What is a corporate AI governance layer?

It is a software layer that sits between the company's tools and the LLM providers and applies the organization's rules to every request: it blocks or masks sensitive data, controls spend through budgets, restricts which models can be used, isolates access per department and records everything in an audit trail. HorseLabs implements this layer across 5 fronts — data, cost, models, access and connectivity — behind a single key, for any provider.

How does HorseLabs stop sensitive data from reaching the LLM provider?

Every prompt passes through the gateway before the model. The DLP Shield inspects the content pre-call: deterministic rules detect bank data — credit-card numbers (Luhn) —, CPF, e-mails and credentials; an NLP layer detects person names and PII that rules can't reach. Depending on the team's policy, the data is masked or the request is blocked before it leaves — and the violation is recorded in the trail, with the data already masked.

Does using HorseLabs solve my LGPD compliance?

Not on its own — and be wary of anyone who promises that. Compliance involves process, legal basis, contracts and people; no tool "solves" it. What HorseLabs delivers are the technical controls that sustain your posture: blocking and masking of personal data before the provider, a violation audit trail and access isolation. And let's say it plainly: every relevant LLM provider is foreign, so international data transfer is intrinsic to using AI. We give you control, reduction and proof over what leaves — not the illusion that nothing does.

Do I need to replace the tools my team already uses?

No. The layer speaks the OpenAI-compatible standard: anything that already works with that standard — IDEs, agents, scripts, internal tools — starts pointing at the gateway by swapping the base_url and using the team's virtual key. The employee keeps their workflow; the company gains control.

What happens if the data detector goes down?

Fail-closed. When the team's policy is set to block and the detector becomes unavailable, the layer stops the request instead of letting it through. Protecting the data is the architectural default — not a setting someone forgets to turn on.

How does AI cost control work?

Each team uses a virtual key with its own budget. Spend shows up in real time per user, key, team and cost center. When consumption crosses the threshold you set, the layer fires an alert (webhook into your workflow); when it overruns, it cuts off. And every request is logged: who, which model, how many tokens, how much it cost.

Which providers and models are supported?

Claude (Anthropic), GPT (OpenAI), Gemini (Google) and Grok (xAI), behind the same key and the same API standard. The catalog is fed by each provider's live models and governed by an allowlist: everything starts off, and only what an administrator approves goes into use. A non-approved model gets a 403.

How is access isolated between departments and companies?

Each organization and each department lives in its own tenant, with strictly scoped roles (operator, admin, member). Provider credentials live in a vault and never reach the end user. Sensitive actions require a second factor, and every access lands in the audit trail.

How much does it cost?

It depends on scope — operation size, number of departments/tenants and request volume. The investment structure is at horse-labs.dev/pricing; scope and metric are defined before we start, with no surprises.

By requesting access through the form on this page — a corporate e-mail and your team size are enough. We are in a validation phase with selected companies: the founder replies within 1 business day.

LLM cost governance

LLM cost governance means making model spend predictable and attributable — with budget caps, threshold alerts, blocking before overrun, and cost tracked per client, team or project. This guide details the four mechanisms and how to apply them at the gateway layer.

Budget caps

A budget cap is a spend limit the gateway enforces before overrun — not a warning on the invoice afterward.

The model provider bills per token consumed and knows nothing about your budget: it serves the next call no matter how much you've already spent. That's why the cap can't live at the provider — it has to live in the layer that sits between your operation and the models. When every call passes through a single point, that point can compare accumulated spend against the defined budget and refuse the request that would cross the limit, before the token is ever sent and charged. The difference between blocking beforehand and finding out afterward is the difference between a control and a report: the first prevents the loss, the second merely documents it once nothing can be done.

The cap is set per period and per cost center, so each client, team or project carries its own limit without contaminating the others. When one cost center's budget runs out, only that center is blocked — the rest of the operation keeps running. When the period turns over, the budget applies again and calls flow once more, with no manual intervention. The result is a cap that acts as an automatic brake, not a retroactive alarm: spend stays contained within what was planned, contract by contract.

At Horse Labs, the gateway enforces per-cost-center budgets and blocks automatically when the budget runs out — before the overrun, not after.

Per-client attribution

Attribution is knowing who spent what: every call carries a cost center, so spend lands per client, team or project.

In an operation serving many clients on the same infrastructure, the provider's aggregate invoice is useless for management: it says how much the whole operation consumed but won't tell you who generated that consumption or under which contract. Without that answer you can't bill each client the right cost, can't see which project is expensive, and have no basis to forecast next month. Attribution solves this by tying each call to a cost center the moment it passes through the gateway: spend stops being an opaque total and becomes a traceable sum per client, team or project.

With spend attributed, the conversation shifts from "AI cost X" to "client A cost X, project B cost Y" — and that holds both for passing the cost through and for deciding where to optimize. Tracking these numbers in real time, not only at close, lets you act while the month is still running: spot the contract that spiked, the team that changed its usage pattern, the project that needs a tighter cap. Attribution is the prerequisite for any fair billing and for any optimization that isn't a guess.

At Horse Labs, per-client spend is visible in real time in the Console and can be delivered by report.

Threshold alerts

Alerts fire at 50%, 80% and 100% of the budget, before the block — time to react.

Blocking on overrun protects the wallet but catches the operation by surprise: the call simply stops working. Alerts exist so no one reaches the block blind. As a cost center's consumption climbs, the gateway fires warnings at defined milestones — 50%, 80% and 100% of the budget — giving the team time to react before the cutoff. At 50% you know the month is on the expected pace; at 80%, that it's worth a close look; at 100%, that the limit was reached and the block kicked in.

Each alert can fire a configurable webhook, which wires the budget into the rest of the operation's tooling: a notice in the team's channel, a ticket, an automation trigger. So the reaction doesn't depend on someone watching the dashboard at the right moment — the system reaches out where the team already works. And when a block needs to be reversed by a conscious decision (a campaign that justifies the extra spend, a contract about to be adjusted), an unblock flow releases that cost center in a controlled way, instead of dropping everyone's cap.

At Horse Labs, alerts fire at 50/80/100% with a configurable webhook and an unblock flow.

Cost optimization

Optimization is routing each task to the right model — not paying for an expensive model where a light one suffices.

Much of the AI cost that looks inevitable is, in fact, routing waste: simple tasks — classifying, extracting a field, rephrasing a sentence — running on a top-tier model that charges a premium per token. The most capable model isn't always the one you need, and using it where a lightweight model would deliver the same result means overpaying by default. Optimization starts with seeing this, which is only possible once spend is attributed: with cost per cost center in view, it becomes clear where an expensive model is doing work that doesn't demand it.

Per-cost-center model choice turns that diagnosis into action: each cost center points to the model suited to its kind of task, and the operation routes the right work to the right model without rewriting code at every adjustment. Tasks that call for capability go to the capable model; high-volume, low-complexity tasks go to the economical one. The effect is to cut unoptimized spend while keeping the result — cost drops because the work now runs where it should, not because the operation gave up quality.

At Horse Labs, per-cost-center model choice keeps each task on the appropriate model.

FAQ

How do I control LLM cost per client?

By attributing each call to a cost center at the gateway, with budget caps and automatic blocking per client, team or project.

Can I block spend before overrun?

Yes — the gateway enforces a cap and blocks automatically when the budget runs out, with alerts at 50/80/100% beforehand.

Talk about cost governance