The Silent Budget Killer: Why Agents Are Different from Chatbots

2026-03-20·ClawFirewall·3 minutes

Chat is predictable. A user sends a message, the model responds. Your costs scale roughly with usage.

Agents are different. You're handing a semi-autonomous script access to your model provider. When that script hits a bad state—misparses a tool response, gets stuck in a retry—it can fire off fifty high-token requests in a minute. No user involved. No natural stop.

1. Where the leakage happens

In agentic workflows, runaway costs usually come from three places:

Recursive reasoning loops

The agent keeps asking the model for "clarification" on a sub-task it can't resolve. Each call adds tokens and cost.

Token bloat

Agents often pass their full "memory" or scratchpad into every step. A 5-step task can blow up to 32k tokens by the last iteration if you don't cap it.

Wrong model for the job

Using GPT-4o or Claude 3.5 Sonnet for simple classification that a cheaper model could handle for a fraction of the price.

2. An "agent firewall" at the application layer

A monthly limit on the provider dashboard is a reactive kill switch. It cuts everything when you hit the cap—bad UX, and often too late.

You want controls before calls hit the provider:

  • Max-iteration caps: Hard limit on how many "thoughts" or actions an agent can take per session.
  • Token attribution: Tag every request with user_id or session_id so you can see which agent instance is burning budget.
  • Budget handshakes: If a single task exceeds a token threshold, require a cheaper-model summary or soft approval before continuing.

The goal is to catch runaway behavior early, not after the damage is done.