How to Bulletproof Your Agents After a $10k Blowup
Part 1: What happened and why agents are harder to control ←
Step 1: Map every workflow and tool call
Know exactly what your agents do. For each workflow, write down:
- Average API calls per run
- Worst-case calls per run
- What happens when a tool fails—how many retries?
- Can it spin up nested agents?
- Max possible cost per run?
If you can't answer these, you're flying blind.
Step 2: Hard limits at every level
Set non-negotiable limits that override agent logic:
- Retries per tool call: max 3
- Retries per workflow: max 2 total
- API calls per minute per agent
- API calls per hour per workflow
- Tokens per user request
- Spend per day per agent
- Spend per month per team
Enforce these at the infrastructure layer, before calls hit the provider. If they're only in agent code, a bug or loop can bypass them.
Step 3: Circuit breakers for every workflow
A breaker trips when something crosses a threshold. When it trips, the workflow stops. No more retries. Escalate to a human. Don't restart until someone reviews.
Trip conditions that work well:
- More than 3 failed retries
- 2x average tokens for a single run
- 10x average cost for a single run
- 5x average API calls per minute
This is one of the most effective ways to stop a loop before it burns the budget.
Step 4: Real-time alerts and anomaly detection
You need to know when things go wrong. Set up alerts for:
- Spend at 50% of daily per-agent budget
- Abnormal API call volume
- Circuit breaker trips
- Error rate above 5%
Use Slack, email, or SMS—whatever you actually check. For critical events (breaker trip, 80% of daily budget), send SMS to at least two people.
Step 5: Weekly audits
Review spend every week. Which workflows cost the most? Where are the retries? Can you move simple tasks to cheaper models? Fixing one misrouted workflow can cut spend 30% overnight. Weekly reviews keep you from drifting back into waste.
A faster path: ClawFirewall
Building this from scratch takes months. ClawFirewall does it in about five minutes.
It sits between your agents and your providers (OpenRouter, OpenClaw, OpenAI, Anthropic, etc.) and enforces limits before calls go out. You get:
- Per-agent, per-workflow, per-user budget limits
- Pre-built circuit breakers
- Real-time anomaly detection and alerts
- A unified view of every call and cent across providers
Jake's team added ClawFirewall after the $10k incident. Six months later, no overages. They also cut average monthly spend by 62% while improving the support agent.
Set up ClawFirewall in a few minutes and start protecting your budget.