How to Prevent Runaway Loops (Part 1)
You wake up, check your model provider dashboard, and see thousands of dollars in API calls from the night before. One loop. It kept running while you slept.
Surveys of engineering teams show most have seen this. Average overage around $7,200. Some over $20,000. For early-stage startups, a single loop has put some out of business.
The good news: they're mostly preventable. This playbook walks through what causes them and how to stop them.
What is a runaway loop?
A runaway loop is any autonomous workflow that gets stuck in a repeating cycle of API calls with no built-in exit. Each cycle generates more calls, more tokens, more cost. Because it's autonomous, it can run for hours or days.
Common types:
- Retry loops – tool call fails, agent retries forever
- Workflow restart loops – workflow fails, agent restarts from scratch, repeat
- Nested agent loops – primary agent spawns secondary, secondary spawns another, cascade
- Hallucination loops – model invents a tool call or request and keeps trying to fulfill it
- Context overflow loops – context grows each cycle until it hits the limit, workflow restarts, repeat
They all burn through your budget without delivering value.
Why they happen
Most loops share the same root cause: teams only build for the happy path.
You spend most of your time testing the case where everything works. User sends a clear request, tools succeed, model returns good output, workflow completes. You don't spend enough on edge cases: What if a tool fails? What if the model returns garbage? What if the API times out? What if the agent can't complete the request?
Teams add retry logic but not hard limits, circuit breakers, or exit conditions. They assume the workflow will eventually succeed. That assumption is how loops start.
The $10k loop from the earlier post happened because retries were limited per call but not per workflow. When 10 retries failed, the agent restarted the whole workflow. No one added that guardrail.
Step 1: Hard retry limits at every level
Retry logic causes most runaway loops. Hard limits stop them.
Most teams only limit retries per tool call. You need limits at four levels:
- Per-tool call: max 3 retries. After 3 failures, the call fails and the workflow goes to error handling—no more retries.
- Per-action: max 2 retries. If the agent can't complete the action after 2 attempts, escalate to a human.
- Per-workflow: max 1 restart. If the workflow fails once, allow one restart. If it fails again, stop.
- Global: cap total retries per hour or day. Hit the cap, disable retries for the rest of the period.
Enforce these at the infrastructure layer, before calls reach the provider. If they're only in agent code, a bug or malformed response can bypass them. ClawFirewall enforces retry limits at the API gateway so even buggy agent code can't exceed them.
Step 2: Circuit breakers for every workflow
A circuit breaker stops a workflow when it crosses a threshold. When it trips: workflow stops, no more API calls, escalate. Don't restart until someone reviews.
Set breakers for:
- Error rate – e.g., >20% errors over 10 minutes
- Retry volume – e.g., >10 retries in 5 minutes
- Token usage – e.g., 2x average for a single run
- Spend – e.g., 10x average cost or 50% of daily budget
- API call volume – e.g., 5x average calls per minute
Even if something bypasses retry limits, a breaker can stop the loop before it burns the budget. ClawFirewall includes configurable circuit breakers out of the box.