Cost Control & Token Budgets - Preventing $10K Surprises
Deep dive into cost control for production agents: token budgets, circuit breakers, model routing, max step limits, and preventing runaway loops that burn through API credits
Prerequisite: This is Part 4 of the Production Agents Deep Dive series. Start with Part 0: Overview for context.
Why This Matters
Your agent enters a loop. Loop calls LLM. LLM responds. Loop continues. You wake up to a $10K bill.
This isn’t hypothetical:
- 68% of teams hit budget overruns in first agent deployments
- 50% cite “runaway tool loops and recursive logic” as the cause
- Agents consume 5-20x more tokens than simple chains
What Goes Wrong Without This:
Symptom: Monthly API bill 10x higher than expected. Cause: Agent retry loop when external API was down. No circuit breaker. Kept calling LLM for 6 hours. Symptom: Single user task consumed $500 in tokens. Cause: Complex research task with no budget limit. Agent kept gathering more context, expanding scope. Symptom: Costs vary wildly between identical requests. Cause: No model routing. Using GPT-4 for tasks GPT-3.5 handles fine. No visibility into per-task costs.
Why Agents Are Expensive
Agents aren’t just more LLM calls. They’re structurally more expensive.
| Factor | Simple Chain | Agent |
|---|---|---|
| LLM calls per task | 1-3 | 5-50+ |
| Context size growth | None | Accumulates each turn |
| Retries | Rare | Common (external dependencies) |
| Tool outputs in context | Minimal | Large (file contents, API responses) |
| Loops | None | Yes (observe-think-act) |
Example cost breakdown:
Simple RAG query: 1 embedding call: $0.0001 1 completion call: $0.01 Total: $0.01 Agent research task: 5 planning calls: $0.05 20 tool calls: $0.20 10 analysis calls: $0.10 3 retry loops: $0.15 Total: $0.50
50x more expensive for a single task. At scale, this compounds.
Pattern 1: Token Budgets
Every task gets a budget. Exceed it, gracefully stop.
class TokenBudget:
def __init__(self, max_tokens=50000, warn_at=0.8):
self.max = max_tokens
self.warn_threshold = warn_at
self.used = 0
self.warning_issued = False
def consume(self, tokens):
self.used += tokens
if not self.warning_issued and self.used >= self.max * self.warn_threshold:
self.warning_issued = True
logger.warning(f"Token budget at {self.used}/{self.max} ({self.warn_threshold*100}%)")
if self.used >= self.max:
raise TokenBudgetExceeded(
used=self.used,
max=self.max,
message="Task exceeded token budget. Gracefully stopping."
)
@property
def remaining(self):
return max(0, self.max - self.used)
@property
def percentage_used(self):
return self.used / self.max
# Usage in agent
budget = TokenBudget(max_tokens=100000)
for step in agent_loop():
try:
response = llm.call(prompt)
budget.consume(response.usage.total_tokens)
except TokenBudgetExceeded:
return agent.graceful_shutdown("Budget exceeded")
Budget Sizing Guidelines
| Task Type | Suggested Budget | Rationale |
|---|---|---|
| Simple Q&A | 5,000 tokens | 1-2 turns max |
| Document analysis | 50,000 tokens | Large context, few turns |
| Research task | 100,000 tokens | Many tool calls, iteration |
| Code generation | 150,000 tokens | Multiple files, testing |
| Complex workflow | 500,000 tokens | Multi-step, human-in-loop |
Start conservative. Increase based on actual usage patterns, not guesses.
Pattern 2: Circuit Breakers for Loops
Agents loop. Loops can run forever. Circuit breakers stop them.
class LoopBreaker:
def __init__(self, max_iterations=25, max_same_action=3):
self.max_iterations = max_iterations
self.max_same_action = max_same_action
self.iterations = 0
self.action_history = []
def check(self, action):
self.iterations += 1
self.action_history.append(action)
# Too many total iterations
if self.iterations >= self.max_iterations:
raise LoopLimitExceeded(
f"Agent exceeded {self.max_iterations} iterations"
)
# Stuck in same action
recent = self.action_history[-self.max_same_action:]
if len(recent) == self.max_same_action and len(set(recent)) == 1:
raise StuckInLoop(
f"Agent repeated '{action}' {self.max_same_action} times"
)
# Usage
breaker = LoopBreaker(max_iterations=25, max_same_action=3)
while not done:
action = agent.decide()
breaker.check(action.type) # Raises if stuck
result = agent.execute(action)
Loop Detection Strategies
| Strategy | Detects | Implementation |
|---|---|---|
| Max iterations | Runaway loops | Counter, hard limit |
| Same action repeated | Stuck agent | Track last N actions |
| No progress | Spinning without results | Track state changes |
| Time limit | Slow infinite loops | Wall clock timeout |
Pattern 3: Model Routing
Use expensive models only when needed.
class ModelRouter:
def __init__(self):
self.models = {
"simple": "gpt-4o-mini", # $0.15/1M input
"standard": "gpt-4o", # $5/1M input
"complex": "claude-opus", # $15/1M input
}
def route(self, task):
# Classify task complexity
if task.type in ["clarification", "formatting", "simple_qa"]:
return self.models["simple"]
if task.requires_reasoning or task.type in ["analysis", "planning"]:
return self.models["standard"]
if task.type in ["code_review", "complex_research", "multi_step"]:
return self.models["complex"]
return self.models["standard"] # Default
# Usage
router = ModelRouter()
model = router.route(current_task)
response = llm.call(model=model, prompt=prompt)
Model Cost Comparison (Dec 2024)
| Model | Input (per 1M) | Output (per 1M) | Use For |
|---|---|---|---|
| GPT-4o-mini | $0.15 | $0.60 | Formatting, simple tasks |
| GPT-4o | $5 | $15 | Standard reasoning |
| Claude Sonnet | $3 | $15 | Balanced cost/quality |
| Claude Opus | $15 | $75 | Complex tasks, code |
| GPT-4-turbo | $10 | $30 | Legacy compatibility |
The math: If 60% of your tasks can use mini models, you save ~95% on those tasks.
Pattern 4: Cost Tracking
You can’t control what you don’t measure.
class CostTracker:
# Pricing per 1K tokens (update as needed)
PRICING = {
"gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
"gpt-4o": {"input": 0.005, "output": 0.015},
"claude-sonnet": {"input": 0.003, "output": 0.015},
"claude-opus": {"input": 0.015, "output": 0.075},
}
def __init__(self, alert_threshold=10.0):
self.total_cost = 0
self.cost_by_model = {}
self.cost_by_task_type = {}
self.alert_threshold = alert_threshold
def record(self, model, input_tokens, output_tokens, task_type=None):
pricing = self.PRICING.get(model, {"input": 0.01, "output": 0.03})
cost = (
(input_tokens * pricing["input"] / 1000) +
(output_tokens * pricing["output"] / 1000)
)
self.total_cost += cost
self.cost_by_model[model] = self.cost_by_model.get(model, 0) + cost
if task_type:
self.cost_by_task_type[task_type] = (
self.cost_by_task_type.get(task_type, 0) + cost
)
if self.total_cost >= self.alert_threshold:
self.trigger_alert()
return cost
def trigger_alert(self):
alert.send(
channel="slack-finops",
message=f"Agent cost alert: ${self.total_cost:.2f} exceeded threshold"
)
def report(self):
return {
"total_cost": self.total_cost,
"by_model": self.cost_by_model,
"by_task_type": self.cost_by_task_type,
}
Cost Attribution Dimensions
| Dimension | How to Track | Why It Matters |
|---|---|---|
| Per request | Tag spans with request_id | Identify expensive requests |
| Per user | Tag with user_id | Fair billing, abuse detection |
| Per task type | Classify tasks | Optimize high-cost task types |
| Per model | Track model in each call | Validate routing effectiveness |
| Per feature | Feature flags on tasks | ROI by feature |
Pattern 5: Max Step Limits
Hard limits prevent catastrophic runaway.
class AgentExecutor:
def __init__(self, max_steps=50, max_tool_calls=100):
self.max_steps = max_steps
self.max_tool_calls = max_tool_calls
def run(self, task):
steps = 0
tool_calls = 0
while not task.is_complete():
steps += 1
if steps > self.max_steps:
return self.force_completion(
task,
reason=f"Exceeded max steps ({self.max_steps})"
)
action = self.agent.decide(task)
if action.is_tool_call:
tool_calls += 1
if tool_calls > self.max_tool_calls:
return self.force_completion(
task,
reason=f"Exceeded max tool calls ({self.max_tool_calls})"
)
task = self.agent.execute(action)
return task.result
def force_completion(self, task, reason):
logger.warning(f"Force completing task: {reason}")
return self.agent.summarize_progress(task, interrupted=True)
Alerting Strategy
# Example alerting rules
alerts:
- name: high_cost_request
condition: request_cost > $5
severity: warning
action: log_and_review
- name: budget_exceeded
condition: daily_cost > $100
severity: critical
action: page_oncall
- name: runaway_loop
condition: iterations > 30
severity: critical
action: kill_and_alert
- name: cost_spike
condition: hourly_cost > 3x_average
severity: warning
action: investigate
- name: model_misrouting
condition: expensive_model_on_simple_task
severity: info
action: log_for_review
Common Gotchas
| Gotcha | Symptom | Fix |
|---|---|---|
| No budget on dev | Works in dev, explodes in prod | Budget in all environments |
| Budget too tight | Tasks fail legitimately | Monitor actual usage, adjust |
| No graceful shutdown | Task fails with no results | Implement partial result return |
| Static routing | Over-using expensive models | Dynamic complexity detection |
| No per-user limits | One user burns budget for all | User-level quotas |
| Alerting too late | See bill at end of month | Real-time cost monitoring |
The Cost Control Checklist
Before deploying an agent:
TOKEN BUDGETS [ ] Per-task budget defined [ ] Warning at 80% threshold [ ] Graceful shutdown when exceeded [ ] Budget sizes based on actual usage data LOOP PROTECTION [ ] Max iterations limit [ ] Same-action detection [ ] Time limit as backstop [ ] Progress tracking (no-op detection) MODEL ROUTING [ ] Task complexity classification [ ] Model selection based on task [ ] Default model is cost-efficient [ ] Override for critical tasks COST TRACKING [ ] Per-request cost calculation [ ] Per-user attribution [ ] Per-task-type breakdown [ ] Real-time dashboards ALERTING [ ] Per-request cost alerts [ ] Daily budget alerts [ ] Anomaly detection [ ] Oncall escalation configured
Key Takeaways
-
Agents are 5-20x more expensive than chains. Budget accordingly.
-
Token budgets are mandatory. No task runs without a limit.
-
Circuit breakers prevent runaway loops. Max iterations + stuck detection.
-
Model routing saves 90%+ on simple tasks. Use expensive models selectively.
-
You can’t control what you don’t measure. Track cost by request, user, task type.
Next Steps
Costs are controlled. But how do you know if your agent is doing the right thing?
→ Part 5: Observability & Silent Failures
Or jump to another topic:
- Part 6: Durable Execution — Temporal, Inngest, Restate
- Part 7: Security — Sandboxing and prompt injection