Cost Control & Token Budgets - Preventing $10K Surprises | Intentional / Deliberate / Engineering

Prerequisite: This is Part 4 of the Production Agents Deep Dive series. Start with Part 0: Overview for context.

Why This Matters

Your agent enters a loop. Loop calls LLM. LLM responds. Loop continues. You wake up to a $10K bill.

This isn’t hypothetical:

68% of teams hit budget overruns in first agent deployments
50% cite “runaway tool loops and recursive logic” as the cause
Agents consume 5-20x more tokens than simple chains

What Goes Wrong Without This:

COST CONTROL FAILURE PATTERNS

Symptom: Monthly API bill 10x higher than expected.
Cause:   Agent retry loop when external API was down.
       No circuit breaker. Kept calling LLM for 6 hours.

Symptom: Single user task consumed $500 in tokens.
Cause: Complex research task with no budget limit.
Agent kept gathering more context, expanding scope.

Symptom: Costs vary wildly between identical requests.
Cause: No model routing. Using GPT-4 for tasks GPT-3.5 handles fine.
No visibility into per-task costs.

Why Agents Are Expensive

Agents aren’t just more LLM calls. They’re structurally more expensive.

Factor	Simple Chain	Agent
LLM calls per task	1-3	5-50+
Context size growth	None	Accumulates each turn
Retries	Rare	Common (external dependencies)
Tool outputs in context	Minimal	Large (file contents, API responses)
Loops	None	Yes (observe-think-act)

Example cost breakdown:

COST COMPARISON: RAG vs AGENT

Simple RAG query:
1 embedding call:     $0.0001
1 completion call:    $0.01
Total:                $0.01

Agent research task:
5 planning calls: $0.05
20 tool calls: $0.20
10 analysis calls: $0.10
3 retry loops: $0.15
Total: $0.50

50x more expensive for a single task. At scale, this compounds.

Pattern 1: Token Budgets

Every task gets a budget. Exceed it, gracefully stop.

class TokenBudget:
    def __init__(self, max_tokens=50000, warn_at=0.8):
        self.max = max_tokens
        self.warn_threshold = warn_at
        self.used = 0
        self.warning_issued = False

    def consume(self, tokens):
        self.used += tokens

        if not self.warning_issued and self.used >= self.max * self.warn_threshold:
            self.warning_issued = True
            logger.warning(f"Token budget at {self.used}/{self.max} ({self.warn_threshold*100}%)")

        if self.used >= self.max:
            raise TokenBudgetExceeded(
                used=self.used,
                max=self.max,
                message="Task exceeded token budget. Gracefully stopping."
            )

    @property
    def remaining(self):
        return max(0, self.max - self.used)

    @property
    def percentage_used(self):
        return self.used / self.max

# Usage in agent
budget = TokenBudget(max_tokens=100000)

for step in agent_loop():
    try:
        response = llm.call(prompt)
        budget.consume(response.usage.total_tokens)
    except TokenBudgetExceeded:
        return agent.graceful_shutdown("Budget exceeded")

Budget Sizing Guidelines

Task Type	Suggested Budget	Rationale
Simple Q&A	5,000 tokens	1-2 turns max
Document analysis	50,000 tokens	Large context, few turns
Research task	100,000 tokens	Many tool calls, iteration
Code generation	150,000 tokens	Multiple files, testing
Complex workflow	500,000 tokens	Multi-step, human-in-loop

Start conservative. Increase based on actual usage patterns, not guesses.

Pattern 2: Circuit Breakers for Loops

Agents loop. Loops can run forever. Circuit breakers stop them.

class LoopBreaker:
    def __init__(self, max_iterations=25, max_same_action=3):
        self.max_iterations = max_iterations
        self.max_same_action = max_same_action
        self.iterations = 0
        self.action_history = []

    def check(self, action):
        self.iterations += 1
        self.action_history.append(action)

        # Too many total iterations
        if self.iterations >= self.max_iterations:
            raise LoopLimitExceeded(
                f"Agent exceeded {self.max_iterations} iterations"
            )

        # Stuck in same action
        recent = self.action_history[-self.max_same_action:]
        if len(recent) == self.max_same_action and len(set(recent)) == 1:
            raise StuckInLoop(
                f"Agent repeated '{action}' {self.max_same_action} times"
            )

# Usage
breaker = LoopBreaker(max_iterations=25, max_same_action=3)

while not done:
    action = agent.decide()
    breaker.check(action.type)  # Raises if stuck
    result = agent.execute(action)

Loop Detection Strategies

Strategy	Detects	Implementation
Max iterations	Runaway loops	Counter, hard limit
Same action repeated	Stuck agent	Track last N actions
No progress	Spinning without results	Track state changes
Time limit	Slow infinite loops	Wall clock timeout

Pattern 3: Model Routing

Use expensive models only when needed.

class ModelRouter:
    def __init__(self):
        self.models = {
            "simple": "gpt-4o-mini",      # $0.15/1M input
            "standard": "gpt-4o",         # $5/1M input
            "complex": "claude-opus",     # $15/1M input
        }

    def route(self, task):
        # Classify task complexity
        if task.type in ["clarification", "formatting", "simple_qa"]:
            return self.models["simple"]

        if task.requires_reasoning or task.type in ["analysis", "planning"]:
            return self.models["standard"]

        if task.type in ["code_review", "complex_research", "multi_step"]:
            return self.models["complex"]

        return self.models["standard"]  # Default

# Usage
router = ModelRouter()
model = router.route(current_task)
response = llm.call(model=model, prompt=prompt)

Model Cost Comparison (Dec 2024)

Model	Input (per 1M)	Output (per 1M)	Use For
GPT-4o-mini	$0.15	$0.60	Formatting, simple tasks
GPT-4o	$5	$15	Standard reasoning
Claude Sonnet	$3	$15	Balanced cost/quality
Claude Opus	$15	$75	Complex tasks, code
GPT-4-turbo	$10	$30	Legacy compatibility

The math: If 60% of your tasks can use mini models, you save ~95% on those tasks.

Pattern 4: Cost Tracking

You can’t control what you don’t measure.

class CostTracker:
    # Pricing per 1K tokens (update as needed)
    PRICING = {
        "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
        "gpt-4o": {"input": 0.005, "output": 0.015},
        "claude-sonnet": {"input": 0.003, "output": 0.015},
        "claude-opus": {"input": 0.015, "output": 0.075},
    }

    def __init__(self, alert_threshold=10.0):
        self.total_cost = 0
        self.cost_by_model = {}
        self.cost_by_task_type = {}
        self.alert_threshold = alert_threshold

    def record(self, model, input_tokens, output_tokens, task_type=None):
        pricing = self.PRICING.get(model, {"input": 0.01, "output": 0.03})

        cost = (
            (input_tokens * pricing["input"] / 1000) +
            (output_tokens * pricing["output"] / 1000)
        )

        self.total_cost += cost
        self.cost_by_model[model] = self.cost_by_model.get(model, 0) + cost

        if task_type:
            self.cost_by_task_type[task_type] = (
                self.cost_by_task_type.get(task_type, 0) + cost
            )

        if self.total_cost >= self.alert_threshold:
            self.trigger_alert()

        return cost

    def trigger_alert(self):
        alert.send(
            channel="slack-finops",
            message=f"Agent cost alert: ${self.total_cost:.2f} exceeded threshold"
        )

    def report(self):
        return {
            "total_cost": self.total_cost,
            "by_model": self.cost_by_model,
            "by_task_type": self.cost_by_task_type,
        }

Cost Attribution Dimensions

Dimension	How to Track	Why It Matters
Per request	Tag spans with request_id	Identify expensive requests
Per user	Tag with user_id	Fair billing, abuse detection
Per task type	Classify tasks	Optimize high-cost task types
Per model	Track model in each call	Validate routing effectiveness
Per feature	Feature flags on tasks	ROI by feature

Pattern 5: Max Step Limits

Hard limits prevent catastrophic runaway.

class AgentExecutor:
    def __init__(self, max_steps=50, max_tool_calls=100):
        self.max_steps = max_steps
        self.max_tool_calls = max_tool_calls

    def run(self, task):
        steps = 0
        tool_calls = 0

        while not task.is_complete():
            steps += 1

            if steps > self.max_steps:
                return self.force_completion(
                    task,
                    reason=f"Exceeded max steps ({self.max_steps})"
                )

            action = self.agent.decide(task)

            if action.is_tool_call:
                tool_calls += 1
                if tool_calls > self.max_tool_calls:
                    return self.force_completion(
                        task,
                        reason=f"Exceeded max tool calls ({self.max_tool_calls})"
                    )

            task = self.agent.execute(action)

        return task.result

    def force_completion(self, task, reason):
        logger.warning(f"Force completing task: {reason}")
        return self.agent.summarize_progress(task, interrupted=True)

Alerting Strategy

# Example alerting rules

alerts:
  - name: high_cost_request
    condition: request_cost > $5
    severity: warning
    action: log_and_review

  - name: budget_exceeded
    condition: daily_cost > $100
    severity: critical
    action: page_oncall

  - name: runaway_loop
    condition: iterations > 30
    severity: critical
    action: kill_and_alert

  - name: cost_spike
    condition: hourly_cost > 3x_average
    severity: warning
    action: investigate

  - name: model_misrouting
    condition: expensive_model_on_simple_task
    severity: info
    action: log_for_review

Common Gotchas

Gotcha	Symptom	Fix
No budget on dev	Works in dev, explodes in prod	Budget in all environments
Budget too tight	Tasks fail legitimately	Monitor actual usage, adjust
No graceful shutdown	Task fails with no results	Implement partial result return
Static routing	Over-using expensive models	Dynamic complexity detection
No per-user limits	One user burns budget for all	User-level quotas
Alerting too late	See bill at end of month	Real-time cost monitoring

The Cost Control Checklist

Before deploying an agent:

COST CONTROL DEPLOYMENT CHECKLIST

TOKEN BUDGETS
[ ] Per-task budget defined
[ ] Warning at 80% threshold
[ ] Graceful shutdown when exceeded
[ ] Budget sizes based on actual usage data

LOOP PROTECTION
[ ] Max iterations limit
[ ] Same-action detection
[ ] Time limit as backstop
[ ] Progress tracking (no-op detection)

MODEL ROUTING
[ ] Task complexity classification
[ ] Model selection based on task
[ ] Default model is cost-efficient
[ ] Override for critical tasks

COST TRACKING
[ ] Per-request cost calculation
[ ] Per-user attribution
[ ] Per-task-type breakdown
[ ] Real-time dashboards

ALERTING
[ ] Per-request cost alerts
[ ] Daily budget alerts
[ ] Anomaly detection
[ ] Oncall escalation configured

Key Takeaways

Agents are 5-20x more expensive than chains. Budget accordingly.
Token budgets are mandatory. No task runs without a limit.
Circuit breakers prevent runaway loops. Max iterations + stuck detection.
Model routing saves 90%+ on simple tasks. Use expensive models selectively.
You can’t control what you don’t measure. Track cost by request, user, task type.

Next Steps

Costs are controlled. But how do you know if your agent is doing the right thing?

→ Part 5: Observability & Silent Failures

Or jump to another topic:

Part 6: Durable Execution — Temporal, Inngest, Restate
Part 7: Security — Sandboxing and prompt injection

Production-agents Series