Durable Execution Frameworks - Don't Reinvent the Wheel
Deep dive into durable execution frameworks for agents: Temporal, Inngest, Restate, Azure Durable Functions, AWS Step Functions. When to use each and how they solve agent production challenges
Prerequisite: This is Part 6 of the Production Agents Deep Dive series. Start with Part 0: Overview for context.
Why This Matters
You’ve read about idempotency, checkpointing, retries, and state persistence. Here’s the secret: all of these problems have been solved before.
Durable execution frameworks handle:
- State persistence automatically
- Retries with exponential backoff built-in
- Checkpointing at every step
- Exactly-once semantics
- Long-running workflows (hours, days, weeks)
- Human-in-the-loop interrupts
If you’re writing your own checkpointing + retry + recovery logic, you’re probably reinventing a durable execution framework.
The decision:
- Build: Roll your own state management (1000s of lines, months of debugging)
- Buy: Use a framework that’s been battle-tested in production (days to integrate)
What Durable Execution Means
A durable execution framework guarantees:
- State survives failures: If your process crashes, it resumes from the last step
- Exactly-once semantics: Even with retries, side effects happen once
- Automatic retries: Transient failures handled without your code knowing
- Long-running support: Workflows can pause for days, waiting for human input
Traditional Code: Start → Execute → [Crash] → Start over from scratch Durable Execution: Start → Execute → [Crash] → Resume from last checkpoint
Framework Comparison
| Framework | Best For | Deployment | Language Support |
|---|---|---|---|
| Temporal | Complex workflows, enterprise | Self-hosted or Temporal Cloud | Go, Java, Python, TypeScript |
| Inngest | Event-driven, serverless | Fully managed | TypeScript, Python |
| Restate | Low latency, lightweight | Self-hosted or Restate Cloud | TypeScript, Java, Kotlin, Go |
| Azure Durable Functions | Azure-native | Azure Functions | C#, JavaScript, Python, PowerShell |
| AWS Step Functions | AWS-native, visual | AWS native | JSON state machine, any via Lambda |
| GCP Cloud Workflows | GCP-native, YAML | GCP native | YAML config |
Temporal
The gold standard for complex, long-running workflows. Used by Netflix, Snap, Stripe, and Datadog.
Core Concepts
from temporalio import activity, workflow
from datetime import timedelta
@activity.defn
async def book_flight(flight_id: str, idempotency_key: str) -> BookingResult:
"""
Activities are your side effects. Temporal handles retries automatically.
Your idempotent implementation + Temporal's at-least-once = exactly-once semantics.
"""
return await flight_api.book(flight_id, idempotency_key=idempotency_key)
@activity.defn
async def charge_payment(amount: float, idempotency_key: str) -> PaymentResult:
return await payment_api.charge(amount, idempotency_key=idempotency_key)
@activity.defn
async def send_confirmation(email: str, booking: BookingResult) -> None:
await email_service.send(email, template="booking_confirmation", data=booking)
@workflow.defn
class BookingWorkflow:
@workflow.run
async def run(self, request: BookingRequest) -> BookingResult:
# Each step is automatically checkpointed
# If we crash after book_flight, we resume at charge_payment
booking = await workflow.execute_activity(
book_flight,
args=[request.flight_id, f"{request.user_id}:{request.booking_id}:book"],
retry_policy=RetryPolicy(
initial_interval=timedelta(seconds=1),
maximum_interval=timedelta(seconds=30),
backoff_coefficient=2.0,
maximum_attempts=5,
non_retryable_error_types=["ValidationError", "AuthError"]
)
)
payment = await workflow.execute_activity(
charge_payment,
args=[booking.total_amount, f"{request.user_id}:{request.booking_id}:pay"],
retry_policy=RetryPolicy(maximum_attempts=3)
)
# Fire and forget — don't block on email
await workflow.execute_activity(
send_confirmation,
args=[request.email, booking],
start_to_close_timeout=timedelta(minutes=5)
)
return booking
Human-in-the-Loop with Signals
@workflow.defn
class ApprovalWorkflow:
def __init__(self):
self.approved = None
@workflow.signal
async def approve(self, approved: bool, reason: str):
"""Human sends this signal to approve/reject"""
self.approved = approved
self.reason = reason
@workflow.run
async def run(self, request: ApprovalRequest) -> ApprovalResult:
# Execute some work
analysis = await workflow.execute_activity(analyze_request, args=[request])
if analysis.needs_approval:
# Wait for human signal (can wait days)
await workflow.wait_condition(lambda: self.approved is not None)
if not self.approved:
return ApprovalResult(status="rejected", reason=self.reason)
# Continue with approved workflow
return await workflow.execute_activity(complete_request, args=[request])
When to Use Temporal
- Complex multi-step workflows
- Long-running processes (hours to weeks)
- Enterprise requirements (audit trails, compliance)
- Need strong consistency guarantees
- Already have infrastructure team capacity
Inngest
Developer-friendly, event-driven, fully managed. Great for teams that want to move fast.
Core Concepts
import { Inngest } from "inngest";
const inngest = new Inngest({ id: "my-agent" });
export const agentWorkflow = inngest.createFunction(
{
id: "process-customer-request",
retries: 5, // Built-in retry
},
{ event: "customer/request.received" },
async ({ event, step }) => {
// Each step is automatically checkpointed
// If we crash after classify, we resume at route
const classification = await step.run("classify", async () => {
return await llm.classify(event.data.message);
});
const route = await step.run("route", async () => {
if (classification.confidence < 0.7) {
return "human";
}
return classification.intent;
});
if (route === "human") {
// Wait for human response (can wait indefinitely)
const humanResponse = await step.waitForEvent("human/responded", {
match: "data.request_id",
timeout: "7d",
});
return humanResponse;
}
// Continue with automated handling
const result = await step.run("execute", async () => {
return await agent.execute(classification.intent, event.data);
});
return result;
}
);
Token Budget with Inngest
export const budgetedAgent = inngest.createFunction(
{ id: "budgeted-agent" },
{ event: "agent/task.started" },
async ({ event, step }) => {
let tokensUsed = 0;
const maxTokens = 100000;
const plan = await step.run("plan", async () => {
const result = await llm.plan(event.data.task);
tokensUsed += result.usage.total_tokens;
return result;
});
for (const action of plan.actions) {
if (tokensUsed >= maxTokens) {
// Graceful shutdown within budget
return {
status: "budget_exceeded",
completed: plan.actions.indexOf(action),
};
}
await step.run(`execute-${action.id}`, async () => {
const result = await agent.execute(action);
tokensUsed += result.usage?.total_tokens || 0;
return result;
});
}
return { status: "completed", tokensUsed };
}
);
When to Use Inngest
- Event-driven architectures
- Serverless deployments
- Want fully managed infrastructure
- TypeScript/Node.js primary stack
- Need fast iteration speed
Restate
Lightweight, low-latency, excellent developer experience. Written in Rust.
Core Concepts
import * as restate from "@restatedev/restate-sdk";
const agentService = restate.service({
name: "agent",
handlers: {
processRequest: async (ctx: restate.Context, request: AgentRequest) => {
// Each ctx.run() is automatically checkpointed
// Idempotency is built-in via deterministic execution
const classification = await ctx.run("classify", async () => {
return await llm.classify(request.message);
});
if (classification.needs_approval) {
// Await human approval (durable promise)
const approval = await ctx.awakeable<ApprovalResult>();
// This ID can be used by external system to complete the awakeable
console.log(`Awaiting approval: ${approval.id}`);
const result = await approval.promise;
if (!result.approved) {
return { status: "rejected" };
}
}
const result = await ctx.run("execute", async () => {
return await agent.execute(classification.intent, request);
});
return result;
},
},
});
// Complete awakeable from external system (e.g., webhook)
async function approveRequest(awakeableId: string, approved: boolean) {
const restate = clients.connect("http://localhost:8080");
await restate.resolveAwakeable(awakeableId, { approved });
}
Virtual Objects for Stateful Agents
const agentSession = restate.object({
name: "agent-session",
handlers: {
// State is automatically persisted per session ID
addMessage: async (ctx: restate.ObjectContext, message: Message) => {
const history = (await ctx.get<Message[]>("history")) || [];
history.push(message);
ctx.set("history", history);
const response = await ctx.run("generate", async () => {
return await llm.chat(history);
});
history.push({ role: "assistant", content: response });
ctx.set("history", history);
return response;
},
getHistory: async (ctx: restate.ObjectContext) => {
return (await ctx.get<Message[]>("history")) || [];
},
},
});
When to Use Restate
- Low-latency requirements
- Lightweight deployment (single binary)
- Strong consistency without heavy infrastructure
- TypeScript or JVM stack
- Want to self-host easily
Azure Durable Functions
Native Azure integration. Great for Azure-first shops.
Core Concepts
import azure.durable_functions as df
# Orchestrator function
@df.orchestrator_trigger(context_name="context")
def agent_orchestrator(context: df.DurableOrchestrationContext):
request = context.get_input()
# Each activity is checkpointed
classification = yield context.call_activity(
"classify_request",
request
)
if classification["needs_approval"]:
# Wait for external event (human approval)
approval = yield context.wait_for_external_event("approval")
if not approval["approved"]:
return {"status": "rejected"}
# Continue with execution
result = yield context.call_activity(
"execute_agent_action",
classification
)
return result
# Activity functions
@df.activity_trigger(input_name="request")
def classify_request(request: dict) -> dict:
return llm.classify(request["message"])
@df.activity_trigger(input_name="classification")
def execute_agent_action(classification: dict) -> dict:
return agent.execute(classification["intent"])
Fan-out/Fan-in Pattern
@df.orchestrator_trigger(context_name="context")
def parallel_research(context: df.DurableOrchestrationContext):
queries = context.get_input()["queries"]
# Fan out: run research tasks in parallel
tasks = []
for query in queries:
task = context.call_activity("research_query", query)
tasks.append(task)
# Fan in: wait for all to complete
results = yield context.task_all(tasks)
# Synthesize results
synthesis = yield context.call_activity("synthesize_results", results)
return synthesis
When to Use Azure Durable Functions
- Already on Azure
- .NET or Python primary stack
- Want serverless with durable state
- Need tight Azure service integration
- Cost optimization via consumption pricing
AWS Step Functions
Visual workflows, tight AWS integration. Best for teams who like declarative state machines.
Core Concepts
{
"Comment": "Agent workflow with human approval",
"StartAt": "ClassifyRequest",
"States": {
"ClassifyRequest": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:classify",
"Next": "NeedsApproval"
},
"NeedsApproval": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.needsApproval",
"BooleanEquals": true,
"Next": "WaitForApproval"
}
],
"Default": "ExecuteAction"
},
"WaitForApproval": {
"Type": "Task",
"Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
"Parameters": {
"QueueUrl": "https://sqs.us-east-1.amazonaws.com/123/approvals",
"MessageBody": {
"TaskToken.$": "$$.Task.Token",
"Request.$": "$"
}
},
"Next": "CheckApproval"
},
"CheckApproval": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.approved",
"BooleanEquals": false,
"Next": "Rejected"
}
],
"Default": "ExecuteAction"
},
"ExecuteAction": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:execute",
"End": true
},
"Rejected": {
"Type": "Fail",
"Error": "ApprovalRejected",
"Cause": "Human rejected the action"
}
}
}
When to Use Step Functions
- Already on AWS
- Visual workflow design preferred
- Need built-in AWS service integrations
- Long-running workflows with wait states
- Audit and compliance requirements
Decision Framework
| If You Need… | Use |
|---|---|
| Complex multi-step workflows with strong guarantees | Temporal |
| Fast iteration, event-driven, serverless | Inngest |
| Low latency, lightweight, self-hosted | Restate |
| Azure-native, serverless, cost optimization | Azure Durable Functions |
| AWS-native, visual workflows, service integrations | AWS Step Functions |
| GCP-native, simple YAML workflows | GCP Cloud Workflows |
The “Build vs Buy” Decision Tree
Are you writing retry logic with exponential backoff? → Consider a durable execution framework Are you implementing checkpointing to survive crashes? → Consider a durable execution framework Are you building idempotency key management? → Consider a durable execution framework Are you handling human-in-the-loop with long waits? → Consider a durable execution framework If yes to 2+ of these, you're reinventing the wheel.
Migration Path
If you have existing agent code, here’s how to migrate:
1. Identify Side Effects
# Before: Side effects scattered in code
def process_request(request):
classification = llm.classify(request) # LLM call
if classification.needs_action:
result = api.execute(classification) # External API
email.send(request.user, result) # Email
return result
2. Extract as Activities
# After: Side effects are activities
@activity.defn
async def classify(request): return llm.classify(request)
@activity.defn
async def execute_action(classification): return api.execute(classification)
@activity.defn
async def send_email(user, result): email.send(user, result)
@workflow.defn
class RequestWorkflow:
@workflow.run
async def run(self, request):
classification = await workflow.execute_activity(classify, args=[request])
if classification.needs_action:
result = await workflow.execute_activity(execute_action, args=[classification])
await workflow.execute_activity(send_email, args=[request.user, result])
return result
Key Takeaways
-
Don’t reinvent the wheel. If you’re writing checkpointing + retry + recovery, use a framework.
-
Match framework to needs. Enterprise = Temporal. Fast iteration = Inngest. Low latency = Restate.
-
Cloud-native options exist. Azure Durable Functions, AWS Step Functions for tight integration.
-
Migration is incremental. Extract side effects as activities, wrap in workflows.
-
The patterns are the same. Every framework provides: checkpointing, retries, human-in-loop, durability.
Next Steps
Framework selected. But how do you secure agents that execute code and call external APIs?
→ Part 7: Security & Sandboxing
Or jump to another topic:
- Part 8: Testing & Evaluation — How to test agents