Skip to content

Production-agents Series

Durable Execution Frameworks - Don't Reinvent the Wheel

Deep dive into durable execution frameworks for agents: Temporal, Inngest, Restate, Azure Durable Functions, AWS Step Functions. When to use each and how they solve agent production challenges

Prerequisite: This is Part 6 of the Production Agents Deep Dive series. Start with Part 0: Overview for context.

Why This Matters

You’ve read about idempotency, checkpointing, retries, and state persistence. Here’s the secret: all of these problems have been solved before.

Durable execution frameworks handle:

  • State persistence automatically
  • Retries with exponential backoff built-in
  • Checkpointing at every step
  • Exactly-once semantics
  • Long-running workflows (hours, days, weeks)
  • Human-in-the-loop interrupts

If you’re writing your own checkpointing + retry + recovery logic, you’re probably reinventing a durable execution framework.

The decision:

  • Build: Roll your own state management (1000s of lines, months of debugging)
  • Buy: Use a framework that’s been battle-tested in production (days to integrate)

What Durable Execution Means

A durable execution framework guarantees:

  1. State survives failures: If your process crashes, it resumes from the last step
  2. Exactly-once semantics: Even with retries, side effects happen once
  3. Automatic retries: Transient failures handled without your code knowing
  4. Long-running support: Workflows can pause for days, waiting for human input
TRADITIONAL vs DURABLE EXECUTION
Traditional Code:
Start  Execute  [Crash]  Start over from scratch

Durable Execution:
Start  Execute  [Crash]  Resume from last checkpoint

Framework Comparison

FrameworkBest ForDeploymentLanguage Support
TemporalComplex workflows, enterpriseSelf-hosted or Temporal CloudGo, Java, Python, TypeScript
InngestEvent-driven, serverlessFully managedTypeScript, Python
RestateLow latency, lightweightSelf-hosted or Restate CloudTypeScript, Java, Kotlin, Go
Azure Durable FunctionsAzure-nativeAzure FunctionsC#, JavaScript, Python, PowerShell
AWS Step FunctionsAWS-native, visualAWS nativeJSON state machine, any via Lambda
GCP Cloud WorkflowsGCP-native, YAMLGCP nativeYAML config

Temporal

The gold standard for complex, long-running workflows. Used by Netflix, Snap, Stripe, and Datadog.

Core Concepts

from temporalio import activity, workflow
from datetime import timedelta

@activity.defn
async def book_flight(flight_id: str, idempotency_key: str) -> BookingResult:
    """
    Activities are your side effects. Temporal handles retries automatically.
    Your idempotent implementation + Temporal's at-least-once = exactly-once semantics.
    """
    return await flight_api.book(flight_id, idempotency_key=idempotency_key)

@activity.defn
async def charge_payment(amount: float, idempotency_key: str) -> PaymentResult:
    return await payment_api.charge(amount, idempotency_key=idempotency_key)

@activity.defn
async def send_confirmation(email: str, booking: BookingResult) -> None:
    await email_service.send(email, template="booking_confirmation", data=booking)

@workflow.defn
class BookingWorkflow:
    @workflow.run
    async def run(self, request: BookingRequest) -> BookingResult:
        # Each step is automatically checkpointed
        # If we crash after book_flight, we resume at charge_payment

        booking = await workflow.execute_activity(
            book_flight,
            args=[request.flight_id, f"{request.user_id}:{request.booking_id}:book"],
            retry_policy=RetryPolicy(
                initial_interval=timedelta(seconds=1),
                maximum_interval=timedelta(seconds=30),
                backoff_coefficient=2.0,
                maximum_attempts=5,
                non_retryable_error_types=["ValidationError", "AuthError"]
            )
        )

        payment = await workflow.execute_activity(
            charge_payment,
            args=[booking.total_amount, f"{request.user_id}:{request.booking_id}:pay"],
            retry_policy=RetryPolicy(maximum_attempts=3)
        )

        # Fire and forget — don't block on email
        await workflow.execute_activity(
            send_confirmation,
            args=[request.email, booking],
            start_to_close_timeout=timedelta(minutes=5)
        )

        return booking

Human-in-the-Loop with Signals

@workflow.defn
class ApprovalWorkflow:
    def __init__(self):
        self.approved = None

    @workflow.signal
    async def approve(self, approved: bool, reason: str):
        """Human sends this signal to approve/reject"""
        self.approved = approved
        self.reason = reason

    @workflow.run
    async def run(self, request: ApprovalRequest) -> ApprovalResult:
        # Execute some work
        analysis = await workflow.execute_activity(analyze_request, args=[request])

        if analysis.needs_approval:
            # Wait for human signal (can wait days)
            await workflow.wait_condition(lambda: self.approved is not None)

            if not self.approved:
                return ApprovalResult(status="rejected", reason=self.reason)

        # Continue with approved workflow
        return await workflow.execute_activity(complete_request, args=[request])

When to Use Temporal

  • Complex multi-step workflows
  • Long-running processes (hours to weeks)
  • Enterprise requirements (audit trails, compliance)
  • Need strong consistency guarantees
  • Already have infrastructure team capacity

Inngest

Developer-friendly, event-driven, fully managed. Great for teams that want to move fast.

Core Concepts

import { Inngest } from "inngest";

const inngest = new Inngest({ id: "my-agent" });

export const agentWorkflow = inngest.createFunction(
  {
    id: "process-customer-request",
    retries: 5, // Built-in retry
  },
  { event: "customer/request.received" },
  async ({ event, step }) => {
    // Each step is automatically checkpointed
    // If we crash after classify, we resume at route

    const classification = await step.run("classify", async () => {
      return await llm.classify(event.data.message);
    });

    const route = await step.run("route", async () => {
      if (classification.confidence < 0.7) {
        return "human";
      }
      return classification.intent;
    });

    if (route === "human") {
      // Wait for human response (can wait indefinitely)
      const humanResponse = await step.waitForEvent("human/responded", {
        match: "data.request_id",
        timeout: "7d",
      });
      return humanResponse;
    }

    // Continue with automated handling
    const result = await step.run("execute", async () => {
      return await agent.execute(classification.intent, event.data);
    });

    return result;
  }
);

Token Budget with Inngest

export const budgetedAgent = inngest.createFunction(
  { id: "budgeted-agent" },
  { event: "agent/task.started" },
  async ({ event, step }) => {
    let tokensUsed = 0;
    const maxTokens = 100000;

    const plan = await step.run("plan", async () => {
      const result = await llm.plan(event.data.task);
      tokensUsed += result.usage.total_tokens;
      return result;
    });

    for (const action of plan.actions) {
      if (tokensUsed >= maxTokens) {
        // Graceful shutdown within budget
        return {
          status: "budget_exceeded",
          completed: plan.actions.indexOf(action),
        };
      }

      await step.run(`execute-${action.id}`, async () => {
        const result = await agent.execute(action);
        tokensUsed += result.usage?.total_tokens || 0;
        return result;
      });
    }

    return { status: "completed", tokensUsed };
  }
);

When to Use Inngest

  • Event-driven architectures
  • Serverless deployments
  • Want fully managed infrastructure
  • TypeScript/Node.js primary stack
  • Need fast iteration speed

Restate

Lightweight, low-latency, excellent developer experience. Written in Rust.

Core Concepts

import * as restate from "@restatedev/restate-sdk";

const agentService = restate.service({
  name: "agent",
  handlers: {
    processRequest: async (ctx: restate.Context, request: AgentRequest) => {
      // Each ctx.run() is automatically checkpointed
      // Idempotency is built-in via deterministic execution

      const classification = await ctx.run("classify", async () => {
        return await llm.classify(request.message);
      });

      if (classification.needs_approval) {
        // Await human approval (durable promise)
        const approval = await ctx.awakeable<ApprovalResult>();

        // This ID can be used by external system to complete the awakeable
        console.log(`Awaiting approval: ${approval.id}`);

        const result = await approval.promise;
        if (!result.approved) {
          return { status: "rejected" };
        }
      }

      const result = await ctx.run("execute", async () => {
        return await agent.execute(classification.intent, request);
      });

      return result;
    },
  },
});

// Complete awakeable from external system (e.g., webhook)
async function approveRequest(awakeableId: string, approved: boolean) {
  const restate = clients.connect("http://localhost:8080");
  await restate.resolveAwakeable(awakeableId, { approved });
}

Virtual Objects for Stateful Agents

const agentSession = restate.object({
  name: "agent-session",
  handlers: {
    // State is automatically persisted per session ID
    addMessage: async (ctx: restate.ObjectContext, message: Message) => {
      const history = (await ctx.get<Message[]>("history")) || [];
      history.push(message);
      ctx.set("history", history);

      const response = await ctx.run("generate", async () => {
        return await llm.chat(history);
      });

      history.push({ role: "assistant", content: response });
      ctx.set("history", history);

      return response;
    },

    getHistory: async (ctx: restate.ObjectContext) => {
      return (await ctx.get<Message[]>("history")) || [];
    },
  },
});

When to Use Restate

  • Low-latency requirements
  • Lightweight deployment (single binary)
  • Strong consistency without heavy infrastructure
  • TypeScript or JVM stack
  • Want to self-host easily

Azure Durable Functions

Native Azure integration. Great for Azure-first shops.

Core Concepts

import azure.durable_functions as df

# Orchestrator function
@df.orchestrator_trigger(context_name="context")
def agent_orchestrator(context: df.DurableOrchestrationContext):
    request = context.get_input()

    # Each activity is checkpointed
    classification = yield context.call_activity(
        "classify_request",
        request
    )

    if classification["needs_approval"]:
        # Wait for external event (human approval)
        approval = yield context.wait_for_external_event("approval")

        if not approval["approved"]:
            return {"status": "rejected"}

    # Continue with execution
    result = yield context.call_activity(
        "execute_agent_action",
        classification
    )

    return result

# Activity functions
@df.activity_trigger(input_name="request")
def classify_request(request: dict) -> dict:
    return llm.classify(request["message"])

@df.activity_trigger(input_name="classification")
def execute_agent_action(classification: dict) -> dict:
    return agent.execute(classification["intent"])

Fan-out/Fan-in Pattern

@df.orchestrator_trigger(context_name="context")
def parallel_research(context: df.DurableOrchestrationContext):
    queries = context.get_input()["queries"]

    # Fan out: run research tasks in parallel
    tasks = []
    for query in queries:
        task = context.call_activity("research_query", query)
        tasks.append(task)

    # Fan in: wait for all to complete
    results = yield context.task_all(tasks)

    # Synthesize results
    synthesis = yield context.call_activity("synthesize_results", results)

    return synthesis

When to Use Azure Durable Functions

  • Already on Azure
  • .NET or Python primary stack
  • Want serverless with durable state
  • Need tight Azure service integration
  • Cost optimization via consumption pricing

AWS Step Functions

Visual workflows, tight AWS integration. Best for teams who like declarative state machines.

Core Concepts

{
  "Comment": "Agent workflow with human approval",
  "StartAt": "ClassifyRequest",
  "States": {
    "ClassifyRequest": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:classify",
      "Next": "NeedsApproval"
    },
    "NeedsApproval": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.needsApproval",
          "BooleanEquals": true,
          "Next": "WaitForApproval"
        }
      ],
      "Default": "ExecuteAction"
    },
    "WaitForApproval": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
      "Parameters": {
        "QueueUrl": "https://sqs.us-east-1.amazonaws.com/123/approvals",
        "MessageBody": {
          "TaskToken.$": "$$.Task.Token",
          "Request.$": "$"
        }
      },
      "Next": "CheckApproval"
    },
    "CheckApproval": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.approved",
          "BooleanEquals": false,
          "Next": "Rejected"
        }
      ],
      "Default": "ExecuteAction"
    },
    "ExecuteAction": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:execute",
      "End": true
    },
    "Rejected": {
      "Type": "Fail",
      "Error": "ApprovalRejected",
      "Cause": "Human rejected the action"
    }
  }
}

When to Use Step Functions

  • Already on AWS
  • Visual workflow design preferred
  • Need built-in AWS service integrations
  • Long-running workflows with wait states
  • Audit and compliance requirements

Decision Framework

If You Need…Use
Complex multi-step workflows with strong guaranteesTemporal
Fast iteration, event-driven, serverlessInngest
Low latency, lightweight, self-hostedRestate
Azure-native, serverless, cost optimizationAzure Durable Functions
AWS-native, visual workflows, service integrationsAWS Step Functions
GCP-native, simple YAML workflowsGCP Cloud Workflows

The “Build vs Buy” Decision Tree

BUILD vs BUY DECISION
Are you writing retry logic with exponential backoff?
 Consider a durable execution framework

Are you implementing checkpointing to survive crashes?
 Consider a durable execution framework

Are you building idempotency key management?
 Consider a durable execution framework

Are you handling human-in-the-loop with long waits?
 Consider a durable execution framework

If yes to 2+ of these, you&#39;re reinventing the wheel.

Migration Path

If you have existing agent code, here’s how to migrate:

1. Identify Side Effects

# Before: Side effects scattered in code
def process_request(request):
    classification = llm.classify(request)  # LLM call
    if classification.needs_action:
        result = api.execute(classification)  # External API
        email.send(request.user, result)      # Email
    return result

2. Extract as Activities

# After: Side effects are activities
@activity.defn
async def classify(request): return llm.classify(request)

@activity.defn
async def execute_action(classification): return api.execute(classification)

@activity.defn
async def send_email(user, result): email.send(user, result)

@workflow.defn
class RequestWorkflow:
    @workflow.run
    async def run(self, request):
        classification = await workflow.execute_activity(classify, args=[request])
        if classification.needs_action:
            result = await workflow.execute_activity(execute_action, args=[classification])
            await workflow.execute_activity(send_email, args=[request.user, result])
        return result

Key Takeaways

  1. Don’t reinvent the wheel. If you’re writing checkpointing + retry + recovery, use a framework.

  2. Match framework to needs. Enterprise = Temporal. Fast iteration = Inngest. Low latency = Restate.

  3. Cloud-native options exist. Azure Durable Functions, AWS Step Functions for tight integration.

  4. Migration is incremental. Extract side effects as activities, wrap in workflows.

  5. The patterns are the same. Every framework provides: checkpointing, retries, human-in-loop, durability.


Next Steps

Framework selected. But how do you secure agents that execute code and call external APIs?

Part 7: Security & Sandboxing

Or jump to another topic: