Durable Execution Frameworks - Don't Reinvent the Wheel | Intentional / Deliberate / Engineering

Prerequisite: This is Part 6 of the Production Agents Deep Dive series. Start with Part 0: Overview for context.

Why This Matters

You’ve read about idempotency, checkpointing, retries, and state persistence. Here’s the secret: all of these problems have been solved before.

Durable execution frameworks handle:

State persistence automatically
Retries with exponential backoff built-in
Checkpointing at every step
Exactly-once semantics
Long-running workflows (hours, days, weeks)
Human-in-the-loop interrupts

If you’re writing your own checkpointing + retry + recovery logic, you’re probably reinventing a durable execution framework.

The decision:

Build: Roll your own state management (1000s of lines, months of debugging)
Buy: Use a framework that’s been battle-tested in production (days to integrate)

What Durable Execution Means

A durable execution framework guarantees:

State survives failures: If your process crashes, it resumes from the last step
Exactly-once semantics: Even with retries, side effects happen once
Automatic retries: Transient failures handled without your code knowing
Long-running support: Workflows can pause for days, waiting for human input

TRADITIONAL vs DURABLE EXECUTION

Traditional Code:
Start → Execute → [Crash] → Start over from scratch

Durable Execution:
Start → Execute → [Crash] → Resume from last checkpoint

Framework Comparison

Framework	Best For	Deployment	Language Support
Temporal	Complex workflows, enterprise	Self-hosted or Temporal Cloud	Go, Java, Python, TypeScript
Inngest	Event-driven, serverless	Fully managed	TypeScript, Python
Restate	Low latency, lightweight	Self-hosted or Restate Cloud	TypeScript, Java, Kotlin, Go
Azure Durable Functions	Azure-native	Azure Functions	C#, JavaScript, Python, PowerShell
AWS Step Functions	AWS-native, visual	AWS native	JSON state machine, any via Lambda
GCP Cloud Workflows	GCP-native, YAML	GCP native	YAML config

Temporal

The gold standard for complex, long-running workflows. Used by Netflix, Snap, Stripe, and Datadog.

Core Concepts

from temporalio import activity, workflow
from datetime import timedelta

@activity.defn
async def book_flight(flight_id: str, idempotency_key: str) -> BookingResult:
    """
    Activities are your side effects. Temporal handles retries automatically.
    Your idempotent implementation + Temporal's at-least-once = exactly-once semantics.
    """
    return await flight_api.book(flight_id, idempotency_key=idempotency_key)

@activity.defn
async def charge_payment(amount: float, idempotency_key: str) -> PaymentResult:
    return await payment_api.charge(amount, idempotency_key=idempotency_key)

@activity.defn
async def send_confirmation(email: str, booking: BookingResult) -> None:
    await email_service.send(email, template="booking_confirmation", data=booking)

@workflow.defn
class BookingWorkflow:
    @workflow.run
    async def run(self, request: BookingRequest) -> BookingResult:
        # Each step is automatically checkpointed
        # If we crash after book_flight, we resume at charge_payment

        booking = await workflow.execute_activity(
            book_flight,
            args=[request.flight_id, f"{request.user_id}:{request.booking_id}:book"],
            retry_policy=RetryPolicy(
                initial_interval=timedelta(seconds=1),
                maximum_interval=timedelta(seconds=30),
                backoff_coefficient=2.0,
                maximum_attempts=5,
                non_retryable_error_types=["ValidationError", "AuthError"]
            )
        )

        payment = await workflow.execute_activity(
            charge_payment,
            args=[booking.total_amount, f"{request.user_id}:{request.booking_id}:pay"],
            retry_policy=RetryPolicy(maximum_attempts=3)
        )

        # Fire and forget — don't block on email
        await workflow.execute_activity(
            send_confirmation,
            args=[request.email, booking],
            start_to_close_timeout=timedelta(minutes=5)
        )

        return booking

Human-in-the-Loop with Signals

@workflow.defn
class ApprovalWorkflow:
    def __init__(self):
        self.approved = None

    @workflow.signal
    async def approve(self, approved: bool, reason: str):
        """Human sends this signal to approve/reject"""
        self.approved = approved
        self.reason = reason

    @workflow.run
    async def run(self, request: ApprovalRequest) -> ApprovalResult:
        # Execute some work
        analysis = await workflow.execute_activity(analyze_request, args=[request])

        if analysis.needs_approval:
            # Wait for human signal (can wait days)
            await workflow.wait_condition(lambda: self.approved is not None)

            if not self.approved:
                return ApprovalResult(status="rejected", reason=self.reason)

        # Continue with approved workflow
        return await workflow.execute_activity(complete_request, args=[request])

When to Use Temporal

Complex multi-step workflows
Long-running processes (hours to weeks)
Enterprise requirements (audit trails, compliance)
Need strong consistency guarantees
Already have infrastructure team capacity

Inngest

Developer-friendly, event-driven, fully managed. Great for teams that want to move fast.

Core Concepts

import { Inngest } from "inngest";

const inngest = new Inngest({ id: "my-agent" });

export const agentWorkflow = inngest.createFunction(
  {
    id: "process-customer-request",
    retries: 5, // Built-in retry
  },
  { event: "customer/request.received" },
  async ({ event, step }) => {
    // Each step is automatically checkpointed
    // If we crash after classify, we resume at route

    const classification = await step.run("classify", async () => {
      return await llm.classify(event.data.message);
    });

    const route = await step.run("route", async () => {
      if (classification.confidence < 0.7) {
        return "human";
      }
      return classification.intent;
    });

    if (route === "human") {
      // Wait for human response (can wait indefinitely)
      const humanResponse = await step.waitForEvent("human/responded", {
        match: "data.request_id",
        timeout: "7d",
      });
      return humanResponse;
    }

    // Continue with automated handling
    const result = await step.run("execute", async () => {
      return await agent.execute(classification.intent, event.data);
    });

    return result;
  }
);

Token Budget with Inngest

export const budgetedAgent = inngest.createFunction(
  { id: "budgeted-agent" },
  { event: "agent/task.started" },
  async ({ event, step }) => {
    let tokensUsed = 0;
    const maxTokens = 100000;

    const plan = await step.run("plan", async () => {
      const result = await llm.plan(event.data.task);
      tokensUsed += result.usage.total_tokens;
      return result;
    });

    for (const action of plan.actions) {
      if (tokensUsed >= maxTokens) {
        // Graceful shutdown within budget
        return {
          status: "budget_exceeded",
          completed: plan.actions.indexOf(action),
        };
      }

      await step.run(`execute-${action.id}`, async () => {
        const result = await agent.execute(action);
        tokensUsed += result.usage?.total_tokens || 0;
        return result;
      });
    }

    return { status: "completed", tokensUsed };
  }
);

When to Use Inngest

Event-driven architectures
Serverless deployments
Want fully managed infrastructure
TypeScript/Node.js primary stack
Need fast iteration speed

Restate

Lightweight, low-latency, excellent developer experience. Written in Rust.

Core Concepts

import * as restate from "@restatedev/restate-sdk";

const agentService = restate.service({
  name: "agent",
  handlers: {
    processRequest: async (ctx: restate.Context, request: AgentRequest) => {
      // Each ctx.run() is automatically checkpointed
      // Idempotency is built-in via deterministic execution

      const classification = await ctx.run("classify", async () => {
        return await llm.classify(request.message);
      });

      if (classification.needs_approval) {
        // Await human approval (durable promise)
        const approval = await ctx.awakeable<ApprovalResult>();

        // This ID can be used by external system to complete the awakeable
        console.log(`Awaiting approval: ${approval.id}`);

        const result = await approval.promise;
        if (!result.approved) {
          return { status: "rejected" };
        }
      }

      const result = await ctx.run("execute", async () => {
        return await agent.execute(classification.intent, request);
      });

      return result;
    },
  },
});

// Complete awakeable from external system (e.g., webhook)
async function approveRequest(awakeableId: string, approved: boolean) {
  const restate = clients.connect("http://localhost:8080");
  await restate.resolveAwakeable(awakeableId, { approved });
}

Virtual Objects for Stateful Agents

const agentSession = restate.object({
  name: "agent-session",
  handlers: {
    // State is automatically persisted per session ID
    addMessage: async (ctx: restate.ObjectContext, message: Message) => {
      const history = (await ctx.get<Message[]>("history")) || [];
      history.push(message);
      ctx.set("history", history);

      const response = await ctx.run("generate", async () => {
        return await llm.chat(history);
      });

      history.push({ role: "assistant", content: response });
      ctx.set("history", history);

      return response;
    },

    getHistory: async (ctx: restate.ObjectContext) => {
      return (await ctx.get<Message[]>("history")) || [];
    },
  },
});

When to Use Restate

Low-latency requirements
Lightweight deployment (single binary)
Strong consistency without heavy infrastructure
TypeScript or JVM stack
Want to self-host easily

Azure Durable Functions

Native Azure integration. Great for Azure-first shops.

Core Concepts

import azure.durable_functions as df

# Orchestrator function
@df.orchestrator_trigger(context_name="context")
def agent_orchestrator(context: df.DurableOrchestrationContext):
    request = context.get_input()

    # Each activity is checkpointed
    classification = yield context.call_activity(
        "classify_request",
        request
    )

    if classification["needs_approval"]:
        # Wait for external event (human approval)
        approval = yield context.wait_for_external_event("approval")

        if not approval["approved"]:
            return {"status": "rejected"}

    # Continue with execution
    result = yield context.call_activity(
        "execute_agent_action",
        classification
    )

    return result

# Activity functions
@df.activity_trigger(input_name="request")
def classify_request(request: dict) -> dict:
    return llm.classify(request["message"])

@df.activity_trigger(input_name="classification")
def execute_agent_action(classification: dict) -> dict:
    return agent.execute(classification["intent"])

Fan-out/Fan-in Pattern

@df.orchestrator_trigger(context_name="context")
def parallel_research(context: df.DurableOrchestrationContext):
    queries = context.get_input()["queries"]

    # Fan out: run research tasks in parallel
    tasks = []
    for query in queries:
        task = context.call_activity("research_query", query)
        tasks.append(task)

    # Fan in: wait for all to complete
    results = yield context.task_all(tasks)

    # Synthesize results
    synthesis = yield context.call_activity("synthesize_results", results)

    return synthesis

When to Use Azure Durable Functions

Already on Azure
.NET or Python primary stack
Want serverless with durable state
Need tight Azure service integration
Cost optimization via consumption pricing

AWS Step Functions

Visual workflows, tight AWS integration. Best for teams who like declarative state machines.

Core Concepts

{
  "Comment": "Agent workflow with human approval",
  "StartAt": "ClassifyRequest",
  "States": {
    "ClassifyRequest": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:classify",
      "Next": "NeedsApproval"
    },
    "NeedsApproval": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.needsApproval",
          "BooleanEquals": true,
          "Next": "WaitForApproval"
        }
      ],
      "Default": "ExecuteAction"
    },
    "WaitForApproval": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
      "Parameters": {
        "QueueUrl": "https://sqs.us-east-1.amazonaws.com/123/approvals",
        "MessageBody": {
          "TaskToken.$": "$$.Task.Token",
          "Request.$": "$"
        }
      },
      "Next": "CheckApproval"
    },
    "CheckApproval": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.approved",
          "BooleanEquals": false,
          "Next": "Rejected"
        }
      ],
      "Default": "ExecuteAction"
    },
    "ExecuteAction": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:execute",
      "End": true
    },
    "Rejected": {
      "Type": "Fail",
      "Error": "ApprovalRejected",
      "Cause": "Human rejected the action"
    }
  }
}

When to Use Step Functions

Already on AWS
Visual workflow design preferred
Need built-in AWS service integrations
Long-running workflows with wait states
Audit and compliance requirements

Decision Framework

If You Need…	Use
Complex multi-step workflows with strong guarantees	Temporal
Fast iteration, event-driven, serverless	Inngest
Low latency, lightweight, self-hosted	Restate
Azure-native, serverless, cost optimization	Azure Durable Functions
AWS-native, visual workflows, service integrations	AWS Step Functions
GCP-native, simple YAML workflows	GCP Cloud Workflows

The “Build vs Buy” Decision Tree

BUILD vs BUY DECISION

Are you writing retry logic with exponential backoff?
→ Consider a durable execution framework

Are you implementing checkpointing to survive crashes?
→ Consider a durable execution framework

Are you building idempotency key management?
→ Consider a durable execution framework

Are you handling human-in-the-loop with long waits?
→ Consider a durable execution framework

If yes to 2+ of these, you&#39;re reinventing the wheel.

Migration Path

If you have existing agent code, here’s how to migrate:

1. Identify Side Effects

# Before: Side effects scattered in code
def process_request(request):
    classification = llm.classify(request)  # LLM call
    if classification.needs_action:
        result = api.execute(classification)  # External API
        email.send(request.user, result)      # Email
    return result

2. Extract as Activities

# After: Side effects are activities
@activity.defn
async def classify(request): return llm.classify(request)

@activity.defn
async def execute_action(classification): return api.execute(classification)

@activity.defn
async def send_email(user, result): email.send(user, result)

@workflow.defn
class RequestWorkflow:
    @workflow.run
    async def run(self, request):
        classification = await workflow.execute_activity(classify, args=[request])
        if classification.needs_action:
            result = await workflow.execute_activity(execute_action, args=[classification])
            await workflow.execute_activity(send_email, args=[request.user, result])
        return result

Key Takeaways

Don’t reinvent the wheel. If you’re writing checkpointing + retry + recovery, use a framework.
Match framework to needs. Enterprise = Temporal. Fast iteration = Inngest. Low latency = Restate.
Cloud-native options exist. Azure Durable Functions, AWS Step Functions for tight integration.
Migration is incremental. Extract side effects as activities, wrap in workflows.
The patterns are the same. Every framework provides: checkpointing, retries, human-in-loop, durability.

Next Steps

Framework selected. But how do you secure agents that execute code and call external APIs?

→ Part 7: Security & Sandboxing

Or jump to another topic:

Part 8: Testing & Evaluation — How to test agents

Production-agents Series

Why This Matters

What Durable Execution Means

Framework Comparison

Temporal

Core Concepts

Human-in-the-Loop with Signals

When to Use Temporal

Inngest

Core Concepts

Token Budget with Inngest

When to Use Inngest

Restate

Core Concepts

Virtual Objects for Stateful Agents

When to Use Restate

Azure Durable Functions

Core Concepts

Fan-out/Fan-in Pattern

When to Use Azure Durable Functions

AWS Step Functions

Core Concepts

When to Use Step Functions

Decision Framework

The “Build vs Buy” Decision Tree

Migration Path

1. Identify Side Effects

2. Extract as Activities

Key Takeaways

Next Steps

Table of Contents