Ketan Khairnar

Systems writing for AI agents, data platforms, distributed coordination, and the production details tutorials skip.

AI-native systems engineering

Notes on building systems that hold up in production.

Retries, state, budgets, coordination, observability, failure recovery — written down so you don't have to relearn them.

distributed systems production agents data platforms observability
10B+ events/day systems direction
150+ paying users from zero
70+ data pipelines shipped
1B+ market data points served

Search

Find the exact thread.

Jump straight into the essays, concepts, explainers, and deep dives behind the system you are thinking about.

agents

AI systems that survive real users

Multi-agent orchestration, bounded tool calls, memory, evals, cost guardrails, and recovery paths.

systems

Distributed systems with operational taste

Coordination, retries, idempotency, stream processing, failure detection, SLOs, and quiet incident prevention.

platforms

Data platforms that earn their keep

Kafka, Spark, ClickHouse, search, lakehouse migrations, and the product pressure behind architecture choices.

Latest field note

Continue reading

Apr 22, 2026
Manager / Coordinator / Agent: A Multi-Agent Topology That Survives Real Workloads Most multi-agent demos collapse on contact with non-trivial tasks. The Manager / Coordinator / Agent topology is what I keep reaching for when I want a multi-agent system that does not deadlock, drift, or hallucinate its way into nonsense. Here is what each layer does, and where the pattern breaks. A topology for agents that need to survive real work. AIagentsarchitecturemulti-agent