AI-native systems engineering
Notes on building systems that hold up in production.
Retries, state, budgets, coordination, observability, failure recovery — written down so you don't have to relearn them.
distributed systems production agents data platforms observability
prod-agent-loop p95 stable
tracetool call bounded evalfailure path replayable budgetcost ceiling enforced
10B+ events/day systems direction
150+ paying users from zero
70+ data pipelines shipped
1B+ market data points served
Search
Find the exact thread.
Jump straight into the essays, concepts, explainers, and deep dives behind the system you are thinking about.
agents
AI systems that survive real users
Multi-agent orchestration, bounded tool calls, memory, evals, cost guardrails, and recovery paths.
systems
Distributed systems with operational taste
Coordination, retries, idempotency, stream processing, failure detection, SLOs, and quiet incident prevention.
platforms
Data platforms that earn their keep
Kafka, Spark, ClickHouse, search, lakehouse migrations, and the product pressure behind architecture choices.
Latest field note
Continue reading
Blog
Recent posts
01 · Manager / Coordinator / Agent: A Multi-Agent Topology That Survives Real Workloads A topology for agents that need to survive real work. 02 · Encoding the Senior Engineer in the Room: A Design Memo for Tacit Skills Encoding the senior engineer's questions into reusable skills. 03 · Akka Cluster Sharding vs a Consistent-Hash Ring: What You Trade Away What you buy when you stop hand-rolling the hash ring. 04 · The New Rules of Shipping in the AI Era Creation got cheaper; evolution is still the hard part.
Series
Deep dives
Harness Engineering: The Compounding Stack The operating layer around serious AI work. Production Agents Deep Dive Retries, state, cost, security, and evals after the demo. AI Engineering Fundamentals Start at tokens; end at evaluated agents. Kafka Deep Dive Partitions, offsets, transactions, and the trade-offs behind them.