Connection · Interrupted

Something didn't load

Part of this page failed to reach you. Reload to try again — if it keeps happening, check your connection.

Skip to main content
All PostsTag

#distributed-systems

8 posts filed under this tag.

Backend01

Catching a Retry Race with One Seed: Deterministic Simulation in Rust using turmoil

I had three flaky retry tests no one could reproduce on a laptop. I rewrote one in Rust on top of turmoil, Tokio's deterministic simulator, and a single 8-byte seed pinned the partition race byte-for-byte. These are my notes on what the seed actually controls, what leaks past it, and when deterministic simulation testing is worth the seam.

Jun 4
Distributed Systems02

Actor-per-Entity vs Postgres Optimistic Locking: A Seat-Reservation Bake-off

I ran the same hot-key seat reservation workload two ways: Postgres with a version column and retries, and a single actor per seat. The actor design did not scale better — it moved the hard problem from concurrency control to routing and rebalance correctness, and that trade was the easier one to reason about under hot keys.

May 26
Backend03

Durable Execution Isn't About Agents — It's About Replayable Backend Workflows

I came to durable-execution runtimes through the agent press, but the constraint that surprises everyone is determinism on replay. These are my notes from working a six-step payment reconciliation as a Restate workflow in TypeScript — the line that broke replay, the mental model that fixed it, and the trade-offs that come with the pattern.

May 19
Distributed Systems04

AckWait Is a Contract: How a 30-Second Default Took Down My JetStream Consumer

I lost an evening to a NATS JetStream pull consumer that doubled its work in production. The cause was three lines of ConsumerConfig I never wrote. These are my notes on what AckWait actually counts, why MaxDeliver = -1 is the silent footgun, and the 70-line Go contract I now ship on every JetStream consumer.

May 12
Engineering05

What `dbos ontime` Actually Asks: Building a Distributed Cron on etcd Leases in Go

A 0-click query for `dbos ontime` showed up in my Search Console last week. The reader is not asking about DBOS — they are asking how to run a job every minute, exactly once, across a fleet. From my own notes, an etcd lease, the `concurrency.Election` package, and a fencing token cover that case in under 100 lines of Go, without pulling in a workflow engine.

May 7
Engineering06

DBOS vs Temporal: When Postgres Is Enough for Durable Workflow Execution

DBOS reuses Postgres as the durability layer for workflows, while Temporal runs a dedicated cluster. The right choice depends on team size, workload shape, and where you want your operational budget to go. This is a practical rubric for picking between them.

Apr 26
Distributed Systems07

Cell-Based Architecture Isn't Free: What Slack, DoorDash, and Roblox Actually Paid For It

Cell-based architecture contains blast radius, but it is not free. A look at what Slack, DoorDash, and Roblox actually paid for cells in production — and a checklist for the cheaper fault-isolation patterns most teams should reach for first.

Apr 23
Engineering08

The Transactional Outbox Is Not a Queue

The transactional outbox is a ledger, not a queue. Treating it like one is what breaks Postgres under load. This post walks through the specific failure modes — autovacuum stalls, xmin horizon drift, replication slot lag, poison pills — and the operational rules that actually keep it working in production.

Apr 17