#distributed-systems
8 posts filed under this tag.
Catching a Retry Race with One Seed: Deterministic Simulation in Rust using turmoil
I had three flaky retry tests no one could reproduce on a laptop. I rewrote one in Rust on top of turmoil, Tokio's deterministic simulator, and a single 8-byte seed pinned the partition race byte-for-byte. These are my notes on what the seed actually controls, what leaks past it, and when deterministic simulation testing is worth the seam.
Actor-per-Entity vs Postgres Optimistic Locking: A Seat-Reservation Bake-off
I ran the same hot-key seat reservation workload two ways: Postgres with a version column and retries, and a single actor per seat. The actor design did not scale better — it moved the hard problem from concurrency control to routing and rebalance correctness, and that trade was the easier one to reason about under hot keys.
Durable Execution Isn't About Agents — It's About Replayable Backend Workflows
I came to durable-execution runtimes through the agent press, but the constraint that surprises everyone is determinism on replay. These are my notes from working a six-step payment reconciliation as a Restate workflow in TypeScript — the line that broke replay, the mental model that fixed it, and the trade-offs that come with the pattern.
AckWait Is a Contract: How a 30-Second Default Took Down My JetStream Consumer
I lost an evening to a NATS JetStream pull consumer that doubled its work in production. The cause was three lines of ConsumerConfig I never wrote. These are my notes on what AckWait actually counts, why MaxDeliver = -1 is the silent footgun, and the 70-line Go contract I now ship on every JetStream consumer.
What `dbos ontime` Actually Asks: Building a Distributed Cron on etcd Leases in Go
A 0-click query for `dbos ontime` showed up in my Search Console last week. The reader is not asking about DBOS — they are asking how to run a job every minute, exactly once, across a fleet. From my own notes, an etcd lease, the `concurrency.Election` package, and a fencing token cover that case in under 100 lines of Go, without pulling in a workflow engine.
DBOS vs Temporal: When Postgres Is Enough for Durable Workflow Execution
DBOS reuses Postgres as the durability layer for workflows, while Temporal runs a dedicated cluster. The right choice depends on team size, workload shape, and where you want your operational budget to go. This is a practical rubric for picking between them.
Cell-Based Architecture Isn't Free: What Slack, DoorDash, and Roblox Actually Paid For It
Cell-based architecture contains blast radius, but it is not free. A look at what Slack, DoorDash, and Roblox actually paid for cells in production — and a checklist for the cheaper fault-isolation patterns most teams should reach for first.
The Transactional Outbox Is Not a Queue
The transactional outbox is a ledger, not a queue. Treating it like one is what breaks Postgres under load. This post walks through the specific failure modes — autovacuum stalls, xmin horizon drift, replication slot lag, poison pills — and the operational rules that actually keep it working in production.