Connection · Interrupted

Something didn't load

Part of this page failed to reach you. Reload to try again — if it keeps happening, check your connection.

Skip to main content
Backend7 min read

Catching a Retry Race with One Seed: Deterministic Simulation in Rust using turmoil

I had three flaky retry tests no one could reproduce on a laptop. I rewrote one in Rust on top of turmoil, Tokio's deterministic simulator, and a single 8-byte seed pinned the partition race byte-for-byte. These are my notes on what the seed actually controls, what leaks past it, and when deterministic simulation testing is worth the seam.

All Posts
2/4

Most flaky tests teach me nothing. The output line says "expected 1, got 2" once in 4,000 runs, the CI screenshot scrolls off the dashboard, and the next person to touch the file rebases past it. I have notes on three of these from the last year — all retries-under-partition, all Rust, all unreproducible on my laptop.

That changed when I sat down with turmoil. It is Tokio's own deterministic simulation framework: every host runs on one thread, time and the network are mocked, and the whole simulation is driven by a seedable RNG. The promise is the part that makes this category of bug interesting. A single 8-byte seed replays the same scheduling decisions, the same partition timings, the same TCP byte order. The flake stops being a story and becomes a handle.

Deterministic simulation testing has stopped being FoundationDB lore. A QCon London 2026 talk walked through a state-machine DST in Rust, WarpStream wrote up running their entire SaaS through Antithesis in March, and the S2 team published a piece on combining turmoil with libc shims for Rust storage code. This post is what I learned wiring up a small example, plus the leaks I had to plug before it actually held water.

What turmoil simulates and what it doesn't

A turmoil test builds a Sim, registers some "hosts" (each is an async closure), registers a "client" (the test driver), and calls sim.run(). Hosts get virtual TCP/UDP through turmoil::net, virtual hostnames, and tokio time that is also mocked. The simulator has a step loop. At each step the network delivers any packets that are due, any timers that are due fire, and any task that is ready makes progress. With a fixed seed, the order is fully determined. The current crate is turmoil = "0.7", which adds a partial filesystem shim behind unstable-fs for crash-consistency tests.

What turmoil does not control is anything that escapes its surface: real syscalls, threads spawned outside the runtime, libraries that hold their own clocks, anything backed by std::collections::HashMap. Rust's HashMap seeds its hasher per-process from OsRng on construction, so iteration order changes between runs even when the rest of the code is deterministic. I lost an afternoon to that one.

A retry race that survives 10,000 reruns

The bug shape I want a test to find is small. A client sends an "apply 42" request to a server. The server applies it, increments a counter, and writes back "ok". The network drops the ack. The client retries. The server applies "42" again. The bug is the missing idempotency check. The flake is that it only triggers when the partition timing aligns with the application's retry timer.

The shape, drawn out, looks like this:

Here is the smallest turmoil test I could write that pins it down.

rust
// Cargo.toml
// [dependencies]
// tokio = { version = "1", features = ["full"] }
// turmoil = "0.7"
// rand = "0.9"
//
// Run with: cargo test --release retry_race -- --nocapture

use std::sync::{Arc, atomic::{AtomicU32, Ordering}};
use std::time::Duration;

use rand::SeedableRng;
use tokio::io::{AsyncReadExt, AsyncWriteExt};
use turmoil::{net::{TcpListener, TcpStream}, Builder};

fn run_one(seed: u64) -> u32 {
    let applied = Arc::new(AtomicU32::new(0));
    let mut sim = Builder::new()
        .simulation_duration(Duration::from_secs(30))
        .build_with_rng(Box::new(rand::rngs::StdRng::seed_from_u64(seed)));

    let counter = applied.clone();
    sim.host("server", move || {
        let counter = counter.clone();
        async move {
            let listener = TcpListener::bind("0.0.0.0:80").await?;
            loop {
                let (mut s, _) = listener.accept().await?;
                let counter = counter.clone();
                tokio::spawn(async move {
                    let mut id = [0u8; 4];
                    if s.read_exact(&mut id).await.is_ok() {
                        // BUG: no dedupe on id — every arrival applies.
                        counter.fetch_add(1, Ordering::Relaxed);
                        let _ = s.write_all(b"ok").await;
                    }
                });
            }
        }
    });

    sim.client("client", async {
        let id: u32 = 42;
        for attempt in 0..2 {
            let mut s = TcpStream::connect("server:80").await?;
            s.write_all(&id.to_le_bytes()).await?;
            if attempt == 0 { turmoil::partition("client", "server"); }
            let mut ack = [0u8; 2];
            let _ = tokio::time::timeout(
                Duration::from_secs(2),
                s.read_exact(&mut ack),
            ).await;
            turmoil::repair("client", "server");
        }
        Ok(())
    });

    sim.run().unwrap();
    applied.load(Ordering::Relaxed)
}

#[test]
fn retry_race() {
    for seed in 0..32 {
        let n = run_one(seed);
        assert!(n <= 1, "seed={seed} applied={n}: not idempotent");
    }
}

Run it with cargo test --release retry_race -- --nocapture. The first failing seed prints applied=2 and the assertion stops the test. The fix is unsurprising — keep a HashSet<u32> of applied ids on the server and skip the increment when the id is already present. The point is not the fix. The point is that the failure is the same on the first run, the millionth run, on my laptop, on CI, on a borrowed M3, and stays the same as long as I hold the seed.

The two non-obvious lines are build_with_rng(...), which makes every randomness source the simulator owns downstream of seed, and partition("client", "server") immediately after the write. The bytes for "42" may or may not have crossed to the server before that partition takes effect. With seed variance, both outcomes happen across the 32 runs. The invariant the test checks (applied <= 1) catches the bad path without needing to predict which seeds expose it.

What the seed actually pins down

A turmoil run threads one RNG through every choice the simulator makes — task scheduling order, packet delivery jitter, the timing the network manipulation primitives use. tokio time is mocked too, so tokio::time::timeout returns at the same logical instant for the same seed. That is enough to make the partition-timing race above stable across runs.

It is also where the technique sells itself short. The seed pins the simulator's choices, not your program's. If the code calls std::time::Instant::now, it sees real wall-clock time and the test stops being deterministic. Same for getrandom, quanta, rdtsc, anything that fetches entropy or time from outside tokio. The S2 team's writeup on combining turmoil with a libc-shimming layer (their mad-turmoil derivative of madsim) was the first place I saw this called out cleanly: the Rust ecosystem has so many transitive crates pulling time and randomness out of band that "deterministic" needs a libc-level seam, not just a runtime-level one.

For my own code I keep the rule short. Pass a Clock and an Rng in by trait, swap them for deterministic versions in tests, never call Instant::now from anywhere cargo test reaches. It does not catch every leak — a transitive HashMap iteration still bites — but it is the cheapest 80% of the win.

What this is not

Turmoil is not chaos engineering. Chaos runs against production-shaped systems and surfaces problems with the deployment, the monitoring, and the people. DST runs in a single process on a single thread and surfaces problems with the protocol — the kind of problems where the whole question is "what happens if these two messages arrive in this exact order and this packet is dropped first?" The two answer different questions. The TigerBeetle project's VOPR farm runs on 1,000 cores 24/7 and accelerates simulated time roughly 700x — they catch protocol bugs before deploys, then run real chaos in pre-prod for everything else.

I would also avoid reaching for it for code that does not have a clean network seam. If a "distributed system" is a Postgres connection pool plus an HTTP handler, the time goes into wedging in turmoil::net traits, not into flake debugging. DST earns its weight when a service has its own message protocol, multiple participants, and timing-dependent correctness — replication, leader election, retry-with-dedupe, sticky session routing, distributed transactions.

A short list of leaks to plug before you trust the seed

  • Direct calls to std::time::Instant::now and std::time::SystemTime::now inside any path the test reaches. They bypass turmoil's mocked time.
  • HashMap iteration order. Rust's hasher seeds from OsRng per construction. Use a BTreeMap, a HashMap with a deterministic hasher, or sort before observing.
  • getrandom, OsRng, anything pulling entropy from /dev/urandom. Plumb the RNG in. Do not let dependencies create their own.
  • Anything that spawns OS threads. Turmoil schedules tasks on its single thread. Threads outside it are invisible to the simulator.
  • Direct tokio::net use instead of turmoil::net. The compile-time seam is awkward — the standard pattern is a cfg(feature = "turmoil") block — but it is the line between a simulated network and real network calls escaping the test.

When to reach for turmoil: a service with its own protocol, where retries, partitions, or timing windows have already burned a flaky test no one wants to debug. When to skip it: simple request/response services that fit a mock and a few #[tokio::test]s, or anything where the bug class lives in production traffic shape rather than message ordering.

The takeaway I would write on a sticky note: a flaky distributed-systems test is unscientific, but only because I have not yet pinned its seed. The work to make a service runnable under a seeded simulator is real, and not all services pay it back. The ones that do return something better than "the test passed" — they return "the test failed at seed 7 and will keep failing at seed 7 until the protocol is right".

Read next

Still here? You might enjoy this.

Nothing close enough — try a different angle?

Was this helpful?

Leave a rating or a quick note — it helps me improve.