Durable Execution: The Replay Determinism Trap

2/4

I came to durable execution through the AI press cycle. Every blog post framed it as the runtime that finally makes long-running agent loops survive a crash. After tracing the pattern back through Temporal's docs, the Restate handbook, and InfoQ's coverage of Cloudflare's Project Think, I think the framing has it backwards. The interesting part of durable execution is not the agent on top. It is what the runtime demands of the code underneath, and that demand is what surprises every backend engineer who picks it up.

The promise is short. Write the workflow as a normal function. The runtime journals every step. If the process dies mid-flight, a different worker re-runs the function from the start, replays the journal, and lands at the exact line where the crash happened. No coordinator, no saga compensations, no hand-rolled state machine.

The price is a contract. The code must produce the same sequence of journal entries every time it runs. Anything that lives outside the journal — a wall-clock read, a UUID generated in-process, a network call that bypassed the journal slot — turns the next replay into a lottery. Crash recovery moves out of application code, but it pays for the move by forcing every side effect to be deterministic or explicitly journaled. That is the constraint the press never names.

I want to walk through what that means on a workflow that has nothing to do with agents. I picked the most boring example I could think of: a six-step payment reconciliation job. Pull a bank statement, pull the ledger, diff them, post corrections for the missing entries, wait until the close window, mark the batch reconciled. The kind of cron job that lives in every payments stack.

Why the boring example matters

A reconciliation job is interesting because nothing is interesting about it. No streaming partitions, no consensus, no hot keys. The only hard requirement is that if the worker dies after step three, the resumed worker picks up at step four and does not re-post the corrections that already landed. That is the same requirement an agent loop has when it dies between an LLM call and a tool invocation, stripped of the LLM theater. If the pattern is going to hold for backend workflows in general, it has to hold here first.

The hand-rolled version is well-trodden. A service walks the six steps, persists progress to a row in Postgres after each step, retries on failure, and hopes the service does not die between writing "step 3 done" and starting step 4. Saga libraries pretty this up. They do not change the shape of the problem.

A durable runtime collapses the bookkeeping into a primitive. Each step is a journal entry. The runtime owns the journal. The code is the script that produces journal entries in order.

The Restate version

Here is the workflow as a single Restate handler in TypeScript. It runs against a local Restate server (npx @restatedev/restate-server) registered to this endpoint. The handler is the whole moving part.

typescript

// reconcile.ts
// Start the runtime in another terminal:
//   npx @restatedev/restate-server
// Then run this file:
//   npx tsx reconcile.ts
// Trigger it:
//   curl localhost:8080/payments.reconcile/2026-05-05/run --json '{}'
import * as restate from "@restatedev/restate-sdk";

type Entry = { id: string; amount: number };

const reconcile = restate.workflow({
  name: "payments.reconcile",
  handlers: {
    run: async (ctx: restate.WorkflowContext) => {
      const batchId = ctx.key;

      // 1. Stamp the run inside ctx.run so the timestamp lands in the journal.
      const startedAt = await ctx.run("started_at", () =>
        new Date().toISOString(),
      );

      // 2. Fetch the bank statement.
      const bank = await ctx.run("fetch_bank", async () => {
        const r = await fetch(`https://bank.local/statements/${batchId}`);
        return (await r.json()) as { entries: Entry[] };
      });

      // 3. Fetch the ledger entries for the same batch.
      const ledger = await ctx.run("fetch_ledger", async () => {
        const r = await fetch(`https://ledger.local/entries?batch=${batchId}`);
        return (await r.json()) as { entries: Entry[] };
      });

      // 4. Pure diff. Safe to run outside the journal.
      const known = new Set(ledger.entries.map((l) => l.id));
      const missing = bank.entries.filter((b) => !known.has(b.id));

      // 5. Each correction is its own journal slot, keyed by entry id.
      for (const entry of missing) {
        await ctx.run(`post_${entry.id}`, async () => {
          await fetch("https://ledger.local/corrections", {
            method: "POST",
            body: JSON.stringify({ ...entry, source: "bank" }),
          });
        });
      }

      // 6. Sleep until close (60s here for the demo), then stamp the finish.
      // ctx.sleep takes milliseconds or a Duration; the object-literal form
      // ({ seconds: 60 }) is not part of the current SDK.
      await ctx.sleep(60_000);
      const finishedAt = await ctx.run("finished_at", () =>
        new Date().toISOString(),
      );

      return { batchId, startedAt, finishedAt, posted: missing.length };
    },
  },
});

restate.endpoint().bind(reconcile).listen(9080);

Run it with npx tsx reconcile.ts after the Restate server is up. The shape worth studying is what is inside ctx.run and what is outside. Everything that touches the world — wall-clock reads, HTTP calls, the eventual fetch to post a correction — is wrapped. The diff in step 4 is not. That split is the entire mental model.

What the journal actually records

When the workflow runs the first time, Restate appends a journal entry for each ctx.run block as it completes. The entry stores the name and the result. Step 1 records started_at = "2026-05-05T09:00:00Z". Step 2 records the parsed bank statement. Steps 5a, 5b, 5c each record a successful POST. The sleep records its scheduled wakeup.

The diagram below traces what the journal looks like before a forced crash and after the worker is replaced. The interesting move is at the dashed line: a fresh worker starts the same function from the top, but every wrapped block short-circuits to the recorded value instead of re-running.

The diagram makes the small trap obvious. The crash happened during post_e9. The runtime does not know whether the side effect made it to the ledger. So it re-issues the slot on the new worker and the POST handler at ledger.local/corrections has to be idempotent on the entry id. Durable execution does not relieve you of idempotency; it concentrates it on the boundary between your workflow and the systems it talks to. I have a longer set of notes on what idempotency actually demands of the receiver, but the rule for this post is short: anything inside ctx.run can run zero, one, or two times from the receiver's perspective. Design accordingly.

The line that broke replay

The line that surprised me was step 1. In an early draft I had written it like this:

typescript

// BROKEN: bare Date.now() outside ctx.run
const startedAt = new Date().toISOString();

The first run worked. The journal recorded steps 2 through 6 and the workflow returned a sensible record. I then killed the worker between steps 4 and 5 and watched a fresh worker pick up. The replay diverged on the startedAt value because new Date() ran a second time and produced a later instant. Restate caught the divergence at the next journal entry, raised a "journal mismatch" error, and parked the invocation.

Temporal's TypeScript SDK hides this trap with a sandbox. It rewrites Date.now() and Math.random() so they return values pulled from the workflow context, and it has done so for years. Restate's TypeScript SDK does not sandbox. Outside ctx.run, it is plain Node, and plain Node calls the system clock. The Temporal docs say the workflow code "must be deterministic between replays" — that single rule sits underneath both runtimes, but Temporal moves the enforcement into the SDK while Restate moves it onto the engineer.

The fix is the version in the example: wrap the timestamp in ctx.run("started_at", () => new Date().toISOString()). The first run executes the closure, the second reads it from the journal, and replay matches.

The same trap shows up in two more shapes once you start looking. A non-deterministic loop bound: for (const entry of missing) is fine because missing was derived from journal data, but for (let i = 0; i < Math.random() * 10; i++) is not. And an out-of-band fetch: if a refactor of step 4 slips a bare await fetch(...) into the workflow body to "just check the gateway one more time", replay calls the network a second time, gets a different response, and the next ctx.run slot mismatches.

The mental model that actually clicks

The shorthand I have settled on is two sentences. Code outside ctx.run is "the script". Code inside ctx.run is "the journal".

The script is replayed verbatim. The script reads from the journal but never writes outside it. Every read it makes from the world has to come back through the journal, which means every read has to be wrapped.

The journal is the durable part. Each slot stores a name and a value. Restate's name-based addressing is why steps 5a/5b/5c all coexist as separate slots: post_${entry.id} makes the name unique per entry, and the for-loop walks the same names every time because missing is derived deterministically from journaled data. Naming is not cosmetic. Two slots with the same name in the same invocation collide and the runtime rejects the second. Two slots that should match but use different names produce a phantom step and replay diverges.

That mental model also tells you when durable execution is the wrong tool. It is the wrong tool when the workflow does not have a script — when every step depends on out-of-band signals you cannot represent in the journal, or when the steps are themselves throughput-critical and the journal write becomes a bottleneck on the hot path. Restate's docs are clear that each ctx.run call costs a write to the journal store, and that is a real number you have to budget for at high call rates.

Trade-offs worth naming

Durable execution shifts the operational surface. Coordinators and saga state machines drop out; a journal store takes their place. Restate runs the store inline, Temporal runs a separate cluster, DBOS reuses Postgres. Each choice trades latency, footprint, and ops cost differently — I wrote separately about picking between DBOS and Temporal, and the rubric there extends to Restate as the third option in the same row.

The "no bare side effects" rule is real, and the SDK only catches it at replay time. A workflow that worked in tests can still trip the first time a worker dies after a non-journaled call. Two countermeasures earn their keep: a CI step that runs a forced-replay harness (Temporal's Worker.runReplayHistory, Restate's RestateTestEnvironment), and a code review rule that flags imports of Date, Math.random, crypto.randomUUID, and fetch inside any module that exports a workflow handler.

Changing workflow code between runs is a versioning problem, not a refactor. If a journal from yesterday referred to a step post_${entry.id} and today's code renames it to apply_${entry.id}, replay walks off the journal at that point. Both Temporal and Restate have explicit versioning hooks (Temporal's patched API, Restate's stable-name guidance). Use them.

A single ctx.run slot can execute up to twice in pathological crash patterns. The systems on the other end have to dedupe by their own keys. The Stripe Idempotency-Key header is the easy reference; my own write-up on idempotency as a four-part protocol covers what the receiver actually has to do beyond storing the key.

When I would reach for it, and when I would not

Reach for it when the workflow has more than three sequential side effects, each side effect is expensive enough to justify a journal write, and the cost of running a step twice on the receiver is bounded by an idempotency key. Reconciliation, provisioning, payment capture, multi-vendor onboarding, long-running checkout flows — all good fits.

Skip it for a workflow that is a single transaction in a single database. Postgres already gives durability and atomicity for that. Adding a runtime on top doubles the moving parts and buys nothing.

Skip it for low-latency request-response. The journal write tax shows up as visible p99 even on a healthy runtime, and the resilience does not earn back the latency.

Be cautious when the workflow is mostly branching on out-of-band data. Each branch has to come from the journal, which means every signal has to be threaded through ctx.run or a workflow signal. The journal then becomes the bulk of the code and the runtime stops paying for itself.

The agent framing in the press is, at most, one application of the pattern. The pattern is a 2026 take on what saga frameworks tried to do for the last decade — make crash recovery boring — and the determinism contract is the price of admission. Pay it on purpose, in ctx.run blocks, with named slots, or pay it later in mismatched journals at 2 a.m.

Takeaways

Treat code outside ctx.run as a deterministic script and code inside ctx.run as the journal.
Wrap every side effect: clock reads, UUIDs, network calls, randomness. The Temporal SDK sandboxes some of these in TypeScript; the Restate SDK does not.
Name journal slots by data you control (post_${entry.id}), not by counters or timestamps.
Add a forced-replay test to CI so divergence shows up at build time, not at recovery time.
Idempotency does not move; it concentrates on the receivers that workflows talk to.
Version workflow code with the runtime's hooks (Temporal patched, Restate stable names) — never silently rename a step that already exists in a journal.

Reach for durable execution when there are several side effects in a row, each worth a journal write, each idempotent on the receiver. Skip it for single-database transactions and for latency-sensitive request-response paths.

Still here? You might enjoy this.

Nothing close enough — try a different angle?

Engineering

Idempotency Is a Protocol, Not a Key

The first time I shipped idempotency as a UUID header and a Redis lookup, a duplicate charge slipped through a week later. These are my notes on treating idempotency as a four-part protocol — dedup, determinism, concurrent safety, downstream propagation — with a minimal Kotlin plus Postgres implementation that holds up under retry.

Engineering

What `dbos ontime` Actually Asks: Building a Distributed Cron on etcd Leases in Go

A 0-click query for `dbos ontime` showed up in my Search Console last week. The reader is not asking about DBOS — they are asking how to run a job every minute, exactly once, across a fleet. From my own notes, an etcd lease, the `concurrency.Election` package, and a fencing token cover that case in under 100 lines of Go, without pulling in a workflow engine.

The Deterministic Backbone: Why Production AI Systems Are Moving Away From Fully Autonomous Agents

Fully autonomous agents are hard to bound, hard to test, and expensive to operate. A deterministic backbone with narrow agent steps gives you the control flow back while keeping the intelligence where it matters. Here is how to design, test, and migrate toward it.

Durable Execution Isn't About Agents — It's About Replayable Backend Workflows