Idempotency Is a Protocol, Not a Key

2/4

Four concerns that get collapsed into one header

Reading Brandur's write-up on Stripe-style keys and Adyen's API reference, the same point kept surfacing: the word "idempotency" hides four separate concerns, and collapsing them into a single UUID is where the bugs come from.

Request deduplication. Two identical requests must produce one outcome.
Operation determinism. The same input must produce the same result, for as long as the key is valid.
Concurrent execution safety. Two concurrent copies of the same key must not both execute the side effect.
Downstream propagation. Every service the handler calls must honor the same contract.

A Redis SETNX covers deduplication in the happy path and partially addresses concurrent safety. It does nothing for the other two. Each gap is a production incident waiting for the right combination of retry timing, pod restart, and cache eviction.

Why Redis alone failed me

The concrete failure I hit went like this. Client sends a POST with Idempotency-Key: abc. Pod A wins the Redis SETNX, calls the payment gateway, then crashes before writing the response back to Redis. The client retries 30ms later. Pod B sees the Redis key, but the value is empty — the response was never stored. The fallback path treated that as "about to finish" and, after a short wait, executed the charge again.

The root cause is that a cache is not a state machine. Any operation with more than one outcome — in progress, succeeded, failed, expired — needs durable storage with at least three explicit states, not a single "set or not set" bit.

The storage shape that holds up

Moving the key store into Postgres with a unique constraint changed the shape of the problem. The constraint is the only layer in the stack that makes "two concurrent inserts, exactly one winner" an atomic property. Brandur names it directly: the UNIQUE index is "atomic by construction, no possible race condition."

The minimal schema I kept coming back to:

sql

CREATE TABLE idempotency (
  key         TEXT PRIMARY KEY,
  status      TEXT NOT NULL CHECK (status IN ('IN_PROGRESS','SUCCEEDED','FAILED')),
  body        JSONB,
  started_at  TIMESTAMPTZ NOT NULL,
  finished_at TIMESTAMPTZ
);

The protocol on top is a three-phase write. INSERT first, to claim the key. Run the work. UPDATE with the outcome. If the INSERT conflicts, another request is already holding the key — depending on the key's state and age, either return the stored response, return 409 Conflict, or reclaim a stuck entry.

The state machine behind those three phases is worth looking at before reading the code.

Here is the whole thing in Kotlin, using Spring's JdbcTemplate. One class, one file.

kotlin

import org.springframework.dao.DuplicateKeyException
import org.springframework.jdbc.core.JdbcTemplate
import org.springframework.stereotype.Service
import org.springframework.transaction.annotation.Isolation
import org.springframework.transaction.annotation.Transactional
import java.sql.Timestamp
import java.time.Instant

@Service
class IdempotencyGuard(private val db: JdbcTemplate) {

    @Transactional(isolation = Isolation.SERIALIZABLE)
    fun <T : Any> runOnce(key: String, staleAfterSec: Long = 30, work: () -> T): T {
        val now = Timestamp.from(Instant.now())
        try {
            db.update(
                "INSERT INTO idempotency(key,status,started_at) VALUES(?, 'IN_PROGRESS', ?)",
                key, now,
            )
        } catch (_: DuplicateKeyException) {
            val row = db.queryForMap(
                "SELECT status, body, started_at FROM idempotency WHERE key = ? FOR UPDATE",
                key,
            )
            when (row["status"] as String) {
                "SUCCEEDED" -> {
                    @Suppress("UNCHECKED_CAST")
                    return row["body"] as T
                }
                "IN_PROGRESS" -> {
                    val started = (row["started_at"] as Timestamp).toInstant()
                    if (Instant.now().isBefore(started.plusSeconds(staleAfterSec))) {
                        throw ConflictException("Operation for $key is in progress")
                    }
                }
            }
            db.update("UPDATE idempotency SET status='IN_PROGRESS', started_at=? WHERE key=?", now, key)
        }

        return try {
            val result = work()
            db.update(
                "UPDATE idempotency SET status='SUCCEEDED', body=?::jsonb, finished_at=? WHERE key=?",
                result.toString(), Timestamp.from(Instant.now()), key,
            )
            result
        } catch (e: Exception) {
            db.update(
                "UPDATE idempotency SET status='FAILED', finished_at=? WHERE key=? ",
                Timestamp.from(Instant.now()), key,
            )
            throw e
        }
    }
}

class ConflictException(msg: String) : RuntimeException(msg)

Wire it into a Spring Boot app, create the table, and run with ./gradlew bootRun. The behavior I verified in a throwaway project: fire ten parallel requests with the same key at a controller that delegates to runOnce, and confirm that exactly one closure body executes, one retrieves the stored response, and the others see 409.

Three non-obvious details. The INSERT happens before any side effect, so the unique constraint — not application logic — enforces mutual exclusion. The FOR UPDATE on the conflict path blocks a concurrent lookup until the first transaction commits, so the second caller never reads a half-written row. The staleAfterSec window is the escape hatch for keys that get stuck because a pod crashed between claim and result; without it, an IN_PROGRESS row can wedge the key forever.

One caveat I noted in my own tests: work() runs inside the SERIALIZABLE transaction here. That is fine for local DB mutations but not for slow external calls. For those, I split the flow into two transactions and advance a recovery_point column between them, following Brandur's atomic-phase pattern. The single-transaction version is the pedagogical core; the two-transaction version is what I actually deploy.

TTLs are load-bearing, not decoration

It took me longer than it should have to internalize that retention is part of the contract, not a cleanup knob. A DZone article on phantom-write idempotency data loss makes the point sharply: when an idempotency record expires mid-replay, every replayed message looks new and gets processed again.

My rough rule from reading the material and watching behavior locally: the retention window must exceed the longest possible retry window the endpoint faces.

HTTP clients with 24-hour retry budgets: keep keys at least 24 hours. That is what Stripe documents.
Kafka consumers or webhook replayers with week-long replay windows: keep keys at least 7 days.
Ledger entries and double-entry bookkeeping: never expire. The key becomes part of the permanent record.

The trap is using a single short TTL (often the Redis default) for all three, because it is the same tool for all three.

The downstream hole

The failure that kept biting me after the Postgres rewrite was this. The payment service was idempotent. It called an email service to send a receipt. The email service had never heard of idempotency keys. A retry would arrive, the payment service would return the stored response, but the email had only gone out if the first attempt survived to that line — and twice if a crash forced both attempts through the work() block.

The protocol only holds if every hop honors it. Two patterns from my notes, depending on the downstream:

If the downstream accepts an idempotency header (Stripe, Adyen, modern webhooks), forward the caller's key or derive a deterministic child key from it. If the downstream does not, move the side effect behind a transactional outbox: write the domain row and an outbox row inside the same DB transaction, and have a worker drain the outbox with its own dedup. The worker's dedup is trivial because each outbox row has a primary key.

Either path gives you the same property: the contract crosses every boundary where a side effect happens, or a local mechanism (outbox, fencing token) supplies the same guarantee.

Why this matters more now

Spring Framework 7.0 shipped in November 2025 with @Retryable and @ConcurrencyLimit moved into the core framework, no spring-retry dependency required. Making retries first-class is good. It also quietly turns any non-idempotent handler into a duplicate-write generator the moment someone slaps @Retryable(maxRetries = 3) on it. The release notes call out backoff and timeout; they do not remind the author that the method underneath now needs an idempotency protocol around any state mutation.

What I actually do now

Use the database's unique constraint as the mutual-exclusion primitive. Not Redis, not a distributed lock.
Model the operation explicitly as IN_PROGRESS to SUCCEEDED or FAILED, with a stale-after policy to reclaim wedged rows.
Set the retention window to the longest retry window that can reach the endpoint. For money movement, never expire.
Forward or derive an idempotency key for every downstream that accepts one. For the rest, use a transactional outbox.
Keep the flow small enough to reason about on one page: an INSERT, a FOR UPDATE, an UPDATE.

Reach for this protocol when the handler writes to a database, calls a payment or messaging provider, or fans out to more than one downstream. Skip it for pure reads and for operations that are naturally idempotent, like writing a known value to a known key. Never skip it for anything that moves money.

Still here? You might enjoy this.

Nothing close enough — try a different angle?

Engineering

What `dbos ontime` Actually Asks: Building a Distributed Cron on etcd Leases in Go

A 0-click query for `dbos ontime` showed up in my Search Console last week. The reader is not asking about DBOS — they are asking how to run a job every minute, exactly once, across a fleet. From my own notes, an etcd lease, the `concurrency.Election` package, and a fencing token cover that case in under 100 lines of Go, without pulling in a workflow engine.

Engineering

Event-Log-as-Source-of-Truth Turns Schema Evolution Into a Forever Problem

When the log is the source of truth, every schema change is permanent. A Kotlin/Avro walkthrough of the rename that passed the Schema Registry check and silently corrupted every old event, plus the Protobuf and Avro invariants I now keep pinned above my desk.

Engineering

The Transactional Outbox Is Not a Queue

The transactional outbox is a ledger, not a queue. Treating it like one is what breaks Postgres under load. This post walks through the specific failure modes — autovacuum stalls, xmin horizon drift, replication slot lag, poison pills — and the operational rules that actually keep it working in production.

Something didn't load

Idempotency Is a Protocol, Not a Key

Four concerns that get collapsed into one header

Why Redis alone failed me

The storage shape that holds up

TTLs are load-bearing, not decoration

The downstream hole

Why this matters more now

What I actually do now

Still here? You might enjoy this.

Related Posts

What `dbos ontime` Actually Asks: Building a Distributed Cron on etcd Leases in Go

Event-Log-as-Source-of-Truth Turns Schema Evolution Into a Forever Problem

The Transactional Outbox Is Not a Queue