Distributed Cron with etcd Leases in Go

A 0-click query showed up in my Search Console last week: dbos ontime. Four impressions, no landing post. The query is interesting because it is not really about DBOS. The reader typing it has a job that needs to run every N minutes, exactly once, across a fleet — and they are pricing out durable-execution products to get there.

That is the wrong floor of the stack to start at. Before reaching for a workflow engine, the question worth answering is: how many primitives do I actually need to make "run this every minute, exactly once, with failover" hold up under partition?

Three primitives, plus a fence. All four live in etcd, and the Go client packages them as 30-line APIs. The result is a scheduler I can fit in one file, read on a Sunday morning, and reason about when the leader stalls.

Why this is worth writing down right now: Thoughtworks Technology Radar Vol 34 moved Apache APISIX into Trial specifically because it uses etcd to push routing configuration into data planes without reload-induced latency. That blip elevated etcd from "Kubernetes implementation detail" into a primitive senior backend engineers should be reaching for directly. The APISIX use case is configuration broadcast; the leader-elected cron use case is the same primitive viewed from a different angle.

The four primitives

Lease. A timed lock with a TTL — held server-side and renewed by a keep-alive stream from the client. If renewals stop arriving (process death, network stall, GC pause longer than the TTL), the server expires the lease and deletes every key attached to it. The Go client renews at TTL/3 by default.

Election. The concurrency package wraps a lease into a CAS-style election: candidates write a key under a shared prefix with their lease attached, and the candidate with the lowest creation revision wins. Campaign either returns "you are leader" or blocks, watching until your turn comes. Resign lets a leader yield voluntarily.

Watch. Every key change in etcd is a logical event ordered by revision. The Go client streams those events. The election uses watch internally to know when the previous leader's key disappears.

Fencing token. The leader's lease ID changes per session, but the field that never goes backwards across leadership turnovers is Election.Rev() — the etcd revision at which the current leader's key was created. That is the integer to persist alongside any work the leader does, so a stalled-then-resumed old leader cannot overwrite a newer leader's output. Martin Kleppmann's critique of distributed locks-without-fencing applies directly here, and etcd's revision field is exactly the monotonic integer he asks for.

That is the full primitive set. No workflow engine, no Postgres queue table.

A single-file scheduler in Go

Here is the whole thing. It assumes a local etcd at localhost:2379 — running etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 is enough to follow along.

package main

import (
	"context"
	"fmt"
	"log"
	"os"
	"time"

	clientv3 "go.etcd.io/etcd/client/v3"
	"go.etcd.io/etcd/client/v3/concurrency"
)

const (
	electionPrefix = "/cron/scheduler/leader"
	tickInterval   = 1 * time.Minute
	sessionTTL     = 10 // seconds
)

func tick(at time.Time, fence clientv3.LeaseID, rev int64) {
	// In a real scheduler the work writes a row keyed by the tick timestamp,
	// guarded by "WHERE existing.fence < $rev" so a delayed old leader cannot
	// overwrite a newer one's output.
	log.Printf("tick at=%s lease=%d fence_rev=%d",
		at.UTC().Format(time.RFC3339), fence, rev)
}

func runAsLeader(ctx context.Context, sess *concurrency.Session, rev int64) {
	t := time.NewTicker(tickInterval)
	defer t.Stop()
	for {
		select {
		case <-ctx.Done():
			return
		case <-sess.Done():
			// Lease expired or session closed. We are no longer leader.
			return
		case now := <-t.C:
			tick(now, sess.Lease(), rev)
		}
	}
}

func main() {
	nodeID := os.Getenv("NODE_ID")
	if nodeID == "" {
		nodeID = fmt.Sprintf("node-%d", os.Getpid())
	}

	cli, err := clientv3.New(clientv3.Config{
		Endpoints:   []string{"localhost:2379"},
		DialTimeout: 5 * time.Second,
	})
	if err != nil {
		log.Fatalf("etcd connect: %v", err)
	}
	defer cli.Close()

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	for ctx.Err() == nil {
		sess, err := concurrency.NewSession(cli, concurrency.WithTTL(sessionTTL))
		if err != nil {
			log.Printf("session: %v — backing off", err)
			time.Sleep(2 * time.Second)
			continue
		}
		e := concurrency.NewElection(sess, electionPrefix)

		log.Printf("%s campaigning", nodeID)
		if err := e.Campaign(ctx, nodeID); err != nil {
			log.Printf("campaign: %v", err)
			sess.Close()
			continue
		}
		log.Printf("%s elected leader lease=%d rev=%d",
			nodeID, sess.Lease(), e.Rev())

		runAsLeader(ctx, sess, e.Rev())

		log.Printf("%s lost leadership lease=%d", nodeID, sess.Lease())
		sess.Close()
	}
}

Run it:

go run main.go

Start it twice in two terminals as NODE_ID=a go run main.go and NODE_ID=b go run main.go. One ticks every minute; the other prints campaigning and waits. Ctrl-C the leader, and within roughly the session TTL the runner-up logs its election and starts ticking.

Three lines deserve a closer look — the rest is plumbing.

concurrency.NewSession(cli, concurrency.WithTTL(sessionTTL)) quietly does two things: it creates a lease and starts the keep-alive goroutine that renews it. With sessionTTL = 10 seconds, the keep-alive runs every ~3.3 seconds, and an expiry happens between 6.7 and 13.3 seconds after the last successful renewal — depending on which keep-alive you missed.

runAsLeader selects on ctx.Done() and sess.Done(). The second channel closes when the session expires, which is the only signal that I have lost leadership without resigning. I have seen newcomers replace this with a periodic IsLeader() poll. That is wrong: between two polls the lease can expire and a new leader can run. The session's Done() channel is the source of truth.

e.Rev() is the fencing token I pass into tick. In a real scheduler I write the tick result into a downstream store with INSERT (..., fence) VALUES (..., $rev) ON CONFLICT (key) DO UPDATE SET ... WHERE existing.fence < $rev. A leader that stalled past its TTL and then woke up will still try to write — and the row's fence will reject the write because it carries a revision older than whatever the new leader has already committed.

What happens when the leader dies mid-tick

The interesting failure mode is not the clean Ctrl-C. It is the leader that goes into a 12-second stop-the-world or loses its uplink at second 31 of a 60-second interval, mid-tick. The sequence below is what actually happens. The diagram makes the gap visible — once it is in place, look for the empty band between L1's lease expiry and L2's first tick.

Second 30. Leader L1 begins tick #N at lease ID 0xAAA, election revision 17.
Second 31. L1 stalls. Its keep-alive goroutine cannot renew.
Second ~41. The lease expires server-side; etcd deletes L1's election key. The watch wakes up the runner-up L2.
Second 41–42. L2 acquires its own session at lease ID 0xBBB, becomes leader at revision 19, and starts its ticker.
Second 50. L1 wakes up. The keep-alive returns an error. sess.Done() closes. The runAsLeader loop exits, and L1 attempts to write the result of tick #N to the downstream store with fence revision 17.
Second 50.001. The downstream store rejects the write because the latest row carries fence revision 19.

Two facts to remember from this. First, the failover window is bounded above by 2 × TTL and below by roughly TTL × 2/3 — assuming the stall happened right after a successful renewal. With a 10-second TTL that is a window of 6.7–13.3 seconds during which no node holds the lease. If the work cannot tolerate that gap, lower the TTL — but do not push it under 3× the realistic worst-case pause, including GC pauses and TLS handshake jitter on the runtime.

Second, the fence is what makes the late write safe. Without it, L1's stale tick #N silently overwrites L2's fresh tick #N+1. The etcd FAQ has been explicit about this since 3.5: revisions are the fencing token. Election.Rev() is what gets persisted alongside the work.

Trade-offs I would not skip naming

Wall-clock drift between leaders. Each leader runs its own time.Ticker. If L1 fires at :00 and L2 takes over at :12, L2's first tick lands at :72 — not at :00 of the next minute, unless I align the ticker to a wall-clock boundary. For interval-based work this rarely matters; for cron-expression-based schedules it does, and the fix is to sleep until the next aligned boundary at the top of runAsLeader rather than calling NewTicker immediately.

time.Ticker drops ticks under load. Go's docs are explicit: if a receiver is busy when a tick fires, that tick is dropped, not buffered. For once-per-minute cron this is irrelevant; for sub-second work it is a real source of lost ticks.

Lease auto-renew during etcd cluster turbulence. Issue #9888 on etcd-io/etcd describes a window during cluster-leader elections where leases get extended automatically by the new etcd leader. The TTL effectively widens during cluster turbulence. This is rarely a correctness bug — the fence still holds — but it is the reason failover during a real partition takes longer than the back-of-the-envelope 2 × TTL upper bound.

What this is not. The single-leader cron handles "run X every minute, no overlap, failover under 15 seconds." It does not give me cron-expression parsing, time-zone-aware schedules, business-hour logic, multi-tenant fairness, or per-job retry budgeting. Those are features I would write on top — or, at the point I find myself writing the third of them, switch to a real scheduler library.

Two tests worth keeping

Lease-renewal stall. Use iptables -A OUTPUT -p tcp --dport 2379 -j DROP (or tc for delay) on the leader, then wait 1.5 × TTL. Assert that within 2 × TTL the runner-up has won the election and started ticking, and that the original leader's runAsLeader exited via sess.Done(). This is the test that catches the "I forgot to listen on sess.Done() and used a polling boolean instead" bug.

Double-tick across split-brain. Run three nodes with TTL 5s. Drop the network between the leader and etcd for 2 × TTL + 2. Bring it back. Assert that the downstream store has exactly one row per tick interval, and that any duplicate writes were rejected by the fence column. This is the test that proves the fencing token is doing its job. If two rows exist, the downstream is missing the WHERE existing.fence < $rev guard.

I do not run either of these in a unit test — they need a real etcd. The first one I run inside a docker compose with three etcd nodes and a tiny pumba netem step. The second one needs a sidecar Postgres or whatever the fenced rows are landing in.

Where this stops being enough

The line where I would walk away from this design and pull in a workflow runtime is not "more than one job." Sharding ticks across leaders by job key is a 30-line addition: hash the job ID, modulo the leader count, store one election prefix per shard. The line is when individual ticks need durable, replayable state machines: multi-step workflows where a single tick spawns a saga that takes 20 minutes, talks to seven services, and must resume across process restarts without re-doing the side effects.

That is what DBOS, Temporal, and Restate were built for. The leader-elected etcd cron is what they sit on top of in their own deployments. The dbos ontime query lands the reader on the wrong floor of that stack — the answer is not a workflow engine, it is the primitive the workflow engine uses internally to schedule its own ticks.

When to use this and when to avoid it

Use it when:

etcd is already in the operational footprint (Kubernetes, APISIX, a service mesh — same cluster, scoped key prefix).
The unit of work fits inside one tick interval and is idempotent at the storage layer.
A 6–15 second failover gap is acceptable.
The schedule is a fixed interval or a small handful of cron entries.

Avoid it when:

The job is multi-step, long-running, and needs durable replay. Use Temporal, DBOS, or Restate.
Sub-second precision is required. The TTL math will not deliver it.
etcd is not already operated. Standing up a cluster just to run cron is a poor trade against a Postgres advisory lock plus a row-per-tick guard.

Takeaways

An etcd lease, concurrency.Election, and sess.Done() together cover exactly-once recurring ticks in under 100 lines of Go.
The fencing token to persist is Election.Rev(). Without it, a stalled leader can silently overwrite a fresher one's output.
The failover window is bounded by ~2 × TTL. Tune the session TTL against the realistic worst-case pause, not the average.
Listen on sess.Done(), never poll a leadership boolean. Polling has a window where two nodes both think they hold the lock.
The line where this stops being enough is durable multi-step workflows, not "more jobs." Add sharding before adding a workflow engine.

Something didn't load

What `dbos ontime` Actually Asks: Building a Distributed Cron on etcd Leases in Go

The four primitives

A single-file scheduler in Go

What happens when the leader dies mid-tick

Trade-offs I would not skip naming

Two tests worth keeping

Where this stops being enough

When to use this and when to avoid it

Takeaways

Still here? You might enjoy this.

Related Posts

DBOS vs Temporal: When Postgres Is Enough for Durable Workflow Execution

Memory Evaluation: Measuring How AI Memory Decays Over a Project's Lifetime

The Transactional Outbox Is Not a Queue