Cell-Based Architecture Isn't Free: The Real Cost

Cells are back on the roadmap. Every other resilience talk this spring has a slide that puts Slack, DoorDash, and Roblox in the same frame and concludes with "this is where we're headed." The pattern is real, the benefits are real, and the production write-ups are worth reading. But the cost of cells is also real, and in most of the rooms where this slide gets shown, the team is not ready to pay it.

A cell is a bulkhead with a budget. You pick a boundary — an availability zone, a customer shard, a workload type — and rebuild your stack so that every dependency inside that boundary is independent of every other cell. Routing sits on top. When a cell burns, it burns alone. That's the whole pitch, and it is compelling when you've just watched a single bad config propagate across every region you own.

The problem is that this boundary is load-bearing. Everything you add — a new service, a new database, a new background job — has to decide which side of the line it sits on, and the wrong answer quietly destroys the isolation you thought you had. That is the part the conference talks undersell.

What Slack, DoorDash, and Roblox actually fixed

Slack's move to cellular started after a series of AZ-level networking and control-plane incidents that turned a single zone's bad day into hours of degraded service for every customer. The fix was to make the front door — routing, Envoy, the session layer — aware of zones, then to align stateful services behind it so a zone failure drained cleanly. The headline metric is not "fewer failures" but "shorter blast radius": when one zone misbehaves, the damage stops at the zone boundary instead of crossing the network fabric.

DoorDash's story is adjacent but not identical. Their zone-aware routing work in Envoy came out of a cross-AZ traffic problem, not a pure isolation problem — cross-zone hops were adding latency and cost at a scale where both mattered. Isolation was a side effect of keeping traffic local. The lesson most teams take from DoorDash is "do cells"; the lesson they should take is "most of your traffic had no business leaving the zone in the first place."

Roblox's post-incident reshuffling was the harshest teacher. A cascading failure in a shared control-plane component took the platform offline for the better part of three days. The fix was deeper than routing: it involved breaking shared dependencies, changing how clusters were provisioned, and treating the blast radius as a first-class design target. Cells were part of the answer. So was aggressively pruning what the control plane was allowed to do.

The thing these three stories have in common is not "we built cells and everything got better." It is "we had a specific failure mode that a shared substrate made worse, and cellularizing the substrate was cheaper than hardening it."

Anatomy of a cell, and where the cost hides

A cell is three things stacked. A routing layer on top that knows which cell a request belongs to. A set of cell-local services and data underneath, with no cross-cell calls on the hot path. And a control plane off to the side that provisions cells, places tenants, drains traffic, and owns everything that genuinely has to be global.

Each of those three layers has a cost most roadmaps underestimate.

The routing layer is the part teams get most wrong. "Cell-aware routing" sounds like a config change. In practice it is a new fleet: you own the router, its deployment cadence, its observability, and — worst of all — its failure modes. A buggy router cancels every bit of isolation you paid for, because a request routed to the wrong cell is now a cross-cell dependency with no circuit breaker. Slack and DoorDash lean on Envoy for a reason: it's a mature data plane with a credible story for zone awareness. Rolling your own at a Kotlin/Spring gateway tier is a two-quarter project that looks like two sprints on a whiteboard.

The cell-local data layer is where the architecture diagram stops being honest. If every cell has its own database, you have just multiplied your operational surface by the number of cells. Backups, failovers, schema migrations, read-replica lag, capacity planning — all of it, per cell. Teams compensate by shrinking the blast radius of each cell, which means more cells, which means more data planes to operate. This is the cost most adoption checklists gloss over in a single bullet and is where the on-call load lives.

The control plane is the part where the pattern quietly leaks. A pure cell model says nothing crosses the boundary on the hot path, but the control plane still has to place tenants, move them between cells, and handle cross-cell features. Search, analytics, global counters, rate limits that span the whole user base — none of these want to live inside a cell. Most teams end up with a "global" tier bolted on, and that tier becomes exactly the shared fate the cells were supposed to eliminate. Roblox's postmortem is, in part, a story about this tier being underestimated.

The cheaper things that usually come first

For a surprising number of systems, the right answer is not cells. It is one of the three patterns cells subsume, applied deliberately and without the operational tax.

Zone-aware routing is the first one. Keep traffic inside the zone it entered; fall back across zones only when the local target is unhealthy. This solves most of the "one AZ had a bad day" incidents that motivate cell talks, costs almost nothing if you already run a service mesh, and does not require you to rebuild your data layer. DoorDash's work is as much an argument for zone-aware routing as it is for cells.

Shuffle sharding is the second. Instead of running every tenant on every node, you assign each tenant a small random subset of the fleet. A noisy neighbor or a bad tenant can only damage the small group of other tenants who share their shuffle. This is the pattern AWS has quietly used for years and it gives you 80% of the blast-radius reduction of cells with a fraction of the operational surface. If your pain is "one customer can hurt all the others," shuffle sharding is almost always the cheaper answer.

Partitioned workloads are the third. A lot of cell adoption is really about separating workloads that should never have been sharing a cluster: batch jobs next to interactive traffic, webhook delivery next to user-facing APIs, a single Kafka consumer group handling three unrelated pipelines. Splitting these is cheap, eliminates a real class of incidents, and does not require a router, a control plane, or per-cell databases.

If you have not done these three, you are not ready for cells — you are ready for an easier win.

When cells are actually the right tool

The signals are specific enough to enumerate. You should consider cells when all of these are true, not just some.

Your availability target is high enough that a single-region, single-control-plane failure will breach your SLO on its own. If a three-hour AZ event is survivable for your business, you almost certainly do not need cells.

You have already exhausted zone-aware routing, shuffle sharding, and workload partitioning, and you still see incidents where one cohort's problems leak into everyone else's. The remaining leaks are in shared substrate, not application code.

You have a credible owner for the router and the control plane. Not a team that will build them as a side project, but a team whose roadmap and on-call rotation is explicitly "the cell infrastructure." Slack, DoorDash, and Roblox all have this team. Most companies citing their talks do not.

You can name the thing you want cell-local — request processing, customer data, a specific pipeline — and the thing you are comfortable leaving global. If every service wants to be cell-local, you are describing a region, not a cell, and you should just run another region.

You are willing to pay a latency and cost tax on cross-cell features forever. Global search, global analytics, cross-tenant features, anti-abuse — these will be harder, slower, or more expensive. If leadership is surprised by this in year two, the program will be rolled back.

If you are going to do this anyway

Start with one cell per availability zone, not one cell per customer. Zone-sized cells match your existing failure domains, let you reuse your service mesh for routing, and give you a working reference implementation without the "how many cells?" argument. Grow finer-grained cells later, only for the workloads that demonstrably need them.

Build the router and the control plane as products, not as plumbing. Give them an owner, an SLO, and a runbook. Treat "a cell was routed to the wrong place" as a severity incident, because it is: it is the one failure mode that silently erases the whole pattern's value.

Write the cross-cell features down on day one. Search, analytics, billing, anti-abuse, global rate limits, admin tools — list them, decide which tier owns them, and accept that this is where the complexity went. If the list is empty, you are lying to yourself.

Measure blast radius, not uptime. Cells rarely reduce the number of incidents; they reduce the share of users affected per incident. If your dashboards cannot show "percent of traffic degraded during the last incident," you will not be able to tell whether the program is working.

Takeaways

Cells are a bulkhead at scale. The benefit is contained blast radius; the cost is a router, a control plane, and per-cell operational surface that grows with the number of cells.
The Slack, DoorDash, and Roblox stories are about specific shared-substrate failure modes. Copy the pattern only if you have the same failure mode.
Zone-aware routing, shuffle sharding, and workload partitioning solve most of what motivates cell adoption, at a fraction of the operational cost.
Cross-cell features — search, analytics, global counters, identity — are where the complexity silently relocates. If they are not named on day one, they will eat the program.
Start with zone-sized cells, own the router and control plane as products, and measure blast radius as a first-class SLO.

Reach for cells when you have an availability target that a single shared substrate cannot meet, a team to own the cell infrastructure as a product, and a tolerance for permanent cross-cell friction. Avoid them when a smaller set of patterns would plausibly close the gap, or when "we need cells" is being driven by the conference circuit instead of by your incident history.

Something didn't load

Cell-Based Architecture Isn't Free: What Slack, DoorDash, and Roblox Actually Paid For It

What Slack, DoorDash, and Roblox actually fixed

Anatomy of a cell, and where the cost hides

The cheaper things that usually come first

When cells are actually the right tool

If you are going to do this anyway

Takeaways

Still here? You might enjoy this.

Related Posts

Exposing Spring AI Agents via the A2A Protocol: What Interoperability Actually Buys You

DBOS vs Temporal: When Postgres Is Enough for Durable Workflow Execution