The Deterministic Backbone: Why Production AI Systems Are Moving Away From Fully Autonomous Agents
Fully autonomous agents are hard to bound, hard to test, and expensive to operate. A deterministic backbone with narrow agent steps gives you the control flow back while keeping the intelligence where it matters. Here is how to design, test, and migrate toward it.
A year of running "autonomous" agents in production has settled a debate that sounded philosophical in 2024 and sounds operational in 2026: if your control flow lives inside a model's context window, you do not have control flow. You have a suggestion.
The pattern that has won, quietly and across very different codebases, is not "bigger agent, better prompts." It is the opposite. A deterministic workflow backbone — a boring state machine, saga, or workflow engine — drives the execution, and intelligence gets invoked at specific, narrow steps that return control back to the backbone the moment they finish. Anthropic's effective-agents guidance, JetBrains' Koog, Spring AI's Agent Harness, LangGraph, and Temporal's AI primitives all point the same direction.
This post makes the case that this hybrid architecture should be your default for any AI feature going to production, and walks through what it buys you in testing, observability, cost, and incident response. It is written for backend engineers who have either shipped an agent and regretted it, or are about to.
The Problem
The 2024 default looked like this: give the model a system prompt, a set of tools, and a loop. The model decides what to call, in what order, with what arguments, and when it is done. It is seductive. It is also nearly impossible to operate past a certain level of traffic.
Three things tend to break.
The control flow is non-local. When a support agent ends up refunding the wrong order, the bug is not in any single tool. It is in the sequence the model chose, given a context that no longer exists. You cannot set a breakpoint in a prompt. You cannot diff a reasoning chain against last week's reasoning chain. Your "stack trace" is a transcript, and transcripts do not reproduce.
The cost is unbounded. An open-ended loop that can call tools until it decides to stop has no a priori token budget. A single pathological input — a document that confuses the planner, a tool that returns ambiguous errors — can burn 40x your p50 cost. Finance will notice before you do.
Testing degrades into vibes. Unit tests assume determinism. A loop that can take four different paths on the same input cannot be unit tested in any meaningful sense. Teams end up with "golden transcript" suites that pass 98% of the time and never catch the regression that matters, because the regression is a 2% tail.
None of these are prompt problems. They are architecture problems. The system has no seams.
The Approach
The deterministic backbone pattern inverts the question. Instead of asking "what is the smallest instruction set that lets the model accomplish this task?", you ask "what is the largest fraction of this task that does not need a model at all?"
The mental model is simple. Your workflow is a directed graph with typed state. Most nodes are plain code — database reads, API calls, validations, routing. A small number of nodes invoke an LLM, and each of those invocations is narrow: a single prompt, a single expected output shape, a bounded retry policy, and a hard token ceiling. When the node returns, the backbone owns the state again.
Three properties fall out of this, and they are the whole reason the pattern wins:
- The graph is inspectable. You can draw it. You can log every transition. You can replay it from state.
- Each agent step is a pure function of its inputs. It is not "the agent that runs the whole task." It is "the classifier that turns a support email into one of seven intents" or "the extractor that pulls line items from a PDF." These are testable.
- The expensive parts are local. Token cost, latency, and failure are concentrated at specific nodes. You can cache, fallback, or circuit-break each one independently.
This is not a new idea. It is how every mature distributed system has handled "we need to call something unreliable" for twenty years. Sagas, outbox patterns, workflow engines — the AI case is the same shape with a new kind of unreliable dependency.
Technical Deep Dive
Let me make this concrete in Kotlin with Spring. The same pattern translates to Temporal, Flowable, LangGraph, or a hand-rolled state machine — what matters is the shape, not the framework.
Imagine a support triage pipeline: an email comes in, we classify it, we extract structured data if relevant, we route it to a queue, and we draft a reply only when confidence clears a threshold. The naive agentic version hands this entire task to one LLM with five tools. The backbone version looks like this:
sealed interface TriageState {
data class Received(val email: Email) : TriageState
data class Classified(val email: Email, val intent: Intent, val confidence: Double) : TriageState
data class Extracted(val email: Email, val intent: Intent, val fields: Map<String, Any>) : TriageState
data class Routed(val ticketId: TicketId, val queue: Queue) : TriageState
data class Drafted(val ticketId: TicketId, val draft: String) : TriageState
data class Escalated(val ticketId: TicketId, val reason: String) : TriageState
}
@Component
class TriageWorkflow(
private val classifier: IntentClassifier, // narrow LLM step
private val extractor: FieldExtractor, // narrow LLM step
private val router: Router, // plain code
private val drafter: ReplyDrafter, // narrow LLM step
private val tickets: TicketRepository,
) {
fun run(email: Email): TriageState {
val classified = classifier.classify(email) // bounded: 1 call, <800 tokens
if (classified.confidence < 0.6) {
return escalate(email, "low_confidence_classification")
}
val extracted = if (classified.intent.needsExtraction) {
extractor.extract(email, classified.intent) // bounded: 1 call, strict schema
} else {
TriageState.Extracted(email, classified.intent, emptyMap())
}
val routed = router.route(extracted) // plain code, deterministic
return if (classified.intent.autoReplyEligible) {
drafter.draft(routed) // bounded: 1 call, template-anchored
} else {
routed
}
}
}Every LLM call here is a function. Each has a typed input, a typed output, a timeout, a token ceiling, and its own retry policy. The control flow — the ifs, the escalation, the routing — is ordinary Kotlin. You can set a breakpoint on any line.
The classifier is where the interesting trade-offs live:
@Component
class IntentClassifier(private val chat: ChatClient) {
fun classify(email: Email): TriageState.Classified {
val response = chat.prompt()
.system(CLASSIFY_SYSTEM_PROMPT)
.user(email.normalizedBody())
.options { it.maxTokens(200).temperature(0.0) }
.call()
.entity(ClassificationResult::class.java)
return TriageState.Classified(email, response.intent, response.confidence)
}
}
data class ClassificationResult(val intent: Intent, val confidence: Double)Three things worth pointing out. Temperature is zero, because this step is doing structured classification and non-determinism is a bug, not a feature. maxTokens is tight, because the output shape is small and a blown budget means the prompt drifted. The response is parsed into a typed object — if the model returns something off-schema, you get an exception at the node boundary, not a downstream corruption.
For the workflow itself, I will usually reach for Spring StateMachine or a workflow engine once the graph has more than five or six nodes, branching, or any long-running steps. Temporal is particularly good here because retries, timeouts, and history replay come for free, and LLM steps are exactly the kind of flaky, expensive, timeout-prone activities Temporal was designed for. In Kotlin:
class TriageWorkflowImpl : TriageWorkflow {
override fun run(email: Email): TriageState {
val classified = Workflow.newActivityStub(
ClassifierActivity::class.java,
ActivityOptions.newBuilder()
.setStartToCloseTimeout(Duration.ofSeconds(10))
.setRetryOptions(RetryOptions.newBuilder().setMaximumAttempts(3).build())
.build()
).classify(email)
// ...
}
}Each LLM activity has a bounded timeout, a bounded retry count, and a replayable history. If a run fails halfway, Temporal resumes from the last completed activity, not from the start. That single property — deterministic replay on partial failure — is the reason I have stopped writing ad-hoc orchestration for anything that talks to a model in production.
The trade-offs worth naming:
- You give up emergent behavior. An agent that can choose its own tool order sometimes finds a path you did not think of. The backbone will not. For well-understood workflows, this is what you want. For exploratory tasks, it is a real loss.
- You move complexity into the graph. Branches, guards, and escalation paths become your problem again. This is a feature — it is complexity you can see and review — but it is not free.
- You are more coupled to your domain. A single mega-agent is tantalizingly generic. A backbone is shaped like your business. That is a good thing at scale and a bad thing when requirements change weekly.
Pitfalls & Edge Cases
A few failure modes I have watched teams hit, usually in the first quarter after adopting the pattern.
Fake determinism. Teams wrap LLM calls in a workflow engine and declare victory, then leave temperature at 0.7 and set maxTokens to infinity. The backbone is deterministic; the steps are not; nothing is reproducible. If the step is doing structured work, pin temperature to 0 and constrain output with a schema. If it is doing generative work, accept the non-determinism and isolate it — do not pretend a creative step is a pure function.
Graph sprawl. The pattern rewards breaking things into small nodes, so teams break things into too many small nodes. Thirty-node graphs become unreadable for the same reason thirty-method classes do. Group related steps into sub-workflows, and resist the urge to model every branch as a node.
Smuggling an agent back in. One LLM node starts needing to "just call one tool to check something," then two, then it loops. Congratulations, you have an agent again, now hiding inside a node that pretends to be atomic. If a step needs to call tools, make the tool calls nodes in the graph. If that is impractical, acknowledge you have an agent step and give it an explicit budget and step limit.
Ignoring the escalation path. The whole value of the backbone is that it knows when to stop being confident. If every branch eventually feeds into "let the agent figure it out," you have built an expensive wrapper around the same autonomous loop you were trying to replace. Human handoff, fallback heuristics, and hard stops are first-class features, not afterthoughts.
Over-indexing on replay. Replay is powerful and seductive. It is also misleading when the underlying model has drifted. A transcript that replayed cleanly in January may classify the same input differently in April because the provider shipped a new checkpoint. Pin model versions in your activity options, and include the model identifier in your workflow history.
Practical Takeaways
- Treat every LLM call as an unreliable, expensive, non-deterministic activity — because it is. Design around it the same way you would design around a flaky external API.
- Make the control flow ordinary code. Branching, retries, and escalation should be readable in your IDE, not inferred from a transcript.
- Pin temperature, max tokens, and model version at every step. A narrow step with loose knobs is a wide step in disguise.
- Put LLM steps behind typed interfaces with structured output. Schema violations should fail fast at the node boundary.
- Reach for a workflow engine (Temporal, Flowable, LangGraph, Spring StateMachine) once your graph branches. Ad-hoc orchestration ages badly.
- Reserve fully autonomous loops for tasks where you accept unbounded cost and latency in exchange for exploration — research, coding agents, long-horizon analysis. Not for customer-facing transactional workflows.
- Migrate incrementally. Wrap your existing agent as a single node in a new backbone, then pull behavior out of it one step at a time. You do not need a rewrite.
Conclusion
Fully autonomous agents are not wrong. They are a specific tool for a specific problem: exploratory, low-SLA, high-tolerance-for-cost tasks where the value of finding a novel path outweighs the cost of not being able to bound one. That is a real category. It is not most production features.
For everything else — triage, extraction, routing, moderation, summarization, the long tail of "LLM somewhere in the pipeline" — the deterministic backbone is the right default in 2026. Not because it is fashionable, but because it is the only architecture that gives you back the three things production systems cannot function without: reproducibility, bounded cost, and a place to put a breakpoint.
If you are starting a new AI feature this quarter, start with the graph. Add intelligence at the nodes that need it. If you are running an autonomous agent in production today and you are tired, the migration is less scary than it looks: wrap it, carve it, replace it.
Written by Tiarê Balbi