Skip to main content
All Posts
Research15 min read

AI Prompts: How Good and How Bad They Are — Opening a New Line of Research

An honest look at where prompts work, where they quietly fail, and the assumption we stopped questioning — that AI must make mistakes. The opening shot of a research line on moving from "best effort" to specifiable, measurable precision.

Tiarê Balbi BonaminiSoftware Engineer · Vancouver
2/4

There is a quiet sentence repeated in every meeting, every demo, every architecture review: "the model may make mistakes." The industry says it like a disclaimer, like a weather warning, and has grown so comfortable with it that the phrase now functions as load-bearing infrastructure for everything else — a permission slip that lets engineers ship, integrate, and trust systems that they also, in the same breath, refuse to fully trust.

This post is the public start of a research line I want to drive over the coming months. It's not a tutorial, it's not a benchmark, and it's not a hot take. It's a thesis statement. The question I want to live with is simple to state and difficult to answer:

What would it take for the field to stop assuming AI will make mistakes — and start engineering for precision instead?

Before I lay out where I want to go, I want to honestly survey where the field is. Prompts today are extraordinary. Prompts today are also unreliable in ways the industry still doesn't measure well. Both things are true, and the gap between them is the territory I want to explore.

The good: what prompts actually deliver today

It's easy to forget how much a single prompt is being asked to do. One block of natural language is, simultaneously, an API contract, a task specification, a context window, a persona definition, a safety boundary, and a coordination protocol for tools. That this works at all is one of the most surprising results in recent software history.

Here is what I think prompts genuinely deliver today, and where I refuse to be cynical about them:

Context narrowing. A prompt is the cheapest mechanism ever invented to take a system trained on the entire internet and aim it at this customer, this ticket, this invoice. Retrieval-augmented generation, when done well, drops hallucination rates by 75–90% compared to closed-book generation. That is not a tweak — it is an architectural primitive. The field discovered that you can fix a huge class of failure modes not by retraining the model, but by improving what you put in front of it.

Task completion through composition. The shift from "prompt the model" to "program the model" is already underway. Frameworks like DSPy explicitly reframe prompts as typed modules with input/output signatures, optimizable end-to-end, and composable into pipelines. Stanford's group has been pushing this hard: stop tinkering with strings, start declaring contracts. I think this is where the field is genuinely heading, and the early results — neural-symbolic pipelines combining LLMs with formal solvers — are reporting accuracy gains of up to 43% over plain chain-of-thought.

Structured output, on demand. Constrained decoding (XGrammar, llguidance, Outlines) means that when I ask for JSON, I can now guarantee JSON — not "usually JSON, except on Tuesdays." XGrammar runs at under 40 microseconds per token and, as of March 2026, is the default structured-generation backend for vLLM, SGLang, and TensorRT-LLM; XGrammar-2 followed in May 2026 with a focus on dynamic structures for agentic flows. llguidance is the foundation OpenAI publicly credited for its Structured Outputs feature. This is a precision win worth celebrating: at the syntactic level, the problem is essentially solved.

Tool use and agentic flows. Function calling, MCP servers, agent frameworks — the model can now reach out to deterministic systems for the things it shouldn't be guessing about. Tool grounding reduces hallucinations by 65–80%. That is huge. Every time a calculation, a date lookup, a database query, or a file read is delegated to an actual tool, the architecture is removing a category of error rather than tolerating it.

In-context learning as a primitive. A handful of well-chosen examples can re-shape behavior in ways that used to require fine-tuning. For exploratory work, for one-off tasks, for evolving requirements, this is a genuinely new kind of programming.

If you came of age writing imperative code, the experience of writing a prompt that generalizes is uncanny. You describe the what, not the how, and the system fills in the rest. That is real power, and I don't want to talk about the failure modes without first acknowledging it.

The bad: where prompts quietly break

And yet. The same property that makes prompts feel magical — natural language as the interface — is the same property that makes them brittle. Here is where I see prompts breaking today, ordered roughly from "well known" to "rarely talked about":

Brittleness to phrasing. This is the big one, and it is much worse than the field acknowledges in public. The Brittlebench and Multi-Prompt Evaluation lines of work show that semantics-preserving perturbations — rewording, reordering, reformatting — can shift performance by up to 12%, and a single perturbation flips the relative ranking of models in 63% of benchmark cases. Read that again: the answer to "which model is best at this task?" depends, more often than not, on how the question happened to be phrased. Benchmark culture papers over this by reporting one number per model per task, but the underlying distribution is wide and bimodal.

Hallucination, even in 2026. The headline numbers look good — frontier hallucination rates are 3–8× lower than 2024 baselines. Some published comparisons put Claude in the low single digits and other frontier models in the 8–12% range on factual summarization tasks; on harder knowledge benchmarks like AA-Omniscience, the spread is much wider. That is real progress. But the same data shows that prompt-only mitigations cap out at around –15%. The big wins (RAG, tool grounding) are architectural, not prompt-engineering wins. The industry has been collectively over-rewarding "prompt hacks" relative to what the data shows about where reliability actually comes from.

Non-determinism as a feature, not a bug. Even at temperature zero, the same prompt against the same model on the same day can produce different outputs across runs, because batching, hardware, and inference-time optimizations introduce small numerical differences that compound through sampling. Evaluating reliability requires running the same task across hundreds of trials and measuring variance — yet most production systems treat one good response as evidence the prompt "works."

Compounding errors in multi-step systems. A 95%-reliable step is fine. Twenty of them in a row is 36%. Agentic flows multiply per-step errors, and the field still doesn't have widely accepted eval frameworks that measure the trajectory, not just the final output. The measurement problem is ahead of the tooling — and the tooling is what people actually use to decide whether to ship.

Instruction collisions. As prompts grow — system prompts, developer prompts, tool descriptions, retrieved context, user messages — instructions start to contradict each other. The model resolves these collisions in ways no one can fully predict. Worse, the resolution depends on order, recency, and phrasing in ways that change with each model version. The field has no good vocabulary for this. There is no git diff for prompt behavior.

Verification failure under composition. A recent study on integrating LLMs with formal methods found that while single-function correctness reaches over 99% syntactic accuracy, compositional tasks degrade to 95.67% syntax correctness and a catastrophic 3.69% full verification rate, with the best model reaching only 7% on Pass@8. Single functions are tractable. Composing them with guarantees is not.

The "best effort" cultural default. This is the deepest one. The industry has collectively agreed, without ever debating it, that "best effort" is the right contract between humans and language models. The model tried, it usually got it right, and the human is responsible for catching the cases where it didn't. That is a reasonable transitional default. It is a terrible terminal state.

The assumption no one questions anymore

Here is the table I want to turn.

When a database returns the wrong row, engineers don't shrug and say "well, databases sometimes make mistakes." They file a bug. They open an incident. They expect a postmortem. The system is held to a precision contract, and violations of that contract are treated as defects.

When a language model returns the wrong row, the response is "the model can hallucinate, please verify." An entire profession has been built — "AI engineer," "prompt engineer," "eval engineer" — around the assumption that the system is fundamentally non-deterministic and therefore fundamentally fallible. Verification is offloaded onto the human or onto a downstream check. The system is held to a best-effort contract.

I am not arguing that this assumption is wrong today. It is empirically correct — the systems that exist do make mistakes, frequently, in ways no one can fully predict. I am arguing that it has stopped being interrogated. The field has internalized it as a permanent property of "AI" rather than a contingent property of "the LLMs that happen to be available in 2026."

A few things have changed recently that make me think this assumption is now load-bearing in a way it shouldn't be:

  1. Constrained decoding works. At the syntactic level there is proof that the field can move from "usually correct" to "provably correct." There is no theoretical reason the same shift cannot happen at higher semantic levels — it just requires the right scaffolding.
  2. Tool grounding works. Pushing answers out of the model and into deterministic systems gives a clean factoring: the model proposes, the tool decides. Most "hallucinations" in well-designed agentic systems are now planning failures, not factual ones.
  3. Programmatic prompting works. DSPy and its descendants show that when prompts are treated as compilable artifacts with typed signatures, they can actually be optimized and verified — rather than vibe-tuned.
  4. The economics are aligning. Every customer I talk to has the same complaint: "I can build a demo in an hour, but I can't ship to production because I can't guarantee anything." There is now real economic pressure to close the precision gap.

The conditions for a phase change are present. The vocabulary for the phase change is not.

What "precision" could mean for language models

I want to be careful here. "Precision" is doing a lot of work in this post, and I owe it a definition.

When I say I want to move from best effort to precision, I do not mean determinism in the strict sense. I do not mean that every prompt must produce exactly one token sequence. Some tasks are inherently open-ended; creativity, summarization, ideation — these benefit from variance and would be impoverished by removing it.

What I mean is something more like: the behavior of a prompt should be specifiable, measurable, and bounded. Concretely, I'd like to be able to say things like:

  • "For inputs matching this schema, this prompt produces an output satisfying this contract with probability ≥ 0.999."
  • "Across 100 semantics-preserving rewordings of this prompt, output variance on these metrics is ≤ ε."
  • "This prompt, this model version, this temperature, this tool set — if any of these change, behavior is re-verified."

That is not science fiction. That is what engineering already does for software components. Software has type systems, property-based tests, contract tests, versioning, CI. Prompts have none of these in a serious, standardized way.

Today, the closest analog is: write the prompt, run it on a handful of examples, eyeball the outputs, ship. That is not engineering. That is artisanal craft, and while there is real skill in it, it does not scale to the systems being built today.

What precision could look like, concretely, in the medium term:

Prompt contracts as first-class artifacts. A prompt has typed inputs, typed outputs, pre-conditions, post-conditions, and assertions. Tooling enforces them at compile time and at runtime. DSPy is the most serious move in this direction. Mellea and similar projects are reaching for the same thing.

Behavioral diffs. A change to a prompt should produce a diff not just at the string level but at the behavior level — "this rewording changes outputs on 3% of historical inputs, and here are the deltas." This is the missing primitive that would let prompts be treated as software.

Span-level verification. The REFIND SemEval 2025 work points at the right shape: every claim a model makes is matched against retrieved evidence, and unsupported claims are flagged. This should become the default UX for any system that produces factual outputs.

Process supervision, not just outcome supervision. For agentic flows, the right unit of evaluation is the trajectory — was the right tool selected, was the plan coherent, did the reasoning hold — not just the final answer. The measurement problem here is open; it is one of the most important open problems in applied AI right now.

Domain-bounded models. A general model with a 10% hallucination rate on factual queries cannot be the foundation of a clinical system, a legal system, or a financial system. The right architecture is probably a tightly-scoped, tightly-evaluated model whose domain is small enough to characterize its failure modes — composed into a larger system through explicit interfaces.

None of this is a single breakthrough. All of it is the slow accumulation of engineering practice that databases got in the 1970s, networks got in the 1980s, and the web got in the 2000s. The field is at the 1970s phase of language model engineering. There is no equivalent of ACID yet.

The research questions I want to drive

This post is the start of a line of research. Here are the specific questions I want to investigate over the next stretch of time, in roughly the order I plan to take them on:

1. How fragile, in practice, are the prompts that actually ship?

Most published brittleness research uses academic benchmarks. I want to take the kinds of prompts that show up in real production systems — agents, RAG pipelines, classification flows — and measure how they shift under realistic perturbations. The hypothesis I want to test: most production prompts are far more fragile than their authors believe, and the fragility is concentrated in predictable places.

2. What is the right unit of versioning for prompt behavior?

Today the field versions prompt strings. That is the wrong unit. The right unit is closer to "the behavior of this prompt on a representative input distribution." Can tooling let a developer commit a prompt and automatically get back a behavioral fingerprint — and a diff against the previous fingerprint? What would such a tool look like in practice?

3. Where is the boundary between "specify in language" and "specify in code"?

DSPy's bet is that the right move is writing code that generates prompts, not prompts directly. Constrained decoding's bet is to write grammars that constrain output. Tool use's bet is to write functions that the model calls. Each of these is a place where natural language gives way to formal specification. I want to understand, for a given task class, where the right boundary is — and develop heuristics.

4. Can "the model might make a mistake" move from a system property to a quantified, contractual property?

This is the big one. Can a prompt come with an SLA? Not "best effort," but "for inputs in this class, error rate ≤ X, measured continuously." What would the infrastructure look like to deliver and enforce that — for a single prompt, and then for compositions of prompts?

5. What does precision cost, and is it worth paying?

Every move toward precision — constrained decoding, retrieval grounding, span verification, tool delegation, behavioral testing — costs latency, cost, complexity, and sometimes capability. For which tasks is the tradeoff worth it, and how do teams decide? I suspect there is a useful taxonomy here that the field doesn't yet have a vocabulary for.

What you'll see from me on this

I'm planning to share work in this area in a few forms over the coming months:

  • Field reports — concrete experiments measuring prompt fragility, with the methodology open and the results honest, including the parts that don't work.
  • Pattern write-ups — when a specific approach (a contract, a verification step, a decoding strategy) actually moves a system from "best effort" to "specifiable," I want to document it precisely enough for others to apply.
  • Position pieces — like this one, where I'm staking out an opinion about where the field should be heading and inviting disagreement.
  • Tooling sketches — I expect some of this work to produce small open tools. When it does, they'll be in the open from day one.

I am intentionally not promising a paper, a product, or a framework. I am promising sustained attention to a problem I think the industry has stopped looking at clearly.

The reason this matters

Every once in a while, an industry develops a piece of received wisdom that is actually a load-bearing assumption mistaken for a law of nature. "Websites need to be served from a single physical location." "Databases can't be both consistent and available under partitions." "Software ships with bugs and that's just how it is."

Some of those assumptions turn out to be true. Some turn out to be temporary, and the people who notice early — and build the alternative — define the next decade.

I think "AI makes mistakes" is in the second category. Not because the current generation of models won't make mistakes — they will, often, and I want to be honest about that — but because the assumption that mistakes are intrinsic, rather than a property of how these systems are currently being built, is starting to look like a failure of imagination.

The good news is that the path forward doesn't require any single dramatic breakthrough. It requires the same boring, accumulated engineering discipline that every other branch of computing has gone through. Contracts. Tests. Diffs. Versioning. SLAs. Composability. Verification.

The bad news is that none of this is glamorous, and very little of it shows up in launch posts.

I'd like to spend the next while building, measuring, and writing about the unglamorous parts. If that interests you — or, especially, if you disagree with any of the above — I'd love to hear from you. Research lines get better in public.

This is the opening shot. More to come.

Read next

Still here? You might enjoy this.

Nothing close enough — try a different angle?

Was this helpful?

Leave a rating or a quick note — it helps me improve.