Code Graphs for Coding Agents: The Delivery Shape Matters More Than the Algorithm
I spent a weekend pointing a coding agent at a 480k-line Go monorepo and watching it grep-loop through 38 tool calls on one question. AST-derived code graphs fix that, but the delivery shape — local stdio MCP, remote service, or skill — changes the economics more than the graph algorithm does. Here is where I would put one in 2026, with a minimal Go indexer I can drop next to the agent.
I spent a weekend pointing a coding agent at a 480k-line Go monorepo and watching it lose. The agent grep-looped for 38 tool calls trying to answer one question: if I change the signature of BillingClient.Charge, which handlers break? It found the method. It read the file. It missed three callers because their package names had nothing in common with "billing". Total cost: under three dollars. Total useful output: a near miss.
That afternoon is what this post is about. The pitch for coding agents on a 5k-line repo is honest. The pitch on a real codebase is dishonest by default — agents flatten structure into vector similarity and grep loops, and structure is exactly what you need to answer the question.
Code graphs fix this. The 2026 question is not whether to build one. The arguments for that are settled past about 50k lines. The question is where to put it, because the delivery shape changes the economics more than the graph algorithm does.
Two failure modes do most of the damage
The first is context flattening. Vector RAG finds chunks that look lexically similar to the query. Ask "which controllers call ShoppingCartService" and you get back ShoppingCartService plus other service classes whose text resembles it. The controllers that depend on it never enter top-k because their text never mentions shopping carts. The DKB paper on graph-RAG for codebases reproduces this on Shopizer, ThingsBoard, and OpenMRS Core, and the gap doesn't close with bigger k.
The second is multi-hop blindness. "If I change this interface, what breaks?" needs a traversal, not a similarity score. An agent can simulate a traversal with grep — but the cost is linear in repo size, every hop costs a tool call, and the agent often gives up after the third level.
A precomputed graph collapses both. Build it once from the AST, store the edges, traverse on demand. The interesting choice is where that store lives.
The options on the table
There are four ways to give a coding agent structural awareness over a large repo, plus two no-structure baselines I'll dismiss quickly.
The baselines, for completeness. Long-context dumping (drop the whole repo into a 1M-token window) burns money on every query and only works once. Naive vector RAG finds something fast and finds the wrong thing on multi-hop questions. Pure agentic grep is what the agent does by default and what I am trying to escape.
The real options:
| Delivery shape | What it is | Where it lives | Cold-start | Per-query cost |
|---|---|---|---|---|
| Skill only | A SKILL.md teaching the agent to use rg, ast-grep, and tree-sitter CLI competently | Context, on demand | Zero | Bounded by primitives; still O(repo) for "who calls X" |
| Local stdio MCP | Tree-sitter → SQLite indexer that ships as a single binary and runs next to the agent | Developer machine | Seconds for a medium repo | Sub-millisecond on warm queries |
| Remote MCP | Centralized graph (Apache AGE, FalkorDB, Neo4j) reached over HTTP/SSE | A server somewhere | Seconds to minutes | Network hop plus lookup |
| Adaptive (skill + local MCP) | Skill routes between primitives and the MCP based on the question | Developer machine | Same as local | Cheap by default, escalates on demand |
This post is about picking between rows two, three, and four. The skill-only baseline is the right answer often enough to mention, and never enough to recommend by itself.
The numbers that actually decide it
Two recent papers ran the same workload through these paradigms and published every number you need.
On indexing cost, measured on Shopizer (around 1,200 Java files), the DKB authors clock the AST-derived knowledge graph at 2.81 seconds end to end. An LLM-extracted knowledge graph over the same repo takes 200.14 seconds, and — this is the part nobody quotes — the extraction is probabilistic. The per-file success rate is 0.69. Out of around 1,200 files, 377 are silently dropped. Your "knowledge graph" is missing a third of the codebase and the agent has no way to know which third.
On end-to-end cost across fifteen architecture and code-tracing queries, also from the DKB paper, the spread is:
- Naive vector RAG: $0.04
- AST-derived graph: $0.09
- LLM-extracted graph: $0.79
On the larger OpenMRS-core + ThingsBoard workload, the same ratio holds: naive $0.149, AST graph $0.317, LLM graph $6.80. The AST approach costs roughly 2× the naive baseline and roughly 1/20× the LLM approach, and answers more multi-hop questions correctly than either.
On query latency, the Codebase-Memory paper measures BFS traversal over a SQLite recursive CTE at the sub-millisecond mark on warm queries. The same evaluation reports 99% fewer tokens than file-exploration baselines on equivalent questions. The numbers feel optimistic until you read the queries — they are exactly the questions a coding agent asks, just answered against a table instead of a corpus.
On coverage, the AST graph reaches 0.90 of chunks on Shopizer with 1,158 nodes. The LLM-extracted graph reaches 0.64 with 842 nodes. The gap is not algorithmic taste; it is the stochastic extractor failing in predictable ways.
The honest takeaway on the algorithm question: code already has a deterministic parser. Use it. Save LLM-driven graph extraction for the corpora that lack one — prose, PDFs, support tickets.
The cross-cut nobody talks about: MCP servers eat your context window
The thing that surprised me most isn't in either paper. It's in two posts that landed within months of each other: Anthropic's Code Execution with MCP on November 4, 2025, and Cloudflare's Code Mode update in April 2026. Both arrive at the same conclusion from different sides.
Tool definitions for a connected MCP server sit in the model's context before the first user message. A typical five-server developer setup adds around 55,000 tokens of definitions before anyone says hello. Cloudflare's measurement on the Cloudflare MCP — 2,500+ API endpoints — comes in at 1.17 million tokens of definitions if loaded eagerly. The fix in both posts is the same: present the MCP tools as a code API, let the agent discover and call them by writing small scripts in a sandbox, and cut the eager context to a stub. Anthropic measures 150,000 → 2,000 tokens on their canonical example (98.7% reduction). Cloudflare measures 1.17M → ~1,000 tokens on theirs (99.9%).
Why does this matter for code graphs specifically? Because a remote graph MCP that exposes ten tools (find_callers, find_callees, class_hierarchy, impact_radius, …) lands tens of thousands of tokens in context whether you query it or not. A local stdio MCP does the same. The Skill route avoids it. The adaptive route — a skill that knows when to shell out to a local MCP — gets the best of both: cheap default, structural lookup when the question warrants it.
This is the axis I would not have picked from the seed alone. Picking the delivery shape is partly a question of what the agent pays just to know the tool exists.
A decision matrix I would actually paste into a design doc
I tried four real situations against the options. The matrix below is the version I would defend on a whiteboard.
| Situation | Pick | Why |
|---|---|---|
| Solo dev, < 50k LOC, exploratory work | Skill + good grep/ast-grep primitives | Indexing cost outweighs the win; the agent already does this passably |
| Solo dev, large monorepo, frequent multi-hop questions | Local stdio MCP with AST graph | Sub-ms queries, source never leaves the machine, single binary |
| Team of N, shared semantics, governance | Remote MCP backed by Apache AGE or Neo4j | The graph itself is an org asset — ownership, deployment topology, review history all live there |
| Mixed workload, want cost discipline | Adaptive: skill first, escalate to local MCP on flagged ambiguity | Saves tokens on the easy 70% of questions |
| One-shot architecture audit | Long-context dump, skip the infra | Building the graph costs more than the answer |
| Macro-heavy or codegen-heavy stack | Skill + grep, accept the limitation | AST graph misses generated code; pretending it doesn't is the dangerous move |
The macro case deserves more than a row. The Codebase-Memory evaluation reports 0.58 coverage on macro-heavy C. The pattern generalizes — anywhere surface syntax is rewritten before semantics, the AST-only graph underperforms. C macros, Lisp macros, Rust proc-macros, JVM annotation processors like KSP and KAPT, code-gen frameworks that produce sources at build time. The graph builder sees the input and the agent asks about the output. If you bolt an AST graph onto a codebase like that and trust it, you will ship a regression because the graph reported no callers of the function the macro-expanded code actually invokes thirty times.
A minimal worked example, in Go
Picking Go on purpose — the seed leaned Kotlin/Spring and I wanted a stack with a deterministic parser in the standard library and no CGO ceremony. The example below is the smallest thing that does the interesting work: walk a Go repo, parse every file with go/ast, extract function declarations and call expressions, store them in SQLite, answer "who calls X". This is the engine an MCP server would expose as find_callers.
// callgraph.go — minimal AST-based call-graph indexer for a Go repo.
package main
import (
"database/sql"
"fmt"
"go/ast"
"go/parser"
"go/token"
"io/fs"
"log"
"os"
"path/filepath"
"strings"
_ "modernc.org/sqlite"
)
func main() {
if len(os.Args) != 3 {
log.Fatalf("usage: %s <repo> <fqn>", os.Args[0])
}
root, target := os.Args[1], os.Args[2]
db, err := sql.Open("sqlite", ":memory:")
if err != nil {
log.Fatal(err)
}
defer db.Close()
if _, err := db.Exec(`
CREATE TABLE symbols(fqn TEXT PRIMARY KEY, file TEXT, line INT);
CREATE TABLE edges(caller TEXT, callee TEXT);
CREATE INDEX idx_callee ON edges(callee);`); err != nil {
log.Fatal(err)
}
fset := token.NewFileSet()
filepath.WalkDir(root, func(path string, d fs.DirEntry, err error) error {
if err != nil || d.IsDir() || !strings.HasSuffix(path, ".go") {
return nil
}
f, err := parser.ParseFile(fset, path, nil, 0)
if err != nil {
return nil
}
pkg := f.Name.Name
for _, decl := range f.Decls {
fn, ok := decl.(*ast.FuncDecl)
if !ok {
continue
}
caller := pkg + "." + fn.Name.Name
pos := fset.Position(fn.Pos())
db.Exec(`INSERT OR IGNORE INTO symbols VALUES(?,?,?)`, caller, pos.Filename, pos.Line)
ast.Inspect(fn, func(n ast.Node) bool {
if call, ok := n.(*ast.CallExpr); ok {
if name := resolve(call.Fun, pkg); name != "" {
db.Exec(`INSERT INTO edges VALUES(?,?)`, caller, name)
}
}
return true
})
}
return nil
})
rows, _ := db.Query(`SELECT caller FROM edges WHERE callee = ?`, target)
defer rows.Close()
fmt.Printf("callers of %s:\n", target)
for rows.Next() {
var c string
rows.Scan(&c)
fmt.Println(" -", c)
}
}
func resolve(e ast.Expr, pkg string) string {
switch v := e.(type) {
case *ast.Ident:
return pkg + "." + v.Name
case *ast.SelectorExpr:
if x, ok := v.X.(*ast.Ident); ok {
return x.Name + "." + v.Sel.Name
}
}
return ""
}Run it with:
go run callgraph.go ./my-repo billing.Charge
The non-obvious parts. go/ast and go/parser are standard library — no CGO, no grammar headers. modernc.org/sqlite is a pure-Go SQLite, which keeps the build single-binary. ast.Inspect walks the function body and pulls out every *ast.CallExpr. The resolve helper is deliberately small: for an *ast.Ident it qualifies with the current package; for a *ast.SelectorExpr like billing.Charge it qualifies with the imported package alias. That covers most calls in a real codebase and breaks gracefully on interface-typed receivers — which is the AST blind spot all over again, this time inside the language.
The MCP wrapper around this is the boring part. A goroutine reading newline-delimited JSON-RPC from stdin, dispatching on method name (find_callers, find_callees), and writing responses to stdout. The MCP spec is small enough that the entire stdio loop is another 60 lines.
What I would test
I wrote a callgraph_test.go to convince myself the indexer worked before trusting it on a real repo. The fixture is a single svc.go file written into t.TempDir():
package svc
func ChargeOrder(id string) { logPayment(id); chargeStripe(id) }
func logPayment(id string) {}
func chargeStripe(id string) {}Three assertions, none of them clever. After indexing, find_callers("svc.logPayment") returns exactly ["svc.ChargeOrder"]. find_callers("svc.chargeStripe") returns exactly ["svc.ChargeOrder"]. find_callers("svc.ChargeOrder") returns the empty list. A fourth test parses a file with a deliberate syntax error and confirms the indexer skips the file rather than crashing the run — failure of one file should never poison the graph.
The performance smoke test I would not skip: index a synthetic repo of N files, measure cold-start time and warm-query p50, plot against repo size. The Codebase-Memory paper claims sub-millisecond on real codebases. The thing the paper cannot tell me is what happens on my repo on my machine, and a 30-line benchmark closes that gap in five minutes.
Where I would land
Three things I believe after a weekend of pushing on this.
For most people working on most codebases, the answer is local stdio MCP with an AST-derived graph, not remote, not LLM-extracted. The source never leaves the machine, the binary ships in one file, the queries are cheap, and the algorithm matches the data — code has a parser, use it.
The LLM-extracted knowledge graph approach is the trendy one and it is almost never worth 20× the cost of the deterministic approach for code specifically. The 0.69 per-file success rate is the part I cannot get past. A graph that silently misses a third of your repo is worse than no graph, because the agent now answers wrong with confidence.
The pattern I would build today, if I were starting fresh, combines the two trends. A small skill that teaches the agent to use grep and ast-grep for cheap questions, and a local stdio MCP behind a code-execution sandbox for the multi-hop ones. Anthropic's Code Execution with MCP and Cloudflare's Code Mode are converging on the same shape, and it lines up exactly with what code graphs need: most of the time the agent should not pay the context cost of the graph existing.
Takeaways
- Pick the delivery shape before the algorithm. The graph algorithm matters less than where the graph lives.
- AST-derived graphs cost about 2× a naive RAG and 1/20× an LLM-extracted graph, and answer more questions correctly than either.
- A local stdio MCP keeps source on the machine and serves sub-millisecond queries on warm caches.
- MCP servers cost context whether you query them or not — adaptive routing through a skill is the underrated default.
- AST-only graphs miss codegen and macro-expanded code. If you live on a codebase like that, name the limitation and stop the agent from trusting the graph.
When to reach for each
Reach for a local stdio MCP when you are one developer on a large repo asking multi-hop questions every day. Reach for a remote MCP when the graph itself is a shared asset that encodes more than AST — ownership, deployment topology, review history. Reach for a skill alone on small repos, or as the cheap-path layer in an adaptive setup. Skip the structural infrastructure entirely for one-shot audits and macro-heavy stacks where the graph would lie.
The agent on my 480k-line Go repo, after I built the indexer above, answered the same question in three tool calls and 14k tokens. The reason isn't that the algorithm got better. The reason is that I moved the structure to where the agent could traverse it on demand instead of reconstructing it on every question.
Still here? You might enjoy this.
Nothing close enough — try a different angle?
Related Posts
Turning LLM Context Engineering Into an Evaluation Loop with DSPy
Notes from two weekends of digging into DSPy. I stopped treating prompts as the source of truth and started treating them as compiled output from a typed signature, a metric, and an optimizer. Here is the smallest end-to-end program I kept, how MIPROv2 actually searches, and where the approach breaks down in practice.
Exposing Spring AI Agents via the A2A Protocol: What Interoperability Actually Buys You
Spring AI's server-side A2A integration is stable enough to put in production, but the protocol is most useful at organizational boundaries, not as an internal RPC replacement. This post walks through what actually changes in a Spring AI codebase, where the sharp edges still are, and a practical decision framework for A2A vs MCP vs plain REST.
Cell-Based Architecture Isn't Free: What Slack, DoorDash, and Roblox Actually Paid For It
Cell-based architecture contains blast radius, but it is not free. A look at what Slack, DoorDash, and Roblox actually paid for cells in production — and a checklist for the cheaper fault-isolation patterns most teams should reach for first.