A Fitness Function Is Just a Test That Fails the Build When the Architecture Drifts
A fitness function is not a framework artifact — it is a build-failing test that encodes one architectural invariant. I encode a layering rule in about 60 lines of TypeScript using the compiler's own API, test the test against good, bad, and generated-code trees, then draw the line between an invariant worth gating and a metric gate that backfires under Goodhart's law.
"Fitness function" sounds like a special artifact you have to install a framework to get. It is not. A fitness function is an automated check that encodes one architectural invariant and fails the build when the code drifts away from it. Martin Fowler and the authors of Building Evolutionary Architectures define it as a test that measures how close an implementation sits to its stated design goals. The word comes from evolutionary biology, but the implementation is a unit test that returns a non-zero exit code.
I went looking for the concrete mechanism because the buzzword kept showing up without the wiring underneath it. The O'Reilly Radar piece from late 2025 on agentic architecture governance reminded me why most teams never get there: the idea was introduced in the 2017 first edition of the book, and the ambitious version was "largely thwarted by brittleness." Fitness functions that break on every legitimate refactor get disabled within a month. So the real question is not how to write one. It is which invariant is worth gating, and how to write the check so it fails for the right reason.
This post encodes a single invariant as a build-failing test in TypeScript, with no architecture framework, then tests the test, then draws the line I now hold between an invariant worth a hard gate and ceremony that just annoys people.
The one invariant I gate
The rule I trust enough to fail a build over is a layering boundary: the domain layer must not import the transport layer. Domain code holds business rules. Transport code holds HTTP handlers, serializers, framework glue. Domain may know nothing about how a request arrived. When that edge inverts — a domain entity reaches up into an HTTP module — the dependency graph quietly rots, and every future change to transport risks breaking business logic that should never have seen it.
This is a good gate for one reason: it is a structural fact, not a number. An import either exists or it does not. There is no threshold to argue about and nothing to game. Compare that to "no file over 300 lines," which I will get to later and which I deliberately do not gate.
The mechanism is an import-graph walk. I do not need a heavy tool for it. The popular defaults in the TypeScript world are dependency-cruiser and ArchUnitTS, and both are fine. But the check itself is small enough that the TypeScript compiler's own API does the job, and writing it by hand makes the failure mode obvious instead of hiding it behind config.
The picture I am encoding looks like this: two clusters of modules, most edges legal, one edge crossing the boundary the wrong way.
What the import-graph check actually does
Here is the whole thing — the fitness function and its own tests in one runnable file.
import * as ts from "typescript";
import * as path from "node:path";
type SourceUnit = { file: string; code: string };
type Violation = { from: string; to: string; spec: string };
// One invariant: nothing in src/domain may import from src/transport.
const RULE = { layer: "domain", mustNotReach: "transport" };
function layerOf(file: string): string | null {
const m = file.replace(/\\/g, "/").match(/(?:^|\/)src\/([^/]+)\//);
return m ? m[1] : null;
}
function isGenerated(file: string): boolean {
return /\.gen\.ts$|\.generated\.ts$/.test(file);
}
function importsOf(unit: SourceUnit): string[] {
const sf = ts.createSourceFile(unit.file, unit.code, ts.ScriptTarget.Latest, true);
const specs: string[] = [];
const visit = (node: ts.Node) => {
if (ts.isImportDeclaration(node) && ts.isStringLiteral(node.moduleSpecifier)) {
specs.push(node.moduleSpecifier.text);
}
ts.forEachChild(node, visit);
};
visit(sf);
return specs;
}
export function checkLayering(units: SourceUnit[]): Violation[] {
const violations: Violation[] = [];
for (const unit of units) {
if (isGenerated(unit.file)) continue;
if (layerOf(unit.file) !== RULE.layer) continue;
const dir = path.posix.dirname(unit.file.replace(/\\/g, "/"));
for (const spec of importsOf(unit)) {
if (!spec.startsWith(".")) continue; // third-party packages are not our layers
const resolved = path.posix.normalize(path.posix.join(dir, spec));
if (layerOf(resolved + "/") === RULE.mustNotReach) {
violations.push({ from: unit.file, to: resolved, spec });
}
}
}
return violations;
}
// --- tests for the fitness function itself ---
function assert(cond: boolean, msg: string) {
if (!cond) { console.error("FAIL:", msg); process.exit(1); }
}
const clean: SourceUnit[] = [
{ file: "src/domain/order.ts", code: `import { Money } from "./money";` },
{ file: "src/transport/http.ts", code: `import { Order } from "../domain/order";` },
];
const dirty: SourceUnit[] = [
{ file: "src/domain/order.ts", code: `import { send } from "../transport/http";` },
{ file: "src/domain/cache.gen.ts", code: `import { x } from "../transport/http";` },
];
assert(checkLayering(clean).length === 0, "clean tree must pass");
const found = checkLayering(dirty);
assert(found.length === 1, "one real violation; generated file must be ignored");
assert(found[0].spec === "../transport/http", "error must name the offending edge");
console.log(`ok: ${found.length} violation -> ${found[0].from} imports ${found[0].spec}`);Run it with npx tsx fitness.ts. It prints the single violation it found on the dirty tree and exits zero because the assertions held; in CI you would invert that last step so any violation exits non-zero.
A few lines carry the weight. importsOf uses ts.createSourceFile to parse a module into an AST and walks it with ts.forEachChild, collecting only ImportDeclaration nodes whose specifier is a string literal. That is the entire dependency extraction — no regex over source text, which would trip over comments and strings. layerOf reads the layer name straight from the path: the first segment under src/. checkLayering resolves each relative import against the importing file's directory with path.posix.join plus normalize, then asks which layer the resolved path lands in. If a domain file resolves an import into transport, that edge is a violation, and the record names the exact offending import specifier — not just "a violation exists."
I feed it in-memory source units here so the example runs as one file. For CI, the only change is the input: walk src/ on disk, read each .ts file into a SourceUnit, and exit non-zero when the returned array is non-empty. The check logic does not change.
Testing the test
A fitness function that silently passes is worse than no fitness function, because it buys false confidence. So the check needs its own tests, and they are at the bottom of the same file.
Three cases matter. A known-good tree must return zero violations, or the gate is too loose and will wave real drift through. A known-bad tree must return exactly the violations I planted, or the gate is mismeasuring. And the error must name the offending edge — found[0].spec === "../transport/http" — because a violation report that does not point at the bad import wastes the time of whoever has to fix it at 5pm on a Friday.
The fourth case is the one I learned to add only after a fitness function falsely failed on me: generated code. The dirty tree includes src/domain/cache.gen.ts, which also imports transport. A naive check would flag it, the build would go red on code nobody wrote by hand, and the team would reach for the disable switch. isGenerated excludes it, and the assertion that the dirty tree yields exactly one violation, not two, locks that exclusion in. This is the brittleness the O'Reilly piece warned about, caught by a test of the test rather than by a production incident.
The gate I refused to add
The tempting next move is to keep adding gates. "No file over 300 lines." "Test coverage must stay above 80 percent." "No more than 5 parameters per function." These feel like fitness functions. They are metric gates, and metric gates are where the practice goes wrong.
The reason is Goodhart's law, stated by the economist Charles Goodhart in 1975: when a measure becomes a target, it stops being a good measure. The strong version is sharper — even an honest, good-faith pursuit of the metric, pushed far enough, damages the goal the metric was a proxy for. A coverage gate is the textbook case. Developers chasing the number write tests that execute lines without asserting anything, so coverage climbs while real test quality falls, and the integration tests that would actually catch a regression get skipped because they are harder to write and move the number less. A line-count gate is worse: it punishes the refactor that deletes code and rewards splitting one coherent module into three incoherent ones to duck under the threshold.
The layering check has no such failure mode. There is no number to optimize toward and no honest way to "improve the score" except by not writing the forbidden import. The check measures a fact about the structure, and the only way to satisfy it is to keep the structure correct. That is the property I now screen for before I gate anything: can a well-meaning engineer make this metric better while making the codebase worse? If yes, it does not go in CI as a hard gate.
When to gate, when to warn, when to report
Not every invariant earns a build failure. After getting this wrong in both directions, I sort candidate checks into three buckets.
Hard CI gate, for structural invariants that are binary and cannot be gamed: layer boundaries, "no module imports a sibling's internal package," "no direct database access outside the repository layer," "no cyclic dependencies." These either hold or they do not, and a violation is always a real problem.
Warning, not a gate, for checks that are usually right but have legitimate exceptions: a new third-party dependency appearing, a public API surface growing. Surface the finding in the pull request, let a human judge, do not block the merge.
Periodic report, for anything metric-shaped: file sizes, coverage trend, complexity. Track the direction over time, look at it in a review, but never wire it to a red build. A trend you discuss is information; a threshold you enforce is an invitation to game it.
The last failure mode is organizational, not technical. A fitness function rots if nobody owns it. The day the layering check fails for a reason the author does not understand — a path convention changed, a new generated-file pattern appeared — someone will comment it out to ship, and it never comes back. The fix is the same discipline as any test: when it fails, you either fix the code or fix the check, and you never silence it. That is why I keep the check small and its own tests next to it. A fitness function I can read in one screen is one I will repair instead of delete.
Actionable takeaways:
- Start with one invariant that is a structural fact, not a number — a layer boundary is the highest-value first gate.
- Write the check against the language's own AST tooling before reaching for a framework; the TypeScript compiler API walks imports in a dozen lines.
- Make the violation report name the exact offending edge, not just its existence.
- Test the fitness function with a known-good tree, a known-bad tree, and a generated-code case, so it fails only for real drift.
- Screen every candidate gate against Goodhart's law: if honest optimization of the metric can hurt the codebase, do not make it a hard gate.
Reach for a fitness function when an invariant is binary, structural, and expensive to violate silently — layering, dependency direction, module boundaries. Avoid wiring one to any metric with a threshold, because the threshold becomes the target and the target gets gamed. A fitness function that fails for the wrong reason gets disabled within a month; one that names a real structural violation in plain terms earns its place in the build.
Still here? You might enjoy this.
Nothing close enough — try a different angle?
Related Posts
Catching a Retry Race with One Seed: Deterministic Simulation in Rust using turmoil
I had three flaky retry tests no one could reproduce on a laptop. I rewrote one in Rust on top of turmoil, Tokio's deterministic simulator, and a single 8-byte seed pinned the partition race byte-for-byte. These are my notes on what the seed actually controls, what leaks past it, and when deterministic simulation testing is worth the seam.
Auditing a Scala Service Against Chad Fowler's Four Regenerative Constraints
I walked a Scala order-processing service from my notes through Chad Fowler's four regenerative constraints. Two passed for free, two would force a real redesign. Here is what I learned about where "loosely coupled module" ends and "regenerative component" begins, and which parts of the redesign I would actually pay for.
The Deterministic Backbone: Why Production AI Systems Are Moving Away From Fully Autonomous Agents
Fully autonomous agents are hard to bound, hard to test, and expensive to operate. A deterministic backbone with narrow agent steps gives you the control flow back while keeping the intelligence where it matters. Here is how to design, test, and migrate toward it.
