Turning Repo Maintenance into Markdown: Keeping a Rust Codebase Alive with Agentic Workflows
Long-lived repositories drift: deprecated components linger, layers bleed, and tests miss the functions that actually break. In my own study I turned three recurring chores into scheduled markdown workflows the repo runs on itself, then wrote up what I learned about capping blast radius, pairing LLM checks with deterministic scans, and letting agents draft shapes while I write the substance.
Long-lived codebases never fail on a single day. They drift. A replacement component ships and half the project still calls the deprecated one. A handler reaches three layers down because that was the shortest path under deadline. Tests exist, but not on the functions that actually broke last quarter. None of it blocks a release, none of it earns a ticket, and all of it taxes every future change. The problem I kept hitting in my own side projects was not lack of awareness — it was lack of a loop that would make the awareness act on itself.
Kief Morris's March 2026 piece on martinfowler.com gave me the framing I was missing: the productive lever with coding agents is the working loop, not the prompt. He calls it being "on the loop" — humans design and maintain the harness, agents run it. I had been oscillating between copy-pasting prompts into a chat window and running ad-hoc scripts only I remembered to run. Both are in-the-loop patterns that break the moment I look away.
What I wanted was a harness the repository could carry on its back. GitHub shipped GitHub Agentic Workflows (gh aw) as a technical preview on 13 February 2026, and the shape of it fit: each workflow is a markdown file under .github/aw/, the agent runs with a read-only token by default, and any write back into the repo must pass through a declared safe-outputs block that opens a pull request in a separate, scoped job. The markdown is the harness. That changed how I thought about maintenance chores.
The three chores I move out of my head
In a Rust service I keep for study, three recurring chores kept showing up as "someday" items in my notes. I turned each one into a scheduled workflow.
The first is deprecated-component migration. A UI module had moved from ui::button::Button to ui::v2::Button, and twenty-eight callers still used the old one. The workflow file:
---
description: Migrate ui::button::Button to ui::v2::Button
on:
schedule:
- cron: "0 4 * * 2"
permissions: read-all
tools:
- rg
- cargo
safe-outputs:
create-pull-request:
max: 1
labels: [chore, auto-migration]
---
Find up to 20 call sites of `ui::button::Button` that are not inside
`src/ui/button/`. Replace each with `ui::v2::Button`, preserving the
prop names listed in `docs/migration-button-v2.md`. Run
`cargo check --all-targets` and stop if it fails. Open one pull
request titled `chore(ui): migrate Button v1 -> v2 (N/28)`.Two details matter. The max: 1 on create-pull-request caps the blast radius — if the agent misreads the migration guide, I get one noisy PR, not twenty. The "up to 20 call sites" cap keeps any single PR reviewable; I would rather merge fifteen small PRs than try to read one enormous one.
The second chore is architectural boundary enforcement. The service has a clean layering — domain, application, infrastructure — and the rule is that domain must not import from infrastructure. A deterministic scan catches the easy cases; an LLM pass is what finally catches the clever ones, where a domain module imports a helper whose transitive dependency reaches into infrastructure. The Thoughtworks Technology Radar Volume 34 (April 2026) placed "architecture drift reduction with LLMs" in Assess — promising, not yet a default. My workflow mirrors that pattern: run the deterministic scan first, hand the remaining ambiguous cases to the agent.
---
description: Flag and propose fixes for domain -> infrastructure leaks
on:
schedule:
- cron: "0 5 * * 4"
permissions: read-all
tools:
- rg
- cargo
safe-outputs:
create-issue:
max: 3
labels: [architecture, drift]
---
Scan `src/domain/**/*.rs`. For each module that directly or
transitively depends on a type from `src/infrastructure/**`,
propose a trait in `src/domain` that narrows the dependency.
Open at most three issues; each must cite the offending import
path and a minimal trait sketch. Do not open pull requests.The third is coverage-gap closure. cargo tarpaulin --print-summary reports per-file coverage; the workflow picks the three public functions with zero branch coverage in files below 60 percent and proposes deterministic tests — table-driven, no I/O, no clock. Output goes to a PR with max: 1 so I can reject a bad test without wading through a backlog.
The case that made me trust the loop
The architecture workflow surfaced a real leak I would have kept missing. A domain function depended on a struct from infrastructure::postgres::rows to shape its return value, and the deterministic scan missed it because the import chain ran through a re-export. The agent flagged the offending path and drafted a trait sketch. I kept the shape and wrote the final code myself. Here is the minimal example I pulled out of my notes to capture the pattern for a future me:
// src/domain/pricing.rs
// Run with: cargo test
pub struct Money { pub cents: i64 }
pub trait PriceSource {
fn unit_price(&self, sku: &str) -> Option<Money>;
}
pub fn total(src: &dyn PriceSource, cart: &[(&str, u32)]) -> Money {
let cents = cart
.iter()
.filter_map(|(sku, qty)| src.unit_price(sku).map(|p| p.cents * *qty as i64))
.sum();
Money { cents }
}
#[cfg(test)]
mod tests {
use super::*;
use std::collections::HashMap;
struct Stub(HashMap<&'static str, i64>);
impl PriceSource for Stub {
fn unit_price(&self, sku: &str) -> Option<Money> {
self.0.get(sku).map(|c| Money { cents: *c })
}
}
#[test]
fn totals_known_skus_and_ignores_unknown() {
let src = Stub(HashMap::from([("A", 100), ("B", 250)]));
let t = total(&src, &[("A", 2), ("B", 1), ("C", 5)]);
assert_eq!(t.cents, 450);
}
}cargo test exercises the entire contract without touching Postgres, without a tokio runtime, and without the chrono transitive dependency that the old shape dragged in. The PriceSource trait is the narrow seam. Before the extraction, every change in the infrastructure row struct rippled into domain tests; after, the domain owns its vocabulary and the Postgres adapter implements the trait in src/infrastructure/pricing_pg.rs. The agent did not write this code. It drafted the trait signature in an issue comment and left the implementation to a human PR. That split — agent proposes shape, I write substance — is the split I want the harness to enforce.
What I got wrong on the first pass
A few traps I hit in my own tests and corrected:
- I started with
max: 5on the migration PR. Review attention collapsed by PR three. Dropping tomax: 1with a per-run cap of 20 files made every PR small enough to read in five minutes. - I gave the architecture workflow
write-all"just in case". Predictably, the agent proposed a sweeping rewrite. Switching to read-only plussafe-outputs: create-issueforced the draft-then-human split and cut the noise by an order of magnitude. - I scheduled all three workflows at the same minute. gh-aw's fuzzy-schedule feature distributes start times deterministically across workflows to avoid cron spikes; moving off a shared exact time stopped me fighting myself for runner minutes.
- I assumed outbound HTTP from the agent was unrestricted. It is not — the runtime routes outbound traffic through a proxy with an explicit domain allowlist, and anything outside it is dropped at the kernel. That is a feature; I stopped writing workflows that tried to call arbitrary SaaS analyzers.
When to reach for this, when to walk away
This shape earns its keep when the chore is recurring, narrow, and easy to review. Deprecated-symbol migrations, import-boundary drift, coverage gaps on pure functions, doc-comment freshness — all fit. The cost is a handful of small PRs per week, and each one carries the same mental load as a human-authored cleanup.
It is the wrong shape when the chore is one-off, when it requires taste ("make this API nicer"), or when the failure mode is subtle enough that a noisy PR can sneak past review. LLM-driven architecture enforcement is still in Assess on the Thoughtworks Radar for good reason — I pair it with deterministic rules and treat the agent's output as a draft, never a verdict.
The one takeaway I keep returning to: a repository that maintains itself under supervision, not on interrupt, gives back the bandwidth I used to spend remembering to schedule cleanup sprints.
Notes for a future me:
- keep
safe-outputs.*.maxsmall (1–3), and cap per-run scope inside the prompt itself. - use read-only permissions; never set
write-allto save a debugging minute. - use fuzzy schedules so parallel workflows do not collide on runners.
- pair any LLM enforcement step with a deterministic scan that runs first.
- let the agent propose shapes in issues; write the substance in PRs.
Use when: recurring narrow chores with cheap review. Avoid when: taste calls, one-offs, or anywhere a silent wrong answer can land.
Further reading: Kief Morris, "Humans and Agents in Software Engineering Loops", martinfowler.com, 4 March 2026; the gh-aw overview and safe-outputs reference at github.github.com/gh-aw; Thoughtworks Technology Radar Volume 34, April 2026.
Still here? You might enjoy this.
Nothing close enough — try a different angle?
Related Posts
Auditing a Scala Service Against Chad Fowler's Four Regenerative Constraints
I walked a Scala order-processing service from my notes through Chad Fowler's four regenerative constraints. Two passed for free, two would force a real redesign. Here is what I learned about where "loosely coupled module" ends and "regenerative component" begins, and which parts of the redesign I would actually pay for.
Durable Execution Isn't About Agents — It's About Replayable Backend Workflows
I came to durable-execution runtimes through the agent press, but the constraint that surprises everyone is determinism on replay. These are my notes from working a six-step payment reconciliation as a Restate workflow in TypeScript — the line that broke replay, the mental model that fixed it, and the trade-offs that come with the pattern.
The Deterministic Backbone: Why Production AI Systems Are Moving Away From Fully Autonomous Agents
Fully autonomous agents are hard to bound, hard to test, and expensive to operate. A deterministic backbone with narrow agent steps gives you the control flow back while keeping the intelligence where it matters. Here is how to design, test, and migrate toward it.
