Skip to main content
AI9 min read

Memory Evaluation: Measuring How AI Memory Decays Over a Project's Lifetime

Most AI memory benchmarks grade on recall and stop there. That hides the real failure mode: stale facts quietly poisoning the context window. Here is a lifecycle-based evaluation framework that tests recall, revision, and controlled forgetting across the change points every long-lived project goes through.

Most teams I have worked with bolt a memory layer onto their AI assistants — a vector store, a summarization job, a "facts" table — and then evaluate it once, at a single point in time. Recall looks good, latency is acceptable, ship it.

Six months later the assistant confidently tells a new hire that the auth service still runs on Node, that payments-v1 is the source of truth, and that Alice owns the billing domain. None of those things are true anymore. They were true. The memory layer never noticed they stopped being true.

This is the gap I want to close. A useful memory system is not evaluated on what it remembers; it is evaluated on whether its representation of the project tracks reality as the project changes. If you are building a long-lived copilot — for a codebase, a team, a customer account — you need to treat memory the way you treat a cache: correctness includes invalidation.

By the end of this post you will have a concrete framework for evaluating memory across a project's lifecycle, the four metrics that matter, and the failure modes to design around.

The Problem

The standard memory evaluation looks roughly like this: seed the store with N facts, ask M questions, measure recall and precision against a gold set. This is a snapshot. It tells you nothing about what happens when fact 17 becomes obsolete in week 12.

Real projects produce a continuous stream of invalidating events:

  • A service is rewritten in a different language. Every code-specific answer grounded in the old stack is now wrong.
  • A team reorganizes. "Ask Alice about billing" was true; now it routes to a person who left the company.
  • A decision is reversed. You chose Kafka, then moved to SQS, then moved back. Memory that reflects any single one of those states in isolation is misleading.
  • A requirement is dropped. The assistant keeps citing a constraint nobody cares about anymore.
  • An architectural boundary shifts. The module the memory "knows" no longer exists.

Each of these is a mutation to the ground truth. A snapshot benchmark does not see them. Worse, the standard failure mode is silent: the system keeps answering confidently with stale context because nothing in the retrieval pipeline flags a fact as expired. Relevance scores look normal. Embeddings still match. The answer is just wrong.

The hard part is not the retrieval. The hard part is that "the right answer" is a moving target and most evaluation harnesses freeze it.

The Approach

The mental model I use is borrowed from database replication: memory is a read replica of the project. What you measure is replication lag, drift, and divergence — not just throughput.

Instead of a single-point benchmark, I run a lifecycle evaluation. The evaluation harness drives the system through a sequence of change points and probes at each one. At every step it asks four questions:

  1. Preservation — Did it keep facts that are still valid?
  2. Revision — Did it update facts that changed?
  3. Forgetting — Did it drop facts that are obsolete?
  4. Non-contamination — Are old answers bleeding into new ones?

Those four collapse into four metrics you can track over time:

  • Freshness — % of retrieved facts whose last validation falls within the acceptable staleness window for their type.
  • Consistency — % of retrieved facts that do not contradict the current ground truth.
  • Coverage — % of current ground-truth facts that are retrievable.
  • Controlled forgetting rate — % of invalidated facts that have been removed or superseded within N change points of the invalidating event.

Notice what is missing: raw recall. Recall is a component of coverage, but on its own it rewards hoarding. A store that never forgets anything will have perfect recall and catastrophic consistency.

Technical Deep Dive

Modeling change points

The first concrete artifact is a timeline — an ordered list of events that mutate the ground truth. In testing we synthesize it; in production we derive it from sources we already have (git history, ADRs, org charts, ticket closures).

kotlin
sealed interface ChangeEvent {
    val id: String
    val timestamp: Instant

    data class FactAdded(
        override val id: String,
        override val timestamp: Instant,
        val fact: Fact,
    ) : ChangeEvent

    data class FactRevised(
        override val id: String,
        override val timestamp: Instant,
        val previous: Fact,
        val current: Fact,
    ) : ChangeEvent

    data class FactInvalidated(
        override val id: String,
        override val timestamp: Instant,
        val fact: Fact,
        val reason: InvalidationReason,
    ) : ChangeEvent
}

data class Fact(
    val subject: String,
    val predicate: String,
    val value: String,
    val domain: FactDomain,   // TECH, OWNERSHIP, DECISION, REQUIREMENT
)

Domains matter because staleness tolerance is domain-specific. An ownership fact is stale within days of a reorg. An architectural decision can be valid for quarters. A requirement might be valid for years. Treating them uniformly is the root of most over-eager forgetting bugs I have seen.

Running the harness

The evaluator replays the timeline, and after each event it runs a probe set against the memory system. A probe is a question with a time-aware expected answer — the answer is a function of "what is true at timestamp T".

kotlin
class LifecycleEvaluator(
    private val memory: MemoryService,
    private val probes: List<Probe>,
    private val groundTruth: GroundTruth,
) {
    fun evaluate(timeline: List<ChangeEvent>): List<StageResult> {
        val results = mutableListOf<StageResult>()

        timeline.forEach { event ->
            memory.apply(event)

            val metrics = probes
                .map { probe -> score(probe, event.timestamp) }
                .aggregate()

            results += StageResult(event, metrics)
        }

        return results
    }

    private fun score(probe: Probe, at: Instant): ProbeScore {
        val expected = groundTruth.resolve(probe, at)
        val actual = memory.answer(probe.question, at)

        return ProbeScore(
            freshness = freshness(actual.facts, at),
            consistency = consistency(actual.facts, expected),
            coverage = coverage(actual.facts, expected),
            contaminated = actual.facts.any { it.invalidatedBefore(at) },
        )
    }
}

The important choice here is that memory.answer returns the facts used to produce the answer, not just the answer text. Scoring at the fact level is what lets you distinguish "the answer is right for the wrong reasons" from genuine correctness. I have seen systems score well on answer-text similarity while retrieving completely stale evidence — they were one rephrased question away from breaking.

Forcing the system to forget

Controlled forgetting is the metric most implementations fail. The usual retrieval stack — embed, nearest-neighbor, return — has no native concept of invalidation. You need an explicit signal.

Three mechanisms, in increasing order of invasiveness:

  1. TTL per domain. Ownership facts expire in 30 days unless revalidated. Cheap, noisy, but catches the long tail.
  2. Supersession links. When a FactRevised event comes in, write a tombstone pointing the old fact at the new one. Retrieval filters tombstoned facts unless the query explicitly asks for history.
  3. Invalidation propagation. A FactInvalidated event fans out to dependent facts. Killing payments-v1 is source of truth should also invalidate payments-v1 uses Postgres 13.

In practice I run all three. TTL handles the drift you did not model, supersession handles the changes you did model, propagation handles the cascades. The evaluation harness specifically probes for each failure mode — miss a TTL, miss a supersession, miss a cascade — so you can tell which layer is leaking.

Wiring it into Spring

For services that already expose memory through a Spring-based retrieval API, the evaluator slots in as just another client. The harness becomes a scheduled job that replays a synthetic timeline against a shadow memory instance and publishes metrics to the same observability stack as everything else. No reason to build a parallel dashboard.

kotlin
@Component
class MemoryHealthJob(
    private val evaluator: LifecycleEvaluator,
    private val timelineSource: TimelineSource,
    private val meter: MeterRegistry,
) {
    @Scheduled(cron = "0 0 * * * *")
    fun run() {
        val timeline = timelineSource.latest()
        val results = evaluator.evaluate(timeline)

        results.last().metrics.let {
            meter.gauge("memory.freshness", it.freshness)
            meter.gauge("memory.consistency", it.consistency)
            meter.gauge("memory.coverage", it.coverage)
            meter.gauge("memory.contamination", it.contaminationRate)
        }
    }
}

Freshness and consistency are the two I alert on. Coverage tends to move slowly and contamination is a leading indicator for consistency — if contamination is rising, consistency will follow.

Pitfalls & Edge Cases

Over-forgetting. The first time you wire up aggressive TTLs, coverage collapses. Ownership facts expire before the revalidation job catches up; the assistant forgets who owns anything. The fix is a soft-expiry state: facts that are past TTL but not yet invalidated are retrievable with a confidence penalty, not excluded outright.

Embedding drift masquerading as memory drift. If you change the embedding model, everything looks stale. The evaluator needs to distinguish "the fact is outdated" from "the retrieval layer changed." Version the embeddings and include the embedding version in the fact record.

Silent contradiction. Two facts can be individually plausible and jointly inconsistent — the store claims service X owns domain A and that service Y owns domain A. Pointwise retrieval never sees the conflict. The evaluator should run consistency checks across the retrieved set, not just per-fact.

Probe leakage. If you author probes once and never update them, the evaluator starts testing yesterday's project. Probes should be versioned alongside the codebase and invalidated on the same signals as the facts they test.

The "still true, just less true" case. Some facts degrade gradually — a performance characteristic, a team size, a traffic pattern. They are never explicitly invalidated. Controlled forgetting does not help here; you need continuous revalidation against fresh observations. Treat this as a separate pipeline and do not try to squeeze it into the event model.

Non-determinism in scoring. If consistency is scored by an LLM judge, its own drift will show up in your metrics. Pin the judge model version and re-baseline deliberately, not incidentally.

Practical Takeaways

  • Evaluate memory as a lifecycle, not a snapshot. Every change point is a test case.
  • Track freshness, consistency, coverage, and controlled forgetting. Raw recall on its own rewards hoarding.
  • Score at the fact level, not the answer-text level. Right-answer-wrong-evidence is a bug, not a pass.
  • Segment by fact domain. Ownership, decisions, requirements, and technical facts all have different staleness tolerances.
  • Combine TTLs, supersession links, and invalidation propagation. Each catches a different class of drift.
  • Run the evaluator as a scheduled job against a shadow store and wire its metrics into your existing observability stack.
  • Version your probes. A frozen probe set silently turns into a regression test for the past.

Conclusion

The memory layer is a cache over your project's ground truth. Like any cache, it is only as useful as its invalidation story. Evaluating it once, at a single point in time, tells you how good a snapshot it took; it tells you nothing about how fast it will rot.

Use this framework when the project is long-lived, the ground truth is mutable, and wrong answers are worse than "I don't know" — that is most real engineering contexts. Skip it when the memory is session-scoped or when the underlying domain genuinely does not change. The overhead is not free: a lifecycle harness, a timeline source, and per-domain staleness policies are real engineering investment. Pay for them where the cost of stale confidence is higher than the cost of the harness.

The shift in mindset is small but load-bearing. Stop asking "does it remember?" and start asking "does what it remembers still match reality, and how quickly does it notice when it doesn't?"

Written by Tiarê Balbi

Was this helpful?

Leave a rating or a quick note — it helps me improve.