Writing · № 03

All posts

Long-form notes on software, distributed systems, and the craft of building. Shipping one a week.

All AI Backend Distributed Systems Engineering Research

All Posts

30 posts

Event-Log-as-Source-of-Truth Turns Schema Evolution Into a Forever Problem

When the log is the source of truth, every schema change is permanent. A Kotlin/Avro walkthrough of the rename that passed the Schema Registry check and silently corrupted every old event, plus the Protobuf and Avro invariants I now keep pinned above my desk.

May 5

Engineering02

Turning LLM Context Engineering Into an Evaluation Loop with DSPy

Notes from two weekends of digging into DSPy. I stopped treating prompts as the source of truth and started treating them as compiled output from a typed signature, a metric, and an optimizer. Here is the smallest end-to-end program I kept, how MIPROv2 actually searches, and where the approach breaks down in practice.

May 3

AI03

Exposing Spring AI Agents via the A2A Protocol: What Interoperability Actually Buys You

Spring AI's server-side A2A integration is stable enough to put in production, but the protocol is most useful at organizational boundaries, not as an internal RPC replacement. This post walks through what actually changes in a Spring AI codebase, where the sharp edges still are, and a practical decision framework for A2A vs MCP vs plain REST.

Apr 29

Engineering04

DBOS vs Temporal: When Postgres Is Enough for Durable Workflow Execution

DBOS reuses Postgres as the durability layer for workflows, while Temporal runs a dedicated cluster. The right choice depends on team size, workload shape, and where you want your operational budget to go. This is a practical rubric for picking between them.

Apr 26

Distributed Systems05

Cell-Based Architecture Isn't Free: What Slack, DoorDash, and Roblox Actually Paid For It

Cell-based architecture contains blast radius, but it is not free. A look at what Slack, DoorDash, and Roblox actually paid for cells in production — and a checklist for the cheaper fault-isolation patterns most teams should reach for first.

Apr 23

AI06

JetBrains Tracy: Pragmatic AI Observability for Kotlin

JetBrains Tracy is a Kotlin library that wires LLM-aware tracing into your app on top of OpenTelemetry. This post walks through how I integrated it in a Spring Boot service, the design decisions that matter, and the failure modes teams hit once LLM calls become the hottest path in their system.

Apr 22

AI07

The Deterministic Backbone: Why Production AI Systems Are Moving Away From Fully Autonomous Agents

Fully autonomous agents are hard to bound, hard to test, and expensive to operate. A deterministic backbone with narrow agent steps gives you the control flow back while keeping the intelligence where it matters. Here is how to design, test, and migrate toward it.

Apr 19

AI08

Memory Evaluation: Measuring How AI Memory Decays Over a Project's Lifetime

Most AI memory benchmarks grade on recall and stop there. That hides the real failure mode: stale facts quietly poisoning the context window. Here is a lifecycle-based evaluation framework that tests recall, revision, and controlled forgetting across the change points every long-lived project goes through.

Apr 17

Engineering09

The Transactional Outbox Is Not a Queue

The transactional outbox is a ledger, not a queue. Treating it like one is what breaks Postgres under load. This post walks through the specific failure modes — autovacuum stalls, xmin horizon drift, replication slot lag, poison pills — and the operational rules that actually keep it working in production.

Apr 17

Engineering10

Virtual Threads After JEP 491: The Bottleneck Moved

JEP 491 removed the `synchronized` pinning problem that kept virtual threads out of production. The interesting question now isn't whether to enable them — it's which bottleneck shows up next. A field guide for Spring Boot / Kotlin services running on JDK 24+.

Apr 17