Architecture Decision Records -- Making Decisions Visible and Reversible
Six months from now, someone will look at the chatixia-mesh codebase and ask: 'Why does every Python agent spawn a separate Rust process just to send a JSON message?' The code will show what happen...
On this page
Lesson 16: Architecture Decision Records — Making Decisions Visible and Reversible
Prerequisites: Lesson 11 (Transport Comparison).
Introduction
Six months from now, someone will ask: “Why does every Python agent spawn a separate Rust process just to send a JSON message?” The code shows what happens. The git log shows when. Neither explains why.
The answer lives in ADR-001: Python’s WebRTC ecosystem (aiortc) is fragile and lacks production-grade DTLS support, so each agent spawns a Rust sidecar for WebRTC, communicating via Unix socket IPC. Without this record, the next engineer might reasonably conclude the sidecar is unnecessary and attempt to rip it out.
Architecture Decision Records (ADRs) are short documents that capture the context, decision, and consequences of significant technical choices. They preserve the reasoning that code, tests, and configuration only imply.
1. Why Document Decisions?
Code is the what. Commits are the when. ADRs are the why.
Without ADRs, teams cycle through a costly pattern: Engineer A evaluates three approaches and implements one. Engineer B joins months later, finds the code confusing, and proposes rewriting it using an approach A already rejected. The team spends a week debating trade-offs that were already evaluated. ADRs make the original evaluation accessible, allowing revisiting to be efficient rather than redundant.
Decisions are the most perishable knowledge. Code persists in the repository. The reasoning behind a decision exists only in the heads of those who made it. ADRs convert ephemeral knowledge into a durable artifact.
2. The ADR Format
The standard format (Michael Nygard, 2011) has five sections:
| Section | Purpose |
|---|---|
| Date | When the decision was made |
| Status | Proposed, accepted, deprecated, or superseded |
| Context | The situation requiring a decision |
| Decision | What we decided |
| Consequences | What follows — both positive and negative |
An ADR should take 10-30 minutes to write and 5 minutes to read.
chatixia-mesh adds a sixth section — Migration Path — answering “if this decision is wrong, how do we reverse it?” This forces thinking about reversibility at decision time, not when things break.
A complete example: ADR-004
## ADR-004: In-Memory State (No Database)
**Date:** 2026-03-21
**Status:** Accepted
**Context:** Registry needs to track agents, tasks, and signaling peers.
Options: database (PostgreSQL, Redis) or in-memory (DashMap).
**Decision:** All state is in-memory using DashMap. No database dependency.
**Consequences:**
- (+) Zero deployment complexity -- single binary, no external services
- (+) Very fast reads/writes
- (-) No durability -- restart loses all state
- (-) Single-instance only (no horizontal scaling)
**Migration path:** Add PostgreSQL when persistence or multi-instance is needed.
Notice: Context names alternatives. Consequences use (+) and (-) markers — honest ADRs always have both. Migration path names a specific technology and trigger.
3. Case Study: The Evolution of Task Execution
The most instructive ADRs form chains — sequences where each decision builds on the previous. chatixia-mesh has a three-ADR chain tracing task execution’s evolution:
ADR-005 (Sync HTTP task queue): Python skill handlers are synchronous and cannot use the async IPC bridge. Workaround: route tasks through the registry’s REST API. Latency: 3-15s (poll-based). Migration path: “once async handlers are supported, route through DataChannel.”
ADR-013 (Heartbeat-driven execution): During E2E testing, tasks were assigned but never executed — the runner’s heartbeat loop discarded the response body. Fix: parse pending_tasks from heartbeat response and execute them. New limitation: heartbeat interval (~15s) bounds pickup latency.
ADR-016 (P2P DataChannel execution + HTTP fallback): All task data still routed through the registry, contradicting the P2P architecture. Solution: async skill handlers with P2P-first path via DataChannels, automatic HTTP fallback when P2P is unavailable. Sub-second latency on P2P path.
The chain tells a story of iterative refinement: a pragmatic workaround, a testing-driven bug fix, and a principled refactoring. ADR-005’s migration path predicted ADR-016 before it existed — this is the value of migration paths.
4. The Devil’s Advocate ADR
Most ADRs present decisions favorably. ADR-018 takes a different approach: it presents the case for WebRTC DataChannels, then systematically argues against its own conclusion with eleven specific criticisms:
- ICE gathering + DTLS handshake takes 5-10s per peer (vs ~50-100ms for TCP+TLS)
- The system already requires a central registry, making “registry is not in the data path” aspirational
- TURN relays all traffic through a server, negating P2P benefits
- Every agent deployment requires four moving parts where HTTP/gRPC would need one
These are genuine weaknesses the team acknowledges, quantifies, and accepts.
The devil’s advocate ADR builds trust by demonstrating thorough evaluation, enabling future re-evaluation with explicit conditions for reconsideration (“replace WebRTC if: all agents are on the same network, agent count exceeds ~30, webrtc-rs stalls, or WebTransport matures”), and ending with honesty rather than certainty.
How to write one
Present your decision with benefits. For each benefit, ask when it becomes irrelevant. For each complexity cost, ask what the simpler alternative gives you for free. Quantify where possible. State explicit conditions for reversal. Do not resolve the tension.
5. When to Write an ADR
Write one when the decision is:
- Irreversible or expensive to reverse — transport protocol, database choice
- Cross-cutting — affects multiple components
- Likely to be questioned — “someone will ask why we did this”
- Involves significant trade-offs — strong arguments on both sides
Skip when the decision is:
- Cheap to reverse — logging library choice
- Local to one component — internal data structure choice
- Industry standard — JWT for auth, JSON for config
Heuristic: If you would spend more than 30 minutes explaining the decision to a new team member (including alternatives considered), write the ADR. It will take less than 30 minutes and save that explanation for every future reader.
6. Living Documentation
An ADR describing a decision the codebase no longer follows is worse than no ADR. Keeping docs in sync requires triggers, not good intentions.
chatixia-mesh uses a documentation matrix mapping code changes to specific documents: add a route, update COMPONENTS.md; change auth flow, update SYSTEM_DESIGN.md; make an architectural decision, add to ADR.md. “Update the docs” is vague and ignorable. “Add the new route to the routes table in COMPONENTS.md” is concrete and takes two minutes.
chatixia-mesh keeps all ADRs in a single file (docs/ADR.md) — searchable with one Ctrl+F, readable in sequence since ADRs reference each other, and low overhead to append. For larger organizations with hundreds of ADRs, one file per ADR with an index is more practical.
Three triggers catch documentation drift: code review (does the relevant doc still match?), session boundaries (meeting notes prompt doc updates), and new contributor onboarding (discrepancies found during reading are documentation bugs).
Summary
ADRs preserve the why behind your architecture. The format is simple: Date, Status, Context, Decision, Consequences, and optionally a Migration Path.
The best ADRs include arguments against the decision, conditions for reconsideration, and known costs the team accepts. Not every decision needs one — apply the filter: irreversible, cross-cutting, likely to be questioned, significant trade-offs.
Documentation stays alive when updates are triggered by code changes, not by willpower. A documentation matrix that maps code changes to specific documents converts maintenance from a chore into a habit.
Previous: Lesson 15: Deployment Patterns | Next: Lesson 17: Testing Distributed Systems