In-Memory State -- DashMap, Eventual Consistency, and the Database Question
Every networked service needs to store state -- who is connected, what work is pending, what configuration applies. The question is where that state lives and what guarantees it provides.
On this page
Lesson 12: In-Memory State — DashMap, Eventual Consistency, and the Database Question
Prerequisites: Lesson 04: Async Programming Patterns, Lesson 07: Application Protocol Design
1. The Spectrum of State Management
Every networked service needs to store state. There are three broad tiers:
| Tier | Example | Durability | Speed | Scaling |
|---|---|---|---|---|
| In-memory | HashMap, DashMap | None — lost on restart | Nanoseconds | Single process |
| Embedded DB | SQLite, RocksDB | Durable to disk | Microseconds | Single process |
| External DB | PostgreSQL, Redis | Durable, replicated | Milliseconds | Multi-instance |
The chatixia-mesh registry sits in the first tier. All state lives in DashMap instances that exist only in the registry process’s memory. This is a deliberate choice (ADR-004): the registry is a single binary with zero external dependencies.
The key insight: if your clients will re-send their state on a fixed interval, you can treat the server as a cache rather than a source of truth. The agents are the source of truth for their own existence. The registry merely aggregates that truth.
2. DashMap: A Concurrent HashMap for Rust
Rust’s standard HashMap is not thread-safe. Wrapping it in Mutex<HashMap<K, V>> locks the entire map for every operation. DashMap solves this with sharded locking — splitting the map into 16 shards, each with its own RwLock. Two operations on keys in different shards proceed in parallel with zero contention.
The registry uses 6 DashMap instances across 4 state structs: agent records, tasks, signaling peers, invite codes, onboarding entries, and rate-limit buckets. All are wrapped in Arc for sharing across tokio tasks and HTTP handlers.
Key operations in the codebase
- Insert:
agents.insert(info.agent_id.clone(), record)— overwrites if exists - Get: Returns a
Ref<K, V>guard holding the shard’s read lock. Clone the value before the guard drops; never hold it across an.awaitpoint - Get mutable:
agents.get_mut(&id)returns aRefMut<K, V>with the shard’s write lock for in-place updates - Iterate with mutation:
iter_mut()locks each shard in sequence — safe but briefly blocks writes per shard - Retain: Iterates and removes entries where the closure returns false — used for TTL cleanup
- Remove: Atomically removes a key-value pair
3. Eventual Consistency via Heartbeats
The registry does not require agents to be pre-configured. Agents announce themselves by sending heartbeats, and the registry builds its view from those heartbeats. The heartbeat handler performs an upsert: update if exists, insert if new.
The registry’s state is always eventually consistent with the actual set of running agents. The convergence window is one heartbeat interval: 15 seconds.
T=0s Registry starts (empty)
T=3s Agent-A heartbeat --> registry knows A
T=8s Agent-B heartbeat --> registry knows A and B
T=20s Registry restarts (empty again)
T=25s Agent-A heartbeat --> re-learns A
T=33s Agent-B heartbeat --> full convergence
This design means the registry has no startup dependencies. Restart it at any time, and within 15 seconds the world rebuilds itself. The trade-off: during those 15 seconds, requests return partial results.
4. The Health State Machine
The health_check_loop runs every 15 seconds and classifies each agent by heartbeat age:
- active — last heartbeat < 90 seconds ago (fewer than 6 missed heartbeats)
- stale — 90-270 seconds (6-18 missed heartbeats)
- offline — > 270 seconds
Any heartbeat received resets the agent to “active.” The find_by_skill method only returns active agents, preventing task assignment to agents that are probably dead.
The loop uses tokio::time::sleep (not interval), waiting 15 seconds after each iteration completes — simpler and safer for a health check.
5. The TTL Pattern
Three background loops use the same pattern: sleep, scan, expire.
| Loop | Interval | TTL | What expires |
|---|---|---|---|
| health_check_loop | 15s | 90s stale / 270s offline | Agent health transitions |
| expire_tasks_loop | 30s | 300s (task.ttl) | Pending/assigned tasks marked “failed” |
| cleanup_loop | 60s | 300s (codes) | Used/expired invite codes, empty rate-limit buckets |
Task expiration marks tasks as “failed” with error “TTL expired” rather than removing them, preserving them for debugging. Note: completed tasks are never garbage-collected — a known limitation.
Invite code cleanup uses retain() to remove used or expired codes, and prunes empty rate-limit buckets.
This pattern avoids per-entry timers. A single background scan handles all entries in one pass. The worst-case staleness is the TTL value plus one scan interval (e.g., a task could live up to 330 seconds before expiration).
6. When to Add a Database
The in-memory approach works because the registry is a coordination point, not a system of record. But it has clear limits. On registry restart:
| Data | Recovery | Impact |
|---|---|---|
| Agent registrations | Auto-recover (~15s) | Brief dashboard flicker |
| Task queue | LOST permanently | Pending tasks vanish |
| Signaling peers | Auto-recover (seconds) | Brief connectivity gap |
| Invite codes | LOST permanently | In-flight pairing fails |
| Rate-limit buckets | Reset (harmless) | Briefly allows extra attempts |
What to move to a database: Task queue (lifecycle spans minutes, must not be lost) and onboarding entries (long-lived credentials).
What to keep in memory: Agent registrations (rebuilt from heartbeats), signaling peers (cannot serialize channel handles), rate-limit buckets (ephemeral), invite codes (5-minute TTL).
The second reason for a database is horizontal scaling. With DashMap, only one registry instance can run. A shared database enables multiple instances behind a load balancer. The signaling peer map would still need coordination (via Redis pub/sub), but the registry API would scale horizontally.
Summary
The chatixia-mesh registry uses in-memory DashMap instances as its sole state store. This eliminates infrastructure dependencies and delivers nanosecond-scale reads, but all state is volatile. The system compensates through:
- Heartbeat-driven eventual consistency — agents re-announce themselves, so the registry rebuilds within 15 seconds of any restart
- Background TTL loops — three spawned tasks scan their DashMaps on fixed intervals, expiring stale entries
The state that cannot self-heal (task queue, onboarding entries) is the state that would justify adding a database. ADR-004 documents this as a planned migration path for when the system needs durability or multi-instance scaling.
Previous: Lesson 11: Transport Comparison | Next: Lesson 13: Building Monitoring Dashboards