Lesson 12: In-Memory State — DashMap, Eventual Consistency, and the Database Question

Prerequisites: Lesson 04: Async Programming Patterns, Lesson 07: Application Protocol Design

1. The Spectrum of State Management

Every networked service needs to store state. There are three broad tiers:

Tier	Example	Durability	Speed	Scaling
In-memory	`HashMap`, `DashMap`	None — lost on restart	Nanoseconds	Single process
Embedded DB	SQLite, RocksDB	Durable to disk	Microseconds	Single process
External DB	PostgreSQL, Redis	Durable, replicated	Milliseconds	Multi-instance

The chatixia-mesh registry sits in the first tier. All state lives in DashMap instances that exist only in the registry process’s memory. This is a deliberate choice (ADR-004): the registry is a single binary with zero external dependencies.

The key insight: if your clients will re-send their state on a fixed interval, you can treat the server as a cache rather than a source of truth. The agents are the source of truth for their own existence. The registry merely aggregates that truth.

2. DashMap: A Concurrent HashMap for Rust

Rust’s standard HashMap is not thread-safe. Wrapping it in Mutex<HashMap<K, V>> locks the entire map for every operation. DashMap solves this with sharded locking — splitting the map into 16 shards, each with its own RwLock. Two operations on keys in different shards proceed in parallel with zero contention.

The registry uses 6 DashMap instances across 4 state structs: agent records, tasks, signaling peers, invite codes, onboarding entries, and rate-limit buckets. All are wrapped in Arc for sharing across tokio tasks and HTTP handlers.

Key operations in the codebase

Insert: agents.insert(info.agent_id.clone(), record) — overwrites if exists
Get: Returns a Ref<K, V> guard holding the shard’s read lock. Clone the value before the guard drops; never hold it across an .await point
Get mutable: agents.get_mut(&id) returns a RefMut<K, V> with the shard’s write lock for in-place updates
Iterate with mutation: iter_mut() locks each shard in sequence — safe but briefly blocks writes per shard
Retain: Iterates and removes entries where the closure returns false — used for TTL cleanup
Remove: Atomically removes a key-value pair

3. Eventual Consistency via Heartbeats

The registry does not require agents to be pre-configured. Agents announce themselves by sending heartbeats, and the registry builds its view from those heartbeats. The heartbeat handler performs an upsert: update if exists, insert if new.

The registry’s state is always eventually consistent with the actual set of running agents. The convergence window is one heartbeat interval: 15 seconds.

T=0s    Registry starts (empty)
T=3s    Agent-A heartbeat --> registry knows A
T=8s    Agent-B heartbeat --> registry knows A and B
T=20s   Registry restarts (empty again)
T=25s   Agent-A heartbeat --> re-learns A
T=33s   Agent-B heartbeat --> full convergence

This design means the registry has no startup dependencies. Restart it at any time, and within 15 seconds the world rebuilds itself. The trade-off: during those 15 seconds, requests return partial results.

4. The Health State Machine

The health_check_loop runs every 15 seconds and classifies each agent by heartbeat age:

active — last heartbeat < 90 seconds ago (fewer than 6 missed heartbeats)
stale — 90-270 seconds (6-18 missed heartbeats)
offline — > 270 seconds

Any heartbeat received resets the agent to “active.” The find_by_skill method only returns active agents, preventing task assignment to agents that are probably dead.

The loop uses tokio::time::sleep (not interval), waiting 15 seconds after each iteration completes — simpler and safer for a health check.

5. The TTL Pattern

Three background loops use the same pattern: sleep, scan, expire.

Loop	Interval	TTL	What expires
health_check_loop	15s	90s stale / 270s offline	Agent health transitions
expire_tasks_loop	30s	300s (task.ttl)	Pending/assigned tasks marked “failed”
cleanup_loop	60s	300s (codes)	Used/expired invite codes, empty rate-limit buckets

Task expiration marks tasks as “failed” with error “TTL expired” rather than removing them, preserving them for debugging. Note: completed tasks are never garbage-collected — a known limitation.

Invite code cleanup uses retain() to remove used or expired codes, and prunes empty rate-limit buckets.

This pattern avoids per-entry timers. A single background scan handles all entries in one pass. The worst-case staleness is the TTL value plus one scan interval (e.g., a task could live up to 330 seconds before expiration).

6. When to Add a Database

The in-memory approach works because the registry is a coordination point, not a system of record. But it has clear limits. On registry restart:

Data	Recovery	Impact
Agent registrations	Auto-recover (~15s)	Brief dashboard flicker
Task queue	LOST permanently	Pending tasks vanish
Signaling peers	Auto-recover (seconds)	Brief connectivity gap
Invite codes	LOST permanently	In-flight pairing fails
Rate-limit buckets	Reset (harmless)	Briefly allows extra attempts

What to move to a database: Task queue (lifecycle spans minutes, must not be lost) and onboarding entries (long-lived credentials).

What to keep in memory: Agent registrations (rebuilt from heartbeats), signaling peers (cannot serialize channel handles), rate-limit buckets (ephemeral), invite codes (5-minute TTL).

The second reason for a database is horizontal scaling. With DashMap, only one registry instance can run. A shared database enables multiple instances behind a load balancer. The signaling peer map would still need coordination (via Redis pub/sub), but the registry API would scale horizontally.

Summary

The chatixia-mesh registry uses in-memory DashMap instances as its sole state store. This eliminates infrastructure dependencies and delivers nanosecond-scale reads, but all state is volatile. The system compensates through:

Heartbeat-driven eventual consistency — agents re-announce themselves, so the registry rebuilds within 15 seconds of any restart
Background TTL loops — three spawned tasks scan their DashMaps on fixed intervals, expiring stale entries

The state that cannot self-heal (task queue, onboarding entries) is the state that would justify adding a database. ADR-004 documents this as a planned migration path for when the system needs durability or multi-instance scaling.

Previous: Lesson 11: Transport Comparison | Next: Lesson 13: Building Monitoring Dashboards

In-Memory State -- DashMap, Eventual Consistency, and the Database Question