chatixia blog
Deep Dive March 22, 2026 · 6 min read

In-Memory State -- DashMap, Eventual Consistency, and the Database Question

Every networked service needs to store state -- who is connected, what work is pending, what configuration applies. The question is where that state lives and what guarantees it provides.

state-managementconsistencyrustdashmap
On this page

Lesson 12: In-Memory State — DashMap, Eventual Consistency, and the Database Question

Prerequisites: Lesson 04: Async Programming Patterns, Lesson 07: Application Protocol Design


1. The Spectrum of State Management

Every networked service needs to store state. There are three broad tiers:

TierExampleDurabilitySpeedScaling
In-memoryHashMap, DashMapNone — lost on restartNanosecondsSingle process
Embedded DBSQLite, RocksDBDurable to diskMicrosecondsSingle process
External DBPostgreSQL, RedisDurable, replicatedMillisecondsMulti-instance

The chatixia-mesh registry sits in the first tier. All state lives in DashMap instances that exist only in the registry process’s memory. This is a deliberate choice (ADR-004): the registry is a single binary with zero external dependencies.

The key insight: if your clients will re-send their state on a fixed interval, you can treat the server as a cache rather than a source of truth. The agents are the source of truth for their own existence. The registry merely aggregates that truth.


2. DashMap: A Concurrent HashMap for Rust

Rust’s standard HashMap is not thread-safe. Wrapping it in Mutex<HashMap<K, V>> locks the entire map for every operation. DashMap solves this with sharded locking — splitting the map into 16 shards, each with its own RwLock. Two operations on keys in different shards proceed in parallel with zero contention.

The registry uses 6 DashMap instances across 4 state structs: agent records, tasks, signaling peers, invite codes, onboarding entries, and rate-limit buckets. All are wrapped in Arc for sharing across tokio tasks and HTTP handlers.

Key operations in the codebase

  • Insert: agents.insert(info.agent_id.clone(), record) — overwrites if exists
  • Get: Returns a Ref<K, V> guard holding the shard’s read lock. Clone the value before the guard drops; never hold it across an .await point
  • Get mutable: agents.get_mut(&id) returns a RefMut<K, V> with the shard’s write lock for in-place updates
  • Iterate with mutation: iter_mut() locks each shard in sequence — safe but briefly blocks writes per shard
  • Retain: Iterates and removes entries where the closure returns false — used for TTL cleanup
  • Remove: Atomically removes a key-value pair

3. Eventual Consistency via Heartbeats

The registry does not require agents to be pre-configured. Agents announce themselves by sending heartbeats, and the registry builds its view from those heartbeats. The heartbeat handler performs an upsert: update if exists, insert if new.

The registry’s state is always eventually consistent with the actual set of running agents. The convergence window is one heartbeat interval: 15 seconds.

T=0s    Registry starts (empty)
T=3s    Agent-A heartbeat --> registry knows A
T=8s    Agent-B heartbeat --> registry knows A and B
T=20s   Registry restarts (empty again)
T=25s   Agent-A heartbeat --> re-learns A
T=33s   Agent-B heartbeat --> full convergence

This design means the registry has no startup dependencies. Restart it at any time, and within 15 seconds the world rebuilds itself. The trade-off: during those 15 seconds, requests return partial results.


4. The Health State Machine

The health_check_loop runs every 15 seconds and classifies each agent by heartbeat age:

  • active — last heartbeat < 90 seconds ago (fewer than 6 missed heartbeats)
  • stale — 90-270 seconds (6-18 missed heartbeats)
  • offline — > 270 seconds

Any heartbeat received resets the agent to “active.” The find_by_skill method only returns active agents, preventing task assignment to agents that are probably dead.

The loop uses tokio::time::sleep (not interval), waiting 15 seconds after each iteration completes — simpler and safer for a health check.


5. The TTL Pattern

Three background loops use the same pattern: sleep, scan, expire.

LoopIntervalTTLWhat expires
health_check_loop15s90s stale / 270s offlineAgent health transitions
expire_tasks_loop30s300s (task.ttl)Pending/assigned tasks marked “failed”
cleanup_loop60s300s (codes)Used/expired invite codes, empty rate-limit buckets

Task expiration marks tasks as “failed” with error “TTL expired” rather than removing them, preserving them for debugging. Note: completed tasks are never garbage-collected — a known limitation.

Invite code cleanup uses retain() to remove used or expired codes, and prunes empty rate-limit buckets.

This pattern avoids per-entry timers. A single background scan handles all entries in one pass. The worst-case staleness is the TTL value plus one scan interval (e.g., a task could live up to 330 seconds before expiration).


6. When to Add a Database

The in-memory approach works because the registry is a coordination point, not a system of record. But it has clear limits. On registry restart:

DataRecoveryImpact
Agent registrationsAuto-recover (~15s)Brief dashboard flicker
Task queueLOST permanentlyPending tasks vanish
Signaling peersAuto-recover (seconds)Brief connectivity gap
Invite codesLOST permanentlyIn-flight pairing fails
Rate-limit bucketsReset (harmless)Briefly allows extra attempts

What to move to a database: Task queue (lifecycle spans minutes, must not be lost) and onboarding entries (long-lived credentials).

What to keep in memory: Agent registrations (rebuilt from heartbeats), signaling peers (cannot serialize channel handles), rate-limit buckets (ephemeral), invite codes (5-minute TTL).

The second reason for a database is horizontal scaling. With DashMap, only one registry instance can run. A shared database enables multiple instances behind a load balancer. The signaling peer map would still need coordination (via Redis pub/sub), but the registry API would scale horizontally.


Summary

The chatixia-mesh registry uses in-memory DashMap instances as its sole state store. This eliminates infrastructure dependencies and delivers nanosecond-scale reads, but all state is volatile. The system compensates through:

  1. Heartbeat-driven eventual consistency — agents re-announce themselves, so the registry rebuilds within 15 seconds of any restart
  2. Background TTL loops — three spawned tasks scan their DashMaps on fixed intervals, expiring stale entries

The state that cannot self-heal (task queue, onboarding entries) is the state that would justify adding a database. ADR-004 documents this as a planned migration path for when the system needs durability or multi-instance scaling.

Previous: Lesson 11: Transport Comparison | Next: Lesson 13: Building Monitoring Dashboards

Comments