Threat Modeling Distributed Systems -- From Attack Surfaces to Mitigations
Most security incidents in distributed systems do not come from novel cryptographic attacks. They come from overlooked assumptions: an endpoint that was never meant to be public, a token that never...
On this page
- Introduction
- 1. Why Threat Model?
- 2. System Boundaries and Assets
- 3. STRIDE Applied to chatixia-mesh
- S — Spoofing
- T — Tampering
- R — Repudiation
- I — Information Disclosure
- D — Denial of Service
- E — Elevation of Privilege
- 4. The WebRTC Attack Surface
- The trade-off
- 5. Pairing Security
- Brute-force resistance
- 6. Production Security Checklist
- 7. Threat Modeling as a Repeatable Process
- Summary
Lesson 14: Threat Modeling Distributed Systems — From Attack Surfaces to Mitigations
Prerequisites: Lesson 08: Authentication and Security, Lesson 10: The Sidecar Pattern
Introduction
Most security incidents in distributed systems come from overlooked assumptions: an endpoint that was never meant to be public, a token that never expires, a payload trusted because it came from “inside the mesh.” Threat modeling finds these gaps before an attacker does.
This lesson walks through threat modeling as applied to chatixia-mesh using the STRIDE framework, analyzes the WebRTC attack surface, and examines the pairing system’s brute-force defenses.
1. Why Threat Model?
Threat modeling replaces “is this secure?” (unanswerable) with “what can go wrong, and what have we done about it?” (enumerable). A threat model has three columns: Threat (what can an attacker do?), Mitigation (what prevents it?), and Residual risk (what remains unaddressed?).
The residual risk column matters most. Every system has residual risks. The difference between secure and insecure is whether the team knows what those risks are.
The rule: when you add a feature, add its threats. When you change a protocol, re-examine its trust boundaries.
2. System Boundaries and Assets
Before enumerating threats, identify where the boundaries are and what is worth protecting.
chatixia-mesh has five trust boundaries: Internet to Registry (HTTP/WebSocket), Internet to TURN (UDP relay), Registry to Sidecar (WebSocket with JWT), Sidecar to Agent (Unix socket with filesystem permissions), and Sidecar to Sidecar (WebRTC DataChannels with DTLS).
The asset inventory drives prioritization:
| Asset | Sensitivity | Why it matters |
|---|---|---|
| JWT signing secret | Critical | Compromise allows forging tokens for any identity |
| API keys | High | Grants ability to impersonate agents |
| Agent-to-agent messages | High | Contains task payloads, LLM prompts |
| TURN shared secret | High | Compromise allows abusing the relay |
| Task payloads | Medium-High | Visible to registry in plaintext |
| Agent capabilities | Low | Public by design |
3. STRIDE Applied to chatixia-mesh
STRIDE classifies threats into six categories: Spoofing (authentication), Tampering (integrity), Repudiation (non-repudiation), Information Disclosure (confidentiality), Denial of Service (availability), and Elevation of Privilege (authorization).
S — Spoofing
Threat T1: The WebSocket signaling endpoint is the mesh’s front door. chatixia-mesh mitigates spoofing with JWT-on-upgrade (token required before WebSocket completes), identity binding (JWT sub claim = peer_id), and sender verification (JWT identity must match message source). Residual risk: JWT passed as URL query parameter appears in server and proxy logs.
T — Tampering
Threat T5: The registry does not validate task payloads. Any authenticated agent can submit arbitrary content via POST /api/tasks/submit. Tasks route by skill match but payload content is unchecked — no schema validation, no authorization on who can target whom, no rate limiting per source.
R — Repudiation
chatixia-mesh has no audit logging. Agent actions, admin approvals, and API key usage are logged only via tracing::info! — operational logs, not tamper-evident audit trails. A production deployment needs append-only audit logs for authentication, pairing lifecycle, and task submissions.
I — Information Disclosure
Threat T9: Several GET endpoints (/api/registry/agents, /api/mesh/topology, /api/pairing/pending) require no authentication, exposing agent IPs, skills, topology, and onboarding activity. Knowing an agent has a shell skill tells an attacker it is a high-value target.
D — Denial of Service
Threat T4: The registry has no rate limiting on most HTTP endpoints and no WebSocket connection limits. An attacker can flood token issuance, open thousands of WebSocket connections, or submit thousands of tasks.
E — Elevation of Privilege
Threat T7 (Prompt Injection): A malicious task payload can manipulate the target agent’s LLM into executing dangerous skills. Mitigations require defense in depth: skill parameter schemas, allowlists for dangerous skills, payload sanitization, and separation of user content from system instructions.
Threat T6 (IPC Hijacking): The sidecar’s Unix socket at /tmp/chatixia-sidecar.sock is predictable and /tmp is world-readable. Any local process can impersonate the agent. Fix: move to $XDG_RUNTIME_DIR, set 0600 permissions, authenticate the IPC connection.
4. The WebRTC Attack Surface
WebRTC DataChannels use four protocol layers (ICE, STUN/TURN, DTLS, SCTP) where HTTP/gRPC uses one (TLS). Each layer adds state machines, parsers, and configuration surface.
ICE: Candidate injection via compromised signaling could redirect connections. Mitigated by JWT on the signaling WebSocket.
STUN/TURN: An open TURN server enables DDoS amplification. chatixia-mesh uses ephemeral HMAC-SHA1 credentials with 24-hour expiry — the standard use-auth-secret mode for coturn.
DTLS: Provides end-to-end encryption. The webrtc-rs library is less audited than mainstream TLS implementations.
SCTP: Less battle-tested than TCP; vulnerabilities in chunk parsing could cause crashes.
The trade-off
More layers means a larger audit surface, but also stronger end-to-end encryption. In HTTP, the registry sees all messages in plaintext. With WebRTC, DTLS encrypts the DataChannel end-to-end — a compromised registry sees only signaling metadata, not content. Residual risk: A signaling-path compromise enables MITM by substituting DTLS fingerprints during SDP exchange.
5. Pairing Security
The pairing system lets new agents join without pre-provisioned API keys using a multi-step flow: admin generates a 6-digit code, agent redeems it (out-of-band), admin approves, agent receives a device token.
Brute-force resistance
Rate limiting (5 attempts per IP per 60 seconds) combined with a 5-minute code TTL means a single IP can attempt at most 25 guesses out of 1,000,000 possibilities — a 0.0025% chance. Even 100 IPs yield only 0.25%.
Additional defenses compound the difficulty: codes are single-use (consumed on first redemption), and successful redemption only creates a “pending_approval” entry requiring admin approval.
After approval, the agent receives a 128-bit random device token (dt_ + 32 hex chars), computationally infeasible to guess. The device token is returned once, stored locally, and used to obtain short-lived JWTs.
6. Production Security Checklist
The threat model’s 12-item checklist, ordered by effort and impact:
- Change
SIGNALING_SECRETfrom default (T1, T2) - Replace
ak_dev_001with unique API keys per agent (T2) - Move
api_keys.jsonto a secrets manager (T2) - Enable TLS on registry (T1, T9, T10)
- Deploy coturn with TLS (T11)
- Move IPC socket to secure directory (T6)
- Add rate limiting to all HTTP endpoints (T4)
- Require JWT on GET registry endpoints (T9)
- Implement task submission ACLs (T5)
- Sanitize task payloads before LLM processing (T7)
- Add DTLS fingerprint verification (T3)
- Set up monitoring for abnormal signaling patterns (T1, T4)
Items 1-3 are configuration changes; items 9-12 require code. The default JWT secret and API key (ak_dev_001) are in source code — anyone who has read the repository can forge tokens or authenticate in a production deployment that inherits these defaults.
7. Threat Modeling as a Repeatable Process
Apply this to any system: (1) Draw the boundary diagram, (2) inventory assets with sensitivity levels, (3) apply STRIDE to each trust boundary, (4) assess residual risk honestly, (5) turn findings into a prioritized checklist, and (6) update when the system changes.
Summary
Threat modeling builds structured understanding of what your system protects and where the gaps are. chatixia-mesh documents 11 threats across all STRIDE categories — some well-mitigated, others explicitly unaddressed. The honest documentation of residual risks tells you exactly where to invest next.
The WebRTC transport introduces a larger protocol attack surface but provides genuine end-to-end encryption. The pairing system demonstrates defense in depth. And the production checklist turns abstract threats into concrete action items.
Security is not a feature you add. It is a set of questions you ask at every design decision.
Previous: Lesson 13: Building Monitoring Dashboards | Next: Lesson 15: Deployment Patterns