Lesson 14: Threat Modeling Distributed Systems — From Attack Surfaces to Mitigations

Prerequisites: Lesson 08: Authentication and Security, Lesson 10: The Sidecar Pattern

Introduction

Most security incidents in distributed systems come from overlooked assumptions: an endpoint that was never meant to be public, a token that never expires, a payload trusted because it came from “inside the mesh.” Threat modeling finds these gaps before an attacker does.

This lesson walks through threat modeling as applied to chatixia-mesh using the STRIDE framework, analyzes the WebRTC attack surface, and examines the pairing system’s brute-force defenses.

1. Why Threat Model?

Threat modeling replaces “is this secure?” (unanswerable) with “what can go wrong, and what have we done about it?” (enumerable). A threat model has three columns: Threat (what can an attacker do?), Mitigation (what prevents it?), and Residual risk (what remains unaddressed?).

The residual risk column matters most. Every system has residual risks. The difference between secure and insecure is whether the team knows what those risks are.

The rule: when you add a feature, add its threats. When you change a protocol, re-examine its trust boundaries.

2. System Boundaries and Assets

Before enumerating threats, identify where the boundaries are and what is worth protecting.

chatixia-mesh has five trust boundaries: Internet to Registry (HTTP/WebSocket), Internet to TURN (UDP relay), Registry to Sidecar (WebSocket with JWT), Sidecar to Agent (Unix socket with filesystem permissions), and Sidecar to Sidecar (WebRTC DataChannels with DTLS).

The asset inventory drives prioritization:

Asset	Sensitivity	Why it matters
JWT signing secret	Critical	Compromise allows forging tokens for any identity
API keys	High	Grants ability to impersonate agents
Agent-to-agent messages	High	Contains task payloads, LLM prompts
TURN shared secret	High	Compromise allows abusing the relay
Task payloads	Medium-High	Visible to registry in plaintext
Agent capabilities	Low	Public by design

3. STRIDE Applied to chatixia-mesh

STRIDE classifies threats into six categories: Spoofing (authentication), Tampering (integrity), Repudiation (non-repudiation), Information Disclosure (confidentiality), Denial of Service (availability), and Elevation of Privilege (authorization).

S — Spoofing

Threat T1: The WebSocket signaling endpoint is the mesh’s front door. chatixia-mesh mitigates spoofing with JWT-on-upgrade (token required before WebSocket completes), identity binding (JWT sub claim = peer_id), and sender verification (JWT identity must match message source). Residual risk: JWT passed as URL query parameter appears in server and proxy logs.

T — Tampering

Threat T5: The registry does not validate task payloads. Any authenticated agent can submit arbitrary content via POST /api/tasks/submit. Tasks route by skill match but payload content is unchecked — no schema validation, no authorization on who can target whom, no rate limiting per source.

R — Repudiation

chatixia-mesh has no audit logging. Agent actions, admin approvals, and API key usage are logged only via tracing::info! — operational logs, not tamper-evident audit trails. A production deployment needs append-only audit logs for authentication, pairing lifecycle, and task submissions.

I — Information Disclosure

Threat T9: Several GET endpoints (/api/registry/agents, /api/mesh/topology, /api/pairing/pending) require no authentication, exposing agent IPs, skills, topology, and onboarding activity. Knowing an agent has a shell skill tells an attacker it is a high-value target.

D — Denial of Service

Threat T4: The registry has no rate limiting on most HTTP endpoints and no WebSocket connection limits. An attacker can flood token issuance, open thousands of WebSocket connections, or submit thousands of tasks.

E — Elevation of Privilege

Threat T7 (Prompt Injection): A malicious task payload can manipulate the target agent’s LLM into executing dangerous skills. Mitigations require defense in depth: skill parameter schemas, allowlists for dangerous skills, payload sanitization, and separation of user content from system instructions.

Threat T6 (IPC Hijacking): The sidecar’s Unix socket at /tmp/chatixia-sidecar.sock is predictable and /tmp is world-readable. Any local process can impersonate the agent. Fix: move to $XDG_RUNTIME_DIR, set 0600 permissions, authenticate the IPC connection.

4. The WebRTC Attack Surface

WebRTC DataChannels use four protocol layers (ICE, STUN/TURN, DTLS, SCTP) where HTTP/gRPC uses one (TLS). Each layer adds state machines, parsers, and configuration surface.

ICE: Candidate injection via compromised signaling could redirect connections. Mitigated by JWT on the signaling WebSocket.

STUN/TURN: An open TURN server enables DDoS amplification. chatixia-mesh uses ephemeral HMAC-SHA1 credentials with 24-hour expiry — the standard use-auth-secret mode for coturn.

DTLS: Provides end-to-end encryption. The webrtc-rs library is less audited than mainstream TLS implementations.

SCTP: Less battle-tested than TCP; vulnerabilities in chunk parsing could cause crashes.

The trade-off

More layers means a larger audit surface, but also stronger end-to-end encryption. In HTTP, the registry sees all messages in plaintext. With WebRTC, DTLS encrypts the DataChannel end-to-end — a compromised registry sees only signaling metadata, not content. Residual risk: A signaling-path compromise enables MITM by substituting DTLS fingerprints during SDP exchange.

5. Pairing Security

The pairing system lets new agents join without pre-provisioned API keys using a multi-step flow: admin generates a 6-digit code, agent redeems it (out-of-band), admin approves, agent receives a device token.

Brute-force resistance

Rate limiting (5 attempts per IP per 60 seconds) combined with a 5-minute code TTL means a single IP can attempt at most 25 guesses out of 1,000,000 possibilities — a 0.0025% chance. Even 100 IPs yield only 0.25%.

Additional defenses compound the difficulty: codes are single-use (consumed on first redemption), and successful redemption only creates a “pending_approval” entry requiring admin approval.

After approval, the agent receives a 128-bit random device token (dt_ + 32 hex chars), computationally infeasible to guess. The device token is returned once, stored locally, and used to obtain short-lived JWTs.

6. Production Security Checklist

The threat model’s 12-item checklist, ordered by effort and impact:

Change SIGNALING_SECRET from default (T1, T2)
Replace ak_dev_001 with unique API keys per agent (T2)
Move api_keys.json to a secrets manager (T2)
Enable TLS on registry (T1, T9, T10)
Deploy coturn with TLS (T11)
Move IPC socket to secure directory (T6)
Add rate limiting to all HTTP endpoints (T4)
Require JWT on GET registry endpoints (T9)
Implement task submission ACLs (T5)
Sanitize task payloads before LLM processing (T7)
Add DTLS fingerprint verification (T3)
Set up monitoring for abnormal signaling patterns (T1, T4)

Items 1-3 are configuration changes; items 9-12 require code. The default JWT secret and API key (ak_dev_001) are in source code — anyone who has read the repository can forge tokens or authenticate in a production deployment that inherits these defaults.

7. Threat Modeling as a Repeatable Process

Apply this to any system: (1) Draw the boundary diagram, (2) inventory assets with sensitivity levels, (3) apply STRIDE to each trust boundary, (4) assess residual risk honestly, (5) turn findings into a prioritized checklist, and (6) update when the system changes.

Summary

Threat modeling builds structured understanding of what your system protects and where the gaps are. chatixia-mesh documents 11 threats across all STRIDE categories — some well-mitigated, others explicitly unaddressed. The honest documentation of residual risks tells you exactly where to invest next.

The WebRTC transport introduces a larger protocol attack surface but provides genuine end-to-end encryption. The pairing system demonstrates defense in depth. And the production checklist turns abstract threats into concrete action items.

Security is not a feature you add. It is a set of questions you ask at every design decision.

Previous: Lesson 13: Building Monitoring Dashboards | Next: Lesson 15: Deployment Patterns

Threat Modeling Distributed Systems -- From Attack Surfaces to Mitigations