Why Distributed Systems? From Monolith to Mesh
A distributed system is a collection of independent processes, running on different machines, that coordinate over a network to accomplish a shared goal. To an outside observer, the system behaves ...
On this page
- Prerequisites
- What You’ll Learn
- 1. What Is a Distributed System?
- Why distribute at all?
- 2. The Eight Fallacies of Distributed Computing
- 3. Topology Models
- Star (Client-Server)
- Ring
- Tree (Hierarchical)
- Full Mesh
- Trade-offs Summary
- 4. Control Plane vs Data Plane
- Why Separate Them?
- In chatixia-mesh
- 5. Case Study: chatixia-mesh at a Glance
- The Four Components
- How a Message Travels
- What Problem Does It Solve?
- Exercises
- Related Lessons
- Further Reading
Lesson 01 — Why Distributed Systems? From Monolith to Mesh
Prerequisites
None. This is the first lesson in the series.
What You’ll Learn
- What a distributed system is and how it differs from a monolithic application
- The Eight Fallacies of Distributed Computing and why they matter
- Four fundamental network topologies and their trade-offs
- The difference between control plane and data plane
- How the chatixia-mesh project maps to these concepts
1. What Is a Distributed System?
A distributed system is a collection of independent processes, running on different machines, that coordinate over a network to accomplish a shared goal. To an outside observer, the system behaves as a single coherent unit — but internally, each process has its own memory, its own clock, and its own failure modes.
Contrast this with a monolith: a single process running on a single machine. All components share the same memory space, the same clock, and the same fate.
Why distribute at all?
A monolith is simpler to build, deploy, and debug. You distribute a system when you need something a single machine cannot provide:
- Resilience — if one node fails, the others keep running.
- Scalability — you can add machines instead of buying a bigger one.
- Geographic reach — you place nodes close to users or data sources.
- Autonomy — independent teams or organizations each run their own node.
Every one of these benefits comes with a cost: the network between nodes is unreliable, slow, and insecure.
2. The Eight Fallacies of Distributed Computing
In 1994, Peter Deutsch (with additions by James Gosling) identified eight assumptions that developers new to distributed systems tend to make. Each one is false.
-
The network is reliable. Packets get dropped. Cables get unplugged. A
task_requestover a WebRTC DataChannel may never arrive without a timeout and retry mechanism. -
Latency is zero. A “fast” cross-continent round trip takes 100-200ms. Local function calls take microseconds. Developers who confuse the two build brittle systems.
-
Bandwidth is infinite. A home connection might offer 10 Mbps upload. If 20 agents all broadcast large payloads simultaneously, the network saturates and heartbeats stop arriving.
-
The network is secure. Without encryption, any device on the network path can read and modify traffic. This is why chatixia-mesh uses DTLS encryption on DataChannels and JWT authentication on signaling.
-
Topology doesn’t change. Nodes join and leave. A laptop moves from office Wi-Fi to a mobile hotspot. A developer closes their laptop lid, and the topology just changed.
-
There is one administrator. chatixia-mesh agents can run on a developer’s laptop, a Raspberry Pi, and a cloud VM. Each environment has different firewall rules and update cycles.
-
Transport cost is zero. Network communication costs CPU cycles for serialization, memory for buffers, and sometimes real money for bandwidth. TURN relay servers cost money to host.
-
The network is homogeneous. One agent runs Python 3.13 on macOS ARM; another runs on Ubuntu x86_64 in Docker. Each has different performance characteristics and different bugs.
3. Topology Models
The topology of a distributed system describes how nodes are connected. Different topologies make different trade-offs.
Star (Client-Server)
Every node connects to a single central server. Connections: N. Simple and easy to reason about, but the server is a single point of failure and a bottleneck.
Ring
Each node connects to exactly two neighbors in a closed loop. No single bottleneck, but a single node failure breaks the ring. High latency (N/2 hops on average). Rarely used in practice.
Tree (Hierarchical)
Nodes are arranged in parent-child hierarchy. Natural for hierarchical organizations, efficient for broadcasting. Root is a single point of failure, and cross-subtree communication must go through the root.
Full Mesh
Every node connects directly to every other node. Connections: N * (N-1) / 2.
| Agents | Connections |
|---|---|
| 5 | 10 |
| 10 | 45 |
| 20 | 190 |
| 50 | 1,225 |
Maximum resilience (no single point of failure) and minimum latency (single hop), but O(N^2) connections make it impractical beyond ~50 nodes. chatixia-mesh uses full mesh for agent-to-agent DataChannels.
Trade-offs Summary
Simplicity Resilience Scalability
Star High Low Moderate
Ring Moderate Low Low
Tree Moderate Low Moderate
Full Mesh Low High Low
No topology is universally best. The right choice depends on the number of nodes, failure requirements, and communication patterns.
4. Control Plane vs Data Plane
Every distributed system has two fundamental concerns:
- Where should messages go? (discovery, routing, coordination) — the control plane
- How do messages actually get there? (transport, delivery) — the data plane
The control plane manages the system: which nodes are alive, what capabilities they have, how to reach them. It carries metadata, not application data. Analogy: Air traffic control tracks positions and assigns routes but does not fly the planes.
The data plane carries actual application data between nodes. Analogy: The aircraft moving from point A to point B.
Why Separate Them?
- Different scaling requirements. Metadata is small and infrequent; application traffic is large and frequent.
- Fault isolation. If the control plane goes down, existing data plane connections keep working.
- Security boundaries. End-to-end encryption on the data plane means even the control plane operator cannot read messages.
In chatixia-mesh
The registry handles control plane functions (agent registration, health checking, skill routing, signaling). WebRTC DataChannels handle the data plane (task requests/responses, agent prompts, direct sidecar-to-sidecar communication with DTLS encryption).
The registry never sees the content of agent-to-agent messages. If the registry goes down, existing DataChannel connections continue working — new agents just cannot join until it comes back.
5. Case Study: chatixia-mesh at a Glance
chatixia-mesh is an agent-to-agent mesh network where AI agents communicate directly over WebRTC DataChannels.
The Four Components
| Component | Language | Role |
|---|---|---|
| Registry | Rust (axum) | Control plane — signaling, agent registry, task queue, hub API |
| Sidecar | Rust (webrtc-rs) | Networking layer — WebRTC peer connections, IPC bridge to agent |
| Agent | Python | Application logic — skills, LLM calls, task execution |
| Hub | React (Vite) | Monitoring dashboard — agent health, task queue, topology graph |
Each agent process is paired with a sidecar process. The agent handles application logic (what to do); the sidecar handles networking (how to reach other agents). They communicate over a Unix domain socket using a JSON-line protocol — the sidecar pattern.
How a Message Travels
When Agent A wants to send a task to Agent B:
- Agent A writes a JSON command to the Unix socket shared with Sidecar A.
- Sidecar A sends the
MeshMessageover the DataChannel (P2P, encrypted). - Sidecar B receives the message and writes it to its Unix socket.
- Agent B reads the message and executes the task.
The registry is not involved in the data transfer. Data flows directly between sidecars.
What Problem Does It Solve?
AI agents are most useful when they can collaborate. chatixia-mesh solves this with:
- Discovery — the registry lets agents find each other by skill.
- Direct communication — WebRTC DataChannels let agents talk without a central server.
- NAT traversal — STUN/TURN lets agents behind firewalls connect.
- Graceful degradation — if P2P fails, agents fall back to the registry’s HTTP task queue.
Exercises
-
Identify Control Plane and Data Plane. Pick a system you use daily. Draw its topology and identify which components form the control plane vs data plane.
-
Calculate Mesh Connections. Using N * (N-1) / 2, calculate connections for 5, 10, and 20 agents. If each WebRTC handshake takes 5 seconds, how long for the 20th agent to connect to all 19 peers?
-
Star vs Mesh at Scale. Write a paragraph arguing that star topology is better than full mesh for 100 agents.
-
Fallacies and Heartbeats. chatixia-mesh agents heartbeat every 15 seconds. After 90s of silence an agent is “stale”; after 270s it is “offline.” Which fallacies does this address? Why three states instead of two?
Related Lessons
- Lesson 02: Peer-to-Peer Networking — NAT traversal, STUN/TURN, ICE
- Lesson 03: WebRTC Fundamentals — SDP, ICE, DTLS, DataChannels
- Lesson 05: Signaling Protocol Design — how agents find each other
Further Reading
- Deutsch, P. (1994). The Eight Fallacies of Distributed Computing. Sun Microsystems.
- Tanenbaum, A. & Van Steen, M. (2017). Distributed Systems: Principles and Paradigms. 3rd ed. Pearson.
- Kleppmann, M. (2017). Designing Data-Intensive Applications. O’Reilly Media. Chapters 5-9.
Previous: Lesson 00: Curriculum Overview | Next: Lesson 02: Peer-to-Peer Networking