Lesson 01 — Why Distributed Systems? From Monolith to Mesh

Prerequisites

None. This is the first lesson in the series.

What You’ll Learn

What a distributed system is and how it differs from a monolithic application
The Eight Fallacies of Distributed Computing and why they matter
Four fundamental network topologies and their trade-offs
The difference between control plane and data plane
How the chatixia-mesh project maps to these concepts

1. What Is a Distributed System?

A distributed system is a collection of independent processes, running on different machines, that coordinate over a network to accomplish a shared goal. To an outside observer, the system behaves as a single coherent unit — but internally, each process has its own memory, its own clock, and its own failure modes.

Contrast this with a monolith: a single process running on a single machine. All components share the same memory space, the same clock, and the same fate.

Why distribute at all?

A monolith is simpler to build, deploy, and debug. You distribute a system when you need something a single machine cannot provide:

Resilience — if one node fails, the others keep running.
Scalability — you can add machines instead of buying a bigger one.
Geographic reach — you place nodes close to users or data sources.
Autonomy — independent teams or organizations each run their own node.

Every one of these benefits comes with a cost: the network between nodes is unreliable, slow, and insecure.

2. The Eight Fallacies of Distributed Computing

In 1994, Peter Deutsch (with additions by James Gosling) identified eight assumptions that developers new to distributed systems tend to make. Each one is false.

The network is reliable. Packets get dropped. Cables get unplugged. A task_request over a WebRTC DataChannel may never arrive without a timeout and retry mechanism.
Latency is zero. A “fast” cross-continent round trip takes 100-200ms. Local function calls take microseconds. Developers who confuse the two build brittle systems.
Bandwidth is infinite. A home connection might offer 10 Mbps upload. If 20 agents all broadcast large payloads simultaneously, the network saturates and heartbeats stop arriving.
The network is secure. Without encryption, any device on the network path can read and modify traffic. This is why chatixia-mesh uses DTLS encryption on DataChannels and JWT authentication on signaling.
Topology doesn’t change. Nodes join and leave. A laptop moves from office Wi-Fi to a mobile hotspot. A developer closes their laptop lid, and the topology just changed.
There is one administrator. chatixia-mesh agents can run on a developer’s laptop, a Raspberry Pi, and a cloud VM. Each environment has different firewall rules and update cycles.
Transport cost is zero. Network communication costs CPU cycles for serialization, memory for buffers, and sometimes real money for bandwidth. TURN relay servers cost money to host.
The network is homogeneous. One agent runs Python 3.13 on macOS ARM; another runs on Ubuntu x86_64 in Docker. Each has different performance characteristics and different bugs.

3. Topology Models

The topology of a distributed system describes how nodes are connected. Different topologies make different trade-offs.

Star (Client-Server)

Every node connects to a single central server. Connections: N. Simple and easy to reason about, but the server is a single point of failure and a bottleneck.

Ring

Each node connects to exactly two neighbors in a closed loop. No single bottleneck, but a single node failure breaks the ring. High latency (N/2 hops on average). Rarely used in practice.

Tree (Hierarchical)

Nodes are arranged in parent-child hierarchy. Natural for hierarchical organizations, efficient for broadcasting. Root is a single point of failure, and cross-subtree communication must go through the root.

Full Mesh

Every node connects directly to every other node. Connections: N * (N-1) / 2.

Agents	Connections
5	10
10	45
20	190
50	1,225

Maximum resilience (no single point of failure) and minimum latency (single hop), but O(N^2) connections make it impractical beyond ~50 nodes. chatixia-mesh uses full mesh for agent-to-agent DataChannels.

Trade-offs Summary

                Simplicity    Resilience    Scalability
Star            High          Low           Moderate
Ring            Moderate      Low           Low
Tree            Moderate      Low           Moderate
Full Mesh       Low           High          Low

No topology is universally best. The right choice depends on the number of nodes, failure requirements, and communication patterns.

4. Control Plane vs Data Plane

Every distributed system has two fundamental concerns:

Where should messages go? (discovery, routing, coordination) — the control plane
How do messages actually get there? (transport, delivery) — the data plane

The control plane manages the system: which nodes are alive, what capabilities they have, how to reach them. It carries metadata, not application data. Analogy: Air traffic control tracks positions and assigns routes but does not fly the planes.

The data plane carries actual application data between nodes. Analogy: The aircraft moving from point A to point B.

Why Separate Them?

Different scaling requirements. Metadata is small and infrequent; application traffic is large and frequent.
Fault isolation. If the control plane goes down, existing data plane connections keep working.
Security boundaries. End-to-end encryption on the data plane means even the control plane operator cannot read messages.

In chatixia-mesh

The registry handles control plane functions (agent registration, health checking, skill routing, signaling). WebRTC DataChannels handle the data plane (task requests/responses, agent prompts, direct sidecar-to-sidecar communication with DTLS encryption).

The registry never sees the content of agent-to-agent messages. If the registry goes down, existing DataChannel connections continue working — new agents just cannot join until it comes back.

5. Case Study: chatixia-mesh at a Glance

chatixia-mesh is an agent-to-agent mesh network where AI agents communicate directly over WebRTC DataChannels.

The Four Components

Component	Language	Role
Registry	Rust (axum)	Control plane — signaling, agent registry, task queue, hub API
Sidecar	Rust (webrtc-rs)	Networking layer — WebRTC peer connections, IPC bridge to agent
Agent	Python	Application logic — skills, LLM calls, task execution
Hub	React (Vite)	Monitoring dashboard — agent health, task queue, topology graph

Each agent process is paired with a sidecar process. The agent handles application logic (what to do); the sidecar handles networking (how to reach other agents). They communicate over a Unix domain socket using a JSON-line protocol — the sidecar pattern.

How a Message Travels

When Agent A wants to send a task to Agent B:

Agent A writes a JSON command to the Unix socket shared with Sidecar A.
Sidecar A sends the MeshMessage over the DataChannel (P2P, encrypted).
Sidecar B receives the message and writes it to its Unix socket.
Agent B reads the message and executes the task.

The registry is not involved in the data transfer. Data flows directly between sidecars.

What Problem Does It Solve?

AI agents are most useful when they can collaborate. chatixia-mesh solves this with:

Discovery — the registry lets agents find each other by skill.
Direct communication — WebRTC DataChannels let agents talk without a central server.
NAT traversal — STUN/TURN lets agents behind firewalls connect.
Graceful degradation — if P2P fails, agents fall back to the registry’s HTTP task queue.

Exercises

Identify Control Plane and Data Plane. Pick a system you use daily. Draw its topology and identify which components form the control plane vs data plane.
Calculate Mesh Connections. Using N * (N-1) / 2, calculate connections for 5, 10, and 20 agents. If each WebRTC handshake takes 5 seconds, how long for the 20th agent to connect to all 19 peers?
Star vs Mesh at Scale. Write a paragraph arguing that star topology is better than full mesh for 100 agents.
Fallacies and Heartbeats. chatixia-mesh agents heartbeat every 15 seconds. After 90s of silence an agent is “stale”; after 270s it is “offline.” Which fallacies does this address? Why three states instead of two?

Lesson 02: Peer-to-Peer Networking — NAT traversal, STUN/TURN, ICE
Lesson 03: WebRTC Fundamentals — SDP, ICE, DTLS, DataChannels
Lesson 05: Signaling Protocol Design — how agents find each other

Why Distributed Systems? From Monolith to Mesh

Lesson 01 — Why Distributed Systems? From Monolith to Mesh

Prerequisites

What You’ll Learn

1. What Is a Distributed System?

Why distribute at all?

2. The Eight Fallacies of Distributed Computing

3. Topology Models

Star (Client-Server)

Ring

Tree (Hierarchical)

Full Mesh

Trade-offs Summary

4. Control Plane vs Data Plane

Why Separate Them?

In chatixia-mesh

5. Case Study: chatixia-mesh at a Glance

The Four Components

How a Message Travels

What Problem Does It Solve?

Exercises

Further Reading

Comments

Lesson 01 — Why Distributed Systems? From Monolith to Mesh

Prerequisites

What You’ll Learn

1. What Is a Distributed System?

Why distribute at all?

2. The Eight Fallacies of Distributed Computing

3. Topology Models

Star (Client-Server)

Ring

Tree (Hierarchical)

Full Mesh

Trade-offs Summary

4. Control Plane vs Data Plane

Why Separate Them?

In chatixia-mesh

5. Case Study: chatixia-mesh at a Glance

The Four Components

How a Message Travels

What Problem Does It Solve?

Exercises

Related Lessons

Further Reading

Comments