Lesson 07: Application Protocol Design — MeshMessage and Task Lifecycle

Prerequisites: Lesson 05: Signaling Protocol Design, Lesson 06: Inter-Process Communication

Key source files:

sidecar/src/protocol.rs — MeshMessage struct and message type constants
registry/src/hub.rs — Task struct, TaskSubmission, HubState, task lifecycle
agent/chatixia/core/mesh_skills.py — handle_delegate with P2P-first and HTTP fallback
agent/chatixia/core/mesh_client.py — MeshClient.request() for correlated request/response

Introduction

Lessons 05 and 06 covered how chatixia-mesh establishes connections (signaling) and bridges WebRTC to Python (IPC). Both are transport protocols. This lesson covers what those bytes actually mean — the application protocol that gives messages their semantics.

1. Layered Protocols

Every message travels through multiple layers:

+---------------------------------------------------------------+
|  Application:  MeshMessage | IpcMessage | SignalingMessage     |
+---------------------------------------------------------------+
|  Transport:    DataChannel | Unix Socket | WebSocket           |
+---------------------------------------------------------------+
|  Network:      UDP         | Filesystem  | TCP                 |
+---------------------------------------------------------------+

A MeshMessage from Agent A to Agent B crosses two channels: wrapped in an IpcMessage over the Unix socket to Sidecar A, sent as raw JSON over the DataChannel to Sidecar B, then wrapped in another IpcMessage to reach Agent B. Each component only understands its own protocol — the Python agent never deals with WebRTC, the sidecar never interprets task payloads.

2. MeshMessage Format

The MeshMessage is the single envelope for all agent-to-agent communication. Five fields:

// sidecar/src/protocol.rs
pub struct MeshMessage {
    #[serde(rename = "type")]
    pub msg_type: String,
    #[serde(default)]
    pub request_id: String,
    #[serde(default)]
    pub source_agent: String,
    #[serde(default)]
    pub target_agent: String,
    #[serde(default)]
    pub payload: serde_json::Value,
}

Field	Required	Purpose
`type`	Yes	Determines how the receiver interprets the message
`request_id`	No	Correlates requests with responses (12-char UUID hex)
`source_agent`	No	Sender identity for routing and attribution
`target_agent`	No	Intended recipient; `"*"` for broadcasts
`payload`	No	Arbitrary JSON, contents depend on `type`

Most fields default to empty, so a ping is just {"type": "ping"}. The #[serde(default)] annotation means missing fields deserialize gracefully, making the protocol forward-compatible.

Message Types

Connectivity: ping / pong — lightweight DataChannel heartbeat.

Task delegation (request/response):

task_request — “Execute this task.” Carries request_id, payload includes message and skill.
task_response — “Here is the result.” Carries matching request_id, payload includes result or error.
task_stream_chunk — Streaming partial results with the same request_id.

Skill discovery: skill_query / skill_response — “What can you do?”

Agent communication (fire-and-forget):

agent_status — Broadcast of skills, health, load.
agent_prompt — Direct message or broadcast, no response expected.
agent_response / agent_stream_chunk — Optional replies.

3. Task Lifecycle

When an agent delegates work, it is tracked as a task with a four-state lifecycle (registry/src/hub.rs):

              submit_task
                  |
                  v
             [pending]
                  |
         get_pending_for_agent()
                  |
                  v
             [assigned]
              /        \
   update_task()    update_task()
   completed        failed
        |               |
        v               v
  [completed]      [failed]

Transition	Trigger	What happens
-> `pending`	`submit_task()` HTTP handler	New task created with UUID, timestamped
`pending` -> `assigned`	`get_pending_for_agent()`	Agent claims task by skill match; `assigned_agent_id` set
`assigned` -> `completed`	`update_task()`	Agent POSTs result
`assigned` -> `failed`	`update_task()`	Agent POSTs error
any -> `failed`	`expire_tasks_loop()`	TTL exceeded (default 300s, checked every 30s)

Terminal states (completed/failed) are permanent — no retry mechanism. The source agent decides whether to resubmit.

4. The Dual Execution Path

The core design pattern: every operation tries the fast P2P DataChannel first, then falls back to the slower HTTP task queue.

Path 1: P2P (fast, <100ms typical)

Taken when the mesh client is connected and the target peer is reachable:

msg = MeshMessage(
    msg_type="task_request",
    source_agent=agent_id,
    target_agent=target_agent_id,
    payload={"message": message, "skill": skill},
)
response = await _mesh_client.request(target_peer, msg, timeout=120.0)

The request() method generates a request_id, registers an asyncio.Future, sends the message through IPC -> DataChannel -> remote sidecar -> remote agent, and resolves the future when a task_response with the matching ID arrives.

Total hops: 6 (3 each direction). Network crossings: 2 DataChannel messages.

Path 2: HTTP Fallback (3-15s typical)

Used when P2P is unavailable:

result = _post(f"{registry}/api/hub/tasks", {
    "skill": skill, "target_agent_id": target_agent_id,
    "source_agent_id": agent_id, "payload": {"message": message}, "ttl": 300,
})
task_id = result.get("task_id")

# Poll every 3s until completed, failed, or 120s deadline
while time_remaining:
    await asyncio.sleep(3)
    status = _get(f"{registry}/api/hub/tasks/{task_id}")
    if status["state"] in ("completed", "failed"):
        return status

Minimum 4 HTTP requests per task. Latency dominated by the 3-second polling interval.

Comparison

Aspect	P2P DataChannel	HTTP Task Queue
Latency	<100ms	3-15s
Registry load	None	4+ requests/task
Encryption	DTLS (end-to-end)	TLS to registry (registry sees content)
Reliability	Requires active DataChannel	Works if registry is reachable

5. Graceful Degradation

The dual path is part of a three-tier transport hierarchy:

Tier	Transport	Latency	When used
1	Direct P2P (UDP)	<100ms	Same LAN, public IPs, or STUN-assisted NAT traversal
2	TURN Relay (UDP)	100-500ms	Symmetric NAT or restrictive firewall
3	HTTP Fallback	3-15s	No WebRTC connectivity at all

Tier selection is implicit: ICE negotiation tries direct then TURN; if all WebRTC fails, handle_delegate falls through to HTTP. From the agent’s perspective, delegate always returns a result — only latency varies.

This design serves four goals: latency (P2P is 30-100x faster), privacy (DTLS means the registry never sees data-plane content), scalability (P2P offloads registry traffic), and resilience (established DataChannels survive registry downtime).

6. Fire-and-Forget vs Request/Response

Fire-and-Forget: `mesh_send` and `mesh_broadcast`

Uses agent_prompt type and MeshClient.send():

msg = MeshMessage(msg_type="agent_prompt", source_agent=agent_id,
                  target_agent=target_agent_id, payload={"message": message})
await _mesh_client.send(target_peer, msg)  # returns immediately

No request_id, no response expected. Appropriate for status announcements, notifications, and broadcasts.

Request/Response: `delegate`

Uses task_request/task_response types and MeshClient.request():

msg = MeshMessage(msg_type="task_request", source_agent=agent_id,
                  target_agent=target_agent_id, payload={"message": message, "skill": skill})
response = await _mesh_client.request(target_peer, msg, timeout=120.0)

Generates request_id, blocks until response or timeout. Appropriate for task delegation where the sender needs the result.

The wait parameter on handle_delegate converts it to fire-and-forget when wait=False — sends a task_request but does not await the response.

Summary

MeshMessage is a five-field JSON envelope that carries all agent-to-agent communication. Its minimal design makes it easy to implement across languages and extend with new types.

Tasks follow a four-state lifecycle (pending -> assigned -> completed | failed) with TTL-based expiration preventing abandoned tasks from accumulating.

The dual execution path tries P2P first (fast, private, decentralized) and falls back to HTTP (slower but always available). request_id enables correlation over DataChannels; task_id serves the same purpose over HTTP polling.

Graceful degradation across three tiers means the system never stops working — it only slows down.

Previous: Lesson 06: Inter-Process Communication | Next: Lesson 08: Authentication and Security

Application Protocol Design -- MeshMessage and Task Lifecycle

Lesson 07: Application Protocol Design — MeshMessage and Task Lifecycle

Introduction

1. Layered Protocols

2. MeshMessage Format

Message Types

3. Task Lifecycle

4. The Dual Execution Path

Path 1: P2P (fast, <100ms typical)

Path 2: HTTP Fallback (3-15s typical)

Comparison

5. Graceful Degradation

6. Fire-and-Forget vs Request/Response

Fire-and-Forget: `mesh_send` and `mesh_broadcast`

Request/Response: `delegate`

Summary

Comments

Lesson 07: Application Protocol Design — MeshMessage and Task Lifecycle

Introduction

1. Layered Protocols

2. MeshMessage Format

Message Types

3. Task Lifecycle

4. The Dual Execution Path

Path 1: P2P (fast, <100ms typical)

Path 2: HTTP Fallback (3-15s typical)

Comparison

5. Graceful Degradation

6. Fire-and-Forget vs Request/Response

Fire-and-Forget: mesh_send and mesh_broadcast

Request/Response: delegate

Summary

Comments

Fire-and-Forget: `mesh_send` and `mesh_broadcast`

Request/Response: `delegate`