chatixia blog
Operations March 29, 2026 · 6 min read

Testing Distributed Systems -- From Unit Tests to End-to-End Validation

A distributed system is a collection of independent processes that communicate over a network to achieve a shared goal. Testing such a system is harder than testing a monolith because bugs can hide...

testingintegration-testse2e
On this page

Lesson 17: Testing Distributed Systems — From Unit Tests to End-to-End Validation

Prerequisites: Lesson 05: Signaling Protocol Design, Lesson 06: Inter-Process Communication


Introduction

chatixia-mesh spans three languages (Rust, Python, TypeScript), four components (registry, sidecar, agent, hub), and three communication channels (WebSocket, WebRTC DataChannel, Unix socket IPC). A unit test proving MeshMessage serializes correctly tells you nothing about whether the heartbeat loop processes the tasks it receives.

This lesson examines the testing strategy chatixia-mesh uses — and the testing gap it discovered the hard way.


1. The Testing Pyramid for Distributed Systems

The classic pyramid applies, but the middle and top layers carry more weight than in monoliths because the most dangerous bugs live at component boundaries.

Unit tests (base): Verify individual functions in isolation — protocol serialization, state transitions, token validation, config parsing. Fast, deterministic, cheap to write. Cannot catch bugs at component boundaries.

Integration tests (middle): Verify two or more components working together, using mocks to avoid standing up the full system. Skill handlers with mock MeshClient, runner registration with mock HTTP, CLI subcommands with filesystem. Best return-on-investment for distributed system bugs.

E2E tests (top): Run the full system — registry, sidecars, agents. Most expensive but catch cross-component state synchronization failures that nothing else can. The insight: you cannot skip E2E testing, but you can minimize how often you need it by pushing boundary-crossing tests into the integration tier.


2. Unit Testing Protocol Code

Protocol code has clear inputs, clear outputs, and no side effects — ideal for testing. The pattern: construct, serialize, deserialize, verify.

Serialization round-trips in Rust

#[test]
fn test_signaling_message_serialize_with_target() {
    let msg = SignalingMessage {
        msg_type: "offer".into(),
        peer_id: "peer-1".into(),
        target_id: Some("peer-2".into()),
        payload: serde_json::json!({"sdp": "..."}),
    };
    let json: serde_json::Value = serde_json::to_value(&msg).unwrap();
    assert_eq!(json["type"], "offer");
    // "msg_type" must NOT appear (renamed to "type" via serde)
    assert!(json.get("msg_type").is_none());
}

This catches a critical issue: the #[serde(rename = "type")] attribute. If removed, every component expecting type breaks.

Default deserialization tests verify that a minimal {"type":"ping"} succeeds with #[serde(default)] annotations. Round-trip tests (serialize then deserialize) prove no data is lost.

Cross-language consistency

Python’s MeshMessage has a subtle difference: payload defaults to {} (empty dict), while Rust defaults to serde_json::Value::Null. This is intentional but illustrates why cross-language protocol testing matters.

Auth token validation

Three scenarios cover the security boundary: a freshly issued token validates correctly, an expired token is rejected (crafted with past timestamps to avoid waiting), and a token from a different signing secret is rejected.


3. The E2E Gap — A Cautionary Tale

Session 4 ran a full E2E test: 1 registry, 2 agents with WebRTC sidecars. Registration, authentication, signaling, DataChannel formation, task submission, and task assignment all passed. Then a list_agents task was submitted targeting agent-beta.

The task was created (pending), claimed on beta’s heartbeat (assigned) — and nothing happened. Agent-beta never executed it.

The root cause

The heartbeat loop sent an HTTP POST and discarded the response body. The registry returned pending_tasks in the response, but resp.json() was never called.

Why unit tests missed it

Every component passed its tests individually. HubState.get_pending_for_agent correctly assigns tasks (Rust test). Skill handlers execute correctly (Python test). The heartbeat POST sends the right payload (mock test). The bug was at the seam — the registry returned data; the runner ignored it. No unit test covered this because it spans two processes in two languages.

The fix (ADR-013)

The heartbeat loop now parses the response, extracts pending_tasks, and dispatches each to _execute_task via asyncio.create_task so long-running skills do not block the loop.

Guideline: Identify seams where data crosses process boundaries. Write at least one test per seam verifying the receiving side acts on what the sending side provides.


4. Testing Async Code

Rust: #[tokio::test]

Replaces #[test] for async functions, setting up a tokio runtime. Use #[tokio::test(flavor = "multi_thread")] for tests needing concurrent tasks.

Python: pytest-asyncio

@pytest.mark.asyncio handles async test functions. Key patterns from the skill handler tests:

@pytest.mark.asyncio
async def test_p2p_path_fire_and_forget(self):
    mock_client = MagicMock()
    mock_client.connected = True
    mock_client.is_peer_connected.return_value = True
    mock_client.send = AsyncMock()

    result = await handle_delegate(
        message="do something",
        target_agent_id="agent-b",
        skill="research",
        wait=False,
        _mesh_client=mock_client,
    )
    assert "P2P" in result
    mock_client.send.assert_called_once()

MagicMock for sync properties, AsyncMock for async methods, and _mesh_client parameter injection for testability.


5. Integration Test Strategies

The key skill is choosing what to mock and what to run for real. Mock too much and you have unit tests. Mock too little and you have slow, flaky E2E tests.

Strategy 1: Skill handlers with mock MeshClient. Test P2P path (peer connected, uses request() for wait=True, send() for fire-and-forget), fallback path (peer not connected, falls back to HTTP), and broadcast path. These are the highest-value tests in the Python codebase.

Strategy 2: Registration with mock HTTP. Mock requests.post and verify the exact JSON payload, headers, and URL construction. Catches field name changes without a running registry.

Strategy 3: Error handling. Verify that _update_task catches ConnectionError rather than propagating it — network failures should not crash the heartbeat loop.

Strategy 4: URL derivation. Test that http:// maps to ws:// and https:// maps to wss:// for signaling URLs, and that trailing slashes are stripped. These bugs would only surface at runtime when the sidecar tries to connect.


6. CI for Multi-Language Projects

The CI pipeline (.github/workflows/ci.yml) runs on every push and PR, testing all three languages in parallel:

JobWhat it checks
rust-lintcargo fmt --check + cargo clippy -D warnings
rust-testcargo test --workspace
python-lintruff check + ruff format --check
python-testuv sync --all-groups && uv run pytest -v
hubtsc --noEmit + pnpm build
version-check(PRs only) Fails if agent source changed without version bump
docker(PRs only) Builds all three Dockerfiles without pushing

Swatinem/rust-cache@v2 reduces Rust build times from ~3 minutes to ~30 seconds. Docker builds use GitHub Actions cache for layer sharing.

PyPI publishing uses OIDC trusted publisher — no API tokens in secrets. The workflow verifies the git tag matches pyproject.toml version. Version bump enforcement on PRs plus tag verification on publish creates a two-step safety net.


Summary

Testing distributed systems requires every pyramid level, with special attention to component boundaries.

Unit tests cover protocol structs, state machines, and security boundaries. Integration tests with mocked network layers are the best ROI — skill handler tests with mock MeshClient cover both P2P and fallback paths. E2E tests are expensive but essential; the heartbeat bug was invisible to everything else.

Async testing needs #[tokio::test] in Rust and @pytest.mark.asyncio in Python. CI for multi-language projects runs checks in parallel with caching, version enforcement, and Docker build validation.

The core lesson: in a distributed system, the most dangerous bugs hide at the seams between components. Push seam-crossing tests into the integration tier, and use E2E as the final safety net.


Previous: Lesson 16: Architecture Decision Records

Comments