Building a Multi-Agent AI System for Business: Architecture That Actually Works

If you search for multi-agent AI systems, you'll find research papers about debate protocols, AutoGen demos with three chatbots talking to each other, and LangGraph tutorials that never leave a Jupyter notebook.

None of that tells you how to build a multi-agent system that runs a business.

We have one. Three agents — CEO, CTO, CMO — operating a company with two shipped products and real revenue. Here's the architecture.

The topology: star, not mesh

The first decision is how agents communicate. In research, you see mesh topologies where every agent talks to every other agent. In practice, this creates chaos. Three agents in a mesh means six communication channels. Five agents means twenty.

We use a star topology with the CEO agent at the center. Cordy (CEO) receives all reports, assigns all work, and resolves conflicts. The CTO and CMO talk to each other only for direct handoffs — "here's the blog post you asked me to build" — not for strategic discussion.

Why this works:

Single point of coordination prevents conflicting priorities
CEO agent has full context on everything happening in the company
Adding a new agent means one new channel, not N new channels
Mirrors how small teams actually work (founder coordinates, specialists execute)

The handoff protocol

Agent-to-agent communication needs structure. Free-form messages lead to dropped context and ambiguous assignments. We use a structured handoff format:

HANDOFF — initiating a task. Includes: from, to, task ID, priority, summary, context, deadline, and explicit "done when" criteria.
ACK — accepting or rejecting within 2 minutes. If rejecting, must state what's needed to proceed.
DONE — completion. Includes evidence (commit hash, URL, test result) and any risks or notes.
BLOCKED — can't proceed. States blocker, impact, options with tradeoffs, and a recommendation.

Every handoff includes done_when criteria — specific, verifiable conditions. Not "build the blog page" but "blog page returns HTTP 200 at /blog/post-slug with Article JSON-LD and a link to /checkout."

This eliminates the most common multi-agent failure mode: agents that say "done" when they've only partially completed the work.

Shared state: files over databases

Agents need a shared understanding of what's happening in the company. We use a shared filesystem:

PROJECTS.md — single source of truth for all project state. What's in progress, what's shipped, what's blocked, who owns what.
SIGNALS.md — strategic intelligence. Market findings, competitive insights, patterns that should influence priorities.
THESIS.md — the business north star. Read-only, human-written. Keeps all agents aligned on what the company is building and why.
FEEDBACK-LOG.md — lessons from corrections. When an agent makes a mistake and gets corrected, the lesson goes here so all agents learn from it.

Why files instead of a database? Because LLMs can read and write markdown natively. No ORM, no schema migrations, no API layer. An agent reads PROJECTS.md at session start and knows the full state of the company in 500 milliseconds.

Session isolation and concurrency

Each agent runs in its own session with its own context. This is important — sharing a session between agents creates confusion about whose instructions to follow.

But isolation creates a concurrency problem: what if two agents update PROJECTS.md at the same time? In practice, this is rare because the shared files change slowly (project state doesn't flip every minute). When it does happen, the next agent to read the file gets the latest version, and stale state self-corrects within one heartbeat cycle.

We chose simplicity over correctness here. No distributed locks. No CRDT. If an occasional stale read costs us a wasted agent cycle, that's cheaper than the engineering complexity of a proper distributed system.

The failure modes you'll hit

Multi-agent systems fail in specific, predictable ways:

Duplicate work — two agents pick up the same task. Fix: clear ownership in PROJECTS.md before starting work.
Context drift — an agent's memory diverges from reality because it hasn't re-read shared state. Fix: mandatory re-reads at session start and after every heartbeat.
Escalation loops — Agent A asks Agent B for help, Agent B asks Agent A. Fix: star topology with clear authority hierarchy.
Premature "done" — agents mark tasks complete without verification. Fix: DONE messages require evidence, and the CEO agent spot-checks.

You won't prevent all of these. The goal is to make them cheap to detect and fix. A wasted agent cycle costs $0.10. A wasted human hour costs $100+.

Scaling: when to add agents

Don't start with three agents. Start with one. Get it reliable. Add a second when you have a clear role separation (the CTO agent shouldn't also be doing marketing).

The signal to add an agent: your existing agents are spending significant time on work that's outside their core competency, and that work is well-defined enough to spec.

We ran with two agents (CEO + CTO) for the first month. Added the CMO agent when we had two products to market and the CEO was spending 60% of its cycles on distribution tasks instead of strategy.

The full architecture — spec layer templates, handoff protocol, shared state system, and governance model — is documented in The Zero Employee Guide. Chapter 1 covers the thesis and core architecture. Free to read.