Multi-agent systems were one of the dominant architectural patterns in 2025. Most of them were poorly designed. I have reviewed enough now to recognise the failure modes by silhouette. Here are the ones I see most often, and what to do about them.

The agent that should have been a function call

The most common anti-pattern is wrapping a deterministic operation in an LLM-driven agent. A team needs to call an external API, parse the response, and update a database field. They build an "agent" that decides whether to call the API and another agent that decides how to update the field.

The decision points are illusory. The work is deterministic. The LLM adds latency, cost, and a non-determinism budget that buys nothing. The right answer is a function call wrapped in a try-catch, not an agent.

The diagnostic question. For each agent in the system, can I write the deterministic version in a few hours. If yes, I should write the deterministic version. The LLM should be reserved for genuine interpretation steps, not for control flow that has a clear specification.

The agents that talk to each other forever

Two agents are tasked with negotiating a solution. Agent A proposes. Agent B critiques. Agent A revises. Agent B critiques again. The conversation continues until a token budget runs out. The output is sometimes good. Often it is a slightly worse version of what either agent could produce alone.

The problem is that the negotiation has no clear termination condition. Without an external check, the agents will continue to disagree on cosmetic issues. The system spends compute without converging.

The fix is to bound the negotiation. Allow at most one round of critique. Require the second agent to either approve or reject with a specific reason. Have a third deterministic process arbitrate. The unbounded multi-turn negotiation is almost always worse than a constrained interaction.

The orchestrator that does not understand the tools

A single orchestrator agent is given a long list of tools and asked to choose. As the tool list grows past five or six, the orchestrator's selection accuracy falls quickly. By ten or twelve tools, the system makes wrong tool choices on a significant fraction of inputs.

The cause is usually that the tool descriptions are ambiguous, the tools have overlapping capabilities, or the orchestrator has not been given enough context about the input to choose well.

The fixes are direct. Reduce the tool list to the few that actually apply to the orchestrator's scope. Group tools into domains and add a routing step that selects a domain before selecting a tool within it. Improve the tool descriptions with concrete examples of when each tool is and is not appropriate.

A system with twelve tools is rarely a single-orchestrator system. It is two or three orchestrators with three or four tools each.

The agent that does not check its own outputs

An agent generates an output and ships it to a downstream system without validation. The downstream system has its own schema. The agent's output sometimes does not conform. Errors appear in production at low rates and are hard to debug.

The fix is to validate every agent output against a schema before it leaves the agent boundary. If the output does not conform, the agent retries with a corrective prompt. If the retry also fails, the agent escalates to a human or to a fallback path. The schema is the contract between the agent and the rest of the system. Without it, the agent is producing output that the system cannot rely on.

The agent that is graded only by other agents

LLM-as-judge evaluation is a useful tool. It is not a substitute for human evaluation on the cases that matter. Teams that grade their multi-agent systems entirely with LLM-as-judge build systems that pass the judge and fail the user.

The pattern is to use LLM-as-judge for high-volume regression testing and to use human evaluation on a stratified sample of production traffic for ground truth calibration. The two techniques complement each other. Either one alone is insufficient.

The system that has no observability

A multi-agent system in production should produce a trace of each request that includes the agents invoked, the tools called, the prompts used, the model versions, the latency at each stage, and the cost at each stage. Without this trace, debugging production issues is guesswork.

A surprising number of multi-agent systems ship without adequate tracing. The team can tell that something is wrong in production. They cannot tell which agent is wrong, or why. The mean time to resolution stretches out and trust in the system erodes.

This is not exotic infrastructure. There are open-source tracing tools that handle the agent case adequately. The investment is small. The cost of not having it is large.

What good looks like

A multi-agent system that works in production tends to share a few properties. The agents are few in number. Each agent has a narrow, well-specified job. The orchestration is mostly deterministic, with the LLM used in clearly bounded interpretive roles. The outputs are validated against schemas. The system is observable. Human review is in the loop where the failure cost warrants it.

This is less ambitious than the marketing of multi-agent systems suggests. It is also what ships.