2025 proved that AI agents can write code, build features, and ship software autonomously. Teams at Anthropic, OpenAI, and across the industry demonstrated that a single agent session could produce meaningful working software. The question in 2026 is no longer whether agents work. It is whether they work reliably — across multi-hour sessions, at scale, with quality that meets production standards. That is the problem Harness engineering was built to solve.
What Is a Harness?
A harness is a structured multi-agent framework that orchestrates specialised AI agents to collaboratively solve long-running, complex tasks. Rather than pushing a single agent harder and hoping it maintains coherence over a multi-hour session, a harness decomposes the work across agents that each have distinct roles: planning, generating, and evaluating. Each agent is optimised for its specific function, and the harness coordinates how they hand work between each other.
Both Anthropic and OpenAI have published detailed engineering writeups on how they built harness systems for internal development. The patterns they describe are strikingly similar, arrived at independently, which suggests these are real solutions to real structural problems — not design choices that could have gone either way.

Why Single Agents Fail at Long Tasks
Before understanding why harnesses work, it helps to understand what breaks when you run a single agent on an extended task. The Anthropic engineering team identified three consistent failure modes.
The three failure modes of long-running single agents:
- Context window degradation — Models lose coherence as context fills up, exhibiting what the team calls 'context anxiety': prematurely wrapping up tasks and presenting incomplete work as finished.
- Self-evaluation bias — Agents confidently overrate their own output. A generator asked to evaluate its own design or code quality will consistently give itself high marks, producing mediocre work that appears polished on the surface.
- Scope under-estimation — Without external guidance, agents under-scope tasks and miss important features. The agent solves the stated problem but not the real problem.
These are not bugs to be patched with better prompting. They are structural properties of how large language models work. A harness addresses them architecturally rather than fighting them with instructions.
The Three-Agent Architecture
The Anthropic harness that powered large-scale internal development implements three specialised agents, each with a clearly scoped role.
The three agents and their roles:
- Planner — Takes a high-level prompt and expands it into a detailed product specification with ambitious scope. The Planner's job is to push back against under-scoping before a single line of code is written.
- Generator — Implements features iteratively from the planning spec. Works across full-stack technology (React, Vite, FastAPI, PostgreSQL in Anthropic's case). Performs initial self-evaluation before hand-off, but is not expected to catch everything.
- Evaluator — Tests the running application using Playwright MCP. Catches bugs and design flaws. Grades work against explicit, measurable criteria. Provides structured feedback that the Generator uses in the next iteration.
Separating the agent doing the work from the agent judging it proves to be a strong lever. Rather than trying to make generators more self-critical, you separate these concerns entirely — allowing evaluators to be calibrated with independent skepticism.

The GAN Insight: Adversarial Quality
The generator-evaluator separation in a harness directly mirrors Generative Adversarial Networks from machine learning. In a GAN, a generator and a discriminator are trained in opposition: the generator learns to produce outputs that fool the discriminator; the discriminator learns to catch the generator's failures. The tension between them drives quality far beyond what either could achieve alone.
A harness applies this same principle to AI engineering. The Generator is motivated to produce working software. The Evaluator is calibrated to be skeptical and find flaws. Neither is trying to collaborate in a comfortable way — they are in productive tension. The result is iterative improvement through external critique rather than self-evaluation, which the engineering teams found to be dramatically more effective.
Context Management: Reset Beats Compaction
One of the most counterintuitive findings from the Anthropic team's experimentation: context resets proved superior to context compaction for maintaining coherence in long sessions. Rather than summarising earlier conversation history and appending it to the context window, they found it more effective to clear the context entirely and let each agent pick up work by reading structured file artifacts.
This is the file-based communication pattern: agents communicate through structured files rather than message passing. Each agent can start fresh — no prior conversation history needed — because the previous agent's work is captured in artifacts. The Generator reads the Planner's spec. The Evaluator reads the running application's test results. Clean handoffs without context accumulation.
Results at Scale
The deployment of Anthropic's harness architecture demonstrated what reliable multi-agent systems can achieve: repositories containing on the order of one million lines of code across application logic, infrastructure, tooling, and documentation. Approximately 1,500 pull requests opened and merged. Achieved by a team of three engineers orchestrating the system.
OpenAI's Harness system, deployed internally for engineering work with Codex, reported similar leverage: small teams producing output at a scale that would have required significantly larger teams working manually.
What This Means for How You Build
Harness engineering is not about raw AI capability. It is about systems design — structuring the environment around agents to compensate for their structural weaknesses and amplify their strengths. If you are building production AI systems in 2026, the question is not which model to use. It is how to design the harness around it.
The practical takeaways from both the Anthropic and OpenAI engineering writeups: separate generation from evaluation; use file artifacts for context-free handoffs; make evaluation criteria explicit and measurable; and prefer context resets over context accumulation for long sessions. These are not theoretical recommendations — they are the outputs of engineering teams who ran the experiments and measured the results.
2025 was agents. 2026 is the year we learn to engineer around them.



