CONTEXT × HARNESS / 2026

Don't build an AI that replays yesterday's spec — the gap between spec and source of truth is the real context

More and more often, an AI agent's accuracy is decided by its context, not its prompting. But "context" here is not a polished spec. What really moves the needle is the gap between the spec and the source of truth, and the reasons behind the drift.

An AI fed only the spec replays "past truth." Feed it the drift reasons too, and it approaches "today's truth." The blind spot of Spec-Driven Development and the real core of Harness Engineering, laid out.

Context engineering Harness Engineering Source of Truth AI agent SDD Issue Driven 2026.05.20 · 7 min read

FIG.0 — THE GAP

The spec (green, dashed) sits still while the Source of Truth (pink, curve) — the running code, DB, API, field judgement — drifts downward over time. The GAP (the unexplained delta) in between is where the real context lives.

▍ THE PROMISE

What separates AI agents in 2026 is no longer the model or the prompt. It's not the spec — it's the gap between the spec and reality, and whether the "reasons for the drift" are accumulated alongside.

▍ TL;DR

The frontier of AI-agent accuracy has shifted: model → prompt → context.
If you mistake "context" for a polished spec, the AI just replays "past truth." Specs drift further from the Source of Truth (running code, ops, field judgement) the longer time passes.
What actually works is the reasons for the drift. Five whys — why the spec was changed, why an exception was allowed, why the implementation compromised, why the issue went the way it did, why the review came out that way — decide the quality of the AI's output.
Documents are polished; context is accumulated. Put the spec at the core of the Harness, and layer the drift reasons around it. The right AI is not the loud one — it's the one that lowers the human's verification load.

§ 01 SHIFT

From prompt-craft to context design

For a few years now, the lever that moves AI-agent quality has been moving:

Up to 2023: raw model strength dominated (GPT-3.5 → 4 → Claude 3 → …)
2023-2024: same model, different wording produced different results — "prompt engineering" boomed
2025-: raw model quality commoditised; context (what you feed in) became the dominant variable

By 2026 the gap between frontier models is closing fast. With Claude Opus 4.7 / GPT-5.5 / grok-4.3 at the top, "what the model knows" matters less than "what you put in front of it for this task." Welcome to the era of context design.

▍ What is ‘context’ really?

"Context" here is not just "input text." It's the whole substrate of judgement material — background, history, contradictions, hesitations, wobbles. The same layer I called "tacit thoughts" in the previous post.

§ 02 GAP

Spec vs Source of Truth — the gap is inevitable

Say "context" and most people think of the spec or design document. This article takes a clear position against that: feeding only a spec as Context is almost always wrong.

>2-1Spec = "what should be." Source of Truth = "what is."

The spec describes what should be. A snapshot of agreement at a moment, internally coherent, neatly polished.

As implementation and operations evolve, the actual "truth" drifts elsewhere:

The running code — hard-coded values, exception handlers, commented-out branches
The DB schema and the live data — migration history, unexpected records, exceptional values
The actual API behaviour — undocumented responses, unofficial endpoints
Customer-side operating decisions — approval routes never written down, tacit exceptions
Field judgement — choices an operator made on the spot

These are the Source of Truth (SoT). The spec inevitably drifts away from the SoT over time. This is not laziness — it's structural. Requirements change, exceptions happen, implementations compromise.

>2-2The gap = an unexplained delta

The problem is not that the gap exists. It's that the gap is never explained. The spec says "what should be," the code says "how it runs" — but the bit in between, "why it diverged," lives nowhere.

▍ Doc decay, or context?

Most organisations treat this gap as "document decay" and pour effort into "keeping the spec in sync with the truth." That work is futile — the moment you polish it, drift resumes.
This article takes the opposite stance: keep the gap as something to be explained, not eliminated.

§ 03 LOSS

An AI fed only the spec replays "past truth"

Feed only the spec to an AI, and its output will faithfully replay "the past consensus".

FIG.1 — SPEC-ONLY VS SPEC + GAP

An AI fed only the spec returns "past truth." An AI fed the spec plus code, issues, reviews, ops notes, and drift reasons approaches "today's truth." What differs is not the model, not the prompt — only the thickness of the context.

>3-1Typical failures of a spec-only AI

"The spec says X is correct, but the code shows Y." → The AI trusts the spec, returns X, and drifts from reality.
"The spec has no exception handling, so edge cases can be ignored." → Operationally impossible — a misjudgement.
"I implemented per the latest API docs." → The unofficial operating rules get missed.

This is not the AI's fault. The context you fed it is frozen at a point in time, and the AI is faithful to that point. The cleaner the spec, the more confidently the AI quotes "past truth."

>3-2Reverse-engineering alone is not enough either

"So just give it the code, not the spec?" That fails too. Code reveals "what is implemented and how," but never "why it became that." Same structure as the tacit knowledge / tacit thoughts argument from the previous post — outputs alone can't reproduce the context of judgement.

▍ The link to tacit knowledge / tacit thoughts

"Why was it written this way?" "Why was this exception allowed?" — these are expert tacit knowledge, or the tacit thoughts just upstream of it. What sinks into the spec/SoT gap is, essentially, this layer of knowledge.

§ 04 WHY

Five whys to accumulate — that's strong context

What should you accumulate, then? This article proposes five "whys" to keep deliberately.

FIG.2 — FIVE WHYS

Five "whys" that explain the gap between spec and Source of Truth. The central GAP = reasons for the drift is filled by five surrounding assets (change log / ops log / code comments / issues / PR reviews).

>4-1① Why was the spec changed?

The motivation behind a spec change. "Customer fed back X," "the upstream premise broke," "a different problem surfaced once we built it" — somewhere in change logs, meeting notes, or Slack history. If this disappears, neither the AI nor the next human can answer "why the old spec was discarded."

>4-2② Why was the exception allowed?

An operational decision that "this is off-spec, but we'll allow it." Approval-rule exceptions, customer-specific carve-outs, emergency manual workarounds. Usually never documented. But "the impossible case is now business as usual" is more common than people admit.

>4-3③ Why was the implementation compromised?

Implementation compromises. "Should be X, but a legacy constraint forced Y," "we dropped the edge case for performance reasons" — can land in PR comments or code comments, but never in a polished spec. The "reason for the compromise" is exactly the judgement material you'll need when the spec changes next.

>4-4④ Why was it argued this way in the issue?

The shape of the discussion in issues and discussion threads. Not just the final verdict, but the "rejected alternatives," "premises debated," and "trade-offs that produced the agreement." Echoes Karpathy's LLM Wiki philosophy — keep the discussion, not just the conclusion.

>4-5⑤ Why did the review come out this way?

PR review comments. The reviewer's concerns, objections, compromises, and reasons for approving. Nobody re-reads review history after merge — but if it persists, the AI can "reproduce the same kind of judgement elsewhere".

▍ The "five whys" are SECI externalisation

Keeping these "whys" is exactly the Externalisation step in Nonaka's SECI model. The twist: you're externalising the process, not the conclusion. That's how judgement patterns become reproducible in other contexts.

§ 05 PRINCIPLE

Documents are polished; context is accumulated

"Keep the five whys" sounds like "write more docs" — but it isn't. The thing you polish and the thing you accumulate are different objects.

FIG.3 — DOCUMENTS VS CONTEXT

The same word "documents" splits into two distinct kinds: polished documents (for humans) and accumulated context (for AI). Push for coherence and you cut out the wobbles and hesitations — the context loses thickness.

>5-1Documents — polished, for humans

Proposals, final specs, articles, reports, manuals. Polished for humans (clients or readers). Coherence matters; contradictions are stripped. The value is "readability" and "clarity of conclusion."

>5-2Context — accumulated, for AI

Issues, discussion, PR reviews, ops notes, failure logs, drift reasons, rough notes from before verbalisation. Accumulated for the AI agent. Keep the contradictions, the wobbles, the hesitations. Thickness of judgement material matters more than coherence.

▍ Tolerating contradiction is the core

If you treat context as a "thinking process," contradictions are natural. Human judgement wobbles constantly; organisational decisions get overwritten. Whether you can keep that without sanding it down decides whether your AI agent can reproduce "your kind of judgement".

>5-3Which kind of organisation wins in the AI era

Traditional knowledge management leaned heavily toward "polishing documents." But the organisations that win in the AI era are the ones that can run an "accumulating context" practice one step upstream. The DX resolution of the long-tail discussion from the previous post ties directly into this.

§ 06 HARNESS

Spec at the core of the Harness; drift reasons on the outer rings

So should we discard the spec? No — the spec is the core of the Harness. But on its own, it isn't yet a Harness.

FIG.4 — HARNESS LAYERS

At the centre, Model (a bare LLM) wrapped by SPEC. Layered around them, concentrically: code / issues / ops notes / drift reasons. All of these together = Harness (everything outside the model that steers it).

>6-1Agent = Model + Harness

Following Karpathy's framing, Agent = Model + Harness. The Harness is everything other than the model — SPEC, REQUIREMENTS, PLAN, tools, verification, constraints, feedback loop, context.

Within that, the spec acts as the innermost ring of the Harness, because:

The spec gives the AI its initial "premise for what should be"
SoT (code, ops) sits one ring out; "drift reasons" layer further out
By reading these concentrically from inside out, the AI can update its judgement from "past truth" → "today's truth"

>6-2The blind spot of Spec-Driven Development

The SDD boom is fundamentally pointing the right way. Anchoring the AI's starting point in a spec is important. But SDD alone is not enough — the more effort you pour into polishing the spec, the more the SoT delta goes unexplained. Read this article not as a rejection of SDD but as a "design the SDD outer rings" argument.

▍ Issue Driven Development as a complement

"Issue Driven Development (IDD)" is sometimes proposed alongside SDD. It pairs well with this article — keeping issues is, exactly, "why was it argued" and "why was it decided" accumulated. SDD vs IDD shouldn't be an opposition. SDD = the spec is the truth. IDD = the drift reasons are the truth. Let them coexist.

>6-3How a resident agent like Hermes plugs in

A concrete execution base for this Harness is a resident agent like Hermes Agent. With Skills / Memory / Hooks / Cron, you can continuously ingest the code, issues, PRs, and ops logs, and accumulate the five "whys" into a Vault or Knowledge Graph as the system runs.

§ 07 EVAL

Good AI = how much it lowers verification load

All of this ends up at one evaluation question: "what is a good AI?"

>7-1Not output volume — verification-load reduction

On the Linux kernel 7.1 RC4 release in May 2026, Linus Torvalds publicly declared the security mailing list "almost entirely unmanageable" due to the flood of AI-generated vulnerability reports. What was a stream of 2-3 reports per week two years ago has ballooned to 5-10 reports per day, with multiple researchers independently surfacing the same patterns via automated tools and filing duplicates that drain maintainers' time^[1]^[2].

Linus himself does not dismiss AI in security work — he asks researchers to "understand the code and contribute a patch," not just the alert. That's a miniature of AI-agent operations in general. The value of an AI is not output volume — it is how much it lowers the human's verification, correction, and review load. This generalisation is also taken up in the previous post (§08 BUSINESS).

What an AI report should include at a minimum: what the problem is / why it's a problem / whether it duplicates existing reports / whether it's reproducible / what its impact range is / whether a patch exists / whether the patch is reviewable.

>7-2Spec-only AI and "Slop"

A spec-only AI mass-produces plausible-looking output. It reads right, but it's drifted from the SoT and a human has to check every line to use it. This is the textbook case of "Slop" (low-quality, generic, templated AI output) discussed in the previous post. Only the AI fed the drift reasons becomes the kind that actually lowers human verification load.

▍ Context design is an ROI argument

"Accumulating context is hard." True — installing a culture of keeping the five "whys" is non-trivial. But on ROI it pays back many times over in the volume of verification and judgement the AI takes on. Burning human hours reviewing a "Slop-producing AI" is far more expensive over time.

▍ THE WORLDVIEW — accumulate, don't polish

To approach "today's truth," AI needs accumulation, not polish

What sharpens an AI agent is no longer the model or the prompt. It is whether you can accumulate the gap between spec and Source of Truth, and the reasons for the drift.

An AI fed only the spec faithfully replays past consensus. It reads neatly — but it's drifted from reality. Accumulate the drift reasons too, and the AI approaches "today's truth." Same model, same prompt — only the thickness of context differs.

Polish documents (for humans / clients)
Accumulate context (for AI agents — keep the contradictions and wobbles)
Spec at the core of the Harness; layer "why it diverged" on the outside

Many organisations pour energy into "polishing the spec" because of the SDD boom. But the real differentiation lies elsewhere: not in polishing the spec, but in accumulating the gap with the SoT. To stop building AIs that replay "past truth," stop polishing — start accumulating.