Don't build an AI that replays yesterday's spec — the gap between spec and source of truth is the real context
More and more often, an AI agent's accuracy is decided by its context, not its prompting. But "context" here is not a polished spec. What really moves the needle is the gap between the spec and the source of truth, and the reasons behind the drift.
An AI fed only the spec replays "past truth." Feed it the drift reasons too, and it approaches "today's truth." The blind spot of Spec-Driven Development and the real core of Harness Engineering, laid out.
What separates AI agents in 2026 is no longer the model or the prompt. It's not the spec — it's the gap between the spec and reality, and whether the "reasons for the drift" are accumulated alongside.
- The frontier of AI-agent accuracy has shifted: model → prompt → context.
- If you mistake "context" for a polished spec, the AI just replays "past truth." Specs drift further from the Source of Truth (running code, ops, field judgement) the longer time passes.
- What actually works is the reasons for the drift. Five whys — why the spec was changed, why an exception was allowed, why the implementation compromised, why the issue went the way it did, why the review came out that way — decide the quality of the AI's output.
- Documents are polished; context is accumulated. Put the spec at the core of the Harness, and layer the drift reasons around it. The right AI is not the loud one — it's the one that lowers the human's verification load.
From prompt-craft to context design
For a few years now, the lever that moves AI-agent quality has been moving:
- Up to 2023: raw model strength dominated (GPT-3.5 → 4 → Claude 3 → …)
- 2023-2024: same model, different wording produced different results — "prompt engineering" boomed
- 2025-: raw model quality commoditised; context (what you feed in) became the dominant variable
By 2026 the gap between frontier models is closing fast. With Claude Opus 4.7 / GPT-5.5 / grok-4.3 at the top, "what the model knows" matters less than "what you put in front of it for this task." Welcome to the era of context design.
"Context" here is not just "input text." It's the whole substrate of judgement material — background, history, contradictions, hesitations, wobbles. The same layer I called "tacit thoughts" in the previous post.
Spec vs Source of Truth — the gap is inevitable
Say "context" and most people think of the spec or design document. This article takes a clear position against that: feeding only a spec as Context is almost always wrong.
>2-1Spec = "what should be." Source of Truth = "what is."
The spec describes what should be. A snapshot of agreement at a moment, internally coherent, neatly polished.
As implementation and operations evolve, the actual "truth" drifts elsewhere:
- The running code — hard-coded values, exception handlers, commented-out branches
- The DB schema and the live data — migration history, unexpected records, exceptional values
- The actual API behaviour — undocumented responses, unofficial endpoints
- Customer-side operating decisions — approval routes never written down, tacit exceptions
- Field judgement — choices an operator made on the spot
These are the Source of Truth (SoT). The spec inevitably drifts away from the SoT over time. This is not laziness — it's structural. Requirements change, exceptions happen, implementations compromise.
>2-2The gap = an unexplained delta
The problem is not that the gap exists. It's that the gap is never explained. The spec says "what should be," the code says "how it runs" — but the bit in between, "why it diverged," lives nowhere.
Most organisations treat this gap as "document decay" and pour effort into "keeping the spec in sync with the truth." That work is futile — the moment you polish it, drift resumes.
This article takes the opposite stance: keep the gap as something to be explained, not eliminated.
An AI fed only the spec replays "past truth"
Feed only the spec to an AI, and its output will faithfully replay "the past consensus".
>3-1Typical failures of a spec-only AI
- "The spec says X is correct, but the code shows Y." → The AI trusts the spec, returns X, and drifts from reality.
- "The spec has no exception handling, so edge cases can be ignored." → Operationally impossible — a misjudgement.
- "I implemented per the latest API docs." → The unofficial operating rules get missed.
This is not the AI's fault. The context you fed it is frozen at a point in time, and the AI is faithful to that point. The cleaner the spec, the more confidently the AI quotes "past truth."
>3-2Reverse-engineering alone is not enough either
"So just give it the code, not the spec?" That fails too. Code reveals "what is implemented and how," but never "why it became that." Same structure as the tacit knowledge / tacit thoughts argument from the previous post — outputs alone can't reproduce the context of judgement.
"Why was it written this way?" "Why was this exception allowed?" — these are expert tacit knowledge, or the tacit thoughts just upstream of it. What sinks into the spec/SoT gap is, essentially, this layer of knowledge.
Five whys to accumulate — that's strong context
What should you accumulate, then? This article proposes five "whys" to keep deliberately.
>4-1① Why was the spec changed?
The motivation behind a spec change. "Customer fed back X," "the upstream premise broke," "a different problem surfaced once we built it" — somewhere in change logs, meeting notes, or Slack history. If this disappears, neither the AI nor the next human can answer "why the old spec was discarded."
>4-2② Why was the exception allowed?
An operational decision that "this is off-spec, but we'll allow it." Approval-rule exceptions, customer-specific carve-outs, emergency manual workarounds. Usually never documented. But "the impossible case is now business as usual" is more common than people admit.
>4-3③ Why was the implementation compromised?
Implementation compromises. "Should be X, but a legacy constraint forced Y," "we dropped the edge case for performance reasons" — can land in PR comments or code comments, but never in a polished spec. The "reason for the compromise" is exactly the judgement material you'll need when the spec changes next.
>4-4④ Why was it argued this way in the issue?
The shape of the discussion in issues and discussion threads. Not just the final verdict, but the "rejected alternatives," "premises debated," and "trade-offs that produced the agreement." Echoes Karpathy's LLM Wiki philosophy — keep the discussion, not just the conclusion.
>4-5⑤ Why did the review come out this way?
PR review comments. The reviewer's concerns, objections, compromises, and reasons for approving. Nobody re-reads review history after merge — but if it persists, the AI can "reproduce the same kind of judgement elsewhere".
Keeping these "whys" is exactly the Externalisation step in Nonaka's SECI model. The twist: you're externalising the process, not the conclusion. That's how judgement patterns become reproducible in other contexts.
Documents are polished; context is accumulated
"Keep the five whys" sounds like "write more docs" — but it isn't. The thing you polish and the thing you accumulate are different objects.
>5-1Documents — polished, for humans
Proposals, final specs, articles, reports, manuals. Polished for humans (clients or readers). Coherence matters; contradictions are stripped. The value is "readability" and "clarity of conclusion."
>5-2Context — accumulated, for AI
Issues, discussion, PR reviews, ops notes, failure logs, drift reasons, rough notes from before verbalisation. Accumulated for the AI agent. Keep the contradictions, the wobbles, the hesitations. Thickness of judgement material matters more than coherence.
If you treat context as a "thinking process," contradictions are natural. Human judgement wobbles constantly; organisational decisions get overwritten. Whether you can keep that without sanding it down decides whether your AI agent can reproduce "your kind of judgement".
>5-3Which kind of organisation wins in the AI era
Traditional knowledge management leaned heavily toward "polishing documents." But the organisations that win in the AI era are the ones that can run an "accumulating context" practice one step upstream. The DX resolution of the long-tail discussion from the previous post ties directly into this.
Spec at the core of the Harness; drift reasons on the outer rings
So should we discard the spec? No — the spec is the core of the Harness. But on its own, it isn't yet a Harness.
>6-1Agent = Model + Harness
Following Karpathy's framing, Agent = Model + Harness. The Harness is everything other than the model — SPEC, REQUIREMENTS, PLAN, tools, verification, constraints, feedback loop, context.
Within that, the spec acts as the innermost ring of the Harness, because:
- The spec gives the AI its initial "premise for what should be"
- SoT (code, ops) sits one ring out; "drift reasons" layer further out
- By reading these concentrically from inside out, the AI can update its judgement from "past truth" → "today's truth"
>6-2The blind spot of Spec-Driven Development
The SDD boom is fundamentally pointing the right way. Anchoring the AI's starting point in a spec is important. But SDD alone is not enough — the more effort you pour into polishing the spec, the more the SoT delta goes unexplained. Read this article not as a rejection of SDD but as a "design the SDD outer rings" argument.
"Issue Driven Development (IDD)" is sometimes proposed alongside SDD. It pairs well with this article — keeping issues is, exactly, "why was it argued" and "why was it decided" accumulated. SDD vs IDD shouldn't be an opposition. SDD = the spec is the truth. IDD = the drift reasons are the truth. Let them coexist.
>6-3How a resident agent like Hermes plugs in
A concrete execution base for this Harness is a resident agent like Hermes Agent. With Skills / Memory / Hooks / Cron, you can continuously ingest the code, issues, PRs, and ops logs, and accumulate the five "whys" into a Vault or Knowledge Graph as the system runs.
Good AI = how much it lowers verification load
All of this ends up at one evaluation question: "what is a good AI?"
>7-1Not output volume — verification-load reduction
On the Linux kernel 7.1 RC4 release in May 2026, Linus Torvalds publicly declared the security mailing list "almost entirely unmanageable" due to the flood of AI-generated vulnerability reports. What was a stream of 2-3 reports per week two years ago has ballooned to 5-10 reports per day, with multiple researchers independently surfacing the same patterns via automated tools and filing duplicates that drain maintainers' time[1][2].
Linus himself does not dismiss AI in security work — he asks researchers to "understand the code and contribute a patch," not just the alert. That's a miniature of AI-agent operations in general. The value of an AI is not output volume — it is how much it lowers the human's verification, correction, and review load. This generalisation is also taken up in the previous post (§08 BUSINESS).
What an AI report should include at a minimum: what the problem is / why it's a problem / whether it duplicates existing reports / whether it's reproducible / what its impact range is / whether a patch exists / whether the patch is reviewable.
>7-2Spec-only AI and "Slop"
A spec-only AI mass-produces plausible-looking output. It reads right, but it's drifted from the SoT and a human has to check every line to use it. This is the textbook case of "Slop" (low-quality, generic, templated AI output) discussed in the previous post. Only the AI fed the drift reasons becomes the kind that actually lowers human verification load.
"Accumulating context is hard." True — installing a culture of keeping the five "whys" is non-trivial. But on ROI it pays back many times over in the volume of verification and judgement the AI takes on. Burning human hours reviewing a "Slop-producing AI" is far more expensive over time.
To approach "today's truth," AI needs accumulation, not polish
What sharpens an AI agent is no longer the model or the prompt. It is whether you can accumulate the gap between spec and Source of Truth, and the reasons for the drift.
An AI fed only the spec faithfully replays past consensus. It reads neatly — but it's drifted from reality. Accumulate the drift reasons too, and the AI approaches "today's truth." Same model, same prompt — only the thickness of context differs.
- Polish documents (for humans / clients)
- Accumulate context (for AI agents — keep the contradictions and wobbles)
- Spec at the core of the Harness; layer "why it diverged" on the outside
Many organisations pour energy into "polishing the spec" because of the SDD boom. But the real differentiation lies elsewhere: not in polishing the spec, but in accumulating the gap with the SoT. To stop building AIs that replay "past truth," stop polishing — start accumulating.