A defensive design pattern for high-stakes agentic workflows. Hybrid neuro-symbolic verification for multi-agent LLM systems that need trustworthy, auditable decisions instead of confident-sounding hallucinations.
Status: Production pattern, distilled from a working media-planning agent. License: CC0 — public domain. Adapt freely.
It addresses the primary blocker for enterprise agent adoption: critic fatigue and silent hallucinations in multi-agent consensus loops.
LLMs can propose. LLMs can object. LLMs can request evidence. LLMs cannot adjudicate.
Adjudication is deterministic code that reads typed evidence and applies typed rules. Everything else follows from that.
flowchart TD
U[User directive]:::user --> P[Proposer LLM]:::llm
P -->|typed proposal| PT[Pressure-Tester LLM]:::llm
PT -->|typed objections| RW[Research Workers<br/>APIs · MCP · files · search]:::tools
RW -->|typed evidence| R{{Resolver<br/>PURE CODE}}:::code
R -->|advance / hold / escalate| OUT[Validated Artifact + Receipts]:::out
R -. another round .-> P
classDef llm fill:#1f6feb,stroke:#0d419d,color:#fff
classDef tools fill:#8957e5,stroke:#553098,color:#fff
classDef code fill:#1a7f37,stroke:#0f5223,color:#fff
classDef user fill:#9a6700,stroke:#5c3d00,color:#fff
classDef out fill:#cf222e,stroke:#82071e,color:#fff
The Resolver is what makes this pattern different from "two LLMs talking." It is the part you trust because you wrote it.
| Path | Purpose |
|---|---|
| docs/agent-dialectic-resolver-pattern.md | The full pattern: four actors, typed contracts, authority hierarchy, confidence thresholds, resolver state machine, hard gates, receipts, anti-patterns, generic use cases |
| src/types.ts | TypeScript contracts (EvidenceChainItem, Proposal, Objection, Resolution, DialecticArtifact, DialecticReceipt) |
| src/resolver.ts | Pure runDialecticResolver() — no LLM, no network, no randomness |
| src/gates.ts | The starter hard-gate set, including the invention check |
| src/index.ts | Public surface |
| examples/code-review/ | Runnable mock: code-review proposer/pressure-tester with stub research workers |
import { runDialecticResolver, defaultAuthorityRank } from "agent-dialectic-resolver";
const result = runDialecticResolver({
artifact, // your DialecticArtifact
evidenceChain, // EvidenceChainItem[] appended so far this turn
config: {
thresholdsByCategory: {
"Regulatory": 0.85,
"Safety": 0.80,
"External-Availability": 0.75,
"Factual": 0.70,
"Categorical": 0.65,
},
authorityRank: defaultAuthorityRank,
maxRounds: 3,
minSourceKindForSeverity: {
BLOCKING: ["user-override", "live-api", "official-source", "domain-evidence"],
HIGH: ["user-override", "live-api", "official-source", "domain-evidence", "fresh-research"],
MEDIUM: ["user-override", "live-api", "official-source", "domain-evidence", "fresh-research", "modeled-fallback"],
LOW: ["user-override", "live-api", "official-source", "domain-evidence", "fresh-research", "modeled-fallback", "llm-inference"],
},
},
});
if (result.artifact.validation.readyToAdvance) {
// ship the artifact
} else if (result.needsAnotherRound) {
// run the Proposer + Pressure-Tester again with the appended receipts visible
} else {
// escalate to human review
}See examples/code-review/run.ts for an end-to-end mock.
Use it when the output drives a real action with cost: spend, code, configuration, communication, policy.
Don't use it when:
- The output is purely informational and the user is the final adjudicator (a search tool, a Q&A bot).
- The cost of being wrong is low and the cost of slowness is high (autocomplete).
- There is no domain-specific Red Flag Playbook worth writing — meaning nobody has expertise about how this thing fails.
If you can't name the failures you're guarding against, you're not ready to apply this pattern. Go find a domain expert first.
The pattern is domain-agnostic. The Red Flag Playbook, the closed enums, the confidence thresholds, and the gates are what you specialize.
- Software engineering — Proposer writes a patch, Pressure-Tester applies a security/perf/test-coverage playbook, Research Workers run tests & static analysis, Resolver requires green tests + no blocking critique unresolved.
- Customer support — Proposer drafts a refund decision, Pressure-Tester checks policy + fraud, Research Workers hit billing/orders, Resolver blocks high-value refunds without a
live-apiorder confirmation. - Medical triage — Proposer suggests next step, Pressure-Tester applies red-flag symptom list, Research Workers hit EHR + drug interaction + guideline retrieval, Resolver never advances on
llm-inferencealone for contraindications. - Legal drafting — Proposer drafts a clause, Pressure-Tester argues opposing counsel, Research Workers retrieve case law + statute, Resolver requires
official-sourcecitations for any binding claim. - DevOps — Proposer generates a Terraform/K8s change, Pressure-Tester applies SRE red flags, Research Workers run
terraform plan+ OPA, Resolver gates prod changes on green plan + policy pass. - Scientific paper review — Proposer synthesizes findings, Pressure-Tester critiques methodology, Research Workers verify citations + re-run stats, Resolver requires dataset references and checked test statistics.
Full breakdown in docs/agent-dialectic-resolver-pattern.md.
Honest accounting:
- Schema discipline. Every domain field needs a closed enum or a validator.
- Calibrated confidences at ingest. Garbage-in, garbage-out: if your data sources don't carry honest confidence, the Resolver propagates lies.
- Real Red Flag Playbooks. A weak Pressure-Tester misses real objections.
- A separate research layer. Typed adapters around your tools. You wanted those anyway.
- More LLM calls, not fewer. Proposer + Pressure-Tester + (sometimes) Research = ≥2 LLM calls per turn.
What it buys: a system whose decisions you can defend.
Other multi-agent frameworks (AutoGen, CrewAI, LangGraph) ship the agent loop but leave adjudication to either (a) free-form chat between two LLMs or (b) an "LLM-as-judge." Both collapse under pressure-testing because confidence inflation eventually convinces the critic to relax its standards.
Three elements that are not currently found off-the-shelf:
- Evidence Authority Hierarchy. A strict transport-layer ranking (
user-override>live-api>official-source>domain-evidence>fresh-research>modeled-fallback>llm-inference) with an explicit rule thatllm-inferencecan never clear a BLOCKING objection alone. - No natural-language drift. Both agents emit only typed message kinds (closed enums of structural moves), not free-form prose. The audit trail is parseable, not skimmable.
- Pure-code adjudication. The Resolver is deterministic TypeScript. No network. No model evaluation. 100% reproducible — same artifact + same evidence chain → same resolutions, every time.
This pattern lives next to a handful of adjacent ideas. It is not a replacement for them; it solves a different slice of the problem.
| Approach | What it does well | Where this pattern is different |
|---|---|---|
| "DIALECTIC" frameworks (multi-agent debate, e.g. VC-style pro/contra critics) | Structures high-stakes decisions as adversarial argument | Those still rely on an LLM judge or scoring heuristic to draw the conclusion. Here the adjudicator is a hard-coded state machine that cannot be bypassed by argument quality. |
| Compiled AI / LLM+P | Runs the LLM once at compile-time to generate intent, then routes execution through deterministic planners and hard validation gates | Optimizes for static, low-latency execution. This pattern optimizes for live, iterative negotiation loops where agents fetch dynamic tools to satisfy a critic before committing. |
| Reasoning Graphs | Anchors evaluation edges directly to retrieved evidence items — mirrors this pattern's rule that "if it's not in the evidence chain, it does not exist" | Designed for agentic self-improvement and memory. This Resolver is a defensive runtime firewall whose job is to block bad execution, not to learn from it. |
| LLM-as-judge (AutoGen, RLHF reward models) | Cheap, flexible, easy to deploy | Judge is still inference. Cannot resolve blockers in this pattern. Used here only as one input among many to the deterministic Resolver. |
| CrewAI / LangGraph multi-agent chat | Great ergonomics, fast iteration | Free-form prose between agents. Confidence inflation is the default failure mode. |
Position this pattern as hybrid neuro-symbolic verification — neural for generation and critique, symbolic for adjudication — sitting on top of whichever multi-agent framework you already use.
If you ship something using this pattern, file an issue or PR with a one-line summary of your Red Flag Playbook and gate configuration. The goal is a shared library across domains.
PRs welcome for:
- Additional reference adapter examples (Python, Go, Rust).
- Domain-specific Red Flag Playbooks.
- Real-world case studies (sanitized).
Distilled from a production multi-agent media-planning system where a "Proposer" agent and a "Pressure-Tester" agent had to agree on a campaign plan before live ad spend was committed. Letting two LLMs converge in free-form chat turned out to be indistinguishable from one LLM rationalizing. Adding the deterministic Resolver between them turned the system from confident-sounding theater into a defensible decision pipeline.
The pattern is now generalized. Use it anywhere an LLM-generated proposal needs to clear a real-world bar before being acted on.
Simon Foster — inventor of the Agent Dialectic Resolver pattern.
Questions, implementation help, real-world case studies, or commercial inquiries: simon@spotrunner.com
Released to the public domain under CC0. Attribution is appreciated but not required.