600 trials. Standard prompting reported 100% success. Actual correct output: 34%.
Manifold eliminates false positives entirely through external specification-driven validation.
Most agent frameworks put the agent in two roles at once: executor and judge.
It performs the task. Then it decides if its output is correct. This is like asking a student to grade their own exam. The result across 600 controlled trials:
| Approach | True Success | False Positives | Relative Cost |
|---|---|---|---|
| Naive prompting | 34% | 66% | 1× |
| Retry logic | 38% | 62% | 1.5–3.5× |
| Manifold | 94% | 0% | 0.36× |
Two failure modes drive this:
- Silent false positives — the agent reports success, the output is wrong. Your system never finds out.
- Infinite retry loops — output is bad → retry → same output → retry. Standard retry counters count attempts, not progress.
Manifold treats prompts as contracts, not instructions.
Each workflow step defines what must be true before it runs, what must be true after, what must always hold globally, and what must change before a retry is allowed. Verification is external — the agent's opinion about its own output is irrelevant.
Instruction approach: agent executes → agent judges → system trusts
Manifold approach: agent executes → spec engine judges → system trusts the spec
pip install manifold-aifrom manifold import Spec, SpecResult, Context
class HasAPIKey(Spec):
rule_id = "has_api_key"
tags = ("precondition", "config")
def evaluate(self, context: Context, candidate=None):
if context.has_data("api_key"):
return SpecResult.ok(self.rule_id, "API key configured")
return SpecResult.fail(
self.rule_id,
"Missing API key",
suggested_fix="Set 'api_key' in context.data"
)
class OutputNotEmpty(Spec):
rule_id = "output_not_empty"
tags = ("postcondition", "output")
def evaluate(self, context: Context, candidate=None):
if candidate and len(candidate) > 0:
return SpecResult.ok(self.rule_id, f"Extracted {len(candidate)} items")
return SpecResult.fail(
self.rule_id,
"No items extracted",
suggested_fix="Check input data format or prompt"
)manifest_version: "1.0"
globals:
start_step: "extract"
budgets:
max_total_attempts: 10
max_attempts_per_step: 3
max_cost_dollars: 5.0
steps:
extract:
agent_id: "extraction_agent"
pre_specs: ["has_api_key"]
post_specs: ["output_not_empty"]
edges:
- from_step: "extract"
to_step: "__complete__"
when: "post_ok"
priority: 10
- from_step: "extract"
to_step: "__fail__"
when: "attempts('extract') >= 3"
priority: 1from manifold import OrchestratorBuilder
orchestrator = (
OrchestratorBuilder()
.with_manifest_file("workflow.yaml")
.with_spec(HasAPIKey())
.with_spec(OutputNotEmpty())
.build()
)
result = await orchestrator.run(initial_data={"api_key": "sk-..."})
print(f"Success: {result.success}")
print(f"Steps executed: {result.total_steps_executed}")
print(f"Trace: {result.final_context.trace}")| Category | Purpose |
|---|---|
| Preconditions | Must be true before the agent runs. Gates execution. |
| Postconditions | Must be true about the output. Eliminates false positives. |
| Invariants | Must always hold across the entire run. Global safety constraints. |
| Progress conditions | Must show the situation changed. Prevents infinite retry loops. |
Standard retry counters count attempts. Manifold counts progress.
fingerprint = hash(step_id, canonical_inputs, tool_calls, failed_rule_ids, missing_fields)
if fingerprint in seen_fingerprints:
raise LoopDetectedError() # Blocked — not just countedEvery retry must represent genuine forward movement. Same situation = blocked.
Workflows live in data, not code. Swap domains by swapping manifests. Agents are replaceable components. Specs are the laws of physics.
| Feature | Manifold | LangGraph | Manual Code |
|---|---|---|---|
| Declarative manifests | ✓ | Partial | ✗ |
| External spec validation | ✓ | ✗ | Manual |
| Loop prevention | ✓ | ✗ | Manual |
| Progress conditions | ✓ | ✗ | Manual |
| Complete tracing | ✓ | Partial | Manual |
| Zero false positives | ✓ | ✗ | Manual |
This framework is the subject of a published whitepaper and accompanying scientific paper with full experimental methodology across 600 trials.
📄 Whitepaper — Architecture & Concepts (this repo)
🔬 Scientific Paper — Full Methodology & Results on Zenodo (600 trials, statistical analysis)
Honest note on scope: The experimental results come from controlled trials at a scale I could fund independently. I don't have the resources for large-scale production testing across diverse domains. If you stress-test this architecture and find failure modes — I want to know. Every independent result, including negative ones, advances the work.
Open to collaboration and co-authorship on follow-up research.
Best fit:
- Multi-step workflows with verifiable intermediate outputs
- Data extraction and format compliance tasks
- Production systems where silent failures are costly
- Any pipeline where you need to know why something failed
Less applicable:
- Purely creative tasks where "correct" is subjective
- Exploratory tasks with undefined output spaces
Even in creative domains, loop prevention and cost control provide value.
- Python 3.10+
- PyYAML
Alpha (v0.1.0) — Core architecture is stable. API may evolve.
Feedback, issues, and pull requests welcome.
Fabio-Eric Rempel · fabiorempel@proton.me · github.com/fabs133
MIT License · Built on contract-driven design and immutable data patterns