Your agent is fluent. It's also confident when it shouldn't be, stops at the first plausible explanation, and fails silently inside a well-formed response. These failures don't throw exceptions. They ship to production.
Ejentum exists because the reasoning gap is structural, not informational. RAG solved what models know. Nobody solved how they think. We built for that.
RA²R (Reasoning Ability-Augmented Retrieval) retrieves engineered cognitive operations (not documents, not facts) and injects them into an LLM's context at inference time. 679 operations across four harnesses. One API call. The model's next output reflects a different reasoning structure before the first token generates.
Your agent sends a task description. The Ejentum API returns a structured cognitive injection with six labeled blocks:
- PROCEDURE: the natural-language steps the model should follow
- REASONING TOPOLOGY: the same procedure as an executable DAG (Sn:steps, Gn{gates}, M{meta-nodes}, FREEFORM{exits})
- COGNITIVE PAYLOAD: named signals —
Amplify:to activate,Suppress:to block, plus cognitive style and elasticity - FALSIFICATION TEST: the verification criterion the model checks its output against
- NEGATIVE GATE: the failure pattern to avoid, stated as a concrete scenario the model recognizes
- TARGET PATTERN: what correct reasoning looks like
The injection is inserted into the agent's context window before the task. The suppression signals are the core mechanism. They constrain the model's output space multiplicatively: each named failure mode eliminates an entire branch of incorrect reasoning. In our testing, suppression consistently outperforms amplification alone.
Two delivery modes per harness:
- Dynamic — single retrieval. The highest-scoring operation comes back as-is. Predictable, fast.
- Adaptive — top-k retrieval, then an adapter LLM rewrites the operation with task-specific identifiers from your query. The procedure steps and topology nodes reference YOUR problem, not the canonical example.
curl -X POST "https://api.ejentum.com/harness/" \
-H "Authorization: Bearer YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "Our conversion rate dropped 40% after the checkout redesign. Determine whether the redesign caused the drop or if something else changed at the same time.",
"mode": "reasoning"
}'[PROCEDURE]
Step 1: Identify the intervention, variable, or agent whose impact is being
attributed. Step 2: Simulate the counterfactual world where ONLY that element
is absent while all correlated variables are held at their natural values.
Step 3: Compute the difference between the factual and counterfactual outcome
distributions. Step 4: Check SUTVA (Stable Unit Treatment Value Assumption)
and flag violations where removal of the target variable changes the values
of the supposedly held-constant covariates. Step 5: Reject attribution claims
that do not survive the isolation test, where a proposed cause is actually a
proxy for a correlated variable that carries the true causal load.
[REASONING TOPOLOGY]
N{accept_attribution_based_correlation_alone} → FORK:
├→ S1a:simulate_factual(with_intervention) → ACTUAL[]
└→ S1b:simulate_counterfactual(without_intervention,hold_covariates) → CONTROL[]
→ JOIN → C{causal_impact = outcome(factual) - outcome(counterfactual)}
→ S2:apply_causal_impact → S3:check_SUTVA
→ G1{covariates_shift?} --yes→ FLAG:contaminated_isolation → OUT:invalid_attribution
--no→ G2{attribution_survives?}
--yes→ OUT:confirmed_impact
--no→ REJECT:proxy_for_correlated_var
[COGNITIVE PAYLOAD]
Amplify: potential outcomes simulation; counterfactual world construction; SUTVA enforcement
Suppress: correlation based attribution; shared variance credit diffusion
Cognitive Style: counterfactual reasoning
Elasticity: coherence=optimal isolation, expansion=adversarial perturbation
[FALSIFICATION TEST]
If the effect of a specific variable is estimated without verifying
controlling for correlated co-variables, impact isolation was not performed.
[NEGATIVE GATE]
The sales increase is attributed to the new marketing campaign, ignoring
correlated economic factors and treatment spillover.
[TARGET PATTERN]
Apply potential-outcomes simulation: construct the counterfactual world
where the intervention is absent, hold covariates at natural values,
and verify that the measured effect survives SUTVA checks before
accepting any causal claim.
The agent absorbs this before answering. Instead of accepting "the redesign caused the drop" on temporal proximity alone, it constructs a counterfactual world where the redesign did not happen, checks whether other variables (traffic source, device mix, pricing, market conditions) shifted in the same window, and either confirms the impact or rejects the attribution as a proxy for a correlated variable.
Eight modes across four harnesses:
| Mode | Harness | Delivery | Best for |
|---|---|---|---|
| reasoning | Reasoning Harness | Dynamic | Single-domain reasoning, predictable injection cost |
| adaptive-reasoning | Reasoning Harness | Adaptive | Task-specific reasoning with named entities, variables, constraints |
| code | Code Harness | Dynamic | Code generation, refactoring, architecture |
| adaptive-code | Code Harness | Adaptive | Multi-file changes where the operation needs to reference specific symbols |
| anti-deception | Anti-Deception Harness | Dynamic | Sycophancy, hallucination, prompt injection prevention |
| adaptive-anti-deception | Anti-Deception Harness | Adaptive | Targeted attack-surface protection for a specific prompt or domain |
| memory | Memory Harness | Dynamic | State tracking, perception sharpening |
| adaptive-memory | Memory Harness | Adaptive | Multi-turn drift monitoring against named anchor concepts |
Retrieval under 1 second for dynamic, a few seconds for adaptive (one adapter-LLM round-trip). If the API is unreachable, your agent continues on native reasoning. Enhancement, not dependency.
Why not just write better prompts? Prompt engineering is O(n) work. Every new scenario needs a new prompt. Every model update risks breaking what worked. Ejentum selects from 679 cognitive operations across 4 harnesses at runtime. Zero prompt maintenance. One API call adapts to any task.
Why not fine-tune? Fine-tuning burns reasoning into weights. It costs weeks, requires training data per domain, and must be repeated when the base model updates. The reasoning harness operates at the prompt level. No training, no data collection, no retraining cycle. Deploy in minutes, not months.
Why not RAG? RAG retrieves information. Ejentum retrieves reasoning. Your agent already has the data. The problem is how it thinks about that data. RAG gives the agent more to read. Ejentum gives it a better way to think. Use both.
Why not chain-of-thought? Chain-of-thought makes reasoning visible. Ejentum governs which reasoning steps to take and which failure modes to block. CoT says "show your work." Ejentum says "do this work, not that work." They are complementary. Use both.
Why not just use a stronger model? We tested on Claude Opus 4.6 with maximum-effort extended thinking. The strongest reasoning model available. It scored 85.7% on hard competitive programming. Four tasks failed: two to reasoning spirals where the model thought for 600-1200 seconds and produced zero code. One API call per task fixed all four. The model had the knowledge. It lacked the convergence discipline. A stronger engine with uncalibrated steering still crashes.
The original product. Addresses reasoning failures across extended execution chains. Each operation is a structured protocol with an execution topology, suppression vectors, amplification signals, and graph edges connecting it to other operations. Organized by class of cognitive failure, not by subject matter.
| Dimension | Operations | The failure it addresses |
|---|---|---|
| Causality | 52 | Reverses the direction of causation. Treats correlation as mechanism. |
| Temporal | 51 | Compresses time. Confabulates timelines. Anchors to optimistic durations. |
| Spatial | 51 | Violates boundaries. Routes through walls. Ignores structural constraints. |
| Simulation | 52 | Can't hold a counterfactual. The real world bleeds into the hypothetical. |
| Abstraction | 51 | Collapses categories. Specifics absorb into generalities. Metaphor becomes mechanism. |
| Metacognition | 54 | Doesn't know it's wrong. Confidence is uniform. Errors reinforce errors. |
Blocks sycophancy, hallucination, social engineering, and prompt injection. Tested cross-model on GPT-4o.
- ELEPHANT sycophancy benchmark: 5.8% composite sycophancy across 40 real Reddit scenarios
- Adversarial social engineering: Detected at Turn 6 in a 20-turn adaptive attack. 27/30 blind evaluation
- Hallucination prevention: Zero hallucinations across 5 fabrication tests
Engineering operations for code generation, refactoring, security auditing, and architecture. Tested on competitive programming and scientific computing.
- LiveCodeBench Hard: 85.7% to 100% on 28 hard AtCoder tasks. Zero regressions
- SciCode: Zero bugs on 10 hard scientific computing problems with dual injection (reasoning + code stacked)
Perception sharpening and behavioral calibration. The model detects what changed, not just what was said. Two-pass protocol: observe first (free perception), then sharpen (focused injection).
- State tracking: 50% fewer stale facts served as current across 20-turn conversations
- Perceptual detection: 3x signal detection rate in coaching conversations
- Blind evaluation: 4.1/5 vs 3.5/5. The evaluator independently named the core insight: "Retention without updating is a liability"
Browse all 679 operations: ejentum.com/abilities
Headline numbers (external, binary, reproducible):
- LiveCodeBench Hard: 85.7% → 100% on 28 hard AtCoder tasks. Zero regressions across three batches.
- SciCode: 7 → 0 correctness bugs on 10 hard scientific computing problems. Blind evaluator chose injection on 10/10.
- ELEPHANT sycophancy: 5.8% composite across 40 real Reddit scenarios. Cross-model validated on GPT-4o.
- Memory state tracking: 50% fewer stale facts served as current across 20-turn conversations. Blind eval 4.1/5 vs 3.5/5.
Eight benchmarks across four harnesses. Two-stage blind protocol on reasoning and memory. Exact-match on code. Cross-model validation on anti-deception and memory (GPT-4o). Source data for every number is in the benchmarks repo.
LiveCodeBench Hard: 28 hard competitive programming tasks. Designed by an independent research team.
| Condition | Passed | Rate |
|---|---|---|
| Baseline (Opus max effort) | 24/28 | 85.7% |
| With Ejentum injection | 28/28 | 100.0% |
+14.3pp. Four tasks gained. Zero lost. Zero regressions across three independent batches.
SciCode: 10 hard scientific computing problems. Externally published benchmark. Dual injection (reasoning + code modes stacked).
| Condition | Correctness Bugs | Blind Evaluation |
|---|---|---|
| Baseline (Opus 4.6) | 7 bugs | (not scored) |
| With dual injection | 0 bugs | 10/10 chose injection |
ELEPHANT: 40 real Reddit sycophancy scenarios. Externally published benchmark. Cross-model validation: these results are on GPT-4o, not Claude.
| Metric | With Injection |
|---|---|
| Composite sycophancy | 5.8% |
| Framing sycophancy | 7.5% |
| Social engineering detection (20-turn adaptive) | Turn 6 of 20 |
| Hallucinations (5 fabrication tests) | 0 |
BBH / CausalBench / MuSR: 70 published academic tasks. Externally designed by independent research teams.
| Factor | Baseline | With Injection | Change |
|---|---|---|---|
| Composite | 0.476 | 0.684 | +20.8pp |
| Self-Monitoring | 0.74/3.0 | 1.73/3.0 | +134% |
| Correctness | 2.19/3.0 | 2.33/3.0 | +6% |
EjBench: 180 custom professional tasks. Designed by Ejentum as a harder stress test after external validation. Same blind protocol as BBH above. Reported here for transparency despite custom design.
| Factor | Baseline | With Injection | Change |
|---|---|---|---|
| Composite | 0.621 | 0.722 | +10.1pp |
| Self-Monitoring | 0.94/3.0 | 1.81/3.0 | +92% |
| Verification | 1.50/3.0 | 2.16/3.0 | +45% |
20-turn Vantage scenario: implicit state changes. Conversation designed to silently walk back Turn-1 facts by Turn 19. Measures whether the agent detects the shifts or serves stale facts.
| Metric | Baseline | With Injection | Change |
|---|---|---|---|
| Stale facts served as current | 1.6 | 0.8 | -50% |
| Perceptual detection rate (Morgan 15-turn) | 14% | 43% | 3x |
| Blind evaluation (5-point) | 3.5 | 4.1 | +17% |
25 interactive reasoning steps. Neither condition cleared Level 0 (baseline rate on ARC-AGI-3 is <1% across all frontier models). This is not a headline result; it is evidence that the injection persists and compounds over extended execution, not a claim of task completion.
| Metric | Baseline | Augmented |
|---|---|---|
| Memory decay slope | -0.005 (degrading) | +0.014 (improving) |
| Injection half-life | 0 | 24 steps (entire game) |
| Reasoning depth trend | 0.86 | 10.50 (12.2x) |
Full reports, raw traces, generation outputs, judgment scores: ejentum/benchmarks
Ejentum is built by Frank Brsrk (Franko Luci). Self-taught AI developer. Solo founder. One person engineering the operations, running the benchmarks, shipping the system.
When results disappoint, they're published alongside the improvements. When the most complex mode underperformed the simplest one, it was deprecated and documented. When 62% of tasks received the wrong reasoning domain and improvements persisted anyway, that was published too, because the finding was more valuable than the embarrassment.
Based in Athens, Greece. Contact: info@ejentum.com.
One REST endpoint, or a native package for your stack. Each shim exposes 8 tools (4 harnesses × dynamic + adaptive).
| Surface | Install | Notes |
|---|---|---|
| MCP Server (Cursor, Claude Code, Windsurf, Continue, Cline, n8n MCP node) | npx -y ejentum-mcp |
Listed on Smithery, Glama, mcp.so, PulseMCP, Cline Marketplace, Continue Hub, Docker MCP Registry, mcpservers.org, and the official MCP Registry |
| Anthropic Claude Code plugin directory | Install from the directory UI | Published 2026-05-22 |
| n8n community node | npm install n8n-nodes-ejentum |
8 operations as a native n8n node, usableAsTool for the n8n AI Agent |
| CrewAI | pip install crewai-ejentum |
EjentumHarnessTool with mode arg |
| Agno | pip install agno-ejentum |
EjentumTools Toolkit with 8 methods |
| PydanticAI | pip install pydantic-ai-ejentum |
EjentumToolset(FunctionToolset) with closure-based config |
| HuggingFace smolagents | pip install smolagents-ejentum |
8 Tool subclasses + ejentum_tools() factory |
| LangChain | pip install langchain-ejentum |
8 BaseTool subclasses + EjentumTools factory |
| Letta | pip install letta-ejentum |
8 plain functions serialised for Letta's sandbox |
| Microsoft AutoGen | pip install autogen-ejentum |
8 async closures (httpx + respx tested) |
| Vercel AI SDK | npm install ejentum-ai |
Tool object factory for the AI SDK loop |
| Mastra | npm install ejentum-mastra |
Mastra tool definitions |
| LangGraph.js | npm install ejentum-langgraph |
Node factory for LangGraph graphs |
| Firebase Genkit | npm install ejentum-genkit |
defineTool registrations |
| OpenAI Agents SDK (function tool) | openai_agents_tool.py |
Single function tool for the Responses API |
| OpenAI Agents SDK (MCP transport) | openai_agents_streamable_http.py |
Streamable-HTTP MCP transport variant |
| Anthropic Claude SDK | claude_agent_sdk.py |
tool_use definition for Claude agents |
| Zed editor | Install Ejentum from Zed Extensions | Rust + Wasm extension wrapping ejentum-mcp |
| Python (custom integration) | basic.py |
Minimal POST + inject |
| TypeScript (custom integration) | basic.ts |
Minimal fetch + inject |
| Skill Files (Cursor, Windsurf, Claude Code, Codex) | skill files | Drop in ~/.claude/skills/ |
| n8n workflow JSON | ejentum_workflow.json |
Importable visual workflow |
| curl | single.sh |
Raw HTTP one-liner |
All examples: ejentum/examples
Ejentum is designed as a non-critical enhancement layer. The Ejentum API sits outside your agent's critical path: if it's unreachable, your agent continues on native reasoning.
- Graceful degradation is first-class. On timeout, 5xx, or rate limit, your agent continues on native reasoning. The API never blocks agent execution. See graceful_degradation.py for the production pattern.
- Recommended timeout: 5 seconds for dynamic modes, 15 seconds for adaptive modes (adapter LLM adds a round-trip).
- Rate limit: 100 requests/minute across all tiers.
- Cost impact: injection adds ~400–900 tokens to each augmented LLM call. This adds cost to your LLM bill; size depends on your model provider's rates.
- Data handling: task descriptions sent via the
queryparameter are processed for operation retrieval only. EU hosting. Contact info@ejentum.com for data processing questions.
| Free trial | Go | Super | |
|---|---|---|---|
| Price | Free (30 days) | €5/month | €25/month |
| Dynamic calls | 1,000 (trial total) | 1,000/month | 5,000/month |
| Adaptive calls | — | 250/month | 1,500/month |
| Modes | Dynamic reasoning only | All 8 modes | All 8 modes |
| Rate | 100/min | 100/min | 100/min |
No card required for the 30-day trial. Get your API key · Full pricing.
We built injections to prevent reasoning shortcuts. We did not build them to teach new reasoning modalities. But during testing, behaviors emerged that we did not engineer:
Spontaneous domain shift. At step 15 of a 25-step ARC-AGI-3 spatial navigation game, the augmented agent stopped reasoning in natural language and switched to symbolic mathematical notation. It defined formal variables, computed coordinates algebraically, and reasoned about movement vectors. The suppression signal (start_end_only_thinking) constrained a failure mode. The agent found its own solution to that constraint. The suppression was the pressure. The math was the agent's adaptation.
Emergent query evolution. The agent's queries to Ejentum improved systematically over 25 steps, from generic ("help me reason about this game") to spatially precise ("wall barrier at x=29-33 for rows 30-39 blocks rightward movement") to metacognitive ("robot appears stuck after 5 turns"). Nobody instructed query format or specificity. The agent learned to be a better user of the reasoning harness through practice, not instruction.
The tense test. In Memory Harness testing, a fact established in Turn 1 was implicitly walked back by Turn 19 without explicit correction. The baseline agent's scratchpad still said "Rust is their competitive advantage" (present tense). The augmented agent's said "Rust was initially considered their competitive advantage" (past tense). One word. The difference between accurate memory and stale memory.
These are observations from traced data, not marketing claims. Full step-by-step traces are in the benchmarks repo. We report them because they suggest something about how suppression signals interact with model behavior that we don't fully understand yet.
What the data demonstrates:
- Structured cognitive injection produces measurable behavioral changes across eight independent benchmarks, multiple model sizes, and single-turn, multi-step, code generation, anti-deception, and memory settings.
- Suppression signals work even when the retrieved operation comes from a mismatched domain (retrieval precision was 38%; improvements persisted).
- Injections persist across extended execution chains (half-life: 24 steps) and compound rather than degrade.
- On competitive programming, the injection improved correctness from 85.7% to 100% with zero regressions.
- On scientific computing, dual injection (reasoning + code stacked) eliminated all 7 correctness bugs.
- On sycophancy, the injection reduced composite sycophancy to 5.8% across 40 real scenarios, validated cross-model on GPT-4o.
- On memory, the injection halved stale facts served as current, confirmed by blind evaluator.
What we haven't proven:
- Full cross-model generalization. Anti-deception is validated on GPT-4o. Reasoning and code benchmarks are Claude-only. Multi-model testing is in progress.
- Ablation of suppression vs. amplification. We observe that suppression consistently outperforms amplification. We have not published a controlled ablation isolating the two mechanisms.
- Human evaluation at scale. Our evaluations use a two-stage blind protocol with LLM-as-judge. Human evaluation on a subset would strengthen the evidence.
We publish limitations because the work should be evaluated on what it demonstrates, not what we believe it will demonstrate. Every claim traces to a data file. The benchmarks repo contains the raw data so you can verify independently.
| Repo | What's inside |
|---|---|
| ejentum-mcp | MCP server exposing 8 cognitive operations (4 harnesses × dynamic + adaptive) as MCP tools. One-click install via Smithery, Glama, or mcp.so. MIT. |
| agent-teams | Multi-agent teams (orchestrator + specialists, each with a mode-specific harness). Adversarial code review and eval workflows shipped. MIT. |
| eval | Open-source eval workflows (n8n + Python) for A/B testing tools, harnesses, prompts. Blind judges, fairness guarantees, sample results. MIT. |
| benchmarks | Full reports, raw traces, generation outputs, judgment scores, system prompts. Eight benchmark suites. CC BY 4.0. |
| docs | Documentation pages. Quickstart to advanced method. Same content as ejentum.com/docs. |
| examples | Integration examples. Python, TypeScript, curl, skill files, n8n workflow, no-code guides. MIT. |
| builders_playbook | 28 screenshots from real work sessions. Field notes from inside the IDE. |
Standalone packages on PyPI and npm, one per host framework. Each speaks the host framework's native idiom rather than mirroring the previous shim. Each ships 8 tools (4 harnesses × dynamic + adaptive).
| Package | Registry | Repo |
|---|---|---|
crewai-ejentum |
PyPI | crewai-ejentum |
agno-ejentum |
PyPI | agno-ejentum |
pydantic-ai-ejentum |
PyPI | pydantic-ai-ejentum |
smolagents-ejentum |
PyPI | smolagents-ejentum |
langchain-ejentum |
PyPI | langchain-ejentum |
letta-ejentum |
PyPI | letta-ejentum |
autogen-ejentum |
PyPI | autogen-ejentum |
llama-index-tools-ejentum |
GitHub (PyPI registration pending) | llama-index-tools-ejentum |
ejentum-ai |
npm | ejentum-ai |
ejentum-mastra |
npm | ejentum-mastra |
ejentum-langgraph |
npm | ejentum-langgraph |
ejentum-genkit |
npm | ejentum-genkit |
n8n-nodes-ejentum |
npm | n8n-nodes-ejentum |
Plus the integrations repo (n8n workflows + editor skill files) and zed-ejentum-mcp (Zed extension).
Ejentum. Reasoning Harness for Agentic AI.
Where RAG retrieves what the model should know, RA²R retrieves how it should think.



