Evolution

How Cortex grew from a routing table to a cognitive layer — with all the wrong turns, crashes, and course corrections along the way.

This isn't a polished product history. It's an honest changelog of building infrastructure while using it daily. Every decision has context. Every pivot had a reason.

Act 0: Before Cortex — The Plugin Jungle

When: Early February 2026

Claude Code ships with a plugin ecosystem. I installed liberally: 95+ plugins active, 14 MCP servers, dozens of hooks. Docker helpers, Kubernetes managers, Terraform — none relevant to a Greek lawyer doing legal work.

One plugin alone (claude-octopus, a 30-skill persona system) injected ~4,000 tokens per session through CLAUDE.md. It required Codex and Gemini CLI binaries that weren't even installed on my machine.

The question that started everything: I have all these tools. How does Claude know which one to use?

It didn't. Claude guessed based on description competition. Sometimes it picked the right MCP server. Often it didn't. And every session started from zero — no memory of what worked yesterday.

Lesson: Existence ≠ function. Having 95 plugins installed is not the same as having a system.

Act 1: The Cowork Prototype

When: Pre-February 2026 (on Claude Desktop)

The first Cortex wasn't built for Claude Code at all. It was a meta-orchestrator for Cowork (Claude Desktop's collaborative surface):

Router: 10-step intent scoring pipeline. Trigger matching, anti-trigger veto, confidence tiers (autoload ≥0.8, suggest 0.6–0.79, silent <0.6), cooldown enforcement, shadow registry for uninstalled plugins.
Memory: Real MCP server (Node.js, sql.js SQLite, 11 tools).
Analyst: Progressive-depth analysis engine.

The registry existed in three forms: registry.md (human-editable), registry.json (compiled, SHA-256 hashed), digest.md (~30 lines for cheap context injection). Scoring: trigger×0.4 + domain×0.3 + entity×0.2 + context×0.1.

All dependencies (@modelcontextprotocol/sdk, sql.js, zod) were platform-agnostic. The core logic was portable. What wasn't: the plugin manifest format, skill activation mechanism, hook system, and command syntax — all Cowork-specific.

Act 2: The Port — From Cowork to Claude Code

When: February 13–18, 2026

Feb 13: Searched for any existing routing infrastructure in Claude Code. Nothing. Blank slate — no registries, no routing, no MCP servers configured.

Feb 17: Reverse-engineered the entire Cowork Cortex codebase (26 files). Documented the architecture for porting. Designed a 4-Layer Memory Stack, placing Cortex as Layer 4: the intelligent router that sends queries to the correct memory layer.

Feb 18: First install. Cortex v1.0.0 dropped into ~/.claude/plugins/cache/local/cortex/1.0.0/. Three skills, seven commands, one MCP server.

First gotcha: The plugin cache existed on disk, but Cortex was NOT registered in installed_plugins.json. Cache ≠ active. Classic existence-check failure.

Same day: MCP server load test passed. The SQLite-backed memory server worked fine over stdio. But the routing layer — the whole reason Cortex existed — didn't activate.

Act 3: The Big Pivot — Passive Over Active

When: February 20, 2026

This is the most significant architectural decision in Cortex's evolution.

Three options evaluated:

Pure MCP — Clean, but requires explicit tool calls. No automatic routing.
Pure hooks — Automatic, but hooks can't see conversation context. Limited error handling.
Hybrid — UserPromptSubmit hook for automatic lightweight routing + MCP server for interactive queries.

The hybrid won on paper. But during implementation planning, three Cowork features were permanently scrapped:

MCP bridge scanner — Too complex for the value delivered.
LLM calls in the routing hot path — Latency killer. Unacceptable in a hook that fires on every user message.
Context budget management — Unnecessary with the passive approach.

What was built instead: 4 passive query tools. No active routing. No interception.

cortex_digest  → pre-computed 596-token capability summary
cortex_search  → keyword matching against trigger arrays (~5ms)
cortex_lookup  → full entry details by ID
cortex_status  → registry health check

The reasoning: "Defer active routing until data proves Claude's native matching insufficient." The bet was that Claude Code's built-in description competition was good enough, and active routing would only be added if metrics showed failure.

That data never materialized. Passive routing worked. The active routing phase was never built.

What was abandoned from the original design:

Feature	Why Dropped
Active routing via UserPromptSubmit hook	Claude's native routing proved sufficient
LLM calls in routing hot path	Latency — scrapped permanently
MCP bridge scanner	Too complex for the value
Context budget management	Unnecessary with passive approach
`cortex_record` / `cortex_outcome` / `cortex_learn` (DB write-back)	Deferred, never needed
Cooldown system / polite interruption protocol	Deferred
Shadow `recommend_install` flow	Not applicable in Claude Code

Act 4: The First Registry Breakdown

When: February 20, 2026

The registry had never been compiled. The database was empty. Despite MEMORY.md claiming 163 capabilities, the actual count was 39.

What was wrong:

compile.js existed but had never been run
registry.json was empty/stale
MEMORY.md had a fictional count (163)
Legal entries were Cowork-era corporate templates (NDA, GDPR) — wrong for Greek civil law

Fix: Registry expanded from 39 → 53 entries (43 installed + 10 shadow). Legal entries flagged for replacement with Greek law entries.

Same day, worse discovery: Two Cortex MCP servers were running simultaneously:

The old plugin-managed server at ~/.claude/plugins/cache/local/cortex/1.0.0/
The new standalone server at ~/.claude/cortex/server/index.js

The cortex.memory.db had: 1 session, 43 capability metrics, 23 preferences, 4 anti-patterns, 0 routing decisions. The router logic existed only as pseudocode specs.

Lesson: Audit before building. The v0.6 → v1.0 roadmap was created: fix what's broken (v0.7), close the loops (v0.8), add domain intelligence (v0.9), achieve closed-loop learning (v1.0).

Act 5: The Observer — Half-Built for Months

When: February 20–24, 2026

The learning loop had a PostToolUse observer that logged tool calls to observations.jsonl. It accumulated thousands of observations. But the analysis pipeline was open — data flowed in and collected. Nothing evolved.

Feb 24 audit found the pipeline was 50% broken:

Observer config set to enabled: false
thousands of observations collected, never analyzed
All instinct directories empty
41.7% duplicate entries — the observe hook was registered on BOTH PreToolUse and PostToolUse. PreToolUse events have no output. Pure waste.

Root cause: settings.json had observe.sh on both hook events.

Fix: PreToolUse hook removed. 270 tool_start ghost entries cleaned. Each tool now fires exactly one observation.

Feb 25: Second bug in observe.sh. Line 98 wrote None for the input field on PostToolUse even though the data was in the hook JSON. Fix: input captured properly, enabling analyze.py's file path extraction.

Act 6: The Learning Loop Actually Closes

When: February 26, 2026 (B1/C1 phase of the evolution master plan)

Deep audit of analyze.py (495 lines, 14 functions) revealed why patterns never promoted:

Error confidence capped at 0.8 (line 214: min(0.3 + retries * 0.03, 0.8)). Promotion gate was 0.9. Mathematical impossibility.
File pattern confidence flat at 0.4. Also never promotes.
Clustering: designed but zero code after build_instincts() at line 246.
auto_evolve: false — the curator was explicitly disabled.
4 CLI commands documented in SKILL.md, 0 built.

The C1 fix:

Config loading wired up properly
Confidence caps raised (error: 0.8 → 0.95, file: 0.4 → dynamic)
Jaccard + union-find clustering implemented
promote_to_rules / decay_stale_instincts pipeline built
auto_evolve: true activated
16/16 tests passing

First auto-promotion: Tool chain patterns (Bash→Bash→Bash seen 120x, Read→Read→Read seen 38x) promoted to ~/.claude/rules/generated-rules.md — automatically loaded every session.

Act 7: The ChromaDB Reckoning

When: February–March 2026

The memory plugin (claude-mem) shipped with ChromaDB for vector search. For weeks, everyone assumed it worked.

The discovery: ChromaDB stored thousands of embeddings at 384 dimensions. The MCP search interface didn't use them. Every search went through FTS5 keyword matching only. The vector layer was structurally present and functionally dead.

The decision (D4.1 in the evolution coordination doc):

Three options:

Fix ChromaDB integration
Replace with sqlite-vec
Migrate to external service

sqlite-vec won: same SQLite file (no separate process), native FTS5 hybrid, 1024-dim BGE-M3 embeddings for proper multilingual support (the 384-dim ChromaDB vectors were too small for Greek text anyway — dimension upgrade would require re-embedding everything regardless).

ChromaDB: thousands of embeddings deleted. Archived at ~/.claude-mem/archive/.

sqlite-vec: hundreds of vectors at 1024-dim, over a thousand semantic links via KNN (cosine > 0.75), temporal decay per collection (30d for memory, 90d for cases, 365d for templates, null for evergreen).

Act 8: The RAG Pipeline — 20 Servers Evaluated, Custom Built

When: February–March 2026

Before building a custom RAG pipeline, 20+ existing MCP servers were evaluated: GNO, mcp-local-rag, gnosis-mcp, RAGLite, kb-mcp-server, knowledge-mcp (LightRAG), and others.

None satisfied all constraints: DOCX support + Greek language + hybrid search + local-only + MCP protocol.

Fatal gaps:

gnosis-mcp: no DOCX (eliminates hundreds of legal templates)
GNO: requires Bun runtime
RAGLite: requires Pandoc
All cloud options: disqualified (legal documents stay local, non-negotiable)

The embedding model journey:

nomic-embed-text-v2-moe (768-dim) → demoted for weak Greek → BGE-M3 (1024-dim, 8K context, best multilingual) became the winner. The Greek-specific stsb-xlm-r-greek-transfer was considered but killed by 512-token context limit (too short for legal documents).

Zone-based chunking insight (late addition): 60% of legal template tokens are waste — placeholder dots, identical headers, near-duplicate forms. hundreds of power-of-attorney forms are 70–85% identical. Solution: Strip+Tag+Embed — segment into HEADER/PARTIES/FACTS/LAW/REQUEST/BOILERPLATE zones, embed only semantic zones with metadata prefix. Raw corpus millions of tokens → cleaned ~5M tokens.

The Ollama crash (still unresolved): Everything depended on Ollama for local embeddings. Ollama 0.17.x introduced MLX Metal initialization that crashes unconditionally on M1 Air — SIGABRT before serving. CPU fallback also fails (MLX init is unconditional). Current: 49% vectorized (roughly half the chunks). The remaining 51% blocked on a broken binary.

Act 9: The Crash That Changed Everything

When: March 2, 2026, 2:29 AM

macOS OOM killed 4 concurrent Claude Code sessions. 4 × ~1.5GB = ~6GB on an 8GB M1 Air.

What the crash revealed about the system:

The healthcheck reported 51/51 PASS while three critical systems were non-functional:

claude-mem MCP search: returning empty — chromaStrategy=null, no FTS5 fallback
Ollama: SIGABRT on launch (MLX crash)
PreCompact hook: reading conversation key that doesn't exist (actual: transcript_path). Had never worked. Passed healthcheck.

"You have existence checks, not functional tests." — SRE expert, post-crash analysis

Data casualty: A Google Drive restore was running mid-crash. DCIM photos recovered (4,314 files confirmed). Munich trip photos: 0 files, unrecoverable unless restored from Google's trash before ~April 1 purge.

The non-destructive rules codified from this crash:

Never delete without verified recovery path
Verify identity by CONTENT (hash), not just name/path
Context matters — folder structure carries meaning
Override requires explicit per-action user confirmation
A plan is a hypothesis, not authority — re-verify at execution time

The context rot discovery (same session cluster): Softmax attention dilution at high token counts = structural amnesia. The fix: re-inject critical rules near the recency position every N turns (300 tokens, forced eval snippet, exploiting recency bias against RoPE geometric decay). Cost: zero engineering.

Act 10: The Master Plan — Chaos to Architecture

When: February 26, 2026

After 82+ sessions of organic growth, the system had accumulated 5 memory layers, 12 MCP servers, 6 hooks, 36 plugins, RAG at 49%, broken plugin hooks, and auto_evolve: false.

The evolution master plan (starry-chasing-flask.md): ~38 hours across 12–15 sessions.

Key planning pivot: The original approach was parallel plan-then-implement cycles. Changed to strict sequential: all 5 B-phase planning sessions complete before any C-phase implementation begins. Reason: early implementation was making decisions that conflicted with later planning insights.

What each phase added:

Phase	Focus	Key Deliverables
B1/C1	Learning Loop	Fixed confidence caps, added clustering, promotion pipeline, temporal decay, `auto_evolve: true`
B2/C2	Session Lifecycle	Stop hook, enhanced PreCompact, `/refresh-streams` skill, rotation automation
B3/C3	Personas	Evaluated 29 octopus personas (kept all — zero passive cost), created Greek legal persona, `/persona` command
B4/C4	Memory Architecture	ChromaDB → sqlite-vec, `observation_links` table, temporal decay per collection, semantic KNN linking
B5/C5	Cross-Layer Search	Unified Search MCP: `Promise.allSettled` + 2s timeout/layer, RRF k=60, Jaccard MMR λ=0.7, 5-min cache
C5.1	Audit & Fixes	Injection detection (5 regex patterns), XML memory wrapping, memory result trust boundaries

All phases complete as of March 2026.

Act 11: Stealing from OpenClaw

When: February 25, 2026

OpenClaw: 100K+ stars, 675K LOC TypeScript, 23 lifecycle hooks, 52 skills. Analyzed for stealable patterns.

Adopted immediately:

Injection detection: 5 regexes blocking "ignore instructions", XML tags, tool invocation patterns in memory retrieval paths
Memory XML wrapping: <relevant-memories> tag + "untrusted historical data" label + HTML entity escaping
Temporal decay formula: score *= exp(-ln(2)/30 * ageInDays) — adopted verbatim (evergreen files exempt)
detect-secrets baseline for the ~/.claude git repo

Deferred:

Progressive disclosure (SKILL.md < 500 lines + references/ on-demand)
Auto-capture triggers (preference/decision detection from conversation text)
Embedding cache (SQLite LRU)

Gap comparison at time of analysis vs. now:

Feature	Feb 25 State	Current State
Hooks	6	8
Temporal decay	None	Implemented (C4)
Injection detection	None	5 patterns (C5.1)
MMR diversity	None	Jaccard λ=0.7 (C5)
XML memory wrapping	None	Implemented (C5.1)
Progressive disclosure	Full skill load	Still full load
Auto-capture	Manual	Partially automated (PreCompact)

Act 12: The Neuroscience Connection

When: February 25, 2026 (Chapter 1), March 4, 2026 (Chapters 2–3)

The brain metaphor wasn't part of the original design. It was discovered after the system was already built.

Reading Buschman et al. (2025) and Artem Kirsanov's explainer ("Why the Brain Doesn't Start From Scratch"), I realized: the system I'd built for practical reasons — routing, suppression of irrelevant context, pattern promotion, session lifecycle — mapped almost perfectly to known brain architecture.

The recognition was convergent, not deliberate. Same constraints (limited working memory, need for context routing, value of not starting from scratch) → same solutions.

Chapter 1 (Composition) was already built: Cortex routing = thalamic relay, domain rules = gain control suppression, agent delegation = dynamic routing ("railroad switch" in the Buschman paper).

Chapter 2 (Consolidation) was the identified gap. An archive quality audit proved it empirically:

Feb 24–26 done blocks (with consolidation): narrative + decisions + reasoning, 8–9/10
Mar 1–3 done blocks (crisis mode, no ritual): pure checklists, 4–5/10

The /bye skill was designed to fill this gap: hippocampal replay → session summary, episodic→semantic transfer → claude-mem persistence, pruning → session rotation.

Chapter 3 (Activation) existed already via Synergatis + SessionStart hooks, but wasn't framed as such until the three-chapter model crystallized.

The Meta-Patterns

Looking across 100+ sessions of building this system, five patterns recur:

1. Existence checks masquerading as functional tests. ChromaDB "working" but unused. Healthcheck passing while 3 systems broken. Observer "running" but never evolving. The system consistently accumulated the appearance of functionality before the reality of it.

2. Organic growth followed by consolidation crisis. 82+ sessions of additive work created extraordinary breadth and significant debt. The evolution master plan was the consolidation response. The system had to be planned after being built.

3. Plans survived crashes; execution didn't. Every crashed session had a plan file on disk. The recovery protocol: resume = read plan + execute, no re-planning. Persistent state on disk is the only reliable thing.

4. The AuDHD hyperfocus trap. Session overreach was consistent: 4 crashed sessions each had 3–5 major tasks running concurrently. The ADHD coach framing: "The overreach is hyperfocus-within-scope (gravity, not escape). Fix the plumbing before adding more fixtures."

5. Convergent evolution with neuroscience. The brain metaphor wasn't imposed — it was discovered. The same constraints produce the same solutions whether you're a primate prefrontal cortex or a CLI tool managing 14 MCP servers.

Current State (March 2026)

Living system (daily production):

60+ registry entries across 12 namespaces
8 hooks across full Claude Code lifecycle
Learning loop: hundreds of observations, dozens of instincts, auto-promotion active
5-layer memory: MEMORY.md → claude-mem (hundreds of observations + hundreds of vectors + over a thousand semantic links) → Obsidian (dozens of notes) → Cortex registry → Unified Search (RRF fusion)
RAG: thousands of docs and chunks, partially vectorized (blocked on Ollama crash)
12 Greek law tools in production
Dashboard: 16 sections, dual-tier polling, standalone PWA

This repo (v0.1.0):

Working cortex init CLI (end-to-end initialization)
4-tool MCP server (raw JSON-RPC, zero dependencies)
3 installable hooks (session-start, observe, prompt-router)
Starter capability registry (5 entries — your production system will grow from here)
Architecture documentation + config schema + domain pack format spec

Next: Extract learning loop (analyze.py → Node.js). Generalize hooks to use $CORTEX_HOME. Add tests. Collapse 9 package stubs to ~4 active packages.

Every wrong turn taught something. The crashes were the best audits. The system that exists today is the sum of everything that broke.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evolution

Act 0: Before Cortex — The Plugin Jungle

Act 1: The Cowork Prototype

Act 2: The Port — From Cowork to Claude Code

Act 3: The Big Pivot — Passive Over Active

Act 4: The First Registry Breakdown

Act 5: The Observer — Half-Built for Months

Act 6: The Learning Loop Actually Closes

Act 7: The ChromaDB Reckoning

Act 8: The RAG Pipeline — 20 Servers Evaluated, Custom Built

Act 9: The Crash That Changed Everything

Act 10: The Master Plan — Chaos to Architecture

Act 11: Stealing from OpenClaw

Act 12: The Neuroscience Connection

The Meta-Patterns

Current State (March 2026)

FilesExpand file tree

EVOLUTION.md

Latest commit

History

EVOLUTION.md

File metadata and controls

Evolution

Act 0: Before Cortex — The Plugin Jungle

Act 1: The Cowork Prototype

Act 2: The Port — From Cowork to Claude Code

Act 3: The Big Pivot — Passive Over Active

Act 4: The First Registry Breakdown

Act 5: The Observer — Half-Built for Months

Act 6: The Learning Loop Actually Closes

Act 7: The ChromaDB Reckoning

Act 8: The RAG Pipeline — 20 Servers Evaluated, Custom Built

Act 9: The Crash That Changed Everything

Act 10: The Master Plan — Chaos to Architecture

Act 11: Stealing from OpenClaw

Act 12: The Neuroscience Connection

The Meta-Patterns

Current State (March 2026)