Skip to content

Latest commit

 

History

History
364 lines (231 loc) · 19.6 KB

File metadata and controls

364 lines (231 loc) · 19.6 KB

Evolution

How Cortex grew from a routing table to a cognitive layer — with all the wrong turns, crashes, and course corrections along the way.

This isn't a polished product history. It's an honest changelog of building infrastructure while using it daily. Every decision has context. Every pivot had a reason.


Act 0: Before Cortex — The Plugin Jungle

When: Early February 2026

Claude Code ships with a plugin ecosystem. I installed liberally: 95+ plugins active, 14 MCP servers, dozens of hooks. Docker helpers, Kubernetes managers, Terraform — none relevant to a Greek lawyer doing legal work.

One plugin alone (claude-octopus, a 30-skill persona system) injected ~4,000 tokens per session through CLAUDE.md. It required Codex and Gemini CLI binaries that weren't even installed on my machine.

The question that started everything: I have all these tools. How does Claude know which one to use?

It didn't. Claude guessed based on description competition. Sometimes it picked the right MCP server. Often it didn't. And every session started from zero — no memory of what worked yesterday.

Lesson: Existence ≠ function. Having 95 plugins installed is not the same as having a system.


Act 1: The Cowork Prototype

When: Pre-February 2026 (on Claude Desktop)

The first Cortex wasn't built for Claude Code at all. It was a meta-orchestrator for Cowork (Claude Desktop's collaborative surface):

  • Router: 10-step intent scoring pipeline. Trigger matching, anti-trigger veto, confidence tiers (autoload ≥0.8, suggest 0.6–0.79, silent <0.6), cooldown enforcement, shadow registry for uninstalled plugins.
  • Memory: Real MCP server (Node.js, sql.js SQLite, 11 tools).
  • Analyst: Progressive-depth analysis engine.

The registry existed in three forms: registry.md (human-editable), registry.json (compiled, SHA-256 hashed), digest.md (~30 lines for cheap context injection). Scoring: trigger×0.4 + domain×0.3 + entity×0.2 + context×0.1.

All dependencies (@modelcontextprotocol/sdk, sql.js, zod) were platform-agnostic. The core logic was portable. What wasn't: the plugin manifest format, skill activation mechanism, hook system, and command syntax — all Cowork-specific.


Act 2: The Port — From Cowork to Claude Code

When: February 13–18, 2026

Feb 13: Searched for any existing routing infrastructure in Claude Code. Nothing. Blank slate — no registries, no routing, no MCP servers configured.

Feb 17: Reverse-engineered the entire Cowork Cortex codebase (26 files). Documented the architecture for porting. Designed a 4-Layer Memory Stack, placing Cortex as Layer 4: the intelligent router that sends queries to the correct memory layer.

Feb 18: First install. Cortex v1.0.0 dropped into ~/.claude/plugins/cache/local/cortex/1.0.0/. Three skills, seven commands, one MCP server.

First gotcha: The plugin cache existed on disk, but Cortex was NOT registered in installed_plugins.json. Cache ≠ active. Classic existence-check failure.

Same day: MCP server load test passed. The SQLite-backed memory server worked fine over stdio. But the routing layer — the whole reason Cortex existed — didn't activate.


Act 3: The Big Pivot — Passive Over Active

When: February 20, 2026

This is the most significant architectural decision in Cortex's evolution.

Three options evaluated:

  1. Pure MCP — Clean, but requires explicit tool calls. No automatic routing.
  2. Pure hooks — Automatic, but hooks can't see conversation context. Limited error handling.
  3. HybridUserPromptSubmit hook for automatic lightweight routing + MCP server for interactive queries.

The hybrid won on paper. But during implementation planning, three Cowork features were permanently scrapped:

  • MCP bridge scanner — Too complex for the value delivered.
  • LLM calls in the routing hot path — Latency killer. Unacceptable in a hook that fires on every user message.
  • Context budget management — Unnecessary with the passive approach.

What was built instead: 4 passive query tools. No active routing. No interception.

cortex_digest  → pre-computed 596-token capability summary
cortex_search  → keyword matching against trigger arrays (~5ms)
cortex_lookup  → full entry details by ID
cortex_status  → registry health check

The reasoning: "Defer active routing until data proves Claude's native matching insufficient." The bet was that Claude Code's built-in description competition was good enough, and active routing would only be added if metrics showed failure.

That data never materialized. Passive routing worked. The active routing phase was never built.

What was abandoned from the original design:

Feature Why Dropped
Active routing via UserPromptSubmit hook Claude's native routing proved sufficient
LLM calls in routing hot path Latency — scrapped permanently
MCP bridge scanner Too complex for the value
Context budget management Unnecessary with passive approach
cortex_record / cortex_outcome / cortex_learn (DB write-back) Deferred, never needed
Cooldown system / polite interruption protocol Deferred
Shadow recommend_install flow Not applicable in Claude Code

Act 4: The First Registry Breakdown

When: February 20, 2026

The registry had never been compiled. The database was empty. Despite MEMORY.md claiming 163 capabilities, the actual count was 39.

What was wrong:

  • compile.js existed but had never been run
  • registry.json was empty/stale
  • MEMORY.md had a fictional count (163)
  • Legal entries were Cowork-era corporate templates (NDA, GDPR) — wrong for Greek civil law

Fix: Registry expanded from 39 → 53 entries (43 installed + 10 shadow). Legal entries flagged for replacement with Greek law entries.

Same day, worse discovery: Two Cortex MCP servers were running simultaneously:

  1. The old plugin-managed server at ~/.claude/plugins/cache/local/cortex/1.0.0/
  2. The new standalone server at ~/.claude/cortex/server/index.js

The cortex.memory.db had: 1 session, 43 capability metrics, 23 preferences, 4 anti-patterns, 0 routing decisions. The router logic existed only as pseudocode specs.

Lesson: Audit before building. The v0.6 → v1.0 roadmap was created: fix what's broken (v0.7), close the loops (v0.8), add domain intelligence (v0.9), achieve closed-loop learning (v1.0).


Act 5: The Observer — Half-Built for Months

When: February 20–24, 2026

The learning loop had a PostToolUse observer that logged tool calls to observations.jsonl. It accumulated thousands of observations. But the analysis pipeline was open — data flowed in and collected. Nothing evolved.

Feb 24 audit found the pipeline was 50% broken:

  • Observer config set to enabled: false
  • thousands of observations collected, never analyzed
  • All instinct directories empty
  • 41.7% duplicate entries — the observe hook was registered on BOTH PreToolUse and PostToolUse. PreToolUse events have no output. Pure waste.

Root cause: settings.json had observe.sh on both hook events.

Fix: PreToolUse hook removed. 270 tool_start ghost entries cleaned. Each tool now fires exactly one observation.

Feb 25: Second bug in observe.sh. Line 98 wrote None for the input field on PostToolUse even though the data was in the hook JSON. Fix: input captured properly, enabling analyze.py's file path extraction.


Act 6: The Learning Loop Actually Closes

When: February 26, 2026 (B1/C1 phase of the evolution master plan)

Deep audit of analyze.py (495 lines, 14 functions) revealed why patterns never promoted:

  • Error confidence capped at 0.8 (line 214: min(0.3 + retries * 0.03, 0.8)). Promotion gate was 0.9. Mathematical impossibility.
  • File pattern confidence flat at 0.4. Also never promotes.
  • Clustering: designed but zero code after build_instincts() at line 246.
  • auto_evolve: false — the curator was explicitly disabled.
  • 4 CLI commands documented in SKILL.md, 0 built.

The C1 fix:

  • Config loading wired up properly
  • Confidence caps raised (error: 0.8 → 0.95, file: 0.4 → dynamic)
  • Jaccard + union-find clustering implemented
  • promote_to_rules / decay_stale_instincts pipeline built
  • auto_evolve: true activated
  • 16/16 tests passing

First auto-promotion: Tool chain patterns (Bash→Bash→Bash seen 120x, Read→Read→Read seen 38x) promoted to ~/.claude/rules/generated-rules.md — automatically loaded every session.


Act 7: The ChromaDB Reckoning

When: February–March 2026

The memory plugin (claude-mem) shipped with ChromaDB for vector search. For weeks, everyone assumed it worked.

The discovery: ChromaDB stored thousands of embeddings at 384 dimensions. The MCP search interface didn't use them. Every search went through FTS5 keyword matching only. The vector layer was structurally present and functionally dead.

The decision (D4.1 in the evolution coordination doc):

Three options:

  1. Fix ChromaDB integration
  2. Replace with sqlite-vec
  3. Migrate to external service

sqlite-vec won: same SQLite file (no separate process), native FTS5 hybrid, 1024-dim BGE-M3 embeddings for proper multilingual support (the 384-dim ChromaDB vectors were too small for Greek text anyway — dimension upgrade would require re-embedding everything regardless).

ChromaDB: thousands of embeddings deleted. Archived at ~/.claude-mem/archive/.

sqlite-vec: hundreds of vectors at 1024-dim, over a thousand semantic links via KNN (cosine > 0.75), temporal decay per collection (30d for memory, 90d for cases, 365d for templates, null for evergreen).


Act 8: The RAG Pipeline — 20 Servers Evaluated, Custom Built

When: February–March 2026

Before building a custom RAG pipeline, 20+ existing MCP servers were evaluated: GNO, mcp-local-rag, gnosis-mcp, RAGLite, kb-mcp-server, knowledge-mcp (LightRAG), and others.

None satisfied all constraints: DOCX support + Greek language + hybrid search + local-only + MCP protocol.

Fatal gaps:

  • gnosis-mcp: no DOCX (eliminates hundreds of legal templates)
  • GNO: requires Bun runtime
  • RAGLite: requires Pandoc
  • All cloud options: disqualified (legal documents stay local, non-negotiable)

The embedding model journey:

nomic-embed-text-v2-moe (768-dim) → demoted for weak Greek → BGE-M3 (1024-dim, 8K context, best multilingual) became the winner. The Greek-specific stsb-xlm-r-greek-transfer was considered but killed by 512-token context limit (too short for legal documents).

Zone-based chunking insight (late addition): 60% of legal template tokens are waste — placeholder dots, identical headers, near-duplicate forms. hundreds of power-of-attorney forms are 70–85% identical. Solution: Strip+Tag+Embed — segment into HEADER/PARTIES/FACTS/LAW/REQUEST/BOILERPLATE zones, embed only semantic zones with metadata prefix. Raw corpus millions of tokens → cleaned ~5M tokens.

The Ollama crash (still unresolved): Everything depended on Ollama for local embeddings. Ollama 0.17.x introduced MLX Metal initialization that crashes unconditionally on M1 Air — SIGABRT before serving. CPU fallback also fails (MLX init is unconditional). Current: 49% vectorized (roughly half the chunks). The remaining 51% blocked on a broken binary.


Act 9: The Crash That Changed Everything

When: March 2, 2026, 2:29 AM

macOS OOM killed 4 concurrent Claude Code sessions. 4 × ~1.5GB = ~6GB on an 8GB M1 Air.

What the crash revealed about the system:

The healthcheck reported 51/51 PASS while three critical systems were non-functional:

  1. claude-mem MCP search: returning empty — chromaStrategy=null, no FTS5 fallback
  2. Ollama: SIGABRT on launch (MLX crash)
  3. PreCompact hook: reading conversation key that doesn't exist (actual: transcript_path). Had never worked. Passed healthcheck.

"You have existence checks, not functional tests." — SRE expert, post-crash analysis

Data casualty: A Google Drive restore was running mid-crash. DCIM photos recovered (4,314 files confirmed). Munich trip photos: 0 files, unrecoverable unless restored from Google's trash before ~April 1 purge.

The non-destructive rules codified from this crash:

  1. Never delete without verified recovery path
  2. Verify identity by CONTENT (hash), not just name/path
  3. Context matters — folder structure carries meaning
  4. Override requires explicit per-action user confirmation
  5. A plan is a hypothesis, not authority — re-verify at execution time

The context rot discovery (same session cluster): Softmax attention dilution at high token counts = structural amnesia. The fix: re-inject critical rules near the recency position every N turns (300 tokens, forced eval snippet, exploiting recency bias against RoPE geometric decay). Cost: zero engineering.


Act 10: The Master Plan — Chaos to Architecture

When: February 26, 2026

After 82+ sessions of organic growth, the system had accumulated 5 memory layers, 12 MCP servers, 6 hooks, 36 plugins, RAG at 49%, broken plugin hooks, and auto_evolve: false.

The evolution master plan (starry-chasing-flask.md): ~38 hours across 12–15 sessions.

Key planning pivot: The original approach was parallel plan-then-implement cycles. Changed to strict sequential: all 5 B-phase planning sessions complete before any C-phase implementation begins. Reason: early implementation was making decisions that conflicted with later planning insights.

What each phase added:

Phase Focus Key Deliverables
B1/C1 Learning Loop Fixed confidence caps, added clustering, promotion pipeline, temporal decay, auto_evolve: true
B2/C2 Session Lifecycle Stop hook, enhanced PreCompact, /refresh-streams skill, rotation automation
B3/C3 Personas Evaluated 29 octopus personas (kept all — zero passive cost), created Greek legal persona, /persona command
B4/C4 Memory Architecture ChromaDB → sqlite-vec, observation_links table, temporal decay per collection, semantic KNN linking
B5/C5 Cross-Layer Search Unified Search MCP: Promise.allSettled + 2s timeout/layer, RRF k=60, Jaccard MMR λ=0.7, 5-min cache
C5.1 Audit & Fixes Injection detection (5 regex patterns), XML memory wrapping, memory result trust boundaries

All phases complete as of March 2026.


Act 11: Stealing from OpenClaw

When: February 25, 2026

OpenClaw: 100K+ stars, 675K LOC TypeScript, 23 lifecycle hooks, 52 skills. Analyzed for stealable patterns.

Adopted immediately:

  • Injection detection: 5 regexes blocking "ignore instructions", XML tags, tool invocation patterns in memory retrieval paths
  • Memory XML wrapping: <relevant-memories> tag + "untrusted historical data" label + HTML entity escaping
  • Temporal decay formula: score *= exp(-ln(2)/30 * ageInDays) — adopted verbatim (evergreen files exempt)
  • detect-secrets baseline for the ~/.claude git repo

Deferred:

  • Progressive disclosure (SKILL.md < 500 lines + references/ on-demand)
  • Auto-capture triggers (preference/decision detection from conversation text)
  • Embedding cache (SQLite LRU)

Gap comparison at time of analysis vs. now:

Feature Feb 25 State Current State
Hooks 6 8
Temporal decay None Implemented (C4)
Injection detection None 5 patterns (C5.1)
MMR diversity None Jaccard λ=0.7 (C5)
XML memory wrapping None Implemented (C5.1)
Progressive disclosure Full skill load Still full load
Auto-capture Manual Partially automated (PreCompact)

Act 12: The Neuroscience Connection

When: February 25, 2026 (Chapter 1), March 4, 2026 (Chapters 2–3)

The brain metaphor wasn't part of the original design. It was discovered after the system was already built.

Reading Buschman et al. (2025) and Artem Kirsanov's explainer ("Why the Brain Doesn't Start From Scratch"), I realized: the system I'd built for practical reasons — routing, suppression of irrelevant context, pattern promotion, session lifecycle — mapped almost perfectly to known brain architecture.

The recognition was convergent, not deliberate. Same constraints (limited working memory, need for context routing, value of not starting from scratch) → same solutions.

Chapter 1 (Composition) was already built: Cortex routing = thalamic relay, domain rules = gain control suppression, agent delegation = dynamic routing ("railroad switch" in the Buschman paper).

Chapter 2 (Consolidation) was the identified gap. An archive quality audit proved it empirically:

  • Feb 24–26 done blocks (with consolidation): narrative + decisions + reasoning, 8–9/10
  • Mar 1–3 done blocks (crisis mode, no ritual): pure checklists, 4–5/10

The /bye skill was designed to fill this gap: hippocampal replay → session summary, episodic→semantic transfer → claude-mem persistence, pruning → session rotation.

Chapter 3 (Activation) existed already via Synergatis + SessionStart hooks, but wasn't framed as such until the three-chapter model crystallized.


The Meta-Patterns

Looking across 100+ sessions of building this system, five patterns recur:

1. Existence checks masquerading as functional tests. ChromaDB "working" but unused. Healthcheck passing while 3 systems broken. Observer "running" but never evolving. The system consistently accumulated the appearance of functionality before the reality of it.

2. Organic growth followed by consolidation crisis. 82+ sessions of additive work created extraordinary breadth and significant debt. The evolution master plan was the consolidation response. The system had to be planned after being built.

3. Plans survived crashes; execution didn't. Every crashed session had a plan file on disk. The recovery protocol: resume = read plan + execute, no re-planning. Persistent state on disk is the only reliable thing.

4. The AuDHD hyperfocus trap. Session overreach was consistent: 4 crashed sessions each had 3–5 major tasks running concurrently. The ADHD coach framing: "The overreach is hyperfocus-within-scope (gravity, not escape). Fix the plumbing before adding more fixtures."

5. Convergent evolution with neuroscience. The brain metaphor wasn't imposed — it was discovered. The same constraints produce the same solutions whether you're a primate prefrontal cortex or a CLI tool managing 14 MCP servers.


Current State (March 2026)

Living system (daily production):

  • 60+ registry entries across 12 namespaces
  • 8 hooks across full Claude Code lifecycle
  • Learning loop: hundreds of observations, dozens of instincts, auto-promotion active
  • 5-layer memory: MEMORY.md → claude-mem (hundreds of observations + hundreds of vectors + over a thousand semantic links) → Obsidian (dozens of notes) → Cortex registry → Unified Search (RRF fusion)
  • RAG: thousands of docs and chunks, partially vectorized (blocked on Ollama crash)
  • 12 Greek law tools in production
  • Dashboard: 16 sections, dual-tier polling, standalone PWA

This repo (v0.1.0):

  • Working cortex init CLI (end-to-end initialization)
  • 4-tool MCP server (raw JSON-RPC, zero dependencies)
  • 3 installable hooks (session-start, observe, prompt-router)
  • Starter capability registry (5 entries — your production system will grow from here)
  • Architecture documentation + config schema + domain pack format spec

Next: Extract learning loop (analyze.py → Node.js). Generalize hooks to use $CORTEX_HOME. Add tests. Collapse 9 package stubs to ~4 active packages.


Every wrong turn taught something. The crashes were the best audits. The system that exists today is the sum of everything that broke.