Skip to content

enjector/microgpt-c

Repository files navigation

MicroGPT-C

Build CodeQL License: MIT

Tiny specialist models, coordinated by a pipeline, outperform single models on focused tasks.

Composable Intelligence — the four phases of MicroGPT-C: stem cell foundation, targeted differentiation, organelle pipeline coordination, and proven results across logic games and code composition


The Story

This project started as a C port of Andrej Karpathy's microGPT.py — a ~200 line Python GPT that trains a character-level Transformer from scratch. We rewrote it in pure C99 with zero dependencies, and as you'd expect from C, it's much faster.

Then we asked a bigger question: can tiny models actually be intelligent?

Not by making them bigger — the industry already does that. Instead, by making them work together. We took the same ~460K parameter engine and trained it on different tasks: one becomes a planner, another becomes a player, another becomes a judge. Each one starts as the same blank "stem cell" and differentiates based on its training data.

We call them organelles — like the specialised structures inside a biological cell.

The result surprised us. A single organelle playing Connect-4 wins about 55% of the time. But when a planner and player coordinate through a shared protocol, the system hits 90% — even though the individual models are still wrong half the time. The pipeline catches the mistakes. The coordination is the intelligence.

We've now tested this across 11 logic games, from Tic-Tac-Toe to Sudoku, with models ranging from 30K to 460K parameters. The pattern holds: right-sized specialists working together consistently outperform a single larger model working alone.

Then we asked: does it work on real-world data?

We ran a lottery prediction experiment as a negative control for organelle intelligence. The lottery model hit an entropy floor at 0.50 loss — it learned nothing, because lottery draws are random. This proves the engine's integrity: the impressive 78–91% accuracy seen in Mastermind and Connect-4 is a result of the model genuinely learning underlying rules, not some hidden flaw in the training engine.

We also explored applying OPA to continuous-valued domains like financial time-series. This revealed a fundamental insight: the 31-character vocabulary that makes game coordination reliable destroys the continuous gradients that prediction requires — what we call the "Discretisation Wall." Bridging categorical reasoning with numerical sensing is an active research direction.

Same engine. Same architecture. One learns game patterns, one hits an entropy floor on random data, and one maps the boundary between pattern matching and temporal prediction. That's three kinds of proof.

The full research journey — from character-level Transformer to VM-based code generation through the calibrated three-bound scaling-curve closure — is documented in Composable Intelligence at the Edge (21 chapters + appendix, online version).

Honest-claim note (May 2026): the project's headline numbers were re-audited after the closing scaling-curve experiment caught a curator-self-overlap leakage incident. The restated calibrated claim — ~75-80 % retrieval on novel-paraphrase tests in distinctive-noun domains, three documented structural bounds (curator-, model-, domain-bounded), audit infrastructure baked in via tools/scaling_leakage_audit.sh — replaces earlier inflated retrieval claims. See docs/research/ORGANELLE_STATE.md for the synthesis and docs/engineering/CLEAN_ROOM_IMPLEMENTATION/RESEARCH_DISCLOSURE.md for the regulator-friendly disclosure register. This repository is research-only; the productisation strategy and per-vertical implementation plans were migrated to a private companion repo (organelles.bio) on 2026-05-01 — see docs/MIGRATED_TO_ORGANELLES_BIO.md for the index.


Quick Start

git clone https://github.com/enjector/microgpt-c.git
cd microgpt-c
mkdir build && cd build
cmake ..
cmake --build . --config Release

# Train a name generator in < 1 second (4K params)
./names_demo

# Train Shakespeare text generation (840K params, character-level)
./shakespeare_demo

# Train Shakespeare word-level generation (510K params, ~40K tok/s inference, 2 min training)
./shakespeare_word_demo

# Generate infinite Word-Level Shakespeare using Memory Sparse Attention (MSA)
./msa_infinite_shakespeare

# Generate word-level Shakespeare with TurboQuant 4-bit memory compression
cd demos/turbo_quant
../../build/tq_shakespeare_tq

# Run a multi-organelle game pipeline (88% win rate)
./connect4_demo

All 11 game experiments, the lottery negative control, 3 pretrained checkpoints, 97 unit tests, and 22 benchmarks are included. See the full list in demos/character-level/.


Performance Highlights

All benchmarks on Apple M2 Max (dev machine), single-threaded unless noted. Models are 360KB–5.4MB and compile anywhere with a C99 compiler. Edge device testing is a future research stage. See PERFORMANCE for full details.

Engine Params Training Inference Notes
Character-level (Shakespeare) 841K 28K tok/s 16K tok/s 14 min, 12 threads
Word-level (Shakespeare) 510K 12.5K tok/s 40K tok/s 2 min, 12 threads
VM engine (dispatch) 3.7–5.8M ops/s Single-threaded
Micro-benchmark (tiny model) 6.5K 642K tok/s 1.55M infer/s Float32, 1 thread
SSD ensemble (5-vote, prefix cache) 6.5K 1.9× faster vs old ensemble (arXiv:2603.03251)

vs. Karpathy's microgpt.py: ~1,000× faster training, ~700× faster inference (expected for C vs Python; the real contribution is the orchestration layer).

Game Leaderboard (11 Games)

All games: trained organelle vs random opponent, 100 evaluation games each. Full details in RESEARCH_ORGANELLE_GAMES.

Game Organelles Params Size Total Training Result
Pentago 2 92K 1.1 MB 2.2 MB ~9 min 91% win
8-Puzzle 5 460K 5.4 MB 27 MB ~7 min 90% solve
Connect-4 2 460K 5.4 MB 10.8 MB ~21 min 88% win
Tic-Tac-Toe 2 460K 5.4 MB 10.8 MB ~17 min 87% w+d
Mastermind 2 92K 1.1 MB 2.2 MB ~8 min 79% solve
Sudoku 2 160K 1.9 MB 3.8 MB ~3 min 78% solve
Othello 2 92K 1.1 MB 2.2 MB ~8 min 67% win
Klotski 2 30K 360 KB 720 KB ~36 sec 62% solve
Red Donkey 2 30K 360 KB 720 KB ~38 sec 19% solve
Lights Out 2 160K 1.9 MB 3.8 MB ~4 min 10% solve
Hex 2 92K 1.1 MB 2.2 MB ~3 min 27% win

Negative Control

Experiment Organelles Params Size Training Result Interpretation
Lottery 2 163K 1.9 MB ~5 min Entropy floor Negative control ✓

Innovations

Key technical contributions shipped in this engine:

Innovation Description Evidence
🧬 Organelle Pipeline Architecture Composable specialist micro-models coordinated by deterministic C scaffolding 11 games, 91% win (Pentago) to 90% solve (8-Puzzle)
💾 Memory Sparse Attention (MSA) Infinite sequence lengths routed via $O(1)$ LRU-paged latent storage chunks. Removes quadratic memory growth for endless CRM interactions.
🗜️ TurboQuant Memory Compression 4-bit dual-state quantization (MSE Codebooks + 1-bit QJL Residuals). 8x memory reduction with +25% generation speedup, validated at 1.3M+ encodes/sec.
💪 TinyLlama-Class Resiliency Zero NaN instability. SwiGLU, RMSNorm, grouped-query attention, and decoupled weight decay, rigorously audited against PyTorch output logits. Zero invalid moves across all 11 games
Prefix KV Cache Sharing Prompt processed once, KV state copied per ensemble vote — eliminates redundant inference 1.9–5.7× ensemble speedup (arXiv:2603.03251)
🔮 Speculative Decoding Draft organelle generates candidates, target verifies with KV rollback on rejection Functional with acceptance statistics tracking
🧠 Neural Algorithmic Reasoning Deterministic scaffolding (Kanban, cycle detector, judge) frees model capacity for pattern matching ~340 lines of C replaces what gradient descent handles poorly
📝 Dual Tokenisation Character-level (zero <unk>) and word-level (O(1) hash, 2.5× faster inference) Shakespeare: 16K→40K tok/s
🔧 Compile-Time Architecture N_EMBD, N_LAYER, BLOCK_SIZE etc. as CMake defines — zero runtime overhead 30K–841K params, 360KB–5.4MB
🖥️ Metal GPU + SIMD + BLAS Optional Apple Metal shaders, NEON auto-vectorisation, Accelerate BLAS All opt-in, zero-dependency baseline
📦 Paged KV Cache Memory-efficient attention for constrained deployments Opt-in via -DMICROGPT_PAGED_KV=ON
🔀 Block Attention Residuals Learned depth-attention replaces additive residuals — preserves prompt signal through deep layers Opt-in via -DMICROGPT_ATTN_RES=ON (paper)
🎯 Negative Control Methodology Lottery experiment proves engine learns patterns, not artefacts Entropy floor at 0.50 (theoretical maximum)
🧭 DeepSeek-V4 Port Stack Active-attention triumvirate (Partial RoPE + Attention Sink + Q/K RMSNorm) ported from DeepSeek-V4 §2.3.3 onto a CPU-first C99 engine. Rope-aware MSA pool/recency injection makes long-context inference relative-position-correct. All four flags off by default; combined stack opt-in. −8.7% held-out PPL on deep config (4-layer 138K-param), 0 new params, ~1% extra runtime. See V4 port roadmap.
🔌 Pipeline IR + Wiring Organelle (multi-organelle + manifold-retrieval) Typed graph IR (DAG + verifier + text round-trip + DOT) emitted by a 540K-param word-level wiring organelle plus a 540K-param planner organelle. Phase 2c ships anchor-retrieval generation over a 20D Geodesic manifold: a 20-entry canonical @graph table indexed by Geodesic top-1 prediction over a handcoded keyword embedder. Replaces autoregressive token generation with table lookup. Phase 2d leakage audit (§38) confirmed that 13 of 20 original held-out prompts were verbatim in the wiring training corpus (introduced by Phase 13). Restated honest headlines: anchor-retrieval mechanism — 🎯 100% (20/20) on the clean Phase 2c paraphrases (no training-corpus overlap); wiring transformer alone — 7/20 (35%) on the same clean set. Run ./wiring_organelle_demo --clean-only (anchor retrieval, 100%), ./wiring_organelle_demo --composition (multi-stage composition, 60%), or ./wiring_organelle_demo --no-anchor --clean-only (wiring-only baseline, 35%). For Phase 4: ./corpus_expand pipeline_corpus_phase4_train.txt 42 then ./manifold_tfidf_demo pipeline_corpus_adversarial.txt pipeline_corpus_phase4_train.txt (TF-IDF on expanded corpus, 90% on adversarial axis-2 vs handcoded 10%). See the standalone paper, the development log including §38 leakage audit, §39 works/doesn't-work examples, §40 pre-registered Phase 3, §41 Phase 3a falsification, §42+§43 Phase 3b shipped, §44 state-of-the-arc, §45 pre-registered Phase 4 + §46 corpus expansion shipped. Manifold-learning research.

Explore Further

Topic Link
🧭 Where the research stands today (start here) ORGANELLE_STATE
📖 Book: Composable Intelligence at the Edge PDF · Online · Chapters
FAQ FAQ.md
🧬 The stem cell philosophy VISION.md
🏆 Game leaderboard (11 games) RESEARCH_ORGANELLE_GAMES
🎲 Lottery experiment (entropy baseline) lottery/README.md
🔬 Pipeline architecture (white paper) RESEARCH_ORGANELLE_PIPELINE
🧠 Reasoning conclusion RESEARCH_ORGANELLE_REASONING
🔌 Pipeline IR + Wiring Organelle (paper) RESEARCH_WIRING_ORGANELLE_PAPER
🔬 Pipeline IR + Wiring Organelle (full development log) RESEARCH_PIPELINE_IR
📐 Calibrated three-bound scaling claim (post-Phase-3) wiring_scaling_post_phase3
🛡️ Standing leakage-audit protection tools/scaling_leakage_audit.sh
📑 Honest disclosure register (cancelled phases, restated headlines) RESEARCH_DISCLOSURE
🏛️ Clean-room rebuild-test corpus (BS / TDD / FS / BRD / FRD / NFRD / TRACEABILITY) docs/engineering/CLEAN_ROOM_IMPLEMENTATION/
🌐 Manifold-learning composition (research sketch) RESEARCH_MANIFOLD_LEARNING
📚 Using as a library FUNCTIONAL_SPEC
Performance & benchmarks PERFORMANCE
🚀 SSD inference optimisations RESEARCH_SSD
🔀 Attention Residuals research RESEARCH_ATTN_RES
🔧 Build options (Metal, BLAS, INT8, SIMD) BUILD_OPTIONS
🛠️ Extending the Wiring Organelle EXTENDING_WIRING_ORGANELLE
🌳 Project as a Node-style runtime (analysis) NODE_ANALYSIS
🤝 Contributing CONTRIBUTING.md
📋 Data licensing DATA_LICENSE.md
🔒 Productisation artefacts (migrated to private companion repo) MIGRATED_TO_ORGANELLES_BIO

Requirements

  • C99 compiler (GCC, Clang, MSVC)
  • CMake 3.10+
  • No other dependencies

Optional: Git LFS for pretrained checkpoints (git lfs pull).


Responsible Use

MicroGPT-C runs entirely on-device with no telemetry, no cloud calls, and no data collection. Small models trained on narrow corpora inherit the biases of that corpus — be aware of this when deploying. High confidence means the model has seen similar patterns, not that the output is correct. Always validate through deterministic checks (the Judge pattern) or human review for safety-critical applications.

See CONTRIBUTING.md for ethics guidelines.


Research Team

This project was built transparently with human–AI collaboration — the same philosophy of coordinated intelligence that MicroGPT-C explores.

Role Member
🧭 Principal Research Manager Ajay Soni — research direction, validation, and decisions
💻 Engineering & Documentation Claude — coding, documentation, and junior research
🔬 Senior Research Assistant Grok — in-depth analysis and insights
🎨 Senior Research Assistant Gemini — creative synthesis and validation
📚 Community Education NotebookLM — accessible explanations and education materials

License

MIT — see LICENSE.

About

Zero-dependency C99 GPT-2 engine for edge AI. Sub-1M parameter models train on-device in seconds. Organelle Pipeline Architecture (OPA) coordinates specialised micro-models — 91% win rates on 11 logic games with 30K–160K parameters. Composition beats capacity.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors