A minimal transformer inference engine built from scratch in C++ to understand deep learning systems at a fundamental level.
Most ML engineers interact with transformers through high-level APIs and pretrained models. But what actually happens under the hood? How does memory layout affect performance? Where are the real bottlenecks?
Prun is a ground-up implementation of a transformer inference pipeline in C++. It strips away the abstraction layers to expose exactly what's happening: tensor operations in memory, how data moves through the computation graph, and where efficiency really comes from.
This is built to explore:
- How inference engines work at the systems level
- Why memory layout and data movement matter more than model architecture
- Real performance bottlenecks in neural network computation
- Optimization strategies that actually move the needle (quantization, buffer reuse, sparse routing)
- What it takes to run models efficiently on real hardware
Not a framework. Not a library. Just a ground-up implementation to understand the fundamentals.
- Minimal Tensor System — Flat memory layout with manual indexing (row-major). Row-major keeps your memory access linear and cache-friendly
- Core Operations — MatMul, softmax, transpose. Hand-written to understand what's actually happening at the CPU level
- Scaled Dot-Product Attention — The heart of transformers: QKᵀ / √d → softmax → V. Implemented to profile each stage
- Linear Layers — Weights, biases, and operations. Manual computation to see where memory bottlenecks emerge
- Feedforward Networks — Two-layer MLPs with activation functions. Profile these separately to understand relative cost
- Full Transformer Blocks — Attention + feedforward + layer structure combined. Where the whole thing comes together
- Mixture of Experts (MoE) — Gating layer that routes inputs to expert networks. Sparse computation and load balancing challenges exposed
When you use PyTorch or TensorFlow, you don't see:
- How memory is actually laid out
- Where data copies happen
- What makes one MatMul implementation 10x faster than another
- The cost of allocations vs. reuse
- How routing decisions affect expert utilization
- What quantization actually does to your computation
Prun puts all of that front and center.
- Compiler: C++17 compatible (GCC 7+, Clang 5+, MSVC 2017+)
- Build System: CMake 3.10+
- Platform: Windows, macOS, Linux (no GPU dependencies currently)
# Clone and build
git clone https://github.com/yourusername/prun.git
cd prun
mkdir build && cd build
cmake ..
cmake --build . --config ReleaseThe Release build matters here—it enables optimizations that reveal what's actually fast vs slow.
#include "tensor/Tensor.h"
#include "model/TransformerBlock.h"
int main() {
// Single inference forward pass through a transformer block
TransformerBlock block(512); // 512-dim embedding
Tensor input({1, 512}); // Batch size 1
Tensor output = block.forward(input);
return 0;
}#include "tensor/Tensor.h"
#include "model/Attention.h"
#include "utils/Timer.h"
int main() {
Tensor Q({64, 64});
Tensor K({64, 64});
Tensor V({64, 64});
// Initialize data...
Timer timer;
Tensor out = attention(Q, K, V);
auto elapsed = timer.elapsed(); // See exactly how long attention takes
return 0;
}This is the whole point—measure everything. Understanding where time actually goes is where optimization begins.
prun/
├── tensor/ # Core tensor data structure
│ └── Tensor.h
├── ops/ # Primitive operations
│ └── MatMul.h
├── layers/ # Neural network layers
│ ├── Linear.h # Fully connected layer
│ └── FeedForward.h # Feed-forward network
├── model/ # High-level model components
│ ├── Attention.h # Multi-head attention
│ ├── TransformerBlock.h
│ └── MoE.h # Mixture of experts
├── core/ # Execution and model management
│ ├── Model.h
│ └── Executor.h
├── utils/ # Utility functions
│ ├── Logger.h
│ └── Timer.h
├── examples/ # Usage examples
├── benchmarks/ # Performance benchmarking
├── tests/ # Unit tests
└── CMakeLists.txt
Tensor
The fundamental unit. Shape + flat data array. When you iterate through it, you're iterating through memory. No black boxes.
MatMul
Three nested loops. O(n³) on paper, but everything else is about making those loops cache-efficient. This is where 80% of inference time lives.
Attention
QKᵀ (matmul) → scale → softmax → V (matmul). Profile each stage. Softmax seems cheap until you realize it's a synchronization point that kills parallelism.
Transformer Block
Attention + residual + LayerNorm + FFN + residual. See how much of the cost is in attention vs feedforward.
MoE (Mixture of Experts)
Gating layer that routes each token to the highest-scoring expert. Sparse computation wins only if load is balanced. Routing collapse kills the speedup.
- Flat tensors: No nested vectors. Single contiguous
float*with shape info - Row-major layout: Sequential memory access = cache hits
- Manual indexing: You see exactly how
[i,j]becomesdata[i*cols+j] - No dynamic allocation during inference: Buffers preallocated upfront
Here's what we're working toward:
- Profile Everything — Understand the current cost breakdown before touching anything
- MatMul First — This is the bottleneck. Cache-friendly tiling, blocking, better memory access
- Buffer Reuse — Stop allocating and deallocating during inference. Preallocate, reuse
- Quantization — INT8 inference. Float tensors → quantized, see the speed/accuracy tradeoff
- MoE Optimization — Fix routing collapse, better load balancing across experts
- Sparse Kernels — Only compute what matters
- SIMD & Intrinsics — When you've profiled down to the actual CPU bottlenecks
Right now this is a ground-truth implementation. The goal is to progressively optimize without losing clarity about what changed and why.
If you're interested in inference optimization, this is an open invitation to contribute:
- New optimization techniques — Novel memory layouts, quantization strategies, kernel designs
- Profiling and analysis — Identify bottlenecks on different hardware, platforms
- Sparse computation — Better routing, pruning, dynamic execution
- SIMD implementations — Hand-tuned kernels for specific operations
- Experimental features — Quantization variants, new MoE routing, gradient checkpointing
The bar is simple: show your work. Measure before and after. Explain why something is faster.
Q: Should I use this in production?
A: No. This is a learning tool and sandbox for inference optimization research. For production, use optimized backends like TensorRT, CoreML, ONNX Runtime.
Q: Why C++?
A: Because inference bottlenecks are memory and CPU-level concerns. C++ gives you the control to see and optimize those. Python doesn't.
Q: Can this run actual models?
A: Not yet—we'd need weight loaders, quantization that maps real model weights, etc. This is currently a transformer building block library, not an inference framework that runs trained models.
Q: Will you add GPU support?
A: GPU is a different beast. Right now CPU is interesting enough—most edge devices don't have GPUs. GPU optimization is its own project.
Q: How does this compare to TensorRT / TVM / llama.cpp?
A: Not meaningfully yet. Those are production systems. Prun is "understand the fundamentals" stage. Eventually the interesting comparison will be on specific optimizations.
This project is licensed under the MIT License - see the LICENSE file for details.
@software{prun2026,
title = {Prun: A Minimal Transformer Inference Engine},
author = {Your Name},
year = {2026},
url = {https://github.com/yourusername/prun}
}
Understanding transformer inference at the systems level requires reading about:
- Vaswani et al. (2017) — Attention is All You Need: The foundational transformer paper
- Shazeer et al. (2017) — Outrageously Large Neural Networks for Efficient Conditional Computation: MoE fundamentals
- Roark et al. — Memory-Efficient Attention on GPUs: Understanding memory bottlenecks in attention
- Bone et al. — MatMul Optimization: The real bottleneck in neural network inference
- Stock et al. — And the Bit Goes Down: Post-training quantization and inference optimization
Built with respect for systems-level optimization work in:
- llama.cpp (bringing model inference to CPU seriously)
- ONNX Runtime (understanding inference backend design)
- TVM (compiler approach to kernels)
- PyTorch C++ API (how modern inference APIs look)
- Have an optimization idea? Open an issue with benchmarks
- Found a bottleneck? Document it. Measure it. Show us the profile
- Want to contribute an optimization? Include before/after numbers, not just code
- Questions about how something works? Read the header files, run the benchmarks, profile it yourself
Serious inquiries only. This is about understanding systems, not collecting features.