A minimal CUDA transformer inference engine for small GPT-2-family models.
It currently supports:
- token embeddings + positional embeddings
- multi-layer causal self-attention
- GELU MLP blocks
- final norm + LM head
- argmax token generation
- decoding generated token IDs back to text with the matching Hugging Face tokenizer
This project shows the core shape of decoder-only transformer inference on GPU:
- prompt -> token ids
- token ids -> embeddings
- run transformer blocks
- produce logits
- pick next token id
- repeat
The code is intentionally minimal. It is not an optimized production inference engine.
Working well:
- small GPT-2-style models
- exported Hugging Face weights
- real GPU runtime testing
Not implemented:
- KV cache
- sampling beyond argmax
- Llama-style architectures (
RoPE,RMSNorm,GQA,SwiGLU) - wide-model support beyond the current small-kernel assumptions
- full tensor-by-tensor numerical validation against PyTorch
- src/main2.cu: main CUDA inference path and decode loop
- src/kernels.cu: CUDA kernels
- src/model_loader.h: exported GPT-2 bundle loader
- scripts/export_gpt2_hf.py: export Hugging Face GPT-2-family weights into this project’s flat-text bundle format
- scripts/decode_with_hf_tokenizer.py: decode generated token IDs back to text
mkdir -p build
nvcc -arch=sm_75 -O3 src/main2.cu -o build/main2Adjust sm_75 to match your GPU.
Example with the tiny GPT-2-family test model:
python3 scripts/export_gpt2_hf.py \
--model sshleifer/tiny-gpt2 \
--prompt "Hey how are you" \
--output-dir /tmp/tiny_gpt2_bundleThis writes:
model_config.jsontoken_ids.txt- embedding weights
- per-layer transformer weights
- final norm / LM head files
./build/main2 /tmp/tiny_gpt2_bundle/model_config.json /tmp/tiny_gpt2_bundle/token_ids.txt 8That prints:
- current token ids
- last-position logits
- argmax next token
- final token id sequence
python3 scripts/decode_with_hf_tokenizer.py \
--config /tmp/tiny_gpt2_bundle/model_config.json \
--token-ids-file /tmp/tiny_gpt2_bundle/token_ids.txtOr decode generated IDs directly:
python3 scripts/decode_with_hf_tokenizer.py \
--config /tmp/tiny_gpt2_bundle/model_config.json \
--token-ids "10814 703 389 345"This repo now contains a real, minimal CUDA transformer inference path for small GPT-2-family models:
- real Hugging Face weights
- real GPU execution
- real token generation
- real tokenizer decode back to strings