Skip to content

Latest commit

 

History

History
109 lines (82 loc) · 2.99 KB

File metadata and controls

109 lines (82 loc) · 2.99 KB

CUDA NN Inference Engine

A minimal CUDA transformer inference engine for small GPT-2-family models.

It currently supports:

  • token embeddings + positional embeddings
  • multi-layer causal self-attention
  • GELU MLP blocks
  • final norm + LM head
  • argmax token generation
  • decoding generated token IDs back to text with the matching Hugging Face tokenizer

What This Project Proves

This project shows the core shape of decoder-only transformer inference on GPU:

  1. prompt -> token ids
  2. token ids -> embeddings
  3. run transformer blocks
  4. produce logits
  5. pick next token id
  6. repeat

The code is intentionally minimal. It is not an optimized production inference engine.

Current Scope

Working well:

  • small GPT-2-style models
  • exported Hugging Face weights
  • real GPU runtime testing

Not implemented:

  • KV cache
  • sampling beyond argmax
  • Llama-style architectures (RoPE, RMSNorm, GQA, SwiGLU)
  • wide-model support beyond the current small-kernel assumptions
  • full tensor-by-tensor numerical validation against PyTorch

Main Files

Build

mkdir -p build
nvcc -arch=sm_75 -O3 src/main2.cu -o build/main2

Adjust sm_75 to match your GPU.

Export A Model

Example with the tiny GPT-2-family test model:

python3 scripts/export_gpt2_hf.py \
  --model sshleifer/tiny-gpt2 \
  --prompt "Hey how are you" \
  --output-dir /tmp/tiny_gpt2_bundle

This writes:

  • model_config.json
  • token_ids.txt
  • embedding weights
  • per-layer transformer weights
  • final norm / LM head files

Run Generation

./build/main2 /tmp/tiny_gpt2_bundle/model_config.json /tmp/tiny_gpt2_bundle/token_ids.txt 8

That prints:

  • current token ids
  • last-position logits
  • argmax next token
  • final token id sequence

Decode To Text

python3 scripts/decode_with_hf_tokenizer.py \
  --config /tmp/tiny_gpt2_bundle/model_config.json \
  --token-ids-file /tmp/tiny_gpt2_bundle/token_ids.txt

Or decode generated IDs directly:

python3 scripts/decode_with_hf_tokenizer.py \
  --config /tmp/tiny_gpt2_bundle/model_config.json \
  --token-ids "10814 703 389 345"

Summary

This repo now contains a real, minimal CUDA transformer inference path for small GPT-2-family models:

  • real Hugging Face weights
  • real GPU execution
  • real token generation
  • real tokenizer decode back to strings