CUDA NN Inference Engine

A minimal CUDA transformer inference engine for small GPT-2-family models.

It currently supports:

token embeddings + positional embeddings
multi-layer causal self-attention
GELU MLP blocks
final norm + LM head
argmax token generation
decoding generated token IDs back to text with the matching Hugging Face tokenizer

What This Project Proves

This project shows the core shape of decoder-only transformer inference on GPU:

prompt -> token ids
token ids -> embeddings
run transformer blocks
produce logits
pick next token id
repeat

The code is intentionally minimal. It is not an optimized production inference engine.

Current Scope

Working well:

small GPT-2-style models
exported Hugging Face weights
real GPU runtime testing

Not implemented:

KV cache
sampling beyond argmax
Llama-style architectures (RoPE, RMSNorm, GQA, SwiGLU)
wide-model support beyond the current small-kernel assumptions
full tensor-by-tensor numerical validation against PyTorch

Main Files

src/main2.cu: main CUDA inference path and decode loop
src/kernels.cu: CUDA kernels
src/model_loader.h: exported GPT-2 bundle loader
scripts/export_gpt2_hf.py: export Hugging Face GPT-2-family weights into this project’s flat-text bundle format
scripts/decode_with_hf_tokenizer.py: decode generated token IDs back to text

Build

mkdir -p build
nvcc -arch=sm_75 -O3 src/main2.cu -o build/main2

Adjust sm_75 to match your GPU.

Export A Model

Example with the tiny GPT-2-family test model:

python3 scripts/export_gpt2_hf.py \
  --model sshleifer/tiny-gpt2 \
  --prompt "Hey how are you" \
  --output-dir /tmp/tiny_gpt2_bundle

This writes:

model_config.json
token_ids.txt
embedding weights
per-layer transformer weights
final norm / LM head files

Run Generation

./build/main2 /tmp/tiny_gpt2_bundle/model_config.json /tmp/tiny_gpt2_bundle/token_ids.txt 8

That prints:

current token ids
last-position logits
argmax next token
final token id sequence

Decode To Text

python3 scripts/decode_with_hf_tokenizer.py \
  --config /tmp/tiny_gpt2_bundle/model_config.json \
  --token-ids-file /tmp/tiny_gpt2_bundle/token_ids.txt

Or decode generated IDs directly:

python3 scripts/decode_with_hf_tokenizer.py \
  --config /tmp/tiny_gpt2_bundle/model_config.json \
  --token-ids "10814 703 389 345"

Summary

This repo now contains a real, minimal CUDA transformer inference path for small GPT-2-family models:

real Hugging Face weights
real GPU execution
real token generation
real tokenizer decode back to strings

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
build		build
practice		practice
sample_data		sample_data
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA NN Inference Engine

What This Project Proves

Current Scope

Main Files

Build

Export A Model

Run Generation

Decode To Text

Summary

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CUDA NN Inference Engine

What This Project Proves

Current Scope

Main Files

Build

Export A Model

Run Generation

Decode To Text

Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages