Skip to content

JayZenith/CUDA_NN_Inference_Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CUDA NN Inference Engine

A minimal CUDA transformer inference engine for small GPT-2-family models.

It currently supports:

  • token embeddings + positional embeddings
  • multi-layer causal self-attention
  • GELU MLP blocks
  • final norm + LM head
  • argmax token generation
  • decoding generated token IDs back to text with the matching Hugging Face tokenizer

What This Project Proves

This project shows the core shape of decoder-only transformer inference on GPU:

  1. prompt -> token ids
  2. token ids -> embeddings
  3. run transformer blocks
  4. produce logits
  5. pick next token id
  6. repeat

The code is intentionally minimal. It is not an optimized production inference engine.

Current Scope

Working well:

  • small GPT-2-style models
  • exported Hugging Face weights
  • real GPU runtime testing

Not implemented:

  • KV cache
  • sampling beyond argmax
  • Llama-style architectures (RoPE, RMSNorm, GQA, SwiGLU)
  • wide-model support beyond the current small-kernel assumptions
  • full tensor-by-tensor numerical validation against PyTorch

Main Files

Build

mkdir -p build
nvcc -arch=sm_75 -O3 src/main2.cu -o build/main2

Adjust sm_75 to match your GPU.

Export A Model

Example with the tiny GPT-2-family test model:

python3 scripts/export_gpt2_hf.py \
  --model sshleifer/tiny-gpt2 \
  --prompt "Hey how are you" \
  --output-dir /tmp/tiny_gpt2_bundle

This writes:

  • model_config.json
  • token_ids.txt
  • embedding weights
  • per-layer transformer weights
  • final norm / LM head files

Run Generation

./build/main2 /tmp/tiny_gpt2_bundle/model_config.json /tmp/tiny_gpt2_bundle/token_ids.txt 8

That prints:

  • current token ids
  • last-position logits
  • argmax next token
  • final token id sequence

Decode To Text

python3 scripts/decode_with_hf_tokenizer.py \
  --config /tmp/tiny_gpt2_bundle/model_config.json \
  --token-ids-file /tmp/tiny_gpt2_bundle/token_ids.txt

Or decode generated IDs directly:

python3 scripts/decode_with_hf_tokenizer.py \
  --config /tmp/tiny_gpt2_bundle/model_config.json \
  --token-ids "10814 703 389 345"

Summary

This repo now contains a real, minimal CUDA transformer inference path for small GPT-2-family models:

  • real Hugging Face weights
  • real GPU execution
  • real token generation
  • real tokenizer decode back to strings

About

Feedforward neural network on the GPU using CUDA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors