Skip to content

anviit/llm-inference-serving

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-inference-serving

Production LLM inference stack — TinyLlama 1.1B on RTX 4050 Laptop GPU.

Stack: Client → FastAPI (async) → Redis Cache → PyTorch FP16 → GPU

Tested on: RTX 4050 Laptop (6GB), CUDA 13, Python 3.12, WSL2 Ubuntu 22.04


Results

Metric Value
TTFT P50 28ms
Decode speed 39.4 tok/s
Cache hit latency P50 2ms
Cache hit rate 81%
Success rate @ concurrency=10 100%
VRAM used 2583 MB / 6141 MB

Cold prefill (no cache) completes in 28ms P50. Cache hits return in 2ms. Decode runs at 39 tok/s — ~85% of theoretical max for the RTX 4050L's 192 GB/s memory bandwidth.


Architecture

Client
  │
  ▼
FastAPI Gateway (:8080)
  ├── Redis Cache ──► HIT: ~2ms
  └── asyncio.Lock → PyTorch FP16 → RTX 4050 GPU
Layer Technology Purpose
Model TinyLlama 1.1B FP16 Decoder-only transformer
Kernel OpenAI Triton fused attention Reduced HBM round-trips
Cache Redis 7.2 Prompt deduplication
Gateway FastAPI + uvicorn Async request routing
Profiling CUDA events + pynvml GPU-accurate timing

Quickstart

git clone <repo> && cd llm-inference-serving
pip install torch transformers accelerate fastapi "uvicorn[standard]" \
    redis pynvml httpx aiohttp rich
pip install -e .

# Start Redis
docker run -d --name redis --network host redis:7.2-alpine \
  redis-server --maxmemory 512mb --maxmemory-policy allkeys-lru

# Start gateway (~30s to load model)
uvicorn src.gateway.app:app --host 0.0.0.0 --port 8080 --workers 1

Verify it works

python demo.py

Benchmarks

python benchmarks/bench.py --concurrency 1 --requests 20 --no-cache  # raw inference
python benchmarks/bench.py --concurrency 10 --requests 100            # with cache
python profiling/gpu_monitor.py                                        # live GPU stats

API

curl -X POST http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is a GPU?", "max_new_tokens": 100}'

curl http://localhost:8080/health
curl http://localhost:8080/metrics

Project Structure

llm-inference-serving/
├── src/
│   ├── models/pytorch_baseline.py   # FP16 baseline
│   ├── kernels/fused_attention.py   # Triton fused attention kernel
│   ├── cache/redis_cache.py         # Async Redis cache
│   └── gateway/app.py               # FastAPI gateway
├── benchmarks/
│   ├── bench_pytorch.py             # Standalone baseline
│   └── bench.py                     # Async load test
├── profiling/
│   ├── profiler.py                  # CUDA event profiler
│   └── gpu_monitor.py               # Live GPU monitor
├── docker/docker-compose.yml        # Redis
├── demo.py                          # Smoke test
└── RUNBOOK.md                       # Step-by-step guide

Engineering Decisions

FP16 — 2x memory bandwidth vs FP32 on 192 GB/s RTX 4050L. Model fits in 2.6GB VRAM leaving headroom for KV-cache.

Fused attention kernel — Triton kernel eliminates intermediate HBM writes on the attention matrix. ~15ms saved at S=512 vs PyTorch SDPA.

asyncio.Lock on GPU — single GPU can't run concurrent inference. Lock serializes requests cleanly instead of corrupting CUDA state across threads.

Redis write-through as create_task — cache writes fire-and-forget, never block the response path.

Single uvicorn worker — one CUDA context, no VRAM waste on a 6GB GPU.

About

Production LLM inference stack — 28ms TTFT, 39 tok/s, 81% cache hit rate on a 6GB GPU

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages