llm-inference-serving

Production LLM inference stack — TinyLlama 1.1B on RTX 4050 Laptop GPU.

Stack: Client → FastAPI (async) → Redis Cache → PyTorch FP16 → GPU

Tested on: RTX 4050 Laptop (6GB), CUDA 13, Python 3.12, WSL2 Ubuntu 22.04

Results

Metric	Value
TTFT P50	28ms
Decode speed	39.4 tok/s
Cache hit latency P50	2ms
Cache hit rate	81%
Success rate @ concurrency=10	100%
VRAM used	2583 MB / 6141 MB

Cold prefill (no cache) completes in 28ms P50. Cache hits return in 2ms. Decode runs at 39 tok/s — ~85% of theoretical max for the RTX 4050L's 192 GB/s memory bandwidth.

Architecture

Client
  │
  ▼
FastAPI Gateway (:8080)
  ├── Redis Cache ──► HIT: ~2ms
  └── asyncio.Lock → PyTorch FP16 → RTX 4050 GPU

Layer	Technology	Purpose
Model	TinyLlama 1.1B FP16	Decoder-only transformer
Kernel	OpenAI Triton fused attention	Reduced HBM round-trips
Cache	Redis 7.2	Prompt deduplication
Gateway	FastAPI + uvicorn	Async request routing
Profiling	CUDA events + pynvml	GPU-accurate timing

Quickstart

git clone <repo> && cd llm-inference-serving
pip install torch transformers accelerate fastapi "uvicorn[standard]" \
    redis pynvml httpx aiohttp rich
pip install -e .

# Start Redis
docker run -d --name redis --network host redis:7.2-alpine \
  redis-server --maxmemory 512mb --maxmemory-policy allkeys-lru

# Start gateway (~30s to load model)
uvicorn src.gateway.app:app --host 0.0.0.0 --port 8080 --workers 1

Verify it works

python demo.py

Benchmarks

python benchmarks/bench.py --concurrency 1 --requests 20 --no-cache  # raw inference
python benchmarks/bench.py --concurrency 10 --requests 100            # with cache
python profiling/gpu_monitor.py                                        # live GPU stats

API

curl -X POST http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is a GPU?", "max_new_tokens": 100}'

curl http://localhost:8080/health
curl http://localhost:8080/metrics

Project Structure

llm-inference-serving/
├── src/
│   ├── models/pytorch_baseline.py   # FP16 baseline
│   ├── kernels/fused_attention.py   # Triton fused attention kernel
│   ├── cache/redis_cache.py         # Async Redis cache
│   └── gateway/app.py               # FastAPI gateway
├── benchmarks/
│   ├── bench_pytorch.py             # Standalone baseline
│   └── bench.py                     # Async load test
├── profiling/
│   ├── profiler.py                  # CUDA event profiler
│   └── gpu_monitor.py               # Live GPU monitor
├── docker/docker-compose.yml        # Redis
├── demo.py                          # Smoke test
└── RUNBOOK.md                       # Step-by-step guide

Engineering Decisions

FP16 — 2x memory bandwidth vs FP32 on 192 GB/s RTX 4050L. Model fits in 2.6GB VRAM leaving headroom for KV-cache.

Fused attention kernel — Triton kernel eliminates intermediate HBM writes on the attention matrix. ~15ms saved at S=512 vs PyTorch SDPA.

asyncio.Lock on GPU — single GPU can't run concurrent inference. Lock serializes requests cleanly instead of corrupting CUDA state across threads.

Redis write-through as create_task — cache writes fire-and-forget, never block the response path.

Single uvicorn worker — one CUDA context, no VRAM waste on a 6GB GPU.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
benchmarks		benchmarks
docker		docker
engines		engines
profiling		profiling
src		src
.gitignore		.gitignore
README.md		README.md
demo.py		demo.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-inference-serving

Results

Architecture

Quickstart

Verify it works

Benchmarks

API

Project Structure

Engineering Decisions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-inference-serving

Results

Architecture

Quickstart

Verify it works

Benchmarks

API

Project Structure

Engineering Decisions

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages