Production LLM inference stack — TinyLlama 1.1B on RTX 4050 Laptop GPU.
Stack: Client → FastAPI (async) → Redis Cache → PyTorch FP16 → GPU
Tested on: RTX 4050 Laptop (6GB), CUDA 13, Python 3.12, WSL2 Ubuntu 22.04
| Metric | Value |
|---|---|
| TTFT P50 | 28ms |
| Decode speed | 39.4 tok/s |
| Cache hit latency P50 | 2ms |
| Cache hit rate | 81% |
| Success rate @ concurrency=10 | 100% |
| VRAM used | 2583 MB / 6141 MB |
Cold prefill (no cache) completes in 28ms P50. Cache hits return in 2ms. Decode runs at 39 tok/s — ~85% of theoretical max for the RTX 4050L's 192 GB/s memory bandwidth.
Client
│
▼
FastAPI Gateway (:8080)
├── Redis Cache ──► HIT: ~2ms
└── asyncio.Lock → PyTorch FP16 → RTX 4050 GPU
| Layer | Technology | Purpose |
|---|---|---|
| Model | TinyLlama 1.1B FP16 | Decoder-only transformer |
| Kernel | OpenAI Triton fused attention | Reduced HBM round-trips |
| Cache | Redis 7.2 | Prompt deduplication |
| Gateway | FastAPI + uvicorn | Async request routing |
| Profiling | CUDA events + pynvml | GPU-accurate timing |
git clone <repo> && cd llm-inference-serving
pip install torch transformers accelerate fastapi "uvicorn[standard]" \
redis pynvml httpx aiohttp rich
pip install -e .
# Start Redis
docker run -d --name redis --network host redis:7.2-alpine \
redis-server --maxmemory 512mb --maxmemory-policy allkeys-lru
# Start gateway (~30s to load model)
uvicorn src.gateway.app:app --host 0.0.0.0 --port 8080 --workers 1python demo.pypython benchmarks/bench.py --concurrency 1 --requests 20 --no-cache # raw inference
python benchmarks/bench.py --concurrency 10 --requests 100 # with cache
python profiling/gpu_monitor.py # live GPU statscurl -X POST http://localhost:8080/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "What is a GPU?", "max_new_tokens": 100}'
curl http://localhost:8080/health
curl http://localhost:8080/metricsllm-inference-serving/
├── src/
│ ├── models/pytorch_baseline.py # FP16 baseline
│ ├── kernels/fused_attention.py # Triton fused attention kernel
│ ├── cache/redis_cache.py # Async Redis cache
│ └── gateway/app.py # FastAPI gateway
├── benchmarks/
│ ├── bench_pytorch.py # Standalone baseline
│ └── bench.py # Async load test
├── profiling/
│ ├── profiler.py # CUDA event profiler
│ └── gpu_monitor.py # Live GPU monitor
├── docker/docker-compose.yml # Redis
├── demo.py # Smoke test
└── RUNBOOK.md # Step-by-step guide
FP16 — 2x memory bandwidth vs FP32 on 192 GB/s RTX 4050L. Model fits in 2.6GB VRAM leaving headroom for KV-cache.
Fused attention kernel — Triton kernel eliminates intermediate HBM writes on the attention matrix. ~15ms saved at S=512 vs PyTorch SDPA.
asyncio.Lock on GPU — single GPU can't run concurrent inference. Lock serializes requests cleanly instead of corrupting CUDA state across threads.
Redis write-through as create_task — cache writes fire-and-forget, never block the response path.
Single uvicorn worker — one CUDA context, no VRAM waste on a 6GB GPU.