| title | Neural Paged Attention |
|---|---|
| emoji | 🚀 |
| colorFrom | blue |
| colorTo | purple |
| sdk | docker |
| sdk_version | 4.36.1 |
| app_file | app.py |
| pinned | false |
A reinforcement-learning environment for training agents to manage the GPU KV Cache in LLM inference servers.
Neural PagedAttention is a deterministic, high-fidelity simulator of vLLM-style PagedAttention memory hardware. It exposes the cache eviction / swapping / admission decisions as a Gymnasium-compatible RL environment, and ships with a suite of baseline and learned agents (LRU, Random, Q-Learning, DQN, LLM, PPO) so static and dynamic policies can be benchmarked head-to-head on identical workloads.
A full design write-up lives in blog.md.
Modern LLM servers are bottlenecked by GPU memory. Static heuristics like LRU/FIFO are simple and well-understood, but they cannot anticipate future requests, distinguish VIP from Free users, or react to bimodal traffic. Existing research into learned eviction policies is fragmented — every paper builds its own simulator and metrics.
This project provides a shared, reproducible environment where:
- Static baselines (LRU, FIFO, size-weighted) and learned policies (DQN, PPO, LLM-as-policy) run on identical scenarios.
- Traffic patterns include short chatters, power users, returning users, and viral spikes.
- Episodes are scored on a composite of throughput, latency efficiency, and cache residency.
It is built as an early contribution to OpenEnv, a framework for standardizing RL environments in LLM infrastructure.
- Deterministic simulator of PagedAttention with GPU + CPU swap tiers.
- 20-D observation space capturing memory pressure, queue depth, idle-cache statistics, and SLA wait-time pressure.
- 18 discrete actions spanning eviction, swapping, admission, rejection, preemption, and garbage collection.
- Three difficulty tiers —
easy,medium,hard— with different traffic models and SLAs. - Six bundled agents plug-and-play behind a single API:
lru— Least Recently Used heuristicrandom— Random baselineqlearning— Tabular Q-learningneural— Deep Q-Network (PyTorch)llm— Qwen2.5-3B-Instruct as a zero-shot policyppo— Unsloth-fine-tuned Qwen with REINFORCE/PPO + LoRA via MLX
- FastAPI HTTP server (
/api/simulate,/api/agents,/api/settings) for headless batch runs and dashboards. - Next.js dashboard under
dashboard/for visualizing tick-level telemetry. - Hugging Face Spaces deploy out of the box (this repo is a Docker SDK Space).
.
├── app.py # FastAPI entrypoint (served on port 7860)
├── inference.py # Standalone inference / smoke test
├── models.py # Pydantic schemas for observations & actions
├── train_dqn.py # DQN training loop
├── run_lru_test.py # Quick LRU sanity check
├── run_qagent_test.py # Quick tabular Q-learning sanity check
├── agents/ # Agent implementations
│ ├── LRUAgent/
│ ├── RandomAgent/
│ ├── QLearningAgent/
│ ├── NeuralAgent/ # DQN
│ ├── LLMAgent/ # Qwen2.5 zero-shot
│ └── UnslothPPOAgent/ # PPO + LoRA via MLX
├── server/ # Core environment
│ ├── environment.py # KVCacheEnvironment (reset / step / state)
│ ├── env_components/ # Constants, traffic generator, scoring
│ └── session_manager.py # Multi-session support
├── hf_train_space/ # Hugging Face Space for training
├── hf_ppo_space/ # Hugging Face Space for PPO inference
├── dashboard/ # Next.js telemetry dashboard
├── colab/ # Colab training notebooks
├── assets/ # Diagrams used in blog.md
├── blog.md # Project write-up
├── openapi.yaml # OpenAPI spec for the HTTP API
├── Dockerfile # Hugging Face Space build
├── requirements.txt
└── pyproject.toml
20-dimensional float32 array, normalized to [0.0, 1.0] (the memory_pressure_trend field is in [-1.0, 1.0]).
| # | Field | Description |
|---|---|---|
| 1 | gpu_utilization_pct |
Fraction of GPU blocks in use |
| 2 | cpu_utilization_pct |
Fraction of CPU blocks in use |
| 3 | memory_pressure_trend |
GPU utilization slope over last 5 ticks (negative = falling) |
| 4 | free_queue_pressure |
Free queue length / max (100) |
| 5 | vip_queue_pressure |
VIP queue length / max (50) |
| 6 | free_max_wait_time_pct |
Oldest Free request wait / SLA threshold |
| 7 | vip_max_wait_time_pct |
Oldest VIP request wait / SLA threshold |
| 8 | yield_preempt_active |
Largest active request size / GPU capacity |
| 9 | free_size_max |
Largest idle Free GPU cache / GPU capacity |
| 10 | free_size_mean |
Mean idle Free GPU cache size |
| 11 | free_size_std_dev |
Std dev of idle Free GPU cache sizes |
| 12 | vip_size_max |
Largest idle VIP GPU cache / GPU capacity |
| 13 | vip_size_mean |
Mean idle VIP GPU cache size |
| 14 | vip_size_std_dev |
Std dev of idle VIP GPU cache sizes |
| 15 | free_age_max |
Oldest idle Free GPU cache age (normalized to 500 ticks) |
| 16 | free_age_mean |
Mean idle Free GPU cache age |
| 17 | free_age_std_dev |
Std dev of idle Free GPU cache ages |
| 18 | vip_age_max |
Oldest idle VIP GPU cache age |
| 19 | vip_age_mean |
Mean idle VIP GPU cache age |
| 20 | vip_age_std_dev |
Std dev of idle VIP GPU cache ages |
Discrete(18):
| ID | Action |
|---|---|
| 0 | Evict Largest Free cache (idle GPU only) |
| 1 | Evict Largest VIP cache (idle GPU only) |
| 2 | Evict Oldest Free cache (idle GPU only) |
| 3 | Evict Oldest VIP cache (idle GPU only) |
| 4 | Swap Largest Free cache GPU→CPU |
| 5 | Swap Largest VIP cache GPU→CPU |
| 6 | Swap Oldest Free cache GPU→CPU |
| 7 | Swap Oldest VIP cache GPU→CPU |
| 8 | Admit next Free user from queue |
| 9 | Admit next VIP user from queue |
| 10 | Reject next Free user (penalty if GPU has space) |
| 11 | Reject next VIP user (penalty if GPU has space) |
| 12 | Preempt & Shred Largest Active Free request |
| 13 | Preempt & Shred Largest Active VIP request |
| 14 | Preempt & Swap Largest Active Free → CPU |
| 15 | Preempt & Swap Largest Active VIP → CPU |
| 16 | Garbage Collect (delete idle Free CPU caches > 200 ticks) |
| 17 | Do Nothing |
| Task | Traffic | SLA Free | SLA VIP | Max Ticks |
|---|---|---|---|---|
| easy | 1 req / tick | None | None | 2,000 |
| medium | Day / night | 100 | 50 | 5,000 |
| hard | Viral spikes | 50 | 25 | 10,000 |
A tick corresponds to one token generated across all active requests. The default GPU has 10,000 blocks (≈160k tokens) and the CPU swap tier has 50,000 blocks (≈800k tokens). Traffic mix defaults to ~80% chatters, ~20% power users, with ~30% returning users.
Each episode produces a composite score over:
- Throughput — fraction of arriving requests successfully completed.
- Latency efficiency — generation time vs. queue-wait time per request.
- Cache residency — for returning users, whether their historical cache survived.
Throughput is the dominant term; cache residency acts as a tie-breaker.
- Python ≥ 3.10
- (Optional) Docker, for the Hugging Face Space build
- (Optional)
HF_TOKENif you want to pull gated models
git clone https://github.com/<your-org>/NeuralPagedAttention.git
cd NeuralPagedAttention
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtpython app.py
# or
uvicorn app:app --host 0.0.0.0 --port 7860 --reloadThe server is now available at http://localhost:7860 with interactive Swagger docs at http://localhost:7860/docs.
docker build -t neural-paged-attention .
docker run -p 7860:7860 neural-paged-attentionpython run_lru_test.py
python run_qagent_test.pypython train_dqn.pyexport HF_TOKEN=your_token
export MODEL_NAME=TinyLlama/TinyLlama-1.1B-Chat-v1.0
python inference.pyFull schema lives in openapi.yaml. The most useful endpoints:
List the agents available to the simulator.
Inspect or override environment constants (GPU_TOTAL_BLOCKS, CPU_TOTAL_BLOCKS, TOKENS_PER_BLOCK, per-task max_ticks, TRAFFIC_SEED).
Run a full headless simulation and return tick- and session-level telemetry.
curl -X POST http://localhost:7860/api/simulate \
-H "Content-Type: application/json" \
-d '{"agent": "neural", "task": "hard"}'Response:
{
"status": "success",
"agent": "neural",
"task": "hard",
"session_logs": [ ... ],
"tick_logs": [ ... ]
}Legacy single-env endpoints (/reset, /step, /state) remain available but are deprecated; prefer the session-scoped variants exposed via server/session_manager.py.
The table below reports the environment final score (a composite of throughput, latency efficiency, and cache residency, clamped to (0.001, 0.999)). Random / LRU / Q-Learning / DQN were measured locally with the default environment constants. LLM and PPO scores were fetched from the same backend the dashboard uses (https://suryanshchattree-neural-paged-attention-env.hf.space/api/simulate), where their tick budget is capped to keep request latency manageable on a CPU-only Space.
| Agent | easy | medium | hard | Notes |
|---|---|---|---|---|
| Random | 0.887 | 0.001 | 0.001 | Crashes within ~15 ticks on medium / hard |
| LRU | 0.977 | 0.253 | 0.223 | Best on easy and medium; only agent that survives all of easy |
| Tabular Q-Learning | 0.435 | 0.172 | 0.148 | Cold-start, no warm Q-table — bottom of the pack |
| DQN (Neural) | 0.898 | 0.001 | 0.353 | Strong on easy, crashes early on medium |
| LLM (Qwen2.5-1.5B) | 0.267 | 0.309 | 0.333 | Hosted run; ~50-tick budget. Parse-failure rate 0% |
| PPO (Unsloth + LoRA) | 0.334 | 0.253 | 0.352 | Hosted run; survives furthest on medium (138 ticks) before crash |
Cumulative-score traces from the live dashboard make the gap between difficulty tiers easy to read: every agent finishes easy in positive territory, while every agent is in free fall by the end of hard.
Raw per-run telemetry (reward totals, ticks run, arrival/completion counts, crash flags, LLM parse failures) is saved to results/benchmarks.json and the full LLM/PPO tick-level traces are in results/remote_runs/.
A few takeaways:
easyrewards a near-do-nothing policy more than people expect. Even Random scores0.887because traffic stays under capacity. The environment doesn't really start testing a policy untilmedium.mediumis the difficulty cliff. Random and DQN both trip the crash penalty within the first 20 ticks, collapsing to the0.001floor. LRU's0.253is the strongest passing score.- LLM and PPO get higher scores than Random / DQN on the harder tiers but on much shorter horizons. Both crash within ~25–30 ticks on
medium/hardon the hosted backend. The composite score is biased upward by the small denominator (few requests arrived before crash). PPO onmediumis the exception: it sustains 138 ticks and 12 completed requests, the best LLM-class number on that tier. - The classical baseline is harder to beat than people assume. LRU's
0.977oneasyand0.253onmediumare the bars a learned policy actually has to clear. None of the learned agents shipped here clear them onmediumend-to-end.
To regenerate locally:
python scripts/run_benchmarks.py # random / lru / qlearning / neural
python scripts/fetch_remote_scores.py # llm / ppo via the live HF Space
# or, with GPU + cached Qwen2.5:
python scripts/run_benchmarks.py --include-llm --include-ppoOr run an individual agent against the live API:
curl -s -X POST https://suryanshchattree-neural-paged-attention-env.hf.space/api/simulate \
-H "Content-Type: application/json" \
-d '{"agent": "ppo", "task": "medium"}' \
| jq '.session_logs[].final_score'A web dashboard for running and inspecting these simulations is hosted at:
https://neural-paged-attention-dashboard-7q.vercel.app/
It is a Vite + React + Recharts client that talks to the public Hugging Face Space backend at https://suryanshchattree-neural-paged-attention-env.hf.space/api.
- Configuration panel — override the global environment constants (deterministic traffic seed, GPU and CPU total blocks, tokens per block, max ticks per difficulty), then trigger a batch run with the START BATCH SIMULATION button. Leaving fields blank uses the backend defaults.
- Multi-agent comparison panel —
MULTI-AGENT COMPARISON (CUMULATIVE SCORE)plots the running cumulative score of each agent on the same axis so you can read off relative performance over time, not just at episode end. - Agent focus panel —
AGENT FOCUS:lets you drill into a single agent and shows liveTick,Score,Total Reward, currentTask,Action,GPU / CPU Util,Memory Trend, andYield Preemptindicator. - Queue panels — separate
Free RequestsandVIP Requestscharts surface the queue dynamics that drive the SLA-miss penalties. - Live log — per-tick action / reward / tick stream (
log-action,log-reward,log-tick) and aLive Scorerunning total. - Status badges —
COMPLETEDandCRASHEDflags per session, matching thecrashedfield in the API'ssession_logs.
A completed easy run vs. a fully-crashed hard run, side by side on the dashboard's traffic / token / macro-stats panels:
- Crash localization. The cumulative-score line for an agent flattens the moment its
crashedflag flips. Random and DQN go flat almost immediately onmedium/hard, while LRU keeps drifting and only hits the wall mid-episode. - Memory-pressure inflection points. The
Memory TrendandGPU Utilizationcharts reveal that crashes follow a sharp, concentrated 5-tick rise in GPU utilization rather than slow drift — agents that don't pre-empt aggressively lose the window. - Throughput vs. latency trade-off. On the LRU run, you can see a tight queue (
VIP Requestsnear zero) but periodic stalls in cumulative score during traffic spikes — that's the policy paying for swap latency to keep VIPs admitted. - Policy fingerprints. The action stream in the live log makes each agent's signature obvious: Random is uniform, LRU concentrates on
Swap Oldest/Evict Oldestonce GPU > 85%, the Q-learning agent over-issuesRejectactions, and DQN's behavior changes character around the same memory-pressure thresholds the LRU heuristic uses. - Reward shape ≠ score shape. Cumulative reward can plunge into very negative territory (large SLA penalties) while final score remains > 0.2 — useful when comparing crashed-but-completed runs against early-terminated ones.
This is an early-stage simulator and intentionally simplified.
- No hardware-level timing effects, memory bandwidth, or PCIe transfer cost modeling.
- Traffic patterns are heuristic (compound sine + spikes), not replayed production traces.
- Not yet validated against a live vLLM deployment; simulator and real-world performance may diverge.
See the "What it does not do" section of blog.md for a longer discussion.
- 🌐 Live dashboard: https://neural-paged-attention-dashboard-7q.vercel.app/
- 🤗 Hugging Face Space (backend): https://suryanshchattree-neural-paged-attention-env.hf.space
- 📓 Blog / design write-up:
blog.md - 📜 OpenAPI spec:
openapi.yaml




