Skip to content

tensormux/Tensorpath

Repository files navigation

TensorPath — Inference Optimization Control Plane

Part of the TensorMux open-source ecosystem.

TensorPath tells you the best way to serve an LLM. Give it a model, a workload, and what you care about — latency, throughput, or cost — and it picks the optimal GPU + backend + quantization combination, explains why, and hands you a deployment-ready config. It also ships a kernel optimization layer (Forge) that can autonomously generate and verify faster Triton kernels for specific GPU + dtype + shape combinations.

What it does

Recommender — turns workload intent into an optimized deployment plan

  • Generates candidate plans (GPU × backend × quantization) for a model
  • Filters out anything that won't fit in VRAM or beats your budget
  • Scores each on five dimensions: latency, throughput, cost, quality, simplicity
  • Ranks them, explains the top pick, exports a deployment artifact
  • Compares plans across models (/compare)

Forge — verified kernel optimization layer

  • Retrieves expert playbooks from @krxgu/kernel-skills (npm)
  • Either generates an agent-ready prompt for you to use elsewhere (manual mode)
  • Or runs an autonomous loop with Claude Opus 4.7 that writes the kernel itself (agentic mode)
  • Validates correctness with pytest against a PyTorch reference, benchmarks against baseline, refuses to promote anything below 1.10× speedup
  • Verified kernels land in a registry that the recommender annotates onto plan results

Quick start

# python 3.11+
pip install -r requirements.txt

# Node 18+ for kernel-skills (via npm)
bash scripts/install_kernel_skills.sh

# (optional, for agentic Forge) drop your Anthropic key in .env
cp .env.example .env
# edit .env and paste your sk-ant-api03-...

# run the server
python -m uvicorn app.main:app --reload --port 8000

Then visit http://localhost:8000/

Three ways to use it

1. Web UI

URL What
/ Recommendation form
/compare Side-by-side comparison across models
/forge Kernel optimization runs + verified-kernel registry
/forge/runs/<id>/agentic Live dashboard for an autonomous Forge run

2. JSON API

# recommend
curl -X POST http://localhost:8000/api/recommend \
  -H 'Content-Type: application/json' \
  -d '{"model_id":"qwen2.5-7b","workload_type":"chat","optimization_priority":"cost","constraints":{"max_p95_latency_ms":250,"max_monthly_budget_usd":300}}'

# compare two models
curl -X POST http://localhost:8000/api/compare \
  -H 'Content-Type: application/json' \
  -d '{"model_ids":["qwen2.5-7b","llama3.2-3b"],"workload_type":"chat","optimization_priority":"balanced"}'

# kick off an autonomous Forge run
curl -X POST http://localhost:8000/api/forge/runs \
  -H 'Content-Type: application/json' \
  -d '{"op":"rmsnorm","language":"triton","target_gpu":"RTX 4070","dtype":"fp16","shape":{"batch":16,"hidden_size":4096}}'
# then start the agent against the returned run_id:
curl -X POST http://localhost:8000/api/forge/runs/<run_id>/agentic \
  -d '{"max_iterations":5,"cost_cap_usd":3.0}'

OpenAPI docs at /docs.

3. CLI

# Forge from the terminal (no server required)
python scripts/forge.py create --op rmsnorm --language triton \
    --gpu "RTX 4070" --dtype fp16 --batch 16 --hidden-size 4096
python scripts/forge.py verify    --run-id <run_id>
python scripts/forge.py benchmark --run-id <run_id>
python scripts/forge.py promote   --run-id <run_id>
python scripts/forge.py list-kernels

# tiny smoke-test of the recommender (no server)
python scripts/demo.py

Project structure

app/
  api/             FastAPI routes (/api/* + /api/forge/*)
  ui/              Jinja templates + UI handlers
  schemas/         pydantic data contracts
  services/
    recommender/         candidate generation + scoring + ranking
    benchmark_store/     loads + queries benchmark profiles
    runtime_registry/    backend capabilities (vLLM, TensorRT-LLM)
    deployment/          DeploymentConfig export
    explanation/         "why this plan won" text
    optimization/        plan-annotation passes (KernelRegistryPass)
    forge/               kernel-skills, prompt builder, gates, agentic loop
  kernels/triton/        verified kernel source files
benchmarks/
  profiles/        benchmark JSON (priors + measured)
  runners/         vLLM-driven measurement tool
  kernels/         shared bench utils (warmup, CUDA sync, percentiles)
kernel_registry/   verified_kernels.json (the source of truth)
forge_runs/        per-run artifacts (gitignored except .gitkeep)
docs/              FORGE.md, KERNEL_REGISTRY.md, BENCHMARKING.md, CLAIMS.md
scripts/           forge CLI + demo
tests/             93 unit tests + 1 opt-in integration test

Supported surface (MVP)

models: Qwen 2.5 7B, Qwen 2.5 3B, Llama 3.1 8B, Llama 3.2 3B
GPUs: L4, L40S, A100-80GB, H100 (datacenter); RTX 4070 (local for measured profiles)
backends: vLLM, TensorRT-LLM
quantizations: FP16, BF16, FP8, AWQ 4-bit, GPTQ 4-bit
priorities: latency, throughput, cost, balanced

How scoring works

Each candidate is scored on five dimensions:

  1. latency — relative to the candidate set; bonus for headroom under user target, penalty proportional to overshoot
  2. throughput — tokens/sec, plus a hard penalty if min_throughput_tps not met
  3. cost — hourly rate; bonus when comfortably under budget, heavy penalty when over
  4. quality — quantization degradation (FP16=1.00, FP8=0.96, AWQ=0.88, GPTQ=0.86)
  5. simplicity — operational complexity of the backend (vLLM is simpler than TensorRT-LLM)

Weights shift based on optimization_priority. Hard constraints (VRAM, budget × 2, latency, throughput) filter or crush the score before ranking.

Benchmark data

Profiles live in benchmarks/profiles/ as JSON, labeled with their source: measured, estimated, or imported. We don't fake precision — if a number is estimated, it says so.

Measured profiles (RTX 4070, single-request, vLLM)

Model Quant p95 TTFT Tok/s
Qwen 2.5 7B AWQ 160 ms 79
Llama 3.1 8B AWQ 161 ms 81
Llama 3.2 3B AWQ 122 ms 113
Qwen 2.5 3B FP16 195 ms 66

To add more:

python benchmarks/runners/bench.py --model <id> --quantization <quant>

Gated repos (Meta Llama) need either huggingface-cli login or HF_TOKEN exported.

Forge — kernel optimization layer

Two modes, same gates.

Manual mode. You fill the Forge form, get a 75 KB strict markdown prompt with a curated kernel-skills bundle. Hand it to your own coding agent (Claude Code, Cursor, etc.) or write the kernel yourself, drop the five required files into forge_runs/<id>/candidate/, then click verify → benchmark → promote.

Agentic mode (requires ANTHROPIC_API_KEY). Same form, but check the "Autonomous" box. Forge launches a background loop that calls Claude Opus 4.7 with a tool-use surface (write_candidate_file, run_verify, run_benchmark, etc.), iterates up to 5 times under a $3 cost cap, and on success copies the verified Triton kernel into app/kernels/triton/verified/<op>/<kernel_id>.py and registers it. A live dashboard shows status, cost, iteration count, gate badges, and a scrolling transcript.

Real promoted kernels in this repo (RMSNorm on RTX 4070, fp16, batch=16, hidden_size=4096):

app/kernels/triton/verified/rmsnorm/
  triton_rmsnorm_rtx4070_fp16_b16_h4096_v1.py    # 3.66x vs torch eager
  triton_rmsnorm_rtx4070_fp16_b16_h4096_v2.py    # 3.66x
  triton_rmsnorm_rtx4070_fp16_b16_h4096_v3.py    # 3.63x  (tightest implementation)

These were written by the agent, verified against a PyTorch RMSNorm reference at multiple shapes including non-power-of-two, and benchmarked at 200 iterations after 25 warmup iterations.

Why kernel-skills is used

TensorPath uses @krxgu/kernel-skills as an external instruction source for CUDA, Triton, quantization, benchmarking, and kernel optimization workflows.

kernel-skills provides reusable expert playbooks. TensorPath does not depend on it for execution, benchmarking, compilation, or deployment.

All execution happens inside Forge. Forge retrieves skill bundles, creates agent-ready prompts, accepts generated candidate kernels, validates correctness, benchmarks performance, and promotes only verified kernels into the local kernel registry.

Documentation

  • docs/FORGE.md — full Forge pipeline, agentic loop, hardware constraints
  • docs/KERNEL_REGISTRY.md — registry schema, kernel ID format, evidence levels, promotion rules
  • docs/BENCHMARKING.md — measurement protocol (warmup, sync, percentiles), output schema, threshold rules
  • docs/CLAIMS.md — what we are and aren't allowed to say about kernel speedups

Testing

# default (no GPU required)
pytest -q                # 93 passed, 1 skipped

# CUDA-required tests (real GPU)
pytest -m cuda -q

The single skipped test is the kernel-skills CLI integration test, opt-in via NEEVPATH_FORGE_INTEGRATION=1.

Contributing

TensorPath is open source under github.com/tensormux. PRs welcome — see the docs folder for architecture details before opening a large change.

What's next

  • benchmark runner with measured RTX 4070 profiles
  • UI: server-rendered Jinja templates at /, /compare, /forge
  • comparison endpoint + side-by-side cards
  • Forge: kernel-skills-driven optimization with verify + benchmark + promote gates
  • verified kernel registry annotated onto recommendation results (op-level evidence)
  • autonomous agentic mode with Claude Opus 4.7 orchestrator
  • real promoted Triton kernels in the registry (RMSNorm on RTX 4070)
  • live deployment integration (currently exports config artifact only)
  • more models and GPU tiers
  • runtime-level integration of promoted kernels (currently op-level evidence only — see docs/CLAIMS.md)
  • more ops in Forge: fused add+RMSNorm, softmax, sampling, KV cache append, dequant, RoPE

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors