Skip to content

Fatemanx/kernel-pilot

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KernelPilot

A Codex + Humanize workflow for GPU kernel tuning: local CUDA knowledge, Nsight Compute digests, and clean standalone benchmark repos.

GitHub stars GitHub forks Last commit CUDA PR corpus Open watchlist

Works well with AI-Infra-Auto-Driven-SKILLS.

KernelPilot is for long GPU-kernel tuning runs, the kind where the useful facts are easy to lose: which baseline was copied, which shape regressed, what NCU said, and which source idea has already been tried.

It wraps Humanize RLCR for kernel work. Codex plans the task, creates a standalone repo, builds candidates, runs tests and benchmarks, records provenance, profiles representative cases, and lets the Humanize stop hook decide whether another round is needed.

It is designed to sit next to AI-Infra-Auto-Driven-SKILLS: that repo carries broader serving/profiler/SGLang playbooks, while this repo keeps the kernel-loop machinery and CUDA knowledge pack.

What Is Here

Signal What makes it useful
Humanize Kernel Agent Loop One Codex skill does the plan, refinement, standalone repo setup, RLCR startup, benchmark/profile loop, and stop-hook review.
607 CUDA optimization PRs PR notes from SGLang, vLLM, TensorRT-LLM, FlashAttention, FlashInfer, CUTLASS/CuTe, DeepGEMM, TileLang, CCCL/CUB, and similar repos.
280 open PR watchlist entries Current CUDA optimization work is kept separate from merged evidence. Open PRs must be re-checked before use.
Code-first knowledge routing Topic pages, source guides, PR notes, blog-to-code maps, and AKO4ALL references tell Codex what to read first.
Nsight Compute feedback loop profile-evidence turns NCU metrics into a bottleneck call and one next edit.
Clean standalone repos Candidate kernels live in isolated repos with their own bindings, tests, benchmarks, ledgers, lineage, and artifacts.
Baseline-aware, language-flexible Use CUDA C++/PTX, Triton, CuTe DSL, TileLang, CUTLASS/CuTe, ThunderKittens, or the baseline's own kernel stack unless the user asks for from-scratch work.

What You Can Do

Goal Start here
Run a full kernel optimization loop in Codex humanize-kernel-agent-loop
Route Codex through CUDA PR/source knowledge before editing kernel-knowledge
Turn NCU reports into concrete next kernel edits profile-evidence
Inspect the PR-driven kernel corpus by framework knowledge/references/prs/
Inspect kernel ideas by bottleneck family knowledge/references/prs/by-topic/
Use broader serving, profiler, incident, and model optimization skills AI-Infra-Auto-Driven-SKILLS

How The Loop Works

  1. Scoped knowledge pass: read the target topic, target framework, matching source guide, and the PR page only when that repo is PR-driven. Source-only repos go straight to source files, tests, and benchmarks.
  2. Standalone setup: create a fresh repo with torch bindings, correctness tests, benchmarks, ledgers, lineage, and profile artifact folders.
  3. Baseline evidence: run baseline correctness/benchmark and collect one representative NCU digest before the first profile-driven edit.
  4. Evidence loop: implement one candidate, test it, benchmark it, collect NCU on regressions/plateaus/surprising wins, and record every attempt.
  5. Review gate: Humanize RLCR reviews the round and either stops cleanly or writes the next round prompt.

After two consecutive weak rounds with less than 1% geomean improvement, KernelPilot expands the search: read at least 50 new code-first sources before prose sources and log do-not-reread keys so the next round does not repeat the same reading.

Install

Fresh install:

git clone https://github.com/BBuf/kernel-pilot.git
cd kernel-pilot
./scripts/install-codex-skills.sh

Update an existing checkout:

git pull --ff-only
./scripts/install-codex-skills.sh

Restart Codex after installation, then open /skills and check that these skills are visible:

  • humanize-kernel-agent-loop
  • kernel-knowledge
  • profile-evidence

If Codex shows hook needs review, open /hooks and approve the Humanize Stop hook. Use /permissions to switch to Full Access, then continue after Codex shows Permissions updated to Full Access.

Knowledge Base

kernel-knowledge includes copied AKO4ALL CUDA/CUTLASS/NCU references plus a PR-driven production knowledge layer plus source-only code guides. The current PR scan covers 9 PR-driven CUDA optimization repos. PyTorch, DeepSeek TileKernels, Triton, QuACK, ThunderKittens, sample repos, blog/code companion repos, puzzle repos, source catalogs, and any repo with fewer than 10 selected CUDA optimization PRs are intentionally source-only.

The knowledge layout is split into:

  • knowledge/routing/ for lightweight topic and source routing
  • knowledge/references/prs/ for PR case notes
  • knowledge/references/source-guides/ for code maps
  • knowledge/references/blogs/ for article-to-code maps

The PR layer keeps all filtered CUDA optimization PRs for PR-driven repos, not a small curated top-N. It also has cross-repository topic pages and an open PR watchlist.

Refresh it with:

python3 scripts/refresh_pr_knowledge.py --since 2024-05-15

Prompt Cards

Baseline-derived optimization:

[$humanize-kernel-agent-loop] I want to optimize SGLang's H100 int8_scaled_mm kernel on H100. Use the existing SGLang/CUTLASS kernel as the baseline and starting point. Work in a clean standalone repo, keep provenance/lineage, and use the most appropriate kernel language for the candidate.

From-scratch optimization:

[$humanize-kernel-agent-loop] I want to optimize SGLang's H100 int8_scaled_mm kernel on H100. Implement the candidate kernel from scratch and use the existing SGLang/CUTLASS kernel only as the correctness/performance comparison baseline. Work in a clean standalone repo and keep provenance/lineage.

Example Outputs

The loop should leave enough state for you to tell what happened without replaying the whole session.

Humanize stop hook summary

The optimization ledger should make the useful versions and rejected follow-ups easy to scan.

KernelPilot optimization ledger

Monitor

Open another terminal in the same target repo:

source "${HUMANIZE_RUNTIME_ROOT:-${CODEX_HOME:-$HOME/.codex}/skills/humanize}/scripts/humanize.sh"
humanize monitor rlcr

Keep the monitor outside the Codex TUI.

Related

  • AI-Infra-Auto-Driven-SKILLS: playbooks for serving benchmarks, torch-profiler triage, SGLang optimization, production incidents, model PR histories, and GPU-kernel AKO4ALL workflows.
  • Humanize: the RLCR loop runtime that KernelPilot specializes for GPU kernel work.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Shell 80.8%
  • Python 11.4%
  • JavaScript 5.2%
  • CSS 2.4%
  • HTML 0.2%