Prefix Sum with Odd-Positive Masking: GPU Optimization

Triton kernel optimization for the METR MLPuzzles benchmark.

Motivation

This project originated from 3b1b.co/talent (Grant Sanderson's talent search), which led to mlpuzzles.com - a GPU kernel optimization challenge created by METR (Model Evaluation & Threat Research).

The challenge tests ability to optimize GPU kernels on real hardware (NVIDIA A10G), with submissions scored on correctness, execution time, and reviewed by Claude Opus for reward hacking detection.

This repo contains:

Working implementations (reference + Triton)
Benchmarking infrastructure for measured performance
Theoretical analysis tools (roofline model, work-depth diagrams)
CI pipeline for automated submission

Problem

Compute a prefix sum where position i is accumulated only if the count of positive values in x[0:i] (exclusive) is odd.

# Example
x = [3, -1, 2, 5, -3]
# Position 0: positive_count_before = 0 (even) → exclude
# Position 1: positive_count_before = 1 (odd)  → include (-1)
# Position 2: positive_count_before = 1 (odd)  → include (2)
# Position 3: positive_count_before = 2 (even) → exclude
# Position 4: positive_count_before = 3 (odd)  → include (-3)
# Output: [0, -1, 1, 1, -2]

Quick Start

# Clone and setup
git clone <repo-url>
cd metr
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows
pip install -r requirements.txt

# Run tests
python prefix_sum.py

# Run benchmarks (requires CUDA GPU)
python benchmark.py --quick        # Quick test
python benchmark.py                # Full benchmark
python benchmark.py --help         # See all options

Submission to mlpuzzles.com

Option 1: GitHub Action (Automatic)

Push to main or trunk branch triggers automatic submission:

Runs CPU correctness tests
Submits to mlpuzzles.com A10G
Posts results to commit/PR summary
Updates leaderboard

Option 2: Local Submission

# Submit prefix_sum.py
python submit.py

# Submit custom file
python submit.py --file my_solution.py

# View leaderboard
python submit.py --leaderboard

# Raw JSON output
python submit.py --json

Option 3: Direct curl

curl -X POST https://puzzle.metr-dev.org/api/submit \
  -H "Content-Type: application/json" \
  -d "{\"code\": \"$(cat prefix_sum.py | python3 -c 'import sys,json; print(json.dumps(sys.stdin.read()))')\"}"

Option 4: SSH Sandbox

ssh -p 2222 sandbox@puzzle.metr-dev.org  # password: puzzle
# Then: submit solution.py

Benchmarking

The benchmark suite measures:

Throughput (GFLOP/s)
Memory bandwidth (GB/s)
Stability (coefficient of variation < 5%)

Results are saved to results/ as JSON and CSV, with plots.

# Full benchmark across sizes
python benchmark.py --sizes 256 512 1024 2048 4096 8192 16384 32768 65536

# Custom iterations for stability
python benchmark.py --warmup 100 --iterations 500

# Output to specific directory
python benchmark.py --output my_results/

Implementations

Implementation	Size Range	Description
`prefix_sum_reference`	any n	CPU sequential (baseline)
`prefix_sum_single_block`	n ≤ 1024	Single Triton block, optimal for small inputs
`prefix_sum`	any n	Multi-block with CPU Phase 2 (current bottleneck)

Analysis Tools

# Generate theoretical roofline model
python roofline_model.py

# Generate work-depth diagram
python work_depth_diagram.py

Optimization Status

Current Bottleneck

The multi-block implementation has a CPU Phase 2 that processes elements sequentially, destroying GPU parallelism for n > 1024.

Phase 1 (GPU) → Phase 2 (CPU, O(n) sequential) → Phase 3 (GPU)
                        ↑
                   BOTTLENECK

Optimization Priority

Move Phase 2 to GPU - Expected ~100x speedup for large n
Operator fusion - Reduce memory traffic
Autotune block sizes - Optimize for specific GPU

Hardware Targets

Primary: NVIDIA A10G (AWS G5 instances, METR MLPuzzles)

31.2 TFLOPS FP32
600 GB/s memory bandwidth
84 SMs, 24GB GDDR6

Files

├── prefix_sum.py           # Main implementation
├── benchmark.py            # Benchmarking suite
├── roofline_model.py       # Theoretical performance analysis
├── work_depth_diagram.py   # Parallelism visualization
├── requirements.txt        # Dependencies
└── results/                # Benchmark outputs (gitignored)

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
docs		docs
mcp-submit-server		mcp-submit-server
variants		variants
visualizations		visualizations
.gitignore		.gitignore
README.md		README.md
benchmark.py		benchmark.py
explain_algorithm.py		explain_algorithm.py
explain_algorithm_complete.py		explain_algorithm_complete.py
mlpuzzles_environment.md		mlpuzzles_environment.md
prefix_sum.py		prefix_sum.py
requirements.txt		requirements.txt
roofline_model.py		roofline_model.py
solution.py		solution.py
submit.py		submit.py
work_depth_diagram.py		work_depth_diagram.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prefix Sum with Odd-Positive Masking: GPU Optimization

Motivation

Problem

Quick Start

Submission to mlpuzzles.com

Option 1: GitHub Action (Automatic)

Option 2: Local Submission

Option 3: Direct curl

Option 4: SSH Sandbox

Benchmarking

Implementations

Analysis Tools

Optimization Status

Current Bottleneck

Optimization Priority

Hardware Targets

Files

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Prefix Sum with Odd-Positive Masking: GPU Optimization

Motivation

Problem

Quick Start

Submission to mlpuzzles.com

Option 1: GitHub Action (Automatic)

Option 2: Local Submission

Option 3: Direct curl

Option 4: SSH Sandbox

Benchmarking

Implementations

Analysis Tools

Optimization Status

Current Bottleneck

Optimization Priority

Hardware Targets

Files

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages