Skip to content

Prompt card: reproduce AVO FA4 B200 attention result #1

@BBuf

Description

@BBuf
[$humanize-kernel-agent-loop](/Users/bbuf/.codex/skills/humanize-kernel-agent-loop/SKILL.md) [$ion-b200](/Users/bbuf/.codex/skills/ion-b200/SKILL.md)  All NVIDIA B200 work must run on GPU0 inside the existing `sglang_bbuf` Docker container on `ion-b200`.

Use this exact pattern for all remote Python, pip, nvcc, build, test, benchmark, and Nsight Compute commands:

ssh ion-b200 'docker exec sglang_bbuf bash -lc "CUDA_VISIBLE_DEVICES=0 <command>"'

Do not run Python, pip, nvcc, builds, tests, benchmarks, or profiling directly on the ion-b200 host. Do not `pip install flash-attn` on the host. The container already has FlashAttention-4 installed; use it as the main baseline.

Implement a standalone CUDA/inline-PTX forward-only MHA attention kernel for NVIDIA B200.

Scope:
- Forward pass only
- No backward
- No GQA
- No serving/framework integration
- dtype: BF16
- head_dim: 128
- num_heads: 16
- total tokens: 32768

Benchmark cases:
- batch=8, seqlen=4096
- batch=4, seqlen=8192
- batch=2, seqlen=16384
- batch=1, seqlen=32768
- Test both `causal=False` and `causal=True`

Correctness:
- Compare against PyTorch reference and/or official FlashAttention-4 output.
- Report explicit max error and relative error tolerances.

Benchmarking:
- Follow Dao-AILab/flash-attention `benchmarks/benchmark_attn.py` methodology as closely as practical, including warmup/repeat logic.
- Report per-case mean latency, std, TFLOPS, and geometric mean TFLOPS.

Target:
- Beat official FlashAttention-4 by at least 5% geometric-mean TFLOPS across the configured B200 cases.

Deliverable must be correct, benchmarkable, profileable, reproducible, and compared against FlashAttention-4 on B200 GPU0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions