Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
dc2e18d
Add fmha kernel
Mar 11, 2026
4b193fd
flash_attn: port kernel and test to refactored API
Mar 12, 2026
b032cd7
flash_attn: add mfma_f32_32x32x8f16 wrapper and remove fp_math params
Mar 12, 2026
567402e
Update run.sh
yanguahe Mar 13, 2026
8cd41fd
flash_attn: add unsafe_fp_math and fast_fp_math compile options
yanguahe Mar 13, 2026
d572241
flash_attn: switch to GEP-based global load/store and add DAZ support
yanguahe Mar 13, 2026
a8beabc
scripts: add ISA postprocess scripts for FMHA kernel optimization
yanguahe Mar 14, 2026
60ea2cc
GEMM schedule: add SIGEMMScheduleOptimize LLVM patch and ISA postproc…
yanguahe Mar 14, 2026
9ce5ea1
scripts: fix patch check to use git apply --check
yanguahe Mar 15, 2026
790e088
flash_attn: CK-aligned architecture with BLOCK_M=256, BLOCK_N=64, K=1…
yanguahe Mar 16, 2026
8005e46
flash_attn: CK-aligned memory access + GEMM pipeline restructuring
yanguahe Mar 17, 2026
07ff3a6
flash_attn: DMA-based K double-buffer + V DMA with XOR swizzle
yanguahe Mar 20, 2026
a0e85d4
tests: add Triton flash attention benchmark script
yanguahe Mar 20, 2026
d4226c4
tests: disable TRITON_HIP_USE_ASYNC_COPY for Triton flash attn test
yanguahe Mar 20, 2026
461dede
flash_attn: tile-level causal mask skip via scf.IfOp + remove post-V-…
yanguahe Mar 20, 2026
5f49fca
flash_attn: add bf16 support alongside existing fp16
yanguahe Mar 20, 2026
eb098af
tests: include dtype in benchmark output line
yanguahe Mar 21, 2026
046728c
run: update benchmark configs for bf16/fp16 multi-shape testing
yanguahe Mar 21, 2026
4f20459
perf: add MI350X bf16/fp16 multi-shape benchmark log
yanguahe Mar 21, 2026
5b85b58
docs: add MI350X bf16 flash_attn_func optimization log
yanguahe Mar 22, 2026
f42ab17
tests: allow waves_per_eu override via FLYDSL_WAVES_PER_EU env var
yanguahe Mar 22, 2026
72fba87
perf: add MI350X full-rebuild benchmark log (v1)
yanguahe Mar 22, 2026
33d3fc9
flash_attn_func: add gfx942 (MI325X/MI300X/MI308X) compatibility
yanguahe Mar 22, 2026
cddbd69
flash_attn_func: gfx942 performance optimizations (+13% bf16)
yanguahe Mar 22, 2026
1e7ca41
Add log.perf.MI325X.v0
yanguahe Mar 22, 2026
e380859
flash_attn_func: apply MaxNumFOp fastmath + bf16_trunc_pack to gfx950
yanguahe Mar 22, 2026
9be2ac9
flash_attn: fix bf16_trunc_pack_v8 range() -> range_constexpr()
yanguahe Mar 22, 2026
833db21
Remove amdgpu-gemm-schedule-opt
yanguahe Mar 23, 2026
bad24d8
flash_attn: derive NUM_WAVES from flat_work_group_size
yanguahe Mar 23, 2026
8ec4c8e
flash_attn: add flat_work_group_size assertion and fix test configs
yanguahe Mar 23, 2026
723a842
flash_attn: support seq_len < BLOCK_M, add --compare mode, align aite…
yanguahe Mar 23, 2026
9705e19
test: add summary table after all configs in normal test mode
yanguahe Mar 23, 2026
6067f33
test: add triton_flash_attn comparison mode (FlyDSL vs Triton)
yanguahe Mar 23, 2026
e9e889f
flash_attn: fix UnboundLocalError for _v_vecs_prefetch on gfx950 noca…
yanguahe Mar 23, 2026
c568002
perf: update MI350X benchmark logs and add triton comparison results
yanguahe Mar 23, 2026
1d7dcb9
perf: update MI325X benchmark log (v0)
yanguahe Mar 23, 2026
8cc3be9
perf: add MI308X benchmark log (v0)
yanguahe Mar 23, 2026
05602ae
Remove useless file
yanguahe Mar 23, 2026
831f2c1
fix error after rebase main
coderfeli Mar 31, 2026
2a59b91
fix conflict
coderfeli Mar 31, 2026
0b1058b
refine code
coderfeli Mar 31, 2026
16d2242
reorder files
coderfeli Mar 31, 2026
4d1f60a
note tune_fmha llvm commit
coderfeli Apr 1, 2026
05d4e0e
change test to real ck and asm
coderfeli Apr 1, 2026
02a8eb0
refine code
coderfeli Apr 1, 2026
3469c4e
fix error del files
coderfeli Apr 1, 2026
337d46b
Merge branch 'main' into hyg_mha_new_api
coderfeli Apr 1, 2026
7458e2b
use flyc.compile
coderfeli Apr 1, 2026
7f0b012
slight change
coderfeli Apr 1, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Loading