Tracking issue for the sparse-mario branch (local/unpushed at the time of writing — see "How to access the work" below).
What this is
A 2,200-line example under crates/ruvllm_sparse_attention/examples/sparse_mario.rs that uses the existing sparse attention kernel as a training-free associative memory — no autograd, no learned weights, no Python — to generate Super Mario Bros level slices from an embedded VGLC-alphabet corpus. Two pipelines from the same kernel:
- Autoregressive (AR) retrieval LM:
KvCache + decode_step for incremental O(log T) decoding.
- Masked discrete diffusion (D3PM / MaskGIT family): bidirectional context, MaskGIT cosine schedule, training-free retrieval-as-denoiser.
13 iterations across 7 commits, 40/40 unit tests passing, two docs (sparse_mario_metrics.md, sparse_mario_baselines.md) capturing the analysis.
Headline results
End-to-end AR speedup (iter 8, KvCache + decode_step for the example):
| Path |
700 tokens |
per-token |
| Iter 6 (full forward per step) |
25,970 ms |
37 ms |
Iter 8 (KvCache + decode_step) |
9 ms |
12 µs |
|
|
2,880× |
Cross-baseline comparison (iter 13, avg L2 distance to corpus median target):
| Pipeline |
L2 dist |
vs SOTA |
| Corpus (self-distance) |
0.504 |
– |
| Sparse-Mario diffusion |
0.723 ⭐ |
1.0× (SOTA on this artifact) |
| Markov-1 (corpus bigram) |
2.745 |
3.8× |
| Uniform random |
3.353 |
4.6× |
| Sparse-Mario AR |
4.998 |
6.9× |
The headline finding: the value-add of attention machinery on this artifact is not bigram fidelity (Markov-1 has perfect bigrams and still loses by 3.8×). It's bidirectional masked filling, which only the kernel-based diffuser provides. That's the SOTA story for sparse attention as a primitive — not as an LLM accelerator.
Honest finding (candidate follow-up)
Sparse-Mario AR is the worst pipeline on aggregate, even worse than uniform random, despite excellent density. Cause: AR's K builder adds 0.5·pos(i) and the query sits at the tail of the corpus+prefix sequence, biasing retrieval toward corpus-tail (level-floor) positions. Linearity 5.25 vs corpus 0.57.
A 3-line fix — drop positional encoding from the AR K builder, the same way iter 7 already did for the diffuser — should halve AR's L2 distance. Detailed in docs/sparse_mario_baselines.md § "Why AR is the worst — and what would fix it".
Diffusion's remaining gap to corpus self-distance is bimodal playability (boot-slice placement decides whether a floor exists). Floor-anchor pre-step (~5–10 line architectural change) would close most of it.
Both fixes are guarded by the iter-11 metrics scaffolding shipping on the branch — any change must improve playable_columns and linearity without crashing density / novelty.
How to access the work
- Branch (local on the maintainer's workstation, not pushed):
sparse-mario
- Public gist with the full README, code, bench, and both docs: https://gist.github.com/ruvnet/d3e0aaa7af2745b678a9eecddf610301
- Iteration log:
- corpus + tokenizer
2-3. retrieval LM + ASCII generation
- dense vs sparse vs sparse+FastGRNN bench
- top-k + repetition penalty
- wrapped renderer
- masked discrete diffusion
- KvCache + decode_step (2,880× speedup)
- nucleus / top-p sampling
- multi-token bidirectional context
- PCG metrics module (density / linearity / leniency / novelty / playable)
- hyperparameter sweep against metrics
- cross-baseline comparison (SOTA)
Suggested next steps
Filed with gh issue create on behalf of @ruvnet from a Claude Code session that drove iters 1-13.
Tracking issue for the
sparse-mariobranch (local/unpushed at the time of writing — see "How to access the work" below).What this is
A 2,200-line example under
crates/ruvllm_sparse_attention/examples/sparse_mario.rsthat uses the existing sparse attention kernel as a training-free associative memory — no autograd, no learned weights, no Python — to generate Super Mario Bros level slices from an embedded VGLC-alphabet corpus. Two pipelines from the same kernel:KvCache+decode_stepfor incremental O(log T) decoding.13 iterations across 7 commits, 40/40 unit tests passing, two docs (
sparse_mario_metrics.md,sparse_mario_baselines.md) capturing the analysis.Headline results
End-to-end AR speedup (iter 8, KvCache + decode_step for the example):
KvCache+ decode_step)Cross-baseline comparison (iter 13, avg L2 distance to corpus median target):
The headline finding: the value-add of attention machinery on this artifact is not bigram fidelity (Markov-1 has perfect bigrams and still loses by 3.8×). It's bidirectional masked filling, which only the kernel-based diffuser provides. That's the SOTA story for sparse attention as a primitive — not as an LLM accelerator.
Honest finding (candidate follow-up)
Sparse-Mario AR is the worst pipeline on aggregate, even worse than uniform random, despite excellent density. Cause: AR's K builder adds
0.5·pos(i)and the query sits at the tail of the corpus+prefix sequence, biasing retrieval toward corpus-tail (level-floor) positions. Linearity 5.25 vs corpus 0.57.A 3-line fix — drop positional encoding from the AR K builder, the same way iter 7 already did for the diffuser — should halve AR's L2 distance. Detailed in
docs/sparse_mario_baselines.md§ "Why AR is the worst — and what would fix it".Diffusion's remaining gap to corpus self-distance is bimodal playability (boot-slice placement decides whether a floor exists). Floor-anchor pre-step (~5–10 line architectural change) would close most of it.
Both fixes are guarded by the iter-11 metrics scaffolding shipping on the branch — any change must improve
playable_columnsandlinearitywithout crashing density / novelty.How to access the work
sparse-mario2-3. retrieval LM + ASCII generation
Suggested next steps
sparse-mariobranch and open a PR to track review and mergeFiled with
gh issue createon behalf of @ruvnet from a Claude Code session that drove iters 1-13.