sparse-mario: training-free retrieval LM + masked discrete diffusion demo (SOTA on cross-baseline metrics)

Tracking issue for the `sparse-mario` branch (local/unpushed at the time of writing — see "How to access the work" below).

## What this is

A 2,200-line example under `crates/ruvllm_sparse_attention/examples/sparse_mario.rs` that uses the existing sparse attention kernel as a **training-free associative memory** — no autograd, no learned weights, no Python — to generate Super Mario Bros level slices from an embedded VGLC-alphabet corpus. Two pipelines from the same kernel:

- **Autoregressive (AR)** retrieval LM: `KvCache` + `decode_step` for incremental O(log T) decoding.
- **Masked discrete diffusion** (D3PM / MaskGIT family): bidirectional context, MaskGIT cosine schedule, training-free retrieval-as-denoiser.

13 iterations across 7 commits, 40/40 unit tests passing, two docs (`sparse_mario_metrics.md`, `sparse_mario_baselines.md`) capturing the analysis.

## Headline results

**End-to-end AR speedup** (iter 8, KvCache + decode_step for the example):

| Path                              | 700 tokens   | per-token |
|-----------------------------------|-------------:|----------:|
| Iter 6 (full forward per step)    | 25,970 ms    |     37 ms |
| Iter 8 (`KvCache` + decode_step)  | **9 ms**     | **12 µs** |
|                                   |              | **2,880×**|

**Cross-baseline comparison** (iter 13, avg L2 distance to corpus median target):

| Pipeline                  | L2 dist | vs SOTA |
|---------------------------|--------:|--------:|
| Corpus (self-distance)    |   0.504 |       – |
| **Sparse-Mario diffusion**| **0.723** ⭐ | 1.0× (SOTA on this artifact) |
| Markov-1 (corpus bigram)  |   2.745 |   3.8×  |
| Uniform random            |   3.353 |   4.6×  |
| Sparse-Mario AR           |   4.998 |   6.9×  |

**The headline finding**: the value-add of attention machinery on this artifact is *not* bigram fidelity (Markov-1 has perfect bigrams and still loses by 3.8×). It's **bidirectional masked filling**, which only the kernel-based diffuser provides. That's the SOTA story for sparse attention as a primitive — not as an LLM accelerator.

## Honest finding (candidate follow-up)

Sparse-Mario AR is the worst pipeline on aggregate, even worse than uniform random, despite excellent density. Cause: AR's K builder adds `0.5·pos(i)` and the query sits at the tail of the corpus+prefix sequence, biasing retrieval toward corpus-tail (level-floor) positions. Linearity 5.25 vs corpus 0.57.

A 3-line fix — drop positional encoding from the AR K builder, the same way iter 7 already did for the diffuser — should halve AR's L2 distance. Detailed in `docs/sparse_mario_baselines.md` § "Why AR is the worst — and what would fix it".

Diffusion's remaining gap to corpus self-distance is bimodal playability (boot-slice placement decides whether a floor exists). Floor-anchor pre-step (~5–10 line architectural change) would close most of it.

Both fixes are guarded by the iter-11 metrics scaffolding shipping on the branch — any change must improve `playable_columns` and `linearity` without crashing density / novelty.

## How to access the work

- **Branch (local on the maintainer's workstation, not pushed):** `sparse-mario`
- **Public gist with the full README, code, bench, and both docs:** https://gist.github.com/ruvnet/d3e0aaa7af2745b678a9eecddf610301
- **Iteration log:**
  1. corpus + tokenizer
  2-3. retrieval LM + ASCII generation
  4. dense vs sparse vs sparse+FastGRNN bench
  5. top-k + repetition penalty
  6. wrapped renderer
  7. masked discrete diffusion
  8. KvCache + decode_step (2,880× speedup)
  9. nucleus / top-p sampling
  10. multi-token bidirectional context
  11. PCG metrics module (density / linearity / leniency / novelty / playable)
  12. hyperparameter sweep against metrics
  13. cross-baseline comparison (SOTA)

## Suggested next steps

- [ ] Push the `sparse-mario` branch and open a PR to track review and merge
- [ ] Implement the AR K-builder positional fix (3-line change, expected ~50% drop in AR L2 distance)
- [ ] Implement the diffusion floor-anchor pre-step (5-10 line change, expected to close most of the diffusion → corpus gap)
- [ ] Optionally generalise: same retrieval-as-memory pattern + bidirectional masked diffusion applied to other discrete-token domains (logs, configs, MIDI, MAGVIT-style visual tokens)

Filed with `gh issue create` on behalf of @ruvnet from a Claude Code session that drove iters 1-13.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sparse-mario: training-free retrieval LM + masked discrete diffusion demo (SOTA on cross-baseline metrics) #448

What this is

Headline results

Honest finding (candidate follow-up)

How to access the work

Suggested next steps

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Path	700 tokens	per-token
Iter 6 (full forward per step)	25,970 ms	37 ms
Iter 8 (`KvCache` + decode_step)	9 ms	12 µs
		2,880×

Pipeline	L2 dist	vs SOTA
Corpus (self-distance)	0.504	–
Sparse-Mario diffusion	0.723 ⭐	1.0× (SOTA on this artifact)
Markov-1 (corpus bigram)	2.745	3.8×
Uniform random	3.353	4.6×
Sparse-Mario AR	4.998	6.9×

sparse-mario: training-free retrieval LM + masked discrete diffusion demo (SOTA on cross-baseline metrics) #448

Description

What this is

Headline results

Honest finding (candidate follow-up)

How to access the work

Suggested next steps

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions