Skip to content

sparse-mario: training-free retrieval LM + masked discrete diffusion demo (SOTA on cross-baseline metrics) #448

@ruvnet

Description

@ruvnet

Tracking issue for the sparse-mario branch (local/unpushed at the time of writing — see "How to access the work" below).

What this is

A 2,200-line example under crates/ruvllm_sparse_attention/examples/sparse_mario.rs that uses the existing sparse attention kernel as a training-free associative memory — no autograd, no learned weights, no Python — to generate Super Mario Bros level slices from an embedded VGLC-alphabet corpus. Two pipelines from the same kernel:

  • Autoregressive (AR) retrieval LM: KvCache + decode_step for incremental O(log T) decoding.
  • Masked discrete diffusion (D3PM / MaskGIT family): bidirectional context, MaskGIT cosine schedule, training-free retrieval-as-denoiser.

13 iterations across 7 commits, 40/40 unit tests passing, two docs (sparse_mario_metrics.md, sparse_mario_baselines.md) capturing the analysis.

Headline results

End-to-end AR speedup (iter 8, KvCache + decode_step for the example):

Path 700 tokens per-token
Iter 6 (full forward per step) 25,970 ms 37 ms
Iter 8 (KvCache + decode_step) 9 ms 12 µs
2,880×

Cross-baseline comparison (iter 13, avg L2 distance to corpus median target):

Pipeline L2 dist vs SOTA
Corpus (self-distance) 0.504
Sparse-Mario diffusion 0.723 1.0× (SOTA on this artifact)
Markov-1 (corpus bigram) 2.745 3.8×
Uniform random 3.353 4.6×
Sparse-Mario AR 4.998 6.9×

The headline finding: the value-add of attention machinery on this artifact is not bigram fidelity (Markov-1 has perfect bigrams and still loses by 3.8×). It's bidirectional masked filling, which only the kernel-based diffuser provides. That's the SOTA story for sparse attention as a primitive — not as an LLM accelerator.

Honest finding (candidate follow-up)

Sparse-Mario AR is the worst pipeline on aggregate, even worse than uniform random, despite excellent density. Cause: AR's K builder adds 0.5·pos(i) and the query sits at the tail of the corpus+prefix sequence, biasing retrieval toward corpus-tail (level-floor) positions. Linearity 5.25 vs corpus 0.57.

A 3-line fix — drop positional encoding from the AR K builder, the same way iter 7 already did for the diffuser — should halve AR's L2 distance. Detailed in docs/sparse_mario_baselines.md § "Why AR is the worst — and what would fix it".

Diffusion's remaining gap to corpus self-distance is bimodal playability (boot-slice placement decides whether a floor exists). Floor-anchor pre-step (~5–10 line architectural change) would close most of it.

Both fixes are guarded by the iter-11 metrics scaffolding shipping on the branch — any change must improve playable_columns and linearity without crashing density / novelty.

How to access the work

  • Branch (local on the maintainer's workstation, not pushed): sparse-mario
  • Public gist with the full README, code, bench, and both docs: https://gist.github.com/ruvnet/d3e0aaa7af2745b678a9eecddf610301
  • Iteration log:
    1. corpus + tokenizer
      2-3. retrieval LM + ASCII generation
    2. dense vs sparse vs sparse+FastGRNN bench
    3. top-k + repetition penalty
    4. wrapped renderer
    5. masked discrete diffusion
    6. KvCache + decode_step (2,880× speedup)
    7. nucleus / top-p sampling
    8. multi-token bidirectional context
    9. PCG metrics module (density / linearity / leniency / novelty / playable)
    10. hyperparameter sweep against metrics
    11. cross-baseline comparison (SOTA)

Suggested next steps

  • Push the sparse-mario branch and open a PR to track review and merge
  • Implement the AR K-builder positional fix (3-line change, expected ~50% drop in AR L2 distance)
  • Implement the diffusion floor-anchor pre-step (5-10 line change, expected to close most of the diffusion → corpus gap)
  • Optionally generalise: same retrieval-as-memory pattern + bidirectional masked diffusion applied to other discrete-token domains (logs, configs, MIDI, MAGVIT-style visual tokens)

Filed with gh issue create on behalf of @ruvnet from a Claude Code session that drove iters 1-13.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions