Captured from the Phase 1→2 transition adversarial sweep (2026-04-26). Each item is something the random-input parity tests do not exercise but production will. Severity tags reflect "how likely is this to bite real users."
-
top_k > n_slotsis silently clamped on GPU but unrestricted in PyTorch. Layer constructor should validatetop_k <= n_slots and top_k <= MAX_TOP_K. Add tophase1/cube_memory_layer.py::__init__. -
top_k = 0crashes softmax in the PyTorch retrieve and produces uniform 1/k weights with garbage indices on GPU. Constructor should refusetop_k < 1.
-
Cleanup tie-break diverges between CPU
argmax(first-wins) and GPU>comparison (last-wins). Random unit phasors don't tie at fp32 precision; deterministic codebooks with duplicate entries do. Either standardize the comparison (>=vs>) on both sides or document that the layer is not bit-deterministic across backends. -
STE gradient through cleanup vanishes at exact codebook boundaries. Gradient on
role_projis zero whenever the query exactly aligns with a snapped phasor. Practically rare with random init, but the test set never explicitly covers it. Add a unit test that initializes a query at a codebook entry and asserts the forward+backward path still moves the loss. -
Distillation test fixture's FakeLoader resets seed every epoch, so all 10 steps see identical batches. Hides shuffling bugs that would matter at real distillation scale. Fix by using a per-epoch RNG reseed or asserting batch variety.
-
GGUF buffer layout for
slot_keysandslot_valuesis unspecified. PyTorch is row-major(n_slots, 2*d_codebook); the ggml-vulkan backend's tensor stride conventions need to match exactly. Test before claiming Phase 2 is integrated. If a silent transpose creeps in, the dot products compute on the wrong axis and retrieve returns plausible-looking garbage. -
Unitize on
Vec2::ZEROreturnsVec2::ZERO(the eps floor prevents division-by-zero but does not produce a unit vector). Algebraically wrong but doesn't crash. Mitigate by making the cleanup path mapVec2::ZEROto a canonical phasor(1, 0).
- Cleanup with d ≤ 4 has insufficient noise margin to be
reliable. Document a
d >= 64floor in the layer. n_slots = 1withtop_k > 1picks the same slot k times and softmax-weights it equally. Mathematically correct, semantically pointless. Validaten_slots >= top_kat init.- Naive dot product in the GPU retrieve loop could accumulate
fp32 error at very large
d_key. Probably fine to d_key ≈ 4096; re-check before going beyond.
Three of the items above gate Phase 2 acceptance:
- The
top_k/n_slotsinvariants must be enforced in the GGUF loader (refuse to load a model that violates them). - The buffer-layout assumption must be verified by a round-trip parity test: PyTorch forward → export to GGUF → load in ggml → ggml forward → compare to PyTorch output. Same input, same weights, same output to fp32 tolerance.
- The shader push-constant order must match the GGUF tensor order exactly. Today both are written by us; in a future where ggml reorders tensors during graph optimization, that assumption breaks. Add a runtime assert.