Two Negative Results for Vector Symbolic Architectures: FFN Replacement and Compositional Image Generation
Target venue: NeurIPS 2026 (Negative Results / Datasets & Benchmarks track) Status: All experiments complete, writing phase
Vector Symbolic Architectures (VSAs) offer algebraically structured representations — binding, bundling, permutation — that are O(D) and matmul-free. These properties make VSA a natural candidate for (1) replacing dense feed-forward networks with structured retrieval, and (2) compositional image generation via token binding. We conduct systematic experiments testing both hypotheses and find that neither works, for different but complementary reasons.
Case Study 1: FFN Replacement. We test VSA memory layers as replacements for feed-forward networks in Qwen3.6-27B. The failure mode is a rank bottleneck: VSA's cleanup→bind→retrieve pipeline has effective rank bounded by top-k (typically 4), while FFN mappings are ~89% linear with effective rank >2048. A 164K-parameter rank-16 linear projection captures more variance than a 35M-parameter VSA memory layer. Scaling top-k to 64 does not help (2.9% → 2.7% variance). SVD-derived codebooks improve over random (4.3% vs 2.8%) but remain far below linear baselines (36.6% at matched parameters).
Case Study 2: Compositional Image Generation. We test VSA binding for encoding image token sequences from TiTok VQ-8K (64 tokens × 8192 codebook). The failure has two independent causes: (a) FHRR superposition of 64 bindings cannot support 8192-way retrieval (0% accuracy at all D tested, including factored codebook and decoder variants), and (b) real images do not share token multisets (2.7-8.9% overlap), so the permutation-based generation framing is inapplicable.
Positive findings. We identify conditions where VSA succeeds: (1) permutation locality in TiTok — small token permutations yield small visual changes, (2) VSA encodes and recovers permutations of 64 positions with 100% accuracy at D≥2048, and (3) SVD-aligned codebooks capture 50% more variance than random codebooks. These successes illuminate the boundary: VSA excels at structured, low-cardinality operations but fails when the target requires high-rank continuous mappings or high-cardinality discrete retrieval.
- VSA (hyperdimensional computing) has theoretical appeal for neural computation: algebraic structure, compositionality, O(D) operations
- Two natural applications:
- FFN replacement: structured retrieval instead of dense projection
- Image generation: compositional binding of token positions and content
- Both applications leverage VSA's core operations (binding, bundling) in domains where structure should help
- "Can VSA memory layers match the effective rank of learned FFN projections in deep transformers?"
- "Can VSA binding encode and retrieve image token sequences for compositional generation?"
- Both answers are negative, but for different reasons:
- FFN replacement fails due to a rank bottleneck (retrieval collapses output dimensionality)
- Image generation fails due to a capacity bottleneck (64 superposed bindings can't support 8192-way retrieval) AND a framing mismatch (images don't share token multisets)
- Positive findings narrow the scope: VSA works for permutation representation (64-way, structured) but not for content retrieval (8192-way, unstructured)
- Two independent case studies demonstrating VSA limitations in complementary domains
- Rank-bottleneck diagnosis for FFN replacement (theoretical + empirical)
- Capacity analysis for VSA token binding (theory predicts ~log2(D/N) bits per retrieval; experiments confirm)
- Positive result: VSA permutation encoding achieves 100% exact match at D=2048 for 64-position permutations
- Negative result: TiTok VQ-8K token multisets have <9% overlap between images, invalidating permutation-based generation
- FHRR: complex phasor binding (element-wise multiply), bundling (sum), cleanup (nearest-neighbor or cosine similarity)
- Storage capacity: ~log2(D/N) bits per retrieval from N superposed bindings (Plate 2003; Frady et al. 2018)
- Key strength: compositionality — bind(A,B) is invertible, supports group operations
- Standard FFN: up-project → activation → down-project
- 2/3 of transformer parameters, known to store factual knowledge
- Parameter count: 2 × d_model × d_ffn per layer
- TiTok (Yu et al. 2024): 1D image tokenizer, 64 tokens per 256×256 image, VQ-8K codebook (8192 entries × 64 dims)
- Frozen pre-trained encoder + decoder (Apache 2.0)
- Teacher: Qwen3.6-27B (d_model=5120, 64 layers)
- Extract (input, output) activation pairs at layers 3, 27, 43
- 50,000 tokens from FineWeb-Edu, 5% validation
- Metric: variance captured = (1 - val_MSE / zero_MSE) × 100%
- VSA V1 (frozen random codebooks), V2 (learned + Gumbel-softmax)
- VSA-MoE (VSA routing for expert selection)
- Baselines: rank-r linear (SVD), rank-r + MLP, learned-gate MoE
Table 1: Architecture comparison (layer 27)
| Architecture | Var% | Params |
|---|---|---|
| Zero baseline | 0.0 | 0 |
| Cube Memory V1 (frozen VSA) | ~5 | 35M |
| Cube Memory V2 (learned VSA) | 4.8 | 35M |
| Rank-16 linear (SVD) | 5.9 | 164K |
| VSA-MoE 16×128 top-4 | 14.2 | 24M |
| Learned-MoE 8×256 top-2 | 16.2 | 21M |
| Rank-2048 linear | 36.6 | 21M |
| Rank-2048 + MLP-512 | 38.4 | 26M |
| Full-rank ceiling | 41.1 | 52M |
Top-k scaling (Table 2): Increasing top-k does not break the ceiling.
| top_k | Var% |
|---|---|
| 4 | 2.9 |
| 16 | 2.8 |
| 64 | 2.7 |
Codebook ablation (Table 3): SVD-aligned codebooks help but don't change the conclusion.
| Codebook | Var% |
|---|---|
| Frozen random | 2.8 |
| Learned | 2.8 |
| SVD-optimal | 4.3 |
| SVD-optimal + learned | 4.3 |
SVD spectrum (Table 4): Singular values of FFN activation matrix decay slowly — effective rank >2048. See Appendix A for full spectrum.
FLOPs comparison (Table 5): VSA is both slower and worse. All FLOPs are inference-only forward pass (no training overhead).
| Architecture | FLOPs/token | Var% |
|---|---|---|
| Rank-2048 linear | 42M | 36.6 |
| Rank-2048 + MLP-512 | 52M | 38.4 |
| Cube Memory V2 | 62M | 4.8 |
Rank-equalized FLOP comparison (Table 6): At matched effective rank, VSA uses ~1500x more FLOPs for comparable or worse variance.
| Architecture | Eff. Rank | FLOPs/token | Var% |
|---|---|---|---|
| Rank-4 linear (SVD) | 4 | 41K | <0 (neg. R²) |
| Cube Memory V2 top-4 | ≤4 | 62M | 2.9 |
Note: Rank-4 linear (static SVD) yields negative R² (worse than predicting the mean). VSA top-4 achieves 2.9% because routing selects different codebook entries per input — data-dependent rank-4 beats static rank-4. But both are catastrophically below rank-2048 linear (36.6%), and VSA spends 1514× more FLOPs for its marginal advantage over static SVD. The rank bottleneck holds: per-token output is rank-≤4 regardless of routing, limiting information throughput.
Proof sketch. The VSA retrieval output is: y = Σ_{j ∈ top-k} α_j · v_j, where v_j are memory value vectors and α_j = softmax(similarity scores). This is a convex combination of k vectors, so y ∈ span{v_1, ..., v_k}, giving rank(output) ≤ k.
- With top-k=4, VSA output rank ≤ 4
- A rank-4 linear (SVD) also fails (negative R²) but costs 41K FLOPs vs VSA's 62M FLOPs — same failure, 1514× cheaper
- FFN effective rank >2048 — retrieval fundamentally underpowered
- 95% of learnable FFN variance is linear; the remaining 5% is better captured by MLP than by sparse retrieval
- Hypothesis: image generation as token permutation manipulation
- TiTok encodes images as 64 discrete tokens from 8192-entry codebook
- VSA binding naturally represents position → content mappings
- Permutation locality: small changes in token arrangement → small visual changes (validated experimentally)
- Encode test image with frozen TiTok, apply N random position swaps, decode, measure MSE vs original
| Swaps | MSE |
|---|---|
| 0 | 0.000 |
| 1 | 0.001 |
| 4 | 0.022 |
| 16 | 0.073 |
| 64 | 0.153 |
- Monotonic relationship confirms the TiTok decoder is smooth w.r.t. token permutation — the generation premise has empirical support.
Experiment 1: Pure VSA binding. Bind 64 position phasors with content phasors, superpose, unbind each position, classify against 8192 codebook entries.
| D | Acc% | Theory (bits have/need) |
|---|---|---|
| 512 | 0.0 | 3.0 / 13.0 |
| 1024 | 0.0 | 4.0 / 13.0 |
| 2048 | 0.0 | 5.0 / 13.0 |
| 4096 | 0.0 | 6.0 / 13.0 |
Theory predicts ~log2(D/64) bits per retrieval; 8192 entries need 13 bits. Even D=4096 provides only 6 bits. Total failure.
Experiment 1b: VSA + MLP decoder. Add 2-layer MLP after unbinding. Result: catastrophic overfitting at all D. Train loss → 0, val loss → 25. The decoder memorizes training-specific noise patterns.
Experiment 1c: Factored codebook. Decompose 8192 = 128 × 64, use 3-way binding (pos ⊗ factA ⊗ factB), retrieve factors independently.
| D | AccA% (128-way) | AccB% (64-way) | Joint% |
|---|---|---|---|
| 512 | 2.2 | 5.6 | 0.1 |
| 1024 | 2.3 | 5.8 | 0.2 |
| 2048 | 2.0 | 5.7 | 0.1 |
| 4096 | 1.9 | 5.9 | 0.1 |
Hard ceiling at ~5.8% accB regardless of D. The bottleneck is structural (cross-talk from superposition), not capacity.
Experiment 2: Real TiTok tokens. Test on Imagenette (6000 images encoded by frozen TiTok). Token distribution: 12.85/13.00 bits entropy, all 8192 codes used. Result at D=4096: 0.61% accuracy after 1500 steps (same plateau as D=512). Near-uniform codebook usage means real tokens are as hard as random — no concentration to exploit.
MLP-only control: Position one-hot (64) → MLP(64→512→512→8192) without any VSA binding achieves 0.93% accuracy, confirming the task is impossible from positional information alone (slight positional bias but nothing useful). VSA is not the bottleneck — the problem structure is.
Reframe: encode permutations (64-way classification) instead of tokens (8192-way). No training — pure VSA theory test.
| D | k=1 swap | k=4 | k=16 | k=32 (full) |
|---|---|---|---|---|
| 512 | 100% swap | 100% | 99.9% | 99.3% |
| 1024 | 100% / 89% exact | 100% / 54% | 100% / 15% | 100% / 99.9% |
| 2048 | 100% / 99.8% exact | 100% / 99.8% | 100% / 96.5% | 100% / 100% |
Perfect at D≥2048 (0 failures in 2000 test permutations per condition, 95% CI: [99.8%, 100%]). VSA excels at permutation representation — the 64-way classification is well within capacity.
Test whether real images can be approximated as permutations of a cluster reference. For each image: find nearest cluster, solve optimal assignment (Hungarian algorithm), measure token match rate.
| K clusters | Within-cluster overlap | Match rate | >=50% match |
|---|---|---|---|
| 10 | 4.0% | 2.7% | 0.2% |
| 50 | 2.1% | 5.2% | 0.8% |
| 200 | 2.4% | 8.9% | 3.3% |
Metric definitions: "Within-cluster overlap" = average pairwise multiset intersection size / 64 between random pairs in the same cluster (measures how many identical token values two images share). "Match rate" = fraction of positions where the Hungarian-optimal permutation of the reference yields the correct token. Match rate ≥ overlap because optimal assignment can match tokens that appear at different multiplicities.
Theoretical bound: Each image uses 64/8192 = 0.78% of the codebook. Under uniform codebook usage, the expected multiset overlap between two random images is ~0.5 tokens (birthday collision rate: 8192 × (64/8192)² ≈ 0.5). Even with K=6000 clusters (one reference per image, defeating the purpose), estimated match rate is ~29%. The permutation framing requires near-complete multiset overlap, which is impossible when the codebook is large and near-uniformly used.
- Capacity failure: FHRR superposition of N bindings provides ~log2(D/N) bits per retrieval. With N=64 and codebook=8192 (13 bits), no feasible D suffices. Reducing cardinality to 64-way (permutations) solves this.
- Framing failure: Even if token retrieval worked, the permutation generation framing requires shared token multisets between images. With 8192 near-uniformly-used codebook entries, this assumption fails catastrophically (<9% overlap).
- Permutation representation: 100% exact recovery at D=2048 for 64-position permutations (Experiment 3). VSA is ideal for structured operations on small discrete sets.
- Permutation locality: TiTok decoder is smooth w.r.t. token permutation (Experiment 0). This property is real and potentially useful for local image editing.
- SVD-aligned codebooks: Initializing VSA codebooks from SVD of the target activation matrix captures 50% more variance than random (4.3% vs 2.8%). If codebook entries align with the data manifold, retrieval improves — but the ceiling remains low.
VSA succeeds when:
- Classification cardinality is low (64 positions, not 8192 codes)
- The target has discrete, compositional structure
- Operations are group-theoretic (binding = multiplication, unbinding = division)
VSA fails when:
- The target mapping is high-rank and continuous (FFN replacement)
- Retrieval requires high-cardinality discrimination from superposition
- The compositional framing doesn't match the data (token multisets)
- The rank bottleneck (Case Study 1) applies to all retrieval-based FFN replacements: Product Key Memory, holographic reduced representations, any top-k selection mechanism
- The capacity limit (Case Study 2) applies to any VSA scheme that superposes N bindings and retrieves from codebook size C: requires D >> N × C for reliable retrieval
- VSA / Hyperdimensional Computing: Kanerva (2009); Plate (2003) HRR; Gayler & Levy (2020); Frady et al. (2018) capacity bounds
- Memory Layers: Lample et al. (2019) PKM; Wu et al. (2024) Meta; Sukhbaatar et al. (2019)
- FFN Compression: Low-rank factorization (Hsu et al. 2022); pruning (Frantar & Alistarh 2023); distillation (Hinton et al. 2015)
- Image Tokenization: VQ-VAE (van den Oord et al. 2017); TiTok (Yu et al. 2024); VQGAN (Esser et al. 2021)
- Permutation-based Generation: Jigsaw puzzle methods (Noroozi & Favaro 2016); set prediction (Lee et al. 2019)
- Sparse MoE: Shazeer et al. (2017); Fedus et al. (2022); Mixtral
We tested VSA in two settings where its algebraic properties should provide advantages: structured FFN replacement and compositional image generation. Both fail, for different reasons:
-
FFN replacement: The rank bottleneck. FFN mappings are ~89% linear with effective rank >2048. VSA retrieval collapses rank to top-k. Linear projections dominate at every parameter budget.
-
Image generation: The capacity bottleneck plus framing mismatch. Superposition of 64 bindings provides ~6 bits per retrieval; 8192 codebook entries need 13 bits. And even with perfect retrieval, images don't share token multisets, so the permutation framing fails.
The positive findings narrow VSA's useful scope: it excels at representing discrete group operations (permutations of 64 positions: 100% exact match) but fails when the target requires either high-rank continuous mappings or high-cardinality discrete retrieval.
These results suggest that VSA's future in deep learning lies in genuinely compositional, low-cardinality tasks — symbolic reasoning, relational learning, discrete program synthesis — rather than as a drop-in replacement for dense continuous computations.
- Full singular value plot for layers 3, 27, 43
- Quantitative effective rank analysis
- Hyperparameters for all runs
- Hardware: AMD Ryzen AI 9 HX 370, Radeon 890M, 23GB RAM (local); ThinkCentre M70q Gen 5, 32GB RAM (reviewer experiments)
- Training curves for all architectures
- TiTok VQ-8K codebook usage histogram
- Entropy analysis (12.85 / 13.00 bits)
- Spatial correlation in real tokens
- Full accuracy tables for D=256,512,1024,2048,4096 across all k
- Swap accuracy vs position accuracy vs exact match
- Architecture comparison bar chart: Var% vs params for all architectures (Case Study 1) — visual punchline
- Singular value decay: Log-scale SVD spectrum for 3 layers
- Token binding capacity: Accuracy vs D for exps 1, 1c, 2 showing capacity wall
- Permutation recovery heatmap: Accuracy (D × k_swaps) from exp 3 — green region (permutations work) vs red region (tokens fail)
- Token overlap histogram: Distribution of match rates from exp 4 showing the framing failure
- Signal decomposition: Stacked bar: ~89% linear + 5% MLP + <1% VSA
| Exp | Description | Status | Location |
|---|---|---|---|
| SVD spectrum | Singular value decay of FFN activations | DONE | thinkhub:~/reviewer_results/exp1_svd_spectrum.json |
| Top-k scaling | top_k = 4,16,64 variance | DONE | thinkhub:/tmp/exp2_topk.log |
| FLOPs | Wall-clock + FLOP comparison | DONE | thinkhub:~/reviewer_results/exp3_flops.json |
| Codebook ablation | Random/learned/SVD/SVD+learned | DONE | thinkhub:~/reviewer_results/exp4_codebook_ablation.json |
| Rank-4 linear | SVD baseline at matched rank | DONE | thinkhub:~/reviewer_results/rank4_linear.json |
| Exp | Description | Status | Location |
|---|---|---|---|
| Exp 0 | Permutation locality validation | DONE | rubik-gen/VALIDATION_RESULT.md |
| Exp 1 | Pure VSA token binding D=512-4096 | DONE | rubik-gen/results/exp1_vsa_capacity.json |
| Exp 1b | VSA + MLP decoder | DONE | rubik-gen/results/exp1b_vsa_decoder.json |
| Exp 1b ctrl | MLP-only baseline (no VSA) | DONE | rubik-gen/results/exp1b_mlp_only_control.json |
| Exp 1c | Factored codebook (128×64) | DONE | rubik-gen/results/exp1c_factored.json |
| Exp 2 | Real TiTok tokens (Imagenette, D=512-4096) | DONE | rubik-gen/results/exp2_real_tokens.json |
| Exp 3 | Permutation VSA encoding | DONE | rubik-gen/results/exp3_permutation_vsa.json |
| Exp 4 | Permutation reconstruction quality | DONE | rubik-gen/results/exp4_permutation_reconstruction.json |
- Code: github.com/Peterc3-dev/cube-memory (will be made public)
- Rubik gen: ~/projects/cube-memory/rubik-gen/
- Activations: ~/cube-memory-cache/activations/
- All experiments reproducible on consumer AMD APU (23GB RAM)
- Total compute: ~8 hours across both machines