Summary
`entity_cooccurrence` holds 1.72M rows and occupies 365 MB (~62% of the 592 MB SQLite DB). But 82% of pairs have weight ≤ 1.1 — i.e., pair co-occurred in exactly one note and possibly got one Hebbian bump. The top-weight pairs are polluted by tool-output tokens (`"MEMORY.md" <-> "System reminder"` at 12.0, `"0" <-> "1"` at 9.0) rather than vault content. The boost at query time is ≤ +10% and rarely differentiates.
Evidence
Direct SQL on the DB:
| bucket |
count |
share |
| weight == 1.0 |
1,085,370 |
62.9% |
| weight == 1.1 |
325,920 |
18.9% |
| 1.1 < w < 2.0 |
301,671 |
17.5% |
| 2 ≤ w < 5 |
12,043 |
0.70% |
| 5 ≤ w < 10 |
114 |
0.007% |
| ≥10 |
3 |
0.0002% |
Top entities: literal tool-output fragments leaking from triple extraction.
Proposed fix
- Prune `weight <= 1.0` after each rebuild. In `cooccurrence.py:142`, after the `executemany`: `conn.execute("DELETE FROM entity_cooccurrence WHERE weight <= 1.0")`. Frees ~225 MB. A single-note co-occurrence is noise.
- Switch score from raw count to NPMI. In `search.py:661-708` and `attractor.py:130-144`, replace count-based weight with `npmi(a,b) = pmi / -log p(a,b)`. Cache per-entity note counts in a small companion table populated by `persist_cooccurrence`. Effect: `Azure` / `MEMORY.md` stop dominating; rare-but-specific pairs rise.
- Denylist + upstream filter. Add a denylist in `cooccurrence.py:113-124` (single-digit strings, numeric-only tokens, known system tokens like `System reminder`, `vault_search`). Better: fix the triples extractor in `triples.py` so tool-output fragments never become entities.
- Hebbian LTD. Add `decay_cooccurrence(conn, factor=0.99, floor=1.0)` in `cooccurrence.py`, call from `cli/decay.py`. Multiplies all weights by 0.99 then `DELETE WHERE weight < floor`. Prevents unbounded growth from the `×1.1` reinforcement at `search.py:840-866`.
- Wire into `vault_related`. Currently `related.py:93-102` is pure cosine. Fold in co-occurrence via the same blend the attractor uses (`0.6 cosine + 0.25 cooccur + 0.15 wikilinks`). The signal becomes load-bearing instead of decorative.
Expected effect
- DB shrinks ~225 MB.
- Top pairs reflect actual knowledge, not tool noise.
- NPMI rebalances the score toward specific associations.
- Self-limiting growth via decay.
- `vault_related` starts surfacing Hebbian-connected notes, not just embedding-similar ones — which is the neuroscience premise the project sells.
Key files
- `src/neurostack/cooccurrence.py:17, 20-88, 91-148, 257-275`
- `src/neurostack/search.py:661-708, 840-866`
- `src/neurostack/attractor.py:43, 67-177`
- `src/neurostack/related.py:19-125` (currently doesn't read `entity_cooccurrence` at all)
- `src/neurostack/config.py:53`
Summary
`entity_cooccurrence` holds 1.72M rows and occupies 365 MB (~62% of the 592 MB SQLite DB). But 82% of pairs have weight ≤ 1.1 — i.e., pair co-occurred in exactly one note and possibly got one Hebbian bump. The top-weight pairs are polluted by tool-output tokens (`"MEMORY.md" <-> "System reminder"` at 12.0, `"0" <-> "1"` at 9.0) rather than vault content. The boost at query time is ≤ +10% and rarely differentiates.
Evidence
Direct SQL on the DB:
Top entities: literal tool-output fragments leaking from triple extraction.
Proposed fix
Expected effect
Key files