Skip to content

Improvement: co-occurrence table is 62% of DB and 82% noise #34

@raphasouthall

Description

@raphasouthall

Summary

`entity_cooccurrence` holds 1.72M rows and occupies 365 MB (~62% of the 592 MB SQLite DB). But 82% of pairs have weight ≤ 1.1 — i.e., pair co-occurred in exactly one note and possibly got one Hebbian bump. The top-weight pairs are polluted by tool-output tokens (`"MEMORY.md" <-> "System reminder"` at 12.0, `"0" <-> "1"` at 9.0) rather than vault content. The boost at query time is ≤ +10% and rarely differentiates.

Evidence

Direct SQL on the DB:

bucket count share
weight == 1.0 1,085,370 62.9%
weight == 1.1 325,920 18.9%
1.1 < w < 2.0 301,671 17.5%
2 ≤ w < 5 12,043 0.70%
5 ≤ w < 10 114 0.007%
≥10 3 0.0002%

Top entities: literal tool-output fragments leaking from triple extraction.

Proposed fix

  1. Prune `weight <= 1.0` after each rebuild. In `cooccurrence.py:142`, after the `executemany`: `conn.execute("DELETE FROM entity_cooccurrence WHERE weight <= 1.0")`. Frees ~225 MB. A single-note co-occurrence is noise.
  2. Switch score from raw count to NPMI. In `search.py:661-708` and `attractor.py:130-144`, replace count-based weight with `npmi(a,b) = pmi / -log p(a,b)`. Cache per-entity note counts in a small companion table populated by `persist_cooccurrence`. Effect: `Azure` / `MEMORY.md` stop dominating; rare-but-specific pairs rise.
  3. Denylist + upstream filter. Add a denylist in `cooccurrence.py:113-124` (single-digit strings, numeric-only tokens, known system tokens like `System reminder`, `vault_search`). Better: fix the triples extractor in `triples.py` so tool-output fragments never become entities.
  4. Hebbian LTD. Add `decay_cooccurrence(conn, factor=0.99, floor=1.0)` in `cooccurrence.py`, call from `cli/decay.py`. Multiplies all weights by 0.99 then `DELETE WHERE weight < floor`. Prevents unbounded growth from the `×1.1` reinforcement at `search.py:840-866`.
  5. Wire into `vault_related`. Currently `related.py:93-102` is pure cosine. Fold in co-occurrence via the same blend the attractor uses (`0.6 cosine + 0.25 cooccur + 0.15 wikilinks`). The signal becomes load-bearing instead of decorative.

Expected effect

  • DB shrinks ~225 MB.
  • Top pairs reflect actual knowledge, not tool noise.
  • NPMI rebalances the score toward specific associations.
  • Self-limiting growth via decay.
  • `vault_related` starts surfacing Hebbian-connected notes, not just embedding-similar ones — which is the neuroscience premise the project sells.

Key files

  • `src/neurostack/cooccurrence.py:17, 20-88, 91-148, 257-275`
  • `src/neurostack/search.py:661-708, 840-866`
  • `src/neurostack/attractor.py:43, 67-177`
  • `src/neurostack/related.py:19-125` (currently doesn't read `entity_cooccurrence` at all)
  • `src/neurostack/config.py:53`

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions