Prune dense semantic floor so community modularity recovers by raphasouthall · Pull Request #44 · raphasouthall/neurostack

raphasouthall · 2026-06-03T13:09:49Z

Problem

GraphRAG community detection sat at Newman modularity Q ≈ 0.06 (coarse) / 0.05 (fine) — barely above a random partition. Communities bled into each other and the hierarchy inverted.

Diagnosis (measured, not assumed)

The blended similarity matrix is S = 0.6·semantic + 0.25·cooc + 0.15·wikilinks. A read-only parameter sweep against a copy of a ~490-note vault showed:

Co-occurrence is not the cause. Entity document-frequency is healthy — 90% of entities (9,829/10,948) appear in a single note; the biggest hub is in 37. Co-occurrence knobs (MIN_SHARED, hub df-cap) left Q unchanged (0.067 → 0.067).
The semantic signal is a dense floor. Off-diagonal cosine averaged 0.358, with 66% of all note pairs above 0.3. In a single-domain vault everything is mildly similar to everything, so the dominant signal connects nearly all pairs and collapses modularity.
Thresholding the floor is the lever. Zeroing semantic edges below mean + 0.5·std lifts Q to ~0.30 coarse / ~0.28 fine with stable community counts (7 coarse / 11–12 fine across the whole sweep). top-k sparsification alone hurt (it keeps each node's top neighbours regardless of absolute weight).

config	Qc	Qf
baseline (no threshold)	0.068	0.058
co-occurrence df-cap / min-shared	0.067	0.057
mean + 0.5·std (this PR)	0.343	0.278

Change

_build_similarity_matrix zeros semantic edges below an adaptive threshold SEMANTIC_THRESHOLD_K (default 0.5, i.e. mean + 0.5·std of the off-diagonal distribution). Adaptive rather than a fixed cosine so it self-tunes to each vault's similarity spread. None disables. Guarded for n ≤ 2.

Tests

New TestSemanticThreshold: cross-cluster edges pruned while within-cluster survive, disabled mode keeps the floor, small-vault skip. Full suite: 559 passed.

Community detection blended S = 0.6*semantic + 0.25*cooc + 0.15*links, but in a single-domain vault the semantic signal is a dense floor: off-diagonal cosine averaged ~0.36 with 66% of all note pairs above 0.3, so nearly every note was connected and Newman modularity sat at Q~0.06 — barely above random. Measurement (read-only sweep on a ~490-note vault) showed co-occurrence was not the culprit: entity document-frequency is healthy (90% of entities appear in a single note), and co-occurrence knobs (min-shared, hub df-cap) left Q unchanged. The lever is the semantic floor. Zeroing semantic edges below an adaptive threshold (mean + k*std of the off-diagonal distribution) lifts Q 0.06 -> ~0.30 coarse / ~0.28 fine while keeping community counts stable. Add SEMANTIC_THRESHOLD_K (default 0.5) applied in _build_similarity_matrix. Adaptive rather than a fixed cosine so it self-tunes per vault; None disables. Guarded for n<=2. Tests cover cross-cluster pruning, disabled mode, and the small-vault skip.

raphasouthall merged commit bb410a4 into main Jun 3, 2026
5 checks passed

raphasouthall deleted the fix/semantic-edge-threshold-modularity branch June 3, 2026 13:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prune dense semantic floor so community modularity recovers#44

Prune dense semantic floor so community modularity recovers#44
raphasouthall merged 1 commit into
mainfrom
fix/semantic-edge-threshold-modularity

raphasouthall commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raphasouthall commented Jun 3, 2026

Problem

Diagnosis (measured, not assumed)

Change

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant