Skip to content

Prune dense semantic floor so community modularity recovers#44

Merged
raphasouthall merged 1 commit into
mainfrom
fix/semantic-edge-threshold-modularity
Jun 3, 2026
Merged

Prune dense semantic floor so community modularity recovers#44
raphasouthall merged 1 commit into
mainfrom
fix/semantic-edge-threshold-modularity

Conversation

@raphasouthall
Copy link
Copy Markdown
Owner

Problem

GraphRAG community detection sat at Newman modularity Q ≈ 0.06 (coarse) / 0.05 (fine) — barely above a random partition. Communities bled into each other and the hierarchy inverted.

Diagnosis (measured, not assumed)

The blended similarity matrix is S = 0.6·semantic + 0.25·cooc + 0.15·wikilinks. A read-only parameter sweep against a copy of a ~490-note vault showed:

  • Co-occurrence is not the cause. Entity document-frequency is healthy — 90% of entities (9,829/10,948) appear in a single note; the biggest hub is in 37. Co-occurrence knobs (MIN_SHARED, hub df-cap) left Q unchanged (0.067 → 0.067).
  • The semantic signal is a dense floor. Off-diagonal cosine averaged 0.358, with 66% of all note pairs above 0.3. In a single-domain vault everything is mildly similar to everything, so the dominant signal connects nearly all pairs and collapses modularity.
  • Thresholding the floor is the lever. Zeroing semantic edges below mean + 0.5·std lifts Q to ~0.30 coarse / ~0.28 fine with stable community counts (7 coarse / 11–12 fine across the whole sweep). top-k sparsification alone hurt (it keeps each node's top neighbours regardless of absolute weight).
config Qc Qf
baseline (no threshold) 0.068 0.058
co-occurrence df-cap / min-shared 0.067 0.057
mean + 0.5·std (this PR) 0.343 0.278

Change

_build_similarity_matrix zeros semantic edges below an adaptive threshold SEMANTIC_THRESHOLD_K (default 0.5, i.e. mean + 0.5·std of the off-diagonal distribution). Adaptive rather than a fixed cosine so it self-tunes to each vault's similarity spread. None disables. Guarded for n ≤ 2.

Tests

New TestSemanticThreshold: cross-cluster edges pruned while within-cluster survive, disabled mode keeps the floor, small-vault skip. Full suite: 559 passed.

Community detection blended S = 0.6*semantic + 0.25*cooc + 0.15*links, but in
a single-domain vault the semantic signal is a dense floor: off-diagonal cosine
averaged ~0.36 with 66% of all note pairs above 0.3, so nearly every note was
connected and Newman modularity sat at Q~0.06 — barely above random.

Measurement (read-only sweep on a ~490-note vault) showed co-occurrence was not
the culprit: entity document-frequency is healthy (90% of entities appear in a
single note), and co-occurrence knobs (min-shared, hub df-cap) left Q unchanged.
The lever is the semantic floor. Zeroing semantic edges below an adaptive
threshold (mean + k*std of the off-diagonal distribution) lifts Q 0.06 -> ~0.30
coarse / ~0.28 fine while keeping community counts stable.

Add SEMANTIC_THRESHOLD_K (default 0.5) applied in _build_similarity_matrix.
Adaptive rather than a fixed cosine so it self-tunes per vault; None disables.
Guarded for n<=2. Tests cover cross-cluster pruning, disabled mode, and the
small-vault skip.
@raphasouthall raphasouthall merged commit bb410a4 into main Jun 3, 2026
5 checks passed
@raphasouthall raphasouthall deleted the fix/semantic-edge-threshold-modularity branch June 3, 2026 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant