Prune dense semantic floor so community modularity recovers#44
Merged
Conversation
Community detection blended S = 0.6*semantic + 0.25*cooc + 0.15*links, but in a single-domain vault the semantic signal is a dense floor: off-diagonal cosine averaged ~0.36 with 66% of all note pairs above 0.3, so nearly every note was connected and Newman modularity sat at Q~0.06 — barely above random. Measurement (read-only sweep on a ~490-note vault) showed co-occurrence was not the culprit: entity document-frequency is healthy (90% of entities appear in a single note), and co-occurrence knobs (min-shared, hub df-cap) left Q unchanged. The lever is the semantic floor. Zeroing semantic edges below an adaptive threshold (mean + k*std of the off-diagonal distribution) lifts Q 0.06 -> ~0.30 coarse / ~0.28 fine while keeping community counts stable. Add SEMANTIC_THRESHOLD_K (default 0.5) applied in _build_similarity_matrix. Adaptive rather than a fixed cosine so it self-tunes per vault; None disables. Guarded for n<=2. Tests cover cross-cluster pruning, disabled mode, and the small-vault skip.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
GraphRAG community detection sat at Newman modularity Q ≈ 0.06 (coarse) / 0.05 (fine) — barely above a random partition. Communities bled into each other and the hierarchy inverted.
Diagnosis (measured, not assumed)
The blended similarity matrix is
S = 0.6·semantic + 0.25·cooc + 0.15·wikilinks. A read-only parameter sweep against a copy of a ~490-note vault showed:MIN_SHARED, hub df-cap) left Q unchanged (0.067 → 0.067).mean + 0.5·stdlifts Q to ~0.30 coarse / ~0.28 fine with stable community counts (7 coarse / 11–12 fine across the whole sweep). top-k sparsification alone hurt (it keeps each node's top neighbours regardless of absolute weight).Change
_build_similarity_matrixzeros semantic edges below an adaptive thresholdSEMANTIC_THRESHOLD_K(default0.5, i.e. mean + 0.5·std of the off-diagonal distribution). Adaptive rather than a fixed cosine so it self-tunes to each vault's similarity spread.Nonedisables. Guarded for n ≤ 2.Tests
New
TestSemanticThreshold: cross-cluster edges pruned while within-cluster survive, disabled mode keeps the floor, small-vault skip. Full suite: 559 passed.