Add folder-path signal to community similarity matrix#46
Merged
Conversation
Embeddings see "infrastructure" as one topic regardless of whether a note lives under work/ or home/, so distinct organisational areas collapsed into a single community (the largest coarse cluster was work+literature+home soup). Blend a fourth channel into _build_similarity_matrix: notes sharing a top-level folder prefix get a uniform similarity bump (PATH_SIGNAL_WEIGHT=0.3, PATH_PREFIX_DEPTH=1). A read-only sweep on a ~490-note vault picked these: depth=1 is the work/home grain (depth=2 reduced cohesion); delta=0.3 lifts folder purity 0.68->0.85 coarse / 0.76->0.93 fine while modularity holds or improves; delta>=0.5 over-purifies and collapses community count. Degrades gracefully: root-level files form no path edges, a flat single-folder vault yields an all-zero signal, weight 0 disables. Tests cover same-folder bump, cross-folder isolation, root-file exclusion, and the disable path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Embeddings treat "infrastructure" as one topic whether a note lives under
work/orhome/, so distinct organisational areas collapsed into one community. The largest coarse cluster was a work+literature+home soup (258 notes: work 203, literature 26, home 16).Change
Blend a fourth channel into
_build_similarity_matrix: notes sharing a top-level folder prefix get a uniform similarity bump.PATH_SIGNAL_WEIGHT = 0.3(0 disables)PATH_PREFIX_DEPTH = 1(top-level:work/,home/,research/…)S = 0.6·semantic + 0.25·cooc + 0.15·links + 0.3·pathMeasurement (read-only sweep, ~490-note vault)
Purity = fraction of each community's notes from its dominant top-level folder.
home/weakens reinforcement).work:203, literature:26, home:16→work:211, literature:8, home:0.Safety
Root-level files (
CLAUDE.mdetc.) form no path edges; a flat single-folder vault yields an all-zeroS_path; weight 0 disables. So it degrades gracefully as a default.Tests
New
TestPathSignal: same-folder bump, cross-folder isolation, root-file exclusion, disable path. Full suite: 566 passed.