Skip to content

Commit 3361fcd

Browse files
authored
feat: implement true Leiden probabilistic refinement (backlog #103) (#552)
* docs: add true Leiden refinement phase to backlog (ID 103) Current vendored implementation uses greedy refinement, which is functionally Louvain with an extra pass. The paper's randomized refinement (Algorithm 3) is what guarantees well-connected communities — the defining contribution of Leiden over Louvain. * fix: mark backlog item #103 as breaking, add deterministic seed note (#552) The probabilistic Leiden refinement changes community assignments and introduces non-determinism — both qualify as breaking per the column definition. Added a note about using a deterministic seed for CI reproducibility. * feat: implement true Leiden probabilistic refinement (Algorithm 3) Replace greedy best-gain selection in refinement phase with Boltzmann sampling p(v, C) ∝ exp(ΔH/θ) per Traag et al. 2019, Algorithm 3. This is the defining contribution of Leiden over Louvain — guarantees γ-connected communities instead of bridge-connected subcommunities. Deterministic via seeded PRNG (mulberry32) — same seed always produces identical community assignments. New refinementTheta option (default 0.01) controls temperature: lower → more greedy, higher → exploratory. Breaking: community assignments will differ from prior greedy refinement for any graph where multiple candidates have positive quality gain during the refinement phase. Impact: 2 functions changed, 4 affected * docs: remove backlog #103 — ships in this PR, not a breaking change * fix: align Leiden refinement with Algorithm 3 (Traag et al. 2019) Three corrections to match the paper: 1. Singleton guard — only nodes still in singleton communities are candidates for merging. Once merged, a node is locked for the remainder of the pass. Essential for γ-connectedness guarantee. 2. Single pass — one randomized sweep, not an iterative while-loop. Iterating until convergence is Louvain behavior, not Leiden. 3. Stay option — the "remain as singleton" choice (ΔH=0) is included in the Boltzmann distribution, so a node may probabilistically stay alone even when positive-gain merges exist. Impact: 1 functions changed, 1 affected * test: add Algorithm 3 conformance tests for Leiden refinement Three new tests that would catch deviations from the paper: - Stay option: high theta preserves singletons because ΔH=0 competes in the Boltzmann distribution. Without it, all positive-gain nodes would be forced to merge. - Singleton guard: ring of triangles stays granular across seeds. Without the guard, iterative passes would collapse adjacent triangles. - Single pass: refine=true preserves at least as many communities as refine=false on a uniform weak-link graph. Iterative convergence would over-merge. * feat: post-refinement connectivity split and fix default theta Three improvements to complete the robust Leiden implementation: 1. Default refinementTheta changed from 0.01 to 1.0. The old default made exp(ΔH/0.01) extremely peaked, effectively disabling the probabilistic behavior. θ=1.0 matches the paper's exp(ΔH). 2. Post-refinement split step: after probabilistic refinement, BFS each community's induced subgraph. If a community has disconnected components, split them into separate community IDs. O(V+E) total. This replaces the expensive per-candidate γ-connectedness check with a cheap post-step using codegraph's existing graph primitives. 3. New connectivity validation test: across multiple seeds, verify every community is internally connected via BFS on the subgraph. This directly tests the core Leiden guarantee. Adds resizeCommunities() to partition API for the split step. Impact: 6 functions changed, 5 affected
1 parent 4908334 commit 3361fcd

File tree

5 files changed

+461
-30
lines changed

5 files changed

+461
-30
lines changed

docs/roadmap/BACKLOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,7 @@ Community detection will use a vendored Leiden optimiser (PR #545) with full con
122122
| 101 | Hierarchical community decomposition | Run Leiden at multiple resolution levels (e.g., γ=0.5, 1.0, 2.0) and expose nested community structure — macro-clusters containing sub-clusters. The vendored optimiser already computes multi-level coarsening internally; surface it as `communities --hierarchical` with a tree output showing which fine-grained communities nest inside coarse ones. Store hierarchy in a `community_hierarchy` table or JSON metadata. | Architecture | Single-resolution communities force a choice between broad architectural groups and tight cohesion clusters. Hierarchical decomposition gives both — agents can zoom from "this is the graph subsystem" to "specifically the Leiden algorithm cluster within it" without re-running at different resolutions ||| 3 | No | #545 |
123123
| 102 | Community-aware impact scoring | Factor community boundaries into `fn-impact` and `diff-impact` risk scoring. Changes that cross community boundaries are architecturally riskier than changes within a single community — they indicate coupling between modules that should be independent. Add `crossCommunityCount` to impact output and weight it in triage risk scoring. A function with blast radius 5 all within one community is lower risk than blast radius 5 spanning 4 communities. | Analysis | Directly improves blast radius accuracy — the core problem codegraph exists to solve. Community-crossing impact is a strong signal for architectural coupling that raw call-chain fan-out doesn't capture ||| 4 | No | #545 |
124124

125+
125126
### Tier 1f — Embeddings leverage (build on existing `embeddings` table)
126127

127128
Symbol embeddings and FTS index are populated via `codegraph embed`. Currently only consumed by the `search` command. The vectors and `cosineSim()` function already exist.

src/graph/algorithms/leiden/index.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ import { runLouvainUndirectedModularity } from './optimiser.js';
2323
* @param {number} [options.maxCommunitySize]
2424
* @param {Set|Array} [options.fixedNodes]
2525
* @param {string} [options.candidateStrategy] - 'neighbors' | 'all' | 'random' | 'random-neighbor'
26+
* @param {number} [options.refinementTheta=1.0] - Temperature for probabilistic Leiden refinement (Algorithm 3, Traag et al. 2019). Lower → more greedy, higher → more exploratory. Deterministic via seeded PRNG
2627
* @returns {{ getClass(id): number, getCommunities(): Map, quality(): number, toJSON(): object }}
2728
*
2829
* **Note on `quality()`:** For modularity, `quality()` always evaluates at γ=1.0

src/graph/algorithms/leiden/optimiser.js

Lines changed: 160 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,12 @@ export function runLouvainUndirectedModularity(graph, optionsInput = {}) {
167167
options,
168168
level === 0 ? fixedNodeMask : null,
169169
);
170+
// Post-refinement: split any disconnected communities into their
171+
// connected components. This is the cheap O(V+E) alternative to
172+
// checking γ-connectedness on every candidate during refinement.
173+
// A disconnected community violates even basic connectivity, so
174+
// splitting is always correct.
175+
splitDisconnectedCommunities(graphAdapter, refined);
170176
renumberCommunities(refined, options.preserveLabels);
171177
effectivePartition = refined;
172178
}
@@ -229,6 +235,28 @@ function buildCoarseGraph(g, p) {
229235
return coarse;
230236
}
231237

238+
/**
239+
* True Leiden refinement phase (Algorithm 3, Traag et al. 2019).
240+
*
241+
* Key properties that distinguish this from Louvain-style refinement:
242+
*
243+
* 1. **Singleton start** — each node begins in its own community.
244+
* 2. **Singleton guard** — only nodes still in singleton communities are
245+
* considered for merging. Once a node joins a non-singleton community
246+
* it is locked for the remainder of the pass. This prevents oscillation
247+
* and is essential for the γ-connectedness guarantee.
248+
* 3. **Single pass** — one randomized sweep through all nodes, not an
249+
* iterative loop until convergence (that would be Louvain behavior).
250+
* 4. **Probabilistic selection** — candidate communities are sampled from
251+
* a Boltzmann distribution `p(v, C) ∝ exp(ΔH / θ)`, with the "stay
252+
* as singleton" option (ΔH = 0) included in the distribution. This
253+
* means a node may probabilistically choose to remain alone even when
254+
* positive-gain merges exist.
255+
*
256+
* θ (refinementTheta) controls temperature: lower → more deterministic
257+
* (approaches greedy), higher → more exploratory. Determinism is preserved
258+
* via the seeded PRNG — same seed produces the same assignments.
259+
*/
232260
function refineWithinCoarseCommunities(g, basePart, rng, opts, fixedMask0) {
233261
const p = makePartition(g);
234262
p.initializeAggregates();
@@ -237,45 +265,144 @@ function refineWithinCoarseCommunities(g, basePart, rng, opts, fixedMask0) {
237265
const commMacro = new Int32Array(p.communityCount);
238266
for (let i = 0; i < p.communityCount; i++) commMacro[i] = macro[i];
239267

268+
const theta = typeof opts.refinementTheta === 'number' ? opts.refinementTheta : 1.0;
269+
270+
// Single pass in random order (Algorithm 3, step 2).
240271
const order = new Int32Array(g.n);
241272
for (let i = 0; i < g.n; i++) order[i] = i;
242-
let improved = true;
243-
let passes = 0;
244-
while (improved) {
245-
improved = false;
246-
passes++;
247-
shuffleArrayInPlace(order, rng);
248-
for (let idx = 0; idx < order.length; idx++) {
249-
const v = order[idx];
250-
if (fixedMask0?.[v]) continue;
251-
const macroV = macro[v];
252-
const touchedCount = p.accumulateNeighborCommunityEdgeWeights(v);
253-
let bestC = p.nodeCommunity[v];
254-
let bestGain = 0;
255-
const maxSize = Number.isFinite(opts.maxCommunitySize) ? opts.maxCommunitySize : Infinity;
256-
for (let t = 0; t < touchedCount; t++) {
257-
const c = p.getCandidateCommunityAt(t);
258-
if (commMacro[c] !== macroV) continue;
259-
if (maxSize < Infinity) {
260-
const nextSize = p.getCommunityTotalSize(c) + g.size[v];
261-
if (nextSize > maxSize) continue;
262-
}
263-
const gain = computeQualityGain(p, v, c, opts);
264-
if (gain > bestGain) {
265-
bestGain = gain;
266-
bestC = c;
267-
}
273+
shuffleArrayInPlace(order, rng);
274+
275+
for (let idx = 0; idx < order.length; idx++) {
276+
const v = order[idx];
277+
if (fixedMask0?.[v]) continue;
278+
279+
// Singleton guard: only move nodes still alone in their community.
280+
if (p.getCommunityNodeCount(p.nodeCommunity[v]) > 1) continue;
281+
282+
const macroV = macro[v];
283+
const touchedCount = p.accumulateNeighborCommunityEdgeWeights(v);
284+
const maxSize = Number.isFinite(opts.maxCommunitySize) ? opts.maxCommunitySize : Infinity;
285+
286+
// Collect eligible communities and their quality gains.
287+
const candidates = [];
288+
for (let t = 0; t < touchedCount; t++) {
289+
const c = p.getCandidateCommunityAt(t);
290+
if (c === p.nodeCommunity[v]) continue;
291+
if (commMacro[c] !== macroV) continue;
292+
if (maxSize < Infinity) {
293+
const nextSize = p.getCommunityTotalSize(c) + g.size[v];
294+
if (nextSize > maxSize) continue;
268295
}
269-
if (bestC !== p.nodeCommunity[v] && bestGain > GAIN_EPSILON) {
270-
p.moveNodeToCommunity(v, bestC);
271-
improved = true;
296+
const gain = computeQualityGain(p, v, c, opts);
297+
if (gain > GAIN_EPSILON) {
298+
candidates.push({ c, gain });
272299
}
273300
}
274-
if (passes >= opts.maxLocalPasses) break;
301+
302+
if (candidates.length === 0) continue;
303+
304+
// Probabilistic selection: p(v, C) ∝ exp(ΔH / θ), with the "stay"
305+
// option (ΔH = 0) included per Algorithm 3.
306+
// For numerical stability, subtract the max gain before exponentiation.
307+
const maxGain = candidates.reduce((m, x) => (x.gain > m ? x.gain : m), 0);
308+
// "Stay as singleton" weight: exp((0 - maxGain) / theta)
309+
const stayWeight = Math.exp((0 - maxGain) / theta);
310+
let totalWeight = stayWeight;
311+
for (let i = 0; i < candidates.length; i++) {
312+
candidates[i].weight = Math.exp((candidates[i].gain - maxGain) / theta);
313+
totalWeight += candidates[i].weight;
314+
}
315+
316+
const r = rng() * totalWeight;
317+
if (r < stayWeight) continue; // node stays as singleton
318+
319+
let cumulative = stayWeight;
320+
let chosenC = candidates[candidates.length - 1].c; // fallback
321+
for (let i = 0; i < candidates.length; i++) {
322+
cumulative += candidates[i].weight;
323+
if (r < cumulative) {
324+
chosenC = candidates[i].c;
325+
break;
326+
}
327+
}
328+
329+
p.moveNodeToCommunity(v, chosenC);
275330
}
276331
return p;
277332
}
278333

334+
/**
335+
* Post-refinement connectivity check. For each community, run a BFS on
336+
* the subgraph induced by its members (using the adapter's outEdges).
337+
* If a community has multiple connected components, assign secondary
338+
* components to new community IDs, then reinitialize aggregates once.
339+
*
340+
* O(V+E) total since communities partition V.
341+
*
342+
* This replaces the per-candidate γ-connectedness check from the paper
343+
* with a cheaper post-step that catches the most important violation
344+
* (disconnected subcommunities).
345+
*/
346+
function splitDisconnectedCommunities(g, partition) {
347+
const n = g.n;
348+
const nc = partition.nodeCommunity;
349+
const members = partition.getCommunityMembers();
350+
let nextC = partition.communityCount;
351+
let didSplit = false;
352+
353+
const visited = new Uint8Array(n);
354+
const inCommunity = new Uint8Array(n);
355+
356+
for (let c = 0; c < members.length; c++) {
357+
const nodes = members[c];
358+
if (nodes.length <= 1) continue;
359+
360+
for (let i = 0; i < nodes.length; i++) inCommunity[nodes[i]] = 1;
361+
362+
let componentCount = 0;
363+
for (let i = 0; i < nodes.length; i++) {
364+
const start = nodes[i];
365+
if (visited[start]) continue;
366+
componentCount++;
367+
368+
// BFS within the community subgraph.
369+
const queue = [start];
370+
visited[start] = 1;
371+
let head = 0;
372+
while (head < queue.length) {
373+
const v = queue[head++];
374+
const edges = g.outEdges[v];
375+
for (let k = 0; k < edges.length; k++) {
376+
const w = edges[k].to;
377+
if (inCommunity[w] && !visited[w]) {
378+
visited[w] = 1;
379+
queue.push(w);
380+
}
381+
}
382+
}
383+
384+
if (componentCount > 1) {
385+
// Secondary component — assign new community ID directly.
386+
const newC = nextC++;
387+
for (let q = 0; q < queue.length; q++) nc[queue[q]] = newC;
388+
didSplit = true;
389+
}
390+
}
391+
392+
for (let i = 0; i < nodes.length; i++) {
393+
inCommunity[nodes[i]] = 0;
394+
visited[nodes[i]] = 0;
395+
}
396+
}
397+
398+
if (didSplit) {
399+
// Grow the partition's typed arrays to accommodate new community IDs,
400+
// then recompute all aggregates from the updated nodeCommunity array.
401+
partition.resizeCommunities(nextC);
402+
partition.initializeAggregates();
403+
}
404+
}
405+
279406
function computeQualityGain(partition, v, c, opts) {
280407
const quality = (opts.quality || 'modularity').toLowerCase();
281408
const gamma = typeof opts.resolution === 'number' ? opts.resolution : 1.0;
@@ -329,6 +456,8 @@ function normalizeOptions(options = {}) {
329456
const maxCommunitySize = Number.isFinite(options.maxCommunitySize)
330457
? options.maxCommunitySize
331458
: Infinity;
459+
const refinementTheta =
460+
typeof options.refinementTheta === 'number' ? options.refinementTheta : 1.0;
332461
return {
333462
directed,
334463
randomSeed,
@@ -341,6 +470,7 @@ function normalizeOptions(options = {}) {
341470
refine,
342471
preserveLabels,
343472
maxCommunitySize,
473+
refinementTheta,
344474
fixedNodes: options.fixedNodes,
345475
};
346476
}

src/graph/algorithms/leiden/partition.js

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -373,6 +373,10 @@ export function makePartition(graph) {
373373
get communityTotalInStrength() {
374374
return communityTotalInStrength;
375375
},
376+
resizeCommunities(newCount) {
377+
ensureCommCapacity(newCount);
378+
communityCount = newCount;
379+
},
376380
initializeAggregates,
377381
accumulateNeighborCommunityEdgeWeights,
378382
getCandidateCommunityCount: () => candidateCommunityCount,

0 commit comments

Comments
 (0)