feat: implement true Leiden probabilistic refinement (backlog #103) (#552)

carlos-alm · web-flow · commit 3361fcdf06eb · 2026-03-21T03:02:38.000-06:00
* docs: add true Leiden refinement phase to backlog (ID 103) Current vendored implementation uses greedy refinement, which is functionally Louvain with an extra pass. The paper's randomized refinement (Algorithm 3) is what guarantees well-connected communities — the defining contribution of Leiden over Louvain. * fix: mark backlog item #103 as breaking, add deterministic seed note (#552) The probabilistic Leiden refinement changes community assignments and introduces non-determinism — both qualify as breaking per the column definition. Added a note about using a deterministic seed for CI reproducibility. * feat: implement true Leiden probabilistic refinement (Algorithm 3) Replace greedy best-gain selection in refinement phase with Boltzmann sampling p(v, C) ∝ exp(ΔH/θ) per Traag et al. 2019, Algorithm 3. This is the defining contribution of Leiden over Louvain — guarantees γ-connected communities instead of bridge-connected subcommunities. Deterministic via seeded PRNG (mulberry32) — same seed always produces identical community assignments. New refinementTheta option (default 0.01) controls temperature: lower → more greedy, higher → exploratory. Breaking: community assignments will differ from prior greedy refinement for any graph where multiple candidates have positive quality gain during the refinement phase. Impact: 2 functions changed, 4 affected * docs: remove backlog #103 — ships in this PR, not a breaking change * fix: align Leiden refinement with Algorithm 3 (Traag et al. 2019) Three corrections to match the paper: 1. Singleton guard — only nodes still in singleton communities are candidates for merging. Once merged, a node is locked for the remainder of the pass. Essential for γ-connectedness guarantee. 2. Single pass — one randomized sweep, not an iterative while-loop. Iterating until convergence is Louvain behavior, not Leiden. 3. Stay option — the "remain as singleton" choice (ΔH=0) is included in the Boltzmann distribution, so a node may probabilistically stay alone even when positive-gain merges exist. Impact: 1 functions changed, 1 affected * test: add Algorithm 3 conformance tests for Leiden refinement Three new tests that would catch deviations from the paper: - Stay option: high theta preserves singletons because ΔH=0 competes in the Boltzmann distribution. Without it, all positive-gain nodes would be forced to merge. - Singleton guard: ring of triangles stays granular across seeds. Without the guard, iterative passes would collapse adjacent triangles. - Single pass: refine=true preserves at least as many communities as refine=false on a uniform weak-link graph. Iterative convergence would over-merge. * feat: post-refinement connectivity split and fix default theta Three improvements to complete the robust Leiden implementation: 1. Default refinementTheta changed from 0.01 to 1.0. The old default made exp(ΔH/0.01) extremely peaked, effectively disabling the probabilistic behavior. θ=1.0 matches the paper's exp(ΔH). 2. Post-refinement split step: after probabilistic refinement, BFS each community's induced subgraph. If a community has disconnected components, split them into separate community IDs. O(V+E) total. This replaces the expensive per-candidate γ-connectedness check with a cheap post-step using codegraph's existing graph primitives. 3. New connectivity validation test: across multiple seeds, verify every community is internally connected via BFS on the subgraph. This directly tests the core Leiden guarantee. Adds resizeCommunities() to partition API for the split step. Impact: 6 functions changed, 5 affected
diff --git a/docs/roadmap/BACKLOG.md b/docs/roadmap/BACKLOG.md
@@ -122,6 +122,7 @@ Community detection will use a vendored Leiden optimiser (PR #545) with full con
 | 101 | Hierarchical community decomposition | Run Leiden at multiple resolution levels (e.g., γ=0.5, 1.0, 2.0) and expose nested community structure — macro-clusters containing sub-clusters. The vendored optimiser already computes multi-level coarsening internally; surface it as `communities --hierarchical` with a tree output showing which fine-grained communities nest inside coarse ones. Store hierarchy in a `community_hierarchy` table or JSON metadata. | Architecture | Single-resolution communities force a choice between broad architectural groups and tight cohesion clusters. Hierarchical decomposition gives both — agents can zoom from "this is the graph subsystem" to "specifically the Leiden algorithm cluster within it" without re-running at different resolutions | ✓ | ✓ | 3 | No | #545 |
 | 102 | Community-aware impact scoring | Factor community boundaries into `fn-impact` and `diff-impact` risk scoring. Changes that cross community boundaries are architecturally riskier than changes within a single community — they indicate coupling between modules that should be independent. Add `crossCommunityCount` to impact output and weight it in triage risk scoring. A function with blast radius 5 all within one community is lower risk than blast radius 5 spanning 4 communities. | Analysis | Directly improves blast radius accuracy — the core problem codegraph exists to solve. Community-crossing impact is a strong signal for architectural coupling that raw call-chain fan-out doesn't capture | ✓ | ✓ | 4 | No | #545 |
 
+
 ### Tier 1f — Embeddings leverage (build on existing `embeddings` table)
 
 Symbol embeddings and FTS index are populated via `codegraph embed`. Currently only consumed by the `search` command. The vectors and `cosineSim()` function already exist.
diff --git a/src/graph/algorithms/leiden/index.js b/src/graph/algorithms/leiden/index.js
@@ -23,6 +23,7 @@ import { runLouvainUndirectedModularity } from './optimiser.js';
  * @param {number}  [options.maxCommunitySize]
  * @param {Set|Array} [options.fixedNodes]
  * @param {string}  [options.candidateStrategy]    - 'neighbors' | 'all' | 'random' | 'random-neighbor'
+ * @param {number}  [options.refinementTheta=1.0]  - Temperature for probabilistic Leiden refinement (Algorithm 3, Traag et al. 2019). Lower → more greedy, higher → more exploratory. Deterministic via seeded PRNG
  * @returns {{ getClass(id): number, getCommunities(): Map, quality(): number, toJSON(): object }}
  *
  * **Note on `quality()`:** For modularity, `quality()` always evaluates at γ=1.0
diff --git a/src/graph/algorithms/leiden/optimiser.js b/src/graph/algorithms/leiden/optimiser.js
@@ -167,6 +167,12 @@ export function runLouvainUndirectedModularity(graph, optionsInput = {}) {
         options,
         level === 0 ? fixedNodeMask : null,
       );
+      // Post-refinement: split any disconnected communities into their
+      // connected components. This is the cheap O(V+E) alternative to
+      // checking γ-connectedness on every candidate during refinement.
+      // A disconnected community violates even basic connectivity, so
+      // splitting is always correct.
+      splitDisconnectedCommunities(graphAdapter, refined);
       renumberCommunities(refined, options.preserveLabels);
       effectivePartition = refined;
     }
@@ -229,6 +235,28 @@ function buildCoarseGraph(g, p) {
   return coarse;
 }
 
+/**
+ * True Leiden refinement phase (Algorithm 3, Traag et al. 2019).
+ *
+ * Key properties that distinguish this from Louvain-style refinement:
+ *
+ * 1. **Singleton start** — each node begins in its own community.
+ * 2. **Singleton guard** — only nodes still in singleton communities are
+ *    considered for merging. Once a node joins a non-singleton community
+ *    it is locked for the remainder of the pass. This prevents oscillation
+ *    and is essential for the γ-connectedness guarantee.
+ * 3. **Single pass** — one randomized sweep through all nodes, not an
+ *    iterative loop until convergence (that would be Louvain behavior).
+ * 4. **Probabilistic selection** — candidate communities are sampled from
+ *    a Boltzmann distribution `p(v, C) ∝ exp(ΔH / θ)`, with the "stay
+ *    as singleton" option (ΔH = 0) included in the distribution. This
+ *    means a node may probabilistically choose to remain alone even when
+ *    positive-gain merges exist.
+ *
+ * θ (refinementTheta) controls temperature: lower → more deterministic
+ * (approaches greedy), higher → more exploratory. Determinism is preserved
+ * via the seeded PRNG — same seed produces the same assignments.
+ */
 function refineWithinCoarseCommunities(g, basePart, rng, opts, fixedMask0) {
   const p = makePartition(g);
   p.initializeAggregates();
@@ -237,45 +265,144 @@ function refineWithinCoarseCommunities(g, basePart, rng, opts, fixedMask0) {
   const commMacro = new Int32Array(p.communityCount);
   for (let i = 0; i < p.communityCount; i++) commMacro[i] = macro[i];
 
+  const theta = typeof opts.refinementTheta === 'number' ? opts.refinementTheta : 1.0;
+
+  // Single pass in random order (Algorithm 3, step 2).
   const order = new Int32Array(g.n);
   for (let i = 0; i < g.n; i++) order[i] = i;
-  let improved = true;
-  let passes = 0;
-  while (improved) {
-    improved = false;
-    passes++;
-    shuffleArrayInPlace(order, rng);
-    for (let idx = 0; idx < order.length; idx++) {
-      const v = order[idx];
-      if (fixedMask0?.[v]) continue;
-      const macroV = macro[v];
-      const touchedCount = p.accumulateNeighborCommunityEdgeWeights(v);
-      let bestC = p.nodeCommunity[v];
-      let bestGain = 0;
-      const maxSize = Number.isFinite(opts.maxCommunitySize) ? opts.maxCommunitySize : Infinity;
-      for (let t = 0; t < touchedCount; t++) {
-        const c = p.getCandidateCommunityAt(t);
-        if (commMacro[c] !== macroV) continue;
-        if (maxSize < Infinity) {
-          const nextSize = p.getCommunityTotalSize(c) + g.size[v];
-          if (nextSize > maxSize) continue;
-        }
-        const gain = computeQualityGain(p, v, c, opts);
-        if (gain > bestGain) {
-          bestGain = gain;
-          bestC = c;
-        }
+  shuffleArrayInPlace(order, rng);
+
+  for (let idx = 0; idx < order.length; idx++) {
+    const v = order[idx];
+    if (fixedMask0?.[v]) continue;
+
+    // Singleton guard: only move nodes still alone in their community.
+    if (p.getCommunityNodeCount(p.nodeCommunity[v]) > 1) continue;
+
+    const macroV = macro[v];
+    const touchedCount = p.accumulateNeighborCommunityEdgeWeights(v);
+    const maxSize = Number.isFinite(opts.maxCommunitySize) ? opts.maxCommunitySize : Infinity;
+
+    // Collect eligible communities and their quality gains.
+    const candidates = [];
+    for (let t = 0; t < touchedCount; t++) {
+      const c = p.getCandidateCommunityAt(t);
+      if (c === p.nodeCommunity[v]) continue;
+      if (commMacro[c] !== macroV) continue;
+      if (maxSize < Infinity) {
+        const nextSize = p.getCommunityTotalSize(c) + g.size[v];
+        if (nextSize > maxSize) continue;
       }
-      if (bestC !== p.nodeCommunity[v] && bestGain > GAIN_EPSILON) {
-        p.moveNodeToCommunity(v, bestC);
-        improved = true;
+      const gain = computeQualityGain(p, v, c, opts);
+      if (gain > GAIN_EPSILON) {
+        candidates.push({ c, gain });
       }
     }
-    if (passes >= opts.maxLocalPasses) break;
+
+    if (candidates.length === 0) continue;
+
+    // Probabilistic selection: p(v, C) ∝ exp(ΔH / θ), with the "stay"
+    // option (ΔH = 0) included per Algorithm 3.
+    // For numerical stability, subtract the max gain before exponentiation.
+    const maxGain = candidates.reduce((m, x) => (x.gain > m ? x.gain : m), 0);
+    // "Stay as singleton" weight: exp((0 - maxGain) / theta)
+    const stayWeight = Math.exp((0 - maxGain) / theta);
+    let totalWeight = stayWeight;
+    for (let i = 0; i < candidates.length; i++) {
+      candidates[i].weight = Math.exp((candidates[i].gain - maxGain) / theta);
+      totalWeight += candidates[i].weight;
+    }
+
+    const r = rng() * totalWeight;
+    if (r < stayWeight) continue; // node stays as singleton
+
+    let cumulative = stayWeight;
+    let chosenC = candidates[candidates.length - 1].c; // fallback
+    for (let i = 0; i < candidates.length; i++) {
+      cumulative += candidates[i].weight;
+      if (r < cumulative) {
+        chosenC = candidates[i].c;
+        break;
+      }
+    }
+
+    p.moveNodeToCommunity(v, chosenC);
   }
   return p;
 }
 
+/**
+ * Post-refinement connectivity check.  For each community, run a BFS on
+ * the subgraph induced by its members (using the adapter's outEdges).
+ * If a community has multiple connected components, assign secondary
+ * components to new community IDs, then reinitialize aggregates once.
+ *
+ * O(V+E) total since communities partition V.
+ *
+ * This replaces the per-candidate γ-connectedness check from the paper
+ * with a cheaper post-step that catches the most important violation
+ * (disconnected subcommunities).
+ */
+function splitDisconnectedCommunities(g, partition) {
+  const n = g.n;
+  const nc = partition.nodeCommunity;
+  const members = partition.getCommunityMembers();
+  let nextC = partition.communityCount;
+  let didSplit = false;
+
+  const visited = new Uint8Array(n);
+  const inCommunity = new Uint8Array(n);
+
+  for (let c = 0; c < members.length; c++) {
+    const nodes = members[c];
+    if (nodes.length <= 1) continue;
+
+    for (let i = 0; i < nodes.length; i++) inCommunity[nodes[i]] = 1;
+
+    let componentCount = 0;
+    for (let i = 0; i < nodes.length; i++) {
+      const start = nodes[i];
+      if (visited[start]) continue;
+      componentCount++;
+
+      // BFS within the community subgraph.
+      const queue = [start];
+      visited[start] = 1;
+      let head = 0;
+      while (head < queue.length) {
+        const v = queue[head++];
+        const edges = g.outEdges[v];
+        for (let k = 0; k < edges.length; k++) {
+          const w = edges[k].to;
+          if (inCommunity[w] && !visited[w]) {
+            visited[w] = 1;
+            queue.push(w);
+          }
+        }
+      }
+
+      if (componentCount > 1) {
+        // Secondary component — assign new community ID directly.
+        const newC = nextC++;
+        for (let q = 0; q < queue.length; q++) nc[queue[q]] = newC;
+        didSplit = true;
+      }
+    }
+
+    for (let i = 0; i < nodes.length; i++) {
+      inCommunity[nodes[i]] = 0;
+      visited[nodes[i]] = 0;
+    }
+  }
+
+  if (didSplit) {
+    // Grow the partition's typed arrays to accommodate new community IDs,
+    // then recompute all aggregates from the updated nodeCommunity array.
+    partition.resizeCommunities(nextC);
+    partition.initializeAggregates();
+  }
+}
+
 function computeQualityGain(partition, v, c, opts) {
   const quality = (opts.quality || 'modularity').toLowerCase();
   const gamma = typeof opts.resolution === 'number' ? opts.resolution : 1.0;
@@ -329,6 +456,8 @@ function normalizeOptions(options = {}) {
   const maxCommunitySize = Number.isFinite(options.maxCommunitySize)
     ? options.maxCommunitySize
     : Infinity;
+  const refinementTheta =
+    typeof options.refinementTheta === 'number' ? options.refinementTheta : 1.0;
   return {
     directed,
     randomSeed,
@@ -341,6 +470,7 @@ function normalizeOptions(options = {}) {
     refine,
     preserveLabels,
     maxCommunitySize,
+    refinementTheta,
     fixedNodes: options.fixedNodes,
   };
 }
diff --git a/src/graph/algorithms/leiden/partition.js b/src/graph/algorithms/leiden/partition.js
@@ -373,6 +373,10 @@ export function makePartition(graph) {
     get communityTotalInStrength() {
       return communityTotalInStrength;
     },
+    resizeCommunities(newCount) {
+      ensureCommCapacity(newCount);
+      communityCount = newCount;
+    },
     initializeAggregates,
     accumulateNeighborCommunityEdgeWeights,
     getCandidateCommunityCount: () => candidateCommunityCount,
diff --git a/tests/graph/algorithms/leiden.test.js b/tests/graph/algorithms/leiden.test.js

Original file line number	Diff line number	Diff line change
`@@ -23,6 +23,7 @@ import { runLouvainUndirectedModularity } from './optimiser.js';`
`23`	`23`	`* @param {number} [options.maxCommunitySize]`
`24`	`24`	`* @param {Set\|Array} [options.fixedNodes]`
`25`	`25`	`* @param {string} [options.candidateStrategy] - 'neighbors' \| 'all' \| 'random' \| 'random-neighbor'`
	`26`	`+ * @param {number} [options.refinementTheta=1.0] - Temperature for probabilistic Leiden refinement (Algorithm 3, Traag et al. 2019). Lower → more greedy, higher → more exploratory. Deterministic via seeded PRNG`
`26`	`27`	`* @returns {{ getClass(id): number, getCommunities(): Map, quality(): number, toJSON(): object }}`
`27`	`28`	`*`
`28`	`29`	* Note on `quality()`: For modularity, `quality()` always evaluates at γ=1.0