Skip to content

Comments

feat: DBSCAN cluster coloring for embedding landscape#341

Merged
aaronsb merged 2 commits intomainfrom
feat/dbscan-cluster-coloring
Feb 19, 2026
Merged

feat: DBSCAN cluster coloring for embedding landscape#341
aaronsb merged 2 commits intomainfrom
feat/dbscan-cluster-coloring

Conversation

@aaronsb
Copy link
Owner

@aaronsb aaronsb commented Feb 19, 2026

Summary

  • Run DBSCAN on projected coordinates server-side to detect spatial clusters automatically
  • Auto-tune eps via 40th percentile k-NN distances for balanced clusters (no single cluster dominates)
  • Derive cluster names from concept labels using TF-IDF scoring (frequent-in-cluster, rare-across-others)
  • Add "By Cluster" color mode with 3 switchable palettes (Bold, Warm→Cool, Earth)
  • Sortable legend by name, count, or palette order with click-to-toggle highlight/dim
  • Poll job status after regeneration so the UI refreshes reliably for large datasets
  • Move node info from left-click to right-click context menu (left click = pan/rotate only)
  • Updated screenshots and feature docs

Test plan

  • Regenerate projection from web UI, verify cluster_id and cluster names appear in API response
  • Switch to "By Cluster" color mode — verify bold colored regions with named clusters in legend
  • Click legend entries to highlight/dim clusters, verify "Show all" clears filter
  • Switch palettes (Bold, Warm→Cool, Earth) — colors update instantly
  • Sort by name/count/color in legend header
  • Right-click a point — context menu shows label, grounding, cluster name, and examine actions
  • Left click only pans/rotates, does not select points
  • Switch to other color modes (ontology, grounding, position) — still work unchanged
  • tsc -b passes with no errors

Run DBSCAN on t-SNE/UMAP projected coordinates server-side to identify
spatial clusters, then color them in the web UI like a political map.

Server:
- Add _compute_clusters() with auto-tuned eps (40th percentile k-NN)
- Add _name_clusters() using TF-IDF scoring of concept labels
- Emit cluster_id per point and cluster stats/names in projection response

Client:
- Add "By Cluster" color mode with 3 palettes (Bold, Warm→Cool, Earth)
- Sortable legend (by name, count, or palette order) with cluster toggles
- Highlight/dim clusters by clicking legend entries
- Poll job status after regeneration so UI refreshes reliably
- Move info panel from left-click to right-click context menu
- Left click reserved for pan/rotate only

Screenshots and docs updated.
@aaronsb
Copy link
Owner Author

aaronsb commented Feb 19, 2026

Code Review -- PR #341: DBSCAN Cluster Coloring

Scope: 447 additions, 61 deletions across 7 source files (standard tier). Backend clustering + naming logic in Python, frontend cluster visualization in React/TypeScript, documentation and screenshots.

The overall feature is well-structured: server-side clustering keeps the frontend thin, TF-IDF naming is a smart approach, and the auto-tuned eps via k-NN percentile is a solid heuristic. Good work separating _compute_clusters and _name_clusters as distinct private methods.


Important

1. Job polling reimplements pollJobUntilComplete -- use the existing utility

Location: web/src/components/embeddings/EmbeddingLandscapeWorkspace.tsx (lines in the handleRegenerate callback, around the result.status === 'queued' block)

The apiClient already has pollJobUntilComplete() (client.ts:663-683) which handles completed/failed/cancelled states, configurable interval, and progress callbacks. The hand-rolled polling loop here:

  • Silently exits after 60 iterations without reporting timeout to the user
  • Does not handle cancelled job status
  • Hardcodes 1-second interval vs the utility's 2-second default

Suggestion: Replace the manual loop with:

if (result.status === 'queued' && result.job_id) {
  const finalJob = await apiClient.pollJobUntilComplete(result.job_id, {
    intervalMs: 1000,
  });
  if (finalJob.status === 'failed') {
    setError('Projection job failed');
    return;
  }
  if (finalJob.status === 'cancelled') {
    setError('Projection job was cancelled');
    return;
  }
}

Note that pollJobUntilComplete has no timeout either (it loops forever), so you may still want to add an AbortController-style timeout wrapper. But at minimum, reusing the existing utility avoids divergent polling implementations.


2. _name_clusters has a division-by-zero risk when num_clusters == 1

Location: api/app/services/embedding_projection_service.py, _name_clusters method, around the TF-IDF scoring loop

When there is exactly one cluster, every term has doc_freq[w] == 1 and num_clusters == 1, so:

idf = math.log(num_clusters / doc_freq[w])  # math.log(1/1) = 0.0

All terms score tf * 0.0 = 0.0, and the fallback 0.1 never fires because doc_freq[w] < num_clusters is False (1 < 1 is False). So the else branch gives idf = 0.1 for all terms. This is not a crash, but the naming degrades: with a single cluster, every term gets the same IDF (0.1), so ranking is purely by term frequency. That is actually reasonable behavior for a single cluster, but it is accidental rather than intentional. Worth a comment or a dedicated if num_clusters == 1 branch that picks top-TF terms explicitly.


3. onSelectPoint type signature is now misleading

Location: web/src/components/embeddings/EmbeddingScatter3D.tsx:115

The Props interface still declares:

onSelectPoint: (point: EmbeddingPoint | null, screenPos?: { x: number; y: number }) => void;

But after this PR, handleClick only ever calls onSelectPoint(null) -- the screenPos parameter is dead. The caller in EmbeddingLandscapeWorkspace.tsx passes (point) => { setSelectedConcept(point); } ignoring screenPos. The type should be updated to (point: EmbeddingPoint | null) => void to reflect the actual contract, or if future use is planned, leave it but add a comment.


Minor

4. Cluster legend JSX is ~120 lines of inline rendering -- consider extraction

Location: web/src/components/embeddings/EmbeddingLandscapeWorkspace.tsx (the {colorScheme === 'cluster' && (...)} block)

EmbeddingLandscapeWorkspace.tsx is already 1178 lines (priority file per code quality standards). The cluster legend block adds substantial JSX with its own sorting, filtering, and interaction logic. Extracting a <ClusterLegend> component would:

  • Reduce the workspace file size
  • Make the legend independently testable
  • Follow Single Responsibility -- the workspace orchestrates, the legend renders cluster UI

Not blocking, but this file will keep growing as features are added.


5. Python type annotations: cluster_sizes uses int keys internally, Dict[str, int] in Pydantic

Location: api/app/services/embedding_projection_service.py:743-745 vs api/app/routes/projection.py:65

_compute_clusters builds cluster_sizes with int keys, and _name_clusters returns Dict[int, str]. The Pydantic response models declare Dict[str, int] and Dict[str, str]. This works because JSON serialization converts int keys to strings, and Pydantic coerces on the way out. But it is a type lie in the service layer -- the internal dict has int keys while the docstring and response model imply str. Either: (a) use str(label) as keys in the service to match the declared types, or (b) annotate the Pydantic models as Dict[int, int] (Pydantic v2 handles int-key dicts in JSON). Not a bug, but a maintenance trap for anyone reading the service code and trusting the types.


6. np.ptp is deprecated in NumPy 2.0+

Location: api/app/services/embedding_projection_service.py, inside _compute_clusters:

data_range = float(np.max(np.ptp(projection, axis=0)))

np.ptp was deprecated in NumPy 2.0 (Dec 2023) and will be removed. Replace with:

data_range = float(np.max(np.max(projection, axis=0) - np.min(projection, axis=0)))

Nit

7. import math and from collections import Counter are inside _name_clusters

These are stdlib imports. Convention in this codebase (and PEP 8) is top-of-file imports. Lazy imports are appropriate for heavy optional dependencies (like the sklearn guard pattern used above), but math and Counter are not heavy. Moving them to the top avoids repeated import overhead on each call and follows the pattern used everywhere else in this file.

8. Stop words list could be a module-level constant

_STOP_WORDS is defined as a class attribute (frozenset), which is fine. Just noting it is a large literal that could also live in a shared utility if other NLP-adjacent code in the project needs it. Not actionable now, but flagging for future awareness.


What looks good

  • Auto-tuning eps via k-NN percentile is a well-known heuristic (the "elbow" method variant). Using the 40th percentile with a floor at 1% of data range is a pragmatic choice that avoids degenerate clusters.
  • Server-side clustering keeps the frontend simple -- cluster IDs travel as plain integers, all the heavy math stays in Python/numpy.
  • Graceful degradation when DBSCAN is unavailable or dataset is too small -- returns all-noise labels, frontend shows "No clusters" message.
  • TF-IDF naming is a genuinely useful feature for making clusters interpretable without manual labeling.
  • Context menu consolidation (moving info from left-click to right-click) simplifies the interaction model and removes the NodeInfoBox dependency.

Testing gap

There are no tests for _compute_clusters or _name_clusters. Both methods have interesting edge cases worth covering:

  • Empty projection array
  • All points in one cluster
  • Single-term concept labels (TF-IDF with degenerate input)
  • Very small datasets (N < min_samples)

These are pure-function methods on numpy arrays -- easy to unit test without database fixtures.

- Replace manual job polling with apiClient.pollJobUntilComplete()
- Fix single-cluster TF-IDF scoring (frequency-only when num_clusters <= 1)
- Remove dead screenPos parameter from onSelectPoint
- Extract ClusterLegend into its own component (from ~140 inline lines)
- Replace deprecated np.ptp with explicit max-min
- Move inline imports (math, Counter) to module top level
- Use consistent str keys for cluster_sizes and cluster_names dicts
- Fix eps=0 edge case when all points are identical (floor at 1e-6)
- Add 11 unit tests for _compute_clusters and _name_clusters
@aaronsb aaronsb merged commit af95b3e into main Feb 19, 2026
3 checks passed
@aaronsb aaronsb deleted the feat/dbscan-cluster-coloring branch February 19, 2026 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant