A quick walkthrough of what happens to your embeddings from the moment you click "Run Clustering" to the scatter plot on screen.
Raw Embeddings (from parquet or model)
│
├─ Validate: check for NaN/Inf, cast to float32
├─ L2 Normalize: project onto unit hypersphere
│
├─► Step 1: KMeans Clustering (high-dimensional)
│ Backend: cuML → FAISS → sklearn
│
├─► Step 2: Dimensionality Reduction to 2D
│ Method: PCA / t-SNE / UMAP
│ Backend: cuML → sklearn
│
└─► Scatter Plot (Altair)
Color = cluster, position = 2D projection
Before any computation, every embedding goes through _prepare_embeddings():
- Cast to float32 — GPU backends require it; keeps memory predictable.
- NaN/Inf check — replaces bad values with 0 and logs a warning.
- L2 normalization — divides each vector by its magnitude so every point
sits on the unit hypersphere. This is critical for two reasons:
- Prevents cuML UMAP's NN-descent from crashing with SIGFPE on
large-magnitude vectors (see
investigation/cuml_umap_sigfpe/). - Appropriate for contrastive embeddings (CLIP, BioCLIP) whose training objective is cosine-similarity based — magnitude isn't a learned signal.
- Prevents cuML UMAP's NN-descent from crashing with SIGFPE on
large-magnitude vectors (see
Input norms are logged so you can always verify what came in.
Clusters the full high-dimensional embeddings (e.g., 768-d for BioCLIP 2). Runs before dimensionality reduction so clusters are based on the full feature space, not a lossy 2D projection.
| Backend | When It's Used | How It Works |
|---|---|---|
| cuML | GPU available + >500 samples | GPU-accelerated KMeans via RAPIDS. Runs on CuPy arrays. Falls back to sklearn on any error. |
| FAISS | No GPU + >500 samples | Facebook's optimized CPU KMeans using L2 index. Fast for medium datasets. Falls back to sklearn on error. |
| sklearn | Small datasets or fallback | Standard scikit-learn KMeans. Always works, no special dependencies. |
Auto-selection priority: cuML > FAISS > sklearn. You can override in the sidebar.
Projects embeddings from high-dimensional space down to 2D for visualization. This is purely for the scatter plot — clustering uses the full-dimensional data.
The fastest option. Linear projection onto the two directions of maximum variance. Good for getting a quick overview; doesn't capture nonlinear structure.
| Backend | Notes |
|---|---|
| cuML | GPU-accelerated, near-instant even on large datasets |
| sklearn | CPU-based, still fast since PCA is O(n) |
Nonlinear method that preserves local neighborhoods. Good at revealing clusters but slow on large datasets. Perplexity is auto-adjusted based on sample size.
| Backend | Notes |
|---|---|
| cuML | GPU-accelerated, handles thousands of samples well |
| sklearn | CPU-based, can be slow above ~5k samples |
The recommended default. Nonlinear like t-SNE but faster and better at preserving global structure. Neighbor count is auto-adjusted.
| Backend | Notes |
|---|---|
| cuML | Runs in an isolated subprocess so a crash doesn't kill the app. The subprocess verifies L2 normalization as a safety net. Falls back to sklearn on failure. |
| sklearn | CPU-based umap-learn. Slower but numerically stable. |
Why the subprocess? cuML UMAP's NN-descent algorithm can occasionally trigger a SIGFPE (floating-point exception) that kills the process instantly — no Python try/except can catch it. The subprocess isolates this risk.
When you select "auto" (the default), the app picks the fastest available backend:
| Operation | Auto Logic |
|---|---|
| KMeans | cuML if GPU + >500 samples, else FAISS if available + >500 samples, else sklearn |
| Dim. Reduction | cuML if GPU + >5000 samples, else sklearn |
Any GPU error (architecture mismatch, missing libraries, out of memory (OOM)) triggers an automatic retry with sklearn. OOM errors are surfaced to the user with guidance.
Every step is logged to logs/emb_explorer.log (DEBUG level) and console (INFO):
- Embedding extraction: shape, dtype
- Preparation: input norms (min/max/mean), non-finite count, L2 normalization
- Backend selection: which backend was chosen and why
- KMeans: cluster count, sample count, elapsed time
- Reduction: method, sample count, elapsed time
- Fallbacks: what failed and what we fell back to
- Visualization: point selection events, density mode changes
Check the log file for the full picture when debugging.
cuML (GPU)
│ error?
▼
FAISS (CPU, optimized) ← KMeans only
│ error?
▼
sklearn (CPU, always works)
The app is designed to always produce a result. GPU acceleration is a nice-to-have, never a hard requirement.