Summary
Two bugs in the precalculated embeddings app: (1) hover tooltips show only partial metadata, and (2) when clustering by column values, KMeans labels are overwritten with raw column values -- destroying the ability to evaluate whether KMeans successfully separated records by category.
Bug 1: Hover tooltip shows only partial metadata
Root cause: Hard-coded [:8] slice in shared/components/visualization.py:87:
metadata_cols = [c for c in df_plot.columns if c not in skip_cols][:8]
With the example dataset's 17 metadata columns, fields like species, scientific_name, common_name are cut off. Users must click a point and check the Data Preview panel to see them.
Fix: Remove/raise the cap, or add a user-selectable tooltip column picker.
Bug 2: KMeans labels overwritten + "null" in tooltips when clustering by column
The goal of "Use column values" clustering is to evaluate whether KMeans can separate records by a column's categories (column = ground truth, KMeans = prediction). The current code defeats this by overwriting KMeans labels with the raw column values (sidebar.py:691-693).
Example -- clustering by kingdom (5 unique values, n_clusters=5)
KMeans label After overwrite
Row kingdom (from embeddings) (from column) Problem
----- ------------------ ----------------- --------------- ----------------------
0 Animalia 2 0 (Animalia) KMeans said cluster 2
1 Plantae 2 3 (Plantae) Same KMeans cluster, diff color
2 Animalia 0 0 (Animalia) KMeans said cluster 0
3 NaN 1 5 (Unknown) Phantom 6th cluster
After the overwrite:
- Colors show taxonomy, not KMeans output -- you can't see what KMeans actually found
- Two points in the same KMeans cluster get different colors (rows 0 and 1) -- exactly the comparison the user wants to see, but it's invisible
- Rows 0 and 2 look identical (both "Animalia") but KMeans put them in different clusters -- this disagreement is the evaluation signal, and it's destroyed
- Null rows create a phantom 6th cluster that KMeans never assigned
Additionally, raw NaN values in the original column leak into the tooltip and render as "null" in Altair/Vega-Lite, which users interpret as the cluster name being null (even though cluster_name is correctly set to "Unknown").
Proposed fix: two distinct modes
| Mode |
KMeans |
Colors |
Purpose |
| Cluster by column count (fix existing) |
Yes, n_clusters from column |
KMeans labels (preserve, don't overwrite) |
Evaluate: do embedding clusters align with taxonomy? |
| Label by column (new) |
No, skip KMeans |
Raw column values |
Explore: how does a column distribute in embedding space? |
For the evaluation mode, column values become comparison metadata (in tooltip + Data Preview), not the cluster assignment.
Affected Files
shared/components/visualization.py:87 -- tooltip [:8] cap
apps/precalculated/components/sidebar.py:555-602 -- cluster method UI
apps/precalculated/components/sidebar.py:672-697 -- label overwrite logic
apps/precalculated/components/sidebar.py:732-747 -- create_cluster_dataframe() null handling
Summary
Two bugs in the precalculated embeddings app: (1) hover tooltips show only partial metadata, and (2) when clustering by column values, KMeans labels are overwritten with raw column values -- destroying the ability to evaluate whether KMeans successfully separated records by category.
Bug 1: Hover tooltip shows only partial metadata
Root cause: Hard-coded
[:8]slice inshared/components/visualization.py:87:With the example dataset's 17 metadata columns, fields like
species,scientific_name,common_nameare cut off. Users must click a point and check the Data Preview panel to see them.Fix: Remove/raise the cap, or add a user-selectable tooltip column picker.
Bug 2: KMeans labels overwritten + "null" in tooltips when clustering by column
The goal of "Use column values" clustering is to evaluate whether KMeans can separate records by a column's categories (column = ground truth, KMeans = prediction). The current code defeats this by overwriting KMeans labels with the raw column values (
sidebar.py:691-693).Example -- clustering by
kingdom(5 unique values, n_clusters=5)After the overwrite:
Additionally, raw
NaNvalues in the original column leak into the tooltip and render as"null"in Altair/Vega-Lite, which users interpret as the cluster name being null (even thoughcluster_nameis correctly set to"Unknown").Proposed fix: two distinct modes
For the evaluation mode, column values become comparison metadata (in tooltip + Data Preview), not the cluster assignment.
Affected Files
shared/components/visualization.py:87-- tooltip[:8]capapps/precalculated/components/sidebar.py:555-602-- cluster method UIapps/precalculated/components/sidebar.py:672-697-- label overwrite logicapps/precalculated/components/sidebar.py:732-747--create_cluster_dataframe()null handling