-
Notifications
You must be signed in to change notification settings - Fork 23
Open
Description
When a matrix with only one row with nonzero variance and all the rest have zero variance, the model doesn't fit-
Error is:
NotFittedError: This UMAP instance is not fitted yet.
on line:
connected_vertices_mask = ~disconnected_vertices(reducer)
Proposed solution:
def umap_embedding(
X: np.ndarray,
n_neighbors: int = 5,
min_dist: float = 0.12,
spread: float = 9.0,
random_state: int = 42,
n_components: int = 2,
metric: str = "correlation",
n_epochs: int = 1500,
**kwargs,
) -> Tuple[np.ndarray, np.ndarray, UMAP]:
from umap.utils import disconnected_vertices
"""
Perform UMAP embedding on input data.
Args:
X: Input data with shape (n_samples, n_features).
n_neighbors: Number of neighbors to consider for each point.
min_dist: Minimum distance between points in the embedding space.
spread: Determines how spread out all embedded points are overall.
random_state: Random seed for reproducibility.
n_components: Number of dimensions in the embedding space.
metric: Distance metric to use.
n_epochs: Number of training epochs for embedding optimization.
**kwargs: Additional keyword arguments for UMAP.
Returns:
A tuple containing:
- embedding: The UMAP embedding (n_samples, n_components). May be NaN if insufficient data.
- mask: Boolean mask (length n_samples) showing which rows had nonzero variance and were connected.
- reducer: The fitted UMAP object or None if insufficient data.
Raises:
ValueError: If n_components is too large relative to sample size.
Note:
This function handles reshaping of input data and removes constant rows.
"""
if n_components > X.shape[0] - 2:
raise ValueError(
"number of components must be 2 smaller than sample size. "
"See: https://github.com/lmcinnes/umap/issues/201"
)
if len(X.shape) > 2:
# Flatten (n_samples, n_features_1, ...) → (n_samples, n_features)
X = X.reshape(X.shape[0], -1)
# Prepare an output array of NaNs.
n_samples = X.shape[0]
embedding = np.full((n_samples, n_components), np.nan)
# Mask out rows that have zero (or near-zero) variance.
mask = ~np.isclose(X.std(axis=1), 0)
X_nonconst = X[mask]
# If fewer than 2 rows remain, skip UMAP and return embedding of NaNs.
if X_nonconst.shape[0] < 2:
return embedding, mask, None
# Fit UMAP
reducer = UMAP(
n_neighbors=n_neighbors,
min_dist=min_dist,
random_state=random_state,
n_components=n_components,
metric=metric,
spread=spread,
n_epochs=n_epochs,
**kwargs,
)
_embedding = reducer.fit_transform(X_nonconst)
# Remove any “disconnected” vertices UMAP couldn’t place
# (e.g. if the graph is disjoint).
connected_vertices_mask = ~disconnected_vertices(reducer)
# Incorporate the connected-vertices mask into our existing mask.
mask[mask] = mask[mask] & connected_vertices_mask
# Place the valid embeddings back into the final array.
embedding[mask] = _embedding[connected_vertices_mask]
return embedding, mask, reducer
Metadata
Metadata
Assignees
Labels
No labels