Edge case with umap_embedding

When a matrix with only one row with nonzero variance and all the rest have zero variance, the model doesn't fit- 
Error is:
NotFittedError: This UMAP instance is not fitted yet.  
on line:
    connected_vertices_mask = ~disconnected_vertices(reducer)
Proposed solution:

```

def umap_embedding(
    X: np.ndarray,
    n_neighbors: int = 5,
    min_dist: float = 0.12,
    spread: float = 9.0,
    random_state: int = 42,
    n_components: int = 2,
    metric: str = "correlation",
    n_epochs: int = 1500,
    **kwargs,
) -> Tuple[np.ndarray, np.ndarray, UMAP]:
    from umap.utils import disconnected_vertices
    """
    Perform UMAP embedding on input data.

    Args:
        X: Input data with shape (n_samples, n_features).
        n_neighbors: Number of neighbors to consider for each point.
        min_dist: Minimum distance between points in the embedding space.
        spread: Determines how spread out all embedded points are overall.
        random_state: Random seed for reproducibility.
        n_components: Number of dimensions in the embedding space.
        metric: Distance metric to use.
        n_epochs: Number of training epochs for embedding optimization.
        **kwargs: Additional keyword arguments for UMAP.

    Returns:
        A tuple containing:
        - embedding: The UMAP embedding (n_samples, n_components). May be NaN if insufficient data.
        - mask: Boolean mask (length n_samples) showing which rows had nonzero variance and were connected.
        - reducer: The fitted UMAP object or None if insufficient data.

    Raises:
        ValueError: If n_components is too large relative to sample size.

    Note:
        This function handles reshaping of input data and removes constant rows.
    """
    if n_components > X.shape[0] - 2:
        raise ValueError(
            "number of components must be 2 smaller than sample size. "
            "See: https://github.com/lmcinnes/umap/issues/201"
        )

    if len(X.shape) > 2:
        # Flatten (n_samples, n_features_1, ...) → (n_samples, n_features)
        X = X.reshape(X.shape[0], -1)

    # Prepare an output array of NaNs.
    n_samples = X.shape[0]
    embedding = np.full((n_samples, n_components), np.nan)

    # Mask out rows that have zero (or near-zero) variance.
    mask = ~np.isclose(X.std(axis=1), 0)
    X_nonconst = X[mask]

    # If fewer than 2 rows remain, skip UMAP and return embedding of NaNs.
    if X_nonconst.shape[0] < 2:
        return embedding, mask, None

    # Fit UMAP
    reducer = UMAP(
        n_neighbors=n_neighbors,
        min_dist=min_dist,
        random_state=random_state,
        n_components=n_components,
        metric=metric,
        spread=spread,
        n_epochs=n_epochs,
        **kwargs,
    )
    _embedding = reducer.fit_transform(X_nonconst)

    # Remove any “disconnected” vertices UMAP couldn’t place
    # (e.g. if the graph is disjoint).
    connected_vertices_mask = ~disconnected_vertices(reducer)

    # Incorporate the connected-vertices mask into our existing mask.
    mask[mask] = mask[mask] & connected_vertices_mask

    # Place the valid embeddings back into the final array.
    embedding[mask] = _embedding[connected_vertices_mask]

    return embedding, mask, reducer
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Edge case with umap_embedding #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Edge case with umap_embedding #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions