Skip to content

Edge case with umap_embedding #6

@0xJustin

Description

@0xJustin

When a matrix with only one row with nonzero variance and all the rest have zero variance, the model doesn't fit-
Error is:
NotFittedError: This UMAP instance is not fitted yet.
on line:
connected_vertices_mask = ~disconnected_vertices(reducer)
Proposed solution:


def umap_embedding(
    X: np.ndarray,
    n_neighbors: int = 5,
    min_dist: float = 0.12,
    spread: float = 9.0,
    random_state: int = 42,
    n_components: int = 2,
    metric: str = "correlation",
    n_epochs: int = 1500,
    **kwargs,
) -> Tuple[np.ndarray, np.ndarray, UMAP]:
    from umap.utils import disconnected_vertices
    """
    Perform UMAP embedding on input data.

    Args:
        X: Input data with shape (n_samples, n_features).
        n_neighbors: Number of neighbors to consider for each point.
        min_dist: Minimum distance between points in the embedding space.
        spread: Determines how spread out all embedded points are overall.
        random_state: Random seed for reproducibility.
        n_components: Number of dimensions in the embedding space.
        metric: Distance metric to use.
        n_epochs: Number of training epochs for embedding optimization.
        **kwargs: Additional keyword arguments for UMAP.

    Returns:
        A tuple containing:
        - embedding: The UMAP embedding (n_samples, n_components). May be NaN if insufficient data.
        - mask: Boolean mask (length n_samples) showing which rows had nonzero variance and were connected.
        - reducer: The fitted UMAP object or None if insufficient data.

    Raises:
        ValueError: If n_components is too large relative to sample size.

    Note:
        This function handles reshaping of input data and removes constant rows.
    """
    if n_components > X.shape[0] - 2:
        raise ValueError(
            "number of components must be 2 smaller than sample size. "
            "See: https://github.com/lmcinnes/umap/issues/201"
        )

    if len(X.shape) > 2:
        # Flatten (n_samples, n_features_1, ...) → (n_samples, n_features)
        X = X.reshape(X.shape[0], -1)

    # Prepare an output array of NaNs.
    n_samples = X.shape[0]
    embedding = np.full((n_samples, n_components), np.nan)

    # Mask out rows that have zero (or near-zero) variance.
    mask = ~np.isclose(X.std(axis=1), 0)
    X_nonconst = X[mask]

    # If fewer than 2 rows remain, skip UMAP and return embedding of NaNs.
    if X_nonconst.shape[0] < 2:
        return embedding, mask, None

    # Fit UMAP
    reducer = UMAP(
        n_neighbors=n_neighbors,
        min_dist=min_dist,
        random_state=random_state,
        n_components=n_components,
        metric=metric,
        spread=spread,
        n_epochs=n_epochs,
        **kwargs,
    )
    _embedding = reducer.fit_transform(X_nonconst)

    # Remove any “disconnected” vertices UMAP couldn’t place
    # (e.g. if the graph is disjoint).
    connected_vertices_mask = ~disconnected_vertices(reducer)

    # Incorporate the connected-vertices mask into our existing mask.
    mask[mask] = mask[mask] & connected_vertices_mask

    # Place the valid embeddings back into the final array.
    embedding[mask] = _embedding[connected_vertices_mask]

    return embedding, mask, reducer

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions