Adding distribution overlap metrics

One of the advantages of using EMD is that it provides a metric indicating the amount of work still required for a treated population; it provides a distance metric. However, it does not tell us how much overlap exists between these two single-cell populations.

We can add a separate function that calculates the amount of overlap using a traditional logistic regression model or a binary tree. This will provide insight into the global level of overlap, making it more interpretable to see how single-cell overlap in the on-morphology signature space.

implementation example:
```python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
import numpy as np

def classifier_auroc(control_df, treated_df, n_splits=5, random_state=42):
    """
    AUROC-based overlap for pre-normalized, feature-selected morphological profiles.

    Parameters
    ----------
    control_df : pd.DataFrame  shape (n_control, n_features)
    treated_df : pd.DataFrame  shape (n_treated, n_features)

    Returns
    -------
    dict with auroc, overlap, and population sizes
    """
    n_ctrl = len(control_df)
    n_trt  = len(treated_df)
    ratio  = min(n_ctrl, n_trt) / max(n_ctrl, n_trt)

    if ratio < 0.3:
        print(f"⚠️  High imbalance detected: {n_ctrl} control vs {n_trt} treated "
              f"(ratio={ratio:.2f}). Using class_weight='balanced'.")

    X = np.vstack([control_df.values, treated_df.values])
    y = np.array([0] * n_ctrl + [1] * n_trt)

    cv  = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
    clf = LogisticRegression(
        max_iter=1000,
        random_state=random_state,
        class_weight="balanced"
    )

    auroc   = cross_val_score(clf, X, y, cv=cv, scoring="roc_auc").mean()
    overlap = 1 - abs(auroc - 0.5) * 2

    return {"auroc": auroc, "overlap": overlap, "n_control": n_ctrl, "n_treated": n_trt}
```

However, incorporating the overlap score into the on-Buscar score needs further discussion. Should it be treated separately, or should higher overlap lower the overall on-Buscar score?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding distribution overlap metrics #95

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adding distribution overlap metrics #95

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions