One of the advantages of using EMD is that it provides a metric indicating the amount of work still required for a treated population; it provides a distance metric. However, it does not tell us how much overlap exists between these two single-cell populations.
We can add a separate function that calculates the amount of overlap using a traditional logistic regression model or a binary tree. This will provide insight into the global level of overlap, making it more interpretable to see how single-cell overlap in the on-morphology signature space.
implementation example:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
import numpy as np
def classifier_auroc(control_df, treated_df, n_splits=5, random_state=42):
"""
AUROC-based overlap for pre-normalized, feature-selected morphological profiles.
Parameters
----------
control_df : pd.DataFrame shape (n_control, n_features)
treated_df : pd.DataFrame shape (n_treated, n_features)
Returns
-------
dict with auroc, overlap, and population sizes
"""
n_ctrl = len(control_df)
n_trt = len(treated_df)
ratio = min(n_ctrl, n_trt) / max(n_ctrl, n_trt)
if ratio < 0.3:
print(f"⚠️ High imbalance detected: {n_ctrl} control vs {n_trt} treated "
f"(ratio={ratio:.2f}). Using class_weight='balanced'.")
X = np.vstack([control_df.values, treated_df.values])
y = np.array([0] * n_ctrl + [1] * n_trt)
cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
clf = LogisticRegression(
max_iter=1000,
random_state=random_state,
class_weight="balanced"
)
auroc = cross_val_score(clf, X, y, cv=cv, scoring="roc_auc").mean()
overlap = 1 - abs(auroc - 0.5) * 2
return {"auroc": auroc, "overlap": overlap, "n_control": n_ctrl, "n_treated": n_trt}
However, incorporating the overlap score into the on-Buscar score needs further discussion. Should it be treated separately, or should higher overlap lower the overall on-Buscar score?
One of the advantages of using EMD is that it provides a metric indicating the amount of work still required for a treated population; it provides a distance metric. However, it does not tell us how much overlap exists between these two single-cell populations.
We can add a separate function that calculates the amount of overlap using a traditional logistic regression model or a binary tree. This will provide insight into the global level of overlap, making it more interpretable to see how single-cell overlap in the on-morphology signature space.
implementation example:
However, incorporating the overlap score into the on-Buscar score needs further discussion. Should it be treated separately, or should higher overlap lower the overall on-Buscar score?