Skip to content

miaohancheng/pysmatch

Repository files navigation

pysmatch

PyPI version Downloads GitHub License codecov

Propensity Score Matching (PSM) helps reduce selection bias in observational studies by matching treatment and control units with similar propensity scores.

pysmatch is an improved and extended version of pymatch, with modernized modeling, modularized matching utilities, and better support for reproducible workflows.

Multilingual

English | 中文

Highlights

  • Multiple score models: Logistic Regression, KNN, CatBoost
  • Flexible balancing: oversampling and undersampling (balance_strategy)
  • Standard and exhaustive matching workflows
  • Balance diagnostics for categorical and continuous covariates
  • Optional Optuna tuning for automated model search

Installation

Install from PyPI:

pip install pysmatch

Install optional extras:

pip install "pysmatch[tree]"   # CatBoost support
pip install "pysmatch[tune]"   # Optuna support
pip install "pysmatch[all]"    # all optional dependencies

Install from source:

git clone https://github.com/miaohancheng/pysmatch.git
cd pysmatch
pip install -e ".[all]"

Quickstart

This minimal example runs the full core path with the built-in demo dataset (misc/loan.csv).

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
from pysmatch.Matcher import Matcher

np.random.seed(42)
data = pd.read_csv("misc/loan.csv")

test = data[data.loan_status == "Default"].copy()
control = data[data.loan_status == "Fully Paid"].copy()

matcher = Matcher(
    test=test,
    control=control,
    yvar="is_default",
    exclude=["loan_status"],
)

matcher.fit_scores(
    balance=True,
    balance_strategy="over",
    nmodels=10,
    model_type="linear",
    n_jobs=2,
)
matcher.predict_scores()
matcher.match(method="min", nmatches=1, threshold=0.001, replacement=False)

print(matcher.matched_data.head())

If this works, continue to the full workflow below.

End-to-End Workflow

Data Preparation

Use domain-relevant covariates and avoid leaking post-treatment variables into matching features.

import pandas as pd

fields = [
    "loan_amnt",
    "funded_amnt",
    "funded_amnt_inv",
    "term",
    "int_rate",
    "installment",
    "grade",
    "sub_grade",
    "loan_status",
]

raw = pd.read_csv("misc/loan.csv", usecols=fields)
test = raw[raw.loan_status == "Default"].copy()
control = raw[raw.loan_status == "Fully Paid"].copy()

Initialize Matcher

from pysmatch.Matcher import Matcher

matcher = Matcher(
    test=test,
    control=control,
    yvar="is_default",
    exclude=["loan_status"],
)

print("xvars:", matcher.xvars)
print("test/control:", matcher.testn, matcher.controln)

Fit Propensity Score Models

fit_scores supports three model types:

  • linear (logistic regression)
  • knn
  • tree (CatBoost, requires pysmatch[tree])
matcher.fit_scores(
    balance=True,
    balance_strategy="over",   # "over" or "under"
    nmodels=10,
    model_type="linear",
    max_iter=200,
    n_jobs=2,
)

print("models:", len(matcher.models))
print("avg validation accuracy:", sum(matcher.model_accuracy) / len(matcher.model_accuracy))

Optuna path (single tuned model):

# matcher.fit_scores(
#     balance=True,
#     model_type="tree",
#     use_optuna=True,
#     n_trials=20,
# )

Predict and Plot Scores

matcher.predict_scores()
matcher.plot_scores()

matcher.data now contains a scores column.

Tune Threshold

import numpy as np

matcher.tune_threshold(
    method="min",
    nmatches=1,
    rng=np.arange(0.0001, 0.0051, 0.0005),
)

Choose a threshold that balances quality and retained sample size.

Run Matching

Standard matching:

matcher.match(
    method="min",
    nmatches=1,
    threshold=0.001,
    replacement=False,
    exhaustive_matching=False,
)
matcher.plot_matched_scores()

Exhaustive matching:

matcher.match(
    threshold=0.001,
    nmatches=1,
    exhaustive_matching=True,
)

Review Matched Data and Weights

print(matcher.matched_data.head())
print(matcher.record_frequency().head())
matcher.assign_weight_vector()
print(matcher.matched_data[["record_id", "match_id", "weight"]].head())

Matching Strategies

Standard vs Exhaustive Matching

  • Standard (exhaustive_matching=False): uses nearest-neighbor style control selection with configurable method/replacement behavior.
  • Exhaustive (exhaustive_matching=True): prioritizes wider control utilization while still respecting threshold constraints.

Key Parameters

  • threshold: max allowed score distance
  • nmatches: controls per treated unit
  • replacement: whether a control can be reused
  • method: "min" (closest) or "random" (random within threshold)

Practical Guidance

  • Start with nmatches=1, replacement=False, and a moderate threshold.
  • If retention is too low, loosen threshold gradually.
  • If balance is weak after matching, tighten threshold or change model/balance strategy.
  • For severe class imbalance, test balance_strategy="under" as sensitivity analysis.

Evaluation

After matching, evaluate covariate balance before causal analysis.

Categorical Covariates

cat_table = matcher.compare_categorical(return_table=True, plot_result=True)
print(cat_table)

Interpretation:

  • check before/after p-value shifts
  • look for reduced proportional differences after matching

Continuous Covariates

cont_table = matcher.compare_continuous(return_table=True, plot_result=True)
print(cont_table)

Interpretation:

  • compare KS statistics and grouped permutation test p-values
  • monitor standardized mean/median differences pre vs post matching

Single Variable Proportion Test

print(matcher.prop_test("grade"))

Troubleshooting

ValueError: numpy.dtype size changed

This is usually a NumPy/Pandas binary compatibility issue.

pip install --upgrade --force-reinstall "numpy>=1.26.4" "pandas>=2.1.4"

Restart your Python kernel/session after reinstalling.

Scores column not found

Run predict_scores() before match().

matcher.fit_scores(...)
matcher.predict_scores()
matcher.match(...)

FileNotFoundError for dataset path

Use a repo-relative path:

pd.read_csv("misc/loan.csv")

No matches found

Usually threshold is too strict or groups are weakly overlapping.

  • increase threshold
  • try a different model_type
  • inspect score distributions with plot_scores()

Jupyter kernel issues in notebooks

If your notebook kernel name is unavailable, switch to an existing kernel (python3) and rerun cells.

FAQ

When should I use linear vs tree vs knn?

  • Start with linear for strong baseline interpretability.
  • Use tree for nonlinear relationships and mixed feature types.
  • Use knn as a local-structure baseline and compare sensitivity.

Is high model accuracy always better for matching?

Not necessarily. Very high separability may indicate weak overlap, which can reduce matchability. Balance diagnostics matter more than raw classifier accuracy.

Should I use over- or under-sampling?

  • over: usually keeps more majority information; good default.
  • under: faster/smaller training sets; useful for sensitivity checks.

How do I make runs reproducible?

  • set np.random.seed(...)
  • keep fixed package versions
  • record model/matching parameters in experiment logs

Additional resources

  • Sekhon, J. S. (2011), Multivariate and propensity score matching software with automated balance optimization: The Matching package for R. Journal of Statistical Software, 42(7), 1-52. Link
  • Rosenbaum, P. R., & Rubin, D. B. (1983), The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41-55. Link

Contributing

Contributions are welcome. Please open an issue or pull request in this repository.

License

pysmatch is released under the MIT License.

About

Propensity Score Matching(PSM) on python

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors