`pysmatch`

Propensity Score Matching (PSM) helps reduce selection bias in observational studies by matching treatment and control units with similar propensity scores.

pysmatch is an improved and extended version of pymatch, with modernized modeling, modularized matching utilities, and better support for reproducible workflows.

Multilingual

English | 中文

Highlights

Multiple score models: Logistic Regression, KNN, CatBoost
Flexible balancing: oversampling and undersampling (balance_strategy)
Standard and exhaustive matching workflows
Balance diagnostics for categorical and continuous covariates
Optional Optuna tuning for automated model search

Installation

Install from PyPI:

pip install pysmatch

Install optional extras:

pip install "pysmatch[tree]"   # CatBoost support
pip install "pysmatch[tune]"   # Optuna support
pip install "pysmatch[all]"    # all optional dependencies

Install from source:

git clone https://github.com/miaohancheng/pysmatch.git
cd pysmatch
pip install -e ".[all]"

Quickstart

This minimal example runs the full core path with the built-in demo dataset (misc/loan.csv).

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
from pysmatch.Matcher import Matcher

np.random.seed(42)
data = pd.read_csv("misc/loan.csv")

test = data[data.loan_status == "Default"].copy()
control = data[data.loan_status == "Fully Paid"].copy()

matcher = Matcher(
    test=test,
    control=control,
    yvar="is_default",
    exclude=["loan_status"],
)

matcher.fit_scores(
    balance=True,
    balance_strategy="over",
    nmodels=10,
    model_type="linear",
    n_jobs=2,
)
matcher.predict_scores()
matcher.match(method="min", nmatches=1, threshold=0.001, replacement=False)

print(matcher.matched_data.head())

If this works, continue to the full workflow below.

End-to-End Workflow

Data Preparation

Use domain-relevant covariates and avoid leaking post-treatment variables into matching features.

import pandas as pd

fields = [
    "loan_amnt",
    "funded_amnt",
    "funded_amnt_inv",
    "term",
    "int_rate",
    "installment",
    "grade",
    "sub_grade",
    "loan_status",
]

raw = pd.read_csv("misc/loan.csv", usecols=fields)
test = raw[raw.loan_status == "Default"].copy()
control = raw[raw.loan_status == "Fully Paid"].copy()

Initialize Matcher

from pysmatch.Matcher import Matcher

matcher = Matcher(
    test=test,
    control=control,
    yvar="is_default",
    exclude=["loan_status"],
)

print("xvars:", matcher.xvars)
print("test/control:", matcher.testn, matcher.controln)

Fit Propensity Score Models

fit_scores supports three model types:

linear (logistic regression)
knn
tree (CatBoost, requires pysmatch[tree])

matcher.fit_scores(
    balance=True,
    balance_strategy="over",   # "over" or "under"
    nmodels=10,
    model_type="linear",
    max_iter=200,
    n_jobs=2,
)

print("models:", len(matcher.models))
print("avg validation accuracy:", sum(matcher.model_accuracy) / len(matcher.model_accuracy))

Optuna path (single tuned model):

# matcher.fit_scores(
#     balance=True,
#     model_type="tree",
#     use_optuna=True,
#     n_trials=20,
# )

Predict and Plot Scores

matcher.predict_scores()
matcher.plot_scores()

matcher.data now contains a scores column.

Tune Threshold

import numpy as np

matcher.tune_threshold(
    method="min",
    nmatches=1,
    rng=np.arange(0.0001, 0.0051, 0.0005),
)

Choose a threshold that balances quality and retained sample size.

Run Matching

Standard matching:

matcher.match(
    method="min",
    nmatches=1,
    threshold=0.001,
    replacement=False,
    exhaustive_matching=False,
)
matcher.plot_matched_scores()

Exhaustive matching:

matcher.match(
    threshold=0.001,
    nmatches=1,
    exhaustive_matching=True,
)

Review Matched Data and Weights

print(matcher.matched_data.head())
print(matcher.record_frequency().head())
matcher.assign_weight_vector()
print(matcher.matched_data[["record_id", "match_id", "weight"]].head())

Matching Strategies

Standard vs Exhaustive Matching

Standard (exhaustive_matching=False): uses nearest-neighbor style control selection with configurable method/replacement behavior.
Exhaustive (exhaustive_matching=True): prioritizes wider control utilization while still respecting threshold constraints.

Key Parameters

threshold: max allowed score distance
nmatches: controls per treated unit
replacement: whether a control can be reused
method: "min" (closest) or "random" (random within threshold)

Practical Guidance

Start with nmatches=1, replacement=False, and a moderate threshold.
If retention is too low, loosen threshold gradually.
If balance is weak after matching, tighten threshold or change model/balance strategy.
For severe class imbalance, test balance_strategy="under" as sensitivity analysis.

Evaluation

After matching, evaluate covariate balance before causal analysis.

Categorical Covariates

cat_table = matcher.compare_categorical(return_table=True, plot_result=True)
print(cat_table)

Interpretation:

check before/after p-value shifts
look for reduced proportional differences after matching

Continuous Covariates

cont_table = matcher.compare_continuous(return_table=True, plot_result=True)
print(cont_table)

Interpretation:

compare KS statistics and grouped permutation test p-values
monitor standardized mean/median differences pre vs post matching

Single Variable Proportion Test

print(matcher.prop_test("grade"))

Troubleshooting

`ValueError: numpy.dtype size changed`

This is usually a NumPy/Pandas binary compatibility issue.

pip install --upgrade --force-reinstall "numpy>=1.26.4" "pandas>=2.1.4"

Restart your Python kernel/session after reinstalling.

`Scores column not found`

Run predict_scores() before match().

matcher.fit_scores(...)
matcher.predict_scores()
matcher.match(...)

`FileNotFoundError` for dataset path

Use a repo-relative path:

pd.read_csv("misc/loan.csv")

No matches found

Usually threshold is too strict or groups are weakly overlapping.

increase threshold
try a different model_type
inspect score distributions with plot_scores()

Jupyter kernel issues in notebooks

If your notebook kernel name is unavailable, switch to an existing kernel (python3) and rerun cells.

FAQ

When should I use `linear` vs `tree` vs `knn`?

Start with linear for strong baseline interpretability.
Use tree for nonlinear relationships and mixed feature types.
Use knn as a local-structure baseline and compare sensitivity.

Is high model accuracy always better for matching?

Not necessarily. Very high separability may indicate weak overlap, which can reduce matchability. Balance diagnostics matter more than raw classifier accuracy.

Should I use over- or under-sampling?

over: usually keeps more majority information; good default.
under: faster/smaller training sets; useful for sensitivity checks.

How do I make runs reproducible?

set np.random.seed(...)
keep fixed package versions
record model/matching parameters in experiment logs

Additional resources

Sekhon, J. S. (2011), Multivariate and propensity score matching software with automated balance optimization: The Matching package for R. Journal of Statistical Software, 42(7), 1-52. Link
Rosenbaum, P. R., & Rubin, D. B. (1983), The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41-55. Link

Contributing

Contributions are welcome. Please open an issue or pull request in this repository.

License

pysmatch is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
.github/workflows		.github/workflows
Example_files		Example_files
docs		docs
misc		misc
pysmatch		pysmatch
scripts		scripts
test		test
.gitignore		.gitignore
Example.ipynb		Example.ipynb
Example.py		Example.py
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README_CHINESE.md		README_CHINESE.md
coverage.xml		coverage.xml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

pysmatch

Multilingual

Highlights

Installation

Quickstart

End-to-End Workflow

Data Preparation

Initialize Matcher

Fit Propensity Score Models

Predict and Plot Scores

Tune Threshold

Run Matching

Review Matched Data and Weights

Matching Strategies

Standard vs Exhaustive Matching

Key Parameters

Practical Guidance

Evaluation

Categorical Covariates

Continuous Covariates

Single Variable Proportion Test

Troubleshooting

ValueError: numpy.dtype size changed

Scores column not found

FileNotFoundError for dataset path

No matches found

Jupyter kernel issues in notebooks

FAQ

When should I use linear vs tree vs knn?

Is high model accuracy always better for matching?

Should I use over- or under-sampling?

How do I make runs reproducible?

Additional resources

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 22

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`pysmatch`

`ValueError: numpy.dtype size changed`

`Scores column not found`

`FileNotFoundError` for dataset path

When should I use `linear` vs `tree` vs `knn`?

Packages