-
Notifications
You must be signed in to change notification settings - Fork 178
Add helper to compute simple SNP feature matrix for ML workflows #1185
Description
Downstream users often want to feed MalariaGEN SNP/genotype data into machine‑learning models (e.g. taxon classification, resistance prediction), but there is no small, documented helper that turns SNP data into a simple, ML‑ready feature matrix. Each user currently has to re‑implement feature extraction on top of snp_allele_frequencies() and snp_genotype_allele_counts().
Proposed
Add a lightweight helper snp_feature_matrix(...) on the Anopheles SNP frequency analysis API that:
Accepts the standard parameters (transcript or region, sample_sets, sample_query, optional cohorts).
Returns a pandas.DataFrame where rows are samples or cohorts and columns are:
total_snp_count
nonsynonymous_snp_count
mean_allele_frequency
Internally uses existing public methods such as snp_allele_frequencies() and snp_genotype_allele_counts() (no ML inside the package).
Implementation details
Implemented as AnophelesSnpFrequencyAnalysis.snp_feature_matrix in malariagen_data/anoph/snp_frq.py, decorated with _check_types and @doc(...) like other public methods.
Supports:
Cohort mode (cohorts provided): one row per cohort; counts based on unique (contig, position) sites; mean_allele_frequency from each frq_ column.
Sample mode (cohorts is None): one row per sample; per‑sample counts from snp_genotype_allele_counts; mean AF from snp_allele_frequencies(..., cohorts={"all": "True"}).
Includes two tests in tests/anoph/test_snp_frq.py that check expected columns and non‑empty results using ag3_sim.
I’ve opened a PR implementing this and am very happy to adjust the feature set or signature if you’d prefer.