Skip to content

Add helper to compute simple SNP feature matrix for ML workflows #1185

@ishaniray1

Description

@ishaniray1

Downstream users often want to feed MalariaGEN SNP/genotype data into machine‑learning models (e.g. taxon classification, resistance prediction), but there is no small, documented helper that turns SNP data into a simple, ML‑ready feature matrix. Each user currently has to re‑implement feature extraction on top of snp_allele_frequencies() and snp_genotype_allele_counts().
Proposed
Add a lightweight helper snp_feature_matrix(...) on the Anopheles SNP frequency analysis API that:
Accepts the standard parameters (transcript or region, sample_sets, sample_query, optional cohorts).
Returns a pandas.DataFrame where rows are samples or cohorts and columns are:
total_snp_count
nonsynonymous_snp_count
mean_allele_frequency
Internally uses existing public methods such as snp_allele_frequencies() and snp_genotype_allele_counts() (no ML inside the package).

Implementation details
Implemented as AnophelesSnpFrequencyAnalysis.snp_feature_matrix in malariagen_data/anoph/snp_frq.py, decorated with _check_types and @doc(...) like other public methods.

Supports:
Cohort mode (cohorts provided): one row per cohort; counts based on unique (contig, position) sites; mean_allele_frequency from each frq_ column.
Sample mode (cohorts is None): one row per sample; per‑sample counts from snp_genotype_allele_counts; mean AF from snp_allele_frequencies(..., cohorts={"all": "True"}).

Includes two tests in tests/anoph/test_snp_frq.py that check expected columns and non‑empty results using ag3_sim.

I’ve opened a PR implementing this and am very happy to adjust the feature set or signature if you’d prefer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions