Skip to content

feat: Add SNP feature matrix helper function for ML workflows (Issue #1185)#1248

Open
janhavitupe wants to merge 1 commit intomalariagen:masterfrom
janhavitupe:feature/snp-feature-matrix
Open

feat: Add SNP feature matrix helper function for ML workflows (Issue #1185)#1248
janhavitupe wants to merge 1 commit intomalariagen:masterfrom
janhavitupe:feature/snp-feature-matrix

Conversation

@janhavitupe
Copy link
Copy Markdown

Summary

  • Implement snp_feature_matrix method in AnophelesSnpFrequencyAnalysis class
  • Generate ML-ready feature matrix with total SNP count, nonsynonymous SNP count, and mean allele frequency
  • Support both cohort mode (aggregated by cohorts) and sample mode (per-sample)
  • Accept either transcript or genomic region as input
  • Include comprehensive test suite covering all functionality

Implementation Details

  • Main function: snp_feature_matrix() with full parameter support
  • Helper functions: _snp_feature_matrix_cohort_mode() and _snp_feature_matrix_sample_mode()
  • Input validation: Ensures exactly one of transcript or region is provided
  • Output: pandas DataFrame with columns total_snp_count, nonsynonymous_snp_count, mean_allele_frequency
  • Metadata: Includes descriptive title with transcript/region information

Testing

  • Added 5 comprehensive test functions covering:
    • Cohort mode with multiple cohorts
    • Sample mode for individual samples
    • Region-based analysis
    • Input validation error handling
    • Minimal parameter usage
  • All tests pass and maintain existing functionality

Code Quality

  • Follows project coding standards
  • Proper type hints and docstrings
  • Linting and formatting checks pass
  • No breaking changes to existing API

Fixes #1185

- Add snp_feature_matrix() method to AnophelesSnpFrequencyAnalysis class
- Supports both cohort mode (one row per cohort) and sample mode (one row per sample)
- Returns DataFrame with columns: total_snp_count, nonsynonymous_snp_count, mean_allele_frequency
- Uses existing public methods: snp_allele_frequencies() and snp_genotype_allele_counts()
- Add comprehensive test suite with 5 test functions covering all scenarios
- Remove unnecessary comments for cleaner codebase
- All code quality checks pass (ruff, py_compile)

Fixes malariagen#1185
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add helper to compute simple SNP feature matrix for ML workflows

1 participant