Skip to content

Migrate gene_cnv() from AnophelesCnvFrequencyAnalysis to AnophelesCnvData (resolves dipclust.py TODO) #1208

@khushthecoder

Description

@khushthecoder

Summary

There is a TODO in malariagen_data/anoph/dipclust.py indicating that gene_cnv() still needs to be migrated into the modular CNV “data-layer” mixin architecture used by the rest of malariagen_data/anoph/.

Specifically, gene_cnv() is currently implemented in malariagen_data/anoph/cnv_frq.py (as part of AnophelesCnvFrequencyAnalysis) but is conceptually a data access method (returns an xr.Dataset of modal copy number by gene). This creates architectural inconsistency and forces dipclust.py to call it in a way that bypasses the standard class-hierarchy pattern.

How to generate this issue

  1. Open malariagen_data/anoph/dipclust.py.
  2. Jump to the TODO block at ~lines 401–404:
    • It explicitly says gene_cnv() needs to be migrated to the AnophelesCnvData class so it can be found in the class hierarchy.
  3. Confirm where gene_cnv() lives today:
    • Search in malariagen_data/anoph/ and you’ll find gene_cnv() implemented in malariagen_data/anoph/cnv_frq.py.
    • Verify that malariagen_data/anoph/cnv_data.py (which defines AnophelesCnvData) does not contain gene_cnv().

This mismatch between dipclust.py (TODO + type ignore) and the actual method location is the root problem.

Problem / Current state

  • dipclust.py contains:
    • a TODO noting gene_cnv() must be migrated to AnophelesCnvData
    • a self.gene_cnv(...) # type: ignore call inside _dipclust_cnv_bar_trace()
  • gene_cnv() is implemented in malariagen_data/anoph/cnv_frq.py (within AnophelesCnvFrequencyAnalysis), not in malariagen_data/anoph/cnv_data.py (AnophelesCnvData).

As a result:

  • The API surface is inconsistent: a “gene copy number data access” method is tied to a “frequency analysis” class rather than the CNV data mixin.
  • Some analysis modules/components rely on inheritance via specific analysis classes instead of the intended data-layer hierarchy.
  • Type-checking / linting hints (the # type: ignore) show the hierarchy is not clean for method discovery.

Why this is important (Impact)

This is important because it increases fragility and maintenance cost:

  • Architectural consistency: Other CNV “data” methods live in cnv_data.py under AnophelesCnvData, but gene_cnv() does not—this breaks the intended pattern.
  • Onboarding & extensibility: Adding or reusing gene-level CNV datasets in new modules becomes harder because developers must know that gene_cnv() lives in an analysis mixin rather than the data mixin.
  • Inheritance correctness: If a future class inherits only AnophelesCnvData (and not AnophelesCnvFrequencyAnalysis), it will not naturally get gene_cnv(), even though it is a data access method.
  • Maintainability: Moving logic into the correct layer reduces the chances of duplicated logic and future “workarounds” (type: ignore, direct calls, or special-case imports).

Proposed fix

  1. Move gene_cnv() (and its internal helper, e.g. _gene_cnv)
    • From malariagen_data/anoph/cnv_frq.py
    • Into malariagen_data/anoph/cnv_data.py
    • As methods on AnophelesCnvData.
  2. Update callers to use the standard hierarchy
    • malariagen_data/anoph/cnv_frq.py:
      • Adjust AnophelesCnvFrequencyAnalysis so it calls the moved gene_cnv() implementation (directly or via a shared helper).
    • malariagen_data/anoph/dipclust.py:
      • Remove the TODO and remove the # type: ignore if possible after the method is available on the proper class hierarchy.
  3. Keep public behavior stable
    • Preserve the existing gene_cnv() signature and returned dataset structure so downstream plots/analyses don’t break.

Tests / Acceptance criteria

  • Existing CNV/dipclust-related test suites pass.
  • Add/adjust unit tests to verify:
    1. AnophelesCnvData exposes gene_cnv() (or that instances in the expected hierarchy do).
    2. AnophelesDipClustAnalysis no longer needs the # type: ignore for self.gene_cnv(...) (or at least no runtime break occurs).
    3. Output schema of gene_cnv() is unchanged (coords/data_vars expected by downstream code).

Implementation approach (high level)

  • Extract the current gene_cnv() / _gene_cnv code from cnv_frq.py.
  • Paste into AnophelesCnvData in cnv_data.py with minimal changes.
  • Update imports so cnv_data.py has access to whatever dependencies gene_cnv() currently uses (e.g., Region, _parse_multi_region, _cn_mode, genome feature access, etc.).
  • Refactor AnophelesCnvFrequencyAnalysis to call self.gene_cnv(...) from the data mixin.
  • Update dipclust.py to call gene_cnv() without the migration warning/type workaround.

Reference

  • TODO location: malariagen_data/anoph/dipclust.py (~lines 401–404)
  • Current implementation location: malariagen_data/anoph/cnv_frq.py
  • Intended target location: malariagen_data/anoph/cnv_data.py (AnophelesCnvData)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions