-
Notifications
You must be signed in to change notification settings - Fork 178
Migrate gene_cnv() from AnophelesCnvFrequencyAnalysis to AnophelesCnvData (resolves dipclust.py TODO) #1208
Description
Summary
There is a TODO in malariagen_data/anoph/dipclust.py indicating that gene_cnv() still needs to be migrated into the modular CNV “data-layer” mixin architecture used by the rest of malariagen_data/anoph/.
Specifically, gene_cnv() is currently implemented in malariagen_data/anoph/cnv_frq.py (as part of AnophelesCnvFrequencyAnalysis) but is conceptually a data access method (returns an xr.Dataset of modal copy number by gene). This creates architectural inconsistency and forces dipclust.py to call it in a way that bypasses the standard class-hierarchy pattern.
How to generate this issue
- Open
malariagen_data/anoph/dipclust.py. - Jump to the TODO block at ~lines 401–404:
- It explicitly says
gene_cnv()needs to be migrated to theAnophelesCnvDataclass so it can be found in the class hierarchy.
- It explicitly says
- Confirm where
gene_cnv()lives today:- Search in
malariagen_data/anoph/and you’ll findgene_cnv()implemented inmalariagen_data/anoph/cnv_frq.py. - Verify that
malariagen_data/anoph/cnv_data.py(which definesAnophelesCnvData) does not containgene_cnv().
- Search in
This mismatch between dipclust.py (TODO + type ignore) and the actual method location is the root problem.
Problem / Current state
dipclust.pycontains:- a TODO noting
gene_cnv()must be migrated toAnophelesCnvData - a
self.gene_cnv(...) # type: ignorecall inside_dipclust_cnv_bar_trace()
- a TODO noting
gene_cnv()is implemented inmalariagen_data/anoph/cnv_frq.py(withinAnophelesCnvFrequencyAnalysis), not inmalariagen_data/anoph/cnv_data.py(AnophelesCnvData).
As a result:
- The API surface is inconsistent: a “gene copy number data access” method is tied to a “frequency analysis” class rather than the CNV data mixin.
- Some analysis modules/components rely on inheritance via specific analysis classes instead of the intended data-layer hierarchy.
- Type-checking / linting hints (the
# type: ignore) show the hierarchy is not clean for method discovery.
Why this is important (Impact)
This is important because it increases fragility and maintenance cost:
- Architectural consistency: Other CNV “data” methods live in
cnv_data.pyunderAnophelesCnvData, butgene_cnv()does not—this breaks the intended pattern. - Onboarding & extensibility: Adding or reusing gene-level CNV datasets in new modules becomes harder because developers must know that
gene_cnv()lives in an analysis mixin rather than the data mixin. - Inheritance correctness: If a future class inherits only
AnophelesCnvData(and notAnophelesCnvFrequencyAnalysis), it will not naturally getgene_cnv(), even though it is a data access method. - Maintainability: Moving logic into the correct layer reduces the chances of duplicated logic and future “workarounds” (
type: ignore, direct calls, or special-case imports).
Proposed fix
- Move
gene_cnv()(and its internal helper, e.g._gene_cnv)- From
malariagen_data/anoph/cnv_frq.py - Into
malariagen_data/anoph/cnv_data.py - As methods on
AnophelesCnvData.
- From
- Update callers to use the standard hierarchy
malariagen_data/anoph/cnv_frq.py:- Adjust
AnophelesCnvFrequencyAnalysisso it calls the movedgene_cnv()implementation (directly or via a shared helper).
- Adjust
malariagen_data/anoph/dipclust.py:- Remove the TODO and remove the
# type: ignoreif possible after the method is available on the proper class hierarchy.
- Remove the TODO and remove the
- Keep public behavior stable
- Preserve the existing
gene_cnv()signature and returned dataset structure so downstream plots/analyses don’t break.
- Preserve the existing
Tests / Acceptance criteria
- Existing CNV/dipclust-related test suites pass.
- Add/adjust unit tests to verify:
AnophelesCnvDataexposesgene_cnv()(or that instances in the expected hierarchy do).AnophelesDipClustAnalysisno longer needs the# type: ignoreforself.gene_cnv(...)(or at least no runtime break occurs).- Output schema of
gene_cnv()is unchanged (coords/data_vars expected by downstream code).
Implementation approach (high level)
- Extract the current
gene_cnv()/_gene_cnvcode fromcnv_frq.py. - Paste into
AnophelesCnvDataincnv_data.pywith minimal changes. - Update imports so
cnv_data.pyhas access to whatever dependenciesgene_cnv()currently uses (e.g.,Region,_parse_multi_region,_cn_mode, genome feature access, etc.). - Refactor
AnophelesCnvFrequencyAnalysisto callself.gene_cnv(...)from the data mixin. - Update
dipclust.pyto callgene_cnv()without the migration warning/type workaround.
Reference
- TODO location:
malariagen_data/anoph/dipclust.py(~lines 401–404) - Current implementation location:
malariagen_data/anoph/cnv_frq.py - Intended target location:
malariagen_data/anoph/cnv_data.py(AnophelesCnvData)