Issue Description
The codebase uses Python assert statements for input validation and invariant checks in production code (non-test files). This is problematic because assert statements are silently stripped when Python runs with -O (optimize flag), meaning all validation checks disappear in optimized builds — leading to confusing downstream errors or silent data corruption.
There are 50+ assert statements across production modules:
| File |
Count |
Example |
util.py |
~20 |
assert indexer.ndim == 1 |
anoph/snp_data.py |
~6 |
assert contig in self.contigs |
mjn.py |
1 |
assert metric in ["hamming", "jaccard"] |
anoph/hap_data.py |
3 |
assert contig in self.contigs |
anoph/hap_frq.py |
2 |
assert n_samples >= min_cohort_size |
anoph/snp_frq.py |
3 |
assert nobs_mode == "fixed" |
anoph/cnv_frq.py |
1 |
assert nobs_mode == "fixed" |
anoph/genome_features.py |
2 |
assert contig in self.contigs |
anoph/genome_sequence.py |
1 |
assert contig in self.contigs |
anoph/h1x.py |
2 |
assert ha.ndim == hb.ndim == 2 |
anoph/sample_metadata.py |
2 |
assert self._aim_metadata_columns is not None |
anoph/aim_data.py |
2 |
assert self._aim_palettes is not None |
anopheles.py |
1 |
assert np.count_nonzero(loc_j) == n_sites_j |
Before / After Example
# ❌ Current — silently skipped with python -O
assert metric in ["hamming", "jaccard"]
# ✅ Proposed — always enforced, clear error message
if metric not in ("hamming", "jaccard"):
raise ValueError(f"metric must be 'hamming' or 'jaccard', got {metric!r}")
Why This Issue Is Important
Reliability — Users running python -O (common in Docker images, production deployments, and some CI pipelines) lose all validation silently. An invalid metric like "cosine" would slip through mjn.py and produce garbage results instead of a clear error.
Debuggability — Most of these assert statements have no error message. When they do fire, the user sees a bare AssertionError with no context. Proper exceptions provide actionable information (e.g., "Expected contig '2L' but got 'chr2L'").
Python Best Practice — The Python docs explicitly state: "assert should not be used for data validation because it can be disabled." PEP 8 and major linters (Ruff S101, Bandit B101) flag this as a code quality issue.
Data Integrity — In a genomics library, silent shape mismatches or wrong contigs could produce scientifically incorrect results without any error, which is far worse than a crash.
My Approach to Solve
Step 1: Categorize each assert by type
| Category |
Replacement Exception |
Example |
| Input validation |
ValueError / TypeError |
assert metric in [...] → raise ValueError(...) |
| State invariants |
RuntimeError |
assert self._aim_palettes is not None → raise RuntimeError(...) |
| Shape/dimension checks |
ValueError |
assert arr.ndim == 1 → raise ValueError(f"Expected 1-D array, got {arr.ndim}-D") |
| Internal consistency |
RuntimeError |
assert nobs_mode == "fixed" → raise RuntimeError(...) |
Step 2: Replace in priority order
util.py first — most used, core functions (_dask_compress_dataset, _da_compress, allele mapping)
anoph/snp_data.py and anoph/hap_data.py — user-facing data access, contig validation
mjn.py — public-facing parameter validation
- Remaining
anoph/ modules — hap_frq.py, snp_frq.py, cnv_frq.py, genome_features.py, genome_sequence.py, h1x.py, aim_data.py, sample_metadata.py
anopheles.py — one occurrence
Step 3: Add descriptive error messages
Every replacement will include a clear, actionable error message explaining what was expected vs. what was received:
# Shape validation
if indexer.ndim != 1:
raise ValueError(
f"Expected indexer to be 1-dimensional, got {indexer.ndim} dimensions"
)
# Contig validation
if contig not in self.contigs:
raise ValueError(
f"Contig {contig!r} not found. Available contigs: {self.contigs}"
)
# State invariant
if self._default_phasing_analysis is None:
raise RuntimeError(
"No default phasing analysis configured. "
"Please specify the 'analysis' parameter explicitly."
)
Step 4: Verify
- Run the existing test suite to confirm no regressions
- Verify tests pass with
python -O flag
- Confirm no
assert statements remain in production code (only in tests/)
Files to Modify
malariagen_data/util.py
malariagen_data/mjn.py
malariagen_data/anopheles.py
malariagen_data/anoph/snp_data.py
malariagen_data/anoph/hap_data.py
malariagen_data/anoph/hap_frq.py
malariagen_data/anoph/snp_frq.py
malariagen_data/anoph/cnv_frq.py
malariagen_data/anoph/genome_features.py
malariagen_data/anoph/genome_sequence.py
malariagen_data/anoph/h1x.py
malariagen_data/anoph/sample_metadata.py
malariagen_data/anoph/aim_data.py
Issue Description
The codebase uses Python
assertstatements for input validation and invariant checks in production code (non-test files). This is problematic becauseassertstatements are silently stripped when Python runs with-O(optimize flag), meaning all validation checks disappear in optimized builds — leading to confusing downstream errors or silent data corruption.There are 50+
assertstatements across production modules:util.pyassert indexer.ndim == 1anoph/snp_data.pyassert contig in self.contigsmjn.pyassert metric in ["hamming", "jaccard"]anoph/hap_data.pyassert contig in self.contigsanoph/hap_frq.pyassert n_samples >= min_cohort_sizeanoph/snp_frq.pyassert nobs_mode == "fixed"anoph/cnv_frq.pyassert nobs_mode == "fixed"anoph/genome_features.pyassert contig in self.contigsanoph/genome_sequence.pyassert contig in self.contigsanoph/h1x.pyassert ha.ndim == hb.ndim == 2anoph/sample_metadata.pyassert self._aim_metadata_columns is not Noneanoph/aim_data.pyassert self._aim_palettes is not Noneanopheles.pyassert np.count_nonzero(loc_j) == n_sites_jBefore / After Example
Why This Issue Is Important
Reliability — Users running
python -O(common in Docker images, production deployments, and some CI pipelines) lose all validation silently. An invalid metric like"cosine"would slip throughmjn.pyand produce garbage results instead of a clear error.Debuggability — Most of these
assertstatements have no error message. When they do fire, the user sees a bareAssertionErrorwith no context. Proper exceptions provide actionable information (e.g.,"Expected contig '2L' but got 'chr2L'").Python Best Practice — The Python docs explicitly state: "assert should not be used for data validation because it can be disabled." PEP 8 and major linters (Ruff S101, Bandit B101) flag this as a code quality issue.
Data Integrity — In a genomics library, silent shape mismatches or wrong contigs could produce scientifically incorrect results without any error, which is far worse than a crash.
My Approach to Solve
Step 1: Categorize each
assertby typeValueError/TypeErrorassert metric in [...]→raise ValueError(...)RuntimeErrorassert self._aim_palettes is not None→raise RuntimeError(...)ValueErrorassert arr.ndim == 1→raise ValueError(f"Expected 1-D array, got {arr.ndim}-D")RuntimeErrorassert nobs_mode == "fixed"→raise RuntimeError(...)Step 2: Replace in priority order
util.pyfirst — most used, core functions (_dask_compress_dataset,_da_compress, allele mapping)anoph/snp_data.pyandanoph/hap_data.py— user-facing data access, contig validationmjn.py— public-facing parameter validationanoph/modules —hap_frq.py,snp_frq.py,cnv_frq.py,genome_features.py,genome_sequence.py,h1x.py,aim_data.py,sample_metadata.pyanopheles.py— one occurrenceStep 3: Add descriptive error messages
Every replacement will include a clear, actionable error message explaining what was expected vs. what was received:
Step 4: Verify
python -Oflagassertstatements remain in production code (only intests/)Files to Modify
malariagen_data/util.pymalariagen_data/mjn.pymalariagen_data/anopheles.pymalariagen_data/anoph/snp_data.pymalariagen_data/anoph/hap_data.pymalariagen_data/anoph/hap_frq.pymalariagen_data/anoph/snp_frq.pymalariagen_data/anoph/cnv_frq.pymalariagen_data/anoph/genome_features.pymalariagen_data/anoph/genome_sequence.pymalariagen_data/anoph/h1x.pymalariagen_data/anoph/sample_metadata.pymalariagen_data/anoph/aim_data.py