Skip to content

Replace assert Statements with Proper Runtime Validation in Production Code #1259

@khushthecoder

Description

@khushthecoder

Issue Description

The codebase uses Python assert statements for input validation and invariant checks in production code (non-test files). This is problematic because assert statements are silently stripped when Python runs with -O (optimize flag), meaning all validation checks disappear in optimized builds — leading to confusing downstream errors or silent data corruption.

There are 50+ assert statements across production modules:

File Count Example
util.py ~20 assert indexer.ndim == 1
anoph/snp_data.py ~6 assert contig in self.contigs
mjn.py 1 assert metric in ["hamming", "jaccard"]
anoph/hap_data.py 3 assert contig in self.contigs
anoph/hap_frq.py 2 assert n_samples >= min_cohort_size
anoph/snp_frq.py 3 assert nobs_mode == "fixed"
anoph/cnv_frq.py 1 assert nobs_mode == "fixed"
anoph/genome_features.py 2 assert contig in self.contigs
anoph/genome_sequence.py 1 assert contig in self.contigs
anoph/h1x.py 2 assert ha.ndim == hb.ndim == 2
anoph/sample_metadata.py 2 assert self._aim_metadata_columns is not None
anoph/aim_data.py 2 assert self._aim_palettes is not None
anopheles.py 1 assert np.count_nonzero(loc_j) == n_sites_j

Before / After Example

# ❌ Current — silently skipped with python -O
assert metric in ["hamming", "jaccard"]
# ✅ Proposed — always enforced, clear error message
if metric not in ("hamming", "jaccard"):
    raise ValueError(f"metric must be 'hamming' or 'jaccard', got {metric!r}")

Why This Issue Is Important

Reliability — Users running python -O (common in Docker images, production deployments, and some CI pipelines) lose all validation silently. An invalid metric like "cosine" would slip through mjn.py and produce garbage results instead of a clear error.

Debuggability — Most of these assert statements have no error message. When they do fire, the user sees a bare AssertionError with no context. Proper exceptions provide actionable information (e.g., "Expected contig '2L' but got 'chr2L'").

Python Best Practice — The Python docs explicitly state: "assert should not be used for data validation because it can be disabled." PEP 8 and major linters (Ruff S101, Bandit B101) flag this as a code quality issue.

Data Integrity — In a genomics library, silent shape mismatches or wrong contigs could produce scientifically incorrect results without any error, which is far worse than a crash.


My Approach to Solve

Step 1: Categorize each assert by type

Category Replacement Exception Example
Input validation ValueError / TypeError assert metric in [...]raise ValueError(...)
State invariants RuntimeError assert self._aim_palettes is not Noneraise RuntimeError(...)
Shape/dimension checks ValueError assert arr.ndim == 1raise ValueError(f"Expected 1-D array, got {arr.ndim}-D")
Internal consistency RuntimeError assert nobs_mode == "fixed"raise RuntimeError(...)

Step 2: Replace in priority order

  1. util.py first — most used, core functions (_dask_compress_dataset, _da_compress, allele mapping)
  2. anoph/snp_data.py and anoph/hap_data.py — user-facing data access, contig validation
  3. mjn.py — public-facing parameter validation
  4. Remaining anoph/ modules — hap_frq.py, snp_frq.py, cnv_frq.py, genome_features.py, genome_sequence.py, h1x.py, aim_data.py, sample_metadata.py
  5. anopheles.py — one occurrence

Step 3: Add descriptive error messages

Every replacement will include a clear, actionable error message explaining what was expected vs. what was received:

# Shape validation
if indexer.ndim != 1:
    raise ValueError(
        f"Expected indexer to be 1-dimensional, got {indexer.ndim} dimensions"
    )
 
# Contig validation
if contig not in self.contigs:
    raise ValueError(
        f"Contig {contig!r} not found. Available contigs: {self.contigs}"
    )
 
# State invariant
if self._default_phasing_analysis is None:
    raise RuntimeError(
        "No default phasing analysis configured. "
        "Please specify the 'analysis' parameter explicitly."
    )

Step 4: Verify

  • Run the existing test suite to confirm no regressions
  • Verify tests pass with python -O flag
  • Confirm no assert statements remain in production code (only in tests/)

Files to Modify

  • malariagen_data/util.py
  • malariagen_data/mjn.py
  • malariagen_data/anopheles.py
  • malariagen_data/anoph/snp_data.py
  • malariagen_data/anoph/hap_data.py
  • malariagen_data/anoph/hap_frq.py
  • malariagen_data/anoph/snp_frq.py
  • malariagen_data/anoph/cnv_frq.py
  • malariagen_data/anoph/genome_features.py
  • malariagen_data/anoph/genome_sequence.py
  • malariagen_data/anoph/h1x.py
  • malariagen_data/anoph/sample_metadata.py
  • malariagen_data/anoph/aim_data.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions