Skip to content

AssertionError: negative extraPadLeft/Right in TranslateColorspaceRead.realign_new on spliced long reads (greflen overflow) #19

@k-roy

Description

@k-roy

Summary

mapPacBio.sh with intronlen=50 (splice-aware mode) crashes with a Java AssertionError when processing long-read (ONT/PacBio) sub-reads whose spliced genomic extent (greflen) exceeds msa.maxColumns - 80 = 7520. The padding-shrinkage while-loops in TranslateColorspaceRead.realign_new have no lower-bound guard, driving extraPadLeft and extraPadRight to negative values (e.g. -30, -30).

Environment

  • BBTools version: 39.26 (from bbmap/current/ install)
  • Java: 1.8.0 (OpenJDK, -ea assertions enabled)
  • Command: mapPacBio.sh with intronlen=50 maxindel=300000 ...
  • Input: Oxford Nanopore DRS reads, ~200–32000 bp, split at 6000 bp before alignment
  • Genome: S. cerevisiae (R64), but the bug is geometry-dependent and not organism-specific

Crash output

java.lang.AssertionError: -30, -30
<read_id>
	at align2.TranslateColorspaceRead.realign_new(TranslateColorspaceRead.java:407)
	at align2.SiteScore.realign(SiteScore.java:263)
	at align2.BBMapThreadPacBio.realignAndTest(BBMapThreadPacBio.java:791)
	at align2.BBMapThreadPacBio.processRead(BBMapThreadPacBio.java:552)
	at align2.BBMapThreadPacBio.run(BBMapThreadPacBio.java:155)
	at java.lang.Thread.run(Thread.java:750)

Followed (after thread death) by:

java.lang.AssertionError:
The number of reads out does not add up to the number of reads in.
<stats breakdown> = 316339 != 316340
	at align2.AbstractMapper.printOutputStats(AbstractMapper.java:~1970)

Exit code: non-zero (killed by assertion). The process loses exactly 1 read per crashing thread.

Root cause

Trigger condition

In TranslateColorspaceRead.realign_new (splice-aware branch, ss.gaps != null):

// TranslateColorspaceRead.java ~line 394–407
int greflen = Tools.max(bases.length, GapTools.calcGrefLen(minLoc, maxLoc, ss.gaps));
int newlen = (greflen + 1 + extraPadLeft + extraPadRight);
if (newlen >= msa.maxColumns - 80) {
    while (newlen >= msa.maxColumns-80 && extraPadLeft > extraPadRight) { newlen--; extraPadLeft--; }
    while (newlen >= msa.maxColumns-80 && extraPadLeft < extraPadRight) { newlen--; extraPadRight--; }
    while (newlen >= msa.maxColumns-80) { newlen -= 2; extraPadLeft--; extraPadRight--; }
    // ← NO lower-bound guard here
}
assert(extraPadLeft >= 0 && extraPadRight >= 0) : extraPadLeft + ", " + extraPadRight + ...;

ALIGN_COLUMNS = 7600 (hardcoded in BBIndexPacBio.java:2474), so the threshold is 7520.

How the pads go negative

When a spliced sub-read has greflen ≈ 7579–7581 (e.g. a ~600 bp exon-split read spanning a 7000 bp intron):

newlen = greflen + 1 + extraPadLeft + extraPadRight
       ≈ 7580 + 1 + 30 + 30  = 7641   (example initial pads)
excess = 7641 - 7520 = 121

The first two while-loops equalise the pads (or are no-ops if already equal). The third loop fires (121 - 1) / 2 = 60 times, decrementing both pads by 60:

extraPadLeft  = 30 - 60 = -30
extraPadRight = 30 - 60 = -30

The assertion fires.

If extraPadLeft == extraPadRight == 0 initially (minimum padding case), excess of 60 above threshold drives both to -30 in 30 iterations. This is reproducible with the same read and same reference; the crash is 100% deterministic for affected reads.

Why removing -ea does not fix the bug

Without -ea, the assertion becomes a no-op, but extraPadLeft = -30 and extraPadRight = -30 are then passed to the MSA array sizing code. This produces ArrayIndexOutOfBoundsException instead — same crash, different exception class.

Why ALIGN_COLUMNS cannot be raised at runtime

BBIndexPacBio.java:2474:

public static final int ALIGN_COLUMNS = 7600;

This is static final and referenced at index-build time to size arrays. It cannot be overridden via command-line parameters.

Suggested fix

Clamp extraPadLeft and extraPadRight to 0 before the assertion:

if (newlen >= msa.maxColumns - 80) {
    while (newlen >= msa.maxColumns-80 && extraPadLeft > extraPadRight) { newlen--; extraPadLeft--; }
    while (newlen >= msa.maxColumns-80 && extraPadLeft < extraPadRight) { newlen--; extraPadRight--; }
    while (newlen >= msa.maxColumns-80) { newlen -= 2; extraPadLeft--; extraPadRight--; }
    // ADDED: clamp — alignment will be degraded but not crash
    if (extraPadLeft < 0) { newlen -= extraPadLeft; extraPadLeft = 0; }
    if (extraPadRight < 0) { newlen -= extraPadRight; extraPadRight = 0; }
}
assert(extraPadLeft >= 0 && extraPadRight >= 0) : ...;

A stricter alternative is to break out of the third while-loop when either pad reaches 0:

while (newlen >= msa.maxColumns-80 && extraPadLeft > 0 && extraPadRight > 0) {
    newlen -= 2; extraPadLeft--; extraPadRight--;
}

Either approach prevents the assertion/AIOOBE while producing a slightly compressed alignment window for reads that genuinely exceed ALIGN_COLUMNS. The comment already present in the code acknowledges this case: // TODO: In this case the alignment will probably be wrong.

Reproduction context

Reads involved are sub-reads split from long Nanopore DRS reads (>6000 bp) by an upstream read-splitter. A sub-read covering one exon can still span a very long intron in genomic coordinates, giving greflen >> read_length. The affected reads are processed correctly by minimap2 and other long-read aligners; the crash is specific to mapPacBio.sh with intronlen set.

Any workflow that uses mapPacBio.sh with intronlen>0 on reads from species with long introns (>7000 bp) and sufficiently large sub-reads is at risk.

Impact

One read is silently lost per crashing thread. The printOutputStats assertion then fires, causing mapPacBio to exit non-zero. Any downstream tool checking exit codes will reject the entire output BAM — even though only 1 read out of potentially hundreds of thousands was affected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions