Skip to content

Aggressive Sequence Identifier Truncation with ignorejunk=t in filterbyname.sh v39.62 #17

@mmokrejs

Description

@mmokrejs

Describe the Bug

When using filterbyname.sh on raw FASTA assemblies containing literal Unicode string escapes (such as the 6 characters \u0003 embedded in strings), the internal Java runtime automatically maps these literals into the 0x03 End of Text (ETX) control byte.

When the ignorejunk=t parser hits this dynamically evaluated ETX byte mid-header, it causes aggressive and undocumented truncation of the sequence identifiers. This completely fractures the identifier strings and leads to massive false-positive over-extraction of sequences that coincidentally share the surviving prefix segment.

In our specific case, providing an exact list of 156 unique GISAID accession IDs resulted in the extraction of 87,675 completely incorrect sequences from the root FASTA.

To Reproduce

(Note: Unlike the previous pipe delimiter reproduce script, this requires an actual embedded Unicode string escape in the header to trigger the Java internal substitution mechanism).

1. Create a dummy FASTA file input.fasta:

>id1_valid
ACGTACTG
>id2_with_literal_escape\u0003_suffix
TGCACTGC
>id3_valid
CGCGCGCA

2. Create a target filter list names.txt:

id2_with_literal_escape\u0003_suffix

3. Run filterbyname.sh:

filterbyname.sh in=input.fasta names=names.txt out=output.fasta include=t ignorejunk=t fastawrap=0 

4. Expected vs. Actual Output:

  • Expected: output.fasta should contain exactly 1 record.
  • Actual: The ETX truncation completely severs the header, frequently dropping the search index logic and emitting mismatched output frames.

Root Cause Analysis

The bug occurs during the internal parsing behavior triggered by ignorejunk=t.

We initially believed ignorejunk=t was aggressively tokenizing across the pipe character (|), but we have since mathematically isolated that the silent truncation is instead caused directly by Java native stream decoder hitting literal \uXXXX sequences embedded in the FASTA file by the upstream sequence submitters (a common GISAID anomaly).

Because the Java VM internally resolves \u0003 into the End of Text standard byte immediately on ingestion before BBTools gets to parse it as plain text, filterbyname.sh interprets the ETX byte as a structural terminal delimiter or junk.

Consequently, filterbyname.sh truncates the identifier specifically where the escape character began, drops the trailing suffix completely, and blindly extracts FASTA records that share the newly severed identical prefix, causing exponential sequence inflation.

System Information

  • BBTools Version: > v39.06
  • OS: Linux
  • Java Version: standard JDK 11+

Suggested Fix

  1. Refine the definition of "junk" in the string tokenizer (likely inside FilterByName.java or Shared.java) or proactively bypass Java automatic unicode hex conversion so literal metadata strings \u0000 through \uFFFF are preserved exactly as plain-text bytes.

(Updated: Bug formally isolated to \u0003 ETX expansion, not the pipe separator).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions