Describe the Bug
When using filterbyname.sh on raw FASTA assemblies containing literal Unicode string escapes (such as the 6 characters \u0003 embedded in strings), the internal Java runtime automatically maps these literals into the 0x03 End of Text (ETX) control byte.
When the ignorejunk=t parser hits this dynamically evaluated ETX byte mid-header, it causes aggressive and undocumented truncation of the sequence identifiers. This completely fractures the identifier strings and leads to massive false-positive over-extraction of sequences that coincidentally share the surviving prefix segment.
In our specific case, providing an exact list of 156 unique GISAID accession IDs resulted in the extraction of 87,675 completely incorrect sequences from the root FASTA.
To Reproduce
(Note: Unlike the previous pipe delimiter reproduce script, this requires an actual embedded Unicode string escape in the header to trigger the Java internal substitution mechanism).
1. Create a dummy FASTA file input.fasta:
>id1_valid
ACGTACTG
>id2_with_literal_escape\u0003_suffix
TGCACTGC
>id3_valid
CGCGCGCA
2. Create a target filter list names.txt:
id2_with_literal_escape\u0003_suffix
3. Run filterbyname.sh:
filterbyname.sh in=input.fasta names=names.txt out=output.fasta include=t ignorejunk=t fastawrap=0
4. Expected vs. Actual Output:
- Expected:
output.fasta should contain exactly 1 record.
- Actual: The ETX truncation completely severs the header, frequently dropping the search index logic and emitting mismatched output frames.
Root Cause Analysis
The bug occurs during the internal parsing behavior triggered by ignorejunk=t.
We initially believed ignorejunk=t was aggressively tokenizing across the pipe character (|), but we have since mathematically isolated that the silent truncation is instead caused directly by Java native stream decoder hitting literal \uXXXX sequences embedded in the FASTA file by the upstream sequence submitters (a common GISAID anomaly).
Because the Java VM internally resolves \u0003 into the End of Text standard byte immediately on ingestion before BBTools gets to parse it as plain text, filterbyname.sh interprets the ETX byte as a structural terminal delimiter or junk.
Consequently, filterbyname.sh truncates the identifier specifically where the escape character began, drops the trailing suffix completely, and blindly extracts FASTA records that share the newly severed identical prefix, causing exponential sequence inflation.
System Information
- BBTools Version: > v39.06
- OS: Linux
- Java Version: standard JDK 11+
Suggested Fix
- Refine the definition of "junk" in the string tokenizer (likely inside
FilterByName.java or Shared.java) or proactively bypass Java automatic unicode hex conversion so literal metadata strings \u0000 through \uFFFF are preserved exactly as plain-text bytes.
(Updated: Bug formally isolated to \u0003 ETX expansion, not the pipe separator).
Describe the Bug
When using
filterbyname.shon raw FASTA assemblies containing literal Unicode string escapes (such as the 6 characters\u0003embedded in strings), the internal Java runtime automatically maps these literals into the0x03End of Text (ETX) control byte.When the
ignorejunk=tparser hits this dynamically evaluated ETX byte mid-header, it causes aggressive and undocumented truncation of the sequence identifiers. This completely fractures the identifier strings and leads to massive false-positive over-extraction of sequences that coincidentally share the surviving prefix segment.In our specific case, providing an exact list of 156 unique GISAID accession IDs resulted in the extraction of 87,675 completely incorrect sequences from the root FASTA.
To Reproduce
(Note: Unlike the previous pipe delimiter reproduce script, this requires an actual embedded Unicode string escape in the header to trigger the Java internal substitution mechanism).
1. Create a dummy FASTA file
input.fasta:2. Create a target filter list
names.txt:3. Run filterbyname.sh:
4. Expected vs. Actual Output:
output.fastashould contain exactly 1 record.Root Cause Analysis
The bug occurs during the internal parsing behavior triggered by
ignorejunk=t.We initially believed
ignorejunk=twas aggressively tokenizing across the pipe character (|), but we have since mathematically isolated that the silent truncation is instead caused directly by Java native stream decoder hitting literal\uXXXXsequences embedded in the FASTA file by the upstream sequence submitters (a common GISAID anomaly).Because the Java VM internally resolves
\u0003into the End of Text standard byte immediately on ingestion before BBTools gets to parse it as plain text,filterbyname.shinterprets the ETX byte as a structural terminal delimiter or junk.Consequently,
filterbyname.shtruncates the identifier specifically where the escape character began, drops the trailing suffix completely, and blindly extracts FASTA records that share the newly severed identical prefix, causing exponential sequence inflation.System Information
Suggested Fix
FilterByName.javaorShared.java) or proactively bypass Java automatic unicode hex conversion so literal metadata strings\u0000through\uFFFFare preserved exactly as plain-text bytes.(Updated: Bug formally isolated to
\u0003ETX expansion, not the pipe separator).