Aggressive Sequence Identifier Truncation with `ignorejunk=t` in filterbyname.sh v39.62

## Describe the Bug
When using `filterbyname.sh` on raw FASTA assemblies containing literal Unicode string escapes (such as the 6 characters `\u0003` embedded in strings), the internal Java runtime automatically maps these literals into the `0x03` End of Text (ETX) control byte. 

When the `ignorejunk=t` parser hits this dynamically evaluated ETX byte mid-header, it causes aggressive and undocumented truncation of the sequence identifiers. This completely fractures the identifier strings and leads to massive false-positive over-extraction of sequences that coincidentally share the surviving prefix segment.

In our specific case, providing an exact list of **156** unique GISAID accession IDs resulted in the extraction of **87,675** completely incorrect sequences from the root FASTA.

## To Reproduce

*(Note: Unlike the previous pipe delimiter reproduce script, this requires an actual embedded Unicode string escape in the header to trigger the Java internal substitution mechanism).*

**1. Create a dummy FASTA file `input.fasta`:**
```fasta
>id1_valid
ACGTACTG
>id2_with_literal_escape\u0003_suffix
TGCACTGC
>id3_valid
CGCGCGCA
```

**2. Create a target filter list `names.txt`:**
```text
id2_with_literal_escape\u0003_suffix
```

**3. Run filterbyname.sh:**
```bash
filterbyname.sh in=input.fasta names=names.txt out=output.fasta include=t ignorejunk=t fastawrap=0 
```

**4. Expected vs. Actual Output:**
- **Expected:** `output.fasta` should contain exactly **1** record.
- **Actual:** The ETX truncation completely severs the header, frequently dropping the search index logic and emitting mismatched output frames.

## Root Cause Analysis
The bug occurs during the internal parsing behavior triggered by `ignorejunk=t`. 

We initially believed `ignorejunk=t` was aggressively tokenizing across the pipe character (`|`), but we have since mathematically isolated that the silent truncation is instead caused directly by Java native stream decoder hitting literal `\uXXXX` sequences embedded in the FASTA file by the upstream sequence submitters (a common GISAID anomaly). 

Because the Java VM internally resolves `\u0003` into the End of Text standard byte immediately on ingestion before BBTools gets to parse it as plain text, `filterbyname.sh` interprets the ETX byte as a structural terminal delimiter or junk. 

Consequently, `filterbyname.sh` truncates the identifier specifically where the escape character began, drops the trailing suffix completely, and blindly extracts FASTA records that share the newly severed identical prefix, causing exponential sequence inflation.

## System Information
- **BBTools Version:** > v39.06
- **OS:** Linux
- **Java Version:** standard JDK 11+

## Suggested Fix
1. Refine the definition of "junk" in the string tokenizer (likely inside `FilterByName.java` or `Shared.java`) or proactively bypass Java automatic unicode hex conversion so literal metadata strings `\u0000` through `\uFFFF` are preserved exactly as plain-text bytes.

*(Updated: Bug formally isolated to `\u0003` ETX expansion, not the pipe separator).*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggressive Sequence Identifier Truncation with `ignorejunk=t` in filterbyname.sh v39.62 #17

Describe the Bug

To Reproduce

Root Cause Analysis

System Information

Suggested Fix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Aggressive Sequence Identifier Truncation with ignorejunk=t in filterbyname.sh v39.62 #17

Description

Describe the Bug

To Reproduce

Root Cause Analysis

System Information

Suggested Fix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Aggressive Sequence Identifier Truncation with `ignorejunk=t` in filterbyname.sh v39.62 #17