Skip to content

Make faidx work with very long (>4 Gbyte!) lines#2008

Open
daviesrob wants to merge 1 commit into
samtools:developfrom
daviesrob:faidx-very-long-lines
Open

Make faidx work with very long (>4 Gbyte!) lines#2008
daviesrob wants to merge 1 commit into
samtools:developfrom
daviesrob:faidx-very-long-lines

Conversation

@daviesrob
Copy link
Copy Markdown
Member

@daviesrob daviesrob commented May 12, 2026

Although faidx should support very long references, writing one longer than 4Gbases on a single line broke it because it used a uint32_t field to store the line length.

To make it work with such inputs, faidx1_t::line_blen is increased in size to uint64_t so the correct length can be stored. To avoid having to do the same for faidx1_t::line_len, which would make each entry quite a bit bigger for a fairly rare use-case, that field is changed so that it stores the number of bytes to be skipped at the end of each line instead of the full length. As this value will usually only be 1 or 2, a uint32_t is plenty big enough for it. Combined with the fact that the original structure had a four-byte hole in it (between line_blen and len), it's possible to store the longer line lengths while keeping faidx1_t exactly the same size as it had before.

Fixes samtools/samtools#2331

Comment thread faidx.c Outdated

while ((l = hgetln(buf, 0x10000, fp)) > 0) {
uint32_t line_len, line_blen, n;
uint64_t line_len, line_blen, n;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't affect the behaviour, but this is a good opportunity to make n a plain int.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it's now an int.

Although faidx should support very long references, writing one
longer than 4Gbases on a single line broke it because it used
a uint32_t field to store the line length.

To make it work with such inputs, faidx1_t::line_blen is increased
in size to uint64_t so the correct length can be stored.  To
avoid having to do the same for faidx1_t::line_len, which would
make each entry quite a bit bigger for a fairly rare use-case,
that field is changed so that it stores the number of bytes to
be skipped at the end of each line instead of the full length.
As this value will usually only be 1 or 2, a uint32_t is plenty
big enough for it.  Combined with the fact that the original
structure had a four-byte hole in it (between line_blen and len),
it's possible to store the longer line lengths while keeping
faidx1_t exactly the same size as it had before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

samtools faidx incorrrent sequence extraction when scaffold length is very long (>2.1Gb)

3 participants