Skip to content

vg giraffe chaining mode speedup#4910

Open
faithokamoto wants to merge 10 commits into
masterfrom
speedup
Open

vg giraffe chaining mode speedup#4910
faithokamoto wants to merge 10 commits into
masterfrom
speedup

Conversation

@faithokamoto
Copy link
Copy Markdown
Contributor

@faithokamoto faithokamoto commented May 18, 2026

Changelog Entry

To be copied to the draft changelog by merger:

  • Significant time and memory optimizations to vg giraffe chaining/long-read mode

Description

A collection of small tweaks to optimize around ziptrees. Each tweak was individually tested to make sure it actually was a speed improvement. Output does not change.

  • Ziptree construction changes:
    • When in add_child_to_chain(), instead of using the zipcode to look up the chain component for the new child, use the memory within the sorting information; much quicker to access.
    • When in sort_one_interval() and we're sorting the children of a chain, instead of separately looking up each thing the zipcode stores, page through the entire zipcode at once (see get_chain_child_sort_info())
  • Ziptree distance_iterator change: instead of using pos.index to look up the item at the current index, store an iterator for the backing vector and dereference that as needed.
    • Yes this makes the code a bit more fragile, but as long as no one tries to move pos.index outside of the helper shift_index_by() / change_index_to() functions, it'll work.
    • I'm still storing the full ziptree & the index, since I need them for stuff like easy teleportation (hard to jump an iterator to an index if you don't know where you are) and being able to look up values within distance matrices.
  • DP changes:
    • The vector which holds transitions gets space pre-allocated.
    • Instead of generating a bunch of transitions and then iterating through them to throw out the ones which aren't read-reachable, now we check each transition for whether it's read-reachable before we even bother saving it. (Note this does mean I had to kill the missing-transition checker, since we are purposely missing transitions, but whatever it was never really used.)

I'm futzing around with a final optimization, will push that if it turns out to help.

@faithokamoto
Copy link
Copy Markdown
Contributor Author

Nope the last thing I was playing with ended up being bad, so this is it.

Copy link
Copy Markdown
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good, but I think the sort_value_t generation could be substantially simplified to avoid throwing around the big tuples.

Comment thread src/zip_code.hpp Outdated
Comment thread src/zip_code_tree.cpp Outdated
Comment on lines +2265 to +2271
ZipCode::code_type_t code_type;
size_t offset_in_chain;
size_t length;
size_t chain_component;
bool is_reversed;
std::tie(code_type, offset_in_chain, length, chain_component, is_reversed) \
= seed.zipcode.get_chain_child_sort_info(interval.depth+1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope we're unpacking it ourselves and not just < comparing it.

When I see sorting and tuples together I assume we're building a sort key and we're going to compare those tuples with the tuple operator<, but that's not what's happening here.

This is probably a good candidate for a struct, which then gives a place to explain how/why we can't just make an operator<. But then we need to ask why that struct isn't already just the same as sort_value_t? Why can't get_chain_child_sort_info() take seed.pos and node_offset, which seem to be the only inputs to the code below here to set up sort_values_by_seed[zipcode_sort_order[i]] that don't come out of get_chain_child_sort_info(), and then return the completed sort_value_t, instead of a whole collection of things that we use only to build it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set it up so now the ZipCodeForest takes responsibility for figuring out its sort values, and it actually just directly sets the various fields of the sort_value_t by reference instead of return.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants