Skip to content

Split recombination penalty and enable by default#4911

Open
adamnovak wants to merge 25 commits into
masterfrom
split-penalty
Open

Split recombination penalty and enable by default#4911
adamnovak wants to merge 25 commits into
masterfrom
split-penalty

Conversation

@adamnovak
Copy link
Copy Markdown
Member

Changelog Entry

To be copied to the draft changelog by merger:

  • The vg giraffe --rec-penalty-chain parameter has been split into --rec-penalty (for chaining), --rec-consistency-bonus (a bonus for haplotype consistency used during chaining but not incorporated into the chain score), and --rec-penalty-aln (used to penalize alignment scores per recombination).
  • Recombination-aware minimizer indexing is now always on when there are few enough haplotypes. Passing --rec-mode to vg minimizer now just makes it fail if recombination-aware minimizer indexing isn't on (because of too many haplotypes).
  • Recombination-aware mapping is now the default in vg giraffe, if a recombination-aware minimizer index file is loaded and you are using the hifi or r10 presets. To turn it off, pass --no-rec-mode. There's no longer a distinction between .path minimizer and zipcodes files and normal ones. If using a path cover instead of real haplotypes, like with an unphased VCF as a Giraffe input, you must pass --no-rec-mode manually or else recombination-aware mapping will be used despite being inappropriate for a path cover. The wiki will need to be updated to reflect this.
  • The hifi and r10 presets for vg giraffe have been updated with tuned recombination penalty settings.
  • vg giraffe no longer produces alignments with nonempty path and negative or zero score. Potential alignment that would reach or go below a score of 0 (perhaps because of --rec-penalty-aln) will be removed, and if needed an unmapped alignment record will be emitted for the read.

Description

This sets up recombination-aware mapping to be on by default when possible, using separately-tuned parameters for the different ways we use recombination information.

To write the tests cleanly, I had to get rid of negative-score alignments, which I don't think we want anyway, but I'm currently testing to make sure that doesn't do anything bad to our calling results. (I'm not evaluating what it does to our mapping results, because I don't think a read put at the correct position with a negative score should ever have really counted.)

I tried to smarten up how we decide whether to use recombination-aware mapping, but there's a couple things that could change:

  1. I just have a made-up constant for how many paths we can probably represent in the payload, rather than getting that information from the code that defines the path flags payload. I'm also assuming all samples in the GBWT metadata count (including the generic-path one). @dcmonti is there a better way for me to find out how many distinct things the payload can represent, and how many of them are in any given GBWT?
  2. The new smartness will do the wrong thing when building a path cover, especially within Giraffe where we handed Giraffe itself an unphased GBZ or a graph without covering paths and it ought to know better. I probably should do the extra engineering around that to cover at least some of the cases, but getting the right bits to the right places is awkward and I think to really get it right I'd have to double out a bunch of IndexRegistry indexes and rules to bring the state of whether a path cover got used all the way through to the minimizer indexing. Or else sniff the inputs in giraffe_main.cpp and guess what the IndexRegistry is going to do. I didn't like any of these options so I'm making the user remember to turn it off, for now.

adamnovak added 25 commits May 12, 2026 18:11
@adamnovak
Copy link
Copy Markdown
Member Author

@jltsiren What if we changed the path cover generation to somehow mark the resulting GBZ as being a path cover, maybe with a tag? Then I could look at it when doing minimizer indexing and know not to make a recombination-aware minimizer index because I don't have real haplotypes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant