Skip to content

Conversation

@faithokamoto
Copy link

@faithokamoto faithokamoto commented Mar 14, 2025

This PR adds a --impute option which will impute all substitutions to ambiguous bases, given that the parental state is consistent with the ambiguity. (For example, N may be any non-gap character, but Y may only be C or T.) It erases these mutations from the relevant nodes. Having the ambiguity never be introduced is the most parsimonious scenario. If a NucMut consists of ambiguous and non-ambiguous bases, the non-ambiguous bases are retained, and merged together if consecutive.

Besides creating impute.cpp and adding the necessary lines to panman.hpp, I also introduce various small conveniences. Examples:

  • NucMuts now have functions such as .getNucCode(int i) to do the bitshift lookup of the base at a given position, or .length() to do the bitshift lookup of the length of the mutation. I changed code elsewhere to use these new helper functions, which improves readability and follows the principle of DRY.
  • I introduce a hashable struct Coordinate which stores just a position and may be made by copying off a NucMut. I use this struct to speed up consolidateNucMutations() by converting it to use an unordered_map.
  • I factor out functions to apply and undo nucleotide/block mutations given a set of mutations and a tracker (e.g. a sequence_t). These were used in two separate FASTA-printing functions, and I use them for my imputation function.

I also perform one unambiguous bugfix, in mergeNodes(), which now e.g. updates the parent attribute of all new children.

it seems that only the NSNPX types are used, so I'll be externally figuring out e.g. indel length
notably, "insertions from siblings" end up appearing as SNVs, since the parent has the sibling's insertion - but that seems fine to me
hashing Coordinate still doesn't work
nucleotide mutations need to be consolidated
@faithokamoto
Copy link
Author

You can ignore the install/installationUbuntu.sh and workflows/scripts/wfmash.sh updates, I'm honestly not quite sure how they're different. I reverted the -mavx512f flag in CMakeLists.txt because it caused segfaults on my machine (Windows 11 Home running Ubuntu in a WSL) due to a bad instruction set.

@faithokamoto
Copy link
Author

Ahhh I triple-checked and I must've broken something. The imputation isn't working like it was yesterday.

@faithokamoto
Copy link
Author

nevermind it is all good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant