Update predicates to handle special cases#1184
Open
nakib103 wants to merge 10 commits intoEnsembl:postreleasefix/116from
Open
Update predicates to handle special cases#1184nakib103 wants to merge 10 commits intoEnsembl:postreleasefix/116from
nakib103 wants to merge 10 commits intoEnsembl:postreleasefix/116from
Conversation
6 tasks
palakpsheth
added a commit
to palakpsheth/ensembl-variation
that referenced
this pull request
Feb 11, 2026
…654) Integrate improvements from the Ensembl team's parallel PR Ensembl#1184 to handle additional edge cases in stop codon consequence prediction: - inframe_insertion: guard against */* peptides (ref/alt both stop) - stop_lost/stop_retained: fall through to sequence-level analysis when alt_pep contains 'X' (unreliable peptide translation) - _overlaps_stop_codon: fix insertion coordinate handling to properly detect stop codon overlap (root cause fix for insertion detection) - ref_eq_alt_sequence: remove redundant condition 1 (strict subset of condition 3), improve condition 2 to check trailing stop semantically, simplify condition 3 index comparison These changes complement the existing Issue #1710 fixes (Bug 1 and Bug 2) and were validated by the Ensembl team against the full GRCh38 variant database with no difference in variant consequences.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ENSVAR-6654
Handle special cases for
inframe_insertion*- not an inframe_insertion.Ideally we should not get
$ref_pep == $alt_pepwith the check oflength($alt_codon) > length ($ref_codon). But cases found where alt_codon can be just a 1bp/2bp larger than ref_codon and the insertion is not inframe.For example,
5 134750948 . TG Tvariant have codonsTAA/TAAGand peptides*/*.We only handle cases where
$ref_pep=$alt_pep=*to incur minimal change and do not add logic to check if$ref_pep=$alt_pep.stop_retainedandstop_lostCheck transcript sequence if alt_pep has
XIn these functions first the peptide is used to infer if the effect is present, if not possible, such as peptide string is not available than the genomic sequences is used to infer the effect. We should use the transcript sequence if alt_pep has
Xas in some cases it cannot infer the effect with such peptide sequence.For example,
3 149520808 . G GTTAAhas peptidesL/L*Xwhich says it is stop_gained but if we look at the transcript sequence it is actually stop_retained.by product: for stop lost a case appears that, in some cases stop lost is reported along with incomplete_terminal_codon (before it was not). That is because now we are checking if alt_pep is
X.A patch is applied by returning 0 if partial_codon is true as like stop_retained.
stop_gainedIn some cases
stop_gainedandstop_lostis now appearing together after the above edit. For example,rs2154303328the change happens as-/LPRFKTRS*P*SX. The insertion is happening just before stop codon andLis replacing the stop codon. So it is definitely stop lost. So this part is corrected (by checking genomic sequence as alt_pep hasXin it).But stop_gained is wrong because the new stop codon that appearing is not premature stop codon. It is an old issue and present in pre-112 VEP. The reason it is appearing is we are not checking if the newly added stop codon comes after the original stop codon.
A patch has been applied to check if we already have stop_lost. If stop is lost then it cannot be stop_gained.
Remove false cases from
ref_eq_alt_sequencefunction -# 1- even if alt_pep has * it might not be exact stop location.Because we do not know if the stop codon is actually at the original stop position. (Ola also removed this case in her last PR and I agree)
# 2- why we are checking $final_stop_length < 3I do not know why we are checking
$final_stop_lengthis less than 3 here. First of all final_stop_length is the number of aa in the mut aa string from the length of ref aa string. We do not care about the length as long as the first aa is a stop (i.e. the stop is at the same position). I change it to $final_stop =~ /^\Q*\E/ so we know that first aa is a stop codon.(What happens if ref_seq does not go all the way until stop codon?)
_overlaps_stop_codonFor example,
3 149520808 . G GTTAAvariant, the insertion is happened before stop codon. If we only consider the position where it is inserted it does not overlap stop codon and we never know that stop is actually retained. So, we consider the full length of insertion in the reference sequence to see if it actually overlaps stop codon.(What happens if an indel length does not go until the stop codon but after adding in the change in the transcript sequence it is actually a stop_retained? - is it something to do with VEP not being completely haplotype-aware?)
frameshiftif the first base affected is the stop codon then we do not report as frameshift. As the reading frame excludes the start and stop codon. In such case the current reading frame keeps the same but new aa added with a new stop.
by product: A case appears for inframe_deletion as the one mentioned at top for inframe_insertion. For example, a variant such
TAG/G(*/X) now considered as stop_lost,inframe_deletion. Stop lost is fine but inframe deletion is not. Inframe deletion is appearing as by product of the frameshift change.As the variant is not classified as frameshift the
_get_codon_allelesreturns the codon and the condition for inframe deletion resulted in truth. Why we are checking with codon instead of peptide? - it is probably based on frameshift check, if it is not frameshift than we would see any deletion that is modulo 3. The above update to frameshift open a case for deletion at stop codon.So we add a check if the
ref_pepis exactly*, if so even if it is deleted that would not be an inframe deletion.Impact
This PR is supposed to fix some edge cases. To determine it is not impacting a lot of known variants I tested the difference in consequence for variants in Ensembl variation database for human GRCh38 (this file). And it shows no difference in variant consequences.