Skip to content

Update predicates to handle special cases#1184

Open
nakib103 wants to merge 10 commits intoEnsembl:postreleasefix/116from
nakib103:upd_pred
Open

Update predicates to handle special cases#1184
nakib103 wants to merge 10 commits intoEnsembl:postreleasefix/116from
nakib103:upd_pred

Conversation

@nakib103
Copy link
Copy Markdown
Contributor

@nakib103 nakib103 commented Sep 25, 2025

ENSVAR-6654

Handle special cases for inframe_insertion

  • ref and alt pep both * - not an inframe_insertion.
    Ideally we should not get $ref_pep == $alt_pep with the check of length($alt_codon) > length ($ref_codon). But cases found where alt_codon can be just a 1bp/2bp larger than ref_codon and the insertion is not inframe.
    For example, 5 134750948 . TG T variant have codons TAA/TAAG and peptides */*.
    We only handle cases where $ref_pep=$alt_pep=* to incur minimal change and do not add logic to check if $ref_pep=$alt_pep.

stop_retained and stop_lost

  • Check transcript sequence if alt_pep has X
    In these functions first the peptide is used to infer if the effect is present, if not possible, such as peptide string is not available than the genomic sequences is used to infer the effect. We should use the transcript sequence if alt_pep has X as in some cases it cannot infer the effect with such peptide sequence.
    For example, 3 149520808 . G GTTAA has peptides L/L*X which says it is stop_gained but if we look at the transcript sequence it is actually stop_retained.

    by product: for stop lost a case appears that, in some cases stop lost is reported along with incomplete_terminal_codon (before it was not). That is because now we are checking if alt_pep is X.

    A patch is applied by returning 0 if partial_codon is true as like stop_retained.

stop_gained

  • In some cases stop_gained and stop_lost is now appearing together after the above edit. For example, rs2154303328 the change happens as -/LPRFKTRS*P*SX. The insertion is happening just before stop codon and L is replacing the stop codon. So it is definitely stop lost. So this part is corrected (by checking genomic sequence as alt_pep has X in it).

    But stop_gained is wrong because the new stop codon that appearing is not premature stop codon. It is an old issue and present in pre-112 VEP. The reason it is appearing is we are not checking if the newly added stop codon comes after the original stop codon.

    A patch has been applied to check if we already have stop_lost. If stop is lost then it cannot be stop_gained.

Remove false cases from ref_eq_alt_sequence function -

  • # 1 - even if alt_pep has * it might not be exact stop location.
    Because we do not know if the stop codon is actually at the original stop position. (Ola also removed this case in her last PR and I agree)
  • # 2 - why we are checking $final_stop_length < 3
    I do not know why we are checking $final_stop_length is less than 3 here. First of all final_stop_length is the number of aa in the mut aa string from the length of ref aa string. We do not care about the length as long as the first aa is a stop (i.e. the stop is at the same position). I change it to $final_stop =~ /^\Q*\E/ so we know that first aa is a stop codon.
    (What happens if ref_seq does not go all the way until stop codon?)

_overlaps_stop_codon

  • consider length of insertion when checking for stop_codon overlap.
    For example, 3 149520808 . G GTTAA variant, the insertion is happened before stop codon. If we only consider the position where it is inserted it does not overlap stop codon and we never know that stop is actually retained. So, we consider the full length of insertion in the reference sequence to see if it actually overlaps stop codon.
    (What happens if an indel length does not go until the stop codon but after adding in the change in the transcript sequence it is actually a stop_retained? - is it something to do with VEP not being completely haplotype-aware?)

frameshift

  • if the first base affected is the stop codon then we do not report as frameshift. As the reading frame excludes the start and stop codon. In such case the current reading frame keeps the same but new aa added with a new stop.

    by product: A case appears for inframe_deletion as the one mentioned at top for inframe_insertion. For example, a variant such TAG/G (*/X) now considered as stop_lost,inframe_deletion. Stop lost is fine but inframe deletion is not. Inframe deletion is appearing as by product of the frameshift change.

    As the variant is not classified as frameshift the _get_codon_alleles returns the codon and the condition for inframe deletion resulted in truth. Why we are checking with codon instead of peptide? - it is probably based on frameshift check, if it is not frameshift than we would see any deletion that is modulo 3. The above update to frameshift open a case for deletion at stop codon.

    So we add a check if the ref_pep is exactly *, if so even if it is deleted that would not be an inframe deletion.

Impact

This PR is supposed to fix some edge cases. To determine it is not impacting a lot of known variants I tested the difference in consequence for variants in Ensembl variation database for human GRCh38 (this file). And it shows no difference in variant consequences.

@dglemos dglemos added the e116 label Oct 27, 2025
@likhitha-surapaneni likhitha-surapaneni self-assigned this Dec 2, 2025
palakpsheth added a commit to palakpsheth/ensembl-variation that referenced this pull request Feb 11, 2026
…654)

Integrate improvements from the Ensembl team's parallel PR Ensembl#1184 to
handle additional edge cases in stop codon consequence prediction:

- inframe_insertion: guard against */* peptides (ref/alt both stop)
- stop_lost/stop_retained: fall through to sequence-level analysis
  when alt_pep contains 'X' (unreliable peptide translation)
- _overlaps_stop_codon: fix insertion coordinate handling to properly
  detect stop codon overlap (root cause fix for insertion detection)
- ref_eq_alt_sequence: remove redundant condition 1 (strict subset of
  condition 3), improve condition 2 to check trailing stop semantically,
  simplify condition 3 index comparison

These changes complement the existing Issue #1710 fixes (Bug 1 and
Bug 2) and were validated by the Ensembl team against the full GRCh38
variant database with no difference in variant consequences.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants