Skip to content

Protein coordinates should not be returned for c. UTR coordinates #466

@theferrit32

Description

@theferrit32

Describe the bug

A c. transcript variant which is in 5' or 3' untranslated regions (before cds start or after cds end) currently get mapped to protein coordinates using arithmetic, but those coordinates are not valid positions.

The example below uses clinvar variant 875307:
g.: NC_000001.11:226875386:C:T
c.: NM_000447.3:c.-184C>T
p.: N/A

Steps to reproduce

def main():
    # https://www.ncbi.nlm.nih.gov/clinvar/variation/875307/
    clinvar_spdi = "NC_000001.11:226875386:C:T"
    clinvar_nm_hgvs = "NM_000447.3:c.-184C>T"
    seqrepo_dir = "/Users/kferrite/dev/data/seqrepo/2024-12-20"
    sr = SeqRepo(root_dir=seqrepo_dir)
    cst = CoolSeqTool(sr=sr)
    seq_accession = "NC_000001.11"
    seq_pos_start = 226875386
    seq_pos_end = 226875387
    result = asyncio.run(
        cst.mane_transcript.grch38_to_mane_c_p(
            alt_ac=seq_accession,
            start_pos=seq_pos_start,
            end_pos=seq_pos_end,
            coordinate_type=CoordinateType.INTER_RESIDUE,
        )
    )

    cdna = result.cdna
    protein = result.protein
    print(f"cdna: {cdna}")
    print(f"protein: {protein}")

    # Check if variant is in an untranslated region
    cds_length = cdna.coding_end_site - cdna.coding_start_site
    if cdna.pos[1] <= 0:
        print(
            f"\nShould not map to protein: variant at cDNA pos {cdna.pos} is in the "
            f"5' UTR (before CDS start at position 0). "
            f"Absolute transcript pos: {cdna.pos[0] + cdna.coding_start_site}-"
            f"{cdna.pos[1] + cdna.coding_start_site}, "
            f"CDS spans: {cdna.coding_start_site}-{cdna.coding_end_site}"
        )
    elif cdna.pos[0] >= cds_length:
        print(
            f"\nShould not map to protein: variant at cDNA pos {cdna.pos} is in the "
            f"3' UTR (after CDS end at position {cds_length}). "
            f"Absolute transcript pos: {cdna.pos[0] + cdna.coding_start_site}-"
            f"{cdna.pos[1] + cdna.coding_start_site}, "
            f"CDS spans: {cdna.coding_start_site}-{cdna.coding_end_site}"
        )
    else:
        print(f"\nVariant is in CDS (pos {cdna.pos}, CDS length {cds_length})")

    print(f"cool-seq-tool returned protein pos {protein.pos}")

main()

Output:

cdna: gene='PSEN2' refseq='NM_000447.3' ensembl='ENST00000366783.8' pos=(-184, -183) strand=<Strand.POSITIVE: 1> status=<TranscriptPriority.MANE_SELECT: 'mane_select'> coding_start_site=383 coding_end_site=1730 alt_ac='NC_000001.11'
protein: gene='PSEN2' refseq='NP_000438.2' ensembl='ENSP00000355747.3' pos=(-62, -61) strand=<Strand.POSITIVE: 1> status=<TranscriptPriority.MANE_SELECT: 'mane_select'>

Should not map to protein: variant at cDNA pos (-184, -183) is in the 5' UTR (before CDS start at position 0). Absolute transcript pos: 199-200, CDS spans: 383-1730
cool-seq-tool returned protein pos (-62, -61)

Acceptance Criteria

If the variant end < cds start or variant start > cds end it's easy enough to have the protein coordinates be None.

But I don't know what the behavior should be if the variant transcript region spans the cds boundary such that part of it codes for a protein and part does not.

Possible reason(s)/Suggested Fix

Either:

(1) Add cds start/end args to _get_mane_p and return None if mane_c_pos_range is outside those bounds.
(2) Prior to calling _get_mane_p (from grch38_to_mane_c_p and get_mane_transcript), check the cds bounds and do not call _get_mane_p if outside them.

Environment & Version

pypi: cool-seq-tool==0.15.4 (but same code in branch: main / tag: 0.16.0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions