Skip to content

antigenomics/tcren

Repository files navigation

tcren

tcren — structure-based prediction of TCR–epitope recognition

PyPI tests docs python license

TCRen predicts which epitopes a T-cell receptor recognises from a single TCR–peptide–MHC structure (experimental or modelled). It extracts the TCR–peptide contact map and scores every candidate peptide with a residue-level statistical potential derived from contact preferences in TCR:pMHC crystal structures — answering not "what fancy complex can a model draw?" but "is this binding physically plausible?".

This is a documented, tested, CLI-driven Python library. TCR chains are annotated with the sibling arda; MHC chains are mapped and the groove partitioned against a curated reference; structures are oriented into one canonical frame; and the original contact maps, potential, and scores are reproduced numerically (validated against committed oracles to floating-point precision).

While the original tcren focused on TCR:peptide contacts, the new version brings in features to score TCR:MHC and peptide:MHC interactions, required to get full picture of TCR:pMHC binding mechanics and estimate ddG values.

Install

pip install tcren          # from PyPI — binary wheels ship the C++ extension; pulls in arda-mapper

For development (editable install, conda env with the build toolchain, and the reference data fetched into data/):

bash setup.sh              # creates the `tcren` conda env, installs arda + tcren, fetches data/
conda activate tcren

tcren ships a small pybind11/C++ extension (tcren._align) for the MHC-pseudosequence fitting-alignment hot path, built on install by scikit-build-core (a Biopython fallback runs if it is not built). TCR annotation is provided by arda, a runtime dependency published to PyPI as arda-mapper (it imports as arda); pip/setup.sh pull it automatically. setup.sh also runs tcren fetch-data to populate data/ with the reference structure sets (Native2026, Canonical2026) used by orient/superimpose (set TCREN_NO_FETCH=1 to skip).

Command line

# Full pipeline: annotate -> superimpose -> resmarkup / canonical Cα / contacts -> per-interface
# energies (TCRen for TCR↔peptide, MJ for TCR↔MHC and peptide↔MHC) + total
tcren pipeline -s complex.pdb -o scores.csv

# End-to-end candidate-epitope scoring from a structure
tcren score -s complex.pdb -c candidates.txt -o ranked.csv

# Substitute a peptide and refine its pose (knowledge-based MC scored by the DOPE atom-level
# statistical potential — independent of the TCRen/MJ scoring potentials, restrained to the input).
# Not physics relaxation — use Rosetta FlexPepDock for that.
tcren refine -s complex.pdb -o refined/ --substitute KQWLVWLFL

# Structures: any of .pdb / .cif / .pdb.gz / .cif.gz, a directory, or a .tar.gz batch
tcren contacts -s batch.tar.gz -o contacts.csv --interface tcr_peptide

# Per-residue markup: TCR (CDR/FR) + MHC groove (helix/floor) + peptide in one table.
# --regions all|tcr|mhc|peptide filters; --pseudo also marks NetMHCpan groove residues (MPS).
tcren annotate -s complex.cif.gz -o markup.csv --regions mhc --pseudo

# Superimpose structure(s) onto the canonical frame, by MHC, against the canonical database
# (data/Canonical2026, fetched at install). Detects MHC class + species and averages the
# superposition over every database structure of that class/species. Chains -> A=Vα B=Vβ
# C=peptide D=MHCα E=MHCβ/β2m. -s takes a file / directory / .tar.gz / glob; -o is a directory,
# or a single structure file (one input) whose extension must match --mmCIF/--compress; -t threads.
tcren superimpose -s complex.pdb -o oriented.pdb           # single file
tcren superimpose -s 'data/*.pdb' -o oriented/ -t 8        # glob -> directory, threaded

# Build a canonical database from native complexes (how Canonical2026 is produced). Annotation
# is one batched mmseqs call; -t threads only the structural alignment + write.
tcren orient -s data/Native2026 -o data/Canonical2026 -t 8

# Structure outputs are plain .pdb by default; add --mmCIF for .cif and --compress for .gz.
tcren superimpose -s complex.pdb -o oriented/ --mmCIF --compress   # -> oriented/<id>.cif.gz

# Fetch recent TCR-pMHC structures from RCSB -> data/pdb_recent (mmCIF .cif.gz, 5-chain validated)
tcren fetch-recent --discover --after 2024-01-01

# Build the MHC reference once (IMGT/HLA + mouse H-2; cached, not committed)
tcren build-mhc-ref

tcren info
tcren --install-completion        # shell tab-completion (bash/zsh/fish)

tcren orient and tcren superimpose need the reference sets in data/ (Native2026, Canonical2026); setup.sh fetches them at install via tcren fetch-data (re-run it any time).

Library

from tcren import run_pipeline, parse_structure, import_structure, ContactMap, score_peptides
from tcren.annotation import classify_chains
from tcren.potential import tcren

# One call: annotate -> superimpose -> contacts -> per-interface energies + total
res = run_pipeline("complex.pdb")              # res.scores, res.markup, res.contacts, res.oriented

# …or the individual steps:
s = parse_structure("complex.pdb.gz")          # also .cif/.cif.gz; import_structure trims the C-gene
classify_chains(s, organism="human")           # TRA/TRB via arda, peptide, MHC
cm = ContactMap.from_structure(s)              # 5 Å contacts + interface partitioning
ranked = score_peptides(cm, ["KQWLVWLFL", "RLLHPHHPL"], tcren())

Batch inputs, gzip, archives

from tcren.structure import iter_structures
for pdb_id, structure in iter_structures("batch.tar.gz"):   # file | directory | .tar.gz
    classify_chains(structure, organism="human")
    ...

Canonical orientation, contacts, docking geometry

from tcren.mhc import annotate_mhc
from tcren.orient import canonicalize_structure, superimpose, docking_angles
from tcren.contacts import multi_contacts, ContactDefinition

annotate_mhc(s)
oriented, info = canonicalize_structure(s)     # frame: z=MHC→TCR, y=peptide, x=thin; chains A–E
oriented, info = superimpose(s)                # orient onto data/Canonical2026 by MHC (class+species ensemble)
layers = multi_contacts(s, ContactDefinition(d1=5, d2=8, d3=12))   # heavy-atom / Cβ / Cα
d = docking_angles(s)                          # crossing (~20–70° αβ) + incident angle

2D complementarity maps & region-pair contacts

from tcren.project2d import (project_structure, residue_markup_table, contacts_table,
                             region_pair_summary)
from tcren.viz import render_complementarity_map, view_pocket_cdr

proj = project_structure(s)                                   # canonical groove plane
svg  = render_complementarity_map(residue_markup_table(s, proj),
                                  contacts=contacts_table(s, threshold=5.0))
region_pair_summary(s, kind="closest")        # contacts per region pair + bond types (cb/ca too)
view_pocket_cdr(s).show()                      # interactive 3D pocket + CDR overlay (py3Dmol)

Data

Structures live in the Hugging Face dataset isalgo/tcren_structures, all gzipped:

folder contents
Native2022 the 2022 paper set (oracle)
Native2026 the comprehensive 2026 TCR:pMHC set the current potential is derived from
Canonical2026 Native2026 re-oriented into the canonical frame (tcren orient)

tcren reads .pdb/.cif/.pdb.gz/.cif.gz and .tar.gz batches; an installed library lazily fetches the canonical reference structures from the Hub when orienting a new complex. The root data/ holds Native2026 (+ Canonical2026, gitignored, fetched on demand), PDB_date.tsv, orient_metadata.json, and TCRen_potential.csv — the current potential derived from the Native2026 set (use it with tcren score -p data/TCRen_potential.csv).

Notebooks

Runnable examples under notebooks/ (rendered in the docs):

  • complementarity_map_2d — 2D interface maps, multiple structural + map views of 1ao7
  • contact_thresholds_and_bondtypes — region-pair contact counts (closest/Cβ/Cα) + bond types
  • canonical_frame_figures — canonical-frame QC across the Native2026 set
  • pymol_canonical_figures — ray-traced PyMOL panels (overlay, groove, interface) by class/species
  • mhc_pseudosequence_mps — NetMHCpan MHC pseudosequence (MPS) residues vs. peptide contacts
  • example_gil_a02_rs_motif — GILGFVFTL/HLA-A*02 and the public CDR3β Arg–Ser motif
  • natcompsci2022/ — full reproduction of the Nat Comput Sci 2022 analyses

Performance

Per-stage timings on a TCR-pMHC complex (1ao7), Apple M3, single thread (RUN_BENCHMARK=1 pytest -k benchmark -s to reproduce):

stage time notes
parse a gzipped structure ~19 ms .pdb.gz / .cif.gz
contact map (5 Å, cKDTree) ~9 ms per structure
score 1000 candidate peptides ~8 ms ~8 µs/peptide (vectorised)
annotate (TCR + MHC), batched ~213 ms/structure one mmseqs2 call for the whole set; vs ~1.5 s/structure unbatched
peak RSS, single-structure pipeline ~195 MB

Annotation is the only network/compute-heavy step and is always batched (one mmseqs2 search over all chains; mmseqs2 parallelises internally — never per-structure, never Python-threaded). Threads are used only for the embarrassingly-parallel, mmseqs-free stages (structural alignment, write, rendering): tcren orient -t N.

Tests

pytest -m "not slow"          # unit + fast regression (the CI gate)
pytest                        # add the arda/mmseqs-backed regression tests
RUN_BENCHMARK=1 pytest -k benchmark -s

Methods appendix

The coordinate-level extensions — backbone-preserving peptide substitution and the potential-guided Monte-Carlo refinement kernel (energy function, the restraint-necessity argument, sampler, and citations) — are written up in the technical appendix appendix/tcren.tex (built with make -C appendixappendix/tcren.pdf).

Citing

TCRen is free for academic and non-commercial use. If you use it, please cite our latest Nature Computational Science 2024 paper:

Karnaukhov VK, Shcherbinin DS, Chugunov AO, Chudakov DM, Efremov RG, Zvyagin IV, Shugay M. Structure-based prediction of T cell receptor recognition of unseen epitopes using TCRen. Nat Comput Sci. 2024 Jul;4(7):510-521. doi: 10.1038/s43588-024-00653-0. Epub 2024 Jul 10. PMID: 38987378.