LM-JEPA for Squared Amplitude Calculation: SYMBA - GSoC 2026

Candidate: Anvitha Bhat A

My Medium Blog of this Project: Click here

Quick Links

To ensure quick access without the need of extensive setup or $O(N^2)$ memory constraints, please refer to the following verification resources:

Resource Type	Description	Link
Interactive Demo	Verify QED/QCD predictions live	Open in Colab
Pre-trained Weights	Final JEPA backbone weights (0.098 MSE)	Download (.pth)
Inference Script	Lightweight CLI tool for batch verification	View Code

Live Inference Demonstration

Tokenization Transparency (Feature 1): A raw QED amplitude string (e.g., mul(pow(alpha,2), Tr(gamma_mu, slash(p1), gamma_nu, slash(p2)), div(1, s))) is passed through the prefix normalizer and converted into an integer token sequence in real time, with each intermediate representation printed for full auditability
O(N) FastAttention Backbone (Feature 2): The normalized token tensor is forwarded through the LM-JEPA context encoder, producing a contextual embedding of shape (1, 16, 256) via linear complexity attention, confirming stable latent dimensionality with 0 quadratic memory overhead

These results can be replicated using the interactive notebook linked in the Quick Links section above.

Setup & Installation

Full environment setup, dependency installation and verification steps are documented in SETUP.md

Project Architecture and Pipeline

The below flowchart highlights the sequential data flow from preprocessing to the foundation model and depicts the model pipeline that uses advanced representation learning and symbolic mathematics to predict squared amplitudes.

graph TD
    A[Raw QED/QCD Symbolic Expressions] --> B[Task 1.2: Prefix Tokenization & Index Normalization]
    B -->|Preprocessed Data Handover| C[Task 2.5: LM-JEPA Transformer Backbone]
    C -->|Linear FastAttention| D[Predicted Squared Amplitudes]
    D -.->|Complexity Scaling| E((Convergence))

Deliverables

The progressive deliverables and the corresponding performance metrics are cataloged below. Please click the directory links to navigate to the specific task folders.

Task ID	Component	Metric (MSE Loss)	Visual Validation	Task Documentation	Notebook / Weights
1.2	Data Pre-processing	N/A (Lossless)	Reconstruction Proof	Task 1.2 README	Solution PDF
2.5	LM-JEPA Pre-training	0.125	JEPA Loss Curve	Task 2.5 README	Local Weights
2.5	QCD Fine-tuning	0.098	Parity Plot	Task 2.5 README	Local Weights

Comparative Results: Complexity Scaling

The table below demonstrates the model's robustness and efficiency of the FastAttention mechanism when predicting squared amplitudes for expressions of varying lengths

Operand Count	Validation MSE Loss	Inference Infrastructure Status
2 Operands	0.091	Stable
4 Operands	0.098	Stable
6+ Operands	0.112	Stable

Innovation Highlights

$O(N)$ Complexity Architecture: The integration of FastAttention allows the model to efficiently process deep symbolic expressions containing more than four operands. This structural enhancement circumvents the computational bottleneck of standard $O(N^2)$ Transformers, preventing out-of-memory errors and maintaining rapid inference times on extended mathematical sequences.

LM-JEPA with FastAttention

1. Latent Manifold Organization	2. Linear Complexity Scaling

Mapping Physics Symmetries: The model organizes scattering processes into distinct latent clusters (Electron vs. Muon)	Breaking the O(N²) Barrier: Linear attention ensures stable memory usage for deep symbolic trees

JEPA Latent Space: The Joint-Embedding Predictive Architecture (JEPA) facilitates robust representation learning. By operating in the latent space, the model establishes a foundational prior for Feynman diagrams entirely unsupervised, capturing the underlying physics before conventional supervised fine-tuning is initiated.

Observations

Variance control during training was a crucial discovery made during the model optimization stage. While predicting complex QCD amplitudes, Index Normalization turned out to be the crucial intervention needed to stabilize the validation loss. The magnitude of attention gradients was maintained by regularizing the index distributions before tokenization which avoided divergence and guaranteed smooth convergence.

References

Category	Reference / Resource	Contribution to Project
Physics Framework	Alnuqaydan, A., Gleyzer, S., & Prosper, H. (2023). "SYMBA: Symbolic Computation of Squared Amplitudes in High Energy Physics with Machine Learning". Mach. Learn.: Sci. Technol. 4 015007	The foundational work proving that Transformers can predict 97.6% (QCD) and 99% (QED) of squared amplitudes correctly
Data Generation	Uhlrich, G., Mahmoudi, F., & Arbey, A. (2021). "MARTY: A new C++ framework for automated symbolic calculations...". Comput. Phys. Commun. 264, 107928	Context for the raw QED/QCD symbolic expressions in the 125,000-sample dataset
Tokenization	Lample and Charton, "Deep Learning for Symbolic Mathematics" (2019)	Established the Prefix (Polish) Notation standard for processing mathematical trees
Architecture	Yann LeCun, "A Path Towards Autonomous Machine Intelligence" (2022)	Foundational theory for the asymmetric encoder-predictor JEPA stack used in Task 2.5
Complexity	Katharopoulos et al., "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" (2020)	Mathematical basis for the $O(N)$ FastAttention claim and numerical efficiency
Framework	Paszke et al., "PyTorch: An Imperative Style, High-Performance Deep Learning Library"	Core library for model development and tensor operations.
Community	ML4SCI GSoC Baselines & JetClass Resources	Standards for repository modularity

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Task_1.2_Data_Preprocessing		Task_1.2_Data_Preprocessing
Task_2.5_Foundation_Model		Task_2.5_Foundation_Model
docs		docs
setup		setup
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LM-JEPA for Squared Amplitude Calculation: SYMBA - GSoC 2026

Quick Links

Live Inference Demonstration

Setup & Installation

Project Architecture and Pipeline

Deliverables

Comparative Results: Complexity Scaling

Innovation Highlights

LM-JEPA with FastAttention

Observations

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LM-JEPA for Squared Amplitude Calculation: SYMBA - GSoC 2026

Quick Links

Live Inference Demonstration

Setup & Installation

Project Architecture and Pipeline

Deliverables

Comparative Results: Complexity Scaling

Innovation Highlights

LM-JEPA with FastAttention

Observations

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages