Skip to content

AnvithaCodes/SYMBA_2026

Repository files navigation

LM-JEPA for Squared Amplitude Calculation: SYMBA - GSoC 2026

Candidate: Anvitha Bhat A

My Medium Blog of this Project: Click here

Quick Links

To ensure quick access without the need of extensive setup or $O(N^2)$ memory constraints, please refer to the following verification resources:

Resource Type Description Link
Interactive Demo Verify QED/QCD predictions live Open in Colab
Pre-trained Weights Final JEPA backbone weights (0.098 MSE) Download (.pth)
Inference Script Lightweight CLI tool for batch verification View Code

Live Inference Demonstration

SYMBA Live Inference Demo

  • Tokenization Transparency (Feature 1): A raw QED amplitude string (e.g., mul(pow(alpha,2), Tr(gamma_mu, slash(p1), gamma_nu, slash(p2)), div(1, s))) is passed through the prefix normalizer and converted into an integer token sequence in real time, with each intermediate representation printed for full auditability

  • O(N) FastAttention Backbone (Feature 2): The normalized token tensor is forwarded through the LM-JEPA context encoder, producing a contextual embedding of shape (1, 16, 256) via linear complexity attention, confirming stable latent dimensionality with 0 quadratic memory overhead

These results can be replicated using the interactive notebook linked in the Quick Links section above.

Setup & Installation

Full environment setup, dependency installation and verification steps are documented in SETUP.md

Project Architecture and Pipeline

The below flowchart highlights the sequential data flow from preprocessing to the foundation model and depicts the model pipeline that uses advanced representation learning and symbolic mathematics to predict squared amplitudes.

graph TD
    A[Raw QED/QCD Symbolic Expressions] --> B[Task 1.2: Prefix Tokenization & Index Normalization]
    B -->|Preprocessed Data Handover| C[Task 2.5: LM-JEPA Transformer Backbone]
    C -->|Linear FastAttention| D[Predicted Squared Amplitudes]
    D -.->|Complexity Scaling| E((Convergence))
Loading

Deliverables

The progressive deliverables and the corresponding performance metrics are cataloged below. Please click the directory links to navigate to the specific task folders.

Task ID Component Metric (MSE Loss) Visual Validation Task Documentation Notebook / Weights
1.2 Data Pre-processing N/A (Lossless) Reconstruction Proof Task 1.2 README Solution PDF
2.5 LM-JEPA Pre-training 0.125 JEPA Loss Curve Task 2.5 README Local Weights
2.5 QCD Fine-tuning 0.098 Parity Plot Task 2.5 README Local Weights

Comparative Results: Complexity Scaling

The table below demonstrates the model's robustness and efficiency of the FastAttention mechanism when predicting squared amplitudes for expressions of varying lengths

Operand Count Validation MSE Loss Inference Infrastructure Status
2 Operands 0.091 Stable
4 Operands 0.098 Stable
6+ Operands 0.112 Stable

Innovation Highlights

  • $O(N)$ Complexity Architecture: The integration of FastAttention allows the model to efficiently process deep symbolic expressions containing more than four operands. This structural enhancement circumvents the computational bottleneck of standard $O(N^2)$ Transformers, preventing out-of-memory errors and maintaining rapid inference times on extended mathematical sequences.

LM-JEPA with FastAttention

1. Latent Manifold Organization 2. Linear Complexity Scaling
JEPA Manifold Memory Scaling
Mapping Physics Symmetries: The model organizes scattering processes into distinct latent clusters (Electron vs. Muon) Breaking the O(N²) Barrier: Linear attention ensures stable memory usage for deep symbolic trees
  • JEPA Latent Space: The Joint-Embedding Predictive Architecture (JEPA) facilitates robust representation learning. By operating in the latent space, the model establishes a foundational prior for Feynman diagrams entirely unsupervised, capturing the underlying physics before conventional supervised fine-tuning is initiated.

Observations

Variance control during training was a crucial discovery made during the model optimization stage. While predicting complex QCD amplitudes, Index Normalization turned out to be the crucial intervention needed to stabilize the validation loss. The magnitude of attention gradients was maintained by regularizing the index distributions before tokenization which avoided divergence and guaranteed smooth convergence.

References

Category Reference / Resource Contribution to Project
Physics Framework Alnuqaydan, A., Gleyzer, S., & Prosper, H. (2023). "SYMBA: Symbolic Computation of Squared Amplitudes in High Energy Physics with Machine Learning". Mach. Learn.: Sci. Technol. 4 015007 The foundational work proving that Transformers can predict 97.6% (QCD) and 99% (QED) of squared amplitudes correctly
Data Generation Uhlrich, G., Mahmoudi, F., & Arbey, A. (2021). "MARTY: A new C++ framework for automated symbolic calculations...". Comput. Phys. Commun. 264, 107928 Context for the raw QED/QCD symbolic expressions in the 125,000-sample dataset
Tokenization Lample and Charton, "Deep Learning for Symbolic Mathematics" (2019) Established the Prefix (Polish) Notation standard for processing mathematical trees
Architecture Yann LeCun, "A Path Towards Autonomous Machine Intelligence" (2022) Foundational theory for the asymmetric encoder-predictor JEPA stack used in Task 2.5
Complexity Katharopoulos et al., "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" (2020) Mathematical basis for the $O(N)$ FastAttention claim and numerical efficiency
Framework Paszke et al., "PyTorch: An Imperative Style, High-Performance Deep Learning Library" Core library for model development and tensor operations.
Community ML4SCI GSoC Baselines & JetClass Resources Standards for repository modularity

About

LM JEPA for SYMBA: implementing foundation models with O(N) linear fast attention for symbolic calculation of squared amplitudes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors