Skip to content

AnvithaCodes/E2E_2026

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GSoC 2026: E2E & End-to-End Deep Learning

Candidate: Anvitha Bhat

Organisation: ML4SCI / E2E

Project: Foundation Models for E2E Event Reconstruction


Overview

This repository presents the E2E Physics Foundation Model for Task 2j at the High-Luminosity LHC. By implementing a Joint-Embedding Predictive Architecture (JEPA) with Linear FastAttention, it solves the $O(N^2)$ bottleneck of traditional Transformers at scale.

Component Technical Result Status Reference
Task 2j: Foundation Model 75.53% Accuracy (0.35 Loss) Complete Task README
Task 2g: E2E Inference Linear O(N) Scaling (Opset 18) Complete Task README

Key Navigation

Visual Evidences

E2E Architecture Flowchart

flowchart TD
    A["Raw E2E Data"] --> B["Preprocessing (100k Events)"]
    B --> C["E2E Foundation Model (JEPA)"]
    
    subgraph JEPA["Joint-Embedding Architecture"]
        direction TB
        C1["Input Sequence"] --> C2["Random Masking"]
        C2 --> C3["Context Encoder (FastAttention)"]
        C2 --> C4["Target Encoder (Momentum)"]
        C3 --> C5["Predictor"]
        C4 --> C6["Latent Target"]
        C5 <-->|"JEPA Loss"| C6
    end
    
    C3 --> D["CLS Embedding"]
    D --> E["Classification Head"] --> F{"Result"}
    F --> G["75.53% Accuracy"]
    
    D -.-> H["t-SNE Latent Visualization"]
    C3 -.-> I["ONNX Export (Opset 18)"]
    I -.-> J["E2E Inference (< 5ms)"]

    style JEPA fill:#f9f9f9,stroke:#333,stroke-dasharray: 5 5
    style G fill:#d4edda,stroke:#28a745,color:#155724
    style J fill:#cce5ff,stroke:#004085,color:#004085
Loading

Technical Results & Proofs

Latent Manifold (t-SNE) $O(N)$ Scaling (FastAttention)
Physics Saliency Loss Decay (0.8 → 0.3)
  • Representation Learning: JEPA discovers distinct manifold separation between Quark and Gluon jets without explicit labels during pre-training.
  • Linear Scaling: Replacing $O(N^2)$ attention with FastAttention allows the model to handle High-Luminosity LHC pileup scales (2048+ particles) without memory overflow.
  • Physics Intuition: Saliency maps confirm the model focuses on core kinematic features ($p_T, \eta, \phi$).

Ablation Summary

The full pipeline (JEPA and FastAttention) achieves a +15% AUC gain over vanilla Transformer baselines on high-multiplicity events.

Technical Architecture

1. Joint-Embedding Predictive Architecture (JEPA)

JEPA predicts latent representations, in contrast to conventional Autoencoders that reconstruct raw pixels. As a result, the model focuses on the physical laws governing energy distributions and is resistant to detector noise.

2. FastAttention O(N) Efficiency

The <5ms HLT latency budget is achieved while preserving global context by linearizing the attention mechanism. As a sparsity ready framework, this architecture can thus perform sophisticated dictionary learning right out of the box.

Repo Structure

Each task lives in its own parent folder with a README.md, models/, and evidence.

E2E_2026/
├── Task_2j_Foundation_Model/    # JEPA pre-training + O(N) attention
│   ├── models/                  #   FastAttention & JepaMAE
│   ├── data/                    #   preprocess_cms.py (100k events, 80-10-10)
│   ├── training/                #   train_cms.py, val_cms.py, run_ablation.py
│   ├── results/
│   │   ├── e2e_flowchart.jpg      #   architecture flowchart
│   │   ├── verify_e2e_results.py  #   primary: run for 75.53%
│   │   ├── weights/               #   pre-trained weights
│   │   └── plots/               #   latent_tsne.png, loss_decay_plot.png, etc.
│   └── README.md              #   Task 2j detail page
│
├── Task_2g_CMSSW_Inference/    # CMSSW-ready ONNX inference
│   ├── onnx_models/             #   part_hybrid_vit.onnx, momentum_regressor.onnx
│   ├── benchmarks/              #   run_onnx_inference.py, benchmark_model.py
│   ├── CMSSW_Guide.md           #   E2E-ready ONNX inference guide
│   └── README.md              #   Task 2g detail page
│
├── reco/                        #   CMSSW inference configuration (inference_cfg.py)
├── utils/                      #   Shared visualization & export scripts
├── proj_data/                  #   Processed .npz splits (train/val/test)
├── QuarkGluon/                 #   Raw E2E parquet files (~22 GB)
└── README.md

Quick Verification

All commands are to be run from the repo root:

# task 2j: reproduce 75.53% accuracy
python Task_2j_Foundation_Model/data/preprocess_cms.py
python Task_2j_Foundation_Model/results/verify_e2e_results.py   # 75.53%

# task 2j: regenerate plots
python utils/visualize_latent_space.py                           # t-SNE
python utils/plot_scaling_comparison.py                          # O(N) scaling

# task 2g: E2E latency benchmark
pip install onnxruntime
python Task_2g_CMSSW_Inference/benchmarks/run_onnx_inference.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors