This project explores adaptive inference for Vision Transformers with the goal of reducing unnecessary compute and latency at inference time.
Standard ViT inference runs at a fixed resolution, meaning every image pays the full compute cost even when the image is simple and does not benefit from high resolution. This leads to wasted GPU time, higher latency, and inefficient deployment in real-world systems.
Instead of treating all inputs equally, this project investigates whether changes in the model’s internal representations can be used to decide when higher-resolution computation is actually necessary.
The focus of this project is systems efficiency, not accuracy improvements.
This file measures the raw forward-pass latency of a custom StaticViT model at different input resolutions.
Key characteristics:
- No dataset
- No dataloader
- Single-image inference
- Measures pure model compute cost only
The goal is to understand how expensive resolution scaling is in isolation, without any system-level overhead.
- Patch-based Vision Transformer
- Fixed architecture
- Same weights for all runs
- Only input resolution changes
- Low resolution (32×32): ~1.8 ms
- High resolution (64×64): ~8.7 ms
These numbers are not end-to-end inference latency.
They represent the lower bound on compute cost imposed by resolution alone.
This file establishes reference baselines for static ViT inference on a real dataset.
- Dataset: Tiny ImageNet
- Images evaluated: 2000
- Inference mode: static (always low-res or always high-res)
These baselines represent how ViT is typically deployed today.
Low resolution (32×32):
- ~2.5 ms / image
- ~30 MB peak VRAM
High resolution (64×64):
- ~8.8 ms / image
- ~68 MB peak VRAM
These values are used as reference points when evaluating adaptive inference.
This file analyzes how much the model’s internal representation changes when resolution increases.
For the same image:
- Run ViT at low resolution
- Run ViT at high resolution
- Measure the L2 distance between CLS embeddings
This produces a scalar value per image that answers:
Does increasing resolution materially change the model’s computation?
- A distribution of representation deltas
- Identification of images where high resolution meaningfully affects the model
- A principled oracle signal for adaptive inference
This signal is not semantic confidence, it purely measures compute sensitivity.
This file implements the adaptive inference system.
- Run low-resolution inference for every image
- Compute a cheap proxy signal from low-resolution only
(layerwise CLS representation change) - Compare the proxy against a calibrated threshold
- Escalate to high resolution only when necessary
The proxy is calibrated offline to ensure it lives in the same representation space as the oracle signal.
- images processed : 2000
- avg latency : 2.894 ms / image
- peak VRAM : 61.8 MB
- escalation rate : 9.5 %
- low resolution : ~2.8 ms / image (~30 MB peak)
- high resolution : ~9.6 ms / image (~68 MB peak)
- ~3.3× faster than static high-resolution inference
- High-resolution compute is used for <10% of images
- Peak VRAM remains close to high-res due to worst-case escalation (expected)
- Significant reduction in average latency and GPU compute
This project demonstrates that Vision Transformer inference does not need to be static.
By using representation dynamics instead of semantic confidence, it is possible to:
- Avoid unnecessary high-resolution compute
- Preserve model behavior on difficult inputs
- Achieve large real-world speedups without changing the model architecture
The result is a principled, system-level adaptive ViT inference pipeline suitable for real deployment scenarios.