Adaptive Resolution Vision Transformer (ViT)

This project explores adaptive inference for Vision Transformers with the goal of reducing unnecessary compute and latency at inference time.

Standard ViT inference runs at a fixed resolution, meaning every image pays the full compute cost even when the image is simple and does not benefit from high resolution. This leads to wasted GPU time, higher latency, and inefficient deployment in real-world systems.

Instead of treating all inputs equally, this project investigates whether changes in the model’s internal representations can be used to decide when higher-resolution computation is actually necessary.

The focus of this project is systems efficiency, not accuracy improvements.

`forward_inference.py` — Static ViT Forward Pass Cost

This file measures the raw forward-pass latency of a custom StaticViT model at different input resolutions.

Key characteristics:

No dataset
No dataloader
Single-image inference
Measures pure model compute cost only

The goal is to understand how expensive resolution scaling is in isolation, without any system-level overhead.

Model configuration (summary)

Patch-based Vision Transformer
Fixed architecture
Same weights for all runs
Only input resolution changes

Forward-pass latency (model-only)

Low resolution (32×32): ~1.8 ms
High resolution (64×64): ~8.7 ms

These numbers are not end-to-end inference latency.
They represent the lower bound on compute cost imposed by resolution alone.

`baseline_inference.py` — Static ViT Dataset Baselines

This file establishes reference baselines for static ViT inference on a real dataset.

Dataset: Tiny ImageNet
Images evaluated: 2000
Inference mode: static (always low-res or always high-res)

These baselines represent how ViT is typically deployed today.

Static baseline results

Low resolution (32×32):

~2.5 ms / image
~30 MB peak VRAM

High resolution (64×64):

~8.8 ms / image
~68 MB peak VRAM

These values are used as reference points when evaluating adaptive inference.

`analysis/delta.py` — Representation Delta Analysis

This file analyzes how much the model’s internal representation changes when resolution increases.

For the same image:

Run ViT at low resolution
Run ViT at high resolution
Measure the L2 distance between CLS embeddings

This produces a scalar value per image that answers:

Does increasing resolution materially change the model’s computation?

What this gives us

A distribution of representation deltas
Identification of images where high resolution meaningfully affects the model
A principled oracle signal for adaptive inference

This signal is not semantic confidence, it purely measures compute sensitivity.

`adaptive/main.py` — Adaptive Inference Policy

This file implements the adaptive inference system.

How adaptive inference works

Run low-resolution inference for every image
Compute a cheap proxy signal from low-resolution only
(layerwise CLS representation change)
Compare the proxy against a calibrated threshold
Escalate to high resolution only when necessary

The proxy is calibrated offline to ensure it lives in the same representation space as the oracle signal.

Final Adaptive Inference Results

images processed : 2000
avg latency : 2.894 ms / image
peak VRAM : 61.8 MB
escalation rate : 9.5 %

Static Baselines (Reference)

low resolution : ~2.8 ms / image (~30 MB peak)
high resolution : ~9.6 ms / image (~68 MB peak)

Key Outcome

~3.3× faster than static high-resolution inference
High-resolution compute is used for <10% of images
Peak VRAM remains close to high-res due to worst-case escalation (expected)
Significant reduction in average latency and GPU compute

Summary

This project demonstrates that Vision Transformer inference does not need to be static.

By using representation dynamics instead of semantic confidence, it is possible to:

Avoid unnecessary high-resolution compute
Preserve model behavior on difficult inputs
Achieve large real-world speedups without changing the model architecture

The result is a principled, system-level adaptive ViT inference pipeline suitable for real deployment scenarios.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
adaptive		adaptive
analysis		analysis
baseline		baseline
train		train
.gitignore		.gitignore
READme.md		READme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adaptive Resolution Vision Transformer (ViT)