Skip to content

Mog9/Adaptive-ViT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Adaptive Resolution Vision Transformer (ViT)

This project explores adaptive inference for Vision Transformers with the goal of reducing unnecessary compute and latency at inference time.

Standard ViT inference runs at a fixed resolution, meaning every image pays the full compute cost even when the image is simple and does not benefit from high resolution. This leads to wasted GPU time, higher latency, and inefficient deployment in real-world systems.

Instead of treating all inputs equally, this project investigates whether changes in the model’s internal representations can be used to decide when higher-resolution computation is actually necessary.

The focus of this project is systems efficiency, not accuracy improvements.


forward_inference.py — Static ViT Forward Pass Cost

This file measures the raw forward-pass latency of a custom StaticViT model at different input resolutions.

Key characteristics:

  • No dataset
  • No dataloader
  • Single-image inference
  • Measures pure model compute cost only

The goal is to understand how expensive resolution scaling is in isolation, without any system-level overhead.

Model configuration (summary)

  • Patch-based Vision Transformer
  • Fixed architecture
  • Same weights for all runs
  • Only input resolution changes

Forward-pass latency (model-only)

  • Low resolution (32×32): ~1.8 ms
  • High resolution (64×64): ~8.7 ms

These numbers are not end-to-end inference latency.
They represent the lower bound on compute cost imposed by resolution alone.


baseline_inference.py — Static ViT Dataset Baselines

This file establishes reference baselines for static ViT inference on a real dataset.

  • Dataset: Tiny ImageNet
  • Images evaluated: 2000
  • Inference mode: static (always low-res or always high-res)

These baselines represent how ViT is typically deployed today.

Static baseline results

Low resolution (32×32):

  • ~2.5 ms / image
  • ~30 MB peak VRAM

High resolution (64×64):

  • ~8.8 ms / image
  • ~68 MB peak VRAM
image

These values are used as reference points when evaluating adaptive inference.


analysis/delta.py — Representation Delta Analysis

This file analyzes how much the model’s internal representation changes when resolution increases.

For the same image:

  • Run ViT at low resolution
  • Run ViT at high resolution
  • Measure the L2 distance between CLS embeddings

This produces a scalar value per image that answers:

Does increasing resolution materially change the model’s computation?

What this gives us

  • A distribution of representation deltas
  • Identification of images where high resolution meaningfully affects the model
  • A principled oracle signal for adaptive inference
image

This signal is not semantic confidence, it purely measures compute sensitivity.


adaptive/main.py — Adaptive Inference Policy

This file implements the adaptive inference system.

How adaptive inference works

  1. Run low-resolution inference for every image
  2. Compute a cheap proxy signal from low-resolution only
    (layerwise CLS representation change)
  3. Compare the proxy against a calibrated threshold
  4. Escalate to high resolution only when necessary

The proxy is calibrated offline to ensure it lives in the same representation space as the oracle signal.


Final Adaptive Inference Results

  • images processed : 2000
  • avg latency : 2.894 ms / image
  • peak VRAM : 61.8 MB
  • escalation rate : 9.5 %

Static Baselines (Reference)

  • low resolution : ~2.8 ms / image (~30 MB peak)
  • high resolution : ~9.6 ms / image (~68 MB peak)
image

Key Outcome

  • ~3.3× faster than static high-resolution inference
  • High-resolution compute is used for <10% of images
  • Peak VRAM remains close to high-res due to worst-case escalation (expected)
  • Significant reduction in average latency and GPU compute

Summary

This project demonstrates that Vision Transformer inference does not need to be static.

By using representation dynamics instead of semantic confidence, it is possible to:

  • Avoid unnecessary high-resolution compute
  • Preserve model behavior on difficult inputs
  • Achieve large real-world speedups without changing the model architecture

The result is a principled, system-level adaptive ViT inference pipeline suitable for real deployment scenarios.

About

An adaptive Vision Transformer inference system that avoids unnecessary high-resolution computation, achieving ~3× faster inference than static high-res ViT by selectively escalating only when needed.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages