deep-vqa-framework

A Unified Deep Learning Framework for Image Quality Assessment (IQA) and Video Quality Assessment (VQA).

This framework provides an end-to-end solution for training, evaluating, and deploying quality assessment models. It features a unified architecture that seamlessly handles both image and video inputs, multi-dataset support, cross-validation pipelines, and production-ready inference APIs.

Note

This framework is primarily tested on AutoDL cloud GPU instances.

System Requirements

Hardware Requirements

Component	Minimum	Recommended
GPU	8GB VRAM (IQA only)	24GB+ VRAM (VQA training)
RAM	16GB	32GB+
Disk Space	50GB	200GB+ (including datasets)
CPU	4 cores	8+ cores

Software Requirements

Component	Version	Notes
OS	Linux (Ubuntu 20.04+) / Windows 10+ / macOS 12+	Linux recommended for training
Python	3.10 - 3.12	3.12+ not fully tested
CUDA	11.8 / 12.1	Required for GPU training
PyTorch	2.0+	2.5+ recommended for better AMP support
cuDNN	8.7+	Included with PyTorch

Storage Breakdown

The framework expects datasets in the following structure:

Dataset	Size (Compressed)	Size (Extracted)
TID2013	~500MB	~3GB
KoNViD-1k	~9GB	~10GB
T2VQA-DB	~45GB	~50GB+
Total	~55GB	~63GB+

Additional Space

Item	Estimated Size
Python environment (uv/venv)	~5GB
Model checkpoints (5-fold)	~10GB
Training logs & plots	~2GB
Grand Total	~80-100GB

Note

Use uv for faster installation and smaller dependency footprint
Symbolic links (see below) do not consume additional disk space
quarantine/ directory may grow if files are isolated; run scripts/cache_clean.sh regularly

Architecture & Design Decisions

Unified IQA/VQA Architecture

The framework implements a dimension-aware routing system that automatically switches between image (4D tensors) and video (5D tensors) processing modes.

Key Design Decisions:

Decision	Implementation	Rationale
Unified Model	Single `IQAVQANet` handles both 4D and 5D inputs	Eliminates duplicate code, ensures consistent quality metrics
Flexible Backbones	Swin-T / ResNet50 with automatic feature adaptation	Balances accuracy vs. memory consumption
Temporal Fusion	Transformer encoder for video frame aggregation	Captures long-range dependencies between frames
Hybrid Loss	MSE (70%) + Rank Loss (30%)	Optimizes both absolute prediction and relative ordering
Multi-Dataset Support	YAML-based configuration with factory pattern	Easy addition of new datasets without code changes
Path Abstraction	DSL-based `PathManager` with YAML routing	Eliminates hardcoded paths, supports symbolic links
Lazy Asset Resolution	`CaseInsensitiveAssetResolver` with pre-built index	O(1) file lookup, case-insensitive matching

Model Architecture

IQAVQANet: Unified Quality Assessment Network

# Architecture overview
Input (4D: [B,3,H,W] or 5D: [B,F,3,H,W])
    ↓
Backbone (Swin-T / ResNet50)
    ↓
Spatial Pooling (AdaptiveAvgPool2d)
    ↓
[Temporal Fusion] ← TransformerEncoder (only for video)
    ↓
Quality Head (3-layer MLP + Sigmoid)
    ↓
Output: Quality Score (0-1 range)

Supported Configurations

Backbone	Parameters	IQA	VQA	Memory (per sample)
ResNet50	25M	✅	✅	~2GB (8 frames)
Swin-T	28M	✅	✅	~4GB (8 frames)

Loss Function: Hybrid MSE + Rank Loss

Total Loss = 0.7 × MSE + 0.3 × Rank Loss

- MSE Loss: Absolute prediction accuracy
- Rank Loss: Preserves relative ordering between samples

Training Pipeline

Quick Start Training

Step 1: Initialize Paths

make link

Step 2: Training Commands

You can choose between running the direct uv command or using the make wrapper.

Dataset	Model	uv Command	make Command
TID2013	`resnet_iqa`	`uv run python -m src.main --model resnet_iqa --dataset tid2013`	`make train DATASET=tid2013 MODEL=resnet_iqa`
KoNViD-1k	`timeswin_vqa`	`uv run python -m src.main --model timeswin_vqa --dataset konvid-1k`	`make train DATASET=konvid-1k MODEL=timeswin_vqa`
T2VQA-DB	`resnet_vqa`	`uv run python -m src.main --model resnet_vqa --dataset t2vqa-db`	`make train DATASET=t2vqa-db MODEL=resnet_vqa`

Note: By default, DEBUG=0 is applied in make commands. You can override it by appending DEBUG=1 if needed.

Configuration Parameters

# config/models/resnet_vqa.yaml
preprocessing:
  batch_size: 2          # Reduce if OOM
  num_workers: 4         # Data loading threads
  k_fold: 5              # Cross-validation folds

model:
  backbone: "resnet50"   # or "swin_t"
  num_frames: 8          # Video frames per sample
  transformer_layers: 2  # Temporal fusion depth

train:
  epochs: 50
  lr: 0.0001
  gradient_accumulation_steps: 4  # Effective batch = batch_size × steps
  early_stop:
    enabled: true
    patience: 10
    monitor: "val_srocc"
    mode: "max"

Advanced Options

You can extend the framework capabilities using the following training and debugging modes:

Mode	Use Case	uv / Shell Command	make Wrapper
Smoke Test	Quick functionality check	`uv run python -m src.main --smoke_test`	`make test`
Debug Mode	Enable breakpoints & verbose logs	`LOG_LEVEL=DEBUG uv run python -m src.main`	`make train DEBUG=1`
Background	Run on remote server persistently	`nohup uv run python -m src.main > results/scripts_logs/train.log 2>&1 &`	`make train`

Detailed Execution

Smoke Test: Executes a single epoch with minimal data to verify pipeline integrity. Ideal for CI/CD or validating changes.

uv run python -m src.main --smoke_test

Debug Mode: Runs with LOG_LEVEL=DEBUG to enable verbose logging and debugger breakpoints. Useful for troubleshooting.

LOG_LEVEL=DEBUG uv run python -m src.main

Background Training: Uses nohup to ensure training continues after terminal closure. Output is redirected to results/scripts_logs/train.log.

nohup uv run python -m src.main > results/scripts_logs/train.log 2>&1 &

Tip

Monitor real-time training progress with:

tail -f results/scripts_logs/train.log

Evaluation & Metrics

Core Metrics

Metric	Full Name	Interpretation
PLCC	Pearson Linear Correlation Coefficient	Linear relationship (accuracy)
SROCC	Spearman Rank Order Correlation Coefficient	Monotonic relationship (ranking)
KROCC	Kendall Rank Correlation Coefficient	Ordinal agreement
RMSE	Root Mean Square Error	Prediction error magnitude
R²	Coefficient of Determination	Variance explained

Visualizations

The framework automatically generates:

Training History: Loss curves, PLCC/SROCC progression
Residual Analysis: Scatter plots, error distribution
Cross-Model Comparison: Bar charts for multiple models

Output location: results/{dataset}/plots/

Project Main Structure

deep-vqa-framework/
├── Makefile                # Automation & workflow commands
├── README.md               # Project overview
├── DISCLAIMER.md           # Legal liability & resource usage policy
├── pyproject.toml          # Dependency & environment management (uv)
│
├── config/                 # YAML configuration modules
│   ├── basic.yaml          # System & training global defaults
│   ├── dataset_config.yaml # Dataset-specific metadata
│   └── models/             # Model architecture parameters
│
├── datasets/               # Data storage & symlink routing
│   ├── KoNViD-1k/          # Video quality dataset
│   ├── T2VQA-DB/           # Text-to-Video QA dataset
│   └── TID2013/            # Image quality dataset
│
├── docs/                   # Interactive architecture & manuals
│   ├── pipeline.html       # System execution & module flow
│   └── Cloud_Platform_Rental_Guide.md
│
├── results/                # Global outputs & logs
│   ├── model_outputs/      # Training checkpoints
│   ├── train_logs/         # Execution & performance history
│   └── plots/              # Visualization (loss, residuals, etc.)
│
├── scripts/                # Infrastructure automation
│   ├── manage_data.sh      # Download & data preparation
│   ├── setup_env.sh        # Environment & system initialization
│   └── *.sh                # Auxiliary maintenance & cleanup scripts
│
└── src/                    # Core framework logic
    ├── main.py             # Global execution entry point
    ├── core/               # Training engine & evaluation pipeline
    ├── data/               # Data loaders, EDA & integrity analysis
    ├── models/             # Architecture definitions (IQAVQA-Net)
    └── utils/              # Configuration, logging & path management

System Overview

For a detailed look at the system architecture and execution flow, we provide two viewing options:

Configuration Guide

Configuration Layering

Configuration files are merged in the following order (later files override earlier ones):

Layer	File	Purpose
1 (Base)	`basic.yaml`	Global defaults
2 (Model)	`models/{model}.yaml`	Model-specific overrides
3 (Dataset)	`dataset_config.yaml`	Dataset-specific settings

Memory Optimization for Video Training

# If encountering CUDA Out of Memory (OOM)
preprocessing:
  batch_size: 1              # Reduce batch size
  num_workers: 0             # Disable multiprocessing

model:
  num_frames: 4              # Reduce temporal frames
  backbone: "resnet50"       # Use smaller backbone
  transformer_layers: 1      # Reduce transformer depth

train:
  gradient_accumulation_steps: 4  # Simulate larger batch
  amp: true                  # Enable mixed precision

Troubleshooting

CUDA Out of Memory

Symptom	Solution
OOM at first batch	Reduce `batch_size` to 1
OOM after several epochs	Enable `gradient_checkpointing: true`
OOM during validation	Reduce `num_frames` to 4

Dataset Not Found

If you encounter FileNotFoundError when passing --dataset xxx, it means the dataset symlink is missing or incorrect.

make link

or you can:

cd scripts
bash setup_links.sh

Video Loading Backend (AutoDL Specific)

Warning

On AutoDL or similar cloud GPU instances, OpenCV's VideoCapture may fail due to missing system dependencies.

Solution: Use Decord

uv add decord

Decord is pre-configured as the default backend. If Decord is not available, the framework automatically falls back to OpenCV, but on AutoDL this fallback may fail. Always use Decord for video training on AutoDL.

Slow Training

Issue	Optimization
Data loading bottleneck	Increase `num_workers: 8`
Small batch size	Use `gradient_accumulation_steps`
Video decoding slow	Ensure Decord is installed

📄 License

Framework: MIT
Author: @autentisitet
Version: 0.9.2-beta (pre-release)

🙏 Acknowledgments

PyTorch team for deep learning framework
Decord developers for efficient video loading
TID2013, KoNViD-1k, T2VQA-DB dataset providers

⚖️ Legal & Disclaimer

For details regarding third-party tool usage, dataset compliance, and resource usage, please refer to the DISCLAIMER.md file.

For detailed contribution guidelines and issue reporting, please check the .github folder.

Built with ❤️ for the research community

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github		.github
config		config
docs		docs
scripts		scripts
src		src
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.markdownlintignore		.markdownlintignore
.python-version		.python-version
DISCLAIMER.md		DISCLAIMER.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation