Skip to content

Pure Rust video action classification engine based on ViViT. Classifies video clips into 400 action categories from Kinetics-400. No Python runtime, no CUDA, no external dependencies.

Notifications You must be signed in to change notification settings

qora-protocol/QORA-Vision-Video

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 

Repository files navigation

language license tags base_model library_name pipeline_tag model-index
en
apache-2.0
rust
cpu-inference
quantized
q4
video-classification
action-recognition
vivit
video-transformer
pure-rust
no-python
no-cuda
kinetics-400
google/vivit-b-16x2-kinetics400
qora
video-classification
name results
QORA-Vision-Video
task dataset metrics
type
video-classification
name type
Kinetics-400
kinetics-400
name type value
Top-1 Accuracy
accuracy
79.3

QORA-Vision (Video) - Native Rust Video Classifier

Screenshot 2026-02-27 174517

Pure Rust video action classification engine based on ViViT. Classifies video clips into 400 action categories from Kinetics-400. No Python runtime, no CUDA, no external dependencies.

Overview

Property Value
Engine QORA-Vision (Pure Rust)
Base Model ViViT-B/16x2 (google/vivit-b-16x2-kinetics400)
Parameters ~89M
Quantization Q4 (4-bit symmetric, group_size=32)
Model Size 60 MB (Q4 binary)
Executable 4.4 MB
Input 32 frames x 224x224 RGB video
Output 768-dim embeddings + 400-class logits
Classes 400 (Kinetics-400 action categories)
Platform Windows x86_64 (CPU-only)

Architecture

ViViT-B/16x2 (Video Vision Transformer)

Component Details
Backbone 12-layer ViT-Base transformer
Hidden Size 768
Attention Heads 12 (head_dim=64)
MLP (Intermediate) 3,072 (GELU-Tanh activation)
Tubelet Size [2, 16, 16] (temporal, height, width)
Input Frames 32
Patches per Frame 14 x 14 = 196
Total Tubelets 16 x 14 x 14 = 3,136
Sequence Length 3,137 (3,136 tubelets + 1 CLS token)
Normalization LayerNorm with bias (eps=1e-6)
Attention Bidirectional (no causal mask)
Position Encoding Learned [3137, 768]
Classifier Linear(768, 400)

Key Design: Tubelet Embedding

Unlike image ViTs that use 2D patches, ViViT uses 3D tubelets β€” spatiotemporal volumes that capture both spatial and temporal information:

Video [3, 32, 224, 224] (C, T, H, W)
  β†’ Extract tubelets [3, 2, 16, 16] = 1,536 values each
  β†’ 16 temporal Γ— 14 height Γ— 14 width = 3,136 tubelets
  β†’ GEMM: [3136, 1536] Γ— [1536, 768] β†’ [3136, 768]
  β†’ Prepend CLS token β†’ [3137, 768]

Pipeline

Video (32 frames Γ— 224Γ—224)
    β†’ Tubelet Embedding (3D Conv: [2,16,16])
    β†’ 3,136 tubelets + CLS token = 3,137 sequence
    β†’ Add Position Embeddings [3137, 768]
    β†’ 12x ViT Transformer Layers (bidirectional)
    β†’ Final LayerNorm
    β†’ CLS token β†’ Linear(768, 400)
    β†’ Kinetics-400 logits

Files

vivit-model/
  qora-vision.exe      - 4.4 MB    Inference engine
  model.qora-vision    - 60 MB     Video model (Q4)
  config.json          - 293 B     QORA-branded config
  README.md            - This file

Usage

# Classify from frame directory
qora-vision.exe vivit --frames ./my_frames/ --model-path ../ViViT/

# Classify from video file (requires ffmpeg)
qora-vision.exe vivit --video clip.mp4 --model-path ../ViViT/

# Load from binary
qora-vision.exe vivit --load model.qora-vision --frames ./my_frames/

CLI Arguments

Flag Default Description
--model-path <path> . Path to model directory (safetensors)
--frames <dir> - Directory of 32 JPEG/PNG frames
--video <file> - Video file (extracts frames via ffmpeg)
--load <path> - Load binary (.qora-vision)
--save <path> - Save binary
--f16 off Use F16 weights instead of Q4

Input Requirements

  • 32 frames at 224x224 resolution
  • Frames are uniformly sampled from the video
  • Each frame: resize shortest edge to 224, center crop
  • Normalize: (pixel/255 - 0.5) / 0.5 = range [-1, 1]

Published Benchmarks

ViViT (Original Paper - ICCV 2021)

Model Variant Kinetics-400 Top-1 Top-5 Views
ViViT-B/16x2 (Factorised) 79.3% 93.4% 1x3
ViViT-L/16x2 (Factorised) 81.7% 93.8% 1x3
ViViT-H/14x2 (JFT pretrained) 84.9% 95.8% 4x3

Comparison with Other Video Models

Model Params Kinetics-400 Top-1 Architecture
QORA-Vision (ViViT-B/16x2) 89M 79.3% Video ViT (tubelets)
TimeSformer-B 121M 78.0% Divided attention
Video Swin-T 28M 78.8% 3D shifted windows
SlowFast R101-8x8 53M 77.6% Two-stream CNN
X3D-XXL 20M 80.4% Efficient 3D CNN

Test Results

Test: 32 Synthetic Frames (Color Gradient)

Input: 32 test frames (224x224, color gradient from red to blue)

Output:

Top-5 predictions:
  #1: class 169 (score: 4.5807)
  #2: class 346 (score: 4.2157)
  #3: class 84  (score: 3.3206)
  #4: class 107 (score: 3.2053)
  #5: class 245 (score: 2.5995)
Metric Value
Tubelets 3,136 patches
Sequence Length 3,137 (+ CLS)
Embedding dim=768, L2 norm=17.0658
Forward Pass 1,235.7s (12 layers x 12 heads, 3137x3137 attention)
Model Load 5.4s (from safetensors)
Model Memory 60 MB (Q4)
Binary Save 63ms
Result PASS (valid predictions with correct logit distribution)

Performance Notes

The long forward pass time (1,235s) is due to the large sequence length (3,137 tokens). Each attention layer computes a 3,137 x 3,137 attention matrix across 12 heads. This is expected for CPU-only inference of a video model β€” GPU acceleration would dramatically improve this.

Component Time
Tubelet Embedding ~0.1s
Attention (per layer) ~100s (3137x3137 matrix)
12 Layers Total ~1,200s
Final Classifier <1s

Kinetics-400 Classes

The model classifies videos into 400 human action categories including:

Sports: basketball, golf, swimming, skateboarding, skiing, surfing, tennis, volleyball... Daily activities: cooking, eating, drinking, brushing teeth, washing dishes... Music: playing guitar, piano, drums, violin, saxophone... Dance: ballet, breakdancing, salsa, tap dancing... Other: driving car, riding horse, flying kite, blowing candles...

Full class list: Kinetics-400 Labels

QORA Model Family

Engine Model Params Size (Q4) Purpose
QORA SmolLM3-3B 3.07B 1.68 GB Text generation, reasoning, chat
QORA-TTS Qwen3-TTS 1.84B 1.5 GB Text-to-speech synthesis
QORA-Vision (Image) SigLIP 2 Base 93M 58 MB Image embeddings, zero-shot classification
QORA-Vision (Video) ViViT Base 89M 60 MB Video action classification

Built with QORA - Pure Rust AI Inference

About

Pure Rust video action classification engine based on ViViT. Classifies video clips into 400 action categories from Kinetics-400. No Python runtime, no CUDA, no external dependencies.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors