QORA-Vision (Video) - Native Rust Video Classifier

language

license

QORA-Vision (Video) - Native Rust Video Classifier

Downlod 🤗: https://huggingface.co/qoranet/QORA-Vision-Video

Pure Rust video action classification engine based on ViViT. Classifies video clips into 400 action categories from Kinetics-400. No Python runtime, no CUDA, no external dependencies.

Overview

Property	Value
Engine	QORA-Vision (Pure Rust)
Base Model	ViViT-B/16x2 (google/vivit-b-16x2-kinetics400)
Parameters	~89M
Quantization	Q4 (4-bit symmetric, group_size=32)
Model Size	60 MB (Q4 binary)
Executable	4.4 MB
Input	32 frames x 224x224 RGB video
Output	768-dim embeddings + 400-class logits
Classes	400 (Kinetics-400 action categories)
Platform	Windows x86_64 (CPU-only)

Architecture

ViViT-B/16x2 (Video Vision Transformer)

Component	Details
Backbone	12-layer ViT-Base transformer
Hidden Size	768
Attention Heads	12 (head_dim=64)
MLP (Intermediate)	3,072 (GELU-Tanh activation)
Tubelet Size	[2, 16, 16] (temporal, height, width)
Input Frames	32
Patches per Frame	14 x 14 = 196
Total Tubelets	16 x 14 x 14 = 3,136
Sequence Length	3,137 (3,136 tubelets + 1 CLS token)
Normalization	LayerNorm with bias (eps=1e-6)
Attention	Bidirectional (no causal mask)
Position Encoding	Learned [3137, 768]
Classifier	Linear(768, 400)

Key Design: Tubelet Embedding

Unlike image ViTs that use 2D patches, ViViT uses 3D tubelets — spatiotemporal volumes that capture both spatial and temporal information:

Video [3, 32, 224, 224] (C, T, H, W)
  → Extract tubelets [3, 2, 16, 16] = 1,536 values each
  → 16 temporal × 14 height × 14 width = 3,136 tubelets
  → GEMM: [3136, 1536] × [1536, 768] → [3136, 768]
  → Prepend CLS token → [3137, 768]

Pipeline

Video (32 frames × 224×224)
    → Tubelet Embedding (3D Conv: [2,16,16])
    → 3,136 tubelets + CLS token = 3,137 sequence
    → Add Position Embeddings [3137, 768]
    → 12x ViT Transformer Layers (bidirectional)
    → Final LayerNorm
    → CLS token → Linear(768, 400)
    → Kinetics-400 logits

Files

vivit-model/
  qora-vision.exe      - 4.4 MB    Inference engine
  model.qora-vision    - 60 MB     Video model (Q4)
  config.json          - 293 B     QORA-branded config
  README.md            - This file

Usage

# Classify from frame directory
qora-vision.exe vivit --frames ./my_frames/ --model-path ../ViViT/

# Classify from video file (requires ffmpeg)
qora-vision.exe vivit --video clip.mp4 --model-path ../ViViT/

# Load from binary
qora-vision.exe vivit --load model.qora-vision --frames ./my_frames/

CLI Arguments

Flag	Default	Description
`--model-path <path>`	`.`	Path to model directory (safetensors)
`--frames <dir>`	-	Directory of 32 JPEG/PNG frames
`--video <file>`	-	Video file (extracts frames via ffmpeg)
`--load <path>`	-	Load binary (.qora-vision)
`--save <path>`	-	Save binary
`--f16`	off	Use F16 weights instead of Q4

Input Requirements

32 frames at 224x224 resolution
Frames are uniformly sampled from the video
Each frame: resize shortest edge to 224, center crop
Normalize: (pixel/255 - 0.5) / 0.5 = range [-1, 1]

Published Benchmarks

ViViT (Original Paper - ICCV 2021)

Model Variant	Kinetics-400 Top-1	Top-5	Views
ViViT-B/16x2 (Factorised)	79.3%	93.4%	1x3
ViViT-L/16x2 (Factorised)	81.7%	93.8%	1x3
ViViT-H/14x2 (JFT pretrained)	84.9%	95.8%	4x3

Comparison with Other Video Models

Model	Params	Kinetics-400 Top-1	Architecture
QORA-Vision (ViViT-B/16x2)	89M	79.3%	Video ViT (tubelets)
TimeSformer-B	121M	78.0%	Divided attention
Video Swin-T	28M	78.8%	3D shifted windows
SlowFast R101-8x8	53M	77.6%	Two-stream CNN
X3D-XXL	20M	80.4%	Efficient 3D CNN

Test Results

Test: 32 Synthetic Frames (Color Gradient)

Input: 32 test frames (224x224, color gradient from red to blue)

Output:

Top-5 predictions:
  #1: class 169 (score: 4.5807)
  #2: class 346 (score: 4.2157)
  #3: class 84  (score: 3.3206)
  #4: class 107 (score: 3.2053)
  #5: class 245 (score: 2.5995)

Metric	Value
Tubelets	3,136 patches
Sequence Length	3,137 (+ CLS)
Embedding	dim=768, L2 norm=17.0658
Forward Pass	1,235.7s (12 layers x 12 heads, 3137x3137 attention)
Model Load	5.4s (from safetensors)
Model Memory	60 MB (Q4)
Binary Save	63ms
Result	PASS (valid predictions with correct logit distribution)

Performance Notes

The long forward pass time (1,235s) is due to the large sequence length (3,137 tokens). Each attention layer computes a 3,137 x 3,137 attention matrix across 12 heads. This is expected for CPU-only inference of a video model — GPU acceleration would dramatically improve this.

Component	Time
Tubelet Embedding	~0.1s
Attention (per layer)	~100s (3137x3137 matrix)
12 Layers Total	~1,200s
Final Classifier	<1s

Kinetics-400 Classes

The model classifies videos into 400 human action categories including:

Sports: basketball, golf, swimming, skateboarding, skiing, surfing, tennis, volleyball... Daily activities: cooking, eating, drinking, brushing teeth, washing dishes... Music: playing guitar, piano, drums, violin, saxophone... Dance: ballet, breakdancing, salsa, tap dancing... Other: driving car, riding horse, flying kite, blowing candles...

Full class list: Kinetics-400 Labels

QORA Model Family

Engine	Model	Params	Size (Q4)	Purpose
QORA	SmolLM3-3B	3.07B	1.68 GB	Text generation, reasoning, chat
QORA-TTS	Qwen3-TTS	1.84B	1.5 GB	Text-to-speech synthesis
QORA-Vision (Image)	SigLIP 2 Base	93M	58 MB	Image embeddings, zero-shot classification
QORA-Vision (Video)	ViViT Base	89M	60 MB	Video action classification

Built with QORA - Pure Rust AI Inference

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
config.json		config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QORA-Vision (Video) - Native Rust Video Classifier

Downlod 🤗: https://huggingface.co/qoranet/QORA-Vision-Video

Overview

Architecture

ViViT-B/16x2 (Video Vision Transformer)

Key Design: Tubelet Embedding

Pipeline

Files

Usage

CLI Arguments

Input Requirements

Published Benchmarks

ViViT (Original Paper - ICCV 2021)

Comparison with Other Video Models

Test Results

Test: 32 Synthetic Frames (Color Gradient)

Performance Notes

Kinetics-400 Classes

QORA Model Family

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

qora-protocol/QORA-Vision-Video

Folders and files

Latest commit

History

Repository files navigation

QORA-Vision (Video) - Native Rust Video Classifier

Downlod 🤗: https://huggingface.co/qoranet/QORA-Vision-Video

Overview

Architecture

ViViT-B/16x2 (Video Vision Transformer)

Key Design: Tubelet Embedding

Pipeline

Files

Usage

CLI Arguments

Input Requirements

Published Benchmarks

ViViT (Original Paper - ICCV 2021)

Comparison with Other Video Models

Test Results

Test: 32 Synthetic Frames (Color Gradient)

Performance Notes

Kinetics-400 Classes

QORA Model Family

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages