| language | license | tags | base_model | library_name | pipeline_tag | model-index | ||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
apache-2.0 |
|
google/vivit-b-16x2-kinetics400 |
qora |
video-classification |
|
Downlod π€: https://huggingface.co/qoranet/QORA-Vision-Video
Pure Rust video action classification engine based on ViViT. Classifies video clips into 400 action categories from Kinetics-400. No Python runtime, no CUDA, no external dependencies.
| Property | Value |
|---|---|
| Engine | QORA-Vision (Pure Rust) |
| Base Model | ViViT-B/16x2 (google/vivit-b-16x2-kinetics400) |
| Parameters | ~89M |
| Quantization | Q4 (4-bit symmetric, group_size=32) |
| Model Size | 60 MB (Q4 binary) |
| Executable | 4.4 MB |
| Input | 32 frames x 224x224 RGB video |
| Output | 768-dim embeddings + 400-class logits |
| Classes | 400 (Kinetics-400 action categories) |
| Platform | Windows x86_64 (CPU-only) |
| Component | Details |
|---|---|
| Backbone | 12-layer ViT-Base transformer |
| Hidden Size | 768 |
| Attention Heads | 12 (head_dim=64) |
| MLP (Intermediate) | 3,072 (GELU-Tanh activation) |
| Tubelet Size | [2, 16, 16] (temporal, height, width) |
| Input Frames | 32 |
| Patches per Frame | 14 x 14 = 196 |
| Total Tubelets | 16 x 14 x 14 = 3,136 |
| Sequence Length | 3,137 (3,136 tubelets + 1 CLS token) |
| Normalization | LayerNorm with bias (eps=1e-6) |
| Attention | Bidirectional (no causal mask) |
| Position Encoding | Learned [3137, 768] |
| Classifier | Linear(768, 400) |
Unlike image ViTs that use 2D patches, ViViT uses 3D tubelets β spatiotemporal volumes that capture both spatial and temporal information:
Video [3, 32, 224, 224] (C, T, H, W)
β Extract tubelets [3, 2, 16, 16] = 1,536 values each
β 16 temporal Γ 14 height Γ 14 width = 3,136 tubelets
β GEMM: [3136, 1536] Γ [1536, 768] β [3136, 768]
β Prepend CLS token β [3137, 768]
Video (32 frames Γ 224Γ224)
β Tubelet Embedding (3D Conv: [2,16,16])
β 3,136 tubelets + CLS token = 3,137 sequence
β Add Position Embeddings [3137, 768]
β 12x ViT Transformer Layers (bidirectional)
β Final LayerNorm
β CLS token β Linear(768, 400)
β Kinetics-400 logits
vivit-model/
qora-vision.exe - 4.4 MB Inference engine
model.qora-vision - 60 MB Video model (Q4)
config.json - 293 B QORA-branded config
README.md - This file
# Classify from frame directory
qora-vision.exe vivit --frames ./my_frames/ --model-path ../ViViT/
# Classify from video file (requires ffmpeg)
qora-vision.exe vivit --video clip.mp4 --model-path ../ViViT/
# Load from binary
qora-vision.exe vivit --load model.qora-vision --frames ./my_frames/| Flag | Default | Description |
|---|---|---|
--model-path <path> |
. |
Path to model directory (safetensors) |
--frames <dir> |
- | Directory of 32 JPEG/PNG frames |
--video <file> |
- | Video file (extracts frames via ffmpeg) |
--load <path> |
- | Load binary (.qora-vision) |
--save <path> |
- | Save binary |
--f16 |
off | Use F16 weights instead of Q4 |
- 32 frames at 224x224 resolution
- Frames are uniformly sampled from the video
- Each frame: resize shortest edge to 224, center crop
- Normalize:
(pixel/255 - 0.5) / 0.5= range [-1, 1]
| Model Variant | Kinetics-400 Top-1 | Top-5 | Views |
|---|---|---|---|
| ViViT-B/16x2 (Factorised) | 79.3% | 93.4% | 1x3 |
| ViViT-L/16x2 (Factorised) | 81.7% | 93.8% | 1x3 |
| ViViT-H/14x2 (JFT pretrained) | 84.9% | 95.8% | 4x3 |
| Model | Params | Kinetics-400 Top-1 | Architecture |
|---|---|---|---|
| QORA-Vision (ViViT-B/16x2) | 89M | 79.3% | Video ViT (tubelets) |
| TimeSformer-B | 121M | 78.0% | Divided attention |
| Video Swin-T | 28M | 78.8% | 3D shifted windows |
| SlowFast R101-8x8 | 53M | 77.6% | Two-stream CNN |
| X3D-XXL | 20M | 80.4% | Efficient 3D CNN |
Input: 32 test frames (224x224, color gradient from red to blue)
Output:
Top-5 predictions:
#1: class 169 (score: 4.5807)
#2: class 346 (score: 4.2157)
#3: class 84 (score: 3.3206)
#4: class 107 (score: 3.2053)
#5: class 245 (score: 2.5995)
| Metric | Value |
|---|---|
| Tubelets | 3,136 patches |
| Sequence Length | 3,137 (+ CLS) |
| Embedding | dim=768, L2 norm=17.0658 |
| Forward Pass | 1,235.7s (12 layers x 12 heads, 3137x3137 attention) |
| Model Load | 5.4s (from safetensors) |
| Model Memory | 60 MB (Q4) |
| Binary Save | 63ms |
| Result | PASS (valid predictions with correct logit distribution) |
The long forward pass time (1,235s) is due to the large sequence length (3,137 tokens). Each attention layer computes a 3,137 x 3,137 attention matrix across 12 heads. This is expected for CPU-only inference of a video model β GPU acceleration would dramatically improve this.
| Component | Time |
|---|---|
| Tubelet Embedding | ~0.1s |
| Attention (per layer) | ~100s (3137x3137 matrix) |
| 12 Layers Total | ~1,200s |
| Final Classifier | <1s |
The model classifies videos into 400 human action categories including:
Sports: basketball, golf, swimming, skateboarding, skiing, surfing, tennis, volleyball... Daily activities: cooking, eating, drinking, brushing teeth, washing dishes... Music: playing guitar, piano, drums, violin, saxophone... Dance: ballet, breakdancing, salsa, tap dancing... Other: driving car, riding horse, flying kite, blowing candles...
Full class list: Kinetics-400 Labels
| Engine | Model | Params | Size (Q4) | Purpose |
|---|---|---|---|---|
| QORA | SmolLM3-3B | 3.07B | 1.68 GB | Text generation, reasoning, chat |
| QORA-TTS | Qwen3-TTS | 1.84B | 1.5 GB | Text-to-speech synthesis |
| QORA-Vision (Image) | SigLIP 2 Base | 93M | 58 MB | Image embeddings, zero-shot classification |
| QORA-Vision (Video) | ViViT Base | 89M | 60 MB | Video action classification |
Built with QORA - Pure Rust AI Inference