A TensorRT-accelerated face-swap studio with a real-time visual editor. Load a video, set your anchors, and see every adjustment rendered live — no blind parameter-tweaking, no guess-and-recheck render cycles.

Most face-swap tools are batch processors: you set parameters in a config or a dated GUI, run a full render, look at the result, and repeat until it's right. ViniMaya is built the other way around — as an interactive editor.
- Live per-frame preview. See the swap on your actual frames as you work, not after a render completes.
- Live segment preview. Define a segment by dropping anchors, preview just that segment, and decide whether to commit to a full swap and save — before spending time on a complete encode.
- Visual knob feedback. Every control — strength, masks, blend, color, restorer — updates the preview in real time, so you can see what each adjustment does instead of inferring it from a number.
- Hardware-accelerated throughout. TensorRT inference and NVDEC/NVENC video I/O mean the preview is fast enough to actually iterate in.
Under the hood, ViniMaya's swap algorithms are a faithful port of Rope's proven pipeline — the same math that produces its quality — re-implemented on a modern TensorRT stack (vs. Rope's older ONNX Runtime / tkinter / CUDA 11.8 base). But where Rope and its peers stop at "run the swap," ViniMaya is designed for the part that actually takes the time: dialing it in.
| Metric | Rope (ONNX Runtime) | ViniMaya (TensorRT) |
|---|---|---|
| Runtime | ONNX Runtime + CUDA EP | TensorRT 10.x (FP32 generative, FP16 detection) |
| Video I/O | OpenCV + ffmpeg pipe | pyNVVideoCodec (NVDEC/NVENC hardware codec) |
| Python | 3.10 only | 3.11+ |
| CUDA | 11.8 | 12.x |
| 1080p, 3 faces, GFPGAN | ~8-12 fps | hardware-accelerated TRT path |
Decode (NVDEC) → Detect (RetinaFace) → Recognize (ArcFace) → Match → Swap → Encode (NVENC)
│
┌────────────┴────────────┐
│ swap_core() │
│ │
│ 1. Similarity Transform│
│ 2. Align face 512×512 │
│ 3. InSwapper (tiled) │
│ 4. Strength blending │
│ 5. Color correction │
│ 6. Mask generation: │
│ • Border mask │
│ • Diff mask │
│ • Occluder │
│ • Face parser │
│ 7. Face restoration │
│ 8. Paste-back (bbox) │
└─────────────────────────┘
vinimaya/
├── vinimaya/
│ ├── __init__.py
│ ├── models.py # Frozen dataclasses: SwapConfig, MaskConfig, etc.
│ ├── config.py # JSON config loading with defaults from dataclasses
│ ├── swap/
│ │ ├── engine.py # TRTEngine wrapper + EngineManager (lazy-loading)
│ │ └── core.py # swap_core() — complete port of Rope's swap logic
│ ├── detection/
│ │ └── retinaface.py # RetinaFace detection + ArcFace recognition
│ └── video/
│ ├── decoder.py # NVDEC decoder via pyNVVideoCodec
│ └── encoder.py # NVENC encoder + ffmpeg remux
├── tools/
│ ├── convert_models.py # ONNX → TensorRT engine conversion
│ ├── process_video.py # Full video swap pipeline (CLI)
│ ├── test_swap.py # Single-image swap test
│ ├── test_swap_debug.py # Diagnostic with intermediate dumps
│ ├── test_minimal.py # Minimal swap (match Rope settings)
│ └── test_rope_match.py # Test with exact Rope parameter set
├── models/ # Place ONNX model files here
│ └── engines/ # Generated TRT engines (auto-created)
├── vinimaya-config.json # Default configuration
└── pyproject.toml
- GPU: NVIDIA GPU with Turing architecture or newer (RTX 20xx+)
- OS: Windows 10/11 or Linux
- Python: 3.11 or 3.12
- CUDA: 12.x with matching cuDNN
- ffmpeg: On system PATH (for audio remux)
python -m venv venv
# Windows
venv\Scripts\activate
# Linux
source venv/bin/activate# PyTorch with CUDA 12.8 (adjust for your CUDA version)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
# TensorRT and tools
pip install tensorrt polygraphy
# ONNX (for model conversion)
pip install onnx protobuf
# Video codec
pip install PyNvVideoCodec
# Image processing
pip install opencv-python scikit-image numpy tqdmPlace the following ONNX model files in the models/ directory. These are the same models used by Rope:
| Model | File | Purpose |
|---|---|---|
| RetinaFace | det_10g.onnx |
Face detection (primary) |
| ArcFace | w600k_r50.onnx |
Face recognition (512-dim embeddings) |
| InSwapper | inswapper_128.fp16.onnx |
Face swap core |
| GFPGAN | GFPGANv1.4.onnx |
Face restoration |
| CodeFormer | codeformer_fp16.onnx |
Face restoration (transformer-based) |
| GPEN-256 | GPEN-BFR-256.onnx |
Face restoration |
| GPEN-512 | GPEN-BFR-512.onnx |
Face restoration |
| Occluder | occluder.onnx |
Occlusion mask (hands/glasses) |
| Face Parser | faceparser_fp16.onnx |
Semantic face segmentation (19-class) |
| SCRFD | scrfd_2.5g_bnkps.onnx |
Face detection (lighter alternative) |
| YOLOv8 | yoloface_8n.onnx |
Face detection (fastest) |
| 106-point | 2d106det.onnx |
Fine facial landmarks |
| ResNet50 | res50.onnx |
Restorer reference detection |
# Validate all ONNX models (no TRT needed)
python -m tools.convert_models --models-dir ./models --onnx-only
# Build all TRT engines (takes 15-30 minutes on first run)
python -m tools.convert_models --models-dir ./models
# Rebuild a specific model
python -m tools.convert_models --models-dir ./models --model inswapper --forceThe converter automatically uses FP32 for generative models (inswapper, all restorers) and FP16 for detection models. Engines are cached in models/engines/ and are GPU-architecture-specific.
# Basic swap with GFPGAN restoration
python -m tools.test_swap \
--source-face source.jpg \
--target target.jpg \
--output result.jpg \
--models-dir ./models \
--threshold 0 \
--restorer gfpgan
# With all masks enabled
python -m tools.test_swap \
--source-face source.jpg \
--target target.jpg \
--output result.jpg \
--models-dir ./models \
--threshold 0 \
--restorer gfpgan \
--occluder --faceparser# Full video swap with GFPGAN + Blend mode
python -m tools.process_video \
--source-face source.jpg \
--input video.mp4 \
--output swapped.mp4 \
--models-dir ./models \
--restorer gfpgan \
--restorer-det blend
# Quick test (first 30 frames with timing)
python -m tools.process_video \
--source-face source.jpg \
--input video.mp4 \
--output test.mp4 \
--models-dir ./models \
--restorer gfpgan \
--restorer-det blend \
--max-frames 30 --perf
# With masks and H.264 output
python -m tools.process_video \
--source-face source.jpg \
--input video.mp4 \
--output swapped.mp4 \
--models-dir ./models \
--restorer gfpgan \
--restorer-det blend \
--occluder --faceparser \
--codec h264 --qp 20# Dumps intermediate images at every pipeline stage
python -m tools.test_swap_debug \
--source-face source.jpg \
--target target.jpg \
--models-dir ./models \
--output-dir ./debug_output \
--restorer gfpgan \
--all-masksThis saves 15+ intermediate images showing each processing step: aligned face, raw swap output, restorer input/output, mask layers, and final composite.
All parameters are controlled via frozen dataclasses in models.py. Defaults match Rope's proven settings:
| Parameter | Default | Description |
|---|---|---|
swapper_resolution |
128 | InSwapper resolution: 128, 256, or 512 (tiled) |
strength |
100 | Swap iterations × 100 (100=1 pass, 300=3 passes) |
match_threshold |
75.0 | Cosine similarity threshold for face matching |
| Parameter | Default | Description |
|---|---|---|
border_top/bottom/sides |
10 | Border inset on 128×128 mask (pixels) |
border_blur |
10 | Gaussian blur kernel for border mask |
blend_amount |
5 | Gaussian blur on final composite mask |
occluder_enabled |
false | Occlusion mask (hands/glasses) |
occluder_amount |
0 | Dilate (+) or erode (-) occluder mask |
face_parser_enabled |
false | Semantic face segmentation mask |
face_parser_amount |
0 | Must be non-zero to activate segmentation |
mouth_parser_amount |
0 | Mouth region control |
diff_enabled |
false | Pixel-difference mask (for video temporal stability) |
diff_threshold |
10 | Diff sensitivity (higher = less change) |
| Parameter | Default | Description |
|---|---|---|
enabled |
false | Enable face restoration |
restorer_type |
"gfpgan" | gfpgan, codeformer, gpen256, gpen512 |
det_mode |
"none" | Face alignment for restorer: none, blend, reference |
blend_alpha |
0.80 | Restorer strength (1.0 = full restorer output) |
| Parameter | Default | Description |
|---|---|---|
codec |
"hevc" | Output codec: hevc or h264 |
preset |
"P5" | NVENC preset (P1=fastest, P7=best quality) |
qp |
18 | Quantization parameter (lower = higher quality) |
A key finding during development: detection models work in FP16, but generative models require FP32.
| Category | Models | Precision | Reason |
|---|---|---|---|
| Detection | RetinaFace, SCRFD, YOLOv8, ArcFace | FP16 | Classification output tolerates reduced precision |
| Masking | Occluder, Face Parser | FP16 | Binary/categorical output tolerates reduced precision |
| Generation | InSwapper | FP32 | Color fidelity degrades over iterative application |
| Generation | GFPGAN, GPEN-256, GPEN-512 | FP32 | StyleGAN2 ModulatedConv2d causes output collapse in FP16 |
| Generation | CodeFormer | FP32 | Transformer LayerNorm overflows in FP16 (produces NaN) |
The model converter handles this automatically — generative models are blacklisted from FP16 builds.
The following Rope mechanics are preserved exactly:
- ArcFace alignment reference points (
ARCFACE_DST × 4.0 + 32.0in 512-space) - Similarity transform estimation via
skimage.transform.SimilarityTransform - Affine warp/unwarp via
torchvision.transforms.v2.functional.affine - InSwapper latent computation (L2 norm → emap dot product → L2 norm)
- Tiled swap execution (128→256/512 via stride-sampling)
- Iterative strength control (multiple swap passes + alpha blending)
- 5-source multiplicative mask composition (border × diff × occluder × parser × CLIP)
- Morphological mask operations (dilate/erode via conv2d with ones kernel)
- Gaussian blur on masks
- Bounding-box-optimized paste-back
- Frame upscaling for small images (ensure both dimensions ≥ 512)
- Color correction (gamma + per-channel RGB offsets)
- Restorer detection modes (None / Blend / Reference with FFHQ alignment)
- Face parser semantic class handling (bg/neck/cloth/hair/hat exclusion)
The TRTEngine class provides:
- Automatic input dtype casting (handles FP16↔FP32 mismatches between model expectations and PyTorch tensors)
- Dynamic shape support via optimization profiles
- Auto-allocated output tensors for multi-output models (face parser, CodeFormer)
- Lazy loading via
EngineManager(engines loaded on first use) - Execution on PyTorch's current CUDA stream (prevents race conditions with tensor operations)
The video pipeline uses:
- pyNVVideoCodec for hardware-accelerated decode (NVDEC) and encode (NVENC)
- DLPack for zero-copy tensor exchange between decoder and PyTorch
- ffmpeg for container remux and audio copy
- Frame clone from DLPack to own CUDA memory (decoder's circular buffer is read-only)
ensure_min_size()/restore_size()for frames smaller than 512px
- GUI integration (PyWebView + Flask)
- Performance optimization (event-based stream sync, tensor pre-allocation)
- Source face management UI (multi-source assignment like Rope)
- Temporal stability (diff mask, face tracking across frames)
- CLIPSeg text-guided masking (requires PyTorch model export)
- Restorer Reference detection mode (ResNet50)
- Frame orientation support (rotation)
- Batch face processing
- Rope by Hillobar — the face swap pipeline and algorithms that ViniMaya ports
- InsightFace — RetinaFace, ArcFace, and InSwapper models
- GFPGAN — face restoration model
- CodeFormer — transformer-based face restoration
- GPEN — blind face restoration
See LICENSE for details.