Skip to content

mountlord/ViniMaya

Repository files navigation

ViniMaya

A TensorRT-accelerated face-swap studio with a real-time visual editor. Load a video, set your anchors, and see every adjustment rendered live — no blind parameter-tweaking, no guess-and-recheck render cycles. ViniMaya — face-swap studio with live preview

Why ViniMaya?

Most face-swap tools are batch processors: you set parameters in a config or a dated GUI, run a full render, look at the result, and repeat until it's right. ViniMaya is built the other way around — as an interactive editor.

  • Live per-frame preview. See the swap on your actual frames as you work, not after a render completes.
  • Live segment preview. Define a segment by dropping anchors, preview just that segment, and decide whether to commit to a full swap and save — before spending time on a complete encode.
  • Visual knob feedback. Every control — strength, masks, blend, color, restorer — updates the preview in real time, so you can see what each adjustment does instead of inferring it from a number.
  • Hardware-accelerated throughout. TensorRT inference and NVDEC/NVENC video I/O mean the preview is fast enough to actually iterate in.

Under the hood, ViniMaya's swap algorithms are a faithful port of Rope's proven pipeline — the same math that produces its quality — re-implemented on a modern TensorRT stack (vs. Rope's older ONNX Runtime / tkinter / CUDA 11.8 base). But where Rope and its peers stop at "run the swap," ViniMaya is designed for the part that actually takes the time: dialing it in.

Results

Metric Rope (ONNX Runtime) ViniMaya (TensorRT)
Runtime ONNX Runtime + CUDA EP TensorRT 10.x (FP32 generative, FP16 detection)
Video I/O OpenCV + ffmpeg pipe pyNVVideoCodec (NVDEC/NVENC hardware codec)
Python 3.10 only 3.11+
CUDA 11.8 12.x
1080p, 3 faces, GFPGAN ~8-12 fps hardware-accelerated TRT path

Architecture

Decode (NVDEC) → Detect (RetinaFace) → Recognize (ArcFace) → Match → Swap → Encode (NVENC)
                                                                 │
                                                    ┌────────────┴────────────┐
                                                    │       swap_core()       │
                                                    │                         │
                                                    │  1. Similarity Transform│
                                                    │  2. Align face 512×512  │
                                                    │  3. InSwapper (tiled)   │
                                                    │  4. Strength blending   │
                                                    │  5. Color correction    │
                                                    │  6. Mask generation:    │
                                                    │     • Border mask       │
                                                    │     • Diff mask         │
                                                    │     • Occluder          │
                                                    │     • Face parser       │
                                                    │  7. Face restoration    │
                                                    │  8. Paste-back (bbox)   │
                                                    └─────────────────────────┘

Project Structure

vinimaya/
├── vinimaya/
│   ├── __init__.py
│   ├── models.py              # Frozen dataclasses: SwapConfig, MaskConfig, etc.
│   ├── config.py              # JSON config loading with defaults from dataclasses
│   ├── swap/
│   │   ├── engine.py          # TRTEngine wrapper + EngineManager (lazy-loading)
│   │   └── core.py            # swap_core() — complete port of Rope's swap logic
│   ├── detection/
│   │   └── retinaface.py      # RetinaFace detection + ArcFace recognition
│   └── video/
│       ├── decoder.py         # NVDEC decoder via pyNVVideoCodec
│       └── encoder.py         # NVENC encoder + ffmpeg remux
├── tools/
│   ├── convert_models.py      # ONNX → TensorRT engine conversion
│   ├── process_video.py       # Full video swap pipeline (CLI)
│   ├── test_swap.py           # Single-image swap test
│   ├── test_swap_debug.py     # Diagnostic with intermediate dumps
│   ├── test_minimal.py        # Minimal swap (match Rope settings)
│   └── test_rope_match.py     # Test with exact Rope parameter set
├── models/                    # Place ONNX model files here
│   └── engines/               # Generated TRT engines (auto-created)
├── vinimaya-config.json       # Default configuration
└── pyproject.toml

Prerequisites

  • GPU: NVIDIA GPU with Turing architecture or newer (RTX 20xx+)
  • OS: Windows 10/11 or Linux
  • Python: 3.11 or 3.12
  • CUDA: 12.x with matching cuDNN
  • ffmpeg: On system PATH (for audio remux)

Installation

1. Create virtual environment

python -m venv venv
# Windows
venv\Scripts\activate
# Linux
source venv/bin/activate

2. Install dependencies

# PyTorch with CUDA 12.8 (adjust for your CUDA version)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

# TensorRT and tools
pip install tensorrt polygraphy

# ONNX (for model conversion)
pip install onnx protobuf

# Video codec
pip install PyNvVideoCodec

# Image processing
pip install opencv-python scikit-image numpy tqdm

3. Obtain model files

Place the following ONNX model files in the models/ directory. These are the same models used by Rope:

Model File Purpose
RetinaFace det_10g.onnx Face detection (primary)
ArcFace w600k_r50.onnx Face recognition (512-dim embeddings)
InSwapper inswapper_128.fp16.onnx Face swap core
GFPGAN GFPGANv1.4.onnx Face restoration
CodeFormer codeformer_fp16.onnx Face restoration (transformer-based)
GPEN-256 GPEN-BFR-256.onnx Face restoration
GPEN-512 GPEN-BFR-512.onnx Face restoration
Occluder occluder.onnx Occlusion mask (hands/glasses)
Face Parser faceparser_fp16.onnx Semantic face segmentation (19-class)
SCRFD scrfd_2.5g_bnkps.onnx Face detection (lighter alternative)
YOLOv8 yoloface_8n.onnx Face detection (fastest)
106-point 2d106det.onnx Fine facial landmarks
ResNet50 res50.onnx Restorer reference detection

4. Convert models to TensorRT

# Validate all ONNX models (no TRT needed)
python -m tools.convert_models --models-dir ./models --onnx-only

# Build all TRT engines (takes 15-30 minutes on first run)
python -m tools.convert_models --models-dir ./models

# Rebuild a specific model
python -m tools.convert_models --models-dir ./models --model inswapper --force

The converter automatically uses FP32 for generative models (inswapper, all restorers) and FP16 for detection models. Engines are cached in models/engines/ and are GPU-architecture-specific.

Usage

Single Image Swap

# Basic swap with GFPGAN restoration
python -m tools.test_swap \
    --source-face source.jpg \
    --target target.jpg \
    --output result.jpg \
    --models-dir ./models \
    --threshold 0 \
    --restorer gfpgan

# With all masks enabled
python -m tools.test_swap \
    --source-face source.jpg \
    --target target.jpg \
    --output result.jpg \
    --models-dir ./models \
    --threshold 0 \
    --restorer gfpgan \
    --occluder --faceparser

Video Processing

# Full video swap with GFPGAN + Blend mode
python -m tools.process_video \
    --source-face source.jpg \
    --input video.mp4 \
    --output swapped.mp4 \
    --models-dir ./models \
    --restorer gfpgan \
    --restorer-det blend

# Quick test (first 30 frames with timing)
python -m tools.process_video \
    --source-face source.jpg \
    --input video.mp4 \
    --output test.mp4 \
    --models-dir ./models \
    --restorer gfpgan \
    --restorer-det blend \
    --max-frames 30 --perf

# With masks and H.264 output
python -m tools.process_video \
    --source-face source.jpg \
    --input video.mp4 \
    --output swapped.mp4 \
    --models-dir ./models \
    --restorer gfpgan \
    --restorer-det blend \
    --occluder --faceparser \
    --codec h264 --qp 20

Diagnostic (Debug Pipeline)

# Dumps intermediate images at every pipeline stage
python -m tools.test_swap_debug \
    --source-face source.jpg \
    --target target.jpg \
    --models-dir ./models \
    --output-dir ./debug_output \
    --restorer gfpgan \
    --all-masks

This saves 15+ intermediate images showing each processing step: aligned face, raw swap output, restorer input/output, mask layers, and final composite.

Configuration

All parameters are controlled via frozen dataclasses in models.py. Defaults match Rope's proven settings:

Swap Parameters

Parameter Default Description
swapper_resolution 128 InSwapper resolution: 128, 256, or 512 (tiled)
strength 100 Swap iterations × 100 (100=1 pass, 300=3 passes)
match_threshold 75.0 Cosine similarity threshold for face matching

Mask Parameters

Parameter Default Description
border_top/bottom/sides 10 Border inset on 128×128 mask (pixels)
border_blur 10 Gaussian blur kernel for border mask
blend_amount 5 Gaussian blur on final composite mask
occluder_enabled false Occlusion mask (hands/glasses)
occluder_amount 0 Dilate (+) or erode (-) occluder mask
face_parser_enabled false Semantic face segmentation mask
face_parser_amount 0 Must be non-zero to activate segmentation
mouth_parser_amount 0 Mouth region control
diff_enabled false Pixel-difference mask (for video temporal stability)
diff_threshold 10 Diff sensitivity (higher = less change)

Restorer Parameters

Parameter Default Description
enabled false Enable face restoration
restorer_type "gfpgan" gfpgan, codeformer, gpen256, gpen512
det_mode "none" Face alignment for restorer: none, blend, reference
blend_alpha 0.80 Restorer strength (1.0 = full restorer output)

Encoder Parameters

Parameter Default Description
codec "hevc" Output codec: hevc or h264
preset "P5" NVENC preset (P1=fastest, P7=best quality)
qp 18 Quantization parameter (lower = higher quality)

Model Precision Strategy

A key finding during development: detection models work in FP16, but generative models require FP32.

Category Models Precision Reason
Detection RetinaFace, SCRFD, YOLOv8, ArcFace FP16 Classification output tolerates reduced precision
Masking Occluder, Face Parser FP16 Binary/categorical output tolerates reduced precision
Generation InSwapper FP32 Color fidelity degrades over iterative application
Generation GFPGAN, GPEN-256, GPEN-512 FP32 StyleGAN2 ModulatedConv2d causes output collapse in FP16
Generation CodeFormer FP32 Transformer LayerNorm overflows in FP16 (produces NaN)

The model converter handles this automatically — generative models are blacklisted from FP16 builds.

Technical Details

Ported from Rope

The following Rope mechanics are preserved exactly:

  • ArcFace alignment reference points (ARCFACE_DST × 4.0 + 32.0 in 512-space)
  • Similarity transform estimation via skimage.transform.SimilarityTransform
  • Affine warp/unwarp via torchvision.transforms.v2.functional.affine
  • InSwapper latent computation (L2 norm → emap dot product → L2 norm)
  • Tiled swap execution (128→256/512 via stride-sampling)
  • Iterative strength control (multiple swap passes + alpha blending)
  • 5-source multiplicative mask composition (border × diff × occluder × parser × CLIP)
  • Morphological mask operations (dilate/erode via conv2d with ones kernel)
  • Gaussian blur on masks
  • Bounding-box-optimized paste-back
  • Frame upscaling for small images (ensure both dimensions ≥ 512)
  • Color correction (gamma + per-channel RGB offsets)
  • Restorer detection modes (None / Blend / Reference with FFHQ alignment)
  • Face parser semantic class handling (bg/neck/cloth/hair/hat exclusion)

TensorRT Engine Wrapper

The TRTEngine class provides:

  • Automatic input dtype casting (handles FP16↔FP32 mismatches between model expectations and PyTorch tensors)
  • Dynamic shape support via optimization profiles
  • Auto-allocated output tensors for multi-output models (face parser, CodeFormer)
  • Lazy loading via EngineManager (engines loaded on first use)
  • Execution on PyTorch's current CUDA stream (prevents race conditions with tensor operations)

Video Pipeline

The video pipeline uses:

  • pyNVVideoCodec for hardware-accelerated decode (NVDEC) and encode (NVENC)
  • DLPack for zero-copy tensor exchange between decoder and PyTorch
  • ffmpeg for container remux and audio copy
  • Frame clone from DLPack to own CUDA memory (decoder's circular buffer is read-only)
  • ensure_min_size() / restore_size() for frames smaller than 512px

Roadmap

  • GUI integration (PyWebView + Flask)
  • Performance optimization (event-based stream sync, tensor pre-allocation)
  • Source face management UI (multi-source assignment like Rope)
  • Temporal stability (diff mask, face tracking across frames)
  • CLIPSeg text-guided masking (requires PyTorch model export)
  • Restorer Reference detection mode (ResNet50)
  • Frame orientation support (rotation)
  • Batch face processing

Acknowledgments

  • Rope by Hillobar — the face swap pipeline and algorithms that ViniMaya ports
  • InsightFace — RetinaFace, ArcFace, and InSwapper models
  • GFPGAN — face restoration model
  • CodeFormer — transformer-based face restoration
  • GPEN — blind face restoration

License

See LICENSE for details.