ViniMaya

A TensorRT-accelerated face-swap studio with a real-time visual editor. Load a video, set your anchors, and see every adjustment rendered live — no blind parameter-tweaking, no guess-and-recheck render cycles.

Why ViniMaya?

Most face-swap tools are batch processors: you set parameters in a config or a dated GUI, run a full render, look at the result, and repeat until it's right. ViniMaya is built the other way around — as an interactive editor.

Live per-frame preview. See the swap on your actual frames as you work, not after a render completes.
Live segment preview. Define a segment by dropping anchors, preview just that segment, and decide whether to commit to a full swap and save — before spending time on a complete encode.
Visual knob feedback. Every control — strength, masks, blend, color, restorer — updates the preview in real time, so you can see what each adjustment does instead of inferring it from a number.
Hardware-accelerated throughout. TensorRT inference and NVDEC/NVENC video I/O mean the preview is fast enough to actually iterate in.

Under the hood, ViniMaya's swap algorithms are a faithful port of Rope's proven pipeline — the same math that produces its quality — re-implemented on a modern TensorRT stack (vs. Rope's older ONNX Runtime / tkinter / CUDA 11.8 base). But where Rope and its peers stop at "run the swap," ViniMaya is designed for the part that actually takes the time: dialing it in.

Results

Metric	Rope (ONNX Runtime)	ViniMaya (TensorRT)
Runtime	ONNX Runtime + CUDA EP	TensorRT 10.x (FP32 generative, FP16 detection)
Video I/O	OpenCV + ffmpeg pipe	pyNVVideoCodec (NVDEC/NVENC hardware codec)
Python	3.10 only	3.11+
CUDA	11.8	12.x
1080p, 3 faces, GFPGAN	~8-12 fps	hardware-accelerated TRT path

Architecture

Decode (NVDEC) → Detect (RetinaFace) → Recognize (ArcFace) → Match → Swap → Encode (NVENC)
                                                                 │
                                                    ┌────────────┴────────────┐
                                                    │       swap_core()       │
                                                    │                         │
                                                    │  1. Similarity Transform│
                                                    │  2. Align face 512×512  │
                                                    │  3. InSwapper (tiled)   │
                                                    │  4. Strength blending   │
                                                    │  5. Color correction    │
                                                    │  6. Mask generation:    │
                                                    │     • Border mask       │
                                                    │     • Diff mask         │
                                                    │     • Occluder          │
                                                    │     • Face parser       │
                                                    │  7. Face restoration    │
                                                    │  8. Paste-back (bbox)   │
                                                    └─────────────────────────┘

Project Structure

vinimaya/
├── vinimaya/
│   ├── __init__.py
│   ├── models.py              # Frozen dataclasses: SwapConfig, MaskConfig, etc.
│   ├── config.py              # JSON config loading with defaults from dataclasses
│   ├── swap/
│   │   ├── engine.py          # TRTEngine wrapper + EngineManager (lazy-loading)
│   │   └── core.py            # swap_core() — complete port of Rope's swap logic
│   ├── detection/
│   │   └── retinaface.py      # RetinaFace detection + ArcFace recognition
│   └── video/
│       ├── decoder.py         # NVDEC decoder via pyNVVideoCodec
│       └── encoder.py         # NVENC encoder + ffmpeg remux
├── tools/
│   ├── convert_models.py      # ONNX → TensorRT engine conversion
│   ├── process_video.py       # Full video swap pipeline (CLI)
│   ├── test_swap.py           # Single-image swap test
│   ├── test_swap_debug.py     # Diagnostic with intermediate dumps
│   ├── test_minimal.py        # Minimal swap (match Rope settings)
│   └── test_rope_match.py     # Test with exact Rope parameter set
├── models/                    # Place ONNX model files here
│   └── engines/               # Generated TRT engines (auto-created)
├── vinimaya-config.json       # Default configuration
└── pyproject.toml

Prerequisites

GPU: NVIDIA GPU with Turing architecture or newer (RTX 20xx+)
OS: Windows 10/11 or Linux
Python: 3.11 or 3.12
CUDA: 12.x with matching cuDNN
ffmpeg: On system PATH (for audio remux)

Installation

1. Create virtual environment

python -m venv venv
# Windows
venv\Scripts\activate
# Linux
source venv/bin/activate

2. Install dependencies

# PyTorch with CUDA 12.8 (adjust for your CUDA version)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

# TensorRT and tools
pip install tensorrt polygraphy

# ONNX (for model conversion)
pip install onnx protobuf

# Video codec
pip install PyNvVideoCodec

# Image processing
pip install opencv-python scikit-image numpy tqdm

3. Obtain model files

Place the following ONNX model files in the models/ directory. These are the same models used by Rope:

Model	File	Purpose
RetinaFace	`det_10g.onnx`	Face detection (primary)
ArcFace	`w600k_r50.onnx`	Face recognition (512-dim embeddings)
InSwapper	`inswapper_128.fp16.onnx`	Face swap core
GFPGAN	`GFPGANv1.4.onnx`	Face restoration
CodeFormer	`codeformer_fp16.onnx`	Face restoration (transformer-based)
GPEN-256	`GPEN-BFR-256.onnx`	Face restoration
GPEN-512	`GPEN-BFR-512.onnx`	Face restoration
Occluder	`occluder.onnx`	Occlusion mask (hands/glasses)
Face Parser	`faceparser_fp16.onnx`	Semantic face segmentation (19-class)
SCRFD	`scrfd_2.5g_bnkps.onnx`	Face detection (lighter alternative)
YOLOv8	`yoloface_8n.onnx`	Face detection (fastest)
106-point	`2d106det.onnx`	Fine facial landmarks
ResNet50	`res50.onnx`	Restorer reference detection

4. Convert models to TensorRT

# Validate all ONNX models (no TRT needed)
python -m tools.convert_models --models-dir ./models --onnx-only

# Build all TRT engines (takes 15-30 minutes on first run)
python -m tools.convert_models --models-dir ./models

# Rebuild a specific model
python -m tools.convert_models --models-dir ./models --model inswapper --force

The converter automatically uses FP32 for generative models (inswapper, all restorers) and FP16 for detection models. Engines are cached in models/engines/ and are GPU-architecture-specific.

Usage

Single Image Swap

# Basic swap with GFPGAN restoration
python -m tools.test_swap \
    --source-face source.jpg \
    --target target.jpg \
    --output result.jpg \
    --models-dir ./models \
    --threshold 0 \
    --restorer gfpgan

# With all masks enabled
python -m tools.test_swap \
    --source-face source.jpg \
    --target target.jpg \
    --output result.jpg \
    --models-dir ./models \
    --threshold 0 \
    --restorer gfpgan \
    --occluder --faceparser

Video Processing

# Full video swap with GFPGAN + Blend mode
python -m tools.process_video \
    --source-face source.jpg \
    --input video.mp4 \
    --output swapped.mp4 \
    --models-dir ./models \
    --restorer gfpgan \
    --restorer-det blend

# Quick test (first 30 frames with timing)
python -m tools.process_video \
    --source-face source.jpg \
    --input video.mp4 \
    --output test.mp4 \
    --models-dir ./models \
    --restorer gfpgan \
    --restorer-det blend \
    --max-frames 30 --perf

# With masks and H.264 output
python -m tools.process_video \
    --source-face source.jpg \
    --input video.mp4 \
    --output swapped.mp4 \
    --models-dir ./models \
    --restorer gfpgan \
    --restorer-det blend \
    --occluder --faceparser \
    --codec h264 --qp 20

Diagnostic (Debug Pipeline)

# Dumps intermediate images at every pipeline stage
python -m tools.test_swap_debug \
    --source-face source.jpg \
    --target target.jpg \
    --models-dir ./models \
    --output-dir ./debug_output \
    --restorer gfpgan \
    --all-masks

This saves 15+ intermediate images showing each processing step: aligned face, raw swap output, restorer input/output, mask layers, and final composite.

Configuration

All parameters are controlled via frozen dataclasses in models.py. Defaults match Rope's proven settings:

Swap Parameters

Parameter	Default	Description
`swapper_resolution`	128	InSwapper resolution: 128, 256, or 512 (tiled)
`strength`	100	Swap iterations × 100 (100=1 pass, 300=3 passes)
`match_threshold`	75.0	Cosine similarity threshold for face matching

Mask Parameters

Parameter	Default	Description
`border_top/bottom/sides`	10	Border inset on 128×128 mask (pixels)
`border_blur`	10	Gaussian blur kernel for border mask
`blend_amount`	5	Gaussian blur on final composite mask
`occluder_enabled`	false	Occlusion mask (hands/glasses)
`occluder_amount`	0	Dilate (+) or erode (-) occluder mask
`face_parser_enabled`	false	Semantic face segmentation mask
`face_parser_amount`	0	Must be non-zero to activate segmentation
`mouth_parser_amount`	0	Mouth region control
`diff_enabled`	false	Pixel-difference mask (for video temporal stability)
`diff_threshold`	10	Diff sensitivity (higher = less change)

Restorer Parameters

Parameter	Default	Description
`enabled`	false	Enable face restoration
`restorer_type`	"gfpgan"	gfpgan, codeformer, gpen256, gpen512
`det_mode`	"none"	Face alignment for restorer: none, blend, reference
`blend_alpha`	0.80	Restorer strength (1.0 = full restorer output)

Encoder Parameters

Parameter	Default	Description
`codec`	"hevc"	Output codec: hevc or h264
`preset`	"P5"	NVENC preset (P1=fastest, P7=best quality)
`qp`	18	Quantization parameter (lower = higher quality)

Model Precision Strategy

A key finding during development: detection models work in FP16, but generative models require FP32.

Category	Models	Precision	Reason
Detection	RetinaFace, SCRFD, YOLOv8, ArcFace	FP16	Classification output tolerates reduced precision
Masking	Occluder, Face Parser	FP16	Binary/categorical output tolerates reduced precision
Generation	InSwapper	FP32	Color fidelity degrades over iterative application
Generation	GFPGAN, GPEN-256, GPEN-512	FP32	StyleGAN2 ModulatedConv2d causes output collapse in FP16
Generation	CodeFormer	FP32	Transformer LayerNorm overflows in FP16 (produces NaN)

The model converter handles this automatically — generative models are blacklisted from FP16 builds.

Technical Details

Ported from Rope

The following Rope mechanics are preserved exactly:

ArcFace alignment reference points (ARCFACE_DST × 4.0 + 32.0 in 512-space)
Similarity transform estimation via skimage.transform.SimilarityTransform
Affine warp/unwarp via torchvision.transforms.v2.functional.affine
InSwapper latent computation (L2 norm → emap dot product → L2 norm)
Tiled swap execution (128→256/512 via stride-sampling)
Iterative strength control (multiple swap passes + alpha blending)
5-source multiplicative mask composition (border × diff × occluder × parser × CLIP)
Morphological mask operations (dilate/erode via conv2d with ones kernel)
Gaussian blur on masks
Bounding-box-optimized paste-back
Frame upscaling for small images (ensure both dimensions ≥ 512)
Color correction (gamma + per-channel RGB offsets)
Restorer detection modes (None / Blend / Reference with FFHQ alignment)
Face parser semantic class handling (bg/neck/cloth/hair/hat exclusion)

TensorRT Engine Wrapper

The TRTEngine class provides:

Automatic input dtype casting (handles FP16↔FP32 mismatches between model expectations and PyTorch tensors)
Dynamic shape support via optimization profiles
Auto-allocated output tensors for multi-output models (face parser, CodeFormer)
Lazy loading via EngineManager (engines loaded on first use)
Execution on PyTorch's current CUDA stream (prevents race conditions with tensor operations)

Video Pipeline

The video pipeline uses:

pyNVVideoCodec for hardware-accelerated decode (NVDEC) and encode (NVENC)
DLPack for zero-copy tensor exchange between decoder and PyTorch
ffmpeg for container remux and audio copy
Frame clone from DLPack to own CUDA memory (decoder's circular buffer is read-only)
ensure_min_size() / restore_size() for frames smaller than 512px

Roadmap

GUI integration (PyWebView + Flask)
Performance optimization (event-based stream sync, tensor pre-allocation)
Source face management UI (multi-source assignment like Rope)
Temporal stability (diff mask, face tracking across frames)
CLIPSeg text-guided masking (requires PyTorch model export)
Restorer Reference detection mode (ResNet50)
Frame orientation support (rotation)
Batch face processing

Acknowledgments

Rope by Hillobar — the face swap pipeline and algorithms that ViniMaya ports
InsightFace — RetinaFace, ArcFace, and InSwapper models
GFPGAN — face restoration model
CodeFormer — transformer-based face restoration
GPEN — blind face restoration

License

See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
tools		tools
vinimaya		vinimaya
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ViniMaya-Roadmap.xlsx		ViniMaya-Roadmap.xlsx
packager.ps1		packager.ps1
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
vinimaya-config.json		vinimaya-config.json
vinimaya.png		vinimaya.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViniMaya

Why ViniMaya?

Results

Architecture

Project Structure

Prerequisites

Installation

1. Create virtual environment

2. Install dependencies

3. Obtain model files

4. Convert models to TensorRT

Usage

Single Image Swap

Video Processing

Diagnostic (Debug Pipeline)

Configuration

Swap Parameters

Mask Parameters

Restorer Parameters

Encoder Parameters

Model Precision Strategy

Technical Details

Ported from Rope

TensorRT Engine Wrapper

Video Pipeline

Roadmap

Acknowledgments

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ViniMaya

Why ViniMaya?

Results

Architecture

Project Structure

Prerequisites

Installation

1. Create virtual environment

2. Install dependencies

3. Obtain model files

4. Convert models to TensorRT

Usage

Single Image Swap

Video Processing

Diagnostic (Debug Pipeline)

Configuration

Swap Parameters

Mask Parameters

Restorer Parameters

Encoder Parameters

Model Precision Strategy

Technical Details

Ported from Rope

TensorRT Engine Wrapper

Video Pipeline

Roadmap

Acknowledgments

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages