A browser-based workbench for experimenting with ML/Vision models for video processing. You can run various models on an uploaded video to:
- Generate camera pose + intrinsics + depth
- 2D object segmentation / tracking
- 3D bounding box
- Scene & Object point clouds
Runs as a local backend server & frontend web app. The UI is a Solid + Three.js app served by Vite; all heavy lifting is done by Python scripts that the Vite dev server shells out to. A plugin architecture makes it easy to add additional models.
The repo bundles a Python setup/ toolchain that installs every model
into a single, gitignored models/ directory: one venv, one set of
pinned commits, one canonical location for weights.
All computed results are saved, and can be browsed from the frontend.
This repo has been almost entirely created by Claude Code, under step-by-step human guidance. DO NOT deploy on the internet - code has not been hardened.
Questions / Comments? find me at rms@rms80.com, or @rms80.
See the Model / method reference section below for pinned commits, custom patches, and per-plugin quirks.
- COLMAP + DepthAnythingV2 — classical SfM camera solve (COLMAP 4.0.3) paired with DepthAnythingV2 Metric Indoor for per-frame depth, RANSAC-scaled to meters.
- CUT3R — CUT3R/CUT3R, feed-forward poses + per-frame depth + pointmaps from a sliding context window.
- VGGT — facebookresearch/vggt, Meta's 1B transformer; anchors-only phase produces cameras + depth + pointmaps in one shot.
- VGGT-Omega (gated) —
facebookresearch/vggt-omega,
CVPR 2026 successor to VGGT-1B. Requires HF access to
facebook/VGGT-Omega. - Depth-Anything-3 (Metric, Large) — ByteDance-Seed/Depth-Anything-3, pairs the LARGE-1.1 (pose + relative depth) and DA3METRIC-LARGE (metric depth) heads for metric-scale reconstructions.
- Pi3X — yyfz/Pi3, single feed-forward pass producing per-frame pointmaps + a global scene pointmap; fits in 12 GB VRAM.
- MapAnything — facebookresearch/map-anything, memory-efficient inference with edge-aware scene-pointmap masking.
- HunyuanWorld-Mirror — Tencent-Hunyuan/HunyuanWorld-Mirror, Tencent's multi-head model (we use the pointmap + depth + camera heads; gaussian-splat head stubbed out).
- HunyuanWorld-Mirror 2.0 — Tencent-Hunyuan/HY-World-2.0, v2 with a flash-attention-free SDPA shim.
- WildDet3D (depth + K) — allenai/WildDet3D, produces depth + predicted intrinsics per frame (no cross-frame pose solve — useful as a depth/K signal, not a real reconstruction).
- InfiniDepth (depth refiner) — zju3dv/InfiniDepth. Not a standalone reconstruction — consumes another plugin's cameras + depth and sharpens the per-frame depth via a neural implicit field.
- SAM3 (detect) (gated) —
facebook/sam3via Ultralytics. Click + text label → first-frame bbox & RGBA mask. - SAM2 (track) —
sam2.1_l.ptvia Ultralytics'SAM2VideoPredictor. Propagates the SAM3 detection across all frames in the tracked range.
- Boxer (default) — facebookresearch/boxer, per-frame oriented bounding boxes from the masked depth/pointmap. The Fuse toggle merges all frames into one shared static box (more stable, slower).
- WildDet3D —
allenai/WildDet3D, neural 3D
detector with optional Use Cameras (intrinsics prior) and Use
Depth toggles that pipe the active scene plugin's
K/ depth in as priors.
The typical end-to-end run looks like this. Every step writes its outputs
to analysis/<video_stem>/ and shows up in the right viewport without
needing a refresh.
-
Upload a video. Drag-drop or pick a previosly-uploaded video. The server re-encodes for smooth scrubbing and pre-extracts every frame as a JPEG under
analysis/<video>/_scene/frames/. -
Scene analysis. Pick a method and run. Produces camera poses + intrinsics + per-frame depth, and (for most plugins) per-frame and/or global pointmaps. InfiniDepth is also available as a depth refiner.
-
World-up annotation (optional). Click 3+ points on horizontal surfaces (floor, table) across one or more frames, then Align Scene. This rotates the reconstruction so up is
+y, yaws frame 0 to look down+z, and translates frame 0 to the origin. Stored per-video; reusable across analyses. -
Object segmentation. Click any object in the source frame, and enter a text label. SAM3 is run to segment the first frame (writes
detect.json(bbox + base64 RGBA mask) + a frame-0 PNG mask). Then SAM2 can be run to propagate tracking (writestrack.jsonand per-framemasks/NNNNNN.png). Note: the click is used to disambiguate which segmentation result to trackEach detect-then-track pair lives in its own analysis folder
<label>_<N>(e.g.chair_1), selectable from the UI. -
Box solve Lift the tracked 2D mask into a 3D oriented bounding box. Two solvers:
- Boxer (default): per-frame OBB from masked depth/pointmap. The Fuse toggle merges all frames' point clouds into one shared static box (more stable, slower).
- WildDet3D: neural 3D detector with optional camera-intrinsics-prior and depth-prior toggles.
Output:
<analysis>/<solver>/boxes.json. Solvers can co-exist on one analysis run (output dirs are keyed by solver id). -
Object point cloud . Builds a dense object-only point cloud by unprojecting per-frame depth through the per-frame mask and concatenating across the tracked range. Streamed to the viewer as chunked
.npzblobs.
The right side of the UI is a tab strip + viewport. Tabs are keyboard-navigable; arrow keys step frames within a tab (or jump to the nearest keyframe).
-
Source — the raw video frame, overlaid with the current mask / bbox if an analysis is loaded.
-
Depth — the active scene plugin's per-frame depth map, colourised and aligned to the source frame's resolution.
-
3D (Per-Frame) — Three.js viewport showing the current frame's depth lifted into a 3D mesh, plus camera path/frustums, solved boxes, etc
-
3D (Scene) — global scene pointmap streamed in chunks (plus boxes / etc)
-
3D (Object) — object-only point cloud built by unprojecting depth through the tracked masks.
Status / log output appears in a fixed bar at the bottom of the viewport and follows the latest pipeline run (scene prep, detect, track, box, object cloud).
A few plugins pull weights from gated Hugging Face repos. Skip this step if you don't need them — the rest of the setup still works. To request acess, open the links below and fill out the form (approved by Meta, often within minutes)
facebook/sam3— required for object detection / tracking.facebook/VGGT-Omega— required for the VGGT-Omega scene plugin.
While you are waiting, generate a token at https://huggingface.co/settings/tokens (paste it into a text file until you have finished setup, as you only get to see it once on the website!). Then either provide it in the text field in the GUI installer (see below) or run hf auth login (do pip install -U huggingface_hub to get the hf commands).
Prerequisites
- Python 3.11+ on
PATH. Setup scripts use only the stdlib until the venv exists. - Node.js 18+ for the Vite dev server.
- Git (every external model is a pinned
git clone). - NVIDIA GPU + CUDA 12.4 driver for the included torch wheels.
- Visual Studio 2019/2022 Build Tools (optional) — only needed if
you want CUT3R's
curopeCUDA extension to build. Skipping is fine; CUT3R falls back to a slower pure-Python RoPE.
Install
npm install
python setup/INSTALL.py # GUI (recommended)
Run the app
run_server.bat
Prerequisites
- Python 3.11+ on
PATH. - Node.js 18+.
- Git.
- Homebrew — only used to install COLMAP.
- No CUDA: macOS uses the default MPS torch wheels.
Install
npm install
python setup/INSTALL.py # GUI (recommended)
Run the app
bash run_server.sh
Prerequisites
- Python 3.11+ on
PATH. - Node.js 18+.
- Git.
- NVIDIA GPU + CUDA 12.4 driver for the included torch wheels.
- For the GUI installer:
python3-tkon Debian/Ubuntu, orpython3-tkinteron Fedora/RHEL. The headless installer doesn't need Tk. - On Ubuntu 26.04,
plugin_colmap.pyauto-fixes a knownlibposelibpackaging gap.
Install
npm install
python setup/INSTALL.py # python GUI installer
Run the app
bash run_server.sh
Installers run 00_venv.py first and then the requested
plugin_*.py scripts. Every script is idempotent — it skips
already-present artifacts — and supports --force to wipe and
reinstall. Re-running either installer is safe.
GUI (setup/INSTALL.py) — a Tk window with one checkbox per setup step (venv, plugins). Optionally enter your HF auth token for gated models. Each script's output streams into the log pane.
The --force checkbox forwards --force to every selected script.
Headless (setup/EVERYTHING.py) — runs every plugin in order,
unconditionally. Use this when you can't run a GUI. Be aware this will consume approx 100GB of disk space.
Per-component — if you only need a subset, run scripts piecewise:
python setup/00_venv.py # always first
python setup/plugin_colmap.py # if you want the COLMAP plugin
python setup/plugin_depthanythingv2.py # paired with COLMAP
python setup/plugin_cut3r.py
python setup/plugin_vggt.py
python setup/plugin_vggtomega.py # gated HF repo
python setup/plugin_da3.py
python setup/plugin_pi3.py
python setup/plugin_mapanything.py
python setup/plugin_worldmirror.py
python setup/plugin_worldmirror2.py
python setup/plugin_wilddet3d.py # provides scene AND box solver
python setup/plugin_infinidepth.py # depth refiner (post-process)
python setup/plugin_boxer.py
python setup/plugin_sam.py # required for detect/track (gated)
Everything except COLMAP (Windows zip / macOS Homebrew / Linux apt) and
the models/.venv/ itself lands under models/, which is gitignored.
The Vite dev server listens on port 4444. run_server.bat /
run_server.sh kill any existing listener on that port before starting
— don't invoke npm run dev directly, since strictPort: true makes a
stale listener fatal. Change the port in the script if you need a
different one.
models/
.venv/ project venv (CUDA torch on Win, MPS on Mac)
external/
boxer/
cut3r/
depth-anything-3/
hunyuanworld-mirror/
hy-world-2.0/
infinidepth/
vggt/
vggt-omega/
wilddet3d/ cloned --recursive (sam3, lingbot_depth submodules)
tools/
colmap/ Windows only; macOS uses Homebrew's binary
weights/
sam2.1_l.pt
sam3.pt
infinidepth/depth/infinidepth.ckpt
HuggingFace-distributed weights live in the standard HF cache
(~/.cache/huggingface), not under models/.
Pinned commits as of the last setup-script update. Every commit hash
below comes straight from setup/<name>.py.
- Native binary: COLMAP 4.0.3
(colmap-x64-windows-cuda.zip
on Windows;
brew install colmapon macOS). - HF model:
depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf. - Python:
pycolmap(latest pip wheel). - Notes: COLMAP solves shared intrinsics + per-frame poses on
downscaled frames (
--max-size 1920, default--every 3).run_depth.pythen RANSAC-fits a per-frame affine between DA2 depth and COLMAP sparse observations to recover a global metric scale and rescalescameras.jsontranslations into meters.
- Repo:
CUT3R/CUT3Rpinned to8bc15dc92a6d7fd92920b4ec81540d3dec7d3ecf. - Checkpoint:
cut3r_512_dpt_4_64.pthfetched from CUT3R's Google Drive viagdown(Drive throttling occasionally needs a manual download; the script prints the URL and bails gracefully if so). - Custom modifications applied by
setup/plugin_cut3r.py:- Filters CUT3R's
requirements.txtto skip its pinnedtorch / torchvision / numpy / pillow(we keep the CUDA-built torch from the base venv). - Applies two local patches in
setup/patches/:cut3r-fallback-rope-and-load.patch— adds a Python-only RoPE2D fallback path so CUT3R imports even without thecuropeCUDA extension, and tolerates modern PyTorch's strictertorch.load.cut3r-curope-modern-pytorch.patch— fixes the bundledcuropeCUDA extension'ssetup.pyfor current PyTorch / CUDA toolkit headers.
- Builds the
curopeCUDA extension in-place if bothnvccand a Windowsvcvars64.batare reachable. If either is missing, the build is skipped and CUT3R falls back to the patched Python RoPE.
- Filters CUT3R's
- Repo:
facebookresearch/vggtpinned to44b3afbd1869d8bde4894dd8ea1e293112dd5eba. - HF model:
facebook/VGGT-1B. - Notes: Two-phase strategy in the upstream code — phase 1 is
anchors-only (every 10th frame); phase 2 fills in spans via similarity
alignment. The plugin runs phase 1 only (
--anchors-only). Reporequirements.txtis installed filtered (skip torch/torchvision/numpy/ pillow).
- Repo:
facebookresearch/vggt-omegapinned to39a0cb8af88554f15ddcb5354cd52bde588fa014. - HF model:
facebook/VGGT-Omega— gated (request access, thenhf auth login). The setup downloads only the non-text 512-resolution checkpoint (vggt_omega_1b_512.pt, ~4.58 GB); the 256-text variant is skipped since we only consume camera + depth here. - Notes: Successor to VGGT-1B (CVPR 2026 Oral). Unlike VGGT-1B
it ships plain
.ptstate dicts (nofrom_pretrained). Reporequirements.txtis installed filtered (skip torch/torchvision/ numpy/pillow).
- Repo:
ByteDance-Seed/Depth-Anything-3pinned to41736238f5bced4debf3f2a12375d2466874866d. - HF models:
depth-anything/DA3-LARGE-1.1(pose + relative depth) anddepth-anything/DA3METRIC-LARGE(metric depth). A scene-wide ratio reconciles the two so translations land in meters. - Custom modifications applied by
setup/plugin_da3.py(DA3's upstream requirements are not Python 3.13-friendly):- Skips
open3d(no 3.13 wheel on PyPI; only viz paths use it),xformers(latest hard-requires torch >= 2.10 and would clobber our CUDA build),moviepy(DA3 importsmoviepy.editor, removed in 2.x — explicitly pinned tomoviepy==1.0.3), andpre-commit(dev tool only). - Installs
addictmanually — imported bydepth_anything_3.model.da3but missing from upstreampyproject/requirements.txt. - DA3's
pyproject.tomlpinsrequires-python = "<=3.13", which PEP 440 reads as excluding 3.13.x patch releases. We install with--ignore-requires-python --no-deps.
- Skips
- Source:
pip install git+https://github.com/yyfz/Pi3.git@b412c3bd236dfd7686f1e4b48004d5087f2fa093. - HF model:
yyfz233/Pi3X. - Notes: Single feed-forward pass. The runner disables Pi3's
multimodal conditioning branches (
disable_multimodal()) to fit in 12 GB VRAM. Publishes both per-frame pointmaps and a global scene pointmap.
- Source:
pip install git+https://github.com/facebookresearch/map-anything.git@f7ebafb4d8349776705aaa686cf928988d1bd7f4. - HF model:
facebook/map-anything. - Notes: Memory-efficient inference path with
minibatch_size=1for 12 GB GPUs. Edge-aware masking applied to the scene pointmap.
- Repo:
Tencent-Hunyuan/HunyuanWorld-Mirrorpinned tob38bdd12e677f406788b1a56db5c3b4585f9ccd3. - HF model:
tencent/HunyuanWorld-Mirror. - Custom modifications:
- The setup script does not install the upstream
requirements.txt— it pinstorch==2.3.1(would clobber our CUDA build) and depends on a CUDA build ofgsplatthat we do not ship. run_worldmirror.pystubsgsplatat import time so the gaussian-splat head is unused. Pointmap + depth + cameras work without it.
- The setup script does not install the upstream
- Repo:
Tencent-Hunyuan/HY-World-2.0pinned to484e22020e7d7943eb199e31a00e10facf64c3d9. - HF model:
tencent/HY-World-2.0(subfolderHY-WorldMirror-2.0). - Custom modifications: same
gsplatstub as v1, plus aflash_attnshim inrun_worldmirror2.pythat routes everything to PyTorch SDPA so the flash-attention build is unnecessary. Upstreamrequirements.txtis again skipped (Linux-onlygsplatwheel + flash-attention).
- Repo:
allenai/WildDet3Dpinned to1768ffcd4c5e9bb1856d3f1a5b0b5e0498b89c97, cloned recursively (submodulesthird_party/sam3andthird_party/lingbot_depth). - Checkpoint:
wilddet3d_alldata_all_prompt_v1.0.ptfromallenai/WildDet3Don HF. - Custom modifications applied by
setup/plugin_wilddet3d.py(this one's the worst — upstream requirements are heavily incompatible with Python 3.13):utils3din upstreamrequirements.txtresolves to the wrong PyPI package (Kalash Jain'sutils3d, which has no.pt/.npsubmodules). WildDet3D's depth backend callsutils3d.pt.depth_map_to_point_map, which is from EasternJournalist's git-onlyutils3d. We filter the PyPI name out and install the git version pinned to commit94d1037aabbce32dea9c07a7c4849525817a1615.vis4d==1.0.0is installed with--no-deps: its transitive deps (bdd100k,scalabel) pinmatplotlib==3.5.3/Shapely==1.8, neither of which has a 3.13 wheel and both fail to build from sdist. The inference path doesn't touch any of that. We then install the actual runtime deps (lightning,jsonargparse[signatures],pydantic>=2.0,cloudpickle,devtools,h5py) from WildDet3D's HF demorequirements.txt, which the upstream authors vetted as inference-only.- Submodule runtime deps installed explicitly:
ftfy,regex,iopath,open_clip_torch,safetensors. - On Windows, installs
triton-windows(registers itself astriton) sincesam3.model.edtdoes a bareimport tritonand the Linux HF demo gets that for free withtorch. - The runner builds the model with
skip_pretrained=True, so SAM3 / LingBot pretrained weights are not needed — the WildDet3D checkpoint already contains them.
- Notes: Produces depth + predicted intrinsics per frame but no cross-frame pose solve (every camera pose is identity). Useful as a depth/K signal, not as a real reconstruction.
- Repo:
zju3dv/InfiniDepthpinned to36c6e0c31887fafc210184ee43ca475230704095. - HF model:
ritianyu/InfiniDepth→models/weights/infinidepth/depth/infinidepth.ckpt. - Notes: Not a standalone reconstruction. The runner consumes an
upstream plugin's
cameras.json+ per-frame depth and feeds them through InfiniDepth's neural implicit field to produce a sharper / higher-res depth map. Pick the upstream source in the UI when running the plugin. - Custom modifications applied by
setup/plugin_infinidepth.py:- Filters upstream
requirements.txtto skiptorch/torchvision/torchaudio/numpy/pillow(CUDA build in the base venv),xformers(pins torch 2.9 and would clobber it),gsplat(only used by the Gaussian-Splatting inference path, which we don't run),open3d(no 3.13 wheel; viz-only), andspaces(HF Space SDK shim). - Explicitly pins
moviepy==1.0.3because InfiniDepth importsmoviepy.editor, which 2.x dropped. - Skips MoGe-2 entirely: the runner always supplies
override_gt_depth+ intrinsics, so the lazyfrom moge.model.v2 import MoGeModelimport insidemoge_utils._get_moge2_modelis never reached.
- Filters upstream
- Repo:
facebookresearch/boxerpinned todf474128a76ba42b05bc81feca7ac1a53fab41af. - HF model:
facebook/boxer(we pull three checkpoints intomodels/external/boxer/ckpts/):boxernet_hw960in4x6d768-wssxpf9p.ckptdinov3_vits16plus_pretrain_lvd1689m-4057cbaa.pthowlv2-base-patch16-ensemble.pt
- Custom modifications: skip Boxer's
pyproject(uv-based, pins versions we already control). We install onlydill(used by its checkpoint loader); torch + opencv + tqdm are already in the base venv. - Notes: The runner rotates the world into Boxer's gravity
convention (
gravity = [0, 0, -1]) before inference and rotates results back.--fuse(UI: Fuse toggle) fuses all frames' masked pointclouds into one static box.
- Same
wilddet3dcheckout and checkpoint as the scene plugin. - Notes: Runs on every ~10th frame and propagates to neighbouring
frames using the nearest preceding keyframe. UI toggles Use Cameras
(intrinsics prior) and Use Depth (depth prior) control whether the
active scene plugin's
K/ depth are passed in.
- Source:
facebook/sam3on HF →models/weights/sam3.pt. Gated repo: request access on the model page andhf auth logininto the project venv before runningsetup/plugin_sam.py. See the Setup section. - Loaded via: Ultralytics (
ultralytics>=8.4.37). - Inputs: frame image, click x/y, label.
- Output:
detect.json(bbox + base64 RGBA mask) +frame0_mask.png.
- Source:
sam2.1_l.ptfrom the Ultralytics asset release →models/weights/sam2.1_l.pt. - Loaded via:
SAM2VideoPredictor(Ultralytics). - Notes: Uses the bbox from
detect.json, not the mask, due to a shape bug in Ultralytics' mask-prompt path. Output istrack.jsonmasks/NNNNNN.png.
Every video gets an analysis/<video_stem>/ directory:
analysis/<video>/
_scene/
frames/NNNNNN.jpg (extract_frames.py)
frames.json (fps, frame count, source size)
<plugin>/cameras.json per-plugin poses + intrinsics
<plugin>/depth/NNNNNN.npz per-plugin per-frame depth
<plugin>/pointmap/NNNNNN.npz (optional, per-plugin)
<plugin>/scene_pointmap.npz (optional global pointmap)
<plugin>.log / prepare.log
<object_name>_<N>/ one per object analysis (e.g. chair_1)
detect.json
frame0_mask.png
track.json
masks/NNNNNN.png
boxer/boxes.json (optional)
wilddet3d/boxes.json (optional)
Adding a new scene method is a single entry in
src/scenePlugins.ts plus one runner script under scripts/. Adding
a new box solver is the same shape but in src/boxSolverPlugins.ts.