Skip to content

oindrilasaha/SIGMA-Gen-Code

Repository files navigation

SIGMA-GEN: Structure and Identity Guided Multi-subject Assembly for Image Generation

Oindrila Saha1, Vojtech Krs2, Radomir Mech2,
Subhransu Maji1, Kevin Blackburn-Matzen2*, Matheus Gadelha2*

1 University of Massachusetts Amherst     2 Adobe Research

* denotes equal advising

arXiv Project Page Demo Model Weights

sigmagen_teaser_linkedin.mp4

Overview

SIGMA-Gen enables multi-subject image generation and editing in a single forward pass. Given reference images of subjects, spatial masks indicating where each subject should appear, precise/box depth and a text prompt, SIGMA-Gen composes them into a coherent scene while preserving each subject's identity.

This code supports two modes:

  • Generation — compose subjects into a new scene from scratch
  • Editing / Insertion — insert subjects into an existing image with latent blending for seamless background preservation

Results

Multi-Control Generation with 2D box, 3D box, and precise depth combined in a single generation Subject Insertion with precise mask and depth
"a bowl, a can and a toy on a table" "a woman parasailing over a yacht"

Getting Started

Requirements

  • Python 3.10+
  • CUDA-capable GPU (32GB+ VRAM recommended)
  • Access to FLUX.1-Kontext-dev (gated model — accept the license on HuggingFace)

Installation

pip install torch torchvision
pip install diffusers>=0.35.0 transformers peft accelerate
pip install scipy numpy Pillow

Model Weights

LoRA weights are hosted on HuggingFace and downloaded automatically on first run:

Repository Description
oindrila13saha/sigma-gen-lora Dual LoRA adapters (identity + spatial conditioning)
black-forest-labs/FLUX.1-Kontext-dev Base model (downloaded automatically)

Usage

run.py — CLI Inference

run.py reads a folder of inputs and generates an image.

python run.py \
  --folder examples/multi_control \
  --prompt "a bowl, a can and a toy on a table" \
  --out outputs/multi_control.png

Input Folder Format

Each example is a folder with the following files:

example_folder/
  generated_nobg_0.png        # Subject 0 (RGBA, background removed)
  generated_nobg_1.png        # Subject 1
  ...
  mask_object_0.png           # Binary mask for subject 0 (white = subject region)
  mask_object_1.png           # Binary mask for subject 1
  ...
  source.png                  # (Optional) Background image for edit/insertion mode
  depth_precise.png           # (Optional) Depth map for precise regions
  depth_box.png               # (Optional) Depth map for box regions

File Descriptions

File Required Description
generated_nobg_*.png Yes Subject reference images with background removed (RGBA). Numbered starting from 0.
mask_object_*.png Yes Per-subject binary masks (white on black). Each mask defines where the corresponding subject should be placed. Numbered to match subjects.
source.png No Source image for insertion/editing. When present, the pipeline uses latent blending to preserve unmasked regions. When absent, a new image is generated from scratch.
depth_precise.png No Depth information for precise object regions.
depth_box.png No Depth information for bounding box regions.

How It Works

  1. Subject tiles — Each generated_nobg_*.png is cropped to its content, resized to fit a 512x512 tile, and encoded as a separate identity condition with a unique subject_id.

  2. Spatial condition — The masks and optional depth maps are combined into an RGB condition image:

    • R channel: Indexed mask (each subject gets a unique intensity value)
    • G channel: depth_precise (depth for precise object silhouettes)
    • B channel: depth_box (depth for bounding box regions)
  3. Dual LoRA — Two LoRA adapters work together:

    • cond1: Identity conditioning (encodes subject appearance)
    • cond2: Spatial conditioning (encodes layout and structure)
  4. Edit mode (when source.png is present) — The source image is noise-encoded at each timestep, and a blend mask ensures the background is preserved while subjects are seamlessly inserted.


Interactive Demo

A Gradio-based interactive demo is also included and hosted on HuggingFace Spaces:

Try the live demo

To run the demo locally:

# Install additional demo dependencies
pip install gradio>=5.0.0 ./custom_components/dist/gradio_maskeditor-0.0.1-py3-none-any.whl

# Launch
python gradio_demo.py

The demo opens at http://localhost:7860 and supports:

  • Drawing subject regions on a canvas with color-coded rectangles (Red = Subject 1, Green = Subject 2, etc.)
  • Automatic background removal on uploaded subject images
  • Auto-detected mode — upload a background image to edit, or leave blank to generate from scratch
  • Pre-loaded examples for quick experimentation

Acknowledgements

The pipeline code is adapted from OminiControl.


Citation

If you find our work useful, please cite:

@article{saha2025sigma,
  title={SIGMA-GEN: Structure and Identity Guided Multi-subject Assembly for Image Generation},
  author={Saha, Oindrila and Krs, Vojtech and Mech, Radomir and Maji, Subhransu and Blackburn-Matzen, Kevin and Gadelha, Matheus},
  journal={arXiv preprint arXiv:2510.06469},
  year={2025}
}

About

Code Release for SIGMA-Gen

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors