🥈 LUMOS : The Image Editor of 2030

Demo Video

Watch the Demo on YouTube

Introduction

Problem understanding

The problem statement asks us to imagine how creative tools - especially Photoshop will evolve by 2030 in the world where mobile devices and AI-assisted workflows dominate. Current editing tools are powerful but still heavily dependent on manual operations, complex interfaces, and high computational resources. In contrast, the brief envisions a future where creators interact with images more naturally and effortlessly, using simple prompts, fluid gestures and most minimal hardware. The challenge is to indentify gap's in today's creative ecosystem and propose how AI can fill these gaps making editing faster, more intuitive, and more context aware. We are expected to deliver two workflows that demonstrate this shift : features that are not just "automated version of existing tools", but genuinely rethink how editing should feel when powered by intelligent models. These workflows must be grounded in real user pain points, supported by a clear market rationale, and implemented using open-source AI models capable of region selection, generation and inpainting. Overall the problem asks us to blend user research, design thinking and cutting edge AI to build a prototype that reflects the creative experience of 2030-lightweight, intelligent and human-centric.

Our Solution

Our solution is built around two complementary workflows that together represent the future of AI‑assisted, mobile‑friendly image editing. Before entering either workflow, the user can begin by uploading an image or generating one using our user‑style personalized LoRA, ensuring a highly customized starting point. From there, the system branches into two specialized pipelines designed to support different creative needs.

Workflow 1 : AI-Enhanced Image Editing Tools

The first workflow focuses on intuitive, fine‑grained image editing using a suite of advanced open‑source AI tools. It includes LeDits++ for image‑to‑image transformation, enabling users to refine or restyle their images with high fidelity. For artistic transformations, we integrate a style‑transfer module that automatically selects the most backendropriate style LoRA based on the user’s prompt and backendlies it seamlessly.Region‑level editing is supported through Segment Anything (SAM), which allows users to isolate any part of the image and then choose to erase it, inpaint new content, or manipulate it using Inpaint4Drag, a state‑of‑the‑art drag-based deformation model.Additionally, the workflow includes Lightning Drag, which enables users to adjust the direction or orientation an object is facing, and Generative Expand, an outpainting tool that extends scenes while preserving visual coherence. Together, these tools form an intelligent, flexible editing environment that reflects the natural, prompt‑driven editing experience envisioned for 2030.

Workflow 2: Smart Composition and 3D‑Aware Object Insertion

The second workflow is designed for high‑quality object insertion and blending, enabling users to integrate new elements into a scene with realism and spatial coherence.The process begins with Smart Crop, which prepares and focuses the base image. The user then selects any object image to insert, and the system automatically removes its background, isolating the subject. This extracted object is passed through a 2D‑to‑3D generation model, which reconstructs a lightweight 3D representation that allows proper orientation, scaling, and positioning relative to the target image. Once the 3D orientation is finalized, the object is composited back into the scene. The blended result is then refined through a relighting model, ensuring that shadows, highlights, and color temperature align with the background. Finally, the combined and harmonized output is delivered, producing an integrated and realistic image with minimal user effort.

Repository Structure

LUMOS/  
├── backend/        # On-device backend modules and model inference code 
├── cloud/          # Cloud pipelines, training scripts, and processing workflows
├── figma/          # Figma frames, wireframes, and design assets
├── frontend/       # Frontend UI code and application components
└── report/         # All project reports and documentation files

Final Deliverables:

Submission	Path
A) Product Design	`/figma`, `/report/Design Rationale`
B) Editing Ecosystem	`/report/Understanding the Editing Ecosystem`
C) Execution	`/backend`, `/frontend`, `/cloud`, `/report/Technical Report`
D) Optional Creative Artefacts	`/report/Creative Artefacts` , `/report/Decision Logs`

How to Use the Repository

1.Frontend:

Follow the instructions in /frontend/README.md
Also, there is a live deployed link in /frontend/README.md for interacting with the UI (Frontend demo).

2.Backend:

git clone https://github.com/team76adobe-design/lumos.git
cd lumos/backend
pip install huggingface-hub
hf auth login

NOTE: This HuggingFace Access Token has been generated explicitly for this repository. It is free-of-cost and safe-to-expose.

There have to be separate virtual environments for running different parts in the workflow. Overall there are 7 virtual environments to be used for the corresponding models-
1)Virtual Environment 1 - ledits, inpaint4drag, style transfer loras, background removal

python3.10 -m venv venv_1
source venv_1/bin/activate
pip install -r requirements1.txt

2)Virtual Environment 2 - lightningDrag, iopainttest, outpaint

python3.10 -m venv venv_2
source venv_2/bin/activate
pip install -r requirements2.txt

3)Virtual Environment 3 - smartcrop

python3.10 -m venv venv_3
source venv_3/bin/activate
pip install -r requirements3.txt

4)Virtual Environment 4 - Stable-Fast 3D

python3.10 -m venv venv_4
source venv_4/bin/activate
pip install -r requirements4.txt

5)Virtual Environment 5 - LBM Relighting Model,clip and Moondream

python3.10 -m venv venv_5
source venv_5/bin/activate
pip install -r requirements5.txt

6)Virtual Environment 6 - Sana 1.6B Text to Image

python3.10 -m venv venv_6
source venv_6/bin/activate
pip install -r requirements6.txt

7)Virtual Environment 7 - MagicQuill

cd magicquill
git submodule update --init --recursive
wget -O models.zip "https://hkustconnect-my.sharepoint.com/:u:/g/personal/zliucz_connect_ust_hk/EWlGF0WfawJIrJ1Hn85_-3gB0MtwImAnYeWXuleVQcukMg?e=Gcjugg&download=1"
unzip models.zip
python -m venv Magic_venv
source Magic_venv/bin/activate
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu118
pip install gradio_magicquill-0.0.1-py3-none-any.whl
#Change "torch==2.1.2", "torchvision==0.16.2" to "torch==2.2.0", "torchvision==0.17.0" in pyproject.toml
cp -f pyproject.toml MagicQuill/LLaVA/
pip install -e MagicQuill/LLaVA/
pip install -r requirements.txt

8)Virtual Environment 8 - Whisper

python3.10 -m venv venv_whisper
source venv_whisper/bin/activate
pip install -r requirements8.txt

Running the Model:-
1.Sana 1.6B Text to Image

source venv_6/bin/activate 
cd sana
uvicorn main:backend --host 0.0.0.0 --port 8008

2.ledits

source venv_1/bin/activate 
cd ledits
uvicorn main:backend --host 0.0.0.0 --port 8002

3.inpaint4drag

source venv_1/bin/activate 
cd inpaint4drag
uvicorn main:backend --host 0.0.0.0 --port 8004

4.style transfer loras
Also you have to change the paths in main according to your path

source venv_1/bin/activate 
cd style_transfer_loras
gdown --fuzzy "https://drive.google.com/file/d/1ouAGb9GIv6hRhUzu8lAtxWXghL76e6VO/view?usp=sharing"
unzip sd1.5loras.zip 
uvicorn main:backend --host 0.0.0.0 --port 8002

5.background removal

source venv_1/bin/activate 
cd background_removal
uvicorn main:backend --host 0.0.0.0 --port 8002

6.lightningDrag

source venv_2/bin/activate 
cd lightningDrag
python download.py
uvicorn main:backend --host 0.0.0.0 --port 8004

7.inpaint

gdown --fuzzy "https://drive.google.com/file/d/1DaQyf1010x3pYG6yDaaQuLKWh39ZlWzU/view?usp=drive_link" -O sam/
source venv_2/bin/activate 
cd inpaint
uvicorn main:backend --host 0.0.0.0 --port 8002

8.outpaint

source venv_2/bin/activate 
cd outpaint
uvicorn main:backend --host 0.0.0.0 --port 8005

9.smartcrop

gdown --fuzzy "https://drive.google.com/file/d/1zxS4Qhm3gbfQUHxp097yz4ytB7EWsTfc/view?usp=sharing" -O backend/smartcrop/smartcrop_utils/
source venv_3/bin/activate 
cd smartcrop
uvicorn main:backend --host 0.0.0.0 --port 8006

10.Stable-Fast 3D

source venv_4/bin/activate 
cd outpaint
uvicorn main:backend --host 0.0.0.0 --port 8004

11.LBM Relighting Model

source venv_5/bin/activate 
cd LBM
uvicorn main:backend --host 0.0.0.0 --port 8003

12.clip and Moondream

source venv_5/bin/activate 
cd clipNmoondream
uvicorn main:backend --host 0.0.0.0 --port 8000

13.MagicQuill

cd magicquill
source Magic_venv/bin/activate
python main.py

14.InvisMark - for adding watermark to the image

source venv_1/bin/activate
cd InvisMark
gdown --fuzzy "https://drive.google.com/file/d/1XslNWwvKAyclYrY6cTczqvWV9V9vilv5/view?usp=drive_link" 
uvicorn main:app --host 0.0.0.0 --port 8013

15.Gaurd Rails

source venv_5/bin/activate
cd NSFW
uvicorn main:app --host 0.0.0.0 --port 8014

16.Whisper - for STT

source venv_whisper/bin/activate
cd whisper
python main.py

These are all local pipelines. Similarly the cloud pipelines can be implemented from the given codes in the CLOUD_FEATURES.

Overview of Overall Inference

Our system is optimized for low latency, efficient GPU usage, and high-quality image outputs. Below is a detailed explanation of each major component in the inference pipeline.

1. Dynamic Model Loading (Load → Run → Unload)

To ensure fast response times and optimal GPU memory utilization, each model is loaded only when the user selects it. When a tool is activated, a POST /load_model request initializes the corresponding model into GPU memory. After the user provides input—either an image or a prompt—the system executes POST /run to perform inference. Once the user returns to the main menu, POST /unload is triggered, freeing GPU resources. This prevents unnecessary memory consumption, eliminates repetitive heavy loads, and significantly improves inference speed across the system.

2. Optimized High-Resolution Image Pipeline (2K → 512 → 2K)

The pipeline supports images up to 2K resolution, but for faster computation, inputs are first downsampled to 512×512. The core editing or generation task is performed at this reduced resolution to minimize latency. After processing, a diffusion-based upscaler reconstructs the output back to 2K by converting it into latent space, refining it, and producing a detailed, sharp high-resolution result. This approach balances speed with quality by performing heavy operations only where necessary.

3. Algorithmic Enhancements on Stable Diffusion 1.5

We incorporate optimized algorithms on top of SD 1.5, dramatically improving both inference speed and output accuracy. These enhancements allow SD 1.5 to operate more efficiently than standard implementations while maintaining strong visual fidelity. Because of its versatility, SD 1.5 is reused for more than 60% of all tasks, minimizing model-switch overhead and reducing GPU load, which results in smoother and faster operations throughout the system.

4. Reuse of Moondream and CLIP for Auxiliary Tasks

To further streamline the pipeline, models like Moondream and CLIP are reused across multiple subtasks. Moondream assists with lightweight VLM functions, while CLIP powers semantic reasoning, safety filtering, and classification. Reusing these models avoids repeated initializations, reduces memory fragmentation, and significantly accelerates workflows that rely on vision-language understanding or content validation.

Speech-to-Text with openai/whisper-tiny

We use openai/whisper-tiny, a lightweight automatic speech-recognition (ASR) model that converts spoken prompts directly into text across the interface. Whisper-tiny is optimized for speed, enabling real-time transcription on CPUs and mobile hardware. Despite its small size, it supports ~100 languages, handles multilingual speech, accents, and background noise, and produces reliable text outputs for voice-based commands and prompts. This makes voice interaction seamless across all windows of the application.

Text to Image

1.Sana 1.6B Text to Image

Pipeline Explanation

The Sana 1.6B text-to-image pipeline integrates a DiT-based diffusion model with Nunchaku’s Singular Value Decomposition Quantization -compressed transformer to enable fast, memory-efficient generation without sacrificing image quality. In this setup, the original Sana denoiser is replaced by a low-rank INT4/FP8 Nunchaku transformer, significantly reducing VRAM usage while maintaining strong visual fidelity. The pipeline processes prompts through a text encoder, performs iterative denoising using the compressed DiT transformer, and finally decodes latents through a VAE to produce the final image. This combination allows Sana 1.6B to run smoothly even on mid-range GPUs, making it ideal for lightweight, high-performance API deployment.

Inference Time & Memory Used

Inference Time - under 3 second for Image size of 2048 X 2048
Memory Used - Under 9 Gb VRAM

Examples

2.Personalized Flux.1 Dev + Nunchaku (OPTIONAL CLOUD PIPELINE)

Pipeline Explanation

Our FLUX.1-dev pipeline combines two powerful components—Nunchaku’s SVD-quantized 4-bit diffusion transformer and a personalized LoRA fine-tuning module—to deliver fast, memory-efficient, identity-aware image generation. During training, we fine-tune FLUX.1-dev using a lightweight LoRA (rank 16) on the UNet while keeping the text encoder frozen, allowing the model to learn a user’s identity from just 12 photos with high fidelity and minimal overfitting. Once personalization is complete, the LoRA module is merged into the quantized Nunchaku FLUX transformer, enabling inference in a low-VRAM environment without sacrificing detail, alignment, or identity consistency. The resulting system is capable of generating high-resolution, photorealistic, and identity-preserving images using simple prompts, while running 2× faster and using only ~30% of full-precision memory. This unified fine-tuning + 4-bit inference pipeline forms one of our core features—allowing rapid, personalized, and cost-efficient image generation on consumer GPUs.

Time & Memory Used

Training Time - 25 minutes of A40
Inference Time - 4 seconds
Memory Used - 18 Gb VRAM using on INT4 bit with Singular Value Decomposition Quantization over original Flux Dev which runs on 50Gb VRAM

Examples

Workflow 1 : AI-Enhanced Image editing Tools Specs

1.MagicQuill

Pipeline Explanation

The system operates through a unified stroke-driven editing pipeline in which user-provided add, remove, and color strokes are first converted into structured masks that encode localized geometric and backendearance cues. These masks, along with the original image, are supplied to a fine-tuned LLaVA-1.5 model that performs ``Draw & Guess'' inference to interpret the semantic intention behind the strokes, enabling the extraction of high-level intent even from abstract signals---for example, recognizing that subtle wavy strokes on a face are meant to introduce realistic wrinkles. The inferred prompt is then combined with the original image and the stroke-derived masks and passed into a fine-tuned Stable Diffusion~1.5 model augmented with inpainting and control branches. This diffusion-based editing module integrates structural (edge) and backendearance (color) conditions to regenerate the modified regions while preserving unedited content, producing high-quality, semantically aligned outputs with fast inference and precise spatial control.

Inference Time & Memory Used

Inference Time - 5 seconds
Memory Used - 11 Gb VRAM

Examples

Input Image	Brush Stroked Used on Image	Auto VLM Generated Prompt	Output Image
		deer
		Wrinkles

2.Ledits ++

Pipeline Explanation

LEDITS++ first inverts the input image into the diffusion latent space using a fast, error-free DPM-Solver++ inversion, ensuring perfect reconstruction. The model then backendlies text-guided edit vectors that modify only the intended semantic regions, guided by implicit masks derived from attention and noise-difference maps. Finally, the edited latent is decoded back into an image, producing precise, localized changes without affecting the rest of the content.

Inference Time & Memory Used

Inference Time - 6 seconds
Memory Used - Under 8.5 Gb VRAM or approximately 9 Gb VRAM

Examples

Input Image	Prompt	Output Image
	Add Sunglasses
	Increase the fur, while adding a little blackish shade. Also make ears more stiff. Turn eyes to yellow.

3.Inpaint4Drag

Pipeline Explanation

Inpaint4Drag first takes a user-defined mask along with drag handles and target points to compute a precise pixel-space deformation of the selected region. The warped image is then analyzed to identify newly exposed or empty areas, which are passed to an inpainting model (such as SD 1.5) to synthesize missing content. This two-stage pipeline—deterministic geometric warping followed by targeted inpainting—enables accurate, high-resolution edits with real-time responsiveness.

Inference Time & Memory Used

Inference Time - 7 seconds
Memory Used - 3 Gb

Examples

Input Image	Output Image

4.Lightning Drag

Pipeline Explanation

LightningDrag takes the input image, user-defined handle–target point pairs, and an optional mask, and encodes them through a point-embedding network while preserving backendearance features using a reference-based encoder. These embeddings condition a Stable Diffusion–based inpainting backbone, which generates the manipulated image by following point movements while keeping untouched regions intact. Through this conditional generation workflow, LightningDrag produces fast, accurate drag-based edits with strong structural consistency.

Inference Time & Memory Used

Inference Time - 0.5 seconds
Memory Used - 7 Gb

Examples

Input Image	Output Image

5. SAM (Segment Anything Model): Universal Image Segmentation

Pipeline Explanation

SAM first encodes the input image into a high-dimensional feature map using a ViT-based image encoder. User prompts—such as positive points indicating what to include and negative points indicating what to exclude—are converted into prompt embeddings and fused with the image features in the mask decoder. The decoder then generates precise segmentation masks in real time, using the combination of positive and negative cues to accurately isolate the desired region without requiring any model retraining.

Inference Time & Memory Used

Inference Time - 2 seconds
Memory Used - 3 Gb

6.PowerPaint: Unified Diffusion-Based Region Editing using SD 1.5

Pipeline Explanation

PowerPaint takes an input image, a user-defined mask, and a text instruction, and feeds them into a diffusion-based backbone enhanced with spatial and semantic conditioning. The model interprets the instruction to decide whether it should erase, inpaint, or outpaint, and then synthesizes the masked region while preserving global scene coherence. This unified pipeline enables consistent, high-quality region editing across multiple tasks without switching models.

Inference Time & Memory Used

Inference Time - 7 seconds
Memory Used - 9 Gb

Examples

Erase + Inpaint

Input Image	Output Image

Generative Expand (Outpainting)

Input Image	Output Image

7.stabilityai/stable-diffusion-x4-upscaler : Latent Diffusion Model

Pipeline Explanation

The ×4 upscaler first encodes the low-resolution input image into a compact latent representation using a pretrained VAE, reducing the problem to a lighter and more expressive latent space. A time-conditioned U-Net then performs diffusion-based denoising on these latents, guided through cross-attention layers that integrate text or other conditioning signals to refine structure and detail. After the latent has been fully denoised, the autoencoder decodes it back into pixel space, producing a high-resolution image with enhanced sharpness and fidelity.

Inference Time & Memory Used

Inference Time - 2 seconds
Memory Used - 4.5 Gb VRAM

8.Dynamic Style Transfer with SD1.5 and Semantic LoRA Selection

Pipeline Explanation

Our dynamic style-transfer system uses Stable Diffusion 1.5 together with a scalable semantic LoRA-selection engine to automatically choose the most suitable artistic style for any user input. When the user provides an image and a prompt, the Moondream-2 vision-language model first analyzes the image and enriches the user prompt based on the LoRA’s metadata, improving semantic clarity and artistic intent. The system then embeds this enhanced prompt into a SentenceTransformer index to retrieve the most relevant LoRA through cosine-similarity search. After selection, the LoRA’s trigger words are automatically backendended to the refined prompt and the LoRA is injected on-the-fly into the Img2Img pipeline without reloading the base model. This design scales effortlessly to hundreds or thousands of LoRAs, adapts to changing artistic trends, and completely removes the need for users to manually choose styles—offering a fluid, intelligent, and highly adaptive style-transfer workflow.

Inference Time & Memory Used

Inference Time - 8 seconds
Memory Used - 11 Gb VRAM

Examples

Input Image	Prompt	LoRA selected	Output Image
	Turn this Man into a retro game character	Retro Game Art

Workflow 2 : Smart Composition and 3D Aware Object Insertion Specs

1.A2-RL: Aesthetics Aware Reinforcement Learning for Smart Cropping

Pipeline Explanation

A2-RL begins by extracting visual features from the input image and initializes a cropping window that the reinforcement learning agent iteratively adjusts through predefined actions such as scaling, translating, and reshaping. At each step, the agent evaluates the aesthetics-aware reward function to guide the crop toward a more pleasing composition, continuing until the termination action signals that the optimal crop has been found. This sequential decision-making pipeline enables fast, intelligent, and high-quality cropping without exhaustive search.

Inference Time & Memory Used

Inference Time - 5 seconds
Memory Used - 8 Gb

Examples

Input Image	Output Image

2. BRIAAI / RMBG-2.0: High-Resolution Background Removal

Pipeline Explanation

RMBG-2.0 performs high-resolution background removal by leveraging the BiRefNet architecture, which combines global semantic understanding with fine-detail reconstruction. The model first uses a transformer-based Localization Module to generate a coarse foreground map that captures object structure even in cluttered scenes. This is then refined by the bilateral-reference Reconstruction Module, which integrates multi-scale contextual patches with gradient-based edge cues to recover fine contours, hair strands, soft boundaries, and intricate textures. Supported by an auxiliary edge-aware loss, this two-stage pipeline produces sharp, production-grade foreground masks that remain accurate even on large, complex images.

Inference Time & Memory

Inference Time - 7 seconds
Memory Used - 11 Gb VRAM

Examples

Input Image	Output Image

3. Stable Fast 3D: Single-Image Feed-Forward 3D Reconstruction

Pipeline Explanation

Stable Fast 3D reconstructs a complete, textured 3D mesh from a single image through a feed-forward pipeline that predicts geometry, materials, and textures in one pass. A transformer-based network first infers 3D structure, surface normals, and material parameters while generating a latent representation for texture synthesis. The system then performs illumination disentanglement to remove baked-in lighting, producing clean albedo textures suitable for realistic relighting. A differentiable mesh extraction stage generates the 3D shape, followed by efficient UV unwrbackending and texture baking to produce a high-quality texture atlas rather than simple vertex colors. Finally, SF3D outputs complete material maps and normal maps, enabling accurate shading under novel lighting. This unified pipeline produces production-ready 3D assets in ~0.5 seconds, outperforming traditional multi-view or optimization-heavy methods in both speed and fidelity.

Inference Time and Memory

Inference Time - 5 seconds
Memory Used - 6 Gb VRAM

Examples

Input Image :-

Output from some angles of rotation :-

4. LBM Relighting: Single-Step Latent-Space Illumination Transfer

Pipeline

LBM Relighting performs illumination transfer in a single step by mbackending the input image through a learned latent-space transformation that preserves geometry while modifying lighting. The image is first encoded into a VAE latent, where a stochastic latent “bridge” is defined between the source and target illumination conditions, conditioned on parameters such as light direction, strength, or environment maps. A neural denoiser (U-Net) is trained to backendroximate the drift along this bridge, enabling a direct, one-shot conversion of the source latent into a target latent without iterative diffusion. Once decoded, the output exhibits realistic changes in shading, highlights, shadows, and global illumination while maintaining object boundaries and scene structure. This latent-transport pipeline offers relighting quality comparable to multi-step diffusion models but at a fraction of the computational cost, making it ideal for real-time and interactive workflows.

Inference Time and Memory

Inference Time - 5 seconds
Memory Used - 10 Gb VRAM

Examples

Input Image	Output Image

5. Hybrid CLIP + Moondream2 Defect Analysis System

Pipeline Explanation

The Hybrid CLIP + Moondream2 system performs intelligent photographic defect analysis by first using CLIP to embed the input image and compare it against a curated vocabulary of 101 defect descriptions, ranking the top three most likely issues through cosine similarity. These defect candidates, along with the image, are then passed to the Moondream2 vision-language model, which verifies whether each defect is genuinely present and provides a grounded, human-readable explanation based on its visual reasoning. This combined retrieval-and-verification pipeline delivers fast, scalable, and highly reliable defect detection—catching lighting issues, blur, distortions, color problems, and AI artifacts—while ensuring that every prediction is context-aware, interpretable, and accurate.

Inference Time & Memory Used

Inference Time - 9 seconds
Memory Used - 4 Gb VRAM

Ethical Considerations

1.InvisMark

Pipeline Explanation

InvisMark embeds an invisible 256-bit watermark by passing the input image through a neural encoder that adds a subtle, imperceptible residual into the pixel space. During training, the embedded image is routed through a robustness module that applies real-world distortions—such as JPEG compression, noise, blur, cropping, and color shifts—to ensure the watermark remains stable under common manipulations. A paired neural decoder is then used to reliably extract the watermark from the distorted outputs, while the loss function jointly optimizes perceptual similarity and extraction accuracy. This feed-forward encode–distort–decode pipeline enables high-capacity, invisible watermarking that remains intact even after aggressive editing or compression.

Examples

2.NSFW Content Guardrails

Pipeline Explanation

Using a CLIP-based similarity system as guard rails provides a fast, lightweight, and highly adaptable way to detect harmful or sensitive content across a very broad risk spectrum. While many existing moderation approaches—especially classical detectors and even several commercial VLM-based filters—perform strongly mainly on sexual or nudity-related content, they often miss non-sexual harms such as violence, weapons, extremism, drugs, or psychologically disturbing scenes. In contrast, CLIP embeds both images and a rich taxonomy of safety labels into the same semantic space, enabling it to surface the top 3 highest-scoring NSFW categories and mark them as violations when their similarity exceeds a defined threshold, creating a precise and transparent rule-based moderation pipeline. This makes detection fast (10–50× faster than VLM captioning), far more GPU-efficient, and fully controllable—developers can easily add, remove, or tune categories to match policy requirements. Overall, this CLIP-driven strategy offers a significantly broader, more interpretable, and more scalable guard-rails system compared to methods that specialize only in sexual content moderation.

Examples

3.User Privacy

In 2030, users will be far more sensitive about where their personal images are processed, which is why the core Lumos image-editing workflows run fully on-device, avoiding continuous cloud dependence and preventing routine uploads of private photos. However, for users who choose to personalize the model, we provide an optional secure cloud-based LoRA training service. Only the images the user explicitly selects are uploaded through an encrypted channel, processed inside an isolated training container, and permanently deleted immediately after the LoRA weights are produced. The resulting personalized LoRA is the only artifact returned to the user, and no images, metadata, or embeddings are stored or reused for any secondary purpose. This design balances strong privacy with the ability to learn a user’s preferred style—ensuring personalization remains powerful, controlled, and secure.

Compute Profile

All experiments were conducted on bunch of moderate GPUs on Runpod , while Runpod NVIDIA A40 GPUs were used for training the personalized LoRA models.

References

Black Forest Labs. (2024). FLUX.1-dev. Hugging Face.
https://huggingface.co/black-forest-labs/FLUX.1-dev
Boss, M., Huang, Z., Vasishta, A., & Jampani, V. (2024).
SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrbackending and Illumination Disentanglement.
arXiv:2408.00653 [cs.CV]
https://arxiv.org/abs/2408.00653
Brack, M., Friedrich, F., Kornmeier, K., Tsaban, L., Schramowski, P., Kersting, K., & Passos, A. (2023).
LEDITS++: Limitless Image Editing using Text-to-Image Models.
arXiv:2311.16711 [cs.CV]
https://arxiv.org/abs/2311.16711
BRIA AI. (2024). BRIA RMBG-2.0: Background Removal Model. Hugging Face.
https://huggingface.co/briaai/RMBG-2.0
Chadebec, C., Tasar, O., Sreetharan, S., & Aubin, B. (2025).
LBM: Latent Bridge Matching for Fast Image-to-Image Translation.
arXiv:2503.07535 [cs.CV]
https://arxiv.org/abs/2503.07535
CivitAI Community. (2024). LoRA Models Database.
https://civitai.com
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollár, P., & Girshick, R. (2023).
Segment Anything.
arXiv:2304.02643 [cs.CV]
https://arxiv.org/abs/2304.02643
Li, D., Wu, H., Zhang, J., & Huang, K. (2017).
A2-RL: Aesthetics Aware Reinforcement Learning for Image Cropping.
arXiv:1709.04595 [cs.CV]
https://arxiv.org/abs/1709.04595
Li, M., Lin, Y., Zhang, Z., Cai, T., Li, X., Guo, J., Xie, E., Meng, C., Zhu, J.-Y., & Han, S. (2024).
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models.
arXiv:2411.05007 [cs.CV]
https://arxiv.org/abs/2411.05007
Lu, J., & Han, K. (2025).
Inpaint4Drag: Repurposing Inpainting Models for Drag-Based Image Editing via Bidirectional Warping.
arXiv:2509.04582 [cs.CV]
https://arxiv.org/abs/2509.04582
MIT HAN Lab & Nunchaku Team. (2024).
Nunchaku: 4-Bit Diffusion Model Inference. GitHub.
https://github.com/nunchaku-tech/nunchaku
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021).
Learning Transferable Visual Models From Natural Language Supervision.
arXiv:2103.00020 [cs.CV]
https://arxiv.org/abs/2103.00020
Sanster. (2024).
PowerPaint v2: High-Quality Inpainting and Outpainting. Hugging Face.
https://huggingface.co/Sanster/PowerPaint_v2
Shi, Y., Liew, J. H., Yan, H., Tan, V. Y. F., & Feng, J. (2024).
LightningDrag: Lightning Fast and Accurate Drag-based Image Editing Emerging from Videos.
arXiv:2405.13722 [cs.CV]
https://arxiv.org/abs/2405.13722
Stability AI. (2022).
Stable Diffusion x4 Upscaler. Hugging Face.
https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler
Liu, Z., Yu, Y., Ouyang, H., Wang, Q., Cheng, K. L., Wang, W., Liu, Z., Chen, Q., & Shen, Y. (2024).
MagicQuill: An Intelligent Interactive Image Editing System.
arXiv:2411.09703 [cs.CV]
https://arxiv.org/abs/2411.09703
Vikhyat K. (2024).
Moondream2: A Tiny Vision Language Model. Hugging Face.
https://huggingface.co/vikhyatk/moondream2
Xie, E., Chen, J., Chen, J., Cai, H., Tang, H., Lin, Y., Zhang, Z., Li, M., Zhu, L., Lu, Y., & Han, S. (2024).
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers.
arXiv:2410.10629 [cs.CV]
https://arxiv.org/abs/2410.10629
Rui Xu, Mengya Hu, Deren Lei, Yaxi Li, David Lowe, Alex Gorevski, Mingyu Wang, Emily Ching, and Alex Deng (2024). InvisMark: Invisible and Robust Watermarking for AI-generated Image Provenance arXiv:2411.07795 [cs.CV]
https://arxiv.org/abs/2411.07795

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
backend		backend
cloud		cloud
figma		figma
frontend		frontend
report		report
README.md		README.md

CynapticsAI/Adobe_InterIIT_14

Folders and files

Latest commit

History

Repository files navigation

🥈 LUMOS : The Image Editor of 2030

Demo Video

Watch the Demo on YouTube

Table of Contents

Introduction

Problem understanding

Our Solution

Workflow 1 : AI-Enhanced Image Editing Tools

Workflow 2: Smart Composition and 3D‑Aware Object Insertion

Repository Structure

How to Use the Repository

1.Frontend:

2.Backend:

Overview of Overall Inference

1. Dynamic Model Loading (Load → Run → Unload)

2. Optimized High-Resolution Image Pipeline (2K → 512 → 2K)

3. Algorithmic Enhancements on Stable Diffusion 1.5

4. Reuse of Moondream and CLIP for Auxiliary Tasks

Speech-to-Text with openai/whisper-tiny

Text to Image

1.Sana 1.6B Text to Image

Pipeline Explanation

Inference Time & Memory Used

Examples

2.Personalized Flux.1 Dev + Nunchaku (OPTIONAL CLOUD PIPELINE)

Pipeline Explanation

Time & Memory Used

Examples

Workflow 1 : AI-Enhanced Image editing Tools Specs

1.MagicQuill

Pipeline Explanation

Inference Time & Memory Used

Examples

2.Ledits ++

Pipeline Explanation

Inference Time & Memory Used

Examples

3.Inpaint4Drag

Pipeline Explanation

Inference Time & Memory Used

Examples

4.Lightning Drag

Pipeline Explanation

Inference Time & Memory Used

Examples

5. SAM (Segment Anything Model): Universal Image Segmentation

Pipeline Explanation

Inference Time & Memory Used

6.PowerPaint: Unified Diffusion-Based Region Editing using SD 1.5

Pipeline Explanation

Inference Time & Memory Used

Examples

7.stabilityai/stable-diffusion-x4-upscaler : Latent Diffusion Model

Pipeline Explanation

Inference Time & Memory Used

8.Dynamic Style Transfer with SD1.5 and Semantic LoRA Selection

Pipeline Explanation

Inference Time & Memory Used

Examples

Workflow 2 : Smart Composition and 3D Aware Object Insertion Specs

1.A2-RL: Aesthetics Aware Reinforcement Learning for Smart Cropping

Pipeline Explanation

Inference Time & Memory Used

Examples

2. BRIAAI / RMBG-2.0: High-Resolution Background Removal

Pipeline Explanation

Inference Time & Memory

Examples

3. Stable Fast 3D: Single-Image Feed-Forward 3D Reconstruction

Pipeline Explanation

Inference Time and Memory

Examples

4. LBM Relighting: Single-Step Latent-Space Illumination Transfer

Pipeline

Inference Time and Memory

Packages