This paper presents Unify-Agent, an end-to-end unified multimodal agent for world-grounded image synthesis. Unlike conventional text-to-image models that rely only on fixed parametric memory, Unify-Agent can actively access external world knowledge at inference time, enabling more faithful generation of real people, cultural symbols, rare IPs, historical scenes, scientific concepts, and other long-tail entities.
The core challenge of factual image generation is not just producing visually plausible images, but correctly capturing the targetโs identity-defining visual attributes. Existing agentic systems usually connect retrieval, reasoning, and generation through loosely coupled pipelines, making it difficult to effectively transform external evidence into accurate visual guidance.
To address this, Unify-Agent unifies four capabilities in a single model:
- THINK: understand the prompt and identify missing knowledge.
- RESEARCH: retrieve relevant textual and visual evidence.
- RECAPTION: convert retrieved knowledge into structured generation guidance.
- GENERATE: synthesize the final grounded image.
A key insight is that unifying understanding and generation improves both. By combining high-level semantic representations with low-level generative priors, Unify-Agent can better interpret retrieved references and produce images that are more faithful to real-world knowledge.
To evaluate this setting, the paper introduces FactIP, a benchmark covering 12 categories focused on rare identities and long-tail concepts. Experiments show that Unify-Agent significantly improves factual visual synthesis, outperforming its base model and strong open-source baselines across FactIP, WiSE, KiTTEN, and T2I-FactualBench.
This work highlights a new paradigm for text-to-image generation: moving from closed-book generation to open-book, agentic generation, where models actively reason over external knowledge before synthesis.
This repository provides everything needed to reproduce, fine-tune, and evaluate Unify-Agent:
| Component | Path | Description |
|---|---|---|
| SFT Training | SFT/ |
Supervised fine-tuning scripts with FSDP, W&B logging, and multi-node support |
| Inference | infer/ |
Multi-turn agentic inference with web search, image judging, recaption, and generation |
| Data Pipeline | data_pipeline/ |
Three-stage pipeline to construct training trajectories (prompt โ search trajectory โ image) |
| Evaluation | eval/ |
FactIP benchmark evaluation: generate, score with MLLM judge, and aggregate results |
| Model | csfufu/Unify-Agent | Pre-trained checkpoint on HuggingFace |
| Benchmark | csfufu/FactIP | 2,462 knowledge-intensive prompts across 12 categories |
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Unify-Agent โ
โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ
โ โ THINK โโโโโบโ RESEARCH โโโโโบโRECAPTION โโโโโบโ GENERATE โ โ
โ โ โ โ โ โ โ โ โ โ
โ โ Identify โ โ Search โ โ Fuse ref โ โ Produce โ โ
โ โ missing โ โ text & โ โ images + โ โ grounded โ โ
โ โknowledge โ โ images โ โ evidence โ โ image โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- Get started โ Installation
- Train your own model โ SFT Training
- Run inference โ Inference
- Build training data โ Data Pipeline
- Evaluate on FactIP โ FactIP Benchmark Evaluation
High-quality samples from our Unify-Agent, highlighting its excellence in unified multi-image generation and agentic search enhanced world knowledge integration. It delivers strong cross-image consistency, broad stylistic versatility, and more faithful, knowledge-grounded visual generation across diverse concepts and scenariosโeven for up-to-date real-world queries, such as generating images of the top three finishers (Kimi Antonelli, George Russell, Lewis Hamilton) of the 2026 Chinese Grand Prix in Shanghai.
Qualitative comparison of multi-image generation results on knowledge-intensive prompts involving historical figures, fictional characters, products, and stylized toys. Our method consistently produces images that better preserve subject identity, fine-grained attributes, and prompt-specific details, while achieving stronger real-world knowledge grounding than competing baselines, including Flux-1, Bagel-7b, Hunyuan, and Stable Diffusion.
Overview of the agentic pipeline of our method. Given an input prompt, our framework first performs prompt understanding and cognitive gap detection to identify missing but visually critical attributes. It then acquires complementary multimodal evidence through textual evidence searching and visual evidence searching. Based on the collected evidence, the model grounds the generation process with two types of constraints: identity-preserving constraints that capture character-specific visual traits, and scene-compositional constraints that specify pose, environment, garment, and overall mood. These grounded constraints are then integrated into an evidence-grounded recaptioning module, which produces a detailed caption for the downstream image generator to synthesize the final image.
Overview of our data pipeline. Starting from long-tail IP collection, we construct user instructions and Ground Truth images, build multimodal research trajectories with textual and visual evidence, and finally perform evidence-grounded recaption annotation to obtain high-quality training samples. The resulting dataset supports both SFT trajectory learning and the FactIP benchmark, which evaluates generation quality in terms of clarity, content, aesthetics, and relevance.
- Image generated for the prompt: "The copper is burning, highlighting the color".
- Image generated for Grigory Perelman scribbling mathematical equations.
Hierarchical category distribution of FactIP Bench, consisting of three major groups (Character, Scene, and Object) and 12 fine-grained subcategories. Category-wise comparison of different methods on FactIP Bench, where the radar chart presents the overall scores across all subcategories.
The full benchmark contains three major categories and 12 fine-grained subcategories, totaling 2,462 prompts. We will also release 500 prompts from the full benchmark as a test mini subset, with the hierarchical category distribution (Character, Scene, Object and their 12 fine-grained subcategories) strictly maintained in proportion to the full FactIP Bench.
| Category | Subcategory | Description | Num |
|---|---|---|---|
| CHARACTER | Animation | Animated characters, creatures, equipment, and iconic locations from anime and animated media. | 438 |
| CHARACTER | Comic | Characters and visual elements originating from comic books and manga series. | 363 |
| CHARACTER | Celebrity | Prominent figures across diverse domains, including scientists, political leaders, business executives, athletes, and entertainment personalities. | 300 |
| CHARACTER | Game | Video game characters, weapons, equipment, and other in-game visual elements. | 272 |
| CHARACTER | Mascot | Official mascots representing Olympic Games, regional events, and corporate brands. | 77 |
| CHARACTER | Mythology | Universally recognized mythological narratives and legendary figures, e.g., Kuafu Chasing the Sun. | 50 |
| OBJECT | Food | Cuisines, regional delicacies, desserts, and beverages with cultural significance. | 316 |
| OBJECT | Cultural Relic / Art | National treasures, classical calligraphy, paintings, sculptures, and fine art pieces. | 126 |
| OBJECT | Toy | Collectible figures, designer toys, and model kits with cultural relevance, e.g., Labubu. | 123 |
| OBJECT | Animal / Plant | Individually notable animals and plants with distinct public recognition, e.g., Giant Panda Qizai. | 50 |
| SCENE | Landmark | Renowned scenic spots, architectural landmarks, monuments, and heritage sites. | 297 |
| SCENE | Festival / Celebration | Visual elements and symbols associated with well-known festivals and cultural celebrations. | 50 |
An example of MLLM evaluation for Popovich drawing a play.
git clone https://github.com/shawn0728/Unify-Agent.git
cd Unify-Agent
pip install -r requirements.txtNote: FlashAttention-2 is recommended but commented out in
requirements.txt. Install it separately if your GPU supports it:pip install flash-attn --no-build-isolation
We provide a ready-to-use script for supervised fine-tuning (SFT) of the Unify-Agent model. The training is built on top of Bagel and uses FSDP for distributed training with W&B logging.
| Requirement | Details |
|---|---|
| GPU | 8 ร A100/H100 per node (80 GB recommended) |
| Python | 3.10+ |
| PyTorch | 2.5.1+ with CUDA support |
| Base Model | Bagel-7B-MoT checkpoint |
| ViT | SigLIP-SO400M (NaViT variant) checkpoint |
1. Set required environment variables:
export WANDB_API_KEY="your_wandb_api_key"
export RESUME_FROM="/path/to/Bagel-7B-MoT"
export VIT_PATH="/path/to/siglip-so400m-14-980-flash-attn2-navit"2. Single-node training (8 GPUs):
bash SFT/scripts/train.sh 1 0 127.0.0.1 295003. Multi-node training (e.g., 4 nodes):
# On each node, replace <node_rank> with 0, 1, 2, 3
bash SFT/scripts/train.sh 4 <node_rank> <master_ip> 29500bash SFT/scripts/train.sh <nnodes> <node_rank> <master_addr> <master_port>
| Argument | Description |
|---|---|
nnodes |
Total number of nodes |
node_rank |
Rank of the current node (0-indexed) |
master_addr |
IP address of the rank-0 node |
master_port |
Port for distributed rendezvous |
All training hyperparameters can be overridden via environment variables:
| Variable | Default | Description |
|---|---|---|
RESUME_FROM |
(required) | Path to the base model checkpoint |
VIT_PATH |
(required) | Path to the SigLIP ViT checkpoint |
WANDB_API_KEY |
(required) | Weights & Biases API key |
NPROC_PER_NODE |
8 |
Number of GPUs per node |
RESULTS_DIR |
./outputs/sft |
Directory for logs and metrics |
CHECKPOINT_DIR |
./outputs/sft |
Directory for saving checkpoints |
WANDB_PROJECT |
unify-agent-sft |
W&B project name |
WANDB_NAME |
sft_run |
W&B run name |
NUM_WORKERS |
1 |
DataLoader workers per rank |
KEEP_LAST_N |
8 |
Number of recent checkpoints to keep |
LAUNCH_MODE |
static |
Launch mode: static or elastic |
COMM_PROFILE |
socket_safe |
NCCL comm profile: socket_safe | ib_min | ib_perf |
The default SFT recipe uses the following hyperparameters:
| Hyperparameter | Value |
|---|---|
| Learning rate | 5e-5 |
| Warmup steps | 500 |
| Max gradient norm | 5.0 |
| Expected tokens per batch | 40240 |
| Max tokens per batch | 41520 |
| Max tokens per sample | 40240 |
| Save every N steps | 500 |
| CE loss weight | 1.0 |
| MSE loss weight | 1.0 |
| Special token CE weight | 3.0 |
The SFT training uses a YAML config file to specify datasets. See SFT/data/configs/agent_data.yaml for the agentic SFT config and SFT/data/configs/example.yaml for the general training config.
For detailed information on data preparation, dataset format, and model/training configuration tables, refer to SFT/train/TRAIN.md.
Unify-Agent performs multi-turn agentic inference: given an input prompt and optional IP name, the model autonomously searches the web for textual and visual evidence, generates a grounded recaption, and synthesizes the final image. The inference pipeline has four stages:
- Search Phase โ The model calls
text_searchandsearch_imagetools in a multi-turn loop to gather background knowledge and reference images. - Quality Judging โ Downloaded images are scored by an LLM judge; the top-2 highest-quality references are selected.
- Recaption โ Using the reference images and search results, the model generates a detailed, evidence-grounded scene description.
- Image Generation โ The recaption and reference images are fed to the BAGEL generator to produce the final image.
| Requirement | Details |
|---|---|
| GPU | 1 ร A100/H100 (80 GB) for single-GPU; N GPUs for multi-GPU mode |
| Model | csfufu/Unify-Agent (auto-downloaded by default) |
| OPENAI_API_KEY | Required when --execute_tools is enabled (for LLM reasoning & image judging) |
| SERPER_API_KEY | Required when --execute_tools is enabled (for web search) |
| JINA_API_KEY | Optional (for webpage content extraction; falls back to snippets if not set) |
A unified entry script infer/run_infer.sh automatically dispatches to the single-GPU or multi-GPU backend based on --num_gpus.
1. Set API keys (required for tool-augmented search):
export OPENAI_API_KEY="sk-..."
export SERPER_API_KEY="..."
# Optional: export JINA_API_KEY="..."2. Prepare input data โ a JSON file mapping IP indices to entries:
{
"0": {
"ip_name": "Pikachu",
"image_prompt": "Pikachu surfing on a giant wave at sunset",
"language": "en",
"country": "Japan"
},
"1": {
"ip_name": "Hotpot",
"image_prompt": "A bowl of authentic Chongqing hotpot with rich red broth",
"language": "zh",
"country": "ไธญๅฝ"
}
}When
ip_nameis empty, the model operates in prompt-only mode โ it searches for background information based on the scene description directly.
3. Run inference:
# Single-GPU inference with tool execution
bash infer/run_infer.sh \
--ip_data path/to/ip_data.json \
--output_dir ./output/my_run \
--execute_tools \
--allow_batch_ip
# Multi-GPU inference (4 GPUs)
bash infer/run_infer.sh \
--ip_data path/to/ip_data.json \
--output_dir ./output/my_run \
--num_gpus 4 \
--execute_tools \
--allow_batch_ip4. Process a single IP entry:
bash infer/run_infer.sh \
--ip_data path/to/ip_data.json \
--ip_index 0 \
--output_dir ./output/single_test \
--execute_toolsoutput/my_run/
โโโ 0_trajectory.json # Full multi-turn dialogue trajectory
โโโ 0_generated.png # Final generated image
โโโ intermediate/
โ โโโ 0/
โ โโโ image_1.jpg # Best reference image from search
โ โโโ image_2.jpg # Second-best reference image
โ โโโ tmp/ # All candidate images before filtering
โโโ processing_summary.json # Batch processing summary
The following flags are accepted by run_infer.sh and forwarded to the underlying Python scripts:
| Flag | Default | Description |
|---|---|---|
--model_path |
csfufu/Unify-Agent |
Path or HuggingFace repo id of the model |
--num_gpus |
1 |
Number of GPUs (1 = single-GPU script, >1 = multi-GPU script) |
--ip_data |
(required) | IP data JSON file path |
--ip_index |
None |
Process only this IP index (processes all if not set) |
--output_dir |
None |
Output directory (or use --output_base + --model_name) |
--execute_tools |
False |
Enable web search and LLM judging (requires API keys) |
--allow_batch_ip |
False |
Allow batch processing when --ip_index is not specified |
--per_ip_subdir |
False |
Save each IP's output in a separate subdirectory |
--reference_images |
None |
Provide reference images directly (skip search phase) |
--seed |
42 |
Random seed for reproducibility |
--mode |
1 |
Model loading mode: 1 = full precision, 2 = NF4, 3 = INT8 |
--max_search_turns |
6 |
Maximum turns for the search phase |
--stage1_temperature |
0.7 |
Temperature for the initial search turn |
--stage2_temperature |
0.7 |
Temperature for subsequent search turns |
--enable_think |
False |
Enable thinking mode (multi-GPU only) |
| Variable | Description |
|---|---|
OPENAI_API_KEY |
API key for the OpenAI-compatible reasoning model |
OPENAI_BASE_URL |
Custom base URL for the reasoning API (optional) |
AGENT_REASONING_MODEL |
Reasoning model name (default: gpt-4o) |
SERPER_API_KEY |
API key for Serper web search |
JINA_API_KEY |
API key for Jina AI webpage reading (optional) |
infer/
โโโ run_infer.sh # Unified entry point (dispatches by --num_gpus)
โโโ infer_utils.py # Shared utilities: API clients, search, image processing,
โ # prompt templates, text parsing (~1150 lines)
โโโ unify-agent_infer.py # Single-GPU inference (~860 lines)
โโโ unify-agent_infer_thread.py # Multi-GPU inference with parallelism (~1330 lines)
โโโ model_loader.py # Model weight loading and Inferencer construction
โโโ data/
โโโ data_utils.py # Image preprocessing utilities
We provide a three-stage data pipeline that constructs multi-turn agentic search trajectories paired with generated images from a list of IP entries. This is used to produce the SFT training data described in the paper.
IP CSV / JSON โโบ Stage 1 โโบ Stage 2 โโบ Stage 3 โโบ Training Data
(Prompts) (Trajectories) (Images)
| Stage | Description |
|---|---|
| 1 โ Prompt Generation | Uses an LLM to produce structured image-generation prompts from an IP list |
| 2 โ Trajectory Generation | Conducts multi-turn agent dialogue with web search (Serper + Jina), image quality judging, and evidence-grounded recaption |
| 3 โ Image Generation | Calls an image generation API using the recaption as the prompt |
export OPENAI_API_KEY="sk-..."
export SERPER_API_KEY="..."
bash data_pipeline/run_pipeline.sh \
--source_ip celebrity \
--ip_data data/ip_prompts.json \
--output_dir ./output/pipeline| Variable | Required | Description |
|---|---|---|
OPENAI_API_KEY |
Yes | OpenAI-compatible API key |
SERPER_API_KEY |
Yes (Stage 2) | Serper web & image search |
JINA_API_KEY |
Optional | Jina AI full-page content extraction |
OPENAI_BASE_URL |
Optional | Custom API endpoint |
STAGE1_MODEL |
Optional | Stage 1 model (default: gpt-4o) |
AGENT_REASONING_MODEL |
Optional | Stage 2 reasoning model (default: gpt-4o) |
AGENT_MULTITURN_MODEL |
Optional | Stage 2 multi-turn model (default: gpt-4o) |
IMAGE_GEN_MODEL |
Optional | Stage 3 image model (default: gpt-image-1) |
For detailed documentation on each stage, CLI flags, output formats, and architecture, see data_pipeline/DATA.md.
We provide the FactIP benchmark for evaluating knowledge-grounded image generation. The evaluation pipeline consists of three stages: Generate โ Score โ Calculate.
# Option A: HuggingFace CLI (recommended)
huggingface-cli download csfufu/FactIP --repo-type dataset --local-dir ./FactIP
# Option B: Git LFS
git clone https://huggingface.co/datasets/csfufu/FactIPThe dataset contains prompts (test.json / test_mini.json), ground-truth reference images, and category metadata.
Use eval/bagel_infer_batch.py to generate images from the FactIP prompts:
python eval/bagel_infer_batch.py \
--model_path /path/to/your/model \
--prompt_json ./FactIP/test.json \
--output_dir ./results/my_model \
--num_gpus 8 \
--seed 42 \
--think # optional: enable chain-of-thought reasoningScore each generated image against the ground-truth references using an OpenAI-compatible multimodal API:
export OPENAI_API_KEY="your_api_key"
python eval/score_factip.py \
--base-dir ./results/my_model \
--workers 8You can customize the judge model via environment variable or CLI flag:
# Use a custom model
export FACTIP_EVAL_MODEL="gpt-4o"
# Point to ground-truth images (if stored separately)
export FACTIP_GT_DIR="./FactIP/images"Aggregate per-category and overall scores (reported on a 0โ100 scale):
python eval/calculate.py --base_dir ./results/my_modelThis produces an overall_score.txt report with per-subtask breakdowns:
Overall = 0.05 ร clarity + 0.10 ร content_quality + 0.10 ร aesthetics + 0.75 ร text_relevance_ip
We also provide a single script that runs all three stages end-to-end:
bash eval/run_factip_eval.sh \
--model_path /path/to/your/model \
--prompt_json ./FactIP/test.json \
--gt_dir ./FactIP/images \
--output_dir ./results/my_model \
--num_gpus 8 \
--think| Flag | Description |
|---|---|
--model_path |
Path to the model checkpoint |
--prompt_json |
Path to the FactIP prompt JSON |
--gt_dir |
Path to the ground-truth reference images |
--output_dir |
Directory for generated images and scores |
--num_gpus |
Number of GPUs for parallel generation (default: 8) |
--think |
Enable chain-of-thought reasoning |
--score_workers |
Concurrent API workers for scoring (default: 4) |
--eval_model |
Override the MLLM judge model name |
--skip_generate |
Skip generation, run scoring + calculation only |
--only_calculate |
Skip generation and scoring, run calculation only |
All the code, benchmark, and checkpoints have entered the final approval stage. Stay tuned โ once the approval process is complete, we will release them ASAP.
We thank the open-source community for the wonderful works of Bagel that inspired this project.
For questions, feedback, or collaboration opportunities, feel free to reach out: csfufu0728@gmail.com
If you find our works useful for your research, please consider citing:
@article{chen2026unify,
title={Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis},
author={Chen, Shuang and Shou, Quanxin and Chen, Hangting and Zhou, Yucheng and Feng, Kaituo and Hu, Wenbo and Zhang, Yi-Fan and Lin, Yunlong and Huang, Wenxuan and Song, Mingyang and others},
journal={arXiv preprint arXiv:2603.29620},
year={2026}
}
@article{feng2026gen,
title={Gen-Searcher: Reinforcing Agentic Search for Image Generation},
author={Feng, Kaituo and Zhang, Manyuan and Chen, Shuang and Lin, Yunlong and Fan, Kaixuan and Jiang, Yilei and Li, Hongyu and Zheng, Dian and Wang, Chenyang and Yue, Xiangyu},
journal={arXiv preprint arXiv:2603.28767},
year={2026}
}








