Envision is a comprehensive benchmark designed for evaluating the unified understanding and sequential generation capabilities of multimodal models, specifically focusing on the modeling of causal world processes. The benchmark assesses a model's ability to generate coherent, physically plausible, and aesthetically pleasing sequences of images that follow a complex, step-by-step causal narrative.
Current multimodal models aim to transcend the limitations of single-modality representations by unifying understanding and generation, often using text-to-image (T2I) tasks to calibrate semantic consistency. However, their reliance on static, single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion, while fundamentally hindering their ability to model dynamic processes that unfold over time. To address these constraints, we propose Envision—a causal event progression benchmark for chained text-to-multi-image generation. Grounded in world knowledge and structured by spatiotemporal causality, it reorganizes existing evaluation dimensions and includes 1,000 four-stage prompts spanning six scientific and humanities domains. To transition evaluation from single images to sequential frames and assess whether models truly internalize world knowledge while adhering to causal-temporal constraints, we introduce Envision-Score—a holistic metric integrating multi-dimensional consistency, physicality, and aesthetics. Comprehensive evaluation of 15 models (10 specialized T2I models, 5 unified models) uncovers: specialized T2I models demonstrate proficiency in aesthetic rendering yet lack intrinsic world knowledge. Unified multimodal models bridge this gap, consistently outperforming specialized counterparts in causal narrative coherence. However, even these unified architectures remain subordinate to closed-source models and struggle to overcome the core challenge of spatiotemporal consistency. This demonstrates that a focus on causally-isolated single images impedes multi-frame reasoning and generation, promoting static pattern matching over dynamic world modeling—ultimately limiting world knowledge internalization, generation.
You can download the Envision dataset, which contains the sequence prompts and ground-truth process descriptions, using the following Git command:
git clone [https://huggingface.co/datasets/opendatalab-raiser/Envision](https://huggingface.co/datasets/opendatalab-raiser/Envision)The evaluation of generated sequential images is managed by the eval.py script, which automates the quality assessment using a commercial VLM (e.g., OpenAI models) as the judge. The scoring adheres to a strict hierarchical protocol.
The comprehensive quality score (
| Dimension | Weight ( |
Sub-Dimensions |
|---|---|---|
| Consistency | 40% (0.4) | Semantic Consistency, Factual Consistency, Spatial-Temporal Consistency |
| Physicality | 40% (0.4) | Basic Properties, Dynamics and Interactivity, Physical Reliability |
| Aesthetic | 20% (0.2) | Expressiveness, Artistic Quality, Authenticity |
The final
Within each main dimension, the sub-dimensions are weighted nearly equally, with weights set to approximately DIMENSION_WEIGHTS and SUB_DIMENSION_WEIGHTS variables in eval.py:
- Consistency:
- Semantic Consistency: 0.33
- Factual Consistency: 0.33
- Spatial-Temporal Consistency: 0.34
- Physicality:
- Basic Properties: 0.33
- Dynamics and Interactivity: 0.33
- Physical Reliability: 0.34
- Aesthetic:
- Expressiveness: 0.33
- Artistic Quality: 0.33
- Authenticity: 0.34
The eval.py script utilizes multi-threaded execution and requires the following arguments to run the evaluation against your generated image sequences:
python eval.py \
--json_path /path/to/your/sequences.json \
--image_dir /path/to/your/generated/images \
--output_dir /path/to/save/results \
--api_key YOUR_OPENAI_API_KEY \
--model chatgpt-to-latest \
--result_full full_results.json \
--result_scores scores.jsonl \
--max_workers 5| Argument | Description |
|---|---|
--json_path |
Path to the JSON file containing the sequence prompts and details. |
--image_dir |
Root directory containing the index folders with step images. |
--output_dir |
Directory to save evaluation results. |
--api_key |
OpenAI API key for calling the evaluation model. |
--model |
The LLM model name for evaluation (e.g., gpt-4o). |
--result_full |
Output JSON file for full results. |
--result_scores |
Output JSONL file for scores. |
--max_workers |
Maximum number of concurrent workers for evaluation. |
For the latest official results and model rankings on the Envision benchmark, please visit our dedicated leaderboard website:
https://opendatalab-raiser.github.io/Envision/
We strongly encourage the research community to expand and enhance the Envision benchmark. We welcome contributions in the form of new model results, additional evaluation metrics, or new causal process categories to further challenge the capabilities of unified multimodal models.
How to Contribute:
- Submit New Results: If you have evaluated a novel model on the Envision benchmark using the provided
eval.pyscript, please submit your quantitative results to us. We will periodically update the official leaderboard to reflect the state-of-the-art. - Code and Data Extensions: We welcome pull requests (
git pull request) for any improvements to the evaluation script, bug fixes, or the inclusion of supplementary causal event data to diversify the benchmark's coverage.
By collaborating, we can ensure the Envision benchmark remains a robust and evolving resource for measuring true world knowledge internalization and dynamic process modeling in multimodal generation.
If you use the Envision dataset or benchmark in your research, please cite the following paper:
@article{wei2025ggbench,
title={Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights},
author={Tian, Juanxi and Li, Siyuan and He, Conghui and Wu, Lijun and Tan, Cheng},
journal={arXiv preprint arXiv:2512.01816},
year={2025}
}