This repository contains the official evaluation framework for SUPERChem, an expert-curated, reasoning-intensive multimodal benchmark for the rigorous evaluation of deep chemical reasoning in Large Language Models (LLMs) and Multimodal LLMs (MLLMs).
License: MIT
Verify your environment with bundled sample data (Gemini 2.5 Pro answers, no API key):
pip install -r requirements.txt
python demo/run_demo.pySee demo/README.md for file descriptions and DAG_eval usage with the same sample.
Install from the repository root:
pip install -r requirements.txt| Component | Version (tested) |
|---|---|
| Python | 3.10, 3.11, 3.12 |
| pandas | 2.x |
| pyarrow | 14.xβ21.x |
| openai | 1.xβ2.x |
| PyYAML, loguru, tqdm | see requirements.txt |
| networkx, matplotlib | for DAG_eval/ |
| plotly, scipy, seaborn, Pillow | for analysis/ |
| streamlit | for DAG_eval/view/ (optional) |
- Ubuntu 22.04 / 24.04 LTS
- macOS 14+ (Apple Silicon and Intel)
- Demo / accuracy scripts: standard desktop or laptop (CPU only).
- Full benchmark inference (
eval/): network access to your LLM API; no GPU required in this repo. - DAG / RPF pipeline (
DAG_eval/): API access to judge models; optional external molecule comparison service for structure matching (seeDAG_eval/README.md).
git clone <repository-url>
cd SUPERChem_eval
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
cp eval/config.yaml.sample eval/config.yaml # then add API keys for full evalTypical install time: 2β5 minutes on a normal desktop (depends on network speed).
python demo/run_demo.py| Item | Value |
|---|---|
| Data | 10 questions + Gemini 2.5 Pro (text-only, high) answers in demo/ |
| Expected output | Printed pass@1 accuracy (~50%, 5/10) and per-UUID scores |
| Expected runtime | < 5 s (after pip install) |
demo/questions_demo.parquetβ questionsdemo/20251014164938_questions_release_en_false__gemini-2_5-pro_high__1_0_1.jsonlβ model outputsdemo/ground_truth_graphs_detail.jsonlβ expert reasoning graphs for RPF
The complete benchmark (500 items) is on Hugging Face: ZehuaZhao/SUPERChem. Place downloaded files under data/ following names in eval/eval.sh and DAG_eval/README.md.
- Copy
eval/config.yaml.sampleβeval/config.yamland set API endpoints/keys. - Edit
eval/eval.sh(model,INPUT_FILE, multimodal flag). - Run:
cd eval && bash eval.sh
Outputs:data/*.jsonl.
Details: eval/README.md.
- Place questions parquet, model answers jsonl, and
ground_truth_graphs_detail.jsonlunderDAG_eval/data/. - Copy
DAG_eval/src/config.example.yamlβDAG_eval/src/config.yaml. - Run
./run_full_pipeline.shor individual steps inDAG_eval/src/.
Details: DAG_eval/README.md.
Process data/*.jsonl with scripts in analysis/ (e.g. calc_pass_withbaseline.py, draw_radar_plotly.py). Figures go to results/.
Details: analysis/README.md.
- Obtain model answer files for the models reported in the paper (via
eval/or released artifacts). - Run
analysis/calc_pass_withbaseline.pyfor accuracy tables. - Run plotting scripts (
draw_radar_plotly.py,pass_k_curve.py, etc.) with paths pointing to yourdata/files.
Exact figure-to-script mapping may vary by revision; use filenames in results/ as a reference for expected outputs.
- [2026-03-16] SUPERChem is adopted by MiroThinker-1.7.
- [2026-02-14] SUPERChem is adopted by ByteDance's Seed-2.0.
- [2025-12-06] PDF Preview Released: We have released the PDF version of SUPERChem in both English and Chinese to facilitate easier previewing and manual inspection, especially for non-technical users. You can download SUPERChem-500.zip to access the dataset in PDF format. The password to unzip the file is
SUPERChem2025.
Current benchmarks for evaluating the chemical reasoning capabilities of Large Language Models (LLMs) are limited by oversimplified tasks, ceiling effects, lack of process-level evaluation, and misalignment with expert-level chemistry skills. To address these issues, we introduce SUPERChem, a benchmark of 500 expert-curated reasoning-intensive chemistry problems, covering diverse subfields and provided in both multimodal and text-only formats. Original content and an iterative curation pipeline eliminate flawed items and mitigate data contamination. Each problem is paired with an expert-authored solution path, enabling Reasoning Path Fidelity (RPF) scoring to evaluate reasoning quality beyond final-answer accuracy. Evaluations against a human baseline of 40.3% accuracy show that even the best-performing model, GPT-5 (High), reaches only 38.5%. By combining high difficulty, controlled multimodality, and process-level metrics, SUPERChem provides a rigorous platform for diagnosing and advancing AI chemical reasoning toward expert-level scientific inquiry.
- Expert-Level Challenge: 500 reasoning-intensive problems curated by domain experts.
- Process-Level Evaluation: Reasoning Path Fidelity (RPF) via expert solution DAGs.
- Controlled Multimodality: Text-only and multimodal variants per question.
- Fine-Grained Ability Taxonomy: Tags for knowledge and reasoning skills.
- Contamination Resistant: Expert-authored or non-public sources with human curation.
.
βββ demo/ # Small sample dataset + run_demo.py (start here)
βββ eval/ # LLM answer generation
βββ DAG_eval/ # DAG extraction, matching, RPF scoring
βββ data/ # Full benchmark data and evaluation outputs
βββ analysis/ # Metrics and plots
βββ results/ # Generated figures
βββ requirements.txt
βββ LICENSE # MIT
If you use SUPERChem or this evaluation framework in your research, please cite our paper:
@misc{zhao2025superchemmultimodalreasoningbenchmark,
title={SUPERChem: A Multimodal Reasoning Benchmark in Chemistry},
author={Zehua Zhao and Zhixian Huang and Junren Li and Siyu Lin and Junting Zhou and Fengqi Cao and Kun Zhou and Rui Ge and Tingting Long and Yuexiang Zhu and Yan Liu and Jie Zheng and Junnian Wei and Rong Zhu and Peng Zou and Wenyu Li and Zekai Cheng and Tian Ding and Yaxuan Wang and Yizhao Yan and Tingru Wei and Haowei Ming and Weijie Mao and Chen Sun and Yiming Liu and Zichen Wang and Zuo Zhang and Tong Yang and Hao Ma and Zhen Gao and Jian Pei},
year={2025},
eprint={2512.01274},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.01274},
}