GitHub - DS4SD/MarkushGrapher: [CVPR 26] MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures

MarkushGrapher 2.0 is an end-to-end multimodal model for recognizing both molecular structures and Markush structures from chemical document images. It jointly encodes vision, text, and layout modalities to auto-regressively generate CXSMILES representations and substituent tables.

MarkushGrapher 2.0 substantially outperforms state-of-the-art models — including MolParser, MolScribe, GPT-5, and DeepSeek-OCR — on Markush structure recognition benchmarks, while maintaining competitive performance on standard molecular structure recognition (OCSR).

Resources: Model | Datasets | Paper (v2) | Paper (v1)

What's New in 2.0

Compared to MarkushGrapher 1.0, version 2.0 introduces several major improvements (the v1 code is available under the markushgrapher-v1 tag):

End-to-End Processing — A dedicated ChemicalOCR module extracts text and bounding boxes directly from images, eliminating the need for external OCR annotations.
Two-Phase Training Strategy — Phase 1 (Adaptation) aligns the projector and decoder to pretrained OCSR features; Phase 2 (Fusion) introduces the VTL encoder for joint multimodal training, improving encoder fusion.
Universal Recognition — A single model handles both standard molecular images (SMILES) and multimodal Markush structures (CXSMILES + substituent tables).
New Training Data Pipeline — Automatic construction of large-scale real-world Markush training data from USPTO MOL files (2010–2025).
New Benchmark: IP5-M — 1,000 manually annotated Markush structures from patent documents across all five IP5 patent offices (USPTO, JPO, KIPO, CNIPA, EPO).

Installation

Choose the setup path that matches your hardware:

Hardware	Setup script
NVIDIA GPU (CUDA)	`setup-cuda.sh`
Apple Silicon (MPS)	`setup.sh`
CPU only	`setup.sh`

NVIDIA GPU (CUDA)

ChemicalOCR runs via vllm in a dedicated environment. To run this, you require Python 3.10+

bash setup-cuda.sh

This creates two virtual environments:

chemicalocr-env — vllm + stock transformers (fast batched ChemicalOCR on GPU)
markushgrapher-env — custom transformers fork (MarkushGrapher model inference)

The two environments are needed because vllm requires tokenizers >= 0.19 while the custom transformers fork requires tokenizers < 0.14 — these ranges do not overlap.

Apple Silicon / CPU

bash setup.sh
source markushgrapher-env/bin/activate

On Apple Silicon, ChemicalOCR uses mlx-vlm. On first run the model is automatically converted to MLX format (one-time operation). On CPU, the transformers backend is used as a fallback (very slow).

Manual Setup

Step-by-step instructions

Create a virtual environment (requires Python 3.10):

python3.10 -m venv markushgrapher-env
source markushgrapher-env/bin/activate

Install MarkushGrapher:

PIP_USE_PEP517=0 pip install -e .

Install the transformers fork (contains the MarkushGrapher architecture, built on UDOP):

git clone https://github.com/lucas-morin/transformers.git ./external/transformers
pip install -e ./external/transformers

Install the MolScribe fork (minor fixes for albumentations compatibility):

git clone https://github.com/lucas-morin/MolScribe.git ./external/MolScribe
pip install -e ./external/MolScribe --no-deps

(Apple Silicon only) For fast ChemicalOCR inference on Mac, install mlx-vlm:

pip install mlx-vlm

Download model weights:

huggingface-cli download docling-project/MarkushGrapher-2 --local-dir ./models/markushgrapher-2
huggingface-cli download docling-project/ChemicalOCR --local-dir ./models/chemicalocr
wget https://huggingface.co/yujieq/MolScribe/resolve/main/swin_base_char_aux_1m680k.pth -P ./external/MolScribe/ckpts/

Inference

End-to-End (Images → CXSMILES)

Place your chemical structure images (.png) in a directory and run:

bash scripts/inference/inference.sh ./data/images

This runs the full pipeline:

Converts images to HuggingFace dataset format
Runs ChemicalOCR to extract text labels and bounding boxes
Runs MarkushGrapher 2.0 to predict CXSMILES and substituent tables

Visualizations are saved to data/visualization/prediction/.

The script selects the Python interpreter and ChemicalOCR backend automatically based on which environments are installed:

Hardware	Setup used	ChemicalOCR backend	Speed
NVIDIA GPU	`chemicalocr-env` (from `setup-cuda.sh`)	vllm	Fastest (batched GPU)
Apple Silicon	`markushgrapher-env` (from `setup.sh`)	mlx-vlm	~1.5s per image
CPU	`markushgrapher-env` (from `setup.sh`)	transformers	Slow (fallback only)

You can override the interpreter for either stage:

CHEMICALOCR_PYTHON=/path/to/python MARKUSHGRAPHER_PYTHON=/path/to/python \
  bash scripts/inference/inference.sh ./data/images

Note: ChemicalOCR produces reliable results on NVIDIA GPU (vllm) and Apple Silicon (mlx-vlm). The CPU/transformers fallback is available but slow and not recommended for production use.

Step by Step

Step 1: Convert images to a HuggingFace dataset and apply ChemicalOCR.

NVIDIA GPU (uses chemicalocr-env with vllm):

PYTHONPATH=. chemicalocr-env/bin/python scripts/dataset/image_dir_to_hf_dataset.py \
  --image_dir ./data/images \
  --output_dir ./data/hf/sample-images \
  --apply_ocr \
  --ocr_model_path ./models/chemicalocr

Apple Silicon / CPU (uses markushgrapher-env with mlx-vlm or transformers):

source markushgrapher-env/bin/activate
PYTHONPATH=. python scripts/dataset/image_dir_to_hf_dataset.py \
  --image_dir ./data/images \
  --output_dir ./data/hf/sample-images \
  --apply_ocr \
  --ocr_model_path ./models/chemicalocr

Step 2: Run MarkushGrapher inference (always uses markushgrapher-env):

PYTHONPATH=. markushgrapher-env/bin/python -m markushgrapher.eval config/predict.yaml

The dataset path is configured in config/datasets/datasets_predict.yaml.

Architecture

MarkushGrapher 2.0 employs two complementary encoding pipelines:

Vision Encoder Pipeline — The input image is processed by an OCSR vision encoder (Swin-B ViT, from MolScribe) followed by an MLP projector.
Vision-Text-Layout Pipeline — The image is passed through ChemicalOCR to extract text and bounding boxes, which are then jointly encoded with the image via a VTL encoder (T5-base backbone, UDOP fusion).

The projected vision embedding (e1) is concatenated with the VTL embedding (e2) and fed to a text decoder that auto-regressively generates CXSMILES and substituent tables.

Model size: 831M parameters (744M trainable)

Results

Markush Structure Recognition (CXSMILES Accuracy)

Model	M2S	USPTO-M	WildMol-M	IP5-M
MolParser-Base	39	30	38.1	47.7
MolScribe	21	7	28.1	22.3
GPT-5	3	—	—	—
DeepSeek-OCR	0	0	1.9	0.0
MarkushGrapher 1.0	38	32	—	—
MarkushGrapher 2.0	56	55	48.0	53.7

Molecular Structure Recognition (SMILES Accuracy)

Model	WildMol	JPO	UOB	USPTO
MolParser-Base	76.9	78.9	91.8	93.0
MolScribe	66.4	76.2	87.4	93.1
MolGrapher	45.5	67.5	94.9	91.5
MarkushGrapher 2.0	68.4	71.0	96.6	89.8

Datasets

Download the datasets from HuggingFace:

huggingface-cli download docling-project/MarkushGrapher-2-Datasets --local-dir ./data/hf --repo-type dataset

Training Data

Phase	Dataset	Size	Type
Phase 1 (Adaptation)	MolScribe USPTO	243k	Real (image-SMILES pairs)
Phase 2 (Fusion)	Synthetic CXSMILES	235k	Synthetic
Phase 2 (Fusion)	MolParser	91k	Real (converted to CXSMILES)
Phase 2 (Fusion)	USPTO-MOL-M	54k	Real (auto-extracted from MOL files)

Benchmarks

Markush Structure Recognition:

M2S (103) — Real-world multimodal Markush structures with substituent tables
USPTO-M (74) — Real-world Markush structure images
WildMol-M (10k) — Large-scale semi-manually annotated Markush structures
IP5-M (1,000) — New — Manually annotated Markush structures from IP5 patent offices (1980–2025)

Molecular Structure Recognition (OCSR):

USPTO (5,719), JPO (450), UOB (5,740), WildMol (10k)

The synthetic datasets are generated using MarkushGenerator.

Training

PYTHONUNBUFFERED=1 CUDA_VISIBLE_DEVICES=0 python3.10 -m markushgrapher.train config/train.yaml

Configure training in config/train.yaml and config/datasets/datasets.yaml.

Acknowledgments

MarkushGrapher builds on UDOP (Vision-Text-Layout encoder) and MolScribe (OCSR vision encoder). The ChemicalOCR module is based on SmolDocling. Training was initialized from the pretrained UDOP weights available on HuggingFace.

Citation

If you find this repository useful, please consider citing:

MarkushGrapher-2:

@misc{strohmeyer2026markushgrapher2endtoendmultimodalrecognition,
      title={MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures}, 
      author={Tim Strohmeyer and Lucas Morin and Gerhard Ingmar Meijer and Valéry Weber and Ahmed Nassar and Peter Staar},
      year={2026},
      eprint={2603.28550},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.28550}, 
}

MarkushGrapher:

@inproceedings{Morin_2025,
  title     = {MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures},
  url       = {http://dx.doi.org/10.1109/CVPR52734.2025.01352},
  DOI       = {10.1109/cvpr52734.2025.01352},
  booktitle = {2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  publisher = {IEEE},
  author    = {Morin, Lucas and Weber, Val\'{e}ry and Nassar, Ahmed and Meijer, Gerhard Ingmar and Van Gool, Luc and Li, Yawei and Staar, Peter},
  year      = {2025},
  month     = jun,
  pages     = {14505--14515}
}

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
assets		assets
config		config
data		data
external		external
markushgrapher		markushgrapher
models		models
scripts		scripts
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt
setup-cuda.sh		setup-cuda.sh
setup.py		setup.py
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What's New in 2.0

Installation

NVIDIA GPU (CUDA)

Apple Silicon / CPU

Manual Setup

Inference

End-to-End (Images → CXSMILES)

Step by Step

Architecture

Results

Markush Structure Recognition (CXSMILES Accuracy)

Molecular Structure Recognition (SMILES Accuracy)

Datasets

Training Data

Benchmarks

Training

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What's New in 2.0

Installation

NVIDIA GPU (CUDA)

Apple Silicon / CPU

Manual Setup

Inference

End-to-End (Images → CXSMILES)

Step by Step

Architecture

Results

Markush Structure Recognition (CXSMILES Accuracy)

Molecular Structure Recognition (SMILES Accuracy)

Datasets

Training Data

Benchmarks

Training

Acknowledgments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages