MolMole OCSR Research Environment

For the current experimental write-up and results, see report/report.md.

Why Optical Chemical Structure Recognition (OCSR) matters

Converting chemical diagrams into machine-readable representations (SMILES, InChI, molecular graphs) is fundamental for indexing the chemical literature. Classical OCSR tools often struggle on complex layouts and scanned pages. Modern deep learning approaches such as DECIMER and MolScribe demonstrate strong performance with dedicated training. Given the rapid evolution of multimodal foundation models, it is natural to ask whether general-purpose VLM/LLM systems can perform OCSR without chemistry-specific training—and what their failure modes look like.

Overview

This repository provides a ready-to-use Docker Compose environment for evaluating OCSR with current vision-enabled foundation models.

The project is inspired by MolMole (LG AI Research), which proposes an end-to-end framework for extracting molecules and reactions from full-page patent images and introduces an evaluation benchmark. The MolMole paper evaluates 550 annotated pages; due to copyright restrictions, only a 300-page patent subset is publicly released as MolMole_Patent300 dataset on HuggingFace.

The code in this repo runs holistic extraction: the model sees the entire page and is asked to return all structures and reactions in one JSON response. Extractions are evaluated in three output formats:

graph: atoms/bonds JSON (closest to an explicit molecular graph).
smiles: SMILES strings.
selfies: SELFIES strings.

All developer workflows run inside Docker/Compose; host-level execution is reserved for CI.

Repository structure

.
├── docker-compose.yml
├── Dockerfile
├── Makefile
├── .env.example
├── experiments_openrouter.yaml
├── report/
│   └── report.md
├── src/
│   └── molmole_research/
│       ├── downloader.py   # download dataset + build labels.json from MOL files
│       ├── extractor.py    # holistic OCSR extraction (graph / SMILES / SELFIES)
│       ├── converter.py    # optional conversion helpers
│       ├── evaluator.py    # compute metrics and write logs
│       └── runner.py       # orchestrate multi-model runs
└── tests/
    └── ...

Quickstart (recommended: Docker + OpenRouter)

This is the intended way to reproduce the pilot runs in results_openrouter_*. Keep secrets in .env (gitignored) and do not put API keys on the command line.

Create .env:
```
cp .env.example .env
# edit .env and set OPENROUTER_KEY=...
```
If you plan to run direct OpenAI API experiments (not via OpenRouter), you can also set OPENAI_API_KEY in .env.
Build the image:
```
make build
```
Download MolMole_Patent300 and build labels.json:
```
make download
```

Run OpenRouter experiments (small debug runs; adjust --limit as needed):

docker compose run --rm --user "$(id -u):$(id -g)" research \
  python -m molmole_research.runner run \
    --config experiments_openrouter.yaml \
    --format graph \
    --limit 5 \
    --results-dir results_openrouter_graph

Repeat for other formats:

--format smiles --results-dir results_openrouter_smiles
--format selfies --results-dir results_openrouter_selfies

Inspect outputs:
- Raw model outputs: results_openrouter_*/<experiment>.jsonl
- Metrics: results_openrouter_*/<experiment>_metrics.json
- Logs: results_openrouter_*/<experiment>_metrics.log
- Aggregated summary: results_openrouter_*/summary.json

Dataset download

The publicly released dataset lives on HuggingFace as doxa-friend/MolMole_Patent300 (license: CC-BY-NC-ND-4.0). The downloader uses huggingface_hub.snapshot_download to fetch the dataset snapshot and builds labels.json by converting the provided MOL files to canonical SMILES via RDKit.

To download the dataset into data/images, run:

make download

If the download fails due to authentication or license acceptance, the downloader prints instructions for manual setup.

Running experiments

All commands below are meant to run via Docker.

Single-model runs

If you are using the OpenAI API directly, ensure OPENAI_API_KEY is set (for example via .env).

Run extraction (example: OpenAI, SMILES output):

docker compose run --rm --user "$(id -u):$(id -g)" research \
  python -m molmole_research.extractor run \
    --model gpt-4o \
    --dataset-dir data/images \
    --out results \
    --format smiles \
    --limit 5

Evaluate:

docker compose run --rm --user "$(id -u):$(id -g)" research \
  python -m molmole_research.evaluator run \
    --pred results/gpt-4o.jsonl \
    --dataset-dir data/images \
    --out results

Optional conversion step (mainly useful for debugging):

docker compose run --rm --user "$(id -u):$(id -g)" research \
  python -m molmole_research.converter run \
    --pred results/gpt-4o.jsonl \
    --out results

Notes:

The extractor resumes by default (appends and skips already-processed pages). For a clean run, delete the output JSONL or use --no-resume.
Use --timeout to bound each request, and --limit for short debug runs.

Multi-model runs (runner)

To run a YAML-defined set of experiments (recommended for OpenRouter), use:

docker compose run --rm --user "$(id -u):$(id -g)" research \
  python -m molmole_research.runner run \
    --config experiments_openrouter.yaml \
    --format graph \
    --limit 5 \
    --results-dir results_openrouter_graph

The runner writes per-experiment JSONL outputs and per-experiment metrics, plus summary.json in the selected results directory.

OpenRouter notes

experiments_openrouter.yaml sets the OpenRouter API base and declares api_key_env: OPENROUTER_KEY.
The runner reads OPENROUTER_KEY from .env (or the environment) and passes it to the extractor via environment variables.
Start with --limit and a small model set; OpenRouter runs can be expensive.

Optional: interactive container shell

If you prefer an interactive session:

make shell

Inside the shell, you can run the same python -m molmole_research.<module> run ... commands.

Using Ollama / Open-WebUI

If an existing Open-WebUI + Ollama stack is available at http://host.docker.internal:11434/v1 with the model ministral-3:14b, you can run a sample extraction from inside the research container:

python -m molmole_research.extractor run \
  --model ministral-3:14b \
  --api-base http://host.docker.internal:11434/v1 \
  --api-key placeholder \
  --dataset-dir data/images \
  --out results \
  --format graph \
  --limit 5

Notes:

The Compose service includes extra_hosts: host.docker.internal so the container can reach the host’s Ollama port.
The --api-key value is ignored by Ollama but required by the OpenAI client; any non-empty string is fine.

Makefile commands

The provided Makefile defines several convenience targets:

Target	Description
`make build`	Build the Docker image (installs dependencies).
`make shell`	Open a bash shell inside the research container.
`make test`	Run the full test suite inside the container.
`make lint`	Check code style with ruff inside the container.
`make format`	Format the code base with ruff format inside the container.
`make run`	Run the default runner inside the container.
`make download`	Download MolMole_Patent300 and build `labels.json`.
`make up` / `make down`	Start or stop the research container stack.

CI / local-only execution

Automated CI pipelines execute host-level commands (pytest, ruff) to keep runtimes fast. Outside CI, prefer the Docker workflow above and avoid creating local virtual environments. If you must reproduce the CI run locally, mirror its steps in a temporary venv and install requirements.txt, but treat that as an exception rather than the norm.

Notes on models and API compatibility

The extractor uses the OpenAI Python client and targets OpenAI-compatible APIs. For a provider that exposes an OpenAI-compatible API (OpenAI, OpenRouter, local Open-WebUI, etc.), set --api-base and provide credentials via environment variables or the runner configuration.

Relevant literature

The articles/ directory contains additional papers used to inform this environment (no actual pdfs in repo). A brief summary of each paper is available in articles/relevant_articles.md.

License

This project is released under the MIT License. Individual datasets and published papers retain their respective licenses; please consult the original sources for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
articles		articles
report		report
src/molmole_research		src/molmole_research
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
experiments_openrouter.yaml		experiments_openrouter.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MolMole OCSR Research Environment

Why Optical Chemical Structure Recognition (OCSR) matters

Overview

Repository structure

Quickstart (recommended: Docker + OpenRouter)

Dataset download

Running experiments

Single-model runs

Multi-model runs (runner)

OpenRouter notes

Optional: interactive container shell

Using Ollama / Open-WebUI

Makefile commands

CI / local-only execution

Notes on models and API compatibility

Relevant literature

License

About

Uh oh!

Releases

Packages

Languages

License

dmzio/ocsr_eval

Folders and files

Latest commit

History

Repository files navigation

MolMole OCSR Research Environment

Why Optical Chemical Structure Recognition (OCSR) matters

Overview

Repository structure

Quickstart (recommended: Docker + OpenRouter)

Dataset download

Running experiments

Single-model runs

Multi-model runs (runner)

OpenRouter notes

Optional: interactive container shell

Using Ollama / Open-WebUI

Makefile commands

CI / local-only execution

Notes on models and API compatibility

Relevant literature

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages