Posters Science Extraction API

Convert scientific posters (PDF/images) into structured JSON metadata using Large Language Models.

Quick Start

This project uses mise to manage tool versions (Python 3.12, uv) and uv for dependency management.

Prerequisites

Install mise by following the mise installation guide.

Install dependencies

# Install Python 3.12 and uv (as specified in mise.toml)
mise install

# Activate the virtual environment
uv venv

# Install project dependencies
uv pip install -r requirements.txt

Basic Usage

This repository is an API service — it polls the database for extraction jobs rather than exposing a batch CLI. Start the server with:

python api.py

See the API Server section below for the environment variables it expects. For standalone poster→JSON conversion outside this service, use the poster2json library directly.

Docker (Recommended for Windows)

docker compose up --build

See Docker Setup for detailed instructions including Windows/WSL2 support.

How It Works

PDF/Image → Raw Text Extraction → LLM JSON Structuring → Structured JSON
                ↓                         ↓
          [pdfplumber]            [Llama 3.1 8B]
           [Qwen2-VL]             verbatim Llama mirror

PDF files → Processed via pdfplumber with XY-cut reading order (PyMuPDF fallback) for layout-aware text extraction
Image files → Processed via Qwen2-VL-7B vision-language model
All files → Structured into JSON by Llama-3.1-8B-Poster-Extraction, a verbatim mirror of Meta Llama 3.1 8B Instruct (not fine-tuned)

Output Format

Output conforms to the poster-json-schema:

{
  "$schema": "https://posters.science/schema/v0.1/poster_schema.json",
  "creators": [
    {
      "name": "LastName, FirstName",
      "givenName": "FirstName",
      "familyName": "LastName",
      "affiliation": ["Institution"]
    }
  ],
  "titles": [{ "title": "Poster Title" }],
  "posterContent": {
    "sections": [
      { "sectionTitle": "Abstract", "sectionContent": "..." },
      { "sectionTitle": "Methods", "sectionContent": "..." }
    ]
  },
  "imageCaptions": [{ "captions": ["Figure 1.", "Description"] }],
  "tableCaptions": [{ "captions": ["Table 1.", "Description"] }]
}

System Requirements

Requirement	Specification
GPU	CUDA-capable, ≥16GB VRAM
RAM	≥32GB recommended
Python	3.10+
OS	Linux, macOS, Windows (via Docker/WSL2)

API Server

The API does not accept file uploads. The frontend uploads poster files to Bunny storage and creates ExtractionJob records in the database. This service polls the database for new jobs, downloads the file from Bunny, runs extraction, and writes results to PosterMetadata.

# Set required environment variables
export DATABASE_URL="postgresql://user:pass@host:5432/dbname"
export BUNNY_STORAGE_ZONE="your-storage-zone"
export BUNNY_ACCESS_KEY="your-storage-zone-password"

# Start the API (starts background job worker)
python api.py

# Health check
curl http://localhost:8000/health

See API Reference for full configuration and environment variables.

Documentation

Document	Description
Installation Guide	Detailed setup instructions
Docker Setup	Docker deployment & Windows support
Architecture	Technical details & methodology
Evaluation	Validation metrics & results
API Reference	REST API documentation

Project Structure

posters-science-extraction-api/
├── api.py                   # Flask REST API + job worker startup
├── job_worker.py            # Polls DB, runs extraction via the poster2json library
├── validation.py            # Schema validation of LLM output
├── config.py                # Model IDs and runtime configuration
├── requirements.txt         # Python dependencies
├── Dockerfile               # Container build
├── docker-compose-prod.yml  # Docker orchestration
├── docs/                    # Documentation
├── example_posters/         # Sample poster files
└── test_results/            # Validation outputs

Poster text extraction and JSON structuring are provided by the poster2json library, installed as a dependency; this repository wraps it in a polling API service.

Performance

Extraction runs through the poster2json library, which is validated against a 20-poster annotated corpus. Latest results: 19/20 (95%) passing.

Metric	Score	Threshold
Word Capture	0.92	≥0.75
ROUGE-L	0.85	≥0.75
Number Capture	0.97	≥0.75
Field Proportion	0.88	0.50–1.50

See the poster2json evaluation docs for per-poster results and the methodology behind these metrics.

License

MIT License - see LICENSE for details.

Citation

Part of the FAIR Data Innovations Hub posters.science project.

Contributing

Contributions welcome! Please open an issue to discuss proposed changes.

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
.github/workflows		.github/workflows
docs		docs
example_posters		example_posters
manual_poster_annotation		manual_poster_annotation
test_results		test_results
thumbail_generation		thumbail_generation
.dockerignore		.dockerignore
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
.pylint.ini		.pylint.ini
.pylintrc		.pylintrc
.zenodo.json		.zenodo.json
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
api.py		api.py
codemeta.json		codemeta.json
config.py		config.py
convert_pdfs_to_jpeg.py		convert_pdfs_to_jpeg.py
docker-compose-prod.yml		docker-compose-prod.yml
docker-compose.dev.yml		docker-compose.dev.yml
job_worker.py		job_worker.py
mise.toml		mise.toml
poster_extraction_schema.json		poster_extraction_schema.json
poster_schema.json		poster_schema.json
pyproject.toml		pyproject.toml
requirements-prod.txt		requirements-prod.txt
requirements.txt		requirements.txt
sync_schema.py		sync_schema.py
test_api.py		test_api.py
test_file_upload.ipynb		test_file_upload.ipynb
test_thumbnail_generation.ipynb		test_thumbnail_generation.ipynb
uv.lock		uv.lock
validation.py		validation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Posters Science Extraction API

Quick Start

Prerequisites

Install dependencies

Basic Usage

Docker (Recommended for Windows)

How It Works

Output Format

System Requirements

API Server

Documentation

Project Structure

Performance

License

Citation

Contributing

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Posters Science Extraction API

Quick Start

Prerequisites

Install dependencies

Basic Usage

Docker (Recommended for Windows)

How It Works

Output Format

System Requirements

API Server

Documentation

Project Structure

Performance

License

Citation

Contributing

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages