Convert scientific posters (PDF/images) into structured JSON metadata using Large Language Models.
This project uses mise to manage tool versions (Python 3.12, uv) and uv for dependency management.
Install mise by following the mise installation guide.
# Install Python 3.12 and uv (as specified in mise.toml)
mise install
# Activate the virtual environment
uv venv
# Install project dependencies
uv pip install -r requirements.txtThis repository is an API service — it polls the database for extraction jobs rather than exposing a batch CLI. Start the server with:
python api.pySee the API Server section below for the environment variables it expects. For standalone poster→JSON conversion outside this service, use the poster2json library directly.
docker compose up --buildSee Docker Setup for detailed instructions including Windows/WSL2 support.
PDF/Image → Raw Text Extraction → LLM JSON Structuring → Structured JSON
↓ ↓
[pdfplumber] [Llama 3.1 8B]
[Qwen2-VL] verbatim Llama mirror
- PDF files → Processed via
pdfplumberwith XY-cut reading order (PyMuPDF fallback) for layout-aware text extraction - Image files → Processed via
Qwen2-VL-7Bvision-language model - All files → Structured into JSON by Llama-3.1-8B-Poster-Extraction, a verbatim mirror of Meta Llama 3.1 8B Instruct (not fine-tuned)
Output conforms to the poster-json-schema:
{
"$schema": "https://posters.science/schema/v0.1/poster_schema.json",
"creators": [
{
"name": "LastName, FirstName",
"givenName": "FirstName",
"familyName": "LastName",
"affiliation": ["Institution"]
}
],
"titles": [{ "title": "Poster Title" }],
"posterContent": {
"sections": [
{ "sectionTitle": "Abstract", "sectionContent": "..." },
{ "sectionTitle": "Methods", "sectionContent": "..." }
]
},
"imageCaptions": [{ "captions": ["Figure 1.", "Description"] }],
"tableCaptions": [{ "captions": ["Table 1.", "Description"] }]
}| Requirement | Specification |
|---|---|
| GPU | CUDA-capable, ≥16GB VRAM |
| RAM | ≥32GB recommended |
| Python | 3.10+ |
| OS | Linux, macOS, Windows (via Docker/WSL2) |
The API does not accept file uploads. The frontend uploads poster files to Bunny storage and creates ExtractionJob records in the database. This service polls the database for new jobs, downloads the file from Bunny, runs extraction, and writes results to PosterMetadata.
# Set required environment variables
export DATABASE_URL="postgresql://user:pass@host:5432/dbname"
export BUNNY_STORAGE_ZONE="your-storage-zone"
export BUNNY_ACCESS_KEY="your-storage-zone-password"
# Start the API (starts background job worker)
python api.py
# Health check
curl http://localhost:8000/healthSee API Reference for full configuration and environment variables.
| Document | Description |
|---|---|
| Installation Guide | Detailed setup instructions |
| Docker Setup | Docker deployment & Windows support |
| Architecture | Technical details & methodology |
| Evaluation | Validation metrics & results |
| API Reference | REST API documentation |
posters-science-extraction-api/
├── api.py # Flask REST API + job worker startup
├── job_worker.py # Polls DB, runs extraction via the poster2json library
├── validation.py # Schema validation of LLM output
├── config.py # Model IDs and runtime configuration
├── requirements.txt # Python dependencies
├── Dockerfile # Container build
├── docker-compose-prod.yml # Docker orchestration
├── docs/ # Documentation
├── example_posters/ # Sample poster files
└── test_results/ # Validation outputs
Poster text extraction and JSON structuring are provided by the poster2json library, installed as a dependency; this repository wraps it in a polling API service.
Extraction runs through the poster2json library, which is validated against a 20-poster annotated corpus. Latest results: 19/20 (95%) passing.
| Metric | Score | Threshold |
|---|---|---|
| Word Capture | 0.92 | ≥0.75 |
| ROUGE-L | 0.85 | ≥0.75 |
| Number Capture | 0.97 | ≥0.75 |
| Field Proportion | 0.88 | 0.50–1.50 |
See the poster2json evaluation docs for per-poster results and the methodology behind these metrics.
MIT License - see LICENSE for details.
Part of the FAIR Data Innovations Hub posters.science project.
Contributions welcome! Please open an issue to discuss proposed changes.