Skip to content

fairdataihub/posters-science-extraction-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

118 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Posters Science Extraction API

Convert scientific posters (PDF/images) into structured JSON metadata using Large Language Models.

Quick Start

This project uses mise to manage tool versions (Python 3.12, uv) and uv for dependency management.

Prerequisites

Install mise by following the mise installation guide.

Install dependencies

# Install Python 3.12 and uv (as specified in mise.toml)
mise install

# Activate the virtual environment
uv venv

# Install project dependencies
uv pip install -r requirements.txt

Basic Usage

This repository is an API service — it polls the database for extraction jobs rather than exposing a batch CLI. Start the server with:

python api.py

See the API Server section below for the environment variables it expects. For standalone poster→JSON conversion outside this service, use the poster2json library directly.

Docker (Recommended for Windows)

docker compose up --build

See Docker Setup for detailed instructions including Windows/WSL2 support.

How It Works

PDF/Image → Raw Text Extraction → LLM JSON Structuring → Structured JSON
                ↓                         ↓
          [pdfplumber]            [Llama 3.1 8B]
           [Qwen2-VL]             verbatim Llama mirror
  1. PDF files → Processed via pdfplumber with XY-cut reading order (PyMuPDF fallback) for layout-aware text extraction
  2. Image files → Processed via Qwen2-VL-7B vision-language model
  3. All files → Structured into JSON by Llama-3.1-8B-Poster-Extraction, a verbatim mirror of Meta Llama 3.1 8B Instruct (not fine-tuned)

Output Format

Output conforms to the poster-json-schema:

{
  "$schema": "https://posters.science/schema/v0.1/poster_schema.json",
  "creators": [
    {
      "name": "LastName, FirstName",
      "givenName": "FirstName",
      "familyName": "LastName",
      "affiliation": ["Institution"]
    }
  ],
  "titles": [{ "title": "Poster Title" }],
  "posterContent": {
    "sections": [
      { "sectionTitle": "Abstract", "sectionContent": "..." },
      { "sectionTitle": "Methods", "sectionContent": "..." }
    ]
  },
  "imageCaptions": [{ "captions": ["Figure 1.", "Description"] }],
  "tableCaptions": [{ "captions": ["Table 1.", "Description"] }]
}

System Requirements

Requirement Specification
GPU CUDA-capable, ≥16GB VRAM
RAM ≥32GB recommended
Python 3.10+
OS Linux, macOS, Windows (via Docker/WSL2)

API Server

The API does not accept file uploads. The frontend uploads poster files to Bunny storage and creates ExtractionJob records in the database. This service polls the database for new jobs, downloads the file from Bunny, runs extraction, and writes results to PosterMetadata.

# Set required environment variables
export DATABASE_URL="postgresql://user:pass@host:5432/dbname"
export BUNNY_STORAGE_ZONE="your-storage-zone"
export BUNNY_ACCESS_KEY="your-storage-zone-password"

# Start the API (starts background job worker)
python api.py

# Health check
curl http://localhost:8000/health

See API Reference for full configuration and environment variables.

Documentation

Document Description
Installation Guide Detailed setup instructions
Docker Setup Docker deployment & Windows support
Architecture Technical details & methodology
Evaluation Validation metrics & results
API Reference REST API documentation

Project Structure

posters-science-extraction-api/
├── api.py                   # Flask REST API + job worker startup
├── job_worker.py            # Polls DB, runs extraction via the poster2json library
├── validation.py            # Schema validation of LLM output
├── config.py                # Model IDs and runtime configuration
├── requirements.txt         # Python dependencies
├── Dockerfile               # Container build
├── docker-compose-prod.yml  # Docker orchestration
├── docs/                    # Documentation
├── example_posters/         # Sample poster files
└── test_results/            # Validation outputs

Poster text extraction and JSON structuring are provided by the poster2json library, installed as a dependency; this repository wraps it in a polling API service.

Performance

Extraction runs through the poster2json library, which is validated against a 20-poster annotated corpus. Latest results: 19/20 (95%) passing.

Metric Score Threshold
Word Capture 0.92 ≥0.75
ROUGE-L 0.85 ≥0.75
Number Capture 0.97 ≥0.75
Field Proportion 0.88 0.50–1.50

See the poster2json evaluation docs for per-poster results and the methodology behind these metrics.

License

MIT License - see LICENSE for details.

Citation

Part of the FAIR Data Innovations Hub posters.science project.

Contributing

Contributions welcome! Please open an issue to discuss proposed changes.

About

Posters-science beta

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors