Skip to content

Latest commit

 

History

History
1079 lines (886 loc) · 41.5 KB

File metadata and controls

1079 lines (886 loc) · 41.5 KB

soda-curation

soda-curation is a professional Python package for automated data curation of scientific manuscripts using AI capabilities. It specializes in processing and structuring ZIP files containing manuscript data, extracting figure captions, and matching them with corresponding images and panels.

Table of Contents

  1. Features
  2. Installation
  3. Configuration
  4. Docker
  5. Testing
  6. Pipeline Steps
  7. Verbatim Extraction Verification
  8. Output Schema
  9. Model benchmarking
  10. Quality Control (QC) Pipeline
  11. Contributing
  12. License
  13. Changelog

Features

  • Automated processing of scientific manuscript ZIP files
  • AI-powered extraction and structuring of manuscript information
  • Figure and panel detection using advanced object detection models
  • Intelligent caption extraction and matching for figures and panels
  • Support for OpenAI's GPT models
  • Flexible configuration options for fine-tuning the curation process
  • Debug mode for development and troubleshooting
  • Integrated Quality Control (QC) pipeline for automated figure and data assessment

Installation

  1. Clone the repository:

    git clone https://github.com/source-data/soda-curation.git
    cd soda-curation
  2. Install the package using Poetry:

    poetry install

    For development, benchmark tooling, and linting:

    poetry install --with dev,lint

    Or, if you prefer to use pip:

    pip install -e .
  3. Set up environment variables: Create environment-specific .env files:

    Using environment variables is the recommended way to store sensitive information like API keys:

    OPENAI_API_KEY=your_openai_key
    ANTHROPIC_API_KEY=your_anthropic_key
    GOOGLE_API_KEY=your_google_ai_studio_key
    LANGFUSE_PUBLIC_KEY=your_langfuse_public_key
    LANGFUSE_SECRET_KEY=your_langfuse_secret_key
    LANGFUSE_HOST=https://cloud.langfuse.com
    ENVIRONMENT=test  # or dev or prod

Configuration

The configuration system uses a flexible, hierarchical approach supporting different environments (dev, test, prod) with environment-specific settings. Configuration is managed through:

  1. YAML files for general settings (e.g., config.dev.yaml, config.qc.yaml)
  2. Environment variables for sensitive information
  3. Command-line arguments for runtime options

Configuration Files Structure

  • Main pipeline config: Controls manuscript processing, AI model selection, and pipeline steps.
  • QC config (config.qc.yaml): Controls all quality control tests, test metadata, and versioning. Example:
qc_version: "2"
ai_provider: "openai"
qc_check_metadata:
  panel:
    plot_axis_units:
      name: "Plot Axis Units"
      checklist_type: "fig-checklist"
    error_bars_defined:
      name: "Error Bars Defined"
      checklist_type: "fig-checklist"
      langfuse_name: "checklists/fig-checklist/error-bars-defined"  # optional override
    # ... more tests ...
  figure:
    # Figure-level tests...
  document:
    # Document-level tests...
  
default:
  openai:
    model: "gpt-4o"
    temperature: 0.1
    # ...
  anthropic:
    model: "claude-sonnet-4-6"
    temperature: 0.1
  gemini:
    model: "gemini-2.5-flash"
    temperature: 0.1

Docker

The application supports different environments through Docker:

Building Images

# For CPU-only environments
docker build -t soda-curation-cpu . -f Dockerfile.cpu --target development

# Optional: select environment-specific config at build time (dev/test/prod)
docker build -t soda-curation-cpu . -f Dockerfile.cpu --target development --build-arg DEPLOYMENT_ENV=dev

Running the different environments

1. Development Environment

# Build and run development environment
docker-compose -f docker-compose.dev.yml build
docker-compose -f docker-compose.dev.yml run --rm soda /bin/bash

# Development with console access
docker compose -f docker-compose.dev.yml run --rm --entrypoint=/bin/bash soda

Running the pipelines with Docker

Main pipeline (single docker run)

docker run --rm \
  -v "$(pwd)/data:/app/data" \
  soda-curation-cpu \
  poetry run python -m src.soda_curation.main \
    --zip /app/data/archives/EMM-2023-18636.zip \
    --config /app/config.yaml \
    --output /app/data/output/EMM-2023-18636.json

QC pipeline (single docker run)

Run this after the main pipeline has produced:

  • *_figure_data.json
  • *_zip_structure.pickle
docker run --rm \
  -v "$(pwd)/data:/app/data" \
  soda-curation-cpu \
  poetry run python -m src.soda_curation.qc.main \
    --config /app/config.qc.yaml \
    --figure-data /app/data/output/EMM-2023-18636_figure_data.json \
    --zip-structure /app/data/output/EMM-2023-18636_zip_structure.pickle \
    --output /app/data/output/EMM-2023-18636_qc_results.json

Testing

Running the test suite

# Inside the container
poetry run pytest tests/test_suite

# With coverage report
poetry run pytest tests/test_suite --cov=src tests/ --cov-report=html

Model Benchmarking

The package includes a comprehensive benchmarking system for evaluating model performance across different tasks and configurations. The benchmarking system is configured through config.benchmark.yaml and runs using pytest.

Running Benchmarks

# Run the benchmark tests
poetry run pytest tests/test_pipeline/run_benchmark.py

Benchmark Configuration

The benchmarking system is configured through config.benchmark.yaml:

# Global settings
output_dir: "/app/data/benchmark/"
ground_truth_dir: "/app/data/ground_truth"
manuscript_dir: "/app/data/archives"
prompts_source: "/app/config.yaml"

# Test selection
enabled_tests:
  - extract_sections
  - extract_individual_captions
  - assign_panel_source
  - extract_data_availability

# Model configurations to test
providers:
  openai:
    models:
      - name: "gpt-4o"
        temperatures: [0.0, 0.1, 0.5]
        top_p: [0.1, 1.0]

# Test run configuration  
test_runs:
  n_runs: 1  # Number of times to run each configuration
  manuscripts: "all"  # Can be "all", a number, or specific IDs

Benchmark Components

  1. Test Selection: Choose which pipeline components to evaluate:

    • Section extraction
    • Individual caption extraction
    • Panel source assignment
    • Data availability extraction
  2. Model Configuration: Configure different models and parameters:

    • Multiple providers (OpenAI, Anthropic)
    • Various models per provider
    • Temperature and top_p parameter combinations
    • Multiple runs per configuration
  3. Output and Metrics:

    • Results are saved in the specified output directory
    • Generates CSV files with detailed metrics
    • Saves prompts used for each test
    • Creates comprehensive test reports

Benchmark Results

The benchmark system generates several output files:

  1. metrics.csv: Contains detailed performance metrics including:

    • Task-specific scores
    • Model parameters
    • Execution times
    • Input/output comparisons
  2. prompts.csv: Documents the prompts used for each task:

    • System prompts
    • User prompts
    • Task-specific configurations
  3. results.json: Detailed test results including:

    • Raw model outputs
    • Expected outputs
    • Scoring details
    • Error information

Pipeline Steps

The soda-curation pipeline processes scientific manuscripts through the following detailed steps:

1. ZIP Structure Analysis

  • Purpose: Extract and organize the manuscript's structure and components
  • Process:
    • Parses the ZIP file to identify manuscript components (XML, DOCX/PDF, figures, source data)
    • Creates a structured representation of the manuscript's files
    • Establishes relationships between figures and their associated files
    • Extracts manuscript content from DOCX/PDF for further analysis
    • Builds the initial ZipStructure object that will be enriched throughout the pipeline

2. Section Extraction

  • Purpose: Identify and extract critical manuscript sections
  • Process:
    • Uses AI to locate figure legend sections and data availability sections
    • Extracts these sections verbatim to preserve all formatting and details
    • Verifies extractions against the original document to prevent hallucinations
    • Returns structured content for further processing
    • Preserves HTML formatting from the original document

3. Individual Caption Extraction

  • Purpose: Parse figure captions into structured components
  • Process:
    • Divides full figure legends section into individual figure captions
    • For each figure, extracts:
      • Figure label (e.g., "Figure 1")
      • Caption title (main descriptive heading)
      • Complete caption text with panel descriptions
    • Identifies panel labels (A, B, C, etc.) within each caption
    • Ensures panel labels follow a monotonically increasing sequence
    • Associates each panel with its specific description from the caption

4. Data Availability Analysis

  • Purpose: Extract structured data source information
  • Process:
    • Analyzes the data availability section to identify database references
    • Extracts database names, accession numbers, and URLs/DOIs
    • Structures this information for linking to the appropriate figures/panels
    • Creates standardized references to external data sources

5. Panel Source Assignment

  • Purpose: Match source data files to specific figure panels
  • Process:
    • Analyzes file names and patterns in source data files
    • Maps each source data file to its corresponding panel(s)
    • Uses panel indicators in filenames, data types, and logical groupings
    • Identifies files that cannot be confidently assigned to specific panels
    • Handles cases where files belong to multiple panels

6. Object Detection & Panel Matching

  • Purpose: Detect individual panels within figures and match with captions
  • Process:
    • Panel Detection:

      • Uses a trained YOLOv10 model to detect panel regions within figure images
      • Identifies bounding boxes for each panel with confidence scores
      • Handles complex multi-panel figures with varying layouts
    • AI-Powered Caption Matching:

      • For each detected panel region, extracts the panel image
      • Uses AI vision capabilities to analyze panel contents
      • Matches visual content with appropriate panel descriptions from the caption
      • Resolves conflicts when multiple detections map to the same panel label
      • Assigns sequential labels (A, B, C...) to any additional detected panels
      • Preserves original caption information while adding visual context

7. Output Generation & Verification

  • Purpose: Compile all processed information and verify quality
  • Process:
    • Assembles the complete manuscript structure with all enriched information
    • Calculates hallucination scores to verify content authenticity
    • Cleans up source data file references
    • Computes token usage and cost metrics for AI operations
    • Generates structured JSON output according to the defined schema

Throughout these steps, the pipeline leverages AI capabilities to enhance the accuracy of caption extraction and panel matching. The process is configurable through the config.yaml file, allowing for adjustments in AI models, detection parameters, and debug options.

In debug mode, the pipeline can be configured to process only the first figure, saving time during development and testing. Debug images and additional logs are saved to help with troubleshooting and refinement of the curation process.

Verbatim Extraction Verification

The pipeline now uses an integrated verification approach to ensure text extractions are verbatim rather than hallucinated or modified by the AI.

How It Works

Instead of post-processing comparison with fuzzy matching, the system now:

  1. Uses AI Agent Tools: Specialized verification tools check if extractions are verbatim during the AI processing, not afterward.
  2. Multi-Attempt Verification: If verification fails, the AI tries up to 5 times to produce a verbatim extraction.
  3. Explicit Verbatim Flagging: Each extraction includes an is_verbatim field indicating verification success.

Implemented Verification Tools

Three main verification tools have been implemented:

  1. verify_caption_extraction: Ensures figure captions are extracted verbatim from the manuscript text.
  2. verify_panel_sequence: Confirms panel labels follow a complete sequence without gaps (A, B, C... not A, C, D...).
  3. General verification tool: For sections like figure legends and data availability.

Output Schema

{
  "manuscript_id": "string",
  "xml": "string",
  "docx": "string",
  "pdf": "string",
  "appendix": ["string"],
  "figures": [{
    "figure_label": "string",
    "img_files": ["string"],
    "sd_files": ["string"],
    "panels": [{
      "panel_label": "string",
      "panel_caption": "string",
      "panel_bbox": [number, number, number, number],
      "confidence": number,
      "ai_response": "string",
      "sd_files": ["string"],
      "hallucination_score": number
    }],
    "unassigned_sd_files": ["string"],
    "duplicated_panels": ["object"],
    "ai_response_panel_source_assign": "string",
    "hallucination_score": number,
    "figure_caption": "string",
    "caption_title": "string"
  }],
  "ai_config": {
    "provider": "string",
    "model": "string",
    "temperature": number,
    "top_p": number,
    "max_tokens": number
  },
  "data_availability": {
    "section_text": "string",
    "data_sources": [
      {
        "database": "string",
        "accession_number": "string",
        "url": "string"
      }
    ]
  },
  "errors": ["string"],
  "ai_response_locate_captions": "string",
  "ai_response_extract_individual_captions": "string",
  "non_associated_sd_files": ["string"],
  "locate_captions_hallucination_score": number,
  "locate_data_section_hallucination_score": number,
  "ai_provider": "string",
  "cost": {
    "extract_sections": {
      "prompt_tokens": number,
      "completion_tokens": number,
      "total_tokens": number,
      "cost": number
    },
    "extract_individual_captions": {
      "prompt_tokens": number,
      "completion_tokens": number,
      "total_tokens": number,
      "cost": number
    },
    "assign_panel_source": {
      "prompt_tokens": number,
      "completion_tokens": number,
      "total_tokens": number,
      "cost": number
    },
    "match_caption_panel": {
      "prompt_tokens": number,
      "completion_tokens": number,
      "total_tokens": number,
      "cost": number
    },
    "extract_data_sources": {
      "prompt_tokens": number,
      "completion_tokens": number,
      "total_tokens": number,
      "cost": number
    },
    "total": {
      "prompt_tokens": number,
      "completion_tokens": number,
      "total_tokens": number,
      "cost": number
    }
  }
}

Schema Explanation

  • manuscript_id: Unique identifier for the manuscript
  • xml: Path to the XML file in the ZIP archive
  • docx: Path to the DOCX file in the ZIP archive
  • pdf: Path to the PDF file in the ZIP archive
  • appendix: List of paths to appendix files
  • figures: Array of figure objects, each containing:
    • figure_label: Label of the figure (e.g., "Figure 1")
    • img_files: List of paths to image files for this figure
    • sd_files: List of paths to source data files for this figure
    • figure_caption: Full caption of the figure
    • caption_title: Title of the figure caption
    • hallucination_score: Score between 0-1 indicating possibility of hallucination (0 = verified content, 1 = likely hallucinated)
    • panels: Array of panel objects, each containing:
      • panel_label: Label of the panel (e.g., "A", "B", "C")
      • panel_caption: Caption specific to this panel
      • panel_bbox: Bounding box coordinates of the panel [x1, y1, x2, y2] in relative format
      • confidence: Confidence score of the panel detection
      • ai_response: Raw AI response for this panel
      • sd_files: List of source data files specific to this panel
      • hallucination_score: Score between 0-1 indicating possibility of hallucination (0 = verified content, 1 = likely hallucinated)
    • unassigned_sd_files: Source data files not assigned to specific panels
    • duplicated_panels: List of panels that appear to be duplicates
    • ai_response_panel_source_assign: AI response for panel source assignment
  • errors: List of error messages encountered during processing
  • ai_response_locate_captions: Raw AI response for locating figure captions
  • ai_response_extract_individual_captions: Raw AI response for extracting individual captions
  • non_associated_sd_files: List of source data files not associated with any specific figure or panel
  • locate_captions_hallucination_score: Score between 0-1 indicating possibility of hallucination in the captions extraction
  • locate_data_section_hallucination_score: Score between 0-1 indicating possibility of hallucination in the data section extraction
  • ai_config: Configuration details of the AI processing
  • data_availability: Information about data availability
    • section_text: Text describing the data availability section
    • data_sources: List of data sources with database, accession number, and URL
      • database: Name of the database
      • accession_number: Accession number or identifier
      • url: URL to the data source (can also be a DOI)
  • ai_provider: Identifier for the AI provider used
  • cost: Detailed breakdown of token usage and costs for each processing step

Model benchmarking

Code Formatting and Linting

To format and lint your code, run the following command:

# Build the docker-compose image
docker-compose build format
# Run the formatting and linting checks
docker-compose run --rm format

Quality Control (QC) Pipeline

The QC pipeline provides automated, configurable, and extensible quality assessment of scientific figures and data presentation. It can be run independently or as part of the main curation workflow.

Key Features

  • Schema-based analyzer detection: Automatically determines test types (panel/figure/document) by analyzing Pydantic model structures from schemas
  • Intelligent fallback logic: Uses schema analysis first, then config-based detection, then naming conventions
  • Config-driven test metadata and versioning: Test names/checklist types/versioning are defined in config.qc.yaml, with optional Langfuse prompt bindings.
  • Flexible prompt file naming: Supports arbitrary prompt filenames (e.g., prompt.3.txt, custom_prompt.txt) instead of fixed naming
  • Benchmark.json metadata integration: Enriches test metadata with descriptions and examples from mmQC repository
  • Word document processing: ManuscriptQCAnalyzer can process actual Word documents for document-level analysis
  • Unified output format: Uses qc_checks and check_name structure with enhanced metadata
  • Hierarchical test organization: Tests can be defined at panel, figure, or document level
  • Generic test implementation: New tests can be added without writing custom code

Running the QC Pipeline

poetry run python -m src.soda_curation.qc.main \
  --config config.qc.yaml \
  --figure-data data/output/your_figure_data.json \
  --zip-structure data/output/your_zip_structure.pickle \
  --output data/output/qc_results.json

poetry run python -m src.soda_curation.qc.main \
  --config config.qc.yaml \
  --figure-data data/output/EMM-2023-18636_figure_data.json \
  --zip-structure data/output/EMM-2023-18636_zip_structure.pickle \
  --output data/output/qc_results.json

Configuration Notes (v3.1.1)

config.qc.yaml is now aligned with the Langfuse-based prompt workflow:

  1. qc_check_metadata is the canonical key for panel/figure/document checks.
  2. Prompt version numbers are no longer required in this repository config.
  3. Prompts can be resolved via Langfuse using langfuse_name where needed.
  4. Schema-based analyzer detection remains active (panel/figure/document inferred from response schemas).
  5. Provider-agnostic QC model calls support openai, anthropic, and gemini.
  6. Agentic mode is provider-specific: OpenAI supports full configured tool mode; Anthropic supports provider-native built-in tools (for example web_search_*, web_fetch_*) in QC; Gemini currently logs a warning and runs non-agentic.
  7. Schema-equivalence enforcement is available: when enabled, QC fails fast if any test cannot use a Langfuse schema-derived model.

What 3.1.1 Adds

  • Provider abstraction layer in QC: The QC pipeline now uses a normalized provider contract and factory, so analyzers stay provider-agnostic.
  • Three provider adapters: OpenAI, Anthropic, and Gemini are available under one API surface.
  • Agentic mode wiring: OpenAI supports configured tool mode; Anthropic supports built-in server tools via model_config.tools; Gemini currently remains non-agentic.
  • Langfuse runtime hint compatibility: Optional runtime hints (e.g., agentic, model_config, tool config) can be merged from prompt config while preserving existing prompt/schema behavior.
  • Stronger documentation + config examples: Clear support matrix and per-test override examples for rollout.

Provider Setup

  • ai_provider: "openai" requires OPENAI_API_KEY.
  • ai_provider: "anthropic" requires ANTHROPIC_API_KEY.
  • ai_provider: "gemini" requires GOOGLE_API_KEY.
  • Langfuse-backed prompts/schemas require LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, and LANGFUSE_HOST.

Agentic Support Matrix

Provider Structured Output Agentic / Tools
OpenAI Yes Yes (agentic: true + model_config.tools)
Anthropic Yes Partial: built-in Anthropic tools via model_config.tools (for example web_search_*, web_fetch_*)
Gemini Yes Not yet (logs warning, runs non-agentic)

Note: Anthropic custom client-executed function tools (manual tool-result loop) are not wired yet in QC. Current Anthropic agentic support is limited to provider-native built-in tools.

Anthropic agentic override example:

default:
  pipeline:
    external_data_url_validation_agentic:
      anthropic:
        model: "claude-sonnet-4-6"
        agentic: true
        model_config:
          tools:
            - type: "web_search_20250305"
              name: "web_search"
              max_uses: 5
          tool_choice: "auto"

Minimal example:

qc_version: "3.1.1"
ai_provider: "openai"
enforce_langfuse_schema_equivalence: true
qc_check_metadata:
  panel:
    plot_axis_units:
      name: "Plot Axis Units"
      checklist_type: "fig-checklist"
    stat_significance_level:
      name: "Statistical Significance Level Defined"
      checklist_type: "fig-checklist"
      langfuse_name: "checklists/fig-checklist/stat-significant-level"
  document:
    section_order:
      name: "Manuscript Structure Check"
      checklist_type: "doc-checklist"

Complete Configuration Structure

Here's the full structure of a modern config.qc.yaml:

qc_version: "3.1.1"
ai_provider: "openai"
enforce_langfuse_schema_equivalence: true
qc_check_metadata:
  panel:
    error_bars_defined:
      name: "Error Bars Defined"
      checklist_type: "fig-checklist"
    individual_data_points:
      name: "Individual Data Points Displayed"
      checklist_type: "fig-checklist"
      langfuse_name: "checklists/fig-checklist/individual-data-points-copy"
    plot_axis_units:
      name: "Plot Axis Units"
      checklist_type: "fig-checklist"
  figure:
    # Figure-level tests go here
  document:
    DAS_present_and_correct:
      name: "Data Availability Section Present and Correct"
      checklist_type: "doc-checklist"
    section_order:
      name: "Manuscript Structure Check"
      checklist_type: "doc-checklist"

# OpenAI configuration
default: &default
  openai:
    model: "gpt-5-mini"
    temperature: 0.1
    top_p: 1.0
    max_tokens: 2048
    frequency_penalty: 0.0
    presence_penalty: 0.0
    json_mode: true
  anthropic:
    model: "claude-sonnet-4-6"
    temperature: 0.1
    max_tokens: 4096
  gemini:
    model: "gemini-2.5-flash"
    temperature: 0.1
    top_p: 1.0
    max_tokens: 2048
  pipeline: {}

# Optional: per-test OpenAI agentic override (example)
# default:
#   pipeline:
#     external_data_url_validation_agentic:
#       openai:
#         model: "gpt-5"
#         agentic: true
#         model_config:
#           tools:
#             - type: "web_search_preview"
#           tool_choice: "auto"

Debugging QC Results

The debug visualizer helps you inspect what the AI is analyzing by extracting figure images and captions for visual inspection.

Extract Figure Images

# Extract all figures from a dataset
poetry run python -m src.soda_curation.debug_visualizer \
  data/output/EMM-2023-18636_figure_data.json \
  --output-dir data/debug_images \
  --prefix EMM-2023-18636

# Just analyze image properties without extracting
poetry run python -m src.soda_curation.debug_visualizer \
  data/output/EMM-2023-18636_figure_data.json \
  --analyze

Analyze QC Results Quality

Compare QC results against actual figure content to identify potential issues:

# Run QC analysis to identify issues
poetry run python -m src.soda_curation.qc_analysis \
  data/output/qc_results.json \
  data/output/EMM-2023-18636_figure_data.json \
  --report data/debug_images/qc_analysis_report.html

Debug outputs include:

  • Individual PNG files - Exact images the AI analyzes
  • Caption text files - Complete captions for each figure
  • HTML summary - Browser-viewable overview of all figures
  • QC analysis report - Detailed comparison of QC results vs actual content

This helps identify issues like:

  • Missing statistical notation detection (mean ± SD)
  • Incorrect sample size parsing (n=3)
  • Caption parsing failures
  • Panel identification problems

Output Example (Top Level)

{
  "qc_version": "3.1.1",
  "qc_check_metadata": {
    "plot_axis_units": {
      "name": "Plot Axis Units",
      "checklist_type": "fig-checklist",
      "permalink": "https://github.com/source-data/soda-mmQC"
    },
    "stat_test": {
      "name": "Statistical Test Mentioned",
      "checklist_type": "fig-checklist",
      "permalink": "https://github.com/source-data/soda-mmQC"
    }
  },
  "figures": {
    "figure_1": {
      "panels": [
        {
          "panel_label": "A",
          "qc_checks": [
            {
              "check_name": "plot_axis_units",
              "passed": true,
              "model_output": {
                "panel_label": "A"
              }
            }
          ]
        }
      ]
    }
  },
  "status": "success"
}

How to Add or Update QC Tests

Step-by-Step: Adding a New QC Test

To add a new test to the QC pipeline, follow these steps:

  1. Define test metadata in the config:

    • Open config.qc.yaml.

    • Under qc_check_metadata in the appropriate level (panel, figure, document), add a new entry for your test. Include name and checklist_type (plus optional langfuse_name override). Example:

      qc_check_metadata:
        panel:
          my_new_test:
            name: "My New Test"
            checklist_type: "fig-checklist"
            langfuse_name: "checklists/fig-checklist/my-new-test"  # optional
  2. Configure the test in the pipeline section:

    • Still in config.qc.yaml, add provider settings under default and optional per-test overrides under default.pipeline:

      default:
        openai:
          model: "gpt-4o"
          temperature: 0.1
         anthropic:
           model: "claude-sonnet-4-6"
         gemini:
           model: "gemini-2.5-flash"
         # Optional agentic override (OpenAI):
         pipeline:
           external_data_url_validation_agentic:
             openai:
               agentic: true
               model_config:
                 tools:
                   - type: "web_search_preview"
                 tool_choice: "auto"
  3. Run the QC pipeline:

    • Execute the pipeline as usual. Your new test will be automatically detected and run.
    • The system will generate appropriate test models and integrate results into the output.
  4. (Optional) Add test documentation:

    • Keep the checklist and prompt definitions synchronized with your Langfuse project and/or mmQC documentation.
  5. (Optional) Add custom analyzer:

    • If your test requires specific logic beyond what the generic analyzers provide, create a custom analyzer class that extends the appropriate base class.

Tip: No custom code is required for most tests - the system will automatically generate test implementations based on the configuration.

Example QC Config Section

qc_version: "3.1.1"
qc_check_metadata:
  panel:
    plot_axis_units:
      name: "Plot Axis Units"
      checklist_type: "fig-checklist"
    error_bars_defined:
      name: "Error Bars Defined"
      checklist_type: "fig-checklist"
      langfuse_name: "checklists/fig-checklist/error-bars-defined"
  # ... more tests ...

Contributing

Contributions to soda-curation are welcome! Here are some ways you can contribute:

  1. Report bugs or suggest features by opening an issue
  2. Improve documentation
  3. Submit pull requests with bug fixes or new features

Please ensure that your code adheres to the existing style and passes all tests before submitting a pull request.

Development Setup

  1. Fork the repository and clone your fork
  2. Install development dependencies: poetry install --with dev
  3. Activate the virtual environment: poetry shell
  4. Make your changes and add tests for new functionality
  5. Run tests to ensure everything is working: ./run_tests.sh
  6. Submit a pull request with a clear description of your changes

License

This project is licensed under the MIT License. See the LICENSE file for details.


For any questions or issues, please open an issue on the GitHub repository. We appreciate your interest and contributions to the soda-curation project!

Changelog

3.1.4 (2026-04-21)

  • Main branch: Merged feature/langfuse-v3 into main so the Langfuse 3.x line is the default development line.
  • QC prompts: Panel and figure analyzers default the provider user prompt to include the figure caption when Langfuse/runtime hints leave it empty (Figure caption:\n$figure_caption).
  • CI: Set COMPOSE_HTTP_TIMEOUT for Docker Compose in GitHub Actions to reduce flaky image pulls/builds.

3.1.3 (2026-04-21)

  • Caption HTML sanitization: Strip empty or whitespace-only <li>…</li> entries in figure caption HTML before panel extraction so Word/HTML artifacts do not inflate panel label sequences (OpenAI and Anthropic caption paths).
  • Ultralytics compatibility: Panel object detection imports YOLOv10 when present and transparently falls back to YOLO when the ultralytics package exposes only the unified entrypoint.

3.1.2 (2026-03-26)

  • QC / Langfuse: Stricter schema enforcement for QC prompts and outputs (aligns with Langfuse-managed QC expectations).

3.1.1 (2026-03-26)

  • QC multi-provider architecture: Added provider abstraction and factory with OpenAI, Anthropic, and Gemini adapters.
  • Agentic support in QC: OpenAI tool mode and Anthropic built-in server-side tools (web_search_*, web_fetch_*) supported from runtime model_config.
  • Langfuse compatibility improvements: Prompt/schema sourcing preserved with optional runtime-hint mapping for provider execution.
  • Docs + config refresh: README and config.qc.yaml updated with provider setup, support matrix, and non-agentic/agentic examples.
  • Dependency hygiene pass: Cleaned Poetry runtime dependencies by removing duplicate/unused entries.

Git tag subjects (3.1.x line)

These match git show <tag> first-line summaries: 3.1.1 — Bumping to v3.1.1; Anthropic server-side tools (web_search, web_fetch). 3.1.2 — Bumping to v3.1.2; strict schema enforcement for QC. 3.1.3 — Bumping to v3.1.3; caption HTML sanitization and ultralytics YOLO/YOLOv10 import fallback. 3.1.4 — Bumping to v3.1.4; mainline merge, QC default user prompt, CI Docker timeout.

3.0.0 (2026-03-25)

  • Langfuse prompt integration: Prompt management moved to Langfuse for the active 3.x line.
  • Main + QC observability: Added structured logging with safe payload summaries (lengths/counts + short excerpts, no full prompt dumps).
  • Fallback visibility: Explicit warning/error classification and reason codes for retries/fallback transitions.
  • Resilience improvements: Recoverable step failures are surfaced as degraded/warning paths instead of silent failures.
  • QC status clarity: QC outputs now include clearer per-check pass/fail handling and skipped-output warnings.

2.5.17 (Production baseline)

  • Stable production line: Current production branch/tag before the 3.x Langfuse prompt migration.
  • Prompt source: Prompt selection remains configuration-driven in this line.

Additional tagged versions (maintenance releases)

  • 2.6.0: Maintenance release in the 2.x line.
  • 2.5.x patch train: 2.5.0 through 2.5.16 (followed by production 2.5.17).
  • Other tagged maintenance versions: 2.3.2, 2.2.0, 2.0.12.0.3, 1.2.21.2.7, 1.1.1, 1.1.31.1.9.

2.4.0 (2026-01-28)

  • Automatic Request Chunking: Implemented automatic chunking for large OpenAI API requests that exceed token limits
  • Token Counting: Added tiktoken integration for accurate token counting with fallback estimation
  • Intelligent Chunking: System automatically detects oversized requests and splits them into manageable chunks
  • Response Merging: Implemented seamless merging of chunked responses for Pydantic models (AsignedFilesList)
  • Model-Specific Limits: Added configurable token limits per model (GPT-5: 270k, GPT-4o: 120k)
  • Error Handling: Enhanced context_length_exceeded error handling with automatic fallback to chunking
  • Comprehensive Testing: Added test suite for chunking functionality with 15+ test cases
  • Fixes: Resolves issue where manuscripts with large source data files (6500+ files) exceeded token limits
  • Backward Compatible: All existing functionality preserved, chunking is transparent to users
  • Document Format Fallback: Added support for multiple manuscript file formats with automatic fallback
    • Primary: DOCX files (in doc/ folder)
    • Fallback 1: PDF files (in pdf/ folder)
    • Fallback 2: LaTeX (.tex), RTF, ODT files (in doc/ folder)
    • Automatic text extraction from all supported formats using pypandoc and PyPDF2

2.3.1 (2025-07-23)

  • Bug Fix: Fixed CI test failures in test_prompt_registry.py due to incorrect attribute references
  • Test Fix: Updated tests to use prompt_file instead of non-existent prompt_number attribute in PromptMetadata
  • Docker Fix: Ensured Docker environment properly builds with ultralytics/YOLOv10 dependencies
  • Quality Assurance: All 245 tests now passing in Docker environment
  • CI/CD: Resolved build failures that were blocking continuous integration

2.3.0 (2025-07-23)

  • Major QC Pipeline Enhancement: Implemented schema-based analyzer detection
  • Schema-Based Type Detection: Automatically determines panel/figure/document test types by analyzing Pydantic model structures
  • Intelligent Analyzer Selection: List schemas with panel_label → panel-level, object schemas → figure-level, document fields → document-level
  • Flexible Prompt File Naming: Support for arbitrary prompt filenames instead of fixed prompt.1.txt pattern
  • Enhanced Benchmark Integration: Rich metadata from benchmark.json files with automatic description enrichment
  • Word Document Processing: ManuscriptQCAnalyzer now processes actual Word documents (.docx) for manuscript analysis
  • Output Format Modernization: Updated to qc_checks/check_name structure removing deprecated qc_tests/test_name
  • Robust Fallback Logic: Schema detection → config-based detection → naming convention fallbacks
  • Complete Test Coverage: 26/26 QC tests passing with comprehensive validation
  • Removed example_class Dependency: No longer requires manual example_class configuration

2.1.0 (2025-07-18)

  • Complete refactoring of the QC pipeline with abstract base classes and factory pattern
  • Added support for hierarchical test organization (panel, figure, document levels)
  • Implemented generic test analyzers for different test levels
  • Removed individual test modules in favor of dynamic test generation
  • Enhanced error handling and robustness for test execution
  • Improved metadata handling with hierarchical config structure
  • Simplified output format by removing redundant fields
  • Added fallback mechanisms for missing schemas and prompts

2.0.4 (2025-07-11)

  • QC pipeline now sources test metadata and version from config file
  • Output includes qc_check_metadata and qc_version fields for all runs
  • Permalinks for each QC test are included in the config and output
  • Improved handling of test status (passed: null when not needed)
  • Patch-level version bump for both soda-curation and QC pipeline

2.0.0 (2025-06-10)

  • Added Quality Control (QC) module for automated manuscript assessment
  • Implemented statistical test reporting analysis for figures
  • Dynamic loading of QC test modules from configuration
  • Automatic generation of QC data during main pipeline execution
  • 90% test coverage achieved across the codebase

1.2.1 (2025-05-14)

  • Case insensitive panel caption matching added

1.2.0 (2025-05-12)

  • Normalization of database links
  • Permanent links of identifiers.org added

1.1.2 (2025-05-08)

  • Changed logic to modify EPS into thumbnails to have same results as UI

1.1.1 (2025-05-06)

  • Semideterministic individual caption extraction

1.1.0 (2025-04-11)

  • Verbatim check tool for agentic AI added to ensure verbatim caption extractions
  • Remove hallucination score from panels
  • Remove original source data files from figure level source data
  • Replaced fuzzy-matching hallucination detection with AI agent verification tools
  • Added tools for verbatim extraction verification of figure captions, sections, and panel sequences
  • Enhanced panel detection to identify all panels in figures regardless of caption mentions
  • Improved panel labeling to ensure sequential labels (A, B, C...) without gaps

1.0.6 - 1.0.7 (2025-03-24)

  • Modified normalization of tests and solved some error issues on benchmarking
  • Modified two ground truths with the wrong values in extracting all the captions
  • Modified the readme

1.0.5 (2025-03-19)

  • Added more robust normalization for the detection of possible hallucinated text
  • Test coverage added
  • Current test coverage is 92%

1.0.4 (2025-03-18)

  • Reformatting benchmark code into a package for better readibility
  • Added panel-caption matching to the benchmark
  • Improved the handling of .eps and .tif files with ImageMagick and opencv
    • For the future, we could use tifffile but it requires upgrade to python3.10

1.0.3 (2025-03-13)

  • Figures with no or single panel return now a single panel object
  • Ground truth modified to include HTML and removed manuscript id from internal files

1.0.2 (2025-03-12)

  • Updated README.md
  • Addition of hallucination scores to the output of the pipeline
  • Ensure no panel duplication
  • Generating output captions keeping the HTML text from the docx file
  • No panels allowed for figures with single panels
  • Addition of panel label position to the panel matching prompt searching to increase the performance

1.0.1 (2025-03-11)

  • Removal of manuscript ID from the source data file outputs
  • Correction of non standard encoding in file names

1.0.0 (2025-03-10)

  • Major changes
    • Changes in the configuration and environment definition
    • Pipeline configurable at every single step, allowing for total flexibility in AI model and parameter selection
    • Extraction of data availability and figure legends sections into a single step
    • Fusion of match panel caption and object detection into a single step
  • Minor changes:
    • Support for large images
    • Support for .ai image files
    • Removal of hallucinated files from the list of sd_files in output
    • Ignoring windows cache files from the file assignation

v0.2.3 (2025-02-05)

  • Updated output schema documentation to match actual output structure
  • Improved panel source data assignment with full path preservation
  • Enhanced error handling in panel caption matching
  • Updated AI configuration handling

v0.2.2 (2024-12-02)

  • Changing from the AI assistant API to the Chat API in OpenAI
  • Supporting test, dev and prod environments
  • Addition of tests and CI/CD pipeline
  • Allow for storage of evaluation and model performance
  • Prompts defined in the configuration file, now keeping configuration separately for each pipeline step

v0.2.2 (2025-01-30)

This tag is the stable version of the soda-curation package to extract the following information of papers using OpenAI

  • XML manifest and structure
  • Figure legends
  • Figure panels
  • Associate each figure panel to the corresponding caption test
  • Associate source data at a panel level
  • Extraction of the data availability section
  • Includes model benchmarking on ten annotated ground truth manuscripts

v0.2.1 (2024-12-02)

  • Obsolete tests removed

v0.2.0 (2024-12-02)

  • Addition of benchmarking capabilities
  • Adding manuscripts as string context to the AI models instead of DOCX or PDF files to improve behavior
  • Ground truth data added

v0.1.0 (2024-10-01)

  • Initial release
  • Support for OpenAI and Anthropic AI providers
  • Implemented figure and panel detection
  • Added caption extraction and matching functionality