diff --git a/README.md b/README.md index 46410a0..9d33fb9 100644 --- a/README.md +++ b/README.md @@ -1,29 +1,223 @@ -# FracFeedExtractor - _LLMs for the fraction of feeding predators_ +# FracFeedExtractor — LLMs for the Fraction of Feeding Predators + +**An automated pipeline that reads ecological literature and extracts predator feeding-rate data — turning hundreds of PDFs into a structured, analysis-ready database.** + +![Python Version](https://img.shields.io/badge/python-3.10%2B-blue?style=flat-square) +![Build Status](https://img.shields.io/badge/build-passing-brightgreen?style=flat-square) +![License](https://img.shields.io/badge/license-pending-lightgrey?style=flat-square) +[![GitHub Issues](https://img.shields.io/github/issues/NovakLabOSU/FracFeedExtractor?style=flat-square)](https://github.com/NovakLabOSU/FracFeedExtractor/issues) + +*2025–2026 Oregon State University Senior Capstone Project, in collaboration with Mark Novak.* + +[**→ Try It Yourself**](#get-started) + +--- + +

+ Predator diet surveys form the foundation for estimating the fraction of feeding individuals across species. +

+

Predator diet surveys form the foundation for estimating the fraction of feeding individuals across species.

## Project Description -This project will contribute to validating a novel metric of predator-prey interactions to inform ecosystem-based resource management and ecological theory. It will do so by using a global database of predator diet surveys to train large language models for the purpose of identifying additional publications and extracting key data to overcome the limitations that have hindered the empirical validation of the new metric thus far. +This project contributes to validating a novel metric of predator-prey interaction, the **fraction of feeding individuals**, that has the potential to inform ecosystem-based resource management and ecological theory at scale. Given a folder of PDFs from the ecological literature, our pipeline screens each paper with a trained XGBoost classifier, routes relevant papers to a locally-run LLM for structured data extraction, and exports a JSON with classification confidence and extraction provenance attached to every record, overcoming the data harvesting bottleneck that has hindered validation of this metric. + +--- + +## What is the Fraction of Feeding Individuals? + +The **fraction of feeding individuals** is defined as the proportion of predators found to have non-empty stomachs at the time of sampling. This is a quantity that can be obtained directly from routine predator diet surveys. Research from [Mark Novak's lab at Oregon State University](https://github.com/NovakLabOSU) has established that this metric is analytically linked to a species' metabolic demand, body size, temperature, mortality rate, extinction susceptibility, biological control effectiveness, and population resilience to perturbation, making it a powerful and underutilized parameter for ecosystem-based resource management. + +Despite its potential, the metric is rarely used in practice. The underlying data exists across more than a century of published predator diet surveys, but harvesting it by hand from the primary literature is prohibitively slow at the scale required for meaningful cross-species analysis. FracFeedExtractor was built to solve that bottleneck: given a collection of PDFs, it automatically identifies which papers contain usable diet survey data and extracts the key numbers and covariates needed to compute the fraction of feeding individuals. + +--- + +## Key Features + +- **PDF Classification** — A trained XGBoost classifier identifies which scientific publications contain useful predator diet survey data, filtering out irrelevant papers before they reach the LLM. +- **Structured Data Extraction** — Automatically parses empty and non-empty stomach counts and key covariates (predator identity, survey location, survey year, and more) from tabular and narrative text. +- **Batch Processing** — Accepts a single PDF or an entire folder of PDFs in one command. +- **Provenance & Uncertainty Reporting** — Every result includes the classifier confidence score and an extraction provenance descriptor identifying the source sentence or table for each field, making downstream QA straightforward. +- **Locally-Run LLM** — The extraction model runs entirely on-device via [Ollama](https://ollama.com). Unpublished manuscripts and proprietary datasets never leave the researcher's environment. + +--- ## Motivation -Predator–prey interactions are central to ecosystem stability, yet a key parameter that quantifies predator-prey interaction strength—predator feeding rates—is rarely used in practice because the data required to estimate it are difficult to obtain. Our research has shown that the fraction of feeding individuals, defined as the proportion of predators with non‑empty stomachs, can be easily obtained from routine predator diet surveys and is analytically linked to a species' metabolic demand, body size, temperature, mortality rate, extinction susceptibility, biological control effectiveness, and population resilience to perturbations. To validate this metric for mainstream resource management and ecological theory, a scalable method is needed to harvest the untapped data that exists in the vast ecological literature. -The project will train large language models for two tasks: 1) classifying scientific publications as containing useful predator diet survey information, and 2) extracting the numbers of empty- and non-empty stomachs counted and key covariates (predator identity, survey location, survey year, etc.). By fine-tuning with a large database of hand-annotated publications containing diet surveys conducted across the globe over the last 135 years, the models will learn to recognize relevant publications and parse tabular and narrative data into structured fields. The resulting pipeline will enable the generation of a comprehensive, covariate‑rich database for subsequent analyses and applications. +Predator-prey interactions are central to ecosystem stability, yet predator feeding rates are rarely used in practice because the data required to estimate them are difficult to obtain at scale. To validate the fraction of feeding individuals metric for mainstream resource management and ecological theory, a scalable method is needed to harvest the untapped data that already exists in the vast ecological literature, accumulated over more than a century of field surveys conducted across the globe. + +We trained an XGBoost classifier on the [FracFeed global database](https://github.com/marknovak/FracFeed_DB), a hand-annotated collection of predator diet surveys spanning 135 years and multiple continents, to recognize relevant publications so the LLM only processes papers likely to yield usable data. An LLM running locally via Ollama then extracts the numbers of empty and non-empty stomachs and key covariates from each relevant paper. The resulting pipeline enables the generation of a comprehensive database for subsequent analyses and applications. + +--- + + +## System Architecture + +Our two-stage pipeline combines a lightweight classifier with a locally-run LLM to minimize cost and runtime at scale. The classifier acts as a gate — only papers it scores as useful proceed to the more expensive extraction step. + +

+ Architecture diagram showing the FracFeedExtractor pipeline: PDF input flows through text extraction, cleaning, XGBoost classification, and LLM extraction to produce structured JSON and CSV output +

+ +

Five-stage pipeline architecture. PDF files are preprocessed, filtered, and classified before useful papers proceed to LLM data extraction and structured output.

+ +The pipeline consists of the following components: + +1. **PDF Text Extraction** — PyMuPDF parses each PDF; Tesseract OCR handles scanned documents. +2. **Text Cleaning & Section Filtering** — References, captions, and irrelevant paragraphs are stripped to reduce noise before classification. +3. **XGBoost Classifier** — TF-IDF features feed a trained XGBoost model that scores each paper as useful or not useful with a confidence score. +4. **LLM Extraction** — Relevant papers are passed to a locally-run LLM (via Ollama) with a structured prompt, returning a `PredatorDietMetrics` JSON object containing stomach counts, predator identity, survey location, and survey year. +5. **Output** — Per-paper JSON files and a pipeline summary CSV are written to `data/results/`. + +--- + +## Pipeline Demo + +Below is a condensed view of a typical pipeline run on a folder of PDFs. The classifier scores each paper and routes it while relevant papers proceed to LLM extraction. + +

+ Terminal output showing FracFeedExtractor classifying four PDFs: three marked useful with extracted species data, one marked not useful and skipped +

+

FracFeedExtractor pipeline run on a folder of PDFs.

+ +--- + +## Model Performance + +The classifier was evaluated on a held-out test set of 234 papers. It achieves **94% accuracy** across both relevant and irrelevant publications, with strong and balanced precision and recall. + +| Class | Precision | Recall | F1-score | Support | +|---|---|---|---|---| +| Not useful (0) | 0.96 | 0.91 | 0.93 | 110 | +| Useful (1) | 0.92 | 0.97 | 0.94 | 124 | +| **Overall** | **0.94** | **0.94** | **0.94** | **234** | + +

+ XGBoost training curve showing train and validation log-loss converging over 585 boosting rounds, with minimum validation loss of 0.193 at the best iteration +

+ +

XGBoost classifier training curve. Log-loss for train (blue) and validation (dashed orange) sets across 600 boosting rounds. Early stopping selected round 585 as the best iteration (min val loss: 0.193).

+ +--- + +## Get Started + +### Prerequisites + +| Dependency | Notes | +|---|---| +| Python 3.10+ | Tested on 3.10–3.12 | +| [Ollama](https://ollama.com) | Must be running locally; 8 GB RAM minimum, 16 GB recommended | +| Tesseract OCR | System-level install required for scanned PDFs — see [Contributing Guide](documentation/CONTRIBUTING.md) for platform-specific instructions | + +Pull the default extraction model before running: + +```bash +ollama pull qwen2.5:7b # ~5 GB +ollama list +``` + +### Installation + +```bash +# Linux +git clone https://github.com/NovakLabOSU/FracFeedExtractor.git +cd FracFeedExtractor +python3 -m venv venv +source venv/bin/activate +pip install -r requirements.txt +``` + +```bash +# Windows PowerShell +git clone https://github.com/NovakLabOSU/FracFeedExtractor.git +cd FracFeedExtractor +py -m venv venv +./venv/Scripts/activate +pip install -r requirements.txt +``` + +### Quick Start + +```bash +# Classify and extract from a folder of PDFs +python classify_extract.py path/to/pdfs/ + +# Adjust the LLM model or confidence threshold +python classify_extract.py path/to/pdfs/ --llm-model llama3.1:8b --confidence-threshold 0.70 +``` + +Results are written to `data/results/metrics/` (per-paper JSON) and `data/results/summaries/` (pipeline CSV). + +> For virtual environment setup, full CLI flag reference, and contribution guidelines, see the [Contributing Guide](documentation/CONTRIBUTING.md). + +--- + +## Data Source + +We trained the classifier on the [FracFeed global database](https://github.com/marknovak/FracFeed_DB) — a hand-annotated collection of predator diet surveys from the primary ecological literature. + +--- + +## Team + + + + + + + + + +
+ + GitHub avatar for Mark Novak +
+ Mark Novak
+ Project Lead
+ Mark.Novak@oregonstate.edu +
+ + GitHub avatar for Sean Clayton +
+ Sean Clayton
+ ML Pipeline & Backend
+ claytose@oregonstate.edu +
+ + GitHub avatar for Zahra Alsulaimawi +
+ Zahra Alsulaimawi
+ LLM Integration & Evaluation
+ alsulaza@oregonstate.edu +
+ + GitHub avatar for Raymond Cen +
+ Raymond Cen
+ Data Processing & Testing
+ cenra@oregonstate.edu +
+ + GitHub avatar for Bradley Rule +
+ Bradley Rule
+ PDF Extraction & OCR
+ ruleb@oregonstate.edu +
+ +--- + +## Questions and Feedback +Found a bug or have a question? +[Open an issue on GitHub](https://github.com/NovakLabOSU/FracFeedExtractor/issues) -## Objectives/Deliverables -1. A fully trained, fine‑tuned Python implementation of a large language model (or pair of models) that ingests a publication's pdf and returns a classification and/or the extracted data as well as descriptors of the classification and extraction provenance and uncertainty. -2. A Python pipeline that accepts a single pdf or a folder of pdfs, parses the text of each, queries the model for each, and exports the classification and data extraction results with clear provenance and uncertainty. -3. A clean, reproducible training and evaluation pipeline (including pdf preprocessing and model evaluation metrics) documented in a GitHub repository. -4. A technical report detailing model architecture, training procedure, validation results, and guidance for future extensions. +--- -## Data sources -[FracFeed: Global database of the fraction of feeding predators](https://github.com/marknovak/FracFeed_DB) +## Documentation -## Team Members -- Mark Novak – Project Owner/Lead -- Sean Clayton – Contributor -- Zahra Zahir Ahmed Alsulaimawi – Contributor -- Raymond Cen – Contributor -- Bradley Rule – Contributor +- [Contributing Guide](documentation/CONTRIBUTING.md) — setup, CLI reference, and contribution workflow +- [System Architecture Diagram](assets/architecture.svg) -License: Pending partner confirmation +*License: Pending partner confirmation.* diff --git a/assets/architecture.svg b/assets/architecture.svg new file mode 100644 index 0000000..4d9099d --- /dev/null +++ b/assets/architecture.svg @@ -0,0 +1,160 @@ + + + + + + + + + + + + + + + + + + + + + STAGE I · INPUT + + + + STAGE II · PREPROCESSING + + + + STAGE III · CLASSIFICATION + + + + STAGE IV · EXTRACTION + + + + STAGE V · OUTPUT + + + + + + PDF Files + Single file or folder + + + + Text Extraction + PyMuPDF · Tesseract OCR + + + + + + + Text Cleaning + Section Filtering + + + + TF-IDF Vectorizer + Feature extraction + + + + + + + XGBoost Classifier + Binary classification + + + + LLM Extraction + Ollama (local) + + + + JSON + Structured metrics + + + + Pipeline Summary + Aggregated CSV + + + + + + + + + + + + Useful + + + + + + + + + + + Not Useful + + + confidence score recorded — paper skipped + + diff --git a/assets/fraction-feeding-preds.jfif b/assets/fraction-feeding-preds.jfif new file mode 100644 index 0000000..96216d7 Binary files /dev/null and b/assets/fraction-feeding-preds.jfif differ diff --git a/assets/terminal_demo.svg b/assets/terminal_demo.svg new file mode 100644 index 0000000..beb0a27 --- /dev/null +++ b/assets/terminal_demo.svg @@ -0,0 +1,97 @@ + + + + + + + + + + + + + + + + classify_extract.py + + + + $ + python classify_extract.py data/ + + + + + [1/4] + Bakaloudis_2012.pdf → + useful + (confidence: 0.91) + + + + + Extracted: + Buteo buteo | Greece | 2001–2006 | 143 stomachs (88 non-empty) + + + + + [2/4] + Hales_2008.pdf → + useful + (confidence: 0.87) + + + + + Extracted: + Gadus morhua | North Sea | 2005–2007 | 312 stomachs (201 non-empty) + + + + + [3/4] + Barry_1996.pdf → + not useful + (confidence: 0.38) + + + + + Skipped — confidence below threshold (0.70). + + + + + [4/4] + Insley_2021.pdf → + useful + (confidence: 0.95) + + + + + Extracted: + Enhydra lutris | Alaska | 2018–2020 | 97 stomachs (82 non-empty) + + + + Results written to: + + + + data/results/metrics/ + ← per-paper JSON + + + + + data/results/summaries/ + ← pipeline_summary.csv + + + diff --git a/assets/training_curve.png b/assets/training_curve.png new file mode 100644 index 0000000..68fcc96 Binary files /dev/null and b/assets/training_curve.png differ diff --git a/documentation/CONTRIBUTING.md b/documentation/CONTRIBUTING.md index 7702319..7e95624 100644 --- a/documentation/CONTRIBUTING.md +++ b/documentation/CONTRIBUTING.md @@ -3,23 +3,33 @@ How to set up, code, test, review, and release so contributions meet our Definit of Done. ## Code of Conduct All contributors must follow the Oregon State University Student Code of Conduct and the team’s charter agreement. -* Treat all collaborators with respect and proffessionalism. +* Treat all collaborators with respect and professionalism. * Provide decent participation during meetings and reviews. * Raise the issue privately with the team first. * Issues of academic or ethical concern should be reported directly to the instructor. * Report any inappropriate or unprofessional behavior to the TA, instructor or project manager. **Owner**: Bradley Rule -**Next Review**: 11/26/25 +**Next Review**: 05/15/26 ## Getting Started +> **Pipeline diagram**: [`documentation/architecture.png`](architecture.png) - visual overview of the full extraction pipeline. + * ### Prerequisites * Python 3.10 + * pip installed * Access to GitHub repository + * [Ollama](https://ollama.com) installed and running locally + * Minimum hardware: 8 GB RAM (16 GB recommended for `llama3.1:8b`) + * Pull the required models before running the classify/extract pipeline: + ```bash + ollama pull llama3.1:8b # default extraction model (~5 GB) + ollama pull qwen2.5:7b # alternative model (~5 GB) + ``` + * Verify Ollama is running: `ollama list` * ### Setup Instructions ``` - git clone https://github.com/marknovak/FracFeedExtractor.git + git clone https://github.com/NovakLabOSU/FracFeedExtractor.git cd FracFeedExtractor python -m venv venv source venv/bin/activate @@ -46,12 +56,60 @@ All contributors must follow the Oregon State University Student Code of Conduct python scripts/full_pipeline.py --api ``` * Note: You will need access to the .env file -* ### Enviorment Variables +* ### Running the classify/extract pipeline + Use `classify_extract.py` to classify PDFs and extract structured diet data in a single step. + Requires trained model artifacts in `src/model/models/` (run the full pipeline first, + or see [Retraining the Classifier](#retraining-the-classifier-and-extending-extraction) below). + ```bash + # Single PDF + python classify_extract.py path/to/file.pdf + + # Folder of PDFs (sequential) + python classify_extract.py path/to/pdfs/ + + # All options + python classify_extract.py path/to/pdfs/ \ + --model-dir src/model/models \ + --llm-model llama3.1:8b \ + --output-dir data/results \ + --confidence-threshold 0.70 \ + --max-chars 12000 \ + --num-ctx 4096 \ + --workers 4 + ``` + | Flag | Default | Description | + |------|---------|-------------| + | `--model-dir` | `src/model/models` | Directory containing classifier artifacts | + | `--llm-model` | `llama3.1:8b` | Ollama model for extraction | + | `--output-dir` | `data/results` | Destination for JSON results and summary CSV | + | `--confidence-threshold` | `0.70` | Probability threshold for "useful" classification | + | `--max-chars` | `12000` | Maximum characters sent to the LLM | + | `--num-ctx` | `4096` | Ollama context window size (tokens) | + | `--workers` | `1` | Parallel worker processes (`1` = sequential) | + +* ### Sample Output + Each PDF classified as "useful" produces a JSON file in `data/results/metrics/`: + ```json + { + "source_file": "Smith_2002.pdf", + "extracted_at": "2026-04-24T14:32:00", + "metrics": { + "species_name": "Esox lucius", + "study_location": "Lake Windermere, UK", + "study_date": "1998-2000", + "num_empty_stomachs": 42, + "num_nonempty_stomachs": 158, + "sample_size": 200, + "fraction_feeding": 0.79 + } + } + ``` +* ### Environment Variables * Sensitive information such as API keys will be stored in a local .env file which will be excluded by .gitignore. * Never hardcode secrets **Owner**: Raymond Cen -**Next Review**: 11/26/25 +**Next Review**: 05/15/26 ## Branching & Workflow We will use the feature-branch workflow with all merges handled through PRs. @@ -62,7 +120,7 @@ We will use the feature-branch workflow with all merges handled through PRs. * Rebase your working branch with main, and often, before submitting a PR (simpler conflict resolution) **Owner**: Zahra Zahir Ahmed Alsulaimawi -**Next Review**: 11/26/25 +**Next Review**: 05/15/26 ## Issues & Planning Issue titles should start with the following tags to designate intent: @@ -88,7 +146,7 @@ Feature Description: ``` **Owner**: Zahra Zahir Ahmed Alsulaimawi -**Next Review**: 11/26/25 +**Next Review**: 05/15/26 ## Commit Messages We will use the [Conventional Commit](https://www.conventionalcommits.org/en/v1.0.0/) format for clarity and traceability @@ -106,7 +164,7 @@ fix(ci): update pytest command in workflow [#42] docs(readme): add setup section ``` **Owner**: Raymond Cen -**Next Review**: 11/26/25 +**Next Review**: 05/15/26 ## Code Style, Linting & Formatting We use Black for automatic code formatting and Flake8 for linting to maintain consistent style and prevent common Python errors. @@ -123,7 +181,7 @@ We use Black for automatic code formatting and Flake8 for linting to maintain co black src tests ``` -* ### Formatter: Black +* ### Linter: Flake8 - Config file: `pyproject.toml` - Install `pip install flake8` - Local usage: @@ -133,7 +191,7 @@ We use Black for automatic code formatting and Flake8 for linting to maintain co - Configured to ignore line length violations (E501) and other minor style differences. **Owner**: Sean Clayton -**Next Review**: 11/26/25 +**Next Review**: 05/15/26 ## Testing @@ -151,7 +209,7 @@ We use Black for automatic code formatting and Flake8 for linting to maintain co coverage html ``` **Owner**: Sean Clayton -**Next Review**: 11/26/25 +**Next Review**: 05/15/26 * ### Expectations - New features must include unit or integration tests. @@ -174,7 +232,7 @@ We use Black for automatic code formatting and Flake8 for linting to maintain co - PRs should be rebased on the latest `main` branch before merge if there are conflicts. **Owner**: Bradley Rule -**Next Review**: 11/26/25 +**Next Review**: 05/15/26 ## CI/CD Continuous integration ensures all contributions meet quality standards automatically. @@ -198,7 +256,7 @@ Continuous integration ensures all contributions meet quality standards automati - Artifacts (e.g., coverage reports) are uploaded automatically and can be reviewed. **Owner**: Sean Clayton -**Next Review**: 11/26/25 +**Next Review**: 05/15/26 ## Security & Secrets State how to report vulnerabilities, prohibited patterns (hard-coded secrets), @@ -211,7 +269,7 @@ dependency update policy, and scanning tools. * Security issues or potential breaches should be reported privately to the Project Manager and TA. **Owner**: Raymond Cen -**Next Review**: 11/26/25 +**Next Review**: 05/15/26 ## Documentation Expectations @@ -223,7 +281,7 @@ dependency update policy, and scanning tools. - Inline comments should be reserved for places where the function of code is difficult to understand or infer. **Owner**: Zahra Zahir Ahmed Alsulaimawi -**Next Review**: 11/26/25 +**Next Review**: 05/15/26 ## Release Process ### Versioning Scheme @@ -278,12 +336,61 @@ Example entry: 4) Notify the team and project partner of the rollback. **Owner**: Bradley Rule -**Next Review**: 11/26/25 +**Next Review**: 05/15/26 + +## Retraining the Classifier and Extending Extraction + +### Retraining the XGBoost Classifier + +The classifier artifacts are saved in `src/model/models/`. To retrain with new or updated labeled data: + +1. **Add labeled text files** to `data/processed-text/` and update `data/labels.json` + with `"filename.txt": "useful"` or `"filename.txt": "not useful"` entries. + +2. **Run the trainer directly:** + ```bash + python src/model/train_model.py + ``` + This reads from `data/processed-text/` and `data/labels.json`, trains a TF-IDF + + XGBoost model, and saves three artifacts: + - `src/model/models/pdf_classifier.json` - XGBoost model + - `src/model/models/tfidf_vectorizer.pkl` - TF-IDF vectorizer + - `src/model/models/label_encoder.pkl` - LabelEncoder + +3. **Or run the full pipeline**, which trains the model as a final step: + ```bash + python scripts/full_pipeline.py --local + ``` + +Key tunable parameters in `src/model/train_model.py`: +- `max_features` in `TfidfVectorizer` (default: 10,000) +- `eta`, `max_depth`, `subsample` in the XGBoost `params` dict +- `early_stopping_rounds` (default: 20) + +### Adding New Extraction Fields to the LLM Extractor + +Extraction fields are defined in two places: + +1. **`src/llm/models.py`** - the `PredatorDietMetrics` Pydantic model. + Add a new optional field with the appropriate type and a `None` default: + ```python + prey_taxa: Optional[list[str]] = None + ``` + +2. **`src/llm/llm_client.py`** - the system prompt that instructs the LLM. + Add a description of the new field and its expected format to the prompt string. + +3. **`classify_extract.py`** and **`extract-from-txt.py`** - update the `row` dict + and `fieldnames` list in the summary CSV writer to include the new column. + +After adding a field, run `pytest tests/test_llm_text.py` to verify that the prompt +changes do not break existing extraction tests. + ## Support & Contact -* **Primaty Communciations**: Slack and Teams +* **Primary Communications**: Slack and Teams * **Meetings**: Fridays 1 PM PST * **Project Partner**: Mark Novak, Fridays 8:30AM PST (biweekly check-ins) * **TA Meetings**: Thursdays 1:30PM PST **Owner**: Zahra Zahir Ahmed Alsulaimawi -**Next Review**: 11/26/25 \ No newline at end of file +**Next Review**: 05/15/26 \ No newline at end of file