FracFeedExtractor — LLMs for the Fraction of Feeding Predators

An automated pipeline that reads ecological literature and extracts predator feeding-rate data — turning hundreds of PDFs into a structured, analysis-ready database.

2025–2026 Oregon State University Senior Capstone Project, in collaboration with Mark Novak.

→ Try It Yourself

$Predator diet surveys form the foundation for estimating the fraction of feeding individuals across species.$

Predator diet surveys form the foundation for estimating the fraction of feeding individuals across species.

Project Description

This project contributes to validating a novel metric of predator-prey interaction, the fraction of feeding individuals, that has the potential to inform ecosystem-based resource management and ecological theory at scale. Given a folder of PDFs from the ecological literature, our pipeline screens each paper with a trained XGBoost classifier, routes relevant papers to a locally-run LLM for structured data extraction, and exports a JSON with classification confidence and extraction provenance attached to every record, overcoming the data harvesting bottleneck that has hindered validation of this metric.

What is the Fraction of Feeding Individuals?

The fraction of feeding individuals is defined as the proportion of predators found to have non-empty stomachs at the time of sampling. This is a quantity that can be obtained directly from routine predator diet surveys. Research from Mark Novak's lab at Oregon State University has established that this metric is analytically linked to a species' metabolic demand, body size, temperature, mortality rate, extinction susceptibility, biological control effectiveness, and population resilience to perturbation, making it a powerful and underutilized parameter for ecosystem-based resource management.

Despite its potential, the metric is rarely used in practice. The underlying data exists across more than a century of published predator diet surveys, but harvesting it by hand from the primary literature is prohibitively slow at the scale required for meaningful cross-species analysis. FracFeedExtractor was built to solve that bottleneck: given a collection of PDFs, it automatically identifies which papers contain usable diet survey data and extracts the key numbers and covariates needed to compute the fraction of feeding individuals.

Key Features

PDF Classification — A trained XGBoost classifier identifies which scientific publications contain useful predator diet survey data, filtering out irrelevant papers before they reach the LLM.
Structured Data Extraction — Automatically parses empty and non-empty stomach counts and key covariates (predator identity, survey location, survey year, and more) from tabular and narrative text.
Batch Processing — Accepts a single PDF or an entire folder of PDFs in one command.
Provenance & Uncertainty Reporting — Every result includes the classifier confidence score and an extraction provenance descriptor identifying the source sentence or table for each field, making downstream QA straightforward.
Locally-Run LLM — The extraction model runs entirely on-device via Ollama. Unpublished manuscripts and proprietary datasets never leave the researcher's environment.

Motivation

Predator-prey interactions are central to ecosystem stability, yet predator feeding rates are rarely used in practice because the data required to estimate them are difficult to obtain at scale. To validate the fraction of feeding individuals metric for mainstream resource management and ecological theory, a scalable method is needed to harvest the untapped data that already exists in the vast ecological literature, accumulated over more than a century of field surveys conducted across the globe.

We trained an XGBoost classifier on the FracFeed global database, a hand-annotated collection of predator diet surveys spanning 135 years and multiple continents, to recognize relevant publications so the LLM only processes papers likely to yield usable data. An LLM running locally via Ollama then extracts the numbers of empty and non-empty stomachs and key covariates from each relevant paper. The resulting pipeline enables the generation of a comprehensive database for subsequent analyses and applications.

System Architecture

Our two-stage pipeline combines a lightweight classifier with a locally-run LLM to minimize cost and runtime at scale. The classifier acts as a gate — only papers it scores as useful proceed to the more expensive extraction step.

Five-stage pipeline architecture. PDF files are preprocessed, filtered, and classified before useful papers proceed to LLM data extraction and structured output.

The pipeline consists of the following components:

PDF Text Extraction — PyMuPDF parses each PDF; Tesseract OCR handles scanned documents.
Text Cleaning & Section Filtering — References, captions, and irrelevant paragraphs are stripped to reduce noise before classification.
XGBoost Classifier — TF-IDF features feed a trained XGBoost model that scores each paper as useful or not useful with a confidence score.
LLM Extraction — Relevant papers are passed to a locally-run LLM (via Ollama) with a structured prompt, returning a PredatorDietMetrics JSON object containing stomach counts, predator identity, survey location, and survey year.
Output — Per-paper JSON files and a pipeline summary CSV are written to data/results/.

Pipeline Demo

Below is a condensed view of a typical pipeline run on a folder of PDFs. The classifier scores each paper and routes it while relevant papers proceed to LLM extraction.

FracFeedExtractor pipeline run on a folder of PDFs.

Model Performance

The classifier was evaluated on a held-out test set of 234 papers. It achieves 94% accuracy across both relevant and irrelevant publications, with strong and balanced precision and recall.

Class	Precision	Recall	F1-score	Support
Not useful (0)	0.96	0.91	0.93	110
Useful (1)	0.92	0.97	0.94	124
Overall	0.94	0.94	0.94	234

XGBoost classifier training curve. Log-loss for train (blue) and validation (dashed orange) sets across 600 boosting rounds. Early stopping selected round 585 as the best iteration (min val loss: 0.193).

Get Started

Prerequisites

Dependency	Notes
Python 3.10+	Tested on 3.10–3.12
Ollama	Must be running locally; 8 GB RAM minimum, 16 GB recommended
Tesseract OCR	System-level install required for scanned PDFs — see Contributing Guide for platform-specific instructions

Pull the default extraction model before running:

ollama pull qwen2.5:7b   # ~5 GB
ollama list

Installation

# Linux
git clone https://github.com/NovakLabOSU/FracFeedExtractor.git
cd FracFeedExtractor
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Windows PowerShell
git clone https://github.com/NovakLabOSU/FracFeedExtractor.git
cd FracFeedExtractor
py -m venv venv
./venv/Scripts/activate
pip install -r requirements.txt

Quick Start

# Classify and extract from a folder of PDFs
python classify_extract.py path/to/pdfs/

# Adjust the LLM model or confidence threshold
python classify_extract.py path/to/pdfs/ --llm-model llama3.1:8b --confidence-threshold 0.70

Results are written to data/results/metrics/ (per-paper JSON) and data/results/summaries/ (pipeline CSV).

For virtual environment setup, full CLI flag reference, and contribution guidelines, see the Contributing Guide.

Data Source

We trained the classifier on the FracFeed global database — a hand-annotated collection of predator diet surveys from the primary ecological literature.

Team

Mark Novak
_{Project Lead}
_{Mark.Novak@oregonstate.edu}

Sean Clayton
_{ML Pipeline & Backend}
_{claytose@oregonstate.edu}

Zahra Alsulaimawi
_{LLM Integration & Evaluation}
_{alsulaza@oregonstate.edu}

Raymond Cen
_{Data Processing & Testing}
_{cenra@oregonstate.edu}

Bradley Rule
_{PDF Extraction & OCR}
_{ruleb@oregonstate.edu}

Questions and Feedback

Found a bug or have a question?
Open an issue on GitHub

Documentation

Contributing Guide — setup, CLI reference, and contribution workflow
System Architecture Diagram

License: Pending partner confirmation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FracFeedExtractor — LLMs for the Fraction of Feeding Predators

Project Description

What is the Fraction of Feeding Individuals?

Key Features

Motivation

System Architecture

Pipeline Demo

Model Performance

Get Started

Prerequisites

Installation

Quick Start

Data Source

Team

Questions and Feedback

Documentation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

FracFeedExtractor — LLMs for the Fraction of Feeding Predators

Project Description

What is the Fraction of Feeding Individuals?

Key Features

Motivation

System Architecture

Pipeline Demo

Model Performance

Get Started

Prerequisites

Installation

Quick Start

Data Source

Team

Questions and Feedback

Documentation