paper-farm

A local-first, privacy-preserving pipeline for systematic literature management. paper-farm monitors a Zotero storage directory for incoming research PDFs, applies a multi-stage extraction and normalization process, generates structured summaries via a locally hosted LLM using a map-reduce strategy, and writes the results as annotated Markdown notes into an Obsidian vault — without transmitting any document content to external services.

한국어 문서: README.ko.md

Architecture

Figure 1. End-to-end processing pipeline. A queue-backed watcher thread polls Zotero storage at a configurable interval; detected PDFs are enqueued and processed sequentially through four stages — text extraction, normalization, LLM summarization, and Obsidian export — each result persisted to disk before the next stage begins.

Pipeline stages

Stage	Module	Description
Ingest	`watchers/`	Polls Zotero storage; deduplicates by path hash; enqueues new PDFs
Extract & Normalize	`extractors/`, `normalizers/`	Two-pass text extraction with quality gating; section boundary detection
Summarize	`summarizers/`	Map-reduce LLM summarization; structured JSON output
Export	`exporters/`	Renders Markdown with YAML front-matter; writes Obsidian vault directory

Each paper produces a self-contained directory in the Obsidian vault:

<obsidian-vault>/
  NNN_<paper-id>/
    summary.md      ← structured summary with YAML front-matter (LLM-generated)
    metadata.json   ← title, authors, year, venue, DOI, paper_num, tags
    notes.md        ← user note template (Research Ideas / Questions / Follow-up)
    paper.pdf       ← copy of the source PDF

NNN is a stable three-digit identifier assigned once at first export and persisted in metadata.json; subsequent pipeline runs preserve the same number.

Requirements

Python 3.11+
Ollama running locally — ollama serve
A pulled model, e.g. ollama pull phi4:14b
(Optional) Rust toolchain — only needed for DocStruct OCR on scanned PDFs

Install

git clone --recurse-submodules <repo-url>
cd paper-farm-lab

# with uv (recommended)
uv sync

# or pip
pip install -e .

Configure

paper-farm init-config        # writes paper-farm.toml in the current directory

Edit the generated file:

[paths]
obsidian_vault = "~/Documents/Obsidian/Research/papers"

[llm]
backend  = "ollama"
model    = "phi4:14b"         # run: ollama pull phi4:14b
timeout  = 600                # seconds; 600 recommended for 14B models

[summary]
language = "en"               # en / ko / ja / zh / fr / de / es

[watcher]
zotero_storage = "~/Zotero/storage"
poll_interval  = 30           # seconds between scans

Zotero storage path macOS / Windows: ~/Zotero/storage · Linux (snap): ~/snap/zotero-snap/common/Zotero/storage

Usage

Automatic mode (recommended)

Watch Zotero and process new papers as they arrive:

paper-farm watch

Or use the provided shell helpers:

scripts/start-watch.sh       # launches watcher and writes logs to logs/
scripts/monitor.sh           # live dashboard — queue status, progress, recent logs

Manual mode

# Full pipeline in one command
paper-farm run /path/to/paper.pdf --title "Attention Is All You Need" \
    --authors "Vaswani, Shazeer" --year 2017

# Stage-by-stage
paper-farm ingest    /path/to/paper.pdf
paper-farm parse     <paper-id>
paper-farm summarize <paper-id>
paper-farm export    <paper-id>

Inspection

paper-farm list               # all registered papers
paper-farm show <paper-id>    # metadata + artifact status per stage

Text Extraction

Figure 2. Two-stage extraction strategy. pypdf is attempted first; a five-signal quality scorer determines whether the extracted text is sufficient. Papers scoring below the threshold of 60/100 are re-processed with DocStruct — a Rust/Tesseract pipeline for scanned documents — at the cost of higher latency.

Text extraction uses a two-pass strategy with automated quality gating:

Primary pass (pypdf): Direct text layer extraction; fast and sufficient for digitally typeset PDFs.
Secondary pass (DocStruct OCR): Activated when the quality score falls below threshold. Renders each page to a raster image and applies Tesseract OCR, recovering text from scanned or image-based PDFs.

The quality scorer aggregates five signals:

Signal	Weight	Rationale
Characters per page	30 pts	Proxy for content density; low scores indicate image-only pages
Non-whitespace ratio	20 pts	Detects pages with excessive padding or layout artifacts
Printable-character ratio	20 pts	Discriminates valid text from OCR noise or binary data
Academic keyword hits	20 pts	Verifies presence of structural markers (abstract, introduction, references, …)
Page yield	10 pts	Fraction of pages producing non-empty output

Build DocStruct (optional — scanned PDFs only)

git submodule update --init --recursive
cargo build --release --manifest-path external/DocStruct/Cargo.toml
pip install "Pillow>=11,<12" pytesseract pdf2image "opencv-python>=4.8,<5" numpy

If the binary is absent, paper-farm falls back to pypdf automatically.

Summarization

paper-farm employs a map-reduce strategy to generate summaries that provide uniform coverage across all sections of a paper, including long methodological and experimental sections that a single-pass approach would truncate.

Map phase. Each section whose character count exceeds a configurable threshold (default: 2 000 chars) is individually condensed to approximately 150 words via a focused LLM call. The prompt instructs the model to extract the core contribution, methodology, quantitative results, and any stated limitations for that section only. Short sections are passed through verbatim.

Reduce phase. The condensed section texts are concatenated and submitted to the LLM in a single structured extraction call. The model is asked to produce a JSON object with ten fields — summary, problem, key_idea, method, experiment, results, contributions, limitations, future_work, and keywords — following strict rules that preserve technical terminology in English regardless of the output language.

This design separates coverage (map) from synthesis (reduce), improving fidelity on papers with dense Results or Methods sections without increasing the reduce call's context length.

The experiment field is a nested JSON object with keys dataset, simulator, and metric, enabling structured querying across a corpus.

Project layout

src/paper_farm/
  cli.py            CLI entry point (Typer)
  config.py         Settings — loaded from paper-farm.toml
  pipeline/         PipelineService: ingest → parse → summarize → export
  extractors/       SmartExtractor, SimpleTextExtractor, DocStructExtractor
  normalizers/      Text cleaning and section boundary detection
  summarizers/      OllamaSummaryBackend (map-reduce), LocalSummaryBackend (rule-based)
  exporters/        Obsidian Markdown + metadata.json writer
  watchers/         ZoteroWatcher — scanner thread + worker queue
  storage/          File-backed repository (data/)
data/               Pipeline cache — excluded from git (see .gitignore)
scripts/            Shell helpers: start-watch.sh, monitor.sh, sync.sh
external/DocStruct  OCR submodule (Rust + Tesseract)

Development

uv sync
uv run pytest

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
data		data
docs		docs
external		external
pipeline		pipeline
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.gitkeep		.gitkeep
.gitmodules		.gitmodules
README.ko.md		README.ko.md
README.md		README.md
paper-farm.example.toml		paper-farm.example.toml
pyproject.toml		pyproject.toml
shell.nix		shell.nix
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

paper-farm

Architecture

Pipeline stages

Requirements

Install

Configure

Usage

Automatic mode (recommended)

Manual mode

Inspection

Text Extraction

Build DocStruct (optional — scanned PDFs only)

Summarization

Project layout

Development

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

paper-farm

Architecture

Pipeline stages

Requirements

Install

Configure

Usage

Automatic mode (recommended)

Manual mode

Inspection

Text Extraction

Build DocStruct (optional — scanned PDFs only)

Summarization

Project layout

Development

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages