Add LiteParse engine adapter and comprehensive benchmarking suite by Mihailorama · Pull Request #3 · Mihailorama/docfold

Mihailorama · 2026-03-22T19:57:54Z

Summary

This PR introduces a new LiteParse engine adapter for fast local document parsing via CLI, along with a comprehensive benchmarking framework to compare document processing engines.

Key Changes

LiteParse Engine Adapter (`src/docfold/engines/liteparse_engine.py`)

New LiteParseEngine class that wraps the LiteParse CLI (lit parse) as a subprocess
Supports multiple document formats: PDF, DOCX, PPTX, XLSX, images (PNG, JPG, etc.)
Configurable options: OCR enable/disable, OCR language, DPI, worker threads, max pages
Extracts bounding boxes from LiteParse JSON output with page coordinates
Handles multiple output formats: TEXT, MARKDOWN, JSON, HTML
Robust JSON parsing that handles log lines prefixed before JSON output

Benchmarking Suite (`benchmark.py`)

Generates synthetic test PDFs with known ground truth content (invoices, reports, financial docs)
Runs both PyMuPDF and LiteParse engines on identical documents
Measures performance metrics:
- Processing time (milliseconds)
- Character Error Rate (CER) via Levenshtein distance
- Word Error Rate (WER)
- Bounding box coverage
Produces detailed JSON report (docs/benchmark_results.json) with per-document and aggregate statistics
Includes sample benchmark results showing PyMuPDF (~4.5ms avg) vs LiteParse (~382ms avg)

Test Coverage (`tests/engines/test_liteparse_engine.py`)

300+ lines of comprehensive unit tests
Tests metadata (name, capabilities, availability detection)
Tests all output formats (TEXT, MARKDOWN, JSON)
Tests CLI flag generation (OCR, DPI, workers)
Tests bounding box extraction from textItems
Tests error handling and JSON parsing with log prefixes
Uses mocking to avoid Node.js dependency in test environment

Integration Updates

Registered LiteParseEngine in CLI router (src/docfold/cli.py)
Added LiteParse to extension priority routing (src/docfold/engines/router.py)
Updated benchmarks documentation with LiteParse comparison table
Added optional dependency group in pyproject.toml (no Python deps, requires Node.js 18+)

Notable Implementation Details

LiteParse runs as a subprocess with async I/O for non-blocking execution
Bounding boxes are computed from textItems: [x, y, x+width, y+height]
JSON extraction handles CLI log output by finding the first { character
Benchmark uses text normalization (whitespace collapsing) for fair CER/WER comparison
All async operations properly measure elapsed time using time.perf_counter()

https://claude.ai/code/session_017D1jcELbNYGtk9QoUX6FA7

…ngine Add LiteParse adapter that calls the `lit` CLI via subprocess for fast local document parsing with bounding boxes and confidence scores. No Python dependencies required — only Node.js 18+ with @llamaindex/liteparse installed globally. Changes: - New engine adapter: src/docfold/engines/liteparse_engine.py - Tests: tests/engines/test_liteparse_engine.py (14 tests) - Router: added liteparse to priority maps for PDF, Office, images - CLI: registered LiteParseEngine in _build_router() - pyproject.toml: added [liteparse] optional dependency group - docs/benchmarks.md: added LiteParse to all comparison tables https://claude.ai/code/session_017D1jcELbNYGtk9QoUX6FA7

…enchmark Fixed the JSON parser to handle the actual LiteParse output format: - Page text at `pages[].text` (not nested in `content`) - Text items at `pages[].textItems` with {x, y, width, height} - Added `_extract_json()` to strip log lines from stdout - Added test for log-prefix handling (15 tests total) Benchmark results (4 synthetic digital PDFs): - PyMuPDF: avg 4.5ms, CER=0.0, WER=0.0, avg 6.2 bboxes - LiteParse: avg 382ms, CER=0.0, WER=0.0, avg 31.8 bboxes Both engines achieve perfect text extraction on digital PDFs. LiteParse provides ~5x more granular bounding boxes (word-level vs block-level) but is ~85x slower due to Node.js subprocess overhead. https://claude.ai/code/session_017D1jcELbNYGtk9QoUX6FA7

claude added 2 commits March 22, 2026 13:50

Mihailorama merged commit bd76598 into main Mar 22, 2026
0 of 9 checks passed

Mihailorama deleted the claude/integrate-engine-benchmarks-cBYKe branch March 22, 2026 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LiteParse engine adapter and comprehensive benchmarking suite#3

Add LiteParse engine adapter and comprehensive benchmarking suite#3
Mihailorama merged 2 commits into
mainfrom
claude/integrate-engine-benchmarks-cBYKe

Mihailorama commented Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Mihailorama commented Mar 22, 2026

Summary

Key Changes

LiteParse Engine Adapter (src/docfold/engines/liteparse_engine.py)

Benchmarking Suite (benchmark.py)

Test Coverage (tests/engines/test_liteparse_engine.py)

Integration Updates

Notable Implementation Details

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LiteParse Engine Adapter (`src/docfold/engines/liteparse_engine.py`)

Benchmarking Suite (`benchmark.py`)

Test Coverage (`tests/engines/test_liteparse_engine.py`)