Skip to content

Add LiteParse engine adapter and comprehensive benchmarking suite#3

Merged
Mihailorama merged 2 commits into
mainfrom
claude/integrate-engine-benchmarks-cBYKe
Mar 22, 2026
Merged

Add LiteParse engine adapter and comprehensive benchmarking suite#3
Mihailorama merged 2 commits into
mainfrom
claude/integrate-engine-benchmarks-cBYKe

Conversation

@Mihailorama
Copy link
Copy Markdown
Owner

Summary

This PR introduces a new LiteParse engine adapter for fast local document parsing via CLI, along with a comprehensive benchmarking framework to compare document processing engines.

Key Changes

LiteParse Engine Adapter (src/docfold/engines/liteparse_engine.py)

  • New LiteParseEngine class that wraps the LiteParse CLI (lit parse) as a subprocess
  • Supports multiple document formats: PDF, DOCX, PPTX, XLSX, images (PNG, JPG, etc.)
  • Configurable options: OCR enable/disable, OCR language, DPI, worker threads, max pages
  • Extracts bounding boxes from LiteParse JSON output with page coordinates
  • Handles multiple output formats: TEXT, MARKDOWN, JSON, HTML
  • Robust JSON parsing that handles log lines prefixed before JSON output

Benchmarking Suite (benchmark.py)

  • Generates synthetic test PDFs with known ground truth content (invoices, reports, financial docs)
  • Runs both PyMuPDF and LiteParse engines on identical documents
  • Measures performance metrics:
    • Processing time (milliseconds)
    • Character Error Rate (CER) via Levenshtein distance
    • Word Error Rate (WER)
    • Bounding box coverage
  • Produces detailed JSON report (docs/benchmark_results.json) with per-document and aggregate statistics
  • Includes sample benchmark results showing PyMuPDF (~4.5ms avg) vs LiteParse (~382ms avg)

Test Coverage (tests/engines/test_liteparse_engine.py)

  • 300+ lines of comprehensive unit tests
  • Tests metadata (name, capabilities, availability detection)
  • Tests all output formats (TEXT, MARKDOWN, JSON)
  • Tests CLI flag generation (OCR, DPI, workers)
  • Tests bounding box extraction from textItems
  • Tests error handling and JSON parsing with log prefixes
  • Uses mocking to avoid Node.js dependency in test environment

Integration Updates

  • Registered LiteParseEngine in CLI router (src/docfold/cli.py)
  • Added LiteParse to extension priority routing (src/docfold/engines/router.py)
  • Updated benchmarks documentation with LiteParse comparison table
  • Added optional dependency group in pyproject.toml (no Python deps, requires Node.js 18+)

Notable Implementation Details

  • LiteParse runs as a subprocess with async I/O for non-blocking execution
  • Bounding boxes are computed from textItems: [x, y, x+width, y+height]
  • JSON extraction handles CLI log output by finding the first { character
  • Benchmark uses text normalization (whitespace collapsing) for fair CER/WER comparison
  • All async operations properly measure elapsed time using time.perf_counter()

https://claude.ai/code/session_017D1jcELbNYGtk9QoUX6FA7

claude added 2 commits March 22, 2026 13:50
…ngine

Add LiteParse adapter that calls the `lit` CLI via subprocess for fast
local document parsing with bounding boxes and confidence scores.
No Python dependencies required — only Node.js 18+ with
@llamaindex/liteparse installed globally.

Changes:
- New engine adapter: src/docfold/engines/liteparse_engine.py
- Tests: tests/engines/test_liteparse_engine.py (14 tests)
- Router: added liteparse to priority maps for PDF, Office, images
- CLI: registered LiteParseEngine in _build_router()
- pyproject.toml: added [liteparse] optional dependency group
- docs/benchmarks.md: added LiteParse to all comparison tables

https://claude.ai/code/session_017D1jcELbNYGtk9QoUX6FA7
…enchmark

Fixed the JSON parser to handle the actual LiteParse output format:
- Page text at `pages[].text` (not nested in `content`)
- Text items at `pages[].textItems` with {x, y, width, height}
- Added `_extract_json()` to strip log lines from stdout
- Added test for log-prefix handling (15 tests total)

Benchmark results (4 synthetic digital PDFs):
- PyMuPDF:    avg  4.5ms, CER=0.0, WER=0.0, avg  6.2 bboxes
- LiteParse:  avg 382ms,  CER=0.0, WER=0.0, avg 31.8 bboxes

Both engines achieve perfect text extraction on digital PDFs.
LiteParse provides ~5x more granular bounding boxes (word-level vs
block-level) but is ~85x slower due to Node.js subprocess overhead.

https://claude.ai/code/session_017D1jcELbNYGtk9QoUX6FA7
@Mihailorama Mihailorama merged commit bd76598 into main Mar 22, 2026
0 of 9 checks passed
@Mihailorama Mihailorama deleted the claude/integrate-engine-benchmarks-cBYKe branch March 22, 2026 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants