Add Chandra OCR 2 engine — state-of-the-art document OCR#4
Merged
Conversation
Analyze Chandra OCR 2 (Datalab) as a candidate engine for docfold: - RESEARCH_CHANDRA_OCR.md: model details, benchmarks (85.9% olmOCR SOTA), capabilities, API usage, confidence scoring, and comparison with existing engines - TASK_CHANDRA_ENGINE.md: step-by-step integration plan following TDD approach, covering engine adapter, router, CLI, tests, and documentation updates Sources: GitHub repo, HuggingFace model card, Datalab blog, confidence scoring docs https://claude.ai/code/session_014XarzdnTKLSQmJW7VPVNcW
Add Datalab Chandra OCR 2 as the 19th engine in docfold — a 5B VLM achieving 85.9% on olmOCR benchmark (SOTA). Supports 90+ languages, handwriting, tables, math, and complex layouts via vLLM or HuggingFace. - New ChandraEngine adapter with dual backend (vllm/hf), lazy model loading - Tests: 6 unit tests + interface compliance test - Router: chandra added to PDF, image, and default fallback priorities - CLI: registered in _build_router() - pyproject.toml: chandra optional dependency group, added to [all] - benchmarks.md: quick comparison, engine profile, feature/format/hw/cost matrices All 279 tests pass, ruff clean. https://claude.ai/code/session_014XarzdnTKLSQmJW7VPVNcW
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds Chandra OCR 2 (by Datalab) as a new document processing engine to docfold. Chandra is a 5B-parameter Vision Language Model achieving 85.9% on the olmOCR benchmark — significantly outperforming existing engines like Marker (76.5%) and Mistral OCR (72.0%). It converts images and PDFs to structured Markdown/HTML/JSON with layout preservation, supports 90+ languages, and excels at handwriting, tables, math, and complex layouts.
Key Changes
New engine adapter (
src/docfold/engines/chandra_engine.py):DocumentEngineABC with support for both vLLM (remote server) and HuggingFace (local) inference backendsprocess()call to avoid startup overheadEngineResultwith support for JSON/HTML formatspdf,png,jpg,jpeg,tiff,bmp,webptable_structure=True,heading_detection=True,reading_order=TrueRouter integration (
src/docfold/engines/router.py):"chandra"to high priority in_IMAGE_PRIORITYand_PDF_PRIORITYlists (given superior benchmark scores)_DEFAULT_FALLBACKlistCLI registration (
src/docfold/cli.py):ChandraEnginein_build_router()with graceful fallback if dependencies missingOptional dependency (
pyproject.toml):chandraextra group withchandra-ocr>=0.1dependencyallextraComprehensive test coverage (
tests/engines/test_adapters.py):TestChandraEngineclass with 7 test methods covering name, extensions, availability, config storage, defaults, and capabilitiesTestAllEnginesImplementInterfaceparametrized testsDocumentation:
docs/RESEARCH_CHANDRA_OCR.md— detailed research document with benchmarks, capabilities, usage examples, and integration rationaledocs/TASK_CHANDRA_ENGINE.md— implementation task specificationdocs/benchmarks.md— added Chandra to Quick Comparison table and Engine Profiles sectionImplementation Details
methodparameter.process()call, keeping startup fast.loop.run_in_executor()to offload CPU-bound inference to thread pool.https://claude.ai/code/session_014XarzdnTKLSQmJW7VPVNcW