Skip to content

python: add Docling PDF-to-markdown converter#49

Open
y-71 wants to merge 1 commit intomap-reduce-ingestionfrom
python-docling-converter
Open

python: add Docling PDF-to-markdown converter#49
y-71 wants to merge 1 commit intomap-reduce-ingestionfrom
python-docling-converter

Conversation

@y-71
Copy link
Collaborator

@y-71 y-71 commented Mar 1, 2026

Summary

  • Bootstrap contextrie-convert Python package with Poetry
  • convert_pdf_to_markdown() wraps Docling's DocumentConverter — PDF path in, markdown string out
  • CLI entry point (contextrie-convert) writes markdown to stdout for subprocess capture by the TS side
  • --no-ocr flag for born-digital PDFs (skips EasyOCR, faster)
  • Fixture PDF generated at test time via reportlab (no binary committed)
  • Unit tests for input validation, pipeline options, and CLI error handling
  • Integration test for full PDF→markdown round-trip

Test plan

  • pytest — 5 unit tests pass, 1 integration test (requires Docling install)

Bootstrap contextrie-convert package with a minimal Docling wrapper:
PDF path in, markdown string out on stdout. Includes CLI entry point,
Python API, fixture generation via reportlab, and unit/integration tests.
@y-71 y-71 changed the base branch from main to 39-major-refactor-to-orgnaize-and-split-into-packages March 1, 2026 01:39
@y-71 y-71 changed the base branch from 39-major-refactor-to-orgnaize-and-split-into-packages to map-reduce-ingestion March 1, 2026 01:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant