From 5d263f354c1e68d3f1b7151f4bc505433612f0b9 Mon Sep 17 00:00:00 2001
From: Nameet Potnis <nameetpotnis@Nameets-MacBook-Pro.local>
Date: Sun, 26 Apr 2026 00:40:32 +0530
Subject: [PATCH] Add pdfmux to Parsers, OCR and extraction

---
 README.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/README.md b/README.md
index 6cb97f1..ba95cd0 100644
--- a/README.md
+++ b/README.md
@@ -34,6 +34,7 @@
 - [CatchTheTornado/pdf-extract-api](https://github.com/CatchTheTornado/pdf-extract-api) - Document (PDF) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown.
 - [climatepolicyradar/navigator-document-parser](https://github.com/climatepolicyradar/navigator-document-parser) - Parsing PDFs and websites containing laws and policies.
 - [Iteration Layer](https://iterationlayer.com) - An AI-powered API that extracts structured data from PDFs, images, DOCX, and text files.
+- [pdfmux](https://github.com/NameetP/pdfmux) - Python PDF-to-Markdown extraction library with per-page confidence scoring and self-healing fallback. Re-extracts low-confidence pages automatically; built for RAG pipelines. Built-in MCP server, LangChain + LlamaIndex loaders. #2 on opendataloader-bench. MIT licensed.
 
 
 ## Creation and production