Skip to content

Add PDF loading support and update dependencies#1565

Open
iam-tsr wants to merge 2 commits intomofa-org:mainfrom
iam-tsr:feat/pdf-extract
Open

Add PDF loading support and update dependencies#1565
iam-tsr wants to merge 2 commits intomofa-org:mainfrom
iam-tsr:feat/pdf-extract

Conversation

@iam-tsr
Copy link
Copy Markdown

@iam-tsr iam-tsr commented Apr 2, 2026

🧠 Context

This pull request adds support for PDF document loading to the mofa-foundation RAG pipeline, enabling users to extract and process text from PDF files as part of their retrieval-augmented generation workflows. The implementation introduces a new PdfLoader (behind the pdf feature flag), integrates it into the pipeline and documentation, and provides tests and usage examples. Additionally, a small bug fix is included in the Python example.

PDF Document Loading Support

  • Added a new PdfLoader struct implementing the DocumentLoader trait, allowing extraction of text from PDF files using the pdf-extract crate. The loader handles file extension checks, error handling, and metadata population.
  • Introduced a new LoaderError::PdfParseError variant for robust error reporting when PDF parsing fails.
  • Enabled the pdf feature in mofa-foundation and rag_pipeline examples, updating Cargo.toml files and documentation to describe PDF support and usage.

Examples and Tests

  • Added integration and unit tests for PdfLoader covering extension checks and metadata, and included a demonstration of PDF loading and chunking in the rag_pipeline example.

Bug Fix

  • Fixed a typo in the Python analyze.py script where "analyses.append("sentiment")" was missing a closing quote and bracket, correcting it to result["analyses"].append("sentiment").

iam-tsr added 2 commits April 2, 2026 15:47
- Introduced PdfLoader for loading and processing PDF documents.
- Added optional PDF support in mofa-foundation and rag_pipeline.
- Updated Cargo.lock with new dependencies: pdf-extract, adobe-cmap-parser, and others.
- Enhanced README and example files to demonstrate PDF functionality.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant