We could streamline the ingestion pipeline as implemented through archive_agent/data/FileData and the /archive_agent/data/loaders subpackage.
Options:
Goals:
- Support more file types
- Improve ingestion quality, e.g. retain PDF hierarchy (could also tweak currently used prompt)
For Markitdown there are some points to consider:
- PDF extraction seems to use pdfminer instead of pymupdf, which has been reported to be slower.
- The image description feature seems very basic compared to combined OCR and entity extraction, so Archive Agent's native method as currently implemented seems to be superior.
For marker, I have to do similar research.
ESSENTIAL:
Make sure the new module is thread-safe.
The currently used pymupdf module is not thread safe, which is a performance hit.
PDFs are currently handled differently in IngestionManager due to this.
Also check out MinerU: https://github.com/opendatalab/MinerU
We could streamline the ingestion pipeline as implemented through
archive_agent/data/FileDataand the/archive_agent/data/loaderssubpackage.Options:
Goals:
For Markitdown there are some points to consider:
For marker, I have to do similar research.
ESSENTIAL:
Make sure the new module is thread-safe.
The currently used pymupdf module is not thread safe, which is a performance hit.
PDFs are currently handled differently in IngestionManager due to this.
Also check out MinerU: https://github.com/opendatalab/MinerU