Improve ingestion pipeline

We could streamline the ingestion pipeline as implemented through `archive_agent/data/FileData` and the `/archive_agent/data/loaders` subpackage.

Options:
- [Markitdown](https://github.com/microsoft/markitdown)
- [Marker](https://github.com/datalab-to/marker)

Goals:
- Support more file types
- Improve ingestion quality, e.g. retain PDF hierarchy (could also tweak currently used prompt)

---

For Markitdown there are some points to consider:
- PDF extraction seems to use pdfminer instead of pymupdf, which has been reported to be slower. 
- The image description feature seems very basic compared to combined OCR and entity extraction, so Archive Agent's native method as currently implemented seems to be superior.

For marker, I have to do similar research.

---

ESSENTIAL:

Make sure the new module is thread-safe.

The currently used pymupdf module is **not thread safe**, which is a performance hit.
PDFs are currently handled differently in IngestionManager due to this.

---

Also check out MinerU: https://github.com/opendatalab/MinerU

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ingestion pipeline #49

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Improve ingestion pipeline #49

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions