-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Improve the preprocessing module to ensure high-quality text extraction that preserves document structure and handles diverse formats.
Description:
- Enhance table detection and preservation to maintain row/column structure in extracted text
- Add support for multi-column layouts and complex document formatting
- Implement validation testing to detect garbled characters or encoding issues in output
- Investigate higher DPI settings (300+ DPI) and noise reduction techniques for OCR on scanned/image-based documents
- Ensure text passed to the LLM is clean, well-formatted, and structurally accurate
Goal: Provide optimal text quality and format for downstream LLM processing across diverse paper types.
Metadata
Metadata
Assignees
Labels
No labels