This is a text-based Streamlit project for extracting text, tables, and visual content from your PDF files. You can see which of the three text extraction libraries works better for your PDF files.
This project uses three different powerful libraries for PDF processing:
- All Text: Extract all document text
- Specific Page: Process a specific page
- Markdown/JSON Output: Structured data format
- Search Text: Text search and location finding
- Table Detection: Automatic table detection
- Image Extraction: Extract embedded images
- All Text: Full text extraction
- Specific Page: Page-based processing
- Table Extraction: Advanced table extraction
- Image Extraction: Image detection and cropping
- Lattice: Table detection based on cell boundaries
- Stream: Detection based on whitespace patterns
- Advanced Options: Line scale, page selection, password support
- Visual Debugging: Visualize detected table boundaries
- Clone the repository:
git clone https://github.com/Serkan0YLDZ/pdf2text_streamlit.git
cd pdf2text_streamlit- Create a virtual environment:
python -m venv myenv
source myenv/bin/activate # Linux/Mac
# or
myenv\Scripts\activate # Windows- Install dependencies:
pip install -r requirements.txtstreamlit run main.pypdf2text_streamlit/
├── main.py # Main application file
├── pages/
│ ├── upload.py # PDF upload page
│ ├── directTextExtraction.py # Text/table extraction page
│ └── docs/ # Folder where uploaded PDFs are stored
├── requirements.txt # Python dependencies
├── packages.txt # System dependencies (Ghostscript)
├── pdf2text.mp4 # Demo video
└── README.md # This file
- Cannot extract very complex tables and scanned (image-based) tables

