Skip to content

Serkan0YLDZ/pdf2text_streamlit

Repository files navigation

PDF to Text & Table Extraction Tool

This is a text-based Streamlit project for extracting text, tables, and visual content from your PDF files. You can see which of the three text extraction libraries works better for your PDF files.

Demo Video

Features

Multiple PDF Processing Libraries

This project uses three different powerful libraries for PDF processing:

1. PyMuPDF (fitz)

PyMuPDF Logo

PyMuPDF Modes

  • All Text: Extract all document text
  • Specific Page: Process a specific page
  • Markdown/JSON Output: Structured data format
  • Search Text: Text search and location finding
  • Table Detection: Automatic table detection
  • Image Extraction: Extract embedded images

2. PDFplumber

PDFplumber

PDFplumber Modes

  • All Text: Full text extraction
  • Specific Page: Page-based processing
  • Table Extraction: Advanced table extraction
  • Image Extraction: Image detection and cropping

3. Camelot

Camelot Logo

Camelot Modes

  • Lattice: Table detection based on cell boundaries
  • Stream: Detection based on whitespace patterns
  • Advanced Options: Line scale, page selection, password support
  • Visual Debugging: Visualize detected table boundaries

Installation

Steps

  1. Clone the repository:
git clone https://github.com/Serkan0YLDZ/pdf2text_streamlit.git
cd pdf2text_streamlit
  1. Create a virtual environment:
python -m venv myenv
source myenv/bin/activate  # Linux/Mac
# or
myenv\Scripts\activate  # Windows
  1. Install dependencies:
pip install -r requirements.txt

Usage

Start the Application

streamlit run main.py

Project Structure

pdf2text_streamlit/
├── main.py                     # Main application file
├── pages/
│   ├── upload.py               # PDF upload page
│   ├── directTextExtraction.py # Text/table extraction page
│   └── docs/                   # Folder where uploaded PDFs are stored
├── requirements.txt            # Python dependencies
├── packages.txt                # System dependencies (Ghostscript)
├── pdf2text.mp4                # Demo video
└── README.md                   # This file

Project Limitations

  • Cannot extract very complex tables and scanned (image-based) tables

Releases

No releases published

Packages

 
 
 

Contributors

Languages