PDF to Text & Table Extraction Tool

This is a text-based Streamlit project for extracting text, tables, and visual content from your PDF files. You can see which of the three text extraction libraries works better for your PDF files.

Features

Multiple PDF Processing Libraries

This project uses three different powerful libraries for PDF processing:

1. PyMuPDF (fitz)

PyMuPDF Modes

All Text: Extract all document text
Specific Page: Process a specific page
Markdown/JSON Output: Structured data format
Search Text: Text search and location finding
Table Detection: Automatic table detection
Image Extraction: Extract embedded images

2. PDFplumber

PDFplumber Modes

All Text: Full text extraction
Specific Page: Page-based processing
Table Extraction: Advanced table extraction
Image Extraction: Image detection and cropping

3. Camelot

Camelot Modes

Lattice: Table detection based on cell boundaries
Stream: Detection based on whitespace patterns
Advanced Options: Line scale, page selection, password support
Visual Debugging: Visualize detected table boundaries

Installation

Steps

Clone the repository:

git clone https://github.com/Serkan0YLDZ/pdf2text_streamlit.git
cd pdf2text_streamlit

Create a virtual environment:

python -m venv myenv
source myenv/bin/activate  # Linux/Mac
# or
myenv\Scripts\activate  # Windows

Install dependencies:

pip install -r requirements.txt

Usage

Start the Application

streamlit run main.py

Project Structure

pdf2text_streamlit/
├── main.py                     # Main application file
├── pages/
│   ├── upload.py               # PDF upload page
│   ├── directTextExtraction.py # Text/table extraction page
│   └── docs/                   # Folder where uploaded PDFs are stored
├── requirements.txt            # Python dependencies
├── packages.txt                # System dependencies (Ghostscript)
├── pdf2text.mp4                # Demo video
└── README.md                   # This file

Project Limitations

Cannot extract very complex tables and scanned (image-based) tables

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.devcontainer		.devcontainer
.streamlit		.streamlit
pages		pages
.gitignore		.gitignore
README.md		README.md
main.py		main.py
packages.txt		packages.txt
pdf2text-ezgif.com-video-to-gif-converter.gif		pdf2text-ezgif.com-video-to-gif-converter.gif
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF to Text & Table Extraction Tool

Features

Multiple PDF Processing Libraries

1. PyMuPDF (fitz)

PyMuPDF Modes

2. PDFplumber

PDFplumber Modes

3. Camelot

Camelot Modes

Installation

Steps

Usage

Start the Application

Project Structure

Project Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF to Text & Table Extraction Tool

Features

Multiple PDF Processing Libraries

1. PyMuPDF (fitz)

PyMuPDF Modes

2. PDFplumber

PDFplumber Modes

3. Camelot

Camelot Modes

Installation

Steps

Usage

Start the Application

Project Structure

Project Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages