Skip to content

EngDawood/mistral-ocr

Repository files navigation

Mistral OCR - Advanced PDF Processing & Audio Transcription

πŸ“– Ω„Ω„Ω‚Ψ±Ψ§Ψ‘Ψ© Ψ¨Ψ§Ω„Ω„ΨΊΨ© Ψ§Ω„ΨΉΨ±Ψ¨ΩŠΨ© | For Arabic Documentation: See README_ar.md

This project provides advanced Python scripts for OCR processing, PDF text extraction, audio transcription, and document processing using Mistral AI API. Features batch processing, cost tracking, multilingual support, and speech-to-text capabilities.

πŸ“ Editing Arabic Markdown

For editing Arabic markdown content, use the online Arabic markdown editor at https://app.dawin.io/ - an Arabic markdown editor designed to support right-to-left text direction and Arabic language features.

πŸš€ pdf_to_txt_new.py - Advanced PDF OCR Converter

Enhanced PDF to Text/Markdown converter with batch processing, URL support, automatic dependency management, and smart file management.

✨ Key Features

  • πŸ“„ Flexible Output Formats: Default plain text (.txt) or markdown (.md) with --md flag
  • 🧹 Markdown Cleaning: Use --clean flag to remove repetitive headers from academic papers (requires markdowncleaner)
  • 🌐 URL Support: Download and process PDFs directly from URLs with auto-cleanup
  • πŸ”„ Batch Processing: Process single files or entire directories recursively
  • 🧠 Smart Skip Logic: Only skips PDFs with existing files of the target extension
  • πŸ”„ Re-processing: Interactive confirmation for single file re-processing with unique naming
  • πŸ”‘ Custom API Key: Use --api-key parameter or environment variable
  • ♻️ Auto-Cleanup: Downloaded PDFs are deleted after OCR unless --keep flag is used
  • πŸ“¦ Dependency Checking: Automatically checks and offers to install missing packages
  • πŸ“ Recursive Directory Support: Processes PDFs in all subdirectories
  • πŸ“‚ In-Place Processing: Outputs files to the same location as source PDFs

πŸ“‹ Usage Examples

Single File Processing

# Process to plain text (default)
python pdf_to_txt_new.py document.pdf

# Process to markdown
python pdf_to_txt_new.py document.pdf --md

# Process to clean markdown (removes repetitive headers)
python pdf_to_txt_new.py document.pdf --md --clean

# Explicit plain text
python pdf_to_txt_new.py document.pdf --txt

# Use custom API key
python pdf_to_txt_new.py document.pdf --api-key your_api_key_here

URL Processing

# Download and process PDF from URL (auto-cleanup)
python pdf_to_txt_new.py --url https://example.com/document.pdf

# Download and convert to markdown
python pdf_to_txt_new.py --url https://example.com/document.pdf --md

# Download and keep the PDF file after OCR
python pdf_to_txt_new.py --url https://example.com/document.pdf --keep

Directory Processing

# Process all PDFs in directory (recursive) to text
python pdf_to_txt_new.py ./documents/

# Process all PDFs to markdown
python pdf_to_txt_new.py ./documents/ --md

🎯 Processing Behavior

Directory Mode

  • Recursively finds all *.pdf files
  • Smart Skip Logic: Only skips PDFs with existing files of the target extension
    • Example: If file.txt exists and you run with --md, it will still process
  • Shows progress: "Skipping 3 PDF(s) with existing .txt files, 2 remaining"
  • Processes only new files or files without target extension
  • Outputs files to the same directory as the source PDFs

Single File Mode

  • Checks if PDF has output file with target extension only
    • Example: If file.txt exists and you run with --md, no confirmation needed
  • Asks for confirmation only when target extension file exists
  • Creates uniquely named outputs: document_1.txt, document_2.md, etc.
  • Outputs the file to the same directory as the source PDF

URL Mode

  • Downloads PDF from URL to current directory
  • Processes downloaded PDF like a local file
  • Auto-cleanup: Deletes downloaded PDF after OCR (unless --keep flag is used)
  • Maintains all other processing features

πŸ“¦ Automatic Dependency Management

The script automatically checks for required packages and offers to install them:

======================================================================
ERROR: Missing required packages
======================================================================
  βœ— python-dotenv
  βœ— mistralai

To install missing packages, run one of these commands:
  pip install python-dotenv mistralai
  pip install -r requirements.txt

Would you like to install them now? (y/N):

Features:

  • βœ… Automatic detection of missing packages
  • βœ… Interactive installation prompt
  • βœ… Clear installation instructions
  • βœ… Safe permission-based installation
  • βœ… Silent operation when packages are installed

πŸ“ Output Structure

your_directory/
β”œβ”€β”€ pdf_to_txt_new.py
β”œβ”€β”€ documents/
β”‚   β”œβ”€β”€ report.pdf
β”‚   β”œβ”€β”€ report.txt                   # Default: plain text output
β”‚   β”œβ”€β”€ data.pdf
β”‚   └── data.md                      # With --md flag
└── subfolder/
    β”œβ”€β”€ analysis.pdf
    β”œβ”€β”€ analysis.txt                 # Default output
    └── downloaded_document.pdf      # URL downloads (deleted unless --keep)

Output Format Examples:

  • python pdf_to_txt_new.py doc.pdf β†’ doc.txt (default)
  • python pdf_to_txt_new.py doc.pdf --md β†’ doc.md
  • python pdf_to_txt_new.py --url https://example.com/file.pdf β†’ file.txt (PDF deleted after)
  • python pdf_to_txt_new.py --url https://example.com/file.pdf --keep β†’ file.txt + file.pdf (kept)

πŸŽ›οΈ Command Line Options

python pdf_to_txt_new.py [input] [options]

Arguments:
  input                 Path to PDF file or directory (optional with --url)

Options:
  --url URL            Download and process PDF from URL
  --md                 Convert to markdown instead of plain text
  --clean              Clean markdown output (remove repetitive headers)
  --txt                Explicitly convert to plain text (default)
  --api-key KEY        Use custom Mistral API key
  --keep               Keep downloaded PDF file after processing
  --model MODEL        OCR model name (default: mistral-ocr-latest)
  -h, --help           Show help message

Common Use Cases:

Command Description
pdf_to_txt_new.py file.pdf Process to plain text (default)
pdf_to_txt_new.py file.pdf --md Process to markdown
pdf_to_txt_new.py file.pdf --md --clean Process to clean markdown (remove headers)
pdf_to_txt_new.py --url https://example.com/doc.pdf Download, OCR, delete PDF
pdf_to_txt_new.py --url URL --keep Download, OCR, keep PDF
pdf_to_txt_new.py ./docs/ Process all PDFs in directory
pdf_to_txt_new.py file.pdf --api-key KEY Use custom API key

πŸ“„ pdf_to_txt.py - Basic PDF OCR Converter

Legacy single-file PDF to text converter.

Usage

python pdf_to_txt.py <path_to_pdf_file>

🎡 transcribe_audio.py - Audio Transcription Tool

Advanced audio file transcription using Mistral AI's Voxtral models for high-quality speech-to-text conversion.

✨ Key Features

  • 🎯 High-Quality Transcription: Uses Mistral's Voxtral models for accurate speech recognition
  • 🌍 Multilingual Support: Supports multiple languages including Arabic, English, and more
  • πŸ“ Simple File Processing: Process any audio file with automatic text output
  • πŸ”§ Command-Line Interface: Easy-to-use CLI with file path input
  • πŸ“ Automatic Output: Saves transcription to .txt file with same base name
  • πŸ›‘οΈ Error Handling: Comprehensive error handling with user-friendly messages

πŸ“‹ Usage Examples

Single Audio File Transcription

# Basic usage - transcribe any audio file
python transcribe_audio.py audio.ogg
python transcribe_audio.py recording.mp3
python transcribe_audio.py speech.wav

🎯 Processing Behavior

  • Input: Any audio file (.ogg, .mp3, .wav, .m4a, .flac, etc.)
  • Output: Creates a .txt file with the same base name in the same directory
  • Model: Uses voxtral-mini-latest for optimal transcription quality
  • Encoding: UTF-8 encoding for proper multilingual text support

πŸ“ Output Example

your_directory/
β”œβ”€β”€ transcribe_audio.py
β”œβ”€β”€ speech.ogg
└── speech.txt              # Transcription output

πŸ› οΈ Setup

  1. Clone the repository:

    git clone https://github.com/EngDawood/mistral-ocr.git
    cd mistral-ocr
  2. Install dependencies:

    pip install -r requirements.txt
  3. Get your free Mistral API key:

    • Visit Mistral AI Console
    • Sign up for a free account
    • Navigate to API Keys section
    • Create a new API key
    • Copy the API key (keep it secure)
  4. Set up your API key: Copy the example environment file and fill in your API key:

    cp .env.example .env

    Then edit .env with your actual Mistral API key:

    MISTRAL_API_KEY=your_actual_api_key_here
    

πŸ“‹ Requirements

πŸ”— Mistral OCR API Information

Free Tier: Mistral offers general OCR processing for up to 1,000 pages for free.

API Limitations:

  • Uploaded document files must not exceed 50 MB in size
  • Documents should be no longer than 1,000 pages

OCR Resources & Cookbooks:

🌍 Arabic Language Support

This project fully supports Arabic text processing and multilingual documents. The system has been tested with:

  • Mixed Arabic and English texts
  • UTF-8 encoding for Arabic content
  • Proper right-to-left text direction handling

πŸ“ Important Notes for Arabic Users

  • Ensure PDF files are saved with UTF-8 encoding
  • The system preserves correct Arabic text ordering
  • Multilingual documents can be processed efficiently
  • Arabic README available: README_ar.md

🀝 Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

πŸ“„ License

This project is open source. Please check the license file for details.

About

Advanced Python scripts for PDF OCR processing, audio transcription, and document processing using Mistral AI API. Features batch processing, cost tracking, multilingual support, Arabic language processing, speech-to-text conversion, and Voxtral models for high-quality audio transcription.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages