π ΩΩΩΨ±Ψ§Ψ‘Ψ© Ψ¨Ψ§ΩΩΨΊΨ© Ψ§ΩΨΉΨ±Ψ¨ΩΨ© | For Arabic Documentation: See README_ar.md
This project provides advanced Python scripts for OCR processing, PDF text extraction, audio transcription, and document processing using Mistral AI API. Features batch processing, cost tracking, multilingual support, and speech-to-text capabilities.
For editing Arabic markdown content, use the online Arabic markdown editor at https://app.dawin.io/ - an Arabic markdown editor designed to support right-to-left text direction and Arabic language features.
Enhanced PDF to Text/Markdown converter with batch processing, URL support, automatic dependency management, and smart file management.
- π Flexible Output Formats: Default plain text (
.txt) or markdown (.md) with--mdflag - π§Ή Markdown Cleaning: Use
--cleanflag to remove repetitive headers from academic papers (requires markdowncleaner) - π URL Support: Download and process PDFs directly from URLs with auto-cleanup
- π Batch Processing: Process single files or entire directories recursively
- π§ Smart Skip Logic: Only skips PDFs with existing files of the target extension
- π Re-processing: Interactive confirmation for single file re-processing with unique naming
- π Custom API Key: Use
--api-keyparameter or environment variable - β»οΈ Auto-Cleanup: Downloaded PDFs are deleted after OCR unless
--keepflag is used - π¦ Dependency Checking: Automatically checks and offers to install missing packages
- π Recursive Directory Support: Processes PDFs in all subdirectories
- π In-Place Processing: Outputs files to the same location as source PDFs
# Process to plain text (default)
python pdf_to_txt_new.py document.pdf
# Process to markdown
python pdf_to_txt_new.py document.pdf --md
# Process to clean markdown (removes repetitive headers)
python pdf_to_txt_new.py document.pdf --md --clean
# Explicit plain text
python pdf_to_txt_new.py document.pdf --txt
# Use custom API key
python pdf_to_txt_new.py document.pdf --api-key your_api_key_here# Download and process PDF from URL (auto-cleanup)
python pdf_to_txt_new.py --url https://example.com/document.pdf
# Download and convert to markdown
python pdf_to_txt_new.py --url https://example.com/document.pdf --md
# Download and keep the PDF file after OCR
python pdf_to_txt_new.py --url https://example.com/document.pdf --keep# Process all PDFs in directory (recursive) to text
python pdf_to_txt_new.py ./documents/
# Process all PDFs to markdown
python pdf_to_txt_new.py ./documents/ --md- Recursively finds all
*.pdffiles - Smart Skip Logic: Only skips PDFs with existing files of the target extension
- Example: If
file.txtexists and you run with--md, it will still process
- Example: If
- Shows progress:
"Skipping 3 PDF(s) with existing .txt files, 2 remaining" - Processes only new files or files without target extension
- Outputs files to the same directory as the source PDFs
- Checks if PDF has output file with target extension only
- Example: If
file.txtexists and you run with--md, no confirmation needed
- Example: If
- Asks for confirmation only when target extension file exists
- Creates uniquely named outputs:
document_1.txt,document_2.md, etc. - Outputs the file to the same directory as the source PDF
- Downloads PDF from URL to current directory
- Processes downloaded PDF like a local file
- Auto-cleanup: Deletes downloaded PDF after OCR (unless
--keepflag is used) - Maintains all other processing features
The script automatically checks for required packages and offers to install them:
======================================================================
ERROR: Missing required packages
======================================================================
β python-dotenv
β mistralai
To install missing packages, run one of these commands:
pip install python-dotenv mistralai
pip install -r requirements.txt
Would you like to install them now? (y/N):
Features:
- β Automatic detection of missing packages
- β Interactive installation prompt
- β Clear installation instructions
- β Safe permission-based installation
- β Silent operation when packages are installed
your_directory/
βββ pdf_to_txt_new.py
βββ documents/
β βββ report.pdf
β βββ report.txt # Default: plain text output
β βββ data.pdf
β βββ data.md # With --md flag
βββ subfolder/
βββ analysis.pdf
βββ analysis.txt # Default output
βββ downloaded_document.pdf # URL downloads (deleted unless --keep)
Output Format Examples:
python pdf_to_txt_new.py doc.pdfβdoc.txt(default)python pdf_to_txt_new.py doc.pdf --mdβdoc.mdpython pdf_to_txt_new.py --url https://example.com/file.pdfβfile.txt(PDF deleted after)python pdf_to_txt_new.py --url https://example.com/file.pdf --keepβfile.txt+file.pdf(kept)
python pdf_to_txt_new.py [input] [options]
Arguments:
input Path to PDF file or directory (optional with --url)
Options:
--url URL Download and process PDF from URL
--md Convert to markdown instead of plain text
--clean Clean markdown output (remove repetitive headers)
--txt Explicitly convert to plain text (default)
--api-key KEY Use custom Mistral API key
--keep Keep downloaded PDF file after processing
--model MODEL OCR model name (default: mistral-ocr-latest)
-h, --help Show help messageCommon Use Cases:
| Command | Description |
|---|---|
pdf_to_txt_new.py file.pdf |
Process to plain text (default) |
pdf_to_txt_new.py file.pdf --md |
Process to markdown |
pdf_to_txt_new.py file.pdf --md --clean |
Process to clean markdown (remove headers) |
pdf_to_txt_new.py --url https://example.com/doc.pdf |
Download, OCR, delete PDF |
pdf_to_txt_new.py --url URL --keep |
Download, OCR, keep PDF |
pdf_to_txt_new.py ./docs/ |
Process all PDFs in directory |
pdf_to_txt_new.py file.pdf --api-key KEY |
Use custom API key |
Legacy single-file PDF to text converter.
python pdf_to_txt.py <path_to_pdf_file>Advanced audio file transcription using Mistral AI's Voxtral models for high-quality speech-to-text conversion.
- π― High-Quality Transcription: Uses Mistral's Voxtral models for accurate speech recognition
- π Multilingual Support: Supports multiple languages including Arabic, English, and more
- π Simple File Processing: Process any audio file with automatic text output
- π§ Command-Line Interface: Easy-to-use CLI with file path input
- π Automatic Output: Saves transcription to
.txtfile with same base name - π‘οΈ Error Handling: Comprehensive error handling with user-friendly messages
# Basic usage - transcribe any audio file
python transcribe_audio.py audio.ogg
python transcribe_audio.py recording.mp3
python transcribe_audio.py speech.wav- Input: Any audio file (
.ogg,.mp3,.wav,.m4a,.flac, etc.) - Output: Creates a
.txtfile with the same base name in the same directory - Model: Uses
voxtral-mini-latestfor optimal transcription quality - Encoding: UTF-8 encoding for proper multilingual text support
your_directory/
βββ transcribe_audio.py
βββ speech.ogg
βββ speech.txt # Transcription output
-
Clone the repository:
git clone https://github.com/EngDawood/mistral-ocr.git cd mistral-ocr -
Install dependencies:
pip install -r requirements.txt
-
Get your free Mistral API key:
- Visit Mistral AI Console
- Sign up for a free account
- Navigate to API Keys section
- Create a new API key
- Copy the API key (keep it secure)
-
Set up your API key: Copy the example environment file and fill in your API key:
cp .env.example .env
Then edit
.envwith your actual Mistral API key:MISTRAL_API_KEY=your_actual_api_key_here
- Python 3.8+
- Mistral AI API key (get free at console.mistral.ai)
- Required packages:
mistralai,python-dotenv(auto-checked bypdf_to_txt_new.py) - Optional:
markdowncleanerfor cleaning academic papers (see github.com/josk0/markdowncleaner)
Free Tier: Mistral offers general OCR processing for up to 1,000 pages for free.
API Limitations:
- Uploaded document files must not exceed 50 MB in size
- Documents should be no longer than 1,000 pages
OCR Resources & Cookbooks:
This project fully supports Arabic text processing and multilingual documents. The system has been tested with:
- Mixed Arabic and English texts
- UTF-8 encoding for Arabic content
- Proper right-to-left text direction handling
- Ensure PDF files are saved with UTF-8 encoding
- The system preserves correct Arabic text ordering
- Multilingual documents can be processed efficiently
- Arabic README available: README_ar.md
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
This project is open source. Please check the license file for details.