OCRacle is a modular Python tool (optionally portable) that recursively scans directories, extracts text from PDF and image files (native PDF text and OCR), and searches for keywords or regex patterns. Results can be printed on the console and optionally saved to JSON or CSV. It is also useful as a service component for Data Loss Prevention (DLP) software.
- Recursive scan of directories
- File type detection using magic bytes (not extensions)
- Supported file formats:
- PDF (native text extraction via PyMuPDF, fallback to OCR)
- Images (PNG, JPG, TIFF, etc.)
- Detection rules defined in YAML (
ocracle/rules/rules.yaml) - Output options:
- Console
- JSON
- CSV
- Options:
--include-text: include extracted text in results--rules <file>: specify a custom YAML rule file-v/--verbose: show files being processed
- Colorized logging
- Scan summary with total matches and start/end timestamps
Python 3.8+ is recommended.
All required libraries are listed in the requirements.txt file.
Additionally, all necessary .whl package files are already included in the wheels directory, making OCRacle portable and ready for offline installation as described in the procedure below.
OCRacle includes a portable, pre-configured Tesseract OCR binary and tessdata in the ocracle/core directory, no extra setup needed.
To deploy OCRacle in an isolated environment without internet:
-
Create and activate a virtual environment:
python3 -m venv venv source venv/bin/activate -
Download all packages as
.whlfiles:mkdir wheels pip download -r requirements.txt -d wheels
Copy the entire OCRacle folder (including wheels/, requirements.txt, and code) to the offline machine.
To install dependencies in a typical (online) environment, simply run:
pip install -r requirements.txt-
Create a virtual environment:
python3 -m venv venv source venv/bin/activate -
Install packages from the local wheels:
pip install --no-index --find-links=wheels -r requirements.txt
-
Ensure that Tesseract (binary + tessdata folder) and, if required, Poppler are included in the project directory and configured in
core/text_extractor.py. -
Run:
python OCRacle.py /path/to/folder
Console only:
python OCRacle.py /path/to/folderSave results:
python OCRacle.py /path/to/folder --json results.json --csv results.csvInclude extracted text in output:
python OCRacle.py /path/to/folder --json results.json --csv results.csv --include-textEnable verbose mode to print files being processed:
python OCRacle.py /path/to/folder --verboseSpecify a custom YAML rules file:
python OCRacle.py /path/to/folder --rules custom_rules.yamlRedirect console output to a file (e.g., output.txt):
python OCRacle.py /path/to/folder --verbose > output.txtRules are stored in ocracle/rules/rules.yaml.
Example:
rules:
- name: IBAN_Detection
description: "Detect Italian IBAN numbers (handles optional spaces)"
pattern: "\\bI\\s*T\\s*\\d{2}(?:\\s*[A-Z0-9]){1,30}\\b"
- name: SecretWords
description: "Detect sensitive keywords"
pattern: "\\b(secret|tlp|classified|password)\\b"Example JSON:
[
{
"index": 1,
"rule": "Email",
"file": "sample_data/Screenshot 2025-07-30 114122.png",
"match_count": 1,
"matched_text": [
"aspammer@website.com"
],
"text": " ... "
},
{
"index": 2,
"rule": "BTC_Wallet",
"file": "sample_data/b.png",
"match_count": 2,
"matched_text": [
"bc1qwes635e7dl0dxzic2q044arj5h0e6n4z06pl4a",
"3J98t1WpEZ73CNmQviecrnyiWmqRhWNLy"
],
"text": " ... "
},
{
"index": 3,
"rule": "Email",
"file": "sample_data/test-pdf.pdf",
"match_count": 1,
"matched_text": [
"aspammer@website.com"
],
"text": " ... "
},
{
"index": 4,
"rule": "IBAN_Detection",
"file": "sample_data/Inside Images 894 x 892 IBAN.jpg",
"match_count": 1,
"matched_text": [
"IT 99 Z 12345 12345 123456789012\nyD"
],
"text": " ... "
}
]

