Deadsimple - Document Extractor

FastAPI service for extracting text from PDF, DOCX, PPTX, EPUB, HTML, TXT with multi-backend architecture.

🚀 Features

Multiple file formats: Supports PDF, DOCX, PPTX, EPUB, HTML, Markdown, TXT, CSV, and more
Vision LLM: Optional OpenAI-compatible API for extracting text from scanned PDFs/images
Page-wise output: Returns structured output (one per page/slide)
Extensible backend architecture: Designed for easy addition of new backends

📦 Installation

pip install -e .

Supported Formats

markitdown supports these file types out of the box:

Format	MIME Type
PDF	`application/pdf`
DOCX	`application/vnd.openxmlformats-officedocument.wordprocessingml.document`
PPTX	`application/vnd.openxmlformats-officedocument.presentationml.presentation`
EPUB	`application/epub+zip`
HTML	`text/html`
Markdown	`text/markdown` or `text/x-markdown`
TXT	`text/plain`
CSV	`text/csv`
DOC (legacy)	`application/msword`
PPT (legacy)	`application/vnd.ms-powerpoint`
XLSX	`application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`
XLS (legacy)	`application/vnd.ms-excel`

🌐 Vision LLM Mode

Vision mode is configured in backends.ini under the [llm] section:

[llm]
# Set these values to enable vision/OCR mode
token = your_api_key          # OpenAI or compatible API key
model = gpt-4o                # Model name (supports vision)
url = https://api.openai.com/v1  # API endpoint

To enable vision mode:

Open backends.ini
Uncomment and set the [llm] section values
Restart the server

Example with local Ollama:

[llm]
token = ollama
model = llava
url = http://localhost:11434/v1

Leave [llm] empty to disable vision mode (uses markitdown without LLM).

🧪 Running

python deadsimple/main.py

API available at http://localhost:5000

📡 API

Process Document

curl -X POST http://localhost:5000/process \
  -H "Content-Type: application/pdf" \
  --data-binary @document.pdf

Response:

{
  "page_content": "page 1 text\n---\npagina 2 text",
  "metadata": {}
}

🧪 Testing

pytest tests/

Optional Backends

The service supports optional backends that provide extra capabilities such as OCR, document linking, and content analysis. These backends are not required for basic operation; markitdown is always included.

Available backends

ocrflux – OCR extraction backend
doclings – Document linking backend
docstrange – Specialized document processing backend
marker – Marker based backend

Installation

Each backend is provided as an extra in pyproject.toml. Install the desired backend with pip, for example:

pip install -e ".[ocrflux]"
pip install -e ".[doclings]"
pip install -e ".[docstrange]"
pip install -e ".[marker]"

You can also install multiple extras at once:

pip install -e ".[ocrflux,doclings,docstrange,marker]"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadsimple - Document Extractor

🚀 Features

📦 Installation

Supported Formats

🌐 Vision LLM Mode

🧪 Running

📡 API

Process Document

🧪 Testing

Optional Backends

FilesExpand file tree

readme.md

Latest commit

History

readme.md

File metadata and controls

Deadsimple - Document Extractor

🚀 Features

📦 Installation

Supported Formats

🌐 Vision LLM Mode

🧪 Running

📡 API

Process Document

🧪 Testing

Optional Backends