Docman — Document Processing Pipeline
Docman is a document processing pipeline built on the Heddle framework. It evaluates Heddle's actor-based architecture with a real-world pipeline: extract text from PDF/DOCX documents using Docling, classify and summarize them with LLMs, then persist everything to DuckDB for search and analysis.
Docman is a consumer of Heddle — it provides concrete backends, worker configs, and pipeline definitions. The Heddle framework itself lives in a separate repo.
src/docman/
├── backends/
│ ├── docling_backend.py # DoclingBackend — PDF/DOCX extraction via Docling
│ ├── duckdb_ingest.py # DuckDBIngestBackend — document persistence to DuckDB
│ └── duckdb_query.py # DocmanQueryBackend — thin subclass with Docman schema defaults
│
└── tools/
└── vector_search.py # DuckDBVectorTool — wrapper with document-specific defaults
configs/
├── workers/
│ ├── doc_extractor.yaml # Docling extraction (processor)
│ ├── doc_extractor_windows.yaml # Windows-specific extractor config
│ ├── doc_classifier.yaml # LLM document classification
│ ├── doc_summarizer.yaml # LLM summarization (standard tier)
│ ├── doc_summarizer_local.yaml # LLM summarization (local tier)
│ ├── doc_ingest.yaml # DuckDB persistence (processor)
│ └── doc_query.yaml # DuckDB query (standalone processor)
│
├── orchestrators/
│ ├── doc_pipeline.yaml # Full pipeline (mixed tiers)
│ └── doc_pipeline_local.yaml # Full pipeline (local tier only)
│
└── mcp/
└── docman.yaml # MCP gateway config (exposes pipeline + queries as tools)
docs/
├── ARCHITECTURE.md # This file
├── CONTRIBUTING.md # Contribution standards and CLA
├── setup-macos.md # Full macOS environment setup
├── setup-windows.md # Full Windows environment setup
└── docling-setup.md # Docling configuration and tuning
scripts/
├── dev-start.sh # Development pipeline launcher (macOS/Linux)
└── dev-start.ps1 # Development pipeline launcher (Windows)
tests/ # 40 unit tests (no infrastructure needed)
The pipeline processes documents through four stages. Heddle's PipelineOrchestrator
auto-infers dependencies from input_mapping paths and runs independent stages
concurrently. In Docman's case, each stage depends on the previous one, so
execution remains sequential:
PDF/DOCX → [Extract] → [Classify] → [Summarize] → [Ingest] → DuckDB
Level 0 Level 1 Level 2 Level 3
To process multiple documents concurrently, run multiple pipeline instances — NATS queue groups handle load balancing automatically.
- Type: ProcessorWorker + DoclingBackend
- What it does: Reads PDF/DOCX via Docling, extracts text, tables, and figures
- Output: Writes extracted JSON to workspace, returns
file_ref+ metadata (page count, table presence, section list, text preview) - Key detail:
text_preview(first ~500 words) is included inline so the classifier doesn't need file access
- Type: LLMWorker (local tier)
- What it does: LLM classifies document type from text preview and metadata
- Output:
document_type(invoice, report, letter, memo, contract, resume, academic_paper, manual, form, other), confidence score, reasoning
- Type: LLMWorker (standard or local tier)
- What it does: LLM produces structured summary adapted to document type
- Output: Summary (2-5 paragraphs), key points, word count
- Config pending: Heddle's
resolve_file_refsnow supports file-ref resolution; needs wiring in config
- Type: ProcessorWorker + DuckDBIngestBackend
- What it does: Persists all pipeline results to DuckDB — metadata, classification, summary, full text, and optional vector embeddings
- Output: Document UUID and insertion status
Not part of the pipeline. Accepts structured query requests against the DuckDB database with five actions:
| Action | Description |
|---|---|
search |
Full-text search via DuckDB FTS extension |
filter |
Filter by document_type, has_tables, page range |
stats |
Aggregate counts and averages (grouped by type, tables) |
get |
Single document by ID (includes full text) |
vector_search |
Semantic similarity search via embeddings |
Docman provides thin wrappers around heddle.contrib.duckdb with document-specific
defaults:
DocmanQueryBackend — subclass of heddle.contrib.duckdb.DuckDBQueryBackend with
Docman table schema (columns, filters, FTS fields, stats aggregates).
DuckDBVectorTool — wrapper around heddle.contrib.duckdb.DuckDBVectorTool with
document table/column defaults, implements SyncToolProvider for LLM
function-calling.
DuckDBViewTool — used directly from heddle.contrib.duckdb (no wrapper needed).
Large data passes via file references in a shared workspace directory. Messages
carry only file_ref strings, not inline content.
- Source file placed in workspace directory
- DoclingBackend reads source, writes extracted JSON to workspace
- Extractor returns
file_refpointing to extracted JSON + inlinetext_preview - Classifier uses inline
text_preview(no file access needed) - Summarizer receives
file_ref(Heddle'sresolve_file_refscan resolve this; config not yet wired) - Ingest backend reads full text from workspace JSON, persists to DuckDB
DoclingBackend runs synchronously in a thread pool via asyncio.run_in_executor.
Docling is synchronous; the thread pool prevents blocking the async event loop.
DuckDB backends run synchronously via SyncProcessingBackend. DuckDB is
synchronous by nature.
Path traversal validation: All file_ref values must resolve within the
configured workspace_dir. The WorkspaceManager enforces this.
DuckDB schema auto-creation: The schema is created on first ingestion. No migration step needed.
FTS extension: DuckDB FTS enables full-text search across full_text,
summary, and text_preview columns.
Query results exclude full_text by default to keep NATS messages small.
Use the get action for full content.
Vector embeddings are optional. Controlled by the embedding config section
in doc_ingest.yaml. When absent, the embedding column stores NULL. Embeddings
use FLOAT[] (variable-length) and list_cosine_similarity.
Docman depends on heddle[duckdb] as a package and uses these Heddle components:
| Heddle Component | Docman Usage |
|---|---|
ProcessingBackend ABC |
DoclingBackend, DuckDBIngestBackend |
SyncProcessingBackend |
DuckDB backends (synchronous) |
DuckDBQueryBackend |
DocmanQueryBackend subclass |
DuckDBVectorTool |
DuckDBVectorTool wrapper |
DuckDBViewTool |
Used directly (no wrapper) |
ProcessorWorker |
Runs extraction and ingestion stages |
LLMWorker |
Runs classification and summarization stages |
PipelineOrchestrator |
Orchestrates 4-stage pipeline (dependency-aware parallelism) |
WorkspaceManager |
File-ref resolution with path traversal protection |
MCPGateway |
Exposes Docman as MCP server via configs/mcp/docman.yaml |
The CLI loads backends by fully qualified class path from worker configs:
processing_backend: "docman.backends.docling_backend.DoclingBackend"Docman can be exposed as an MCP (Model Context Protocol) server using Heddle's built-in MCP gateway. A single YAML config maps Docman's pipeline and query backend to MCP tools — no MCP-specific code needed.
heddle mcp --config configs/mcp/docman.yamlThe gateway auto-discovers tools from the config:
| MCP Tool | Source |
|---|---|
process_document |
Pipeline: extract → classify → summarize → ingest |
docman_search |
DocmanQueryBackend search action (FTS) |
docman_filter |
DocmanQueryBackend filter action |
docman_stats |
DocmanQueryBackend stats action |
docman_get |
DocmanQueryBackend get action |
Workspace files (PDFs, extracted JSON) are exposed as MCP resources with
workspace:/// URIs.
DoclingBackend reads tuning options from backend_config in doc_extractor.yaml.
Key settings for Apple Silicon (M1 Pro 32GB):
| Setting | Value | Purpose |
|---|---|---|
device |
"mps" |
GPU acceleration via Metal Performance Shaders |
num_threads |
8 |
Matches M1 Pro performance cores |
ocr_engine |
"ocrmac" |
Native macOS Vision framework OCR |
layout_batch_size |
4 |
Balanced for 32GB RAM |
ocr_batch_size |
4 |
Balanced for 32GB RAM |
Pre-download detection models: docling-tools models download
For comprehensive Docling configuration, see Docling Setup.
For environment setup, see macOS Setup or Windows Setup.