Skip to content

Ofirlev/OCR-Provenance

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OCR Provenance MCP Server

Turn thousands of documents into a searchable, AI-queryable knowledge base -- with full provenance.

Point this at a folder of PDFs, Word docs, spreadsheets, images, or presentations. Minutes later, Claude can search, analyze, compare, and answer questions across your entire document collection -- with a cryptographic audit trail proving exactly where every answer came from.

License: MIT Node.js TypeScript MCP Tools Tests


Why This Exists

AI assistants can't read your files natively. They can't search across 500 PDFs, compare contract versions, or find the one email buried in a discovery dump. This server bridges that gap.

It's a Model Context Protocol server that gives Claude (or any MCP client) the ability to ingest, OCR, search, compare, cluster, tag, version-track, and reason over your documents -- with a cryptographic audit trail proving exactly where every answer came from.

What Happens When You Ingest Documents

Your files (PDF, DOCX, XLSX, images, presentations...)
    -> OCR text extraction via Datalab API (3 accuracy modes)
    -> Hybrid section-aware chunking with markdown parsing
    -> GPU vector embeddings (nomic-embed-text-v1.5, 768-dim)
    -> Image extraction + AI vision analysis (Gemini 3 Flash)
    -> Full-text + semantic + hybrid search indexes
    -> Document clustering by similarity (HDBSCAN / agglomerative / k-means)
    -> Cross-entity tagging system
    -> Document version tracking (re-ingestion detects changes)
    -> SHA-256 provenance chain on every artifact

18 supported file types: PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, PNG, JPG, JPEG, TIFF, TIF, BMP, GIF, WEBP, TXT, CSV, MD


Real-World Use Cases

Litigation & Legal Discovery

You have 3,000 documents from a civil case -- contracts, emails, depositions, medical records, invoices, and correspondence spanning 8 years. Normally this takes a team of paralegals weeks to organize.

"Search all documents for references to the March 2024 amendment"
"Compare the original contract with the signed version -- what changed?"
"Find every document mentioning Dr. Rivera and cluster them by topic"
"Which invoices were submitted after the termination date?"
"Build me a timeline of all communications between Smith and Davis Corp"

The provenance chain means you can trace every search result back to its exact source page and document -- critical for legal admissibility and audit.

Medical Records Review

An insurance adjuster needs to review 800+ pages of medical records across 15 providers for a personal injury claim.

"Find all references to lumbar spine across every provider's records"
"What medications were prescribed between June and December 2024?"
"Compare the initial ER report with the orthopedic surgeon's assessment"
"Extract all diagnosis codes and dates from the treatment records"
"Cluster these records by provider and summarize each provider's findings"

Financial Audit & Compliance

A forensic accountant is reviewing 5 years of financial records for a fraud investigation -- bank statements, tax returns, invoices, receipts, and internal reports.

"Find all transactions over $10,000 across every bank statement"
"Compare this year's tax return with last year's -- what changed?"
"Search for any mention of offshore accounts or shell companies"
"Cluster all invoices by vendor and flag any with duplicate amounts"
"Which expense reports don't have matching receipts?"

Insurance Claims Processing

An adjuster is handling a commercial property damage claim with engineering reports, contractor estimates, photographs, and policy documents.

"What is the total estimated repair cost across all contractor bids?"
"Compare the policyholder's damage report with the independent adjuster's assessment"
"Find all photos showing water damage and describe what's in each one"
"Does the policy cover the type of damage described in the engineering report?"
"Cluster all documents by damage category -- structural, electrical, plumbing"

Academic Research

A PhD student is doing a literature review across 200+ papers, supplementary materials, and datasets.

"Find all papers that discuss transformer architectures for protein folding"
"Which papers cite the 2023 AlphaFold study?"
"Compare the methodology sections of these three competing approaches"
"Cluster these papers by research topic and list the top 5 clusters"
"Build me a RAG context block about attention mechanisms for my thesis"

Real Estate Due Diligence

A commercial real estate firm is evaluating a property acquisition -- title reports, environmental assessments, lease agreements, zoning documents, and inspection reports.

"Are there any environmental liens or violations in the Phase I report?"
"Compare the rent rolls from 2023 and 2024 -- which tenants left?"
"Find all lease clauses related to early termination or renewal options"
"What does the zoning report say about permitted commercial uses?"
"Cluster all inspection findings by severity -- critical, major, minor"

HR & Employment Investigations

An HR director is investigating a workplace complaint with emails, performance reviews, chat logs, and policy documents.

"Find all communications between the complainant and the respondent"
"When was the anti-harassment policy last updated and what does it say?"
"Compare the employee's performance reviews from 2023 and 2024"
"Search for any prior complaints or disciplinary actions in these records"

Quick Start

1. Create a database         ->  ocr_db_create { name: "my-case" }
2. Select it                 ->  ocr_db_select { database_name: "my-case" }
3. Ingest a folder           ->  ocr_ingest_directory { directory_path: "/path/to/docs" }
4. Process everything        ->  ocr_process_pending {}
5. Search                    ->  ocr_search_hybrid { query: "breach of contract" }
6. Ask questions             ->  ocr_rag_context { question: "What were the settlement terms?" }
7. Verify provenance         ->  ocr_provenance_verify { item_id: "doc-id" }

Each database is fully isolated. Create one per case, project, or client.


Architecture

┌─────────────────────────────────────────────────────────────┐
│                    MCP Server (stdio)                        │
│  TypeScript + @modelcontextprotocol/sdk                     │
│  102 tools across 22 tool modules                           │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │ Ingestion│  │  Search  │  │ Analysis │  │  Reports │   │
│  │ 9 tools  │  │ 12 tools │  │ 35 tools │  │  9 tools │   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │              │              │              │          │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │   VLM    │  │  Images  │  │  Tags    │  │  Intel   │   │
│  │ 6 tools  │  │ 14 tools │  │ 6 tools  │  │  4 tools │   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │              │              │              │          │
│  ┌────┴──────────────┴──────────────┴──────────────┴────┐   │
│  │             Service Layer (11 domains)                │   │
│  │  OCR · Chunking · Embedding · Search · VLM          │   │
│  │  Provenance · Comparison · Clustering · Gemini      │   │
│  │  Images · Storage                                    │   │
│  └────┬──────────────┬──────────────┬───────────────────┘   │
│       │              │              │                         │
│  ┌────┴────┐   ┌────┴────┐   ┌────┴─────┐                  │
│  │ SQLite  │   │sqlite-vec│   │ FTS5     │                  │
│  │ 18 tbls │   │ vectors  │   │ indexes  │                  │
│  └─────────┘   └─────────┘   └──────────┘                  │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │            Python Workers (9 processes)               │   │
│  │  OCR · Embedding · Clustering · Image Extraction    │   │
│  │  DOCX Extraction · Image Optimizer · Form Fill      │   │
│  │  File Manager · Local Reranker                      │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │            External APIs                              │   │
│  │  Datalab (OCR/Forms) · Gemini 3 Flash (VLM/AI)     │   │
│  │  Nomic embed v1.5 (local GPU, 768-dim)              │   │
│  └──────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
  • TypeScript MCP Server -- 102 tools across 22 modules, Zod validation, provenance tracking
  • Python Workers (9) -- OCR, GPU embedding, image extraction, clustering, form fill, file management, local reranking
  • SQLite + sqlite-vec -- 18 tables, FTS5 full-text search, vector similarity search, WAL mode
  • Gemini 3 Flash -- vision analysis (image description, classification, PDF analysis)
  • Datalab API -- document OCR, form filling, structured extraction, cloud storage
  • nomic-embed-text-v1.5 -- 768-dim local embeddings (CUDA / MPS / CPU)

Hybrid Section-Aware Chunking

The chunking pipeline produces semantically coherent chunks that respect document structure:

OCR text (markdown)
  │
  ├─ Text Normalization ──── Clean whitespace, normalize line breaks
  ├─ Heading Normalization ─ Fix skipped heading levels (h1→h3 becomes h1→h2)
  ├─ Markdown Parsing ────── Parse into heading/paragraph/table/code/list blocks
  ├─ JSON Block Analysis ──── Detect atomic regions (tables, figures) from OCR blocks
  ├─ Section-Aware Splitting ─ Chunk at heading boundaries, respect atomic regions
  ├─ Page Tracking ────────── Assign page numbers via Datalab page separators
  ├─ Chunk Merging ────────── Merge heading-only chunks into their content
  ├─ Chunk Deduplication ──── Remove near-duplicate chunks via fuzzy matching
  ├─ Header/Footer Tagging ── Auto-tag header/footer chunks for search exclusion
  └─ Metadata Enrichment ──── section_path, heading_context, content_types per chunk

Each chunk carries: section_path (e.g., "Introduction > Background"), heading_context, content_types (table/code/text/list), and page_number -- all searchable as filters.

How Search Works

Three search modes, combinable via Reciprocal Rank Fusion:

Mode Best For How It Works
BM25 Exact terms, case numbers, names FTS5 full-text with porter stemming
Semantic Conceptual queries, paraphrases Vector similarity via nomic-embed-text-v1.5
Hybrid (recommended) General questions BM25 + semantic fused, optional local re-ranking

Search Enhancement Stack

All three search modes support a shared enhancement stack:

  • Query classification -- heuristic analysis auto-routes queries between exact/semantic/mixed modes (auto_route on hybrid)
  • Query expansion -- legal/medical synonym injection for broader recall (expand_query, default on for hybrid)
  • Local cross-encoder reranking -- Python-based cross-encoder model (ms-marco-MiniLM-L-12-v2) re-scores results locally for relevance (rerank)
  • Quality-weighted ranking -- always-on quality score multiplier (0.8x--1.0x) boosts higher-quality OCR results
  • Chunk-level filters -- content_type_filter, section_path_filter (prefix match), heading_filter (LIKE), page_range_filter, quality_boost, table_columns_contain
  • Metadata filters -- title/author/subject LIKE matching, document ID filtering, cluster filtering, quality score threshold
  • VLM image enrichment -- search results from VLM descriptions include image metadata (path, dimensions, type)
  • Table metadata -- search results include table column headers and row/column counts from OCR blocks
  • Context chunks -- surrounding chunks automatically included with results for broader context
  • Group by document -- deduplicate results by document, returning only the best match per document (group_by_document)
  • Header/footer exclusion -- header/footer chunks auto-tagged during ingestion and excluded from search by default (include_headers_footers)
  • Document context -- optionally attach cluster labels and comparison info to results (include_document_context)
  • Provenance inclusion -- attach full provenance chain to each search result
  • Search persistence -- save, list, retrieve, and re-execute named searches
  • Cross-database search -- BM25 search across all databases simultaneously

Provenance Chain

Every artifact carries a SHA-256 hash chain back to its source document:

DOCUMENT (depth 0)
  +-- OCR_RESULT (depth 1)
  |     +-- CHUNK (depth 2) -> EMBEDDING (depth 3)
  |     +-- IMAGE (depth 2) -> VLM_DESCRIPTION (depth 3) -> EMBEDDING (depth 4)
  |     +-- EXTRACTION (depth 2) -> EMBEDDING (depth 3)
  +-- FORM_FILL (depth 0)
  +-- COMPARISON (depth 2)
  +-- CLUSTERING (depth 2)

Export in JSON, W3C PROV-JSON, or CSV for regulatory compliance. Query provenance with 12+ filters, view processing timelines, and analyze per-processor statistics.

Document Version Tracking

When you re-ingest a file, the system detects changes automatically:

  • Same hash -- skip (already processed)
  • Different hash -- creates a new version linked to the previous via previous_version_id
  • Version history -- retrieve all versions of a document ordered by creation date

Document Workflow

Tag-based workflow state management for document lifecycle:

  • States: draft, review, approved, published, archived
  • History: every state change is preserved (append-only)
  • Actions: get current state, set new state, view full state history

Requirements

Component Version Notes
Node.js >= 20 MCP server runtime
Python >= 3.10 Worker processes
PyTorch >= 2.0 Embedding model inference
GPU Optional CUDA or Apple MPS; CPU works fine, just slower

API Keys

Key Get From Used For
DATALAB_API_KEY datalab.to OCR, form fill, file upload, structured extraction
GEMINI_API_KEY Google AI Studio VLM image description and classification

Installation

# Clone and build
git clone https://github.com/ChrisRoyse/OCR-Provenance.git
cd OCR-Provenance
npm install && npm run build

# Install globally (makes `ocr-provenance-mcp` available everywhere)
npm link

# Python dependencies
pip install torch transformers sentence-transformers numpy scikit-learn hdbscan pymupdf pillow python-docx requests

# Download embedding model (~270MB, one-time)
pip install huggingface_hub
huggingface-cli download nomic-ai/nomic-embed-text-v1.5 --local-dir models/nomic-embed-text-v1.5

# Configure API keys
cp .env.example .env
# Edit .env with your DATALAB_API_KEY and GEMINI_API_KEY

# Verify
ocr-provenance-mcp  # Should print "Tools registered: 124" on stderr

PyTorch GPU note: If pip install torch gives you CPU-only, install the CUDA version explicitly:

pip install torch --index-url https://download.pytorch.org/whl/cu124
Platform-specific notes

Linux / WSL2: Install NVIDIA drivers and CUDA toolkit. For WSL2, install the NVIDIA CUDA on WSL driver from the Windows side.

macOS (Apple Silicon): MPS acceleration works automatically. Just pip install torch torchvision torchaudio.

Windows: Use WSL2 for best compatibility. Native Windows works too -- the server auto-detects python vs python3.

Custom embedding model location

If you install globally and want the model elsewhere:

# In your .env file:
EMBEDDING_MODEL_PATH=/path/to/nomic-embed-text-v1.5

The server checks: EMBEDDING_MODEL_PATH env var -> models/ in the package directory -> ~/.ocr-provenance/models/


Connecting to Claude

Claude Code

# Register globally (available in all projects)
claude mcp add ocr-provenance -s user \
  -e OCR_PROVENANCE_ENV_FILE=/path/to/OCR-Provenance/.env \
  -e NODE_OPTIONS=--max-semi-space-size=64 \
  -- ocr-provenance-mcp

Claude Desktop

Add to your config (~/Library/Application Support/Claude/claude_desktop_config.json on macOS, %APPDATA%\Claude\claude_desktop_config.json on Windows):

{
  "mcpServers": {
    "ocr-provenance": {
      "command": "ocr-provenance-mcp",
      "env": {
        "OCR_PROVENANCE_ENV_FILE": "/absolute/path/to/OCR-Provenance/.env",
        "NODE_OPTIONS": "--max-semi-space-size=64"
      }
    }
  }
}

Any MCP Client

The server uses stdio transport (JSON-RPC over stdin/stdout):

ocr-provenance-mcp                    # Global command (after npm link)
node /path/to/dist/index.js           # Direct invocation

Environment variables can be provided via OCR_PROVENANCE_ENV_FILE, direct env vars, or a .env file in the working directory.


Configuration

# .env file
DATALAB_API_KEY=your_key
GEMINI_API_KEY=your_key

# OCR settings
DATALAB_DEFAULT_MODE=accurate          # fast | balanced | accurate
DATALAB_MAX_CONCURRENT=3

# Embeddings (auto-detects CUDA > MPS > CPU)
EMBEDDING_DEVICE=auto
EMBEDDING_BATCH_SIZE=512

# Chunking
CHUNKING_SIZE=2000
CHUNKING_OVERLAP_PERCENT=10

# Auto-clustering (triggers after processing when enabled)
AUTO_CLUSTER_ENABLED=false
AUTO_CLUSTER_THRESHOLD=5               # Minimum documents to trigger
AUTO_CLUSTER_ALGORITHM=hdbscan

# Storage
STORAGE_DATABASES_PATH=~/.ocr-provenance/databases/

Tool Reference (102 Tools)

Database Management (5)
Tool Description
ocr_db_create Create a new isolated database
ocr_db_list List all databases with optional stats
ocr_db_select Select the active database
ocr_db_stats Detailed statistics (documents, chunks, embeddings, images, clusters)
ocr_db_delete Permanently delete a database
Ingestion & Processing (9)
Tool Description
ocr_ingest_directory Scan directory and register documents (18 file types, recursive)
ocr_ingest_files Ingest specific files by path
ocr_process_pending Full pipeline: OCR -> Chunk -> Embed -> Vector -> VLM (with auto-clustering)
ocr_status Check processing status
ocr_retry_failed Reset failed documents for reprocessing
ocr_reprocess Reprocess with different OCR settings
ocr_chunk_complete Repair documents missing chunks/embeddings
ocr_convert_raw One-off OCR conversion without storing
ocr_reembed_document Re-generate embeddings for a document without re-OCRing

Processing options: ocr_mode (fast/balanced/accurate), chunking_strategy (hybrid section-aware), page_range, max_pages, extras (track_changes, chart_understanding, extract_links, table_row_bboxes, infographic, new_block_types)

Version tracking: Re-ingesting a file with a different hash creates a new version linked via previous_version_id.

Search & Retrieval (12)
Tool Description
ocr_search BM25 full-text search (exact terms, codes, IDs)
ocr_search_semantic Vector similarity search (conceptual queries)
ocr_search_hybrid Reciprocal Rank Fusion of BM25 + semantic (recommended)
ocr_rag_context Assemble hybrid search results into a markdown context block for LLMs
ocr_search_export Export results to CSV or JSON
ocr_benchmark_compare Compare search results across databases
ocr_fts_manage Rebuild or check FTS5 index status
ocr_search_save Save a search by name for later retrieval
ocr_search_saved_list List all saved searches
ocr_search_saved_get Retrieve a saved search and its parameters
ocr_search_saved_execute Re-execute a saved search with optional parameter overrides
ocr_search_cross_db BM25 search across all databases simultaneously

Enhancement options: Local cross-encoder reranking (rerank), query expansion (expand_query), auto-routing (auto_route), quality-weighted ranking, chunk-level filters (content type, section path, heading, page range, table columns), metadata filters, cluster filtering, group by document, header/footer exclusion, context chunks, VLM image enrichment, provenance inclusion.

Document Management (12)
Tool Description
ocr_document_list List documents with status filtering
ocr_document_get Full document details (text, chunks, blocks, provenance)
ocr_document_delete Delete document and all derived data (cascade)
ocr_document_find_similar Find similar documents via embedding centroid similarity
ocr_document_structure Analyze document structure (headings, tables, figures, code blocks)
ocr_document_sections Get section hierarchy tree from chunk section paths
ocr_document_update_metadata Batch update document metadata fields
ocr_document_duplicates Detect exact (hash) and near (similarity) duplicates
ocr_document_export Export document to JSON or markdown
ocr_corpus_export Export entire corpus to JSON or markdown archive
ocr_document_versions List all versions of a document by file path
ocr_document_workflow Manage workflow states (draft/review/approved/published/archived)
Provenance (6)
Tool Description
ocr_provenance_get Get the complete provenance chain for any item
ocr_provenance_verify Verify integrity through SHA-256 hash chain
ocr_provenance_export Export provenance (JSON, W3C PROV-JSON, CSV)
ocr_provenance_query Query provenance records with 12+ filters
ocr_provenance_timeline View document processing timeline
ocr_provenance_processor_stats Aggregate statistics per processor type
Document Comparison (6)
Tool Description
ocr_document_compare Text diff + structural metadata diff + similarity ratio
ocr_comparison_list List comparisons with optional filtering
ocr_comparison_get Full comparison details with diff operations
ocr_comparison_discover Auto-discover similar document pairs for comparison
ocr_comparison_batch Batch compare multiple document pairs
ocr_comparison_matrix NxN pairwise cosine similarity matrix across documents
Document Clustering (7)
Tool Description
ocr_cluster_documents Cluster by semantic similarity (HDBSCAN / agglomerative / k-means)
ocr_cluster_list List clusters with filtering by run ID or tag
ocr_cluster_get Cluster details with member documents
ocr_cluster_assign Auto-assign a document to the nearest cluster
ocr_cluster_reassign Move a document to a different cluster
ocr_cluster_merge Merge two clusters into one
ocr_cluster_delete Delete a clustering run
VLM / Vision Analysis (6)
Tool Description
ocr_vlm_describe Describe an image using Gemini 3 Flash (supports thinking mode)
ocr_vlm_classify Classify image type, complexity, text density
ocr_vlm_process_document VLM-process all images in a document
ocr_vlm_process_pending VLM-process all pending images across all documents
ocr_vlm_analyze_pdf Analyze a PDF directly with Gemini 3 Flash (max 20MB)
ocr_vlm_status Service status (API config, rate limits, circuit breaker)

VLM descriptions automatically generate searchable embeddings for semantic image search.

Image Operations (11)
Tool Description
ocr_image_extract Extract images from a PDF via Datalab OCR
ocr_image_list List images extracted from a document
ocr_image_get Get image details
ocr_image_stats Processing statistics
ocr_image_delete Delete an image record
ocr_image_delete_by_document Delete all images for a document
ocr_image_reset_failed Reset failed images for reprocessing
ocr_image_pending List images pending VLM processing
ocr_image_search Search images with 7 filters (type, size, status, confidence, etc.)
ocr_image_semantic_search Semantic search over VLM image descriptions
ocr_image_reanalyze Re-run VLM analysis with a custom prompt
Image Extraction (3)
Tool Description
ocr_extract_images Extract images locally (PyMuPDF for PDF, zipfile for DOCX)
ocr_extract_images_batch Batch extract from all processed documents
ocr_extraction_check Verify Python environment has required packages
Chunks & Pages (4)
Tool Description
ocr_chunk_get Get a chunk by ID with full metadata
ocr_chunk_list List chunks with filtering (content type, section path, page, heading)
ocr_chunk_context Get a chunk with N neighboring chunks for context
ocr_document_page Get all chunks for a specific page number (page-by-page navigation)
Embeddings (4)
Tool Description
ocr_embedding_list List embeddings with filtering
ocr_embedding_stats Embedding statistics (counts, models, coverage)
ocr_embedding_get Get embedding details by ID
ocr_embedding_rebuild Re-generate embeddings for specific targets
Structured Extraction (4)
Tool Description
ocr_extract_structured Extract structured data from OCR'd documents using a JSON schema
ocr_extraction_list List structured extractions for a document
ocr_extraction_get Get a structured extraction by ID
ocr_extraction_search Search across extraction content
Form Fill (2)
Tool Description
ocr_form_fill Fill PDF/image forms via Datalab with field name-value mapping
ocr_form_fill_status Form fill operation status and results
File Management (6)
Tool Description
ocr_file_upload Upload to Datalab cloud (deduplicates by SHA-256)
ocr_file_list List uploaded files with duplicate detection
ocr_file_get File metadata
ocr_file_download Get download URL
ocr_file_delete Delete file record
ocr_file_ingest_uploaded Bridge uploaded files into the document pipeline
Tags (6)
Tool Description
ocr_tag_create Create a tag with optional color and description
ocr_tag_list List tags with usage counts
ocr_tag_apply Apply a tag to any entity (document, chunk, image, cluster, etc.)
ocr_tag_remove Remove a tag from an entity
ocr_tag_search Find entities by tag name
ocr_tag_delete Delete a tag and all associations
Intelligence & Navigation (4)
Tool Description
ocr_guide AI agent navigation -- inspects system state and recommends next tools/actions
ocr_document_tables Extract and parse tables from OCR JSON blocks
ocr_document_recommend Get related document recommendations via embedding similarity
ocr_document_extras Access OCR extras data (charts, links, tracked changes, infographics)
Evaluation (3)
Tool Description
ocr_evaluate_single Evaluate a single image with VLM
ocr_evaluate_document Evaluate all images in a document
ocr_evaluate_pending Evaluate all pending images system-wide
Reports & Analytics (9)
Tool Description
ocr_evaluation_report Comprehensive OCR + VLM metrics report (markdown)
ocr_document_report Single document report (images, extractions, comparisons, clusters)
ocr_quality_summary Quality summary across all documents
ocr_cost_summary Cost analytics by document, mode, month, or total
ocr_pipeline_analytics Pipeline throughput, duration, per-mode/type breakdown
ocr_corpus_profile Corpus content profile (doc sizes, content types, section frequency)
ocr_error_analytics Error/recovery analytics and failure rates
ocr_provenance_bottlenecks Processing bottleneck analysis by processor
ocr_quality_trends Quality trends over time (hourly/daily/weekly/monthly)
Timeline & Analytics (2)
Tool Description
ocr_timeline_analytics Volume metrics over time
ocr_throughput_analytics Processing throughput per time bucket
Health & Diagnostics (1)
Tool Description
ocr_health_check Detect data integrity gaps (missing embeddings, orphaned chunks, etc.) with optional auto-fix
Configuration (2)
Tool Description
ocr_config_get Get current system configuration
ocr_config_set Update configuration at runtime

Processing Pipeline

File on disk
  │
  ├─ 1. REGISTER ──► documents table (status: pending)
  │                  ├─ file_hash computed (SHA-256)
  │                  ├─ version detection (new vs re-ingested)
  │                  └─ provenance record (type: DOCUMENT, depth: 0)
  │
  ├─ 2. OCR ──────► ocr_results table
  │                  ├─ Datalab API call (fast/balanced/accurate)
  │                  ├─ extracted_text (markdown)
  │                  ├─ json_blocks (structural hierarchy)
  │                  ├─ extras_json (charts, links, track changes)
  │                  ├─ page_offsets (page boundaries)
  │                  └─ provenance record (type: OCR_RESULT, depth: 1)
  │
  ├─ 3. CHUNK ────► chunks table
  │                  ├─ Hybrid section-aware chunking
  │                  │   ├─ Text + heading normalization
  │                  │   ├─ Markdown structure parsing
  │                  │   ├─ Atomic region detection (tables, figures)
  │                  │   ├─ Heading-only chunk merging
  │                  │   ├─ Near-duplicate deduplication
  │                  │   └─ Header/footer auto-tagging
  │                  ├─ 2000 chars with 10% overlap
  │                  ├─ section_path, heading_context, content_types
  │                  ├─ page_number assignment via page separators
  │                  └─ provenance records (type: CHUNK, depth: 2)
  │
  ├─ 4. EMBED ────► embeddings + vec_embeddings tables
  │                  ├─ Nomic embed v1.5 (768-dim, local GPU)
  │                  ├─ "search_document: " prefix
  │                  └─ provenance records (type: EMBEDDING, depth: 3)
  │
  ├─ 5. FTS ──────► fts_index (FTS5 virtual table)
  │                  └─ External content index on chunk text
  │
  ├─ 6. IMAGES ───► images table
  │   │              ├─ PyMuPDF extraction (PDF) / zip extraction (DOCX)
  │   │              ├─ Image optimization (resize, format)
  │   │              └─ provenance records (type: IMAGE, depth: 2)
  │   │
  │   └─ 7. VLM ──► images updated + embeddings table
  │                  ├─ Gemini 3 Flash multimodal analysis
  │                  ├─ Description, structured data, confidence
  │                  ├─ VLM description embedding generated (searchable)
  │                  └─ provenance records (type: VLM_DESCRIPTION, depth: 3→4)
  │
  ├─ 8. AUTO-CLUSTER ──► clusters table (when configured)
  │                  └─ Triggers when threshold met and >1hr since last run
  │
  └─ documents.status = 'complete'

Data Architecture (Schema v31)

18 core tables + FTS5 virtual tables + vec_embeddings:

Table Purpose Key Fields
documents Source files file_hash, status, page_count, metadata
ocr_results Extracted text extracted_text, json_blocks, quality_score, cost
chunks Text segments text (2000 chars), section_path, heading_context, content_types
embeddings 768-dim vectors original_text, model_name, source metadata
images Extracted images extracted_path, bbox, VLM description, confidence
extractions Structured data schema_json, extraction_json
form_fills Form filling results field mapping, output path
comparisons Document pair diffs similarity_ratio, diff_operations
clusters Document groupings label, classification_tag, coherence_score
document_clusters Cluster membership document_id, cluster_id
provenance Full audit trail type, processor, chain_depth, content_hash
tags Cross-entity labels name, color, description
entity_tags Tag associations tag_id, entity_type, entity_id
saved_searches Search persistence name, search_type, parameters
uploaded_files Cloud file tracking datalab_id, file_hash, upload status
database_metadata DB-level settings key-value pairs
schema_version Migration tracking version, applied_at
fts_index_metadata FTS index state last_rebuild, chunk count

AI/ML Capabilities

Capability Technology Tool(s)
Document OCR Datalab API (3 modes) ocr_process_pending, ocr_convert_raw
Text Embeddings Nomic embed v1.5 (local GPU) Auto during ingestion, ocr_reembed_document
Image Description Gemini 3 Flash ocr_vlm_describe, ocr_vlm_process_*
Image Classification Gemini 3 Flash ocr_vlm_classify
Search Reranking Python cross-encoder rerank parameter on all search tools (local, no API)
Query Expansion Heuristic synonyms expand_query parameter
Query Classification Heuristic patterns auto_route parameter (hybrid search)
Document Clustering scikit-learn ocr_cluster_documents (HDBSCAN/agglomerative/k-means)
Auto-Clustering scikit-learn Configurable auto-trigger after ocr_process_pending
Similarity Detection Embedding centroids ocr_document_find_similar, ocr_document_recommend
Duplicate Detection File hash + embedding similarity ocr_document_duplicates
Comparison Discovery Embedding similarity ocr_comparison_discover
Comparison Matrix Pairwise cosine similarity ocr_comparison_matrix
Text Comparison npm diff (Sorensen-Dice) ocr_document_compare
RAG Context Assembly Hybrid search + markdown ocr_rag_context
Semantic Image Search VLM description embeddings ocr_image_semantic_search
PDF Direct Analysis Gemini 3 Flash multimodal ocr_vlm_analyze_pdf
Table Extraction OCR JSON block parsing ocr_document_tables
Cross-DB Search BM25 across all databases ocr_search_cross_db
Chunk Deduplication Fuzzy text matching Automatic during chunking pipeline
AI Agent Navigation System state analysis ocr_guide
Health Diagnostics Data integrity analysis ocr_health_check

Development

npm run build             # Build TypeScript
npm test                  # All tests (2,364 across 109 test suites)
npm run test:unit         # Unit tests only
npm run test:integration  # Integration tests only
npm run lint:all          # TypeScript + Python linting
npm run check             # typecheck + lint + test

Project Structure

src/
  index.ts              # MCP server entry point (tool registration, lifecycle)
  bin.ts                # CLI entry point
  tools/                # 22 tool files + shared.ts
    database.ts         # Database CRUD (5 tools)
    ingestion.ts        # Ingest + process pipeline (9 tools)
    search.ts           # BM25, semantic, hybrid, RAG, cross-DB (12 tools)
    documents.ts        # Document ops, versions, workflow (12 tools)
    provenance.ts       # Audit trail, verification (6 tools)
    comparison.ts       # Diff, batch compare, matrix (6 tools)
    clustering.ts       # Cluster, reassign, merge (7 tools)
    vlm.ts              # Gemini vision analysis (6 tools)
    images.ts           # Image ops, semantic search (11 tools)
    reports.ts          # Analytics + quality reports (9 tools)
    tags.ts             # Cross-entity tagging (6 tools)
    intelligence.ts     # AI guide, tables, recommendations, extras (4 tools)
    embeddings.ts       # Embedding management (4 tools)
    extraction-structured.ts  # JSON schema extraction (4 tools)
    extraction.ts       # Local image extraction (3 tools)
    file-management.ts  # Cloud file ops (6 tools)
    chunks.ts           # Chunk inspection + page navigation (4 tools)
    timeline.ts         # Time-series analytics (2 tools)
    form-fill.ts        # PDF form filling (2 tools)
    evaluation.ts       # VLM evaluation (3 tools)
    config.ts           # Runtime config (2 tools)
    health.ts           # Data integrity check (1 tool)
    shared.ts           # Shared utilities (formatResponse, handleError, etc.)
  services/             # Core services (11 domains, 64 files)
    chunking/           # Hybrid section-aware chunking pipeline
      chunker.ts        # Main chunking orchestrator
      markdown-parser.ts
      heading-normalizer.ts
      text-normalizer.ts
      chunk-merger.ts
      chunk-deduplicator.ts
      json-block-analyzer.ts
    search/             # BM25, semantic, hybrid, fusion, reranker (AI + local), query expansion/classification, quality weighting
    gemini/             # Gemini client with caching, circuit breaker, rate limiting
    storage/            # SQLite database + migrations (19 operation files)
    ...                 # OCR, embedding, VLM, provenance, comparison, clustering, images
  models/               # Zod schemas and TypeScript types
  utils/                # Hash, validation, path sanitization
  server/               # Server state, types, errors (14 custom error classes)
python/                 # 9 Python workers + GPU utils
tests/
  unit/                 # Unit tests
  integration/          # Integration tests
  e2e/                  # End-to-end pipeline tests
  manual/               # Verification tests
  benchmark/            # Chunking benchmark
  fixtures/             # Test fixtures and sample documents
docs/                   # System documentation and reports

Key Metrics

Metric Value
MCP tools 124
Tool modules 22
Database tables 18 core + FTS + vec
Schema version v31 (31 migrations)
Database operation files 19
Service domains 11
Test suites 109
Tests passing 2,364
TypeScript source ~46,000 lines
Python source ~4,700 lines
Test code ~65,000 lines
Production deps 9 packages
Python workers 9
External APIs 3 (Datalab, Gemini, Nomic local)
Custom error classes 14
File types supported 18

Troubleshooting

sqlite-vec loading errors

Run npm install -- sqlite-vec uses a prebuilt binary that must match your platform and Node.js version.

Python not found (Windows)

The server auto-detects python vs python3. Ensure Python is on your PATH: python --version.

GPU not detected
python -c "import torch; print('CUDA:', torch.cuda.is_available()); print('MPS:', hasattr(torch.backends, 'mps') and torch.backends.mps.is_available())"

If both are False, install the CUDA version of PyTorch: pip install torch --index-url https://download.pytorch.org/whl/cu124

Embedding model not found

Download the model (see Installation). Verify config.json, model.safetensors, and tokenizer.json are present in the model directory.

API key warnings at startup

Copy .env.example to .env and fill in your DATALAB_API_KEY and GEMINI_API_KEY.

Data integrity issues

Run ocr_health_check { fix: true } to detect and auto-fix common issues like chunks missing embeddings or orphaned records.


License

MIT

About

No description, website, or topics provided.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • TypeScript 95.3%
  • Python 4.6%
  • JavaScript 0.1%