Skip to content

Add multi-document batch processing and knowledge base search#139

Open
rajo69 wants to merge 1 commit intoVectifyAI:mainfrom
rajo69:add-batch-processor
Open

Add multi-document batch processing and knowledge base search#139
rajo69 wants to merge 1 commit intoVectifyAI:mainfrom
rajo69:add-batch-processor

Conversation

@rajo69
Copy link

@rajo69 rajo69 commented Mar 3, 2026

Related to #119

What this adds

process_batch() - process multiple PDFs in one call

from pageindex.batch_processor import process_batch

summary = process_batch(
    doc_paths=["report.pdf", "earnings.pdf", "manual.pdf"],
    output_dir="./results",
    model="gpt-4o-2024-11-20",
    if_add_node_summary="yes",
)
# summary["processed"]     -> ["report.pdf", "earnings.pdf", "manual.pdf"]
# summary["kb_index_path"] -> "./results/kb_index.json"

Each document's hierarchical structure is saved as {stem}_structure.json. A kb_index.json manifest is written listing every document and its output file. Failed documents are logged and skipped so the rest of the batch continues unaffected.

KnowledgeBaseSearch - search across all processed documents

from pageindex.batch_processor import KnowledgeBaseSearch

kb = KnowledgeBaseSearch("./results/kb_index.json")

# Case-insensitive search across node titles and summaries
hits = kb.search("revenue growth", top_k=5)
# [{"doc_name": "earnings.pdf", "title": "Revenue Growth", "score": 2, ...}, ...]

# Retrieve a full document structure
doc = kb.get_document("earnings.pdf")

# List all documents in the knowledge base
names = kb.list_documents()

Search is keyword-based substring matching, consistent with PageIndex's reasoning-first philosophy. Title matches score 2; summary-only matches score 1. Results are ranked by score.

CLI - --batch-dir flag

python run_pageindex.py --batch-dir ./my_pdfs/ --model gpt-4o-2024-11-20
# Processes all PDFs in ./my_pdfs/, saves results to ./results/

Design decisions

  • No new dependencies - uses only the existing page_index_main() pipeline and standard library (json, os, logging, datetime).
  • Graceful degradation - one bad PDF never aborts the whole batch.
  • Shared config - ConfigLoader is called once and the same opt object is reused for every document, ensuring consistent settings across the batch.
  • Lazy loading in search - structure files are loaded on first access and cached, so large knowledge bases do not load everything upfront.

Breaking changes

None. batch_processor is a new module. No existing functions or interfaces were modified.

Scope and limitations

This implementation is designed for small-to-medium collections (up to a few hundred PDFs). For larger collections (thousands+), two things would need extending:

  • Processing - sequential calls to page_index_main() would need a job queue (e.g. Celery/RQ) and parallelism to be practical at scale.
  • Search - the current in-memory keyword scan would need replacing with a proper search backend (SQLite FTS5, Elasticsearch, or a vector database).

This PR establishes the foundational API and kb_index.json output format that such extensions can build on.

Tests

28 mock-based tests in tests/test_batch_processor.py, no API key required:

  • Input validation (empty list, non-PDF files)
  • Success path (files written, index content, doc_description stored)
  • Error handling (missing file, LLM crash, partial failure, auto-create output dir)
  • KnowledgeBaseSearch init, list_documents, get_document, search
  • Search: title match, summary match, case-insensitive, no results, scoring, top_k, required fields, cross-document, nested nodes

Addresses the use case raised in issue VectifyAI#119 — processing multiple PDFs
and querying the results as a unified knowledge base.

New module: pageindex/batch_processor.py
- process_batch(doc_paths, output_dir, **kwargs)
  Processes a list of PDF files using the existing page_index_main()
  pipeline, saves per-document structure JSONs, and writes a
  kb_index.json manifest listing every document and its output file.
  Skips failed documents without aborting the whole batch.
- KnowledgeBaseSearch(kb_index_path)
  Loads a knowledge base produced by process_batch() and exposes:
    .search(query, top_k)  — case-insensitive substring search across
                             all node titles and summaries; scores title
                             matches higher than summary-only matches.
    .get_document(doc_name) — retrieve one document's full structure.
    .list_documents()       — list all successfully processed doc names.

CLI (run_pageindex.py):
- New --batch-dir flag: point at a directory of PDFs and the batch
  processor runs over all of them, printing a summary on completion.

Tests (tests/test_batch_processor.py):
- 28 mock-based tests covering validation, success path, error handling,
  partial failures, search scoring, cross-document search, nested nodes,
  and edge cases. No API key required.

Closes VectifyAI#119
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant