Add multi-document batch processing and knowledge base search by rajo69 · Pull Request #139 · VectifyAI/PageIndex

rajo69 · 2026-03-03T20:18:13Z

Related to #119

What this adds

`process_batch()` - process multiple PDFs in one call

from pageindex.batch_processor import process_batch

summary = process_batch(
    doc_paths=["report.pdf", "earnings.pdf", "manual.pdf"],
    output_dir="./results",
    model="gpt-4o-2024-11-20",
    if_add_node_summary="yes",
)
# summary["processed"]     -> ["report.pdf", "earnings.pdf", "manual.pdf"]
# summary["kb_index_path"] -> "./results/kb_index.json"

Each document's hierarchical structure is saved as {stem}_structure.json. A kb_index.json manifest is written listing every document and its output file. Failed documents are logged and skipped so the rest of the batch continues unaffected.

`KnowledgeBaseSearch` - search across all processed documents

from pageindex.batch_processor import KnowledgeBaseSearch

kb = KnowledgeBaseSearch("./results/kb_index.json")

# Case-insensitive search across node titles and summaries
hits = kb.search("revenue growth", top_k=5)
# [{"doc_name": "earnings.pdf", "title": "Revenue Growth", "score": 2, ...}, ...]

# Retrieve a full document structure
doc = kb.get_document("earnings.pdf")

# List all documents in the knowledge base
names = kb.list_documents()

Search is keyword-based substring matching, consistent with PageIndex's reasoning-first philosophy. Title matches score 2; summary-only matches score 1. Results are ranked by score.

CLI - `--batch-dir` flag

python run_pageindex.py --batch-dir ./my_pdfs/ --model gpt-4o-2024-11-20
# Processes all PDFs in ./my_pdfs/, saves results to ./results/

Design decisions

No new dependencies - uses only the existing page_index_main() pipeline and standard library (json, os, logging, datetime).
Graceful degradation - one bad PDF never aborts the whole batch.
Shared config - ConfigLoader is called once and the same opt object is reused for every document, ensuring consistent settings across the batch.
Lazy loading in search - structure files are loaded on first access and cached, so large knowledge bases do not load everything upfront.

Breaking changes

None. batch_processor is a new module. No existing functions or interfaces were modified.

Scope and limitations

This implementation is designed for small-to-medium collections (up to a few hundred PDFs). For larger collections (thousands+), two things would need extending:

Processing - sequential calls to page_index_main() would need a job queue (e.g. Celery/RQ) and parallelism to be practical at scale.
Search - the current in-memory keyword scan would need replacing with a proper search backend (SQLite FTS5, Elasticsearch, or a vector database).

This PR establishes the foundational API and kb_index.json output format that such extensions can build on.

Tests

28 mock-based tests in tests/test_batch_processor.py, no API key required:

Input validation (empty list, non-PDF files)
Success path (files written, index content, doc_description stored)
Error handling (missing file, LLM crash, partial failure, auto-create output dir)
KnowledgeBaseSearch init, list_documents, get_document, search
Search: title match, summary match, case-insensitive, no results, scoring, top_k, required fields, cross-document, nested nodes

Addresses the use case raised in issue VectifyAI#119 — processing multiple PDFs and querying the results as a unified knowledge base. New module: pageindex/batch_processor.py - process_batch(doc_paths, output_dir, **kwargs) Processes a list of PDF files using the existing page_index_main() pipeline, saves per-document structure JSONs, and writes a kb_index.json manifest listing every document and its output file. Skips failed documents without aborting the whole batch. - KnowledgeBaseSearch(kb_index_path) Loads a knowledge base produced by process_batch() and exposes: .search(query, top_k) — case-insensitive substring search across all node titles and summaries; scores title matches higher than summary-only matches. .get_document(doc_name) — retrieve one document's full structure. .list_documents() — list all successfully processed doc names. CLI (run_pageindex.py): - New --batch-dir flag: point at a directory of PDFs and the batch processor runs over all of them, printing a summary on completion. Tests (tests/test_batch_processor.py): - 28 mock-based tests covering validation, success path, error handling, partial failures, search scoring, cross-document search, nested nodes, and edge cases. No API key required. Closes VectifyAI#119

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-document batch processing and knowledge base search#139

Add multi-document batch processing and knowledge base search#139
rajo69 wants to merge 1 commit intoVectifyAI:mainfrom
rajo69:add-batch-processor

rajo69 commented Mar 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rajo69 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this adds

process_batch() - process multiple PDFs in one call

KnowledgeBaseSearch - search across all processed documents

CLI - --batch-dir flag

Design decisions

Breaking changes

Scope and limitations

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rajo69 commented Mar 3, 2026 •

edited

Loading

`process_batch()` - process multiple PDFs in one call

`KnowledgeBaseSearch` - search across all processed documents

CLI - `--batch-dir` flag