Add multi-document batch processing and knowledge base search#139
Open
rajo69 wants to merge 1 commit intoVectifyAI:mainfrom
Open
Add multi-document batch processing and knowledge base search#139rajo69 wants to merge 1 commit intoVectifyAI:mainfrom
rajo69 wants to merge 1 commit intoVectifyAI:mainfrom
Conversation
Addresses the use case raised in issue VectifyAI#119 — processing multiple PDFs and querying the results as a unified knowledge base. New module: pageindex/batch_processor.py - process_batch(doc_paths, output_dir, **kwargs) Processes a list of PDF files using the existing page_index_main() pipeline, saves per-document structure JSONs, and writes a kb_index.json manifest listing every document and its output file. Skips failed documents without aborting the whole batch. - KnowledgeBaseSearch(kb_index_path) Loads a knowledge base produced by process_batch() and exposes: .search(query, top_k) — case-insensitive substring search across all node titles and summaries; scores title matches higher than summary-only matches. .get_document(doc_name) — retrieve one document's full structure. .list_documents() — list all successfully processed doc names. CLI (run_pageindex.py): - New --batch-dir flag: point at a directory of PDFs and the batch processor runs over all of them, printing a summary on completion. Tests (tests/test_batch_processor.py): - 28 mock-based tests covering validation, success path, error handling, partial failures, search scoring, cross-document search, nested nodes, and edge cases. No API key required. Closes VectifyAI#119
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related to #119
What this adds
process_batch()- process multiple PDFs in one callEach document's hierarchical structure is saved as
{stem}_structure.json. Akb_index.jsonmanifest is written listing every document and its output file. Failed documents are logged and skipped so the rest of the batch continues unaffected.KnowledgeBaseSearch- search across all processed documentsSearch is keyword-based substring matching, consistent with PageIndex's reasoning-first philosophy. Title matches score 2; summary-only matches score 1. Results are ranked by score.
CLI -
--batch-dirflagpython run_pageindex.py --batch-dir ./my_pdfs/ --model gpt-4o-2024-11-20 # Processes all PDFs in ./my_pdfs/, saves results to ./results/Design decisions
page_index_main()pipeline and standard library (json,os,logging,datetime).ConfigLoaderis called once and the sameoptobject is reused for every document, ensuring consistent settings across the batch.Breaking changes
None.
batch_processoris a new module. No existing functions or interfaces were modified.Scope and limitations
This implementation is designed for small-to-medium collections (up to a few hundred PDFs). For larger collections (thousands+), two things would need extending:
page_index_main()would need a job queue (e.g. Celery/RQ) and parallelism to be practical at scale.This PR establishes the foundational API and
kb_index.jsonoutput format that such extensions can build on.Tests
28 mock-based tests in
tests/test_batch_processor.py, no API key required:KnowledgeBaseSearchinit,list_documents,get_document,searchtop_k, required fields, cross-document, nested nodes