Skip to content

fix: improve JSON extraction and TOC fallback handling#333

Open
KairosMarco wants to merge 1 commit into
VectifyAI:mainfrom
KairosMarco:fix/json-response-resilience
Open

fix: improve JSON extraction and TOC fallback handling#333
KairosMarco wants to merge 1 commit into
VectifyAI:mainfrom
KairosMarco:fix/json-response-resilience

Conversation

@KairosMarco

@KairosMarco KairosMarco commented Jun 22, 2026

Copy link
Copy Markdown

Summary

This PR improves PageIndex robustness when LLM calls return JSON in common non-ideal formats or omit optional fields during TOC/page-index extraction.

It keeps the existing indexing flow unchanged, but adds safer parsing and fallback behavior for provider responses that include:

  • fenced JSON blocks,
  • explanatory text before JSON,
  • arrays with trailing text,
  • Python-style literal tokens: None, True, False,
  • missing JSON keys,
  • object-shaped TOC output where list-shaped output is expected,
  • missing page-offset or physical_index values.

Why

While running PageIndex over a FinanceBench PDF subset, I saw indexing failures from model response shape issues such as:

KeyError: 'toc_detected'
KeyError: 'page_index_given_in_toc'
AttributeError: 'dict' object has no attribute 'extend'
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
KeyError: 'physical_index'

The failures were not specific to one document format; they came from LLM JSON formatting variance.

Changes

  • Make extract_json() tolerate fenced JSON, embedded JSON, arrays, trailing text, and Python-style literal tokens.
  • Use safe defaults for TOC detector/completeness checks when parsed JSON is missing or not a dict.
  • Normalize TOC generation output to list[dict] before list operations.
  • Skip offset/page repair when the model output is missing required fields.
  • Return a low-confidence no-TOC structure instead of raising Processing failed after fallback attempts.
  • Add focused unittest coverage for the JSON parser and TOC fallback helpers.

Validation

python -m unittest discover -s tests
python -m py_compile pageindex\utils.py pageindex\page_index.py tests\test_json_resilience.py

Local result:

Ran 7 tests
OK

I also validated equivalent fixes in a local benchmark workspace:

Expanded PageIndex structures: 24 / 24 source documents
Expanded PageIndex retrieval-only QA: 25 / 25 generated
Expanded LLM QA: 25 / 25 generated

The benchmark artifacts are available here:

https://github.com/KairosMarco/pageindex-benchlab

@KairosMarco KairosMarco changed the title Improve JSON extraction and TOC fallback handling fix: improve JSON extraction and TOC fallback handling Jun 22, 2026
@KairosMarco

Copy link
Copy Markdown
Author

Hi maintainers, I wanted to check whether this scope is aligned with the project direction.

The PR is focused on JSON response resilience and TOC fallback handling, with unittest coverage. If this is too broad, I am happy to split it into a smaller parser-only PR first, then a separate TOC fallback PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant