fix: improve JSON extraction and TOC fallback handling by KairosMarco · Pull Request #333 · VectifyAI/PageIndex

KairosMarco · 2026-06-22T02:28:02Z

Summary

This PR improves PageIndex robustness when LLM calls return JSON in common non-ideal formats or omit optional fields during TOC/page-index extraction.

It keeps the existing indexing flow unchanged, but adds safer parsing and fallback behavior for provider responses that include:

fenced JSON blocks,
explanatory text before JSON,
arrays with trailing text,
Python-style literal tokens: None, True, False,
missing JSON keys,
object-shaped TOC output where list-shaped output is expected,
missing page-offset or physical_index values.

Why

While running PageIndex over a FinanceBench PDF subset, I saw indexing failures from model response shape issues such as:

KeyError: 'toc_detected'
KeyError: 'page_index_given_in_toc'
AttributeError: 'dict' object has no attribute 'extend'
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
KeyError: 'physical_index'

The failures were not specific to one document format; they came from LLM JSON formatting variance.

Changes

Make extract_json() tolerate fenced JSON, embedded JSON, arrays, trailing text, and Python-style literal tokens.
Use safe defaults for TOC detector/completeness checks when parsed JSON is missing or not a dict.
Normalize TOC generation output to list[dict] before list operations.
Skip offset/page repair when the model output is missing required fields.
Return a low-confidence no-TOC structure instead of raising Processing failed after fallback attempts.
Add focused unittest coverage for the JSON parser and TOC fallback helpers.

Validation

python -m unittest discover -s tests
python -m py_compile pageindex\utils.py pageindex\page_index.py tests\test_json_resilience.py

Local result:

Ran 7 tests
OK

I also validated equivalent fixes in a local benchmark workspace:

Expanded PageIndex structures: 24 / 24 source documents
Expanded PageIndex retrieval-only QA: 25 / 25 generated
Expanded LLM QA: 25 / 25 generated

The benchmark artifacts are available here:

https://github.com/KairosMarco/pageindex-benchlab

KairosMarco · 2026-06-23T02:28:35Z

Hi maintainers, I wanted to check whether this scope is aligned with the project direction.

The PR is focused on JSON response resilience and TOC fallback handling, with unittest coverage. If this is too broad, I am happy to split it into a smaller parser-only PR first, then a separate TOC fallback PR.

Improve JSON extraction and TOC fallback handling

1cf28e5

KairosMarco changed the title ~~Improve JSON extraction and TOC fallback handling~~ fix: improve JSON extraction and TOC fallback handling Jun 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve JSON extraction and TOC fallback handling#333

fix: improve JSON extraction and TOC fallback handling#333
KairosMarco wants to merge 1 commit into
VectifyAI:mainfrom
KairosMarco:fix/json-response-resilience

KairosMarco commented Jun 22, 2026 •

edited

Loading

Uh oh!

KairosMarco commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KairosMarco commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Changes

Validation

Uh oh!

KairosMarco commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KairosMarco commented Jun 22, 2026 •

edited

Loading