feat: page memory parse track, TOC refactor, and usage_task cost tracking by EricNGOntos · Pull Request #143 · Ontos-AI/knowhere

EricNGOntos · 2026-06-11T08:40:57Z

Summary

This PR introduces three major improvements to the document parsing pipeline:

1. Page Memory Parse Track

Added a new page_memory service (memory_service.py, normalizer.py) that records per-page parsing metadata for future reuse and incremental re-parsing.
New DB migrations: add_doc_profile_to_document_page_plan and add_parse_track_to_documents.
Extended document_agent/budget.py with richer budget tracking and trace.py with structured trace events.
Updated coordinator.py and planner.py to integrate page memory lookups during document profiling.
New contract test: test_page_memory_parse_track_contract.py.

2. TOC Handling Streamlining

Removed the deprecated PDF_PROFILE_TOC_ENABLED config flag from both API and Worker .env.example files and from config/storage.py.
Simplified TOC detection logic in atlas/parser.py and pdf/parser.py to always use the unified path.
Refactored find_toc_anchor_pages.py with improved anchor detection heuristics.
Extended test_doc_profile_anatomy_contract.py to cover the new TOC behaviors.

3. Usage Task Cost Tracking

Added usage_task parameter to all LLM calls across 14+ service files (document agent tools, parser modules, summary builder, etc.) to enable per-task token cost attribution.
Enhanced token_tracking.py with by_model and by_task bucketed tracking, plus get_current_token_tracker() for mid-pipeline snapshots.
New token_costing.py module (251 lines) that maps token usage to estimated USD costs using a configurable pricing table.
Refactored processing_run.py to own the tracker lifecycle at the top level, with parse_execution.py reusing the existing tracker rather than re-initializing.
Added _refresh_processing_stages() in success_finalization.py to persist cost + timing data into job metadata.
Integrated cost estimates into zip_manifest_schema.py for downstream visibility.

4. Supporting Changes

Added TOKEN_PRICING_TABLE_JSON config field in config/ai.py.
Extended dataframe_chunk_converter.py and publication_service.py for page-number propagation.
New contract test: test_document_agent_budget_contract.py.

Testing

All make check (ruff lint + pyright typecheck) passes with 0 errors.
New contract tests added for page memory, budget, and anatomy features.

- Removed PDF_PROFILE_TOC_ENABLED from configuration and adjusted related logic in document profiling. - Simplified TOC detection logic in PDF and Atlas parsers, ensuring consistent behavior across document types. - Updated tests to reflect changes in TOC handling and ensure proper functionality without the deprecated flag.

- Introduced usage_task parameter in LLM calls across multiple services including summary_builder, react_loop, planner, and various document agent tools. - Enhanced token tracking and stage profiling in document ingestion processes to improve performance monitoring and cost estimation. - Updated relevant functions to utilize new token and stage tracking methods, ensuring accurate resource usage reporting. Cherry-picked from feat/wuchengke/2026-06-11 (d5d7d042)

+        if doc is not None:
+            try:
+                doc.close()
+            except Exception:


+                if self._profile_plan_row is not None:
+                    self._db.expunge(self._profile_plan_row)
+                self._profile_plan_row = None
+            except Exception:


The page_memory migration added a NOT NULL parse_track column to the documents table, but the contract test helpers (contract_database.py, test_job_creation_contract.py, test_documents_contract.py) were still inserting documents without it, causing all retrieval contract tests to fail with NotNullViolationError.

The worker contract conftest clears shared.core.config and app.* modules from sys.modules between tests. Top-level imports of settings and service functions become stale references pointing at evicted module objects. Moving all imports inside each test function body ensures the monkeypatch target and the function under test share the same settings instance. Also fixes the parse_track NOT NULL test failures from the previous commit (verified: 174 passed, 0 failed locally with the exact CI test command).

EricNGOntos added 3 commits June 10, 2026 22:38

feat: add page memory parse track

8a6195e

EricNGOntos added the document-parsing label Jun 11, 2026

EricNGOntos self-assigned this Jun 11, 2026

github-advanced-security AI found potential problems Jun 11, 2026

View reviewed changes

EricNGOntos added 2 commits June 11, 2026 16:56

EricNGOntos merged commit e6ff72d into main Jun 11, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: page memory parse track, TOC refactor, and usage_task cost tracking#143

feat: page memory parse track, TOC refactor, and usage_task cost tracking#143
EricNGOntos merged 5 commits into
mainfrom
page-based-pr2

EricNGOntos commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EricNGOntos commented Jun 11, 2026

Summary

1. Page Memory Parse Track

2. TOC Handling Streamlining

3. Usage Task Cost Tracking

4. Supporting Changes

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants