Skip to content

feat: page memory parse track, TOC refactor, and usage_task cost tracking#143

Merged
EricNGOntos merged 5 commits into
mainfrom
page-based-pr2
Jun 11, 2026
Merged

feat: page memory parse track, TOC refactor, and usage_task cost tracking#143
EricNGOntos merged 5 commits into
mainfrom
page-based-pr2

Conversation

@EricNGOntos

Copy link
Copy Markdown
Contributor

Summary

This PR introduces three major improvements to the document parsing pipeline:

1. Page Memory Parse Track

  • Added a new page_memory service (memory_service.py, normalizer.py) that records per-page parsing metadata for future reuse and incremental re-parsing.
  • New DB migrations: add_doc_profile_to_document_page_plan and add_parse_track_to_documents.
  • Extended document_agent/budget.py with richer budget tracking and trace.py with structured trace events.
  • Updated coordinator.py and planner.py to integrate page memory lookups during document profiling.
  • New contract test: test_page_memory_parse_track_contract.py.

2. TOC Handling Streamlining

  • Removed the deprecated PDF_PROFILE_TOC_ENABLED config flag from both API and Worker .env.example files and from config/storage.py.
  • Simplified TOC detection logic in atlas/parser.py and pdf/parser.py to always use the unified path.
  • Refactored find_toc_anchor_pages.py with improved anchor detection heuristics.
  • Extended test_doc_profile_anatomy_contract.py to cover the new TOC behaviors.

3. Usage Task Cost Tracking

  • Added usage_task parameter to all LLM calls across 14+ service files (document agent tools, parser modules, summary builder, etc.) to enable per-task token cost attribution.
  • Enhanced token_tracking.py with by_model and by_task bucketed tracking, plus get_current_token_tracker() for mid-pipeline snapshots.
  • New token_costing.py module (251 lines) that maps token usage to estimated USD costs using a configurable pricing table.
  • Refactored processing_run.py to own the tracker lifecycle at the top level, with parse_execution.py reusing the existing tracker rather than re-initializing.
  • Added _refresh_processing_stages() in success_finalization.py to persist cost + timing data into job metadata.
  • Integrated cost estimates into zip_manifest_schema.py for downstream visibility.

4. Supporting Changes

  • Added TOKEN_PRICING_TABLE_JSON config field in config/ai.py.
  • Extended dataframe_chunk_converter.py and publication_service.py for page-number propagation.
  • New contract test: test_document_agent_budget_contract.py.

Testing

  • All make check (ruff lint + pyright typecheck) passes with 0 errors.
  • New contract tests added for page memory, budget, and anatomy features.

- Removed PDF_PROFILE_TOC_ENABLED from configuration and adjusted related logic in document profiling.
- Simplified TOC detection logic in PDF and Atlas parsers, ensuring consistent behavior across document types.
- Updated tests to reflect changes in TOC handling and ensure proper functionality without the deprecated flag.
- Introduced usage_task parameter in LLM calls across multiple services including summary_builder, react_loop, planner, and various document agent tools.
- Enhanced token tracking and stage profiling in document ingestion processes to improve performance monitoring and cost estimation.
- Updated relevant functions to utilize new token and stage tracking methods, ensuring accurate resource usage reporting.

Cherry-picked from feat/wuchengke/2026-06-11 (d5d7d042)
if doc is not None:
try:
doc.close()
except Exception:
if self._profile_plan_row is not None:
self._db.expunge(self._profile_plan_row)
self._profile_plan_row = None
except Exception:
The page_memory migration added a NOT NULL parse_track column to the
documents table, but the contract test helpers (contract_database.py,
test_job_creation_contract.py, test_documents_contract.py) were still
inserting documents without it, causing all retrieval contract tests
to fail with NotNullViolationError.
The worker contract conftest clears shared.core.config and app.*
modules from sys.modules between tests.  Top-level imports of settings
and service functions become stale references pointing at evicted
module objects.  Moving all imports inside each test function body
ensures the monkeypatch target and the function under test share the
same settings instance.

Also fixes the parse_track NOT NULL test failures from the previous
commit (verified: 174 passed, 0 failed locally with the exact CI
test command).
@EricNGOntos EricNGOntos merged commit e6ff72d into main Jun 11, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants