feat: page memory parse track, TOC refactor, and usage_task cost tracking#143
Merged
Conversation
- Removed PDF_PROFILE_TOC_ENABLED from configuration and adjusted related logic in document profiling. - Simplified TOC detection logic in PDF and Atlas parsers, ensuring consistent behavior across document types. - Updated tests to reflect changes in TOC handling and ensure proper functionality without the deprecated flag.
- Introduced usage_task parameter in LLM calls across multiple services including summary_builder, react_loop, planner, and various document agent tools. - Enhanced token tracking and stage profiling in document ingestion processes to improve performance monitoring and cost estimation. - Updated relevant functions to utilize new token and stage tracking methods, ensuring accurate resource usage reporting. Cherry-picked from feat/wuchengke/2026-06-11 (d5d7d042)
| if doc is not None: | ||
| try: | ||
| doc.close() | ||
| except Exception: |
| if self._profile_plan_row is not None: | ||
| self._db.expunge(self._profile_plan_row) | ||
| self._profile_plan_row = None | ||
| except Exception: |
The page_memory migration added a NOT NULL parse_track column to the documents table, but the contract test helpers (contract_database.py, test_job_creation_contract.py, test_documents_contract.py) were still inserting documents without it, causing all retrieval contract tests to fail with NotNullViolationError.
The worker contract conftest clears shared.core.config and app.* modules from sys.modules between tests. Top-level imports of settings and service functions become stale references pointing at evicted module objects. Moving all imports inside each test function body ensures the monkeypatch target and the function under test share the same settings instance. Also fixes the parse_track NOT NULL test failures from the previous commit (verified: 174 passed, 0 failed locally with the exact CI test command).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces three major improvements to the document parsing pipeline:
1. Page Memory Parse Track
page_memoryservice (memory_service.py,normalizer.py) that records per-page parsing metadata for future reuse and incremental re-parsing.add_doc_profile_to_document_page_planandadd_parse_track_to_documents.document_agent/budget.pywith richer budget tracking andtrace.pywith structured trace events.coordinator.pyandplanner.pyto integrate page memory lookups during document profiling.test_page_memory_parse_track_contract.py.2. TOC Handling Streamlining
PDF_PROFILE_TOC_ENABLEDconfig flag from both API and Worker.env.examplefiles and fromconfig/storage.py.atlas/parser.pyandpdf/parser.pyto always use the unified path.find_toc_anchor_pages.pywith improved anchor detection heuristics.test_doc_profile_anatomy_contract.pyto cover the new TOC behaviors.3. Usage Task Cost Tracking
usage_taskparameter to all LLM calls across 14+ service files (document agent tools, parser modules, summary builder, etc.) to enable per-task token cost attribution.token_tracking.pywithby_modelandby_taskbucketed tracking, plusget_current_token_tracker()for mid-pipeline snapshots.token_costing.pymodule (251 lines) that maps token usage to estimated USD costs using a configurable pricing table.processing_run.pyto own the tracker lifecycle at the top level, withparse_execution.pyreusing the existing tracker rather than re-initializing._refresh_processing_stages()insuccess_finalization.pyto persist cost + timing data into job metadata.zip_manifest_schema.pyfor downstream visibility.4. Supporting Changes
TOKEN_PRICING_TABLE_JSONconfig field inconfig/ai.py.dataframe_chunk_converter.pyandpublication_service.pyfor page-number propagation.test_document_agent_budget_contract.py.Testing
make check(ruff lint + pyright typecheck) passes with 0 errors.