Skip to content

bug: page-priority selection in extract_key_sections is silently ignored #63

@SeanClay10

Description

@SeanClay10

Description

In src/llm/llm_text.py (lines 290–305), a page-priority selection algorithm classifies pages, sorts them by priority, computes a ranked selected list, and tracks a budget then immediately discards all of it by pivoting to a completely different section-based algorithm on line 306. None of the computed results are ever used.

Problematic Code

pages = split_into_pages(text)          # ← dead
scored: List[Tuple[int, int, str]] = []
for page_num, page_text in pages:
    useful, priority = classify_page(page_text)  # ← dead
    if useful:
        scored.append((priority, page_num, page_text))

scored.sort(key=lambda t: t[0])         # ← dead

selected: List[Tuple[int, str]] = []
budget = max_chars
for _priority, page_num, page_text in scored:
    page_with_marker = f"[PAGE {page_num}]\n{page_text}"
    if len(page_with_marker) <= budget:
        selected.append((page_num, page_with_marker))
        budget -= len(page_with_marker)
lines = text.split("\n")   # ← pivots to full original text; everything above is discarded

budget is then re-initialized at line 345:

budget = max_chars - len(preamble_text)

What Goes Wrong

  • split_into_pages, classify_page, scored.sort, selected, and the first budget are all computed but never referenced after line 305.
  • The page-priority ranking has no effect on the final output.
  • The LLM receives sections chosen by paragraph-level keyword scoring alone, which is the correct behavior for this project — but the dead block above runs on every call for nothing.
  • No error is raised, the pipeline runs normally but wastes computation on the dead block.

Fix

Remove lines 290–305 entirely. The section-based keyword paragraph approach (Phase 1 + Phase 2 below line 306) is the correct algorithm for this project: it scores individual paragraphs by extraction-relevant keywords, guaranteeing that data like stomach counts and sample sizes are included regardless of which page or section they appear in. The page-priority approach is coarser (whole-page granularity) and would waste the character budget on surrounding irrelevant content.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions