Description
In src/llm/llm_text.py (lines 290–305), a page-priority selection algorithm classifies pages, sorts them by priority, computes a ranked selected list, and tracks a budget then immediately discards all of it by pivoting to a completely different section-based algorithm on line 306. None of the computed results are ever used.
Problematic Code
pages = split_into_pages(text) # ← dead
scored: List[Tuple[int, int, str]] = []
for page_num, page_text in pages:
useful, priority = classify_page(page_text) # ← dead
if useful:
scored.append((priority, page_num, page_text))
scored.sort(key=lambda t: t[0]) # ← dead
selected: List[Tuple[int, str]] = []
budget = max_chars
for _priority, page_num, page_text in scored:
page_with_marker = f"[PAGE {page_num}]\n{page_text}"
if len(page_with_marker) <= budget:
selected.append((page_num, page_with_marker))
budget -= len(page_with_marker)
lines = text.split("\n") # ← pivots to full original text; everything above is discarded
budget is then re-initialized at line 345:
budget = max_chars - len(preamble_text)
What Goes Wrong
split_into_pages, classify_page, scored.sort, selected, and the first budget are all computed but never referenced after line 305.
- The page-priority ranking has no effect on the final output.
- The LLM receives sections chosen by paragraph-level keyword scoring alone, which is the correct behavior for this project — but the dead block above runs on every call for nothing.
- No error is raised, the pipeline runs normally but wastes computation on the dead block.
Fix
Remove lines 290–305 entirely. The section-based keyword paragraph approach (Phase 1 + Phase 2 below line 306) is the correct algorithm for this project: it scores individual paragraphs by extraction-relevant keywords, guaranteeing that data like stomach counts and sample sizes are included regardless of which page or section they appear in. The page-priority approach is coarser (whole-page granularity) and would waste the character budget on surrounding irrelevant content.
Description
In
src/llm/llm_text.py(lines 290–305), a page-priority selection algorithm classifies pages, sorts them by priority, computes a rankedselectedlist, and tracks abudgetthen immediately discards all of it by pivoting to a completely different section-based algorithm on line 306. None of the computed results are ever used.Problematic Code
budgetis then re-initialized at line 345:What Goes Wrong
split_into_pages,classify_page,scored.sort,selected, and the firstbudgetare all computed but never referenced after line 305.Fix
Remove lines 290–305 entirely. The section-based keyword paragraph approach (Phase 1 + Phase 2 below line 306) is the correct algorithm for this project: it scores individual paragraphs by extraction-relevant keywords, guaranteeing that data like stomach counts and sample sizes are included regardless of which page or section they appear in. The page-priority approach is coarser (whole-page granularity) and would waste the character budget on surrounding irrelevant content.