Description
In src/llm/llm_client.py (lines 256–263), extracted numeric field values are stringified and searched in the full document text using .find(), which returns the first occurrence — not necessarily the one corresponding to the actual data field.
Problematic Code
value_str = str(value)
if value_str in original_text:
pos = original_text.find(value_str)
page_markers = re.findall(r'\[PAGE (\d+)\]', original_text[:pos])
if page_markers:
source_pages.add(int(page_markers[-1]))
What Goes Wrong
For numeric fields like sample_size = 5, num_empty_stomachs = 3, etc., the string "5" is searched across the entire document. If "5" appears first in e.g. "Figure 5" or a citation year, .find() returns that position — and the source page is attributed to the wrong location in the paper. No error is raised; the result JSON silently contains incorrect source_pages provenance data.
Expected Behavior
Source page attribution should only match the field value in a meaningful context (e.g. surrounded by word boundaries, or within a relevant section), not the first raw string occurrence in the document.
Suggested Fix
Use word-boundary regex matching (e.g. r'\b5\b') instead of plain string find(), and/or restrict the search to already-identified relevant sections rather than the full document text.
Description
In
src/llm/llm_client.py(lines 256–263), extracted numeric field values are stringified and searched in the full document text using.find(), which returns the first occurrence — not necessarily the one corresponding to the actual data field.Problematic Code
What Goes Wrong
For numeric fields like
sample_size = 5,num_empty_stomachs = 3, etc., the string"5"is searched across the entire document. If"5"appears first in e.g. "Figure 5" or a citation year,.find()returns that position — and the source page is attributed to the wrong location in the paper. No error is raised; the result JSON silently contains incorrectsource_pagesprovenance data.Expected Behavior
Source page attribution should only match the field value in a meaningful context (e.g. surrounded by word boundaries, or within a relevant section), not the first raw string occurrence in the document.
Suggested Fix
Use word-boundary regex matching (e.g.
r'\b5\b') instead of plain stringfind(), and/or restrict the search to already-identified relevant sections rather than the full document text.