Skip to content

bug: source page attribution uses naive string search, causing false-positive page matches #62

@SeanClay10

Description

@SeanClay10

Description

In src/llm/llm_client.py (lines 256–263), extracted numeric field values are stringified and searched in the full document text using .find(), which returns the first occurrence — not necessarily the one corresponding to the actual data field.

Problematic Code

value_str = str(value)
if value_str in original_text:
    pos = original_text.find(value_str)
    page_markers = re.findall(r'\[PAGE (\d+)\]', original_text[:pos])
    if page_markers:
        source_pages.add(int(page_markers[-1]))

What Goes Wrong

For numeric fields like sample_size = 5, num_empty_stomachs = 3, etc., the string "5" is searched across the entire document. If "5" appears first in e.g. "Figure 5" or a citation year, .find() returns that position — and the source page is attributed to the wrong location in the paper. No error is raised; the result JSON silently contains incorrect source_pages provenance data.

Expected Behavior

Source page attribution should only match the field value in a meaningful context (e.g. surrounded by word boundaries, or within a relevant section), not the first raw string occurrence in the document.

Suggested Fix

Use word-boundary regex matching (e.g. r'\b5\b') instead of plain string find(), and/or restrict the search to already-identified relevant sections rather than the full document text.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions