feat(examples): page-by-page entity extraction with multi-model comparison#35
Open
ana-daniele wants to merge 2 commits into
Open
feat(examples): page-by-page entity extraction with multi-model comparison#35ana-daniele wants to merge 2 commits into
ana-daniele wants to merge 2 commits into
Conversation
Adds example_07_page_entity_extraction.py — a standalone script that extracts MODEL/DATASET/KPI entities from PDFs page by page using any LM Studio-served model. Key design choices vs the per-chunk enricher approach: - One API call per page (not per text item) — ~100x fewer calls, practical for multi-paper/multi-model comparison runs - Text serialised as plain text; tables as HTML for structure fidelity - Mention verification: checks each extracted mention against the raw page text to flag hallucinations - Consolidated CSV output with per-document-per-model aggregate stats - Disk cache for Docling conversions; resumable (skips completed papers) - CLI flags: --papers, --out, --url, --test Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
|
❌ DCO Check Failed Hi @ana-daniele, your pull request has failed the Developer Certificate of Origin (DCO) check. This repository supports remediation commits, so you can fix this without rewriting history — but you must follow the required message format. 🛠 Quick Fix: Add a remediation commitRun this command: git commit --allow-empty -s -m "DCO Remediation Commit for Ana Daniele <ana.daniele@ibm.com>
I, Ana Daniele <ana.daniele@ibm.com>, hereby add my Signed-off-by to this commit: be822be81711b660bc02a704036f532a72854b09
I, Ana Daniele <ana.daniele@ibm.com>, hereby add my Signed-off-by to this commit: 604efda4599e0e713eb4e6dc8151f7593bc70d9c"
git push🔧 Advanced: Sign off each commit directlyFor the latest commit: git commit --amend --signoff
git push --force-with-leaseFor multiple commits: git rebase --signoff origin/main
git push --force-with-leaseMore info: DCO check report |
Contributor
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
examples/example_07_page_entity_extraction.py— a standalone script for extractingMODEL,DATASET, andKPIentities from PDFs page by page, comparing multiple LM Studio-served models.This complements the existing per-chunk enricher in
DoclingEnrichingAgentand is designed for research/evaluation use cases where you want to run several models side-by-side on a document corpus.Key design choices:
total_entities,hallucination_rate_pct,total_time_s)--papers,--out,--url,--test(single-paper smoke test)Usage
python examples/example_07_page_entity_extraction.py --papers ./papers --out ./runs # smoke test on first PDF only python examples/example_07_page_entity_extraction.py --papers ./papers --testTest plan
--testflag on a single PDFlmsCLI works (or graceful skip if CLI not present)🤖 Generated with Claude Code