Skip to content

feat(examples): page-by-page entity extraction with multi-model comparison#35

Open
ana-daniele wants to merge 2 commits into
docling-project:mainfrom
ana-daniele:feat/page-entity-extraction-example
Open

feat(examples): page-by-page entity extraction with multi-model comparison#35
ana-daniele wants to merge 2 commits into
docling-project:mainfrom
ana-daniele:feat/page-entity-extraction-example

Conversation

@ana-daniele

Copy link
Copy Markdown
Contributor

Summary

Adds examples/example_07_page_entity_extraction.py — a standalone script for extracting MODEL, DATASET, and KPI entities from PDFs page by page, comparing multiple LM Studio-served models.

This complements the existing per-chunk enricher in DoclingEnrichingAgent and is designed for research/evaluation use cases where you want to run several models side-by-side on a document corpus.

Key design choices:

  • One API call per page (not per text item) — ~100× fewer calls vs the per-chunk enricher; practical for multi-paper/multi-model batch runs
  • Mixed serialisation: text items as plain text, tables as HTML (better structure fidelity for table-heavy papers)
  • Mention verification: each extracted mention is checked against the raw page text to flag likely hallucinations
  • Consolidated CSV with per-document-per-model aggregate stats (total_entities, hallucination_rate_pct, total_time_s)
  • Resumable: skips papers whose output JSON already exists; Docling conversions are cached on disk
  • CLI flags: --papers, --out, --url, --test (single-paper smoke test)

Usage

python examples/example_07_page_entity_extraction.py --papers ./papers --out ./runs
# smoke test on first PDF only
python examples/example_07_page_entity_extraction.py --papers ./papers --test

Test plan

  • Smoke test with --test flag on a single PDF
  • Verify CSV columns and empty-page rows are correct
  • Verify HTML output renders correctly
  • Confirm model switch via lms CLI works (or graceful skip if CLI not present)

🤖 Generated with Claude Code

Adds example_07_page_entity_extraction.py — a standalone script that
extracts MODEL/DATASET/KPI entities from PDFs page by page using any
LM Studio-served model.

Key design choices vs the per-chunk enricher approach:
- One API call per page (not per text item) — ~100x fewer calls,
  practical for multi-paper/multi-model comparison runs
- Text serialised as plain text; tables as HTML for structure fidelity
- Mention verification: checks each extracted mention against the raw
  page text to flag hallucinations
- Consolidated CSV output with per-document-per-model aggregate stats
- Disk cache for Docling conversions; resumable (skips completed papers)
- CLI flags: --papers, --out, --url, --test

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

DCO Check Failed

Hi @ana-daniele, your pull request has failed the Developer Certificate of Origin (DCO) check.

This repository supports remediation commits, so you can fix this without rewriting history — but you must follow the required message format.


🛠 Quick Fix: Add a remediation commit

Run this command:

git commit --allow-empty -s -m "DCO Remediation Commit for Ana Daniele <ana.daniele@ibm.com>

I, Ana Daniele <ana.daniele@ibm.com>, hereby add my Signed-off-by to this commit: be822be81711b660bc02a704036f532a72854b09
I, Ana Daniele <ana.daniele@ibm.com>, hereby add my Signed-off-by to this commit: 604efda4599e0e713eb4e6dc8151f7593bc70d9c"
git push

🔧 Advanced: Sign off each commit directly

For the latest commit:

git commit --amend --signoff
git push --force-with-lease

For multiple commits:

git rebase --signoff origin/main
git push --force-with-lease

More info: DCO check report

@mergify

mergify Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant