feat(examples): page-by-page entity extraction with multi-model comparison by ana-daniele · Pull Request #35 · docling-project/docling-agent

ana-daniele · 2026-06-06T10:21:18Z

Summary

Adds examples/example_07_page_entity_extraction.py — a standalone script for extracting MODEL, DATASET, and KPI entities from PDFs page by page, comparing multiple LM Studio-served models.

This complements the existing per-chunk enricher in DoclingEnrichingAgent and is designed for research/evaluation use cases where you want to run several models side-by-side on a document corpus.

Key design choices:

One API call per page (not per text item) — ~100× fewer calls vs the per-chunk enricher; practical for multi-paper/multi-model batch runs
Mixed serialisation: text items as plain text, tables as HTML (better structure fidelity for table-heavy papers)
Mention verification: each extracted mention is checked against the raw page text to flag likely hallucinations
Consolidated CSV with per-document-per-model aggregate stats (total_entities, hallucination_rate_pct, total_time_s)
Resumable: skips papers whose output JSON already exists; Docling conversions are cached on disk
CLI flags: --papers, --out, --url, --test (single-paper smoke test)

Usage

python examples/example_07_page_entity_extraction.py --papers ./papers --out ./runs
# smoke test on first PDF only
python examples/example_07_page_entity_extraction.py --papers ./papers --test

Test plan

Smoke test with --test flag on a single PDF
Verify CSV columns and empty-page rows are correct
Verify HTML output renders correctly
Confirm model switch via lms CLI works (or graceful skip if CLI not present)

🤖 Generated with Claude Code

Adds example_07_page_entity_extraction.py — a standalone script that extracts MODEL/DATASET/KPI entities from PDFs page by page using any LM Studio-served model. Key design choices vs the per-chunk enricher approach: - One API call per page (not per text item) — ~100x fewer calls, practical for multi-paper/multi-model comparison runs - Text serialised as plain text; tables as HTML for structure fidelity - Mention verification: checks each extracted mention against the raw page text to flag hallucinations - Consolidated CSV output with per-document-per-model aggregate stats - Disk cache for Docling conversions; resumable (skips completed papers) - CLI flags: --papers, --out, --url, --test Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-06T10:21:28Z

❌ DCO Check Failed

Hi @ana-daniele, your pull request has failed the Developer Certificate of Origin (DCO) check.

This repository supports remediation commits, so you can fix this without rewriting history — but you must follow the required message format.

🛠 Quick Fix: Add a remediation commit

Run this command:

git commit --allow-empty -s -m "DCO Remediation Commit for Ana Daniele <ana.daniele@ibm.com>

I, Ana Daniele <ana.daniele@ibm.com>, hereby add my Signed-off-by to this commit: be822be81711b660bc02a704036f532a72854b09
I, Ana Daniele <ana.daniele@ibm.com>, hereby add my Signed-off-by to this commit: 604efda4599e0e713eb4e6dc8151f7593bc70d9c"
git push

🔧 Advanced: Sign off each commit directly

For the latest commit:

git commit --amend --signoff
git push --force-with-lease

For multiple commits:

git rebase --signoff origin/main
git push --force-with-lease

More info: DCO check report

mergify · 2026-06-06T10:21:54Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

style: apply ruff formatting to example_07

604efda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(examples): page-by-page entity extraction with multi-model comparison#35

feat(examples): page-by-page entity extraction with multi-model comparison#35
ana-daniele wants to merge 2 commits into
docling-project:mainfrom
ana-daniele:feat/page-entity-extraction-example

ana-daniele commented Jun 6, 2026

Uh oh!

github-actions Bot commented Jun 6, 2026 •

edited

Loading

Uh oh!

mergify Bot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ana-daniele commented Jun 6, 2026

Summary

Usage

Test plan

Uh oh!

github-actions Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛠 Quick Fix: Add a remediation commit

Uh oh!

mergify Bot commented Jun 6, 2026

Merge Protections

🟢 Enforce conventional commit

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 6, 2026 •

edited

Loading