Summary
Add PDF annotation extraction capability to scitex-io, enabling structured extraction of highlights, comments, sticky notes, strikethrough, and underline annotations from PDF files. Primary use case: processing reviewer feedback on scientific manuscripts.
Reference: ~/proj/todo/scitex/16_PDF_ANNOTATION_EXTRACTION.md
Research Findings
Library Comparison
Four Python libraries were evaluated for PDF annotation extraction:
| Library |
Version |
Annotation Support |
Underlying Text |
Author/Date |
Performance (100 iter) |
| PyMuPDF (fitz) |
1.26.7 |
Full (all types via page.annots()) |
Yes (get_text('words', clip=rect)) |
Yes |
7.0ms/call |
| pypdf |
6.5.0 |
Full (raw /Annots dict access) |
No (manual rect lookup needed) |
Partial |
4.8ms/call |
| pdfminer.six |
20251107 |
Basic (raw dict, needs resolve1()) |
No |
Partial |
Not benchmarked |
| pdfplumber |
0.11.8 |
Inherits from pdfminer |
No |
Partial |
Not benchmarked |
Recommendation: PyMuPDF (fitz)
PyMuPDF is the recommended backend for the following reasons:
- Best API for annotations:
page.annots() returns typed Annot objects with .type, .info, .rect -- no manual PDF dict traversal needed.
- Underlying text extraction:
page.get_text('words', clip=annot.rect) retrieves the text under markup annotations (highlights, strikeouts, underlines). No other library provides this in a single call.
- Already a dependency: scitex-io already uses fitz as its primary PDF backend (
_select_backend defaults to it).
- Rich annotation type coverage: Supports all PDF annotation types (Highlight=8, Text/StickyNote=0, StrikeOut=11, Underline=9, Squiggly=10, FreeText=2, Ink=15, etc.).
- Structured metadata: Each annotation exposes
content, title (author), creationDate, modDate, subject, color, and rect coordinates.
pypdf is slightly faster per-call but lacks the critical get_text(clip=rect) feature, requiring a separate text extraction pass and manual coordinate matching.
Tested Extraction Output
From a synthetic annotated PDF, PyMuPDF extracted:
Annotation 1: Type=(8, 'Highlight'), Author='Reviewer1', Content='This needs revision'
Marked text: "This is sample text for annotation testing."
Annotation 2: Type=(0, 'Text'), Author='Reviewer2', Content='Please clarify this point'
Annotation 3: Type=(11, 'StrikeOut'), Author='Reviewer1', Content='Remove this sentence'
Marked text: "Second line with more content to highlight."
Annotation 4: Type=(9, 'Underline'), Author='Reviewer2'
Marked text: "This is sample text for an"
Key annotation type mapping (PDF spec)
| Type ID |
Name |
Use in Review |
| 0 |
Text (Sticky Note) |
Reviewer comments |
| 2 |
FreeText |
Inline comments |
| 8 |
Highlight |
Marking important text |
| 9 |
Underline |
Emphasis |
| 10 |
Squiggly |
Suggested edits |
| 11 |
StrikeOut |
Text to remove |
Proposed API Design
Option A: New mode in stx.io.load (recommended)
# Extract annotations only
annotations = stx.io.load("reviewed.pdf", mode="annotations")
# Returns: List[Dict] with keys: type, page, rect, content, author, date, marked_text
# Include annotations in full extraction
data = stx.io.load("reviewed.pdf", mode="full", annotations=True)
# data.annotations -> List[Dict]
Option B: Standalone function
from scitex_io import extract_annotations
annotations = extract_annotations("reviewed.pdf")
Proposed output schema
[
{
"type": "Highlight", # str: annotation type name
"type_id": 8, # int: PDF annotation type constant
"page": 0, # int: zero-indexed page number
"rect": [72.0, 87.0, 404.0, 106.0], # list: bounding box [x0, y0, x1, y1]
"content": "This needs revision", # str: annotation text/comment
"marked_text": "Sample text...", # str: underlying document text (markup types only)
"author": "Reviewer1", # str: annotation author
"created": "D:20260327...", # str: creation date (if available)
"modified": "D:20260327...", # str: modification date (if available)
"color": [1.0, 1.0, 0.0], # list: RGB color (if available)
},
...
]
Integration Points
-
scitex-io (_load_modules/_pdf.py): Add mode="annotations" and annotations=True kwarg to _extract_full/_extract_scientific. Implementation goes in a new _pdf_annotation_extractors.py module following the existing pattern.
-
scitex-writer: A future writer_import_annotations MCP tool could map extracted annotations to manuscript sections for revision tracking. This depends on (1) being implemented first.
Implementation Notes
- The existing
_pdf.py loader already imports from _pdf_utils, _pdf_text_extractors, and _pdf_content_extractors. A new _pdf_annotation_extractors.py follows this pattern cleanly.
- No new dependencies required -- PyMuPDF is already used.
- For markup annotations (Highlight, Underline, StrikeOut, Squiggly),
page.get_text('words', clip=annot.rect) extracts the underlying text.
- Popup annotations (type 16) should be filtered out -- they are UI artifacts, not user-created annotations.
- The
annot.info dict provides title (= author in PDF spec), content, creationDate, modDate.
Priority
Medium -- useful for journal revision workflow but not blocking current submissions.
Summary
Add PDF annotation extraction capability to scitex-io, enabling structured extraction of highlights, comments, sticky notes, strikethrough, and underline annotations from PDF files. Primary use case: processing reviewer feedback on scientific manuscripts.
Reference:
~/proj/todo/scitex/16_PDF_ANNOTATION_EXTRACTION.mdResearch Findings
Library Comparison
Four Python libraries were evaluated for PDF annotation extraction:
page.annots())get_text('words', clip=rect))/Annotsdict access)resolve1())Recommendation: PyMuPDF (fitz)
PyMuPDF is the recommended backend for the following reasons:
page.annots()returns typedAnnotobjects with.type,.info,.rect-- no manual PDF dict traversal needed.page.get_text('words', clip=annot.rect)retrieves the text under markup annotations (highlights, strikeouts, underlines). No other library provides this in a single call._select_backenddefaults to it).content,title(author),creationDate,modDate,subject, color, and rect coordinates.pypdf is slightly faster per-call but lacks the critical
get_text(clip=rect)feature, requiring a separate text extraction pass and manual coordinate matching.Tested Extraction Output
From a synthetic annotated PDF, PyMuPDF extracted:
Key annotation type mapping (PDF spec)
Proposed API Design
Option A: New mode in
stx.io.load(recommended)Option B: Standalone function
Proposed output schema
[ { "type": "Highlight", # str: annotation type name "type_id": 8, # int: PDF annotation type constant "page": 0, # int: zero-indexed page number "rect": [72.0, 87.0, 404.0, 106.0], # list: bounding box [x0, y0, x1, y1] "content": "This needs revision", # str: annotation text/comment "marked_text": "Sample text...", # str: underlying document text (markup types only) "author": "Reviewer1", # str: annotation author "created": "D:20260327...", # str: creation date (if available) "modified": "D:20260327...", # str: modification date (if available) "color": [1.0, 1.0, 0.0], # list: RGB color (if available) }, ... ]Integration Points
scitex-io (
_load_modules/_pdf.py): Addmode="annotations"andannotations=Truekwarg to_extract_full/_extract_scientific. Implementation goes in a new_pdf_annotation_extractors.pymodule following the existing pattern.scitex-writer: A future
writer_import_annotationsMCP tool could map extracted annotations to manuscript sections for revision tracking. This depends on (1) being implemented first.Implementation Notes
_pdf.pyloader already imports from_pdf_utils,_pdf_text_extractors, and_pdf_content_extractors. A new_pdf_annotation_extractors.pyfollows this pattern cleanly.page.get_text('words', clip=annot.rect)extracts the underlying text.annot.infodict providestitle(= author in PDF spec),content,creationDate,modDate.Priority
Medium -- useful for journal revision workflow but not blocking current submissions.