Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
223 changes: 223 additions & 0 deletions .claude/skills/pdf/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
---
name: pdf
description: Use this skill whenever the user wants to do anything with PDF files. This includes reading or extracting text/tables from PDFs, combining or merging multiple PDFs into one, splitting PDFs apart, rotating pages, adding watermarks, creating new PDFs, filling PDF forms, encrypting/decrypting PDFs, extracting images, and OCR on scanned PDFs to make them searchable. If the user mentions a .pdf file or asks to produce one, use this skill.
---

# PDF Processing

## Quick Start

```python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")

text = ""
for page in reader.pages:
text += page.extract_text()
```

## Python Libraries

### pypdf — Basic Operations

**Merge PDFs:**
```python
from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)

with open("merged.pdf", "wb") as output:
writer.write(output)
```

**Split PDF:**
```python
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)
```

**Extract Metadata:**
```python
reader = PdfReader("document.pdf")
meta = reader.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
```

**Rotate Pages:**
```python
reader = PdfReader("input.pdf")
writer = PdfWriter()
page = reader.pages[0]
page.rotate(90)
writer.add_page(page)
with open("rotated.pdf", "wb") as output:
writer.write(output)
```

### pdfplumber — Text and Table Extraction

**Extract Text with Layout:**
```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)
```

**Extract Tables:**
```python
with pdfplumber.open("document.pdf") as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for j, table in enumerate(tables):
print(f"Table {j+1} on page {i+1}:")
for row in table:
print(row)
```

**Extract Tables to DataFrame:**
```python
import pandas as pd
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
all_tables = []
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if table:
df = pd.DataFrame(table[1:], columns=table[0])
all_tables.append(df)

if all_tables:
combined_df = pd.concat(all_tables, ignore_index=True)
combined_df.to_excel("extracted_tables.xlsx", index=False)
```

### reportlab — Create PDFs

**Basic PDF Creation:**
```python
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letter
c.drawString(100, height - 100, "Hello World!")
c.save()
```

**Multi-page PDF:**
```python
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []

story.append(Paragraph("Report Title", styles['Title']))
story.append(Spacer(1, 12))
story.append(Paragraph("Body content here.", styles['Normal']))
story.append(PageBreak())
story.append(Paragraph("Page 2", styles['Heading1']))
doc.build(story)
```

⚠️ **IMPORTANT**: Never use Unicode subscript/superscript characters (₀₁₂₃, ⁰¹²³) in ReportLab — use `<sub>` and `<super>` tags instead.

## Command-Line Tools

**pdftotext (poppler-utils):**
```bash
pdftotext input.pdf output.txt
pdftotext -layout input.pdf output.txt # preserve layout
pdftotext -f 1 -l 5 input.pdf output.txt # pages 1-5
```

**qpdf:**
```bash
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf output.pdf --rotate=+90:1
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
```

## Common Tasks

**OCR Scanned PDFs:**
```python
import pytesseract
from pdf2image import convert_from_path

images = convert_from_path('scanned.pdf')
text = ""
for i, image in enumerate(images):
text += f"Page {i+1}:\n"
text += pytesseract.image_to_string(image)
text += "\n\n"
```

**Add Watermark:**
```python
from pypdf import PdfReader, PdfWriter

watermark = PdfReader("watermark.pdf").pages[0]
reader = PdfReader("document.pdf")
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)
with open("watermarked.pdf", "wb") as output:
writer.write(output)
```

**Password Protection:**
```python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
writer.encrypt("userpassword", "ownerpassword")
with open("encrypted.pdf", "wb") as output:
writer.write(output)
```

**Extract Images (CLI):**
```bash
pdfimages -j input.pdf output_prefix
```

## Quick Reference

| Task | Best Tool |
|------|-----------|
| Merge PDFs | pypdf |
| Split PDFs | pypdf |
| Extract text | pdfplumber |
| Extract tables | pdfplumber |
| Create PDFs | reportlab |
| CLI merge | qpdf |
| OCR scanned PDFs | pytesseract + pdf2image |
| Fill PDF forms | scripts/ (see forms.md) |

## References

- **Form filling**: See [references/forms.md](references/forms.md) — use bundled scripts in `scripts/`
- **Advanced usage** (pypdfium2, pdf-lib JS, batch processing, troubleshooting): See [references/reference.md](references/reference.md)
104 changes: 104 additions & 0 deletions .claude/skills/pdf/references/forms.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# PDF Form Filling

## Initial Assessment

First check if the PDF has fillable form fields:
```bash
python scripts/check_fillable_fields.py file.pdf
```

---

## For Fillable Forms

**Step 1 — Extract field info:**
```bash
python scripts/extract_form_field_info.py input.pdf fields.json
```
Outputs JSON cataloging all fields (IDs, locations, types: text/checkbox/radio/choice).

**Step 2 — Visually verify (optional):**
```bash
mkdir pages && python scripts/convert_pdf_to_images.py input.pdf pages/
```

**Step 3 — Create field values JSON:**
```json
[
{"field_id": "FirstName", "page": 1, "value": "John"},
{"field_id": "Agree", "page": 1, "value": "/Yes"},
{"field_id": "Gender", "page": 1, "value": "/Male"}
]
```
- Checkboxes: use the `checked_value` shown in the field info JSON
- Radio groups: use one of the `radio_options[].value` values
- Choice fields: use one of the `choice_options[].value` values

**Step 4 — Fill the form:**
```bash
python scripts/fill_fillable_fields.py input.pdf field_values.json output.pdf
```

---

## For Non-Fillable Forms (Text Annotations)

### Approach A — Preferred (digitally-created PDFs)

Extract structural coordinates:
```bash
python scripts/extract_form_structure.py input.pdf structure.json
```

Build a `fields.json` with `form_fields` array using PDF coordinates:
```json
{
"pages": [{"page_number": 1, "pdf_width": 612, "pdf_height": 792}],
"form_fields": [
{
"description": "Name field",
"page_number": 1,
"label_bounding_box": [72, 100, 150, 115],
"entry_bounding_box": [155, 98, 400, 116],
"entry_text": {"text": "John Smith", "font": "Helvetica", "font_size": 11}
}
]
}
```

### Approach B — Fallback (scanned PDFs)

Convert to images, determine pixel coordinates visually, then use image-based coordinates:
```json
{
"pages": [{"page_number": 1, "image_width": 1000, "image_height": 1294}],
"form_fields": [...]
}
```

### Hybrid

Use Approach A for most fields and Approach B for anything extract_form_structure misses.

---

## Validation

Before generating output, validate bounding boxes:
```bash
python scripts/check_bounding_boxes.py fields.json
```

Visually verify a page:
```bash
python scripts/create_validation_image.py 1 fields.json pages/page_1.png validation_page_1.png
```

Red boxes = entry areas, Blue boxes = label areas.

**Fill with annotations:**
```bash
python scripts/fill_pdf_form_with_annotations.py input.pdf fields.json output.pdf
```

Then convert output to images to verify text placement.
Loading