CausalInferenceLab · Funbucket · Apr 15, 2026 · May 10, 2026 · May 10, 2026 · May 10, 2026
diff --git a/.claude/skills/pdf/SKILL.md b/.claude/skills/pdf/SKILL.md
@@ -0,0 +1,223 @@
+---
+name: pdf
+description: Use this skill whenever the user wants to do anything with PDF files. This includes reading or extracting text/tables from PDFs, combining or merging multiple PDFs into one, splitting PDFs apart, rotating pages, adding watermarks, creating new PDFs, filling PDF forms, encrypting/decrypting PDFs, extracting images, and OCR on scanned PDFs to make them searchable. If the user mentions a .pdf file or asks to produce one, use this skill.
+---
+
+# PDF Processing
+
+## Quick Start
+
+```python
+from pypdf import PdfReader, PdfWriter
+
+reader = PdfReader("document.pdf")
+print(f"Pages: {len(reader.pages)}")
+
+text = ""
+for page in reader.pages:
+    text += page.extract_text()
+```
+
+## Python Libraries
+
+### pypdf — Basic Operations
+
+**Merge PDFs:**
+```python
+from pypdf import PdfWriter, PdfReader
+
+writer = PdfWriter()
+for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
+    reader = PdfReader(pdf_file)
+    for page in reader.pages:
+        writer.add_page(page)
+
+with open("merged.pdf", "wb") as output:
+    writer.write(output)
+```
+
+**Split PDF:**
+```python
+reader = PdfReader("input.pdf")
+for i, page in enumerate(reader.pages):
+    writer = PdfWriter()
+    writer.add_page(page)
+    with open(f"page_{i+1}.pdf", "wb") as output:
+        writer.write(output)
+```
+
+**Extract Metadata:**
+```python
+reader = PdfReader("document.pdf")
+meta = reader.metadata
+print(f"Title: {meta.title}")
+print(f"Author: {meta.author}")
+```
+
+**Rotate Pages:**
+```python
+reader = PdfReader("input.pdf")
+writer = PdfWriter()
+page = reader.pages[0]
+page.rotate(90)
+writer.add_page(page)
+with open("rotated.pdf", "wb") as output:
+    writer.write(output)
+```
+
+### pdfplumber — Text and Table Extraction
+
+**Extract Text with Layout:**
+```python
+import pdfplumber
+
+with pdfplumber.open("document.pdf") as pdf:
+    for page in pdf.pages:
+        text = page.extract_text()
+        print(text)
+```
+
+**Extract Tables:**
+```python
+with pdfplumber.open("document.pdf") as pdf:
+    for i, page in enumerate(pdf.pages):
+        tables = page.extract_tables()
+        for j, table in enumerate(tables):
+            print(f"Table {j+1} on page {i+1}:")
+            for row in table:
+                print(row)
+```
+
+**Extract Tables to DataFrame:**
+```python
+import pandas as pd
+import pdfplumber
+
+with pdfplumber.open("document.pdf") as pdf:
+    all_tables = []
+    for page in pdf.pages:
+        tables = page.extract_tables()
+        for table in tables:
+            if table:
+                df = pd.DataFrame(table[1:], columns=table[0])
+                all_tables.append(df)
+
+if all_tables:
+    combined_df = pd.concat(all_tables, ignore_index=True)
+    combined_df.to_excel("extracted_tables.xlsx", index=False)
+```
+
+### reportlab — Create PDFs
+
+**Basic PDF Creation:**
+```python
+from reportlab.lib.pagesizes import letter
+from reportlab.pdfgen import canvas
+
+c = canvas.Canvas("hello.pdf", pagesize=letter)
+width, height = letter
+c.drawString(100, height - 100, "Hello World!")
+c.save()
+```
+
+**Multi-page PDF:**
+```python
+from reportlab.lib.pagesizes import letter
+from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
+from reportlab.lib.styles import getSampleStyleSheet
+
+doc = SimpleDocTemplate("report.pdf", pagesize=letter)
+styles = getSampleStyleSheet()
+story = []
+
+story.append(Paragraph("Report Title", styles['Title']))
+story.append(Spacer(1, 12))
+story.append(Paragraph("Body content here.", styles['Normal']))
+story.append(PageBreak())
+story.append(Paragraph("Page 2", styles['Heading1']))
+doc.build(story)
+```
+
+⚠️ **IMPORTANT**: Never use Unicode subscript/superscript characters (₀₁₂₃, ⁰¹²³) in ReportLab — use `<sub>` and `<super>` tags instead.
+
+## Command-Line Tools
+
+**pdftotext (poppler-utils):**
+```bash
+pdftotext input.pdf output.txt
+pdftotext -layout input.pdf output.txt  # preserve layout
+pdftotext -f 1 -l 5 input.pdf output.txt  # pages 1-5
+```
+
+**qpdf:**
+```bash
+qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
+qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
+qpdf input.pdf output.pdf --rotate=+90:1
+qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
+```
+
+## Common Tasks
+
+**OCR Scanned PDFs:**
+```python
+import pytesseract
+from pdf2image import convert_from_path
+
+images = convert_from_path('scanned.pdf')
+text = ""
+for i, image in enumerate(images):
+    text += f"Page {i+1}:\n"
+    text += pytesseract.image_to_string(image)
+    text += "\n\n"
+```
+
+**Add Watermark:**
+```python
+from pypdf import PdfReader, PdfWriter
+
+watermark = PdfReader("watermark.pdf").pages[0]
+reader = PdfReader("document.pdf")
+writer = PdfWriter()
+for page in reader.pages:
+    page.merge_page(watermark)
+    writer.add_page(page)
+with open("watermarked.pdf", "wb") as output:
+    writer.write(output)
+```
+
+**Password Protection:**
+```python
+from pypdf import PdfReader, PdfWriter
+
+reader = PdfReader("input.pdf")
+writer = PdfWriter()
+for page in reader.pages:
+    writer.add_page(page)
+writer.encrypt("userpassword", "ownerpassword")
+with open("encrypted.pdf", "wb") as output:
+    writer.write(output)
+```
+
+**Extract Images (CLI):**
+```bash
+pdfimages -j input.pdf output_prefix
+```
+
+## Quick Reference
+
+| Task | Best Tool |
+|------|-----------|
+| Merge PDFs | pypdf |
+| Split PDFs | pypdf |
+| Extract text | pdfplumber |
+| Extract tables | pdfplumber |
+| Create PDFs | reportlab |
+| CLI merge | qpdf |
+| OCR scanned PDFs | pytesseract + pdf2image |
+| Fill PDF forms | scripts/ (see forms.md) |
+
+## References
+
+- **Form filling**: See [references/forms.md](references/forms.md) — use bundled scripts in `scripts/`
+- **Advanced usage** (pypdfium2, pdf-lib JS, batch processing, troubleshooting): See [references/reference.md](references/reference.md)
diff --git a/.claude/skills/pdf/references/forms.md b/.claude/skills/pdf/references/forms.md
@@ -0,0 +1,104 @@
+# PDF Form Filling
+
+## Initial Assessment
+
+First check if the PDF has fillable form fields:
+```bash
+python scripts/check_fillable_fields.py file.pdf
+```
+
+---
+
+## For Fillable Forms
+
+**Step 1 — Extract field info:**
+```bash
+python scripts/extract_form_field_info.py input.pdf fields.json
+```
+Outputs JSON cataloging all fields (IDs, locations, types: text/checkbox/radio/choice).
+
+**Step 2 — Visually verify (optional):**
+```bash
+mkdir pages && python scripts/convert_pdf_to_images.py input.pdf pages/
+```
+
+**Step 3 — Create field values JSON:**
+```json
+[
+  {"field_id": "FirstName", "page": 1, "value": "John"},
+  {"field_id": "Agree", "page": 1, "value": "/Yes"},
+  {"field_id": "Gender", "page": 1, "value": "/Male"}
+]
+```
+- Checkboxes: use the `checked_value` shown in the field info JSON
+- Radio groups: use one of the `radio_options[].value` values
+- Choice fields: use one of the `choice_options[].value` values
+
+**Step 4 — Fill the form:**
+```bash
+python scripts/fill_fillable_fields.py input.pdf field_values.json output.pdf
+```
+
+---
+
+## For Non-Fillable Forms (Text Annotations)
+
+### Approach A — Preferred (digitally-created PDFs)
+
+Extract structural coordinates:
+```bash
+python scripts/extract_form_structure.py input.pdf structure.json
+```
+
+Build a `fields.json` with `form_fields` array using PDF coordinates:
+```json
+{
+  "pages": [{"page_number": 1, "pdf_width": 612, "pdf_height": 792}],
+  "form_fields": [
+    {
+      "description": "Name field",
+      "page_number": 1,
+      "label_bounding_box": [72, 100, 150, 115],
+      "entry_bounding_box": [155, 98, 400, 116],
+      "entry_text": {"text": "John Smith", "font": "Helvetica", "font_size": 11}
+    }
+  ]
+}
+```
+
+### Approach B — Fallback (scanned PDFs)
+
+Convert to images, determine pixel coordinates visually, then use image-based coordinates:
+```json
+{
+  "pages": [{"page_number": 1, "image_width": 1000, "image_height": 1294}],
+  "form_fields": [...]
+}
+```
+
+### Hybrid
+
+Use Approach A for most fields and Approach B for anything extract_form_structure misses.
+
+---
+
+## Validation
+
+Before generating output, validate bounding boxes:
+```bash
+python scripts/check_bounding_boxes.py fields.json
+```
+
+Visually verify a page:
+```bash
+python scripts/create_validation_image.py 1 fields.json pages/page_1.png validation_page_1.png
+```
+
+Red boxes = entry areas, Blue boxes = label areas.
+
+**Fill with annotations:**
+```bash
+python scripts/fill_pdf_form_with_annotations.py input.pdf fields.json output.pdf
+```
+
+Then convert output to images to verify text placement.