google-docs-bfs-export

Purpose

This project exports Google Docs by performing a breadth-first search (BFS) through linked documents. It's designed to help discover and archive valuable discussions scattered across different Google Docs with different owners.

Problem It Solves

Organizations often have important discussions and documentation spread across many Google Docs with poor discoverability. This tool:

Starts from a seed document
Follows all links to other Google Docs
Exports each document as either Markdown (.md) or Word (.docx)
Prevents cycles by tracking visited documents

Technical Approach

Why Python + OAuth (not Service Account)?

Service accounts require sharing each doc explicitly with the service account email, or domain-wide delegation (needs admin)
OAuth uses personal credentials, so you automatically have access to any doc you can normally view
Much simpler setup for personal/team use cases

Architecture

Authentication: OAuth 2.0 with token caching (token.pickle)
APIs Used:
- Google Docs API v1 (read document content)
- Google Drive API v3 (export as .docx)
Crawl Strategy: BFS with a deque queue and set for visited tracking
Link Extraction: Regex parsing of document URLs from text runs

Key Components

GoogleDocsCrawler class
- Manages API connections
- Handles BFS traversal
- Exports documents in chosen format
Export formats
- Markdown: Parsed from Docs API structure with basic formatting
- DOCX: Direct export via Drive API (preserves all formatting)
Link extraction
- Recursively processes document elements
- Extracts URLs from text runs
- Filters for Google Docs URLs only

Current State

Completed

✅ OAuth authentication flow with token caching ✅ BFS crawling with cycle detection ✅ Enhanced markdown export with:

Bold, italic, bold+italic, strikethrough
Headings (H1-H6)
Links with optional localization
Bulleted and numbered lists (with nesting)
Tables (basic markdown format)
Inline code detection ✅ Word (.docx) export via Drive API ✅ Google Drive export (--drive FOLDER_ID flag)
Save copies directly to a Google Drive folder
Markdown: Uploads .md files to Drive
Docx: Copies original Google Docs to target folder
Index CSV maps source doc IDs to Drive file IDs
Perfect for shared workspace archiving ✅ Index CSV export (doc ID → local file or Drive ID mapping) ✅ Link localization (Google Docs URLs → local .md files) ✅ Automatic access requests (--request-access flag)
Detects 403 permission errors
Sends access request email to document owner
Uses authenticated user's email address
Gracefully handles rate limits and other errors ✅ Configurable export limits (safety feature) ✅ Progress tracking and logging ✅ Command-line interface ✅ Setup instructions and documentation

File Structure

main.py - Main crawler implementation (599 lines)
README.md - User-facing documentation
LLM.md - This file (technical overview)
pyproject.toml - Dependencies managed by uv
credentials.json - OAuth client credentials (user must provide)
token.pickle - Cached OAuth tokens (generated on first run)
exported_docs/ - Output directory for exported documents
exported_docs/index.csv - Index mapping doc IDs to local files

Dependencies

google-auth-oauthlib - OAuth 2.0 authentication
google-auth-httplib2 - HTTP transport for Google APIs
google-api-python-client - Google Docs and Drive APIs (includes MediaInMemoryUpload for file uploads)

Usage Flow

Setup: python main.py --setup (shows OAuth instructions)
First run: Opens browser for OAuth consent
Subsequent runs: Uses cached token
Export: python main.py --seed-id DOC_ID --format md

Known Limitations

Markdown export still missing: images, comments, complex table formatting, footnotes
Only follows Google Docs links (not Sheets, Slides, etc.)
Subject to Google API rate limits
No incremental export (re-exports everything each run)
Link localization requires two-pass approach (slightly slower)

Future Enhancements (Not Implemented)

Better markdown conversion (tables, images, comments)
Incremental export (skip already exported docs)
Export to other formats (HTML, PDF)
Follow links to Sheets/Slides
Parallel downloading for speed
Export metadata (authors, last modified, etc.)
Graph visualization of document relationships
Support for shared drives

Development Notes

Python 3.12+ required
Uses uv for dependency management
OAuth scopes: documents.readonly, drive.readonly, drive.file (for Drive uploads)
All credentials stored locally (no external services)
Drive export uses files.copy() for docx and MediaInMemoryUpload for markdown

⚠️ IMPORTANT: Documentation Philosophy

DO NOT create excessive documentation files. When implementing features:

✅ Update existing docs (README.md, this file) minimally
✅ Add inline code comments for complex logic
✅ Update help text (--help)
❌ DO NOT create separate IMPROVEMENTS.md, COMPARISON.md, CHANGELOG.md, etc.
❌ DO NOT create test/demo scripts unless specifically requested

Rationale:

Docs longer than source code = tech debt
Docs get out of date quickly
Code should be self-documenting
People can read the source code directly (it's only 599 lines)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

google-docs-bfs-export

Purpose

Problem It Solves

Technical Approach

Why Python + OAuth (not Service Account)?

Architecture

Key Components

Current State

Completed

File Structure

Dependencies

Usage Flow

Known Limitations

Future Enhancements (Not Implemented)

Development Notes

⚠️ IMPORTANT: Documentation Philosophy

FilesExpand file tree

LLM.md

Latest commit

History

LLM.md

File metadata and controls

google-docs-bfs-export

Purpose

Problem It Solves

Technical Approach

Why Python + OAuth (not Service Account)?

Architecture

Key Components

Current State

Completed

File Structure

Dependencies

Usage Flow

Known Limitations

Future Enhancements (Not Implemented)

Development Notes

⚠️ IMPORTANT: Documentation Philosophy