Skip to content

Latest commit

 

History

History
137 lines (113 loc) · 5.44 KB

File metadata and controls

137 lines (113 loc) · 5.44 KB

google-docs-bfs-export

Purpose

This project exports Google Docs by performing a breadth-first search (BFS) through linked documents. It's designed to help discover and archive valuable discussions scattered across different Google Docs with different owners.

Problem It Solves

Organizations often have important discussions and documentation spread across many Google Docs with poor discoverability. This tool:

  • Starts from a seed document
  • Follows all links to other Google Docs
  • Exports each document as either Markdown (.md) or Word (.docx)
  • Prevents cycles by tracking visited documents

Technical Approach

Why Python + OAuth (not Service Account)?

  • Service accounts require sharing each doc explicitly with the service account email, or domain-wide delegation (needs admin)
  • OAuth uses personal credentials, so you automatically have access to any doc you can normally view
  • Much simpler setup for personal/team use cases

Architecture

  • Authentication: OAuth 2.0 with token caching (token.pickle)
  • APIs Used:
    • Google Docs API v1 (read document content)
    • Google Drive API v3 (export as .docx)
  • Crawl Strategy: BFS with a deque queue and set for visited tracking
  • Link Extraction: Regex parsing of document URLs from text runs

Key Components

  1. GoogleDocsCrawler class

    • Manages API connections
    • Handles BFS traversal
    • Exports documents in chosen format
  2. Export formats

    • Markdown: Parsed from Docs API structure with basic formatting
    • DOCX: Direct export via Drive API (preserves all formatting)
  3. Link extraction

    • Recursively processes document elements
    • Extracts URLs from text runs
    • Filters for Google Docs URLs only

Current State

Completed

✅ OAuth authentication flow with token caching ✅ BFS crawling with cycle detection ✅ Enhanced markdown export with:

  • Bold, italic, bold+italic, strikethrough
  • Headings (H1-H6)
  • Links with optional localization
  • Bulleted and numbered lists (with nesting)
  • Tables (basic markdown format)
  • Inline code detection ✅ Word (.docx) export via Drive API ✅ Google Drive export (--drive FOLDER_ID flag)
  • Save copies directly to a Google Drive folder
  • Markdown: Uploads .md files to Drive
  • Docx: Copies original Google Docs to target folder
  • Index CSV maps source doc IDs to Drive file IDs
  • Perfect for shared workspace archiving ✅ Index CSV export (doc ID → local file or Drive ID mapping) ✅ Link localization (Google Docs URLs → local .md files) ✅ Automatic access requests (--request-access flag)
  • Detects 403 permission errors
  • Sends access request email to document owner
  • Uses authenticated user's email address
  • Gracefully handles rate limits and other errors ✅ Configurable export limits (safety feature) ✅ Progress tracking and logging ✅ Command-line interface ✅ Setup instructions and documentation

File Structure

  • main.py - Main crawler implementation (599 lines)
  • README.md - User-facing documentation
  • LLM.md - This file (technical overview)
  • pyproject.toml - Dependencies managed by uv
  • credentials.json - OAuth client credentials (user must provide)
  • token.pickle - Cached OAuth tokens (generated on first run)
  • exported_docs/ - Output directory for exported documents
  • exported_docs/index.csv - Index mapping doc IDs to local files

Dependencies

  • google-auth-oauthlib - OAuth 2.0 authentication
  • google-auth-httplib2 - HTTP transport for Google APIs
  • google-api-python-client - Google Docs and Drive APIs (includes MediaInMemoryUpload for file uploads)

Usage Flow

  1. Setup: python main.py --setup (shows OAuth instructions)
  2. First run: Opens browser for OAuth consent
  3. Subsequent runs: Uses cached token
  4. Export: python main.py --seed-id DOC_ID --format md

Known Limitations

  • Markdown export still missing: images, comments, complex table formatting, footnotes
  • Only follows Google Docs links (not Sheets, Slides, etc.)
  • Subject to Google API rate limits
  • No incremental export (re-exports everything each run)
  • Link localization requires two-pass approach (slightly slower)

Future Enhancements (Not Implemented)

  • Better markdown conversion (tables, images, comments)
  • Incremental export (skip already exported docs)
  • Export to other formats (HTML, PDF)
  • Follow links to Sheets/Slides
  • Parallel downloading for speed
  • Export metadata (authors, last modified, etc.)
  • Graph visualization of document relationships
  • Support for shared drives

Development Notes

  • Python 3.12+ required
  • Uses uv for dependency management
  • OAuth scopes: documents.readonly, drive.readonly, drive.file (for Drive uploads)
  • All credentials stored locally (no external services)
  • Drive export uses files.copy() for docx and MediaInMemoryUpload for markdown

⚠️ IMPORTANT: Documentation Philosophy

DO NOT create excessive documentation files. When implementing features:

  • ✅ Update existing docs (README.md, this file) minimally
  • ✅ Add inline code comments for complex logic
  • ✅ Update help text (--help)
  • ❌ DO NOT create separate IMPROVEMENTS.md, COMPARISON.md, CHANGELOG.md, etc.
  • ❌ DO NOT create test/demo scripts unless specifically requested

Rationale:

  • Docs longer than source code = tech debt
  • Docs get out of date quickly
  • Code should be self-documenting
  • People can read the source code directly (it's only 599 lines)