Skip to content

Latest commit

 

History

History
88 lines (64 loc) · 2.33 KB

File metadata and controls

88 lines (64 loc) · 2.33 KB

Thirdweb Documentation Scraper

A comprehensive tool for scraping, processing, and organizing Thirdweb TypeScript API documentation into a structured local markdown repository.

Features

  • Web Scraping: Traverses the Thirdweb documentation site to extract content
  • Markdown Conversion: Converts HTML content to clean, well-formatted Markdown
  • Intelligent Categorization: Organizes documentation into meaningful categories:
    • UI Components
    • React Hooks
    • Core Functions
    • Advanced Topics
  • Index Generation: Creates navigation indexes for each category
  • Content Cleaning: Removes unnecessary boilerplate and formats code blocks

Project Components

  • Improved Scraper (improved_scraper.py): Main scraper with enhanced functionality
  • Reorganization Tool (reorganize_docs.py): Sorts and categorizes documentation files
  • Markdown Cleaner (markdown_cleaner.py): Cleans and formats scraped Markdown files

Requirements

  • Python 3.x
  • Required libraries listed in requirements.txt

Setup and Usage

Setup Environment

./setup_venv.sh

Run Full Documentation Pipeline

For the complete process (scraping, cleaning, and organizing):

./run_improved_scraper.sh

Run Only Reorganization

If you already have scraped documentation and want to reorganize it:

python reorganize_docs.py

Directory Structure

The scraped content is organized as follows:

thirdweb_typescript_docs/
├── UI Components/
│   ├── 00_index.md
│   ├── Component1.md
│   └── ...
├── React Hooks/
│   ├── 00_index.md
│   ├── Hook1.md
│   └── ...
├── Core Functions/
│   ├── 00_index.md
│   ├── Function1.md
│   └── ...
└── Advanced Topics/
    ├── 00_index.md
    ├── Topic1.md
    └── ...

Additional Resources

  • ScraperBuildGuide.md: Detailed guide for building similar documentation scrapers
  • reorganize_docs.py: Script for categorizing documentation based on content patterns

Why This Project

This project helps developers maintain an up-to-date local copy of Thirdweb documentation for:

  1. Offline access
  2. Training AI models on the Thirdweb TypeScript SDK
  3. Creating customized knowledge bases
  4. Enhancing developer workflows with searchable documentation