Skip to content

t21dev/docset-gen

Repository files navigation

DocSet Gen

DocSet Gen Banner

Transform documentation into LLM training datasets

DocSet Gen scrapes documentation websites and generates high-quality Q&A training datasets for fine-tuning LLMs.

Features

  • Smart Scraping - Uses Firecrawl to handle JS-rendered sites, anti-bot measures, and content cleaning
  • AI-Powered Generation - Generates Q&A pairs using GPT-4o/GPT-4o-mini
  • llms.txt Generation - Generate llms.txt files to help LLMs understand your documentation
  • Interactive CLI - Guided prompts walk you through the process
  • Quality Controls - Automatic deduplication, validation, and filtering
  • Ready-to-Use Output - JSONL format with automatic train/val/test splits

Installation

# Clone the repository
git clone https://github.com/t21dev/docset-gen.git
cd docset-gen

# Create virtual environment
python -m venv venv

# Activate virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Configuration

  1. Copy the example environment file:
cp .env.example .env
  1. Edit .env with your API keys:
FIRECRAWL_API_KEY=fc-your-key-here
OPENAI_API_KEY=sk-your-key-here

Note: The default model is gpt-5.1. Make sure you have access to this model enabled in your OpenAI developer account. You can change the model in .env by setting OPENAI_MODEL=gpt-4o or another supported model.

Usage

Just run:

python app.py

The interactive CLI will guide you:

 ██████╗  ██████╗  ██████╗███████╗███████╗████████╗
 ██╔══██╗██╔═══██╗██╔════╝██╔════╝██╔════╝╚══██╔══╝
 ██║  ██║██║   ██║██║     ███████╗█████╗     ██║
 ██║  ██║██║   ██║██║     ╚════██║██╔══╝     ██║
 ██████╔╝╚██████╔╝╚██████╗███████║███████╗   ██║
 ╚═════╝  ╚═════╝  ╚═════╝╚══════╝╚══════╝   ╚═╝
  ██████╗ ███████╗███╗   ██╗
 ██╔════╝ ██╔════╝████╗  ██║
 ██║  ███╗█████╗  ██╔██╗ ██║
 ██║   ██║██╔══╝  ██║╚██╗██║
 ╚██████╔╝███████╗██║ ╚████║
  ╚═════╝ ╚══════╝╚═╝  ╚═══╝
                        by t21.dev
Transform documentation into LLM training datasets

Enter documentation URL: docs.example.com
Crawl depth [3]: 3

──────── Step 1/3: Scraping Documentation ────────

Found 45 pages (23,450 words total)

Output format? [dataset/llms.txt/both]: dataset

How many Q&A pairs to generate? [225]: 200
Output file [dataset.jsonl]:

──────── Step 2/3: Generating Q&A Pairs ──────────

Generated 200 Q&A pairs

──────── Step 3/3: Cleaning and Saving ───────────

┌────────────── Dataset Complete ──────────────┐
│ Pages Scraped       │ 45                     │
│ Q&A Pairs Generated │ 200                    │
│ Train / Val / Test  │ 160 / 20 / 20          │
│ Output              │ dataset.jsonl          │
└──────────────────────────────────────────────┘

Manual Commands

You can also run individual steps:

# Just scrape (save for later)
python app.py scrape https://docs.example.com --output ./scraped

# Generate from previously scraped content
python app.py generate ./scraped --pairs 500 --output dataset.jsonl

# Generate llms.txt directly
python app.py llms-txt https://docs.example.com --output llms.txt

# Create config file
python app.py init

llms.txt Generation

Generate llms.txt files - a proposed standard to help LLMs understand your documentation structure.

Two modes available:

  • minimal (default) - Links with brief descriptions
  • full - Complete page content included (llms-full.txt)
# Generate minimal llms.txt (links only)
python app.py llms-txt https://docs.example.com

# Generate full llms.txt with complete content
python app.py llms-txt https://docs.example.com --mode full

# Specify project name and output file
python app.py llms-txt https://docs.example.com --name "My Project" --output my-llms.txt

Or choose llms.txt or both when prompted for output format in interactive mode.

Minimal mode output:

# Example Project

> A comprehensive toolkit for building modern web applications.

## Docs
- [Getting Started](https://example.com/docs/getting-started): Quick start guide
- [Configuration](https://example.com/docs/config): Configuration options

## API Reference
- [Authentication](https://example.com/api/auth): Auth endpoints
- [Users](https://example.com/api/users): User management

Full mode output:

# Example Project

> A comprehensive toolkit for building modern web applications.

## Docs

### [Getting Started](https://example.com/docs/getting-started)

> Quick start guide

# Getting Started

Welcome to the project! This guide will help you get up and running...

### [Configuration](https://example.com/docs/config)

> Configuration options

# Configuration

The following configuration options are available...

Output Format

{"instruction": "What is dependency injection?", "input": "", "output": "Dependency injection is..."}
{"instruction": "How do I configure logging?", "input": "", "output": "To configure logging..."}

Configuration File (Optional)

Create docset-gen.yaml for advanced settings:

firecrawl:
  max_depth: 3
  exclude_patterns:
    - "/changelog/*"
    - "/blog/*"

openai:
  model: gpt-4o-mini
  temperature: 0.7

generation:
  mode: qa
  pairs_per_page: 5

output:
  split_ratio: [0.8, 0.1, 0.1]

llms_txt:
  include_optional_section: true
  max_links_per_section: 20

Requirements

Fine-tune with LoCLI

This tool pairs perfectly with LoCLI - a CLI tool that makes fine-tuning LLMs accessible to developers. Just point it at your dataset and go.

# Generate dataset with DocSet Gen
python app.py

# Fine-tune with LoCLI
lo-cli train --dataset dataset.jsonl

License

MIT License - see LICENSE

Author

Created by @TriptoAfsin | t21dev

About

Transform documentation into LLM training datasets

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages