Transform documentation into LLM training datasets
DocSet Gen scrapes documentation websites and generates high-quality Q&A training datasets for fine-tuning LLMs.
- Smart Scraping - Uses Firecrawl to handle JS-rendered sites, anti-bot measures, and content cleaning
- AI-Powered Generation - Generates Q&A pairs using GPT-4o/GPT-4o-mini
- llms.txt Generation - Generate llms.txt files to help LLMs understand your documentation
- Interactive CLI - Guided prompts walk you through the process
- Quality Controls - Automatic deduplication, validation, and filtering
- Ready-to-Use Output - JSONL format with automatic train/val/test splits
# Clone the repository
git clone https://github.com/t21dev/docset-gen.git
cd docset-gen
# Create virtual environment
python -m venv venv
# Activate virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt- Copy the example environment file:
cp .env.example .env- Edit
.envwith your API keys:
FIRECRAWL_API_KEY=fc-your-key-here
OPENAI_API_KEY=sk-your-key-hereNote: The default model is
gpt-5.1. Make sure you have access to this model enabled in your OpenAI developer account. You can change the model in.envby settingOPENAI_MODEL=gpt-4oor another supported model.
Just run:
python app.pyThe interactive CLI will guide you:
██████╗ ██████╗ ██████╗███████╗███████╗████████╗
██╔══██╗██╔═══██╗██╔════╝██╔════╝██╔════╝╚══██╔══╝
██║ ██║██║ ██║██║ ███████╗█████╗ ██║
██║ ██║██║ ██║██║ ╚════██║██╔══╝ ██║
██████╔╝╚██████╔╝╚██████╗███████║███████╗ ██║
╚═════╝ ╚═════╝ ╚═════╝╚══════╝╚══════╝ ╚═╝
██████╗ ███████╗███╗ ██╗
██╔════╝ ██╔════╝████╗ ██║
██║ ███╗█████╗ ██╔██╗ ██║
██║ ██║██╔══╝ ██║╚██╗██║
╚██████╔╝███████╗██║ ╚████║
╚═════╝ ╚══════╝╚═╝ ╚═══╝
by t21.dev
Transform documentation into LLM training datasets
Enter documentation URL: docs.example.com
Crawl depth [3]: 3
──────── Step 1/3: Scraping Documentation ────────
Found 45 pages (23,450 words total)
Output format? [dataset/llms.txt/both]: dataset
How many Q&A pairs to generate? [225]: 200
Output file [dataset.jsonl]:
──────── Step 2/3: Generating Q&A Pairs ──────────
Generated 200 Q&A pairs
──────── Step 3/3: Cleaning and Saving ───────────
┌────────────── Dataset Complete ──────────────┐
│ Pages Scraped │ 45 │
│ Q&A Pairs Generated │ 200 │
│ Train / Val / Test │ 160 / 20 / 20 │
│ Output │ dataset.jsonl │
└──────────────────────────────────────────────┘
You can also run individual steps:
# Just scrape (save for later)
python app.py scrape https://docs.example.com --output ./scraped
# Generate from previously scraped content
python app.py generate ./scraped --pairs 500 --output dataset.jsonl
# Generate llms.txt directly
python app.py llms-txt https://docs.example.com --output llms.txt
# Create config file
python app.py initGenerate llms.txt files - a proposed standard to help LLMs understand your documentation structure.
Two modes available:
- minimal (default) - Links with brief descriptions
- full - Complete page content included (llms-full.txt)
# Generate minimal llms.txt (links only)
python app.py llms-txt https://docs.example.com
# Generate full llms.txt with complete content
python app.py llms-txt https://docs.example.com --mode full
# Specify project name and output file
python app.py llms-txt https://docs.example.com --name "My Project" --output my-llms.txtOr choose llms.txt or both when prompted for output format in interactive mode.
Minimal mode output:
# Example Project
> A comprehensive toolkit for building modern web applications.
## Docs
- [Getting Started](https://example.com/docs/getting-started): Quick start guide
- [Configuration](https://example.com/docs/config): Configuration options
## API Reference
- [Authentication](https://example.com/api/auth): Auth endpoints
- [Users](https://example.com/api/users): User managementFull mode output:
# Example Project
> A comprehensive toolkit for building modern web applications.
## Docs
### [Getting Started](https://example.com/docs/getting-started)
> Quick start guide
# Getting Started
Welcome to the project! This guide will help you get up and running...
### [Configuration](https://example.com/docs/config)
> Configuration options
# Configuration
The following configuration options are available...{"instruction": "What is dependency injection?", "input": "", "output": "Dependency injection is..."}
{"instruction": "How do I configure logging?", "input": "", "output": "To configure logging..."}Create docset-gen.yaml for advanced settings:
firecrawl:
max_depth: 3
exclude_patterns:
- "/changelog/*"
- "/blog/*"
openai:
model: gpt-4o-mini
temperature: 0.7
generation:
mode: qa
pairs_per_page: 5
output:
split_ratio: [0.8, 0.1, 0.1]
llms_txt:
include_optional_section: true
max_links_per_section: 20- Python 3.10+
- Firecrawl API key
- OpenAI API key
This tool pairs perfectly with LoCLI - a CLI tool that makes fine-tuning LLMs accessible to developers. Just point it at your dataset and go.
# Generate dataset with DocSet Gen
python app.py
# Fine-tune with LoCLI
lo-cli train --dataset dataset.jsonlMIT License - see LICENSE
Created by @TriptoAfsin | t21dev
