DocSet Gen

Transform documentation into LLM training datasets

DocSet Gen scrapes documentation websites and generates high-quality Q&A training datasets for fine-tuning LLMs.

Features

Smart Scraping - Uses Firecrawl to handle JS-rendered sites, anti-bot measures, and content cleaning
AI-Powered Generation - Generates Q&A pairs using GPT-4o/GPT-4o-mini
llms.txt Generation - Generate llms.txt files to help LLMs understand your documentation
Interactive CLI - Guided prompts walk you through the process
Quality Controls - Automatic deduplication, validation, and filtering
Ready-to-Use Output - JSONL format with automatic train/val/test splits

Installation

# Clone the repository
git clone https://github.com/t21dev/docset-gen.git
cd docset-gen

# Create virtual environment
python -m venv venv

# Activate virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Configuration

Copy the example environment file:

cp .env.example .env

Edit .env with your API keys:

FIRECRAWL_API_KEY=fc-your-key-here
OPENAI_API_KEY=sk-your-key-here

Note: The default model is gpt-5.1. Make sure you have access to this model enabled in your OpenAI developer account. You can change the model in .env by setting OPENAI_MODEL=gpt-4o or another supported model.

Usage

Just run:

python app.py

The interactive CLI will guide you:

 ██████╗  ██████╗  ██████╗███████╗███████╗████████╗
 ██╔══██╗██╔═══██╗██╔════╝██╔════╝██╔════╝╚══██╔══╝
 ██║  ██║██║   ██║██║     ███████╗█████╗     ██║
 ██║  ██║██║   ██║██║     ╚════██║██╔══╝     ██║
 ██████╔╝╚██████╔╝╚██████╗███████║███████╗   ██║
 ╚═════╝  ╚═════╝  ╚═════╝╚══════╝╚══════╝   ╚═╝
  ██████╗ ███████╗███╗   ██╗
 ██╔════╝ ██╔════╝████╗  ██║
 ██║  ███╗█████╗  ██╔██╗ ██║
 ██║   ██║██╔══╝  ██║╚██╗██║
 ╚██████╔╝███████╗██║ ╚████║
  ╚═════╝ ╚══════╝╚═╝  ╚═══╝
                        by t21.dev
Transform documentation into LLM training datasets

Enter documentation URL: docs.example.com
Crawl depth [3]: 3

──────── Step 1/3: Scraping Documentation ────────

Found 45 pages (23,450 words total)

Output format? [dataset/llms.txt/both]: dataset

How many Q&A pairs to generate? [225]: 200
Output file [dataset.jsonl]:

──────── Step 2/3: Generating Q&A Pairs ──────────

Generated 200 Q&A pairs

──────── Step 3/3: Cleaning and Saving ───────────

┌────────────── Dataset Complete ──────────────┐
│ Pages Scraped       │ 45                     │
│ Q&A Pairs Generated │ 200                    │
│ Train / Val / Test  │ 160 / 20 / 20          │
│ Output              │ dataset.jsonl          │
└──────────────────────────────────────────────┘

Manual Commands

You can also run individual steps:

# Just scrape (save for later)
python app.py scrape https://docs.example.com --output ./scraped

# Generate from previously scraped content
python app.py generate ./scraped --pairs 500 --output dataset.jsonl

# Generate llms.txt directly
python app.py llms-txt https://docs.example.com --output llms.txt

# Create config file
python app.py init

llms.txt Generation

Generate llms.txt files - a proposed standard to help LLMs understand your documentation structure.

Two modes available:

minimal (default) - Links with brief descriptions
full - Complete page content included (llms-full.txt)

# Generate minimal llms.txt (links only)
python app.py llms-txt https://docs.example.com

# Generate full llms.txt with complete content
python app.py llms-txt https://docs.example.com --mode full

# Specify project name and output file
python app.py llms-txt https://docs.example.com --name "My Project" --output my-llms.txt

Or choose llms.txt or both when prompted for output format in interactive mode.

Minimal mode output:

# Example Project

> A comprehensive toolkit for building modern web applications.

## Docs
- [Getting Started](https://example.com/docs/getting-started): Quick start guide
- [Configuration](https://example.com/docs/config): Configuration options

## API Reference
- [Authentication](https://example.com/api/auth): Auth endpoints
- [Users](https://example.com/api/users): User management

Full mode output:

# Example Project

> A comprehensive toolkit for building modern web applications.

## Docs

### [Getting Started](https://example.com/docs/getting-started)

> Quick start guide

# Getting Started

Welcome to the project! This guide will help you get up and running...

### [Configuration](https://example.com/docs/config)

> Configuration options

# Configuration

The following configuration options are available...

Output Format

{"instruction": "What is dependency injection?", "input": "", "output": "Dependency injection is..."}
{"instruction": "How do I configure logging?", "input": "", "output": "To configure logging..."}

Configuration File (Optional)

Create docset-gen.yaml for advanced settings:

firecrawl:
  max_depth: 3
  exclude_patterns:
    - "/changelog/*"
    - "/blog/*"

openai:
  model: gpt-4o-mini
  temperature: 0.7

generation:
  mode: qa
  pairs_per_page: 5

output:
  split_ratio: [0.8, 0.1, 0.1]

llms_txt:
  include_optional_section: true
  max_links_per_section: 20

Requirements

Fine-tune with LoCLI

This tool pairs perfectly with LoCLI - a CLI tool that makes fine-tuning LLMs accessible to developers. Just point it at your dataset and go.

# Generate dataset with DocSet Gen
python app.py

# Fine-tune with LoCLI
lo-cli train --dataset dataset.jsonl

License

MIT License - see LICENSE

Author

Created by @TriptoAfsin | t21dev

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.claude		.claude
src/docset_gen		src/docset_gen
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
docset.png		docset.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DocSet Gen

Features

Installation

Configuration

Usage

Manual Commands

llms.txt Generation

Output Format

Configuration File (Optional)

Requirements

Fine-tune with LoCLI

License

Author

About

Uh oh!

Releases

Packages

Languages

License

t21dev/docset-gen

Folders and files

Latest commit

History

Repository files navigation

DocSet Gen

Features

Installation

Configuration

Usage

Manual Commands

llms.txt Generation

Output Format

Configuration File (Optional)

Requirements

Fine-tune with LoCLI

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages