KothaSet is a powerful CLI tool for generating high-quality datasets using Large Language Models (LLMs) as teacher models. Create diverse training data for fine-tuning smaller models.
- Multi-Provider — OpenAI, and OpenAI-compatible APIs (DeepSeek, vLLM, Ollama)
- Flexible Schemas — Instruction (Alpaca), Chat (ShareGPT), Preference (DPO), Classification
- Streaming Output — Real-time generation with progress tracking
- Resumable — Atomic checkpointing, never lose progress
- JSONL Output — Streaming writes in standard JSONL format
- Reproducible — Optional fixed seed for deterministic-style runs
- Diversity Control — Input files for sequential topic coverage
- Validation — Validate configs, schemas, datasets, and provider connectivity
pip install kothasetnpm install -g kothasetbrew install shantoislamdev/tap/kothasetDownload from GitHub Releases.
go install github.com/shantoislamdev/kothaset/cmd/kothaset@latest-
Initialize configuration:
kothaset init
-
Set your API key:
# Windows PowerShell $env:OPENAI_API_KEY = "sk-..." # Linux/macOS export OPENAI_API_KEY="sk-..."
-
Generate a dataset:
kothaset generate -n 100 -s instruction --seed 42 -i topics.txt -o dataset.jsonl
KothaSet uses a two-file configuration system for better security and organization:
Contains shared settings, context, and instructions. Safe to commit to git.
version: "1.0"
global:
provider: openai
schema: instruction
model: gpt-5.2
concurrency: 4
output_dir: ./output
checkpoint_every: 10 # Save checkpoint every N samples (default: 10)
# Context: Background info or persona injected into every prompt
context: |
Generate high-quality training data for an AI assistant.
The data should be helpful, accurate, and well-formatted.
# Instructions: Specific rules and guidelines for generation
instructions:
- Be creative and diverse in topics and approaches
- Vary the style and complexity of responses
- Use clear and concise languageContains sensitive provider credentials. Add this to your .gitignore!
kothaset init creates this file with owner-only permissions (0600 on Unix-like systems).
providers:
- name: openai
type: openai
api_key: env.OPENAI_API_KEY # Reads from environment variable
# api_key: sk-... # Or hardcode key directly
timeout: 1m
rate_limit:
requests_per_minute: 60
# Custom endpoint example (DeepSeek, vLLM)
- name: local
type: openai
base_url: http://localhost:8000/v1
api_key: not-neededrate_limit.requests_per_minute is actively enforced during generation. Lower values reduce request throughput.
| Schema | Description | Use Case |
|---|---|---|
instruction |
Alpaca-style {instruction, input, output} | SFT |
chat |
ShareGPT multi-turn conversations | Chat fine-tuning |
preference |
{prompt, chosen, rejected} pairs | DPO/RLHF |
classification |
{text, label} pairs | Classifiers |
# Instruction dataset
kothaset generate -n 1000 -s instruction --seed 42 -i topics.txt -o instructions.jsonl
# Chat conversations
kothaset generate -n 500 -s chat --seed 123 -i conversations.txt -o conversations.jsonl
# Preference pairs for DPO
kothaset generate -n 500 -s preference --seed 456 -i pairs.txt -o dpo_data.jsonl# JSONL (default)
kothaset generate -n 100 --seed 42 -i topics.txt -f jsonl -o dataset.jsonl
kothaset generate automatically creates parent directories for --output paths (for example, -o output/data/dataset.jsonl).
# Use custom provider
kothaset generate -n 100 --seed 42 -i topics.txt -p local -o dataset.jsonl
# Control diversity with input file
kothaset generate -n 1000 --seed 42 -i topics.txt -o diverse.jsonl
# Resume interrupted generation
# (use the exact checkpoint filename from `.kothaset/`)
kothaset generate --resume .kothaset/<checkpoint-file>.checkpoint
# Dry run (validate config)
kothaset generate --dry-run -n 100 --seed 42 -i topics.txtGetting Started
Reference
Help
Contributions welcome! See CONTRIBUTING.md.
Apache 2.0 License. See LICENSE.