KothaSet

KothaSet is a powerful CLI tool for generating high-quality datasets using Large Language Models (LLMs) as teacher models. Create diverse training data for fine-tuning smaller models.

Features

Multi-Provider — OpenAI, and OpenAI-compatible APIs (DeepSeek, vLLM, Ollama)
Flexible Schemas — Instruction (Alpaca), Chat (ShareGPT), Preference (DPO), Classification
Streaming Output — Real-time generation with progress tracking
Resumable — Atomic checkpointing, never lose progress
JSONL Output — Streaming writes in standard JSONL format
Reproducible — Optional fixed seed for deterministic-style runs
Diversity Control — Input files for sequential topic coverage
Validation — Validate configs, schemas, datasets, and provider connectivity

Installation

pip (Python)

pip install kothaset

npm (Node.js)

npm install -g kothaset

Homebrew (macOS/Linux)

brew install shantoislamdev/tap/kothaset

Binary Download

Download from GitHub Releases.

From Source

go install github.com/shantoislamdev/kothaset/cmd/kothaset@latest

Quick Start

Initialize configuration:
```
kothaset init
```

Set your API key:

# Windows PowerShell
$env:OPENAI_API_KEY = "sk-..."

# Linux/macOS
export OPENAI_API_KEY="sk-..."

Generate a dataset:

kothaset generate -n 100 -s instruction --seed 42 -i topics.txt -o dataset.jsonl

Configuration

KothaSet uses a two-file configuration system for better security and organization:

1. `kothaset.yaml` (Public)

Contains shared settings, context, and instructions. Safe to commit to git.

version: "1.0"
global:
  provider: openai
  schema: instruction
  model: gpt-5.2
  concurrency: 4
  output_dir: ./output
  checkpoint_every: 10  # Save checkpoint every N samples (default: 10)

# Context: Background info or persona injected into every prompt
context: |
  Generate high-quality training data for an AI assistant.
  The data should be helpful, accurate, and well-formatted.

# Instructions: Specific rules and guidelines for generation
instructions:
  - Be creative and diverse in topics and approaches
  - Vary the style and complexity of responses
  - Use clear and concise language

2. `.secrets.yaml` (Private)

Contains sensitive provider credentials. Add this to your .gitignore! kothaset init creates this file with owner-only permissions (0600 on Unix-like systems).

providers:
  - name: openai
    type: openai
    api_key: env.OPENAI_API_KEY  # Reads from environment variable
    # api_key: sk-...            # Or hardcode key directly
    timeout: 1m
    rate_limit:
      requests_per_minute: 60

  # Custom endpoint example (DeepSeek, vLLM)
  - name: local
    type: openai
    base_url: http://localhost:8000/v1
    api_key: not-needed

rate_limit.requests_per_minute is actively enforced during generation. Lower values reduce request throughput.

Usage

Selecting a Schema

Schema	Description	Use Case
`instruction`	Alpaca-style {instruction, input, output}	SFT
`chat`	ShareGPT multi-turn conversations	Chat fine-tuning
`preference`	{prompt, chosen, rejected} pairs	DPO/RLHF
`classification`	{text, label} pairs	Classifiers

# Instruction dataset
kothaset generate -n 1000 -s instruction --seed 42 -i topics.txt -o instructions.jsonl

# Chat conversations
kothaset generate -n 500 -s chat --seed 123 -i conversations.txt -o conversations.jsonl

# Preference pairs for DPO  
kothaset generate -n 500 -s preference --seed 456 -i pairs.txt -o dpo_data.jsonl

Output Formats

# JSONL (default)
kothaset generate -n 100 --seed 42 -i topics.txt -f jsonl -o dataset.jsonl

kothaset generate automatically creates parent directories for --output paths (for example, -o output/data/dataset.jsonl).

Advanced Options

# Use custom provider
kothaset generate -n 100 --seed 42 -i topics.txt -p local -o dataset.jsonl

# Control diversity with input file
kothaset generate -n 1000 --seed 42 -i topics.txt -o diverse.jsonl

# Resume interrupted generation
# (use the exact checkpoint filename from `.kothaset/`)
kothaset generate --resume .kothaset/<checkpoint-file>.checkpoint

# Dry run (validate config)
kothaset generate --dry-run -n 100 --seed 42 -i topics.txt

Documentation

Getting Started

Reference

Help

Contributing

Contributions welcome! See CONTRIBUTING.md.

License

Apache 2.0 License. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 175 Commits
.github/workflows		.github/workflows
cmd/kothaset		cmd/kothaset
docs		docs
examples		examples
internal		internal
npm		npm
pip		pip
website		website
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.goreleaser.yaml		.goreleaser.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KothaSet

Features

Installation

pip (Python)

npm (Node.js)

Homebrew (macOS/Linux)

Binary Download

From Source

Quick Start

Configuration

1. `kothaset.yaml` (Public)

2. `.secrets.yaml` (Private)

Usage

Selecting a Schema

Output Formats

Advanced Options

Documentation

Contributing

License

About

Uh oh!

Releases 7

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KothaSet

Features

Installation

pip (Python)

npm (Node.js)

Homebrew (macOS/Linux)

Binary Download

From Source

Quick Start

Configuration

1. kothaset.yaml (Public)

2. .secrets.yaml (Private)

Usage

Selecting a Schema

Output Formats

Advanced Options

Documentation

Contributing

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `kothaset.yaml` (Public)

2. `.secrets.yaml` (Private)

Packages