Skip to content

shantoislamdev/kothaset

KothaSet

Go Version npm version PyPI version License

KothaSet is a powerful CLI tool for generating high-quality datasets using Large Language Models (LLMs) as teacher models. Create diverse training data for fine-tuning smaller models.

Features

  • Multi-Provider — OpenAI, and OpenAI-compatible APIs (DeepSeek, vLLM, Ollama)
  • Flexible Schemas — Instruction (Alpaca), Chat (ShareGPT), Preference (DPO), Classification
  • Streaming Output — Real-time generation with progress tracking
  • Resumable — Atomic checkpointing, never lose progress
  • JSONL Output — Streaming writes in standard JSONL format
  • Reproducible — Optional fixed seed for deterministic-style runs
  • Diversity Control — Input files for sequential topic coverage
  • Validation — Validate configs, schemas, datasets, and provider connectivity

Installation

pip (Python)

pip install kothaset

npm (Node.js)

npm install -g kothaset

Homebrew (macOS/Linux)

brew install shantoislamdev/tap/kothaset

Binary Download

Download from GitHub Releases.

From Source

go install github.com/shantoislamdev/kothaset/cmd/kothaset@latest

Quick Start

  1. Initialize configuration:

    kothaset init
  2. Set your API key:

    # Windows PowerShell
    $env:OPENAI_API_KEY = "sk-..."
    
    # Linux/macOS
    export OPENAI_API_KEY="sk-..."
  3. Generate a dataset:

    kothaset generate -n 100 -s instruction --seed 42 -i topics.txt -o dataset.jsonl

Configuration

KothaSet uses a two-file configuration system for better security and organization:

1. kothaset.yaml (Public)

Contains shared settings, context, and instructions. Safe to commit to git.

version: "1.0"
global:
  provider: openai
  schema: instruction
  model: gpt-5.2
  concurrency: 4
  output_dir: ./output
  checkpoint_every: 10  # Save checkpoint every N samples (default: 10)

# Context: Background info or persona injected into every prompt
context: |
  Generate high-quality training data for an AI assistant.
  The data should be helpful, accurate, and well-formatted.

# Instructions: Specific rules and guidelines for generation
instructions:
  - Be creative and diverse in topics and approaches
  - Vary the style and complexity of responses
  - Use clear and concise language

2. .secrets.yaml (Private)

Contains sensitive provider credentials. Add this to your .gitignore! kothaset init creates this file with owner-only permissions (0600 on Unix-like systems).

providers:
  - name: openai
    type: openai
    api_key: env.OPENAI_API_KEY  # Reads from environment variable
    # api_key: sk-...            # Or hardcode key directly
    timeout: 1m
    rate_limit:
      requests_per_minute: 60

  # Custom endpoint example (DeepSeek, vLLM)
  - name: local
    type: openai
    base_url: http://localhost:8000/v1
    api_key: not-needed

rate_limit.requests_per_minute is actively enforced during generation. Lower values reduce request throughput.


Usage

Selecting a Schema

Schema Description Use Case
instruction Alpaca-style {instruction, input, output} SFT
chat ShareGPT multi-turn conversations Chat fine-tuning
preference {prompt, chosen, rejected} pairs DPO/RLHF
classification {text, label} pairs Classifiers
# Instruction dataset
kothaset generate -n 1000 -s instruction --seed 42 -i topics.txt -o instructions.jsonl

# Chat conversations
kothaset generate -n 500 -s chat --seed 123 -i conversations.txt -o conversations.jsonl

# Preference pairs for DPO  
kothaset generate -n 500 -s preference --seed 456 -i pairs.txt -o dpo_data.jsonl

Output Formats

# JSONL (default)
kothaset generate -n 100 --seed 42 -i topics.txt -f jsonl -o dataset.jsonl

kothaset generate automatically creates parent directories for --output paths (for example, -o output/data/dataset.jsonl).

Advanced Options

# Use custom provider
kothaset generate -n 100 --seed 42 -i topics.txt -p local -o dataset.jsonl

# Control diversity with input file
kothaset generate -n 1000 --seed 42 -i topics.txt -o diverse.jsonl

# Resume interrupted generation
# (use the exact checkpoint filename from `.kothaset/`)
kothaset generate --resume .kothaset/<checkpoint-file>.checkpoint

# Dry run (validate config)
kothaset generate --dry-run -n 100 --seed 42 -i topics.txt

Documentation

Getting Started

Reference

Help


Contributing

Contributions welcome! See CONTRIBUTING.md.

License

Apache 2.0 License. See LICENSE.

About

KothaSet is a powerful CLI tool for generating high-quality datasets using Large Language Models (LLMs) as teacher models.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors