LLM Batch Classifier

Use this project when you need to classify a large CSV/Excel file with an LLM into a fixed set of labels and you want the run to be stable, resumable, and easy to operate.

English | 中文

What You Can Do With It

This repository is for readers who want to:

Classify many rows into a fixed taxonomy with an OpenAI-compatible API
Resume a long run after interruption
Re-run an existing labeled CSV and compare old vs new labels
Keep API usage under control with built-in rate limiting and retry behavior

It is not a good fit for:

Arbitrary structured extraction with custom JSON fields
Open-ended generation or Q&A
A distributed online service with multiple machines sharing one API key

Repository Guide

Start here: 3-Minute Start
Use your own file: 5-Minute Start With Your Own Data
Example files: examples/university-programs
Need help or want to report something: GitHub Issues
Contribution guide: CONTRIBUTING.md
Security policy: SECURITY.md
Community expectations: CODE_OF_CONDUCT.md

Before You Start

You only need 3 things:

Python 3.11 or newer
An API key for an OpenAI-compatible endpoint
A CSV or Excel file with one main text column

3-Minute Start: Run the Built-In Example

If this is your first time here, do not start by writing your own config. Run the built-in example first.

git clone https://github.com/Liang-HZ/llm-batch-classifier-public.git
cd llm-batch-classifier-public
python -m pip install -e .
export LLM_API_KEY=your-api-key
llm-classify run --config examples/university-programs/classify.yaml

Windows PowerShell:

$env:LLM_API_KEY="your-api-key"

This example classifies 20 university program names into 12 public demo categories. The relevant files are:

After the run, the 3 files you care about most are:

output/run_TIMESTAMP_.../classification_result.csv: the final labels
output/run_TIMESTAMP_.../classification_report.md: a human-readable report
output/run_TIMESTAMP_.../run_summary.json: machine-readable run stats

5-Minute Start With Your Own Data

1. Prepare a CSV or Excel file

The simplest input looks like this:

text,context
MSc Finance,Finance master's program
MSc Computer Science,Computer science master's program
MBA,Business administration

text: the main field to classify
context: optional extra context

2. Generate a starter config

llm-classify init

This creates classify.yaml.

3. Edit only the minimum required fields

Do not try to understand every config option on day one. For your first run, focus on these:

categories:
  - "Finance"
  - "Computer Science"
  - "Management"

prompt:
  system: |
    You are a classification expert. Classify the input into these categories:
    {categories}

    Output JSON only.
    Output format:
    {{"labels": [{{"name": "Category Name", "confidence": 95, "reason": "why"}}]}}
  user: "{text} / {context}"

model:
  name: deepseek-chat
  api_base: https://api.deepseek.com/v1

input:
  file: data.csv
  text_column: text
  context_column: context

The 4 things to understand are:

categories: your target labels
prompt.system: tell the model it must choose from those labels
model: which OpenAI-compatible API to call
input: your file path and column names

If you do not have a context column:

set context_column to an empty string
change prompt.user to "{text}"

4. Run it

llm-classify run --config classify.yaml

If you want a safer first run:

llm-classify run --config classify.yaml --dry-run
llm-classify run --config classify.yaml --test 20

--dry-run: validate config and workload without making API calls
--test 20: process only the first 20 rows

5. Read the output

By default, results go into output/.

The fields most people care about are:

label: final label or labels, joined by |
confidence: highest confidence score
is_low_confidence: whether the result fell below your threshold
parse_status: parsing and validation status

If the job stops halfway through, resume it:

llm-classify run --config classify.yaml --resume

If you want to retry failures:

llm-classify retry output/run_xxx/classification_result.csv

Common Beginner Mistakes

1. `401` or `403`

Usually one of these is wrong:

LLM_API_KEY is not set
model.api_base is not the correct OpenAI-compatible endpoint

2. `missing columns`

Your file columns do not match the config. Check:

input.text_column
input.context_column

3. Prompt template errors around JSON braces

If you include a JSON example in YAML, literal braces must be escaped:

{{
}}

Do not use bare { and } in prompt examples.

4. Too many `429` errors

Before increasing retry counts, first check:

rate_limit.rps
rate_limit.tps
concurrency

Being conservative is usually more stable.

Need Help?

Use these paths, depending on what you need:

Usage questions and bug reports: GitHub Issues
Security-sensitive reports: SECURITY.md
Contribution rules and local setup: CONTRIBUTING.md

How It Works, In Plain English

Read your CSV/Excel file
Deduplicate rows by text + optional context
Send each item plus your label list to the LLM
Validate whether returned labels really belong to your label set
Write every item to disk immediately so the run can resume later
Retry timeouts and 429s automatically, and mark bad outputs for follow-up

If you want the technical version:

rate limiting uses a sliding window for RPS/TPS, plus an optional coarser cycle cap
checkpointing writes each item immediately after the API response
retries distinguish transient failures from semantic failures

Window vs Cycle

These are two different controls. They are not duplicates.

rate_limit controls short-term pacing.
cycle controls a longer-period total budget.

Use rate_limit when your question is:

"How fast can I send requests right now without spiking too hard?"

rate_limit.rps: maximum requests allowed within the sliding window
rate_limit.tps: maximum estimated tokens allowed within the sliding window
rate_limit.window: how far back the limiter looks when counting requests or tokens
rate_limit.tokens_per_call: estimated tokens consumed by one request when tps is enabled

Use cycle when your question is:

"Over a longer period, how many calls am I allowed to spend in total?"

cycle.duration: length of one budget cycle, in seconds
cycle.max_calls: maximum API calls allowed in that cycle

A simple way to think about it:

rate_limit smooths traffic second by second
cycle caps the total spend over a minute, hour, or other longer interval

Example:

rate_limit.rps: 3 and rate_limit.window: 1 means at most 3 requests in any 1-second window
cycle.duration: 60 and cycle.max_calls: 180 means at most 180 calls in a 60-second cycle

For most users:

start with rate_limit only
add cycle only if your provider or your own budget is expressed as "at most N calls per minute/hour"

CLI Reference

llm-classify run --config FILE         Run batch classification
  --resume                             Resume from a previous run (append mode)
  --fresh                              Clear previous results before running
  --dry-run                            Show config and estimated work, no API calls
  --test N                             Process only the first N items
  --random N                           Sample N random items
  --concurrency N                      Override concurrency from config
  --input-csv FILE                     Use existing CSV for re-classification

llm-classify retry SOURCE              Auto-retry failed items from a result CSV
  --config FILE                        YAML config (auto-detected if omitted)
  --max-rounds N                       Maximum retry rounds (default: 3)
  --dry-run                            Show retry plan, no API calls
  --concurrency N                      Override concurrency

llm-classify init                      Generate a starter classify.yaml
  --output FILE                        Output path (default: classify.yaml)

Full Configuration Reference

Read this after your first successful run.

# LLM Batch Classifier Configuration
#
# API keys are read from environment variables, not from YAML:
#   export LLM_API_KEY=your-key
#   export OPENAI_API_KEY=your-key

# List every label the model is allowed to return.
# The returned label names must match this list exactly.
categories:
  - "Category A"
  - "Category B"
  - "Category C"

# Prompt settings used for every input row.
prompt:
  # Main system prompt.
  # {categories} is injected automatically from the list above.
  system: |
    You are a classification expert. Classify the input into these categories:
    {categories}

    Requirements:
    1. Select all matching categories with confidence scores (0-100)
    2. Only include categories with confidence >= 85
    3. Use exact category names from the list above
    4. Output JSON only

    Output format:
    {{"labels": [{{"name": "Category Name", "confidence": 95, "reason": "reason"}}]}}

  # Alternative to prompt.system for long prompts stored in a separate file.
  # system_file: prompt.txt

  # User prompt template built from your input columns.
  # Only {text} and {context} are supported placeholders.
  user: "{text} / {context}"

# Model and API endpoint settings.
model:
  # Model identifier used by your provider.
  name: deepseek-chat

  # Base URL of any OpenAI-compatible API.
  api_base: https://api.deepseek.com/v1

  # Lower values are usually more stable for classification.
  temperature: 0.1

  # Maximum output tokens allowed for one response.
  max_tokens: 500

  # Per-request timeout in seconds.
  timeout: 30

  # Number of retries for transient request failures.
  max_retries: 3

# Sliding-window rate limits for short-term pacing.
# This answers: "How fast can requests be sent right now?"
rate_limit:
  # Maximum requests per second. Use 0 to disable.
  rps: 3

  # Maximum tokens per second. Use 0 to disable.
  tps: 0

  # Sliding-window size in seconds.
  window: 1

  # Estimated tokens consumed by one request. Used when tps > 0.
  tokens_per_call: 850

# Optional longer-period call budget.
# This answers: "How many calls can I spend in total over a longer interval?"
# Set both fields to 0 to disable.
cycle:
  # Cycle length in seconds.
  duration: 60

  # Maximum API calls allowed within one cycle.
  max_calls: 180

# Backoff settings for throttling and 429 responses.
throttle:
  # Maximum number of backoff attempts.
  max_attempts: 10

  # Initial wait time in seconds.
  base_wait: 30.0

  # Upper bound for exponential backoff waits.
  max_wait: 300.0

  # Random jitter added to waits.
  jitter: 0.5

# Input file and column mapping.
input:
  # Path to the source CSV or Excel file.
  file: data.csv

  # Column that contains the main text to classify.
  text_column: text

  # Optional extra context column.
  context_column: context

# Output location and format.
output:
  # Directory where result files and reports are written.
  dir: output

  # Use auto to follow the input type, or force csv / xlsx.
  format: auto

# Labels below this confidence threshold are filtered out.
threshold: 95

# Number of requests allowed in flight at the same time.
concurrency: 15

When To Use `--input-csv`

If your input file already contains an old label column and you want to compare old vs new results after changing the model or prompt:

llm-classify run --config classify.yaml --input-csv old_results.csv

This adds:

compare_old_label
compare_is_match
classification_diff.csv

That makes it useful for prompt regression testing.

Why Use This Instead of a Simple Script

It is not just for row in csv: call_llm(row)
It is designed for long-running batch jobs
Rate limiting, resume, and retries are built in
You switch tasks by editing YAML, not code

Contributing

If you want to contribute, start with CONTRIBUTING.md.

In short:

Fork the repository and create a branch
Install dev dependencies: python -m pip install -e ".[dev]"
Run tests: pytest
Open a pull request with a clear change summary

Please also read:

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
examples/university-programs		examples/university-programs
src/llm_classifier		src/llm_classifier
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Batch Classifier

What You Can Do With It

Repository Guide

Before You Start

3-Minute Start: Run the Built-In Example

5-Minute Start With Your Own Data

1. Prepare a CSV or Excel file

2. Generate a starter config

3. Edit only the minimum required fields

4. Run it

5. Read the output

Common Beginner Mistakes

1. `401` or `403`

2. `missing columns`

3. Prompt template errors around JSON braces

4. Too many `429` errors

Need Help?

How It Works, In Plain English

Window vs Cycle

CLI Reference

Full Configuration Reference

When To Use `--input-csv`

Why Use This Instead of a Simple Script

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Batch Classifier

What You Can Do With It

Repository Guide

Before You Start

3-Minute Start: Run the Built-In Example

5-Minute Start With Your Own Data

1. Prepare a CSV or Excel file

2. Generate a starter config

3. Edit only the minimum required fields

4. Run it

5. Read the output

Common Beginner Mistakes

1. 401 or 403

2. missing columns

3. Prompt template errors around JSON braces

4. Too many 429 errors

Need Help?

How It Works, In Plain English

Window vs Cycle

CLI Reference

Full Configuration Reference

When To Use --input-csv

Why Use This Instead of a Simple Script

Contributing

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `401` or `403`

2. `missing columns`

4. Too many `429` errors

When To Use `--input-csv`

Packages