Skip to content

Liang-HZ/llm-batch-classifier-public

LLM Batch Classifier

Use this project when you need to classify a large CSV/Excel file with an LLM into a fixed set of labels and you want the run to be stable, resumable, and easy to operate.

PyPI version License: MIT Python 3.11+

English | 中文

What You Can Do With It

This repository is for readers who want to:

  • Classify many rows into a fixed taxonomy with an OpenAI-compatible API
  • Resume a long run after interruption
  • Re-run an existing labeled CSV and compare old vs new labels
  • Keep API usage under control with built-in rate limiting and retry behavior

It is not a good fit for:

  • Arbitrary structured extraction with custom JSON fields
  • Open-ended generation or Q&A
  • A distributed online service with multiple machines sharing one API key

Repository Guide

Before You Start

You only need 3 things:

  • Python 3.11 or newer
  • An API key for an OpenAI-compatible endpoint
  • A CSV or Excel file with one main text column

3-Minute Start: Run the Built-In Example

If this is your first time here, do not start by writing your own config. Run the built-in example first.

git clone https://github.com/Liang-HZ/llm-batch-classifier-public.git
cd llm-batch-classifier-public
python -m pip install -e .
export LLM_API_KEY=your-api-key
llm-classify run --config examples/university-programs/classify.yaml

Windows PowerShell:

$env:LLM_API_KEY="your-api-key"

This example classifies 20 university program names into 12 public demo categories. The relevant files are:

After the run, the 3 files you care about most are:

  • output/run_TIMESTAMP_.../classification_result.csv: the final labels
  • output/run_TIMESTAMP_.../classification_report.md: a human-readable report
  • output/run_TIMESTAMP_.../run_summary.json: machine-readable run stats

5-Minute Start With Your Own Data

1. Prepare a CSV or Excel file

The simplest input looks like this:

text,context
MSc Finance,Finance master's program
MSc Computer Science,Computer science master's program
MBA,Business administration
  • text: the main field to classify
  • context: optional extra context

2. Generate a starter config

llm-classify init

This creates classify.yaml.

3. Edit only the minimum required fields

Do not try to understand every config option on day one. For your first run, focus on these:

categories:
  - "Finance"
  - "Computer Science"
  - "Management"

prompt:
  system: |
    You are a classification expert. Classify the input into these categories:
    {categories}

    Output JSON only.
    Output format:
    {{"labels": [{{"name": "Category Name", "confidence": 95, "reason": "why"}}]}}
  user: "{text} / {context}"

model:
  name: deepseek-chat
  api_base: https://api.deepseek.com/v1

input:
  file: data.csv
  text_column: text
  context_column: context

The 4 things to understand are:

  1. categories: your target labels
  2. prompt.system: tell the model it must choose from those labels
  3. model: which OpenAI-compatible API to call
  4. input: your file path and column names

If you do not have a context column:

  • set context_column to an empty string
  • change prompt.user to "{text}"

4. Run it

llm-classify run --config classify.yaml

If you want a safer first run:

llm-classify run --config classify.yaml --dry-run
llm-classify run --config classify.yaml --test 20
  • --dry-run: validate config and workload without making API calls
  • --test 20: process only the first 20 rows

5. Read the output

By default, results go into output/.

The fields most people care about are:

  • label: final label or labels, joined by |
  • confidence: highest confidence score
  • is_low_confidence: whether the result fell below your threshold
  • parse_status: parsing and validation status

If the job stops halfway through, resume it:

llm-classify run --config classify.yaml --resume

If you want to retry failures:

llm-classify retry output/run_xxx/classification_result.csv

Common Beginner Mistakes

1. 401 or 403

Usually one of these is wrong:

  • LLM_API_KEY is not set
  • model.api_base is not the correct OpenAI-compatible endpoint

2. missing columns

Your file columns do not match the config. Check:

  • input.text_column
  • input.context_column

3. Prompt template errors around JSON braces

If you include a JSON example in YAML, literal braces must be escaped:

  • {{
  • }}

Do not use bare { and } in prompt examples.

4. Too many 429 errors

Before increasing retry counts, first check:

  • rate_limit.rps
  • rate_limit.tps
  • concurrency

Being conservative is usually more stable.

Need Help?

Use these paths, depending on what you need:

How It Works, In Plain English

  1. Read your CSV/Excel file
  2. Deduplicate rows by text + optional context
  3. Send each item plus your label list to the LLM
  4. Validate whether returned labels really belong to your label set
  5. Write every item to disk immediately so the run can resume later
  6. Retry timeouts and 429s automatically, and mark bad outputs for follow-up

If you want the technical version:

  • rate limiting uses a sliding window for RPS/TPS, plus an optional coarser cycle cap
  • checkpointing writes each item immediately after the API response
  • retries distinguish transient failures from semantic failures

Window vs Cycle

These are two different controls. They are not duplicates.

  • rate_limit controls short-term pacing.
  • cycle controls a longer-period total budget.

Use rate_limit when your question is:

"How fast can I send requests right now without spiking too hard?"

  • rate_limit.rps: maximum requests allowed within the sliding window
  • rate_limit.tps: maximum estimated tokens allowed within the sliding window
  • rate_limit.window: how far back the limiter looks when counting requests or tokens
  • rate_limit.tokens_per_call: estimated tokens consumed by one request when tps is enabled

Use cycle when your question is:

"Over a longer period, how many calls am I allowed to spend in total?"

  • cycle.duration: length of one budget cycle, in seconds
  • cycle.max_calls: maximum API calls allowed in that cycle

A simple way to think about it:

  • rate_limit smooths traffic second by second
  • cycle caps the total spend over a minute, hour, or other longer interval

Example:

  • rate_limit.rps: 3 and rate_limit.window: 1 means at most 3 requests in any 1-second window
  • cycle.duration: 60 and cycle.max_calls: 180 means at most 180 calls in a 60-second cycle

For most users:

  • start with rate_limit only
  • add cycle only if your provider or your own budget is expressed as "at most N calls per minute/hour"

CLI Reference

llm-classify run --config FILE         Run batch classification
  --resume                             Resume from a previous run (append mode)
  --fresh                              Clear previous results before running
  --dry-run                            Show config and estimated work, no API calls
  --test N                             Process only the first N items
  --random N                           Sample N random items
  --concurrency N                      Override concurrency from config
  --input-csv FILE                     Use existing CSV for re-classification

llm-classify retry SOURCE              Auto-retry failed items from a result CSV
  --config FILE                        YAML config (auto-detected if omitted)
  --max-rounds N                       Maximum retry rounds (default: 3)
  --dry-run                            Show retry plan, no API calls
  --concurrency N                      Override concurrency

llm-classify init                      Generate a starter classify.yaml
  --output FILE                        Output path (default: classify.yaml)

Full Configuration Reference

Read this after your first successful run.

# LLM Batch Classifier Configuration
#
# API keys are read from environment variables, not from YAML:
#   export LLM_API_KEY=your-key
#   export OPENAI_API_KEY=your-key

# List every label the model is allowed to return.
# The returned label names must match this list exactly.
categories:
  - "Category A"
  - "Category B"
  - "Category C"

# Prompt settings used for every input row.
prompt:
  # Main system prompt.
  # {categories} is injected automatically from the list above.
  system: |
    You are a classification expert. Classify the input into these categories:
    {categories}

    Requirements:
    1. Select all matching categories with confidence scores (0-100)
    2. Only include categories with confidence >= 85
    3. Use exact category names from the list above
    4. Output JSON only

    Output format:
    {{"labels": [{{"name": "Category Name", "confidence": 95, "reason": "reason"}}]}}

  # Alternative to prompt.system for long prompts stored in a separate file.
  # system_file: prompt.txt

  # User prompt template built from your input columns.
  # Only {text} and {context} are supported placeholders.
  user: "{text} / {context}"

# Model and API endpoint settings.
model:
  # Model identifier used by your provider.
  name: deepseek-chat

  # Base URL of any OpenAI-compatible API.
  api_base: https://api.deepseek.com/v1

  # Lower values are usually more stable for classification.
  temperature: 0.1

  # Maximum output tokens allowed for one response.
  max_tokens: 500

  # Per-request timeout in seconds.
  timeout: 30

  # Number of retries for transient request failures.
  max_retries: 3

# Sliding-window rate limits for short-term pacing.
# This answers: "How fast can requests be sent right now?"
rate_limit:
  # Maximum requests per second. Use 0 to disable.
  rps: 3

  # Maximum tokens per second. Use 0 to disable.
  tps: 0

  # Sliding-window size in seconds.
  window: 1

  # Estimated tokens consumed by one request. Used when tps > 0.
  tokens_per_call: 850

# Optional longer-period call budget.
# This answers: "How many calls can I spend in total over a longer interval?"
# Set both fields to 0 to disable.
cycle:
  # Cycle length in seconds.
  duration: 60

  # Maximum API calls allowed within one cycle.
  max_calls: 180

# Backoff settings for throttling and 429 responses.
throttle:
  # Maximum number of backoff attempts.
  max_attempts: 10

  # Initial wait time in seconds.
  base_wait: 30.0

  # Upper bound for exponential backoff waits.
  max_wait: 300.0

  # Random jitter added to waits.
  jitter: 0.5

# Input file and column mapping.
input:
  # Path to the source CSV or Excel file.
  file: data.csv

  # Column that contains the main text to classify.
  text_column: text

  # Optional extra context column.
  context_column: context

# Output location and format.
output:
  # Directory where result files and reports are written.
  dir: output

  # Use auto to follow the input type, or force csv / xlsx.
  format: auto

# Labels below this confidence threshold are filtered out.
threshold: 95

# Number of requests allowed in flight at the same time.
concurrency: 15

When To Use --input-csv

If your input file already contains an old label column and you want to compare old vs new results after changing the model or prompt:

llm-classify run --config classify.yaml --input-csv old_results.csv

This adds:

  • compare_old_label
  • compare_is_match
  • classification_diff.csv

That makes it useful for prompt regression testing.

Why Use This Instead of a Simple Script

  • It is not just for row in csv: call_llm(row)
  • It is designed for long-running batch jobs
  • Rate limiting, resume, and retries are built in
  • You switch tasks by editing YAML, not code

Contributing

If you want to contribute, start with CONTRIBUTING.md.

In short:

  1. Fork the repository and create a branch
  2. Install dev dependencies: python -m pip install -e ".[dev]"
  3. Run tests: pytest
  4. Open a pull request with a clear change summary

Please also read:

License

MIT — Copyright (c) 2024 LLM Batch Classifier contributors.

About

A practical LLM tool for batch classification with rate limiting, checkpointing, and auto-retry.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages