Use this project when you need to classify a large CSV/Excel file with an LLM into a fixed set of labels and you want the run to be stable, resumable, and easy to operate.
English | 中文
This repository is for readers who want to:
- Classify many rows into a fixed taxonomy with an OpenAI-compatible API
- Resume a long run after interruption
- Re-run an existing labeled CSV and compare old vs new labels
- Keep API usage under control with built-in rate limiting and retry behavior
It is not a good fit for:
- Arbitrary structured extraction with custom JSON fields
- Open-ended generation or Q&A
- A distributed online service with multiple machines sharing one API key
- Start here: 3-Minute Start
- Use your own file: 5-Minute Start With Your Own Data
- Example files: examples/university-programs
- Need help or want to report something: GitHub Issues
- Contribution guide: CONTRIBUTING.md
- Security policy: SECURITY.md
- Community expectations: CODE_OF_CONDUCT.md
You only need 3 things:
- Python 3.11 or newer
- An API key for an OpenAI-compatible endpoint
- A CSV or Excel file with one main text column
If this is your first time here, do not start by writing your own config. Run the built-in example first.
git clone https://github.com/Liang-HZ/llm-batch-classifier-public.git
cd llm-batch-classifier-public
python -m pip install -e .
export LLM_API_KEY=your-api-key
llm-classify run --config examples/university-programs/classify.yamlWindows PowerShell:
$env:LLM_API_KEY="your-api-key"This example classifies 20 university program names into 12 public demo categories. The relevant files are:
- examples/university-programs/classify.yaml
- examples/university-programs/prompt.txt
- examples/university-programs/sample_input.csv
- examples/university-programs/README.md
After the run, the 3 files you care about most are:
output/run_TIMESTAMP_.../classification_result.csv: the final labelsoutput/run_TIMESTAMP_.../classification_report.md: a human-readable reportoutput/run_TIMESTAMP_.../run_summary.json: machine-readable run stats
The simplest input looks like this:
text,context
MSc Finance,Finance master's program
MSc Computer Science,Computer science master's program
MBA,Business administrationtext: the main field to classifycontext: optional extra context
llm-classify initThis creates classify.yaml.
Do not try to understand every config option on day one. For your first run, focus on these:
categories:
- "Finance"
- "Computer Science"
- "Management"
prompt:
system: |
You are a classification expert. Classify the input into these categories:
{categories}
Output JSON only.
Output format:
{{"labels": [{{"name": "Category Name", "confidence": 95, "reason": "why"}}]}}
user: "{text} / {context}"
model:
name: deepseek-chat
api_base: https://api.deepseek.com/v1
input:
file: data.csv
text_column: text
context_column: contextThe 4 things to understand are:
categories: your target labelsprompt.system: tell the model it must choose from those labelsmodel: which OpenAI-compatible API to callinput: your file path and column names
If you do not have a context column:
- set
context_columnto an empty string - change
prompt.userto"{text}"
llm-classify run --config classify.yamlIf you want a safer first run:
llm-classify run --config classify.yaml --dry-run
llm-classify run --config classify.yaml --test 20--dry-run: validate config and workload without making API calls--test 20: process only the first 20 rows
By default, results go into output/.
The fields most people care about are:
label: final label or labels, joined by|confidence: highest confidence scoreis_low_confidence: whether the result fell below your thresholdparse_status: parsing and validation status
If the job stops halfway through, resume it:
llm-classify run --config classify.yaml --resumeIf you want to retry failures:
llm-classify retry output/run_xxx/classification_result.csvUsually one of these is wrong:
LLM_API_KEYis not setmodel.api_baseis not the correct OpenAI-compatible endpoint
Your file columns do not match the config. Check:
input.text_columninput.context_column
If you include a JSON example in YAML, literal braces must be escaped:
{{}}
Do not use bare { and } in prompt examples.
Before increasing retry counts, first check:
rate_limit.rpsrate_limit.tpsconcurrency
Being conservative is usually more stable.
Use these paths, depending on what you need:
- Usage questions and bug reports: GitHub Issues
- Security-sensitive reports: SECURITY.md
- Contribution rules and local setup: CONTRIBUTING.md
- Read your CSV/Excel file
- Deduplicate rows by
text + optional context - Send each item plus your label list to the LLM
- Validate whether returned labels really belong to your label set
- Write every item to disk immediately so the run can resume later
- Retry timeouts and
429s automatically, and mark bad outputs for follow-up
If you want the technical version:
- rate limiting uses a sliding window for RPS/TPS, plus an optional coarser cycle cap
- checkpointing writes each item immediately after the API response
- retries distinguish transient failures from semantic failures
These are two different controls. They are not duplicates.
rate_limitcontrols short-term pacing.cyclecontrols a longer-period total budget.
Use rate_limit when your question is:
"How fast can I send requests right now without spiking too hard?"
rate_limit.rps: maximum requests allowed within the sliding windowrate_limit.tps: maximum estimated tokens allowed within the sliding windowrate_limit.window: how far back the limiter looks when counting requests or tokensrate_limit.tokens_per_call: estimated tokens consumed by one request whentpsis enabled
Use cycle when your question is:
"Over a longer period, how many calls am I allowed to spend in total?"
cycle.duration: length of one budget cycle, in secondscycle.max_calls: maximum API calls allowed in that cycle
A simple way to think about it:
rate_limitsmooths traffic second by secondcyclecaps the total spend over a minute, hour, or other longer interval
Example:
rate_limit.rps: 3andrate_limit.window: 1means at most 3 requests in any 1-second windowcycle.duration: 60andcycle.max_calls: 180means at most 180 calls in a 60-second cycle
For most users:
- start with
rate_limitonly - add
cycleonly if your provider or your own budget is expressed as "at most N calls per minute/hour"
llm-classify run --config FILE Run batch classification
--resume Resume from a previous run (append mode)
--fresh Clear previous results before running
--dry-run Show config and estimated work, no API calls
--test N Process only the first N items
--random N Sample N random items
--concurrency N Override concurrency from config
--input-csv FILE Use existing CSV for re-classification
llm-classify retry SOURCE Auto-retry failed items from a result CSV
--config FILE YAML config (auto-detected if omitted)
--max-rounds N Maximum retry rounds (default: 3)
--dry-run Show retry plan, no API calls
--concurrency N Override concurrency
llm-classify init Generate a starter classify.yaml
--output FILE Output path (default: classify.yaml)
Read this after your first successful run.
# LLM Batch Classifier Configuration
#
# API keys are read from environment variables, not from YAML:
# export LLM_API_KEY=your-key
# export OPENAI_API_KEY=your-key
# List every label the model is allowed to return.
# The returned label names must match this list exactly.
categories:
- "Category A"
- "Category B"
- "Category C"
# Prompt settings used for every input row.
prompt:
# Main system prompt.
# {categories} is injected automatically from the list above.
system: |
You are a classification expert. Classify the input into these categories:
{categories}
Requirements:
1. Select all matching categories with confidence scores (0-100)
2. Only include categories with confidence >= 85
3. Use exact category names from the list above
4. Output JSON only
Output format:
{{"labels": [{{"name": "Category Name", "confidence": 95, "reason": "reason"}}]}}
# Alternative to prompt.system for long prompts stored in a separate file.
# system_file: prompt.txt
# User prompt template built from your input columns.
# Only {text} and {context} are supported placeholders.
user: "{text} / {context}"
# Model and API endpoint settings.
model:
# Model identifier used by your provider.
name: deepseek-chat
# Base URL of any OpenAI-compatible API.
api_base: https://api.deepseek.com/v1
# Lower values are usually more stable for classification.
temperature: 0.1
# Maximum output tokens allowed for one response.
max_tokens: 500
# Per-request timeout in seconds.
timeout: 30
# Number of retries for transient request failures.
max_retries: 3
# Sliding-window rate limits for short-term pacing.
# This answers: "How fast can requests be sent right now?"
rate_limit:
# Maximum requests per second. Use 0 to disable.
rps: 3
# Maximum tokens per second. Use 0 to disable.
tps: 0
# Sliding-window size in seconds.
window: 1
# Estimated tokens consumed by one request. Used when tps > 0.
tokens_per_call: 850
# Optional longer-period call budget.
# This answers: "How many calls can I spend in total over a longer interval?"
# Set both fields to 0 to disable.
cycle:
# Cycle length in seconds.
duration: 60
# Maximum API calls allowed within one cycle.
max_calls: 180
# Backoff settings for throttling and 429 responses.
throttle:
# Maximum number of backoff attempts.
max_attempts: 10
# Initial wait time in seconds.
base_wait: 30.0
# Upper bound for exponential backoff waits.
max_wait: 300.0
# Random jitter added to waits.
jitter: 0.5
# Input file and column mapping.
input:
# Path to the source CSV or Excel file.
file: data.csv
# Column that contains the main text to classify.
text_column: text
# Optional extra context column.
context_column: context
# Output location and format.
output:
# Directory where result files and reports are written.
dir: output
# Use auto to follow the input type, or force csv / xlsx.
format: auto
# Labels below this confidence threshold are filtered out.
threshold: 95
# Number of requests allowed in flight at the same time.
concurrency: 15If your input file already contains an old label column and you want to compare old vs new results after changing the model or prompt:
llm-classify run --config classify.yaml --input-csv old_results.csvThis adds:
compare_old_labelcompare_is_matchclassification_diff.csv
That makes it useful for prompt regression testing.
- It is not just
for row in csv: call_llm(row) - It is designed for long-running batch jobs
- Rate limiting, resume, and retries are built in
- You switch tasks by editing YAML, not code
If you want to contribute, start with CONTRIBUTING.md.
In short:
- Fork the repository and create a branch
- Install dev dependencies:
python -m pip install -e ".[dev]" - Run tests:
pytest - Open a pull request with a clear change summary
Please also read:
MIT — Copyright (c) 2024 LLM Batch Classifier contributors.