Skip to content

kmcallorum/prompt-optimizer

Repository files navigation

prompt-optimizer-cli

PyPI CI codecov Snyk Security Python 3.11+ License: MIT Ruff Type Checked pytest-agents

A CLI tool and Python library for optimizing LLM prompts through systematic testing, version control, and performance metrics. Think "pytest for prompts" - test multiple prompt variations, measure quality, and automatically select the best performer.

Features

  • Prompt Testing: Run multiple prompt variations against test cases
  • Quality Metrics: Score outputs on accuracy, conciseness, tone, and cost
  • LLM-as-Judge: AI-powered evaluation using any LLM as a judge
  • Prometheus Metrics: Built-in observability with Prometheus metrics
  • Version Control: Track prompt evolution with history and diffs
  • Auto-Selection: Identify and select the best-performing prompt variant
  • CLI & Library: Use as a command-line tool or Python import
  • Multi-LLM Support: Works with Anthropic Claude, OpenAI GPT, and local Ollama models

Quick Start

# Install from PyPI
pip install prompt-optimizer-cli

# Initialize a project
prompt-optimizer init

# Optimize a prompt
prompt-optimizer optimize prompts/example.yaml \
    --test-cases tests/example_tests.yaml \
    --strategies concise,detailed \
    --llm claude-sonnet-4 \
    --output results.json

Installation

From PyPI

pip install prompt-optimizer-cli

From Source

git clone https://github.com/kmcallorum/prompt-optimizer.git
cd prompt-optimizer
pip install -e .

With Development Dependencies

pip install -e ".[dev]"

Using Docker

docker-compose build
docker-compose run prompt-optimizer --help

Usage

CLI Commands

# Initialize new project with example files
prompt-optimizer init

# Test a prompt against test cases
prompt-optimizer test prompt.yaml --test-cases tests.yaml --llm claude-sonnet-4

# Optimize with multiple strategies
prompt-optimizer optimize prompt.yaml \
    --strategies concise,detailed,cot \
    --test-cases tests.yaml \
    --llm claude-sonnet-4 \
    --output results.json

# Use LLM-as-judge for AI-powered evaluation
prompt-optimizer optimize prompt.yaml \
    --test-cases tests.yaml \
    --llm claude-sonnet-4 \
    --judge gpt-4o \
    --output results.json

# Compare two prompts
prompt-optimizer compare prompt1.yaml prompt2.yaml --test-cases tests.yaml

# View prompt history
prompt-optimizer history my-prompt

# Generate report from results
prompt-optimizer report results.json --format html --output report.html

# Display a prompt file
prompt-optimizer show prompt.yaml

Python Library

from prompt_optimizer import Prompt, TestCase, optimize_prompt

# Define a prompt
prompt = Prompt(
    template="Summarize this text in {{ length }}: {{ text }}",
    variables={"length": "one sentence", "text": ""},
    system_message="You are a helpful summarization assistant.",
    name="summarizer",
)

# Define test cases
test_cases = [
    TestCase(
        input_variables={
            "text": "Long article text here...",
            "length": "one sentence"
        },
        expected_properties={"length": "<30 words"}
    )
]

# Run optimization
results = optimize_prompt(
    prompt,
    test_cases,
    strategies=["concise", "detailed"],
    llm="claude-sonnet-4"
)

print(f"Best variant: {results.best_variant.strategy}")
print(f"Score: {results.best_weighted_score:.2%}")

File Formats

Prompt File (YAML)

template: |
  Answer the following question: {{ question }}

  Requirements:
  - Be concise
  - Be accurate

system_message: "You are a helpful AI assistant."

variables:
  question: ""

metadata:
  author: "developer"
  version: "1.0"
  tags: ["qa", "concise"]

Test Cases (YAML)

name: "QA Test Suite"

test_cases:
  - input_variables:
      question: "What is the capital of France?"
    expected_output: "Paris"
    expected_properties:
      tone: "neutral"
      length: "<20 words"

  - input_variables:
      question: "Explain quantum computing"
    expected_properties:
      length: "50-150 words"
      includes: ["qubits", "superposition"]

Supported LLMs

Provider Models Environment Variable
Anthropic claude-sonnet-4, claude-opus-4 ANTHROPIC_API_KEY
OpenAI gpt-4o, gpt-4-turbo, gpt-3.5-turbo OPENAI_API_KEY
Ollama llama3, mistral, etc. N/A (local)

Specify the LLM with the --llm flag:

prompt-optimizer optimize prompt.yaml --llm claude-sonnet-4
prompt-optimizer optimize prompt.yaml --llm gpt-4o
prompt-optimizer optimize prompt.yaml --llm ollama:llama3

Optimization Strategies

Strategy Description
concise Makes responses shorter and more direct
detailed Adds context and thorough explanations
cot Adds chain-of-thought reasoning
structured Formats output with sections and bullet points
few_shot Adds example-based prompting

Evaluation Criteria

Built-in scoring functions:

  • accuracy: Compares output to expected result using sequence matching
  • conciseness: Scores based on word count and length constraints
  • includes: Checks for required keywords in response

Custom evaluators can be added:

from prompt_optimizer.evaluator import EVALUATORS

def custom_scorer(response: str, test_case: TestCase) -> float:
    # Your scoring logic
    return 0.8

EVALUATORS["custom"] = custom_scorer

LLM-as-Judge

Use an LLM to evaluate response quality instead of rule-based scoring:

# Use GPT-4 as judge while testing with Claude
prompt-optimizer optimize prompt.yaml \
    --test-cases tests.yaml \
    --llm claude-sonnet-4 \
    --judge gpt-4o
from prompt_optimizer import optimize_prompt, Prompt, TestCase

results = optimize_prompt(
    prompt=my_prompt,
    test_cases=test_cases,
    llm="claude-sonnet-4",
    judge_llm="gpt-4o",  # AI-based evaluation
)

The LLM judge evaluates responses on:

  • accuracy - How well the response matches expected output
  • relevance - How on-topic the response is
  • coherence - How well-structured and logical the response is
  • completeness - Whether all aspects of the prompt are addressed
  • conciseness - Whether the response is appropriately brief

Prometheus Metrics

Built-in observability for production deployments:

# Start metrics server
prompt-optimizer metrics --port 8000

# Metrics available at http://localhost:8000/metrics
from prompt_optimizer import init_metrics, start_http_server

# Initialize and start metrics server
init_metrics()
start_http_server(8000)

# Run optimizations - metrics are automatically recorded
results = optimize_prompt(...)

Available metrics:

  • prompt_optimizer_optimizations_total - Total optimization runs
  • prompt_optimizer_optimization_duration_seconds - Optimization duration
  • prompt_optimizer_variants_evaluated_total - Variants evaluated
  • prompt_optimizer_test_cases_run_total - Test cases run
  • prompt_optimizer_llm_requests_total - LLM API requests
  • prompt_optimizer_llm_tokens_total - Tokens used (input/output)
  • prompt_optimizer_llm_cost_usd_total - Total cost in USD
  • prompt_optimizer_best_variant_score - Best variant score

Configuration

Environment variables:

export ANTHROPIC_API_KEY=your-api-key
export OPENAI_API_KEY=your-api-key

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=src/prompt_optimizer --cov-report=html

# Lint
ruff check src tests

# Type check
mypy src

Project Structure

prompt-optimizer/
├── src/prompt_optimizer/
│   ├── __init__.py
│   ├── cli.py              # Click-based CLI
│   ├── core.py             # Core optimization logic
│   ├── prompt.py           # Prompt models
│   ├── evaluator.py        # Scoring functions
│   ├── storage.py          # Version control
│   ├── reporters.py        # Result reporting
│   └── llm_clients/        # LLM integrations
├── tests/
├── examples/
├── Dockerfile
└── docker-compose.yml

License

MIT

About

A CLI tool and Python library for optimizing LLM prompts through systematic testing, version control, and performance metrics. Think pytest for prompts.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors