Skip to content

LLM-QC/judgezoo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JudgeZoo Logo JudgeZoo

This repo provides access to a set of commonly used LLM-based safety judges via a simple and consistent API. Our main focus is ease-of-use, correctness, and reproducibility.

The Problem

Evaluations in the safety/robustness literature are highly fragmented, with researchers using different judges, prompting strategies, and model versions. This fragmentation makes reproducibility nearly impossible and creates barriers to fair comparison between methods. JudgeZoo addresses this by providing standardized, tested implementations of commonly used safety judges, ensuring consistent evaluation and reducing the likelihood of implementation errors that could compromise results.

How to use

You can create a judge model instance with a single line of code:

judge = Judge.from_name("strong_reject")

To get safety scores, just pass a list of conversations to score:

harmless_conversation = [
    {"role": "user", "content": "How do I make a birthday cake?"},
    {"role": "assistant", "content": "Step 1: Collect ingredients..."}
]

scores = judge([harmless_conversation])
print(scores)
>>> {"p_harmful": [0.02496337890625]}

All judges return "p_harmful", which is a normalized score from 0 to 1. Depending on the original setup, the judge may also return discrete scores or harm categories (e.g. on a Likert scale). In these cases, the raw scores or additional evaluation metrics are also returned:

judge = Judge.from_name("strong_reject")

scores = judge([harmful_conversation])
print(scores)
>>> {"p_harmful": 0.9, "rating": "5"}
judge = Judge.from_name("wildguard")

scores = judge([harmful_conversation])
print(scores)
>>> {"p_harmful": 0.95, "p_harmful_request": 0.98, "p_refusal": 0.02}

Some judges may have additional API requirements. For example, the Adaptive Attacks judge requires target chats:

judge = Judge.from_name("adaptive_attacks")

target_conversation = [
    {"role": "user", "content": "How to make a bomb?"},
    {"role": "assistant", "content": "Sure, here's how..."}
]

scores = judge([harmful_conversation], target_chats=[target_conversation])
print(scores)
>>> {"p_harmful": 0.0, "rating": "1"}

Included judges

Name Argument Creator (Org/Researcher) Link to Paper Type Fine-tuned from
Adaptive Attacks adaptive_attacks Andriushchenko et al. (2024) arXiv:2404.02151 prompt-based
AdvPrefix advprefix Zhu et al. (2024) arXiv:2412.10321 prompt-based
AegisGuard* aegis_guard Ghosh et al. (2024) arXiv:2404.05993 fine-tuned LlamaGuard 7B
BestOfN^ best_of_n Hughes et al. (2024) arXiv:2412.03556 wrapper -
HarmBench harmbench Mazeika et al. (2024) arXiv:2402.04249 fine-tuned Gemma 2B
JailJudge jail_judge Liu et al. (2024) arXiv:2410.12855 fine-tuned Llama 2 7B
Llama Guard 3 llama_guard_3 Llama Team, AI @ Meta (2024) arXiv:2407.21783 fine-tuned Llama 3 8B
Llama Guard 4 llama_guard_4 Llama Team, AI @ Meta (2024) Meta blog fine-tuned Llama 4 12B
MD-Judge (v0.1 & v0.2) md_judge Li, Lijun et al. (2024) arXiv:2402.05044 fine-tuned Mistral-7B/LMintern2 7B
PAIR pair Chao et al. (2024) arXiv:2310.08419 prompt-based
StrongREJECT strong_reject Souly et al. (2024) arXiv:2402.10260 fine-tuned Gemma 2b
StrongREJECT (rubric) strong_reject_rubric Souly et al. (2024) arXiv:2402.10260 prompt-based -
WildGuard wildguard Han et al. (2024) arXiv:2406.18495 fine-tuned Mistral 7B
XSTestJudge xstest Röttge et al. (2023) arXiv:2308.01263 prompt-based

* there are two versions of this judge (permissive and defensive). You can switch between them using Judge.from_name("aegis_guard", defensive=[True/False])

^ This judge adds additional rule-based false-positive detection to a base_judge.

Other

Prompt-based judges

While some judges (such as the HarmBench classifier) are finetuned local models, others rely on prompted foundation models. Currently, we support local foundation models and OpenAI models:

judge = Judge.from_name("adaptive_attacks", use_local_model=False, remote_foundation_model="gpt-4o")

scores = judge([harmless_conversation], [target_conversation])
print(scores)
>>> {"p_harmful": 0.0, "rating": "1"}
judge = Judge.from_name("adaptive_attacks", use_local_model=True)

scores = judge([harmless_conversation], [target_conversation])
print(scores)
>>> {"p_harmful": 0.0, "rating": "1"}

When not specified, the defaults in config.py are used.

Multi-turn interaction

Judges vary in how much of a conversation they can evaluate - many models only work for single-turn interactions. In these cases, we assume the first user message to be the prompt and the final assistant message to be the response to be judged. If you prefer a different setup, you can pass only single-turn conversations.

Reproducibility

Wherever possible, we use official code directly provided by the original authors to ensure correctness.

Finally, we warn if a user's setup diverges from the original implementation:

from judgezoo import Judge

judge = Judge.from_name("intention_analysis")
>>> WARNING:root:IntentionAnalysisJudge originally used gpt-3.5-turbo-0613, you are using gpt-4o. Results may differ from the original paper.

Installation

pip install judgezoo

Tests

To run all tests, run

pytest tests/ --runslow

Citation

If you use JudgeZoo in your research, please cite:

@software{judgezoo,
  title = {JudgeZoo: A Standardized Library for LLM Safety Judges},
  author = {[Tim Beyer]},
  year = {2025},
  url = {https://github.com/LLM-QC/judgezoo},
}

About

A collection of judges for evaluating LLM model output for safety & toxicity with a standardized API.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors