#

llm-judge

Here are 20 public repositories matching this topic...

haizelabs / verdict

Inference-time scaling for LLMs-as-a-judge.

reward-shaping llm llm-as-a-judge test-time-compute inference-time-compute llm-judge test-time-scaling

Updated Nov 5, 2025
Jupyter Notebook

vtdinh13 / habit-builder-ai-agent

An end-to-end AI agent project that transcribes audio files, embeds user queries, and searches in Qdrant and web browser via the Brave API. A Streamlit interface powered by OpenAI GPT models delivers actionable health insights from both the archive and the latest research.

ai-agents qdrant pydantic-ai llm-judge ai-agent-evaluation

Updated Dec 12, 2025
Python

ShaheerKhawaja / ProductionOS

ProductionOS v1.0 — Claude Code plugin with 76 agents, 39 commands, and 12 hooks. Deploys specialized agents that review, score, and improve your entire codebase. Smart routing, recursive convergence, self-evaluation.

security-audit multi-agent code-review prompt-engineering self-evaluation deep-research llm-judge claude-code agentic-development claude-code-plugin recursive-improvement auto-swarm max-research production-upgrade convergence-engine worktree-isolation

Updated Apr 16, 2026
TypeScript

black-yt / structai

StructAI offers a robust toolkit for LLM interaction—such as structured outputs, context management, and parallel execution.

Updated Mar 2, 2026
Python

youdotcom-oss / web-search-agent-evals

Extensible benchmarking suite for evaluating AI coding agents on web search tasks. Compare native search vs MCP servers (You.com, expanding) across multiple agents (Claude Code, Gemini, Droid, Codex, expanding) with automated Docker workflows and statistical analysis.

benchmark mcp gemini headless-testing droid codex ai-agents web-search coding-agents model-context-protocol llm-judge claude-code agent-evaluation evaluation-suite

Updated Feb 27, 2026
TypeScript

syed-waleed-ahmed / LLM-as-Judge

A Streamlit web app that uses a Groq-powered LLM (Llama 3) to act as an impartial judge for evaluating and comparing two model outputs. Supports custom criteria, presets like creativity and brand tone, and returns structured scores, explanations, and a winner. Built end-to-end with Python, Groq API, and Streamlit.

python code-evaluation a-b-testing text-evaluation groq streamlit model-benchmarking ai-automation ai-evaluation llm prompt-evaluation llama3 llm-judge output-evaluation scoring-framework

Updated Nov 24, 2025
Python

Anmolian / Prompt_Eval_LLM_Judge

Prompt Design & LLM Judge

prompt-engineering llms few-shot-prompting one-shot-prompting zero-shot-prompting contrastive-cot-prompting cot-prompting llm-judge trec-rag-2024 self-consistency-prompting role-playing-prompting

Updated Feb 10, 2025
Python

mennamohammedkh / Simple-Chatbot-Llama-3-8B-via-HuggingFace-API-TrustGuard-with-LLM-Judge

🤖 A conversational chatbot powered by Meta-Llama-3-8B via HuggingFace API, with TrustGuard safety validation using an LLM-as-Judge.

python chatbot ai-safety uv huggingface llm llama3 llm-judge trustguard

Updated Feb 27, 2026
Python

trajectoryRL / trajrl-bench

TrajRL-Bench: AI agent skills benchmark. SSH sandbox with mock services, LLM judge scoring, split-half delta evaluation. Leaderboard at trajrl.com/bench

benchmark evaluation ai-agents bittensor llm-judge

Updated Apr 23, 2026
Python

PabloCabaleiro / pondera

Pondera is a lightweight, YAML-first framework to evaluate AI models and agents with pluggable runners and an LLM-as-a-judge.

python ai agents model-agnostic ai-evaluation llms llm-evaluation llm-evaluation-framework llm-judge agent-evaluation ai-evaluation-framework rubric-based-evaluation yaml-first

Updated Oct 23, 2025
Python

Padraigobrien08 / Agentic-GenAI-Capstone-Google-Kaggle

Agent QA Mentor: an agentic QA pipeline that evaluates tool-using AI agent trajectories (scores, issue codes, safety/hallucination detection), rewrites prompts with targeted fixes, and stores long-term memory for continuous improvement—plus a CI-style eval gate and demo notebook.

python qa multi-agent-systems ai-agents gemini-api qa-automation pydantic vector-database llms agentic llm-judge agent-evaluation

Updated Dec 2, 2025
Jupyter Notebook

gmitt98 / fieldtest

LLM evaluation framework — define what correct, well-formed, and safe means before you measure

python testing cli open-source evaluation developer-tools eval ai-safety prompts evaluation-framework guardrails llm llmops llm-eval evals llm-evaluation llm-evaluation-framework llm-judge

Updated Apr 1, 2026
Python

akshan-main / arxiv-context-feed

arXiv scraper built for Contextual AI capstone project. Cron based ingestion. LLM judge that minimizes ingestion cost.

arxiv llm-judge

Updated Mar 15, 2026
Python

w00jay / nite-eval

Autonomous overnight LLM eval pipeline for local GGUF models — multi-turn agentic tasks, dimension-routed dual-judge scoring, SQLite-backed comparison reports. Built for llama.cpp + llama-swap on dual-GPU rigs.

python benchmarks hermes gemma model-evaluation llama-cpp local-llm function-calling qwen gguf llm-evaluation agentic-ai tool-calling llm-judge llama-swap

Updated Apr 19, 2026
Python

harneXa / nexa-gauge

An graph-eval framework for LLM's

rouge-metric ranking-algorithm redteam grounding llm blue-score llm-eval llm-evaluation-framework llm-judge relevance-scoring geval

Updated Apr 23, 2026
Python

motasemwed / llm-judge

LLM-as-a-Judge system for rubric-based, explainable evaluation of large language model outputs.

python nlp ai-evaluation large-language-models llm prompt-engineering llm-evaluation llm-judge

Updated Jan 29, 2026
Python

mohsinsheikhani / support-fte-evals

Eval-driven Customer Support FTE using OpenAI Agents SDK. Multi-agent routing, guardrails, and systematic quality evaluation.

Updated Apr 21, 2026
Python

beviah / fracture

Red-team framework for discovering alignment failures in frontier language models.

model-evaluation ai-safety jailbreak-detection red-teaming rlhf prompt-injection llm-evaluation llm-safety llm-safety-benchmark llm-judge alignment-testing adversarial-testing alignment-research

Updated Feb 19, 2026
Python

mohitsem13 / Verdict

Deliver predictive litigation modeling and outcome simulation with enterprise-grade legal analytics for high-stakes trial intelligence.

android ratings malware it-security magisk-module bollywood ai-agents attack-defense malware-detection system-engineering cyber-resiliency zygisk playintegrityfix test-time-compute llm-judge test-time-scaling

Updated Apr 23, 2026
TypeScript

Asyasyarif / openjudges

OpenJudges is an interactive CLI tool that uses LLMs as judges to evaluate AI responses against specific criteria

evaluation rag llm retreival-augmented-generation llm-judge

Updated Feb 24, 2026
Go

Improve this page

Add a description, image, and links to the llm-judge topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-judge topic, visit your repo's landing page and select "manage topics."