Inference-time scaling for LLMs-as-a-judge.
-
Updated
Nov 5, 2025 - Jupyter Notebook
Inference-time scaling for LLMs-as-a-judge.
An end-to-end AI agent project that transcribes audio files, embeds user queries, and searches in Qdrant and web browser via the Brave API. A Streamlit interface powered by OpenAI GPT models delivers actionable health insights from both the archive and the latest research.
ProductionOS v1.0 — Claude Code plugin with 76 agents, 39 commands, and 12 hooks. Deploys specialized agents that review, score, and improve your entire codebase. Smart routing, recursive convergence, self-evaluation.
StructAI offers a robust toolkit for LLM interaction—such as structured outputs, context management, and parallel execution.
Extensible benchmarking suite for evaluating AI coding agents on web search tasks. Compare native search vs MCP servers (You.com, expanding) across multiple agents (Claude Code, Gemini, Droid, Codex, expanding) with automated Docker workflows and statistical analysis.
A Streamlit web app that uses a Groq-powered LLM (Llama 3) to act as an impartial judge for evaluating and comparing two model outputs. Supports custom criteria, presets like creativity and brand tone, and returns structured scores, explanations, and a winner. Built end-to-end with Python, Groq API, and Streamlit.
Prompt Design & LLM Judge
🤖 A conversational chatbot powered by Meta-Llama-3-8B via HuggingFace API, with TrustGuard safety validation using an LLM-as-Judge.
TrajRL-Bench: AI agent skills benchmark. SSH sandbox with mock services, LLM judge scoring, split-half delta evaluation. Leaderboard at trajrl.com/bench
Pondera is a lightweight, YAML-first framework to evaluate AI models and agents with pluggable runners and an LLM-as-a-judge.
Agent QA Mentor: an agentic QA pipeline that evaluates tool-using AI agent trajectories (scores, issue codes, safety/hallucination detection), rewrites prompts with targeted fixes, and stores long-term memory for continuous improvement—plus a CI-style eval gate and demo notebook.
LLM evaluation framework — define what correct, well-formed, and safe means before you measure
Autonomous overnight LLM eval pipeline for local GGUF models — multi-turn agentic tasks, dimension-routed dual-judge scoring, SQLite-backed comparison reports. Built for llama.cpp + llama-swap on dual-GPU rigs.
An graph-eval framework for LLM's
LLM-as-a-Judge system for rubric-based, explainable evaluation of large language model outputs.
Eval-driven Customer Support FTE using OpenAI Agents SDK. Multi-agent routing, guardrails, and systematic quality evaluation.
Red-team framework for discovering alignment failures in frontier language models.
Deliver predictive litigation modeling and outcome simulation with enterprise-grade legal analytics for high-stakes trial intelligence.
OpenJudges is an interactive CLI tool that uses LLMs as judges to evaluate AI responses against specific criteria
Add a description, image, and links to the llm-judge topic page so that developers can more easily learn about it.
To associate your repository with the llm-judge topic, visit your repo's landing page and select "manage topics."