UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection
-
Updated
Dec 5, 2025 - Python
UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection
Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
Ranking LLMs on agentic tasks
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
benchmark and evaluate generative research synthesis
one click to open multi AI sites | 一键打开多个 AI 站点,查看 AI 结果
Code scanner to check for issues in prompts and LLM calls
Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.
Example Projects integrated with Future AGI Tech Stack for easy AI development
☑️ A curated list of tools, methods & platforms for evaluating AI reliability in real applications.
Cost-of-Pass: An Economic Framework for Evaluating Language Models
Running UK AISI's Inspect in the Cloud
Adaptive Testing Framework for AI Models (Psychometrics in AI Evaluation)
A playbook and Colab-based demo for helping engineering teams adopt Generative AI. Includes a working RAG pipeline using LangChain + Chroma, OpenAI GPT-4o embeddings, prompt engineering best practices, and automated LLM evaluations with Ragas.
LLM-as-a-judge for Extractive QA datasets
A Python library providing evaluation metrics to compare generated texts from LLMs, often against reference texts. Features streamlined workflows for model comparison and visualization.
A comprehensive, implementation-focused guide to evaluating Large Language Models, RAG systems, and Agentic AI in production environments.
JudgeGPT: An empirical research platform for evaluating the authenticity of AI-generated news.
Comprehensive AI Evaluation Framework with advanced techniques including Probability-Weighted Scoring. Support for multiple LLM providers and evaluation metrics for RAG systems and AI agents. To get full support evaluation service visit website
prompt-evaluator is an open-source toolkit for evaluating, testing, and comparing LLM prompts. It provides a GUI-driven workflow for running prompt tests, tracking token usage, visualizing results, and ensuring reliability across models like OpenAI, Claude, and Gemini.
Add a description, image, and links to the ai-evaluation topic page so that developers can more easily learn about it.
To associate your repository with the ai-evaluation topic, visit your repo's landing page and select "manage topics."