ai-evaluation

Here are 80 public repositories matching this topic...

cvs-health / uqlm

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

uncertainty-quantification uncertainty-estimation ai-safety confidence-score hallucination confidence-estimation ai-evaluation llm llm-evaluation llm-safety hallucination-evaluation hallucination-detection hallucination-mitigation llm-hallucination

Updated Dec 5, 2025
Python

lechmazur / confabulations

Star

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

benchmark leaderboard gemini llama language-model claude rag o1 hallucinations ai-evaluation llm gemini-pro llm-benchmarking confabulations deepseek-r1 o3-mini

Updated Aug 7, 2025
HTML

rungalileo / agent-leaderboard

Star

Ranking LLMs on agentic tasks

ai evaluation ai-agents synthetic-data ai-evaluation llms ai-benchmark agent-evaluation

Updated Nov 18, 2025
Jupyter Notebook

METR / vivaria

Star

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

ai elicitation ai-evaluation evals

Updated Nov 11, 2025
TypeScript

guestrin-lab / deepscholar-bench

Star

benchmark and evaluate generative research synthesis

dataset-generation benchmark-suite evaluation-framework ai-evaluation deep-research

Updated Dec 1, 2025
Python

taoAIGC / AICompare

Star

one click to open multi AI sites ｜一键打开多个 AI 站点，查看 AI 结果

ai gemini poe claude perplexity ai-evaluation llm chatgpt

Updated Oct 8, 2025
JavaScript

kereva-dev / kereva-scanner

Star

Code scanner to check for issues in prompts and LLM calls

cli security ai linter evaluation code-scanning red-teaming ai-security hallucination ai-evaluation llm prompt-injection llm-security ai-code-review llm-evaluation owasp-llm-top-10 ai-performance ai-red-teaming llm-performance

Updated Apr 6, 2025
Python

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.

nlp machine-learning gemini llama language-model model-evaluation ai-safety mistral claude disinformation ai-security ai-benchmarks ai-evaluation llm llm-benchmarking gpt4o

Updated Mar 20, 2025

future-agi / cookbooks

Star

Example Projects integrated with Future AGI Tech Stack for easy AI development

finance marketing development evaluation interview cookbooks healthcare ai-agents mlops ai-evaluation rag-chatbot agentic-ai

Updated Nov 30, 2025
Python

Vvkmnn / awesome-ai-eval

Star

☑️ A curated list of tools, methods & platforms for evaluating AI reliability in real applications.

Updated Nov 27, 2025

mhamzaerol / Cost-of-Pass

Star

Cost-of-Pass: An Economic Framework for Evaluating Language Models

benchmark economics language-model evaluation-framework ai-evaluation cost-efficiency cost-performance

Updated Apr 25, 2025
Python

METR / inspect-action

Star

Running UK AISI's Inspect in the Cloud

ai inspect elicitation ai-evaluation evals

Updated Dec 6, 2025
Python

bigdata-ustc / CAT4AI

Star

Adaptive Testing Framework for AI Models (Psychometrics in AI Evaluation)

psychometrics adaptive-testing ai-evaluation

Updated Oct 1, 2024
Jupyter Notebook

nnennandukwe / GenAI-Dev-Onboarding-Starter-Kit

Star

A playbook and Colab-based demo for helping engineering teams adopt Generative AI. Includes a working RAG pipeline using LangChain + Chroma, OpenAI GPT-4o embeddings, prompt engineering best practices, and automated LLM evaluations with Ragas.

openai developer-experience rag ai-evaluation langchain llmops chromadb genai evals

Updated May 27, 2025
Jupyter Notebook

Alab-NII / llm-judge-extract-qa

Star

LLM-as-a-judge for Extractive QA datasets

qa evaluation evaluation-metrics ai-evaluation llm-as-a-judge

Updated Apr 22, 2025
Python

ai4society / GenAIResultsComparator

Star

A Python library providing evaluation metrics to compare generated texts from LLMs, often against reference texts. Features streamlined workflows for model comparison and visualization.

python nlp machine-learning natural-language-processing text-analysis ai-evaluation large-language-models llm genai evaluation-metircs text-comparision

Updated Oct 30, 2025
Python

hparreao / Awesome-AI-Evaluation-Guide

Star

A comprehensive, implementation-focused guide to evaluating Large Language Models, RAG systems, and Agentic AI in production environments.

awesome gpt evaluation-metrics evaluation-framework awesome-lists claude ai-evaluation large-language-models llm agentic-ai ai-evaluation-tools ai-evaluation-metrics ai-evaluation-framework

Updated Dec 5, 2025

aloth / JudgeGPT

Star

JudgeGPT: An empirical research platform for evaluating the authenticity of AI-generated news.

Updated Oct 6, 2025
Python

meshkovQA / Eval-ai-library

Star

Comprehensive AI Evaluation Framework with advanced techniques including Probability-Weighted Scoring. Support for multiple LLM providers and evaluation metrics for RAG systems and AI agents. To get full support evaluation service visit website

ai-evaluation llm-evaluation ai-evaluation-tools ai-evaluation-metrics aieval ai-evaluation-framework

Updated Nov 19, 2025
Python

syamsasi99 / prompt-evaluator

Star

prompt-evaluator is an open-source toolkit for evaluating, testing, and comparing LLM prompts. It provides a GUI-driven workflow for running prompt tests, tracking token usage, visualizing results, and ensuring reliability across models like OpenAI, Claude, and Gemini.

electron react typescript datascience developer-tools ai-evaluation llm prompt-engineering prompt-testing promptfoo ai-evaluation-tools ai-evaluation-metrics ai-evaluation-framework

Updated Dec 4, 2025
TypeScript

Improve this page

Add a description, image, and links to the ai-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-evaluation

Here are 80 public repositories matching this topic...

cvs-health / uqlm

lechmazur / confabulations

rungalileo / agent-leaderboard

METR / vivaria

guestrin-lab / deepscholar-bench

taoAIGC / AICompare

kereva-dev / kereva-scanner

lechmazur / deception

future-agi / cookbooks

Vvkmnn / awesome-ai-eval

mhamzaerol / Cost-of-Pass

METR / inspect-action

bigdata-ustc / CAT4AI

nnennandukwe / GenAI-Dev-Onboarding-Starter-Kit

Alab-NII / llm-judge-extract-qa

ai4society / GenAIResultsComparator

hparreao / Awesome-AI-Evaluation-Guide

aloth / JudgeGPT

meshkovQA / Eval-ai-library

syamsasi99 / prompt-evaluator

Improve this page

Add this topic to your repo