agent-evaluation

Here are 279 public repositories matching this topic...

coze-dev / coze-loop

Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.

agent open-source playground ai monitoring evaluation openai observability agentops coze langchain llmops prompt-management llm-observability agent-evaluation eino agent-observability

Updated Jun 11, 2026
Go

Giskard-AI / giskard-oss

Sponsor

Star

🐢 Open-Source Evaluation & Testing library for LLM Agents

ai-security mlops fairness-ai responsible-ai ml-validation red-team-tools trustworthy-ai ml-testing llm ai-red-team ai-testing llmops llm-security llm-eval llm-evaluation rag-evaluation agent-evaluation

Updated Jun 11, 2026
Python

truera / trulens

Star

Evaluation and Tracking for LLM Experiments and AI Agents

machine-learning neural-networks ai-agents explainable-ml agentops ai-monitoring ai-observability llms llmops llm-eval evals llm-evaluation agent-evaluation

Updated Jun 5, 2026
Python

mozilla-ai / any-agent

Star

A single interface to use and evaluate different agent frameworks

ai mcp agents a2a agent-evaluation

Updated Jun 8, 2026
Python

ifixai-ai / iFixAi

Star

Catch your AI's mistakes and blind spots before your customers or regulators do. iFixAi runs 45 inspections, 32 graded core plus 13 extended for frontier risks like sabotage, sandbagging, and oversight evasion. It returns a letter grade in under 5 minutes. Industry and model agnostic.

Updated Jun 9, 2026
Python

TIGER-AI-Lab / ClawBench

Star

Open-source benchmark for browser AI agents on daily tasks.

Updated Jun 6, 2026
Python

chirpz-ai / pandaprobe

Star

open source agent engineering platform: traces, evals, and metrics to debug and improve your AI agents. Integrates with LangGraph, CrewAI, Claude Agent SDK, and more.

open-source monitoring self-hosted tracing crewai langgraph agentic-ai agent-evaluation agent-engineering openai-agents-sdk agent-observability claude-agent-sdk

Updated Jun 10, 2026
Python

rungalileo / agent-leaderboard

Star

Ranking LLMs on agentic tasks

ai evaluation ai-agents synthetic-data ai-evaluation llms ai-benchmark agent-evaluation

Updated May 21, 2026
Jupyter Notebook

alphadl / AdaRubrics

Star

AdaRubric: Adaptive Dynamic Rubric Evaluator for Agent Trajectories

rubric rlhf reward-model llm-evaluation agent-evaluation

Updated Jun 7, 2026
Python

hwfengcs / DM-Code-Agent

Star

Lightweight, auditable Python code agent (~1500 LOC) — ReAct + Planner + Reflexion + Hybrid RAG, with SWE-bench Lite eval and trace replay.

agent mcp rag llm llm-agent react-agent agent-skills agent-evaluation reflexion-agent code-agent swe-bench

Updated Jun 4, 2026
Python

Forsy-AI / forsy-trace-skill

Star

Open skill for capturing AI agent work as structured traces.

reinforcement-learning process-supervision ai-agents post-training tool-use trajectory-data llm-agents agent-evaluation agent-workflows agent-traces

Updated Jun 6, 2026
Python

hidai25 / eval-view

Star

Regression testing for AI agents. Snapshot behavior,diff tool calls,catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.

python testing cli mcp evaluation pytest regression-testing ai-agents autogen llm anthropic langchain-agent openai-assistants crewai langgraph agentic-ai agent-evaluation agent-benchmark

Updated Jun 3, 2026
Python

evaleval / every_eval_ever

Star

Every Eval Ever is a shared schema and crowdsourced eval database. It defines a standardized metadata format for storing AI evaluation results — from leaderboard scrapes and research papers to local evaluation runs — so that results from different frameworks can be compared, reproduced, and reused.

evaluations infra ai-evaluation llm-evaluation agent-evaluation

Updated Jun 1, 2026
Python

Cre4T3Tiv3 / ai-agents-reality-check

Sponsor

Star

Benchmarking the gap between AI agent hype and architecture. Three agent archetypes, 73-point performance spread, stress testing, network resilience, and ensemble coordination analysis with statistical validation.

python open-source benchmarking reproducible-research statistical-analysis performance-testing network-resilience llm-agent llm-tools agent-architecture agentic-workflow agentic-ai agent-performance agent-evaluation ai-benchmarking agent-benchmark reality-check-ai-agent architectural-evaluation ensemble-coordination

Updated Apr 2, 2026
Python

samarailly51-pixel / claimpilot-harness

Star

Crash-test insurance claim AI agents before production.

python testing insurance ai-agents prompt-injection llm-evals agent-evaluation

Updated Jun 6, 2026
Python

microsoft / ignite25-PREL13-observe-manage-and-scale-agentic-ai-apps-with-microsoft-foundry

Star

Learn How To Observe, Manage, and Scale, Agentic AI Apps Using Azure AI Foundry - with this hands-on workshop

observability quality-evaluation aiops distillation-model azure-openai azure-ai-search safety-evaluation azure-ai-foundry supervised-fine-tuning agent-evaluation azure-ai-foundry-models

Updated Mar 26, 2026
Jupyter Notebook

acoyfellow / cloudbox

Star

Synthetic cloud computers for training and evaluating long-horizon agents on Cloudflare. Persona → filesystem → artifacts → collaborators → simulation → retrospective.

cloudflare alchemy eval ai-agents synthetic-data cloudflare-workers durable-objects agent-evaluation

Updated May 21, 2026
TypeScript

SparkBeyond / agentune

Star

Tune your AI Agent to best meet its KPI with a cyclic process of analyze, improve and simulate

customer-support customer-service conversational-agents ai-agents chatbot-evaluation agent-simulator kpi-analysis agent-evaluation agent-optimization sales-agents customer-facing-agents kpi-optimization

Updated Jan 14, 2026
Python

dokimos-dev / dokimos

Star

LLM and agent evaluation for Java & Kotlin. Runs in JUnit and CI. Spring AI, LangChain4j, Koog, Embabel, and any LLM client.

Updated Jun 10, 2026
Java

chaosync-org / awesome-ai-agent-testing

Star

🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems

testing qa benchmark machine-learning evaluation chaos artificial-intelligence chaos-monkey testing-tools awesome-list quality-assurance ai-safety ai-agents chaos-engineering llm llm-evaluation agentic-ai ai-benchmark agent-evaluation

Updated May 28, 2025

Improve this page

Add a description, image, and links to the agent-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the agent-evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent-evaluation

Here are 279 public repositories matching this topic...

coze-dev / coze-loop

Giskard-AI / giskard-oss

truera / trulens

mozilla-ai / any-agent

ifixai-ai / iFixAi

TIGER-AI-Lab / ClawBench

chirpz-ai / pandaprobe

rungalileo / agent-leaderboard

alphadl / AdaRubrics

hwfengcs / DM-Code-Agent

Forsy-AI / forsy-trace-skill

hidai25 / eval-view

evaleval / every_eval_ever

Cre4T3Tiv3 / ai-agents-reality-check

samarailly51-pixel / claimpilot-harness

microsoft / ignite25-PREL13-observe-manage-and-scale-agentic-ai-apps-with-microsoft-foundry

acoyfellow / cloudbox

SparkBeyond / agentune

dokimos-dev / dokimos

chaosync-org / awesome-ai-agent-testing

Improve this page

Add this topic to your repo