Where LLM Agents Fail and How They Can Learn From Failures

AgentDebug

AgentDebug is a framework for understanding, detecting, and recovering from LLM agent failures. It provides:

AgentErrorTaxonomy: A classification system covering 17 error types across 5 modules (memory, reflection, planning, action, system).
AgentErrorBench: Annotated failure trajectories from ALFWorld, GAIA, and WebShop environments.
AgentDebug Framework: A two-stage debugging pipeline that isolates root-cause failures and provides corrective feedback.

Installation

git clone https://github.com/ulab-uiuc/AgentDebug.git
cd AgentDebug
pip install -e .

Environment Setup

AgentDebug includes vendored environments (ALFWorld, WebShop, GAIA) with modular agent prompts. The rollout system requires these environments to be functional.

API Keys: Set the following environment variables (or put them in a .env file):

export OPENAI_API_KEY="..."
export ANTHROPIC_API_KEY="..."      # optional
export GEMINI_API_KEY="..."         # optional
export TOGETHER_API_KEY="..."       # optional

Repository Structure

AgentDebug/
├── detector/                  # Core error detection framework
│   ├── fine_grained_analysis.py    # Phase 1: step-level per-module error detection
│   ├── critical_error_detection.py # Phase 2: critical error identification
│   └── error_definitions.py        # Error taxonomy (5 modules, 17 types)
├── agentdebug/
│   ├── engines/               # Multi-provider LLM abstraction
│   │   ├── openai.py          # OpenAI (GPT-4o, GPT-4.1, etc.)
│   │   ├── anthropic.py       # Anthropic (Claude)
│   │   ├── gemini.py          # Google (Gemini)
│   │   └── together.py        # Together AI (Llama, Qwen, etc.)
│   ├── environments/          # Environment wrappers + modular prompts
│   │   ├── alfworld/          # ALFWorld embodied tasks
│   │   ├── webshop/           # WebShop e-commerce tasks
│   │   └── gaia/              # GAIA general AI assistant tasks
│   └── rollout/               # Trajectory collection
│       ├── rollout.py         # Unified rollout across all environments
│       └── step_to_episode.py # Step-level → episode-level conversion
├── examples/                  # Sample data and demo scripts
└── docs/                      # Documentation

Quick Start

Run the Detector on a Trajectory

from detector.fine_grained_analysis import ErrorTypeDetector
from detector.critical_error_detection import CriticalErrorAnalyzer

# Configure with your API key
api_config = {
    "base_url": "https://api.openai.com/v1/chat/completions",
    "api_key": "your-api-key",
    "model": "gpt-4o-mini",
    "temperature": 0.0,
    "max_retries": 3,
    "timeout": 60,
}

# Phase 1: Step-level error detection
detector = ErrorTypeDetector(api_config)
trajectory_data = detector.parse_trajectory("path/to/trajectory.json")
phase1_results = await detector.analyze_trajectory(trajectory_data)

# Phase 2: Critical error identification
analyzer = CriticalErrorAnalyzer(api_config)
critical_error = await analyzer.identify_critical_error(phase1_results, trajectory_data)

Collect Rollout Trajectories

# AlfWorld rollout with Together AI (cheap, fast)
python -m agentdebug.rollout.rollout \
  --env alfworld \
  --provider together \
  --model meta-llama/Llama-3.3-70B-Instruct-Turbo \
  --unique_envs \
  --total_envs 100 \
  --concurrency 4 \
  --dump_path output/alfworld_steps.jsonl

# Convert steps to episodes
python -m agentdebug.rollout.step_to_episode \
  --input_jsonl output/alfworld_steps.jsonl \
  --output_jsonl output/alfworld_episodes.jsonl

Multi-Provider LLM Engine

from agentdebug.engines import create_chat_model

# Automatically routes to the correct provider based on model name
model = create_chat_model("gpt-4o-mini")                          # → OpenAI
model = create_chat_model("claude-sonnet-4-6")                   # → Anthropic
model = create_chat_model("gemini-2.5-flash")                     # → Gemini
model = create_chat_model("meta-llama/Llama-3.3-70B-Instruct-Turbo")  # → Together

response = model("What is 2+2?")

Error Taxonomy

Module	Error Types	Description
Memory	hallucination, memory_retrieval_failure, over_simplification	Agent misremembers or fails to recall information
Reflection	progress_misjudge, outcome_misinterpretation, causal_misattribution, hallucination	Agent incorrectly evaluates its own progress
Planning	constraint_ignorance, impossible_action, inefficient_plan	Agent creates flawed plans
Action	misalignment, invalid_action, format_error, parameter_error	Agent executes wrong actions
System	step_limit, tool_execution_error, llm_limit, environment_error	External system failures

AgentErrorBench Dataset

Download annotated failure trajectories: AgentErrorBench on Google Drive

ALFWorld: 100 trajectories from embodied agent tasks
GAIA: 50 trajectories from general AI assistant tasks
WebShop: 50 trajectories from web navigation tasks

Key Results

Metric	Improvement
All-Correct Accuracy	+24%
Step Accuracy	+17%
Task Success Rate	Up to +26%

Citation

@article{agentdebug2025,
  title={Where LLM Agents Fail and How They Can Learn From Failures},
  author={Zhu, Kunlun and Liu, Zijia and Li, Bingxuan and Tian, Muxin and Yang Yingxuan and Zhang, Jiaxun and others},
  journal={arXiv preprint arXiv:2509.25370},
  year={2025}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

We welcome contributions! Please feel free to submit issues, create pull requests, or reach out for collaborations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AgentDebug

Installation

Environment Setup

Repository Structure

Quick Start

Run the Detector on a Trajectory

Collect Rollout Trajectories

Multi-Provider LLM Engine

Error Taxonomy

AgentErrorBench Dataset

Key Results

Citation

License

Contributing

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

AgentDebug

Installation

Environment Setup

Repository Structure

Quick Start

Run the Detector on a Trajectory

Collect Rollout Trajectories

Multi-Provider LLM Engine

Error Taxonomy

AgentErrorBench Dataset

Key Results

Citation

License

Contributing