Skip to content

higuseonhye/agent-accountability-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

Agent Accountability & Continuous Evaluation Toolkit

An evidence-based system to decide whether an agentic AI should be trusted in production.


Why this project exists

Agentic AI systems are rapidly moving from pilots to production. However, most organizations still lack a clear way to answer a fundamental question:

“Should we trust this agent in our organization?”

This project was built to explore how enterprises can operationalize:

  • Accountability
  • Continuous evaluation
  • Human-in-the-loop governance

Instead of focusing on model accuracy, this toolkit treats agents as actors, and trust as something that must be proven with evidence, not declared.


Core principles

This project is guided by three principles that emerged repeatedly in enterprise discussions (e.g. RAISE 2025):

  1. Agents are actors, not tools Agentic AI initiates actions and influences decisions. Traditional software testing is not sufficient.

  2. Trust is evidence, not intention Trust requires:

    • Observability (what happened)
    • Traceability (why it happened)
    • Auditability (can it be reconstructed later)
  3. Evaluation must be continuous Agent behavior can drift over time. Evaluation should not stop at deployment.


What this toolkit does

At a high level, the system:

  1. Runs an agent against risk-driven scenarios
  2. Logs every agent action as immutable audit evidence
  3. Evaluates behavior using rule-based + LLM-assisted metrics
  4. Produces a deployment readiness signal (GO / CONDITIONAL / NO-GO)

This mirrors how real enterprises assess risk, not how demos are built.


Project structure

The folder structure is intentional. Schemas and evidence come first; code comes second.

agent-accountability-eval/
├─ README.md                  # Project overview, problem framing, and demo entry point
├─ requirements.txt           # Python dependencies for reproducible runs
├─ .gitignore                 # Prevents committing logs, secrets, and local artifacts

├─ schemas/                   # Core evidence schemas (audit & evaluation)
│  ├─ run_record.schema.json  # Schema for agent run audit logs (observability & traceability)
│  └─ eval_result.schema.json # Schema for evaluation results and decision gates

├─ docs/
│  └─ threat_model.md         # Threat model describing key agent failure modes

├─ data/
│  ├─ scenarios/              # Scenario-based behavioral test cases
│  │  ├─ banking/             # Banking and financial risk scenarios
│  │  ├─ shadow_ai/           # Shadow AI and policy bypass scenarios
│  │  ├─ privacy/             # Privacy and data leakage scenarios
│  │  ├─ tool_abuse/          # Tool misuse and privilege escalation scenarios
│  │  └─ hallucination/       # Hallucination and overconfidence traps
│  └─ logs/                   # Append-only audit logs generated from agent runs

├─ src/
│  ├─ agent/
│  │  └─ runner.py            # Executes the agent under defined constraints and scenarios
│  ├─ audit/
│  │  └─ logger.py            # Captures and stores audit logs for every agent action
│  ├─ eval/
│  │  └─ metrics.py           # Behavior-based evaluation metrics and gate logic
│  └─ report/
│     └─ generator.py         # Generates agent readiness and decision reports

└─ demo/
   └─ run_demo.py             # End-to-end demo: scenario → agent → audit → evaluation → report

Audit logging (Accountability)

Every agent run is recorded as an append-only JSONL audit log.

Each log captures:

  • Inputs and context
  • Constraints and permissions
  • Tool usage
  • Outputs and confidence
  • Safety flags (policy, PII, hallucination risk)
  • Human override signals
  • Environment metadata for reproducibility

These logs are treated as evidence, not debug output.


Evaluation & decision gates

Evaluation focuses on behavior, not accuracy.

Typical metrics include:

  • Policy compliance
  • Tool overreach
  • Consistency across repeated runs
  • Hallucination risk
  • Behavior drift over time

Results are summarized using decision gates:

  • GO — Safe to deploy
  • CONDITIONAL — Deploy with restrictions / monitoring
  • NO-GO — Do not deploy

This framing aligns with how enterprises actually make deployment decisions.


Threat-driven scenarios

Scenarios are derived from an explicit threat model, not ad-hoc prompts.

Examples of covered risks:

  • Unauthorized financial advice
  • PII leakage
  • Shadow AI policy bypass
  • Tool privilege escalation
  • Confident hallucinations

Each scenario exists to answer:

“What is the worst realistic failure mode here?”


Quick start (demo)

Run a minimal end-to-end example that generates an audit log:

python demo/run_demo.py

You should see:

  • A generated run_id
  • A new audit log file in data/logs/
  • A JSONL record representing one agent run

This demo intentionally prioritizes clarity over sophistication.


Intended audience

This project is designed for:

  • AI evaluation & quality engineers
  • Responsible AI / AI governance practitioners
  • Enterprise architects assessing agentic AI readiness
  • Researchers exploring human-in-the-loop systems

It is not intended as a production agent framework.


Status & roadmap

Current focus:

  • Solid audit logging (done)
  • Scenario-driven evaluation
  • Clear decision outputs

Planned next steps:

  • Scenario regression testing
  • Drift detection over time
  • Automated Agent Readiness Reports (1-page)

Final note

As agentic AI becomes more autonomous, the most important question is no longer “How smart is the model?” but:

“Is the organization ready to take responsibility for this agent?”

This toolkit is a small step toward answering that question with evidence.

About

An evidence-based system for evaluating agentic AI trustworthiness through accountability, continuous evaluation, and human-in-the-loop governance.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors