Agent Accountability & Continuous Evaluation Toolkit

An evidence-based system to decide whether an agentic AI should be trusted in production.

Why this project exists

Agentic AI systems are rapidly moving from pilots to production. However, most organizations still lack a clear way to answer a fundamental question:

“Should we trust this agent in our organization?”

This project was built to explore how enterprises can operationalize:

Accountability
Continuous evaluation
Human-in-the-loop governance

Instead of focusing on model accuracy, this toolkit treats agents as actors, and trust as something that must be proven with evidence, not declared.

Core principles

This project is guided by three principles that emerged repeatedly in enterprise discussions (e.g. RAISE 2025):

Agents are actors, not tools Agentic AI initiates actions and influences decisions. Traditional software testing is not sufficient.
Trust is evidence, not intention Trust requires:
- Observability (what happened)
- Traceability (why it happened)
- Auditability (can it be reconstructed later)
Evaluation must be continuous Agent behavior can drift over time. Evaluation should not stop at deployment.

What this toolkit does

At a high level, the system:

Runs an agent against risk-driven scenarios
Logs every agent action as immutable audit evidence
Evaluates behavior using rule-based + LLM-assisted metrics
Produces a deployment readiness signal (GO / CONDITIONAL / NO-GO)

This mirrors how real enterprises assess risk, not how demos are built.

Project structure

The folder structure is intentional. Schemas and evidence come first; code comes second.

agent-accountability-eval/
├─ README.md                  # Project overview, problem framing, and demo entry point
├─ requirements.txt           # Python dependencies for reproducible runs
├─ .gitignore                 # Prevents committing logs, secrets, and local artifacts

├─ schemas/                   # Core evidence schemas (audit & evaluation)
│  ├─ run_record.schema.json  # Schema for agent run audit logs (observability & traceability)
│  └─ eval_result.schema.json # Schema for evaluation results and decision gates

├─ docs/
│  └─ threat_model.md         # Threat model describing key agent failure modes

├─ data/
│  ├─ scenarios/              # Scenario-based behavioral test cases
│  │  ├─ banking/             # Banking and financial risk scenarios
│  │  ├─ shadow_ai/           # Shadow AI and policy bypass scenarios
│  │  ├─ privacy/             # Privacy and data leakage scenarios
│  │  ├─ tool_abuse/          # Tool misuse and privilege escalation scenarios
│  │  └─ hallucination/       # Hallucination and overconfidence traps
│  └─ logs/                   # Append-only audit logs generated from agent runs

├─ src/
│  ├─ agent/
│  │  └─ runner.py            # Executes the agent under defined constraints and scenarios
│  ├─ audit/
│  │  └─ logger.py            # Captures and stores audit logs for every agent action
│  ├─ eval/
│  │  └─ metrics.py           # Behavior-based evaluation metrics and gate logic
│  └─ report/
│     └─ generator.py         # Generates agent readiness and decision reports

└─ demo/
   └─ run_demo.py             # End-to-end demo: scenario → agent → audit → evaluation → report

Audit logging (Accountability)

Every agent run is recorded as an append-only JSONL audit log.

Each log captures:

Inputs and context
Constraints and permissions
Tool usage
Outputs and confidence
Safety flags (policy, PII, hallucination risk)
Human override signals
Environment metadata for reproducibility

These logs are treated as evidence, not debug output.

Evaluation & decision gates

Evaluation focuses on behavior, not accuracy.

Typical metrics include:

Policy compliance
Tool overreach
Consistency across repeated runs
Hallucination risk
Behavior drift over time

Results are summarized using decision gates:

GO — Safe to deploy
CONDITIONAL — Deploy with restrictions / monitoring
NO-GO — Do not deploy

This framing aligns with how enterprises actually make deployment decisions.

Threat-driven scenarios

Scenarios are derived from an explicit threat model, not ad-hoc prompts.

Examples of covered risks:

Unauthorized financial advice
PII leakage
Shadow AI policy bypass
Tool privilege escalation
Confident hallucinations

Each scenario exists to answer:

“What is the worst realistic failure mode here?”

Quick start (demo)

Run a minimal end-to-end example that generates an audit log:

python demo/run_demo.py

You should see:

A generated run_id
A new audit log file in data/logs/
A JSONL record representing one agent run

This demo intentionally prioritizes clarity over sophistication.

Intended audience

This project is designed for:

AI evaluation & quality engineers
Responsible AI / AI governance practitioners
Enterprise architects assessing agentic AI readiness
Researchers exploring human-in-the-loop systems

It is not intended as a production agent framework.

Status & roadmap

Current focus:

Solid audit logging (done)
Scenario-driven evaluation
Clear decision outputs

Planned next steps:

Scenario regression testing
Drift detection over time
Automated Agent Readiness Reports (1-page)

Final note

As agentic AI becomes more autonomous, the most important question is no longer “How smart is the model?” but:

“Is the organization ready to take responsibility for this agent?”

This toolkit is a small step toward answering that question with evidence.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Accountability & Continuous Evaluation Toolkit

Why this project exists

Core principles

What this toolkit does

Project structure

Audit logging (Accountability)

Evaluation & decision gates

Threat-driven scenarios

Quick start (demo)

Intended audience

Status & roadmap

Final note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Agent Accountability & Continuous Evaluation Toolkit

Why this project exists

Core principles

What this toolkit does

Project structure

Audit logging (Accountability)

Evaluation & decision gates

Threat-driven scenarios

Quick start (demo)

Intended audience

Status & roadmap

Final note

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages