An evidence-based system to decide whether an agentic AI should be trusted in production.
Agentic AI systems are rapidly moving from pilots to production. However, most organizations still lack a clear way to answer a fundamental question:
“Should we trust this agent in our organization?”
This project was built to explore how enterprises can operationalize:
- Accountability
- Continuous evaluation
- Human-in-the-loop governance
Instead of focusing on model accuracy, this toolkit treats agents as actors, and trust as something that must be proven with evidence, not declared.
This project is guided by three principles that emerged repeatedly in enterprise discussions (e.g. RAISE 2025):
-
Agents are actors, not tools Agentic AI initiates actions and influences decisions. Traditional software testing is not sufficient.
-
Trust is evidence, not intention Trust requires:
- Observability (what happened)
- Traceability (why it happened)
- Auditability (can it be reconstructed later)
-
Evaluation must be continuous Agent behavior can drift over time. Evaluation should not stop at deployment.
At a high level, the system:
- Runs an agent against risk-driven scenarios
- Logs every agent action as immutable audit evidence
- Evaluates behavior using rule-based + LLM-assisted metrics
- Produces a deployment readiness signal (GO / CONDITIONAL / NO-GO)
This mirrors how real enterprises assess risk, not how demos are built.
The folder structure is intentional. Schemas and evidence come first; code comes second.
agent-accountability-eval/
├─ README.md # Project overview, problem framing, and demo entry point
├─ requirements.txt # Python dependencies for reproducible runs
├─ .gitignore # Prevents committing logs, secrets, and local artifacts
├─ schemas/ # Core evidence schemas (audit & evaluation)
│ ├─ run_record.schema.json # Schema for agent run audit logs (observability & traceability)
│ └─ eval_result.schema.json # Schema for evaluation results and decision gates
├─ docs/
│ └─ threat_model.md # Threat model describing key agent failure modes
├─ data/
│ ├─ scenarios/ # Scenario-based behavioral test cases
│ │ ├─ banking/ # Banking and financial risk scenarios
│ │ ├─ shadow_ai/ # Shadow AI and policy bypass scenarios
│ │ ├─ privacy/ # Privacy and data leakage scenarios
│ │ ├─ tool_abuse/ # Tool misuse and privilege escalation scenarios
│ │ └─ hallucination/ # Hallucination and overconfidence traps
│ └─ logs/ # Append-only audit logs generated from agent runs
├─ src/
│ ├─ agent/
│ │ └─ runner.py # Executes the agent under defined constraints and scenarios
│ ├─ audit/
│ │ └─ logger.py # Captures and stores audit logs for every agent action
│ ├─ eval/
│ │ └─ metrics.py # Behavior-based evaluation metrics and gate logic
│ └─ report/
│ └─ generator.py # Generates agent readiness and decision reports
└─ demo/
└─ run_demo.py # End-to-end demo: scenario → agent → audit → evaluation → report
Every agent run is recorded as an append-only JSONL audit log.
Each log captures:
- Inputs and context
- Constraints and permissions
- Tool usage
- Outputs and confidence
- Safety flags (policy, PII, hallucination risk)
- Human override signals
- Environment metadata for reproducibility
These logs are treated as evidence, not debug output.
Evaluation focuses on behavior, not accuracy.
Typical metrics include:
- Policy compliance
- Tool overreach
- Consistency across repeated runs
- Hallucination risk
- Behavior drift over time
Results are summarized using decision gates:
- GO — Safe to deploy
- CONDITIONAL — Deploy with restrictions / monitoring
- NO-GO — Do not deploy
This framing aligns with how enterprises actually make deployment decisions.
Scenarios are derived from an explicit threat model, not ad-hoc prompts.
Examples of covered risks:
- Unauthorized financial advice
- PII leakage
- Shadow AI policy bypass
- Tool privilege escalation
- Confident hallucinations
Each scenario exists to answer:
“What is the worst realistic failure mode here?”
Run a minimal end-to-end example that generates an audit log:
python demo/run_demo.pyYou should see:
- A generated
run_id - A new audit log file in
data/logs/ - A JSONL record representing one agent run
This demo intentionally prioritizes clarity over sophistication.
This project is designed for:
- AI evaluation & quality engineers
- Responsible AI / AI governance practitioners
- Enterprise architects assessing agentic AI readiness
- Researchers exploring human-in-the-loop systems
It is not intended as a production agent framework.
Current focus:
- Solid audit logging (done)
- Scenario-driven evaluation
- Clear decision outputs
Planned next steps:
- Scenario regression testing
- Drift detection over time
- Automated Agent Readiness Reports (1-page)
As agentic AI becomes more autonomous, the most important question is no longer “How smart is the model?” but:
“Is the organization ready to take responsibility for this agent?”
This toolkit is a small step toward answering that question with evidence.