Agent Contract Tests

CI checks for AI coding agent roles, tool ACLs, call graphs, governed actions, and eval outputs.

This is a toolkit first, with a composite GitHub Action included for teams that want the fastest CI path.

It is not an agent framework and not a claim of novelty. It is a lightweight way to turn agent roles, tool permissions, call graphs, governed actions, and eval assertions into files that can fail CI.

Use it beside Codex, Claude Code, Cursor, Copilot, AutoGen, CrewAI, LangGraph, or your own runner. The goal is simple: make agent behavior reviewable before it becomes production behavior.

Start Here

What breaks without this?

Agent permissions drift into prompts, chats, and tribal memory. Teams lose track of which agents may use which tools, which specialists may call each other, and which actions require approval evidence before merge or deploy.

What command do I run?

Try the local demo:

python3 scripts/agent_ops_init.py --demo --force
python3 scripts/agent_ops_validate.py --root agent-ops-demo --strict
python3 scripts/run_evals.py

What failure does it catch?

It catches contract drift: an agent requests deploy_prod but the ACL does not grant it, a PM agent delegates directly to a backend agent outside the call graph, or a governed action lacks approval and evidence fields.

How do I add it to my repo?

Initialize the starter files:

python3 scripts/agent_ops_init.py --target ../my-product
cd ../my-product
python3 scripts/agent_ops_validate.py --strict

Or add the composite GitHub Action:

- uses: RPSingh1990/agent-contract-tests@v0.1.0
  with:
    root: "."
    strict: "true"
    run-evals: "true"
    eval-dir: ".agent-ops/evals"

How can I contribute one adapter or eval?

Pick one open issue labeled good first issue or help wanted, add a small adapter under adapters/ or an assertion example under examples/evals/, then include a self-test or documented pass/fail example. Useful contributions are concrete checks, not more abstract process.

What It Catches

Agent Contract Tests fails CI when:

an agent asks for a tool that the ACL does not grant
a registry grants tools an agent did not declare
one agent delegates to another agent outside the call graph
governed channels lack approval or evidence requirements
governed tasks miss owner, scope, reviews, tests, rollback, or evidence
saved agent outputs miss deterministic eval assertions

Who This Is For

Use this if you are:

building with Codex, Claude Code, Cursor, GitHub Copilot, or other coding agents
trying to move beyond "vibe coding" into repeatable engineering
worried that AI-generated code will break as the product grows
creating specialist agents such as PM, backend, frontend, QA, security, research, or code review
trying to keep speed without losing architecture, security, and test discipline

Core Idea

AI agents should be treated like employees, not magic prompts.

Each agent needs:

a job description
allowed inputs
expected outputs
permission boundaries
escalation rules
eval assertions
evidence of performance
a manager/orchestrator

The goal is not more agents. The goal is fewer, sharper agents with clear accountability.

What Is Included

docs/
  operating-model.md            How to run a small AI engineering team
  security-model.md             Public-safe and internal-safe boundaries
  evals-and-benchmarks.md       Lightweight eval discipline
  enforcement-model.md          What the validator enforces and what it cannot
  failure-modes.md              What breaks when AI coding scales badly
  prompt-blocks.md              Copy-paste prompts for Codex/Claude/Cursor
  case-study.md                 Multiple sanitized failure cases
  prior-art.md                  Related agent, eval, and security work
  adoption-and-contribution.md  Honest adoption path and contribution guide
  no-code-agent-ops.md          Operator-friendly use without writing code

examples/
  demo-pr/                     Intentionally failing bad-contract PR example
  agents/                       Public-safe agent specifications
  software-team-agents/         Sanitized software-team agent pack
  registry/                     Example call graph, tool ACL, governed channels
  evals/                        Runnable deterministic eval assertions
  worked-example/               End-to-end governed task example
  scenarios/                    Four distinct failure scenarios
  before-after/                 Vibe-coded request vs governed agent task
  demo-repo/                    Minimal initialized Agent Ops example

templates/
  agent-request.md              Hire or upgrade an agent
  governed-task.md              Start governed engineering work
  security-review.md            Pre-release security review
  pr-checklist.md               AI-assisted PR checklist

scripts/
  validate_public_repo.py       Local/CI safety and structure checks
  agent_ops_init.py             Copy Agent Ops starter files into another repo
  agent_ops_validate.py         Enforce agent ACLs, call graph, and governed tasks
  agent_ops_guard.py            Runtime guard helpers for tool/call/channel checks
  run_evals.py                  Run deterministic assertions over saved outputs

.github/workflows/
  validate.yml                  GitHub Actions validation
.gitleaks.toml                  Gitleaks configuration
action.yml                      Composite GitHub Action for downstream repos
adapters/                       Claude Code, Codex, and Cursor instruction blocks

Quick Start

Clone and run the validator:

python3 scripts/validate_public_repo.py

Expected result:

PASS public safety scan
PASS required docs
PASS templates
PASS scripts
PASS security tooling
PASS agent examples
PASS software-team agent pack
PASS eval examples

Run the Agent Ops contract validator:

python3 scripts/agent_ops_validate.py --strict

Expected result:

PASS agent specs
PASS tool ACL enforcement
PASS call graph enforcement
PASS governed channel registry
PASS governed tasks

Run deterministic eval assertions over saved agent outputs:

python3 scripts/run_evals.py

Expected result:

RESULT 4/4 evals passed

Initialize Agent Ops files into another repo:

python3 scripts/agent_ops_init.py --target ../my-product
cd ../my-product
python3 scripts/agent_ops_validate.py --strict

Or generate a local demo:

python3 scripts/agent_ops_init.py --demo --force
python3 scripts/agent_ops_validate.py --root agent-ops-demo --strict

The initializer also copies a GitHub Action into the target repo so future PRs can fail when agent contracts drift.

Use the repo as a composite GitHub Action:

- uses: RPSingh1990/agent-contract-tests@v0.1.0
  with:
    root: "."
    strict: "true"
    run-evals: "true"
    eval-dir: ".agent-ops/evals"

What Is Enforced

scripts/agent_ops_validate.py checks that:

agents request only tools granted in tool-acl.yaml
blocked tools are not requested or granted
agent delegation matches call-graph.yaml
governed channels define approval and evidence fields
governed tasks include owner, lane, scope, reviews, tests, rollback, and evidence
strict mode fails if registry permissions and agent specs drift

This is CI-time enforcement. It does not intercept live model tool calls. For runtime enforcement, wire the same registry files into your agent runner or tool middleware.

For a minimal runtime check:

python3 scripts/agent_ops_guard.py tool backend-builder repo_read
python3 scripts/agent_ops_guard.py call product-manager backend-builder

The first command should allow the tool. The second should deny the call because the example call graph routes implementation work through the orchestrator.

scripts/run_evals.py does not call a model. It scores saved outputs. Generate an answer with any AI coding agent, save it as an output file, and run the same deterministic assertions against it.

30-Minute Adoption Path

If you want to try this on a real repo today:

Run python3 scripts/agent_ops_init.py --target <repo>.
In the target repo, run python3 scripts/agent_ops_validate.py --strict.
Pick one real auth, email, deploy, data, or external-action change.
Write it as a governed task before asking agents to implement it.
Keep only the agents and tools you actually need.
Let the copied GitHub Action fail PRs when contracts drift.

For a complete example, read:

examples/worked-example/README.md
examples/worked-example/agent-request.md
examples/worked-example/governed-task.md
examples/worked-example/sample-review.md

For the sanitized software-team agent pack, read:

examples/software-team-agents/README.md

For tool-specific setup, read:

adapters/claude-code/README.md
adapters/codex/README.md
adapters/cursor/README.md

The Minimum Useful Workflow

Write an agent request before creating a new agent.
Define the agent with a narrow role and trigger.
Add 2-3 realistic eval assertions.
Give the agent the smallest useful permissions.
Route sensitive work through a security review.
Require evidence before merge or deploy.
Promote the agent only after real output is good.

What This Is Not

This is not:

an agent framework
a model wrapper
a prompt marketplace
a complete sandbox
a marketplace-first action
a replacement for Gitleaks, GitHub secret scanning, Promptfoo, DeepEval, Inspect, or AgentOps
a replacement for engineering judgment
a way to bypass code review, security review, or tests

It is an Agent Ops contract-test toolkit for teams already using AI coding tools. The GitHub Action is the easiest adoption path, not the whole product.

Good First Use Cases

Create a backend agent with API-contract discipline.
Add a security reviewer before deployment.
Add a PM agent that converts vague founder requests into acceptance criteria.
Add a QA agent that checks behavior beyond the happy path.
Add contract tests that block drift in agent tools, calls, eval outputs, and governed actions.

Contribution Path

If this helps your team, fork it or open an issue with your own agent-operating pattern. The strongest contributions are concrete: better enforcement checks, eval assertions, governed-task examples, and failure cases from real AI-assisted builds.

License

MIT. Use it, adapt it, and improve it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Contract Tests

Start Here

What breaks without this?

What command do I run?

What failure does it catch?

How do I add it to my repo?

How can I contribute one adapter or eval?

What It Catches

Who This Is For

Core Idea

What Is Included

Quick Start

What Is Enforced

30-Minute Adoption Path

The Minimum Useful Workflow

What This Is Not

Good First Use Cases

Contribution Path

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github		.github
adapters		adapters
assets		assets
docs		docs
examples		examples
scripts		scripts
templates		templates
tests		tests
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
action.yml		action.yml

Folders and files

Latest commit

History

Repository files navigation

Agent Contract Tests

Start Here

What breaks without this?

What command do I run?

What failure does it catch?

How do I add it to my repo?

How can I contribute one adapter or eval?

What It Catches

Who This Is For

Core Idea

What Is Included

Quick Start

What Is Enforced

30-Minute Adoption Path

The Minimum Useful Workflow

What This Is Not

Good First Use Cases

Contribution Path

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages