CI checks for AI coding agent roles, tool ACLs, call graphs, governed actions, and eval outputs.
This is a toolkit first, with a composite GitHub Action included for teams that want the fastest CI path.
It is not an agent framework and not a claim of novelty. It is a lightweight way to turn agent roles, tool permissions, call graphs, governed actions, and eval assertions into files that can fail CI.
Use it beside Codex, Claude Code, Cursor, Copilot, AutoGen, CrewAI, LangGraph, or your own runner. The goal is simple: make agent behavior reviewable before it becomes production behavior.
Agent permissions drift into prompts, chats, and tribal memory. Teams lose track of which agents may use which tools, which specialists may call each other, and which actions require approval evidence before merge or deploy.
Try the local demo:
python3 scripts/agent_ops_init.py --demo --force
python3 scripts/agent_ops_validate.py --root agent-ops-demo --strict
python3 scripts/run_evals.pyIt catches contract drift: an agent requests deploy_prod but the ACL does not grant it, a PM agent delegates directly to a backend agent outside the call graph, or a governed action lacks approval and evidence fields.
Initialize the starter files:
python3 scripts/agent_ops_init.py --target ../my-product
cd ../my-product
python3 scripts/agent_ops_validate.py --strictOr add the composite GitHub Action:
- uses: RPSingh1990/agent-contract-tests@v0.1.0
with:
root: "."
strict: "true"
run-evals: "true"
eval-dir: ".agent-ops/evals"Pick one open issue labeled good first issue or help wanted, add a small adapter under adapters/ or an assertion example under examples/evals/, then include a self-test or documented pass/fail example. Useful contributions are concrete checks, not more abstract process.
Agent Contract Tests fails CI when:
- an agent asks for a tool that the ACL does not grant
- a registry grants tools an agent did not declare
- one agent delegates to another agent outside the call graph
- governed channels lack approval or evidence requirements
- governed tasks miss owner, scope, reviews, tests, rollback, or evidence
- saved agent outputs miss deterministic eval assertions
Use this if you are:
- building with Codex, Claude Code, Cursor, GitHub Copilot, or other coding agents
- trying to move beyond "vibe coding" into repeatable engineering
- worried that AI-generated code will break as the product grows
- creating specialist agents such as PM, backend, frontend, QA, security, research, or code review
- trying to keep speed without losing architecture, security, and test discipline
AI agents should be treated like employees, not magic prompts.
Each agent needs:
- a job description
- allowed inputs
- expected outputs
- permission boundaries
- escalation rules
- eval assertions
- evidence of performance
- a manager/orchestrator
The goal is not more agents. The goal is fewer, sharper agents with clear accountability.
docs/
operating-model.md How to run a small AI engineering team
security-model.md Public-safe and internal-safe boundaries
evals-and-benchmarks.md Lightweight eval discipline
enforcement-model.md What the validator enforces and what it cannot
failure-modes.md What breaks when AI coding scales badly
prompt-blocks.md Copy-paste prompts for Codex/Claude/Cursor
case-study.md Multiple sanitized failure cases
prior-art.md Related agent, eval, and security work
adoption-and-contribution.md Honest adoption path and contribution guide
no-code-agent-ops.md Operator-friendly use without writing code
examples/
demo-pr/ Intentionally failing bad-contract PR example
agents/ Public-safe agent specifications
software-team-agents/ Sanitized software-team agent pack
registry/ Example call graph, tool ACL, governed channels
evals/ Runnable deterministic eval assertions
worked-example/ End-to-end governed task example
scenarios/ Four distinct failure scenarios
before-after/ Vibe-coded request vs governed agent task
demo-repo/ Minimal initialized Agent Ops example
templates/
agent-request.md Hire or upgrade an agent
governed-task.md Start governed engineering work
security-review.md Pre-release security review
pr-checklist.md AI-assisted PR checklist
scripts/
validate_public_repo.py Local/CI safety and structure checks
agent_ops_init.py Copy Agent Ops starter files into another repo
agent_ops_validate.py Enforce agent ACLs, call graph, and governed tasks
agent_ops_guard.py Runtime guard helpers for tool/call/channel checks
run_evals.py Run deterministic assertions over saved outputs
.github/workflows/
validate.yml GitHub Actions validation
.gitleaks.toml Gitleaks configuration
action.yml Composite GitHub Action for downstream repos
adapters/ Claude Code, Codex, and Cursor instruction blocks
Clone and run the validator:
python3 scripts/validate_public_repo.pyExpected result:
PASS public safety scan
PASS required docs
PASS templates
PASS scripts
PASS security tooling
PASS agent examples
PASS software-team agent pack
PASS eval examples
Run the Agent Ops contract validator:
python3 scripts/agent_ops_validate.py --strictExpected result:
PASS agent specs
PASS tool ACL enforcement
PASS call graph enforcement
PASS governed channel registry
PASS governed tasks
Run deterministic eval assertions over saved agent outputs:
python3 scripts/run_evals.pyExpected result:
RESULT 4/4 evals passed
Initialize Agent Ops files into another repo:
python3 scripts/agent_ops_init.py --target ../my-product
cd ../my-product
python3 scripts/agent_ops_validate.py --strictOr generate a local demo:
python3 scripts/agent_ops_init.py --demo --force
python3 scripts/agent_ops_validate.py --root agent-ops-demo --strictThe initializer also copies a GitHub Action into the target repo so future PRs can fail when agent contracts drift.
Use the repo as a composite GitHub Action:
- uses: RPSingh1990/agent-contract-tests@v0.1.0
with:
root: "."
strict: "true"
run-evals: "true"
eval-dir: ".agent-ops/evals"scripts/agent_ops_validate.py checks that:
- agents request only tools granted in
tool-acl.yaml - blocked tools are not requested or granted
- agent delegation matches
call-graph.yaml - governed channels define approval and evidence fields
- governed tasks include owner, lane, scope, reviews, tests, rollback, and evidence
- strict mode fails if registry permissions and agent specs drift
This is CI-time enforcement. It does not intercept live model tool calls. For runtime enforcement, wire the same registry files into your agent runner or tool middleware.
For a minimal runtime check:
python3 scripts/agent_ops_guard.py tool backend-builder repo_read
python3 scripts/agent_ops_guard.py call product-manager backend-builderThe first command should allow the tool. The second should deny the call because the example call graph routes implementation work through the orchestrator.
scripts/run_evals.py does not call a model. It scores saved outputs. Generate an answer with any AI coding agent, save it as an output file, and run the same deterministic assertions against it.
If you want to try this on a real repo today:
- Run
python3 scripts/agent_ops_init.py --target <repo>. - In the target repo, run
python3 scripts/agent_ops_validate.py --strict. - Pick one real auth, email, deploy, data, or external-action change.
- Write it as a governed task before asking agents to implement it.
- Keep only the agents and tools you actually need.
- Let the copied GitHub Action fail PRs when contracts drift.
For a complete example, read:
examples/worked-example/README.mdexamples/worked-example/agent-request.mdexamples/worked-example/governed-task.mdexamples/worked-example/sample-review.md
For the sanitized software-team agent pack, read:
examples/software-team-agents/README.md
For tool-specific setup, read:
adapters/claude-code/README.mdadapters/codex/README.mdadapters/cursor/README.md
- Write an agent request before creating a new agent.
- Define the agent with a narrow role and trigger.
- Add 2-3 realistic eval assertions.
- Give the agent the smallest useful permissions.
- Route sensitive work through a security review.
- Require evidence before merge or deploy.
- Promote the agent only after real output is good.
This is not:
- an agent framework
- a model wrapper
- a prompt marketplace
- a complete sandbox
- a marketplace-first action
- a replacement for Gitleaks, GitHub secret scanning, Promptfoo, DeepEval, Inspect, or AgentOps
- a replacement for engineering judgment
- a way to bypass code review, security review, or tests
It is an Agent Ops contract-test toolkit for teams already using AI coding tools. The GitHub Action is the easiest adoption path, not the whole product.
- Create a backend agent with API-contract discipline.
- Add a security reviewer before deployment.
- Add a PM agent that converts vague founder requests into acceptance criteria.
- Add a QA agent that checks behavior beyond the happy path.
- Add contract tests that block drift in agent tools, calls, eval outputs, and governed actions.
If this helps your team, fork it or open an issue with your own agent-operating pattern. The strongest contributions are concrete: better enforcement checks, eval assertions, governed-task examples, and failure cases from real AI-assisted builds.
MIT. Use it, adapt it, and improve it.