-
Notifications
You must be signed in to change notification settings - Fork 0
FAQ
Strategy, comparison, and frequently-misunderstood questions. For error-level help see Troubleshooting.
When an agent gains the ability to act — refund, email, cancel, deploy, modify a record — every tool change becomes a release event. Code review catches code; eval suites catch behavior; observability catches runtime. None of them answer the release question: "Given the tool surface declared in this PR, do we have explicit approval policies, scope coverage, idempotency evidence, and review readiness for every action?"
Shipgate produces a deterministic answer to that question, before promotion.
scan reviews the whole declared tool surface in one shot — good for a first
audit or a periodic full review. verify is the merge-gate loop: it runs the
same checks on a PR diff (--base/--head), computes the capability change,
and returns a deterministic merge verdict, so a reviewer sees exactly how this
change moves the agent's reach. Both write report.json; the gate is
release_decision.decision either way. In CI the GitHub Action delegates to
verify, so PR comments are diff-aware.
| Evals | Shipgate | |
|---|---|---|
| What they test | Did the model behave correctly on this input? | What tool surface, schemas, scopes, and policies are we releasing? |
| When they run | Iteratively during model/prompt iteration | Before promotion to higher permissions |
| What you do with the output | Tune prompts and retrieval | File a release-review issue |
Evals tell you whether the agent is good. Shipgate tells you whether the agent is reviewable.
Observability records what happened at runtime — useful, but it arrives after behavior exists. Shipgate runs against the manifest and tool sources before any tool calls happen. The two are complementary: traces feed openai_api.trace_samples, and Shipgate flags traces that show approved: false on a tool that needs approval.
A gateway enforces tool access at runtime — can this call go through right now? Shipgate produces release evidence — should this tool be in this release at all? Gateways can't catch a missing approval policy on a refund tool the team forgot to declare. Shipgate can't stop a malicious runtime call. Use both for full coverage.
Scanners (CodeQL, Semgrep, Bandit, etc.) look for code-level vulnerabilities. Shipgate looks for release-readiness of the agent's declared tool surface — broader free-form fields, missing approval policies, scope mismatches, prohibited actions. They complement each other.
No. The risk classifier is rule-based: HTTP method, MCP annotations, tokenized keyword matching against name/description/scopes, and your manifest's risk_overrides. This is an explicit design choice — release decisions need to be deterministic and reviewable. Shipgate is the static-analysis layer; an AI-assisted layer could come later as a separate optional step.
shipgate.yaml is meant for humans to read in PRs. The strict schema (Pydantic) catches typos at scan time. JSON would be denser; we found YAML wins on review legibility for the kind of mixed structured/free-form content (purposes, prohibited actions, suppression reasons) the manifest contains.
Two reasons:
- Trust posture. Shipgate runs against repositories that may not yet be approved for outbound network access. Static-only means it's safe to run on any repo before connecting any external service.
- Determinism. Static checks always produce the same finding for the same inputs. Runtime tools can't make this guarantee.
The trade-off: Shipgate cannot detect dynamically-built tool surfaces. The SHIP-INVENTORY-LOW-CONFIDENCE-PRODUCTION-SURFACE finding is the safety net that nudges teams toward declarative inventories before promotion.
Negligibly. A 600-tool scan completes in ~290 ms on a standard runner. The action overhead (Python install + dependency resolution) dominates; expect ~30 seconds end-to-end on a cold cache, ~10 seconds with a warm pip cache.
The recommended path:
- Land it in advisory mode. No CI failures yet. Watch PR comments.
-
Tune
risk_overridesandchecks.ignorebased on real false positives. -
Save a baseline with
agents-shipgate baseline saveand commit it. -
Switch to
ci_mode: strict --fail-on criticalwith the baseline applied. -
Increment thresholds (
--fail-on critical,high) when the active list is small.
See Baseline Workflow § Rolling out strict mode.
Three answers:
-
Override the heuristic with
risk_overrides.tools.{tool}.remove_tags. -
Suppress the finding with
checks.ignore(requires a reason). -
File an issue with the
false-positivelabel. The catalog improves through reports.
No. There's no posthog/sentry/mixpanel import anywhere in the codebase. Logs go to stderr (or JSON to stderr if AGENTS_SHIPGATE_LOG_FORMAT=json); nothing leaves the host.
The run_id is derived from a hash that includes paths and timestamps — it's a session ID, not a stable identifier. The fields you can rely on across machines:
-
findings[].fingerprint— deterministic from check_id + tool_name + canonical evidence -
findings[].id— fingerprint + content-derived discriminator on collision -
release_decision.decision— the deterministic release gate for the same input -
summary.*_count— the same for the same input
If you see a finding's fingerprint change between runs on the same input, that's a bug — please file it.
The model is built around tool-using agents specifically — tool_sources, risk_hints, policies. If you have an OpenAPI spec for a regular service and want to do general scope/auth review, the auth and schema checks would still apply, but you'd be using a small slice of the catalog. Probably better tools exist for that use case (Spectral, OpenAPI Linter).
The ROADMAP.md is the source of truth. The current direction is the deterministic merge-gate / verifier loop — verify on PR diffs, merge verdicts, capability-change review, and routing trust-root edits to human review. Many earlier roadmap items have since shipped: SARIF output, the Release Evidence Packet, GitLab CI / CircleCI / Jenkins recipes, granular API checks, baselines, policy packs, and broader framework coverage (Anthropic, Google ADK, LangChain/LangGraph, CrewAI, Codex plugins, n8n).
The CLI and GitHub Action are open source under Apache-2.0 and free forever. The lab is exploring optional hosted infrastructure for organization-level rollups, history, and policy drift across many repos — but the static checker will always work standalone.
- General feedback: GitHub Discussions
- Bugs / false positives: Issues
- Design partnership: see threemoonslab.com
- Security: see SECURITY.md
Agents Shipgate · Apache-2.0 · maintained by Three Moons Lab · Report a false positive
Getting started
Reference
Workflows
Extending
Project