FAQ

Strategy, comparison, and frequently-misunderstood questions. For error-level help see Troubleshooting.

What problem does Shipgate solve?

When an agent gains the ability to act — refund, email, cancel, deploy, modify a record — every tool change becomes a release event. Code review catches code; eval suites catch behavior; observability catches runtime. None of them answer the release question: "Given the tool surface declared in this PR, do we have explicit approval policies, scope coverage, idempotency evidence, and review readiness for every action?"

Shipgate produces a deterministic answer to that question, before promotion.

What's the difference between `scan` and `verify`?

scan reviews the whole declared tool surface in one shot — good for a first audit or a periodic full review. verify is the merge-gate loop: it runs the same checks on a PR diff (--base/--head), computes the capability change, and returns a deterministic merge verdict, so a reviewer sees exactly how this change moves the agent's reach. Both write report.json; the gate is release_decision.decision either way. In CI the GitHub Action delegates to verify, so PR comments are diff-aware.

How is this different from evals?

	Evals	Shipgate
What they test	Did the model behave correctly on this input?	What tool surface, schemas, scopes, and policies are we releasing?
When they run	Iteratively during model/prompt iteration	Before promotion to higher permissions
What you do with the output	Tune prompts and retrieval	File a release-review issue

Evals tell you whether the agent is good. Shipgate tells you whether the agent is reviewable.

How is this different from observability/tracing?

Observability records what happened at runtime — useful, but it arrives after behavior exists. Shipgate runs against the manifest and tool sources before any tool calls happen. The two are complementary: traces feed openai_api.trace_samples, and Shipgate flags traces that show approved: false on a tool that needs approval.

How is this different from an MCP gateway?

A gateway enforces tool access at runtime — can this call go through right now? Shipgate produces release evidence — should this tool be in this release at all? Gateways can't catch a missing approval policy on a refund tool the team forgot to declare. Shipgate can't stop a malicious runtime call. Use both for full coverage.

How is this different from a security scanner?

Scanners (CodeQL, Semgrep, Bandit, etc.) look for code-level vulnerabilities. Shipgate looks for release-readiness of the agent's declared tool surface — broader free-form fields, missing approval policies, scope mismatches, prohibited actions. They complement each other.

Does Shipgate use AI to classify tools?

No. The risk classifier is rule-based: HTTP method, MCP annotations, tokenized keyword matching against name/description/scopes, and your manifest's risk_overrides. This is an explicit design choice — release decisions need to be deterministic and reviewable. Shipgate is the static-analysis layer; an AI-assisted layer could come later as a separate optional step.

Why YAML?

shipgate.yaml is meant for humans to read in PRs. The strict schema (Pydantic) catches typos at scan time. JSON would be denser; we found YAML wins on review legibility for the kind of mixed structured/free-form content (purposes, prohibited actions, suppression reasons) the manifest contains.

Why static-only?

Two reasons:

Trust posture. Shipgate runs against repositories that may not yet be approved for outbound network access. Static-only means it's safe to run on any repo before connecting any external service.
Determinism. Static checks always produce the same finding for the same inputs. Runtime tools can't make this guarantee.

The trade-off: Shipgate cannot detect dynamically-built tool surfaces. The SHIP-INVENTORY-LOW-CONFIDENCE-PRODUCTION-SURFACE finding is the safety net that nudges teams toward declarative inventories before promotion.

Will this slow down my CI?

Negligibly. A 600-tool scan completes in ~290 ms on a standard runner. The action overhead (Python install + dependency resolution) dominates; expect ~30 seconds end-to-end on a cold cache, ~10 seconds with a warm pip cache.

How do I roll this out without breaking everyone's workflow?

The recommended path:

Land it in advisory mode. No CI failures yet. Watch PR comments.
Tune risk_overrides and checks.ignore based on real false positives.
Save a baseline with agents-shipgate baseline save and commit it.
Switch to ci_mode: strict --fail-on critical with the baseline applied.
Increment thresholds (--fail-on critical,high) when the active list is small.

See Baseline Workflow § Rolling out strict mode.

What about false positives?

Three answers:

Override the heuristic with risk_overrides.tools.{tool}.remove_tags.
Suppress the finding with checks.ignore (requires a reason).
File an issue with the false-positive label. The catalog improves through reports.

Does Shipgate ship with telemetry?

No. There's no posthog/sentry/mixpanel import anywhere in the codebase. Logs go to stderr (or JSON to stderr if AGENTS_SHIPGATE_LOG_FORMAT=json); nothing leaves the host.

Why does my report change when I run it on a different machine?

The run_id is derived from a hash that includes paths and timestamps — it's a session ID, not a stable identifier. The fields you can rely on across machines:

findings[].fingerprint — deterministic from check_id + tool_name + canonical evidence
findings[].id — fingerprint + content-derived discriminator on collision
release_decision.decision — the deterministic release gate for the same input
summary.*_count — the same for the same input

If you see a finding's fingerprint change between runs on the same input, that's a bug — please file it.

Can I use this for non-agent codebases?

The model is built around tool-using agents specifically — tool_sources, risk_hints, policies. If you have an OpenAPI spec for a regular service and want to do general scope/auth review, the auth and schema checks would still apply, but you'd be using a small slice of the catalog. Probably better tools exist for that use case (Spectral, OpenAPI Linter).

Roadmap

The ROADMAP.md is the source of truth. The current direction is the deterministic merge-gate / verifier loop — verify on PR diffs, merge verdicts, capability-change review, and routing trust-root edits to human review. Many earlier roadmap items have since shipped: SARIF output, the Release Evidence Packet, GitLab CI / CircleCI / Jenkins recipes, granular API checks, baselines, policy packs, and broader framework coverage (Anthropic, Google ADK, LangChain/LangGraph, CrewAI, Codex plugins, n8n).

Pricing? Hosted version?

The CLI and GitHub Action are open source under Apache-2.0 and free forever. The lab is exploring optional hosted infrastructure for organization-level rollups, history, and policy drift across many repos — but the static checker will always work standalone.

How do I contact the maintainers?

General feedback: GitHub Discussions
Bugs / false positives: Issues
Design partnership: see threemoonslab.com
Security: see SECURITY.md

Agents Shipgate · Apache-2.0 · maintained by Three Moons Lab · Report a false positive

🏠 Home

Getting started

Reference

Workflows

Extending

Project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAQ

FAQ

What problem does Shipgate solve?

What's the difference between `scan` and `verify`?

How is this different from evals?

How is this different from observability/tracing?

How is this different from an MCP gateway?

How is this different from a security scanner?

Does Shipgate use AI to classify tools?

Why YAML?

Why static-only?

Will this slow down my CI?

How do I roll this out without breaking everyone's workflow?

What about false positives?

Does Shipgate ship with telemetry?

Why does my report change when I run it on a different machine?

Can I use this for non-agent codebases?

Roadmap

Pricing? Hosted version?

How do I contact the maintainers?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

FAQ

FAQ

What problem does Shipgate solve?

What's the difference between scan and verify?

How is this different from evals?

How is this different from observability/tracing?

How is this different from an MCP gateway?

How is this different from a security scanner?

Does Shipgate use AI to classify tools?

Why YAML?

Why static-only?

Will this slow down my CI?

How do I roll this out without breaking everyone's workflow?

What about false positives?

Does Shipgate ship with telemetry?

Why does my report change when I run it on a different machine?

Can I use this for non-agent codebases?

Roadmap

Pricing? Hosted version?

How do I contact the maintainers?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

What's the difference between `scan` and `verify`?