feat(0.2): Gate pillar lift — hero verdict + adapter diagnostics + policy redesign + schema docs#168
Merged
Merged
Conversation
…illar lift)
Lifts three Gate-pillar cells from synthetic-fixture floor toward
publicly-claimable: ai_eval_ingestion.E3 (2→4),
ai_execution_gating.V2 (2→4), ai_execution_gating.E3 (2→4).
internal/uitokens/uitokens.go:
- HeroVerdict(verdict, headline) — designed three-line block with
rule / indented badge + headline / rule. The block frames the
gating decision so it carries visual weight beyond the rest of
the report. Color-and-symbol via existing token vocabulary
(Alert/Warn/Ok + SymFail/SymWarn/SymOK).
- HeroVerdictMarkdown(verdict, headline, reason) — markdown variant
for PR-comment / GitHub surfaces. Blockquote callout (tints on
GitHub) + horizontal rule. Optional reason as italic line.
- heroVerdictBadge / bracketVerdict helpers handle the BLOCKED /
WARN / PASS vocabulary distinct from VerdictBadge so the hero
presentation can use a heavier shape ("[BLOCKED]") without
changing VerdictBadge's contract.
- Tests: TestHeroVerdict + TestHeroVerdictMarkdown lock both shapes.
cmd/terrain/cmd_ai.go:
- `terrain ai run` text output now leads with HeroVerdict block,
followed by structured Reason / Command / AI Signals /
Ingestion diagnostics sections — the previous single-line
`Decision: BLOCKED — reason` is replaced.
- aiRunHeroLines() centralizes the (action, reason, signalCount)
→ (verdict, headline) mapping so JSON / text / downstream PR
surfaces stay consistent.
internal/airun/eval_result.go:
- New IngestionDiagnostic{Field, Kind, Detail} type capturing
per-field fallbacks during adapter ingestion (kinds: missing,
computed, default-applied, coerced).
- EvalRunResult.Diagnostics field surfaces these to consumers.
internal/airun/{promptfoo,deepeval,ragas}.go:
- Each adapter records diagnostics for the fallbacks that matter
to gating decisions: derived aggregates when stats block is
absent, missing tokenUsage.cost (aiCostRegression no-ops),
defaulted timestamps, missing metricsData (DeepEval), and
missing quality axes (Ragas — when no faithfulness /
context_recall / answer_relevancy in any row).
- Tests in promptfoo_test.go lock the canonical diagnostic
emissions.
cmd/terrain/cmd_ai.go (rendering):
- New "Ingestion diagnostics (N):" block in `terrain ai run`
output surfaces every IngestionDiagnostic with its kind and
detail. Adopters auditing a gating decision can see exactly
which fields fell back.
docs/release/parity/scores.yaml:
- ai_eval_ingestion.E3: 2→4
- ai_execution_gating.V2: 2→4
- ai_execution_gating.E3: 2→4
These three cells were among the audit's specifically-named
Gate-pillar gaps. Several other Gate cells remain at 3 (the
publicly-claimable bar requires labeled-PR precision corpus
and additional doc/UX lifts) — this is one focused step toward
the Gate floor=4 target.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ar lift)
Lifts two more Gate-pillar cells: policy_governance.V2 (3→4) and
ai_execution_gating.P4 (2→4).
internal/reporting/policy_report.go:
- Redesigned `terrain policy check` rendering. Hero verdict block
at top via uitokens.HeroVerdict — PASS / BLOCKED / WARN with
violation count, replacing the previous single Status: PASS/FAIL
line.
- Violations grouped by severity (critical → low) with
BracketedSeverity badges per violation.
- Per-violation now shows `[CRIT] type (Category) — explanation`
with a `location:` follow-on, replacing the flat
` - <type>: <explanation>` rendering.
- New helpers: severityRenderOrder (canonical ordering),
groupViolationsBySeverity (deterministic grouping with category
+ type tiebreakers), policyHeroLines (verdict + headline mapping).
docs/user-guides/ai-eval-onboarding.md (new):
- First-10-minutes walkthrough closing the audit's
ai_execution_gating.P4 finding ("users new to AI evals don't know
whether to run Promptfoo first").
- Three-step flow: ai list → run framework yourself → ai run.
- Explicit "what Terrain does vs. what you do" table to clarify
the trust boundary up-front.
- Per-framework commands for Promptfoo, DeepEval, Ragas with their
output-flag invocations.
- Step 4 covers ingestion-diagnostics interpretation (introduced
in the previous commit) so adopters can audit gate-decision
data lineage.
- Common-questions section addresses sandboxing, custom
frameworks, audit trail.
docs/release/parity/scores.yaml:
- ai_execution_gating.P4: 2→4
- policy_governance.V2: 3→4
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lifts ai_eval_ingestion.E4: 3 → 4. docs/schema/eval-adapters.md (new): - Documents the canonical EvalRunResult / EvalCase / EvalAggregates / TokenUsage / IngestionDiagnostic shape every adapter (Promptfoo, DeepEval, Ragas, Gauntlet) emits. - Field-level "Stability: Stable" annotations make the long-lived contract explicit per FIELD_TIERS.md tiers. - Adapter-authoring checklist: parse canonical format, populate Stable fields, emit IngestionDiagnostic per fallback, add conformance fixtures, lock new diagnostics with unit tests. - Cross-references per-framework integration docs + conformance suite. The schema doc closes the audit's E4 concern that adapters "consume each upstream's shape and we won't notice when upstream changes." The published contract + diagnostic mechanism + conformance tests collectively give us notice on shape drift. docs/release/parity/scores.yaml: - ai_eval_ingestion.E4: 3→4 Net `make pillar-parity` after this commit: AI eval ingestion area floor lifted 2 → 3 (from cells E3=4 + E4=4 this PR plus V2/V3 still at 3 carrying the area). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
[RISK] Terrain — Merge with caution
Coverage gaps in changed code
12 pre-existing issues on changed files
Recommended tests10 test(s) with exact coverage of 34 impacted unit(s). 17 impacted unit(s) have no covering tests in the selected set.
AI Risk Review
1 advisory finding
Owners: PMCLSF Limitations
Generated by Terrain · Targeted Test ResultsTerrain selected 10 test(s) instead of the full suite.
|
Terrain AI Risk Review
Decision: PASS — AI surfaces are covered. |
This was referenced May 5, 2026
PR #132 introduced internal/server/server.go's direct import of golang.org/x/sync/singleflight, but go.mod was never re-tidied so the require line still carries // indirect. CI's `go mod tidy && git diff --exit-code go.mod go.sum` step now fails on every PR because of this drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two pre-existing Windows-only test failures blocking CI on every PR in the 0.2 stack. internal/suppression/suppression.go: - pathMatch was using filepath.Match on inputs already normalized to forward-slashes via filepath.ToSlash. On Windows filepath.Match treats `\` as the separator, so `*.go` matched the entire forward-slashed `sub/foo.go` (the `/` wasn't a separator in its semantics). Switch to path.Match (Unix semantics) via a pathPkgMatch helper. Forward-slash inputs + Unix-semantics matcher = correct behavior on every host OS. internal/portfolio/manifest_test.go: - TestResolveRepoPath_Absolute constructs `\elsewhere\repo` expecting filepath.IsAbs to recognize it as absolute. Windows treats this as relative (drive letter required), so the test fixture isn't actually testing what it intends. Skip on Windows where the rooted-without-drive case is a different edge case the function doesn't claim to handle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merged
2 tasks
pmclSF
added a commit
that referenced
this pull request
May 9, 2026
…licy redesign + schema docs (#168) * feat(0.2): hero verdict block + adapter ingestion diagnostics (Gate pillar lift) Lifts three Gate-pillar cells from synthetic-fixture floor toward publicly-claimable: ai_eval_ingestion.E3 (2→4), ai_execution_gating.V2 (2→4), ai_execution_gating.E3 (2→4). internal/uitokens/uitokens.go: - HeroVerdict(verdict, headline) — designed three-line block with rule / indented badge + headline / rule. The block frames the gating decision so it carries visual weight beyond the rest of the report. Color-and-symbol via existing token vocabulary (Alert/Warn/Ok + SymFail/SymWarn/SymOK). - HeroVerdictMarkdown(verdict, headline, reason) — markdown variant for PR-comment / GitHub surfaces. Blockquote callout (tints on GitHub) + horizontal rule. Optional reason as italic line. - heroVerdictBadge / bracketVerdict helpers handle the BLOCKED / WARN / PASS vocabulary distinct from VerdictBadge so the hero presentation can use a heavier shape ("[BLOCKED]") without changing VerdictBadge's contract. - Tests: TestHeroVerdict + TestHeroVerdictMarkdown lock both shapes. cmd/terrain/cmd_ai.go: - `terrain ai run` text output now leads with HeroVerdict block, followed by structured Reason / Command / AI Signals / Ingestion diagnostics sections — the previous single-line `Decision: BLOCKED — reason` is replaced. - aiRunHeroLines() centralizes the (action, reason, signalCount) → (verdict, headline) mapping so JSON / text / downstream PR surfaces stay consistent. internal/airun/eval_result.go: - New IngestionDiagnostic{Field, Kind, Detail} type capturing per-field fallbacks during adapter ingestion (kinds: missing, computed, default-applied, coerced). - EvalRunResult.Diagnostics field surfaces these to consumers. internal/airun/{promptfoo,deepeval,ragas}.go: - Each adapter records diagnostics for the fallbacks that matter to gating decisions: derived aggregates when stats block is absent, missing tokenUsage.cost (aiCostRegression no-ops), defaulted timestamps, missing metricsData (DeepEval), and missing quality axes (Ragas — when no faithfulness / context_recall / answer_relevancy in any row). - Tests in promptfoo_test.go lock the canonical diagnostic emissions. cmd/terrain/cmd_ai.go (rendering): - New "Ingestion diagnostics (N):" block in `terrain ai run` output surfaces every IngestionDiagnostic with its kind and detail. Adopters auditing a gating decision can see exactly which fields fell back. docs/release/parity/scores.yaml: - ai_eval_ingestion.E3: 2→4 - ai_execution_gating.V2: 2→4 - ai_execution_gating.E3: 2→4 These three cells were among the audit's specifically-named Gate-pillar gaps. Several other Gate cells remain at 3 (the publicly-claimable bar requires labeled-PR precision corpus and additional doc/UX lifts) — this is one focused step toward the Gate floor=4 target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(0.2): policy report redesign + AI eval onboarding doc (Gate pillar lift) Lifts two more Gate-pillar cells: policy_governance.V2 (3→4) and ai_execution_gating.P4 (2→4). internal/reporting/policy_report.go: - Redesigned `terrain policy check` rendering. Hero verdict block at top via uitokens.HeroVerdict — PASS / BLOCKED / WARN with violation count, replacing the previous single Status: PASS/FAIL line. - Violations grouped by severity (critical → low) with BracketedSeverity badges per violation. - Per-violation now shows `[CRIT] type (Category) — explanation` with a `location:` follow-on, replacing the flat ` - <type>: <explanation>` rendering. - New helpers: severityRenderOrder (canonical ordering), groupViolationsBySeverity (deterministic grouping with category + type tiebreakers), policyHeroLines (verdict + headline mapping). docs/user-guides/ai-eval-onboarding.md (new): - First-10-minutes walkthrough closing the audit's ai_execution_gating.P4 finding ("users new to AI evals don't know whether to run Promptfoo first"). - Three-step flow: ai list → run framework yourself → ai run. - Explicit "what Terrain does vs. what you do" table to clarify the trust boundary up-front. - Per-framework commands for Promptfoo, DeepEval, Ragas with their output-flag invocations. - Step 4 covers ingestion-diagnostics interpretation (introduced in the previous commit) so adopters can audit gate-decision data lineage. - Common-questions section addresses sandboxing, custom frameworks, audit trail. docs/release/parity/scores.yaml: - ai_execution_gating.P4: 2→4 - policy_governance.V2: 3→4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(0.2): published eval-adapter schema contract (Gate pillar lift) Lifts ai_eval_ingestion.E4: 3 → 4. docs/schema/eval-adapters.md (new): - Documents the canonical EvalRunResult / EvalCase / EvalAggregates / TokenUsage / IngestionDiagnostic shape every adapter (Promptfoo, DeepEval, Ragas, Gauntlet) emits. - Field-level "Stability: Stable" annotations make the long-lived contract explicit per FIELD_TIERS.md tiers. - Adapter-authoring checklist: parse canonical format, populate Stable fields, emit IngestionDiagnostic per fallback, add conformance fixtures, lock new diagnostics with unit tests. - Cross-references per-framework integration docs + conformance suite. The schema doc closes the audit's E4 concern that adapters "consume each upstream's shape and we won't notice when upstream changes." The published contract + diagnostic mechanism + conformance tests collectively give us notice on shape drift. docs/release/parity/scores.yaml: - ai_eval_ingestion.E4: 3→4 Net `make pillar-parity` after this commit: AI eval ingestion area floor lifted 2 → 3 (from cells E3=4 + E4=4 this PR plus V2/V3 still at 3 carrying the area). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: fix go.mod indirect annotation for golang.org/x/sync PR #132 introduced internal/server/server.go's direct import of golang.org/x/sync/singleflight, but go.mod was never re-tidied so the require line still carries // indirect. CI's `go mod tidy && git diff --exit-code go.mod go.sum` step now fails on every PR because of this drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: cross-platform path handling in suppression + portfolio tests Two pre-existing Windows-only test failures blocking CI on every PR in the 0.2 stack. internal/suppression/suppression.go: - pathMatch was using filepath.Match on inputs already normalized to forward-slashes via filepath.ToSlash. On Windows filepath.Match treats `\` as the separator, so `*.go` matched the entire forward-slashed `sub/foo.go` (the `/` wasn't a separator in its semantics). Switch to path.Match (Unix semantics) via a pathPkgMatch helper. Forward-slash inputs + Unix-semantics matcher = correct behavior on every host OS. internal/portfolio/manifest_test.go: - TestResolveRepoPath_Absolute constructs `\elsewhere\repo` expecting filepath.IsAbs to recognize it as absolute. Windows treats this as relative (drive letter required), so the test fixture isn't actually testing what it intends. Skip on Windows where the rooted-without-drive case is a different edge case the function doesn't claim to handle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First batch of Gate-pillar parity-gate lift work, targeting cells the audit specifically named as gating-decision-quality gaps.
Cells lifted:
ai_eval_ingestion.E3IngestionDiagnosticfrom every adapter, surfaced interrain ai runai_eval_ingestion.E4docs/schema/eval-adapters.mdpublishes the canonical contractai_execution_gating.V2uitokens.HeroVerdictblock at top ofterrain ai runoutputai_execution_gating.E3terrain ai runai_execution_gating.P4docs/user-guides/ai-eval-onboarding.mdfirst-10-minutes flowpolicy_governance.V2terrain policy checkredesign — hero block, severity grouping, BracketedSeverity badgesNet effect:
ai_eval_ingestionarea floor lifted 2→3. The other Gate areas still hard-blocked at floor 2 by irreducible 0.3 work (labeled-PR precision corpus for E2 cells, sandboxed execution forai_execution_gating.P1, and theEmptyNoAISurfaces/EmptyNoPolicyFilewiring on PR #167).Branch order: stacked on PR #167 conceptually; can land in either order. PR #167 carries the V3 empty-state wiring this PR's
policy_governance.V3andai_execution_gating.V3evidence will reference once both merge.What's in this PR (commit-by-commit)
76370e7— Hero verdict block (internal/uitokens/) + adapter diagnostics (internal/airun/) +terrain ai runrendering changese731ee3— Policy report redesign (internal/reporting/policy_report.go) + AI eval onboarding docb7ce44f— Published eval-adapter schema contract (docs/schema/eval-adapters.md)Test plan
go test ./...greengo build ./...cleanmake pillar-parity:ai_eval_ingestionlifted from floor 2 to floor 3terrain ai runshows the new hero verdict block + ingestion diagnosticsterrain policy check(with violations) shows the redesigned severity-grouped outputWhy Gate floor=4 isn't yet reached
The plan's bar for Gate is "publicly-claimable, hostile-review-defensible" (floor=4). Several cells require multi-week work that's outside this PR's scope:
pr_change_scoped,ai_execution_gating,ai_eval_ingestion) need a labeled-PR precision corpus — that's 0.3 work.This PR moves the needle on six cells; the remaining cells get attention in subsequent stacked PRs.