feat(0.2): Gate pillar lift — hero verdict + adapter diagnostics + policy redesign + schema docs by pmclSF · Pull Request #168 · pmclSF/terrain

pmclSF · 2026-05-05T02:51:09Z

Summary

First batch of Gate-pillar parity-gate lift work, targeting cells the audit specifically named as gating-decision-quality gaps.

Cells lifted:

Cell	Before	After	Mechanism
`ai_eval_ingestion.E3`	2	4	per-field `IngestionDiagnostic` from every adapter, surfaced in `terrain ai run`
`ai_eval_ingestion.E4`	3	4	`docs/schema/eval-adapters.md` publishes the canonical contract
`ai_execution_gating.V2`	2	4	`uitokens.HeroVerdict` block at top of `terrain ai run` output
`ai_execution_gating.E3`	2	4	"Ingestion diagnostics" block in `terrain ai run`
`ai_execution_gating.P4`	2	4	`docs/user-guides/ai-eval-onboarding.md` first-10-minutes flow
`policy_governance.V2`	3	4	`terrain policy check` redesign — hero block, severity grouping, BracketedSeverity badges

Net effect: ai_eval_ingestion area floor lifted 2→3. The other Gate areas still hard-blocked at floor 2 by irreducible 0.3 work (labeled-PR precision corpus for E2 cells, sandboxed execution for ai_execution_gating.P1, and the EmptyNoAISurfaces / EmptyNoPolicyFile wiring on PR #167).

Branch order: stacked on PR #167 conceptually; can land in either order. PR #167 carries the V3 empty-state wiring this PR's policy_governance.V3 and ai_execution_gating.V3 evidence will reference once both merge.

What's in this PR (commit-by-commit)

76370e7 — Hero verdict block (internal/uitokens/) + adapter diagnostics (internal/airun/) + terrain ai run rendering changes
e731ee3 — Policy report redesign (internal/reporting/policy_report.go) + AI eval onboarding doc
b7ce44f — Published eval-adapter schema contract (docs/schema/eval-adapters.md)

Test plan

go test ./... green
go build ./... clean
make pillar-parity: ai_eval_ingestion lifted from floor 2 to floor 3
CI green
Manual smoke: terrain ai run shows the new hero verdict block + ingestion diagnostics
Manual smoke: terrain policy check (with violations) shows the redesigned severity-grouped output

Why Gate floor=4 isn't yet reached

The plan's bar for Gate is "publicly-claimable, hostile-review-defensible" (floor=4). Several cells require multi-week work that's outside this PR's scope:

E2 cells (pr_change_scoped, ai_execution_gating, ai_eval_ingestion) need a labeled-PR precision corpus — that's 0.3 work.
P1 cells for AI execution gating need sandboxed eval execution — also 0.3.
V3 / V1 / V2 polish across remaining Gate cells — incremental work to follow this PR.

This PR moves the needle on six cells; the remaining cells get attention in subsequent stacked PRs.

…illar lift) Lifts three Gate-pillar cells from synthetic-fixture floor toward publicly-claimable: ai_eval_ingestion.E3 (2→4), ai_execution_gating.V2 (2→4), ai_execution_gating.E3 (2→4). internal/uitokens/uitokens.go: - HeroVerdict(verdict, headline) — designed three-line block with rule / indented badge + headline / rule. The block frames the gating decision so it carries visual weight beyond the rest of the report. Color-and-symbol via existing token vocabulary (Alert/Warn/Ok + SymFail/SymWarn/SymOK). - HeroVerdictMarkdown(verdict, headline, reason) — markdown variant for PR-comment / GitHub surfaces. Blockquote callout (tints on GitHub) + horizontal rule. Optional reason as italic line. - heroVerdictBadge / bracketVerdict helpers handle the BLOCKED / WARN / PASS vocabulary distinct from VerdictBadge so the hero presentation can use a heavier shape ("[BLOCKED]") without changing VerdictBadge's contract. - Tests: TestHeroVerdict + TestHeroVerdictMarkdown lock both shapes. cmd/terrain/cmd_ai.go: - `terrain ai run` text output now leads with HeroVerdict block, followed by structured Reason / Command / AI Signals / Ingestion diagnostics sections — the previous single-line `Decision: BLOCKED — reason` is replaced. - aiRunHeroLines() centralizes the (action, reason, signalCount) → (verdict, headline) mapping so JSON / text / downstream PR surfaces stay consistent. internal/airun/eval_result.go: - New IngestionDiagnostic{Field, Kind, Detail} type capturing per-field fallbacks during adapter ingestion (kinds: missing, computed, default-applied, coerced). - EvalRunResult.Diagnostics field surfaces these to consumers. internal/airun/{promptfoo,deepeval,ragas}.go: - Each adapter records diagnostics for the fallbacks that matter to gating decisions: derived aggregates when stats block is absent, missing tokenUsage.cost (aiCostRegression no-ops), defaulted timestamps, missing metricsData (DeepEval), and missing quality axes (Ragas — when no faithfulness / context_recall / answer_relevancy in any row). - Tests in promptfoo_test.go lock the canonical diagnostic emissions. cmd/terrain/cmd_ai.go (rendering): - New "Ingestion diagnostics (N):" block in `terrain ai run` output surfaces every IngestionDiagnostic with its kind and detail. Adopters auditing a gating decision can see exactly which fields fell back. docs/release/parity/scores.yaml: - ai_eval_ingestion.E3: 2→4 - ai_execution_gating.V2: 2→4 - ai_execution_gating.E3: 2→4 These three cells were among the audit's specifically-named Gate-pillar gaps. Several other Gate cells remain at 3 (the publicly-claimable bar requires labeled-PR precision corpus and additional doc/UX lifts) — this is one focused step toward the Gate floor=4 target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ar lift) Lifts two more Gate-pillar cells: policy_governance.V2 (3→4) and ai_execution_gating.P4 (2→4). internal/reporting/policy_report.go: - Redesigned `terrain policy check` rendering. Hero verdict block at top via uitokens.HeroVerdict — PASS / BLOCKED / WARN with violation count, replacing the previous single Status: PASS/FAIL line. - Violations grouped by severity (critical → low) with BracketedSeverity badges per violation. - Per-violation now shows `[CRIT] type (Category) — explanation` with a `location:` follow-on, replacing the flat ` - <type>: <explanation>` rendering. - New helpers: severityRenderOrder (canonical ordering), groupViolationsBySeverity (deterministic grouping with category + type tiebreakers), policyHeroLines (verdict + headline mapping). docs/user-guides/ai-eval-onboarding.md (new): - First-10-minutes walkthrough closing the audit's ai_execution_gating.P4 finding ("users new to AI evals don't know whether to run Promptfoo first"). - Three-step flow: ai list → run framework yourself → ai run. - Explicit "what Terrain does vs. what you do" table to clarify the trust boundary up-front. - Per-framework commands for Promptfoo, DeepEval, Ragas with their output-flag invocations. - Step 4 covers ingestion-diagnostics interpretation (introduced in the previous commit) so adopters can audit gate-decision data lineage. - Common-questions section addresses sandboxing, custom frameworks, audit trail. docs/release/parity/scores.yaml: - ai_execution_gating.P4: 2→4 - policy_governance.V2: 3→4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lifts ai_eval_ingestion.E4: 3 → 4. docs/schema/eval-adapters.md (new): - Documents the canonical EvalRunResult / EvalCase / EvalAggregates / TokenUsage / IngestionDiagnostic shape every adapter (Promptfoo, DeepEval, Ragas, Gauntlet) emits. - Field-level "Stability: Stable" annotations make the long-lived contract explicit per FIELD_TIERS.md tiers. - Adapter-authoring checklist: parse canonical format, populate Stable fields, emit IngestionDiagnostic per fallback, add conformance fixtures, lock new diagnostics with unit tests. - Cross-references per-framework integration docs + conformance suite. The schema doc closes the audit's E4 concern that adapters "consume each upstream's shape and we won't notice when upstream changes." The published contract + diagnostic mechanism + conformance tests collectively give us notice on shape drift. docs/release/parity/scores.yaml: - ai_eval_ingestion.E4: 3→4 Net `make pillar-parity` after this commit: AI eval ingestion area floor lifted 2 → 3 (from cells E3=4 + E4=4 this PR plus V2/V3 still at 3 carrying the area). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-05T02:51:57Z

[RISK] Terrain — Merge with caution

High-severity gaps found in changed code.

Metric	Value
Changed files	15 (8 source · 3 test)
Impacted units	51
Protection gaps	16
Tests selected	10 of 796 (1% of suite)

Coverage gaps in changed code

cmd/terrain/cmd_ai.go [LOW] — cmd_ai.go has no observed test coverage.
→ Add unit tests for cmd_ai.go.
internal/airun/deepeval.go [MED] — Exported function LoadDeepEvalFile has no observed test coverage.
→ Add unit tests for exported function LoadDeepEvalFile — this is public API surface.
internal/airun/eval_result.go [MED] — Exported method CaseCount has no observed test coverage.
→ Add unit tests for exported method CaseCount — this is public API surface.
internal/airun/eval_result.go [MED] — Exported method SuccessRate has no observed test coverage.
→ Add unit tests for exported method SuccessRate — this is public API surface.
internal/airun/eval_result.go [MED] — Exported class IngestionDiagnostic has no observed test coverage.
→ Add unit tests for exported class IngestionDiagnostic — this is public API surface.
internal/airun/promptfoo.go [MED] — Exported function LoadPromptfooFile has no observed test coverage.
→ Add unit tests for exported function LoadPromptfooFile — this is public API surface.
internal/airun/promptfoo.go [MED] — Exported method UnmarshalJSON has no observed test coverage.
→ Add unit tests for exported method UnmarshalJSON — this is public API surface.
internal/airun/promptfoo.go [MED] — Exported method IsArray has no observed test coverage.
→ Add unit tests for exported method IsArray — this is public API surface.
internal/airun/promptfoo.go [MED] — Exported method IsNested has no observed test coverage.
→ Add unit tests for exported method IsNested — this is public API surface.
internal/airun/ragas.go [MED] — Exported function LoadRagasFile has no observed test coverage.
→ Add unit tests for exported function LoadRagasFile — this is public API surface.
...and 6 more (6 medium)

12 pre-existing issues on changed files

internal/airun/promptfoo.go [MED] — [aiModelDeprecationRisk] model tag gpt-4 resolves to whatever the provider currently maps it to; pin a dated variant (e.g. gpt-4-0613)
internal/portfolio/manifest_test.go [LOW] — [staticSkippedTest] 1 of 12 tests statically skipped (8%) in internal/portfolio/manifest_test.go.
cmd/terrain/cmd_ai.go [HIGH] — [blastRadiusHotspot] Changes to this file propagate to 169 tests (169 direct, 0 indirect). High blast radius increases regression risk.
internal/airun/deepeval.go [HIGH] — [blastRadiusHotspot] Changes to this file propagate to 436 tests (69 direct, 367 indirect). High blast radius increases regression risk.
internal/airun/eval_result.go [HIGH] — [blastRadiusHotspot] Changes to this file propagate to 436 tests (69 direct, 367 indirect). High blast radius increases regression risk.
...and 7 more

Recommended tests

10 test(s) with exact coverage of 34 impacted unit(s). 17 impacted unit(s) have no covering tests in the selected set.

Test	Confidence	Why
`internal/aidetect/cost_regression_test.go`	exact	exact coverage of `EvalCase`, `EvalRunResult`, `TokenUsage`
`internal/aidetect/hallucination_rate_test.go`	exact	exact coverage of `EvalCase`, `EvalRunResult`
`internal/aidetect/retrieval_regression_test.go`	exact	exact coverage of `EvalCase`
`internal/airun/deepeval_test.go`	exact	exact coverage of `ParseDeepEvalJSON`
`internal/airun/envelope_test.go`	exact	exact coverage of `EvalAggregates`, `EvalCase`, `EvalRunResult` + 1 more
`internal/airun/promptfoo_test.go`	exact	exact coverage of `ParsePromptfooJSON`
`internal/airun/ragas_test.go`	exact	exact coverage of `ParseRagasJSON`
`internal/portfolio/manifest_test.go`	exact	test file directly changed
`internal/suppression/suppression_test.go`	exact	exact coverage of `Apply`, `Entry`, `File` + 1 more
`internal/uitokens/uitokens_test.go`	exact	exact coverage of `Accent`, `Alert`, `Bar` + 20 more

AI Risk Review

Scenarios: 0 of 17 selected

1 advisory finding

internal/airun/promptfoo.go:316, 317 — Model tag is sunset or floats — the next API call could break or silently re-resolve.
→ Pin to a dated model variant (e.g. gpt-4-0613) or upgrade to a current tier.

Owners: PMCLSF

Limitations

No coverage artifacts provided; protection gaps reflect missing data, not measured absence. Provide --coverage to improve accuracy.
Mixed test cultures reduce cross-framework optimization confidence. Consider standardizing on fewer frameworks.

_{Generated by Terrain · terrain pr --json for machine-readable output}

Targeted Test Results

Terrain selected 10 test(s) instead of the full suite.

Go tests: passed

github-actions · 2026-05-05T02:52:01Z

Terrain AI Risk Review

Metric	Value
AI surfaces	13
Eval scenarios	17
Impacted scenarios	0
Uncovered surfaces	13

Decision: PASS — AI surfaces are covered.

PR #132 introduced internal/server/server.go's direct import of golang.org/x/sync/singleflight, but go.mod was never re-tidied so the require line still carries // indirect. CI's `go mod tidy && git diff --exit-code go.mod go.sum` step now fails on every PR because of this drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two pre-existing Windows-only test failures blocking CI on every PR in the 0.2 stack. internal/suppression/suppression.go: - pathMatch was using filepath.Match on inputs already normalized to forward-slashes via filepath.ToSlash. On Windows filepath.Match treats `\` as the separator, so `*.go` matched the entire forward-slashed `sub/foo.go` (the `/` wasn't a separator in its semantics). Switch to path.Match (Unix semantics) via a pathPkgMatch helper. Forward-slash inputs + Unix-semantics matcher = correct behavior on every host OS. internal/portfolio/manifest_test.go: - TestResolveRepoPath_Absolute constructs `\elsewhere\repo` expecting filepath.IsAbs to recognize it as absolute. Windows treats this as relative (drive letter required), so the test fixture isn't actually testing what it intends. Skip on Windows where the rooted-without-drive case is a different edge case the function doesn't claim to handle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…licy redesign + schema docs (#168) * feat(0.2): hero verdict block + adapter ingestion diagnostics (Gate pillar lift) Lifts three Gate-pillar cells from synthetic-fixture floor toward publicly-claimable: ai_eval_ingestion.E3 (2→4), ai_execution_gating.V2 (2→4), ai_execution_gating.E3 (2→4). internal/uitokens/uitokens.go: - HeroVerdict(verdict, headline) — designed three-line block with rule / indented badge + headline / rule. The block frames the gating decision so it carries visual weight beyond the rest of the report. Color-and-symbol via existing token vocabulary (Alert/Warn/Ok + SymFail/SymWarn/SymOK). - HeroVerdictMarkdown(verdict, headline, reason) — markdown variant for PR-comment / GitHub surfaces. Blockquote callout (tints on GitHub) + horizontal rule. Optional reason as italic line. - heroVerdictBadge / bracketVerdict helpers handle the BLOCKED / WARN / PASS vocabulary distinct from VerdictBadge so the hero presentation can use a heavier shape ("[BLOCKED]") without changing VerdictBadge's contract. - Tests: TestHeroVerdict + TestHeroVerdictMarkdown lock both shapes. cmd/terrain/cmd_ai.go: - `terrain ai run` text output now leads with HeroVerdict block, followed by structured Reason / Command / AI Signals / Ingestion diagnostics sections — the previous single-line `Decision: BLOCKED — reason` is replaced. - aiRunHeroLines() centralizes the (action, reason, signalCount) → (verdict, headline) mapping so JSON / text / downstream PR surfaces stay consistent. internal/airun/eval_result.go: - New IngestionDiagnostic{Field, Kind, Detail} type capturing per-field fallbacks during adapter ingestion (kinds: missing, computed, default-applied, coerced). - EvalRunResult.Diagnostics field surfaces these to consumers. internal/airun/{promptfoo,deepeval,ragas}.go: - Each adapter records diagnostics for the fallbacks that matter to gating decisions: derived aggregates when stats block is absent, missing tokenUsage.cost (aiCostRegression no-ops), defaulted timestamps, missing metricsData (DeepEval), and missing quality axes (Ragas — when no faithfulness / context_recall / answer_relevancy in any row). - Tests in promptfoo_test.go lock the canonical diagnostic emissions. cmd/terrain/cmd_ai.go (rendering): - New "Ingestion diagnostics (N):" block in `terrain ai run` output surfaces every IngestionDiagnostic with its kind and detail. Adopters auditing a gating decision can see exactly which fields fell back. docs/release/parity/scores.yaml: - ai_eval_ingestion.E3: 2→4 - ai_execution_gating.V2: 2→4 - ai_execution_gating.E3: 2→4 These three cells were among the audit's specifically-named Gate-pillar gaps. Several other Gate cells remain at 3 (the publicly-claimable bar requires labeled-PR precision corpus and additional doc/UX lifts) — this is one focused step toward the Gate floor=4 target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(0.2): policy report redesign + AI eval onboarding doc (Gate pillar lift) Lifts two more Gate-pillar cells: policy_governance.V2 (3→4) and ai_execution_gating.P4 (2→4). internal/reporting/policy_report.go: - Redesigned `terrain policy check` rendering. Hero verdict block at top via uitokens.HeroVerdict — PASS / BLOCKED / WARN with violation count, replacing the previous single Status: PASS/FAIL line. - Violations grouped by severity (critical → low) with BracketedSeverity badges per violation. - Per-violation now shows `[CRIT] type (Category) — explanation` with a `location:` follow-on, replacing the flat ` - <type>: <explanation>` rendering. - New helpers: severityRenderOrder (canonical ordering), groupViolationsBySeverity (deterministic grouping with category + type tiebreakers), policyHeroLines (verdict + headline mapping). docs/user-guides/ai-eval-onboarding.md (new): - First-10-minutes walkthrough closing the audit's ai_execution_gating.P4 finding ("users new to AI evals don't know whether to run Promptfoo first"). - Three-step flow: ai list → run framework yourself → ai run. - Explicit "what Terrain does vs. what you do" table to clarify the trust boundary up-front. - Per-framework commands for Promptfoo, DeepEval, Ragas with their output-flag invocations. - Step 4 covers ingestion-diagnostics interpretation (introduced in the previous commit) so adopters can audit gate-decision data lineage. - Common-questions section addresses sandboxing, custom frameworks, audit trail. docs/release/parity/scores.yaml: - ai_execution_gating.P4: 2→4 - policy_governance.V2: 3→4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(0.2): published eval-adapter schema contract (Gate pillar lift) Lifts ai_eval_ingestion.E4: 3 → 4. docs/schema/eval-adapters.md (new): - Documents the canonical EvalRunResult / EvalCase / EvalAggregates / TokenUsage / IngestionDiagnostic shape every adapter (Promptfoo, DeepEval, Ragas, Gauntlet) emits. - Field-level "Stability: Stable" annotations make the long-lived contract explicit per FIELD_TIERS.md tiers. - Adapter-authoring checklist: parse canonical format, populate Stable fields, emit IngestionDiagnostic per fallback, add conformance fixtures, lock new diagnostics with unit tests. - Cross-references per-framework integration docs + conformance suite. The schema doc closes the audit's E4 concern that adapters "consume each upstream's shape and we won't notice when upstream changes." The published contract + diagnostic mechanism + conformance tests collectively give us notice on shape drift. docs/release/parity/scores.yaml: - ai_eval_ingestion.E4: 3→4 Net `make pillar-parity` after this commit: AI eval ingestion area floor lifted 2 → 3 (from cells E3=4 + E4=4 this PR plus V2/V3 still at 3 carrying the area). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: fix go.mod indirect annotation for golang.org/x/sync PR #132 introduced internal/server/server.go's direct import of golang.org/x/sync/singleflight, but go.mod was never re-tidied so the require line still carries // indirect. CI's `go mod tidy && git diff --exit-code go.mod go.sum` step now fails on every PR because of this drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: cross-platform path handling in suppression + portfolio tests Two pre-existing Windows-only test failures blocking CI on every PR in the 0.2 stack. internal/suppression/suppression.go: - pathMatch was using filepath.Match on inputs already normalized to forward-slashes via filepath.ToSlash. On Windows filepath.Match treats `\` as the separator, so `*.go` matched the entire forward-slashed `sub/foo.go` (the `/` wasn't a separator in its semantics). Switch to path.Match (Unix semantics) via a pathPkgMatch helper. Forward-slash inputs + Unix-semantics matcher = correct behavior on every host OS. internal/portfolio/manifest_test.go: - TestResolveRepoPath_Absolute constructs `\elsewhere\repo` expecting filepath.IsAbs to recognize it as absolute. Windows treats this as relative (drive letter required), so the test fixture isn't actually testing what it intends. Skip on Windows where the rooted-without-drive case is a different edge case the function doesn't claim to handle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pmclSF and others added 3 commits May 4, 2026 19:44

This was referenced May 5, 2026

feat(0.2): Gate pillar lift batch 2 — final lifts before 0.3 corpus work #169

Merged

feat(0.2): final pillar lifts — Understand pillar PASSES at floor=3 #171

Merged

pmclSF and others added 2 commits May 6, 2026 17:02

pmclSF merged commit 4e65439 into main May 7, 2026
11 checks passed

pmclSF deleted the feat/0.2-gate-pillar-lift branch May 7, 2026 00:11

pmclSF mentioned this pull request May 7, 2026

feat(0.2): pillar lift batch 3 — portfolio + explain schemas + benchmarks + analyze error UX #172

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(0.2): Gate pillar lift — hero verdict + adapter diagnostics + policy redesign + schema docs#168

feat(0.2): Gate pillar lift — hero verdict + adapter diagnostics + policy redesign + schema docs#168
pmclSF merged 5 commits into
mainfrom
feat/0.2-gate-pillar-lift

pmclSF commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pmclSF commented May 5, 2026

Summary

What's in this PR (commit-by-commit)

Test plan

Why Gate floor=4 isn't yet reached

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[RISK] Terrain — Merge with caution

Coverage gaps in changed code

Recommended tests

AI Risk Review

Targeted Test Results

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Terrain AI Risk Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 5, 2026 •

edited

Loading

github-actions Bot commented May 5, 2026 •

edited

Loading