Skip to content

feat(0.2): Gate pillar lift — hero verdict + adapter diagnostics + policy redesign + schema docs#168

Merged
pmclSF merged 5 commits into
mainfrom
feat/0.2-gate-pillar-lift
May 7, 2026
Merged

feat(0.2): Gate pillar lift — hero verdict + adapter diagnostics + policy redesign + schema docs#168
pmclSF merged 5 commits into
mainfrom
feat/0.2-gate-pillar-lift

Conversation

@pmclSF
Copy link
Copy Markdown
Owner

@pmclSF pmclSF commented May 5, 2026

Summary

First batch of Gate-pillar parity-gate lift work, targeting cells the audit specifically named as gating-decision-quality gaps.

Cells lifted:

Cell Before After Mechanism
ai_eval_ingestion.E3 2 4 per-field IngestionDiagnostic from every adapter, surfaced in terrain ai run
ai_eval_ingestion.E4 3 4 docs/schema/eval-adapters.md publishes the canonical contract
ai_execution_gating.V2 2 4 uitokens.HeroVerdict block at top of terrain ai run output
ai_execution_gating.E3 2 4 "Ingestion diagnostics" block in terrain ai run
ai_execution_gating.P4 2 4 docs/user-guides/ai-eval-onboarding.md first-10-minutes flow
policy_governance.V2 3 4 terrain policy check redesign — hero block, severity grouping, BracketedSeverity badges

Net effect: ai_eval_ingestion area floor lifted 2→3. The other Gate areas still hard-blocked at floor 2 by irreducible 0.3 work (labeled-PR precision corpus for E2 cells, sandboxed execution for ai_execution_gating.P1, and the EmptyNoAISurfaces / EmptyNoPolicyFile wiring on PR #167).

Branch order: stacked on PR #167 conceptually; can land in either order. PR #167 carries the V3 empty-state wiring this PR's policy_governance.V3 and ai_execution_gating.V3 evidence will reference once both merge.

What's in this PR (commit-by-commit)

  1. 76370e7 — Hero verdict block (internal/uitokens/) + adapter diagnostics (internal/airun/) + terrain ai run rendering changes
  2. e731ee3 — Policy report redesign (internal/reporting/policy_report.go) + AI eval onboarding doc
  3. b7ce44f — Published eval-adapter schema contract (docs/schema/eval-adapters.md)

Test plan

  • go test ./... green
  • go build ./... clean
  • make pillar-parity: ai_eval_ingestion lifted from floor 2 to floor 3
  • CI green
  • Manual smoke: terrain ai run shows the new hero verdict block + ingestion diagnostics
  • Manual smoke: terrain policy check (with violations) shows the redesigned severity-grouped output

Why Gate floor=4 isn't yet reached

The plan's bar for Gate is "publicly-claimable, hostile-review-defensible" (floor=4). Several cells require multi-week work that's outside this PR's scope:

  • E2 cells (pr_change_scoped, ai_execution_gating, ai_eval_ingestion) need a labeled-PR precision corpus — that's 0.3 work.
  • P1 cells for AI execution gating need sandboxed eval execution — also 0.3.
  • V3 / V1 / V2 polish across remaining Gate cells — incremental work to follow this PR.

This PR moves the needle on six cells; the remaining cells get attention in subsequent stacked PRs.

pmclSF and others added 3 commits May 4, 2026 19:44
…illar lift)

Lifts three Gate-pillar cells from synthetic-fixture floor toward
publicly-claimable: ai_eval_ingestion.E3 (2→4),
ai_execution_gating.V2 (2→4), ai_execution_gating.E3 (2→4).

internal/uitokens/uitokens.go:
- HeroVerdict(verdict, headline) — designed three-line block with
  rule / indented badge + headline / rule. The block frames the
  gating decision so it carries visual weight beyond the rest of
  the report. Color-and-symbol via existing token vocabulary
  (Alert/Warn/Ok + SymFail/SymWarn/SymOK).
- HeroVerdictMarkdown(verdict, headline, reason) — markdown variant
  for PR-comment / GitHub surfaces. Blockquote callout (tints on
  GitHub) + horizontal rule. Optional reason as italic line.
- heroVerdictBadge / bracketVerdict helpers handle the BLOCKED /
  WARN / PASS vocabulary distinct from VerdictBadge so the hero
  presentation can use a heavier shape ("[BLOCKED]") without
  changing VerdictBadge's contract.
- Tests: TestHeroVerdict + TestHeroVerdictMarkdown lock both shapes.

cmd/terrain/cmd_ai.go:
- `terrain ai run` text output now leads with HeroVerdict block,
  followed by structured Reason / Command / AI Signals /
  Ingestion diagnostics sections — the previous single-line
  `Decision: BLOCKED — reason` is replaced.
- aiRunHeroLines() centralizes the (action, reason, signalCount)
  → (verdict, headline) mapping so JSON / text / downstream PR
  surfaces stay consistent.

internal/airun/eval_result.go:
- New IngestionDiagnostic{Field, Kind, Detail} type capturing
  per-field fallbacks during adapter ingestion (kinds: missing,
  computed, default-applied, coerced).
- EvalRunResult.Diagnostics field surfaces these to consumers.

internal/airun/{promptfoo,deepeval,ragas}.go:
- Each adapter records diagnostics for the fallbacks that matter
  to gating decisions: derived aggregates when stats block is
  absent, missing tokenUsage.cost (aiCostRegression no-ops),
  defaulted timestamps, missing metricsData (DeepEval), and
  missing quality axes (Ragas — when no faithfulness /
  context_recall / answer_relevancy in any row).
- Tests in promptfoo_test.go lock the canonical diagnostic
  emissions.

cmd/terrain/cmd_ai.go (rendering):
- New "Ingestion diagnostics (N):" block in `terrain ai run`
  output surfaces every IngestionDiagnostic with its kind and
  detail. Adopters auditing a gating decision can see exactly
  which fields fell back.

docs/release/parity/scores.yaml:
- ai_eval_ingestion.E3: 2→4
- ai_execution_gating.V2: 2→4
- ai_execution_gating.E3: 2→4

These three cells were among the audit's specifically-named
Gate-pillar gaps. Several other Gate cells remain at 3 (the
publicly-claimable bar requires labeled-PR precision corpus
and additional doc/UX lifts) — this is one focused step toward
the Gate floor=4 target.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ar lift)

Lifts two more Gate-pillar cells: policy_governance.V2 (3→4) and
ai_execution_gating.P4 (2→4).

internal/reporting/policy_report.go:
- Redesigned `terrain policy check` rendering. Hero verdict block
  at top via uitokens.HeroVerdict — PASS / BLOCKED / WARN with
  violation count, replacing the previous single Status: PASS/FAIL
  line.
- Violations grouped by severity (critical → low) with
  BracketedSeverity badges per violation.
- Per-violation now shows `[CRIT] type (Category) — explanation`
  with a `location:` follow-on, replacing the flat
  `  - <type>: <explanation>` rendering.
- New helpers: severityRenderOrder (canonical ordering),
  groupViolationsBySeverity (deterministic grouping with category
  + type tiebreakers), policyHeroLines (verdict + headline mapping).

docs/user-guides/ai-eval-onboarding.md (new):
- First-10-minutes walkthrough closing the audit's
  ai_execution_gating.P4 finding ("users new to AI evals don't know
  whether to run Promptfoo first").
- Three-step flow: ai list → run framework yourself → ai run.
- Explicit "what Terrain does vs. what you do" table to clarify
  the trust boundary up-front.
- Per-framework commands for Promptfoo, DeepEval, Ragas with their
  output-flag invocations.
- Step 4 covers ingestion-diagnostics interpretation (introduced
  in the previous commit) so adopters can audit gate-decision
  data lineage.
- Common-questions section addresses sandboxing, custom
  frameworks, audit trail.

docs/release/parity/scores.yaml:
- ai_execution_gating.P4: 2→4
- policy_governance.V2: 3→4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lifts ai_eval_ingestion.E4: 3 → 4.

docs/schema/eval-adapters.md (new):
- Documents the canonical EvalRunResult / EvalCase / EvalAggregates /
  TokenUsage / IngestionDiagnostic shape every adapter (Promptfoo,
  DeepEval, Ragas, Gauntlet) emits.
- Field-level "Stability: Stable" annotations make the long-lived
  contract explicit per FIELD_TIERS.md tiers.
- Adapter-authoring checklist: parse canonical format, populate
  Stable fields, emit IngestionDiagnostic per fallback, add
  conformance fixtures, lock new diagnostics with unit tests.
- Cross-references per-framework integration docs +
  conformance suite.

The schema doc closes the audit's E4 concern that adapters
"consume each upstream's shape and we won't notice when upstream
changes." The published contract + diagnostic mechanism + conformance
tests collectively give us notice on shape drift.

docs/release/parity/scores.yaml:
- ai_eval_ingestion.E4: 3→4

Net `make pillar-parity` after this commit:
  AI eval ingestion area floor lifted 2 → 3 (from cells E3=4 + E4=4
  this PR plus V2/V3 still at 3 carrying the area).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

[RISK] Terrain — Merge with caution

High-severity gaps found in changed code.

Metric Value
Changed files 15 (8 source · 3 test)
Impacted units 51
Protection gaps 16
Tests selected 10 of 796 (1% of suite)

Coverage gaps in changed code

  • cmd/terrain/cmd_ai.go [LOW] — cmd_ai.go has no observed test coverage.
    → Add unit tests for cmd_ai.go.
  • internal/airun/deepeval.go [MED] — Exported function LoadDeepEvalFile has no observed test coverage.
    → Add unit tests for exported function LoadDeepEvalFile — this is public API surface.
  • internal/airun/eval_result.go [MED] — Exported method CaseCount has no observed test coverage.
    → Add unit tests for exported method CaseCount — this is public API surface.
  • internal/airun/eval_result.go [MED] — Exported method SuccessRate has no observed test coverage.
    → Add unit tests for exported method SuccessRate — this is public API surface.
  • internal/airun/eval_result.go [MED] — Exported class IngestionDiagnostic has no observed test coverage.
    → Add unit tests for exported class IngestionDiagnostic — this is public API surface.
  • internal/airun/promptfoo.go [MED] — Exported function LoadPromptfooFile has no observed test coverage.
    → Add unit tests for exported function LoadPromptfooFile — this is public API surface.
  • internal/airun/promptfoo.go [MED] — Exported method UnmarshalJSON has no observed test coverage.
    → Add unit tests for exported method UnmarshalJSON — this is public API surface.
  • internal/airun/promptfoo.go [MED] — Exported method IsArray has no observed test coverage.
    → Add unit tests for exported method IsArray — this is public API surface.
  • internal/airun/promptfoo.go [MED] — Exported method IsNested has no observed test coverage.
    → Add unit tests for exported method IsNested — this is public API surface.
  • internal/airun/ragas.go [MED] — Exported function LoadRagasFile has no observed test coverage.
    → Add unit tests for exported function LoadRagasFile — this is public API surface.
  • ...and 6 more (6 medium)
12 pre-existing issues on changed files
  • internal/airun/promptfoo.go [MED] — [aiModelDeprecationRisk] model tag gpt-4 resolves to whatever the provider currently maps it to; pin a dated variant (e.g. gpt-4-0613)
  • internal/portfolio/manifest_test.go [LOW] — [staticSkippedTest] 1 of 12 tests statically skipped (8%) in internal/portfolio/manifest_test.go.
  • cmd/terrain/cmd_ai.go [HIGH] — [blastRadiusHotspot] Changes to this file propagate to 169 tests (169 direct, 0 indirect). High blast radius increases regression risk.
  • internal/airun/deepeval.go [HIGH] — [blastRadiusHotspot] Changes to this file propagate to 436 tests (69 direct, 367 indirect). High blast radius increases regression risk.
  • internal/airun/eval_result.go [HIGH] — [blastRadiusHotspot] Changes to this file propagate to 436 tests (69 direct, 367 indirect). High blast radius increases regression risk.
  • ...and 7 more

Recommended tests

10 test(s) with exact coverage of 34 impacted unit(s). 17 impacted unit(s) have no covering tests in the selected set.

Test Confidence Why
internal/aidetect/cost_regression_test.go exact exact coverage of EvalCase, EvalRunResult, TokenUsage
internal/aidetect/hallucination_rate_test.go exact exact coverage of EvalCase, EvalRunResult
internal/aidetect/retrieval_regression_test.go exact exact coverage of EvalCase
internal/airun/deepeval_test.go exact exact coverage of ParseDeepEvalJSON
internal/airun/envelope_test.go exact exact coverage of EvalAggregates, EvalCase, EvalRunResult + 1 more
internal/airun/promptfoo_test.go exact exact coverage of ParsePromptfooJSON
internal/airun/ragas_test.go exact exact coverage of ParseRagasJSON
internal/portfolio/manifest_test.go exact test file directly changed
internal/suppression/suppression_test.go exact exact coverage of Apply, Entry, File + 1 more
internal/uitokens/uitokens_test.go exact exact coverage of Accent, Alert, Bar + 20 more

AI Risk Review

Scenarios: 0 of 17 selected

1 advisory finding
  • internal/airun/promptfoo.go:316, 317 — Model tag is sunset or floats — the next API call could break or silently re-resolve.
    → Pin to a dated model variant (e.g. gpt-4-0613) or upgrade to a current tier.

Owners: PMCLSF

Limitations
  • No coverage artifacts provided; protection gaps reflect missing data, not measured absence. Provide --coverage to improve accuracy.
  • Mixed test cultures reduce cross-framework optimization confidence. Consider standardizing on fewer frameworks.

Generated by Terrain · terrain pr --json for machine-readable output

Targeted Test Results

Terrain selected 10 test(s) instead of the full suite.

  • Go tests: passed

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Terrain AI Risk Review

Metric Value
AI surfaces 13
Eval scenarios 17
Impacted scenarios 0
Uncovered surfaces 13

Decision: PASS — AI surfaces are covered.

pmclSF and others added 2 commits May 6, 2026 17:02
PR #132 introduced internal/server/server.go's direct import of
golang.org/x/sync/singleflight, but go.mod was never re-tidied
so the require line still carries // indirect. CI's `go mod tidy
&& git diff --exit-code go.mod go.sum` step now fails on every
PR because of this drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two pre-existing Windows-only test failures blocking CI on every
PR in the 0.2 stack.

internal/suppression/suppression.go:
- pathMatch was using filepath.Match on inputs already normalized
  to forward-slashes via filepath.ToSlash. On Windows
  filepath.Match treats `\` as the separator, so `*.go` matched
  the entire forward-slashed `sub/foo.go` (the `/` wasn't a
  separator in its semantics). Switch to path.Match (Unix
  semantics) via a pathPkgMatch helper. Forward-slash inputs +
  Unix-semantics matcher = correct behavior on every host OS.

internal/portfolio/manifest_test.go:
- TestResolveRepoPath_Absolute constructs `\elsewhere\repo`
  expecting filepath.IsAbs to recognize it as absolute. Windows
  treats this as relative (drive letter required), so the test
  fixture isn't actually testing what it intends. Skip on
  Windows where the rooted-without-drive case is a different
  edge case the function doesn't claim to handle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pmclSF pmclSF merged commit 4e65439 into main May 7, 2026
11 checks passed
@pmclSF pmclSF deleted the feat/0.2-gate-pillar-lift branch May 7, 2026 00:11
pmclSF added a commit that referenced this pull request May 9, 2026
…licy redesign + schema docs (#168)

* feat(0.2): hero verdict block + adapter ingestion diagnostics (Gate pillar lift)

Lifts three Gate-pillar cells from synthetic-fixture floor toward
publicly-claimable: ai_eval_ingestion.E3 (2→4),
ai_execution_gating.V2 (2→4), ai_execution_gating.E3 (2→4).

internal/uitokens/uitokens.go:
- HeroVerdict(verdict, headline) — designed three-line block with
  rule / indented badge + headline / rule. The block frames the
  gating decision so it carries visual weight beyond the rest of
  the report. Color-and-symbol via existing token vocabulary
  (Alert/Warn/Ok + SymFail/SymWarn/SymOK).
- HeroVerdictMarkdown(verdict, headline, reason) — markdown variant
  for PR-comment / GitHub surfaces. Blockquote callout (tints on
  GitHub) + horizontal rule. Optional reason as italic line.
- heroVerdictBadge / bracketVerdict helpers handle the BLOCKED /
  WARN / PASS vocabulary distinct from VerdictBadge so the hero
  presentation can use a heavier shape ("[BLOCKED]") without
  changing VerdictBadge's contract.
- Tests: TestHeroVerdict + TestHeroVerdictMarkdown lock both shapes.

cmd/terrain/cmd_ai.go:
- `terrain ai run` text output now leads with HeroVerdict block,
  followed by structured Reason / Command / AI Signals /
  Ingestion diagnostics sections — the previous single-line
  `Decision: BLOCKED — reason` is replaced.
- aiRunHeroLines() centralizes the (action, reason, signalCount)
  → (verdict, headline) mapping so JSON / text / downstream PR
  surfaces stay consistent.

internal/airun/eval_result.go:
- New IngestionDiagnostic{Field, Kind, Detail} type capturing
  per-field fallbacks during adapter ingestion (kinds: missing,
  computed, default-applied, coerced).
- EvalRunResult.Diagnostics field surfaces these to consumers.

internal/airun/{promptfoo,deepeval,ragas}.go:
- Each adapter records diagnostics for the fallbacks that matter
  to gating decisions: derived aggregates when stats block is
  absent, missing tokenUsage.cost (aiCostRegression no-ops),
  defaulted timestamps, missing metricsData (DeepEval), and
  missing quality axes (Ragas — when no faithfulness /
  context_recall / answer_relevancy in any row).
- Tests in promptfoo_test.go lock the canonical diagnostic
  emissions.

cmd/terrain/cmd_ai.go (rendering):
- New "Ingestion diagnostics (N):" block in `terrain ai run`
  output surfaces every IngestionDiagnostic with its kind and
  detail. Adopters auditing a gating decision can see exactly
  which fields fell back.

docs/release/parity/scores.yaml:
- ai_eval_ingestion.E3: 2→4
- ai_execution_gating.V2: 2→4
- ai_execution_gating.E3: 2→4

These three cells were among the audit's specifically-named
Gate-pillar gaps. Several other Gate cells remain at 3 (the
publicly-claimable bar requires labeled-PR precision corpus
and additional doc/UX lifts) — this is one focused step toward
the Gate floor=4 target.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(0.2): policy report redesign + AI eval onboarding doc (Gate pillar lift)

Lifts two more Gate-pillar cells: policy_governance.V2 (3→4) and
ai_execution_gating.P4 (2→4).

internal/reporting/policy_report.go:
- Redesigned `terrain policy check` rendering. Hero verdict block
  at top via uitokens.HeroVerdict — PASS / BLOCKED / WARN with
  violation count, replacing the previous single Status: PASS/FAIL
  line.
- Violations grouped by severity (critical → low) with
  BracketedSeverity badges per violation.
- Per-violation now shows `[CRIT] type (Category) — explanation`
  with a `location:` follow-on, replacing the flat
  `  - <type>: <explanation>` rendering.
- New helpers: severityRenderOrder (canonical ordering),
  groupViolationsBySeverity (deterministic grouping with category
  + type tiebreakers), policyHeroLines (verdict + headline mapping).

docs/user-guides/ai-eval-onboarding.md (new):
- First-10-minutes walkthrough closing the audit's
  ai_execution_gating.P4 finding ("users new to AI evals don't know
  whether to run Promptfoo first").
- Three-step flow: ai list → run framework yourself → ai run.
- Explicit "what Terrain does vs. what you do" table to clarify
  the trust boundary up-front.
- Per-framework commands for Promptfoo, DeepEval, Ragas with their
  output-flag invocations.
- Step 4 covers ingestion-diagnostics interpretation (introduced
  in the previous commit) so adopters can audit gate-decision
  data lineage.
- Common-questions section addresses sandboxing, custom
  frameworks, audit trail.

docs/release/parity/scores.yaml:
- ai_execution_gating.P4: 2→4
- policy_governance.V2: 3→4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(0.2): published eval-adapter schema contract (Gate pillar lift)

Lifts ai_eval_ingestion.E4: 3 → 4.

docs/schema/eval-adapters.md (new):
- Documents the canonical EvalRunResult / EvalCase / EvalAggregates /
  TokenUsage / IngestionDiagnostic shape every adapter (Promptfoo,
  DeepEval, Ragas, Gauntlet) emits.
- Field-level "Stability: Stable" annotations make the long-lived
  contract explicit per FIELD_TIERS.md tiers.
- Adapter-authoring checklist: parse canonical format, populate
  Stable fields, emit IngestionDiagnostic per fallback, add
  conformance fixtures, lock new diagnostics with unit tests.
- Cross-references per-framework integration docs +
  conformance suite.

The schema doc closes the audit's E4 concern that adapters
"consume each upstream's shape and we won't notice when upstream
changes." The published contract + diagnostic mechanism + conformance
tests collectively give us notice on shape drift.

docs/release/parity/scores.yaml:
- ai_eval_ingestion.E4: 3→4

Net `make pillar-parity` after this commit:
  AI eval ingestion area floor lifted 2 → 3 (from cells E3=4 + E4=4
  this PR plus V2/V3 still at 3 carrying the area).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: fix go.mod indirect annotation for golang.org/x/sync

PR #132 introduced internal/server/server.go's direct import of
golang.org/x/sync/singleflight, but go.mod was never re-tidied
so the require line still carries // indirect. CI's `go mod tidy
&& git diff --exit-code go.mod go.sum` step now fails on every
PR because of this drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: cross-platform path handling in suppression + portfolio tests

Two pre-existing Windows-only test failures blocking CI on every
PR in the 0.2 stack.

internal/suppression/suppression.go:
- pathMatch was using filepath.Match on inputs already normalized
  to forward-slashes via filepath.ToSlash. On Windows
  filepath.Match treats `\` as the separator, so `*.go` matched
  the entire forward-slashed `sub/foo.go` (the `/` wasn't a
  separator in its semantics). Switch to path.Match (Unix
  semantics) via a pathPkgMatch helper. Forward-slash inputs +
  Unix-semantics matcher = correct behavior on every host OS.

internal/portfolio/manifest_test.go:
- TestResolveRepoPath_Absolute constructs `\elsewhere\repo`
  expecting filepath.IsAbs to recognize it as absolute. Windows
  treats this as relative (drive letter required), so the test
  fixture isn't actually testing what it intends. Skip on
  Windows where the rooted-without-drive case is a different
  edge case the function doesn't claim to handle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant