Skip to content

Latest commit

 

History

History
207 lines (151 loc) · 18.9 KB

File metadata and controls

207 lines (151 loc) · 18.9 KB

Plan: Generic Product-Use Contracts

Written at: 2026-05-14T23:55:17-07:00

Objective

Make Iris evaluate real product use by carrying a generic product-use contract from Discovery through Explorer, Judge, validation, and report output.

Steps

  1. Add ProductUseContract schema to Discovery.

    • Why: Discovery is where Iris decides what "real use" means for the product.
    • Verify: Discovery unit tests prove contract fields survive normalization and report JSON extraction.
  2. Update Discovery prompt and Explorer context.

    • Why: Explorer needs archetype-level user jobs, required actions, expected artifacts, and weak proof warnings before acting.
    • Verify: prompt/context tests assert canvas/editor examples require artifact creation rather than activation-only proof.
  3. Update Judge prompt.

    • Why: Judge should score task/artifact coverage separately from surface coverage and avoid high scores for sampled controls.
    • Verify: prompt tests assert the real-use/value-loop guidance is present.
  4. Extend goal-claim validation with generic weak-proof detection.

    • Why: LLM judgement alone will drift. Verified goals should be downgraded when notes/evidence only show toolbar/menu/focus/activation.
    • Verify: validator unit tests cover activation-only downgrade and durable artifact pass.
  5. Render real-use depth in the report.

    • Why: The reader should immediately see whether Iris exercised the primary value loop and what proof was accepted or rejected.
    • Verify: HTML tests assert the summary renders and old reports remain compatible.
  6. Rebuild and re-render the current tldraw report.

    • Why: The current report is the regression artifact the user is auditing.
    • Verify: served report contains product-use contract metadata and real-use depth summary.

Plan Gate

  • Worktree/branch: /Users/yuxuan/work/prod-critic, main.
  • Owned files: packages/core/src/discovery/*, packages/core/src/judge/*, packages/core/src/report/*, packages/adapter-web/src/contract.ts, focused tests, and report re-render artifacts.
  • Risk: overfitting the rules to canvas products. Mitigation: schema uses generic product kinds and generic required-action/artifact/weak-evidence arrays.
  • Risk: old reports break. Mitigation: all report fields are optional and rendering is conditional.
  • Verification: focused tests, package typechecks, full build, report re-render and served-page check.

Implementation Result

  • Added a generic product_use_contract to Discovery output and normalized Explorer context.
  • Updated Discovery, Explorer, and Judge prompts so real-use depth is defined by product kind, primary value loop, required actions, expected artifacts/state, accepted evidence, and weak evidence.
  • Added deterministic goal validation against the product-use contract. The validator now rejects weak proof only when the claim lacks outcome language, and it recognizes keyboard shortcuts/style-control clicks as valid required actions.
  • Rendered the real-use contract in HTML and Markdown reports next to the Discovery coverage map.
  • Added iris report --revalidate so stored judge.raw.txt can be replayed through the current evidence/goal validators before re-rendering old run reports.

Verification Log

  • pnpm --filter @iris/core exec vitest run src/discovery/discovery.test.ts src/explorer/prompts.test.ts src/judge/prompts.test.ts src/judge/goal-claim-validator.test.ts src/report/report-json.test.ts src/report/report-html.test.ts src/report/report-md.test.ts --reporter=dot
  • pnpm --filter @iris/core run typecheck
  • pnpm --filter @iris/adapter-web run typecheck
  • pnpm --filter @iris/cli run typecheck
  • pnpm -r run build
  • pnpm -r run test -- --reporter=dot
  • pnpm --filter @iris/cli run test -- --reporter=dot
  • Fresh real-product run: node packages/cli/dist/bin.js eval https://www.tldraw.com --transport codex-appserver --explorer-model gpt-5.4 --judge-model gpt-5.4 --out iris-runs/tldraw-product-use-contract-20260515-001105 --parallel 1 --timeout 900 --steps-per-goal 14 --free-exploration-steps 12 --max-steps 220 --print-summary --verbose
  • Revalidated served report: node packages/cli/dist/bin.js report iris-runs/tldraw-product-use-contract-20260515-001105 --revalidate
  • Served-page checks: HTTP 200 for report.html, claim clip, and screenshot assets; Playwright snapshot verified 6/7 verified, Real-use contract, and visible per-goal clips.

Follow-up: Evidence Presentation Quality

Written at: 2026-05-15T01:14:00-07:00

The tldraw report still makes the reader work too hard:

  • Claim clips are raw browser recordings, so canvas gestures often look static and low-signal.
  • A failed journey can appear once as a goal and again as a finding, with nearly identical clips, which reads as contradictory even when the JSON is internally consistent.
  • Action-tool friction can become a UX finding without visible product impact.
  • The primary canvas-editor goal can pass after one basic object, which is technically real use but weaker than a real user drawing a meaningful board artifact.

Remediation plan:

  1. Generate trace-based storyboard clips from observation and screenshot frames for every goal/finding claim, and use raw page video only as fallback.

    • Verify: a unit test proves claim storyboards are created from trace screenshots and used before adapter raw-video slicing.
  2. Link findings to overlapping goal rows and suppress duplicate finding media when the goal row already shows the same evidence context.

    • Verify: HTML tests assert a partial goal displays the linked finding and the finding row points back to that goal instead of embedding a duplicate clip.
  3. Discard low-signal action-result-only UX findings about click/focus/retry friction unless there is visible user-facing failure evidence.

    • Verify: validator tests cover tool-friction discard while preserving real, visible failure findings.
  4. Strengthen generic discovery/explorer guidance for artifact-editor products: the primary value loop should produce a small meaningful artifact through multiple normal actions when the product exposes those tools.

    • Verify: prompt tests assert this guidance is present and remains product-kind generic.

Plan: Product Score Provenance and Visual Proof

Written at: 2026-05-18T10:29:15-07:00

Objective

Prevent Iris from presenting audit/tooling limitations as product-quality scores, while preserving true product failures and improving visual proof for canvas/editor products.

Steps

  1. Classify rubric score limitations during report normalization.

    • Why: Judge rationales can describe Iris proof/tooling limits while still assigning product-looking numbers.
    • Verify: report JSON tests show weak evidence, automation/tooling friction, one-persona scope, and untested dimensions become score:null; true product defect dimensions remain scored.
  2. Recompute profile and overall scores after limitation normalization.

    • Why: Nulling CSP/weak-proof dimensions is insufficient if the old profile/overall score is kept.
    • Verify: tests prove failed axe/tool limitations do not drag or inflate accessibility/overall scores, and profile n/a renders when all dimensions are unscored.
  3. Add limitation buckets to evaluation.

    • Why: report consumers need structured reasons: automation_limit, evidence_weakness, coverage_gap, not_tested.
    • Verify: tests assert each bucket is populated and downgrades product-score authority/evidence confidence without creating product findings.
  4. Make goal-claim validation distinguish proof gaps from product failures when calibrating scores.

    • Why: a partial caused by missing screenshot/vision proof should not cap mature-product scores as if the product failed.
    • Verify: validator tests cover proof-gap partials that add caveats without capping, and blocked/not-satisfied product outcomes that still cap goal-dependent scores.
  5. Patch visual proof plumbing for Codex App Server and canvas evidence.

    • Why: canvas-heavy products need visual artifact evidence beyond DOM text and state deltas.
    • Verify: runner/adapter tests prove vision_describe descriptions are emitted and canvas-like visual changes can be represented as proof or limitations.
  6. Run downstream verification on similar and different products.

    • Why: the fix must generalize beyond tldraw and must not regress CRUD/content/calculator/grid products.
    • Verify: live runs or re-rendered reports for tldraw, TodoMVC, DataTables, BMI Calculator, and Wikipedia are compared on goal counts, score authority, evidence confidence, limitation buckets, and false-positive findings.

Non-Regression and Generalization Bar

Added at: 2026-05-18T11:24:00-07:00

This root fix is not complete unless it improves the proof model without hiding real product defects or overfitting to tldraw.

Hard requirements:

  • Product-name agnostic: rules must trigger from product capabilities and evidence shape, such as canvas/editor/media artifact workflows, DOM-visible CRUD workflows, data-grid workflows, forms/calculators, and content/search flows. No tldraw-only scoring or prompt special cases.
  • Visual proof required where the product surface is visual: canvas/editor/media goals need post-action evidence that either names the user-visible artifact through vision_describe, proves a meaningful screenshot/visual delta paired with semantic context, or exposes equivalent structured state. A bare screenshot id, toolbar activation, pointer movement, or unchanged DOM text is not enough for authoritative verification.
  • Playwright remains the executor, not the oracle: browser actions, DOM snapshots, screenshots, and probes are necessary plumbing, but the critic must classify proof weakness as an audit limitation unless the trace shows a real product error, visible broken state, console/network failure, or semantic visual evidence.
  • Mature-product scoring guard: proof gaps, CSP/axe instrumentation failures, missing visual semantics, and one-persona scope must lower score authority/evidence confidence and populate limitation buckets; they must not become low product-quality scores or false findings.
  • Real defect preservation: visible product failures, broken flows, failed network writes, inaccessible controls with probe evidence, or artifact outcomes that are semantically absent after valid interactions must still produce findings and affect scores.
  • Regression suite spans product classes: every substantive change must be checked on at least one visual editor/canvas product, one CRUD/task product, one content/search product, one form/calculator product, and one data-grid/table product. Passing means no false product-defect inflation from tooling limits, no silent loss of verified goals, and no authoritative score when core proof is weak.
  • Score comparison is audit-based, not number-based: lower or higher goal counts and scores are acceptable only when the report evidence, limitation buckets, and authority explain the difference. The gate is downstream report trustworthiness, not maximizing scores.

Verification additions:

  • Unit/regression tests must cover visual-proof absence, successful semantic vision proof, visual-delta-without-semantics, automation limitations, and real product failures.
  • Downstream reports must be reviewed for: goal realism, evidence sufficiency, false-positive findings, product_score.authority, evidence confidence, limitation bucket counts, and whether a normal user would recognize the tested value loop.
  • Existing non-visual products must keep authoritative/high-confidence reports when the trace provides strong DOM/state evidence, even if optional dimensions remain untested.

Plan Review

  • /codex-gate MCP is not available in this session.
  • Fallback review used three parallel read-only Codex explorers:
    • scoring/report pipeline audit,
    • Playwright/adapter/canvas evidence audit,
    • verification-matrix audit.
  • Review findings accepted into this plan: profile/overall scores must be recomputed after nulling non-product dimensions; single-persona coverage is audit scope; score authority must account for CSP/visual caveats; canvas proof needs visual evidence; Playwright remains the right base executor but not a sufficient oracle by itself.

Implementation Result

Updated at: 2026-05-18T11:25:00-07:00

  • Added structured score limitation normalization for automation_limit, evidence_weakness, coverage_gap, and not_tested, with profile/overall score recomputation after run-limit dimensions are nulled.
  • Added product-score authority/confidence downgrades from limitation buckets so proof/tooling gaps cannot silently appear as authoritative product scores.
  • Changed goal-score calibration so proof-gap partials do not cap product scores; blocked/not-satisfied product outcomes still cap goal-dependent dimensions.
  • Added visual_delta as a deterministic screenshot-before/after probe and registered it in the web adapter outcome contract as supporting evidence.
  • Hardened vision_describe: empty descriptions now fail, and Codex App Server vision descriptions are preserved into action-result trace events.
  • Added semantic visual-proof validation for visual/canvas/media artifact goals. visual_delta or a bare screenshot id can support proof, but cannot by itself verify a visual artifact. Verified visual goals need a semantic observation, structured state, or vision_describe text naming the artifact.
  • Applied the semantic visual-proof guard both when Discovery provides product_use_contract and for targeted/no-discovery visual goals inferred from the goal text.
  • Updated core, Agent SDK, and Codex App Server Explorer prompts to ask for visual_delta plus semantic vision_describe when DOM text cannot name a visual artifact.

Updated at: 2026-05-18T11:46:00-07:00

  • Fixed the tldraw 3/7 validator regression by replacing the over-broad toolbar/focus veto with a side-effect-dominance check. Mixed observations that contain real artifact labels/object counts plus toolbar chrome are valid; toolbar-only observations are still partial.
  • Added goal-specific visual proof for freehand/annotation goals. A canvas observation with generic board text or a typed "No obvious extra stray mark" note no longer verifies a freehand circle/underline/correction; the evidence must name the mark or include a concrete mark object signal.
  • Stopped treating negative visual quality conditions such as "No obvious extra stray mark after correction" as literal visible text requirements.
  • Fixed capability coverage accounting so one partial scenario does not poison every capability it is mapped to when another mapped scenario already verifies the capability. Capabilities with only partial mapped scenarios remain partial.

Verification Log

Updated at: 2026-05-18T11:25:00-07:00

  • Core focused tests: pnpm --filter @iris/core exec vitest run src/discovery/discovery.test.ts src/scenario/scenario-data.test.ts src/explorer/prompts.test.ts src/judge/goal-claim-validator.test.ts src/judge/evidence-validator.test.ts src/report/report-json.test.ts src/report/report-md.test.ts src/report/report-html.test.ts src/report/testing-plan.test.ts --pool=forks -> 302 tests passed.
  • Adapter focused tests: pnpm --filter @iris/adapter-web exec vitest run src/contract.test.ts src/tools/vision.test.ts src/index.test.ts --pool=forks -> 34 tests passed. Earlier full adapter probe/tool pass with console/notifications/interactions also passed 53 tests.
  • CLI focused tests: pnpm --filter @iris/cli exec vitest run src/commands/report.test.ts src/codex-app-server-client.test.ts src/scenario-completion-gate.test.ts src/agent-sdk-orchestrator.test.ts src/codex-app-server-orchestrator.test.ts --pool=forks -> 56 tests passed.
  • Typechecks: pnpm --filter @iris/core run typecheck, pnpm --filter @iris/adapter-web run typecheck, and pnpm --filter @iris/cli run typecheck passed.
  • Build: pnpm -r run build passed after the final semantic-proof fallback change.
  • Lint/organize-import check: pnpm exec biome check --formatter-enabled=false ... passed on touched TypeScript files.
  • Visual editor downstream revalidation:
    • iris-runs/tldraw-semantic-proof-rerender-20260518-1120/report.html: 3/7 verified, 4 partial, product score authority insufficient, evidence confidence low. No false product findings; visual proof gaps are explicit goal-validator caveats and score limitations.
    • iris-runs/verify-excalidraw-semantic-proof-final-20260518-1123/report.html: screenshot-only targeted visual goal downgraded from verified to partial; product score authority insufficient, evidence confidence low.
  • Non-visual downstream no-regression revalidation:
    • iris-runs/verify-todomvc-semantic-proof-final-20260518-1123/report.html: 1/1 verified, authority authoritative, confidence high, no findings.
    • iris-runs/verify-bmi-semantic-proof-final-20260518-1123/report.html: 1/1 verified, authority authoritative, confidence high, no findings.
    • iris-runs/verify-wikipedia-semantic-proof-final-20260518-1123/report.html: 1/1 verified, authority authoritative, confidence high, no findings.

Updated at: 2026-05-18T11:46:00-07:00

  • Focused regression tests: pnpm --filter @iris/core exec vitest run src/scenario/scenario-data.test.ts src/judge/goal-claim-validator.test.ts --pool=forks -> 86 tests passed.
  • Focused report/capability tests: pnpm --filter @iris/core exec vitest run src/report/report-json.test.ts src/judge/goal-claim-validator.test.ts src/scenario/scenario-data.test.ts --pool=forks -> 122 tests passed.
  • Core focused suite after the regression fixes: pnpm --filter @iris/core exec vitest run src/discovery/discovery.test.ts src/scenario/scenario-data.test.ts src/explorer/prompts.test.ts src/judge/goal-claim-validator.test.ts src/judge/evidence-validator.test.ts src/report/report-json.test.ts src/report/report-md.test.ts src/report/report-html.test.ts src/report/testing-plan.test.ts --pool=forks -> 307 tests passed.
  • Typecheck/build: pnpm --filter @iris/core run typecheck and pnpm -r run build passed.
  • Full workspace tests: pnpm -r test -- --pool=forks passed across all workspace packages. Each package test run reported 77 test files passed and 705 passed / 1 skipped tests where applicable.
  • Visual editor downstream revalidation:
    • iris-runs/tldraw-goal-specific-coverage-rerender-20260518-1144/report.html: 6/7 verified, G5 partial for missing goal-specific freehand/annotation evidence, authority provisional, confidence medium, 6/6 core capabilities covered, no false product findings.
    • iris-runs/verify-excalidraw-goal-specific-coverage-20260518-1145/report.html: screenshot-only targeted visual goal remains partial; authority insufficient, confidence low.
  • Non-visual downstream no-regression revalidation:
    • iris-runs/verify-todomvc-goal-specific-coverage-20260518-1145/report.html: 1/1 verified, authority authoritative, confidence high.
    • iris-runs/verify-bmi-goal-specific-coverage-20260518-1145/report.html: 1/1 verified, authority authoritative, confidence high.
    • iris-runs/verify-wikipedia-goal-specific-coverage-20260518-1145/report.html: 1/1 verified, authority authoritative, confidence high.