Written at: 2026-05-14T23:55:17-07:00
Make Iris evaluate real product use by carrying a generic product-use contract from Discovery through Explorer, Judge, validation, and report output.
-
Add ProductUseContract schema to Discovery.
- Why: Discovery is where Iris decides what "real use" means for the product.
- Verify: Discovery unit tests prove contract fields survive normalization and report JSON extraction.
-
Update Discovery prompt and Explorer context.
- Why: Explorer needs archetype-level user jobs, required actions, expected artifacts, and weak proof warnings before acting.
- Verify: prompt/context tests assert canvas/editor examples require artifact creation rather than activation-only proof.
-
Update Judge prompt.
- Why: Judge should score task/artifact coverage separately from surface coverage and avoid high scores for sampled controls.
- Verify: prompt tests assert the real-use/value-loop guidance is present.
-
Extend goal-claim validation with generic weak-proof detection.
- Why: LLM judgement alone will drift. Verified goals should be downgraded when notes/evidence only show toolbar/menu/focus/activation.
- Verify: validator unit tests cover activation-only downgrade and durable artifact pass.
-
Render real-use depth in the report.
- Why: The reader should immediately see whether Iris exercised the primary value loop and what proof was accepted or rejected.
- Verify: HTML tests assert the summary renders and old reports remain compatible.
-
Rebuild and re-render the current tldraw report.
- Why: The current report is the regression artifact the user is auditing.
- Verify: served report contains product-use contract metadata and real-use depth summary.
- Worktree/branch:
/Users/yuxuan/work/prod-critic,main. - Owned files:
packages/core/src/discovery/*,packages/core/src/judge/*,packages/core/src/report/*,packages/adapter-web/src/contract.ts, focused tests, and report re-render artifacts. - Risk: overfitting the rules to canvas products. Mitigation: schema uses generic product kinds and generic required-action/artifact/weak-evidence arrays.
- Risk: old reports break. Mitigation: all report fields are optional and rendering is conditional.
- Verification: focused tests, package typechecks, full build, report re-render and served-page check.
- Added a generic
product_use_contractto Discovery output and normalized Explorer context. - Updated Discovery, Explorer, and Judge prompts so real-use depth is defined by product kind, primary value loop, required actions, expected artifacts/state, accepted evidence, and weak evidence.
- Added deterministic goal validation against the product-use contract. The validator now rejects weak proof only when the claim lacks outcome language, and it recognizes keyboard shortcuts/style-control clicks as valid required actions.
- Rendered the real-use contract in HTML and Markdown reports next to the Discovery coverage map.
- Added
iris report --revalidateso storedjudge.raw.txtcan be replayed through the current evidence/goal validators before re-rendering old run reports.
pnpm --filter @iris/core exec vitest run src/discovery/discovery.test.ts src/explorer/prompts.test.ts src/judge/prompts.test.ts src/judge/goal-claim-validator.test.ts src/report/report-json.test.ts src/report/report-html.test.ts src/report/report-md.test.ts --reporter=dotpnpm --filter @iris/core run typecheckpnpm --filter @iris/adapter-web run typecheckpnpm --filter @iris/cli run typecheckpnpm -r run buildpnpm -r run test -- --reporter=dotpnpm --filter @iris/cli run test -- --reporter=dot- Fresh real-product run:
node packages/cli/dist/bin.js eval https://www.tldraw.com --transport codex-appserver --explorer-model gpt-5.4 --judge-model gpt-5.4 --out iris-runs/tldraw-product-use-contract-20260515-001105 --parallel 1 --timeout 900 --steps-per-goal 14 --free-exploration-steps 12 --max-steps 220 --print-summary --verbose - Revalidated served report:
node packages/cli/dist/bin.js report iris-runs/tldraw-product-use-contract-20260515-001105 --revalidate - Served-page checks: HTTP 200 for
report.html, claim clip, and screenshot assets; Playwright snapshot verified6/7 verified,Real-use contract, and visible per-goal clips.
Written at: 2026-05-15T01:14:00-07:00
The tldraw report still makes the reader work too hard:
- Claim clips are raw browser recordings, so canvas gestures often look static and low-signal.
- A failed journey can appear once as a goal and again as a finding, with nearly identical clips, which reads as contradictory even when the JSON is internally consistent.
- Action-tool friction can become a UX finding without visible product impact.
- The primary canvas-editor goal can pass after one basic object, which is technically real use but weaker than a real user drawing a meaningful board artifact.
Remediation plan:
-
Generate trace-based storyboard clips from observation and screenshot frames for every goal/finding claim, and use raw page video only as fallback.
- Verify: a unit test proves claim storyboards are created from trace screenshots and used before adapter raw-video slicing.
-
Link findings to overlapping goal rows and suppress duplicate finding media when the goal row already shows the same evidence context.
- Verify: HTML tests assert a partial goal displays the linked finding and the finding row points back to that goal instead of embedding a duplicate clip.
-
Discard low-signal action-result-only UX findings about click/focus/retry friction unless there is visible user-facing failure evidence.
- Verify: validator tests cover tool-friction discard while preserving real, visible failure findings.
-
Strengthen generic discovery/explorer guidance for artifact-editor products: the primary value loop should produce a small meaningful artifact through multiple normal actions when the product exposes those tools.
- Verify: prompt tests assert this guidance is present and remains product-kind generic.
Written at: 2026-05-18T10:29:15-07:00
Prevent Iris from presenting audit/tooling limitations as product-quality scores, while preserving true product failures and improving visual proof for canvas/editor products.
-
Classify rubric score limitations during report normalization.
- Why: Judge rationales can describe Iris proof/tooling limits while still assigning product-looking numbers.
- Verify: report JSON tests show weak evidence, automation/tooling friction, one-persona scope, and untested dimensions become
score:null; true product defect dimensions remain scored.
-
Recompute profile and overall scores after limitation normalization.
- Why: Nulling CSP/weak-proof dimensions is insufficient if the old profile/overall score is kept.
- Verify: tests prove failed axe/tool limitations do not drag or inflate accessibility/overall scores, and profile
n/arenders when all dimensions are unscored.
-
Add limitation buckets to
evaluation.- Why: report consumers need structured reasons:
automation_limit,evidence_weakness,coverage_gap,not_tested. - Verify: tests assert each bucket is populated and downgrades product-score authority/evidence confidence without creating product findings.
- Why: report consumers need structured reasons:
-
Make goal-claim validation distinguish proof gaps from product failures when calibrating scores.
- Why: a partial caused by missing screenshot/vision proof should not cap mature-product scores as if the product failed.
- Verify: validator tests cover proof-gap partials that add caveats without capping, and blocked/not-satisfied product outcomes that still cap goal-dependent scores.
-
Patch visual proof plumbing for Codex App Server and canvas evidence.
- Why: canvas-heavy products need visual artifact evidence beyond DOM text and state deltas.
- Verify: runner/adapter tests prove
vision_describedescriptions are emitted and canvas-like visual changes can be represented as proof or limitations.
-
Run downstream verification on similar and different products.
- Why: the fix must generalize beyond tldraw and must not regress CRUD/content/calculator/grid products.
- Verify: live runs or re-rendered reports for tldraw, TodoMVC, DataTables, BMI Calculator, and Wikipedia are compared on goal counts, score authority, evidence confidence, limitation buckets, and false-positive findings.
Added at: 2026-05-18T11:24:00-07:00
This root fix is not complete unless it improves the proof model without hiding real product defects or overfitting to tldraw.
Hard requirements:
- Product-name agnostic: rules must trigger from product capabilities and evidence shape, such as canvas/editor/media artifact workflows, DOM-visible CRUD workflows, data-grid workflows, forms/calculators, and content/search flows. No tldraw-only scoring or prompt special cases.
- Visual proof required where the product surface is visual: canvas/editor/media goals need post-action evidence that either names the user-visible artifact through
vision_describe, proves a meaningful screenshot/visual delta paired with semantic context, or exposes equivalent structured state. A bare screenshot id, toolbar activation, pointer movement, or unchanged DOM text is not enough for authoritative verification. - Playwright remains the executor, not the oracle: browser actions, DOM snapshots, screenshots, and probes are necessary plumbing, but the critic must classify proof weakness as an audit limitation unless the trace shows a real product error, visible broken state, console/network failure, or semantic visual evidence.
- Mature-product scoring guard: proof gaps, CSP/axe instrumentation failures, missing visual semantics, and one-persona scope must lower score authority/evidence confidence and populate limitation buckets; they must not become low product-quality scores or false findings.
- Real defect preservation: visible product failures, broken flows, failed network writes, inaccessible controls with probe evidence, or artifact outcomes that are semantically absent after valid interactions must still produce findings and affect scores.
- Regression suite spans product classes: every substantive change must be checked on at least one visual editor/canvas product, one CRUD/task product, one content/search product, one form/calculator product, and one data-grid/table product. Passing means no false product-defect inflation from tooling limits, no silent loss of verified goals, and no authoritative score when core proof is weak.
- Score comparison is audit-based, not number-based: lower or higher goal counts and scores are acceptable only when the report evidence, limitation buckets, and authority explain the difference. The gate is downstream report trustworthiness, not maximizing scores.
Verification additions:
- Unit/regression tests must cover visual-proof absence, successful semantic vision proof, visual-delta-without-semantics, automation limitations, and real product failures.
- Downstream reports must be reviewed for: goal realism, evidence sufficiency, false-positive findings,
product_score.authority, evidence confidence, limitation bucket counts, and whether a normal user would recognize the tested value loop. - Existing non-visual products must keep authoritative/high-confidence reports when the trace provides strong DOM/state evidence, even if optional dimensions remain untested.
/codex-gateMCP is not available in this session.- Fallback review used three parallel read-only Codex explorers:
- scoring/report pipeline audit,
- Playwright/adapter/canvas evidence audit,
- verification-matrix audit.
- Review findings accepted into this plan: profile/overall scores must be recomputed after nulling non-product dimensions; single-persona coverage is audit scope; score authority must account for CSP/visual caveats; canvas proof needs visual evidence; Playwright remains the right base executor but not a sufficient oracle by itself.
Updated at: 2026-05-18T11:25:00-07:00
- Added structured score limitation normalization for
automation_limit,evidence_weakness,coverage_gap, andnot_tested, with profile/overall score recomputation after run-limit dimensions are nulled. - Added product-score authority/confidence downgrades from limitation buckets so proof/tooling gaps cannot silently appear as authoritative product scores.
- Changed goal-score calibration so proof-gap partials do not cap product scores; blocked/not-satisfied product outcomes still cap goal-dependent dimensions.
- Added
visual_deltaas a deterministic screenshot-before/after probe and registered it in the web adapter outcome contract as supporting evidence. - Hardened
vision_describe: empty descriptions now fail, and Codex App Server vision descriptions are preserved into action-result trace events. - Added semantic visual-proof validation for visual/canvas/media artifact goals.
visual_deltaor a bare screenshot id can support proof, but cannot by itself verify a visual artifact. Verified visual goals need a semantic observation, structured state, orvision_describetext naming the artifact. - Applied the semantic visual-proof guard both when Discovery provides
product_use_contractand for targeted/no-discovery visual goals inferred from the goal text. - Updated core, Agent SDK, and Codex App Server Explorer prompts to ask for
visual_deltaplus semanticvision_describewhen DOM text cannot name a visual artifact.
Updated at: 2026-05-18T11:46:00-07:00
- Fixed the tldraw
3/7validator regression by replacing the over-broad toolbar/focus veto with a side-effect-dominance check. Mixed observations that contain real artifact labels/object counts plus toolbar chrome are valid; toolbar-only observations are still partial. - Added goal-specific visual proof for freehand/annotation goals. A canvas observation with generic board text or a typed "No obvious extra stray mark" note no longer verifies a freehand circle/underline/correction; the evidence must name the mark or include a concrete mark object signal.
- Stopped treating negative visual quality conditions such as "No obvious extra stray mark after correction" as literal visible text requirements.
- Fixed capability coverage accounting so one partial scenario does not poison every capability it is mapped to when another mapped scenario already verifies the capability. Capabilities with only partial mapped scenarios remain partial.
Updated at: 2026-05-18T11:25:00-07:00
- Core focused tests:
pnpm --filter @iris/core exec vitest run src/discovery/discovery.test.ts src/scenario/scenario-data.test.ts src/explorer/prompts.test.ts src/judge/goal-claim-validator.test.ts src/judge/evidence-validator.test.ts src/report/report-json.test.ts src/report/report-md.test.ts src/report/report-html.test.ts src/report/testing-plan.test.ts --pool=forks-> 302 tests passed. - Adapter focused tests:
pnpm --filter @iris/adapter-web exec vitest run src/contract.test.ts src/tools/vision.test.ts src/index.test.ts --pool=forks-> 34 tests passed. Earlier full adapter probe/tool pass with console/notifications/interactions also passed 53 tests. - CLI focused tests:
pnpm --filter @iris/cli exec vitest run src/commands/report.test.ts src/codex-app-server-client.test.ts src/scenario-completion-gate.test.ts src/agent-sdk-orchestrator.test.ts src/codex-app-server-orchestrator.test.ts --pool=forks-> 56 tests passed. - Typechecks:
pnpm --filter @iris/core run typecheck,pnpm --filter @iris/adapter-web run typecheck, andpnpm --filter @iris/cli run typecheckpassed. - Build:
pnpm -r run buildpassed after the final semantic-proof fallback change. - Lint/organize-import check:
pnpm exec biome check --formatter-enabled=false ...passed on touched TypeScript files. - Visual editor downstream revalidation:
iris-runs/tldraw-semantic-proof-rerender-20260518-1120/report.html: 3/7 verified, 4 partial, product score authorityinsufficient, evidence confidencelow. No false product findings; visual proof gaps are explicit goal-validator caveats and score limitations.iris-runs/verify-excalidraw-semantic-proof-final-20260518-1123/report.html: screenshot-only targeted visual goal downgraded from verified to partial; product score authorityinsufficient, evidence confidencelow.
- Non-visual downstream no-regression revalidation:
iris-runs/verify-todomvc-semantic-proof-final-20260518-1123/report.html: 1/1 verified, authorityauthoritative, confidencehigh, no findings.iris-runs/verify-bmi-semantic-proof-final-20260518-1123/report.html: 1/1 verified, authorityauthoritative, confidencehigh, no findings.iris-runs/verify-wikipedia-semantic-proof-final-20260518-1123/report.html: 1/1 verified, authorityauthoritative, confidencehigh, no findings.
Updated at: 2026-05-18T11:46:00-07:00
- Focused regression tests:
pnpm --filter @iris/core exec vitest run src/scenario/scenario-data.test.ts src/judge/goal-claim-validator.test.ts --pool=forks-> 86 tests passed. - Focused report/capability tests:
pnpm --filter @iris/core exec vitest run src/report/report-json.test.ts src/judge/goal-claim-validator.test.ts src/scenario/scenario-data.test.ts --pool=forks-> 122 tests passed. - Core focused suite after the regression fixes:
pnpm --filter @iris/core exec vitest run src/discovery/discovery.test.ts src/scenario/scenario-data.test.ts src/explorer/prompts.test.ts src/judge/goal-claim-validator.test.ts src/judge/evidence-validator.test.ts src/report/report-json.test.ts src/report/report-md.test.ts src/report/report-html.test.ts src/report/testing-plan.test.ts --pool=forks-> 307 tests passed. - Typecheck/build:
pnpm --filter @iris/core run typecheckandpnpm -r run buildpassed. - Full workspace tests:
pnpm -r test -- --pool=forkspassed across all workspace packages. Each package test run reported 77 test files passed and 705 passed / 1 skipped tests where applicable. - Visual editor downstream revalidation:
iris-runs/tldraw-goal-specific-coverage-rerender-20260518-1144/report.html: 6/7 verified, G5 partial for missing goal-specific freehand/annotation evidence, authorityprovisional, confidencemedium, 6/6 core capabilities covered, no false product findings.iris-runs/verify-excalidraw-goal-specific-coverage-20260518-1145/report.html: screenshot-only targeted visual goal remains partial; authorityinsufficient, confidencelow.
- Non-visual downstream no-regression revalidation:
iris-runs/verify-todomvc-goal-specific-coverage-20260518-1145/report.html: 1/1 verified, authorityauthoritative, confidencehigh.iris-runs/verify-bmi-goal-specific-coverage-20260518-1145/report.html: 1/1 verified, authorityauthoritative, confidencehigh.iris-runs/verify-wikipedia-goal-specific-coverage-20260518-1145/report.html: 1/1 verified, authorityauthoritative, confidencehigh.