coder · ThomasK33 · Jun 6, 2026 · Jun 5, 2026 · Jun 5, 2026 · Jun 6, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -84,6 +84,12 @@ jobs:
       - name: Test unit
         run: mise run test-unit
 
+  # Integration and e2e tests drive real PTY hosts and headless-browser
+  # renderers, which can transiently fail under machine load (e.g. a screenshot
+  # render or RPC hiccup) even when the code is correct. The `test:integration`
+  # and `test:e2e` npm scripts pass `--retry=2`, so a flaky attempt is retried
+  # in place instead of failing the shard; a genuine failure still fails all
+  # three attempts. Unit tests (`test:unit`) deliberately do NOT retry.
   test-integration:
     runs-on: ubuntu-latest
     timeout-minutes: 20

diff --git a/CONTEXT.md b/CONTEXT.md
@@ -43,6 +43,10 @@ _Avoid_: Visual wait, snapshot wait
 A render condition where the visible text content of a **Semantic Snapshot** has remained unchanged for a requested duration.
 _Avoid_: Settled screen
 
+**Screen Hash**:
+A stable digest of a **Session**'s normalized visible screen text at a captured event-log sequence, used to tell whether the rendered screen content changed between two observations. It is computed from the same canonical visible text that the **Screen Stability** check and text **Render Wait** matching use, so the three never disagree.
+_Avoid_: Screen checksum, frame hash, screenshot hash
+
 **Batch**:
 An ordered sequence of **Batch Steps** driven through one **Command Target** in a single `batch` invocation. It runs fail-fast: the first failed **Batch Step** stops the run unless the caller opts into continuing.
 _Avoid_: Pipeline, script, macro
@@ -228,6 +232,9 @@ _Avoid_: bare "agent", "Coder agent"
 - A **Render Wait** may include text, regex, cursor, or **Screen Stability** conditions.
 - A **Render Wait** may be evaluated by live host polling for a **Live Host Eligible Session** or by offline replay fallback for an **Offline Replay Eligible Session**.
 - Offline replay fallback can evaluate snapshot content and cursor position, but cannot prove elapsed **Screen Stability** duration from a single latest **Semantic Snapshot**.
+- A **Screen Hash** changes exactly when the canonical visible text that the **Screen Stability** check compares changes; the two share one definition.
+- A **Screen Hash** covers visible screen text only — not scrollback, cursor position, or styles — and is distinct from the pixel `sha256` recorded on a **Screenshot Result**.
+- A result carries the **Screen Hash** of the **Semantic Snapshot** it observed: a **Snapshot Result**, a matched **Render Wait** result, and the offline host-unreachable fallback that still observed a snapshot (even when it reports `matched: false` because **Screen Stability** duration could not be proven offline). The hash is keyed on whether a snapshot was observed, not on whether the wait matched; a **Render Wait** that observes no snapshot — a live timeout, a consecutive-failure giveup, or a replay error — carries none.
 - A **Waited Run** may produce one **Run Completion**, time out for its caller, or be interrupted by **Session** exit.
 - Caller timeout does not cancel the underlying **Run Completion**; it may still be observed later to keep internal completion bytes out of artifacts.
 - After **Session** exit, an unobserved **Run Completion** can no longer arrive.

diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md
@@ -46,6 +46,15 @@ If you touch the public bootstrap under `skills/` or the bundled runtime skills
 npm run intent:validate
 ```
 
+### Flaky integration and e2e tests
+
+Integration and e2e tests drive real PTY hosts and headless-browser renderers, so an individual test can transiently fail under machine load (most often a screenshot render or host RPC hiccup) even when the code is correct. To keep these flakes from causing spurious red:
+
+- `npm run test:integration`, `npm run test:e2e`, and the combined `npm run test` retry a failing test in place (`--retry=2`, up to three attempts). A genuine failure still fails all three attempts.
+- `npm run test:unit` deliberately does **not** retry — unit tests must be deterministic, and the dedicated unit CI gate is the authority that catches real unit flakes.
+- If an integration/e2e test fails _consistently_ (not just on one attempt), treat it as a real failure and investigate; do not raise the retry count to paper over it.
+- When debugging a single browser-backed test locally, run it in isolation (`npm run test:e2e -- <file>`); the full serial suite is the heaviest load and the most flake-prone.
+
 ## Documentation and proof expectations
 
 - Keep the root docs split clear: `README.md` for overview and `RELEASE.md` for supported scope.

diff --git a/docs/USAGE.md b/docs/USAGE.md
@@ -104,6 +104,14 @@ Useful flags:
 - `--exit`: wait for the process to exit.
 - `--timeout <ms>`: maximum wait time in milliseconds, with `0` meaning infinite.
 
+### Screen Hash
+
+`snapshot` results (both `--format structured` and `--format text`) and a **matched** `wait` result carry an optional `screenHash`: a lowercase 64-character hex SHA-256 of the visible screen text. Compare it across two calls to tell whether the visible screen actually changed — equal hashes mean identical visible content, even if the event-log sequence advanced on a no-op repaint.
+
+- It hashes the visible screen only. It is **not** a hash of the `--format text` output, which also includes scrollback, so the hash ignores scrollback growth.
+- It is distinct from the `screenshot` result's pixel `sha256`: `screenHash` is content identity, the screenshot `sha256` is pixel identity, and the two are not interchangeable.
+- A `wait` that times out (or finds the host unreachable with no observed screen) omits `screenHash`, so a missing hash unambiguously means "no screen was observed" rather than an error.
+
 ## `batch`
 
 Use `batch` to run an ordered sequence of input-and-`wait` steps against one session in a single invocation, instead of coordinating separate `run`/`type`/`paste`/`send-keys`/`wait` calls. Each `wait` step is anchored to a Wait Baseline — it only considers screen state produced _after_ the preceding input step — so a batch cannot race ahead and match a stale screen the way a hand-written shell loop can.
@@ -174,7 +182,7 @@ The `--json` result is a per-step envelope:
 }
 ```
 
-Each step record carries its `index`, `kind`, `status` (`completed` | `failed` | `not-run` | `interrupted`), and `durationMs`. Input steps report the Event Log `seq` they produced; `wait` steps report the `waitBaseline` they were anchored to plus `matched` / `timedOut` / `matchedText` / `capturedAtSeq`. `completedCount` and `failedIndices` summarize the run. A fail-fast batch exits non-zero with the failed step's exit code (e.g. `11` for a `WAIT_TIMEOUT`); `--keep-going` exits `1` if any step failed. If the process is interrupted by SIGINT/SIGTERM, batch flushes the same envelope with the in-flight step marked `interrupted` and later steps `not-run`, then exits non-zero.
+Each step record carries its `index`, `kind`, `status` (`completed` | `failed` | `not-run` | `interrupted`), and `durationMs`. Input steps report the Event Log `seq` they produced; `wait` steps report the `waitBaseline` they were anchored to plus `matched` / `timedOut` / `matchedText` / `capturedAtSeq`, and a matched `wait` step also carries the `screenHash` of the screen it observed (see [Screen Hash](#screen-hash)). `completedCount` and `failedIndices` summarize the run. A fail-fast batch exits non-zero with the failed step's exit code (e.g. `11` for a `WAIT_TIMEOUT`); `--keep-going` exits `1` if any step failed. If the process is interrupted by SIGINT/SIGTERM, batch flushes the same envelope with the in-flight step marked `interrupted` and later steps `not-run`, then exits non-zero.
 
 The Wait Baseline fixes stale-match only. It does **not** fix echo-match: a `wait` can still match the terminal's echo of a just-typed command (the echo renders _after_ the baseline). Use a distinctive output token or a `screenStableMs` wait rather than waiting for text you just typed. Interrupting a batch mid-`wait` leaves that wait's command still running on the session (the wait is abandoned, not cancelled), exactly like a caller timeout on `run`.
 

diff --git a/docs/prd/screen-hash/PRD.md b/docs/prd/screen-hash/PRD.md
@@ -0,0 +1,63 @@
+# PRD: Screen Hash on snapshot and wait results
+
+## Problem Statement
+
+A caller — often an AI coding agent — driving a **Session** repeatedly needs a cheap, reliable way to answer "did the rendered screen actually change since I last looked?" Today the only per-result identifier is the captured event-log sequence, but that advances on every chunk of output, including output that changes nothing visible: cursor-position queries, terminal-mode toggles, a spinner repainting the same glyphs. So two observations with different sequences can be the identical screen, and a caller comparing sequences sees changes that are not there. There is no stable token for the screen's content itself.
+
+## Solution
+
+Snapshot results and matched **Render Wait** results gain an optional **Screen Hash**: a stable digest of the **Session**'s normalized visible screen text at the captured event-log sequence. Equal hashes mean the visible content is identical; a changed hash means it genuinely changed. The **Screen Hash** is computed from the same canonical visible text that the **Screen Stability** check and text **Render Wait** matching already use, so "the hash changed" and "the stability check saw a change" can never disagree.
+
+## User Stories
+
+1. As an AI coding agent, I want a stable hash of the screen content on each snapshot, so that I can tell across two CLI calls whether the visible screen actually changed without diffing full text myself.
+2. As an agent, I want the hash to stay equal when only the cursor moved, so that cursor motion alone does not look like a content change.
+3. As an agent, I want the hash to stay equal when output occurred that changed nothing visible, so that I am not misled by the captured sequence advancing on a no-op repaint.
+4. As an agent, I want the hash to change whenever the visible text changes, so that I can trust it as a content-changed signal.
+5. As a caller, I want the **Screen Hash** on the snapshot result in both structured and text formats, so that I get it regardless of how I read the screen.
+6. As a caller, I want the **Screen Hash** on a matched render-wait result, so that I know the content identity at the moment my wait condition was satisfied.
+7. As a caller, I want the hash present whenever a result holds an **observed** **Semantic Snapshot** — including the offline host-unreachable `matched: false` fallback that still observed a snapshot — and omitted only when no snapshot was observed (a live timeout, a consecutive-failure giveup, or a replay error), so that a missing hash unambiguously means "no screen was observed" rather than signalling an error.
+8. As a tooling author, I want the **Screen Hash** to be renderer-independent — the same screen yields the same hash under either renderer backend — so that I can compare hashes across sessions rendered by different backends.
+9. As a maintainer, I want the **Screen Hash**, the **Screen Stability** compare, and text **Render Wait** matching to share one canonical visible-text definition, so that they can never disagree about what "the screen" is.
+10. As a maintainer, I want adding the **Screen Hash** and routing the three consumers through one shared canonical-text definition to make no change in itself to the shipped screen-stability behavior, so that the only behavior change is the deliberate, characterization-pinned Phase 1 renderer convergence — not an accidental side effect of the hash.
+11. As a caller, I want to understand that the **Screen Hash** is distinct from a screenshot's pixel digest, so that I use the right identity for content versus pixels.
+12. As a caller, I want to understand that the **Screen Hash** covers the visible screen only, even though the text snapshot format also includes scrollback, so that I am not surprised that the hash ignores scrollback growth.
+13. As a tool building recordings, I want a per-frame content hash, so that I can dedup consecutive identical frames in artifacts.
+14. As a caller using `--json`, I want the hash as a lowercase 64-character hex string validated by the same digest schema as other hashes, so that the field shape is predictable.
+15. As a caller, I want the **Screen Hash** to be optional on results, so that older artifacts and hosts that predate it still parse.
+
+## Implementation Decisions
+
+- Add an optional **Screen Hash** field — a lowercase 64-character SHA-256 hex digest — to the snapshot result (both structured and text formats) and to the matched render-wait result.
+- In scope: a **Batch Step** record for a matched **Render Wait** step also carries the **Screen Hash**, mirrored from that step's render-wait result, so a batch run exposes the same content identity per wait step that a standalone wait does.
+- The **Screen Hash** is the SHA-256 of the canonical visible-text string: the visible lines joined by newline, exactly as the host's screen-stability compare and the text matcher already build it. The shared canonical-text **definition** — `visibleLines[].text` joined by `\n`, sourced only from the snapshot (never `backend.getVisibleText()` or `cells[]`) — is unchanged by adding the hash. Cursor position, text styles, and scrollback are excluded.
+- Converging the two renderer backends on one canonical screen form (Phase 1) intentionally changes the **default** `ghostty-web` backend's stability and text-wait **comparand** on screens with grapheme clusters, interior blank-cell gaps, or non-ASCII trailing characters: the canonical form is exactly `rows` lines, each decoded with full grapheme clusters with blank/zero cells as `' '`, then right-trimmed of trailing ASCII spaces (`0x20`) only. This is a deliberate, narrow change pinned by characterization tests, not a free behavior-preserving add; on plain ASCII screens the comparand is unchanged.
+- Extract one shared canonical-screen-text helper and route the **Screen Hash**, the host **Screen Stability** compare, and the text **Render Wait** matcher through it, so the three share a single definition and cannot diverge.
+- The hash is keyed on whether a result holds an **observed** **Semantic Snapshot**, not on whether the wait matched. A result carries the **Screen Hash** of the snapshot it observed: a matched live wait, a snapshot capture, and the offline host-unreachable fallback that still observed a latest snapshot (even when it returns `matched: false` because the **Screen Stability** duration could not be proven offline). The hash is omitted only when no snapshot was observed: a live wait that times out, a consecutive-failure giveup, or a replay error throw.
+- Do not surface the **Screen Hash** on inspection or any path that does not already render a **Semantic Snapshot**; computing it must never force a renderer bootstrap that would not otherwise happen.
+- Reuse the existing SHA-256 hex validator. The consolidation set is exactly: export `Sha256HexSchema` from `protocol/schemas.ts` and import it in `renderer/types.ts`. Deliberately left out of scope: the standalone regex copies in `storage/artifactManifest.ts` and the `invariant(/^[a-f0-9]{64}$/u.test(...))` checks (for example in `renderer/profiles.ts` and `renderer/bundledFont.ts`), which are not Zod schemas and are not part of this consolidation.
+- The field is optional so existing persisted artifacts and older hosts continue to parse.
+
+## Testing Decisions
+
+Good tests assert external behavior, not implementation details.
+
+- **Canonical-text and hash helper (unit).** Same screen yields the same hash; cursor-only movement yields the same hash; a single visible-glyph change yields a different hash; a trailing-whitespace-only difference (before right-trim of ASCII spaces) yields a different hash — proving the canonical form is exactly what is hashed and the behavior is as specified.
+- **UTF-8 encoding pinned (unit).** The hash is the SHA-256 of the UTF-8 bytes of the canonical visible text, asserted against a concrete golden digest so the encoding can never silently drift. Golden: a three-row screen whose canonical text is `"a\nb\nc"` hashes to `ea7fb08b7a2dc4619ffb7c7bb38d95a2047935fa165d71b12efd3852a2e6d0cc`.
+- **Shared definition (unit).** The host **Screen Stability** compare and the **Render Wait** matcher consume the same canonical string the hash uses, so a later change to one cannot silently diverge from the others, and screen-stability behavior is demonstrably unchanged.
+- **Cross-backend hash equality.** The same event log produces the same **Screen Hash** under both renderer backends, pinning the renderer-independence guarantee that is currently only an assumption. This test requires the optional native addon (`@coder/libghostty-vt-node`) and so must run on at least one CI job that has the addon installed; it skips gracefully where the addon is absent (including the sandbox), so the renderer-independence guarantee is not silently unverified.
+- **Snapshot and wait envelope (integration).** Against an isolated home: the **Screen Hash** is present on a snapshot (structured and text), on a matched live wait, and on the offline host-unreachable `matched: false` fallback that still observed a snapshot; and absent on a timed-out live wait. The existing CLI integration tests are prior art.
+
+## Out of Scope
+
+- Per-frame **Screen Hash**es on recordings / `record export` (user story 13). v1 attaches the hash only where a result already holds an observed **Semantic Snapshot**; the export paths render no **Semantic Snapshot** per frame, so a recording-frame dedup hash is future scope rather than a v1 deliverable.
+- A scrollback hash. The **Screen Hash** is visible-screen-only; a separate scrollback digest can be added later if a concrete need appears.
+- A styled or per-cell hash. Transient style churn would make such a hash flap; the **Screen Hash** is text-content identity only.
+- Pixel-level identity, and any **Screen Hash** on the **Screenshot Result**. A **Screenshot Result** carries only its pixel `sha256`; the content hash lives on the snapshot and wait results. The **Screen Hash** is the semantic counterpart to the pixel digest and the two are not interchangeable.
+- New wait semantics built on the hash (for example, "wait until the screen content changes"). v1 only exposes the field; any hash-driven wait is future scope.
+- Any change to the screen-stability behavior **beyond** the Phase 1 renderer-convergence change described in the Implementation Decisions. The canonical-text definition and the shared single-source unify are behavior-preserving; the only intended behavior change is the default `ghostty-web` backend's comparand on grapheme / interior-gap / non-ASCII-trailing screens, pinned by characterization tests. No new wait semantics are added.
+
+## Further Notes
+
+- The motivation differs from the comparable tool virtui, which hashes to avoid shipping screen bytes over a socket. agent-tty is a local CLI, so the value here is the stable content change-token and frame dedup, not transfer avoidance.
+- The **Screen Hash** term is defined in the project glossary; this PRD and that term are on branch `feat/screen-hash`. No ADR was needed: the field is an optional add over the canonical string that already exists, and the one intended behavior change — the Phase 1 renderer convergence — is narrow, characterization-pinned, and easily reversible.
diff --git a/package.json b/package.json
@@ -55,9 +55,9 @@
     "release:finalize": "node ./scripts/release-finalize.mjs",
     "review-bundle": "tsx src/tools/review-bundle.ts",
     "smoke:install": "node ./scripts/smoke-install.mjs",
-    "test": "vitest run",
-    "test:e2e": "vitest run --maxWorkers=1 test/e2e",
-    "test:integration": "vitest run --maxWorkers=1 test/integration",
+    "test": "vitest run --retry=2",
+    "test:e2e": "vitest run --maxWorkers=1 --retry=2 test/e2e",
+    "test:integration": "vitest run --maxWorkers=1 --retry=2 test/integration",
     "test:unit": "vitest run test/unit",
     "test:watch": "vitest",
     "typecheck": "tsc -p tsconfig.json --noEmit",