Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,12 @@ jobs:
- name: Test unit
run: mise run test-unit

# Integration and e2e tests drive real PTY hosts and headless-browser
# renderers, which can transiently fail under machine load (e.g. a screenshot
# render or RPC hiccup) even when the code is correct. The `test:integration`
# and `test:e2e` npm scripts pass `--retry=2`, so a flaky attempt is retried
# in place instead of failing the shard; a genuine failure still fails all
# three attempts. Unit tests (`test:unit`) deliberately do NOT retry.
test-integration:
runs-on: ubuntu-latest
timeout-minutes: 20
Expand Down
7 changes: 7 additions & 0 deletions CONTEXT.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,10 @@ _Avoid_: Visual wait, snapshot wait
A render condition where the visible text content of a **Semantic Snapshot** has remained unchanged for a requested duration.
_Avoid_: Settled screen

**Screen Hash**:
A stable digest of a **Session**'s normalized visible screen text at a captured event-log sequence, used to tell whether the rendered screen content changed between two observations. It is computed from the same canonical visible text that the **Screen Stability** check and text **Render Wait** matching use, so the three never disagree.
_Avoid_: Screen checksum, frame hash, screenshot hash

**Batch**:
An ordered sequence of **Batch Steps** driven through one **Command Target** in a single `batch` invocation. It runs fail-fast: the first failed **Batch Step** stops the run unless the caller opts into continuing.
_Avoid_: Pipeline, script, macro
Expand Down Expand Up @@ -228,6 +232,9 @@ _Avoid_: bare "agent", "Coder agent"
- A **Render Wait** may include text, regex, cursor, or **Screen Stability** conditions.
- A **Render Wait** may be evaluated by live host polling for a **Live Host Eligible Session** or by offline replay fallback for an **Offline Replay Eligible Session**.
- Offline replay fallback can evaluate snapshot content and cursor position, but cannot prove elapsed **Screen Stability** duration from a single latest **Semantic Snapshot**.
- A **Screen Hash** changes exactly when the canonical visible text that the **Screen Stability** check compares changes; the two share one definition.
- A **Screen Hash** covers visible screen text only — not scrollback, cursor position, or styles — and is distinct from the pixel `sha256` recorded on a **Screenshot Result**.
- A result carries the **Screen Hash** of the **Semantic Snapshot** it observed: a **Snapshot Result**, a matched **Render Wait** result, and the offline host-unreachable fallback that still observed a snapshot (even when it reports `matched: false` because **Screen Stability** duration could not be proven offline). The hash is keyed on whether a snapshot was observed, not on whether the wait matched; a **Render Wait** that observes no snapshot — a live timeout, a consecutive-failure giveup, or a replay error — carries none.
- A **Waited Run** may produce one **Run Completion**, time out for its caller, or be interrupted by **Session** exit.
- Caller timeout does not cancel the underlying **Run Completion**; it may still be observed later to keep internal completion bytes out of artifacts.
- After **Session** exit, an unobserved **Run Completion** can no longer arrive.
Expand Down
9 changes: 9 additions & 0 deletions docs/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,15 @@ If you touch the public bootstrap under `skills/` or the bundled runtime skills
npm run intent:validate
```

### Flaky integration and e2e tests

Integration and e2e tests drive real PTY hosts and headless-browser renderers, so an individual test can transiently fail under machine load (most often a screenshot render or host RPC hiccup) even when the code is correct. To keep these flakes from causing spurious red:

- `npm run test:integration`, `npm run test:e2e`, and the combined `npm run test` retry a failing test in place (`--retry=2`, up to three attempts). A genuine failure still fails all three attempts.
- `npm run test:unit` deliberately does **not** retry — unit tests must be deterministic, and the dedicated unit CI gate is the authority that catches real unit flakes.
- If an integration/e2e test fails _consistently_ (not just on one attempt), treat it as a real failure and investigate; do not raise the retry count to paper over it.
- When debugging a single browser-backed test locally, run it in isolation (`npm run test:e2e -- <file>`); the full serial suite is the heaviest load and the most flake-prone.

## Documentation and proof expectations

- Keep the root docs split clear: `README.md` for overview and `RELEASE.md` for supported scope.
Expand Down
10 changes: 9 additions & 1 deletion docs/USAGE.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,14 @@ Useful flags:
- `--exit`: wait for the process to exit.
- `--timeout <ms>`: maximum wait time in milliseconds, with `0` meaning infinite.

### Screen Hash

`snapshot` results (both `--format structured` and `--format text`) and a **matched** `wait` result carry an optional `screenHash`: a lowercase 64-character hex SHA-256 of the visible screen text. Compare it across two calls to tell whether the visible screen actually changed — equal hashes mean identical visible content, even if the event-log sequence advanced on a no-op repaint.

- It hashes the visible screen only. It is **not** a hash of the `--format text` output, which also includes scrollback, so the hash ignores scrollback growth.
- It is distinct from the `screenshot` result's pixel `sha256`: `screenHash` is content identity, the screenshot `sha256` is pixel identity, and the two are not interchangeable.
- A `wait` that times out (or finds the host unreachable with no observed screen) omits `screenHash`, so a missing hash unambiguously means "no screen was observed" rather than an error.

## `batch`

Use `batch` to run an ordered sequence of input-and-`wait` steps against one session in a single invocation, instead of coordinating separate `run`/`type`/`paste`/`send-keys`/`wait` calls. Each `wait` step is anchored to a Wait Baseline — it only considers screen state produced _after_ the preceding input step — so a batch cannot race ahead and match a stale screen the way a hand-written shell loop can.
Expand Down Expand Up @@ -174,7 +182,7 @@ The `--json` result is a per-step envelope:
}
```

Each step record carries its `index`, `kind`, `status` (`completed` | `failed` | `not-run` | `interrupted`), and `durationMs`. Input steps report the Event Log `seq` they produced; `wait` steps report the `waitBaseline` they were anchored to plus `matched` / `timedOut` / `matchedText` / `capturedAtSeq`. `completedCount` and `failedIndices` summarize the run. A fail-fast batch exits non-zero with the failed step's exit code (e.g. `11` for a `WAIT_TIMEOUT`); `--keep-going` exits `1` if any step failed. If the process is interrupted by SIGINT/SIGTERM, batch flushes the same envelope with the in-flight step marked `interrupted` and later steps `not-run`, then exits non-zero.
Each step record carries its `index`, `kind`, `status` (`completed` | `failed` | `not-run` | `interrupted`), and `durationMs`. Input steps report the Event Log `seq` they produced; `wait` steps report the `waitBaseline` they were anchored to plus `matched` / `timedOut` / `matchedText` / `capturedAtSeq`, and a matched `wait` step also carries the `screenHash` of the screen it observed (see [Screen Hash](#screen-hash)). `completedCount` and `failedIndices` summarize the run. A fail-fast batch exits non-zero with the failed step's exit code (e.g. `11` for a `WAIT_TIMEOUT`); `--keep-going` exits `1` if any step failed. If the process is interrupted by SIGINT/SIGTERM, batch flushes the same envelope with the in-flight step marked `interrupted` and later steps `not-run`, then exits non-zero.

The Wait Baseline fixes stale-match only. It does **not** fix echo-match: a `wait` can still match the terminal's echo of a just-typed command (the echo renders _after_ the baseline). Use a distinctive output token or a `screenStableMs` wait rather than waiting for text you just typed. Interrupting a batch mid-`wait` leaves that wait's command still running on the session (the wait is abandoned, not cancelled), exactly like a caller timeout on `run`.

Expand Down
63 changes: 63 additions & 0 deletions docs/prd/screen-hash/PRD.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# PRD: Screen Hash on snapshot and wait results

## Problem Statement

A caller — often an AI coding agent — driving a **Session** repeatedly needs a cheap, reliable way to answer "did the rendered screen actually change since I last looked?" Today the only per-result identifier is the captured event-log sequence, but that advances on every chunk of output, including output that changes nothing visible: cursor-position queries, terminal-mode toggles, a spinner repainting the same glyphs. So two observations with different sequences can be the identical screen, and a caller comparing sequences sees changes that are not there. There is no stable token for the screen's content itself.

## Solution

Snapshot results and matched **Render Wait** results gain an optional **Screen Hash**: a stable digest of the **Session**'s normalized visible screen text at the captured event-log sequence. Equal hashes mean the visible content is identical; a changed hash means it genuinely changed. The **Screen Hash** is computed from the same canonical visible text that the **Screen Stability** check and text **Render Wait** matching already use, so "the hash changed" and "the stability check saw a change" can never disagree.

## User Stories

1. As an AI coding agent, I want a stable hash of the screen content on each snapshot, so that I can tell across two CLI calls whether the visible screen actually changed without diffing full text myself.
2. As an agent, I want the hash to stay equal when only the cursor moved, so that cursor motion alone does not look like a content change.
3. As an agent, I want the hash to stay equal when output occurred that changed nothing visible, so that I am not misled by the captured sequence advancing on a no-op repaint.
4. As an agent, I want the hash to change whenever the visible text changes, so that I can trust it as a content-changed signal.
5. As a caller, I want the **Screen Hash** on the snapshot result in both structured and text formats, so that I get it regardless of how I read the screen.
6. As a caller, I want the **Screen Hash** on a matched render-wait result, so that I know the content identity at the moment my wait condition was satisfied.
7. As a caller, I want the hash present whenever a result holds an **observed** **Semantic Snapshot** — including the offline host-unreachable `matched: false` fallback that still observed a snapshot — and omitted only when no snapshot was observed (a live timeout, a consecutive-failure giveup, or a replay error), so that a missing hash unambiguously means "no screen was observed" rather than signalling an error.
8. As a tooling author, I want the **Screen Hash** to be renderer-independent — the same screen yields the same hash under either renderer backend — so that I can compare hashes across sessions rendered by different backends.
9. As a maintainer, I want the **Screen Hash**, the **Screen Stability** compare, and text **Render Wait** matching to share one canonical visible-text definition, so that they can never disagree about what "the screen" is.
10. As a maintainer, I want adding the **Screen Hash** and routing the three consumers through one shared canonical-text definition to make no change in itself to the shipped screen-stability behavior, so that the only behavior change is the deliberate, characterization-pinned Phase 1 renderer convergence — not an accidental side effect of the hash.
11. As a caller, I want to understand that the **Screen Hash** is distinct from a screenshot's pixel digest, so that I use the right identity for content versus pixels.
12. As a caller, I want to understand that the **Screen Hash** covers the visible screen only, even though the text snapshot format also includes scrollback, so that I am not surprised that the hash ignores scrollback growth.
13. As a tool building recordings, I want a per-frame content hash, so that I can dedup consecutive identical frames in artifacts.
14. As a caller using `--json`, I want the hash as a lowercase 64-character hex string validated by the same digest schema as other hashes, so that the field shape is predictable.
15. As a caller, I want the **Screen Hash** to be optional on results, so that older artifacts and hosts that predate it still parse.

## Implementation Decisions

- Add an optional **Screen Hash** field — a lowercase 64-character SHA-256 hex digest — to the snapshot result (both structured and text formats) and to the matched render-wait result.
- In scope: a **Batch Step** record for a matched **Render Wait** step also carries the **Screen Hash**, mirrored from that step's render-wait result, so a batch run exposes the same content identity per wait step that a standalone wait does.
- The **Screen Hash** is the SHA-256 of the canonical visible-text string: the visible lines joined by newline, exactly as the host's screen-stability compare and the text matcher already build it. The shared canonical-text **definition** — `visibleLines[].text` joined by `\n`, sourced only from the snapshot (never `backend.getVisibleText()` or `cells[]`) — is unchanged by adding the hash. Cursor position, text styles, and scrollback are excluded.
- Converging the two renderer backends on one canonical screen form (Phase 1) intentionally changes the **default** `ghostty-web` backend's stability and text-wait **comparand** on screens with grapheme clusters, interior blank-cell gaps, or non-ASCII trailing characters: the canonical form is exactly `rows` lines, each decoded with full grapheme clusters with blank/zero cells as `' '`, then right-trimmed of trailing ASCII spaces (`0x20`) only. This is a deliberate, narrow change pinned by characterization tests, not a free behavior-preserving add; on plain ASCII screens the comparand is unchanged.
- Extract one shared canonical-screen-text helper and route the **Screen Hash**, the host **Screen Stability** compare, and the text **Render Wait** matcher through it, so the three share a single definition and cannot diverge.
- The hash is keyed on whether a result holds an **observed** **Semantic Snapshot**, not on whether the wait matched. A result carries the **Screen Hash** of the snapshot it observed: a matched live wait, a snapshot capture, and the offline host-unreachable fallback that still observed a latest snapshot (even when it returns `matched: false` because the **Screen Stability** duration could not be proven offline). The hash is omitted only when no snapshot was observed: a live wait that times out, a consecutive-failure giveup, or a replay error throw.
- Do not surface the **Screen Hash** on inspection or any path that does not already render a **Semantic Snapshot**; computing it must never force a renderer bootstrap that would not otherwise happen.
- Reuse the existing SHA-256 hex validator. The consolidation set is exactly: export `Sha256HexSchema` from `protocol/schemas.ts` and import it in `renderer/types.ts`. Deliberately left out of scope: the standalone regex copies in `storage/artifactManifest.ts` and the `invariant(/^[a-f0-9]{64}$/u.test(...))` checks (for example in `renderer/profiles.ts` and `renderer/bundledFont.ts`), which are not Zod schemas and are not part of this consolidation.
- The field is optional so existing persisted artifacts and older hosts continue to parse.

## Testing Decisions

Good tests assert external behavior, not implementation details.

- **Canonical-text and hash helper (unit).** Same screen yields the same hash; cursor-only movement yields the same hash; a single visible-glyph change yields a different hash; a trailing-whitespace-only difference (before right-trim of ASCII spaces) yields a different hash — proving the canonical form is exactly what is hashed and the behavior is as specified.
- **UTF-8 encoding pinned (unit).** The hash is the SHA-256 of the UTF-8 bytes of the canonical visible text, asserted against a concrete golden digest so the encoding can never silently drift. Golden: a three-row screen whose canonical text is `"a\nb\nc"` hashes to `ea7fb08b7a2dc4619ffb7c7bb38d95a2047935fa165d71b12efd3852a2e6d0cc`.
- **Shared definition (unit).** The host **Screen Stability** compare and the **Render Wait** matcher consume the same canonical string the hash uses, so a later change to one cannot silently diverge from the others, and screen-stability behavior is demonstrably unchanged.
- **Cross-backend hash equality.** The same event log produces the same **Screen Hash** under both renderer backends, pinning the renderer-independence guarantee that is currently only an assumption. This test requires the optional native addon (`@coder/libghostty-vt-node`) and so must run on at least one CI job that has the addon installed; it skips gracefully where the addon is absent (including the sandbox), so the renderer-independence guarantee is not silently unverified.
- **Snapshot and wait envelope (integration).** Against an isolated home: the **Screen Hash** is present on a snapshot (structured and text), on a matched live wait, and on the offline host-unreachable `matched: false` fallback that still observed a snapshot; and absent on a timed-out live wait. The existing CLI integration tests are prior art.

## Out of Scope

- Per-frame **Screen Hash**es on recordings / `record export` (user story 13). v1 attaches the hash only where a result already holds an observed **Semantic Snapshot**; the export paths render no **Semantic Snapshot** per frame, so a recording-frame dedup hash is future scope rather than a v1 deliverable.
- A scrollback hash. The **Screen Hash** is visible-screen-only; a separate scrollback digest can be added later if a concrete need appears.
- A styled or per-cell hash. Transient style churn would make such a hash flap; the **Screen Hash** is text-content identity only.
- Pixel-level identity, and any **Screen Hash** on the **Screenshot Result**. A **Screenshot Result** carries only its pixel `sha256`; the content hash lives on the snapshot and wait results. The **Screen Hash** is the semantic counterpart to the pixel digest and the two are not interchangeable.
- New wait semantics built on the hash (for example, "wait until the screen content changes"). v1 only exposes the field; any hash-driven wait is future scope.
- Any change to the screen-stability behavior **beyond** the Phase 1 renderer-convergence change described in the Implementation Decisions. The canonical-text definition and the shared single-source unify are behavior-preserving; the only intended behavior change is the default `ghostty-web` backend's comparand on grapheme / interior-gap / non-ASCII-trailing screens, pinned by characterization tests. No new wait semantics are added.

## Further Notes

- The motivation differs from the comparable tool virtui, which hashes to avoid shipping screen bytes over a socket. agent-tty is a local CLI, so the value here is the stable content change-token and frame dedup, not transfer avoidance.
- The **Screen Hash** term is defined in the project glossary; this PRD and that term are on branch `feat/screen-hash`. No ADR was needed: the field is an optional add over the canonical string that already exists, and the one intended behavior change — the Phase 1 renderer convergence — is narrow, characterization-pinned, and easily reversible.
6 changes: 3 additions & 3 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -55,9 +55,9 @@
"release:finalize": "node ./scripts/release-finalize.mjs",
"review-bundle": "tsx src/tools/review-bundle.ts",
"smoke:install": "node ./scripts/smoke-install.mjs",
"test": "vitest run",
"test:e2e": "vitest run --maxWorkers=1 test/e2e",
"test:integration": "vitest run --maxWorkers=1 test/integration",
"test": "vitest run --retry=2",
"test:e2e": "vitest run --maxWorkers=1 --retry=2 test/e2e",
"test:integration": "vitest run --maxWorkers=1 --retry=2 test/integration",
"test:unit": "vitest run test/unit",
"test:watch": "vitest",
"typecheck": "tsc -p tsconfig.json --noEmit",
Expand Down
Loading
Loading