Skip to content

test(e2e-prod): per-suite correctness sweeps + nits#169

Merged
jiashuoz merged 1 commit into
mainfrom
test/e2e-prod-correctness-sweeps
May 28, 2026
Merged

test(e2e-prod): per-suite correctness sweeps + nits#169
jiashuoz merged 1 commit into
mainfrom
test/e2e-prod-correctness-sweeps

Conversation

@jiashuoz
Copy link
Copy Markdown
Member

Summary

Companion to #168 (the harness + critical assertion hardening). This is the broader correctness sweep — smaller issues that aren't critical-class but each undermines a specific test's ability to verify what its name claims.

Per-suite changes

Suite Issue Fix
01-basic unused `before` import removed
02-authz "cross-tenant DELETE" had dead code — the `assert.ok` above already threw on non-4xx, so the explicit 200/204 `fail()` was unreachable reordered: hard-fail on 200/204 first via `assert.fail()`, then 4xx assertion
08-mcp / 12-mcp-extended hardcoded `/Users/joshzhang/Desktop/e2a/mcp/dist/index.js` broke for any other contributor / CI repo-relative default via `new URL(...).pathname` + `E2A_MCP_DIST` env override
09-postfix "/send 429 Retry-After" was a no-op recording an `info()` skip but counted as a passing test, inflating the green count `test.skip()` so it shows correctly in the reporter
11-messaging reply-empty test pulled from user-scoped `/messages` with no inbound filter — usually picked outbound → 404 on `/reply` (which only operates on inbound) → `>= 400` assert passed without ever testing "empty body returns 400" now uses agent-scoped inbound listing + asserts `status === 400` exactly; 404-on-listed-id surfaces as info() (would be a listing/storage skew bug, not the thing we're trying to test)
13-billing-surface usage-vs-list drift tolerance was purely informational at any size ±1 is silently fine (timing race); >1 records info() with detail; >3 hard-fails as counter staleness

Why two PRs

PR #168 is the high-leverage harness + critical-class assertions — narrow and easy to land. This PR is broader (touches 7 files) and worth reviewing independently.

Test plan

  • Smoke against prod with all changes — clean
  • Once both PRs are merged, run the full 13-suite e2e against prod and confirm everything stays green (any new failure would be a real regression the previous suite was hiding)

Companion to PR #168 (the harness + contract-assertion hardening).
This is the broader correctness sweep from the same review: smaller
issues that aren't critical-class but undermine specific tests'
ability to actually verify what their names claim.

Per-suite fixes:

01-basic.test.ts
 - Remove unused `before` import.

02-authz.test.ts
 - "cross-tenant DELETE" had dead code: the assert.ok above already
   throws on non-4xx, so the explicit 200/204 check could never run.
   Reordered: hard-fail on 200/204 first (and assert.fail() so node:test
   sees the regression), then the 4xx ok-range assertion.

08-mcp.test.ts + 12-mcp-extended.test.ts
 - Replace hardcoded absolute path to mcp/dist/index.js with a
   repo-relative default + E2A_MCP_DIST env override. The hardcoded
   path was `/Users/joshzhang/Desktop/e2a/...` — broke for any other
   contributor or CI environment.

09-postfix.test.ts
 - The "/send 429 Retry-After" test was a pure no-op recording an
   `info()` skip — but it counted as a passing test in the totals
   and inflated the green count. Marked test.skip() so it shows up
   correctly in the spec-reporter output. The reason (avoiding fan-
   out of 60+ HITL notification emails) is preserved in the comment.

11-messaging.test.ts (reply-empty)
 - Previous candidate selection (`/api/v1/messages` user-scoped +
   fallback to any direction) routinely picked an outbound message,
   which 404'd at the /reply endpoint (reply only operates on inbound).
   The 404 satisfied the `>= 400 && < 500` assert so the test passed
   without ever testing the "empty body returns 400" branch. Now uses
   the agent-scoped inbound listing and asserts the exact 400; if a
   listing-vs-storage skew causes a 404 on an id we just listed, that
   surfaces as info() rather than masking the empty-body check.

13-billing-surface.test.ts
 - usage-agents drift tolerance was purely informational at any size.
   Now: ±1 is silently fine (timing race), >1 records info() with
   detail, but >3 hard-fails as a real counter-staleness regression.

Smoke against prod passes with all changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jiashuoz jiashuoz merged commit 5269e1d into main May 28, 2026
10 checks passed
@jiashuoz jiashuoz deleted the test/e2e-prod-correctness-sweeps branch May 28, 2026 00:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant