fix(ci): test stability improvements #2466

lidel · 2026-01-24T21:04:30Z

Seems that CI died in past 7 months and no longer passes at all.

This is WIP PR to figure out the cause, and fix + stabilize tests so we dont have to babysit them in the future.

removed 10-shard matrix that was adding complexity and CI time overhead without providing proportional benefits. tests now run in a single job with a 10-minute timeout, which is sufficient given test suite completes in ~15 seconds locally. also simplified the two conditional test runs (repeated vs non-repeated) into a single step that always uses --reporter=list for clearer output.

removed conditional logic that changed behavior based on process.env.CI. this was causing inconsistency between local and CI test runs, making it harder to reproduce CI failures locally. now uses consistent settings: 30s timeout per test, 5-minute global timeout for entire suite, no retries, and always starts fresh server.

added timestamped logging to global-setup.js and ipfs-backend.js to help diagnose CI hangs when they occur. each major step now logs progress. added timeout wrappers around async operations that could hang indefinitely: - ipfs-backend startup: 60s timeout - kubo daemon spawn: 30s timeout also fixed two issues: - disabled DHT bootstrapping (Bootstrap: []) for faster daemon startup - changed addInitScript to page.evaluate so localStorage values are captured by storageState() before browser closes

files.test.js: - changed file verification to only check the two files we uploaded instead of iterating all MFS files. other tests may have added files that would cause unexpected matches. grid-view.test.js: - added focusGrid() helper that tries multiple approaches to establish keyboard focus on the grid container. this fixes intermittent failures where arrow key navigation would not work because focus was not set. - simplified test assertions to use playwright's built-in waiters instead of manual count checks. grid.js helper: - selectViewMode now waits for files view to be ready before checking current mode, preventing race conditions during page load.

the global teardown only removed the JSON config file but never called ipfsd.stop() on the spawned Kubo daemon. this left orphaned processes accumulating on CI runners, causing port conflicts and resource exhaustion. - export stop() function from ipfs-backend.js - call stop() in global-teardown.js before removing config file - add logging for teardown progress

trace: 'on-first-retry' was ineffective because retries=0, meaning traces would never be captured. changed to 'retain-on-failure' so traces are available when debugging test failures.

migrate from deprecated waitForSelector() pattern to modern locator API which provides better error messages and auto-waiting behavior. changes include: - replace page.waitForSelector() with page.locator().waitFor() - remove force:true clicks, use proper waits instead - fix missing await on click operations (files.test.js, ipns.test.js) - replace custom checkClassWithTimeout polling with waitForFunction - use .first() where multiple elements match to satisfy strict mode - use more specific selectors (button#id, [role="menuitem"]) to avoid ambiguous matches

the `promise/param-names` rule requires Promise constructor parameters to match `^_?resolve$` and `^_?reject$` patterns. changed `_` to `_resolve` in the timeout wrapper functions.

always run `playwright install --with-deps` regardless of cache status to ensure system dependencies are present. previously this was only run on cache miss, which could cause failures if deps were missing.

tests pass locally (~17s) but hang on CI until 10-minute timeout. added comprehensive timestamped logging at every async operation: - global-setup.js: log each step (port check, daemon spawn, browser launch, navigation) - ipfs-backend.js: log kubo lifecycle (factory, spawn, identity, config write) - global-teardown.js: log cleanup operations - test-e2e.yml: add shell timestamps, enable DEBUG=pw:api for Playwright logging logs output to both stdout and stderr to ensure CI captures output. timeout wrappers now warn at 80% of timeout before failing. after CI run, last log message before timeout identifies the hanging operation.

address CI hang by: - make webui port configurable via WEBUI_PORT env var - use dynamic port allocation in CI workflow - increase webServer timeout from 5s to 30s for CI - add stdout/stderr piping to capture webServer output - add build directory check before running tests - add config logging to track port and cwd - use 127.0.0.1 instead of localhost for consistency the CI was hanging with no output because Playwright initialization was blocking before globalSetup even ran. these changes will help identify exactly where the hang occurs.

previous CI runs showed 8-minute hangs with zero output from Playwright. this indicates cross-env or npm is buffering stdout. changes: - run playwright directly in CI instead of via npm script - add playwright version check before running tests - use fs.writeSync for config logging to bypass Node.js buffering - log environment variables to verify they're passed correctly

trying to isolate the CI hang: - removed DEBUG=pw:api which might cause infinite output buffering - removed NODE_OPTIONS which might affect behavior - added direct node test to load playwright.config.js independently - added timeout wrapper around playwright test command - added chromium installation dry-run check if config load test fails, the issue is in config/node setup. if config loads but playwright test hangs, issue is in playwright runner.

the CI was hanging because npx http-server was not starting. replaced it with a simple inline node http server that: - serves files from ./build directory - handles common MIME types - starts immediately without npx overhead also fixed eslint single-quote error.

replace inline node command with dedicated serve-build.js script to avoid issues with shell escaping and ES module requirements

- remove timeout wrapper that was hiding failures (exit code 124) - remove || echo that swallowed error codes - change webServer stdout/stderr from pipe to inherit for visibility - clean up unnecessary diagnostic steps

hypothesis: Playwright's webServer spawn may be hanging on CI test: start server manually with visible output before running playwright

strace analysis showed the CI hang occurs after Playwright's ESM config transpilation: a worker thread signals completion on the wrong futex address, leaving the main thread waiting forever. last successful CI (Nov 6, 2025) used Node 24.11.0, failing runs use Node 24.13.0. testing if pinning to this version avoids the issue. this is a hypothesis, not a confirmed fix.

- remove strace debugging from test command - define NODE_VERSION as workflow-level env var (24.11.0) - document why Node.js is pinned (ESM loader hang in 24.13.0)

- simplify node_modules cache key to use only package-lock.json - remove github.sha from cache keys (defeats cache reuse across commits) - add restore-keys fallback patterns for all caches - remove redundant npm ci (node_modules cache handles it)

- remove verbose step-by-step debug logging - keep start/end markers and error messages - keep withTimeout wrappers (prevent silent hangs) - keep stop() export for proper daemon cleanup

- remove serve-build.js, restore npx http-server - remove withTimeout() wrappers (not needed with Node.js fix) - keep stop() export and proper daemon shutdown - fix grid.js to use 127.0.0.1 (matches playwright config) - keep: locator API, Bootstrap:[], cache optimization, Node pin

- add data-testid attributes to File, FilesList, FilesGrid, GridFile - create fixtures.js with worker-scoped peerNode fixture for speed - create locators.js with centralized selector definitions - replace brittle CSS selectors with getByRole/getByTestId - replace waitFor() calls with web-first assertions (toBeVisible, toHaveClass) - update coverage.js to re-export from fixtures.js for backward compat - modernize all test files to use shared locators and fixtures - add test/e2e/test/ to gitignore (artifact from running tests)

…ions - add .tool-versions for nodejs 24.11.0 and golang 1.25 - upgrade actions/checkout v4 to v6 - upgrade actions/cache v4 to v5 - upgrade actions/setup-node v4 to v6 - upgrade actions/upload-artifact v4 to v6 - upgrade actions/download-artifact v4 to v7 - upgrade actions/setup-go v5 to v6 - switch setup-node and setup-go to use go-version-file/node-version-file - remove NODE_VERSION env and node-version inputs from reusable workflows - update README to point to .tool-versions for version info

lidel added 4 commits January 24, 2026 22:00

lidel temporarily deployed to Deploy January 24, 2026 21:04 — with GitHub Actions Inactive

lidel mentioned this pull request Jan 24, 2026

feat: DHT Provide Sweep Diagnostic Screen #2463

Open

7 tasks

lidel added 3 commits January 24, 2026 23:15

fix(e2e): enable trace capture on test failure

f2ae4ae

trace: 'on-first-retry' was ineffective because retries=0, meaning traces would never be captured. changed to 'retain-on-failure' so traces are available when debugging test failures.

lidel temporarily deployed to Deploy January 24, 2026 22:16 — with GitHub Actions Inactive

fix(e2e): use proper Promise parameter names for eslint

f39bad2

the `promise/param-names` rule requires Promise constructor parameters to match `^_?resolve$` and `^_?reject$` patterns. changed `_` to `_resolve` in the timeout wrapper functions.

lidel temporarily deployed to Deploy January 24, 2026 23:21 — with GitHub Actions Inactive

lidel force-pushed the fix/e2e-test-improvements branch from e6cd362 to 38393f3 Compare January 24, 2026 23:39

lidel temporarily deployed to Deploy January 24, 2026 23:39 — with GitHub Actions Inactive

ci(e2e): always install playwright deps

b9ed885

always run `playwright install --with-deps` regardless of cache status to ensure system dependencies are present. previously this was only run on cache miss, which could cause failures if deps were missing.

lidel force-pushed the fix/e2e-test-improvements branch from 38393f3 to b9ed885 Compare January 24, 2026 23:56

lidel temporarily deployed to Deploy January 24, 2026 23:57 — with GitHub Actions Inactive

lidel temporarily deployed to Deploy January 25, 2026 00:51 — with GitHub Actions Inactive

lidel temporarily deployed to Deploy January 25, 2026 01:49 — with GitHub Actions Inactive

lidel temporarily deployed to Deploy January 25, 2026 02:08 — with GitHub Actions Inactive

lidel temporarily deployed to Deploy January 25, 2026 02:25 — with GitHub Actions Inactive

lidel temporarily deployed to Deploy January 25, 2026 02:48 — with GitHub Actions Inactive

test(e2e): add dedicated server script for static files

9b4cb09

replace inline node command with dedicated serve-build.js script to avoid issues with shell escaping and ES module requirements

lidel temporarily deployed to Deploy January 25, 2026 04:42 — with GitHub Actions Inactive

test(e2e): fix CI error masking and show server output

7fd1b42

- remove timeout wrapper that was hiding failures (exit code 124) - remove || echo that swallowed error codes - change webServer stdout/stderr from pipe to inherit for visibility - clean up unnecessary diagnostic steps

test(e2e): bypass npx to run playwright directly

1a870e5

lidel temporarily deployed to Deploy January 25, 2026 05:11 — with GitHub Actions Inactive

test(e2e): try starting webui server manually in CI

973e4ff

hypothesis: Playwright's webServer spawn may be hanging on CI test: start server manually with visible output before running playwright

lidel temporarily deployed to Deploy January 25, 2026 05:17 — with GitHub Actions Inactive

test(e2e): add diagnostic to isolate globalSetup hang

fa9c3ab

lidel temporarily deployed to Deploy January 25, 2026 15:09 — with GitHub Actions Inactive

test(e2e): disable webServer on CI, remove pipe that may cause hang

add3197

lidel temporarily deployed to Deploy January 25, 2026 17:01 — with GitHub Actions Inactive

test(e2e): add strace and browser launch diagnostics

1e48a0b

lidel temporarily deployed to Deploy January 25, 2026 17:23 — with GitHub Actions Inactive

test(e2e): add verbose strace with periodic snapshots, single worker

4148dd1

lidel temporarily deployed to Deploy January 25, 2026 17:54 — with GitHub Actions Inactive

test(e2e): strace all syscalls from npm run test:e2e

d3f2406

lidel temporarily deployed to Deploy January 25, 2026 21:29 — with GitHub Actions Inactive

lidel temporarily deployed to Deploy January 25, 2026 23:49 — with GitHub Actions Inactive

lidel added 3 commits January 26, 2026 02:15

ci(e2e): remove strace and use global NODE_VERSION env

2da893c

- remove strace debugging from test command - define NODE_VERSION as workflow-level env var (24.11.0) - document why Node.js is pinned (ESM loader hang in 24.13.0)

test(e2e): simplify logging in setup files

8a1ba8a

- remove verbose step-by-step debug logging - keep start/end markers and error messages - keep withTimeout wrappers (prevent silent hangs) - keep stop() export for proper daemon cleanup

lidel temporarily deployed to Deploy January 26, 2026 02:06 — with GitHub Actions Inactive

lidel temporarily deployed to Deploy January 26, 2026 02:45 — with GitHub Actions Inactive

lidel temporarily deployed to Deploy January 26, 2026 03:59 — with GitHub Actions Inactive

lidel force-pushed the fix/e2e-test-improvements branch from 37ee0ab to 1e6740d Compare January 26, 2026 04:40

lidel temporarily deployed to Deploy January 26, 2026 04:40 — with GitHub Actions Inactive

lidel deployed to Deploy January 26, 2026 05:28 — with GitHub Actions Active

lidel marked this pull request as ready for review January 26, 2026 05:39

lidel requested a review from a team as a code owner January 26, 2026 05:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): test stability improvements #2466

fix(ci): test stability improvements #2466

Uh oh!

lidel commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(ci): test stability improvements #2466

Are you sure you want to change the base?

fix(ci): test stability improvements #2466

Uh oh!

Conversation

lidel commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants