Skip to content

Conversation

@lidel
Copy link
Member

@lidel lidel commented Jan 24, 2026

Seems that CI died in past 7 months and no longer passes at all.

This is WIP PR to figure out the cause, and fix + stabilize tests so we dont have to babysit them in the future.

lidel added 4 commits January 24, 2026 22:00
removed 10-shard matrix that was adding complexity and CI time overhead
without providing proportional benefits. tests now run in a single job
with a 10-minute timeout, which is sufficient given test suite completes
in ~15 seconds locally.

also simplified the two conditional test runs (repeated vs non-repeated)
into a single step that always uses --reporter=list for clearer output.
removed conditional logic that changed behavior based on process.env.CI.
this was causing inconsistency between local and CI test runs, making it
harder to reproduce CI failures locally.

now uses consistent settings: 30s timeout per test, 5-minute global
timeout for entire suite, no retries, and always starts fresh server.
added timestamped logging to global-setup.js and ipfs-backend.js to help
diagnose CI hangs when they occur. each major step now logs progress.

added timeout wrappers around async operations that could hang indefinitely:
- ipfs-backend startup: 60s timeout
- kubo daemon spawn: 30s timeout

also fixed two issues:
- disabled DHT bootstrapping (Bootstrap: []) for faster daemon startup
- changed addInitScript to page.evaluate so localStorage values are
  captured by storageState() before browser closes
files.test.js:
- changed file verification to only check the two files we uploaded
  instead of iterating all MFS files. other tests may have added files
  that would cause unexpected matches.

grid-view.test.js:
- added focusGrid() helper that tries multiple approaches to establish
  keyboard focus on the grid container. this fixes intermittent failures
  where arrow key navigation would not work because focus was not set.
- simplified test assertions to use playwright's built-in waiters
  instead of manual count checks.

grid.js helper:
- selectViewMode now waits for files view to be ready before checking
  current mode, preventing race conditions during page load.
lidel added 3 commits January 24, 2026 23:15
the global teardown only removed the JSON config file but never called
ipfsd.stop() on the spawned Kubo daemon. this left orphaned processes
accumulating on CI runners, causing port conflicts and resource exhaustion.

- export stop() function from ipfs-backend.js
- call stop() in global-teardown.js before removing config file
- add logging for teardown progress
trace: 'on-first-retry' was ineffective because retries=0, meaning
traces would never be captured. changed to 'retain-on-failure' so
traces are available when debugging test failures.
migrate from deprecated waitForSelector() pattern to modern locator API
which provides better error messages and auto-waiting behavior.

changes include:
- replace page.waitForSelector() with page.locator().waitFor()
- remove force:true clicks, use proper waits instead
- fix missing await on click operations (files.test.js, ipns.test.js)
- replace custom checkClassWithTimeout polling with waitForFunction
- use .first() where multiple elements match to satisfy strict mode
- use more specific selectors (button#id, [role="menuitem"]) to avoid
  ambiguous matches
the `promise/param-names` rule requires Promise constructor parameters
to match `^_?resolve$` and `^_?reject$` patterns. changed `_` to
`_resolve` in the timeout wrapper functions.
always run `playwright install --with-deps` regardless of cache status
to ensure system dependencies are present. previously this was only run
on cache miss, which could cause failures if deps were missing.
@lidel lidel force-pushed the fix/e2e-test-improvements branch from 38393f3 to b9ed885 Compare January 24, 2026 23:56
tests pass locally (~17s) but hang on CI until 10-minute timeout.
added comprehensive timestamped logging at every async operation:

- global-setup.js: log each step (port check, daemon spawn, browser launch, navigation)
- ipfs-backend.js: log kubo lifecycle (factory, spawn, identity, config write)
- global-teardown.js: log cleanup operations
- test-e2e.yml: add shell timestamps, enable DEBUG=pw:api for Playwright logging

logs output to both stdout and stderr to ensure CI captures output.
timeout wrappers now warn at 80% of timeout before failing.

after CI run, last log message before timeout identifies the hanging operation.
address CI hang by:
- make webui port configurable via WEBUI_PORT env var
- use dynamic port allocation in CI workflow
- increase webServer timeout from 5s to 30s for CI
- add stdout/stderr piping to capture webServer output
- add build directory check before running tests
- add config logging to track port and cwd
- use 127.0.0.1 instead of localhost for consistency

the CI was hanging with no output because Playwright initialization
was blocking before globalSetup even ran. these changes will help
identify exactly where the hang occurs.
previous CI runs showed 8-minute hangs with zero output from Playwright.
this indicates cross-env or npm is buffering stdout.

changes:
- run playwright directly in CI instead of via npm script
- add playwright version check before running tests
- use fs.writeSync for config logging to bypass Node.js buffering
- log environment variables to verify they're passed correctly
trying to isolate the CI hang:
- removed DEBUG=pw:api which might cause infinite output buffering
- removed NODE_OPTIONS which might affect behavior
- added direct node test to load playwright.config.js independently
- added timeout wrapper around playwright test command
- added chromium installation dry-run check

if config load test fails, the issue is in config/node setup.
if config loads but playwright test hangs, issue is in playwright runner.
the CI was hanging because npx http-server was not starting.
replaced it with a simple inline node http server that:
- serves files from ./build directory
- handles common MIME types
- starts immediately without npx overhead

also fixed eslint single-quote error.
replace inline node command with dedicated serve-build.js script
to avoid issues with shell escaping and ES module requirements
- remove timeout wrapper that was hiding failures (exit code 124)
- remove || echo that swallowed error codes
- change webServer stdout/stderr from pipe to inherit for visibility
- clean up unnecessary diagnostic steps
hypothesis: Playwright's webServer spawn may be hanging on CI
test: start server manually with visible output before running playwright
strace analysis showed the CI hang occurs after Playwright's ESM config
transpilation: a worker thread signals completion on the wrong futex
address, leaving the main thread waiting forever.

last successful CI (Nov 6, 2025) used Node 24.11.0, failing runs use
Node 24.13.0. testing if pinning to this version avoids the issue.

this is a hypothesis, not a confirmed fix.
lidel added 3 commits January 26, 2026 02:15
- remove strace debugging from test command
- define NODE_VERSION as workflow-level env var (24.11.0)
- document why Node.js is pinned (ESM loader hang in 24.13.0)
- simplify node_modules cache key to use only package-lock.json
- remove github.sha from cache keys (defeats cache reuse across commits)
- add restore-keys fallback patterns for all caches
- remove redundant npm ci (node_modules cache handles it)
- remove verbose step-by-step debug logging
- keep start/end markers and error messages
- keep withTimeout wrappers (prevent silent hangs)
- keep stop() export for proper daemon cleanup
- remove serve-build.js, restore npx http-server
- remove withTimeout() wrappers (not needed with Node.js fix)
- keep stop() export and proper daemon shutdown
- fix grid.js to use 127.0.0.1 (matches playwright config)
- keep: locator API, Bootstrap:[], cache optimization, Node pin
- add data-testid attributes to File, FilesList, FilesGrid, GridFile
- create fixtures.js with worker-scoped peerNode fixture for speed
- create locators.js with centralized selector definitions
- replace brittle CSS selectors with getByRole/getByTestId
- replace waitFor() calls with web-first assertions (toBeVisible, toHaveClass)
- update coverage.js to re-export from fixtures.js for backward compat
- modernize all test files to use shared locators and fixtures
- add test/e2e/test/ to gitignore (artifact from running tests)
@lidel lidel force-pushed the fix/e2e-test-improvements branch from 37ee0ab to 1e6740d Compare January 26, 2026 04:40
…ions

- add .tool-versions for nodejs 24.11.0 and golang 1.25
- upgrade actions/checkout v4 to v6
- upgrade actions/cache v4 to v5
- upgrade actions/setup-node v4 to v6
- upgrade actions/upload-artifact v4 to v6
- upgrade actions/download-artifact v4 to v7
- upgrade actions/setup-go v5 to v6
- switch setup-node and setup-go to use go-version-file/node-version-file
- remove NODE_VERSION env and node-version inputs from reusable workflows
- update README to point to .tool-versions for version info
@lidel lidel marked this pull request as ready for review January 26, 2026 05:39
@lidel lidel requested a review from a team as a code owner January 26, 2026 05:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants