-
Notifications
You must be signed in to change notification settings - Fork 528
fix(ci): test stability improvements #2466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
lidel
wants to merge
30
commits into
main
Choose a base branch
from
fix/e2e-test-improvements
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
removed 10-shard matrix that was adding complexity and CI time overhead without providing proportional benefits. tests now run in a single job with a 10-minute timeout, which is sufficient given test suite completes in ~15 seconds locally. also simplified the two conditional test runs (repeated vs non-repeated) into a single step that always uses --reporter=list for clearer output.
removed conditional logic that changed behavior based on process.env.CI. this was causing inconsistency between local and CI test runs, making it harder to reproduce CI failures locally. now uses consistent settings: 30s timeout per test, 5-minute global timeout for entire suite, no retries, and always starts fresh server.
added timestamped logging to global-setup.js and ipfs-backend.js to help diagnose CI hangs when they occur. each major step now logs progress. added timeout wrappers around async operations that could hang indefinitely: - ipfs-backend startup: 60s timeout - kubo daemon spawn: 30s timeout also fixed two issues: - disabled DHT bootstrapping (Bootstrap: []) for faster daemon startup - changed addInitScript to page.evaluate so localStorage values are captured by storageState() before browser closes
files.test.js: - changed file verification to only check the two files we uploaded instead of iterating all MFS files. other tests may have added files that would cause unexpected matches. grid-view.test.js: - added focusGrid() helper that tries multiple approaches to establish keyboard focus on the grid container. this fixes intermittent failures where arrow key navigation would not work because focus was not set. - simplified test assertions to use playwright's built-in waiters instead of manual count checks. grid.js helper: - selectViewMode now waits for files view to be ready before checking current mode, preventing race conditions during page load.
7 tasks
the global teardown only removed the JSON config file but never called ipfsd.stop() on the spawned Kubo daemon. this left orphaned processes accumulating on CI runners, causing port conflicts and resource exhaustion. - export stop() function from ipfs-backend.js - call stop() in global-teardown.js before removing config file - add logging for teardown progress
trace: 'on-first-retry' was ineffective because retries=0, meaning traces would never be captured. changed to 'retain-on-failure' so traces are available when debugging test failures.
migrate from deprecated waitForSelector() pattern to modern locator API which provides better error messages and auto-waiting behavior. changes include: - replace page.waitForSelector() with page.locator().waitFor() - remove force:true clicks, use proper waits instead - fix missing await on click operations (files.test.js, ipns.test.js) - replace custom checkClassWithTimeout polling with waitForFunction - use .first() where multiple elements match to satisfy strict mode - use more specific selectors (button#id, [role="menuitem"]) to avoid ambiguous matches
the `promise/param-names` rule requires Promise constructor parameters to match `^_?resolve$` and `^_?reject$` patterns. changed `_` to `_resolve` in the timeout wrapper functions.
e6cd362 to
38393f3
Compare
always run `playwright install --with-deps` regardless of cache status to ensure system dependencies are present. previously this was only run on cache miss, which could cause failures if deps were missing.
38393f3 to
b9ed885
Compare
tests pass locally (~17s) but hang on CI until 10-minute timeout. added comprehensive timestamped logging at every async operation: - global-setup.js: log each step (port check, daemon spawn, browser launch, navigation) - ipfs-backend.js: log kubo lifecycle (factory, spawn, identity, config write) - global-teardown.js: log cleanup operations - test-e2e.yml: add shell timestamps, enable DEBUG=pw:api for Playwright logging logs output to both stdout and stderr to ensure CI captures output. timeout wrappers now warn at 80% of timeout before failing. after CI run, last log message before timeout identifies the hanging operation.
address CI hang by: - make webui port configurable via WEBUI_PORT env var - use dynamic port allocation in CI workflow - increase webServer timeout from 5s to 30s for CI - add stdout/stderr piping to capture webServer output - add build directory check before running tests - add config logging to track port and cwd - use 127.0.0.1 instead of localhost for consistency the CI was hanging with no output because Playwright initialization was blocking before globalSetup even ran. these changes will help identify exactly where the hang occurs.
previous CI runs showed 8-minute hangs with zero output from Playwright. this indicates cross-env or npm is buffering stdout. changes: - run playwright directly in CI instead of via npm script - add playwright version check before running tests - use fs.writeSync for config logging to bypass Node.js buffering - log environment variables to verify they're passed correctly
trying to isolate the CI hang: - removed DEBUG=pw:api which might cause infinite output buffering - removed NODE_OPTIONS which might affect behavior - added direct node test to load playwright.config.js independently - added timeout wrapper around playwright test command - added chromium installation dry-run check if config load test fails, the issue is in config/node setup. if config loads but playwright test hangs, issue is in playwright runner.
the CI was hanging because npx http-server was not starting. replaced it with a simple inline node http server that: - serves files from ./build directory - handles common MIME types - starts immediately without npx overhead also fixed eslint single-quote error.
replace inline node command with dedicated serve-build.js script to avoid issues with shell escaping and ES module requirements
- remove timeout wrapper that was hiding failures (exit code 124) - remove || echo that swallowed error codes - change webServer stdout/stderr from pipe to inherit for visibility - clean up unnecessary diagnostic steps
hypothesis: Playwright's webServer spawn may be hanging on CI test: start server manually with visible output before running playwright
strace analysis showed the CI hang occurs after Playwright's ESM config transpilation: a worker thread signals completion on the wrong futex address, leaving the main thread waiting forever. last successful CI (Nov 6, 2025) used Node 24.11.0, failing runs use Node 24.13.0. testing if pinning to this version avoids the issue. this is a hypothesis, not a confirmed fix.
- remove strace debugging from test command - define NODE_VERSION as workflow-level env var (24.11.0) - document why Node.js is pinned (ESM loader hang in 24.13.0)
- simplify node_modules cache key to use only package-lock.json - remove github.sha from cache keys (defeats cache reuse across commits) - add restore-keys fallback patterns for all caches - remove redundant npm ci (node_modules cache handles it)
- remove verbose step-by-step debug logging - keep start/end markers and error messages - keep withTimeout wrappers (prevent silent hangs) - keep stop() export for proper daemon cleanup
- remove serve-build.js, restore npx http-server - remove withTimeout() wrappers (not needed with Node.js fix) - keep stop() export and proper daemon shutdown - fix grid.js to use 127.0.0.1 (matches playwright config) - keep: locator API, Bootstrap:[], cache optimization, Node pin
- add data-testid attributes to File, FilesList, FilesGrid, GridFile - create fixtures.js with worker-scoped peerNode fixture for speed - create locators.js with centralized selector definitions - replace brittle CSS selectors with getByRole/getByTestId - replace waitFor() calls with web-first assertions (toBeVisible, toHaveClass) - update coverage.js to re-export from fixtures.js for backward compat - modernize all test files to use shared locators and fixtures - add test/e2e/test/ to gitignore (artifact from running tests)
37ee0ab to
1e6740d
Compare
…ions - add .tool-versions for nodejs 24.11.0 and golang 1.25 - upgrade actions/checkout v4 to v6 - upgrade actions/cache v4 to v5 - upgrade actions/setup-node v4 to v6 - upgrade actions/upload-artifact v4 to v6 - upgrade actions/download-artifact v4 to v7 - upgrade actions/setup-go v5 to v6 - switch setup-node and setup-go to use go-version-file/node-version-file - remove NODE_VERSION env and node-version inputs from reusable workflows - update README to point to .tool-versions for version info
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Seems that CI died in past 7 months and no longer passes at all.
This is WIP PR to figure out the cause, and fix + stabilize tests so we dont have to babysit them in the future.