Add retry and caching to our GitHub actions#3855
Open
samwgoldman wants to merge 36 commits into
Open
Conversation
094dc52 to
f70703c
Compare
This comment has been minimized.
This comment has been minimized.
Contributor
|
@samwgoldman has imported this pull request. If you are a Meta employee, you can view this in D108927688. |
f70703c to
ba19d64
Compare
ba19d64 to
2569940
Compare
2569940 to
809d999
Compare
This comment has been minimized.
This comment has been minimized.
809d999 to
16c6ef4
Compare
This comment has been minimized.
This comment has been minimized.
Add a repo-local composite action for running shell commands with three attempts and exponential backoff. Keeping this local avoids another marketplace dependency while giving workflows a consistent way to handle transient network failures from package registries, release asset uploads, and other external services. Store the caller command in an environment variable before writing it to a temporary script, so command text cannot accidentally terminate a heredoc in the action implementation.
The Scrut install downloads from crates.io and can fail for transient Cargo registry or network issues. Retrying the pinned install gives fresh runners a chance to recover without hiding deterministic failures after the final attempt.
Cargo dependency resolution and downloads have shown transient registry and Git network failures on recent main-branch runs. Fetch dependencies explicitly with retries so later build, clippy, and test steps are less likely to fail mid-build from the same temporary network problem.
The website dependency install reaches the Yarn/npm registry and can fail when network or registry responses are transiently unavailable. Use the shared retry action around the frozen-lockfile install so lockfile-invalidated runs can recover from short registry outages.
Installing wasm-pack and wasm-opt downloads from crates.io, which is one of the network surfaces that has failed transiently in recent CI. Pin the tool versions and retry the installs so cache misses still have a reliable path to rebuild the tools.
The website wasm build needs the wasm32 target and Cargo dependencies. Fetching them before the build with retries isolates transient rustup, registry, or Git network failures from the later build step and gives invalidated caches a chance to refill successfully.
Ubuntu container setup depends on apt mirrors and can fail for temporary mirror or network issues. Retrying the apt update/install step makes those arm Linux extension builds less sensitive to short-lived package repository failures.
The extension package step runs npm ci against the npm registry. Recent CI has seen registry/network flakes, so retry the install to let transient failures recover while keeping the lockfile-controlled dependency set unchanged.
Extension test jobs also run npm ci before compiling and testing. Retrying this registry-backed step reduces failures from temporary npm/network issues, especially on cache misses or newly provisioned runners.
Primer comparison installs pyright and mypy from Python package indexes. Retrying this install protects scheduled primer runs from transient package index or network failures without changing the selected package versions.
Primer comparison builds Pyrefly after resolving Cargo dependencies. Adding a retried fetch step lets temporary Cargo registry or Git fetch failures recover before the expensive primer shard work starts.
The optional issue-ranking primer job installs pyright and mypy from external package indexes. Retrying the install makes manual runs more resilient to transient package index and network failures.
Issue ranking's optional primer path builds Pyrefly, so it has the same Cargo registry and Git fetch exposure as primer comparison. Fetch with retries before building so temporary dependency download failures do not fail the whole run.
Snippet checking installs pyrefly, pyright, and mypy from package indexes. Retrying that install helps manual issue-ranking runs survive temporary PyPI or network failures while preserving the existing install command and diagnostics.
mypy_primer is installed from GitHub via pip, which depends on both package index and Git network access. Retrying the dependency install reduces PR job failures from short-lived network problems before the expensive shard work begins.
Each mypy_primer shard builds both the PR commit and base commit. Fetching their Cargo registry or Git dependencies can fail on transient network errors, so retry cargo fetch for each checkout before building. The cargo build steps remain single-shot so compiler errors are reported without repeated backoff.
The musllinux wheel smoke test installs python3 from Alpine repositories inside Docker. Retrying apk add protects the test from temporary Alpine mirror or network failures without changing the wheel being tested.
Release container setup depends on apt mirrors and the GitHub CLI package repository. Retrying the setup step makes release binary builds more robust to temporary package mirror, keyring download, or network failures.
Uploading release assets calls the GitHub Releases API and can fail for transient service or network issues. Retrying the Unix upload step improves release reliability while keeping --clobber behavior unchanged.
Windows release asset uploads use the same GitHub Releases API as the Unix jobs and can hit transient network or service errors. Retry the upload with the same artifact names and --clobber semantics.
Cache the Yarn package-manager cache used by the website install. The key is derived from website/yarn.lock through setup-node, so dependency changes invalidate the cache while unchanged lockfiles avoid repeated registry downloads.
Cache the installed wasm-pack and wasm-opt binaries so website runs do not rebuild tool dependencies on every cacheable run. The cache key includes OS, runner architecture, and explicit tool versions, which are the inputs that determine whether the cached binaries are compatible.
Cache Cargo registry and Git dependency data for the website wasm build, and keep rust-cache's default CARGO_HOME/bin caching for installed Cargo CLI entries. The size-sensitive part is cache-targets: false: that avoids caching Pyrefly or wasm target outputs while still reducing dependency and tool download work on warm runs. rust-cache includes the Rust environment and Cargo lockfiles in its key, so dependency or toolchain changes invalidate the cache. wasm-pack and wasm-opt also have a separate OS/architecture/version-keyed cache.
The extension build runs npm ci in lsp, so setup-node should key the npm cache from lsp/package-lock.json rather than relying on repository-root defaults. That invalidates the cache when extension dependencies change and avoids stale or unrelated root lockfile cache entries.
Extension test jobs also install dependencies from lsp/package-lock.json. Using that lockfile as the setup-node cache dependency path makes cache invalidation match the dependency set that npm ci actually installs.
Cache the VS Code test binary directories used by @vscode/test-electron so test jobs do not repeatedly download the large editor archive. The key includes OS, extension platform/arch, and the extension lockfile because those determine which test dependency and binary shape is expected.
Wheel builds download the same Cargo registry and Git dependencies across a large maturin matrix. Cache that Cargo dependency data and the default CARGO_HOME/bin contents, but keep cache-targets: false so platform-specific wheel build outputs are not stored in GitHub's cache. rust-cache keys include the job family plus the Rust environment and Cargo lockfiles, which is the right invalidation boundary for shared dependency downloads without creating huge target-dir caches for every wheel target.
Primer comparison builds Pyrefly in each shard and repeatedly needs Cargo registry and Git dependency data. Cache those downloads and CARGO_HOME/bin while leaving target directories uncached, which keeps cache size bounded for scheduled runs. rust-cache's Rust environment and lockfile keying invalidates this cache when Cargo inputs or the toolchain change, without tying it to Pyrefly build artifacts.
The optional issue-ranking primer shards build Pyrefly and benefit from the same Cargo dependency cache as primer comparison. Cache registry/Git downloads and CARGO_HOME/bin, but not target outputs, so manual runs avoid repeated network work without storing bulky build artifacts. The rust-cache environment hash covers the Cargo lockfiles and toolchain, so dependency changes invalidate the cache naturally.
mypy_primer PR jobs build Pyrefly for both new and base commits across many shards. Cache Cargo registry/Git data and CARGO_HOME/bin for the checked-out workspace, but keep target outputs disabled so the cache does not balloon with per-shard build artifacts. Using the pyrefly_to_test workspace lets rust-cache compute invalidation from that checkout's Cargo lockfiles and Rust environment, which is the dependency boundary these builds need.
The cached wasm-pack and wasm-opt binaries are architecture-specific. Include runner.arch in the cache key so x64 and arm64 runners cannot restore incompatible binaries while still sharing cache entries across identical OS/architecture/version combinations.
The VS Code test config does not pin an editor version, so the downloaded stable VS Code binary can change even when package-lock.json does not. Add a weekly key component and hash the test config so caches stay useful but periodically refresh as upstream VS Code stable advances.
Scrut is installed with cargo install on every non-Windows pyrefly run. That install can require registry downloads and local build work, so cache the installed scrut binary plus Cargo's install metadata under a key that includes OS, runner architecture, and the pinned Scrut version. The versioned key gives precise invalidation when the workflow moves to a different Scrut release, while leaving Pyrefly target outputs untouched. The install step still runs on cache misses and is retried for transient crates.io or network failures.
The repository already has pyrefly_wasm/rust-toolchain as a symlink to ../pyrefly/rust-toolchain, and the pyrefly package metadata includes a rust-toolchain file in source distributions. The target file was missing, leaving the symlink broken. Swatinem/rust-cache scans rust-toolchain files when constructing its environment hash, so the broken symlink made cache setup emit ENOENT in the branch workflow runs and prevented clean cache restore/save behavior. Add the missing stable toolchain file so the symlink resolves and the cache key can include the toolchain input normally.
The extension workflows build Pyrefly before packaging or testing the VS Code extension. A branch run hit a transient crates.io connection reset inside the actions-rust-cross cargo build step while fetching serde, which bypassed the retry wrapper because the fetch happened inside the third-party build action. Fetch the Pyrefly Cargo dependencies explicitly through the local retry-command action before either the cross-compile action or native cargo build runs. For cross targets, include the matrix target in cargo fetch so target-specific dependencies are covered. This keeps the existing Rust cache useful on warm runs and gives cold or invalidated caches retry coverage before the expensive build step.
16c6ef4 to
bd492e1
Compare
|
According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
I noticed that we've had a bunch of actions fail recently due to transient network issues, where a step failed on one run, but succeeded on nearby runs both before and after.
This PR adds retry logic around these network requests, so we are more likely to recover from a transient failure. The retry logic is generic and reused throughout.
I also added caching, which in many cases can avoid the network requests altogether. Each cache has a conservatively-defined invalidation
This PR is broken into small commits which can be reviewed one-by-one for clarity.
Test Plan
The modified actions ran on the ci-retry-cache-stack branch. I inspected the action output and saw that the retry and caching code paths are being hit.