Skip to content

Add retry and caching to our GitHub actions#3855

Open
samwgoldman wants to merge 36 commits into
mainfrom
ci-retry-cache-stack
Open

Add retry and caching to our GitHub actions#3855
samwgoldman wants to merge 36 commits into
mainfrom
ci-retry-cache-stack

Conversation

@samwgoldman

Copy link
Copy Markdown
Member

Summary

I noticed that we've had a bunch of actions fail recently due to transient network issues, where a step failed on one run, but succeeded on nearby runs both before and after.

This PR adds retry logic around these network requests, so we are more likely to recover from a transient failure. The retry logic is generic and reused throughout.

I also added caching, which in many cases can avoid the network requests altogether. Each cache has a conservatively-defined invalidation

This PR is broken into small commits which can be reviewed one-by-one for clarity.

Test Plan

The modified actions ran on the ci-retry-cache-stack branch. I inspected the action output and saw that the retry and caching code paths are being hit.

@github-actions

This comment has been minimized.

@meta-codesync

meta-codesync Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

@samwgoldman has imported this pull request. If you are a Meta employee, you can view this in D108927688.

@samwgoldman samwgoldman force-pushed the ci-retry-cache-stack branch from f70703c to ba19d64 Compare June 18, 2026 18:57
@github-actions github-actions Bot added size/xl and removed size/xl labels Jun 18, 2026
@samwgoldman samwgoldman force-pushed the ci-retry-cache-stack branch from ba19d64 to 2569940 Compare June 18, 2026 19:09
@github-actions github-actions Bot added size/xl and removed size/xl labels Jun 18, 2026
@samwgoldman samwgoldman force-pushed the ci-retry-cache-stack branch from 2569940 to 809d999 Compare June 18, 2026 19:26
@github-actions github-actions Bot added size/xl and removed size/xl labels Jun 18, 2026
@github-actions

This comment has been minimized.

@samwgoldman samwgoldman force-pushed the ci-retry-cache-stack branch from 809d999 to 16c6ef4 Compare June 18, 2026 19:57
@github-actions github-actions Bot added size/xl and removed size/xl labels Jun 18, 2026
@github-actions

This comment has been minimized.

Add a repo-local composite action for running shell commands with three attempts and exponential backoff. Keeping this local avoids another marketplace dependency while giving workflows a consistent way to handle transient network failures from package registries, release asset uploads, and other external services.

Store the caller command in an environment variable before writing it to a temporary script, so command text cannot accidentally terminate a heredoc in the action implementation.
The Scrut install downloads from crates.io and can fail for transient Cargo
registry or network issues. Retrying the pinned install gives fresh runners a
chance to recover without hiding deterministic failures after the final
attempt.
Cargo dependency resolution and downloads have shown transient registry and
Git network failures on recent main-branch runs. Fetch dependencies explicitly
with retries so later build, clippy, and test steps are less likely to fail
mid-build from the same temporary network problem.
The website dependency install reaches the Yarn/npm registry and can fail when
network or registry responses are transiently unavailable. Use the shared retry
action around the frozen-lockfile install so lockfile-invalidated runs can
recover from short registry outages.
Installing wasm-pack and wasm-opt downloads from crates.io, which is one of the
network surfaces that has failed transiently in recent CI. Pin the tool
versions and retry the installs so cache misses still have a reliable path to
rebuild the tools.
The website wasm build needs the wasm32 target and Cargo dependencies. Fetching
them before the build with retries isolates transient rustup, registry, or Git
network failures from the later build step and gives invalidated caches a
chance to refill successfully.
Ubuntu container setup depends on apt mirrors and can fail for temporary mirror
or network issues. Retrying the apt update/install step makes those arm Linux
extension builds less sensitive to short-lived package repository failures.
The extension package step runs npm ci against the npm registry. Recent CI has
seen registry/network flakes, so retry the install to let transient failures
recover while keeping the lockfile-controlled dependency set unchanged.
Extension test jobs also run npm ci before compiling and testing. Retrying this
registry-backed step reduces failures from temporary npm/network issues,
especially on cache misses or newly provisioned runners.
Primer comparison installs pyright and mypy from Python package indexes.
Retrying this install protects scheduled primer runs from transient package
index or network failures without changing the selected package versions.
Primer comparison builds Pyrefly after resolving Cargo dependencies. Adding a
retried fetch step lets temporary Cargo registry or Git fetch failures recover
before the expensive primer shard work starts.
The optional issue-ranking primer job installs pyright and mypy from external
package indexes. Retrying the install makes manual runs more resilient to
transient package index and network failures.
Issue ranking's optional primer path builds Pyrefly, so it has the same Cargo
registry and Git fetch exposure as primer comparison. Fetch with retries before
building so temporary dependency download failures do not fail the whole run.
Snippet checking installs pyrefly, pyright, and mypy from package indexes.
Retrying that install helps manual issue-ranking runs survive temporary PyPI or
network failures while preserving the existing install command and diagnostics.
mypy_primer is installed from GitHub via pip, which depends on both package
index and Git network access. Retrying the dependency install reduces PR job
failures from short-lived network problems before the expensive shard work
begins.
Each mypy_primer shard builds both the PR commit and base commit. Fetching their Cargo registry or Git dependencies can fail on transient network errors, so retry cargo fetch for each checkout before building.

The cargo build steps remain single-shot so compiler errors are reported without repeated backoff.
The musllinux wheel smoke test installs python3 from Alpine repositories inside
Docker. Retrying apk add protects the test from temporary Alpine mirror or
network failures without changing the wheel being tested.
Release container setup depends on apt mirrors and the GitHub CLI package
repository. Retrying the setup step makes release binary builds more robust to
temporary package mirror, keyring download, or network failures.
Uploading release assets calls the GitHub Releases API and can fail for
transient service or network issues. Retrying the Unix upload step improves
release reliability while keeping --clobber behavior unchanged.
Windows release asset uploads use the same GitHub Releases API as the Unix
jobs and can hit transient network or service errors. Retry the upload with the
same artifact names and --clobber semantics.
Cache the Yarn package-manager cache used by the website install. The key is
derived from website/yarn.lock through setup-node, so dependency changes
invalidate the cache while unchanged lockfiles avoid repeated registry
downloads.
Cache the installed wasm-pack and wasm-opt binaries so website runs do not
rebuild tool dependencies on every cacheable run. The cache key includes OS,
runner architecture, and explicit tool versions, which are the inputs that
determine whether the cached binaries are compatible.
Cache Cargo registry and Git dependency data for the website wasm build, and
keep rust-cache's default CARGO_HOME/bin caching for installed Cargo CLI
entries. The size-sensitive part is cache-targets: false: that avoids caching
Pyrefly or wasm target outputs while still reducing dependency and tool
download work on warm runs.

rust-cache includes the Rust environment and Cargo lockfiles in its key, so
dependency or toolchain changes invalidate the cache. wasm-pack and wasm-opt
also have a separate OS/architecture/version-keyed cache.
The extension build runs npm ci in lsp, so setup-node should key the npm cache
from lsp/package-lock.json rather than relying on repository-root defaults.
That invalidates the cache when extension dependencies change and avoids stale
or unrelated root lockfile cache entries.
Extension test jobs also install dependencies from lsp/package-lock.json. Using
that lockfile as the setup-node cache dependency path makes cache invalidation
match the dependency set that npm ci actually installs.
Cache the VS Code test binary directories used by @vscode/test-electron so
test jobs do not repeatedly download the large editor archive. The key includes
OS, extension platform/arch, and the extension lockfile because those determine
which test dependency and binary shape is expected.
Wheel builds download the same Cargo registry and Git dependencies across a
large maturin matrix. Cache that Cargo dependency data and the default
CARGO_HOME/bin contents, but keep cache-targets: false so platform-specific
wheel build outputs are not stored in GitHub's cache.

rust-cache keys include the job family plus the Rust environment and Cargo
lockfiles, which is the right invalidation boundary for shared dependency
downloads without creating huge target-dir caches for every wheel target.
Primer comparison builds Pyrefly in each shard and repeatedly needs Cargo
registry and Git dependency data. Cache those downloads and CARGO_HOME/bin
while leaving target directories uncached, which keeps cache size bounded for
scheduled runs.

rust-cache's Rust environment and lockfile keying invalidates this cache when
Cargo inputs or the toolchain change, without tying it to Pyrefly build
artifacts.
The optional issue-ranking primer shards build Pyrefly and benefit from the
same Cargo dependency cache as primer comparison. Cache registry/Git downloads
and CARGO_HOME/bin, but not target outputs, so manual runs avoid repeated
network work without storing bulky build artifacts.

The rust-cache environment hash covers the Cargo lockfiles and toolchain, so
dependency changes invalidate the cache naturally.
mypy_primer PR jobs build Pyrefly for both new and base commits across many
shards. Cache Cargo registry/Git data and CARGO_HOME/bin for the checked-out
workspace, but keep target outputs disabled so the cache does not balloon with
per-shard build artifacts.

Using the pyrefly_to_test workspace lets rust-cache compute invalidation from
that checkout's Cargo lockfiles and Rust environment, which is the dependency
boundary these builds need.
The cached wasm-pack and wasm-opt binaries are architecture-specific. Include
runner.arch in the cache key so x64 and arm64 runners cannot restore
incompatible binaries while still sharing cache entries across identical
OS/architecture/version combinations.
The VS Code test config does not pin an editor version, so the downloaded
stable VS Code binary can change even when package-lock.json does not. Add a
weekly key component and hash the test config so caches stay useful but
periodically refresh as upstream VS Code stable advances.
Scrut is installed with cargo install on every non-Windows pyrefly run. That install can require registry downloads and local build work, so cache the installed scrut binary plus Cargo's install metadata under a key that includes OS, runner architecture, and the pinned Scrut version.

The versioned key gives precise invalidation when the workflow moves to a different Scrut release, while leaving Pyrefly target outputs untouched. The install step still runs on cache misses and is retried for transient crates.io or network failures.
The repository already has pyrefly_wasm/rust-toolchain as a symlink to ../pyrefly/rust-toolchain, and the pyrefly package metadata includes a rust-toolchain file in source distributions. The target file was missing, leaving the symlink broken.

Swatinem/rust-cache scans rust-toolchain files when constructing its environment hash, so the broken symlink made cache setup emit ENOENT in the branch workflow runs and prevented clean cache restore/save behavior. Add the missing stable toolchain file so the symlink resolves and the cache key can include the toolchain input normally.
The extension workflows build Pyrefly before packaging or testing the VS Code extension. A branch run hit a transient crates.io connection reset inside the actions-rust-cross cargo build step while fetching serde, which bypassed the retry wrapper because the fetch happened inside the third-party build action.

Fetch the Pyrefly Cargo dependencies explicitly through the local retry-command action before either the cross-compile action or native cargo build runs. For cross targets, include the matrix target in cargo fetch so target-specific dependencies are covered. This keeps the existing Rust cache useful on warm runs and gives cold or invalidated caches retry coverage before the expensive build step.
@samwgoldman samwgoldman force-pushed the ci-retry-cache-stack branch from 16c6ef4 to bd492e1 Compare June 18, 2026 23:09
@github-actions github-actions Bot added size/xl and removed size/xl labels Jun 18, 2026
@github-actions

Copy link
Copy Markdown

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant