Add retry and caching to our GitHub actions by samwgoldman · Pull Request #3855 · facebook/pyrefly

samwgoldman · 2026-06-17T20:19:46Z

Summary

I noticed that we've had a bunch of actions fail recently due to transient network issues, where a step failed on one run, but succeeded on nearby runs both before and after.

This PR adds retry logic around these network requests, so we are more likely to recover from a transient failure. The retry logic is generic and reused throughout.

I also added caching, which in many cases can avoid the network requests altogether. Each cache has a conservatively-defined invalidation

This PR is broken into small commits which can be reviewed one-by-one for clarity.

Test Plan

The modified actions ran on the ci-retry-cache-stack branch. I inspected the action output and saw that the retry and caching code paths are being hit.

meta-codesync · 2026-06-18T18:10:42Z

@samwgoldman has imported this pull request. If you are a Meta employee, you can view this in D108927688.

Add a repo-local composite action for running shell commands with three attempts and exponential backoff. Keeping this local avoids another marketplace dependency while giving workflows a consistent way to handle transient network failures from package registries, release asset uploads, and other external services. Store the caller command in an environment variable before writing it to a temporary script, so command text cannot accidentally terminate a heredoc in the action implementation.

The Scrut install downloads from crates.io and can fail for transient Cargo registry or network issues. Retrying the pinned install gives fresh runners a chance to recover without hiding deterministic failures after the final attempt.

Cargo dependency resolution and downloads have shown transient registry and Git network failures on recent main-branch runs. Fetch dependencies explicitly with retries so later build, clippy, and test steps are less likely to fail mid-build from the same temporary network problem.

The website dependency install reaches the Yarn/npm registry and can fail when network or registry responses are transiently unavailable. Use the shared retry action around the frozen-lockfile install so lockfile-invalidated runs can recover from short registry outages.

Installing wasm-pack and wasm-opt downloads from crates.io, which is one of the network surfaces that has failed transiently in recent CI. Pin the tool versions and retry the installs so cache misses still have a reliable path to rebuild the tools.

The website wasm build needs the wasm32 target and Cargo dependencies. Fetching them before the build with retries isolates transient rustup, registry, or Git network failures from the later build step and gives invalidated caches a chance to refill successfully.

Ubuntu container setup depends on apt mirrors and can fail for temporary mirror or network issues. Retrying the apt update/install step makes those arm Linux extension builds less sensitive to short-lived package repository failures.

The extension package step runs npm ci against the npm registry. Recent CI has seen registry/network flakes, so retry the install to let transient failures recover while keeping the lockfile-controlled dependency set unchanged.

Extension test jobs also run npm ci before compiling and testing. Retrying this registry-backed step reduces failures from temporary npm/network issues, especially on cache misses or newly provisioned runners.

Primer comparison installs pyright and mypy from Python package indexes. Retrying this install protects scheduled primer runs from transient package index or network failures without changing the selected package versions.

Primer comparison builds Pyrefly after resolving Cargo dependencies. Adding a retried fetch step lets temporary Cargo registry or Git fetch failures recover before the expensive primer shard work starts.

The optional issue-ranking primer job installs pyright and mypy from external package indexes. Retrying the install makes manual runs more resilient to transient package index and network failures.

Issue ranking's optional primer path builds Pyrefly, so it has the same Cargo registry and Git fetch exposure as primer comparison. Fetch with retries before building so temporary dependency download failures do not fail the whole run.

Snippet checking installs pyrefly, pyright, and mypy from package indexes. Retrying that install helps manual issue-ranking runs survive temporary PyPI or network failures while preserving the existing install command and diagnostics.

mypy_primer is installed from GitHub via pip, which depends on both package index and Git network access. Retrying the dependency install reduces PR job failures from short-lived network problems before the expensive shard work begins.

Each mypy_primer shard builds both the PR commit and base commit. Fetching their Cargo registry or Git dependencies can fail on transient network errors, so retry cargo fetch for each checkout before building. The cargo build steps remain single-shot so compiler errors are reported without repeated backoff.

The musllinux wheel smoke test installs python3 from Alpine repositories inside Docker. Retrying apk add protects the test from temporary Alpine mirror or network failures without changing the wheel being tested.

Release container setup depends on apt mirrors and the GitHub CLI package repository. Retrying the setup step makes release binary builds more robust to temporary package mirror, keyring download, or network failures.

Uploading release assets calls the GitHub Releases API and can fail for transient service or network issues. Retrying the Unix upload step improves release reliability while keeping --clobber behavior unchanged.

Windows release asset uploads use the same GitHub Releases API as the Unix jobs and can hit transient network or service errors. Retry the upload with the same artifact names and --clobber semantics.

Cache the Yarn package-manager cache used by the website install. The key is derived from website/yarn.lock through setup-node, so dependency changes invalidate the cache while unchanged lockfiles avoid repeated registry downloads.

Cache the installed wasm-pack and wasm-opt binaries so website runs do not rebuild tool dependencies on every cacheable run. The cache key includes OS, runner architecture, and explicit tool versions, which are the inputs that determine whether the cached binaries are compatible.

Cache Cargo registry and Git dependency data for the website wasm build, and keep rust-cache's default CARGO_HOME/bin caching for installed Cargo CLI entries. The size-sensitive part is cache-targets: false: that avoids caching Pyrefly or wasm target outputs while still reducing dependency and tool download work on warm runs. rust-cache includes the Rust environment and Cargo lockfiles in its key, so dependency or toolchain changes invalidate the cache. wasm-pack and wasm-opt also have a separate OS/architecture/version-keyed cache.

The extension build runs npm ci in lsp, so setup-node should key the npm cache from lsp/package-lock.json rather than relying on repository-root defaults. That invalidates the cache when extension dependencies change and avoids stale or unrelated root lockfile cache entries.

Extension test jobs also install dependencies from lsp/package-lock.json. Using that lockfile as the setup-node cache dependency path makes cache invalidation match the dependency set that npm ci actually installs.

Cache the VS Code test binary directories used by @vscode/test-electron so test jobs do not repeatedly download the large editor archive. The key includes OS, extension platform/arch, and the extension lockfile because those determine which test dependency and binary shape is expected.

Wheel builds download the same Cargo registry and Git dependencies across a large maturin matrix. Cache that Cargo dependency data and the default CARGO_HOME/bin contents, but keep cache-targets: false so platform-specific wheel build outputs are not stored in GitHub's cache. rust-cache keys include the job family plus the Rust environment and Cargo lockfiles, which is the right invalidation boundary for shared dependency downloads without creating huge target-dir caches for every wheel target.

Primer comparison builds Pyrefly in each shard and repeatedly needs Cargo registry and Git dependency data. Cache those downloads and CARGO_HOME/bin while leaving target directories uncached, which keeps cache size bounded for scheduled runs. rust-cache's Rust environment and lockfile keying invalidates this cache when Cargo inputs or the toolchain change, without tying it to Pyrefly build artifacts.

The optional issue-ranking primer shards build Pyrefly and benefit from the same Cargo dependency cache as primer comparison. Cache registry/Git downloads and CARGO_HOME/bin, but not target outputs, so manual runs avoid repeated network work without storing bulky build artifacts. The rust-cache environment hash covers the Cargo lockfiles and toolchain, so dependency changes invalidate the cache naturally.

mypy_primer PR jobs build Pyrefly for both new and base commits across many shards. Cache Cargo registry/Git data and CARGO_HOME/bin for the checked-out workspace, but keep target outputs disabled so the cache does not balloon with per-shard build artifacts. Using the pyrefly_to_test workspace lets rust-cache compute invalidation from that checkout's Cargo lockfiles and Rust environment, which is the dependency boundary these builds need.

The cached wasm-pack and wasm-opt binaries are architecture-specific. Include runner.arch in the cache key so x64 and arm64 runners cannot restore incompatible binaries while still sharing cache entries across identical OS/architecture/version combinations.

The VS Code test config does not pin an editor version, so the downloaded stable VS Code binary can change even when package-lock.json does not. Add a weekly key component and hash the test config so caches stay useful but periodically refresh as upstream VS Code stable advances.

Scrut is installed with cargo install on every non-Windows pyrefly run. That install can require registry downloads and local build work, so cache the installed scrut binary plus Cargo's install metadata under a key that includes OS, runner architecture, and the pinned Scrut version. The versioned key gives precise invalidation when the workflow moves to a different Scrut release, while leaving Pyrefly target outputs untouched. The install step still runs on cache misses and is retried for transient crates.io or network failures.

The repository already has pyrefly_wasm/rust-toolchain as a symlink to ../pyrefly/rust-toolchain, and the pyrefly package metadata includes a rust-toolchain file in source distributions. The target file was missing, leaving the symlink broken. Swatinem/rust-cache scans rust-toolchain files when constructing its environment hash, so the broken symlink made cache setup emit ENOENT in the branch workflow runs and prevented clean cache restore/save behavior. Add the missing stable toolchain file so the symlink resolves and the cache key can include the toolchain input normally.

The extension workflows build Pyrefly before packaging or testing the VS Code extension. A branch run hit a transient crates.io connection reset inside the actions-rust-cross cargo build step while fetching serde, which bypassed the retry wrapper because the fetch happened inside the third-party build action. Fetch the Pyrefly Cargo dependencies explicitly through the local retry-command action before either the cross-compile action or native cargo build runs. For cross targets, include the matrix target in cargo fetch so target-specific dependencies are covered. This keeps the existing Rust cache useful on warm runs and gives cold or invalidated caches retry coverage before the expensive build step.

github-actions · 2026-06-18T23:28:04Z

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

meta-cla Bot added the cla signed label Jun 17, 2026

github-actions Bot added the size/xl label Jun 17, 2026

samwgoldman force-pushed the ci-retry-cache-stack branch from 094dc52 to f70703c Compare June 18, 2026 16:23

github-actions Bot added size/xl and removed size/xl labels Jun 18, 2026

This comment has been minimized.

Sign in to view

samwgoldman force-pushed the ci-retry-cache-stack branch from f70703c to ba19d64 Compare June 18, 2026 18:57

github-actions Bot added size/xl and removed size/xl labels Jun 18, 2026

samwgoldman force-pushed the ci-retry-cache-stack branch from ba19d64 to 2569940 Compare June 18, 2026 19:09

github-actions Bot added size/xl and removed size/xl labels Jun 18, 2026

samwgoldman force-pushed the ci-retry-cache-stack branch from 2569940 to 809d999 Compare June 18, 2026 19:26

github-actions Bot added size/xl and removed size/xl labels Jun 18, 2026

This comment has been minimized.

Sign in to view

samwgoldman force-pushed the ci-retry-cache-stack branch from 809d999 to 16c6ef4 Compare June 18, 2026 19:57

github-actions Bot added size/xl and removed size/xl labels Jun 18, 2026

This comment has been minimized.

Sign in to view

samwgoldman added 9 commits June 18, 2026 16:06

Retry Scrut install in pyrefly workflow

9a850e2

The Scrut install downloads from crates.io and can fail for transient Cargo registry or network issues. Retrying the pinned install gives fresh runners a chance to recover without hiding deterministic failures after the final attempt.

Retry website wasm tool install

a738cfb

Installing wasm-pack and wasm-opt downloads from crates.io, which is one of the network surfaces that has failed transiently in recent CI. Pin the tool versions and retry the installs so cache misses still have a reliable path to rebuild the tools.

Retry extension container dependency install

820e223

Ubuntu container setup depends on apt mirrors and can fail for temporary mirror or network issues. Retrying the apt update/install step makes those arm Linux extension builds less sensitive to short-lived package repository failures.

Retry extension npm install

b9c644a

The extension package step runs npm ci against the npm registry. Recent CI has seen registry/network flakes, so retry the install to let transient failures recover while keeping the lockfile-controlled dependency set unchanged.

Retry extension test npm install

9d0da0d

Extension test jobs also run npm ci before compiling and testing. Retrying this registry-backed step reduces failures from temporary npm/network issues, especially on cache misses or newly provisioned runners.

samwgoldman added 26 commits June 18, 2026 16:06

Retry primer type checker install

e441905

Primer comparison installs pyright and mypy from Python package indexes. Retrying this install protects scheduled primer runs from transient package index or network failures without changing the selected package versions.

Retry primer Cargo dependency fetch

e7a0e29

Primer comparison builds Pyrefly after resolving Cargo dependencies. Adding a retried fetch step lets temporary Cargo registry or Git fetch failures recover before the expensive primer shard work starts.

Retry issue ranking primer type checker install

5934260

The optional issue-ranking primer job installs pyright and mypy from external package indexes. Retrying the install makes manual runs more resilient to transient package index and network failures.

Retry issue ranking Cargo dependency fetch

31fe9ee

Issue ranking's optional primer path builds Pyrefly, so it has the same Cargo registry and Git fetch exposure as primer comparison. Fetch with retries before building so temporary dependency download failures do not fail the whole run.

Retry issue ranking snippet tool install

80c225f

Snippet checking installs pyrefly, pyright, and mypy from package indexes. Retrying that install helps manual issue-ranking runs survive temporary PyPI or network failures while preserving the existing install command and diagnostics.

Retry mypy primer dependency install

1ca8e73

mypy_primer is installed from GitHub via pip, which depends on both package index and Git network access. Retrying the dependency install reduces PR job failures from short-lived network problems before the expensive shard work begins.

Retry Alpine package install in wheel tests

998b43d

The musllinux wheel smoke test installs python3 from Alpine repositories inside Docker. Retrying apk add protects the test from temporary Alpine mirror or network failures without changing the wheel being tested.

Retry release container dependency install

49ae806

Release container setup depends on apt mirrors and the GitHub CLI package repository. Retrying the setup step makes release binary builds more robust to temporary package mirror, keyring download, or network failures.

Retry Unix release asset upload

ca5e813

Uploading release assets calls the GitHub Releases API and can fail for transient service or network issues. Retrying the Unix upload step improves release reliability while keeping --clobber behavior unchanged.

Retry Windows release asset upload

4b4b503

Windows release asset uploads use the same GitHub Releases API as the Unix jobs and can hit transient network or service errors. Retry the upload with the same artifact names and --clobber semantics.

Cache website Yarn dependencies

05697eb

Cache the Yarn package-manager cache used by the website install. The key is derived from website/yarn.lock through setup-node, so dependency changes invalidate the cache while unchanged lockfiles avoid repeated registry downloads.

Use extension lockfile for test npm cache

483c8a2

Extension test jobs also install dependencies from lsp/package-lock.json. Using that lockfile as the setup-node cache dependency path makes cache invalidation match the dependency set that npm ci actually installs.

samwgoldman force-pushed the ci-retry-cache-stack branch from 16c6ef4 to bd492e1 Compare June 18, 2026 23:09

github-actions Bot added size/xl and removed size/xl labels Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry and caching to our GitHub actions#3855

Add retry and caching to our GitHub actions#3855
samwgoldman wants to merge 36 commits into
mainfrom
ci-retry-cache-stack

samwgoldman commented Jun 17, 2026

Uh oh!

This comment has been minimized.

meta-codesync Bot commented Jun 18, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

github-actions Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samwgoldman commented Jun 17, 2026

Summary

Test Plan

Uh oh!

This comment has been minimized.

meta-codesync Bot commented Jun 18, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

github-actions Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant