docs(xet): update HF_XET_CLIENT_READ_TIMEOUT default to 300s#2419
Draft
rajatarya wants to merge 1 commit into
Draft
docs(xet): update HF_XET_CLIENT_READ_TIMEOUT default to 300s#2419rajatarya wants to merge 1 commit into
rajatarya wants to merge 1 commit into
Conversation
Reflects the xet-core default change in huggingface/xet-core#807. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Merged
5 tasks
rajatarya
added a commit
to huggingface/xet-core
that referenced
this pull request
May 18, 2026
…og (#808) Two connected cleanups from the [2026-04-21 Julien upload-stuck investigation](https://www.notion.so/huggingface2/Julien-upload-stuck-upload_xorb-120s-timeouts-2026-04-21-3491384ebcac81a19d0af5394745cfff). Closes #807. Docs PR: huggingface/hub-docs#2419. ## Change 1 — raise `HF_XET_CLIENT_READ_TIMEOUT` default 120s → 300s **Files:** `xet_runtime/src/config/groups/client.rs`, `xet_client/src/cas_client/remote_client.rs` (stale comment). The 120s client read timeout was firing before legitimate `upload_xorb` requests could complete on high-latency / transatlantic / bursty links. Fleet-wide this produced a **chronic 30–50% xorb POST failure rate** (1,092–4,196 `error uploading xorb` events per hour sustained over 24h, peaking at 49.1% in the investigation window). 267 successful uploads in the same 24h had latency > 120s (max 37 min), so 120s wasn't protecting anything legitimate — it was only cutting off slow-but-healthy streams. 300s preserves stall-detection semantics (still an order of magnitude under the 3600s ALB idle). The env override `HF_XET_CLIENT_READ_TIMEOUT` is unchanged. ## Change 2 — log `query_dedup` 404 as cache miss, not "Fatal Error" **Files:** `xet_client/src/cas_client/retry_wrapper.rs`, `xet_client/src/cas_client/remote_client.rs`. A 404 from `cas::query_dedup` is an expected cache miss — the caller converts it to `Ok(None)` and proceeds to upload. Today the retry wrapper logs it as `Fatal Error: \"cas::query_dedup\" api call failed ... 404 Not Found`, producing **20+ alarming-looking lines per upload session** with no actual failure behind them (Hoyt flagged this in the incident Slack thread). Fix: add `RetryWrapper::with_expected_404()` — mirroring the existing `with_expected_416()` pattern — and opt `query_dedup` into it. The 404 still short-circuits retries and surfaces as a fatal error to the caller (preserving the existing `Ok(None)` conversion), but the log line now reads `Not Found (cache miss): \"cas::query_dedup\" api call failed ... 404 Not Found`. ## Test plan - [x] `cargo +nightly fmt --all --check` clean - [x] `cargo test -p xet-client --lib cas_client::retry_wrapper` — 5 passed (incl. new `test_404_expected_is_fatal_and_not_retried`) - [ ] Manually verify `HF_XET_CLIENT_READ_TIMEOUT=120` still overrides via env - [ ] Confirm a session run produces no `Fatal Error:` lines for the `query_dedup` 404s - [ ] Watch the xorb POST error-rate panel on the [CAS Grafana dashboard](https://grafana.huggingface.tech/d/dejp4w2hael1cb/cas) after release; expect the 120s-clustered p50 to disappear 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Adjusts client networking defaults (read timeout) and alters retry-wrapper handling/logging for HTTP 404s, which can change behavior and observability for slow uploads and cache-miss paths. > > **Overview** > Raises the default `HF_XET_CLIENT_READ_TIMEOUT` from 120s to 300s to better tolerate slow-but-progressing transfers. > > Adds `RetryWrapper::with_expected_404()` and opts `cas::query_dedup` into it so 404 responses are still non-retried/fatal to the caller but are logged as an expected *cache miss* (with a new unit test covering the no-retry behavior). > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 3e88f9c. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bumps the documented default for
HF_XET_CLIENT_READ_TIMEOUTin the Xet env-var reference table from120sto300s, matching the xet-core default change in huggingface/xet-core#807.Why
The 120s default was cutting off slow-but-progressing uploads on high-latency / transatlantic links, producing a chronic 30–50% xorb POST failure rate fleet-wide (see Notion investigation).
Test plan
docs/hub/xet/using-xet-storage.mdrenders the updated value🤖 Generated with Claude Code