Skip to content

docs(xet): update HF_XET_CLIENT_READ_TIMEOUT default to 300s#2419

Draft
rajatarya wants to merge 1 commit into
mainfrom
rajat/xet-raise-client-read-timeout-to-300s
Draft

docs(xet): update HF_XET_CLIENT_READ_TIMEOUT default to 300s#2419
rajatarya wants to merge 1 commit into
mainfrom
rajat/xet-raise-client-read-timeout-to-300s

Conversation

@rajatarya

Copy link
Copy Markdown
Contributor

Summary

Bumps the documented default for HF_XET_CLIENT_READ_TIMEOUT in the Xet env-var reference table from 120s to 300s, matching the xet-core default change in huggingface/xet-core#807.

Why

The 120s default was cutting off slow-but-progressing uploads on high-latency / transatlantic links, producing a chronic 30–50% xorb POST failure rate fleet-wide (see Notion investigation).

Test plan

  • Preview docs/hub/xet/using-xet-storage.md renders the updated value
  • Ship in the same release window as the xet-core PR so docs and runtime agree

🤖 Generated with Claude Code

Reflects the xet-core default change in huggingface/xet-core#807.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

rajatarya added a commit to huggingface/xet-core that referenced this pull request May 18, 2026
…og (#808)

Two connected cleanups from the [2026-04-21 Julien upload-stuck
investigation](https://www.notion.so/huggingface2/Julien-upload-stuck-upload_xorb-120s-timeouts-2026-04-21-3491384ebcac81a19d0af5394745cfff).
Closes #807. Docs PR: huggingface/hub-docs#2419.

## Change 1 — raise `HF_XET_CLIENT_READ_TIMEOUT` default 120s → 300s

**Files:** `xet_runtime/src/config/groups/client.rs`,
`xet_client/src/cas_client/remote_client.rs` (stale comment).

The 120s client read timeout was firing before legitimate `upload_xorb`
requests could complete on high-latency / transatlantic / bursty links.
Fleet-wide this produced a **chronic 30–50% xorb POST failure rate**
(1,092–4,196 `error uploading xorb` events per hour sustained over 24h,
peaking at 49.1% in the investigation window). 267 successful uploads in
the same 24h had latency > 120s (max 37 min), so 120s wasn't protecting
anything legitimate — it was only cutting off slow-but-healthy streams.

300s preserves stall-detection semantics (still an order of magnitude
under the 3600s ALB idle). The env override `HF_XET_CLIENT_READ_TIMEOUT`
is unchanged.

## Change 2 — log `query_dedup` 404 as cache miss, not "Fatal Error"

**Files:** `xet_client/src/cas_client/retry_wrapper.rs`,
`xet_client/src/cas_client/remote_client.rs`.

A 404 from `cas::query_dedup` is an expected cache miss — the caller
converts it to `Ok(None)` and proceeds to upload. Today the retry
wrapper logs it as `Fatal Error: \"cas::query_dedup\" api call failed
... 404 Not Found`, producing **20+ alarming-looking lines per upload
session** with no actual failure behind them (Hoyt flagged this in the
incident Slack thread).

Fix: add `RetryWrapper::with_expected_404()` — mirroring the existing
`with_expected_416()` pattern — and opt `query_dedup` into it. The 404
still short-circuits retries and surfaces as a fatal error to the caller
(preserving the existing `Ok(None)` conversion), but the log line now
reads `Not Found (cache miss): \"cas::query_dedup\" api call failed ...
404 Not Found`.

## Test plan

- [x] `cargo +nightly fmt --all --check` clean
- [x] `cargo test -p xet-client --lib cas_client::retry_wrapper` — 5
passed (incl. new `test_404_expected_is_fatal_and_not_retried`)
- [ ] Manually verify `HF_XET_CLIENT_READ_TIMEOUT=120` still overrides
via env
- [ ] Confirm a session run produces no `Fatal Error:` lines for the
`query_dedup` 404s
- [ ] Watch the xorb POST error-rate panel on the [CAS Grafana
dashboard](https://grafana.huggingface.tech/d/dejp4w2hael1cb/cas) after
release; expect the 120s-clustered p50 to disappear

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Adjusts client networking defaults (read timeout) and alters
retry-wrapper handling/logging for HTTP 404s, which can change behavior
and observability for slow uploads and cache-miss paths.
> 
> **Overview**
> Raises the default `HF_XET_CLIENT_READ_TIMEOUT` from 120s to 300s to
better tolerate slow-but-progressing transfers.
> 
> Adds `RetryWrapper::with_expected_404()` and opts `cas::query_dedup`
into it so 404 responses are still non-retried/fatal to the caller but
are logged as an expected *cache miss* (with a new unit test covering
the no-retry behavior).
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
3e88f9c. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants