Skip to content

chore: update torch/cuda related dependencies and Bazel configurations#122

Open
wolegechu wants to merge 139 commits into
mainfrom
ychu/update-deps-to-torch280
Open

chore: update torch/cuda related dependencies and Bazel configurations#122
wolegechu wants to merge 139 commits into
mainfrom
ychu/update-deps-to-torch280

Conversation

@wolegechu
Copy link
Copy Markdown
Contributor

@wolegechu wolegechu commented Dec 23, 2025

Note

Upgrade Torch/CUDA stack and Bazel deps

  • Set -D_GLIBCXX_USE_CXX11_ABI=1 in .bazelrc and bump to torch==2.8.0+cu128 across pyproject.toml
  • Refresh MODULE.bazel http_archive entries (cudnn, nccl, cusparselt, cupti, cublas, cufft, cudart, curand, cusparse, cusolver, nvjitlink, nvrtc, nvtx, triton) with new versions, integrities, and URLs; update libtorch to 2.8.0+cu128 source
  • Adjust third_party BUILD files (e.g., cusparselt paths, nvrtc builtins 12.8, explicit libgomp.so.1)

Build/package improvements

  • setup.py: compute torch CUDA suffix robustly; add BuildPyCommand to generate/link Python protos before build; include tensorcast/proto/**/*.py[ i] in package_data
  • Add minimal __init__.py modules for new package namespaces

New tooling

  • Add tools/torch_version_manager.py (unified version validation/update and uv.lock cache helpers)
  • Add tools/update_module_http_archives.py to rewrite MODULE.bazel from uv.lock

Project config

  • pyproject.toml: add duckdb, pydantic, psutil; refine pyyaml; add uv indexes (private and pytorch); minor lint/pyright formatting tweaks

Written by Cursor Bugbot for commit 440d6ac. Configure here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 20 to 23
name = "torch",
srcs = glob(["torch/lib/libgomp-*.so.1"]) + [
srcs = [
"torch/lib/libgomp.so.1",
"torch/lib/libtorch.so",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep glob for hashed libgomp library

The torch cc_library now hardcodes torch/lib/libgomp.so.1, replacing the previous glob(["torch/lib/libgomp-*.so.1"]). Torch wheels (including the 2.8.0+cu128 wheel referenced in MODULE.bazel) ship a hashed libgomp filename such as libgomp-<hash>.so.1, not an unversioned libgomp.so.1. With the explicit path, Bazel won’t find the file after extracting the wheel, so the libtorch target will fail to build on any fetch of the new torch wheel.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mirror URLs contain duplicate packages path segment

The MIRROR_PREFIX constant is set to http://mirrors.i.basemind.com/pypi/packages/packages/ with a duplicate /packages/packages/ path. The previous MODULE.bazel used single /packages/ paths like http://mirrors.i.basemind.com/pypi/packages/9f/fd/..., but the new URLs have double paths like http://mirrors.i.basemind.com/pypi/packages/packages/ba/51/.... This causes all mirror URL downloads to target incorrect paths. The PyPI fallback URLs still work, but the primary mirror URLs are broken.

tools/update_module_http_archives.py#L31-L32

MIRROR_PREFIX = "http://mirrors.i.basemind.com/pypi/packages/packages/"

MODULE.bazel#L95-L96

tensorcast/MODULE.bazel

Lines 95 to 96 in 440d6ac

urls = [
"http://mirrors.i.basemind.com/pypi/packages/packages/ba/51/e123d997aa098c61d029f76663dedbfb9bc8dcf8c60cbd6adbe42f76d049/nvidia_cudnn_cu12-9.10.2.21-py3-none-manylinux_2_27_x86_64.whl",

Fix in Cursor Fix in Web


- Add `Binding` API (`artifact.bind`/
  `bind_into`) to hide slot mechanics
- Add `SwapKeyMapping` RPC + key mapping generation + cache TTL hints
  (daemon/GS)
- Allow `ttl_ms=0` for long-lived VRAM regions/published replicas; tighten PID watches
- Update
  Global Store schema/migration and extend tests
…r backfill

- implement startup auto mode singleflight flow with strict fail-fast semantics

- recover from stale READY/FAILED auto-state records when owner is dead, then re-elect and recreate daemon

- switch store
  runtime fallback from connect->create to connect->auto

- update API/runtime error hints and docs to include mode='auto', plus add Python tests

- allow ResolveArtifactFromDisk to backfill stale descriptor fields when verify_checksums=false (keep strict mismatch failure when true), with daemon tests
wolegechu and others added 30 commits February 22, 2026 18:40
…den chaos validation

- propagate request budget/transport wait timeout from daemon RPC into materialization hints
- replace fixed reselection retries with budget-aware retry loops and bounded GS transport waits
- improve global-store transport safety via claim rollback, idempotent completion, and profile-based stale cleanup
- enrich Python retry mapping/diagnostics with transient reason buckets and budget metadata
- add and harden cross-host chaos tooling (runner/gate scripts, schema/configs, failure accounting, GPU preflight)
…hmark workflow

Add unregister drain + MTCP completion tracking and tensor read leases for safe retirement; enable stable_dram stage_on_gpu=false CPU streaming with registration chunk ingest and pending alias cleanup to avoid stale-retention OOM; improve GS/daemon timeout-cancel handling, heartbeat buffering/metrics, alias TTL policy, endpoint takeover, and BatchGetReplicaCounts; add cross-host runner/config/docs and related C++/Python tests.
… bind retry path

- return FAILED_PRECONDITION in key-mapping upsert/swap when artifact index is not ready
- inject artifact/index repositories into KeyMappingRpcHandler and extend RPC tests
- reuse preallocated TP rank targets across bind retries; add CUDA OOM fast-fail and compact retry errors
- harden cross-host daemon start/stop cleanup with timeout and TERM/KILL fallback
- add TP bind retry regression test and update GS/benchmark documentation
- generate deterministic mapped view IDs (mapped:v1:sha256) from canonical index, source view, copy plan, and target layout
- enable mapped binding publish on bind/swap (publish=True) by returning target_write_token and publishing VIEW byte-space replicas
- accept opaque selection.view_id and prefer view-byte-space transport with canonical fallback for mixed-version compatibility
- refactor feed upload to span streaming with configurable gRPC message limits, chunk sizing, timeout, and progress logging
- add stable_dram CPU-stream uploader worker auto-tuning and concurrent non-overlapping span uploads
- allow out-of-order stable_dram chunk ingest with overlap rejection and commit-time full-coverage checks
- propagate cpu_shared_memory_enabled through registration and fix CPU memfd lookup to use replica allocation keys
- harden cross-host daemon orchestration, add CPU stream micro-bench tooling, and sync tests/docs
… diffusion fixes

- extend stable_dram proto handshake with publish_cpu_memfd handle/lease and add stable_dram_write_progress range feed
- add registration CPU memfd info + range-only ingest APIs in StoreEngine/RegistrationBackend, with strict overlap/full-span validation
- split stable_dram commit modes (cpu_stream vs cpu_memfd_publish), return stable_cache_admitted, and skip redundant local stable admission
- mint/release stable_dram publish handle leases in registration controller and release lease before commit to avoid double stable-budget charging
- add Python uploader cpu_memfd path (LocalHandle FD exchange + mmap write + written-range ack) while keeping stream fallback
- relax subset publish for view-scoped byte-space and update inplace slot publishability checks
- improve transport behavior with short view-route probe timeout then canonical fallback; force export on target materialization for better diffusion
- preserve replica in-flight counters on re-registration and prefer new idle sources on tie-break
- add/expand C++ and Python tests plus 0082 design/plan/benchmark documentation updates
…nt transport requests

- add transport request_id/requester_worker_id/scheduling_group across proto, SDK, daemon, and store client

- introduce pending transport queue and GROUP_DISPATCH scheduler with fairness/aging/scan/batch controls

- require explicit completion outcome/detail and persist transport outcome for progress/accountability

- add source-balance dispatch metrics and grouped TP-version benchmark wiring

- update schema/config/docs and extend transport service/repository/integration tests
…isher flow

- Add QueryTransportWindow RPC in Global Store (proto/repository/service/rpc) with coverage tests

- Switch cross-host runner transport probe/throughput audit to GS RPC and add publish/put bandwidth metrics

- Default daemon/store to cpu_shared_memory enabled, auto-fill stable_bytes=64MB, and auto-select local handle socket

- Relax fake-CUDA startup sizing checks and improve SDK cleanup/validation paths (region force-cleanup + fake-backend CPU put support)

- Simplify weight_publisher/e2e behavior (always pre-publish trim, p2p-only receiver fallback, tensor_dict replica unload) and update docs
…oling

- fix dataplane pump shutdown/drain flow and recover producer-owned streaming buffer slots
- close timed-out remote key channels and add richer pinned-buffer wait context for debugging
- decouple HA heartbeat RPC from control-plane lock contention and guard liveness by epoch
- add global-store transient transaction conflict handling and cluster runtime RPC support
- add cross-host transport probes, early gate/summarizer tooling, and new benchmark daemon/global configs
- refresh cross-host benchmark docs/playbook and add regression tests for adapters, repos/services, probes, and TP bind retry paths
…bench flow

- add group-source spread and soft-cap controls to group dispatch scheduling
- recycle failed request ids, normalize malformed transport rows, and relax tp_version replay contract on view/artifact variance
- keep tp bind request ids stable across retries, preserve completed ranks on retry, and tune per-rank timeout handling
- add pre-publish trim margin, ABBA suite/preflight tooling, remote daemon startup retry, and update 0083 steady benchmark docs
…rt flow

- add SDK PortConfig overrides for daemon and Global Store ports
- support launcher-only daemon envs from config and pass explicit p2p port overrides
- make local Global Store start fail on existing healthy instances and rotate stale cluster tokens
- strengthen auto-mode singleflight startup and connect-mode error handling
- prewarm GPU hash NVRTC kernels at daemon startup and disable real-CUDA CPU fallback
- trust/backfill artifact descriptors on disk import and skip redundant rehashing
- cap materialization concurrency by daemon worker budget and improve disk fallback behavior
- remove GetArtifactOptions.prefer and document execution-only source policy semantics
* fix(dev): 1) fix torch version check; 2) sync bazel module sources with uv.lock

* fix(dev): explicitly add pytz as dependency to avoid import error in duckDB (e.g. with python>=3.11)

* fix(store): 1) add more logging points for strided load in byte_range_mapped_source; 2) reuse StridedBlock cache; 3) add direct gather path for CPU mem in strided load

* fix(store): add CPU memfd region in local replica handle
- route artifact.bind() through bind_into with preallocated CUDA targets
- drop DeferredLoader/InplaceSlot from the public SDK surface
- update docs, examples, and tests to use the unified Binding contract
- prefer disk-first materialization for local-import view/subset bindings
…ading

- add DiskArtifactContext to reuse disk scans, safetensors fd/mmap state, and index metadata
- route DiskLoader and metadata_stage through the shared context with new tests
- relax replica byte-mapping thresholds for mmap-capable disk sources
- add the h2d_nccl_broadcast_baseline benchmark mode and docs
- stop marking canonicalized indexes as safetensors when only a source index is backfilled
- build mmap mapping options before moving the seekable source into the mapped wrapper
- switch dataplane tests to scoped temp dirs and align AUTO variant fallback assertions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants