chore: update torch/cuda related dependencies and Bazel configurations by wolegechu · Pull Request #122 · tensorcast-ai/tensorcast

wolegechu · 2025-12-23T04:19:46Z

Note

Upgrade Torch/CUDA stack and Bazel deps

Set -D_GLIBCXX_USE_CXX11_ABI=1 in .bazelrc and bump to torch==2.8.0+cu128 across pyproject.toml
Refresh MODULE.bazel http_archive entries (cudnn, nccl, cusparselt, cupti, cublas, cufft, cudart, curand, cusparse, cusolver, nvjitlink, nvrtc, nvtx, triton) with new versions, integrities, and URLs; update libtorch to 2.8.0+cu128 source
Adjust third_party BUILD files (e.g., cusparselt paths, nvrtc builtins 12.8, explicit libgomp.so.1)

Build/package improvements

setup.py: compute torch CUDA suffix robustly; add BuildPyCommand to generate/link Python protos before build; include tensorcast/proto/**/*.py[ i] in package_data
Add minimal __init__.py modules for new package namespaces

New tooling

Add tools/torch_version_manager.py (unified version validation/update and uv.lock cache helpers)
Add tools/update_module_http_archives.py to rewrite MODULE.bazel from uv.lock

Project config

pyproject.toml: add duckdb, pydantic, psutil; refine pyyaml; add uv indexes (private and pytorch); minor lint/pyright formatting tweaks

^{Written by Cursor Bugbot for commit 440d6ac. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-23T04:24:00Z

    name = "torch",
-    srcs = glob(["torch/lib/libgomp-*.so.1"]) + [
+    srcs = [
+        "torch/lib/libgomp.so.1",
        "torch/lib/libtorch.so",


Keep glob for hashed libgomp library

The torch cc_library now hardcodes torch/lib/libgomp.so.1, replacing the previous glob(["torch/lib/libgomp-*.so.1"]). Torch wheels (including the 2.8.0+cu128 wheel referenced in MODULE.bazel) ship a hashed libgomp filename such as libgomp-<hash>.so.1, not an unversioned libgomp.so.1. With the explicit path, Bazel won’t find the file after extracting the wheel, so the libtorch target will fail to build on any fetch of the new torch wheel.

Useful? React with 👍 / 👎.

cursor

Mirror URLs contain duplicate packages path segment

The MIRROR_PREFIX constant is set to http://mirrors.i.basemind.com/pypi/packages/packages/ with a duplicate /packages/packages/ path. The previous MODULE.bazel used single /packages/ paths like http://mirrors.i.basemind.com/pypi/packages/9f/fd/..., but the new URLs have double paths like http://mirrors.i.basemind.com/pypi/packages/packages/ba/51/.... This causes all mirror URL downloads to target incorrect paths. The PyPI fallback URLs still work, but the primary mirror URLs are broken.

tools/update_module_http_archives.py#L31-L32

tensorcast/tools/update_module_http_archives.py

Lines 31 to 32 in 440d6ac


	MIRROR_PREFIX = "http://mirrors.i.basemind.com/pypi/packages/packages/"

MODULE.bazel#L95-L96

tensorcast/MODULE.bazel

Lines 95 to 96 in 440d6ac

    
           urls = [ 
        
               "http://mirrors.i.basemind.com/pypi/packages/packages/ba/51/e123d997aa098c61d029f76663dedbfb9bc8dcf8c60cbd6adbe42f76d049/nvidia_cudnn_cu12-9.10.2.21-py3-none-manylinux_2_27_x86_64.whl",

…rdered window scheduling

- Add `Binding` API (`artifact.bind`/ `bind_into`) to hide slot mechanics - Add `SwapKeyMapping` RPC + key mapping generation + cache TTL hints (daemon/GS) - Allow `ttl_ms=0` for long-lived VRAM regions/published replicas; tighten PID watches - Update Global Store schema/migration and extend tests

… disk index tracking

…upport

…e support

…r backfill - implement startup auto mode singleflight flow with strict fail-fast semantics - recover from stale READY/FAILED auto-state records when owner is dead, then re-elect and recreate daemon - switch store runtime fallback from connect->create to connect->auto - update API/runtime error hints and docs to include mode='auto', plus add Python tests - allow ResolveArtifactFromDisk to backfill stale descriptor fields when verify_checksums=false (keep strict mismatch failure when true), with daemon tests

…den chaos validation - propagate request budget/transport wait timeout from daemon RPC into materialization hints - replace fixed reselection retries with budget-aware retry loops and bounded GS transport waits - improve global-store transport safety via claim rollback, idempotent completion, and profile-based stale cleanup - enrich Python retry mapping/diagnostics with transient reason buckets and budget metadata - add and harden cross-host chaos tooling (runner/gate scripts, schema/configs, failure accounting, GPU preflight)

…istration

…hmark workflow Add unregister drain + MTCP completion tracking and tensor read leases for safe retirement; enable stable_dram stage_on_gpu=false CPU streaming with registration chunk ingest and pending alias cleanup to avoid stale-retention OOM; improve GS/daemon timeout-cancel handling, heartbeat buffering/metrics, alias TTL policy, endpoint takeover, and BatchGetReplicaCounts; add cross-host runner/config/docs and related C++/Python tests.

… bind retry path - return FAILED_PRECONDITION in key-mapping upsert/swap when artifact index is not ready - inject artifact/index repositories into KeyMappingRpcHandler and extend RPC tests - reuse preallocated TP rank targets across bind retries; add CUDA OOM fast-fail and compact retry errors - harden cross-host daemon start/stop cleanup with timeout and TERM/KILL fallback - add TP bind retry regression test and update GS/benchmark documentation

- generate deterministic mapped view IDs (mapped:v1:sha256) from canonical index, source view, copy plan, and target layout - enable mapped binding publish on bind/swap (publish=True) by returning target_write_token and publishing VIEW byte-space replicas - accept opaque selection.view_id and prefer view-byte-space transport with canonical fallback for mixed-version compatibility - refactor feed upload to span streaming with configurable gRPC message limits, chunk sizing, timeout, and progress logging - add stable_dram CPU-stream uploader worker auto-tuning and concurrent non-overlapping span uploads - allow out-of-order stable_dram chunk ingest with overlap rejection and commit-time full-coverage checks - propagate cpu_shared_memory_enabled through registration and fix CPU memfd lookup to use replica allocation keys - harden cross-host daemon orchestration, add CPU stream micro-bench tooling, and sync tests/docs

… diffusion fixes - extend stable_dram proto handshake with publish_cpu_memfd handle/lease and add stable_dram_write_progress range feed - add registration CPU memfd info + range-only ingest APIs in StoreEngine/RegistrationBackend, with strict overlap/full-span validation - split stable_dram commit modes (cpu_stream vs cpu_memfd_publish), return stable_cache_admitted, and skip redundant local stable admission - mint/release stable_dram publish handle leases in registration controller and release lease before commit to avoid double stable-budget charging - add Python uploader cpu_memfd path (LocalHandle FD exchange + mmap write + written-range ack) while keeping stream fallback - relax subset publish for view-scoped byte-space and update inplace slot publishability checks - improve transport behavior with short view-route probe timeout then canonical fallback; force export on target materialization for better diffusion - preserve replica in-flight counters on re-registration and prefer new idle sources on tie-break - add/expand C++ and Python tests plus 0082 design/plan/benchmark documentation updates

…nt transport requests - add transport request_id/requester_worker_id/scheduling_group across proto, SDK, daemon, and store client - introduce pending transport queue and GROUP_DISPATCH scheduler with fairness/aging/scan/batch controls - require explicit completion outcome/detail and persist transport outcome for progress/accountability - add source-balance dispatch metrics and grouped TP-version benchmark wiring - update schema/config/docs and extend transport service/repository/integration tests

…isher flow - Add QueryTransportWindow RPC in Global Store (proto/repository/service/rpc) with coverage tests - Switch cross-host runner transport probe/throughput audit to GS RPC and add publish/put bandwidth metrics - Default daemon/store to cpu_shared_memory enabled, auto-fill stable_bytes=64MB, and auto-select local handle socket - Relax fake-CUDA startup sizing checks and improve SDK cleanup/validation paths (region force-cleanup + fake-backend CPU put support) - Simplify weight_publisher/e2e behavior (always pre-publish trim, p2p-only receiver fallback, tensor_dict replica unload) and update docs

…oling - fix dataplane pump shutdown/drain flow and recover producer-owned streaming buffer slots - close timed-out remote key channels and add richer pinned-buffer wait context for debugging - decouple HA heartbeat RPC from control-plane lock contention and guard liveness by epoch - add global-store transient transaction conflict handling and cluster runtime RPC support - add cross-host transport probes, early gate/summarizer tooling, and new benchmark daemon/global configs - refresh cross-host benchmark docs/playbook and add regression tests for adapters, repos/services, probes, and TP bind retry paths

…bench flow - add group-source spread and soft-cap controls to group dispatch scheduling - recycle failed request ids, normalize malformed transport rows, and relax tp_version replay contract on view/artifact variance - keep tp bind request ids stable across retries, preserve completed ranks on retry, and tune per-rank timeout handling - add pre-publish trim margin, ABBA suite/preflight tooling, remote daemon startup retry, and update 0083 steady benchmark docs

…rt flow - add SDK PortConfig overrides for daemon and Global Store ports - support launcher-only daemon envs from config and pass explicit p2p port overrides - make local Global Store start fail on existing healthy instances and rotate stale cluster tokens - strengthen auto-mode singleflight startup and connect-mode error handling - prewarm GPU hash NVRTC kernels at daemon startup and disable real-CUDA CPU fallback - trust/backfill artifact descriptors on disk import and skip redundant rehashing - cap materialization concurrency by daemon worker budget and improve disk fallback behavior - remove GetArtifactOptions.prefer and document execution-only source policy semantics

* fix(dev): 1) fix torch version check; 2) sync bazel module sources with uv.lock * fix(dev): explicitly add pytz as dependency to avoid import error in duckDB (e.g. with python>=3.11) * fix(store): 1) add more logging points for strided load in byte_range_mapped_source; 2) reuse StridedBlock cache; 3) add direct gather path for CPU mem in strided load * fix(store): add CPU memfd region in local replica handle

- route artifact.bind() through bind_into with preallocated CUDA targets - drop DeferredLoader/InplaceSlot from the public SDK surface - update docs, examples, and tests to use the unified Binding contract - prefer disk-first materialization for local-import view/subset bindings

…ading - add DiskArtifactContext to reuse disk scans, safetensors fd/mmap state, and index metadata - route DiskLoader and metadata_stage through the shared context with new tests - relax replica byte-mapping thresholds for mmap-capable disk sources - add the h2d_nccl_broadcast_baseline benchmark mode and docs

…oads

… diagnostics

- stop marking canonicalized indexes as safetensors when only a source index is backfilled - build mmap mapping options before moving the seekable source into the mapped wrapper - switch dataplane tests to scoped temp dirs and align AUTO variant fallback assertions

chore: update torch/cuda related dependencies and Bazel configurations

440d6ac

chatgpt-codex-connector Bot reviewed Dec 23, 2025

View reviewed changes

cursor Bot reviewed Dec 23, 2025

View reviewed changes

wolegechu added 27 commits December 30, 2025 14:58

Merge branch 'main' into ychu/update-deps-to-torch280

3c0aa73

Merge branch 'main' into ychu/update-deps-to-torch280

9d10122

Merge branch 'main' into ychu/update-deps-to-torch280

4fefb35

fix: add cufile

2326d3c

Merge branch 'main' into ychu/update-deps-to-torch280

04d2769

fix: fix build

4b09da8

Merge branch 'main' into ychu/update-deps-to-torch280

4dad4b2

Merge branch 'main' into ychu/update-deps-to-torch280

2756d26

fix: enhance artifact resolution with tensor index support and some fix

cd1dbf5

fix: improve error handling in gRPC operations

9b16bf9

feat(materialization): add safetensors canonical bytespace + source-o…

540a45a

…rdered window scheduling

feat(api,daemon): add mapped binding materialization

0345069

feat(persistence,daemon,api): add managed shared-disk persistence and…

2513305

… disk index tracking

feat(daemon): implement startup memory preflight checks

b168e3b

feat(api,cli): enhance Global Store initialization with config path s…

fc2084a

…upport

refactor: simplify materialization error handling

818358b

feat: add replica export_state + transport metadata

035a0e8

Merge branch 'main' into ychu/update-deps-to-torch280

00e4cc4

feat(weight-publisher): introduce Weight Publisher tool

4a8ba2d

feat(materialization): enhance materialization process with DiskSourc…

aabf7b0

…e support

refactor(global_store): extract operation rpc handler

81ba104

refactor(global_store): extract binding and key mapping rpc handlers

ee32cec

refactor(global_store): extract transport and index rpc handlers

b10f62e

refactor(global_store): extract worker instance and ha rpc handler

faf8f50

refactor(global_store): extract chunk placement and disk rpc handlers

a82031a

wolegechu and others added 30 commits February 22, 2026 18:40

feat(cross-host): add chaos launcher and fix inactive endpoint re-reg…

904bf35

…istration

Merge branch 'main' into ychu/update-deps-to-torch280

1b608fe

fix: fix daemon config

fe2c9e4

feat(binding): add daemon-owned owner path for Artifact.bind

5c3ee9d

feat(materialization): add collective disk load and metadata repair

506c467

chore(build): package generated protos and add maintenance tools

73741e5

feat(core/store): defer daemon startup and optimize collective disk l…

6a09f72

…oads

feat(python-sdk): make collective disk loads explicit and expose bind…

179d7e3

… diagnostics

feat(runtime): distinguish daemon listening from rpc readiness

7edcea6

docs(binding): add 0085 design and steptron integration guidance

a352e0e

feat(communicator): add RDMA benchmark binary and transport updates

6b196f6

feat(tools): add remote communicator benchmark runners and reports

52f1bc2

feat(assembly): persist binding contributions and wire coordinator flow

045b0df

feat(python): add binding state workflow and inplace slot updates

88437bb

docs: update single-nic RDMA benchmark charts

21e102d

docs: refresh aligned communicator single-nic reference

918de80

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: update torch/cuda related dependencies and Bazel configurations#122

chore: update torch/cuda related dependencies and Bazel configurations#122
wolegechu wants to merge 139 commits into
mainfrom
ychu/update-deps-to-torch280

wolegechu commented Dec 23, 2025 •

edited by cursor Bot

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Dec 23, 2025

Uh oh!

cursor Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	urls = [
	"http://mirrors.i.basemind.com/pypi/packages/packages/ba/51/e123d997aa098c61d029f76663dedbfb9bc8dcf8c60cbd6adbe42f76d049/nvidia_cudnn_cu12-9.10.2.21-py3-none-manylinux_2_27_x86_64.whl",

Conversation

wolegechu commented Dec 23, 2025 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Mirror URLs contain duplicate packages path segment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wolegechu commented Dec 23, 2025 •

edited by cursor Bot

Loading