Skip to content

Conversation

@pragupta
Copy link
Collaborator

@pragupta pragupta commented Dec 5, 2025

rocm_base: f742da3

cyyever and others added 30 commits December 2, 2025 00:36
This PR applies Python 3.10 typing syntax to some files.

Pull Request resolved: pytorch#169229
Approved by: https://github.com/Lucaskabela
…#168988)

Previously would fail deserialization since

```
SDPBackend.__name__      → "SDPBackend"
SDPBackend.__qualname__  → "_SDPBackend"
```

Pull Request resolved: pytorch#168988
Approved by: https://github.com/zhxchen17
ghstack dependencies: pytorch#168989, pytorch#169008
…orch#169244)

As discussed offline with @jamesjwu and @aorenste in a precompile world,
it's actually not necessary to ensure we can safetely generate a unique
cache key since there is no implicit sharing. This PR adds a fallback so
in the case where we can't safetly generate a key for caching (eg. certain HOPs),
we still generate a random nonce key for precompile.

Pull Request resolved: pytorch#169244
Approved by: https://github.com/Lucaskabela
ghstack dependencies: pytorch#168989, pytorch#169008, pytorch#168988
)

This PR updates the NCCL version for CUDA13 from 2.27.7 to 2.28.9.

2.28.9 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.9-1
2.28.7 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.7-1
2.28.3 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.3-1

CUDA 12 remains at 2.27.5 and is untouched by this PR.

Reference PR: pytorch#166174
Pull Request resolved: pytorch#168091
Approved by: https://github.com/atalman
- When tensor numel is 0, we let the hash be 0 instead of hashing, because torch.hash_tensor doesn't work for 0 numel tensors
- Add some tests for distributed
Pull Request resolved: pytorch#169027
Approved by: https://github.com/xmfan, https://github.com/ngimel
Summary: Fix pytorch#167630. There was a reference circle between GraphLowering and CppWrapperCpu due to caching, which makes GraphLowering unnecessarily hold some contant tensors causing GPU memory leaks. This PR fixes that by changing the cache to use the object id of GraphLowering as a part of the key.

Pull Request resolved: pytorch#168063
Approved by: https://github.com/yushangdi
… and bool datatypes (pytorch#165885)

**Summary:**
Support masked vectorization for the tail_loop for int32, int64 and bool datatypes

Pull Request resolved: pytorch#165885
Approved by: https://github.com/mingfeima, https://github.com/jansel
Summary: Fix pytorch#168398. When Inductor creates subgraphs, they should inherit device information from the parent graph.

Pull Request resolved: pytorch#169001
Approved by: https://github.com/yushangdi
pytorch#169088)

Debugging numeric differences for AutoParallel PP between Local Tensor and multi-process
setup revealed differences in how rng offsets are computed. This change refactors DTensor
implementation so that it can be shared with Local Tensor. The existing Local Tensor
implementation was incorrectly computing shard linear index based on number of elements
in the tensor instead of shard coordinates.

AutoParallel PP slices world mesh into "pp" submeshes for MPMD execution and "dp_mod_ep, ep"
submeshes for SPMD execution. Local Tensor uses default process group (corresponding to the
world mesh) to compute collective groups and assumes input local tensors have ranks from
the world mesh. Local Tensor mode can be created with subset of ranks. This feature is
used in AutoParallel PP integration. Therefore this change modifies Local Tensor collectives
to execute only if all ranks from the deduced rank groups are present on local tensor inputs.
Pull Request resolved: pytorch#169088
Approved by: https://github.com/dolpm
…Dict variable tracker (pytorch#169204)

Fixes pytorch#167956

## Summary

  This PR decentralizes and improves the hash implementation for dictionary keys in Dynamo's ConstDictVariable tracker. Instead of maintaining a centralized list of hashable types and custom equality logic in
  _HashableTracker, we now delegate hashability checks, hash computation, and equality comparison to individual VariableTracker subclasses.

## Motivation

  The previous implementation had several issues:

  1. Centralized logic: All hashability checks and hash computations were centralized in dicts.py, making it difficult to add support for new hashable types
  2. Maintainability: Adding a new hashable type required modifying multiple locations in _HashableTracker (underlying_value, _eq_impl, and the is_hashable function)
  3. Scattered knowledge: Type-specific hashing logic was separated from the type's own implementation
  4. Limited extensibility: No clear protocol for VariableTracker subclasses to declare themselves as hashable

## Changes

  New Protocol Methods

  Added three new methods to the VariableTracker base class:

  1. is_python_hashable(): Returns whether the underlying Python object is hashable
  2. get_python_hash(): Computes the hash value for the underlying Python object
  3. is_python_equal(other): Checks Python-level equality between two VariableTrackers

  The base implementation raises unimplemented() with helpful error messages, and subclasses override these methods as appropriate.

## Simplified _HashableTracker

  The _HashableTracker class in ConstDictVariable is now much simpler:

  - Removed underlying_value property (centralized type handling)
  - Removed _eq_impl static method (centralized equality logic)
  - Simplified __hash__() to delegate to vt.get_python_hash()
  - Simplified __eq__() to delegate to vt.is_python_equal()

## Decentralized Implementations

  Implemented the new protocol methods across relevant VariableTracker subclasses:

  - ConstantVariable, TensorVariable, TupleVariable, ListVariable
  - FrozensetVariable, FrozenDataClassVariable
  - BuiltinVariable, UserFunctionVariable, SkipFunctionVariable
  - FunctoolsPartialVariable, WeakRefVariable
  - NumpyVariable, NNModuleVariable, MethodWrapperVariable
  - TorchInGraphFunctionVariable, TorchHigherOrderOperatorVariable
  - TypingVariable, UserDefinedObjectVariable, UserDefinedClassVariable
  - SymNodeVariable, EnumVariable

## Enhanced Test Coverage

  Added 14 new test cases covering various hashable types as dictionary keys:

  - range, tuples, enums, frozensets
  - Typing constructs (e.g., typing.Union)
  - NumPy dtypes, method wrappers
  - Torch builtin functions, frozen dataclasses
  - Custom objects with __hash__
  - Negative test for unhashable types (lists)

## Improved Error Messages

  Updated error messages to be more informative when encountering unhashable types, showing both the Python type and the VariableTracker type.

Pull Request resolved: pytorch#169204
Approved by: https://github.com/jansel
…ytorch#169064)

Summary:
Opts to capture timeout errors during compilation without forcing process failure. Useful to avoid hangs in MAST jobs.

We may want to consider a configuration option for this to avoid wasted compute by never pruning bad config options.

Test Plan: Tested with local model reproducers.

Differential Revision: D87866423

Pull Request resolved: pytorch#169064
Approved by: https://github.com/PaulZhang12
…ytorch#169207)

This PR avoids returning value construction in `qconv_unpack.cpp`.

Pull Request resolved: pytorch#169207
Approved by: https://github.com/Skylion007
…ved_exception (pytorch#168337)

Fixes pytorch#168291

# Summary
Removes `ConstantVariable.create` wrapping in `raise_observed_exception` calls within `torch/_dynamo/variables/functions.py`.

# Context
The `raise_observed_exception` function handles the exception creation internally. Wrapping the error strings in `ConstantVariable` is unnecessary and can be simplified to passing raw strings.

# Test Plan
- [x] Verified syntax validity via `python3 -m py_compile torch/_dynamo/variables/functions.py`
- [ ] CI/CD (Existing tests should pass as this is a refactor of error reporting paths)

Pull Request resolved: pytorch#168337
Approved by: https://github.com/williamwen42, https://github.com/guilhermeleobas, https://github.com/cyyever
Summary:
Add the `torch.mtia.graph_pool_handle` API as the counterpart of `torch.cuda.graph_pool_handle`, which is used in vllm, e.g.
https://www.internalfb.com/code/fbsource/[f6d024bd45964d71810cbe1ed859f132f7f734cd]/fbcode/vllm/trunk/vllm/compilation/cuda_graph.py?lines=170

Test Plan:
```
buck2 run mtia/host_runtime/torch_mtia/tests:test_mtia_graph_py -- -r test_graph_pool_handle
```

Differential Revision: D88059625

Pull Request resolved: pytorch#169283
Approved by: https://github.com/patrick-toulme
…163249)

Add sum support for qlinear_binary templated gemm, and also allow sum for the choice of `QLinearPointwiseBinaryPT2E` in the cases of x2 happening to be the output of `QLinearPointwiseBinaryPT2E`.

Pull Request resolved: pytorch#163249
Approved by: https://github.com/Xia-Weiwen, https://github.com/leslie-fang-intel, https://github.com/jansel
…r when CP is enabled (pytorch#167381)

Previously, CP-specific sharding strategies (which shard on the sequence dimension) were directly
included in the base sharding strategies for scaled_dot_product_attention operators in
`_matrix_ops.py`. This meant these strategies were always available, even when CP was not enabled,
which could lead to incorrect sharding behavior as these sharding rules are not mathmetically correct without CP.

1. **Created new module**:
`torch/distributed/tensor/experimental/_context_parallel/_sharding_rules.py`
   - Implements `op_strategy_context()` - a context manager for temporarily
registering/unregistering strategies
   - Defines CP-enhanced strategy functions for all 6 scaled_dot_product_attention ops (forward and
 backward for flash, efficient, and cudnn variants)
   - Provides `register_cp_sharding_rules()` and `unregister_cp_sharding_rules()` APIs

2. **Updated `_matrix_ops.py`**
   - Removed all CP-specific sharding rules (sequence dimension sharding strategies)
   - Base strategies now only contain replicate, tensor parallelism, and batch sharding
strategies

3. **Updated `_attention.py`**
   - `_enable_cp_dtensor_dispatcher()` now calls `register_cp_sharding_rules()` to dynamically add
CP strategies
   - ~`_disable_cp_dtensor_dispatcher()` now calls `unregister_cp_sharding_rules()` to restore
original strategies~ This will invalidate all the sharding prop caches. Disable it for now.

Pull Request resolved: pytorch#167381
Approved by: https://github.com/wconstab
…se_observed_exception (pytorch#168337)"

This reverts commit fb5be22.

Reverted pytorch#168337 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail some dynamo tests in trunk ([comment](pytorch#168337 (comment)))
…uards and verbose_code_parts (pytorch#169102)

Fix pytorch#168379.

1. The results are validated in the improved testing that the ``___dict_contains`` will be sorted based on the verbose part. The first solution was also suggested in https://fb.workplace.com/groups/1075192433118967/permalink/1650742858897252/ by sorting the ``get_leaf_guards()`` in ``construct_manager_string``.
2. The second solution will be adopted the ``OrderedSet`` in setGuards during guards construction to make sure the ``contain_dict`` are displayed as the order of being added.

We decided to pursuit the second options to reduce the sorting time overhead and simplicity.

Pull Request resolved: pytorch#169102
Approved by: https://github.com/anijain2305
Summary: Compress aoti stack (replace full paths with filenames).

Test Plan:
```
[nbeloborodov@devgpu031]~/fbsource/fbcode% strobe gpuevent --duration-ms=60000 --collect-kernel-events --kernel-sample-interval=0 --pids 1016951
Running "gpuevent" with run id -4456078642709746 and group_trace_id "" on hosts: ["::1"]
Press Ctrl-C to stop the run
> Queuing...	 (00:00:00.001)
> Preparing...	 (00:00:04.055)
> Profiling...	 (00:01:00.383)
> Processing...	 (00:00:00.643)
> Logging...	 (00:00:00.025)
> Finished
| Host | Return Code | Samples | Result Links                                               |
|------|-------------|---------|------------------------------------------------------------|
| ::1  | SUCCESS     | 4       | Raw samples:                                               |
|      |             |         | https://fburl.com/scuba/strobelight_gpu/on_demand/zsglu6sc |
|      |             |         |                                                            |
|      |             |         | Run Details:                                               |
|      |             |         | https://fburl.com/scuba/strobelight_runs/hmcuaz8u          |
```

Differential Revision: D88005763

Pull Request resolved: pytorch#169291
Approved by: https://github.com/yushangdi
Adding reduce_scatter_tensor_out to use in fx passes to efficiently decompose reduce_scatter without concatenation.

Pull Request resolved: pytorch#168260
Approved by: https://github.com/wconstab
# Motivation
There are several issues related to the data type and precision that an accelerator supports (see pytorch#165038 and pytorch#143112). Sometimes, we have to check for these capabilities in the document, and then hard-code.  This PR proposes a new unified API for users to check their accelerator capabilities.

# Changes
This PR creates a new data structure `DeviceCapability` containing the capabilities that an accelerator commonly has:
- Supporting DataType (set to be supported as default):
  - `fp16`, `int32`, `complex` ... etc
- Other capabilities (need to be discussed)

To access the structure, this PR defines a new Python API in the Accelerator module -- `get_device_capability`. It takes `device` as an input and returns a dictionary containing the capabilities (now we have `supported_dtypes` as the key).

# Usage
```python
>>> import torch
>>> import torch_openreg
>>> torch.accelerator.get_device_capability('openreg:0')
{'supported_dtypes': [torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64, torch.float16, torch.float32, torch.float64, torch.complex32, torch.complex64, torch.complex128, torch.bool, torch.qint8, torch.quint8, torch.qint32, torch.bfloat16, torch.quint4x2, torch.quint2x4, torch.bits1x8, torch.bits2x4, torch.bits4x2, torch.bits8, torch.bits16, torch.float8_e5m2, torch.float8_e4m3fn, torch.float8_e5m2fnuz, torch.float8_e4m3fnuz, torch.uint16, torch.uint32, torch.uint64, torch.uint1, torch.uint2, torch.uint3, torch.uint4, torch.uint5, torch.uint6, torch.uint7, torch.int1, torch.int2, torch.int3, torch.int4, torch.int5, torch.int6, torch.int7, torch.float8_e8m0fnu, torch.float4_e2m1fn_x2]}
```
# TODO
- So far, precision is the only capability to track, based on my knowledge. But we can find more capabilities in common, and the API should be designed for good extension.
- It will support other in-tree accelerators, such as **cuda** and **mps**.
- Clarify whether the capabilities are software or hardware supported. (By @guangyey )

Pull Request resolved: pytorch#165631
Approved by: https://github.com/guangyey, https://github.com/albanD

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>
chuanqi129 and others added 25 commits December 5, 2025 02:48
Follows pytorch#166723. Including below changes,

- Add XPU support package 2025.3 build and test in CI for both Linux and Windows
- Keep XPU support package 2025.2 build in CI to ensure no break issue until PyTorch 2.10 release
- Upgrade XPU support package from 2025.2 to 2025.3 in CD for both Linux and Windows
- Update XPU runtime pypi packages dependencies of CD wheels

Pull Request resolved: pytorch#166829
Approved by: https://github.com/atalman
Remove unneeded header inclusion in C++ source files.

Pull Request resolved: pytorch#169200
Approved by: https://github.com/albanD
This allows compiling a matmul on 2 DTensors with fully unbacked sizes, when a zero-cost strategy is available.

Changes with the PR:
- `mark_unbacked()` would previously error on tensor subclasses; now for DTensors it allocates unbacked symbols for both inner & outer sizes. The main motivation here is for testing, so happy to tweak semantics. The unbacked binding search process also now matches on DTensor outer sizes.
- Selecting an op strategy in sharding propagation is based on minimal redistribution costs, and these costs are functions of tensor shapes, so can be unbacked expressions. This PR makes this process more unbacked-friendly, choosing negative or zero-cost strategies when they're available. When these "trivial" strategies aren't available, selection requires comparing unbacked costs, addressed in the next PR (with usage of fallback hints).
- For matmul strategies, sharding prop rules filter out strategies where the matmul inputs fail the `is_tensor_shardable` check on the given DeviceMesh. In eager, this filters out cases where `size of sharded dim < num shards`. In the compiled & unbacked case, we'll often encounter dim size `u_` where `u_` can be both larger and smaller than num shards. This PR assumes such cases are shardable by default, and the implication is that strategies that shard on unbacked dimensions are included for consideration, and if selected, can lead to uneven sharding/zero-size shards at runtime. Alternatives would be 1) the current state of things: DDE and force the user to pick a path: `torch._check(size of sharded dim < or >= num shards)`, or 2) assume the non-shardable case and never include sharded strategies, unless the user picks the shardable path. More discussion in pytorch#165034 (comment).
- Lastly, testing traced redistribution decisions required using aot_eager backend, so that the collectives/ops were hardcoded (eager backend would go through DTensor.dispatch again). This seemed to require re-enabling proxy tracking during shard prop, basically reverting pytorch#163126. Otherwise, errors like `RuntimeError: Max(1, u2) (<class 'torch.SymInt'>, 140294330350224)is not tracked with proxy for <torch.fx.experimental.proxy_tensor.PythonKeyTracer object at 0x7f98d1b14af0>` show up for DTensor outer strides...

Pull Request resolved: pytorch#168051
Approved by: https://github.com/laithsakka
Caffe2 used to have OpenCV integration path, but this is not the case for PyTorch
Pull Request resolved: pytorch#169609
Approved by: https://github.com/jathu, https://github.com/atalman
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: pytorch#165274
Approved by: https://github.com/pytorchbot
…rgs (pytorch#168127)

Fixes: pytorch#168124
This PR fixes triton compilation failures in combo kernels when combining multiple kernels with random ops (or any ops that creates args with value equal to 1). The fix adds the missing logic to populate the `constants` for args marked as compile-time constants, matching the behavior of regular Triton kernels.

Pull Request resolved: pytorch#168127
Approved by: https://github.com/mlazos
ghstack dependencies: pytorch#167781
This PR fixes some of slotscheck warnings. The some of them are:
```
ERROR: 'torch._inductor.cudagraph_trees:AliasesNewOutput' has slots but superclass does not.
ERROR: 'torch._inductor.cudagraph_trees:AliasesPriorGraphOutput' has slots but superclass does not.
ERROR: 'torch._subclasses.fake_tensor:_BypassDispatchCache' has slots but superclass does not.
ERROR: 'torch.distributed._functional_collectives:AsyncCollectiveTensor' has slots but superclass does not.
ERROR: 'torch.distributed.elastic.timer.file_based_local_timer:FileTimerRequest' defines overlapping slots.
ERROR: 'torch.distributed.tensor._shards_wrapper:LocalShardsWrapper' has slots but superclass does not.
ERROR: 'torch.distributed.tensor:DTensor' has slots but superclass does not.
ERROR: 'torch.multiprocessing.spawn:ProcessException' has slots but superclass does not.
ERROR: 'torch.package.package_importer:_ModuleNode' has slots but superclass does not.
ERROR: 'torch.sparse.semi_structured:SparseSemiStructuredTensor' has slots but superclass does not.
ERROR: 'torch.testing._internal.logging_tensor:LoggingTensor' has slots but superclass does not.
```
The fixes work by adding slot to their parent.

Pull Request resolved: pytorch#169348
Approved by: https://github.com/Skylion007
…)"

This reverts commit 8b683e5.

Reverted pytorch#169609 on behalf of https://github.com/malfet due to It causes inductor tests to fail ([comment](pytorch#169609 (comment)))
…uards and verbose_code_parts (pytorch#169102)

Fix pytorch#168379.

1. The results are validated in the improved testing that the ``___dict_contains`` will be sorted based on the verbose part. The first solution was also suggested in https://fb.workplace.com/groups/1075192433118967/permalink/1650742858897252/ by sorting the ``get_leaf_guards()`` in ``construct_manager_string``.
2. The second solution will be adopted the ``OrderedSet`` in setGuards during guards construction to make sure the ``contain_dict`` are displayed as the order of being added.

We decided to pursuit the second options to reduce the sorting time overhead and simplicity.

Pull Request resolved: pytorch#169102
Approved by: https://github.com/anijain2305
This PR fixes unuser context managers detected by pylint.

Pull Request resolved: pytorch#169447
Approved by: https://github.com/malfet
)

This PR fixes FailedMatch format string bug, part of the issue: pytorch#169440

Pull Request resolved: pytorch#169611
Approved by: https://github.com/ProExpertProg, https://github.com/zou3519
ghstack dependencies: pytorch#169603
This PR is for the **yellow block** of the design flow below.

<img width="3880" height="2240" alt="image" src="https://github.com/user-attachments/assets/1774b9c1-5a72-4ebe-9e5c-70072cc7f320" />

Assuming users mark some nodes with metadata `should_offload` (which could later be replaced as compiler automatic decisions), the implementation first checks if the tensors are safe to offload via the `can_offload()` function. For tensors that are marked as `should_offload` and pass the verification in `can_offload()`, we modify the forward and backward graph, where in the forward graph, we offload tensors to CPU, and in the backward graph, we reload tensors to GPU.

We introduce two flags:
* `enable_activation_offloading` -- with this turned on, we insert offload/reload nodes to the fwd/bwd graphs.
* `activation_offload_separate_stream` -- with this turned on, we wrap the offload/reload nodes in a separate stream with correct waits for e.g., data dependencies. (Note that from a "frontend" perspective, everything is the same, the only difference is that the memcpy is now moved to a separate stream, which prepares for subsequent reordering and overlapping)

For an [example model](https://gist.github.com/xuanzhang816/ee2e3648123670f14ced9963858ee3b4), we have
1. baseline (i.e., no activation offloading)

- 9160 MB peak memory
- 157.42 ms per-iteration runtime
- trace for fwd and bwd as below:
<img width="845" height="52" alt="image" src="https://github.com/user-attachments/assets/2868416e-e5a6-42a6-985f-891c5d8edb72" />

2. AO (i.e., with `enable_activation_offloading=True`)

- 7460 MB peak memory
- 224.15 ms per-iteration runtime
- trace for fwd and bwd as below (the "M" blocks are for Memcpy):
<img width="864" height="52" alt="image" src="https://github.com/user-attachments/assets/ec7b6326-b192-4915-af3f-824a2d3f7c49" />

3. AO on separate stream (i.e., additionally with `activation_offload_separate_stream = True`)

- 7460 MB peak memory
- 224.11 ms per-iteration runtime
- trace for fwd and bwd as below (the "M" blocks are for Memcpy):
<img width="891" height="121" alt="image" src="https://github.com/user-attachments/assets/2137691a-2c97-4208-80c6-cd7ae7fac7e0" />

Pull Request resolved: pytorch#167880
Approved by: https://github.com/eellison
)

We introduce two flags for computation <> communication overlap:
* `activation_offload_sink_wait` -- with this turned on, the wait events in forward is sinked to the end of the graph so that the offload ops is not blocking
* `activation_reload_prefetch` -- with this turned on, the reload in the backward is prefetched just enough so that the reload operation is perfectly overlapped with computation.

Continue with the [example model](https://gist.github.com/xuanzhang816/ee2e3648123670f14ced9963858ee3b4) in the first PR in the stack:

4. AO on separate stream with reorders (i.e., additionally with `activation_offload_sink_wait = True` and `activation_reload_prefetch = True`)

- 7534 MB peak memory
- 160.86 ms ms per-iteration runtime
- trace for fwd and bwd as below (the "M" blocks are for Memcpy):
<img width="869" height="123" alt="image" src="https://github.com/user-attachments/assets/6aa0c5df-5cc4-4711-8c1a-20f4a9d9145b" />

Pull Request resolved: pytorch#168316
Approved by: https://github.com/eellison
ghstack dependencies: pytorch#167880
…ecord_artifact (pytorch#169242)

I ran into this when trying to precompile simplefsdp and realizing that deepcloning
a DeviceMesh within a fake mode causes the following error:

```
[rank0]:   File "/home/bobren/local/a/pytorch/spc.py", line 111, in <module>
[rank0]:     .aot_compile(((input_tensor, d_input_tensor), {"mesh": mesh}))
[rank0]:   File "/home/bobren/local/a/pytorch/torch/_dynamo/eval_frame.py", line 800, in aot_compile
[rank0]:     return aot_compile_fullgraph(
[rank0]:   File "/home/bobren/local/a/pytorch/torch/_dynamo/aot_compile.py", line 235, in aot_compile_fullgraph
[rank0]:     compiled_fn = backend(
[rank0]:   File "/home/bobren/local/a/pytorch/torch/__init__.py", line 2445, in __call__
[rank0]:     return compile_fx(model_, inputs_, config_patches=self.config)
[rank0]:   File "/home/bobren/local/a/pytorch/torch/_inductor/compile_fx.py", line 2525, in compile_fx
[rank0]:     return _maybe_wrap_and_compile_fx_main(
[rank0]:   File "/home/bobren/local/a/pytorch/torch/_inductor/compile_fx.py", line 2602, in _maybe_wrap_and_compile_fx_main
[rank0]:     return _compile_fx_main(
[rank0]:   File "/home/bobren/local/a/pytorch/torch/_inductor/compile_fx.py", line 2797, in _compile_fx_main
[rank0]:     return aot_autograd(
[rank0]:   File "/home/bobren/local/a/pytorch/torch/_dynamo/backends/common.py", line 117, in __call__
[rank0]:     cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
[rank0]:   File "/home/bobren/local/a/pytorch/torch/_functorch/aot_autograd.py", line 1119, in aot_module_simplified
[rank0]:     compiled_fn, _ = aot_stage2_compile(
[rank0]:   File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/graph_compile.py", line 348, in aot_stage2_compile
[rank0]:     return aot_stage2_autograd(aot_state, aot_graph_capture)
[rank0]:   File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/graph_compile.py", line 2017, in aot_stage2_autograd
[rank0]:     try_save_cache_entry, entry = _cache_autograd_info(
[rank0]:   File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/graph_compile.py", line 2196, in _cache_autograd_info
[rank0]:     entry = try_save_cache_entry(
[rank0]:   File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/graph_compile.py", line 2186, in try_save_cache_entry
[rank0]:     AOTAutogradCache.save(
[rank0]:   File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 905, in save
[rank0]:     raise e
[rank0]:   File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 889, in save
[rank0]:     PrecompileContext.record_artifact(artifact)
[rank0]:   File "/home/bobren/local/a/pytorch/torch/_dynamo/precompile_context.py", line 147, in record_artifact
[rank0]:     cls._backend_artifacts_by_key[_BackendId(artifact.key)] = copy.deepcopy(
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 172, in deepcopy
[rank0]:     y = _reconstruct(x, memo, *rv)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 271, in _reconstruct
[rank0]:     state = deepcopy(state, memo)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 146, in deepcopy
[rank0]:     y = copier(x, memo)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 231, in _deepcopy_dict
[rank0]:     y[deepcopy(key, memo)] = deepcopy(value, memo)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 172, in deepcopy
[rank0]:     y = _reconstruct(x, memo, *rv)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 271, in _reconstruct
[rank0]:     state = deepcopy(state, memo)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 146, in deepcopy
[rank0]:     y = copier(x, memo)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 231, in _deepcopy_dict
[rank0]:     y[deepcopy(key, memo)] = deepcopy(value, memo)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 172, in deepcopy
[rank0]:     y = _reconstruct(x, memo, *rv)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 271, in _reconstruct
[rank0]:     state = deepcopy(state, memo)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 146, in deepcopy
[rank0]:     y = copier(x, memo)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 231, in _deepcopy_dict
[rank0]:     y[deepcopy(key, memo)] = deepcopy(value, memo)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 146, in deepcopy
[rank0]:     y = copier(x, memo)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 206, in _deepcopy_list
[rank0]:     append(deepcopy(a, memo))
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 172, in deepcopy
[rank0]:     y = _reconstruct(x, memo, *rv)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 271, in _reconstruct
[rank0]:     state = deepcopy(state, memo)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 146, in deepcopy
[rank0]:     y = copier(x, memo)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 231, in _deepcopy_dict
[rank0]:     y[deepcopy(key, memo)] = deepcopy(value, memo)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 146, in deepcopy
[rank0]:     y = copier(x, memo)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 211, in _deepcopy_tuple
[rank0]:     y = [deepcopy(a, memo) for a in x]
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 211, in <listcomp>
[rank0]:     y = [deepcopy(a, memo) for a in x]
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 172, in deepcopy
[rank0]:     y = _reconstruct(x, memo, *rv)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 271, in _reconstruct
[rank0]:     state = deepcopy(state, memo)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 146, in deepcopy
[rank0]:     y = copier(x, memo)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 231, in _deepcopy_dict
[rank0]:     y[deepcopy(key, memo)] = deepcopy(value, memo)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 172, in deepcopy
[rank0]:     y = _reconstruct(x, memo, *rv)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 271, in _reconstruct
[rank0]:     state = deepcopy(state, memo)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 146, in deepcopy
[rank0]:     y = copier(x, memo)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 231, in _deepcopy_dict
[rank0]:     y[deepcopy(key, memo)] = deepcopy(value, memo)
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 153, in deepcopy
[rank0]:     y = copier(memo)
[rank0]:   File "/home/bobren/local/a/pytorch/torch/_tensor.py", line 142, in __deepcopy__
[rank0]:     return handle_torch_function(Tensor.__deepcopy__, (self,), self, memo)
[rank0]:   File "/home/bobren/local/a/pytorch/torch/overrides.py", line 1733, in handle_torch_function
[rank0]:     result = mode.__torch_function__(public_api, types, args, kwargs)
[rank0]:   File "/home/bobren/local/a/pytorch/torch/utils/_device.py", line 109, in __torch_function__
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/bobren/local/a/pytorch/torch/_tensor.py", line 180, in __deepcopy__
[rank0]:     new_storage = self._typed_storage()._deepcopy(memo)
[rank0]:   File "/home/bobren/local/a/pytorch/torch/storage.py", line 1139, in _deepcopy
[rank0]:     return self._new_wrapped_storage(copy.deepcopy(self._untyped_storage, memo))
[rank0]:   File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 153, in deepcopy
[rank0]:     y = copier(memo)
[rank0]:   File "/home/bobren/local/a/pytorch/torch/storage.py", line 243, in __deepcopy__
[rank0]:     new_storage = self.clone()
[rank0]:   File "/home/bobren/local/a/pytorch/torch/storage.py", line 257, in clone
[rank0]:     return type(self)(self.nbytes(), device=self.device).copy_(self)
[rank0]:   File "/home/bobren/local/a/pytorch/torch/utils/_stats.py", line 29, in wrapper
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 1397, in __torch_dispatch__
[rank0]:     return self.dispatch(func, types, args, kwargs)
[rank0]:   File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 2155, in dispatch
[rank0]:     return self._cached_dispatch_impl(func, types, args, kwargs)
[rank0]:   File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 1544, in _cached_dispatch_impl
[rank0]:     output = self._dispatch_impl(func, types, args, kwargs)
[rank0]:   File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 2823, in _dispatch_impl
[rank0]:     r = func(*args, **kwargs)
[rank0]:   File "/home/bobren/local/a/pytorch/torch/_ops.py", line 836, in __call__
[rank0]:     return self._op(*args, **kwargs)
[rank0]: RuntimeError: Attempted to set the storage of a tensor on device "meta" to a storage on different device "cpu".  This is no longer
 allowed; the devices must match.
 ```

As you can see the underlying problem is 1) we do a clone on storage 2) which under the hood calls `_copy` 3) but when we call `copy_` the fake mode turns a self, which is on "cpu", into a meta device. This PR fixes the issue by temporarily disabling dispatch when doing the deepclone.

Pull Request resolved: pytorch#169242
Approved by: https://github.com/bdhirsh

Co-authored-by: Bob Ren <bobrenjc@fb.com>
…orch#169487)

Summary:
PyTorch's expandable segments IPC capability was disabled in fbcode
due to job failures (see
pytorch#132890).  However, some use
cases like CTran require IPC functionality for multi-process GPU
communication. This change introduces
PYTORCH_CUDA_EXPANDABLE_SEGMENTS_IPC environment variable to allow
opt-in enablement of IPC handle types for expandable segments in
fbcode builds while maintaining backward compatibility.

IPC is enabled by default in non-fbcode builds and disabled by default
in fbcode builds (existing behavior). In both cases, it can be
explicitly enabled by setting PYTORCH_CUDA_EXPANDABLE_SEGMENTS_IPC=1.

Test Plan: CI

Differential Revision: D88274246

Pull Request resolved: pytorch#169487
Approved by: https://github.com/ngimel
If it gets to the end of the list and doesn't find a green commit, the LATEST_SHA is None
Pull Request resolved: pytorch#169585
Approved by: https://github.com/huydhn
Summary:

Makes `torch.allclose(a, b, atol=0, rtol=0)` work for `a` and `b`
with dtype `torch.float4_e2m1fn_x2`. This is useful for testing.

Test Plan:

```
pytest test/quantization/core/experimental/test_floatx.py -s -k test_float4_e2m1fn_x2
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: pytorch#169575
Approved by: https://github.com/eqy, https://github.com/drisspg
Summary:

Enables `copy_` support for the `torch.float4_e2m1fn_x2` dtype. This is
useful when slicing a tensor across dim1 and then calling contiguous,
which can happen in vllm and therefore should be supported.

Test Plan:

```
pytest test/quantization/core/experimental/test_floatx.py -s -k test_float4_e2m1fn_x2
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: pytorch#169595
Approved by: https://github.com/drisspg
ghstack dependencies: pytorch#169575
…ch#169261)

Add missing shape validation between `input` and `indices` tensors for `nn.MaxUnpool{1,2,3}d` on MPS backend

Fixes pytorch#169235

Pull Request resolved: pytorch#169261
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
…pytorch#167926)

Correct some grammatical and expression errors in the CONTRIBUTING.md file.
Fixes #ISSUE_NUMBER

Pull Request resolved: pytorch#167926
Approved by: https://github.com/mikaylagawarecki
This PR replaces unnecessary uses of thrust::pair with std::pair.

Pull Request resolved: pytorch#168941
Approved by: https://github.com/albanD
# Conflicts:
#	.ci/docker/ci_commit_pins/triton.txt
#	.ci/docker/triton_version.txt
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Dec 5, 2025

Jenkins build for b6e3d448ef3aff527c474515d81b3d0fd938bb7c commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Detected error during base docker image building:

#61 17.17 + sudo -E -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env PATH=/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.12/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH= git clone --recursive https://github.com/ROCm/triton triton
#61 17.18 Cloning into 'triton'...
#61 30.03 + cd triton
#61 30.03 + as_jenkins git checkout '<<<<<<<' HEAD 5df9c723de8c23508773b07fe16dd34e4c444541 ======= 5261b27331eb1dd09df9ec1bd6acc21cbb184481 '>>>>>>>' upstream/main
#61 30.03 + sudo -E -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env PATH=/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.12/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH= git checkout '<<<<<<<' HEAD 5df9c723de8c23508773b07fe16dd34e4c444541 ======= 5261b27331eb1dd09df9ec1bd6acc21cbb184481 '>>>>>>>' upstream/main
#61 30.04 error: pathspec '<<<<<<<' did not match any file(s) known to git
#61 30.04 error: pathspec 'HEAD' did not match any file(s) known to git
#61 30.04 error: pathspec '5df9c723de8c23508773b07fe16dd34e4c444541' did not match any file(s) known to git
#61 30.04 error: pathspec '=======' did not match any file(s) known to git
#61 30.04 error: pathspec '5261b27331eb1dd09df9ec1bd6acc21cbb184481' did not match any file(s) known to git
#61 30.04 error: pathspec '>>>>>>>' did not match any file(s) known to git

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.