forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 75
[AUTOGENERATED] develop_IFU_20251205 #2857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
pragupta
wants to merge
339
commits into
develop
Choose a base branch
from
develop_IFU_20251205
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+26,000
−11,536
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This PR applies Python 3.10 typing syntax to some files. Pull Request resolved: pytorch#169229 Approved by: https://github.com/Lucaskabela
…#168988) Previously would fail deserialization since ``` SDPBackend.__name__ → "SDPBackend" SDPBackend.__qualname__ → "_SDPBackend" ``` Pull Request resolved: pytorch#168988 Approved by: https://github.com/zhxchen17 ghstack dependencies: pytorch#168989, pytorch#169008
…orch#169244) As discussed offline with @jamesjwu and @aorenste in a precompile world, it's actually not necessary to ensure we can safetely generate a unique cache key since there is no implicit sharing. This PR adds a fallback so in the case where we can't safetly generate a key for caching (eg. certain HOPs), we still generate a random nonce key for precompile. Pull Request resolved: pytorch#169244 Approved by: https://github.com/Lucaskabela ghstack dependencies: pytorch#168989, pytorch#169008, pytorch#168988
) This PR updates the NCCL version for CUDA13 from 2.27.7 to 2.28.9. 2.28.9 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.9-1 2.28.7 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.7-1 2.28.3 release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.28.3-1 CUDA 12 remains at 2.27.5 and is untouched by this PR. Reference PR: pytorch#166174 Pull Request resolved: pytorch#168091 Approved by: https://github.com/atalman
- When tensor numel is 0, we let the hash be 0 instead of hashing, because torch.hash_tensor doesn't work for 0 numel tensors - Add some tests for distributed Pull Request resolved: pytorch#169027 Approved by: https://github.com/xmfan, https://github.com/ngimel
Summary: Fix pytorch#167630. There was a reference circle between GraphLowering and CppWrapperCpu due to caching, which makes GraphLowering unnecessarily hold some contant tensors causing GPU memory leaks. This PR fixes that by changing the cache to use the object id of GraphLowering as a part of the key. Pull Request resolved: pytorch#168063 Approved by: https://github.com/yushangdi
… and bool datatypes (pytorch#165885) **Summary:** Support masked vectorization for the tail_loop for int32, int64 and bool datatypes Pull Request resolved: pytorch#165885 Approved by: https://github.com/mingfeima, https://github.com/jansel
Summary: Fix pytorch#168398. When Inductor creates subgraphs, they should inherit device information from the parent graph. Pull Request resolved: pytorch#169001 Approved by: https://github.com/yushangdi
pytorch#169088) Debugging numeric differences for AutoParallel PP between Local Tensor and multi-process setup revealed differences in how rng offsets are computed. This change refactors DTensor implementation so that it can be shared with Local Tensor. The existing Local Tensor implementation was incorrectly computing shard linear index based on number of elements in the tensor instead of shard coordinates. AutoParallel PP slices world mesh into "pp" submeshes for MPMD execution and "dp_mod_ep, ep" submeshes for SPMD execution. Local Tensor uses default process group (corresponding to the world mesh) to compute collective groups and assumes input local tensors have ranks from the world mesh. Local Tensor mode can be created with subset of ranks. This feature is used in AutoParallel PP integration. Therefore this change modifies Local Tensor collectives to execute only if all ranks from the deduced rank groups are present on local tensor inputs. Pull Request resolved: pytorch#169088 Approved by: https://github.com/dolpm
…Dict variable tracker (pytorch#169204) Fixes pytorch#167956 ## Summary This PR decentralizes and improves the hash implementation for dictionary keys in Dynamo's ConstDictVariable tracker. Instead of maintaining a centralized list of hashable types and custom equality logic in _HashableTracker, we now delegate hashability checks, hash computation, and equality comparison to individual VariableTracker subclasses. ## Motivation The previous implementation had several issues: 1. Centralized logic: All hashability checks and hash computations were centralized in dicts.py, making it difficult to add support for new hashable types 2. Maintainability: Adding a new hashable type required modifying multiple locations in _HashableTracker (underlying_value, _eq_impl, and the is_hashable function) 3. Scattered knowledge: Type-specific hashing logic was separated from the type's own implementation 4. Limited extensibility: No clear protocol for VariableTracker subclasses to declare themselves as hashable ## Changes New Protocol Methods Added three new methods to the VariableTracker base class: 1. is_python_hashable(): Returns whether the underlying Python object is hashable 2. get_python_hash(): Computes the hash value for the underlying Python object 3. is_python_equal(other): Checks Python-level equality between two VariableTrackers The base implementation raises unimplemented() with helpful error messages, and subclasses override these methods as appropriate. ## Simplified _HashableTracker The _HashableTracker class in ConstDictVariable is now much simpler: - Removed underlying_value property (centralized type handling) - Removed _eq_impl static method (centralized equality logic) - Simplified __hash__() to delegate to vt.get_python_hash() - Simplified __eq__() to delegate to vt.is_python_equal() ## Decentralized Implementations Implemented the new protocol methods across relevant VariableTracker subclasses: - ConstantVariable, TensorVariable, TupleVariable, ListVariable - FrozensetVariable, FrozenDataClassVariable - BuiltinVariable, UserFunctionVariable, SkipFunctionVariable - FunctoolsPartialVariable, WeakRefVariable - NumpyVariable, NNModuleVariable, MethodWrapperVariable - TorchInGraphFunctionVariable, TorchHigherOrderOperatorVariable - TypingVariable, UserDefinedObjectVariable, UserDefinedClassVariable - SymNodeVariable, EnumVariable ## Enhanced Test Coverage Added 14 new test cases covering various hashable types as dictionary keys: - range, tuples, enums, frozensets - Typing constructs (e.g., typing.Union) - NumPy dtypes, method wrappers - Torch builtin functions, frozen dataclasses - Custom objects with __hash__ - Negative test for unhashable types (lists) ## Improved Error Messages Updated error messages to be more informative when encountering unhashable types, showing both the Python type and the VariableTracker type. Pull Request resolved: pytorch#169204 Approved by: https://github.com/jansel
…ytorch#169064) Summary: Opts to capture timeout errors during compilation without forcing process failure. Useful to avoid hangs in MAST jobs. We may want to consider a configuration option for this to avoid wasted compute by never pruning bad config options. Test Plan: Tested with local model reproducers. Differential Revision: D87866423 Pull Request resolved: pytorch#169064 Approved by: https://github.com/PaulZhang12
…ytorch#169207) This PR avoids returning value construction in `qconv_unpack.cpp`. Pull Request resolved: pytorch#169207 Approved by: https://github.com/Skylion007
…ved_exception (pytorch#168337) Fixes pytorch#168291 # Summary Removes `ConstantVariable.create` wrapping in `raise_observed_exception` calls within `torch/_dynamo/variables/functions.py`. # Context The `raise_observed_exception` function handles the exception creation internally. Wrapping the error strings in `ConstantVariable` is unnecessary and can be simplified to passing raw strings. # Test Plan - [x] Verified syntax validity via `python3 -m py_compile torch/_dynamo/variables/functions.py` - [ ] CI/CD (Existing tests should pass as this is a refactor of error reporting paths) Pull Request resolved: pytorch#168337 Approved by: https://github.com/williamwen42, https://github.com/guilhermeleobas, https://github.com/cyyever
Summary: Add the `torch.mtia.graph_pool_handle` API as the counterpart of `torch.cuda.graph_pool_handle`, which is used in vllm, e.g. https://www.internalfb.com/code/fbsource/[f6d024bd45964d71810cbe1ed859f132f7f734cd]/fbcode/vllm/trunk/vllm/compilation/cuda_graph.py?lines=170 Test Plan: ``` buck2 run mtia/host_runtime/torch_mtia/tests:test_mtia_graph_py -- -r test_graph_pool_handle ``` Differential Revision: D88059625 Pull Request resolved: pytorch#169283 Approved by: https://github.com/patrick-toulme
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: pytorch#169198 Approved by: https://github.com/pytorchbot
…163249) Add sum support for qlinear_binary templated gemm, and also allow sum for the choice of `QLinearPointwiseBinaryPT2E` in the cases of x2 happening to be the output of `QLinearPointwiseBinaryPT2E`. Pull Request resolved: pytorch#163249 Approved by: https://github.com/Xia-Weiwen, https://github.com/leslie-fang-intel, https://github.com/jansel
Pull Request resolved: pytorch#168928 Approved by: https://github.com/anijain2305
Pull Request resolved: pytorch#168927 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#168928
…r when CP is enabled (pytorch#167381) Previously, CP-specific sharding strategies (which shard on the sequence dimension) were directly included in the base sharding strategies for scaled_dot_product_attention operators in `_matrix_ops.py`. This meant these strategies were always available, even when CP was not enabled, which could lead to incorrect sharding behavior as these sharding rules are not mathmetically correct without CP. 1. **Created new module**: `torch/distributed/tensor/experimental/_context_parallel/_sharding_rules.py` - Implements `op_strategy_context()` - a context manager for temporarily registering/unregistering strategies - Defines CP-enhanced strategy functions for all 6 scaled_dot_product_attention ops (forward and backward for flash, efficient, and cudnn variants) - Provides `register_cp_sharding_rules()` and `unregister_cp_sharding_rules()` APIs 2. **Updated `_matrix_ops.py`** - Removed all CP-specific sharding rules (sequence dimension sharding strategies) - Base strategies now only contain replicate, tensor parallelism, and batch sharding strategies 3. **Updated `_attention.py`** - `_enable_cp_dtensor_dispatcher()` now calls `register_cp_sharding_rules()` to dynamically add CP strategies - ~`_disable_cp_dtensor_dispatcher()` now calls `unregister_cp_sharding_rules()` to restore original strategies~ This will invalidate all the sharding prop caches. Disable it for now. Pull Request resolved: pytorch#167381 Approved by: https://github.com/wconstab
Pull Request resolved: pytorch#169227 Approved by: https://github.com/huydhn
…169323) 80 more passing tests Pull Request resolved: pytorch#169323 Approved by: https://github.com/yarongmu-google, https://github.com/jansel
Pull Request resolved: pytorch#169324 Approved by: https://github.com/malfet, https://github.com/yarongmu-google, https://github.com/jansel ghstack dependencies: pytorch#169323
…atio_chain (pytorch#169309) Fixes https://www.internalfb.com/tasks/?t=246834114 Pull Request resolved: pytorch#169309 Approved by: https://github.com/ezyang
…pytorch#169310) Fixes https://www.internalfb.com/tasks/?t=246782196 Pull Request resolved: pytorch#169310 Approved by: https://github.com/williamwen42 ghstack dependencies: pytorch#169309
…se_observed_exception (pytorch#168337)" This reverts commit fb5be22. Reverted pytorch#168337 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail some dynamo tests in trunk ([comment](pytorch#168337 (comment)))
…uards and verbose_code_parts (pytorch#169102) Fix pytorch#168379. 1. The results are validated in the improved testing that the ``___dict_contains`` will be sorted based on the verbose part. The first solution was also suggested in https://fb.workplace.com/groups/1075192433118967/permalink/1650742858897252/ by sorting the ``get_leaf_guards()`` in ``construct_manager_string``. 2. The second solution will be adopted the ``OrderedSet`` in setGuards during guards construction to make sure the ``contain_dict`` are displayed as the order of being added. We decided to pursuit the second options to reduce the sorting time overhead and simplicity. Pull Request resolved: pytorch#169102 Approved by: https://github.com/anijain2305
Summary: Compress aoti stack (replace full paths with filenames). Test Plan: ``` [nbeloborodov@devgpu031]~/fbsource/fbcode% strobe gpuevent --duration-ms=60000 --collect-kernel-events --kernel-sample-interval=0 --pids 1016951 Running "gpuevent" with run id -4456078642709746 and group_trace_id "" on hosts: ["::1"] Press Ctrl-C to stop the run > Queuing... (00:00:00.001) > Preparing... (00:00:04.055) > Profiling... (00:01:00.383) > Processing... (00:00:00.643) > Logging... (00:00:00.025) > Finished | Host | Return Code | Samples | Result Links | |------|-------------|---------|------------------------------------------------------------| | ::1 | SUCCESS | 4 | Raw samples: | | | | | https://fburl.com/scuba/strobelight_gpu/on_demand/zsglu6sc | | | | | | | | | | Run Details: | | | | | https://fburl.com/scuba/strobelight_runs/hmcuaz8u | ``` Differential Revision: D88005763 Pull Request resolved: pytorch#169291 Approved by: https://github.com/yushangdi
… is defined (pytorch#167496) Fixes pytorch#161660 This extends the `TORCH_STABLE_ONLY` stopgap added in pytorch#161658 Pull Request resolved: pytorch#167496 Approved by: https://github.com/janeyx99, https://github.com/malfet, https://github.com/atalman
Adding reduce_scatter_tensor_out to use in fx passes to efficiently decompose reduce_scatter without concatenation. Pull Request resolved: pytorch#168260 Approved by: https://github.com/wconstab
# Motivation There are several issues related to the data type and precision that an accelerator supports (see pytorch#165038 and pytorch#143112). Sometimes, we have to check for these capabilities in the document, and then hard-code. This PR proposes a new unified API for users to check their accelerator capabilities. # Changes This PR creates a new data structure `DeviceCapability` containing the capabilities that an accelerator commonly has: - Supporting DataType (set to be supported as default): - `fp16`, `int32`, `complex` ... etc - Other capabilities (need to be discussed) To access the structure, this PR defines a new Python API in the Accelerator module -- `get_device_capability`. It takes `device` as an input and returns a dictionary containing the capabilities (now we have `supported_dtypes` as the key). # Usage ```python >>> import torch >>> import torch_openreg >>> torch.accelerator.get_device_capability('openreg:0') {'supported_dtypes': [torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64, torch.float16, torch.float32, torch.float64, torch.complex32, torch.complex64, torch.complex128, torch.bool, torch.qint8, torch.quint8, torch.qint32, torch.bfloat16, torch.quint4x2, torch.quint2x4, torch.bits1x8, torch.bits2x4, torch.bits4x2, torch.bits8, torch.bits16, torch.float8_e5m2, torch.float8_e4m3fn, torch.float8_e5m2fnuz, torch.float8_e4m3fnuz, torch.uint16, torch.uint32, torch.uint64, torch.uint1, torch.uint2, torch.uint3, torch.uint4, torch.uint5, torch.uint6, torch.uint7, torch.int1, torch.int2, torch.int3, torch.int4, torch.int5, torch.int6, torch.int7, torch.float8_e8m0fnu, torch.float4_e2m1fn_x2]} ``` # TODO - So far, precision is the only capability to track, based on my knowledge. But we can find more capabilities in common, and the API should be designed for good extension. - It will support other in-tree accelerators, such as **cuda** and **mps**. - Clarify whether the capabilities are software or hardware supported. (By @guangyey ) Pull Request resolved: pytorch#165631 Approved by: https://github.com/guangyey, https://github.com/albanD Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>
Follows pytorch#166723. Including below changes, - Add XPU support package 2025.3 build and test in CI for both Linux and Windows - Keep XPU support package 2025.2 build in CI to ensure no break issue until PyTorch 2.10 release - Upgrade XPU support package from 2025.2 to 2025.3 in CD for both Linux and Windows - Update XPU runtime pypi packages dependencies of CD wheels Pull Request resolved: pytorch#166829 Approved by: https://github.com/atalman
Remove unneeded header inclusion in C++ source files. Pull Request resolved: pytorch#169200 Approved by: https://github.com/albanD
This allows compiling a matmul on 2 DTensors with fully unbacked sizes, when a zero-cost strategy is available. Changes with the PR: - `mark_unbacked()` would previously error on tensor subclasses; now for DTensors it allocates unbacked symbols for both inner & outer sizes. The main motivation here is for testing, so happy to tweak semantics. The unbacked binding search process also now matches on DTensor outer sizes. - Selecting an op strategy in sharding propagation is based on minimal redistribution costs, and these costs are functions of tensor shapes, so can be unbacked expressions. This PR makes this process more unbacked-friendly, choosing negative or zero-cost strategies when they're available. When these "trivial" strategies aren't available, selection requires comparing unbacked costs, addressed in the next PR (with usage of fallback hints). - For matmul strategies, sharding prop rules filter out strategies where the matmul inputs fail the `is_tensor_shardable` check on the given DeviceMesh. In eager, this filters out cases where `size of sharded dim < num shards`. In the compiled & unbacked case, we'll often encounter dim size `u_` where `u_` can be both larger and smaller than num shards. This PR assumes such cases are shardable by default, and the implication is that strategies that shard on unbacked dimensions are included for consideration, and if selected, can lead to uneven sharding/zero-size shards at runtime. Alternatives would be 1) the current state of things: DDE and force the user to pick a path: `torch._check(size of sharded dim < or >= num shards)`, or 2) assume the non-shardable case and never include sharded strategies, unless the user picks the shardable path. More discussion in pytorch#165034 (comment). - Lastly, testing traced redistribution decisions required using aot_eager backend, so that the collectives/ops were hardcoded (eager backend would go through DTensor.dispatch again). This seemed to require re-enabling proxy tracking during shard prop, basically reverting pytorch#163126. Otherwise, errors like `RuntimeError: Max(1, u2) (<class 'torch.SymInt'>, 140294330350224)is not tracked with proxy for <torch.fx.experimental.proxy_tensor.PythonKeyTracer object at 0x7f98d1b14af0>` show up for DTensor outer strides... Pull Request resolved: pytorch#168051 Approved by: https://github.com/laithsakka
Caffe2 used to have OpenCV integration path, but this is not the case for PyTorch Pull Request resolved: pytorch#169609 Approved by: https://github.com/jathu, https://github.com/atalman
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: pytorch#165274 Approved by: https://github.com/pytorchbot
…9603) Fixes: debug logging part of the issue pytorch#169440 Pull Request resolved: pytorch#169603 Approved by: https://github.com/ProExpertProg, https://github.com/zou3519
…rgs (pytorch#168127) Fixes: pytorch#168124 This PR fixes triton compilation failures in combo kernels when combining multiple kernels with random ops (or any ops that creates args with value equal to 1). The fix adds the missing logic to populate the `constants` for args marked as compile-time constants, matching the behavior of regular Triton kernels. Pull Request resolved: pytorch#168127 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#167781
Fixes pytorch#168478 Fixes pytorch#168557 Fixes pytorch#168573 Fixes pytorch#168581 Fixes pytorch#168586 Fixes pytorch#168625 Fixes pytorch#168647 Fixes pytorch#168649 Fixes pytorch#168672 Fixes pytorch#168676 Fixes pytorch#168677 Fixes pytorch#168678 Fixes pytorch#168679 Fixes pytorch#168684 Fixes pytorch#168683 Fixes pytorch#168681 Unskip some UTs Pull Request resolved: pytorch#169564 Approved by: https://github.com/jeffdaily
This PR fixes some of slotscheck warnings. The some of them are: ``` ERROR: 'torch._inductor.cudagraph_trees:AliasesNewOutput' has slots but superclass does not. ERROR: 'torch._inductor.cudagraph_trees:AliasesPriorGraphOutput' has slots but superclass does not. ERROR: 'torch._subclasses.fake_tensor:_BypassDispatchCache' has slots but superclass does not. ERROR: 'torch.distributed._functional_collectives:AsyncCollectiveTensor' has slots but superclass does not. ERROR: 'torch.distributed.elastic.timer.file_based_local_timer:FileTimerRequest' defines overlapping slots. ERROR: 'torch.distributed.tensor._shards_wrapper:LocalShardsWrapper' has slots but superclass does not. ERROR: 'torch.distributed.tensor:DTensor' has slots but superclass does not. ERROR: 'torch.multiprocessing.spawn:ProcessException' has slots but superclass does not. ERROR: 'torch.package.package_importer:_ModuleNode' has slots but superclass does not. ERROR: 'torch.sparse.semi_structured:SparseSemiStructuredTensor' has slots but superclass does not. ERROR: 'torch.testing._internal.logging_tensor:LoggingTensor' has slots but superclass does not. ``` The fixes work by adding slot to their parent. Pull Request resolved: pytorch#169348 Approved by: https://github.com/Skylion007
…)" This reverts commit 8b683e5. Reverted pytorch#169609 on behalf of https://github.com/malfet due to It causes inductor tests to fail ([comment](pytorch#169609 (comment)))
…uards and verbose_code_parts (pytorch#169102) Fix pytorch#168379. 1. The results are validated in the improved testing that the ``___dict_contains`` will be sorted based on the verbose part. The first solution was also suggested in https://fb.workplace.com/groups/1075192433118967/permalink/1650742858897252/ by sorting the ``get_leaf_guards()`` in ``construct_manager_string``. 2. The second solution will be adopted the ``OrderedSet`` in setGuards during guards construction to make sure the ``contain_dict`` are displayed as the order of being added. We decided to pursuit the second options to reduce the sorting time overhead and simplicity. Pull Request resolved: pytorch#169102 Approved by: https://github.com/anijain2305
This PR fixes unuser context managers detected by pylint. Pull Request resolved: pytorch#169447 Approved by: https://github.com/malfet
) This PR fixes FailedMatch format string bug, part of the issue: pytorch#169440 Pull Request resolved: pytorch#169611 Approved by: https://github.com/ProExpertProg, https://github.com/zou3519 ghstack dependencies: pytorch#169603
) Related to pytorch#169492 Pull Request resolved: pytorch#169630 Approved by: https://github.com/huydhn
This PR is for the **yellow block** of the design flow below. <img width="3880" height="2240" alt="image" src="https://github.com/user-attachments/assets/1774b9c1-5a72-4ebe-9e5c-70072cc7f320" /> Assuming users mark some nodes with metadata `should_offload` (which could later be replaced as compiler automatic decisions), the implementation first checks if the tensors are safe to offload via the `can_offload()` function. For tensors that are marked as `should_offload` and pass the verification in `can_offload()`, we modify the forward and backward graph, where in the forward graph, we offload tensors to CPU, and in the backward graph, we reload tensors to GPU. We introduce two flags: * `enable_activation_offloading` -- with this turned on, we insert offload/reload nodes to the fwd/bwd graphs. * `activation_offload_separate_stream` -- with this turned on, we wrap the offload/reload nodes in a separate stream with correct waits for e.g., data dependencies. (Note that from a "frontend" perspective, everything is the same, the only difference is that the memcpy is now moved to a separate stream, which prepares for subsequent reordering and overlapping) For an [example model](https://gist.github.com/xuanzhang816/ee2e3648123670f14ced9963858ee3b4), we have 1. baseline (i.e., no activation offloading) - 9160 MB peak memory - 157.42 ms per-iteration runtime - trace for fwd and bwd as below: <img width="845" height="52" alt="image" src="https://github.com/user-attachments/assets/2868416e-e5a6-42a6-985f-891c5d8edb72" /> 2. AO (i.e., with `enable_activation_offloading=True`) - 7460 MB peak memory - 224.15 ms per-iteration runtime - trace for fwd and bwd as below (the "M" blocks are for Memcpy): <img width="864" height="52" alt="image" src="https://github.com/user-attachments/assets/ec7b6326-b192-4915-af3f-824a2d3f7c49" /> 3. AO on separate stream (i.e., additionally with `activation_offload_separate_stream = True`) - 7460 MB peak memory - 224.11 ms per-iteration runtime - trace for fwd and bwd as below (the "M" blocks are for Memcpy): <img width="891" height="121" alt="image" src="https://github.com/user-attachments/assets/2137691a-2c97-4208-80c6-cd7ae7fac7e0" /> Pull Request resolved: pytorch#167880 Approved by: https://github.com/eellison
) We introduce two flags for computation <> communication overlap: * `activation_offload_sink_wait` -- with this turned on, the wait events in forward is sinked to the end of the graph so that the offload ops is not blocking * `activation_reload_prefetch` -- with this turned on, the reload in the backward is prefetched just enough so that the reload operation is perfectly overlapped with computation. Continue with the [example model](https://gist.github.com/xuanzhang816/ee2e3648123670f14ced9963858ee3b4) in the first PR in the stack: 4. AO on separate stream with reorders (i.e., additionally with `activation_offload_sink_wait = True` and `activation_reload_prefetch = True`) - 7534 MB peak memory - 160.86 ms ms per-iteration runtime - trace for fwd and bwd as below (the "M" blocks are for Memcpy): <img width="869" height="123" alt="image" src="https://github.com/user-attachments/assets/6aa0c5df-5cc4-4711-8c1a-20f4a9d9145b" /> Pull Request resolved: pytorch#168316 Approved by: https://github.com/eellison ghstack dependencies: pytorch#167880
…ecord_artifact (pytorch#169242) I ran into this when trying to precompile simplefsdp and realizing that deepcloning a DeviceMesh within a fake mode causes the following error: ``` [rank0]: File "/home/bobren/local/a/pytorch/spc.py", line 111, in <module> [rank0]: .aot_compile(((input_tensor, d_input_tensor), {"mesh": mesh})) [rank0]: File "/home/bobren/local/a/pytorch/torch/_dynamo/eval_frame.py", line 800, in aot_compile [rank0]: return aot_compile_fullgraph( [rank0]: File "/home/bobren/local/a/pytorch/torch/_dynamo/aot_compile.py", line 235, in aot_compile_fullgraph [rank0]: compiled_fn = backend( [rank0]: File "/home/bobren/local/a/pytorch/torch/__init__.py", line 2445, in __call__ [rank0]: return compile_fx(model_, inputs_, config_patches=self.config) [rank0]: File "/home/bobren/local/a/pytorch/torch/_inductor/compile_fx.py", line 2525, in compile_fx [rank0]: return _maybe_wrap_and_compile_fx_main( [rank0]: File "/home/bobren/local/a/pytorch/torch/_inductor/compile_fx.py", line 2602, in _maybe_wrap_and_compile_fx_main [rank0]: return _compile_fx_main( [rank0]: File "/home/bobren/local/a/pytorch/torch/_inductor/compile_fx.py", line 2797, in _compile_fx_main [rank0]: return aot_autograd( [rank0]: File "/home/bobren/local/a/pytorch/torch/_dynamo/backends/common.py", line 117, in __call__ [rank0]: cg = aot_module_simplified(gm, example_inputs, **self.kwargs) [rank0]: File "/home/bobren/local/a/pytorch/torch/_functorch/aot_autograd.py", line 1119, in aot_module_simplified [rank0]: compiled_fn, _ = aot_stage2_compile( [rank0]: File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/graph_compile.py", line 348, in aot_stage2_compile [rank0]: return aot_stage2_autograd(aot_state, aot_graph_capture) [rank0]: File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/graph_compile.py", line 2017, in aot_stage2_autograd [rank0]: try_save_cache_entry, entry = _cache_autograd_info( [rank0]: File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/graph_compile.py", line 2196, in _cache_autograd_info [rank0]: entry = try_save_cache_entry( [rank0]: File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/graph_compile.py", line 2186, in try_save_cache_entry [rank0]: AOTAutogradCache.save( [rank0]: File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 905, in save [rank0]: raise e [rank0]: File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 889, in save [rank0]: PrecompileContext.record_artifact(artifact) [rank0]: File "/home/bobren/local/a/pytorch/torch/_dynamo/precompile_context.py", line 147, in record_artifact [rank0]: cls._backend_artifacts_by_key[_BackendId(artifact.key)] = copy.deepcopy( [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 172, in deepcopy [rank0]: y = _reconstruct(x, memo, *rv) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 271, in _reconstruct [rank0]: state = deepcopy(state, memo) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 146, in deepcopy [rank0]: y = copier(x, memo) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 231, in _deepcopy_dict [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 172, in deepcopy [rank0]: y = _reconstruct(x, memo, *rv) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 271, in _reconstruct [rank0]: state = deepcopy(state, memo) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 146, in deepcopy [rank0]: y = copier(x, memo) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 231, in _deepcopy_dict [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 172, in deepcopy [rank0]: y = _reconstruct(x, memo, *rv) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 271, in _reconstruct [rank0]: state = deepcopy(state, memo) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 146, in deepcopy [rank0]: y = copier(x, memo) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 231, in _deepcopy_dict [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 146, in deepcopy [rank0]: y = copier(x, memo) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 206, in _deepcopy_list [rank0]: append(deepcopy(a, memo)) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 172, in deepcopy [rank0]: y = _reconstruct(x, memo, *rv) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 271, in _reconstruct [rank0]: state = deepcopy(state, memo) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 146, in deepcopy [rank0]: y = copier(x, memo) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 231, in _deepcopy_dict [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 146, in deepcopy [rank0]: y = copier(x, memo) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 211, in _deepcopy_tuple [rank0]: y = [deepcopy(a, memo) for a in x] [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 211, in <listcomp> [rank0]: y = [deepcopy(a, memo) for a in x] [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 172, in deepcopy [rank0]: y = _reconstruct(x, memo, *rv) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 271, in _reconstruct [rank0]: state = deepcopy(state, memo) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 146, in deepcopy [rank0]: y = copier(x, memo) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 231, in _deepcopy_dict [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 172, in deepcopy [rank0]: y = _reconstruct(x, memo, *rv) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 271, in _reconstruct [rank0]: state = deepcopy(state, memo) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 146, in deepcopy [rank0]: y = copier(x, memo) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 231, in _deepcopy_dict [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 153, in deepcopy [rank0]: y = copier(memo) [rank0]: File "/home/bobren/local/a/pytorch/torch/_tensor.py", line 142, in __deepcopy__ [rank0]: return handle_torch_function(Tensor.__deepcopy__, (self,), self, memo) [rank0]: File "/home/bobren/local/a/pytorch/torch/overrides.py", line 1733, in handle_torch_function [rank0]: result = mode.__torch_function__(public_api, types, args, kwargs) [rank0]: File "/home/bobren/local/a/pytorch/torch/utils/_device.py", line 109, in __torch_function__ [rank0]: return func(*args, **kwargs) [rank0]: File "/home/bobren/local/a/pytorch/torch/_tensor.py", line 180, in __deepcopy__ [rank0]: new_storage = self._typed_storage()._deepcopy(memo) [rank0]: File "/home/bobren/local/a/pytorch/torch/storage.py", line 1139, in _deepcopy [rank0]: return self._new_wrapped_storage(copy.deepcopy(self._untyped_storage, memo)) [rank0]: File "/home/bobren/local/a/pytorch-env/lib/python3.10/copy.py", line 153, in deepcopy [rank0]: y = copier(memo) [rank0]: File "/home/bobren/local/a/pytorch/torch/storage.py", line 243, in __deepcopy__ [rank0]: new_storage = self.clone() [rank0]: File "/home/bobren/local/a/pytorch/torch/storage.py", line 257, in clone [rank0]: return type(self)(self.nbytes(), device=self.device).copy_(self) [rank0]: File "/home/bobren/local/a/pytorch/torch/utils/_stats.py", line 29, in wrapper [rank0]: return fn(*args, **kwargs) [rank0]: File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 1397, in __torch_dispatch__ [rank0]: return self.dispatch(func, types, args, kwargs) [rank0]: File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 2155, in dispatch [rank0]: return self._cached_dispatch_impl(func, types, args, kwargs) [rank0]: File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 1544, in _cached_dispatch_impl [rank0]: output = self._dispatch_impl(func, types, args, kwargs) [rank0]: File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 2823, in _dispatch_impl [rank0]: r = func(*args, **kwargs) [rank0]: File "/home/bobren/local/a/pytorch/torch/_ops.py", line 836, in __call__ [rank0]: return self._op(*args, **kwargs) [rank0]: RuntimeError: Attempted to set the storage of a tensor on device "meta" to a storage on different device "cpu". This is no longer allowed; the devices must match. ``` As you can see the underlying problem is 1) we do a clone on storage 2) which under the hood calls `_copy` 3) but when we call `copy_` the fake mode turns a self, which is on "cpu", into a meta device. This PR fixes the issue by temporarily disabling dispatch when doing the deepclone. Pull Request resolved: pytorch#169242 Approved by: https://github.com/bdhirsh Co-authored-by: Bob Ren <bobrenjc@fb.com>
…orch#169487) Summary: PyTorch's expandable segments IPC capability was disabled in fbcode due to job failures (see pytorch#132890). However, some use cases like CTran require IPC functionality for multi-process GPU communication. This change introduces PYTORCH_CUDA_EXPANDABLE_SEGMENTS_IPC environment variable to allow opt-in enablement of IPC handle types for expandable segments in fbcode builds while maintaining backward compatibility. IPC is enabled by default in non-fbcode builds and disabled by default in fbcode builds (existing behavior). In both cases, it can be explicitly enabled by setting PYTORCH_CUDA_EXPANDABLE_SEGMENTS_IPC=1. Test Plan: CI Differential Revision: D88274246 Pull Request resolved: pytorch#169487 Approved by: https://github.com/ngimel
If it gets to the end of the list and doesn't find a green commit, the LATEST_SHA is None Pull Request resolved: pytorch#169585 Approved by: https://github.com/huydhn
Summary: Makes `torch.allclose(a, b, atol=0, rtol=0)` work for `a` and `b` with dtype `torch.float4_e2m1fn_x2`. This is useful for testing. Test Plan: ``` pytest test/quantization/core/experimental/test_floatx.py -s -k test_float4_e2m1fn_x2 ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#169575 Approved by: https://github.com/eqy, https://github.com/drisspg
Summary: Enables `copy_` support for the `torch.float4_e2m1fn_x2` dtype. This is useful when slicing a tensor across dim1 and then calling contiguous, which can happen in vllm and therefore should be supported. Test Plan: ``` pytest test/quantization/core/experimental/test_floatx.py -s -k test_float4_e2m1fn_x2 ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#169595 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#169575
…ch#169261) Add missing shape validation between `input` and `indices` tensors for `nn.MaxUnpool{1,2,3}d` on MPS backend Fixes pytorch#169235 Pull Request resolved: pytorch#169261 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
…pytorch#167926) Correct some grammatical and expression errors in the CONTRIBUTING.md file. Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#167926 Approved by: https://github.com/mikaylagawarecki
This PR replaces unnecessary uses of thrust::pair with std::pair. Pull Request resolved: pytorch#168941 Approved by: https://github.com/albanD
# Conflicts: # .ci/docker/ci_commit_pins/triton.txt # .ci/docker/triton_version.txt
|
Jenkins build for b6e3d448ef3aff527c474515d81b3d0fd938bb7c commit finished as FAILURE Detected error during base docker image building: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
rocm_base: f742da3