Skip to content

Conversation

@iupaikov-amd
Copy link

Related to one of the customer issues, but will be useful later

jithunnair-amd and others added 30 commits October 10, 2025 14:55
(cherry picked from commit e294d4d with
modifications for release/2.8)

Reintroduce CIRCLE_TAG to be able to set PYTORCH_BUILD_VERSION without date

(cherry picked from commit 71a30ea)
…for py3.9;

upgrade tensorboard compatible with numpy 2

Co-authored-by: Ethan Wee <Ethan.Wee@amd.com>
(cherry picked from commit e867a3d)
(cherry picked from commit c7a1e32)
(cherry picked from commit 2a215e4)
(cherry picked from commit 866cc1d)
(cherry picked from commit 4b46310)
(cherry picked from commit 3d102a0)
(cherry picked from commit cb98724)
(cherry picked from commit ba1ba26)
(cherry picked from commit 4e3462e)
(cherry picked from commit 85ac538)
…_rcpf(x) instead of 1.f/x (#1800)

Cherry-pick of #1688

Co-authored-by: Michael Halkenhäuser <michaelhalk@web.de>
Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>
(cherry picked from commit f8544af)
(cherry picked from commit ed48754)
(cherry picked from commit d62a39e)
(cherry picked from commit b26ddb8)
Related to
c7a1e32
Fixes https://ontrack-internal.amd.com/browse/SWDEV-537835

Not a Navi specific failure:
```
  File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 1412, in only_fn
    return fn(slf, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/jenkins/pytorch/test/test_binary_ufuncs.py", line 1671, in test_cuda_tensor_pow_scalar_tensor
    self._test_pow(base, exp)
  File "/var/lib/jenkins/pytorch/test/test_binary_ufuncs.py", line 1482, in _test_pow
    self.assertEqual(actual, expected)
  File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 4052, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: The values for attribute 'dtype' do not match: torch.float32 != torch.float64.
```

Using .to(actual) without specifying dtype/device assumes actual is a
tensor or tensor-like, which may fail silently or promote. Fixed by
explicitly matching dtype and device. Going from
pytorch#107302
Fix:
```
root@ubb4-rack-22:/var/lib/jenkins/pytorch# TEST_CONFIG=default HIP_VISIBLE_DEVICES=0 PYTORCH_TEST_WITH_ROCM=1 python test/test_binary_ufuncs.py TestBinaryUfuncsCUDA.test_cuda_tensor_pow_scalar_tensor_cuda
/opt/conda/envs/py_3.12/lib/python3.12/site-packages/hypothesis/entry_points.py:23: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources

Running tests...
----------------------------------------------------------------------
.
----------------------------------------------------------------------
Ran 1 test in 0.141s

OK

Generating XML reports...
root@ubb4-rack-22:/var/lib/jenkins/pytorch# pip list | grep numpy
numpy                   2.1.2

```

(cherry picked from commit a4d60fa)
(cherry picked from commit 9f11871)
This PR fixes the unit test,

test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction FAILED
[0.1163s]

```
Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/test_cuda.py", line 471, in test_set_per_process_memory_fraction
    tmp_tensor = torch.empty(application, dtype=torch.int8, device="cuda")
RuntimeError: Trying to create tensor with negative dimension -5681285432: [-5681285432]
```
This error occurs only on gfx1101 arch.

This error is coming from an integer overflow when another unit test,
test/test_cuda.py::TestCuda::test_randint_generation_for_large_numel
creates a tensor with a huge numel, which overflows into a higher
torch.cuda.max_memory_reserved() when you call
test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction
afterward. To avoid this we introduced torch.cuda.empty_cache() and
torch.cuda.reset_peak_memory_stats() to clean up CUDA states.

JIRA: https://ontrack-internal.amd.com/browse/SWDEV-535295
(cherry picked from commit f86d184)
(cherry picked from commit 1b44228)
…g torch and numpy tensors (#2362)

Cherry-pick of #2340

Co-authored-by: Dmitry Nikolaev <139769634+dnikolaev-amd@users.noreply.github.com>
(cherry picked from commit 22c98ea)
(cherry picked from commit 2d72fcd)
pip installed requirements.txt and .ci/docker/requirements-ci.txt

Local validation: `Successfully installed jinja2-3.1.6 lintrunner-0.12.7
mypy-1.14.0 onnxscript-0.2.2 sympy-1.13.3 tlparse-0.3.30
z3-solver-4.12.6.0`

(cherry picked from commit 30508ff)
(cherry picked from commit 22d02e8)
Adds initial autotuning for foreach support required for
https://ontrack-internal.amd.com/browse/SWDEV-539076

4x improvement for some kernels

Before:
triton_for_fused_18.kd 🔍 | 4.986 ms | 4.986 ms | 2.493 ms | 2 |  
triton_for_fused_6.kd 🔍 | 0.098 ms | 0.098 ms | 0.049 ms | 2 |  
triton_for_fused_7.kd 🔍 | 0.036 ms | 0.036 ms | 0.018 ms | 2 |  

After:
triton_for_fused_18.kd 🔍 | 1.273 ms | 1.273 ms | 0.636 ms | 2 |  
triton_for_fused_6.kd 🔍 | 0.044 ms | 0.044 ms | 0.022 ms | 2 |  
triton_for_fused_7.kd 🔍 | 0.024 ms | 0.024 ms | 0.012 ms | 2 |  

(cherry picked from commit f07b7f7)
(cherry picked from commit ed0d0a7)
Relands #2416 with caching fix

Upstream equivalent pytorch#159146

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit f0aebdc)
(cherry picked from commit 9c429dd)
… Fix warps runtime part 2 (#2455)

Cherry-pick of #2442

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
(cherry picked from commit 77a6760)
…ersistent reduction and no_x_dim removal (#2454)

Cherry-pick of #2417
Need to resolve conflicts

---------

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
(cherry picked from commit eb47158)
Perf improvement for triton tanh

(cherry picked from commit 4febbd8)
… rocm version (#2529)

Cherry-pick of #2518

Co-authored-by: Ethan Wee <Ethan.Wee@amd.com>
(cherry picked from commit c03be63)
Fixes SWDEV-543698
(https://ontrack-internal.amd.com/browse/SWDEV-543698)

Cherry-picked from #2502

This PR fixes the errors like below:
```
[rank3]: RuntimeError: The following operation failed in the TorchScript interpreter.
[rank3]: Traceback of TorchScript (most recent call last):
[rank3]: RuntimeError: /tmp/comgr-28f951/input/CompileSourceACC062:67:7: error: unknown type name 'uint32_t'; did you mean '__hip_internal::uint32_t'?
[rank3]:    67 |       uint32_t int32;
[rank3]:       |       ^~~~~~~~
[rank3]:       |       __hip_internal::uint32_t
```
Earlier uint32_t was defined in HIP headers in std namespace. Now it is
moved to __hip_internal namespace in hip headers. This change is made in
ROCm 7.0.

(cherry picked from commit b2fb688)
…2598)

Cherry-pick of #2597

Co-authored-by: Jerry Mannil <65309407+jerrymannil@users.noreply.github.com>
(cherry picked from commit 9ea02c4)
Original PR (#2417) had incorrect
indentation. Updated PR such that autotune will always add tiny configs,
otherwise use the hinted configs only.

Tested locally on test_torchinductor:
Ran 894 tests in 952.242s
FAILED (failures=1, skipped=28)

And completed autotune runs for microbench models
Microbenchmark for network : resnet152
Num devices: 1
Dtype: FP32
Mini batch size [img] : 64
Time per mini-batch : 0.09107530117034912
Throughput [img/sec] : 702.7152167226226

(cherry picked from commit db3ba66)
cherry-pick of
8d42697

(cherry picked from commit 0b82d9a)
cherry-pick of pytorch#163869

(cherry picked from commit dfd386f)
[AUTOGENERATED] release/2.9_IFU_2025-10-14
Cherry-pick of #2693

Co-authored-by: Gheorghe-Teodor Bercea <gt.bercea@gmail.com>
Cherry-pick of #2710

Co-authored-by: Jerry Mannil <65309407+jerrymannil@users.noreply.github.com>
jataylo and others added 16 commits October 17, 2025 09:18
…2722)

These changes from upstream result in a breakage when loading external
library
```
     61170:     calling init: /opt/venv/lib/python3.12/site-packages/torchvision/_C.so
     61170:
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Fatal Python error: Aborted
 
Current thread 0x00007f229fb36080 (most recent call first):
  File "/usr/lib/python3.12/ctypes/__init__.py", line 379 in __init__
  File "/pytorch/torch/_ops.py", line 1488 in load_library
  File "/opt/venv/lib/python3.12/site-packages/torchvision/extension.py", line 34 in <module>
  File "<frozen importlib._bootstrap>", line 488 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 995 in exec_module
  File "<frozen importlib._bootstrap>", line 935 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1331 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1360 in _find_and_load
  File "/opt/venv/lib/python3.12/site-packages/torchvision/__init__.py", line 9 in <module>
```

This was already reverted in rocm/7.1_internal_testing, need to
investigate whether upstream needs a fix
These changes are currently in progress of being upstreamed. Bring into
release 2.9 for customer model perf improvement

---------

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
Co-authored-by: Sampsa Riikonen <sriikone@amd.com>
Co-authored-by: Nichols A. Romero <165712832+naromero77amd@users.noreply.github.com>
Co-authored-by: AmdSampsa <sampsa.riikonen@amd.com>
…indows. (pytorch#162330)

Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton.
Already tested to be working on Windows with TheRock.

Steps to enable: simply set `USE_FLASH_ATTENTION=1` and `USE_MEM_EFF_ATTENTION=1` as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604

Pull Request resolved: pytorch#162330
Approved by: https://github.com/jeffdaily

Co-authored-by: Scott Todd <scott.todd0@gmail.com>
A few UT failures are caused by `HIPBLASLT_ALLOW_TF32`

Fixes pytorch#157094
Fixes pytorch#157093
Fixes pytorch#157092
Fixes pytorch#157091
Fixes pytorch#157064
Fixes pytorch#157063
Fixes pytorch#157062
Fixes pytorch#157061
Fixes pytorch#157042
Fixes pytorch#157041
Fixes pytorch#157039
Fixes pytorch#157004

Pull Request resolved: pytorch#162998
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
…3373)

Early assignment of `__AOTRITON_LIB` breaks the usage of environment variable `$AOTRITON_INSTALLED_PREFIX`

Pull Request resolved: pytorch#163373
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
## Major Changes

* Efficient Attention on ROCM requires last dimensions of input tensors align with 16 bytes.
  - Unlike FA, ME does not pad input tensors in `scaled_dot_product_attention` and hence this is required.
* Fix `atomic_counter` handling in varlen FA API
* Unskips a few unit tests.

Fixes pytorch#157120
Fixes pytorch#157121
Fixes pytorch#157122
Fixes pytorch#157167
Fixes pytorch#155217
Fixes pytorch#157043
Fixes pytorch#157060

Pull Request resolved: pytorch#163745
Approved by: https://github.com/jeffdaily
- TheRock build system for ROCm builds OpenBLAS from source and uses a
custom name for the library.
- Following existing conventions in `FindOpenBLAS.cmake` to support
finding a custom named version of OpenBLAS.
Cherry-pick of #2738

Co-authored-by: Jerry Mannil <65309407+jerrymannil@users.noreply.github.com>
Cherry-pick of #2740

Co-authored-by: Jerry Mannil <65309407+jerrymannil@users.noreply.github.com>
Cherry-pick of #2743

Co-authored-by: Jerry Mannil <65309407+jerrymannil@users.noreply.github.com>
Related to one of the customer issues, but will be useful later
@iupaikov-amd iupaikov-amd marked this pull request as ready for review October 31, 2025 15:43
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Oct 31, 2025

Jenkins build for 2b5fc74a7e1d7eb3a745955b17eaa0e125d0f266 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.