Add Python 3.14 support for PyTorch builds#2783
Add Python 3.14 support for PyTorch builds#2783subodh-dubey-amd wants to merge 27 commits intomainfrom
Conversation
4f5e095 to
d9d7622
Compare
d9d7622 to
20cf3d0
Compare
marbre
left a comment
There was a problem hiding this comment.
Please link to the logs. I don't see any runs when I filter to your branch: https://github.com/ROCm/TheRock/actions/workflows/release_portable_linux_pytorch_wheels.yml?query=branch%3Ausers%2Fsubodh-dubey-amd%2Fpytorch-314
5524599 to
7d3603b
Compare
There are way more changes on that branch (main...refs/heads/users/subodh-dubey-amd/pytorch-py310-py314-support) so this is not giving a good signal. If we add 3.14 we want to make sure it is green when it lands. Sequencing is important here. |
9d13ac5 to
f5f1aa5
Compare
d34f134 to
f8c2750
Compare
7e9ba03 to
3077ce8
Compare
f8c2750 to
98c7d57
Compare
- Add Python 3.14 to Linux PyTorch release workflow (2.9 and nightly) - Add Python 3.14 to Windows PyTorch release workflow (2.9 and nightly) - Update RELEASES.md to document Python 3.14 support Python 3.14 support: - PyTorch 2.9: Preview support - PyTorch nightly: Full support - PyTorch 2.7/2.8: Not supported (no upstream Python 3.14 support) Fixes #2640
…r of the other include entries. Co-authored-by: Marius Brehler <marius.brehler@amd.com>
Match the key ordering (pytorch_git_ref first) with other include entries as suggested by @marbre
- Changed the default behavior of the caching option from disabled to enabled. - Updated help text to clarify usage of --cache and --no-cache arguments. - Adjusted logic to check for caching based on the new argument structure.
- Fix Python -P flag compatibility for Python 3.10 in rocm_sdk tests and CLI - Update S3 package management for networkx and MarkupSafe compatibility - Add data-requires-python attributes for proper version resolution - Support multiple versions of dependencies (networkx 3.2.1 for Py3.10, latest for Py3.11+) - Enable cp310 wheels in update_dependencies.py Fixes compatibility issues when running rocm-sdk tests on Python 3.10 and resolves dependency version conflicts with networkx and MarkupSafe.
Fixed formatting issues in: - _cli.py: Added blank line after function - base_test.py, devel_test.py: Removed unnecessary line continuations
marbre
left a comment
There was a problem hiding this comment.
Tests:
Release portable Linux PyTorch Wheels Tests:
The latest run https://github.com/ROCm/TheRock/actions/runs/20939661850 was tested on commit a4ba9bb:
Those test results are not valid and needing to check this a reviewer is a waste of time.
Provide valid test results and take into account what was discussed on earlier PRs before requesting a review.
| # Python 3.14 compatibility - PEP 649 changed __annotations__ behavior | ||
| # AttributeError: 'Model' object has no attribute '__annotations__' | ||
| # https://github.com/ROCm/TheRock/actions/runs/20955499125/job/60224765842 | ||
| "test_autocast_cat_jit", | ||
| # Floating-point precision issue: int(fraction * memory) differs by 1 | ||
| # due to rounding, causing assertion failure and UnboundLocalError in cleanup. | ||
| # Expected 874512384 but got 874512383. | ||
| "test_max_split_expandable", |
There was a problem hiding this comment.
As discussed for Python 3.10, such changes should go into a separate PR and should not be part of the PR that applies the workflow changes.
There was a problem hiding this comment.
The change above does not seem to be in your updated PR. Close the issue eventually?
| # Python 3.14 compatibility issues - storage deallocation tests fail | ||
| # https://github.com/ROCm/TheRock/actions/runs/20955499125/job/60224765842 | ||
| "test_storage_dealloc_subclass_resurrected", | ||
| "test_storage_dealloc_subclass_zombie", |
There was a problem hiding this comment.
GitHub logs will vanish after the retention period. File a new issue and make sure it gets the proper attention in the triage meeting. Furthermore, as above, this needs to go to a separate PR.
There was a problem hiding this comment.
As above, the change above does not seem to be in your updated PR. Close the issue eventually?
| parser.add_argument( | ||
| "--no-cache", | ||
| default=False, | ||
| "--cache", | ||
| default=True, | ||
| required=False, | ||
| action=argparse.BooleanOptionalAction, | ||
| help="""Disable pytest caching. Useful when only having read-only access to pytorch directory""", | ||
| help="""Enable pytest caching (default: enabled). Use --no-cache to disable, useful when only having read-only access to pytorch directory""", | ||
| ) | ||
|
|
||
| args = parser.parse_args(argv) | ||
|
|
||
| if not args.pytorch_dir.exists(): | ||
| parser.error( | ||
| f"Directory at '{args.pytorch_dir}' does not exist, checkout pytorch and then set the path via --pytorch-dir or check it out in TheRock/external-build/pytorch/<your pytorch directory>" | ||
| ) | ||
|
|
||
| return args, passthrough_pytest_args | ||
|
|
||
|
|
||
| def main() -> int: | ||
| """Main entry point for the PyTorch test runner. | ||
|
|
||
| Returns: | ||
| Exit code from pytest (0 for success, non-zero for failures). | ||
| """ | ||
| try: | ||
| args, passthrough_pytest_args = cmd_arguments(sys.argv[1:]) | ||
|
|
||
| pytorch_dir = args.pytorch_dir | ||
|
|
||
| # CRITICAL: Determine AMDGPU family and set HIP_VISIBLE_DEVICES | ||
| # BEFORE importing torch/running pytest. Once torch.cuda is initialized, | ||
| # changing HIP_VISIBLE_DEVICES has no effect. | ||
| # For unit tests, run only on the first supported device (policy="single") | ||
| ((first_arch, _),) = set_gpu_execution_policy( | ||
| args.amdgpu_family, policy="single" | ||
| ) | ||
| print(f"Using AMDGPU family: {first_arch}") | ||
|
|
||
| # Determine PyTorch version | ||
| pytorch_version = args.pytorch_version | ||
| if not pytorch_version: | ||
| pytorch_version = detect_pytorch_version() | ||
| print(f"Using PyTorch version: {pytorch_version}") | ||
|
|
||
| # Get tests to skip | ||
| tests_to_skip = get_tests( | ||
| amdgpu_family=first_arch, | ||
| pytorch_version=pytorch_version, | ||
| platform=platform.system(), | ||
| create_skip_list=not args.debug, | ||
| ) | ||
|
|
||
| # Allow manual override of test selection | ||
| if args.k: | ||
| tests_to_skip = args.k | ||
|
|
||
| setup_env(pytorch_dir) | ||
|
|
||
| pytest_args = [ | ||
| f"{pytorch_dir}/test/test_nn.py", | ||
| f"{pytorch_dir}/test/test_torch.py", | ||
| f"{pytorch_dir}/test/test_cuda.py", | ||
| f"{pytorch_dir}/test/test_unary_ufuncs.py", | ||
| f"{pytorch_dir}/test/test_binary_ufuncs.py", | ||
| f"{pytorch_dir}/test/test_autograd.py", | ||
| f"-k={tests_to_skip}", | ||
| # "-n 0", # TODO does this need rework? why should we not run this multithreaded? this does not seem to exist? | ||
| # -n numprocesses, --numprocesses=numprocesses | ||
| # Shortcut for '--dist=load --tx=NUM*popen'. | ||
| # With 'logical', attempt to detect logical CPU count (requires psutil, falls back to 'auto'). | ||
| # With 'auto', attempt to detect physical CPU count. If physical CPU count cannot be determined, falls back to 1. | ||
| # Forced to 0 (disabled) when used with --pdb. | ||
| ] | ||
|
|
||
| if args.no_cache: | ||
| if not args.cache: |
There was a problem hiding this comment.
Why is the logic changed here? This is not related to 3.14.
There was a problem hiding this comment.
The workflow failed with
Error: Exception in PyTorch unit-tests runner: invalid option name '--no-cache' for BooleanOptionalAction Error: Process completed with exit code 1.
https://github.com/ROCm/TheRock/actions/runs/20942960623/job/60216725811
Should we create a separate issue for this also?
There was a problem hiding this comment.
Of course this fails with --no-cache as invalid option because you renamed --no-cache to --cache above!
So asking again, why is the logic changed here?
Updated the PR Description with the latest workflow runs. |
marbre
left a comment
There was a problem hiding this comment.
Tests runs look good but there are still unrelated changes without providing a rational of why this should be changed.
| # The callstack for this one points to _fill_mem_eff_dropout_mask, so it may be related to aotriton? | ||
| "test_cublas_config_nondeterministic_alert_cuda", | ||
| # Large test that isn't very CI-friendly (takes ~2 seconds, possibly hanging) | ||
| "test_memory_format_operators_cuda" |
There was a problem hiding this comment.
Why is this dropped? Comment (which was not deleted) states the reason for excluding this.
| # Move to gfx1151-specific skip list? Check if passing on Linux. | ||
| # We could also skip all test_grad_*. | ||
| "test_grad_scale_will_not_overflow_cuda", | ||
| "test_grad_scaling_unscale_sparse_cuda_float32", |
There was a problem hiding this comment.
Have you verified this isn't flaky any more on gfx1151?
| parser.add_argument( | ||
| "--no-cache", | ||
| default=False, | ||
| "--cache", | ||
| default=True, | ||
| required=False, | ||
| action=argparse.BooleanOptionalAction, | ||
| help="""Disable pytest caching. Useful when only having read-only access to pytorch directory""", | ||
| help="""Enable pytest caching (default: enabled). Use --no-cache to disable, useful when only having read-only access to pytorch directory""", | ||
| ) | ||
|
|
||
| args = parser.parse_args(argv) | ||
|
|
||
| if not args.pytorch_dir.exists(): | ||
| parser.error( | ||
| f"Directory at '{args.pytorch_dir}' does not exist, checkout pytorch and then set the path via --pytorch-dir or check it out in TheRock/external-build/pytorch/<your pytorch directory>" | ||
| ) | ||
|
|
||
| return args, passthrough_pytest_args | ||
|
|
||
|
|
||
| def main() -> int: | ||
| """Main entry point for the PyTorch test runner. | ||
|
|
||
| Returns: | ||
| Exit code from pytest (0 for success, non-zero for failures). | ||
| """ | ||
| try: | ||
| args, passthrough_pytest_args = cmd_arguments(sys.argv[1:]) | ||
|
|
||
| pytorch_dir = args.pytorch_dir | ||
|
|
||
| # CRITICAL: Determine AMDGPU family and set HIP_VISIBLE_DEVICES | ||
| # BEFORE importing torch/running pytest. Once torch.cuda is initialized, | ||
| # changing HIP_VISIBLE_DEVICES has no effect. | ||
| # For unit tests, run only on the first supported device (policy="single") | ||
| ((first_arch, _),) = set_gpu_execution_policy( | ||
| args.amdgpu_family, policy="single" | ||
| ) | ||
| print(f"Using AMDGPU family: {first_arch}") | ||
|
|
||
| # Determine PyTorch version | ||
| pytorch_version = args.pytorch_version | ||
| if not pytorch_version: | ||
| pytorch_version = detect_pytorch_version() | ||
| print(f"Using PyTorch version: {pytorch_version}") | ||
|
|
||
| # Get tests to skip | ||
| tests_to_skip = get_tests( | ||
| amdgpu_family=first_arch, | ||
| pytorch_version=pytorch_version, | ||
| platform=platform.system(), | ||
| create_skip_list=not args.debug, | ||
| ) | ||
|
|
||
| # Allow manual override of test selection | ||
| if args.k: | ||
| tests_to_skip = args.k | ||
|
|
||
| setup_env(pytorch_dir) | ||
|
|
||
| pytest_args = [ | ||
| f"{pytorch_dir}/test/test_nn.py", | ||
| f"{pytorch_dir}/test/test_torch.py", | ||
| f"{pytorch_dir}/test/test_cuda.py", | ||
| f"{pytorch_dir}/test/test_unary_ufuncs.py", | ||
| f"{pytorch_dir}/test/test_binary_ufuncs.py", | ||
| f"{pytorch_dir}/test/test_autograd.py", | ||
| f"-k={tests_to_skip}", | ||
| # "-n 0", # TODO does this need rework? why should we not run this multithreaded? this does not seem to exist? | ||
| # -n numprocesses, --numprocesses=numprocesses | ||
| # Shortcut for '--dist=load --tx=NUM*popen'. | ||
| # With 'logical', attempt to detect logical CPU count (requires psutil, falls back to 'auto'). | ||
| # With 'auto', attempt to detect physical CPU count. If physical CPU count cannot be determined, falls back to 1. | ||
| # Forced to 0 (disabled) when used with --pdb. | ||
| ] | ||
|
|
||
| if args.no_cache: | ||
| if not args.cache: |
There was a problem hiding this comment.
Of course this fails with --no-cache as invalid option because you renamed --no-cache to --cache above!
So asking again, why is the logic changed here?
| # Python 3.14 compatibility - PEP 649 changed __annotations__ behavior | ||
| # AttributeError: 'Model' object has no attribute '__annotations__' | ||
| # https://github.com/ROCm/TheRock/actions/runs/20955499125/job/60224765842 | ||
| "test_autocast_cat_jit", | ||
| # Floating-point precision issue: int(fraction * memory) differs by 1 | ||
| # due to rounding, causing assertion failure and UnboundLocalError in cleanup. | ||
| # Expected 874512384 but got 874512383. | ||
| "test_max_split_expandable", |
There was a problem hiding this comment.
The change above does not seem to be in your updated PR. Close the issue eventually?
| # Python 3.14 compatibility issues - storage deallocation tests fail | ||
| # https://github.com/ROCm/TheRock/actions/runs/20955499125/job/60224765842 | ||
| "test_storage_dealloc_subclass_resurrected", | ||
| "test_storage_dealloc_subclass_zombie", |
There was a problem hiding this comment.
As above, the change above does not seem to be in your updated PR. Close the issue eventually?
ScottTodd
left a comment
There was a problem hiding this comment.
Too much outdated or unrelated code in this PR to review.
| # Python 3.14 is supported for PyTorch 2.9 (preview) and nightly (full) | ||
| - pytorch_git_ref: release/2.9 | ||
| python_version: "3.14" | ||
| pytorch_patchset: rocm_2.9 | ||
| - pytorch_git_ref: nightly | ||
| python_version: "3.14" | ||
| pytorch_patchset: nightly |
There was a problem hiding this comment.
This code is outdated. Please keep active review branches up to date or convert your PR back to a draft.
- release/2.10 is now included, nightly is 2.11+
- pytorch_patchset no longer exists
…--no- prefix) (#3535) ## Summary Python 3.14 no longer allows `--no-` prefixed option names with `argparse.BooleanOptionalAction` (python/cpython#117941). `run_pytorch_tests.py` defined `--no-cache` with `BooleanOptionalAction`, which crashes on Python 3.14: ## Failing job - https://github.com/ROCm/TheRock/actions/runs/22133299032/job/64272029131 for branch `users/subodh-dubey-amd/pytorch-314` ## Testing - **Test py 3.10–3.13:** https://github.com/ROCm/TheRock/actions/runs/22222064763 - **Python 3.14 fix:** https://github.com/ROCm/TheRock/actions/runs/22222485256 ## Dependency chain #3535 (this) → #3540 (skip test failures) → #2783 (add py3.14 to matrix) ## Related - Fixes #2640 - Tracks #2985 - Unblocks #2783 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.' --------- Co-authored-by: Laura Promberger <laura.promberger@amd.com>
|
What's the status here? The code diff seems reasonable up to date now. Are there recent test runs? Why is this still a draft? |
|
Thanks for working on this — adding Python 3.14 support is definitely valuable. I reviewed this against the current state of A few things stood out:
So overall: the core approach makes sense, but I’d prefer to see this rebased and updated for the current branch matrix / dependency stack before merging. |
|
Closing this PR. A clean and updated version is available here: #4145 |
## Summary - Add Python 3.14 to the PyTorch release workflow matrices for release/2.9, release/2.10, and nightly - Add `cp314` to the S3 dependency index allowlist so third-party wheels (numpy, etc.) are uploaded for Python 3.14 - Update RELEASES.md to document Python 3.14 support ## Dependency chain This PR depends on the following being merged first: 1. **TheRock** #3540 2. **ROCm/pytorch** ROCm/pytorch#3099 3. **ROCm/pytorch** ROCm/pytorch#3100 The ROCm/pytorch PRs fix `numpy==2.1.2` having no cp314 wheels, which causes the build to fail when pip falls back to a source build under sccache/meson. ## Python 3.14 support matrix | PyTorch version | Python 3.14 support | |-----------------|---------------------| | 2.8 | Not supported | | 2.9 | Supported | | 2.10 | Supported | | nightly | Supported | ## Changes - `release_portable_linux_pytorch_wheels.yml` — add py3.14 include entries for release/2.9, release/2.10, nightly - `release_windows_pytorch_wheels.yml` — add py3.14 to the version list - `update_dependencies.py` — add `cp314` to `_ALLOWED_CPYTHON_TAGS` - `RELEASES.md` — update supported Python versions ## Test results - Nightly py3.14 (already passing): https://github.com/ROCm/TheRock/actions/runs/23488558012/job/68350335178 - release/2.9 py3.14 (with numpy fix): https://github.com/ROCm/TheRock/actions/runs/23516862647/job/68451235656 Latest after this merge - ROCm/pytorch#3099: https://github.com/ROCm/TheRock/actions/runs/23546615437/job/68548629107 - release/2.10 py3.14 (with numpy fix): https://github.com/ROCm/TheRock/actions/runs/23516828149/job/68451129501 Latest after this merge - ROCm/pytorch#3100: https://github.com/ROCm/TheRock/actions/runs/23546731231/job/68549058264 ## Related - Fixes #2640 - Supersedes #2783 - Tracks #2985 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Summary - Add Python 3.14 to the PyTorch release workflow matrices for release/2.9, release/2.10, and nightly - Add `cp314` to the S3 dependency index allowlist so third-party wheels (numpy, etc.) are uploaded for Python 3.14 - Update RELEASES.md to document Python 3.14 support ## Dependency chain This PR depends on the following being merged first: 1. **TheRock** #3540 2. **ROCm/pytorch** ROCm/pytorch#3099 3. **ROCm/pytorch** ROCm/pytorch#3100 The ROCm/pytorch PRs fix `numpy==2.1.2` having no cp314 wheels, which causes the build to fail when pip falls back to a source build under sccache/meson. ## Python 3.14 support matrix | PyTorch version | Python 3.14 support | |-----------------|---------------------| | 2.8 | Not supported | | 2.9 | Supported | | 2.10 | Supported | | nightly | Supported | ## Changes - `release_portable_linux_pytorch_wheels.yml` — add py3.14 include entries for release/2.9, release/2.10, nightly - `release_windows_pytorch_wheels.yml` — add py3.14 to the version list - `update_dependencies.py` — add `cp314` to `_ALLOWED_CPYTHON_TAGS` - `RELEASES.md` — update supported Python versions ## Test results - Nightly py3.14 (already passing): https://github.com/ROCm/TheRock/actions/runs/23488558012/job/68350335178 - release/2.9 py3.14 (with numpy fix): https://github.com/ROCm/TheRock/actions/runs/23516862647/job/68451235656 Latest after this merge - ROCm/pytorch#3099: https://github.com/ROCm/TheRock/actions/runs/23546615437/job/68548629107 - release/2.10 py3.14 (with numpy fix): https://github.com/ROCm/TheRock/actions/runs/23516828149/job/68451129501 Latest after this merge - ROCm/pytorch#3100: https://github.com/ROCm/TheRock/actions/runs/23546731231/job/68549058264 ## Related - Fixes #2640 - Supersedes #2783 - Tracks #2985 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Summary
Adds Python 3.14 support for PyTorch builds on both Linux and Windows platforms.
Python 3.14 Support Matrix
Changes
.github/workflows/release_portable_linux_pytorch_wheels.ymlto include Python 3.14 for PyTorch 2.9 and nightly.github/workflows/release_windows_pytorch_wheels.ymlto include Python 3.14 for PyTorch 2.9 and nightlyRELEASES.mdto document Python 3.14 support (PyTorch 2.9+ only)build_tools/third_party/s3_management/update_dependencies.pyto update s3 index for pipMotivation
#2640: Add Python 3.14 support for PyTorch
References
Tests:
Release portable Linux PyTorch Wheels Tests:
Release Windows PyTorch Wheels:
Related
Submission Checklist
mainafter Fix run_pytorch_tests.py crash on Python 3.14 (BooleanOptionalAction --no- prefix) #3535 is merged