Generate a manifest file for Pytorch builds#2547
Conversation
marbre
left a comment
There was a problem hiding this comment.
Some additional feedback but need to take a closer look.
| if: ${{ github.repository_owner == 'ROCm' }} | ||
| run: | | ||
| aws s3 cp ${{ env.PACKAGE_DIST_DIR }}/manifests/ \ | ||
| s3://${{ env.S3_BUCKET_ARTIFACTS }}/external-builds/pytorch/${{ github.run_id }}/${{ inputs.amdgpu_family }}/ \ |
There was a problem hiding this comment.
Let's think carefully about the top level folder structure.
Here's an example Linux release package build: https://github.com/ROCm/TheRock/actions/runs/20737916627#summary-59538918041
- https://therock-nightly-artifacts.s3.amazonaws.com/20737916627-linux/
- https://therock-nightly-artifacts.s3.amazonaws.com/20737916627-linux/logs/gfx94X-dcgpu/index.html
- https://therock-nightly-artifacts.s3.amazonaws.com/20737916627-linux/logs/gfx94X-dcgpu/build_time_analysis.html
- https://therock-nightly-artifacts.s3.amazonaws.com/20737916627-linux/index-gfx94X-dcgpu.html
- https://therock-nightly-artifacts.s3.amazonaws.com/20737916627-linux/manifests/gfx94X-dcgpu/therock_manifest.json
Could we match that and write to ${{ github.run_id }}-{{ platform }}/manifests/{{ inputs.amdgpu_family }}/therock_torch_manifest.json?
For example:
20737916627-linux/manifests/gfx94X-dcgpu/therock_rocm_manifest.json
20737916627-linux/manifests/gfx94X-dcgpu/therock_torch_manifest.json
20737916627-linux/manifests/gfx94X-dcgpu/therock_jax_manifest.json
There was a problem hiding this comment.
I think there is an issue when we have single name as therock_torch_manifest.json per run. We need six different json files (json per job) from a particular run: e.g. https://github.com/ROCm/TheRock/actions/runs/20808841882/job/59768375462. So the last completed job will overwrite the therock_torch_manifest.json file.
There was a problem hiding this comment.
When #1236 is solved, each release build should use matching commits. Until then, we can upload one manifest per job with a descriptive name. We'll at least be able to compare manifests across workflow runs and jobs then to see how often the commits used differ.
There was a problem hiding this comment.
I see, sounds good to me.
There was a problem hiding this comment.
Tracking the future issue in here: #2827
There was a problem hiding this comment.
Actually, I think we should sequence this work differently.
- Solve Release workflows do not freeze the commits they select #1236 first. Generate manifest files at the start of the release pipelines for all releases, all repositories (ROCm, PyTorch, JAX, etc.).
- Trigger release builds based on those computed manifests.
That should be much cleaner architecturally. The build scripts won't need to know anything about manifests, they'll continue to operate as they currently do, based on whatever code has been checked out ahead of time by the workflows. We can add a mode to the checkout scripts that reads from the manifest as needed.
| if: ${{ github.repository_owner == 'ROCm' }} | ||
| run: | | ||
| aws s3 cp ${{ env.PACKAGE_DIST_DIR }}/manifests/ \ | ||
| s3://${{ env.S3_BUCKET_ARTIFACTS }}/external-builds/pytorch/${{ github.run_id }}/${{ inputs.amdgpu_family }}/ \ |
There was a problem hiding this comment.
Let's think carefully about the top level folder structure.
Here's an example Linux release package build: https://github.com/ROCm/TheRock/actions/runs/20737916627#summary-59538918041
- https://therock-nightly-artifacts.s3.amazonaws.com/20737916627-linux/
- https://therock-nightly-artifacts.s3.amazonaws.com/20737916627-linux/logs/gfx94X-dcgpu/index.html
- https://therock-nightly-artifacts.s3.amazonaws.com/20737916627-linux/logs/gfx94X-dcgpu/build_time_analysis.html
- https://therock-nightly-artifacts.s3.amazonaws.com/20737916627-linux/index-gfx94X-dcgpu.html
- https://therock-nightly-artifacts.s3.amazonaws.com/20737916627-linux/manifests/gfx94X-dcgpu/therock_manifest.json
Could we match that and write to ${{ github.run_id }}-{{ platform }}/manifests/{{ inputs.amdgpu_family }}/therock_torch_manifest.json?
For example:
20737916627-linux/manifests/gfx94X-dcgpu/therock_rocm_manifest.json
20737916627-linux/manifests/gfx94X-dcgpu/therock_torch_manifest.json
20737916627-linux/manifests/gfx94X-dcgpu/therock_jax_manifest.json
d49c64c to
3cb6cab
Compare
|
Tests: https://github.com/ROCm/TheRock/actions/runs/21955293350 Example from the test: |
ScottTodd
left a comment
There was a problem hiding this comment.
Looking pretty good. A few remaining comments about file organization and code style.
| --pytorch-dir "external-builds/pytorch/pytorch" \ | ||
| --pytorch-audio-dir "external-builds/pytorch/pytorch_audio" \ | ||
| --pytorch-vision-dir "external-builds/pytorch/pytorch_vision" \ | ||
| --triton-dir "external-builds/pytorch/triton" |
There was a problem hiding this comment.
Let's add apex now too. Fine to be a follow-up PR.
See 39d840b
|
Test: https://github.com/ROCm/TheRock/actions/runs/21970407618 CC: @ScottTodd |
There was a problem hiding this comment.
We could also include these changes in https://github.com/ROCm/TheRock/blob/main/.github/workflows/build_portable_linux_pytorch_wheels_ci.yml (new workflow for CI), but the longer term plans will solve for that naturally:
- In the "setup" job, compute a single manifests for all projects and upload it (or multiple manifests: one for ROCm, one for PyTorch, one for JAX, etc.)
- In all jobs follow that, use the manifests to decide what to checkout and build
## Motivation Add Apex to the Pytorch Manifest file Follow up PR for #2547 See the comment here: #2547 (comment) ## Test Plan Test on CI ## Test Result The test runs: https://github.com/ROCm/TheRock/actions/runs/22070943413 Example Manifest from the test: ``` { "pytorch": { "commit": "f466ea7ac710374ce7066d9ead92e57dc42dc76f", "repo": "https://github.com/ROCm/pytorch.git", "branch": "release/2.7" }, "pytorch_audio": { "commit": "95c61b4168fc5133be8dd8c1337d929d066ae6cf", "repo": "https://github.com/pytorch/audio" }, "pytorch_vision": { "commit": "59a3e1f9f78cfe44cb989877cc6f4ea77c8a75ca", "repo": "https://github.com/pytorch/vision" }, "triton": { "commit": "95abd4daa02d789ed257178ba35b4ef2d746ca29", "repo": "https://github.com/ROCm/triton.git" }, "apex": { "commit": "7a57becffa31a4394369152e978fa6f8a0f005c2", "repo": "https://github.com/ROCm/apex" }, "therock": { "commit": "02024427089134e951fe4063100b93c28e7e01a8", "repo": "https://github.com/ROCm/TheRock.git", "branch": "users/erman-gurses/add-apex-to-pytorch-manifest" } } ``` Link: https://therock-dev-artifacts.s3.us-east-2.amazonaws.com/22070943413-linux/manifests/gfx94X-dcgpu/therock-manifest_torch_py3.10_release-2.7.json ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Closes #2481
Generate a manifest file for Pytorch builds.
Example:
therock-manifest_torch_py3.11_release-2.7.jsonhttps://therock-dev-artifacts.s3.us-east-2.amazonaws.com/21179896705-linux/manifests/gfx110X-all/therock-manifest_torch_py3.11_release-2.7.json