Deduplicate TheRock CI#2771
Conversation
- Introduced logic to set the TheRock directory based on external source checkout input. - Added auto-detection of external repository configurations for rocm-libraries and rocm-systems. - Implemented conditional checkouts for TheRock and composable_kernel repositories. - Enhanced the fetch_sources.py script to handle external source exclusions. - Added a step to apply patches from TheRock to the external repository if applicable. - Updated paths in various steps to utilize the dynamically set TheRock directory.
- Added a step to set the TheRock directory based on the external source checkout input. - Updated various steps to utilize the dynamically set TheRock directory for Python dependencies, script execution, and patch application. - Simplified conditional logic for script paths to enhance maintainability.
- Improved the logic for setting the TheRock directory across both Linux and Windows build workflows. - Streamlined the integration of external source configurations and patching processes. - Updated scripts to ensure consistent usage of the dynamically set TheRock directory for dependency management and execution.
- Moved the test execution step for build_tools to a consistent position in both Linux and Windows build workflows. - Ensured that the test step is conditionally executed based on the external source checkout input. - Improved maintainability by standardizing the test execution logic across different platforms.
There was a problem hiding this comment.
looks great overall, thanks for this! some cleanup comments but no major concerns for how it is set up!
I remember the original concern for not combining was that updating in TheRock CI to update in rocm-libs/rocm-systems was really annoying (as it was being brought up), but now it's fairly stable and can be combined!
This commit addresses all review feedback from PR #2771, improving code quality, maintainability, and fixing a critical concurrency bug. Review Comments Addressed: 1. Use env conditional for THEROCK_DIR instead of bash script 2. Move repo detection logic to Python script with full test coverage 3. Python script generates complete CMake options (cleaner YAML) 4. Add Windows composable_kernel support to miopen and hipdnn 5. Remove redundant ENABLE_ALL=OFF flags (auto-added by code) 6. Move inline comments above code blocks Bug Fixes: - Fix concurrency deadlock when multiple external repos call TheRock CI with same PR numbers by including repository name in concurrency group Changes: - build_portable_linux_artifacts.yml: Simplify THEROCK_DIR, move TheRock checkout before auto-detect, call Python script for repo detection - build_windows_artifacts.yml: Same improvements as Linux workflow - ci.yml: Add github.repository to concurrency group to prevent deadlocks - configure_ci.py: Move comment above if statement per code style - detect_external_repo_config.py: NEW - Centralized repo detection with --workspace flag to generate formatted CMake options - test_detect_external_repo_config.py: NEW - 9 unit tests for repo detection - external_repo_project_maps.py: Add Windows composable_kernel support, remove redundant ENABLE_ALL=OFF flags - test_external_repo_project_maps.py: Update tests for Windows CK support Testing: - All 22 unit tests pass - All pre-commit checks pass - No linter errors Reviewer: @geomin12
This commit addresses more review feedback from PR #2771. Review Comments Addressed: 1. Use ternary operator in setup.yml to consolidate checkout steps 2. Combine duplicate functions - detect_external_projects.py now imports from configure_ci.py instead of duplicating code 3. Pass therock_ref through to Linux/Windows build workflows so external repos can specify which TheRock branch to use 4. Simplify configure_ci.py imports - remove unnecessary try/except since external repos always have TheRock checked out Changes: - setup.yml: Consolidated 3 checkout steps into 2 using ternary for path - detect_external_projects.py: Import shared functions from configure_ci.py - configure_ci.py: Removed try/except import pattern, simplified - ci.yml: Pass therock_ref to Linux and Windows jobs - ci_linux.yml: Accept and pass therock_ref to build workflow - ci_windows.yml: Accept and pass therock_ref to build workflow - build_portable_linux_artifacts.yml: Accept therock_ref, use in checkout - build_windows_artifacts.yml: Accept therock_ref, use in checkout Testing: - All linter checks pass - No import errors Reviewer: @geomin12
…ting This commit adds comprehensive support for testing external repository CI integration directly from TheRock, enabling automated regression testing. Key Features: 1. repository_override parameter - Allows testing external repo scenarios from TheRock by overriding github.repository detection 2. projects parameter - Allows overriding project detection for targeted testing (comma-separated format) 3. Automated PR testing - New workflow runs automatically on CI file changes 4. Input validation - Prevents misuse of repository_override Changes: Workflows: - ci.yml: Added repository_override and projects inputs - setup.yml: Added validation, fixed checkout to use override, passes parameters as env vars (GITHUB_REPOSITORY_OVERRIDE, PROJECTS) - ci_linux.yml/ci_windows.yml: Pass through parameters - build_portable_linux_artifacts.yml: Use override in detection, added repository_override input - build_windows_artifacts.yml: Use override in detection, added repository_override input - test_external_repo_integration.yml: NEW - Automated testing workflow with PR trigger and separate project inputs per repo Scripts: - configure_ci.py: Check GITHUB_REPOSITORY_OVERRIDE env var before path-based detection - detect_external_projects.py: Changed to comma-separated format, projects input now acts as override (blank = auto-detect from files) Other: - .gitignore: Added CURSOR_SESSION.md exclusion Testing: - Automatic: Runs on PRs modifying CI infrastructure files - Manual: workflow_dispatch with per-repo project overrides - Projects input behavior: * Blank (default) = auto-detect from modified files * "all" = test all projects (override) * "proj1,proj2" = test specific projects (override) This enables: - Automated regression testing for external repo integration - Targeted testing without CODEOWNERS issues - Future PR automation for CI changes Related: ROCm/rocm-libraries#3629, ROCm/rocm-systems#2495
Changed project input format from space-separated to comma-separated and added pass-through to TheRock's CI workflows. Changes: - Updated projects input description to specify comma-separated format - Clarified that projects input is an override (blank = auto-detect) - Added projects parameter to workflow_call (passes to TheRock) Example usage: - Blank: Auto-detect projects from PR file changes (normal behavior) - "all": Test all projects (override) - "projects/clr,projects/rocminfo": Test specific projects (override) This aligns with TheRock's updated external repo testing infrastructure and enables targeted testing when needed. Related: ROCm/TheRock#2771
Changed project input format from space-separated to comma-separated and added pass-through to TheRock's CI workflows. Changes: - Updated projects input description to specify comma-separated format - Clarified that projects input is an override (blank = auto-detect) - Added projects parameter to workflow_call (passes to TheRock) Example usage: - Blank: Auto-detect projects from PR file changes (normal behavior) - "all": Test all projects (override) - "projects/rocprim,projects/hipcub": Test specific projects (override) This aligns with TheRock's updated external repo testing infrastructure and enables targeted testing when needed. Related: ROCm/TheRock#2771
|
Thanks for the initial feedback. Addressed the review comments. I also added override inputs for repository and projects affected so that testing of the external repo workflows can be done from those repos through |
This commit consolidates multiple fixes to the external repository integration testing infrastructure, addressing path resolution issues and workflow configuration problems that prevented proper testing of rocm-libraries and rocm-systems. Key Changes: Workflow Configuration Fixes: - Fixed git config operations for external repo checkouts (safe.directory, fetch.parallel) - Corrected working-directory settings across Linux and Windows workflows - Fixed duplicate TheRock checkout issue in external repo workflows - Added proper concurrency group handling to prevent conflicts during parallel tests - Improved PR_LABELS handling for empty label cases Path Resolution Improvements: - Corrected CMake source directory paths for external repos (container vs host paths) - Fixed detect_external_repo_config.py path resolution using THEROCK_DIR - Standardized relative path handling between external and TheRock repos - Fixed build directory paths for proper artifact generation External Repo Testing Enhancements: - Added repository_override parameter for testing external repos from TheRock - Added projects parameter for fine-grained project selection testing - Improved diagnostic logging for external repo workflows - Fixed Python setup timing in Windows workflows Configuration Script Updates: - Updated configure_ci.py to honor explicitly provided GPU families in PRs - Applied code formatting (black) to Python scripts - Enhanced external repo detection and configuration logic The changes ensure that external repository CI integration tests can properly: 1. Checkout both the external repo and TheRock in correct directory structure 2. Run detection scripts with proper working directories and path references 3. Generate correct CMake configuration for external source builds 4. Handle git operations in the correct repository contexts
7b27e55 to
5e55941
Compare
Enables rocm-libraries and rocm-systems to use TheRock's CI infrastructure via workflow_call, eliminating duplicate workflow definitions. Key fixes: - Remove REPOSITORY_OVERRIDE from artifact downloads (artifacts accessed via S3 paths, not GitHub API) - Remove project prefix from artifact_group for external repos (artifact names don't include project prefix) - Enable builds for workflow_dispatch from external repos - Generate platform-specific configs in single detection pass - Use composablekernel from rocm-libraries instead of TheRock submodule - Update test coverage for external repo integration External repo workflows call TheRock's ci.yml with external_source_checkout=true, allowing them to leverage TheRock's build/test infrastructure while building from their own source repositories.
HereThereBeDragons
left a comment
There was a problem hiding this comment.
had some issue with the github ui and needed to change to classic.
if there is anything odd e.g. same comment twice please ignore.
- Stop shelling out for external project detection; centralize logic in external_repo_project_maps - Accept canonical GPU family suffixes in overrides; update integration workflow overrides - Standardize env var PROJECT_TO_TEST and tighten detect_external_repo_config CLI/validation - Replace placeholder tests with real configure_ci/detect_external_repo_config unit tests - Improve external repo detection to use path components instead of substring matching - Add matrix size logging to help detect potential matrix explosion early
- Extract _add_gpu_families_from_input() helper to deduplicate GPU family parsing logic between workflow_dispatch and pull_request paths - Refactor main() in configure_ci.py into smaller functions: * _extract_event_flags(): Extract and log event type flags * _generate_base_matrices(): Generate GPU family matrices for both platforms * _apply_external_project_cross_product(): Cross-product external projects with GPU families Main function reduced from 114 lines to ~45 lines - Improve detect_external_repo_config.py: * Enhance output_github_actions_vars() docstring with Args and Returns sections * Pre-flatten platform-specific config values in main() for cleaner separation of concerns * Remove platform parameter from output_github_actions_vars() (now takes pre-resolved config) * Improve ArgumentParser description with output format examples and better epilog * Remove duplicate repository argument (positional + --repository) - Refactor external_repo_project_maps.py: * Change collect_projects_to_run() to take repo_name instead of 4 separate dict fields * Cleaner API - function internally calls get_repo_config() rather than requiring caller to unpack config * Update all tests to use new simpler signature All tests pass (22/22). Pre-commit hooks pass.
|
Applied feedback and CI runs look OK. TheRock CI (internal): Run #10715 passes External repo integration test (TheRock): Run #50 passes TheRock CI from rocm-libraries: Run #18050 passes TheRock CI from rocm-systems: Run #14327 compiled, and as of this message hip-tests started on gfx94X Linux to match inputs to the workflow, gfx120X Windows was skipped because there are no runners set for this gfx-OS combination. |
|
Since today, I cannot manually run CI Nightly jobs anymore. They behave as-if I didn't specify any amdgpu family in the workflow dispatch on the github UI even though I did. Example Could this be caused by this PR? |
|
@mgehre-amd can you please provide which inputs you set? |
the rest stays at its defaults. |
|
@jayhawk-commits looking at the different runs from matthias,i see that setup.yml does not get amdgpu_family set. so i think the change from recommendation is to add these params to the |
|
Adding |
|
Taking a look. |
…#2951) After #2771, these workflows were calling setup.yml but not forwarding their workflow_dispatch inputs (linux_amdgpu_families, test_labels, etc). This caused the setup job to use default empty values instead of the user-provided inputs or the workflow's own defaults. Forward all relevant inputs from workflow_dispatch to setup.yml's workflow_call to fix this. **Affected workflows:** - ci_nightly.yml - ci_asan.yml - multi_arch_ci.yml
ScottTodd
left a comment
There was a problem hiding this comment.
Too many design issues here. Please revert.
| - name: Patch external repo | ||
| if: ${{ inputs.external_source_checkout }} | ||
| working-directory: source-repo | ||
| run: | | ||
| # Apply patches from TheRock to the external repo | ||
| PATCHES_DIR="../$THEROCK_DIR/patches/amd-mainline/${{ steps.detect.outputs.patches_dir }}" | ||
| if [ -d "$PATCHES_DIR" ] && [ "$(ls -A $PATCHES_DIR/*.patch 2>/dev/null)" ]; then | ||
| git -c user.name="therockbot" -c "user.email=therockbot@amd.com" \ | ||
| am --whitespace=nowarn $PATCHES_DIR/*.patch | ||
| else | ||
| echo "No patches found in $PATCHES_DIR or directory doesn't exist" | ||
| fi |
There was a problem hiding this comment.
How will this work? If the external repository changes patches, will it have any way to edit this code?
This will need documentation updates in https://github.com/ROCm/TheRock/tree/main/patches#using-patches-from-rocm-libraries-rocm-systems-and-other-repositories.
| path: ${{ inputs.external_source_checkout && 'source-repo' || '.' }} | ||
|
|
||
| # safe.directory must be set before Runner Health Status | ||
| - name: Adjust git config | ||
| working-directory: ${{ inputs.external_source_checkout && 'source-repo' || '.' }} |
There was a problem hiding this comment.
This logic is duplicated a few times. We may want to pull that up into env and then use ${{ env. }} throughout to minimize how much logic there is after job startup time.
| - name: Detect external source configuration | ||
| if: ${{ inputs.external_source_checkout }} | ||
| id: detect | ||
| run: python3 $THEROCK_DIR/build_tools/github_actions/detect_external_repo_config.py --repository "${{ inputs.repository_override || github.repository }}" --platform linux |
There was a problem hiding this comment.
Please brace-delimit all bash variables. Either use ${THEROCK_DIR} (common in scripts) or ${{ env.THEROCK_DIR }} (common in github actions workflows)
- https://github.com/ROCm/TheRock/blob/main/docs/development/style_guides/bash_style_guide.md#core-principles
- https://google.github.io/styleguide/shellguide.html#variable-expansion
Just typing $THEROCK_DIR does not make it clear when the variable name ends
| extra_cmake_options: >- | ||
| ${{ inputs.external_source_checkout && format('-D{0}=../source-repo', steps.detect.outputs.cmake_source_var) || '' }} | ||
| ${{ inputs.extra_cmake_options }} | ||
| BUILD_DIR: build | ||
| run: | | ||
| python3 build_tools/github_actions/build_configure.py --manylinux | ||
| python3 $THEROCK_DIR/build_tools/github_actions/build_configure.py ${{ inputs.external_source_checkout && '' || '--manylinux' }} |
There was a problem hiding this comment.
Too much magic here with format('-D{0} and ${{ inputs.external_source_checkout && '' || '--manylinux' }}
At a minimum this needs comments explaining the intent.
I think changes to the build_configure.py script may be more appropriate though.
| - name: Test Packaging | ||
| if: ${{ github.event.repository.name == 'TheRock' }} | ||
| if: ${{ github.event.repository.name == 'TheRock' && !inputs.external_source_checkout }} | ||
| run: | | ||
| ctest --test-dir build --output-on-failure | ||
| ctest --test-dir $THEROCK_DIR/build --output-on-failure |
There was a problem hiding this comment.
Why skip packaging tests if there is an external source checkout? Is that ever set when the event repository name is not TheRock?
There was a problem hiding this comment.
Getting the GitHub unicorn right now when trying to look up the existing code on the super-repos, but the behaviour for this PR is to first match what is being done on the CI workflows on the super-repos before adjusting/improving them. This would apply for many of your commented sections. I'll find the lines of code when GitHub is loading for me.
There was a problem hiding this comment.
The ctest step after the build is not run on the super-repos: https://github.com/ROCm/rocm-libraries/blob/develop/.github/workflows/therock-ci-linux.yml#L111
| # If the dependent job failed/cancelled, this job will not be run | ||
| # The use_prebuilt_artifacts "or" statement ensures that tests will run if | ||
| # previous build step is run or skipped.concurrency. | ||
| # Skip for external repos (they don't need Python packages) |
There was a problem hiding this comment.
Huh? They absolutely do need Python packages.
There was a problem hiding this comment.
The wording on 'need' is incorrect in the comment, and more that we are matching current state as external repo workflows do not build python packages as-is. See https://github.com/ROCm/rocm-libraries/actions/runs/21027099988 as example.
| run: | | ||
| if [[ "${{ inputs.external_source_checkout }}" == "true" ]]; then | ||
| cd source-repo | ||
| python ../TheRock/build_tools/github_actions/configure_ci.py | ||
| else | ||
| ./build_tools/github_actions/configure_ci.py | ||
| fi |
There was a problem hiding this comment.
There is a different configure script in rocm-libraries: https://github.com/ROCm/rocm-libraries/blob/develop/.github/scripts/therock_configure_ci.py. Will this stop using that? The logic has diverged quite a bit.
There was a problem hiding this comment.
The configure_ci.py logic from the super-repos is captured now in TheRock through its configure_ci.py and the new python scripts added in this PR.
| run: | | ||
| if [[ "${{ inputs.external_source_checkout }}" == "true" ]]; then | ||
| # Use ADHOCBUILD for external repos to avoid building Python packages | ||
| echo "rocm_package_version=ADHOCBUILD" >> $GITHUB_OUTPUT | ||
| else | ||
| python ./build_tools/compute_rocm_package_version.py --release-type=dev | ||
| fi |
There was a problem hiding this comment.
No! This is all wrong - we need to build all packages in all projects. Don't use ADHOCBUILD as the version.
There was a problem hiding this comment.
ADHOCBUILD is the version in the external repo workflows: https://github.com/ROCm/rocm-libraries/blob/develop/.github/workflows/therock-ci-linux.yml#L100
There was a problem hiding this comment.
It's wrong there. I fixed it here. Please don't generate more work to clean up later.
|
Please revert - https://llvm.org/docs/CodeReview.html#can-code-be-reviewed-after-it-is-committed explains better than I can:
|
Will do. Was waiting for unicorn issues to be resolved. |
This reverts commit a1898df.
This PR reverts the CI deduplication work from PR #2771 and its follow-up fix #2951. ## Reverted commits - a1898df - "Deduplicate TheRock CI (#2771)" - f6b0e0e - "[GitHub Actions] Update workflows calling setup.yml with their inputs (#2951)" ## Reason for Revert Deduplication design requires further changes. --------- Co-authored-by: Joseph Macaranas <145489236+amd-jmacaran@users.noreply.github.com>

CI Deduplication: Centralize configuration and simplify workflows
Summary
This PR implements a comprehensive CI deduplication initiative across TheRock, rocm-libraries, and rocm-systems repositories.
Key achievements:
This PR enables:
Result: One place to maintain CI configuration instead of three.
Architecture: Before vs After
Before: Duplicated Configuration ❌
graph TB subgraph TheRock_Repo [TheRock Repository] TR_Matrix[amdgpu_family_matrix.py] TR_ConfigCI[configure_ci.py] TR_Workflows[CI Workflows] end subgraph RocmLibs [rocm-libraries Repository - DUPLICATES] RL_Matrix["therock_matrix.py<br/>⚠️ DUPLICATE"] RL_ConfigCI["therock_configure_ci.py<br/>⚠️ DUPLICATE"] RL_Workflows["therock-ci-*.yml<br/>⚠️ DUPLICATE"] end subgraph RocmSystems [rocm-systems Repository - DUPLICATES] RS_Matrix["therock_matrix.py<br/>⚠️ DUPLICATE"] RS_ConfigCI["therock_configure_ci.py<br/>⚠️ DUPLICATE"] RS_Workflows["therock-ci-*.yml<br/>⚠️ DUPLICATE"] end TR_Matrix -.manual copies.-> RL_Matrix TR_Matrix -.manual copies.-> RS_Matrix TR_ConfigCI -.manual copies.-> RL_ConfigCI TR_ConfigCI -.manual copies.-> RS_ConfigCI TR_Workflows -.manual copies.-> RL_Workflows TR_Workflows -.manual copies.-> RS_WorkflowsAfter: Centralized Configuration ✅
graph TB subgraph TheRock_Repo ["TheRock Repository ✅ SINGLE SOURCE OF TRUTH"] TR_Matrix[amdgpu_family_matrix.py] TR_ConfigCI["configure_ci.py<br/>with external repo detection"] TR_Build_Linux["build_portable_linux_artifacts.yml"] TR_Build_Windows["build_windows_artifacts.yml"] TR_Test["test_component.yml<br/>test_artifacts.yml"] end subgraph RocmLibs [rocm-libraries Repository] RL_CI["therock-ci.yml<br/>simple caller"] RL_Nightly["therock-ci-nightly.yml<br/>simple caller"] end subgraph RocmSystems [rocm-systems Repository] RS_CI["therock-ci.yml<br/>simple caller"] end RL_CI -->|external_source_checkout=true| TR_ConfigCI RL_Nightly -->|external_source_checkout=true| TR_ConfigCI RS_CI -->|external_source_checkout=true| TR_ConfigCI TR_ConfigCI -->|detects rocm-libraries| TR_Build_Linux TR_ConfigCI -->|detects rocm-libraries| TR_Build_Windows TR_ConfigCI -->|detects rocm-systems| TR_Build_Linux TR_ConfigCI -->|detects rocm-systems| TR_Build_Windows TR_Build_Linux --> TR_Test TR_Build_Windows --> TR_TestFile Deletion Mapping
rocm-libraries - 6 files deleted
.github/scripts/therock_matrix.pybuild_tools/github_actions/amdgpu_family_matrix.py.github/scripts/therock_configure_ci.pybuild_tools/github_actions/configure_ci.py.github/workflows/therock-ci-linux.yml.github/workflows/build_portable_linux_artifacts.ymlrocm-libraries, checks out TheRock+CK.github/workflows/therock-ci-windows.yml.github/workflows/build_windows_artifacts.ymlrocm-libraries, checks out TheRock+CK.github/workflows/therock-test-component.yml.github/workflows/test_component.yml.github/workflows/therock-test-packages.yml.github/workflows/test_artifacts.ymlrocm-systems - 5 files deleted
.github/scripts/therock_matrix.pybuild_tools/github_actions/amdgpu_family_matrix.py.github/scripts/therock_configure_ci.pybuild_tools/github_actions/configure_ci.py.github/workflows/therock-ci-linux.yml.github/workflows/build_portable_linux_artifacts.ymlrocm-systems, checks out TheRock.github/workflows/therock-ci-windows.yml.github/workflows/build_windows_artifacts.ymlrocm-systems, checks out TheRock.github/workflows/therock-test-packages.yml.github/workflows/test_artifacts.ymlKey Changes
New External Repo Infrastructure
.github/workflows/test_external_repo_integration.yml(NEW)github.repositoryvaluebuild_tools/github_actions/detect_external_repo_config.py(NEW)build_tools/github_actions/external_repo_project_maps.py(NEW)collect_projects_to_run()function with clean 3-parameter API (subtrees, platform, repo_name)build_tools/github_actions/configure_ci.py(ENHANCED)main()from 114 → 45 lines with helper functions:_extract_event_flags()- Extract and log event type flags_generate_base_matrices()- Generate GPU family matrices_apply_external_project_cross_product()- Cross-product projects × GPU families_add_gpu_families_from_input()helper eliminates duplicate GPU family parsingWorkflow Files (10 files updated)
external_source_checkout,therock_ref,repository_override,projectsparameterssource-repo/, TheRock totr/THEROCK_DIRenvironment variable (trfor external,.for TheRock)detect_external_repo_config.pypatches/amd-mainline/{repo}/Unit Tests (3 new test files)
test_configure_ci_external.py- External repo detection and project exit logictest_detect_external_repo_config.py- Repository detection and configuration (9 tests)test_external_repo_project_maps.py- Project mappings with path validation (13 tests)Testing
Verified CI Runs
All scenarios have been tested and verified:
TheRock CI (internal): Run #10715
External repo integration test (TheRock): Run #50
TheRock CI from rocm-libraries: Run #18050
TheRock CI from rocm-systems: Run #14327
Key difference: The external integration test (#50) vs external repo triggers (#17546, #13905) differ only in S3 bucket permissions. The integration test uses TheRock's internal role since it runs from TheRock, while real external repo triggers use the external role. All other behavior is identical, ensuring the integration test accurately validates external repo functionality.
Related PRs