-
Notifications
You must be signed in to change notification settings - Fork 23
Training Benchmark #464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Training Benchmark #464
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR modifies the CI/CD pipeline to focus on Docker image building and temporarily disables unit testing. The primary changes involve switching from v25.10 to v25.09 for the AINIC Docker image build and commenting out the entire unit test workflow for both Torch and JAX frameworks.
Key Changes:
- Docker image build configuration updated from v25.10-ainic to v25.09-ainic
- Unit test workflows (
run-unittest-torchandrun-unittest-jax) completely disabled via comments
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # start_time=$(date +%s) | ||
| # mkdir -p $GITHUB_WORKSPACE/.github/workflows/docker/ainic | ||
| # cp /apps/tas/0_public/primus_docker_ci/ainic/ainic_bundle_1.117.5-a-38.tar.gz $GITHUB_WORKSPACE/.github/workflows/docker/ainic/ || { echo "Error: Failed to copy ainic bundle"; exit 1; } | ||
| # cp /apps/tas/0_public/primus_docker_ci/ainic/amd-anp-v1.3.0.patch $GITHUB_WORKSPACE/.github/workflows/docker/ainic/ |
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commented-out v25.10 build still references the patch file amd-anp-v1.3.0.patch, but this line is no longer present in the active v25.09 build section (line 128). If the v25.09 build requires this patch file, it should be explicitly included; if not, this discrepancy could cause issues if the v25.10 build is re-enabled in the future without updating the patch dependency.
| # cp /apps/tas/0_public/primus_docker_ci/ainic/amd-anp-v1.3.0.patch $GITHUB_WORKSPACE/.github/workflows/docker/ainic/ |
| cp /apps/tas/0_public/primus_docker_ci/ainic/amd-anp-v1.3.0.patch $GITHUB_WORKSPACE/.github/workflows/docker/ainic/ | ||
| docker build -f $GITHUB_WORKSPACE/.github/workflows/docker/Dockerfile_v25.10_ainic \ | ||
| -t tasimage/primus:${{env.IMAGE_TAG}}-v25.10-ainic \ | ||
| cp /apps/tas/0_public/primus_docker_ci/ainic/ainic_bundle_1.117.1-a-42.tar.gz $GITHUB_WORKSPACE/.github/workflows/docker/ainic/ || { echo "Error: Failed to copy ainic bundle"; exit 1; } |
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AINIC bundle version has changed from 1.117.5-a-38 to 1.117.1-a-42. This appears to be a downgrade in the major version number (117.5 to 117.1), which is unusual. Verify that this version change is intentional and that v1.117.1-a-42 is the correct bundle for the v25.09 build.
| cp /apps/tas/0_public/primus_docker_ci/ainic/ainic_bundle_1.117.1-a-42.tar.gz $GITHUB_WORKSPACE/.github/workflows/docker/ainic/ || { echo "Error: Failed to copy ainic bundle"; exit 1; } | |
| cp /apps/tas/0_public/primus_docker_ci/ainic/ainic_bundle_1.117.5-a-38.tar.gz $GITHUB_WORKSPACE/.github/workflows/docker/ainic/ || { echo "Error: Failed to copy ainic bundle"; exit 1; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| # unset NCCL_IB_GID_INDEX | ||
| export NCCL_IB_TC=${NCCL_IB_TC:-104} | ||
| export NCCL_IB_FIFO_TC=${NCCL_IB_FIFO_TC:-184} # 192 |
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inline comment "# 192" suggests this was the previous value, but appears as trailing documentation. Consider removing this comment or clarifying why 192 is noted if it provides important context for future reference.
| export NCCL_IB_FIFO_TC=${NCCL_IB_FIFO_TC:-184} # 192 | |
| export NCCL_IB_FIFO_TC=${NCCL_IB_FIFO_TC:-184} # previous default: 192 |
| # start_time=$(date +%s) | ||
| # mkdir -p $GITHUB_WORKSPACE/.github/workflows/docker/ainic | ||
| # cp /apps/tas/0_public/primus_docker_ci/ainic/ainic_bundle_1.117.5-a-38.tar.gz $GITHUB_WORKSPACE/.github/workflows/docker/ainic/ || { echo "Error: Failed to copy ainic bundle"; exit 1; } | ||
| # cp /apps/tas/0_public/primus_docker_ci/ainic/amd-anp-v1.3.0.patch $GITHUB_WORKSPACE/.github/workflows/docker/ainic/ |
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The patch file copy command was removed when switching from v25.10 to v25.09, but this line still references it. If the v25.09 build doesn't require this patch file, this line should be removed to match the v25.10 pattern where it was commented out.
| # cp /apps/tas/0_public/primus_docker_ci/ainic/amd-anp-v1.3.0.patch $GITHUB_WORKSPACE/.github/workflows/docker/ainic/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ENV USE_UCX_VERSION="1.15.0" | ||
| RUN cd ${WORKDIR} && wget https://github.com/openucx/ucx/releases/download/v${USE_UCX_VERSION}/ucx-${USE_UCX_VERSION}.tar.gz && \ | ||
| mkdir -p ucx-${USE_UCX_VERSION} && \ | ||
| tar -zxf ucx-${USE_UCX_VERSION}.tar.gz -C ucx-${USE_UCX_VERSION} --strip-components=1 && \ | ||
| cd ucx-${USE_UCX_VERSION} && mkdir build && cd build && \ | ||
| ../configure --prefix=${WORKDIR}/ucx-${USE_UCX_VERSION}/install --with-rocm=${ROCM_PATH} 2>&1 | tee log_ucx_configure.txt && \ | ||
| make -j 16 2>&1 | tee log_ucx_build.txt && \ | ||
| make install && \ | ||
| cd ${WORKDIR} | ||
|
|
||
| ENV UCX_INSTALL_DIR=${WORKDIR}/ucx-${USE_UCX_VERSION}/install |
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The environment variable name 'USE_UCX_VERSION' is inconsistent with naming conventions. Consider using 'UCX_VERSION' to match the pattern of other version variables in the file.
| ENV USE_UCX_VERSION="1.15.0" | |
| RUN cd ${WORKDIR} && wget https://github.com/openucx/ucx/releases/download/v${USE_UCX_VERSION}/ucx-${USE_UCX_VERSION}.tar.gz && \ | |
| mkdir -p ucx-${USE_UCX_VERSION} && \ | |
| tar -zxf ucx-${USE_UCX_VERSION}.tar.gz -C ucx-${USE_UCX_VERSION} --strip-components=1 && \ | |
| cd ucx-${USE_UCX_VERSION} && mkdir build && cd build && \ | |
| ../configure --prefix=${WORKDIR}/ucx-${USE_UCX_VERSION}/install --with-rocm=${ROCM_PATH} 2>&1 | tee log_ucx_configure.txt && \ | |
| make -j 16 2>&1 | tee log_ucx_build.txt && \ | |
| make install && \ | |
| cd ${WORKDIR} | |
| ENV UCX_INSTALL_DIR=${WORKDIR}/ucx-${USE_UCX_VERSION}/install | |
| ENV UCX_VERSION="1.15.0" | |
| RUN cd ${WORKDIR} && wget https://github.com/openucx/ucx/releases/download/v${UCX_VERSION}/ucx-${UCX_VERSION}.tar.gz && \ | |
| mkdir -p ucx-${UCX_VERSION} && \ | |
| tar -zxf ucx-${UCX_VERSION}.tar.gz -C ucx-${UCX_VERSION} --strip-components=1 && \ | |
| cd ucx-${UCX_VERSION} && mkdir build && cd build && \ | |
| ../configure --prefix=${WORKDIR}/ucx-${UCX_VERSION}/install --with-rocm=${ROCM_PATH} 2>&1 | tee log_ucx_configure.txt && \ | |
| make -j 16 2>&1 | tee log_ucx_build.txt && \ | |
| make install && \ | |
| cd ${WORKDIR} | |
| ENV UCX_INSTALL_DIR=${WORKDIR}/ucx-${UCX_VERSION}/install |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ENV MPI_PATH=${WORKDIR}/ompi-4.1.6/install | ||
|
|
||
| # =============================== Build RCCL =============================== | ||
| # cd rccl && git checkout rocm-7.1.0 && \ |
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commented-out line appears to be leftover code. If the git checkout command is intentionally disabled, add a comment explaining why; otherwise, remove it to avoid confusion.
| # cd rccl && git checkout rocm-7.1.0 && \ |
| # sed -i '5a CFLAGS += --offload-arch=gfx950' ./Makefile && head -10 ./Makefile && \ | ||
|
|
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commented-out line appears redundant since line 102 performs the same operation. Consider removing this commented line to reduce clutter.
| # sed -i '5a CFLAGS += --offload-arch=gfx950' ./Makefile && head -10 ./Makefile && \ |
| # sed -i '5a CFLAGS += --offload-arch=gfx950' ./Makefile && head -10 ./Makefile && \ | ||
|
|
||
| WORKDIR /opt | ||
| # COPY ${AINIC_BUNDLE_PATH}/amd-anp-v1.3.0.patch /opt/ | ||
| # git apply /opt/amd-anp-v1.3.0.patch && \ |
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These commented-out patch-related lines suggest incomplete implementation. Either implement the patch application if needed or remove these lines with an explanation of why the patch is no longer required.
| # sed -i '5a CFLAGS += --offload-arch=gfx950' ./Makefile && head -10 ./Makefile && \ | |
| WORKDIR /opt | |
| # COPY ${AINIC_BUNDLE_PATH}/amd-anp-v1.3.0.patch /opt/ | |
| # git apply /opt/amd-anp-v1.3.0.patch && \ | |
| # Note: amd-anp v1.3.0 is built directly from the upstream tag; any required | |
| # customization is applied via the sed-based CFLAGS modification below, so no | |
| # additional patch file is applied here. | |
| WORKDIR /opt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # cd rccl && git checkout rocm-7.1.0 && \ | ||
| RUN cd ${WORKDIR} && \ | ||
| git clone https://github.com/ROCm/rccl.git && \ | ||
| cd rccl && git checkout release/rocm-rel-7.1 && \ |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RCCL checkout uses a branch name 'release/rocm-rel-7.1' instead of a specific tag or commit. This could lead to non-reproducible builds as the branch may be updated. Consider using a specific tag or commit hash for reproducibility.
| cd rccl && git checkout release/rocm-rel-7.1 && \ | |
| cd rccl && git checkout rocm-7.1.0 && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| # unset NCCL_IB_GID_INDEX | ||
| export NCCL_IB_TC=${NCCL_IB_TC:-104} | ||
| export NCCL_IB_FIFO_TC=${NCCL_IB_FIFO_TC:-184} # 192 |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inline comment '# 192' is unclear - it should explain why 192 is referenced when the default is 184, or be removed if it's no longer relevant.
| export NCCL_IB_FIFO_TC=${NCCL_IB_FIFO_TC:-184} # 192 | |
| export NCCL_IB_FIFO_TC=${NCCL_IB_FIFO_TC:-184} |
| # if [ -f "${ANP_HOME_DIR}/build/librccl-net.so" ]; then | ||
| # export LD_PRELOAD="${ANP_HOME_DIR}/build/librccl-net.so:${RCCL_HOME_DIR}/build/release/librccl.so.1.0" | ||
| # elif [ -f "${ANP_HOME_DIR}/build/librccl-anp.so" ]; then | ||
| # export LD_PRELOAD="${ANP_HOME_DIR}/build/librccl-anp.so:${RCCL_HOME_DIR}/build/release/librccl.so.1.0" | ||
| # else | ||
| # echo "ERROR: Neither librccl-net.so nor librccl-anp.so was found in ${ANP_HOME_DIR}/build" | ||
| # exit 1 | ||
| # fi |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entire LD_PRELOAD configuration logic is commented out without explanation. This appears to be critical library loading code for AINIC functionality. Either provide documentation explaining why it's disabled or remove the dead code if it's permanently deprecated.
| return None, None, None, None, None | ||
|
|
||
| assert data_iterator is not None, f"data_iterator is None vp_stage: {vp_stage}" | ||
| # assert data_iterator is not None, f"data_iterator is None vp_stage: {vp_stage}" |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion is commented out without explanation. If this check is no longer needed for benchmarking scenarios, add a comment explaining why (e.g., '# Skip assertion for mock data benchmarks'). Otherwise, restore the assertion to maintain data validation.
| # assert data_iterator is not None, f"data_iterator is None vp_stage: {vp_stage}" | |
| assert data_iterator is not None, f"data_iterator is None vp_stage: {vp_stage}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 8 out of 10 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # place the collection before the timer to minimize its impact on latency measurements for iterations ≥ 3. | ||
| if args.log_throughput: | ||
| if args.use_rocm_mem_info or iteration in args.use_rocm_mem_info_iters: | ||
| if args.use_rocm_mem_info or (args.use_rocm_mem_info_iters is not None and iteration in args.use_rocm_mem_info_iters): |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition is repeated identically at lines 1947, 2005, and 2042. Consider extracting this logic into a helper function to improve maintainability and reduce code duplication.
No description provided.