Skip to content

Conversation

@wenxie-amd
Copy link
Contributor

No description provided.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR modifies the CI/CD pipeline to focus on Docker image building and temporarily disables unit testing. The primary changes involve switching from v25.10 to v25.09 for the AINIC Docker image build and commenting out the entire unit test workflow for both Torch and JAX frameworks.

Key Changes:

  • Docker image build configuration updated from v25.10-ainic to v25.09-ainic
  • Unit test workflows (run-unittest-torch and run-unittest-jax) completely disabled via comments

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# start_time=$(date +%s)
# mkdir -p $GITHUB_WORKSPACE/.github/workflows/docker/ainic
# cp /apps/tas/0_public/primus_docker_ci/ainic/ainic_bundle_1.117.5-a-38.tar.gz $GITHUB_WORKSPACE/.github/workflows/docker/ainic/ || { echo "Error: Failed to copy ainic bundle"; exit 1; }
# cp /apps/tas/0_public/primus_docker_ci/ainic/amd-anp-v1.3.0.patch $GITHUB_WORKSPACE/.github/workflows/docker/ainic/
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commented-out v25.10 build still references the patch file amd-anp-v1.3.0.patch, but this line is no longer present in the active v25.09 build section (line 128). If the v25.09 build requires this patch file, it should be explicitly included; if not, this discrepancy could cause issues if the v25.10 build is re-enabled in the future without updating the patch dependency.

Suggested change
# cp /apps/tas/0_public/primus_docker_ci/ainic/amd-anp-v1.3.0.patch $GITHUB_WORKSPACE/.github/workflows/docker/ainic/

Copilot uses AI. Check for mistakes.
cp /apps/tas/0_public/primus_docker_ci/ainic/amd-anp-v1.3.0.patch $GITHUB_WORKSPACE/.github/workflows/docker/ainic/
docker build -f $GITHUB_WORKSPACE/.github/workflows/docker/Dockerfile_v25.10_ainic \
-t tasimage/primus:${{env.IMAGE_TAG}}-v25.10-ainic \
cp /apps/tas/0_public/primus_docker_ci/ainic/ainic_bundle_1.117.1-a-42.tar.gz $GITHUB_WORKSPACE/.github/workflows/docker/ainic/ || { echo "Error: Failed to copy ainic bundle"; exit 1; }
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AINIC bundle version has changed from 1.117.5-a-38 to 1.117.1-a-42. This appears to be a downgrade in the major version number (117.5 to 117.1), which is unusual. Verify that this version change is intentional and that v1.117.1-a-42 is the correct bundle for the v25.09 build.

Suggested change
cp /apps/tas/0_public/primus_docker_ci/ainic/ainic_bundle_1.117.1-a-42.tar.gz $GITHUB_WORKSPACE/.github/workflows/docker/ainic/ || { echo "Error: Failed to copy ainic bundle"; exit 1; }
cp /apps/tas/0_public/primus_docker_ci/ainic/ainic_bundle_1.117.5-a-38.tar.gz $GITHUB_WORKSPACE/.github/workflows/docker/ainic/ || { echo "Error: Failed to copy ainic bundle"; exit 1; }

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings December 29, 2025 14:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


# unset NCCL_IB_GID_INDEX
export NCCL_IB_TC=${NCCL_IB_TC:-104}
export NCCL_IB_FIFO_TC=${NCCL_IB_FIFO_TC:-184} # 192
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline comment "# 192" suggests this was the previous value, but appears as trailing documentation. Consider removing this comment or clarifying why 192 is noted if it provides important context for future reference.

Suggested change
export NCCL_IB_FIFO_TC=${NCCL_IB_FIFO_TC:-184} # 192
export NCCL_IB_FIFO_TC=${NCCL_IB_FIFO_TC:-184} # previous default: 192

Copilot uses AI. Check for mistakes.
# start_time=$(date +%s)
# mkdir -p $GITHUB_WORKSPACE/.github/workflows/docker/ainic
# cp /apps/tas/0_public/primus_docker_ci/ainic/ainic_bundle_1.117.5-a-38.tar.gz $GITHUB_WORKSPACE/.github/workflows/docker/ainic/ || { echo "Error: Failed to copy ainic bundle"; exit 1; }
# cp /apps/tas/0_public/primus_docker_ci/ainic/amd-anp-v1.3.0.patch $GITHUB_WORKSPACE/.github/workflows/docker/ainic/
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch file copy command was removed when switching from v25.10 to v25.09, but this line still references it. If the v25.09 build doesn't require this patch file, this line should be removed to match the v25.10 pattern where it was commented out.

Suggested change
# cp /apps/tas/0_public/primus_docker_ci/ainic/amd-anp-v1.3.0.patch $GITHUB_WORKSPACE/.github/workflows/docker/ainic/

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings December 30, 2025 03:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +56 to +66
ENV USE_UCX_VERSION="1.15.0"
RUN cd ${WORKDIR} && wget https://github.com/openucx/ucx/releases/download/v${USE_UCX_VERSION}/ucx-${USE_UCX_VERSION}.tar.gz && \
mkdir -p ucx-${USE_UCX_VERSION} && \
tar -zxf ucx-${USE_UCX_VERSION}.tar.gz -C ucx-${USE_UCX_VERSION} --strip-components=1 && \
cd ucx-${USE_UCX_VERSION} && mkdir build && cd build && \
../configure --prefix=${WORKDIR}/ucx-${USE_UCX_VERSION}/install --with-rocm=${ROCM_PATH} 2>&1 | tee log_ucx_configure.txt && \
make -j 16 2>&1 | tee log_ucx_build.txt && \
make install && \
cd ${WORKDIR}

ENV UCX_INSTALL_DIR=${WORKDIR}/ucx-${USE_UCX_VERSION}/install
Copy link

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The environment variable name 'USE_UCX_VERSION' is inconsistent with naming conventions. Consider using 'UCX_VERSION' to match the pattern of other version variables in the file.

Suggested change
ENV USE_UCX_VERSION="1.15.0"
RUN cd ${WORKDIR} && wget https://github.com/openucx/ucx/releases/download/v${USE_UCX_VERSION}/ucx-${USE_UCX_VERSION}.tar.gz && \
mkdir -p ucx-${USE_UCX_VERSION} && \
tar -zxf ucx-${USE_UCX_VERSION}.tar.gz -C ucx-${USE_UCX_VERSION} --strip-components=1 && \
cd ucx-${USE_UCX_VERSION} && mkdir build && cd build && \
../configure --prefix=${WORKDIR}/ucx-${USE_UCX_VERSION}/install --with-rocm=${ROCM_PATH} 2>&1 | tee log_ucx_configure.txt && \
make -j 16 2>&1 | tee log_ucx_build.txt && \
make install && \
cd ${WORKDIR}
ENV UCX_INSTALL_DIR=${WORKDIR}/ucx-${USE_UCX_VERSION}/install
ENV UCX_VERSION="1.15.0"
RUN cd ${WORKDIR} && wget https://github.com/openucx/ucx/releases/download/v${UCX_VERSION}/ucx-${UCX_VERSION}.tar.gz && \
mkdir -p ucx-${UCX_VERSION} && \
tar -zxf ucx-${UCX_VERSION}.tar.gz -C ucx-${UCX_VERSION} --strip-components=1 && \
cd ucx-${UCX_VERSION} && mkdir build && cd build && \
../configure --prefix=${WORKDIR}/ucx-${UCX_VERSION}/install --with-rocm=${ROCM_PATH} 2>&1 | tee log_ucx_configure.txt && \
make -j 16 2>&1 | tee log_ucx_build.txt && \
make install && \
cd ${WORKDIR}
ENV UCX_INSTALL_DIR=${WORKDIR}/ucx-${UCX_VERSION}/install

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings December 30, 2025 07:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ENV MPI_PATH=${WORKDIR}/ompi-4.1.6/install

# =============================== Build RCCL ===============================
# cd rccl && git checkout rocm-7.1.0 && \
Copy link

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commented-out line appears to be leftover code. If the git checkout command is intentionally disabled, add a comment explaining why; otherwise, remove it to avoid confusion.

Suggested change
# cd rccl && git checkout rocm-7.1.0 && \

Copilot uses AI. Check for mistakes.
Comment on lines +95 to +96
# sed -i '5a CFLAGS += --offload-arch=gfx950' ./Makefile && head -10 ./Makefile && \

Copy link

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commented-out line appears redundant since line 102 performs the same operation. Consider removing this commented line to reduce clutter.

Suggested change
# sed -i '5a CFLAGS += --offload-arch=gfx950' ./Makefile && head -10 ./Makefile && \

Copilot uses AI. Check for mistakes.
Comment on lines +95 to +99
# sed -i '5a CFLAGS += --offload-arch=gfx950' ./Makefile && head -10 ./Makefile && \

WORKDIR /opt
# COPY ${AINIC_BUNDLE_PATH}/amd-anp-v1.3.0.patch /opt/
# git apply /opt/amd-anp-v1.3.0.patch && \
Copy link

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These commented-out patch-related lines suggest incomplete implementation. Either implement the patch application if needed or remove these lines with an explanation of why the patch is no longer required.

Suggested change
# sed -i '5a CFLAGS += --offload-arch=gfx950' ./Makefile && head -10 ./Makefile && \
WORKDIR /opt
# COPY ${AINIC_BUNDLE_PATH}/amd-anp-v1.3.0.patch /opt/
# git apply /opt/amd-anp-v1.3.0.patch && \
# Note: amd-anp v1.3.0 is built directly from the upstream tag; any required
# customization is applied via the sed-based CFLAGS modification below, so no
# additional patch file is applied here.
WORKDIR /opt

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings January 3, 2026 04:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# cd rccl && git checkout rocm-7.1.0 && \
RUN cd ${WORKDIR} && \
git clone https://github.com/ROCm/rccl.git && \
cd rccl && git checkout release/rocm-rel-7.1 && \
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RCCL checkout uses a branch name 'release/rocm-rel-7.1' instead of a specific tag or commit. This could lead to non-reproducible builds as the branch may be updated. Consider using a specific tag or commit hash for reproducibility.

Suggested change
cd rccl && git checkout release/rocm-rel-7.1 && \
cd rccl && git checkout rocm-7.1.0 && \

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings January 5, 2026 08:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


# unset NCCL_IB_GID_INDEX
export NCCL_IB_TC=${NCCL_IB_TC:-104}
export NCCL_IB_FIFO_TC=${NCCL_IB_FIFO_TC:-184} # 192
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline comment '# 192' is unclear - it should explain why 192 is referenced when the default is 184, or be removed if it's no longer relevant.

Suggested change
export NCCL_IB_FIFO_TC=${NCCL_IB_FIFO_TC:-184} # 192
export NCCL_IB_FIFO_TC=${NCCL_IB_FIFO_TC:-184}

Copilot uses AI. Check for mistakes.
Comment on lines +181 to +188
# if [ -f "${ANP_HOME_DIR}/build/librccl-net.so" ]; then
# export LD_PRELOAD="${ANP_HOME_DIR}/build/librccl-net.so:${RCCL_HOME_DIR}/build/release/librccl.so.1.0"
# elif [ -f "${ANP_HOME_DIR}/build/librccl-anp.so" ]; then
# export LD_PRELOAD="${ANP_HOME_DIR}/build/librccl-anp.so:${RCCL_HOME_DIR}/build/release/librccl.so.1.0"
# else
# echo "ERROR: Neither librccl-net.so nor librccl-anp.so was found in ${ANP_HOME_DIR}/build"
# exit 1
# fi
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire LD_PRELOAD configuration logic is commented out without explanation. This appears to be critical library loading code for AINIC functionality. Either provide documentation explaining why it's disabled or remove the dead code if it's permanently deprecated.

Copilot uses AI. Check for mistakes.
return None, None, None, None, None

assert data_iterator is not None, f"data_iterator is None vp_stage: {vp_stage}"
# assert data_iterator is not None, f"data_iterator is None vp_stage: {vp_stage}"
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion is commented out without explanation. If this check is no longer needed for benchmarking scenarios, add a comment explaining why (e.g., '# Skip assertion for mock data benchmarks'). Otherwise, restore the assertion to maintain data validation.

Suggested change
# assert data_iterator is not None, f"data_iterator is None vp_stage: {vp_stage}"
assert data_iterator is not None, f"data_iterator is None vp_stage: {vp_stage}"

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings January 6, 2026 08:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 10 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# place the collection before the timer to minimize its impact on latency measurements for iterations ≥ 3.
if args.log_throughput:
if args.use_rocm_mem_info or iteration in args.use_rocm_mem_info_iters:
if args.use_rocm_mem_info or (args.use_rocm_mem_info_iters is not None and iteration in args.use_rocm_mem_info_iters):
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is repeated identically at lines 1947, 2005, and 2042. Consider extracting this logic into a helper function to improve maintainability and reduce code duplication.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants