Skip to content

feat(tiflow): add DM CI jobs for next-gen TiDB#4495

Closed
joechenrh wants to merge 1 commit intoPingCAP-QE:mainfrom
joechenrh:dm-next-gen-ci
Closed

feat(tiflow): add DM CI jobs for next-gen TiDB#4495
joechenrh wants to merge 1 commit intoPingCAP-QE:mainfrom
joechenrh:dm-next-gen-ci

Conversation

@joechenrh
Copy link
Copy Markdown
Contributor

Summary

  • Add DM integration test and compatibility test CI jobs targeting next-gen TiDB
  • Download next-gen TiDB/PD/TiKV binaries from OCI registry (hub-zot.pingcap.net/mirrors/tidbx) with master-next-gen / dedicated-next-gen tags, instead of the classic fileserver
  • Both jobs are set as optional: true initially for validation
  • Can be triggered via /test dm-next-gen or /test pull-dm-integration-test-next-gen / /test pull-dm-compatibility-test-next-gen

Files added

File Purpose
prow-jobs/pingcap/tiflow/latest-presubmits-next-gen.yaml Prow trigger definitions
jobs/pingcap/tiflow/latest/pull_dm_integration_test_next_gen.groovy Job DSL for integration test
jobs/pingcap/tiflow/latest/pull_dm_compatibility_test_next_gen.groovy Job DSL for compatibility test
pipelines/pingcap/tiflow/latest/pull_dm_integration_test_next_gen.groovy Integration test pipeline
pipelines/pingcap/tiflow/latest/pull_dm_compatibility_test_next_gen.groovy Compatibility test pipeline
pipelines/pingcap/tiflow/latest/pod-pull_dm_integration_test_next_gen.yaml Pod template (with utils container for oras)
pipelines/pingcap/tiflow/latest/pod-pull_dm_compatibility_test_next_gen.yaml Pod template (with utils container for oras)

Key changes vs classic DM CI

  • Binary download: fileserver + wget → OCI registry + oras (via download_pingcap_oci_artifact.sh)
  • Added utils container to pod templates for OCI artifact downloads
  • Set NEXT_GEN=1 environment variable
  • OCI tags resolve to master-next-gen (for master branch) or the release branch name (for release-nextgen-* branches)
  • TiKV uses dedicated-next-gen tag

Test plan

  • Trigger /test pull-dm-integration-test-next-gen on a tiflow PR to validate the pipeline
  • Trigger /test pull-dm-compatibility-test-next-gen on a tiflow PR to validate the pipeline
  • Verify next-gen TiDB binaries are downloaded correctly
  • Verify DM test groups pass with next-gen TiDB downstream

🤖 Generated with Claude Code

Add integration test and compatibility test CI jobs that run DM tests
against next-gen TiDB binaries downloaded from OCI registry instead of
the classic fileserver. Jobs are optional and can be triggered via
`/test dm-next-gen` or `/test pull-dm-integration-test-next-gen`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 9, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign wuhuizuo for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@ti-chi-bot ti-chi-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already done a preliminary review for you, and I hope to help you do a better job.

Summary

This pull request introduces new CI jobs to test DM (Data Migration) compatibility and integration with next-gen TiDB. It uses OCI registry for binary downloads and adds pod templates and pipelines for these jobs. The approach includes optional initial runs and triggers via specific commands. Overall, the implementation is well-structured, but it has some areas for improvement in error handling, resource utilization, and adherence to best practices.


Critical Issues

  1. Resource Limits Misalignment in Pod Templates

    • File: pipelines/pingcap/tiflow/latest/pod-pull_dm_compatibility_test_next_gen.yaml, pipelines/pingcap/tiflow/latest/pod-pull_dm_integration_test_next_gen.yaml
    • Line: Containers mysql1 and mysql2 have resources.limits set but no corresponding resources.requests.
    • Issue: Kubernetes recommends defining both requests and limits for better resource scheduling and preventing overcommitment.
    • Solution:
      resources:
        requests:
          memory: 2Gi
          cpu: "1"
        limits:
          memory: 4Gi
          cpu: "2"
  2. Hardcoded Retry Counts

    • File: pipelines/pingcap/tiflow/latest/pull_dm_compatibility_test_next_gen.groovy, pipelines/pingcap/tiflow/latest/pull_dm_integration_test_next_gen.groovy
    • Lines: Retry logic in stages like prepare and downloads (retry(2)).
    • Issue: Hardcoded retries can result in inefficiency if a transient error persists beyond the retry count.
    • Solution: Use exponential backoff or configurable retry parameters to handle transient errors dynamically.
  3. Potential Data Loss in Log Collection

    • File: pipelines/pingcap/tiflow/latest/pull_dm_compatibility_test_next_gen.groovy, pipelines/pingcap/tiflow/latest/pull_dm_integration_test_next_gen.groovy
    • Lines: Log collection during test failures (tar -cvzf log.tar.gz).
    • Issue: Missing error handling for empty or inaccessible log files may lead to silent failures.
    • Solution:
      sh label: "collect logs", script: """
        if [ -d /tmp/dm_test ]; then
          tar -cvzf log.tar.gz $(find /tmp/dm_test/ -type f -name "*.log") || echo "No logs found"
        else
          echo "Log directory does not exist"
        fi
      """

Code Improvements

  1. Environment Variable Validation

    • File: pipelines/pingcap/tiflow/latest/pull_dm_compatibility_test_next_gen.groovy, pipelines/pingcap/tiflow/latest/pull_dm_integration_test_next_gen.groovy
    • Lines: Usage of environment variables like OCI_TAG_TIDB, OCI_TAG_PD, OCI_TAG_TIKV.
    • Issue: No validation for these variables being empty or improperly set.
    • Solution: Add a validation step at the beginning of the pipeline.
      if (!OCI_TAG_TIDB || !OCI_TAG_PD || !OCI_TAG_TIKV) {
          error("OCI tag environment variables are not set correctly.")
      }
  2. Dynamic Pod Affinity

    • File: pod-pull_dm_integration_test_next_gen.yaml, pod-pull_dm_compatibility_test_next_gen.yaml
    • Lines: Node affinity is statically set to kubernetes.io/arch: amd64.
    • Issue: Static configuration limits scalability to other architectures.
    • Solution: Make node affinity configurable based on CI needs.
  3. Test Matrix Parallelization

    • File: pipelines/pingcap/tiflow/latest/pull_dm_integration_test_next_gen.groovy
    • Lines: Test matrix (axes for TEST_GROUP).
    • Issue: No dynamic partitioning of test groups for better resource utilization.
    • Solution: Introduce a dynamic partitioning mechanism based on available resources and job priority.

Best Practices

  1. Improve Documentation

    • File: pipelines/pingcap/tiflow/latest/pull_dm_compatibility_test_next_gen.groovy, pipelines/pingcap/tiflow/latest/pull_dm_integration_test_next_gen.groovy
    • Issue: Lack of comments explaining critical steps (e.g., why make dm_integration_test_build is needed twice).
    • Solution: Add comments for steps that might not be obvious to new contributors.
  2. Consistent Naming Conventions

    • File: prow-jobs/pingcap/tiflow/latest-presubmits-next-gen.yaml
    • Issue: Job names like pingcap/tiflow/pull_dm_integration_test_next_gen are inconsistent with common naming conventions.
    • Solution: Use hyphens consistently and ensure names follow a standard format.
  3. Test Coverage Validation

    • File: pipelines/pingcap/tiflow/latest/pull_dm_compatibility_test_next_gen.groovy
    • Issue: No validation for whether all relevant test cases are executed.
    • Solution: Add a summary report at the end of the pipeline to confirm coverage.

These changes will improve the reliability, scalability, and maintainability of the CI pipelines while ensuring compliance with Kubernetes and Jenkins best practices.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 9, 2026

@joechenrh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-verify-pr-content-policy de0f822 link true /test pull-verify-pr-content-policy
pull-verify-prow-jobs de0f822 link true /test pull-verify-prow-jobs

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@joechenrh
Copy link
Copy Markdown
Contributor Author

Merged into #4485 instead.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new 'next-gen' CI pipelines for DM compatibility and integration testing, including Jenkins job definitions, Kubernetes pod templates, and Prow configurations. Feedback focuses on improving CI reliability and consistency: specifically, dependencies like gh-ost and mydumper should be baked into Docker images rather than downloaded at runtime, and the Go version in the integration test pod should be updated to 1.25 for consistency. Additionally, mydumper appears to be missing from the integration test pipeline, and the use of return 0 in the Groovy script should be standardized to a simple return.

Comment on lines +119 to +124
sh label: "download gh-ost", script: """
wget --no-verbose --retry-connrefused --waitretry=1 -t 3 -O ./bin/gh-ost.tar.gz https://github.com/github/gh-ost/releases/download/v1.1.0/gh-ost-binary-linux-20200828140552.tar.gz
tar -xz -C ./bin -f ./bin/gh-ost.tar.gz
rm -f ./bin/gh-ost.tar.gz
chmod +x ./bin/gh-ost
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This pipeline downloads gh-ost at runtime but is missing the download for mydumper, which is present in the compatibility test pipeline. DM integration tests typically require mydumper for full data migration scenarios. Additionally, downloading these tools at runtime is discouraged; they should be baked into the CI Docker image.

References
  1. For better CI performance and reliability, bake dependencies into the Docker image instead of installing them at runtime in CI scripts.

name: "mysql-config-volume"
containers:
- name: golang
image: "hub.pingcap.net/jenkins/rocky8_golang-1.23:tini"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The golang container is using Go 1.23 (rocky8_golang-1.23:tini), which is inconsistent with the init-mysql-config container in this pod (line 8) and the golang container in the compatibility test pod, both of which use Go 1.25. For the 'next-gen' CI environment, it is recommended to use a consistent Go version, preferably 1.25.

      image: "hub.pingcap.net/jenkins/rocky8_golang-1.25:latest"

Comment on lines +134 to +144
sh label: "download extra tools", script: """
wget --no-verbose --retry-connrefused --waitretry=1 -t 3 -O /tmp/mydumper.tar.gz http://download.pingcap.org/tidb-enterprise-tools-latest-linux-amd64.tar.gz
tar -xz -C /tmp -f /tmp/mydumper.tar.gz tidb-enterprise-tools-latest-linux-amd64/bin/mydumper
mv /tmp/tidb-enterprise-tools-latest-linux-amd64/bin/mydumper ./bin/
rm -rf /tmp/mydumper.tar.gz /tmp/tidb-enterprise-tools-latest-linux-amd64

wget --no-verbose --retry-connrefused --waitretry=1 -t 3 -O /tmp/gh-ost.tar.gz https://github.com/github/gh-ost/releases/download/v1.1.0/gh-ost-binary-linux-20200828140552.tar.gz
tar -xz -C ./bin -f /tmp/gh-ost.tar.gz
rm -f /tmp/gh-ost.tar.gz
chmod +x ./bin/gh-ost ./bin/mydumper
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Downloading mydumper and gh-ost at runtime via wget can lead to flaky CI jobs and slower execution times. These dependencies should be baked into the Docker image used for the CI job to ensure better performance and reliability.

References
  1. For better CI performance and reliability, bake dependencies into the Docker image instead of installing them at runtime in CI scripts.

println "not matched, all files full path not start with dm/, sync_diff_inspector/ or pkg/ or go.mod, current pr not releate to dm, so skip the dm integration test"
currentBuild.result = 'SUCCESS'
skipRemainingStages = true
return 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using return 0 to exit a script block is unusual in Jenkins pipelines and inconsistent with the pull_dm_compatibility_test_next_gen.groovy script which uses a simple return. It's better to use return for consistency.

                            return

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

1 participant