feat(tiflow): add DM CI jobs for next-gen TiDB#4495
feat(tiflow): add DM CI jobs for next-gen TiDB#4495joechenrh wants to merge 1 commit intoPingCAP-QE:mainfrom
Conversation
Add integration test and compatibility test CI jobs that run DM tests against next-gen TiDB binaries downloaded from OCI registry instead of the classic fileserver. Jobs are optional and can be triggered via `/test dm-next-gen` or `/test pull-dm-integration-test-next-gen`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
I have already done a preliminary review for you, and I hope to help you do a better job.
Summary
This pull request introduces new CI jobs to test DM (Data Migration) compatibility and integration with next-gen TiDB. It uses OCI registry for binary downloads and adds pod templates and pipelines for these jobs. The approach includes optional initial runs and triggers via specific commands. Overall, the implementation is well-structured, but it has some areas for improvement in error handling, resource utilization, and adherence to best practices.
Critical Issues
-
Resource Limits Misalignment in Pod Templates
- File:
pipelines/pingcap/tiflow/latest/pod-pull_dm_compatibility_test_next_gen.yaml,pipelines/pingcap/tiflow/latest/pod-pull_dm_integration_test_next_gen.yaml - Line: Containers
mysql1andmysql2haveresources.limitsset but no correspondingresources.requests. - Issue: Kubernetes recommends defining both
requestsandlimitsfor better resource scheduling and preventing overcommitment. - Solution:
resources: requests: memory: 2Gi cpu: "1" limits: memory: 4Gi cpu: "2"
- File:
-
Hardcoded Retry Counts
- File:
pipelines/pingcap/tiflow/latest/pull_dm_compatibility_test_next_gen.groovy,pipelines/pingcap/tiflow/latest/pull_dm_integration_test_next_gen.groovy - Lines: Retry logic in stages like
prepareand downloads (retry(2)). - Issue: Hardcoded retries can result in inefficiency if a transient error persists beyond the retry count.
- Solution: Use exponential backoff or configurable retry parameters to handle transient errors dynamically.
- File:
-
Potential Data Loss in Log Collection
- File:
pipelines/pingcap/tiflow/latest/pull_dm_compatibility_test_next_gen.groovy,pipelines/pingcap/tiflow/latest/pull_dm_integration_test_next_gen.groovy - Lines: Log collection during test failures (
tar -cvzf log.tar.gz). - Issue: Missing error handling for empty or inaccessible log files may lead to silent failures.
- Solution:
sh label: "collect logs", script: """ if [ -d /tmp/dm_test ]; then tar -cvzf log.tar.gz $(find /tmp/dm_test/ -type f -name "*.log") || echo "No logs found" else echo "Log directory does not exist" fi """
- File:
Code Improvements
-
Environment Variable Validation
- File:
pipelines/pingcap/tiflow/latest/pull_dm_compatibility_test_next_gen.groovy,pipelines/pingcap/tiflow/latest/pull_dm_integration_test_next_gen.groovy - Lines: Usage of environment variables like
OCI_TAG_TIDB,OCI_TAG_PD,OCI_TAG_TIKV. - Issue: No validation for these variables being empty or improperly set.
- Solution: Add a validation step at the beginning of the pipeline.
if (!OCI_TAG_TIDB || !OCI_TAG_PD || !OCI_TAG_TIKV) { error("OCI tag environment variables are not set correctly.") }
- File:
-
Dynamic Pod Affinity
- File:
pod-pull_dm_integration_test_next_gen.yaml,pod-pull_dm_compatibility_test_next_gen.yaml - Lines: Node affinity is statically set to
kubernetes.io/arch: amd64. - Issue: Static configuration limits scalability to other architectures.
- Solution: Make node affinity configurable based on CI needs.
- File:
-
Test Matrix Parallelization
- File:
pipelines/pingcap/tiflow/latest/pull_dm_integration_test_next_gen.groovy - Lines: Test matrix (
axesforTEST_GROUP). - Issue: No dynamic partitioning of test groups for better resource utilization.
- Solution: Introduce a dynamic partitioning mechanism based on available resources and job priority.
- File:
Best Practices
-
Improve Documentation
- File:
pipelines/pingcap/tiflow/latest/pull_dm_compatibility_test_next_gen.groovy,pipelines/pingcap/tiflow/latest/pull_dm_integration_test_next_gen.groovy - Issue: Lack of comments explaining critical steps (e.g., why
make dm_integration_test_buildis needed twice). - Solution: Add comments for steps that might not be obvious to new contributors.
- File:
-
Consistent Naming Conventions
- File:
prow-jobs/pingcap/tiflow/latest-presubmits-next-gen.yaml - Issue: Job names like
pingcap/tiflow/pull_dm_integration_test_next_genare inconsistent with common naming conventions. - Solution: Use hyphens consistently and ensure names follow a standard format.
- File:
-
Test Coverage Validation
- File:
pipelines/pingcap/tiflow/latest/pull_dm_compatibility_test_next_gen.groovy - Issue: No validation for whether all relevant test cases are executed.
- Solution: Add a summary report at the end of the pipeline to confirm coverage.
- File:
These changes will improve the reliability, scalability, and maintainability of the CI pipelines while ensuring compliance with Kubernetes and Jenkins best practices.
|
@joechenrh: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Merged into #4485 instead. |
There was a problem hiding this comment.
Code Review
This pull request introduces new 'next-gen' CI pipelines for DM compatibility and integration testing, including Jenkins job definitions, Kubernetes pod templates, and Prow configurations. Feedback focuses on improving CI reliability and consistency: specifically, dependencies like gh-ost and mydumper should be baked into Docker images rather than downloaded at runtime, and the Go version in the integration test pod should be updated to 1.25 for consistency. Additionally, mydumper appears to be missing from the integration test pipeline, and the use of return 0 in the Groovy script should be standardized to a simple return.
| sh label: "download gh-ost", script: """ | ||
| wget --no-verbose --retry-connrefused --waitretry=1 -t 3 -O ./bin/gh-ost.tar.gz https://github.com/github/gh-ost/releases/download/v1.1.0/gh-ost-binary-linux-20200828140552.tar.gz | ||
| tar -xz -C ./bin -f ./bin/gh-ost.tar.gz | ||
| rm -f ./bin/gh-ost.tar.gz | ||
| chmod +x ./bin/gh-ost | ||
| """ |
There was a problem hiding this comment.
This pipeline downloads gh-ost at runtime but is missing the download for mydumper, which is present in the compatibility test pipeline. DM integration tests typically require mydumper for full data migration scenarios. Additionally, downloading these tools at runtime is discouraged; they should be baked into the CI Docker image.
References
- For better CI performance and reliability, bake dependencies into the Docker image instead of installing them at runtime in CI scripts.
| name: "mysql-config-volume" | ||
| containers: | ||
| - name: golang | ||
| image: "hub.pingcap.net/jenkins/rocky8_golang-1.23:tini" |
There was a problem hiding this comment.
The golang container is using Go 1.23 (rocky8_golang-1.23:tini), which is inconsistent with the init-mysql-config container in this pod (line 8) and the golang container in the compatibility test pod, both of which use Go 1.25. For the 'next-gen' CI environment, it is recommended to use a consistent Go version, preferably 1.25.
image: "hub.pingcap.net/jenkins/rocky8_golang-1.25:latest"| sh label: "download extra tools", script: """ | ||
| wget --no-verbose --retry-connrefused --waitretry=1 -t 3 -O /tmp/mydumper.tar.gz http://download.pingcap.org/tidb-enterprise-tools-latest-linux-amd64.tar.gz | ||
| tar -xz -C /tmp -f /tmp/mydumper.tar.gz tidb-enterprise-tools-latest-linux-amd64/bin/mydumper | ||
| mv /tmp/tidb-enterprise-tools-latest-linux-amd64/bin/mydumper ./bin/ | ||
| rm -rf /tmp/mydumper.tar.gz /tmp/tidb-enterprise-tools-latest-linux-amd64 | ||
|
|
||
| wget --no-verbose --retry-connrefused --waitretry=1 -t 3 -O /tmp/gh-ost.tar.gz https://github.com/github/gh-ost/releases/download/v1.1.0/gh-ost-binary-linux-20200828140552.tar.gz | ||
| tar -xz -C ./bin -f /tmp/gh-ost.tar.gz | ||
| rm -f /tmp/gh-ost.tar.gz | ||
| chmod +x ./bin/gh-ost ./bin/mydumper | ||
| """ |
There was a problem hiding this comment.
Downloading mydumper and gh-ost at runtime via wget can lead to flaky CI jobs and slower execution times. These dependencies should be baked into the Docker image used for the CI job to ensure better performance and reliability.
References
- For better CI performance and reliability, bake dependencies into the Docker image instead of installing them at runtime in CI scripts.
| println "not matched, all files full path not start with dm/, sync_diff_inspector/ or pkg/ or go.mod, current pr not releate to dm, so skip the dm integration test" | ||
| currentBuild.result = 'SUCCESS' | ||
| skipRemainingStages = true | ||
| return 0 |
Summary
hub-zot.pingcap.net/mirrors/tidbx) withmaster-next-gen/dedicated-next-gentags, instead of the classic fileserveroptional: trueinitially for validation/test dm-next-genor/test pull-dm-integration-test-next-gen//test pull-dm-compatibility-test-next-genFiles added
prow-jobs/pingcap/tiflow/latest-presubmits-next-gen.yamljobs/pingcap/tiflow/latest/pull_dm_integration_test_next_gen.groovyjobs/pingcap/tiflow/latest/pull_dm_compatibility_test_next_gen.groovypipelines/pingcap/tiflow/latest/pull_dm_integration_test_next_gen.groovypipelines/pingcap/tiflow/latest/pull_dm_compatibility_test_next_gen.groovypipelines/pingcap/tiflow/latest/pod-pull_dm_integration_test_next_gen.yamlpipelines/pingcap/tiflow/latest/pod-pull_dm_compatibility_test_next_gen.yamlKey changes vs classic DM CI
download_pingcap_oci_artifact.sh)utilscontainer to pod templates for OCI artifact downloadsNEXT_GEN=1environment variablemaster-next-gen(for master branch) or the release branch name (forrelease-nextgen-*branches)dedicated-next-gentagTest plan
/test pull-dm-integration-test-next-genon a tiflow PR to validate the pipeline/test pull-dm-compatibility-test-next-genon a tiflow PR to validate the pipeline🤖 Generated with Claude Code