Parallelize microbenchmarks and run them more times by igoragoli · Pull Request #5313 · DataDog/dd-trace-rb

igoragoli · 2026-02-04T16:27:54Z

⚠️ The PR description was modified to describe new changes.

TODO: Resolve all TODOs before merging to use production branches/images instead of testing ones.

What does this PR do?

Runs benchmarks 6 times (configurable by REPETITIONS on benchmarks/execution.yml) to reduce inter-run variability.
- I had initially tested with 10 repetitions. That would make the CI job go from 30 minutes (current state, without repetitions) to 50 minutes. I brought it down to 6 repetitions, so the CI job duration is kept as 30 minutes.
- If flakiness is still high, we can bump the number of repetitions or configure the threshold on the analysis step over at benchmarking-platform at the dd-trace-rb branch.
To prevent extremely long CI jobs, runs benchmarks in parallel with CPU isolation.
- Since we have 24 available CPUs, the 13 benchmarks were split into two arbitrary groups. Each group is defined on benchmarks/execution.yml
- Every benchmark is run with 2 CPUs (configurable by CPUS_PER_BENCHMARK on the benchmarks/execution.yml job).

Motivation:

https://datadoghq.atlassian.net/browse/APMSP-2544

Change log entry

None.

Additional Notes:

How to test the change?

Execution and reporting

CI pipeline runs successfully with parallel benchmark execution
- "other" group and "profiling" group.
Artifacts collected correctly for all benchmarks (including di_instrument.rb with legacy naming fallback)
- Artifacts on the microbenchmarks-pr-comment job, which has access to all artifacts for candidates and baselines on all repetitions
Results on the BP UI
- Results for "profiling - Allocations" clearly showing six different runs. This means runs are being combined correctly.
Results posted as PR comment

Reducing flakiness
The effect of multiple repetitions and CPU isolation on result variability was tested and reported in this document: https://datadoghq.atlassian.net/wiki/x/egJ3cAE

25 out of ~45 scenarios were flaky before fixes, 0 are flaky after fixes.

These tests used 10 repetitions. While this PR introduces 6 repetitions, it should already bring the flakiness down.

github-actions · 2026-02-04T16:28:11Z

Thank you for updating Change log entry section 👏

^{Visited at: 2026-02-20 09:48:24 UTC}

datadog-official · 2026-02-04T17:19:03Z

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
• Patch Coverage: 100.00%
• Overall Coverage: 95.10% (-0.00%)

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 61a7f68 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!}

pr-commenter · 2026-02-06T11:11:43Z

Benchmarks

Benchmark execution time: 2026-03-03 11:20:54

Comparing candidate commit 61a7f68 in PR branch augusto/2544.flakiness with baseline commit 158e037 in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 46 metrics, 0 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

🟩 = significantly better candidate vs. baseline
🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

ivoanjo · 2026-02-06T15:16:21Z

I think the PR seems in good shape? My only question is -- I see the new jobs coming up in GitLab -- is there a way to check that the way we're exposing the files is still correct?

E.g. how do we test that we haven't broken our benchmark reporting code?

igoragoli · 2026-02-09T14:24:56Z

Hey @ivoanjo and @p-datadog, thank you for the reviews!

I'll answer Oleg on the conversation thread. To answer Ivo:

E.g. how do we test that we haven't broken our benchmark reporting code?

Great point, and while reports on the BP UI are working as expected, PR comments from one microbenchmarking job will overwrite the other. Results have to be combined somehow.

Like on other benchmarks

…rallel

igoragoli · 2026-02-20T10:38:21Z

Hi! For visibility, I re-requested reviews from all that have reviewed, since you have pointed out towards different fixes.

ivoanjo

👍 LGTM

We're in the middle of a release so merging to master is blocked, it should be unblocked later today/by Monday.

At this point I don't see any reason to not to give this a try and then if there's some extra adjustment needed we'll do at as a follow-up PR.

igoragoli · 2026-02-23T06:57:16Z

#5313 (comment) is awesome. We're comparing benchmarks from this branch against master. Since there were no code changes that should impact performance, we would indeed expect to see 0 improvements/regression.

Corroborates with Ruby microbenchmark stability experiments showing that 10 reps took out all flakiness. 6 reps seems to be sufficient.

p-datadog · 2026-02-23T11:12:20Z

I asked claude to review this PR and it came up with a couple of pages of feedback, I DM'd @igoragoli to go over it since I don't understand for about half of it whether it addresses real issues or just theoretical ones.

I wanted to see about DRYing the file names (which take up a large part of the diff) but claude also spotted some missing validation which seems to be at least plausible and improvements in error reporting.

ivoanjo · 2026-02-24T07:53:22Z

I asked claude to review this PR and it came up with a couple of pages of feedback, I DM'd @igoragoli to go over it since I don't understand for about half of it whether it addresses real issues or just theoretical ones.

I wanted to see about DRYing the file names (which take up a large part of the diff) but claude also spotted some missing validation which seems to be at least plausible and improvements in error reporting.

To be clear, is any of that feedback blocking? In this age of "feedback from AI is free" I think it's even more important for the line between "hey maybe this would a cool improvement" and "hey we should really fix this, it's a confusing/dangerous/weird footgun" to be clear for folks :)

p-datadog · 2026-02-24T10:06:32Z

I am not sure - that's what i wanted to find out.

I think we are all in the summit anyway at the moment.

Adds a safety check to ensure BENCHMARKS variable is populated before proceeding with benchmark execution. Without this validation, if the GROUP variable is invalid or misspelled, the job would silently run with zero benchmarks and report success. This change prevents silent failures by: - Validating that the evaluated BENCHMARKS_${GROUP} variable is non-empty - Providing a clear error message identifying the invalid group name - Failing the CI job early before attempting to run bp-runner Example failure message: "Error: No benchmarks defined for group 'typo'" Related to PR #5313 - Parallelize microbenchmarks and run them more times

Validates that the GROUP variable contains only alphanumeric characters and underscores before using it in an eval statement. This prevents: - Shell injection risks if GROUP contains special characters - Cryptic eval errors from malformed variable names - Better error messages for configuration mistakes Example: If someone adds GROUP: "my-new-group" (with hyphens) to the parallel matrix, the job will now fail with a clear error message explaining the format requirement, rather than succeeding with zero benchmarks or producing an eval syntax error. Related to PR #5313 - Parallelize microbenchmarks and run them more times

Ensures the DD_API_KEY is successfully retrieved from AWS SSM Parameter Store before proceeding with benchmark execution. Without this check, if the AWS command fails (due to permissions, network issues, or missing parameter), the variable would be set to an empty string and the job would continue, causing silent failures when attempting to upload results. This prevents: - Benchmark results being lost due to failed uploads - Jobs appearing successful when API key retrieval failed - Difficult debugging of upload failures The job now fails early with a clear error message if the API key cannot be retrieved. Related to PR #5313 - Parallelize microbenchmarks and run them more times

Validates that the artifacts directory is successfully created before proceeding with benchmark execution. This change addresses three issues: 1. ddprof-benchmark job: Removes dangerous `|| :` pattern that suppressed all mkdir errors, which could hide real failures like permission issues or disk full errors. 2. microbenchmarks job: Adds explicit validation that directory creation succeeded. 3. microbenchmarks-pr-comment job: Adds explicit validation that directory creation succeeded. Without this validation, if directory creation fails, the job would continue and benchmark results would be lost. The job might appear successful even though no artifacts were collected. Now the jobs fail early with a clear error message if the artifacts directory cannot be created. Related to PR #5313 - Parallelize microbenchmarks and run them more times

Validates that CI_JOB_TOKEN is set before using it in git URL configuration. If the token is empty or undefined, git config would succeed but create a malformed URL, leading to authentication failures during git clone with misleading error messages. This affects two jobs: - microbenchmarks: Uses token to clone benchmarking-platform - microbenchmarks-pr-comment: Uses token to clone benchmarking-platform Without this check, an empty CI_JOB_TOKEN would cause: - Malformed git URLs like "https://gitlab-ci-token:@gitlab.ddbuild.io/..." - Cryptic authentication errors instead of clear token validation errors - Difficult debugging of CI configuration issues The jobs now fail early with a clear error message if the token is missing. Related to PR #5313 - Parallelize microbenchmarks and run them more times

Improves error handling for git clone operations by: 1. Providing explicit error messages when clones fail 2. Separating git clone from cd command for clearer error messages 3. Including branch name in error message for easier debugging Previously, when git clone failed, the error would appear to be about the subsequent 'cd' command ("cd: platform: No such file or directory"), which is misleading. The actual issue was the clone failure, not the cd. This affects four jobs: - .macrobenchmarks (clones ruby/gitlab branch) - ddprof-benchmark (clones ruby/ddprof-benchmark branch) - microbenchmarks (clones dd-trace-rb branch) - microbenchmarks-pr-comment (clones dd-trace-rb branch) Benefits: - Clear error messages identifying clone failures - Branch name in error helps identify wrong branch configurations - Easier debugging of repository access or branch name issues - No more misleading "directory not found" errors Related to PR #5313 - Parallelize microbenchmarks and run them more times

Validates that CI_COMMIT_SHA is set before executing the ddprof-benchmark job. This variable is used as LATEST_COMMIT_ID to tag benchmark results in the monitoring system. While GitLab CI normally always sets CI_COMMIT_SHA automatically, this validation provides defense in depth against: - Manual job execution without proper CI context - Broken CI configurations - Edge cases in CI platform behavior Without this check, if CI_COMMIT_SHA were somehow empty, benchmark results would be tagged with an empty commit SHA, making them: - Impossible to correlate with specific commits - Orphaned in the monitoring system - Useless for tracking performance over time The job now fails early with a clear error message if the commit SHA is missing, rather than proceeding with invalid metadata. Related to PR #5313 - Parallelize microbenchmarks and run them more times

Rewrites all validation checks from the `|| (echo "..." && exit 1)` pattern to explicit if statements with proper stderr redirection. Issues with the previous pattern: 1. Parentheses create a subshell - exit 1 only exits the subshell, not the main script in some contexts 2. Error messages went to stdout instead of stderr Changes: - All validations now use `if ! command` or `if [ -z "$VAR" ]` - All error messages redirect to stderr with `>&2` - Uses multi-line YAML blocks (`|`) for readability - Eliminates subshell exit issues Affects all validation checks added in previous commits: - CI_COMMIT_SHA validation - ARTIFACTS_DIR creation validation - DD_API_KEY retrieval validation - GROUP variable format validation - BENCHMARKS variable validation - CI_JOB_TOKEN validation - Git clone error handling Related to PR #5313 - Parallelize microbenchmarks and run them more times

igoragoli changed the title ~~[DO NOT MERGE] test: mitigations for benchmark stability~~ Reduce microbenchmark flakiness Feb 6, 2026

igoragoli marked this pull request as ready for review February 6, 2026 10:51

igoragoli requested a review from a team as a code owner February 6, 2026 10:51

ivoanjo reviewed Feb 6, 2026

View reviewed changes

Comment thread .gitlab/benchmarks.yml Outdated

p-datadog reviewed Feb 9, 2026

View reviewed changes

Comment thread benchmarks/run_all.sh Outdated

igoragoli added 8 commits February 19, 2026 13:34

chore: save di_instrument bench results based on their filename

af77fbd

Like on other benchmarks

chore: bump datadog version on benchmarks/Dockerfile

f857a84

feat: update microbenchmarks jobs for executing microbenchmarks in pa…

406f32f

…rallel

feat: add separate microbench PR comment job

fb713db

docs: add a helpful header

e3a5c36

test: point to test image

79db6c8

docs: update execution.yml

37d9b3f

docs: update benchmarks/README.md

c7a672e

igoragoli force-pushed the augusto/2544.flakiness branch from 4fe6f54 to c7a672e Compare February 19, 2026 13:11

igoragoli added 2 commits February 19, 2026 17:30

tweak: repeat benchmarks 5 times

4b58ebc

tweak: remove ".rb" from benchmark results files

d8459b9

igoragoli changed the title ~~Reduce microbenchmark flakiness~~ Parallelize microbenchmarks and run them more times Feb 20, 2026

igoragoli added 5 commits February 20, 2026 11:18

refactor: fix indentation on execution.yml

020a755

chore: point back to production image

d10914a

tweak: remove run_all.sh, since it's not needed anymore

e940d67

chore: point back to production branch on benchmarking-platform

4e306b6

Merge branch 'master' into augusto/2544.flakiness

a75635b

igoragoli requested review from Strech, ivoanjo and p-datadog February 20, 2026 10:36

ivoanjo approved these changes Feb 20, 2026

View reviewed changes

Comment thread benchmarks/Dockerfile

test: trigger ci

acf1cc9

Merge branch 'master' into augusto/2544.flakiness

091684c

igoragoli added 3 commits March 3, 2026 09:38

Merge branch 'master' into augusto/2544.flakiness

743d5f7

feat: set up significant impact threshold as 5%

6a60981

fix: yaml-lint

61a7f68

igoragoli merged commit c14b280 into master Mar 3, 2026
368 checks passed

igoragoli deleted the augusto/2544.flakiness branch March 3, 2026 14:02

github-actions Bot added this to the 2.30.0 milestone Mar 3, 2026

p-datadog mentioned this pull request Mar 4, 2026

Add validation for benchmark CI job variables and commands #5418

Merged

igoragoli mentioned this pull request Mar 18, 2026

Update microbenchmark design to reduce flakiness DataDog/dd-trace-dotnet#8300

Merged

3 tasks

p-datadog mentioned this pull request Mar 19, 2026

Reduce microbenchmark runtime and fix pr-comment retry behavior #5481

Merged

Conversation

igoragoli commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

datadog-official Bot commented Feb 4, 2026 • edited by datadog-datadog-prod-us1-2 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pr-commenter Bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Explanation

More details about the CI and significant changes

Uh oh!

Uh oh!

ivoanjo commented Feb 6, 2026

Uh oh!

Uh oh!

igoragoli commented Feb 9, 2026

Uh oh!

igoragoli commented Feb 20, 2026

Uh oh!

ivoanjo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

igoragoli commented Feb 23, 2026

Uh oh!

p-datadog commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivoanjo commented Feb 24, 2026

Uh oh!

p-datadog commented Feb 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

igoragoli commented Feb 4, 2026 •

edited

Loading

github-actions Bot commented Feb 4, 2026 •

edited

Loading

datadog-official Bot commented Feb 4, 2026 •

edited by datadog-datadog-prod-us1-2 Bot

Loading

pr-commenter Bot commented Feb 6, 2026 •

edited

Loading

p-datadog commented Feb 23, 2026 •

edited

Loading