Skip to content

Parallelize microbenchmarks and run them more times#5313

Merged
igoragoli merged 20 commits intomasterfrom
augusto/2544.flakiness
Mar 3, 2026
Merged

Parallelize microbenchmarks and run them more times#5313
igoragoli merged 20 commits intomasterfrom
augusto/2544.flakiness

Conversation

@igoragoli
Copy link
Copy Markdown
Contributor

@igoragoli igoragoli commented Feb 4, 2026

⚠️ The PR description was modified to describe new changes.

  • TODO: Resolve all TODOs before merging to use production branches/images instead of testing ones.

What does this PR do?

  • Runs benchmarks 6 times (configurable by REPETITIONS on benchmarks/execution.yml) to reduce inter-run variability.
    • I had initially tested with 10 repetitions. That would make the CI job go from 30 minutes (current state, without repetitions) to 50 minutes. I brought it down to 6 repetitions, so the CI job duration is kept as 30 minutes.
    • If flakiness is still high, we can bump the number of repetitions or configure the threshold on the analysis step over at benchmarking-platform at the dd-trace-rb branch.
  • To prevent extremely long CI jobs, runs benchmarks in parallel with CPU isolation.
    • Since we have 24 available CPUs, the 13 benchmarks were split into two arbitrary groups. Each group is defined on benchmarks/execution.yml
    • Every benchmark is run with 2 CPUs (configurable by CPUS_PER_BENCHMARK on the benchmarks/execution.yml job).

Motivation:

https://datadoghq.atlassian.net/browse/APMSP-2544

Change log entry

None.

Additional Notes:

How to test the change?

Execution and reporting

Reducing flakiness
The effect of multiple repetitions and CPU isolation on result variability was tested and reported in this document: https://datadoghq.atlassian.net/wiki/x/egJ3cAE

25 out of ~45 scenarios were flaky before fixes, 0 are flaky after fixes.

These tests used 10 repetitions. While this PR introduces 6 repetitions, it should already bring the flakiness down.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 4, 2026

Thank you for updating Change log entry section 👏

Visited at: 2026-02-20 09:48:24 UTC

@datadog-official
Copy link
Copy Markdown

datadog-official Bot commented Feb 4, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 95.10% (-0.00%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 61a7f68 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

@igoragoli igoragoli changed the title [DO NOT MERGE] test: mitigations for benchmark stability Reduce microbenchmark flakiness Feb 6, 2026
@igoragoli igoragoli marked this pull request as ready for review February 6, 2026 10:51
@igoragoli igoragoli requested a review from a team as a code owner February 6, 2026 10:51
@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented Feb 6, 2026

Benchmarks

Benchmark execution time: 2026-03-03 11:20:54

Comparing candidate commit 61a7f68 in PR branch augusto/2544.flakiness with baseline commit 158e037 in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 46 metrics, 0 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

Comment thread .gitlab/benchmarks.yml Outdated
@ivoanjo
Copy link
Copy Markdown
Member

ivoanjo commented Feb 6, 2026

I think the PR seems in good shape? My only question is -- I see the new jobs coming up in GitLab -- is there a way to check that the way we're exposing the files is still correct?

E.g. how do we test that we haven't broken our benchmark reporting code?

Comment thread benchmarks/run_all.sh Outdated
@igoragoli
Copy link
Copy Markdown
Contributor Author

Hey @ivoanjo and @p-datadog, thank you for the reviews!

I'll answer Oleg on the conversation thread. To answer Ivo:

E.g. how do we test that we haven't broken our benchmark reporting code?

Great point, and while reports on the BP UI are working as expected, PR comments from one microbenchmarking job will overwrite the other. Results have to be combined somehow.

@igoragoli igoragoli force-pushed the augusto/2544.flakiness branch from 4fe6f54 to c7a672e Compare February 19, 2026 13:11
@igoragoli igoragoli changed the title Reduce microbenchmark flakiness Parallelize microbenchmarks and run them more times Feb 20, 2026
@igoragoli
Copy link
Copy Markdown
Contributor Author

Hi! For visibility, I re-requested reviews from all that have reviewed, since you have pointed out towards different fixes.

Copy link
Copy Markdown
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM

We're in the middle of a release so merging to master is blocked, it should be unblocked later today/by Monday.

At this point I don't see any reason to not to give this a try and then if there's some extra adjustment needed we'll do at as a follow-up PR.

Comment thread benchmarks/Dockerfile
@igoragoli
Copy link
Copy Markdown
Contributor Author

#5313 (comment) is awesome. We're comparing benchmarks from this branch against master. Since there were no code changes that should impact performance, we would indeed expect to see 0 improvements/regression.

Corroborates with Ruby microbenchmark stability experiments showing that 10 reps took out all flakiness. 6 reps seems to be sufficient.

@p-datadog
Copy link
Copy Markdown
Member

p-datadog commented Feb 23, 2026

I asked claude to review this PR and it came up with a couple of pages of feedback, I DM'd @igoragoli to go over it since I don't understand for about half of it whether it addresses real issues or just theoretical ones.

I wanted to see about DRYing the file names (which take up a large part of the diff) but claude also spotted some missing validation which seems to be at least plausible and improvements in error reporting.

@ivoanjo
Copy link
Copy Markdown
Member

ivoanjo commented Feb 24, 2026

I asked claude to review this PR and it came up with a couple of pages of feedback, I DM'd @igoragoli to go over it since I don't understand for about half of it whether it addresses real issues or just theoretical ones.

I wanted to see about DRYing the file names (which take up a large part of the diff) but claude also spotted some missing validation which seems to be at least plausible and improvements in error reporting.

To be clear, is any of that feedback blocking? In this age of "feedback from AI is free" I think it's even more important for the line between "hey maybe this would a cool improvement" and "hey we should really fix this, it's a confusing/dangerous/weird footgun" to be clear for folks :)

@p-datadog
Copy link
Copy Markdown
Member

I am not sure - that's what i wanted to find out.

I think we are all in the summit anyway at the moment.

@igoragoli igoragoli merged commit c14b280 into master Mar 3, 2026
368 checks passed
@igoragoli igoragoli deleted the augusto/2544.flakiness branch March 3, 2026 14:02
@github-actions github-actions Bot added this to the 2.30.0 milestone Mar 3, 2026
p-datadog pushed a commit that referenced this pull request Mar 4, 2026
Adds a safety check to ensure BENCHMARKS variable is populated before
proceeding with benchmark execution. Without this validation, if the
GROUP variable is invalid or misspelled, the job would silently run
with zero benchmarks and report success.

This change prevents silent failures by:
- Validating that the evaluated BENCHMARKS_${GROUP} variable is non-empty
- Providing a clear error message identifying the invalid group name
- Failing the CI job early before attempting to run bp-runner

Example failure message: "Error: No benchmarks defined for group 'typo'"

Related to PR #5313 - Parallelize microbenchmarks and run them more times
p-datadog pushed a commit that referenced this pull request Mar 4, 2026
Validates that the GROUP variable contains only alphanumeric characters
and underscores before using it in an eval statement. This prevents:
- Shell injection risks if GROUP contains special characters
- Cryptic eval errors from malformed variable names
- Better error messages for configuration mistakes

Example: If someone adds GROUP: "my-new-group" (with hyphens) to the
parallel matrix, the job will now fail with a clear error message
explaining the format requirement, rather than succeeding with zero
benchmarks or producing an eval syntax error.

Related to PR #5313 - Parallelize microbenchmarks and run them more times
p-datadog pushed a commit that referenced this pull request Mar 4, 2026
Ensures the DD_API_KEY is successfully retrieved from AWS SSM Parameter
Store before proceeding with benchmark execution. Without this check,
if the AWS command fails (due to permissions, network issues, or missing
parameter), the variable would be set to an empty string and the job
would continue, causing silent failures when attempting to upload results.

This prevents:
- Benchmark results being lost due to failed uploads
- Jobs appearing successful when API key retrieval failed
- Difficult debugging of upload failures

The job now fails early with a clear error message if the API key
cannot be retrieved.

Related to PR #5313 - Parallelize microbenchmarks and run them more times
p-datadog pushed a commit that referenced this pull request Mar 4, 2026
Validates that the artifacts directory is successfully created before
proceeding with benchmark execution. This change addresses three issues:

1. ddprof-benchmark job: Removes dangerous `|| :` pattern that suppressed
   all mkdir errors, which could hide real failures like permission issues
   or disk full errors.

2. microbenchmarks job: Adds explicit validation that directory creation
   succeeded.

3. microbenchmarks-pr-comment job: Adds explicit validation that directory
   creation succeeded.

Without this validation, if directory creation fails, the job would
continue and benchmark results would be lost. The job might appear
successful even though no artifacts were collected.

Now the jobs fail early with a clear error message if the artifacts
directory cannot be created.

Related to PR #5313 - Parallelize microbenchmarks and run them more times
p-datadog pushed a commit that referenced this pull request Mar 4, 2026
Validates that CI_JOB_TOKEN is set before using it in git URL
configuration. If the token is empty or undefined, git config would
succeed but create a malformed URL, leading to authentication failures
during git clone with misleading error messages.

This affects two jobs:
- microbenchmarks: Uses token to clone benchmarking-platform
- microbenchmarks-pr-comment: Uses token to clone benchmarking-platform

Without this check, an empty CI_JOB_TOKEN would cause:
- Malformed git URLs like "https://gitlab-ci-token:@gitlab.ddbuild.io/..."
- Cryptic authentication errors instead of clear token validation errors
- Difficult debugging of CI configuration issues

The jobs now fail early with a clear error message if the token is
missing.

Related to PR #5313 - Parallelize microbenchmarks and run them more times
p-datadog pushed a commit that referenced this pull request Mar 4, 2026
Improves error handling for git clone operations by:
1. Providing explicit error messages when clones fail
2. Separating git clone from cd command for clearer error messages
3. Including branch name in error message for easier debugging

Previously, when git clone failed, the error would appear to be about
the subsequent 'cd' command ("cd: platform: No such file or directory"),
which is misleading. The actual issue was the clone failure, not the cd.

This affects four jobs:
- .macrobenchmarks (clones ruby/gitlab branch)
- ddprof-benchmark (clones ruby/ddprof-benchmark branch)
- microbenchmarks (clones dd-trace-rb branch)
- microbenchmarks-pr-comment (clones dd-trace-rb branch)

Benefits:
- Clear error messages identifying clone failures
- Branch name in error helps identify wrong branch configurations
- Easier debugging of repository access or branch name issues
- No more misleading "directory not found" errors

Related to PR #5313 - Parallelize microbenchmarks and run them more times
p-datadog pushed a commit that referenced this pull request Mar 4, 2026
Validates that CI_COMMIT_SHA is set before executing the ddprof-benchmark
job. This variable is used as LATEST_COMMIT_ID to tag benchmark results
in the monitoring system.

While GitLab CI normally always sets CI_COMMIT_SHA automatically, this
validation provides defense in depth against:
- Manual job execution without proper CI context
- Broken CI configurations
- Edge cases in CI platform behavior

Without this check, if CI_COMMIT_SHA were somehow empty, benchmark
results would be tagged with an empty commit SHA, making them:
- Impossible to correlate with specific commits
- Orphaned in the monitoring system
- Useless for tracking performance over time

The job now fails early with a clear error message if the commit SHA
is missing, rather than proceeding with invalid metadata.

Related to PR #5313 - Parallelize microbenchmarks and run them more times
p-datadog pushed a commit that referenced this pull request Mar 4, 2026
Rewrites all validation checks from the `|| (echo "..." && exit 1)`
pattern to explicit if statements with proper stderr redirection.

Issues with the previous pattern:
1. Parentheses create a subshell - exit 1 only exits the subshell, not
   the main script in some contexts
2. Error messages went to stdout instead of stderr

Changes:
- All validations now use `if ! command` or `if [ -z "$VAR" ]`
- All error messages redirect to stderr with `>&2`
- Uses multi-line YAML blocks (`|`) for readability
- Eliminates subshell exit issues

Affects all validation checks added in previous commits:
- CI_COMMIT_SHA validation
- ARTIFACTS_DIR creation validation
- DD_API_KEY retrieval validation
- GROUP variable format validation
- BENCHMARKS variable validation
- CI_JOB_TOKEN validation
- Git clone error handling

Related to PR #5313 - Parallelize microbenchmarks and run them more times
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants