Skip to content

Add timeout to debugger captures#4003

Open
bwoebi wants to merge 4 commits into
masterfrom
bob/debugger-limit
Open

Add timeout to debugger captures#4003
bwoebi wants to merge 4 commits into
masterfrom
bob/debugger-limit

Conversation

@bwoebi

@bwoebi bwoebi commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Adding DD_DYNAMIC_INSTRUMENTATION_CAPTURE_TIMEOUT_MS config to enforce limits on capture times.

@bwoebi bwoebi requested a review from a team as a code owner June 22, 2026 11:17
@datadog-datadog-prod-us1-2

datadog-datadog-prod-us1-2 Bot commented Jun 22, 2026

Copy link
Copy Markdown

Pipelines  Tests

Fix all issues with BitsAI

⚠️ Warnings

🚦 5 Pipeline jobs failed

Profiling ASAN/UBSAN Tests | PHP 8.5 zts UBSAN (arm-8core-linux)   View in Datadog   GitHub Actions

DataDog/apm-reliability/dd-trace-php | test_extension_ci: [8.2]   View in Datadog   GitLab

DataDog/apm-reliability/dd-trace-php | Zend Abstract Interface Tests: [8.3, nts]   View in Datadog   GitLab

View all 5 failed jobs.

❄️ 2 New flaky tests detected

    tmp/build_extension/tests/ext/live-debugger/debugger_log_probe_capture_timeout.phpt (Live debugger log probe capture timeout with large data structure) from PHP.tmp.build_extension.tests.ext.live.debugger   View in Datadog

    tmp/build_extension/tests/ext/live-debugger/debugger_log_probe_capture_timeout.phpt (Live debugger log probe capture timeout with large data structure) from php.tmp.build_extension.tests.ext.live.debugger   View in Datadog

View in Flaky Test Management

ℹ️ Info

No other issues found (see more)

🧪 All tests passed

🔄 Datadog auto-retried 2 jobs - 2 passed on retry View in Datadog

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 54.08% (-0.04%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 89a1dcd | Docs | Datadog PR Page | Give us feedback!

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 793daeec61

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tracer/live_debugger.c Outdated
usec = dd_find_lowest_dealine_timer();
#endif
struct itimerval it = {
.it_value = { .tv_sec = usec / 10000000, .tv_usec = usec % 1000000 },

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use microseconds-per-second for setitimer

On the non-Linux setitimer path, usec is already in microseconds, so tv_sec must divide by 1,000,000 rather than 10,000,000. When DD_DYNAMIC_INSTRUMENTATION_CAPTURE_TIMEOUT_MS is configured above 999 ms on macOS/BSD, values like 1000 ms or 2000 ms produce {0, 0} and disarm the timeout, while other multi-second values fire much too early; the same conversion should be fixed in the stop/re-arm paths as well.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah same here indeed

@pr-commenter

pr-commenter Bot commented Jun 22, 2026

Copy link
Copy Markdown

Benchmarks [ tracer ]

Benchmark execution time: 2026-06-29 16:36:22

Comparing candidate commit 89a1dcd in PR branch bob/debugger-limit with baseline commit 303fa81 in branch master.

Found 0 performance improvements and 7 performance regressions! Performance is the same for 187 metrics, 0 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

scenario:EmptyFileBench/benchEmptyFileBaseline

  • 🟥 execution_time [+84.337µs; +286.443µs] or [+2.709%; +9.199%]

scenario:MessagePackSerializationBench/benchMessagePackSerialization

  • 🟥 execution_time [+4.196µs; +6.924µs] or [+4.143%; +6.837%]

scenario:MessagePackSerializationBench/benchMessagePackSerialization-opcache

  • 🟥 execution_time [+2.843µs; +5.537µs] or [+2.751%; +5.356%]

scenario:SamplingRuleMatchingBench/benchRegexMatching1

  • 🟥 execution_time [+56.720ns; +130.080ns] or [+3.816%; +8.751%]

scenario:SamplingRuleMatchingBench/benchRegexMatching2

  • 🟥 execution_time [+78.046ns; +150.354ns] or [+5.377%; +10.358%]

scenario:SamplingRuleMatchingBench/benchRegexMatching3

  • 🟥 execution_time [+70.971ns; +146.629ns] or [+4.788%; +9.893%]

scenario:SamplingRuleMatchingBench/benchRegexMatching4

  • 🟥 execution_time [+43.686ns; +136.314ns] or [+2.910%; +9.080%]

@bwoebi bwoebi force-pushed the bob/debugger-limit branch 2 times, most recently from efc9646 to db71d0a Compare June 25, 2026 16:26
bwoebi added 2 commits June 25, 2026 19:20
Signed-off-by: Bob Weinand <bob.weinand@datadoghq.com>
Signed-off-by: Bob Weinand <bob.weinand@datadoghq.com>
@bwoebi bwoebi force-pushed the bob/debugger-limit branch from db71d0a to 92fe6c4 Compare June 25, 2026 17:20
Signed-off-by: Bob Weinand <bob.weinand@datadoghq.com>
@bwoebi bwoebi force-pushed the bob/debugger-limit branch from a3a2bcb to c6da6b1 Compare June 26, 2026 19:08
Comment thread ext/remote_config.c Outdated
if (next_deadline != ~0ull) { // re-arm the timer, for ZTS concurrency
uint64_t usec = (next_deadline - now_ns) / 1000ull;
struct itimerval it = {
.it_value = { .tv_sec = usec / 10000000, .tv_usec = usec % 1000000 },

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.it_value = { .tv_sec = usec / 10000000, .tv_usec = usec % 1000000 },
.it_value = { .tv_sec = usec / 1000000, .tv_usec = usec % 1000000 },

there is an extra 0 here no ?

Comment thread tracer/live_debugger.c Outdated
usec = dd_find_lowest_dealine_timer();
#endif
struct itimerval it = {
.it_value = { .tv_sec = usec / 10000000, .tv_usec = usec % 1000000 },

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah same here indeed

Comment thread tracer/live_debugger.c Outdated
#ifdef __linux__
#include <sys/syscall.h>
#elif defined(ZTS)
uint64_t dd_find_lowest_dealine_timer(void) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint64_t dd_find_lowest_dealine_timer(void) {
uint64_t dd_find_lowest_deadline_timer(void) {

needs to be changed in other places as well :D

Comment thread tracer/live_debugger.c Outdated

void dd_stop_debugger_timeout(void) {
if (DDTRACE_G(capture_timer_handle)) {
DeleteTimerQueueTimer(NULL, DDTRACE_G(capture_timer_handle), NULL);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DeleteTimerQueueTimer(NULL, DDTRACE_G(capture_timer_handle), NULL);
DeleteTimerQueueTimer(NULL, DDTRACE_G(capture_timer_handle), INVALID_HANDLE_VALUE);

We can use this value to block until any running callback completes first

Comment thread ext/remote_config.c
#if !defined(__linux__) && defined(ZTS)
} ZEND_HASH_FOREACH_END();
if (next_deadline != ~0ull) { // re-arm the timer, for ZTS concurrency
uint64_t usec = (next_deadline - now_ns) / 1000ull;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint64_t usec = (next_deadline - now_ns) / 1000ull;
uint64_t usec = next_deadline > now_ns ? (next_deadline - now_ns) / 1000ull : 0;

Should we check just in case ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That check happens on line 92: if (now_ns >= deadline) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah indeed, missed it

Comment thread tracer/live_debugger.c
struct timespec now;
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &now);
uint64_t now_ns = (uint64_t)now.tv_sec * 1000000000ULL + (uint64_t)now.tv_nsec;
usec = (next_deadline - now_ns) / 1000ull;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
usec = (next_deadline - now_ns) / 1000ull;
usec = next_deadline > now_ns ? (next_deadline - now_ns) / 1000ull : 0;

same here

Comment thread tracer/live_debugger.c
#endif

// SIGEV_THREAD_ID delivers SIGVTALRM to exactly this thread, not a random one (critical for ZTS).
void dd_start_debugger_timeout(void) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add guards to check if a timer is already active first before starting a new one

Signed-off-by: Bob Weinand <bob.weinand@datadoghq.com>

@Leiyks Leiyks left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants