Skip to content

fix(live-debugger): retire removed probe configs to prevent use-after-free#4024

Draft
Leiyks wants to merge 2 commits into
masterfrom
leiyks/fix-ci-live-debugger-uaf
Draft

fix(live-debugger): retire removed probe configs to prevent use-after-free#4024
Leiyks wants to merge 2 commits into
masterfrom
leiyks/fix-ci-live-debugger-uaf

Conversation

@Leiyks

@Leiyks Leiyks commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Problem

test_extension_ci: [7.1] (the valgrind master job) fails with a LEAKED TEST SUMMARY on tests/ext/live-debugger/debugger_enable_dynamic_config.phpt. The .mem artifact shows it is not a leak but a use-after-free:

Invalid read of size 1
  ... probe.id.to_utf8_lossy()                    send_data.rs:523
  ddog_send_debugger_diagnostics                  remote_config.rs:786
  dd_probe_mark_active                            tracer/live_debugger.c:226
  dd_span_probe_begin                             tracer/live_debugger.c:239
 Address ... is 0 bytes inside a block of size 1 free'd
  drop String -> Probe -> ... -> Box<(RemoteConfigParsed, MaybeShmLimiter)>
  ddog_process_remote_configs                     components-rs/remote_config.rs:445  (LiveDebugger Remove)

The freed block is size 1 — the 1-byte probe id "1" the test generates.

Root cause

The C side (tracer/live_debugger.c:162, def->probe = *probe) keeps a shallow copy of the FFI ddog_Probe, whose ddog_CharSlice fields (id, capture config, shm limiter, …) borrow strings owned by the boxed RemoteConfigParsed stored in live_debugger.active. The Box<> exists specifically to give that data a stable heap address for these borrows (per its own comment).

On a LiveDebugger Remove (and on an Entry::Occupied re-add) the box was dropped immediately, freeing those strings. But zai_hook_remove can defer the actual hook teardown, so a still-installed probe can fire afterwards and read the freed strings via ddog_send_debugger_diagnostics(&def->probe, …) → UAF.

Latent since the feature landed (f87446c4d, 2024-10); only the valgrind job observes it because plain runs read the freed-but-intact bytes and pass. It is timing-gated, which is why it surfaces in the full shuffled valgrind suite rather than in isolation.

Fix

Retire removed/replaced boxes into a new live_debugger.retired vec instead of dropping them, and free them in ddog_rshutdown_remote_config — by which point all probe hooks are torn down and no PHP user code can fire a probe. retired is cleared every request, so it stays bounded.

Verification

  • Builds on PHP 7.1 with the master-pinned libdatadog (6a6d4a535).
  • The previously-leaking test passes 10/10 under valgrind with the fix; non-regressing.
  • Note: the UAF needs full-suite (shared request-replayer + shuffle) timing and does not reproduce in isolation, so the authoritative signal is the test_extension_ci: [7.1] valgrind job on this branch.

…-free

The C side (tracer/live_debugger.c) stores a shallow copy of the FFI ddog_Probe
in def->probe, whose ddog_CharSlice fields (probe id, capture config, etc.) borrow
strings owned by the boxed RemoteConfigParsed kept in live_debugger.active. The Box
exists specifically to give that data a stable heap address for these borrows.

On a LiveDebugger Remove (and on an Entry::Occupied re-add), the box was dropped
immediately, freeing those strings. But zai_hook_remove can defer the actual hook
teardown, so a still-installed probe could fire afterwards and read the freed
strings via ddog_send_debugger_diagnostics(&def->probe, ...) -> probe.id, a
use-after-free. Only the valgrind master job observes it (plain runs read the
freed-but-intact bytes); confirmed by the .mem from test_extension_ci: [7.1].

Retire removed/replaced boxes into live_debugger.retired instead of dropping them,
and free them in ddog_rshutdown_remote_config, by which point all probe hooks are
torn down and no PHP user code can fire a probe.

Verified: builds on PHP 7.1 (libdatadog 6a6d4a535); the leaking test passes 10/10
under valgrind with the fix and is non-regressing.
@datadog-datadog-prod-us1

datadog-datadog-prod-us1 Bot commented Jun 30, 2026

Copy link
Copy Markdown

Pipelines  Tests

Fix all issues with BitsAI

⚠️ Warnings

🚦 4 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-php | ASAN test_c with multiple observers: [8.3]   View in Datadog   GitLab

DataDog/apm-reliability/dd-trace-php | test_extension_ci: [7.0]   View in Datadog   GitLab

DataDog/apm-reliability/dd-trace-php | test_extension_ci: [7.1]   View in Datadog   GitLab

View all 4 failed jobs.

ℹ️ Info

No other issues found (see more)

🧪 All tests passed
❄️ No new flaky tests detected

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 54.08% (-0.03%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: a7cfd7d | Docs | Datadog PR Page | Give us feedback!

@pr-commenter

pr-commenter Bot commented Jun 30, 2026

Copy link
Copy Markdown

Benchmarks [ tracer ]

Benchmark execution time: 2026-06-30 15:27:25

Comparing candidate commit a7cfd7d in PR branch leiyks/fix-ci-live-debugger-uaf with baseline commit bffd346 in branch master.

Found 0 performance improvements and 1 performance regressions! Performance is the same for 193 metrics, 0 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

scenario:ContextPropagationBench/benchInject64Bit-opcache

  • 🟥 execution_time [+503.212ns; +750.788ns] or [+3.616%; +5.396%]

…rc rshutdown

Code review found the previous commit freed live_debugger.retired in
ddog_rshutdown_remote_config, which runs at the very start of RSHUTDOWN
(ext/datadog.c:647) -- BEFORE ddtrace_rshutdown -> dd_force_shutdown_tracing /
zai_hook_clean (tracer/ddtrace.c:582,588) tear down the probe hooks that borrow
into those boxes. The shutdown-time span flush can itself fire a probe, so the
use-after-free window was not actually closed.

Move the free to a new FFI ddog_live_debugger_free_retired, invoked at the end of
ddtrace_live_debugger_rshutdown() (tracer/ddtrace.c:618) -- after both zai_hook_clean
calls and the active_live_debugger_hooks destroy, i.e. the last point at which a
probe can fire. Still per-request, so retired stays bounded.

Verified on PHP 7.1 (libdatadog 6a6d4a535): builds/links; debugger_enable_dynamic_config
passes 10/10 under valgrind and the whole live-debugger dir is clean.
@bwoebi

bwoebi commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

My problem with this PR is that it misidentifies the root cause:
There's no reason why the begin hook should be called in the first place on an already removed hook. This should not be possible - zai_hook_continue carefully uses the ht iterators API to that extent and checks for negative hook id, which signals removal.

And end hooks are anyway guarded with def->removed where needed.

I don't see where ddog_process_remote_configs was called from the shortened trace, so that gives me no hint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants