Commit e87d26f
fix(profiling): fix absurdly high lock hold times in Lock Profiler [backport 4.0] (#15462)
Backport 2cfb922 from #15408 to 4.0.
## Fix lock profiler generating release samples for non-sampled acquires
### Problem
As reported by @dchayes-dd in our Slack channel: Lock Profiler displays
unrealistic lock hold times (either too large or even negative). The
negative values are legitimate int64 overflows, which in itself is a red
herring. The root cause is described below.
**Example 1**
* [Flame
graph](https://app.datadoghq.com/profiling/explorer?query=service%3Awatchdog_explorer%20datacenter%3Aus1.prod.dog&agg_m=%40prof_core_cpu_cores&agg_m_source=base&agg_q=service&agg_q_source=base&agg_t=sum&fromUser=true&my_code=enabled&profile_type=lock-hold-time&profiling-timeline__stack_order=top-down&profiling-timeline__summary_tab=breakdown&profiling-timeline__tb=1763580147405000000&profiling-timeline__tf_f_0=0&profiling-timeline__tf_f_1=121233000000&profiling-timeline__tf_us_0=0&profiling-timeline__tf_us_1=121233000000&refresh_mode=paused&selected_tf=1763579722252.7107%2C1763580268638&top_n=100&top_o=top&viz=flame_graph&x_missing=true&from_ts=1763568998228&to_ts=1763583398228&live=false)
with negative lock hold time: `-970 days / min`
<img width="1546" height="322" alt="Screenshot 2025-11-24 at 10 08
09 PM"
src="https://github.com/user-attachments/assets/e2f2e38b-3282-493f-bd17-8f7545776d51"
/>
**Example 2**
* [Flame
graph](https://app.datadoghq.com/profiling/explorer?query=service%3Awatchdog_explorer%20datacenter%3Aus1.prod.dog%20runtime-id%3Adc67db679fe041f1b1d04dbc8a903e22&agg_m=%40prof_python_cpu_cores&agg_m_source=base&agg_q=service&agg_q_source=base&agg_t=sum&extra_search_fields=%7B%22filters_query%22%3A%22%22%2C%22sample_type%22%3A%22cpu-time%22%7D&fromUser=true&my_code=enabled&profile_type=lock-hold-time&profiling-timeline__stack_order=top-down&profiling-timeline__summary_tab=breakdown&profiling-timeline__tb=1763580092026000000&profiling-timeline__tf_f_0=0&profiling-timeline__tf_f_1=121249000000&profiling-timeline__tf_us_0=0&profiling-timeline__tf_us_1=121249000000&refresh_mode=paused&selected_tf=1763580092026%2C1763580213275&top_n=100&top_o=top&viz=flame_graph&x_missing=true&from_ts=1763579722252&to_ts=1763580268638&live=false)
with absurdly high lock hold time: `19,000 days / min`
<img width="1532" height="313" alt="Screenshot 2025-11-24 at 10 16
05 PM"
src="https://github.com/user-attachments/assets/c7e056ed-faf5-4707-9ddd-002eb3b2234a"
/>
### Impact
_**This affected virtually ALL customers, unless they set the sampling
rate very high (close to 100%).**_
With default `capture_pct = 1.0` (1%):
- **99% of acquires** → ARE NOT sampled (`capture()` returns `False`)
- **99% of releases** → ARE sampled with `duration = system_uptime`
(hours, days, months?)
==> Only **1% of logged releases samples** are legitimate
In general, the lower the sampling rate, the worse the problem.
### Root Cause
When `capture_sampler.capture()` returned `False` (due to sampling
rate), the acquire event was correctly skipped, but the release event
was still being sampled; and that sample's lock hold time was equal to
`system_uptime` (hours, days, months?)
### Fix
Initialize `acquire_time` to None, instead of 0, because `0 is not None`
and is a valid timestamp (a buggy check in `_release` that was letting
through fake samples.)
Non-goal: removed the try / catch guarding deletion of sample's
`acquire_time` attribute, since there is a race condition if multiple
threads try to release one lock (which is allowed in Python.) Instead,
we now explicitly reset the value to None, which is not subject to
races.
### Testing
#### Unit test
* test commit - no fix yet - FAIL
```
$ git checkout ec010c3
$ scripts/ddtest riot run --pass-env 116bda6 -- -k test_release_not_sampled_when_acquire_not_sampled
...
# release samples should NOT be generated when acquire wasn't sampled
> assert len(release_samples) == 0, (
f"Expected no release samples when acquire wasn't sampled, got {len(release_samples)}"
)
E AssertionError: Expected no release samples when acquire wasn't sampled, got 1
E assert 1 == 0
```
* fix commit - PASS
```
$ git checkout ffeb5c6
$ scripts/ddtest riot run --pass-env 116bda6 -- -k test_release_not_sampled_when_acquire_not_sampled
...
collected 142 items / 140 deselected / 2 selected
tests/profiling/collector/test_threading.py::TestThreadingLockCollector::test_release_not_sampled_when_acquire_not_sampled[py3.13] PASSED
tests/profiling/collector/test_threading.py::TestThreadingRLockCollector::test_release_not_sampled_when_acquire_not_sampled[py3.13] PASSED
```
#### Manual Validation
Tested with [reproduction script
(`repro_lock_profiler.py`)](https://github.com/DataDog/dd-trace-py/compare/vlad/lockprof-fix-absurdly-inflated-lock-hold-times-test-script)
running 2000 lock operations with 1ms hold time at 1% sampling rate:
**Expected Profile**
The script acquires locks for 1ms each with a 1% sampling rate.
2000 lock ops × 1ms = 2 seconds total hold time.
With 1% sampling rate = ~20ms of lock hold samples.
**BEFORE (buggy):** **1.04 seconds/min**
- [Profile
link](https://app.datadoghq.com/profiling/explorer?query=host%3ACOMP-LR7JK0FKW1%20service%3Alockprof-repro-before&profile_type=lock-hold-time)
- Flamegraph dominated by 809ms in `threading.py:522` and other bogus
samples
- Expected ~20ms of legitimate samples, got **1040ms total** (52x
inflation)
<img width="1615" height="690" alt="high_hold_times_repro_before"
src="https://github.com/user-attachments/assets/0baf8041-2b7d-4324-ad21-8db178d15dda"
/>
**AFTER (fixed):** **23 milliseconds/min**
- [Profile
link](https://app.datadoghq.com/profiling/explorer?query=host%3ACOMP-LR7JK0FKW1%20service%3Alockprof-repro-after&profile_type=lock-hold-time)
- Clean flamegraph showing only actual lock hold at
`repro_lock_profiler.py:56`
- Expected ~20ms, got **23ms**
- **~45x reduction** in bogus lock time
<img width="1611" height="584" alt="high_hold_times_repro_after"
src="https://github.com/user-attachments/assets/68c7811a-a00a-4eec-84ab-0f4d875d80d5"
/>
Co-authored-by: Vlad Scherbich <vlad.scherbich@datadoghq.com>1 parent 59632ea commit e87d26f
File tree
4 files changed
+44
-16
lines changed- ddtrace/profiling/collector
- releasenotes/notes
- tests
- profiling_v2/collector
- profiling/collector
4 files changed
+44
-16
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
69 | 69 | | |
70 | 70 | | |
71 | 71 | | |
72 | | - | |
| 72 | + | |
73 | 73 | | |
74 | 74 | | |
75 | 75 | | |
| |||
106 | 106 | | |
107 | 107 | | |
108 | 108 | | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
109 | 116 | | |
110 | 117 | | |
111 | 118 | | |
| |||
136 | 143 | | |
137 | 144 | | |
138 | 145 | | |
139 | | - | |
140 | | - | |
141 | | - | |
142 | | - | |
143 | | - | |
144 | | - | |
145 | | - | |
146 | | - | |
147 | | - | |
148 | | - | |
| 146 | + | |
149 | 147 | | |
150 | 148 | | |
151 | 149 | | |
152 | 150 | | |
153 | | - | |
| 151 | + | |
154 | 152 | | |
155 | 153 | | |
156 | 154 | | |
| |||
Lines changed: 7 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
135 | 135 | | |
136 | 136 | | |
137 | 137 | | |
138 | | - | |
| 138 | + | |
139 | 139 | | |
140 | 140 | | |
141 | 141 | | |
| |||
150 | 150 | | |
151 | 151 | | |
152 | 152 | | |
153 | | - | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
154 | 157 | | |
155 | 158 | | |
156 | 159 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
| 8 | + | |
8 | 9 | | |
9 | 10 | | |
10 | 11 | | |
| |||
1139 | 1140 | | |
1140 | 1141 | | |
1141 | 1142 | | |
1142 | | - | |
1143 | | - | |
1144 | 1143 | | |
1145 | 1144 | | |
1146 | 1145 | | |
| |||
1168 | 1167 | | |
1169 | 1168 | | |
1170 | 1169 | | |
| 1170 | + | |
| 1171 | + | |
| 1172 | + | |
| 1173 | + | |
| 1174 | + | |
| 1175 | + | |
| 1176 | + | |
| 1177 | + | |
| 1178 | + | |
| 1179 | + | |
| 1180 | + | |
| 1181 | + | |
| 1182 | + | |
| 1183 | + | |
| 1184 | + | |
| 1185 | + | |
| 1186 | + | |
| 1187 | + | |
| 1188 | + | |
| 1189 | + | |
| 1190 | + | |
1171 | 1191 | | |
1172 | 1192 | | |
1173 | 1193 | | |
| |||
0 commit comments