Skip to content

Commit e87d26f

Browse files
fix(profiling): fix absurdly high lock hold times in Lock Profiler [backport 4.0] (#15462)
Backport 2cfb922 from #15408 to 4.0. ## Fix lock profiler generating release samples for non-sampled acquires ### Problem As reported by @dchayes-dd in our Slack channel: Lock Profiler displays unrealistic lock hold times (either too large or even negative). The negative values are legitimate int64 overflows, which in itself is a red herring. The root cause is described below. **Example 1** * [Flame graph](https://app.datadoghq.com/profiling/explorer?query=service%3Awatchdog_explorer%20datacenter%3Aus1.prod.dog&agg_m=%40prof_core_cpu_cores&agg_m_source=base&agg_q=service&agg_q_source=base&agg_t=sum&fromUser=true&my_code=enabled&profile_type=lock-hold-time&profiling-timeline__stack_order=top-down&profiling-timeline__summary_tab=breakdown&profiling-timeline__tb=1763580147405000000&profiling-timeline__tf_f_0=0&profiling-timeline__tf_f_1=121233000000&profiling-timeline__tf_us_0=0&profiling-timeline__tf_us_1=121233000000&refresh_mode=paused&selected_tf=1763579722252.7107%2C1763580268638&top_n=100&top_o=top&viz=flame_graph&x_missing=true&from_ts=1763568998228&to_ts=1763583398228&live=false) with negative lock hold time: `-970 days / min` <img width="1546" height="322" alt="Screenshot 2025-11-24 at 10 08 09 PM" src="https://github.com/user-attachments/assets/e2f2e38b-3282-493f-bd17-8f7545776d51" /> **Example 2** * [Flame graph](https://app.datadoghq.com/profiling/explorer?query=service%3Awatchdog_explorer%20datacenter%3Aus1.prod.dog%20runtime-id%3Adc67db679fe041f1b1d04dbc8a903e22&agg_m=%40prof_python_cpu_cores&agg_m_source=base&agg_q=service&agg_q_source=base&agg_t=sum&extra_search_fields=%7B%22filters_query%22%3A%22%22%2C%22sample_type%22%3A%22cpu-time%22%7D&fromUser=true&my_code=enabled&profile_type=lock-hold-time&profiling-timeline__stack_order=top-down&profiling-timeline__summary_tab=breakdown&profiling-timeline__tb=1763580092026000000&profiling-timeline__tf_f_0=0&profiling-timeline__tf_f_1=121249000000&profiling-timeline__tf_us_0=0&profiling-timeline__tf_us_1=121249000000&refresh_mode=paused&selected_tf=1763580092026%2C1763580213275&top_n=100&top_o=top&viz=flame_graph&x_missing=true&from_ts=1763579722252&to_ts=1763580268638&live=false) with absurdly high lock hold time: `19,000 days / min` <img width="1532" height="313" alt="Screenshot 2025-11-24 at 10 16 05 PM" src="https://github.com/user-attachments/assets/c7e056ed-faf5-4707-9ddd-002eb3b2234a" /> ### Impact _**This affected virtually ALL customers, unless they set the sampling rate very high (close to 100%).**_ With default `capture_pct = 1.0` (1%): - **99% of acquires** → ARE NOT sampled (`capture()` returns `False`) - **99% of releases** → ARE sampled with `duration = system_uptime` (hours, days, months?) ==> Only **1% of logged releases samples** are legitimate In general, the lower the sampling rate, the worse the problem. ### Root Cause When `capture_sampler.capture()` returned `False` (due to sampling rate), the acquire event was correctly skipped, but the release event was still being sampled; and that sample's lock hold time was equal to `system_uptime` (hours, days, months?) ### Fix Initialize `acquire_time` to None, instead of 0, because `0 is not None` and is a valid timestamp (a buggy check in `_release` that was letting through fake samples.) Non-goal: removed the try / catch guarding deletion of sample's `acquire_time` attribute, since there is a race condition if multiple threads try to release one lock (which is allowed in Python.) Instead, we now explicitly reset the value to None, which is not subject to races. ### Testing #### Unit test * test commit - no fix yet - FAIL ``` $ git checkout ec010c3 $ scripts/ddtest riot run --pass-env 116bda6 -- -k test_release_not_sampled_when_acquire_not_sampled ... # release samples should NOT be generated when acquire wasn't sampled > assert len(release_samples) == 0, ( f"Expected no release samples when acquire wasn't sampled, got {len(release_samples)}" ) E AssertionError: Expected no release samples when acquire wasn't sampled, got 1 E assert 1 == 0 ``` * fix commit - PASS ``` $ git checkout ffeb5c6 $ scripts/ddtest riot run --pass-env 116bda6 -- -k test_release_not_sampled_when_acquire_not_sampled ... collected 142 items / 140 deselected / 2 selected tests/profiling/collector/test_threading.py::TestThreadingLockCollector::test_release_not_sampled_when_acquire_not_sampled[py3.13] PASSED tests/profiling/collector/test_threading.py::TestThreadingRLockCollector::test_release_not_sampled_when_acquire_not_sampled[py3.13] PASSED ``` #### Manual Validation Tested with [reproduction script (`repro_lock_profiler.py`)](https://github.com/DataDog/dd-trace-py/compare/vlad/lockprof-fix-absurdly-inflated-lock-hold-times-test-script) running 2000 lock operations with 1ms hold time at 1% sampling rate: **Expected Profile** The script acquires locks for 1ms each with a 1% sampling rate. 2000 lock ops × 1ms = 2 seconds total hold time. With 1% sampling rate = ~20ms of lock hold samples. **BEFORE (buggy):** **1.04 seconds/min** - [Profile link](https://app.datadoghq.com/profiling/explorer?query=host%3ACOMP-LR7JK0FKW1%20service%3Alockprof-repro-before&profile_type=lock-hold-time) - Flamegraph dominated by 809ms in `threading.py:522` and other bogus samples - Expected ~20ms of legitimate samples, got **1040ms total** (52x inflation) <img width="1615" height="690" alt="high_hold_times_repro_before" src="https://github.com/user-attachments/assets/0baf8041-2b7d-4324-ad21-8db178d15dda" /> **AFTER (fixed):** **23 milliseconds/min** - [Profile link](https://app.datadoghq.com/profiling/explorer?query=host%3ACOMP-LR7JK0FKW1%20service%3Alockprof-repro-after&profile_type=lock-hold-time) - Clean flamegraph showing only actual lock hold at `repro_lock_profiler.py:56` - Expected ~20ms, got **23ms** - **~45x reduction** in bogus lock time <img width="1611" height="584" alt="high_hold_times_repro_after" src="https://github.com/user-attachments/assets/68c7811a-a00a-4eec-84ab-0f4d875d80d5" /> Co-authored-by: Vlad Scherbich <vlad.scherbich@datadoghq.com>
1 parent 59632ea commit e87d26f

File tree

4 files changed

+44
-16
lines changed

4 files changed

+44
-16
lines changed

ddtrace/profiling/collector/_lock.py

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ def __init__(
6969
frame: FrameType = sys._getframe(3)
7070
code: CodeType = frame.f_code
7171
self.init_location: str = f"{os.path.basename(code.co_filename)}:{frame.f_lineno}"
72-
self.acquired_time: int = 0
72+
self.acquired_time: Optional[int] = None
7373
self.name: Optional[str] = None
7474

7575
### DUNDER methods ###
@@ -106,6 +106,13 @@ def __aenter__(self, *args: Any, **kwargs: Any) -> Any:
106106

107107
def _acquire(self, inner_func: Callable[..., Any], *args: Any, **kwargs: Any) -> Any:
108108
if not self.capture_sampler.capture():
109+
if config.enable_asserts:
110+
# Ensure acquired_time is not set when acquire is not sampled
111+
# (else a bogus release sample is produced)
112+
assert (
113+
self.acquired_time is None
114+
), f"Expected acquired_time to be None when acquire is not sampled, got {self.acquired_time!r}" # nosec
115+
109116
return inner_func(*args, **kwargs)
110117

111118
start: int = time.monotonic_ns()
@@ -136,21 +143,12 @@ def __aexit__(self, *args: Any, **kwargs: Any) -> Any:
136143

137144
def _release(self, inner_func: Callable[..., Any], *args: Any, **kwargs: Any) -> None:
138145
start: Optional[int] = getattr(self, "acquired_time", None)
139-
try:
140-
# Though it should generally be avoided to call release() from
141-
# multiple threads, it is possible to do so. In that scenario, the
142-
# following statement code will raise an AttributeError. This should
143-
# not be propagated to the caller and to the users. The inner_func
144-
# will raise an RuntimeError as the threads are trying to release()
145-
# and unlocked lock, and the expected behavior is to propagate that.
146-
del self.acquired_time
147-
except AttributeError:
148-
pass
146+
self.acquired_time = None
149147

150148
try:
151149
return inner_func(*args, **kwargs)
152150
finally:
153-
if start is not None:
151+
if start:
154152
self._flush_sample(start, end=time.monotonic_ns(), is_acquire=False)
155153

156154
def _flush_sample(self, start: int, end: int, is_acquire: bool) -> None:
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
---
2+
fixes:
3+
- |
4+
profiling: This fix resolves a critical issue where the Lock Profiler generated
5+
release samples for non-sampled lock acquires, resulting in inflated or negative (when integer overflows)
6+
lock hold times (e.g., "3.24k days per minute", "-970 days per minute").
7+
This affected virtually all customers using sampling rates < 100% (which should be the majority).

tests/profiling/collector/pprof_utils.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@ def __init__(self, *args, **kwargs):
135135
super().__init__(event_type=LockEventType.RELEASE, *args, **kwargs)
136136

137137

138-
def parse_newest_profile(filename_prefix: str) -> pprof_pb2.Profile:
138+
def parse_newest_profile(filename_prefix: str, assert_samples: bool = True) -> pprof_pb2.Profile:
139139
"""Parse the newest profile that has given filename prefix. The profiler
140140
outputs profile file with following naming convention:
141141
<filename_prefix>.<pid>.<counter>.pprof, and in tests, we'd want to parse
@@ -150,7 +150,10 @@ def parse_newest_profile(filename_prefix: str) -> pprof_pb2.Profile:
150150
serialized_data = dctx.stream_reader(fp).read()
151151
profile = pprof_pb2.Profile()
152152
profile.ParseFromString(serialized_data)
153-
assert len(profile.sample) > 0, "No samples found in profile"
153+
154+
if assert_samples:
155+
assert len(profile.sample) > 0, "No samples found in profile"
156+
154157
return profile
155158

156159

tests/profiling_v2/collector/test_threading.py

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
import os
66
import sys
77
import threading
8+
import time
89
from typing import Callable
910
from typing import List
1011
from typing import Optional
@@ -1139,8 +1140,6 @@ def test_lock_slots_enforced(self) -> None:
11391140

11401141
def test_lock_profiling_overhead_reasonable(self) -> None:
11411142
"""Test that profiling overhead with 0% capture is bounded."""
1142-
import time
1143-
11441143
# Measure without profiling (collector stopped)
11451144
regular_lock: LockClassInst = self.lock_class()
11461145
start: float = time.perf_counter()
@@ -1168,6 +1167,27 @@ def test_lock_profiling_overhead_reasonable(self) -> None:
11681167
overhead_multiplier < 50
11691168
), f"Overhead too high: {overhead_multiplier}x (regular: {regular_time:.6f}s, profiled: {profiled_time_zero:.6f}s)" # noqa: E501
11701169

1170+
def test_release_not_sampled_when_acquire_not_sampled(self) -> None:
1171+
"""Test that lock release events are NOT sampled if their corresponding acquire was not sampled."""
1172+
# Use capture_pct=0 to ensure acquire is NEVER sampled
1173+
with self.collector_class(capture_pct=0):
1174+
lock: LockClassInst = self.lock_class()
1175+
# Do multiple acquire/release cycles
1176+
for _ in range(10):
1177+
lock.acquire()
1178+
time.sleep(0.001)
1179+
lock.release()
1180+
1181+
ddup.upload()
1182+
1183+
profile: pprof_pb2.Profile = pprof_utils.parse_newest_profile(self.output_filename, assert_samples=False)
1184+
release_samples: List[pprof_pb2.Sample] = pprof_utils.get_samples_with_value_type(profile, "lock-release")
1185+
1186+
# release samples should NOT be generated when acquire wasn't sampled
1187+
assert (
1188+
len(release_samples) == 0
1189+
), f"Expected no release samples when acquire wasn't sampled, got {len(release_samples)}"
1190+
11711191

11721192
class TestThreadingLockCollector(BaseThreadingLockCollectorTest):
11731193
"""Test Lock profiling"""

0 commit comments

Comments
 (0)