fix(profiling): fix absurdly high lock hold times in Lock Profiler [backport 4.0] (#15462)

dd-octo-sts[bot] · vlad-scherbich · web-flow · commit e87d26f8e25c · 2025-12-05T17:22:52.000Z
Backport 2cfb922 from #15408 to 4.0. ## Fix lock profiler generating release samples for non-sampled acquires ### Problem As reported by @dchayes-dd in our Slack channel: Lock Profiler displays unrealistic lock hold times (either too large or even negative). The negative values are legitimate int64 overflows, which in itself is a red herring. The root cause is described below. **Example 1** * [Flame graph](https://app.datadoghq.com/profiling/explorer?query=service%3Awatchdog_explorer%20datacenter%3Aus1.prod.dog&agg_m=%40prof_core_cpu_cores&agg_m_source=base&agg_q=service&agg_q_source=base&agg_t=sum&fromUser=true&my_code=enabled&profile_type=lock-hold-time&profiling-timeline__stack_order=top-down&profiling-timeline__summary_tab=breakdown&profiling-timeline__tb=1763580147405000000&profiling-timeline__tf_f_0=0&profiling-timeline__tf_f_1=121233000000&profiling-timeline__tf_us_0=0&profiling-timeline__tf_us_1=121233000000&refresh_mode=paused&selected_tf=1763579722252.7107%2C1763580268638&top_n=100&top_o=top&viz=flame_graph&x_missing=true&from_ts=1763568998228&to_ts=1763583398228&live=false) with negative lock hold time: `-970 days / min` <img width="1546" height="322" alt="Screenshot 2025-11-24 at 10 08 09 PM" src="https://github.com/user-attachments/assets/e2f2e38b-3282-493f-bd17-8f7545776d51" /> **Example 2** * [Flame graph](https://app.datadoghq.com/profiling/explorer?query=service%3Awatchdog_explorer%20datacenter%3Aus1.prod.dog%20runtime-id%3Adc67db679fe041f1b1d04dbc8a903e22&agg_m=%40prof_python_cpu_cores&agg_m_source=base&agg_q=service&agg_q_source=base&agg_t=sum&extra_search_fields=%7B%22filters_query%22%3A%22%22%2C%22sample_type%22%3A%22cpu-time%22%7D&fromUser=true&my_code=enabled&profile_type=lock-hold-time&profiling-timeline__stack_order=top-down&profiling-timeline__summary_tab=breakdown&profiling-timeline__tb=1763580092026000000&profiling-timeline__tf_f_0=0&profiling-timeline__tf_f_1=121249000000&profiling-timeline__tf_us_0=0&profiling-timeline__tf_us_1=121249000000&refresh_mode=paused&selected_tf=1763580092026%2C1763580213275&top_n=100&top_o=top&viz=flame_graph&x_missing=true&from_ts=1763579722252&to_ts=1763580268638&live=false) with absurdly high lock hold time: `19,000 days / min` <img width="1532" height="313" alt="Screenshot 2025-11-24 at 10 16 05 PM" src="https://github.com/user-attachments/assets/c7e056ed-faf5-4707-9ddd-002eb3b2234a" /> ### Impact _**This affected virtually ALL customers, unless they set the sampling rate very high (close to 100%).**_ With default `capture_pct = 1.0` (1%): - **99% of acquires** → ARE NOT sampled (`capture()` returns `False`) - **99% of releases** → ARE sampled with `duration = system_uptime` (hours, days, months?) ==> Only **1% of logged releases samples** are legitimate In general, the lower the sampling rate, the worse the problem. ### Root Cause When `capture_sampler.capture()` returned `False` (due to sampling rate), the acquire event was correctly skipped, but the release event was still being sampled; and that sample's lock hold time was equal to `system_uptime` (hours, days, months?) ### Fix Initialize `acquire_time` to None, instead of 0, because `0 is not None` and is a valid timestamp (a buggy check in `_release` that was letting through fake samples.) Non-goal: removed the try / catch guarding deletion of sample's `acquire_time` attribute, since there is a race condition if multiple threads try to release one lock (which is allowed in Python.) Instead, we now explicitly reset the value to None, which is not subject to races. ### Testing #### Unit test * test commit - no fix yet - FAIL ``` $ git checkout ec010c3 $ scripts/ddtest riot run --pass-env 116bda6 -- -k test_release_not_sampled_when_acquire_not_sampled ... # release samples should NOT be generated when acquire wasn't sampled > assert len(release_samples) == 0, ( f"Expected no release samples when acquire wasn't sampled, got {len(release_samples)}" ) E AssertionError: Expected no release samples when acquire wasn't sampled, got 1 E assert 1 == 0 ``` * fix commit - PASS ``` $ git checkout ffeb5c6 $ scripts/ddtest riot run --pass-env 116bda6 -- -k test_release_not_sampled_when_acquire_not_sampled ... collected 142 items / 140 deselected / 2 selected tests/profiling/collector/test_threading.py::TestThreadingLockCollector::test_release_not_sampled_when_acquire_not_sampled[py3.13] PASSED tests/profiling/collector/test_threading.py::TestThreadingRLockCollector::test_release_not_sampled_when_acquire_not_sampled[py3.13] PASSED ``` #### Manual Validation Tested with [reproduction script (`repro_lock_profiler.py`)](https://github.com/DataDog/dd-trace-py/compare/vlad/lockprof-fix-absurdly-inflated-lock-hold-times-test-script) running 2000 lock operations with 1ms hold time at 1% sampling rate: **Expected Profile** The script acquires locks for 1ms each with a 1% sampling rate. 2000 lock ops × 1ms = 2 seconds total hold time. With 1% sampling rate = ~20ms of lock hold samples. **BEFORE (buggy):** **1.04 seconds/min** - [Profile link](https://app.datadoghq.com/profiling/explorer?query=host%3ACOMP-LR7JK0FKW1%20service%3Alockprof-repro-before&profile_type=lock-hold-time) - Flamegraph dominated by 809ms in `threading.py:522` and other bogus samples - Expected ~20ms of legitimate samples, got **1040ms total** (52x inflation) <img width="1615" height="690" alt="high_hold_times_repro_before" src="https://github.com/user-attachments/assets/0baf8041-2b7d-4324-ad21-8db178d15dda" /> **AFTER (fixed):** **23 milliseconds/min** - [Profile link](https://app.datadoghq.com/profiling/explorer?query=host%3ACOMP-LR7JK0FKW1%20service%3Alockprof-repro-after&profile_type=lock-hold-time) - Clean flamegraph showing only actual lock hold at `repro_lock_profiler.py:56` - Expected ~20ms, got **23ms** - **~45x reduction** in bogus lock time <img width="1611" height="584" alt="high_hold_times_repro_after" src="https://github.com/user-attachments/assets/68c7811a-a00a-4eec-84ab-0f4d875d80d5" /> Co-authored-by: Vlad Scherbich <vlad.scherbich@datadoghq.com>
diff --git a/ddtrace/profiling/collector/_lock.py b/ddtrace/profiling/collector/_lock.py
@@ -69,7 +69,7 @@ def __init__(
         frame: FrameType = sys._getframe(3)
         code: CodeType = frame.f_code
         self.init_location: str = f"{os.path.basename(code.co_filename)}:{frame.f_lineno}"
-        self.acquired_time: int = 0
+        self.acquired_time: Optional[int] = None
         self.name: Optional[str] = None
 
     ### DUNDER methods ###
@@ -106,6 +106,13 @@ def __aenter__(self, *args: Any, **kwargs: Any) -> Any:
 
     def _acquire(self, inner_func: Callable[..., Any], *args: Any, **kwargs: Any) -> Any:
         if not self.capture_sampler.capture():
+            if config.enable_asserts:
+                # Ensure acquired_time is not set when acquire is not sampled
+                # (else a bogus release sample is produced)
+                assert (
+                    self.acquired_time is None
+                ), f"Expected acquired_time to be None when acquire is not sampled, got {self.acquired_time!r}"  # nosec
+
             return inner_func(*args, **kwargs)
 
         start: int = time.monotonic_ns()
@@ -136,21 +143,12 @@ def __aexit__(self, *args: Any, **kwargs: Any) -> Any:
 
     def _release(self, inner_func: Callable[..., Any], *args: Any, **kwargs: Any) -> None:
         start: Optional[int] = getattr(self, "acquired_time", None)
-        try:
-            # Though it should generally be avoided to call release() from
-            # multiple threads, it is possible to do so. In that scenario, the
-            # following statement code will raise an AttributeError. This should
-            # not be propagated to the caller and to the users. The inner_func
-            # will raise an RuntimeError as the threads are trying to release()
-            # and unlocked lock, and the expected behavior is to propagate that.
-            del self.acquired_time
-        except AttributeError:
-            pass
+        self.acquired_time = None
 
         try:
             return inner_func(*args, **kwargs)
         finally:
-            if start is not None:
+            if start:
                 self._flush_sample(start, end=time.monotonic_ns(), is_acquire=False)
 
     def _flush_sample(self, start: int, end: int, is_acquire: bool) -> None:
diff --git a/releasenotes/notes/lock-profiler-fix-inflated-lock-hold-times-c0da83d00a6d704e.yaml b/releasenotes/notes/lock-profiler-fix-inflated-lock-hold-times-c0da83d00a6d704e.yaml
@@ -0,0 +1,7 @@
+---
+fixes:
+  - |
+    profiling: This fix resolves a critical issue where the Lock Profiler generated
+    release samples for non-sampled lock acquires, resulting in inflated or negative (when integer overflows)
+    lock hold times (e.g., "3.24k days per minute", "-970 days per minute"). 
+    This affected virtually all customers using sampling rates < 100% (which should be the majority).
diff --git a/tests/profiling/collector/pprof_utils.py b/tests/profiling/collector/pprof_utils.py
@@ -135,7 +135,7 @@ def __init__(self, *args, **kwargs):
         super().__init__(event_type=LockEventType.RELEASE, *args, **kwargs)
 
 
-def parse_newest_profile(filename_prefix: str) -> pprof_pb2.Profile:
+def parse_newest_profile(filename_prefix: str, assert_samples: bool = True) -> pprof_pb2.Profile:
     """Parse the newest profile that has given filename prefix. The profiler
     outputs profile file with following naming convention:
     <filename_prefix>.<pid>.<counter>.pprof, and in tests, we'd want to parse
@@ -150,7 +150,10 @@ def parse_newest_profile(filename_prefix: str) -> pprof_pb2.Profile:
         serialized_data = dctx.stream_reader(fp).read()
     profile = pprof_pb2.Profile()
     profile.ParseFromString(serialized_data)
-    assert len(profile.sample) > 0, "No samples found in profile"
+
+    if assert_samples:
+        assert len(profile.sample) > 0, "No samples found in profile"
+
     return profile
 
 
diff --git a/tests/profiling_v2/collector/test_threading.py b/tests/profiling_v2/collector/test_threading.py
@@ -5,6 +5,7 @@
 import os
 import sys
 import threading
+import time
 from typing import Callable
 from typing import List
 from typing import Optional
@@ -1139,8 +1140,6 @@ def test_lock_slots_enforced(self) -> None:
 
     def test_lock_profiling_overhead_reasonable(self) -> None:
         """Test that profiling overhead with 0% capture is bounded."""
-        import time
-
         # Measure without profiling (collector stopped)
         regular_lock: LockClassInst = self.lock_class()
         start: float = time.perf_counter()
@@ -1168,6 +1167,27 @@ def test_lock_profiling_overhead_reasonable(self) -> None:
             overhead_multiplier < 50
         ), f"Overhead too high: {overhead_multiplier}x (regular: {regular_time:.6f}s, profiled: {profiled_time_zero:.6f}s)"  # noqa: E501
 
+    def test_release_not_sampled_when_acquire_not_sampled(self) -> None:
+        """Test that lock release events are NOT sampled if their corresponding acquire was not sampled."""
+        # Use capture_pct=0 to ensure acquire is NEVER sampled
+        with self.collector_class(capture_pct=0):
+            lock: LockClassInst = self.lock_class()
+            # Do multiple acquire/release cycles
+            for _ in range(10):
+                lock.acquire()
+                time.sleep(0.001)
+                lock.release()
+
+        ddup.upload()
+
+        profile: pprof_pb2.Profile = pprof_utils.parse_newest_profile(self.output_filename, assert_samples=False)
+        release_samples: List[pprof_pb2.Sample] = pprof_utils.get_samples_with_value_type(profile, "lock-release")
+
+        # release samples should NOT be generated when acquire wasn't sampled
+        assert (
+            len(release_samples) == 0
+        ), f"Expected no release samples when acquire wasn't sampled, got {len(release_samples)}"
+
 
 class TestThreadingLockCollector(BaseThreadingLockCollectorTest):
     """Test Lock profiling"""