[DBMON-6589] Clean up dangling references on cancel to improve GC#23640
[DBMON-6589] Clean up dangling references on cancel to improve GC#23640eric-weaver wants to merge 13 commits intomasterfrom
Conversation
_dynamic_queries holds partial(db_pool.get_connection, ...) closures and _query_manager.executor holds a partial(self.execute_query_raw, db=self.db) closure. Both root the check instance and pool manager in memory after cancel.
…next commit message suggestion Null _job_loop_future after join in cancel() to release completed Futures Completed Future objects are lightweight but keep the async job reachable in the reference graph. Nulling them immediately after .result() returns removes an unnecessary link that delays GC of the check instance. ok next Let me check the current state of each _shutdown method and the caches each job holds. Now I'll update each _shutdown to clear caches. StatementMetrics -- _full_statement_text_cache, _state, _query_calls_cache, _baseline_metrics: self._check = None self._full_statement_text_cache = None self._state = None self._query_calls_cache = None self._baseline_metrics = None except Exception: StatementSamples -- _collection_strategy_cache, _explain_errors_cache, _explained_statements_ratelimiter, _seen_samples_ratelimiter, _raw_statement_text_cache: self._check = None self._collection_strategy_cache = None self._explain_errors_cache = None self._explained_statements_ratelimiter = None self._seen_samples_ratelimiter = None self._raw_statement_text_cache = None except Exception: Metadata -- _schema_collector (holds back-ref to check) and _compiled_patterns_cache: self._check = None self._schema_collector = None self._compiled_patterns_cache = None except Exception: DataObservability only has a small _last_execution dict -- not worth clearing. Its _shutdown already just nulls _check, which is sufficient. All pre-existing. Done. Ready for your commit. commit message suggestion Clear caches in async job shutdown callbacks to free memory eagerly
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3901ffe755
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage (details) 🔗 Commit SHA: bef3c81 | Docs | Datadog PR Page | Give us feedback! |
|
@codex review |
Codecov Report❌ Patch coverage is Additional details and impacted files🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99c8cff551
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Validation ReportAll 20 validations passed. Show details
|
What does this PR do?
Ensures a Postgres check instance is fully reclaimable by CPython's reference-counting GC immediately after
cancel()completes, without relying on the non-deterministic cyclic garbage collector.Async job cleanup
_shutdown()methods to all async jobs — EachDBMAsyncJobsubclass (PostgresStatementMetrics,PostgresStatementSamples,PostgresMetadata,PostgresDataObservability) implements_shutdown()to null its_checkback-reference and release caches (TTLCache,RateLimitingTTLCache,ExplainParameterizedQueries,SchemaCollector)._shutdown()explicitly fromcancel()— A new_cancel_async_job()helper cancels the job, joins the future, nulls the future, and invokes_shutdown()— ensuring cleanup happens regardless of whether the job loop was ever started.db_poolreferences —PostgresStatementSamplesandPostgresMetadatastored their own copy ofdb_pool. Removed in favor of accessing throughself._check.db_pool.Check-level cleanup
self._dbwas never closed duringcancel(), leaking an open socket. Extracts a_close_db()helper to deduplicate the close-and-null pattern shared with thedb()context manager._query_managerandhealth— Drops theQueryManagerandPostgresHealthobjects that hold back-references to the check.check_initializations— The deque holds bound methods and a lambda that captureself, forming reference cycles._diagnosis— TheDiagnosisobject holds afunctools.partialand_sanitizelambda that pin the check.CheckLoggingAdapter.check— The logging adapter holds a back-reference to the check that is only self-cleared after the first log call with a resolvedcheck_id;cancel()before that point leaves the cycle intact.Testing
test_check_gc_after_cancel— A strict regression test that disables CPython's cyclic GC and verifies the check is reclaimed by refcount alone aftercancel(). On failure, the assertion message lists the exact types still pinning the check for easy diagnosis._job_loop_future is None,db_poolaccessed via_check).Motivation
When the agent rescheduled a Postgres check (config reload, autodiscovery churn), the check instance and all its async job objects were pinned in memory by bidirectional reference cycles until CPython's cyclic GC ran — which is non-deterministic and can be delayed significantly. In environments with frequent check rescheduling (e.g., dozens of autodiscovered databases), this caused unnecessary memory retention of connections, caches, and thread state. These changes ensure the entire object graph is reclaimable by reference counting immediately after
cancel()completes.Review checklist (to be filled by reviewers)
qa/skip-qalabel if the PR doesn't need to be tested during QA.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged