Skip to content

[DBMON-6589] Clean up dangling references on cancel to improve GC#23640

Open
eric-weaver wants to merge 13 commits intomasterfrom
eric.weaver/DBMON-6589
Open

[DBMON-6589] Clean up dangling references on cancel to improve GC#23640
eric-weaver wants to merge 13 commits intomasterfrom
eric.weaver/DBMON-6589

Conversation

@eric-weaver
Copy link
Copy Markdown
Contributor

@eric-weaver eric-weaver commented May 8, 2026

What does this PR do?

Ensures a Postgres check instance is fully reclaimable by CPython's reference-counting GC immediately after cancel() completes, without relying on the non-deterministic cyclic garbage collector.

Async job cleanup

  • Add _shutdown() methods to all async jobs — Each DBMAsyncJob subclass (PostgresStatementMetrics, PostgresStatementSamples, PostgresMetadata, PostgresDataObservability) implements _shutdown() to null its _check back-reference and release caches (TTLCache, RateLimitingTTLCache, ExplainParameterizedQueries, SchemaCollector).
  • Call _shutdown() explicitly from cancel() — A new _cancel_async_job() helper cancels the job, joins the future, nulls the future, and invokes _shutdown() — ensuring cleanup happens regardless of whether the job loop was ever started.
  • Remove redundant db_pool referencesPostgresStatementSamples and PostgresMetadata stored their own copy of db_pool. Removed in favor of accessing through self._check.db_pool.

Check-level cleanup

  • Close the standalone main DB connection on cancelself._db was never closed during cancel(), leaking an open socket. Extracts a _close_db() helper to deduplicate the close-and-null pattern shared with the db() context manager.
  • Null _query_manager and health — Drops the QueryManager and PostgresHealth objects that hold back-references to the check.
  • Clear check_initializations — The deque holds bound methods and a lambda that capture self, forming reference cycles.
  • Drop _diagnosis — The Diagnosis object holds a functools.partial and _sanitize lambda that pin the check.
  • Null CheckLoggingAdapter.check — The logging adapter holds a back-reference to the check that is only self-cleared after the first log call with a resolved check_id; cancel() before that point leaves the cycle intact.

Testing

  • Add test_check_gc_after_cancel — A strict regression test that disables CPython's cyclic GC and verifies the check is reclaimed by refcount alone after cancel(). On failure, the assertion message lists the exact types still pinning the check for easy diagnosis.
  • Existing tests updated for the new post-cancel state (_job_loop_future is None, db_pool accessed via _check).

Motivation

When the agent rescheduled a Postgres check (config reload, autodiscovery churn), the check instance and all its async job objects were pinned in memory by bidirectional reference cycles until CPython's cyclic GC ran — which is non-deterministic and can be delayed significantly. In environments with frequent check rescheduling (e.g., dozens of autodiscovered databases), this caused unnecessary memory retention of connections, caches, and thread state. These changes ensure the entire object graph is reclaimable by reference counting immediately after cancel() completes.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

_dynamic_queries holds partial(db_pool.get_connection, ...) closures
and _query_manager.executor holds a partial(self.execute_query_raw, db=self.db)
closure. Both root the check instance and pool manager in memory after cancel.
…next commit message suggestion Null _job_loop_future after join in cancel() to release completed Futures Completed Future objects are lightweight but keep the async job reachable in the reference graph. Nulling them immediately after .result() returns removes an unnecessary link that delays GC of the check instance. ok next Let me check the current state of each _shutdown method and the caches each job holds. Now I'll update each _shutdown to clear caches. StatementMetrics -- _full_statement_text_cache, _state, _query_calls_cache, _baseline_metrics: self._check = None self._full_statement_text_cache = None self._state = None self._query_calls_cache = None self._baseline_metrics = None except Exception: StatementSamples -- _collection_strategy_cache, _explain_errors_cache, _explained_statements_ratelimiter, _seen_samples_ratelimiter, _raw_statement_text_cache: self._check = None self._collection_strategy_cache = None self._explain_errors_cache = None self._explained_statements_ratelimiter = None self._seen_samples_ratelimiter = None self._raw_statement_text_cache = None except Exception: Metadata -- _schema_collector (holds back-ref to check) and _compiled_patterns_cache: self._check = None self._schema_collector = None self._compiled_patterns_cache = None except Exception: DataObservability only has a small _last_execution dict -- not worth clearing. Its _shutdown already just nulls _check, which is sufficient. All pre-existing. Done. Ready for your commit. commit message suggestion Clear caches in async job shutdown callbacks to free memory eagerly
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3901ffe755

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread postgres/datadog_checks/postgres/statement_samples.py Outdated
Comment thread postgres/datadog_checks/postgres/statement_samples.py
@datadog-prod-us1-3
Copy link
Copy Markdown

datadog-prod-us1-3 Bot commented May 8, 2026

Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 93.33%
Overall Coverage: 93.16% (+5.92%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: bef3c81 | Docs | Datadog PR Page | Give us feedback!

@eric-weaver
Copy link
Copy Markdown
Contributor Author

@codex review

@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

❌ Patch coverage is 93.33333% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.40%. Comparing base (e529a0b) to head (bef3c81).

Additional details and impacted files
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 99c8cff551

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread postgres/datadog_checks/postgres/postgres.py Outdated
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 8, 2026

Validation Report

All 20 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and Codecov settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant