Skip to content

Fix API server memory leak: bound DBDagBag version cache with LRU eviction#64326

Open
dheerajturaga wants to merge 2 commits intoapache:mainfrom
dheerajturaga:bugfix/api-memory-leak
Open

Fix API server memory leak: bound DBDagBag version cache with LRU eviction#64326
dheerajturaga wants to merge 2 commits intoapache:mainfrom
dheerajturaga:bugfix/api-memory-leak

Conversation

@dheerajturaga
Copy link
Copy Markdown
Member

@dheerajturaga dheerajturaga commented Mar 27, 2026

DBDagBag._dags is an unbounded in-memory cache causing steady memory
growth in the API server.

DBDagBag was designed for the scheduler, which works with a bounded set
of currently-active DAG versions. As an API server singleton, it is exposed to
the full history of DAG versions in the database with no bound on how
many it will cache

Replace the plain dict in DBDagBag._dags with a bounded OrderedDict-based
LRU cache. In long-running API server processes, every unique dag_version_id
accessed is inserted and never evicted, causing unbounded RSS growth (observed:
9.4 GiB after 7 days with ~70k DAG versions in DB).

The cache is now capped at 512 entries by default (configurable via
core.max_dag_version_cache_size). Cache hits promote the entry to MRU so
frequently-accessed versions are retained over stale historical ones.

image
Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)
    ClaudeCode

…ction

  Replace the plain dict in DBDagBag._dags with a bounded OrderedDict-based
  LRU cache. In long-running API server processes, every unique dag_version_id
  accessed is inserted and never evicted, causing unbounded RSS growth (observed:
  9.4 GiB after 7 days with ~70k DAG versions in DB).

  The cache is now capped at 512 entries by default (configurable via
  core.max_dag_version_cache_size). Cache hits promote the entry to MRU so
  frequently-accessed versions are retained over stale historical ones.
@boring-cyborg boring-cyborg bot added area:API Airflow's REST/HTTP API area:ConfigTemplates labels Mar 27, 2026
@eladkal eladkal added this to the Airflow 3.2.0 milestone Mar 27, 2026
@eladkal eladkal added the type:bug-fix Changelog: Bug Fixes label Mar 27, 2026
Co-authored-by: Elad Kalif <45845474+eladkal@users.noreply.github.com>
@shivaam
Copy link
Copy Markdown
Contributor

shivaam commented Mar 28, 2026

Nice. Seems like a real production bug. A few thoughts:

  1. Default of 512 may be too low. The scheduler processes all active DAGs every cycle. With 1000+ DAGs, a 512 cache means constant eviction and re-fetching from the DB on every loop. The API server's Execution API also serves worker requests for every task state transition, so it can accumulate entries fast too. Consider starting higher (2048+) and letting people tune down — it's easier to reduce a known number than to discover you need to increase one you didn't know existed.
  2. A single config for both scheduler and API server may not be ideal. The scheduler's working set is bounded (latest version per active DAG) and performance-sensitive — it needs a cache big enough to hold all active DAGs. There are no metrics for the cache which will also cause problems in debugging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API area:ConfigTemplates type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants