Skip to content

Store a hash of dag_version.version_data to avoid loading/comparing large manifests on parse #68567

@ephraimbuddy

Description

@ephraimbuddy

Description

SerializedDagModel.write_dag's "serialized hash unchanged" fast path refreshes DagVersion.bundle_version/version_data in place, comparing the full stored version_data against the incoming value:

# airflow-core/src/airflow/models/serialized_dag.py
bundle_metadata_changed = (
    dag_version.bundle_version != bundle_version or dag_version.version_data != version_data
)

version_data is a free-form JSON column (e.g. an S3/custom-bundle manifest). When it is large, two things get expensive on every parse:

  1. _prefetch_dag_write_metadata loads the full DagVersion row — including the entire version_data JSON — for every DAG in the bulk write.
  2. The steady-state same-bundle case re-compares the full version_data dict each parse (only skipped when bundle_version already differs, thanks to or short-circuiting).

Proposal: persist a version_data_hash (e.g. md5 of the canonical JSON) on dag_version and compare/prefetch that instead of the full blob. The prefetch then loads only the small hash, and the change check compares hashes.

Use case/motivation

Keep DB-side parsing cheap and memory-flat as version_data grows (large manifests from S3/custom bundles). Today the built-in bundles keep version_data small or empty (GitDagBundle doesn't set it), so this is a forward-looking optimization rather than a current hotspot — surfaced in review of #68336.

Related issues

Follow-up from review on #68336 (review comment by @uranusjr). The in-place refresh logic was introduced there.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct


Drafted-by: Claude Code (Opus 4.8); reviewed by @ephraimbuddy before posting

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions