Description
SerializedDagModel.write_dag's "serialized hash unchanged" fast path refreshes DagVersion.bundle_version/version_data in place, comparing the full stored version_data against the incoming value:
# airflow-core/src/airflow/models/serialized_dag.py
bundle_metadata_changed = (
dag_version.bundle_version != bundle_version or dag_version.version_data != version_data
)
version_data is a free-form JSON column (e.g. an S3/custom-bundle manifest). When it is large, two things get expensive on every parse:
_prefetch_dag_write_metadata loads the full DagVersion row — including the entire version_data JSON — for every DAG in the bulk write.
- The steady-state same-bundle case re-compares the full
version_data dict each parse (only skipped when bundle_version already differs, thanks to or short-circuiting).
Proposal: persist a version_data_hash (e.g. md5 of the canonical JSON) on dag_version and compare/prefetch that instead of the full blob. The prefetch then loads only the small hash, and the change check compares hashes.
Use case/motivation
Keep DB-side parsing cheap and memory-flat as version_data grows (large manifests from S3/custom bundles). Today the built-in bundles keep version_data small or empty (GitDagBundle doesn't set it), so this is a forward-looking optimization rather than a current hotspot — surfaced in review of #68336.
Related issues
Follow-up from review on #68336 (review comment by @uranusjr). The in-place refresh logic was introduced there.
Are you willing to submit a PR?
Code of Conduct
Drafted-by: Claude Code (Opus 4.8); reviewed by @ephraimbuddy before posting
Description
SerializedDagModel.write_dag's "serialized hash unchanged" fast path refreshesDagVersion.bundle_version/version_datain place, comparing the full storedversion_dataagainst the incoming value:version_datais a free-form JSON column (e.g. an S3/custom-bundle manifest). When it is large, two things get expensive on every parse:_prefetch_dag_write_metadataloads the fullDagVersionrow — including the entireversion_dataJSON — for every DAG in the bulk write.version_datadict each parse (only skipped whenbundle_versionalready differs, thanks toorshort-circuiting).Proposal: persist a
version_data_hash(e.g. md5 of the canonical JSON) ondag_versionand compare/prefetch that instead of the full blob. The prefetch then loads only the small hash, and the change check compares hashes.Use case/motivation
Keep DB-side parsing cheap and memory-flat as
version_datagrows (large manifests from S3/custom bundles). Today the built-in bundles keepversion_datasmall or empty (GitDagBundle doesn't set it), so this is a forward-looking optimization rather than a current hotspot — surfaced in review of #68336.Related issues
Follow-up from review on #68336 (review comment by @uranusjr). The in-place refresh logic was introduced there.
Are you willing to submit a PR?
Code of Conduct
Drafted-by: Claude Code (Opus 4.8); reviewed by @ephraimbuddy before posting