Skip to content

Fix DAG-level on_failure_callback not firing#63692

Open
Sathvik-Chowdary-Veerapaneni wants to merge 7 commits intoapache:mainfrom
Sathvik-Chowdary-Veerapaneni:dag-failure-callback-not-firing-63374
Open

Fix DAG-level on_failure_callback not firing#63692
Sathvik-Chowdary-Veerapaneni wants to merge 7 commits intoapache:mainfrom
Sathvik-Chowdary-Veerapaneni:dag-failure-callback-not-firing-63374

Conversation

@Sathvik-Chowdary-Veerapaneni
Copy link
Copy Markdown

@Sathvik-Chowdary-Veerapaneni Sathvik-Chowdary-Veerapaneni commented Mar 16, 2026

Fixed DAG-level on_failure_callback not firing with KubernetesExecutor.

The DagRunContext validator only caught DetachedInstanceError when trying to access ORM relationships, but other SQLAlchemy exceptions (like InvalidRequestError) silently failed in produce_dag_callback without a callback being produced and without a log message being emitted.

  • Improved exception handling in DagRunContext validator to catch all exceptions
  • Made produce_dag_callback robust: sends callback with minimum context information in case of failure in the DagRunContext validator instead of silently not sending anything
  • Added a warning log message when DAG processor skips callbacks due to a mismatch in bundle_name
  • Added info log message when scheduler sends a callback to the DAG processor

closes #63374

@boring-cyborg
Copy link
Copy Markdown

boring-cyborg bot commented Mar 16, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@boring-cyborg boring-cyborg bot added area:DAG-processing area:Scheduler including HA (high availability) scheduler labels Mar 16, 2026
Previously only DetachedInstanceError was caught when accessing
consumed_asset_events on ORM DagRun objects. Other SQLAlchemy
exceptions (e.g. InvalidRequestError) crashed the scheduler.

closes: apache#63374
DagRunContext creation could crash when ORM relationship access
failed, preventing the callback from being produced entirely.
The callback is now sent with minimal context on failure.
@Sathvik-Chowdary-Veerapaneni Sathvik-Chowdary-Veerapaneni force-pushed the dag-failure-callback-not-firing-63374 branch from f49ef66 to dfd11a3 Compare March 27, 2026 22:32
@eladkal eladkal added this to the Airflow 3.2.0 milestone Mar 27, 2026
@eladkal eladkal added the type:bug-fix Changelog: Bug Fixes label Mar 27, 2026
@eladkal eladkal requested a review from vatsrahul1001 March 27, 2026 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:DAG-processing area:Scheduler including HA (high availability) scheduler type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DAG-level on_failure_callback never fires

3 participants