Feature/data 2449 improve failed deferrable batch #61
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Improve error logging for failed deferred AWS Batch jobs by displaying actual CloudWatch logs instead of generic "Trigger failed" messages.
Problem
When deferred AWS Batch jobs fail, users only see generic error messages like "Trigger failed" without the actual CloudWatch logs that show the real failure reason. This makes debugging failed batch jobs difficult as users cannot see the container logs that contain the actual error details.
Solution
Enhanced the
AWSBatchOperatorto:Retrieve job_id from XCom: When a deferred task resumes after trigger failure, retrieve the
job_idfrom the existingbatch_job_detailsXCom that's automatically created by the BatchOperator.Fetch CloudWatch logs: When
TaskDeferralErroroccurs (trigger failure), fetch the actual CloudWatch logs using the retrievedjob_id.Enhanced error messages: Include the CloudWatch logs and direct link in the error message, showing users the real failure reason instead of just "Trigger failed".
Implementation Details
resume_execution(): Handles trigger failures by retrieving job_id frombatch_job_detailsXCom and fetching CloudWatch logsexecute_complete(): Ensures logs are fetched for successful deferred tasks before status checking_fetch_and_log_cloudwatch(): Helper method that fetches CloudWatch logs and returns them for error messages_format_extra_info(): Formats enhanced error messages with logs and CloudWatch linksBefore vs After
Before:
After:
Test Plan