Fix orphaned batch_retain parents when child fails via unhandled exception#618
Merged
nicoloboschi merged 2 commits intomainfrom Mar 19, 2026
Merged
Fix orphaned batch_retain parents when child fails via unhandled exception#618nicoloboschi merged 2 commits intomainfrom
nicoloboschi merged 2 commits intomainfrom
Conversation
…ption When a child retain operation fails with an unhandled exception (e.g. a DB constraint violation), the memory engine's transaction is rolled back entirely, including any call to _maybe_update_parent_operation. The poller's fallback _mark_failed then updates the child status but leaves the parent batch_retain permanently stuck in 'pending'. Fix: wrap _mark_failed in a transaction and call a new poller-level _maybe_update_parent_operation after marking the child failed. This mirrors the memory engine's own parent-update logic and ensures the parent is resolved to completed/failed regardless of how the child failure was detected. The poller's implementation locks the parent row, checks all siblings, and only finalises the parent once all siblings have reached a terminal state. Errors in parent propagation are logged but do not affect the child failure path, which is the critical state change.
Tests cover the new _maybe_update_parent_operation logic: - Last sibling fails → parent batch_retain becomes failed - Sole child fails → parent becomes failed - Sibling still pending → parent stays pending (no premature resolution) - No parent in result_metadata → safe no-op - End-to-end: unhandled exception via execute_task propagates to parent
nicoloboschi
approved these changes
Mar 19, 2026
Collaborator
nicoloboschi
left a comment
There was a problem hiding this comment.
LGTM
this was a very bad code design initially, this one is much better
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a child
retainoperation fails with an unhandled exception (e.g. a database constraint violation), two things happen:_maybe_update_parent_operationthat would normally update the parentbatch_retain._mark_failedruns and correctly marks the child asfailed, but it has no awareness of the parent relationship.This leaves the parent
batch_retainoperation permanently stuck inpendingstate, which causes the worker queue to appear degraded. Failures of this kind have been observed across documents in various languages (Turkish, Polish, Russian, and Spanish content) where character encoding edge cases can trigger constraint violations.Fix
Wrap
_mark_failedin a transaction and call a new poller-level_maybe_update_parent_operationafter marking the child failed. This mirrors the memory engine's own parent-update logic and ensures the parent is resolved regardless of how the child failure was detected.The poller's implementation:
failedif any sibling failed,completedif all succeededTests
Added
TestMarkFailedParentPropagationcovering:batch_retainbecomesfailedfailedpending(no premature resolution)parent_operation_idinresult_metadata→ safe no-opexecute_taskpropagates to parent