Skip to content

Fix orphaned batch_retain parents when child fails via unhandled exception#618

Merged
nicoloboschi merged 2 commits intomainfrom
fix/orphaned-batch-retain-parents
Mar 19, 2026
Merged

Fix orphaned batch_retain parents when child fails via unhandled exception#618
nicoloboschi merged 2 commits intomainfrom
fix/orphaned-batch-retain-parents

Conversation

@cdbartholomew
Copy link
Contributor

Problem

When a child retain operation fails with an unhandled exception (e.g. a database constraint violation), two things happen:

  1. The memory engine's transaction is rolled back entirely — including the call to _maybe_update_parent_operation that would normally update the parent batch_retain.
  2. The poller's fallback _mark_failed runs and correctly marks the child as failed, but it has no awareness of the parent relationship.

This leaves the parent batch_retain operation permanently stuck in pending state, which causes the worker queue to appear degraded. Failures of this kind have been observed across documents in various languages (Turkish, Polish, Russian, and Spanish content) where character encoding edge cases can trigger constraint violations.

Fix

Wrap _mark_failed in a transaction and call a new poller-level _maybe_update_parent_operation after marking the child failed. This mirrors the memory engine's own parent-update logic and ensures the parent is resolved regardless of how the child failure was detected.

The poller's implementation:

  • Locks the parent row to prevent concurrent sibling race conditions
  • Checks all siblings before finalising — only resolves parent once all are in a terminal state
  • Marks parent failed if any sibling failed, completed if all succeeded
  • Errors in parent propagation are logged but do not affect the child failure path

Tests

Added TestMarkFailedParentPropagation covering:

  • Last sibling fails → parent batch_retain becomes failed
  • Sole child fails → parent becomes failed
  • Sibling still pending → parent stays pending (no premature resolution)
  • No parent_operation_id in result_metadata → safe no-op
  • End-to-end: unhandled exception via execute_task propagates to parent

…ption

When a child retain operation fails with an unhandled exception (e.g. a DB
constraint violation), the memory engine's transaction is rolled back entirely,
including any call to _maybe_update_parent_operation. The poller's fallback
_mark_failed then updates the child status but leaves the parent batch_retain
permanently stuck in 'pending'.

Fix: wrap _mark_failed in a transaction and call a new poller-level
_maybe_update_parent_operation after marking the child failed. This mirrors
the memory engine's own parent-update logic and ensures the parent is
resolved to completed/failed regardless of how the child failure was detected.

The poller's implementation locks the parent row, checks all siblings, and
only finalises the parent once all siblings have reached a terminal state.
Errors in parent propagation are logged but do not affect the child failure
path, which is the critical state change.
Tests cover the new _maybe_update_parent_operation logic:
- Last sibling fails → parent batch_retain becomes failed
- Sole child fails → parent becomes failed
- Sibling still pending → parent stays pending (no premature resolution)
- No parent in result_metadata → safe no-op
- End-to-end: unhandled exception via execute_task propagates to parent
Copy link
Collaborator

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

this was a very bad code design initially, this one is much better

@nicoloboschi nicoloboschi merged commit 4394245 into main Mar 19, 2026
39 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants