Skip to content

fix: resolve race condition in FuturesDict and unsafe __del__ methods#6759

Open
Suraj Sahani (surajsahani) wants to merge 4 commits intolangchain-ai:mainfrom
surajsahani:main
Open

fix: resolve race condition in FuturesDict and unsafe __del__ methods#6759
Suraj Sahani (surajsahani) wants to merge 4 commits intolangchain-ai:mainfrom
surajsahani:main

Conversation

@surajsahani
Copy link
Copy Markdown

@surajsahani Suraj Sahani (surajsahani) commented Feb 6, 2026

Problem

Found and fixed critical concurrency bugs that could cause deadlocks and crashes in production:

  1. Race condition in FuturesDict: The callback could fire before the counter was incremented, or a future could complete between counter increment and callback registration, leading to incorrect counter state and potential deadlocks.

  2. Unsafe __del__ methods: Multiple stores and cache implementations had __del__ methods that could crash during interpreter shutdown when logging or attributes become unavailable.

Solution

Race Condition Fix (libs/langgraph/langgraph/pregel/_runner.py)

  • Increment counter before adding callback to prevent race
  • Check if future is already done after callback registration
  • Made on_done() idempotent by checking if future was already processed
  • Prevents double-decrement of counter

Resource Cleanup Fix

Added proper exception handling in __del__ methods:

  • libs/checkpoint-sqlite/langgraph/store/sqlite/base.py
  • libs/checkpoint-postgres/langgraph/store/postgres/base.py
  • libs/checkpoint-sqlite/langgraph/cache/sqlite/__init__.py
  • libs/checkpoint/langgraph/store/base/batch.py

Testing

  • test_pregel_loop_refcount - Memory leak test passes
  • test_concurrent_emit_sends - Concurrency test passes
  • ✅ All linting and formatting checks pass

Impact

These fixes prevent:

  • Deadlocks in high-concurrency scenarios with many parallel tasks
  • Crashes during interpreter shutdown
  • Incorrect task counter state leading to hung executions

Particularly important for production workloads with hundreds of concurrent graph executions.

- Fix race condition in FuturesDict.__setitem__ where callback could fire
  before counter was incremented, leading to incorrect counter state
- Make on_done() idempotent to handle edge case where future completes
  between counter increment and callback registration
- Add proper exception handling in __del__ methods to prevent crashes
  during interpreter shutdown in stores and cache
- Affected files: pregel/_runner.py, sqlite/postgres stores, batch.py

Tests: test_pregel_loop_refcount, test_concurrent_emit_sends pass
@mdrxy Mason Daugherty (mdrxy) added the bypass-issue-check Maintainer override: skip issue-link enforcement label Mar 24, 2026
…o LLM

_filter_validation_errors manually reconstructed injected arg names from
only state/store/runtime, missing custom InjectedToolArg subclasses.
This meant validation errors for custom injected args (e.g., InjectedAuth)
would leak back to the LLM in error messages.

Fix: Use all_injected_keys (added by Sydney in 08363db) instead of
manually rebuilding the set, ensuring all injected args are filtered.

Companion fix to 08363db (injected arg stripping in _inject_tool_args).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bypass-issue-check Maintainer override: skip issue-link enforcement external

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants