Skip to content

Variant translation jobs hang indefinitely when run concurrently on score sets with overlapping CAIDs #733

@bencap

Description

@bencap

Summary

When two variant translation jobs run concurrently for score sets that share ClinGen allele IDs (CAIDs), both jobs can hang indefinitely — silently, with no exception raised — until the 3-hour JOB_TIMEOUT_SECONDS fires. This is caused by a combination of synchronous psycopg2 blocking the asyncio event loop and long-lived uncommitted transactions holding row locks on variant_translations.

Problem

populate_variant_translations_for_score_set in worker/jobs/external_services/variant_translation.py calls upsert_variant_translations (in lib/variant_translations.py) for each allele. That function issues an INSERT ... ON CONFLICT DO NOTHING against the variant_translations table using a synchronous SQLAlchemy Session backed by psycopg2.

The hang occurs through the following sequence:

  1. Both jobs run as coroutines in the same asyncio event loop (MAX_JOBS = 2 in worker/settings/worker.py).
  2. Score sets from the same experiment share CAIDs. When a CA allele is resolved through a shared PA, both jobs discover and attempt to insert the same (aa_clingen_id, nt_clingen_id) pairs into variant_translations.
  3. Transactions are long-lived: db.execute() in upsert_variant_translations flushes but does not commit. Commits only happen inside update_progress every ~10 alleles.
  4. Job B's db.execute() blocks the OS thread while waiting for a row lock held by Job A's open transaction.
  5. Because psycopg2 is synchronous, blocking the OS thread freezes the entire asyncio event loop. Job A cannot advance to its next await point, cannot call update_progress, and cannot commit — so it never releases its locks.
  6. From Postgres's perspective, only Job B is waiting. Job A's transaction is idle. There is no circular wait, so Postgres does not detect a deadlock and raises no exception.
  7. The set-based deduplication in upsert_variant_translations (list({(aa, nt) for ...})) produces non-deterministic row ordering, which means in a multi-process scenario the jobs can also acquire locks in opposite orders — a true circular deadlock that would raise an exception in separate-process deployments but still manifests as an indefinite hang in the shared event loop case.

Steps to Reproduce

  1. Create two score sets within the same experiment that share mapped variants resolving to overlapping CAIDs (e.g., urn:mavedb:00001268-a-1 and urn:mavedb:00001268-b-1).
  2. Trigger variant translation jobs for both score sets such that they execute concurrently within the same worker process.
  3. Observe both jobs log progress, and may even complete successfully. On some runs however, both jobs will appear to stop executing and hang.
  4. No error or exception is logged. Both jobs remain in RUNNING state until JOB_TIMEOUT_SECONDS (3 hours) elapses.

Expected Behavior

Concurrent variant translation jobs on overlapping score sets should either:

  • Complete successfully (one waits briefly for the other to commit, then continues), or
  • Fail fast with a recoverable error and be retried, rather than hanging silently for hours.

Proposed Behavior

Two changes to lib/variant_translations.py:

  1. Sort rows before inserting. Change list({(aa, nt) for ...}) to sorted({(aa, nt) for ...}). This ensures all transactions acquire row locks in the same canonical (aa_clingen_id, nt_clingen_id) order, eliminating any circular wait in multi-process deployments and reducing the overlap window in the shared event loop case.

  2. Set a per-statement lock timeout using SET LOCAL. Issue db.execute(text("SET LOCAL lock_timeout = '5s'")) immediately before the INSERT. SET LOCAL scopes the timeout to the current transaction only — it expires at the next commit and does not affect unrelated jobs or statements. When Job B's insert blocks on Job A's lock, Postgres will raise ERROR: canceling statement due to lock timeout after 5 seconds. This propagates as an OperationalError through SQLAlchemy, is caught by the with_pipeline_management decorator's exception handler, and the job is marked failed and retried. On retry, the overlapping job has typically already committed its batch, so the conflict does not recur.

The long-term fix is tracked in #715. Once all worker DB sessions are async, db.execute() will yield to the event loop on lock waits rather than blocking the OS thread, making the lock timeout unnecessary.

Acceptance Criteria

  • upsert_variant_translations sorts the deduplicated (aa_clingen_id, nt_clingen_id) pairs before constructing the INSERT values list.
  • upsert_variant_translations issues SET LOCAL lock_timeout = '5s' on the session before executing the INSERT.
  • Concurrent variant translation jobs on score sets with fully overlapping CAIDs do not hang; at least one job completes successfully and the other either completes or fails with a logged OperationalError and is retried.
  • Unrelated jobs and statements outside of upsert_variant_translations are not affected by the lock timeout (verified by confirming SET LOCAL scope).
  • Existing unit tests for upsert_variant_translations continue to pass.

Implementation Notes

  • The SET LOCAL statement must be issued on the same Session (and therefore the same underlying connection) as the INSERT, within the same transaction. Issuing it on a separate connection or after a commit would have no effect.
  • The 5-second timeout value is a starting point. It should be long enough to avoid spurious failures under normal load but short enough to unblock the event loop well before any downstream timeout fires.
  • worker/settings/worker.py already contains a comment explaining the MAX_JOBS = 2 cap and the psycopg2 event loop starvation risk. That comment should be updated to reference this fix and note that the lock timeout is a mitigation, not a resolution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    app: workerTask implementation touches the workertype: bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions