Skip to content

DPSJob.wait_for_completion() returns immediately with incorrect status after job submission #126

@sujen1412

Description

@sujen1412

Summary

DPSJob.wait_for_completion() returns immediately after job submission with an incorrect status (e.g., "Deleted") instead of waiting for the job to reach a terminal state. When called again after the job completes, it correctly returns "Succeeded".

Description

The issue occurs because of a race condition between job submission and database visibility:

  1. When a job is submitted via the DPS API, it takes up to 5 seconds to appear in the OpenSearch database (default commit write time)
  2. If wait_for_completion() is called immediately after submission, before the job appears in the database, the API returns an unexpected status (currently "Deleted")
  3. The function returns immediately with this incorrect status instead of retrying
  4. When called again later (after the job actually completes), it correctly returns "Succeeded"

Steps to Reproduce

from maap.maap import MAAP

maap = MAAP()

# Submit a job
job = maap.submit_job(
    identifier='my_analysis',
    algo_id='my_algorithm',
    version='main',
    queue='maap-dps-worker-8gb'
)

# Immediately wait for completion
job.wait_for_completion()
print(f"Status: {job.status}")  # Prints "Deleted" (incorrect!)

# Wait a moment and check again
time.sleep(5)
job.wait_for_completion()
print(f"Status: {job.status}")  # Prints "Succeeded" (correct)

Expected Behavior

wait_for_completion() should wait for the job to reach a terminal state (Succeeded, Failed, Dismissed, etc.), accounting for the initial database visibility delay.

Actual Behavior

The function returns immediately with status "Deleted" if called before the job appears in the OpenSearch database. It does not retry when encountering this status.

Root Cause

In dps_job.py:wait_for_completion(), the backoff decorator only retries when a RuntimeError is raised. The function only raises this error when status is "accepted" or "running":

if self.status.lower() in ["accepted", "running"]:
    logger.debug('Current Status is {}. Backing off.'.format(self.status))
    raise RuntimeError
return self

When the job hasn't appeared in the database yet, the API returns "Deleted" (or similar), which is not in the list of statuses that trigger a retry. The function returns immediately instead of backing off.

Suggested Fix

The function should either:

  1. Treat unexpected status values (like "Deleted") as a reason to retry, or
  2. Only return when the status is a known terminal state (Succeeded, Failed, Dismissed, Deduped, Offline), or
  3. Add an initial delay or retry mechanism for the first few checks after job submission

Environment

  • maap-py library
  • DPS backend with OpenSearch database
  • Default OpenSearch commit write time: ~5 seconds

Related

  • This affects the common pattern of immediately calling wait_for_completion() after job submission
  • Documentation currently recommends this pattern (see docstring example)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions