Summary
DPSJob.wait_for_completion() returns immediately after job submission with an incorrect status (e.g., "Deleted") instead of waiting for the job to reach a terminal state. When called again after the job completes, it correctly returns "Succeeded".
Description
The issue occurs because of a race condition between job submission and database visibility:
- When a job is submitted via the DPS API, it takes up to 5 seconds to appear in the OpenSearch database (default commit write time)
- If
wait_for_completion() is called immediately after submission, before the job appears in the database, the API returns an unexpected status (currently "Deleted")
- The function returns immediately with this incorrect status instead of retrying
- When called again later (after the job actually completes), it correctly returns "Succeeded"
Steps to Reproduce
from maap.maap import MAAP
maap = MAAP()
# Submit a job
job = maap.submit_job(
identifier='my_analysis',
algo_id='my_algorithm',
version='main',
queue='maap-dps-worker-8gb'
)
# Immediately wait for completion
job.wait_for_completion()
print(f"Status: {job.status}") # Prints "Deleted" (incorrect!)
# Wait a moment and check again
time.sleep(5)
job.wait_for_completion()
print(f"Status: {job.status}") # Prints "Succeeded" (correct)
Expected Behavior
wait_for_completion() should wait for the job to reach a terminal state (Succeeded, Failed, Dismissed, etc.), accounting for the initial database visibility delay.
Actual Behavior
The function returns immediately with status "Deleted" if called before the job appears in the OpenSearch database. It does not retry when encountering this status.
Root Cause
In dps_job.py:wait_for_completion(), the backoff decorator only retries when a RuntimeError is raised. The function only raises this error when status is "accepted" or "running":
if self.status.lower() in ["accepted", "running"]:
logger.debug('Current Status is {}. Backing off.'.format(self.status))
raise RuntimeError
return self
When the job hasn't appeared in the database yet, the API returns "Deleted" (or similar), which is not in the list of statuses that trigger a retry. The function returns immediately instead of backing off.
Suggested Fix
The function should either:
- Treat unexpected status values (like "Deleted") as a reason to retry, or
- Only return when the status is a known terminal state (Succeeded, Failed, Dismissed, Deduped, Offline), or
- Add an initial delay or retry mechanism for the first few checks after job submission
Environment
- maap-py library
- DPS backend with OpenSearch database
- Default OpenSearch commit write time: ~5 seconds
Related
- This affects the common pattern of immediately calling
wait_for_completion() after job submission
- Documentation currently recommends this pattern (see docstring example)
Summary
DPSJob.wait_for_completion()returns immediately after job submission with an incorrect status (e.g., "Deleted") instead of waiting for the job to reach a terminal state. When called again after the job completes, it correctly returns "Succeeded".Description
The issue occurs because of a race condition between job submission and database visibility:
wait_for_completion()is called immediately after submission, before the job appears in the database, the API returns an unexpected status (currently "Deleted")Steps to Reproduce
Expected Behavior
wait_for_completion()should wait for the job to reach a terminal state (Succeeded, Failed, Dismissed, etc.), accounting for the initial database visibility delay.Actual Behavior
The function returns immediately with status "Deleted" if called before the job appears in the OpenSearch database. It does not retry when encountering this status.
Root Cause
In
dps_job.py:wait_for_completion(), the backoff decorator only retries when aRuntimeErroris raised. The function only raises this error when status is "accepted" or "running":When the job hasn't appeared in the database yet, the API returns "Deleted" (or similar), which is not in the list of statuses that trigger a retry. The function returns immediately instead of backing off.
Suggested Fix
The function should either:
Environment
Related
wait_for_completion()after job submission