feat: implement asynchronous scan execution with background worker#129
Conversation
SHAURYAKSHARMA24
left a comment
There was a problem hiding this comment.
Requesting changes.
The async direction is right and the 202 + polling API shape is useful, but I don’t think this is safe to merge yet. There are several correctness issues in the worker/queue design and migration path.
Blocking issues:
-
Migration may re-run historical scans
ALTER TABLE scans ADD COLUMN status TEXT DEFAULT 'pending'appears to backfill existing rows aspending. That means old completed scans could be picked up by the worker and re-executed after deployment. This risks duplicate findings, unnecessary Azure API calls, and real cloud cost. Existing completed rows should be backfilled tocompleted, notpending. -
Scan claiming is not atomic
The worker currently does a read-then-write flow: fetch pending scans, then update each one torunning. If more than one worker/container is active, multiple workers can select the same pending scan before either updates it. This needs an atomic claim pattern, such asUPDATE ... WHERE status='pending' RETURNING *orSELECT ... FOR UPDATE SKIP LOCKED. -
Running scans can get stuck forever
If the worker dies after setting a scan torunningbut before saving the completed result or marking failure, that scan remainsrunningpermanently. There is no heartbeat, lease, timeout, retry counter, or stale-job recovery. Please add a recovery mechanism for stale running scans. -
Serialization regression in
ScanEngine.run_scan()
run_scan()previously returnedmake_serializable(result)but now appears to returnresultdirectly. That risks failures when findings contain datetimes, sets, or Azure SDK objects, because persistence later callsjson.dumps(...). Please restore serialization before returning or move serialization into the save path.
Other concerns:
The worker is started in startup.sh with python3 -m scanner.worker &, but there is no supervision or restart handling. If the worker crashes, the API can stay healthy while scans stop processing.
error_message = str(exc) may expose implementation details through the scan status endpoint. Public API responses should use sanitized error messages.
Timestamp handling looks inconsistent between datetime.now().isoformat(), SQL CURRENT_TIMESTAMP, and Flask JSON serialization.
The worker tests are too mocked to catch the major failure modes. They do not cover race conditions, migration backfill behaviour, stuck running scans, or serialization failure during real persistence.
Please address the migration backfill, atomic scan claiming, stale-running recovery, and serialization regression before merge. Happy to re-review after those are fixed.
|
Applied the changes and here are the summary @SHAURYAKSHARMA24 and ready for the reivew
Verification Results
|
- Sanitize worker error messages to prevent sensitive exception details from being exposed through the public API - Revert unrelated schema and search_path changes to maintain compatibility with existing public-schema deployments - Add column to preserve as the original queue timestamp - Improve migration logic to correctly backfill historical scans and repair incorrect statuses - Update worker tests to reflect generic error handling and the new atomic scan-claiming workflow
|
PR ready for review " 1. Atomic Scan Claiming: Refactored the worker to use a single atomic query with SKIP LOCKED. This eliminates the race condition where two workers could pick up the
|
What does this PR do?
This PR transitions the OpenShield scan execution model from a synchronous, blocking request path to a decoupled, asynchronous architecture. It introduces a
database-backed background worker to handle long-running Azure posture scans, ensuring the API remains highly responsive and immune to web server timeouts even when
scanning enterprise-scale subscriptions.
Type of change
Detailed Summary of Changes
unique scan ID. It no longer waits for the scan to finish.
database to function as a persistent, ACID-compliant task queue without requiring additional infrastructure like Redis.
the core scanning logic. It includes robust error handling to capture and persist tracebacks upon failure.
deployment experience.
developers.
Technical Rationale
Moving to a decoupled worker model addresses the fundamental limitation of synchronous web requests for security scanning. By using a database-backed queue rather
than ephemeral threads or complex message brokers, the system achieves maximum reliability with minimal infrastructure overhead. This architecture allows OpenShield
to compete with enterprise CSPM products by handling thousands of Azure resources without performance degradation.
Testing and Verification
components.
Checklist
Closes Issue #112