feat: implement asynchronous scan execution with background worker by ritiksah141 · Pull Request #129 · openshield-org/openshield

ritiksah141 · 2026-06-06T01:10:34Z

What does this PR do?
This PR transitions the OpenShield scan execution model from a synchronous, blocking request path to a decoupled, asynchronous architecture. It introduces a
database-backed background worker to handle long-running Azure posture scans, ensuring the API remains highly responsive and immune to web server timeouts even when
scanning enterprise-scale subscriptions.

Type of change

API endpoint (Added status polling and async trigger)
Documentation (Added async architecture guide)
Background worker implementation (Database-backed queue logic)
Stability/Performance improvement (Isolated external API latency)

Detailed Summary of Changes

Asynchronous Lifecycle: The POST /api/scans/trigger endpoint was modified to validate requests and immediately return an HTTP 202 Accepted response along with a
unique scan ID. It no longer waits for the scan to finish.
Database-Backed Queue: The scans table in PostgreSQL was enhanced with status (pending, running, completed, failed) and error_message columns. This allows the
database to function as a persistent, ACID-compliant task queue without requiring additional infrastructure like Redis.
Dedicated Background Worker: A new process, scanner/worker.py, was implemented to independently poll for pending scans, manage their state transitions, and execute
the core scanning logic. It includes robust error handling to capture and persist tracebacks upon failure.
Status Polling Endpoint: Added GET /api/scans/<scan_id> to provide the frontend with real-time feedback on scan progress, completion timestamps, and error details.
Automatic Process Management: The startup.sh script was updated to automatically spawn the background worker alongside the Gunicorn web server, ensuring a seamless
deployment experience.
Refined Documentation: Created docs/async-scan-architecture.md to explain the new system flow, technical rationale, and integration patterns for frontend
developers.

Technical Rationale
Moving to a decoupled worker model addresses the fundamental limitation of synchronous web requests for security scanning. By using a database-backed queue rather
than ephemeral threads or complex message brokers, the system achieves maximum reliability with minimal infrastructure overhead. This architecture allows OpenShield
to compete with enterprise CSPM products by handling thousands of Azure resources without performance degradation.

Testing and Verification

Unit Tests: Implemented tests/test_worker.py using industry-standard mocking to verify the worker state machine.
E2E Smoke Tests: Hardened tests/smoke_test.py to verify the full async lifecycle, including successful 202 responses and status polling.
Local CI: Successfully ran the consolidated ci.yml logic locally, verifying syntax, rule structure, and security measures across all 44 rule files and new backend
components.
Dependency Audit: Verified that requirements.txt correctly covers all imports used in the new asynchronous logic.

Checklist

My code follows the rule template in CONTRIBUTING.md
I have not committed any real Azure credentials
My branch name follows the convention: feat/description

Closes Issue #112

…E suite and docs

SHAURYAKSHARMA24

Requesting changes.

The async direction is right and the 202 + polling API shape is useful, but I don’t think this is safe to merge yet. There are several correctness issues in the worker/queue design and migration path.

Blocking issues:

Migration may re-run historical scans
ALTER TABLE scans ADD COLUMN status TEXT DEFAULT 'pending' appears to backfill existing rows as pending. That means old completed scans could be picked up by the worker and re-executed after deployment. This risks duplicate findings, unnecessary Azure API calls, and real cloud cost. Existing completed rows should be backfilled to completed, not pending.
Scan claiming is not atomic
The worker currently does a read-then-write flow: fetch pending scans, then update each one to running. If more than one worker/container is active, multiple workers can select the same pending scan before either updates it. This needs an atomic claim pattern, such as UPDATE ... WHERE status='pending' RETURNING * or SELECT ... FOR UPDATE SKIP LOCKED.
Running scans can get stuck forever
If the worker dies after setting a scan to running but before saving the completed result or marking failure, that scan remains running permanently. There is no heartbeat, lease, timeout, retry counter, or stale-job recovery. Please add a recovery mechanism for stale running scans.
Serialization regression in ScanEngine.run_scan()
run_scan() previously returned make_serializable(result) but now appears to return result directly. That risks failures when findings contain datetimes, sets, or Azure SDK objects, because persistence later calls json.dumps(...). Please restore serialization before returning or move serialization into the save path.

Other concerns:

The worker is started in startup.sh with python3 -m scanner.worker &, but there is no supervision or restart handling. If the worker crashes, the API can stay healthy while scans stop processing.

error_message = str(exc) may expose implementation details through the scan status endpoint. Public API responses should use sanitized error messages.

Timestamp handling looks inconsistent between datetime.now().isoformat(), SQL CURRENT_TIMESTAMP, and Flask JSON serialization.

The worker tests are too mocked to catch the major failure modes. They do not cover race conditions, migration backfill behaviour, stuck running scans, or serialization failure during real persistence.

Please address the migration backfill, atomic scan claiming, stale-running recovery, and serialization regression before merge. Happy to re-review after those are fixed.

ritiksah141 · 2026-06-11T16:58:38Z

Applied the changes and here are the summary @SHAURYAKSHARMA24 and ready for the reivew

Migration Correctness:
- Updated api/models/finding.py to backfill existing scans as 'completed' instead of 'pending'. This prevents historical scans from being picked up and
  re-executed by the new worker.
Atomic Scan Claiming:
- Added claim_next_pending_scan() to DatabaseManager, utilizing PostgreSQL's FOR UPDATE SKIP LOCKED pattern.
- Refactored the worker to use this atomic method, ensuring multiple worker instances won't process the same scan.
Stale Scan Recovery:
- Implemented recover_stale_scans() in DatabaseManager to mark scans stuck in the 'running' state for over 60 minutes as 'failed'.
- The worker now runs this cleanup at the start of its polling loop.
Serialization Regression:
- Restored the call to make_serializable(result) in ScanEngine.run_scan(), ensuring all results are safe for JSON persistence.
Worker Supervision:
- Modified startup.sh to wrap the background worker in a while loop, providing basic auto-restart capabilities if the process crashes.
Error Sanitization & Timestamps:
- The worker now sanitizes public-facing error messages to avoid exposing stack traces via the API.
- Standardized timestamp handling across the persistence layer using timezone-aware ISO formats.

Verification Results

Unit Tests: Updated tests/test_worker.py to verify the new atomic state machine and recovery logic. All 3 tests passed successfully.
Code Quality: Verified that make_serializable is correctly applied and that database connections are handled safely within the new methods.

- Sanitize worker error messages to prevent sensitive exception details from being exposed through the public API - Revert unrelated schema and search_path changes to maintain compatibility with existing public-schema deployments - Add column to preserve as the original queue timestamp - Improve migration logic to correctly backfill historical scans and repair incorrect statuses - Update worker tests to reflect generic error handling and the new atomic scan-claiming workflow

ritiksah141 · 2026-06-12T16:53:23Z

@SHAURYAKSHARMA24

PR ready for review " 1. Atomic Scan Claiming: Refactored the worker to use a single atomic query with SKIP LOCKED. This eliminates the race condition where two workers could pick up the
same scan.
2. Robust Error Sanitization: The GET /api/scans/ endpoint now only exposes a generic internal error message. Detailed Azure exceptions (which often contain
sensitive IDs) are now restricted to internal server logs only.
3. Preserved Queued Timestamps: Added a claimed_at column. started_at now represents the time the scan was queued, while claimed_at tracks when the worker started
processing. This provides accurate "Time in Queue" metrics.
4. Stale Job Recovery: Added a background cleanup task that identifies scans stuck in the running state for more than 60 minutes and marks them as failed with a
timeout message.
5. Serialization Regression: Restored make_serializable in the ScanEngine, preventing crashes when findings contain non-standard objects (like Azure SDK models or
datetimes).
6. Worker Supervision: Updated startup.sh to wrap the worker in a Bash until loop, ensuring the process respawns automatically if it crashes.", "* Existing Data Safety: Confirmed that historical scans are backfilled with status = 'completed', preventing them from being re-scanned.

Fix-up Logic: Added logic to "fix-up" any scans that might have been incorrectly marked as pending during previous partial deployments of this PR.
Running Job Stability: Verified that scans already in progress during the upgrade are correctly backfilled with a claimed_at value, ensuring they aren't immediately
aborted by the new stale-recovery logic."

TFT444

all good ready to merge

ritiksah141 added 3 commits June 6, 2026 01:49

feat: implement asynchronous scan execution with background worker

8746489

chore: async scan architecture with 100% verified test suite

93552e2

feat: complete transition to async scan architecture with verified E2…

9fa418e

…E suite and docs

ritiksah141 requested review from Vishnu2707 and m-khan-97 June 6, 2026 01:15

ritiksah141 self-assigned this Jun 6, 2026

Vishnu2707 requested review from SHAURYAKSHARMA24, TFT444 and safidnadaf and removed request for Vishnu2707, m-khan-97 and safidnadaf June 10, 2026 23:02

SHAURYAKSHARMA24 requested changes Jun 11, 2026

View reviewed changes

fix: addressed the requested changes

a3992f5

SHAURYAKSHARMA24 self-requested a review June 12, 2026 16:54

SHAURYAKSHARMA24 approved these changes Jun 12, 2026

View reviewed changes

TFT444 approved these changes Jun 12, 2026

View reviewed changes

Vishnu2707 merged commit d7c59db into openshield-org:dev Jun 13, 2026
1 check passed

ritiksah141 deleted the feat/async-scan-execution branch June 13, 2026 01:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: implement asynchronous scan execution with background worker#129

feat: implement asynchronous scan execution with background worker#129
Vishnu2707 merged 5 commits into
openshield-org:devfrom
ritiksah141:feat/async-scan-execution

ritiksah141 commented Jun 6, 2026

Uh oh!

SHAURYAKSHARMA24 left a comment

Uh oh!

ritiksah141 commented Jun 11, 2026

Uh oh!

ritiksah141 commented Jun 12, 2026

Uh oh!

TFT444 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

ritiksah141 commented Jun 6, 2026

Uh oh!

SHAURYAKSHARMA24 left a comment

Choose a reason for hiding this comment

Uh oh!

ritiksah141 commented Jun 11, 2026

Uh oh!

ritiksah141 commented Jun 12, 2026

Uh oh!

TFT444 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants