Skip to content

feat: implement asynchronous scan execution with background worker#129

Merged
Vishnu2707 merged 5 commits into
openshield-org:devfrom
ritiksah141:feat/async-scan-execution
Jun 13, 2026
Merged

feat: implement asynchronous scan execution with background worker#129
Vishnu2707 merged 5 commits into
openshield-org:devfrom
ritiksah141:feat/async-scan-execution

Conversation

@ritiksah141

Copy link
Copy Markdown
Collaborator

What does this PR do?
This PR transitions the OpenShield scan execution model from a synchronous, blocking request path to a decoupled, asynchronous architecture. It introduces a
database-backed background worker to handle long-running Azure posture scans, ensuring the API remains highly responsive and immune to web server timeouts even when
scanning enterprise-scale subscriptions.

Type of change

  • API endpoint (Added status polling and async trigger)
  • Documentation (Added async architecture guide)
  • Background worker implementation (Database-backed queue logic)
  • Stability/Performance improvement (Isolated external API latency)

Detailed Summary of Changes

  • Asynchronous Lifecycle: The POST /api/scans/trigger endpoint was modified to validate requests and immediately return an HTTP 202 Accepted response along with a
    unique scan ID. It no longer waits for the scan to finish.
  • Database-Backed Queue: The scans table in PostgreSQL was enhanced with status (pending, running, completed, failed) and error_message columns. This allows the
    database to function as a persistent, ACID-compliant task queue without requiring additional infrastructure like Redis.
  • Dedicated Background Worker: A new process, scanner/worker.py, was implemented to independently poll for pending scans, manage their state transitions, and execute
    the core scanning logic. It includes robust error handling to capture and persist tracebacks upon failure.
  • Status Polling Endpoint: Added GET /api/scans/<scan_id> to provide the frontend with real-time feedback on scan progress, completion timestamps, and error details.
  • Automatic Process Management: The startup.sh script was updated to automatically spawn the background worker alongside the Gunicorn web server, ensuring a seamless
    deployment experience.
  • Refined Documentation: Created docs/async-scan-architecture.md to explain the new system flow, technical rationale, and integration patterns for frontend
    developers.

Technical Rationale
Moving to a decoupled worker model addresses the fundamental limitation of synchronous web requests for security scanning. By using a database-backed queue rather
than ephemeral threads or complex message brokers, the system achieves maximum reliability with minimal infrastructure overhead. This architecture allows OpenShield
to compete with enterprise CSPM products by handling thousands of Azure resources without performance degradation.

Testing and Verification

  • Unit Tests: Implemented tests/test_worker.py using industry-standard mocking to verify the worker state machine.
  • E2E Smoke Tests: Hardened tests/smoke_test.py to verify the full async lifecycle, including successful 202 responses and status polling.
  • Local CI: Successfully ran the consolidated ci.yml logic locally, verifying syntax, rule structure, and security measures across all 44 rule files and new backend
    components.
  • Dependency Audit: Verified that requirements.txt correctly covers all imports used in the new asynchronous logic.

Checklist

  • My code follows the rule template in CONTRIBUTING.md
  • I have not committed any real Azure credentials
  • My branch name follows the convention: feat/description

Closes Issue #112

@ritiksah141 ritiksah141 self-assigned this Jun 6, 2026
@Vishnu2707 Vishnu2707 requested review from SHAURYAKSHARMA24, TFT444 and safidnadaf and removed request for Vishnu2707, m-khan-97 and safidnadaf June 10, 2026 23:02

@SHAURYAKSHARMA24 SHAURYAKSHARMA24 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes.

The async direction is right and the 202 + polling API shape is useful, but I don’t think this is safe to merge yet. There are several correctness issues in the worker/queue design and migration path.

Blocking issues:

  1. Migration may re-run historical scans
    ALTER TABLE scans ADD COLUMN status TEXT DEFAULT 'pending' appears to backfill existing rows as pending. That means old completed scans could be picked up by the worker and re-executed after deployment. This risks duplicate findings, unnecessary Azure API calls, and real cloud cost. Existing completed rows should be backfilled to completed, not pending.

  2. Scan claiming is not atomic
    The worker currently does a read-then-write flow: fetch pending scans, then update each one to running. If more than one worker/container is active, multiple workers can select the same pending scan before either updates it. This needs an atomic claim pattern, such as UPDATE ... WHERE status='pending' RETURNING * or SELECT ... FOR UPDATE SKIP LOCKED.

  3. Running scans can get stuck forever
    If the worker dies after setting a scan to running but before saving the completed result or marking failure, that scan remains running permanently. There is no heartbeat, lease, timeout, retry counter, or stale-job recovery. Please add a recovery mechanism for stale running scans.

  4. Serialization regression in ScanEngine.run_scan()
    run_scan() previously returned make_serializable(result) but now appears to return result directly. That risks failures when findings contain datetimes, sets, or Azure SDK objects, because persistence later calls json.dumps(...). Please restore serialization before returning or move serialization into the save path.

Other concerns:

The worker is started in startup.sh with python3 -m scanner.worker &, but there is no supervision or restart handling. If the worker crashes, the API can stay healthy while scans stop processing.

error_message = str(exc) may expose implementation details through the scan status endpoint. Public API responses should use sanitized error messages.

Timestamp handling looks inconsistent between datetime.now().isoformat(), SQL CURRENT_TIMESTAMP, and Flask JSON serialization.

The worker tests are too mocked to catch the major failure modes. They do not cover race conditions, migration backfill behaviour, stuck running scans, or serialization failure during real persistence.

Please address the migration backfill, atomic scan claiming, stale-running recovery, and serialization regression before merge. Happy to re-review after those are fixed.

@ritiksah141

Copy link
Copy Markdown
Collaborator Author

Applied the changes and here are the summary @SHAURYAKSHARMA24 and ready for the reivew

  1. Migration Correctness:
    • Updated api/models/finding.py to backfill existing scans as 'completed' instead of 'pending'. This prevents historical scans from being picked up and
      re-executed by the new worker.
  2. Atomic Scan Claiming:
    • Added claim_next_pending_scan() to DatabaseManager, utilizing PostgreSQL's FOR UPDATE SKIP LOCKED pattern.
    • Refactored the worker to use this atomic method, ensuring multiple worker instances won't process the same scan.
  3. Stale Scan Recovery:
    • Implemented recover_stale_scans() in DatabaseManager to mark scans stuck in the 'running' state for over 60 minutes as 'failed'.
    • The worker now runs this cleanup at the start of its polling loop.
  4. Serialization Regression:
    • Restored the call to make_serializable(result) in ScanEngine.run_scan(), ensuring all results are safe for JSON persistence.
  5. Worker Supervision:
    • Modified startup.sh to wrap the background worker in a while loop, providing basic auto-restart capabilities if the process crashes.
  6. Error Sanitization & Timestamps:
    • The worker now sanitizes public-facing error messages to avoid exposing stack traces via the API.
    • Standardized timestamp handling across the persistence layer using timezone-aware ISO formats.

Verification Results

  • Unit Tests: Updated tests/test_worker.py to verify the new atomic state machine and recovery logic. All 3 tests passed successfully.
  • Code Quality: Verified that make_serializable is correctly applied and that database connections are handled safely within the new methods.

- Sanitize worker error messages to prevent sensitive exception details from being exposed through the public API

- Revert unrelated schema and search_path changes to maintain compatibility with existing public-schema deployments

- Add  column to preserve  as the original queue timestamp

- Improve migration logic to correctly backfill historical scans and repair incorrect  statuses

- Update worker tests to reflect generic error handling and the new atomic scan-claiming workflow
@ritiksah141

Copy link
Copy Markdown
Collaborator Author

@SHAURYAKSHARMA24

PR ready for review " 1. Atomic Scan Claiming: Refactored the worker to use a single atomic query with SKIP LOCKED. This eliminates the race condition where two workers could pick up the
same scan.
2. Robust Error Sanitization: The GET /api/scans/ endpoint now only exposes a generic internal error message. Detailed Azure exceptions (which often contain
sensitive IDs) are now restricted to internal server logs only.
3. Preserved Queued Timestamps: Added a claimed_at column. started_at now represents the time the scan was queued, while claimed_at tracks when the worker started
processing. This provides accurate "Time in Queue" metrics.
4. Stale Job Recovery: Added a background cleanup task that identifies scans stuck in the running state for more than 60 minutes and marks them as failed with a
timeout message.
5. Serialization Regression: Restored make_serializable in the ScanEngine, preventing crashes when findings contain non-standard objects (like Azure SDK models or
datetimes).
6. Worker Supervision: Updated startup.sh to wrap the worker in a Bash until loop, ensuring the process respawns automatically if it crashes.", "* Existing Data Safety: Confirmed that historical scans are backfilled with status = 'completed', preventing them from being re-scanned.

  • Fix-up Logic: Added logic to "fix-up" any scans that might have been incorrectly marked as pending during previous partial deployments of this PR.
  • Running Job Stability: Verified that scans already in progress during the upgrade are correctly backfilled with a claimed_at value, ensuring they aren't immediately
    aborted by the new stale-recovery logic."

@SHAURYAKSHARMA24 SHAURYAKSHARMA24 self-requested a review June 12, 2026 16:54

@TFT444 TFT444 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all good ready to merge

@Vishnu2707 Vishnu2707 merged commit d7c59db into openshield-org:dev Jun 13, 2026
1 check passed
@ritiksah141 ritiksah141 deleted the feat/async-scan-execution branch June 13, 2026 01:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants