Skip to content

fix(scrapy): async-thread startup race, shutdown lifecycle, and timeout setting#979

Open
vdusek wants to merge 12 commits into
masterfrom
fix/async-thread-startup-race
Open

fix(scrapy): async-thread startup race, shutdown lifecycle, and timeout setting#979
vdusek wants to merge 12 commits into
masterfrom
fix/async-thread-startup-race

Conversation

@vdusek

@vdusek vdusek commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Description

Fixes several defects in the Scrapy integration's background event-loop thread (AsyncThread), the scheduler, and the HTTP cache storage, and makes the loop timeout configurable.

Fixes

  • run_coro startup race — the is_running() guard fired spuriously when a coroutine was submitted before the loop thread reached run_forever() (observed ~122/500 in scheduler.open()). It now guards on is_closed(). A coroutine queued on a not-yet-running loop runs once the loop starts; only a closed loop raises.
  • close() thread leak — if task cancellation timed out or raised, the loop was never stopped or joined. Stop, join, and the forced-shutdown fallback now run in a finally, and the original error still propagates.
  • close() second call — a repeated close raised RuntimeError: Event loop is closed. An is_closed() early-return makes it a no-op.
  • close() ignored its timeout for the cancellation step (it used the constructor default). It now passes the caller's timeout through.
  • run_coro timeout left the coroutine running. It now cancels the future on timeout.
  • HTTP cache open/cleanup thread leaksopen_spider now closes the thread if opening the key-value store fails (matching ApifyScheduler.open). The expiration sweep runs inside try with close() in a finally.
  • Configurable timeout (refactor(scrapy): make AsyncThread timeout configurable #955) — new APIFY_ASYNC_THREAD_TIMEOUT_SECS setting, wired into the scheduler (via from_crawler) and the cache storage.

Error logging

The integration now follows consistent conventions for caught exceptions:

  • except … as exc:logger.warning(f'… {exc}'), swallowed — for expected, recoverable conditions handled locally: a malformed or legacy stored payload skipped as a cache/queue miss, or non-UTF-8 headers preserved in the serialized request. A short message plus the exception text, with no traceback, because it is not a bug.
  • except Exception:logger.exception('…'), swallowed — for unexpected failures handled at a terminal point: the cleanup sweep, shutdown, or skip-and-continue. logger.exception attaches the full traceback, and nothing re-raises because the error is handled here.
  • except …:raise (no logging) — when the error is re-raised and the caller or Scrapy logs it with a traceback anyway. run_coro's timeout path cancels the future and re-raises without logging, so the failure is reported once.
  • except Exception:logger.exception('…'); raise — the boundary log, used only where local context materially helps and the propagated error would otherwise be logged only generically or not at all. The scheduler's next_request / enqueue_request / has_pending_requests are called synchronously by the Scrapy engine (not inside a Deferred), so without this log the Apify-specific context would be lost.

Why logger.exception replaced traceback.print_exc(): traceback.print_exc() writes a bare traceback straight to stderr, bypassing logging entirely. It has no level, no logger name, no message, and ignores Scrapy's and the SDK's log configuration and handlers. logger.exception(msg) logs at ERROR through the configured logging, so it is routed, formatted, and filterable like every other log line. It adds a message explaining what failed and still attaches the full traceback automatically, which makes including the exception object in the message ({exc}) redundant (ruff TRY401).

Tests

New tests/unit/scrapy/test_async_thread.py covers the startup race, run-after-close, timeout cancellation, idempotent close, the caller timeout reaching the shutdown step, and stop/join when task cancellation fails. The scheduler and HTTP cache test modules gain coverage for the timeout setting, closing the thread on open failure, and the cleanup-failure path still closing the thread.

@vdusek vdusek added adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. labels Jun 12, 2026
@vdusek vdusek self-assigned this Jun 12, 2026
@github-actions github-actions Bot added this to the 142nd sprint - Tooling team milestone Jun 12, 2026
@github-actions github-actions Bot added the tested Temporary label used only programatically for some analytics. label Jun 12, 2026
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 70.52632% with 28 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.59%. Comparing base (0daca28) to head (4a5a77e).
⚠️ Report is 30 commits behind head on master.

Files with missing lines Patch % Lines
src/apify/scrapy/extensions/_httpcache.py 75.00% 14 Missing ⚠️
src/apify/scrapy/scheduler.py 55.55% 8 Missing ⚠️
src/apify/scrapy/_async_thread.py 78.94% 4 Missing ⚠️
src/apify/scrapy/requests.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #979      +/-   ##
==========================================
+ Coverage   89.90%   91.59%   +1.68%     
==========================================
  Files          49       49              
  Lines        3091     3164      +73     
==========================================
+ Hits         2779     2898     +119     
+ Misses        312      266      -46     
Flag Coverage Δ
e2e 35.77% <0.00%> (-0.14%) ⬇️
integration 56.54% <0.00%> (-0.34%) ⬇️
unit 80.46% <70.52%> (+1.72%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vdusek vdusek changed the title fix: Resolve AsyncThread.run_coro startup race fix(scrapy): Resolve AsyncThread.run_coro startup race Jun 12, 2026
@vdusek vdusek requested a review from Pijukatel June 12, 2026 17:03
@vdusek vdusek marked this pull request as ready for review June 12, 2026 17:04
@vdusek vdusek changed the title fix(scrapy): Resolve AsyncThread.run_coro startup race fix(scrapy): async-thread startup race, shutdown lifecycle, and timeout setting Jun 13, 2026
@vdusek vdusek marked this pull request as draft June 13, 2026 08:14
@vdusek vdusek removed the request for review from Pijukatel June 13, 2026 08:14
@vdusek vdusek marked this pull request as ready for review June 17, 2026 12:24
@vdusek vdusek requested a review from Pijukatel June 17, 2026 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants