Set time limit and interrupt crawling that takes too long #1586
Replies: 1 comment
-
|
Use config = CrawlerRunConfig(
page_timeout=30000, # 30s hard cutoff for page navigation
wait_for_timeout=10000, # 10s cutoff for any wait_for conditions
)
results = await crawler.arun_many(urls, config=config)If a page doesn't load within For your 10K URL case, you can also tune the dispatcher: from crawl4ai import MemoryAdaptiveDispatcher
dispatcher = MemoryAdaptiveDispatcher(
max_session_permit=20, # max concurrent crawls
fairness_timeout=120.0, # prioritize URLs waiting > 2 min
)
results = await crawler.arun_many(urls, config=config, dispatcher=dispatcher)If you're seeing URLs that hang for hours even with config = CrawlerRunConfig(
page_timeout=30000,
wait_for_timeout=10000,
delay_before_return_html=0, # no extra delay
scan_full_page=False, # don't scroll if not needed
)Failed URLs will have |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I am crawling 10K URLs using
crawler.arun_manywith batch size 100. In some batches, some URLs takes extremely long (a few hours) and blocks the entire process. I want to know if it is possible to add time limit so that, the URL job that takes more than X seconds to crawl will be canceled, or keep only the content that have been crawled. I'm willing to sacrifice that URL to ensure the time of entire dataset.Thank you.
Beta Was this translation helpful? Give feedback.
All reactions