-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Which package is the feature request for? If unsure which one to select, leave blank
@crawlee/basic (BasicCrawler)
Feature
Add the ability to attach or reference a Session (or at minimum its relevant state — cookies, headers, proxy session ID) to a Request object, so that when the request is processed by a different crawler (or in a later run), it reuses the same session context without requiring re-initialization.
Motivation
When splitting work across multiple crawlers (e.g. CrawlerA enqueues requests into CrawlerB's queue), sessions cannot be transferred alongside their requests. In cases where a session has been "prepared" — e.g. a location has been set via a preNavigation hook that makes requests under a specific proxy session — CrawlerB has no way to know about this and must repeat the entire session setup step.
I can also see this as useful when working with a single Actor only in a similar case.
// CrawlerA - processes search pages, sets location per session, enqueues product URLs
const crawlerA = new PuppeteerCrawler({
useSessionPool: true,
async requestHandler({ request, session, page }) {
// Sets location tied to this session (slow, expensive step)
await setLocation(page, session, targetLocation);
const productUrls = await scrapeProductUrls(page);
// We want CrawlerB to reuse the same session/location,
// but there's no way to pass the session here
await crawlerB.addRequests(
productUrls.map(url => ({ url, /* session: session ?? */ }))
);
},
});
// CrawlerB - processes product pages, but has to redo the location setup
const crawlerB = new PuppeteerCrawler({
useSessionPool: true,
async preNavigationHooks: [
async ({ request, session, page }) => {
// Redundant! CrawlerA already did this for this request's session
await setLocation(page, session, getLocationForRequest(request));
},
],
async requestHandler({ page }) {
// scrape product data
},
});Ideal solution or implementation, and any additional constraints
Allow a Session (or a serializable snapshot of it) to be attached to a Request, so a downstream crawler can restore and reuse it:
// CrawlerA
await crawlerBQueue.addRequest({
url: productUrl,
session: session.exportSnapshot(), // serialize cookies, headers, proxy session ID
});
// CrawlerB
const crawlerB = new PuppeteerCrawler({
useSessionPool: true,
async requestHandler({ request, session }) {
// session was restored from request snapshot — location already set, no re-init needed
},
});Or just allow SessionPool sharing #3445 and reference the session (by ID) the request should get from the pool.
Alternative solutions or implementations
No response
Other context
- Related to the broader Session / SessionPool limitations acknowledged for v4 improvements
- The workarounds are very messy
- Would be valuable for any workflow where sessions require expensive setup (geo-location, auth, CAPTCHA solving) that should not be repeated per crawler stage