Skip to content

Allow passing/binding a Session to a Request (when enqueuing across crawlers) #3446

@MatousMarik

Description

@MatousMarik

Which package is the feature request for? If unsure which one to select, leave blank

@crawlee/basic (BasicCrawler)

Feature

Add the ability to attach or reference a Session (or at minimum its relevant state — cookies, headers, proxy session ID) to a Request object, so that when the request is processed by a different crawler (or in a later run), it reuses the same session context without requiring re-initialization.

Motivation

When splitting work across multiple crawlers (e.g. CrawlerA enqueues requests into CrawlerB's queue), sessions cannot be transferred alongside their requests. In cases where a session has been "prepared" — e.g. a location has been set via a preNavigation hook that makes requests under a specific proxy session — CrawlerB has no way to know about this and must repeat the entire session setup step.

I can also see this as useful when working with a single Actor only in a similar case.

// CrawlerA - processes search pages, sets location per session, enqueues product URLs
const crawlerA = new PuppeteerCrawler({
  useSessionPool: true,
  async requestHandler({ request, session, page }) {
    // Sets location tied to this session (slow, expensive step)
    await setLocation(page, session, targetLocation);

    const productUrls = await scrapeProductUrls(page);

    // We want CrawlerB to reuse the same session/location,
    // but there's no way to pass the session here
    await crawlerB.addRequests(
      productUrls.map(url => ({ url, /* session: session ?? */ }))
    );
  },
});

// CrawlerB - processes product pages, but has to redo the location setup
const crawlerB = new PuppeteerCrawler({
  useSessionPool: true,
  async preNavigationHooks: [
    async ({ request, session, page }) => {
      // Redundant! CrawlerA already did this for this request's session
      await setLocation(page, session, getLocationForRequest(request));
    },
  ],
  async requestHandler({ page }) {
    // scrape product data
  },
});

Ideal solution or implementation, and any additional constraints

Allow a Session (or a serializable snapshot of it) to be attached to a Request, so a downstream crawler can restore and reuse it:

// CrawlerA
await crawlerBQueue.addRequest({
  url: productUrl,
  session: session.exportSnapshot(), // serialize cookies, headers, proxy session ID
});

// CrawlerB
const crawlerB = new PuppeteerCrawler({
  useSessionPool: true,
  async requestHandler({ request, session }) {
    // session was restored from request snapshot — location already set, no re-init needed
  },
});

Or just allow SessionPool sharing #3445 and reference the session (by ID) the request should get from the pool.

Alternative solutions or implementations

No response

Other context

  • Related to the broader Session / SessionPool limitations acknowledged for v4 improvements
  • The workarounds are very messy
  • Would be valuable for any workflow where sessions require expensive setup (geo-location, auth, CAPTCHA solving) that should not be repeated per crawler stage

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureIssues that represent new features or improvements to existing features.t-toolingIssues with this label are in the ownership of the tooling team.
    No fields configured for Feature.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions