Skip to content

ContextPipeline changes break skipNavigation with CheerioCrawler #3304

@barjin

Description

@barjin

The following snippet works with Crawlee v3, but will break on current v4:

import { CheerioCrawler } from "@crawlee/cheerio";

const crawler = new CheerioCrawler({
    requestHandler: async () => {
        // pass
    },
});

await crawler.run([{
    url: 'http://example.com',
    skipNavigation: true,
}]);
INFO  CheerioCrawler: Starting the crawler.
WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. The `contentType` property is not available - `skipNavigation` was used
    at get contentType (file:///home/jindrichbar/Desktop/apify/crawlee/packages/http-crawler/dist/internals/http-crawler.js:207:27) {"id":"8OamqXBCpPHxyH9","url":"http://example.com","retryCount":1}
ERROR CheerioCrawler: An exception occurred during handling of failed request. This places the crawler and its underlying storages into an unknown state and crawling will be terminated. 
  The `request.loadedUrl` property is not available - `skipNavigation` was used
      at Object.get (file:///home/jindrichbar/Desktop/apify/crawlee/packages/http-crawler/dist/internals/http-crawler.js:177:35)
      at Function.entries (<anonymous>)
      at _ObjectValidator.handleIgnoreStrategy (file:///home/jindrichbar/Desktop/apify/crawlee/node_modules/@sapphire/shapeshift/dist/esm/index.mjs:2089:41)
      at _ObjectValidator.handlePassthroughStrategy (file:///home/jindrichbar/Desktop/apify/crawlee/node_modules/@sapphire/shapeshift/dist/esm/index.mjs:2170:25)
      at _ObjectValidator.handleStrategy (file:///home/jindrichbar/Desktop/apify/crawlee/node_modules/@sapphire/shapeshift/dist/esm/index.mjs:1982:47)
      at _ObjectValidator.handle (file:///home/jindrichbar/Desktop/apify/crawlee/node_modules/@sapphire/shapeshift/dist/esm/index.mjs:2081:17)
      at _ObjectValidator.parse (file:///home/jindrichbar/Desktop/apify/crawlee/node_modules/@sapphire/shapeshift/dist/esm/index.mjs:964:90)
      at RequestQueueClient.updateRequest (file:///home/jindrichbar/Desktop/apify/crawlee/packages/memory-storage/dist/resource-clients/request-queue.js:366:22)
      at RequestQueue.reclaimRequest (file:///home/jindrichbar/Desktop/apify/crawlee/packages/core/dist/storages/request_provider.js:386:35)
      at RequestQueue.reclaimRequest (file:///home/jindrichbar/Desktop/apify/crawlee/packages/core/dist/storages/request_queue_v2.js:219:33)

The crawler gets a double whammy, first from CheerioCrawler's parseContent (accesses crawlingContext.contentType):

const isXml = crawlingContext.contentType.type.includes('xml');

and then Shapeshift's validation on updateRequest while handling the error above (this accesses request.loadedUrl):

This is caused by the addition of the validation Proxy on CrawlingContext and Request in HttpCrawler (link and link)

Metadata

Metadata

Assignees

Labels

t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions