The following snippet works with Crawlee v3, but will break on current v4:
import { CheerioCrawler } from "@crawlee/cheerio";
const crawler = new CheerioCrawler({
requestHandler: async () => {
// pass
},
});
await crawler.run([{
url: 'http://example.com',
skipNavigation: true,
}]);
INFO CheerioCrawler: Starting the crawler.
WARN CheerioCrawler: Reclaiming failed request back to the list or queue. The `contentType` property is not available - `skipNavigation` was used
at get contentType (file:///home/jindrichbar/Desktop/apify/crawlee/packages/http-crawler/dist/internals/http-crawler.js:207:27) {"id":"8OamqXBCpPHxyH9","url":"http://example.com","retryCount":1}
ERROR CheerioCrawler: An exception occurred during handling of failed request. This places the crawler and its underlying storages into an unknown state and crawling will be terminated.
The `request.loadedUrl` property is not available - `skipNavigation` was used
at Object.get (file:///home/jindrichbar/Desktop/apify/crawlee/packages/http-crawler/dist/internals/http-crawler.js:177:35)
at Function.entries (<anonymous>)
at _ObjectValidator.handleIgnoreStrategy (file:///home/jindrichbar/Desktop/apify/crawlee/node_modules/@sapphire/shapeshift/dist/esm/index.mjs:2089:41)
at _ObjectValidator.handlePassthroughStrategy (file:///home/jindrichbar/Desktop/apify/crawlee/node_modules/@sapphire/shapeshift/dist/esm/index.mjs:2170:25)
at _ObjectValidator.handleStrategy (file:///home/jindrichbar/Desktop/apify/crawlee/node_modules/@sapphire/shapeshift/dist/esm/index.mjs:1982:47)
at _ObjectValidator.handle (file:///home/jindrichbar/Desktop/apify/crawlee/node_modules/@sapphire/shapeshift/dist/esm/index.mjs:2081:17)
at _ObjectValidator.parse (file:///home/jindrichbar/Desktop/apify/crawlee/node_modules/@sapphire/shapeshift/dist/esm/index.mjs:964:90)
at RequestQueueClient.updateRequest (file:///home/jindrichbar/Desktop/apify/crawlee/packages/memory-storage/dist/resource-clients/request-queue.js:366:22)
at RequestQueue.reclaimRequest (file:///home/jindrichbar/Desktop/apify/crawlee/packages/core/dist/storages/request_provider.js:386:35)
at RequestQueue.reclaimRequest (file:///home/jindrichbar/Desktop/apify/crawlee/packages/core/dist/storages/request_queue_v2.js:219:33)
The crawler gets a double whammy, first from CheerioCrawler's parseContent (accesses crawlingContext.contentType):
|
const isXml = crawlingContext.contentType.type.includes('xml'); |
and then Shapeshift's validation on updateRequest while handling the error above (this accesses request.loadedUrl):
|
requestShape.parse(request); |
This is caused by the addition of the validation Proxy on CrawlingContext and Request in HttpCrawler (link and link)
The following snippet works with Crawlee v3, but will break on current
v4:The crawler gets a double whammy, first from CheerioCrawler's
parseContent(accessescrawlingContext.contentType):crawlee/packages/cheerio-crawler/src/internals/cheerio-crawler.ts
Line 195 in bca7d7a
and then Shapeshift's validation on
updateRequestwhile handling the error above (this accessesrequest.loadedUrl):crawlee/packages/memory-storage/src/resource-clients/request-queue.ts
Line 514 in bca7d7a
This is caused by the addition of the validation
ProxyonCrawlingContextandRequestinHttpCrawler(link and link)