-
Notifications
You must be signed in to change notification settings - Fork 535
Open
Labels
t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.
Description
At recent events I attended, I was asked about AI/LLM-based HTML parsing. I also found a few dedicated AI-based scraping frameworks, such as ScrapeGraphAI and Parsera, that appear to be gaining traction.
Right now, we provide an AI-selector workflow only through the PlaywrightCrawler via Stagehand guide.
This means:
- AI-based selectors are supported only for Playwright, not for HTTP-based crawlers.
- Even for
PlaywrightCrawler, the integration is not very smooth compared to the tools mentioned above.
Example from the ScrapeGraphAI:
# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
prompt="Extract useful information from the webpage, including a description of what the company does, founders and social media links",
source="https://scrapegraphai.com/",
config=graph_config
)
# Run the pipeline
result = smart_scraper_graph.run()It might be worth exploring a more native solution:
- Better Stagehand integration so that AI-based selectors in Playwright crawlers are as straightforward as in the dedicated AI-scraping libraries.
- Introduce an AI/LLM-powered crawler built on top of
AbstractHttpCrawler, enabling AI/LLM selectors for HTTP-based scraping as well.
This could make Crawlee more usable for AI/LLM-based extractions, and/or for faster prototype scrapers without manual CSS/XPath selectors.
monk3yd
Metadata
Metadata
Assignees
Labels
t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.