-
Notifications
You must be signed in to change notification settings - Fork 617
Closed
Labels
t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.
Description
For the purposes of testability and nice code structure I thought I'd have some information dependency-injected top-down from the main function. I don't know how to do that. I'll illustrate my problem on a constant, but imagine there are some click options to my program which affect how the scraper behaves, so the value doesn't have to be immutable and the issue is the same. This is my program:
import re
import asyncio
from enum import StrEnum, auto
import click
from crawlee.beautifulsoup_crawler import (
BeautifulSoupCrawler,
BeautifulSoupCrawlingContext,
)
from crawlee.router import Router
LENGTH_RE = re.compile(r"(\d+)\s+min")
class Label(StrEnum):
DETAIL = auto()
router = Router[BeautifulSoupCrawlingContext]()
@click.command()
def edison():
asyncio.run(scrape())
async def scrape():
crawler = BeautifulSoupCrawler(request_handler=router)
await crawler.run(["https://edisonfilmhub.cz/program"])
await crawler.export_data("edison.json", dataset_name="edison")
@router.default_handler
async def detault_handler(context: BeautifulSoupCrawlingContext):
await context.enqueue_links(selector=".program_table .name a", label=Label.DETAIL)
@router.handler(Label.DETAIL)
async def detail_handler(context: BeautifulSoupCrawlingContext):
context.log.info(f"Scraping {context.request.url}")
description = context.soup.select_one(".filmy_page .desc3").text
length_min = LENGTH_RE.search(description).group(1)
# TODO get starts_at, then calculate ends_at
await context.push_data(
{
"url": context.request.url,
"title": context.soup.select_one(".filmy_page h1").text.strip(),
"csfd_url": context.soup.select_one(".filmy_page .hrefs a")["href"],
},
dataset_name="edison",
)In the main function, I have certain information I want to pass down. For example I want "edison" to be an argument:
@click.command()
def edison():
slug = "edison"
asyncio.run(scrape(slug))Then this is easy:
async def scrape(slug: str):
crawler = BeautifulSoupCrawler(request_handler=router)
await crawler.run(["https://edisonfilmhub.cz/program"])
await crawler.export_data(f"{slug}.json", dataset_name=slug)But then, how do I pass that slug down to the handlers? I have no idea. What do you suggest as the best approach?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.