Skip to content

Implement/document a way how to pass custom information to handlers #525

@honzajavorek

Description

@honzajavorek

For the purposes of testability and nice code structure I thought I'd have some information dependency-injected top-down from the main function. I don't know how to do that. I'll illustrate my problem on a constant, but imagine there are some click options to my program which affect how the scraper behaves, so the value doesn't have to be immutable and the issue is the same. This is my program:

import re
import asyncio
from enum import StrEnum, auto

import click
from crawlee.beautifulsoup_crawler import (
    BeautifulSoupCrawler,
    BeautifulSoupCrawlingContext,
)
from crawlee.router import Router


LENGTH_RE = re.compile(r"(\d+)\s+min")


class Label(StrEnum):
    DETAIL = auto()


router = Router[BeautifulSoupCrawlingContext]()


@click.command()
def edison():
    asyncio.run(scrape())


async def scrape():
    crawler = BeautifulSoupCrawler(request_handler=router)
    await crawler.run(["https://edisonfilmhub.cz/program"])
    await crawler.export_data("edison.json", dataset_name="edison")


@router.default_handler
async def detault_handler(context: BeautifulSoupCrawlingContext):
    await context.enqueue_links(selector=".program_table .name a", label=Label.DETAIL)


@router.handler(Label.DETAIL)
async def detail_handler(context: BeautifulSoupCrawlingContext):
    context.log.info(f"Scraping {context.request.url}")

    description = context.soup.select_one(".filmy_page .desc3").text
    length_min = LENGTH_RE.search(description).group(1)
    # TODO get starts_at, then calculate ends_at

    await context.push_data(
        {
            "url": context.request.url,
            "title": context.soup.select_one(".filmy_page h1").text.strip(),
            "csfd_url": context.soup.select_one(".filmy_page .hrefs a")["href"],
        },
        dataset_name="edison",
    )

In the main function, I have certain information I want to pass down. For example I want "edison" to be an argument:

@click.command()
def edison():
    slug = "edison"
    asyncio.run(scrape(slug))

Then this is easy:

async def scrape(slug: str):
    crawler = BeautifulSoupCrawler(request_handler=router)
    await crawler.run(["https://edisonfilmhub.cz/program"])
    await crawler.export_data(f"{slug}.json", dataset_name=slug)

But then, how do I pass that slug down to the handlers? I have no idea. What do you suggest as the best approach?

Metadata

Metadata

Assignees

Labels

t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions