Skip to content
This repository was archived by the owner on Jul 3, 2023. It is now read-only.
This repository was archived by the owner on Jul 3, 2023. It is now read-only.

Investigate how to follow a link in a job-toot and index the body of that link as well #7

@berkes

Description

@berkes

Possible candidate for following is the "card" when that is present.

We'd need

  • Sane timeout to avoid hanging when host of the vacancy is unavailable or blocking.
  • TXT/HTML checking. PDF support for later. Anything else should be disgarded.
  • Length check. Anything longer than X bytes should be chopped off. 500kb? Timeout will catch many of these too, but a very fast host might still serve us megabytes on which we then choke.
  • Sanitizer or semantic text-analyzer; so we can parse HTML in a somewhat sane way and remove things like menus, footers, sidebars. What options are there FLOSS for this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions