Add Supplements spider + integrated crawling/postprocessing by heroheman · Pull Request #5 · scp-data/scp_crawler

heroheman · 2025-12-15T12:05:51Z

This PR adds first-class support for “supplement” pages from the SCP Wiki and integrates them into the existing crawl + post-processing workflow.

Adds a new Scrapy spider scp_supplement to crawl pages tagged supplement and export them to scp_supplement.json.
Introduces the ScpSupplement item type.
Extends post-processing with run_postproc_supplement to generate processed outputs under supplement:
content_supplement.json (full content + history/source/images)
index.json (metadata + content_file)
Adds parent_scp (best-effort extracted from the link) and parent_tale (best-effort extracted from *- patterns).
Updates the Makefile so Supplements are included in scp_crawl and scp_postprocess (and still available via dedicated supplement_* targets).
Updates README documentation to mention the new spider, output file, and processed output.

How to test

make scp (includes supplements crawl + postprocess)

Or individually:

make supplement_crawl
make supplement_postprocess
scrapy crawl scp_supplement -o data/scp_supplement.json

This PR should fix #2

- Introduce ScpSupplement class for item representation - Implement ScpSupplementSpider to crawl supplement pages - Update makefile to include supplement in data targets

- Implement run_postproc_supplement to process SCP supplement data - Create necessary directories and handle data extraction - Store processed supplements in JSON format for further use

- Added instructions for crawling pages tagged as 'supplement' - Updated content structure to include multiple content types - Clarified post-processing details for supplements

- Updated the LinkExtractor in ScpTaleSpider to deny links matching specific patterns, improving the relevance of parsed tales.

- for reference

tedivm

Overall this looks really good, but there are some changes to be made to the github workflows.

- This change prevents automatic pushing to the SCP API during CI - it should run on pull requests to allow for manual review first - remove clone step as it's unnecessary - resolves scp-data#5

- this fixes unintend removal of Linkextractor Rule

heroheman · 2025-12-15T16:55:32Z

I adjusted the workflow file, but keep in mind I really added it by accident. So if the edits are not working out, I would prefer removing the yml file - I really do not know much about github actions :)

Also

For testing purpose I forked the api repo, too.

See:
https://github.com/heroheman/scp-api/
https://github.com/heroheman/scp-api/actions/runs/20237800863

Result Data:
https://github.com/heroheman/scp-api/tree/main/docs/data/scp/supplement

Note: I actually remove a linkrule for "TALES", so in the testrun "Tales" is empty. I reverted it with the last commit. I will start another workflow run to make sure this works out.

- Add checks for empty responses and missing 'body' in JSON - Log errors for various failure scenarios to improve debugging - Ensure robust parsing of history HTML to prevent crashes

- Allows manual triggering of the workflow - Improves flexibility for testing and updates

- Handle empty history cases by returning an empty list - Support both dict and list formats for history input - Safely parse date strings with error handling - Sort revisions by date, ensuring robustness against missing values

- Use `get` method to safely access history in hubs and items - Prevent potential KeyError by ensuring history key is present

- This change prevents automatic pushing to the SCP API during CI - it should run on pull requests to allow for manual review first - remove clone step as it's unnecessary - resolves scp-data#5

heroheman · 2025-12-17T15:46:37Z

I have splitted all the changes in feature branches and created #6 #7 #8

heroheman added 4 commits December 15, 2025 12:11

chore(.gitignore): add .DS_Store to ignore list

774a7e8

feat(spiders): add SCP Supplement spider and item

aeb4640

- Introduce ScpSupplement class for item representation - Implement ScpSupplementSpider to crawl supplement pages - Update makefile to include supplement in data targets

feat(postprocessing): add supplement processing command

b188362

- Implement run_postproc_supplement to process SCP supplement data - Create necessary directories and handle data extraction - Store processed supplements in JSON format for further use

docs(README): update supplement crawl instructions and content structure

146435d

- Added instructions for crawling pages tagged as 'supplement' - Updated content structure to include multiple content types - Clarified post-processing details for supplements

heroheman mentioned this pull request Dec 15, 2025

Feature request: crawling of supplements #2

Open

heroheman added 2 commits December 15, 2025 15:24

fix(spiders): refine tale parsing rules to exclude unwanted links

84d709f

- Updated the LinkExtractor in ScpTaleSpider to deny links matching specific patterns, improving the relevance of parsed tales.

feat(ci): add GitHub Actions workflow for SCP crawling

3a21940

- for reference

tedivm requested changes Dec 15, 2025

View reviewed changes

Comment thread .github/workflows/scp-items.yml Outdated

Comment thread .github/workflows/scp-items.yml Outdated

Comment thread .github/workflows/scp-items.yml Outdated

Comment thread .github/workflows/scp-items.yml Outdated

heroheman added 2 commits December 15, 2025 17:45

chore(workflow): adjusted workflow after review

166a50b

- This change prevents automatic pushing to the SCP API during CI - it should run on pull requests to allow for manual review first - remove clone step as it's unnecessary - resolves scp-data#5

fix(spiders): update tale parsing rules to allow all links

49aec69

- this fixes unintend removal of Linkextractor Rule

heroheman requested a review from tedivm December 15, 2025 16:55

heroheman added 4 commits December 17, 2025 11:24

fix(spiders): enhance error handling in history lookup

d68338f

- Add checks for empty responses and missing 'body' in JSON - Log errors for various failure scenarios to improve debugging - Ensure robust parsing of history HTML to prevent crashes

chore(workflow): enable workflow_dispatch for SCP crawling

cca296c

- Allows manual triggering of the workflow - Improves flexibility for testing and updates

fix(history): ensure history key always exists in items

45d680b

- Use `get` method to safely access history in hubs and items - Prevent potential KeyError by ensuring history key is present

heroheman closed this Dec 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Supplements spider + integrated crawling/postprocessing#5

Add Supplements spider + integrated crawling/postprocessing#5
heroheman wants to merge 12 commits into
scp-data:mainfrom
heroheman:main

heroheman commented Dec 15, 2025 •

edited

Loading

Uh oh!

tedivm left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

heroheman commented Dec 15, 2025

Uh oh!

heroheman commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

heroheman commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to test

Uh oh!

tedivm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

heroheman commented Dec 15, 2025

Uh oh!

heroheman commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

heroheman commented Dec 15, 2025 •

edited

Loading