Add Supplements spider + integrated crawling/postprocessing#5
Add Supplements spider + integrated crawling/postprocessing#5heroheman wants to merge 12 commits into
Conversation
- Introduce ScpSupplement class for item representation - Implement ScpSupplementSpider to crawl supplement pages - Update makefile to include supplement in data targets
- Implement run_postproc_supplement to process SCP supplement data - Create necessary directories and handle data extraction - Store processed supplements in JSON format for further use
- Added instructions for crawling pages tagged as 'supplement' - Updated content structure to include multiple content types - Clarified post-processing details for supplements
- Updated the LinkExtractor in ScpTaleSpider to deny links matching specific patterns, improving the relevance of parsed tales.
- for reference
tedivm
left a comment
There was a problem hiding this comment.
Overall this looks really good, but there are some changes to be made to the github workflows.
- This change prevents automatic pushing to the SCP API during CI - it should run on pull requests to allow for manual review first - remove clone step as it's unnecessary - resolves scp-data#5
- this fixes unintend removal of Linkextractor Rule
|
I adjusted the workflow file, but keep in mind I really added it by accident. So if the edits are not working out, I would prefer removing the yml file - I really do not know much about github actions :) Also For testing purpose I forked the api repo, too. See: Result Data: Note: I actually remove a linkrule for "TALES", so in the testrun "Tales" is empty. I reverted it with the last commit. I will start another workflow run to make sure this works out. |
- Add checks for empty responses and missing 'body' in JSON - Log errors for various failure scenarios to improve debugging - Ensure robust parsing of history HTML to prevent crashes
- Allows manual triggering of the workflow - Improves flexibility for testing and updates
- Handle empty history cases by returning an empty list - Support both dict and list formats for history input - Safely parse date strings with error handling - Sort revisions by date, ensuring robustness against missing values
- Use `get` method to safely access history in hubs and items - Prevent potential KeyError by ensuring history key is present
- This change prevents automatic pushing to the SCP API during CI - it should run on pull requests to allow for manual review first - remove clone step as it's unnecessary - resolves scp-data#5
This PR adds first-class support for “supplement” pages from the SCP Wiki and integrates them into the existing crawl + post-processing workflow.
scp_supplementto crawl pages tagged supplement and export them toscp_supplement.json.run_postproc_supplementto generate processed outputs under supplement:content_supplement.json(full content + history/source/images)index.json(metadata + content_file)parent_scp(best-effort extracted from the link) and parent_tale (best-effort extracted from *- patterns).scp_crawlandscp_postprocess(and still available via dedicatedsupplement_*targets).How to test
make scp(includes supplements crawl + postprocess)Or individually:
make supplement_crawlmake supplement_postprocessscrapy crawl scp_supplement -o data/scp_supplement.jsonThis PR should fix #2