Skip to content

Fix homepage url restriction#1013

Draft
ljluestc wants to merge 2 commits intocodelucas:masterfrom
ljluestc:fix-homepage-url-restriction
Draft

Fix homepage url restriction#1013
ljluestc wants to merge 2 commits intocodelucas:masterfrom
ljluestc:fix-homepage-url-restriction

Conversation

@ljluestc
Copy link
Copy Markdown

Fixes issue #134 (originally #455): href url in news html source and scrape urls from Newspaper counts differ.

Changes

  • Added restrict_to_homepage_urls option to newspaper.build to limit articles to homepage <a href> links.
  • Integrated BeautifulSoup for homepage URL extraction.
  • Fixed indexing bug in user example code.
  • Added test case for Reuters homepage scraping.
  • Updated documentation with new option.

Testing

  • Verified ~300 articles scraped from Reuters homepage.
  • Ensured article URLs match homepage patterns.
  • Tested error handling for failed downloads.
  • Ran existing test suite to confirm no regressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant