A robust, feature-rich web scraper for extracting article URLs from Substack archive pages with support for multiple browsers, output formats, and resume capabilities.
β¨ Core Features
- π Handles infinite scrolling automatically
- π Extracts publication dates and titles
- π¨ Multiple output formats (TXT, CSV, JSON)
- πΎ Resume capability with checkpoint files
- π Multi-browser support (Chrome, Firefox, Edge)
- β‘ Progress bars and logging
- π§ Configurable via YAML/JSON files
- π― Smart anti-bot detection avoidance
pip install substack-scrapergit clone https://github.com/yourusername/substack-scraper.git
cd substack-scraper
pip install -e .- Python 3.8 or later
- One of: Google Chrome, Firefox, or Microsoft Edge
substack-scraper https://example.substack.com/archiveExport to CSV with dates and titles:
substack-scraper https://example.substack.com/archive \
--format csv \
--show-dates \
--show-titles \
--output my_articlesSort by date (newest first):
substack-scraper https://example.substack.com/archive --sort-by-dateUse Firefox instead of Chrome:
substack-scraper https://example.substack.com/archive --browser firefoxExport to JSON:
substack-scraper https://example.substack.com/archive --format json --output articlesResume from previous checkpoint:
substack-scraper https://example.substack.com/archive --resumeDebug mode (save HTML for inspection):
substack-scraper https://example.substack.com/archive --debug--config FILE- Path to configuration file (YAML or JSON)
--browser {chrome,firefox,edge}- Browser engine to use (default: chrome)--no-headless- Run browser in visible mode
--debug- Enable debug mode and save HTML--resume- Resume from previous checkpoint
--show-dates- Show publication dates--show-titles- Show article titles--sort-by-date- Sort articles by publication date--ascending- Sort in ascending order (oldest first)
--format {txt,csv,json}- Output format (default: txt)--output NAME- Output filename (without extension)--output-dir DIR- Output directory (default: output)--no-console- Don't print results to console
--log-level {DEBUG,INFO,WARNING,ERROR}- Logging level (default: INFO)--log-file FILE- Save logs to file
Create a config.yaml or config.json file to customize defaults:
browser:
engine: chrome
headless: true
timeout: 30
scraping:
initial_wait:
min: 3
max: 6
max_retries: 3
output:
format: csv
directory: output
include_dates: trueUse it with:
substack-scraper https://example.substack.com/archive --config config.yaml01.01.2024 - Article Title - https://example.substack.com/p/article
02.01.2024 - Another Article - https://example.substack.com/p/another
date,title,url
01.01.2024,Article Title,https://example.substack.com/p/article
02.01.2024,Another Article,https://example.substack.com/p/another{
"total_articles": 2,
"articles": [
{
"url": "https://example.substack.com/p/article",
"date": "01.01.2024",
"title": "Article Title"
}
]
}You can also use the scraper programmatically:
from substack_scraper import SubstackScraper, ArticleParser, Exporter
from substack_scraper.config import Config
# Initialize with config
config = Config()
scraper = SubstackScraper(config)
parser = ArticleParser()
exporter = Exporter("output")
# Scrape articles
with scraper:
html = scraper.scrape_page("https://example.substack.com/archive")
articles = parser.parse_articles(html, "https://example.substack.com")
# Sort if needed
articles = parser.sort_articles(articles, by_date=True, ascending=False)
# Export
exporter.export(articles, format="csv", filename="articles", include_dates=True)- Use
--debugto save HTML and inspect the page structure - Verify the URL is correct and accessible
- Check if your IP is blocked (try in a normal browser)
- Update WebDriver:
pip install --upgrade webdriver-manager - Try a different browser:
--browser firefox - Run in visible mode:
--no-headless - Ensure your browser is up to date
- Increase timeout in config file
- Check your internet connection
- The site may be slow or blocking requests
- Reduce scraping frequency
- Use VPN if necessary
- Increase delays in config file
git clone https://github.com/yourusername/substack-scraper.git
cd substack-scraper
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt -r requirements-dev.txtpytest tests/ -vblack src/substack_scraper tests
flake8 src/substack_scrapermypy src/substack_scraperContributions are welcome! Please read CONTRIBUTING.md for guidelines.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
See CHANGELOG.md for version history and changes.
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with Selenium and BeautifulSoup
- Inspired by the need for better Substack content management
- π« Issues: GitHub Issues
- π¬ Discussions: GitHub Discussions
Note: This tool is for personal use and research. Please respect Substack's Terms of Service and robots.txt. Use responsibly and ethically.
Happy scraping! π