A CLI tool for scraping historical newspaper trend data from the Swedish National Library (Kungliga biblioteket).
- Modern CLI built with Typer and Rich for a great user experience
- Flexible keyword loading from .txt, .csv, or .tsv files
- Proximity search support with customizable markers
- Configuration validation via SHA256 hashing to prevent data corruption
- SQLite database with SQLAlchemy ORM for reliable data storage
- Type-safe with full type hints and mypy validation
pipx install kb-trendpython -m pip install kb-trendgit clone https://github.com/matjoha/kb-trend
cd kb-trend
pip install -e ".[dev]"Run the interactive setup wizard:
kb-trend initOr use non-interactive mode with defaults:
kb-trend init --non-interactiveThis creates:
settings.yaml- Configuration filekb_trend.sqlite3- SQLite database- Wildcard query for baseline measurements
Load keywords from a file:
# From plain text file (one keyword per line)
kb-trend add-keywords keywords.txt
# From CSV file
kb-trend add-keywords keywords.csv
# From TSV file
kb-trend add-keywords keywords.tsvExample CSV format:
title,gender,category
gosse,male,youth
flicka,female,youthAll columns are stored as metadata, and you specify which column is the keyword
in settings.yaml.
Execute the scraping queue:
kb-trend runOptions:
--limit N- Process only N items--resume/--restart- Resume from last run or restart--config PATH- Use alternate config file
Normalize counts against baseline:
kb-trend processView database statistics:
kb-trend statusThe settings.yaml file controls all aspects of the scraper:
db_path: kb_trend.sqlite3
min_year: 1820 # Optional: filter start year
max_year: 2020 # Optional: filter end year
journals: # List of newspapers
- "None" # "None" searches all journals
- "DAGENS NYHETER"
sleep_timer: 1.0 # Seconds between requests
request_timeout: 30 # HTTP timeout
keyword_column: "title" # Which CSV column is the keyword
marker_templates: # Empty = plain search
- "SÖKES"
- "PLATS"
- "ERHÅLLES"
proximity_distance: 5 # Proximity search windowKB-Trend calculates a SHA256 hash of your configuration and stores it in the database. This prevents accidental data corruption if settings change after the database is created.
If you modify settings.yaml, you'll need to:
- Restore the original settings, or
- Create a new database with
kb-trend init --force
Validate your configuration:
kb-trend validateWhen marker_templates is empty:
Query: "gosse"
When markers are configured:
Query: "gosse SÖKES"~5 OR "gosse PLATS"~5 OR "gosse ERHÅLLES"~5
This finds "gosse" within 5 words of the markers.
KB-Trend uses the new KB.se data API:
https://data.kb.se/search/?q=PHRASE&searchGranularity=part&from=YYYY-MM-DD&to=YYYY-MM-DD&isPartOf=JOURNAL
This replaces the old Selenium-based scraping of the tidningar.kb.se interface, providing:
- Faster, more reliable scraping
- JSON responses instead of HTML parsing
- No browser dependencies
- Better error handling
- metadata: Configuration hash, schema version
- query: Search queries with metadata from CSV
- journal: Newspaper definitions
- counts: Hit counts by year/query/journal
- queue: Processing queue with status tracking
| Command | Description |
|---|---|
kb-trend init |
Run configuration wizard |
kb-trend add-keywords <file> |
Load keywords from file |
kb-trend run |
Execute scraping queue |
kb-trend process |
Calculate relative frequencies |
kb-trend status |
Show database statistics |
kb-trend validate |
Validate configuration hash |
kb-trend reset |
Reset queue to pending |
# Run all tests with coverage
pytest
# Run with verbose output
pytest -v
# Run specific test file
pytest tests/test_keywords/test_loader.pymypy src/kb_trendruff check src/kb_trend testsThe original KB_TrendScraper used Selenium to scrape the tidningar.kb.se interface. This new version:
- Uses the official KB data API (faster, more reliable)
- Provides a proper CLI with subcommands
- Supports flexible keyword file formats
- Validates configuration to prevent errors
- Has comprehensive test coverage
No automatic migration is provided. To migrate:
- Export your old data if needed
- Run
kb-trend initto create new configuration - Load your keywords with
kb-trend add-keywords - Run the scraper
CC BY NC 4.0
Based on the original KB_TrendScraper project, modernized with:
- Typer for CLI
- httpx for HTTP requests
- SQLAlchemy for database
- Pydantic for configuration validation
- pytest for comprehensive testing
If you use KB-Trend in your research, please cite it as:
@software{johansson2025kbtrend,
author = {Johansson, Mathias},
title = {{KB-Trend: Swedish National Library newspaper trend scraper}},
year = {2025},
version = {1.0.1},
url = {https://github.com/DigitalHistory-Lund/kb-trend},
license = {CC-BY-NC-4.0}
}
Or in APA format:
Johansson, M. (2025). KB-Trend: Swedish National Library newspaper trend scraper (Version 1.0.1) [Computer software]. https://github.com/DigitalHistory-Lund/kb-trend