A Rust workspace for collecting, normalizing, storing, and rating news articles at scale. It fetches thousands of articles from curated sources, parses the article content through a lightweight parser service, stores everything in SurrealDB, then uses OpenAI to score and tag each article for positivity, travel potential, and editorial discovery.
The browsing and search UI lives in gusnews, a dedicated frontend for exploring this service's SurrealDB dataset with rich filters and search options.
flowchart LR
sources[News sources]
fetcher[fetcher]
parser[article-parser]
db[(SurrealDB)]
rater[rater]
openai[OpenAI ratings + tags]
gusnews[gusnews UI]
sources --> fetcher
fetcher -->|fetch + parse request| parser
parser -->|normalized article payload| fetcher
fetcher -->|deduplicated news records| db
db -->|unrated recent articles| rater
rater --> openai
openai -->|positivity, travel score, tags| rater
rater -->|rated articles| db
db -->|searchable dataset| gusnews
| Crate | Purpose |
|---|---|
fetcher |
Runs source adapters, fetches article pages, deduplicates URLs, cleans article bodies, and writes news records. |
rater |
Continuously rates unrated recent articles with OpenAI and stores positivity, travel, and tag metadata. |
shared |
Shared config, database model, sanitization helpers, Telegram notifications, and rating logic. |
bun-article-parser |
Local article parsing service used by fetcher through ARTICLE_PARSER_URL. |
- Parallel scraping across dozens of French, Belgian, Quebec, Africa, and travel/lifestyle sources.
- Per-source selection with
fetcher --enable source-a,source-b. - URL deduplication with tag merging when multiple providers discover the same article.
- HTML sanitization and text extraction before persistence.
- Scheduled Docker runtime for repeated fetches during the day.
- OpenAI-powered structured ratings using
gpt-5-nanoand JSON schema output. - SurrealDB schema for authenticated UI access, article metadata, ratings, notes, and tags.
- Optional Telegram alerts for fetcher/rater failures.
- Rust stable, edition 2024.
- Docker and Docker Compose for the full stack.
- A SurrealDB instance, or the included Compose service.
- An OpenAI API key for
rater. - Chromium when running
fetcheroutside Docker.
CI builds and tests the workspace on Rust stable with cargo build --verbose and cargo test --verbose.
git clone git@github.com:mirsella/news-scraper.git
cd news-scraper
cp .env.example .env
docker compose up --buildThe Compose stack starts:
surrealdbon127.0.0.1:8000article_parseron127.0.0.1:8081fetcher, scheduled by cron in the containerrater, running continuously
Create .env from .env.example and fill the required values:
| Variable | Description |
|---|---|
DB_USER |
SurrealDB root username. |
DB_PASSWORD |
SurrealDB root password. |
ARTICLE_PARSER_URL |
URL of the article parser service. Compose uses http://article_parser:8081. |
SURREALDB_HOST |
SurrealDB HTTP host, for example 127.0.0.1:8000. |
OPENAI_API_KEY |
OpenAI API key used by rater. |
RATING_CHAT_PROMPT |
Rating prompt override value kept in env compatibility; the binary currently embeds rating-prompt.md. |
PARALLEL_RATING |
Maximum number of concurrent OpenAI rating tasks. |
TELEGRAM_TOKEN |
Telegram bot token for alerts. |
TELEGRAM_ID |
Telegram chat id for alerts. |
NO_TELEGRAM |
Set to any value to disable sending Telegram messages at runtime. |
CHROME_HEADLESS |
Optional fetcher Chrome headless mode. |
CHROME_CONCURRENT |
Optional number of concurrent Chrome-backed fetch tasks. |
CHROME_DATA_DIR |
Optional Chrome profile/data directory. |
Build and test everything:
cargo build
cargo testList available source adapters:
cargo run -p fetcher -- --listRun the fetcher against selected sources:
cargo run -p fetcher -- --enable fr::positivr,fr::goodnewsnetwork --ignore-empty-dbRun the rater:
cargo run -p raterThe fetcher reads .env by default. Use --env-file path/to/file if you need another configuration file.
The SurrealDB schema is defined in schema.surql. The primary table is news, with fields for:
- article title, link, provider, date, caption, HTML body, and extracted text
rating, a 0-100 positivity scorerating_travel, a 0-100 travel/extraordinary-story score- tags, notes, and
usedstate
gusnews consumes this dataset directly and provides the human-facing UI for filtering, searching, and selecting articles.
The multi-stage cargo.dockerfile builds the Rust workspace with musl and produces two runtime targets:
fetcher, based on Alpine with Chromium and cronrater, based on Alpine with CA certificates and timezone data
The fetcher container is scheduled at 07:00, 09:00, 13:00, 16:00, and 18:00 Europe/Paris time.
gusnews is the companion UI frontend for this service. Use it to browse the SurrealDB news database, combine search filters, inspect ratings/tags, and work with the articles collected by this scraper.
This project is licensed under the MIT License.