A modern news article diff tracker that monitors RSS feeds for changes in titles and content, displays diffs on a web frontend, and optionally syndicates diff posts to Bluesky and Mastodon. Draws inspiration from three legacy projects (newsdiffs, diffengine, NYTdiff) but built with modern technologies.
- Monitor RSS/Atom/JSON Feed for article changes (titles, content, metadata)
- Extract article content automatically (no per-site parsers)
- Detect and visualize diffs between article versions
- Web frontend as the primary interface for browsing and viewing diffs
- Social syndication — post diff summaries/images to Bluesky and ActivityPub (Fediverse)
- Internet Archive — archive each version to the Wayback Machine
- Atom output feeds — subscribe to diffs via RSS reader
- WebSub — instant push updates from hubs that support it
- Sitemap import — seed baseline versions for an entire site
- No Twitter/X — explicitly excluded from scope
| Component | Library |
|---|---|
| Framework | SvelteKit (adapter-node) |
| Database | PostgreSQL (Cloudron addon) |
| ORM | Drizzle |
| Job Queue | BullMQ (Redis-backed, Cloudron addon) |
| Feed Parsing | rss-parser + JSON Feed |
| Content Extraction | Defuddle (primary) + @mozilla/readability (fallback) + JSDOM |
| Diffing | diff (jsdiff) — word-level |
| Bluesky SDK | @atproto/api |
| ActivityPub | @fedify/botkit |
| Diff Card Images | satori + sharp |
| Runtime | Node.js 22 |
| Deployment | Cloudron or Docker Compose |
npm install # Install dependencies
npm run dev # Development server
npm run build # Production build
npm run migrate # Run database migrations
npm test # Run 64 unit tests (Vitest)64 tests across 10 files. Run with npm test.
Covered modules: differ (16), feed-parser (10), schema (10), bluesky (6), auth (5), atom-builder (4), rate-limit (4), extractor (3), card-generator (3), websub (3).
Not unit-tested (integration): feed-poller, syndicator, bot/index.ts — these require Redis/Postgres and are tested against the live deployment.
src/
lib/
server/
db/ # Drizzle schema + migrations
workers/ # BullMQ job processors (feed poller, syndicator)
services/ # Content extraction, diffing, social posting
components/ # Svelte UI components
routes/ # SvelteKit pages and API routes
static/ # Static assets
CloudronManifest.json # Cloudron app manifest
Dockerfile # Cloudron deployment image
start.sh # Cloudron startup script
This project is informed by analysis of three existing news diff tools. All are functional concepts with rotting implementations.
What it does: Full article body diffing with a Django web UI. Scrapes 5 hardcoded news sites (NYT, CNN, Politico, BBC, WashPo), stores every version as a file in Git repos (one per month), renders diffs client-side with Google diff-match-patch.
Architecture:
- Scraper (cron) -> per-site BeautifulSoup parsers -> Git repo storage -> Django web frontend
ArticlesandVersiontables in SQL; article text retrieved viagit show <sha>:<path>- Adaptive check frequency: 15min for new articles, tapering to monthly for old ones
- "Boring" version filtering (skips whitespace-only / encoding-only changes)
Web UI pages: Homepage with URL lookup, browse recent changes (filterable by source), article history, side-by-side diff view, Atom feeds per source.
What's broken:
- Python 2 only (fatal —
urllib2,cookielib,printstatements,except X, e:syntax) - Django 1.5 (current is 5.x; uses removed APIs everywhere)
- Requires BOTH BeautifulSoup 3.2 AND 4.0 simultaneously
- All 5 site parsers target 2012-era HTML that no longer exists
- Hardcoded MIT infrastructure paths (
/mit/newsdiffs/.my.cnf) cleanup.pyhas a bare variable name that causesNameErrorat runtime- Pagination disabled ("overloading the server")
- Secret key committed to repo
Key insight to preserve: Full body diffing + web UI + adaptive scheduling + boring-version filtering. Git-based storage is space-efficient and provides queryable history.
What it does: Feed-agnostic article diff tracker. Monitors any RSS/Atom feed, extracts content via readability-lxml, generates HTML diffs + PNG screenshots, submits to Internet Archive, notifies via Twitter and SendGrid email.
Architecture:
- Single-package Python app, monolith
__init__.py(743 lines) - Peewee ORM (Feed, Entry, FeedEntry, EntryVersion, Diff tables)
- Config via YAML with
${ENV_VAR}interpolation (envyaml) - Diff files stored at
{home}/diffs/{id % 257}/{id}.html|.png|thumb.png - Selenium (geckodriver/chromedriver) for screenshots
- htmldiff2 for server-side HTML diffing, Jinja2 for diff page template
Data flow: RSS feed -> feedparser -> readability-lxml extraction -> fingerprint comparison -> htmldiff2 diff -> Selenium screenshot -> Twitter thread / SendGrid email -> Internet Archive submission
What's broken:
- Twitter API completely broken (
update_with_mediaremoved June 2023) - Selenium
executable_pathdeprecated/removed in Selenium 4.x staleproperty uses.secondsinstead of.total_seconds()(wraps at 86400s)- Archive.org dependency: if snapshot fails, no notification is sent at all
bloggedfield is dead code- Travis CI on Python 3.7 only
Key insight to preserve: Feed-agnostic design + readability for automatic content extraction (no per-site parsers) + Internet Archive integration + YAML config with env var support.
What it does: Monitors NYT Top Stories API for metadata changes (headline, abstract, kicker, URL). Generates visual diff images via Selenium screenshots, posts to Twitter and Bluesky as threaded replies.
Architecture:
- Single file
nytdiff.py(617 lines), two classes (BaseParser, NYTParser) - SQLite via
dataset(nyt_ids + nyt_versions tables) - simplediff for word-level diffing -> HTML with
<ins>/<del>-> Selenium screenshot -> PNG - Thread continuity: all diffs for one article accumulate in one social media thread
- Bluesky support via atproto SDK, with image aspect ratio hints and alt text
What's broken:
PHANTOMJS_PATHenv var required but unused (crashes on startup if missing)TemporaryDirectory(delete=False)inwithblock = resource leak- Bare
except:clauses throughout - Hardcoded
America/Buenos_Airestimezone - Chinese article filter hardcoded
- Hash covers thumbnail/byline but diffs are only generated for 4 fields (silent version bumps)
- No Mastodon code despite README claiming Mastodon support
- Only tracks metadata, not article body text
Key insight to preserve: Thread-based social posting (all diffs for one article in one thread) + Bluesky support + visual diff images with CSS styling + alt text accessibility.
| Aspect | newsdiffs | diffengine | NYTdiff |
|---|---|---|---|
| Content scope | Full body + title + byline | Full body + title + URL | Metadata only |
| Source flexibility | 5 hardcoded sites | Any RSS/Atom feed | NYT API only |
| Content extraction | Per-site BS parsers | readability-lxml (automatic) | API (no extraction) |
| Diff algorithm | diff-match-patch (client JS) | htmldiff2 (server HTML) | simplediff (word-level) |
| Storage | Git repos + SQL | Peewee SQL + filesystem | dataset/SQLite |
| Web UI | Yes (Django) | No | No |
| Social posting | No | Twitter (broken) + email | Twitter + Bluesky |
| Screenshots | No | Selenium | Selenium |
| Archive.org | No | Yes | No |
| Scheduling | Adaptive backoff | External cron | External cron |
- Feed-agnostic RSS monitoring (from diffengine)
- Automatic content extraction via readability (from diffengine, eliminates per-site parsers)
- Full article body diffing (from newsdiffs — the most valuable feature)
- Web UI for browsing diffs (from newsdiffs — primary interface)
- Thread-based social syndication (from NYTdiff — Bluesky thread model)
- Adaptive check frequency (from newsdiffs — check new articles more often)
- Boring version filtering (from newsdiffs — skip noise)
- Visual diff images for social posts (from NYTdiff/diffengine)
- Internet Archive integration (from diffengine — preserve evidence)
- Alt text on diff images (from NYTdiff — accessibility)
- Per-site HTML parsers (newsdiffs) — use readability instead
- Selenium for screenshots — heavy, fragile, slow
- Single-file monoliths (NYTdiff, diffengine) — proper module structure
- Bare except clauses — proper error handling
- Twitter/X dependency — dead platform for bots
- Hardcoded news sources — must be configurable
- Git subprocess calls for storage (newsdiffs) — use a proper database
- Python 2 anything