Concise guide for AI agents and humans working on LinuxReport. Focus: correct mental model, key files, critical toggles, and non-obvious rules.
LinuxReport is a Python/Flask-based news aggregation platform that builds multiple reports (Linux, AI, COVID-19, PV/Solar, Space, Trump, Detroit Techno, etc.) from RSS/HTML sources. It uses:
- Multi-layer caching (memory + SQLite + file) for performance.
- Background workers for feed fetching and processing.
- Optional Selenium/Tor for complex or sensitive sites.
- LLMs for automatic headlines and summaries.
License: GNU LGPL v3.
- Python 3.x, Flask.
- Package management: uv (recommended, 10-100x faster) or pip (traditional)
- Diskcache (SQLite-backed) and Cacheout for caching.
- Feedparser, BeautifulSoup4 for feeds/HTML.
- Selenium (+ optional Tor) for JS-heavy sites.
- LLMs: OpenAI, sentence_transformers.
- Jinja2 templates; Flask-Assets bundles JS/CSS from templates/ into static/.
Only the most relevant files for contributors/agents:
-
app.py:
- Creates Flask app and core extensions.
- Registers routes and assets, configures compression and security headers.
-
shared.py:
- Mode enums for report types.
- Global caches (g_c = diskcache, g_cm = memory).
- Important feature flags and rate limiting configuration.
- Common utilities used across the project.
-
routes.py:
- Main routes and blueprint-style initialization for feature modules.
- Error handlers, security headers, CORS.
- Page-level caching patterns.
-
workers.py:
- Threaded feed fetching and processing pipeline.
- Integrates report settings, caching, image selection, and deduplication.
-
*_report_settings.py:
- Per-report CONFIG definitions (feeds, schedule, titles, prompts, special handling).
- Adding or changing a report type starts here.
-
Reddit.py:
- Reddit API client integration and helpers.
-
custom_site_handlers.py:
- Site-specific scraping/normalization for tricky domains.
-
image_parser.py / image_utils.py / image_processing.py:
- Image candidate extraction, scoring, and JS-heavy site handling.
-
templates/:
- *.html Jinja templates.
- app.js, core.js, config.js, chat.js, weather.js, infinitescroll.js.
- themes.css, core.css, weather.css, chat.css, config.css.
- These are the source of truth for front-end; they are bundled into static/.
-
static/:
- Generated/bundled JS/CSS and static assets.
- Do not hand-edit generated linuxreport.js / linuxreport.css.
-
tests/:
- Pytest tests for core behavior (locking, dedup, extraction, etc.).
-
config.yaml:
- Runtime configuration (domains, credentials, feature toggles). Never hardcode sensitive values.
Each report type is driven by a dedicated settings file:
- Files: *report_settings.py (ai, linux_, covid_, space_, trump_, pv_, techno_, etc.).
- Each defines a CONFIG object specifying:
- ALL_URLS: map of feed URL → RssInfo.
- SITE_URLS: ordered list of feeds to process.
- CUSTOM_FETCH_CONFIG: special cases (Selenium/Tor/custom handlers).
- SCHEDULE: hours to auto-update.
- WEB_TITLE / WEB_DESCRIPTION, REPORT_PROMPT, and related metadata.
Agents:
- When adding/editing a report, adjust CONFIG here instead of scattering constants.
- Keep schedules, feeds, and prompts consistent and centralized.
- Core endpoint: /api/weather.
- Takes lat/lon; otherwise uses backend logic to infer/fallback.
- Behavior is controlled by flags:
- DISABLE_CLIENT_GEOLOCATION (front-end): when True, do not ask the browser for location.
- DISABLE_IP_GEOLOCATION (shared.py): when True, fallback is Detroit; when False, use IP geolocation.
- Results are cached to avoid repeated lookups.
- Agents modifying weather behavior must:
- Respect these flags.
- Preserve caching to avoid hammering external APIs.
LinuxReport relies heavily on caching. Agents must integrate with these layers correctly.
Core layers:
- Disk cache (g_c, via diskcache / SQLite)
- Persistent and shared across processes.
- Stores:
- Weather and geolocation buckets.
- Chat comments, banned IPs, rate limiting and lock state.
- Feed content and other durable data.
- Use for: values that must survive restarts or be shared between workers.
- Memory cache (g_cm, via cacheout)
- In-process, fast, TTL-based.
- Use for:
- Hot data, e.g., full page HTML, parsed feeds, small computed results.
- Do not rely on it for correctness; it is an optimization only.
- File-based cache
- AI-generated HTML snippets like {mode}reportabove.html.
- Used as a lightweight content store and mtime indicator.
- Avoid extra file I/O when mtime/content matches expectations.
- Assets and CDN
- JS/CSS are built from templates/ into static/linuxreport.js and static/linuxreport.css (often served via CDN).
- Long cache lifetimes are assumed.
- When changing front-end behavior, update templates/*; let the build/bundling pipeline regenerate static assets.
Agent rules:
- Do:
- Prefer existing cache helpers instead of creating ad-hoc caches.
- Reuse cache keys/patterns where the codebase already uses them.
- Do not:
- Introduce separate SQLite/databases for similar data.
- Bypass caching for hot paths (feeds, weather, front page) unless debugging.
- Modify generated static/* files by hand.
- workers.py:
- Uses thread pools for concurrent feed fetching and processing.
- Integrates:
- *_report_settings.py for which feeds to hit and how.
- Deduplication, caching, and image selection.
- Optional Tor/Selenium paths for complex sources.
- Custom site handlers:
- Implement site-specific scraping/normalization logic without cluttering generic pipelines.
- Tor/Selenium:
- Encapsulated in seleniumfetch.py, Tor.py, and custom handlers.
- Toggle or extend here; do not scatter Selenium/Tor usage.
- Controlled by ENABLE_REDDIT_API_FETCH in shared.py.
- When ENABLE_REDDIT_API_FETCH = False (default):
- Legacy behavior:
- Reddit URLs fetched via legacy RedditFetcher in workers.py using Tor and/or RSS/feedparser flows.
- Legacy behavior:
- When ENABLE_REDDIT_API_FETCH = True:
- New behavior:
- Reddit URLs fetched via fetch_reddit_feed_as_feedparser() in Reddit.py through RedditFetcher in workers.py.
- Returns feedparser-like entries; workers do not need special handling.
- New behavior:
- Bootstrapping:
- Run Reddit.py once interactively on the server to obtain tokens and write reddit_token.json (mode 0600).
- Do not commit credentials or reddit_token.json.
This flag only affects how Reddit feeds are fetched; other feeds are unaffected.
- Follow PEP 8.
- Indent with 4 spaces.
- Target max line length: 160 chars.
- Use existing patterns:
- Classes: PascalCase (SiteConfig, RssInfo).
- Functions/vars: snake_case.
- Constants: UPPER_SNAKE_CASE.
- Filenames: lowercase_with_underscores.py.
- Place:
- New core modules at repo root.
- New report configs in *_report_settings.py.
- New Jinja templates in templates/.
- New static assets under static/images/ (or appropriate subdir).
- Admin:
- Flask-Login for authentication; credentials/config from config.yaml.
- Forms:
- Flask-WTF with CSRF—reuse existing patterns for new forms.
- Rate limiting:
- Flask-Limiter configured centrally; follow its helpers/utilities for new endpoints.
- CORS and headers:
- Security headers and allowed domains are set in core routes; align new endpoints with these defaults.
- IP blocking:
- Banned IPs stored persistently (diskcache); use the same mechanism when extending protections.
- Never:
- Commit secrets, tokens, passwords, or private keys.
- Introduce unauthenticated admin-style endpoints.
- Use
picklefor data from untrusted or external sources (risk of arbitrary code execution). - Use MD5 for security-sensitive hashing; prefer SHA-256 or better.
- Edit JS/CSS source only in templates/:
- JS: app.js, core.js, config.js, chat.js, weather.js, infinitescroll.js.
- CSS: themes.css, core.css, weather.css, chat.css, config.css.
- These are bundled into:
- static/linuxreport.js
- static/linuxreport.css
- Rules:
- Do not edit static/linuxreport.* directly.
- Keep CSRF handling and existing event/delegation patterns consistent.
- Respect theme, font-size, infinite scroll, and auto-refresh behavior already implemented.
- config.yaml:
- Central runtime config: domains, credentials, feature flags, storage settings.
- *_report_settings.py:
- Per-report source and behavior.
- Guidelines:
- Never hardcode sensitive values into Python/JS.
- Keep configuration-driven behavior in these files instead of scattering constants.
Adding a new report type:
- Create {type}_report_settings.py with a CONFIG object:
- Use existing *_report_settings.py as reference.
- Add {type}reportabove.html template and relevant branding in static/images/.
- Wire any domain/routing needs in config.yaml and routes.py (following existing patterns).
- Ensure workers pick it up (typically automatic via CONFIG structure).
Adding a new feature/route:
- Create a module, e.g. my_feature.py, that exposes init_my_feature_routes(app).
- Call that initializer from routes.py.
- Add templates/static assets under templates/ and static/ as needed.
- Add or update tests under tests/ (see "Relevant tests" below).
Modifying caching behavior:
- For page caching: adjust keys/TTLs where caching decorators or helpers are used (often in routes.py or shared utilities).
- For data caching:
- Use g_c for persistent/shared data.
- Use g_cm for hot ephemeral data.
- When changing data flows:
- Implement appropriate invalidation or key versioning.
- After cache-related changes, run the relevant cache/perf tests.
Setup:
- git clone
- cd LinuxReport
- Copy/edit config.yaml for local settings.
Dependency installation options:
Option 1: uv (recommended - 10-100x faster, reproducible builds)
- Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh - uv sync
- uv run python app.py
Option 2: pip (traditional, widely supported)
- pip install -r requirements.txt
- python app.py
Note: The dev environment uses .venv virtual environment, but both pyproject.toml (for uv users) and requirements.txt (for pip users) are maintained to support different development workflows and deployment scenarios.
Testing:
- Run the full suite when in doubt:
- pytest tests/
- Prefer targeted tests after specific changes.
Assets:
- Edit JS/CSS in templates/.
- Let the asset pipeline (Flask-Assets or equivalent) regenerate static bundles.
When modifying specific areas, run at least the corresponding tests:
-
Caching / compression / infra:
- tests/test_compression.py
- tests/test_sqlitelock.py
-
Feed parsing, titles, deduplication:
- tests/test_extract_titles.py
- tests/test_dedup.py
-
Forms and input handling:
- tests/test_forms.py
-
Browser / scraper behavior:
- tests/test_browser_switch.py
- tests/playwright_test.py
- tests/playwright_simple_test.py
- tests/selenium_test.py
-
Tor / network behavior:
- tests/tortest.py
Note:
- Additional benchmark scripts exist under tests/ for historical performance experiments. They are optional and not part of the normal regression suite.
- Use caching correctly before optimizing elsewhere.
- Keep I/O-bound work in workers/thread pools rather than blocking request handlers.
- Leverage existing minified/bundled assets in production.
- Respect rate limiting and avoid adding unbounded per-request work.
- Cache/db:
- Verify cache.db exists and is writable.
- Confirm cache keys and TTLs are sane when debugging stale/fresh data.
- Feeds:
- Check network/SSL.
- Verify related *_report_settings.py and custom_site_handlers.py entries.
- Inspect workers and Tor/Selenium configuration for special sites.
- Assets:
- Ensure bundling runs; check console/network for 404s or JS errors.
- Auth:
- Confirm config.yaml credentials.
- Verify Flask-Login wiring and protected routes.
Do:
- Use existing caching infrastructure and helpers.
- Follow the modular route pattern for new features.
- Keep behavior configuration-driven (config.yaml, *_report_settings.py).
- Use type hints and match existing style/patterns.
- Add or update tests for non-trivial changes.
Avoid:
- Bypassing caches on hot paths.
- Hardcoding config, secrets, or environment-specific values.
- Editing generated static/ assets directly.
- Creating circular imports.
- Shipping credentials, tokens, or private keys in the repo.
- Create a Reddit "script" app in your Reddit preferences.
- Set redirect URI to a URL handled by /reddit/callback (see routes.py); this route is cosmetic.
- On the server, run Reddit.py once:
- Provide client_id, client_secret, username, password.
- It writes reddit_token.json (mode 0600) with tokens.
- Toggle:
- ENABLE_REDDIT_API_FETCH in shared.py.
- False: legacy Tor/Selenium/RSS pipeline.
- True: Reddit API via Reddit.py + workers.py.
- ENABLE_REDDIT_API_FETCH in shared.py.
- Never commit reddit_token.json or credentials.
If present in the repo/deploy environment, see:
- README.md: setup and high-level overview.
- Caching.md: deeper caching internals.
- README_object_storage_sync.md: CDN/object storage sync configuration.
- PWA.md, Scaling.md, ROADMAP.md: advanced deployment and roadmap details.
- httpd-vhosts-sample.conf: sample Apache deployment.
External references:
- Flask, Jinja2, PEP 8, Flask-Assets, Diskcache docs for library specifics.