AI Agent Guide for LinuxReport

Concise guide for AI agents and humans working on LinuxReport. Focus: correct mental model, key files, critical toggles, and non-obvious rules.

Project Overview

LinuxReport is a Python/Flask-based news aggregation platform that builds multiple reports (Linux, AI, COVID-19, PV/Solar, Space, Trump, Detroit Techno, etc.) from RSS/HTML sources. It uses:

Multi-layer caching (memory + SQLite + file) for performance.
Background workers for feed fetching and processing.
Optional Selenium/Tor for complex or sensitive sites.
LLMs for automatic headlines and summaries.

License: GNU LGPL v3.

Key Technologies (what matters to your code)

Python 3.x, Flask.
Package management: uv (recommended, 10-100x faster) or pip (traditional)
Diskcache (SQLite-backed) and Cacheout for caching.
Feedparser, BeautifulSoup4 for feeds/HTML.
Selenium (+ optional Tor) for JS-heavy sites.
LLMs: OpenAI, sentence_transformers.
Jinja2 templates; Flask-Assets bundles JS/CSS from templates/ into static/.

Key Files and Layout (mental map)

Only the most relevant files for contributors/agents:

app.py:
- Creates Flask app and core extensions.
- Registers routes and assets, configures compression and security headers.
shared.py:
- Mode enums for report types.
- Global caches (g_c = diskcache, g_cm = memory).
- Important feature flags and rate limiting configuration.
- Common utilities used across the project.
routes.py:
- Main routes and blueprint-style initialization for feature modules.
- Error handlers, security headers, CORS.
- Page-level caching patterns.
workers.py:
- Threaded feed fetching and processing pipeline.
- Integrates report settings, caching, image selection, and deduplication.
*_report_settings.py:
- Per-report CONFIG definitions (feeds, schedule, titles, prompts, special handling).
- Adding or changing a report type starts here.
Reddit.py:
- Reddit API client integration and helpers.
custom_site_handlers.py:
- Site-specific scraping/normalization for tricky domains.
image_parser.py / image_utils.py / image_processing.py:
- Image candidate extraction, scoring, and JS-heavy site handling.
templates/:
- *.html Jinja templates.
- app.js, core.js, config.js, chat.js, weather.js, infinitescroll.js.
- themes.css, core.css, weather.css, chat.css, config.css.
- These are the source of truth for front-end; they are bundled into static/.
static/:
- Generated/bundled JS/CSS and static assets.
- Do not hand-edit generated linuxreport.js / linuxreport.css.
tests/:
- Pytest tests for core behavior (locking, dedup, extraction, etc.).
config.yaml:
- Runtime configuration (domains, credentials, feature toggles). Never hardcode sensitive values.

Report Type System

Each report type is driven by a dedicated settings file:

Files: *report_settings.py (ai, linux_, covid_, space_, trump_, pv_, techno_, etc.).
Each defines a CONFIG object specifying:
- ALL_URLS: map of feed URL → RssInfo.
- SITE_URLS: ordered list of feeds to process.
- CUSTOM_FETCH_CONFIG: special cases (Selenium/Tor/custom handlers).
- SCHEDULE: hours to auto-update.
- WEB_TITLE / WEB_DESCRIPTION, REPORT_PROMPT, and related metadata.

Agents:

When adding/editing a report, adjust CONFIG here instead of scattering constants.
Keep schedules, feeds, and prompts consistent and centralized.

Weather System (short)

Core endpoint: /api/weather.
Takes lat/lon; otherwise uses backend logic to infer/fallback.
Behavior is controlled by flags:
- DISABLE_CLIENT_GEOLOCATION (front-end): when True, do not ask the browser for location.
- DISABLE_IP_GEOLOCATION (shared.py): when True, fallback is Detroit; when False, use IP geolocation.
Results are cached to avoid repeated lookups.
Agents modifying weather behavior must:
- Respect these flags.
- Preserve caching to avoid hammering external APIs.

Caching and Storage (read this; do not bypass)

LinuxReport relies heavily on caching. Agents must integrate with these layers correctly.

Core layers:

Disk cache (g_c, via diskcache / SQLite)

Persistent and shared across processes.
Stores:
- Weather and geolocation buckets.
- Chat comments, banned IPs, rate limiting and lock state.
- Feed content and other durable data.
Use for: values that must survive restarts or be shared between workers.

Memory cache (g_cm, via cacheout)

In-process, fast, TTL-based.
Use for:
- Hot data, e.g., full page HTML, parsed feeds, small computed results.
Do not rely on it for correctness; it is an optimization only.

File-based cache

AI-generated HTML snippets like {mode}reportabove.html.
Used as a lightweight content store and mtime indicator.
Avoid extra file I/O when mtime/content matches expectations.

Assets and CDN

JS/CSS are built from templates/ into static/linuxreport.js and static/linuxreport.css (often served via CDN).
Long cache lifetimes are assumed.
When changing front-end behavior, update templates/*; let the build/bundling pipeline regenerate static assets.

Agent rules:

Do:
- Prefer existing cache helpers instead of creating ad-hoc caches.
- Reuse cache keys/patterns where the codebase already uses them.
Do not:
- Introduce separate SQLite/databases for similar data.
- Bypass caching for hot paths (feeds, weather, front page) unless debugging.
- Modify generated static/* files by hand.

Feed Processing and Workers

workers.py:
- Uses thread pools for concurrent feed fetching and processing.
- Integrates:
  - *_report_settings.py for which feeds to hit and how.
  - Deduplication, caching, and image selection.
  - Optional Tor/Selenium paths for complex sources.
Custom site handlers:
- Implement site-specific scraping/normalization logic without cluttering generic pipelines.
Tor/Selenium:
- Encapsulated in seleniumfetch.py, Tor.py, and custom handlers.
- Toggle or extend here; do not scatter Selenium/Tor usage.

Reddit Integration (important toggle)

Controlled by ENABLE_REDDIT_API_FETCH in shared.py.
When ENABLE_REDDIT_API_FETCH = False (default):
- Legacy behavior:
  - Reddit URLs fetched via legacy RedditFetcher in workers.py using Tor and/or RSS/feedparser flows.
When ENABLE_REDDIT_API_FETCH = True:
- New behavior:
  - Reddit URLs fetched via fetch_reddit_feed_as_feedparser() in Reddit.py through RedditFetcher in workers.py.
  - Returns feedparser-like entries; workers do not need special handling.
Bootstrapping:
- Run Reddit.py once interactively on the server to obtain tokens and write reddit_token.json (mode 0600).
- Do not commit credentials or reddit_token.json.

This flag only affects how Reddit feeds are fetched; other feeds are unaffected.

Coding Conventions (brief but binding)

Follow PEP 8.
Indent with 4 spaces.
Target max line length: 160 chars.
Use existing patterns:
- Classes: PascalCase (SiteConfig, RssInfo).
- Functions/vars: snake_case.
- Constants: UPPER_SNAKE_CASE.
- Filenames: lowercase_with_underscores.py.
Place:
- New core modules at repo root.
- New report configs in *_report_settings.py.
- New Jinja templates in templates/.
- New static assets under static/images/ (or appropriate subdir).

Security and Authentication (project-specific points)

Admin:
- Flask-Login for authentication; credentials/config from config.yaml.
Forms:
- Flask-WTF with CSRF—reuse existing patterns for new forms.
Rate limiting:
- Flask-Limiter configured centrally; follow its helpers/utilities for new endpoints.
CORS and headers:
- Security headers and allowed domains are set in core routes; align new endpoints with these defaults.
IP blocking:
- Banned IPs stored persistently (diskcache); use the same mechanism when extending protections.
Never:
- Commit secrets, tokens, passwords, or private keys.
- Introduce unauthenticated admin-style endpoints.
- Use pickle for data from untrusted or external sources (risk of arbitrary code execution).
- Use MD5 for security-sensitive hashing; prefer SHA-256 or better.

JavaScript and CSS Architecture (minimal operational view)

Edit JS/CSS source only in templates/:
- JS: app.js, core.js, config.js, chat.js, weather.js, infinitescroll.js.
- CSS: themes.css, core.css, weather.css, chat.css, config.css.
These are bundled into:
- static/linuxreport.js
- static/linuxreport.css
Rules:
- Do not edit static/linuxreport.* directly.
- Keep CSRF handling and existing event/delegation patterns consistent.
- Respect theme, font-size, infinite scroll, and auto-refresh behavior already implemented.

Configuration Management

config.yaml:
- Central runtime config: domains, credentials, feature flags, storage settings.
*_report_settings.py:
- Per-report source and behavior.
Guidelines:
- Never hardcode sensitive values into Python/JS.
- Keep configuration-driven behavior in these files instead of scattering constants.

Common Tasks (quick recipes)

Adding a new report type:

Create {type}_report_settings.py with a CONFIG object:
- Use existing *_report_settings.py as reference.
Add {type}reportabove.html template and relevant branding in static/images/.
Wire any domain/routing needs in config.yaml and routes.py (following existing patterns).
Ensure workers pick it up (typically automatic via CONFIG structure).

Adding a new feature/route:

Create a module, e.g. my_feature.py, that exposes init_my_feature_routes(app).
Call that initializer from routes.py.
Add templates/static assets under templates/ and static/ as needed.
Add or update tests under tests/ (see "Relevant tests" below).

Modifying caching behavior:

For page caching: adjust keys/TTLs where caching decorators or helpers are used (often in routes.py or shared utilities).
For data caching:
- Use g_c for persistent/shared data.
- Use g_cm for hot ephemeral data.
When changing data flows:
- Implement appropriate invalidation or key versioning.
After cache-related changes, run the relevant cache/perf tests.

Development Workflows (minimal)

Setup:

git clone
cd LinuxReport
Copy/edit config.yaml for local settings.

Dependency installation options:

Option 1: uv (recommended - 10-100x faster, reproducible builds)

Install uv: curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
uv run python app.py

Option 2: pip (traditional, widely supported)

pip install -r requirements.txt
python app.py

Note: The dev environment uses .venv virtual environment, but both pyproject.toml (for uv users) and requirements.txt (for pip users) are maintained to support different development workflows and deployment scenarios.

Testing:

Run the full suite when in doubt:
- pytest tests/
Prefer targeted tests after specific changes.

Assets:

Edit JS/CSS in templates/.
Let the asset pipeline (Flask-Assets or equivalent) regenerate static bundles.

Relevant tests (what to run after changes)

When modifying specific areas, run at least the corresponding tests:

Caching / compression / infra:
- tests/test_compression.py
- tests/test_sqlitelock.py
Feed parsing, titles, deduplication:
- tests/test_extract_titles.py
- tests/test_dedup.py
Forms and input handling:
- tests/test_forms.py
Browser / scraper behavior:
- tests/test_browser_switch.py
- tests/playwright_test.py
- tests/playwright_simple_test.py
- tests/selenium_test.py
Tor / network behavior:
- tests/tortest.py

Note:

Additional benchmark scripts exist under tests/ for historical performance experiments. They are optional and not part of the normal regression suite.

Performance Considerations

Use caching correctly before optimizing elsewhere.
Keep I/O-bound work in workers/thread pools rather than blocking request handlers.
Leverage existing minified/bundled assets in production.
Respect rate limiting and avoid adding unbounded per-request work.

Troubleshooting (short checklist)

Cache/db:
- Verify cache.db exists and is writable.
- Confirm cache keys and TTLs are sane when debugging stale/fresh data.
Feeds:
- Check network/SSL.
- Verify related *_report_settings.py and custom_site_handlers.py entries.
- Inspect workers and Tor/Selenium configuration for special sites.
Assets:
- Ensure bundling runs; check console/network for 404s or JS errors.
Auth:
- Confirm config.yaml credentials.
- Verify Flask-Login wiring and protected routes.

Important Guidelines (agent checklist)

Do:

Use existing caching infrastructure and helpers.
Follow the modular route pattern for new features.
Keep behavior configuration-driven (config.yaml, *_report_settings.py).
Use type hints and match existing style/patterns.
Add or update tests for non-trivial changes.

Avoid:

Bypassing caches on hot paths.
Hardcoding config, secrets, or environment-specific values.
Editing generated static/ assets directly.
Creating circular imports.
Shipping credentials, tokens, or private keys in the repo.

Reddit API Setup (for agents and operators)

Create a Reddit "script" app in your Reddit preferences.
Set redirect URI to a URL handled by /reddit/callback (see routes.py); this route is cosmetic.
On the server, run Reddit.py once:
- Provide client_id, client_secret, username, password.
- It writes reddit_token.json (mode 0600) with tokens.
Toggle:
- ENABLE_REDDIT_API_FETCH in shared.py.
  - False: legacy Tor/Selenium/RSS pipeline.
  - True: Reddit API via Reddit.py + workers.py.
Never commit reddit_token.json or credentials.

Additional Documentation (if available)

If present in the repo/deploy environment, see:

README.md: setup and high-level overview.
Caching.md: deeper caching internals.
README_object_storage_sync.md: CDN/object storage sync configuration.
PWA.md, Scaling.md, ROADMAP.md: advanced deployment and roadmap details.
httpd-vhosts-sample.conf: sample Apache deployment.

External references:

Flask, Jinja2, PEP 8, Flask-Assets, Diskcache docs for library specifics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI Agent Guide for LinuxReport

Project Overview

Key Technologies (what matters to your code)

Key Files and Layout (mental map)

Report Type System

Weather System (short)

Caching and Storage (read this; do not bypass)

Feed Processing and Workers

Reddit Integration (important toggle)

Coding Conventions (brief but binding)

Security and Authentication (project-specific points)

JavaScript and CSS Architecture (minimal operational view)

Configuration Management

Common Tasks (quick recipes)

Development Workflows (minimal)

Relevant tests (what to run after changes)

Performance Considerations

Troubleshooting (short checklist)

Important Guidelines (agent checklist)

Reddit API Setup (for agents and operators)

Additional Documentation (if available)

FilesExpand file tree

agents.md

Latest commit

History

agents.md

File metadata and controls

AI Agent Guide for LinuxReport

Project Overview

Key Technologies (what matters to your code)

Key Files and Layout (mental map)

Report Type System

Weather System (short)

Caching and Storage (read this; do not bypass)

Feed Processing and Workers

Reddit Integration (important toggle)

Coding Conventions (brief but binding)

Security and Authentication (project-specific points)

JavaScript and CSS Architecture (minimal operational view)

Configuration Management

Common Tasks (quick recipes)

Development Workflows (minimal)

Relevant tests (what to run after changes)

Performance Considerations

Troubleshooting (short checklist)

Important Guidelines (agent checklist)

Reddit API Setup (for agents and operators)

Additional Documentation (if available)