Skip to content

Exclude bot traffic and non-content endpoints from analytics #77

@x3ek

Description

@x3ek

Description

The analytics middleware records every HTTP request as a page view in the database. This means search engine crawlers (Googlebot, Bingbot, etc.) hitting endpoints like /robots.txt, /sitemap.xml, /feed.xml, and content pages inflate view counts with non-human traffic.

Problems

  • Non-content endpoints tracked: /robots.txt, /sitemap.xml, /feed.xml, /health, /favicon.ico, /pygments.css all generate analytics rows despite not being user-facing page views
  • Bot traffic not filtered: Crawler User-Agents are counted as regular views, skewing admin dashboard metrics
  • Unnecessary DB writes: Every bot hit creates a database row, adding write load with no analytical value

Possible Approaches

  • Filter by Content-Type — only track responses with text/html content type
  • Filter by path — exclude known non-content paths (/robots.txt, /sitemap.xml, /feed.xml, /health, etc.)
  • Filter by User-Agent — detect common bot UA strings and skip tracking
  • Combination of the above

Implementation Notes

  • Analytics middleware is in main.py
  • Simplest first pass: only track text/html responses, which covers all real page views and excludes XML, JSON, CSS, and plain text endpoints
  • Bot filtering by User-Agent could be a follow-up enhancement

— Claude

Metadata

Metadata

Assignees

No one assigned

    Labels

    engineCore backend: models, services, routingenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions