Skip to content

[BUG]: time-to-read userscript activates on non-article websites #13

@bittricky

Description

@bittricky

Userscript

Problem:

The time-to-read userscript currently activates on many websites that don't contain article content or long-form text. This causes unnecessary script execution and occasionally displays reading time indicators on inappropriate pages.

Potential Causes:

After looking throught the code, I've found several factors contributing to this issue:

  1. Overly Broad URL Matching:

    • Using @match *://*/* in the userscript metadata means the script loads on every website
    • While the script does have exclusion logic, it only blocks specific URL patterns rather than identifying article sites
  2. Generic Content Selectors:

    • Fallback selectors like main, .content, and generic article tags match elements on many non-article websites
    • The "catch-all" pattern { domain: "*", contentSelector: "main" } is particularly problematic as nearly all websites have a <main> element
  3. Liberal Text Block Detection:

    • The findLargestTextBlock() function selects any block with sufficient text density
    • Many sites with long navigation menus, documentation, or code examples can trigger this detection
  4. Minimal Word Count Threshold:

    • Current MIN_WORD_COUNT of 100 is too low for reliably identifying article content
    • Many non-article pages (product listings, documentation indexes) can exceed this threshold

Steps to reproduce:

  1. Visit sites like GitHub repository pages, documentation sites, or e-commerce product listings
  2. Observe that the reading time indicator appears despite no article content being present
  3. Specific examples:
    • Shopping cart pages on e-commerce sites
    • GitHub repository main pages
    • API documentation home pages
    • Social media feed pages

Proposed Solutions:

  1. Improved Site Detection:

    • Implement a more robust content detection algorithm
    • Consider checking meta tags (e.g., <meta property="og:type" content="article">)
    • Look for article schema markup (itemtype="http://schema.org/Article")
  2. More Specific URL Matching:

    • Limit script execution to known content sites (news, blogs)
    • Add more excluded URL patterns for common non-article paths
  3. Better Content Heuristics:

    • Assess text-to-HTML ratio in candidate content blocks
    • Check for article-specific patterns (date published, author byline, etc.)
    • Increase MIN_WORD_COUNT to a more selective threshold (e.g., 250-300)
  4. User Configuration:

    • Add a whitelist feature where users can specify which sites should run the script
    • Implement a "training mode" where users can manually select article blocks

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions