Skip to content

miriamgoldman/playwright-image-validation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Website Image Checker

A powerful Playwright application that scans websites for broken images and provides detailed reports. This tool helps you maintain website quality by identifying images that fail to load, detecting image sources (GCP, Pantheon, etc.), and provides comprehensive analysis across entire websites using sitemaps.

Features

  • πŸ” Comprehensive Image Detection: Finds images in <img> tags, srcset attributes, and CSS background images
  • 🚨 Broken Image Detection: Identifies images that return 4xx/5xx status codes or have invalid content types
  • 🌐 Image Source Detection: Automatically detects image sources (GCP, Pantheon, AWS CloudFront, Cloudflare, Fastly, etc.)
  • πŸ“Š Detailed Reporting: Provides comprehensive console and markdown reports with broken image URLs, status codes, and error messages
  • ⚑ Fast & Efficient: Uses HTTP HEAD/GET requests for image checking with intelligent parallel processing
  • 🎨 Beautiful Output: Color-coded console output for easy reading
  • πŸ”§ Configurable: Extensive customization options for timeouts, concurrency, rate limiting, and more
  • πŸ—ΊοΈ Sitemap Support: Check entire websites using XML sitemaps (including sitemap indexes)
  • 🚦 Advanced Rate Limiting: Built-in throttling, adaptive concurrency, exponential backoff, and intelligent retry logic
  • 🧠 Dynamic Parameter Calculation: Automatically calculates optimal batch sizes, concurrency, and delays based on site size
  • πŸ“ˆ Performance Metrics: Tracks and reports timing information, images per second, and other performance metrics
  • 🎯 Smart Filtering: Option to filter to only check wp-content/uploads/ images and external images (perfect for WordPress sites)

Installation

  1. Clone or download this repository
  2. Install dependencies:
    npm install
  3. Install Playwright browsers:
    npx playwright install

Usage

Basic Usage

Check all images on a single page:

npm start https://example.com

Check Entire Website Using Sitemap

Check all images across your entire website using an XML sitemap:

npm start https://example.com --sitemap https://example.com/sitemap.xml

Check Sitemap Index

The tool automatically handles sitemap indexes (common in WordPress/Yoast SEO):

npm start https://example.com --sitemap https://example.com/sitemap_index.xml

Recommended Production Run

For full website checks with optimal settings:

npm start https://example.com \
  --sitemap https://example.com/sitemap_index.xml \
  --enable-learning \
  --max-concurrent-batches 3 \
  --max-concurrent-pages 8 \
  --max-concurrent-images 15 \
  --filter-uploads-only

Complete Command-Line Options

Basic Options

Option Short Description Default
--sitemap <url> -s XML sitemap URL to crawl None
--limit <n> -l Limit number of pages to check from sitemap No limit
--batch-size <n> -b Process pages in batches of N No batching (uses dynamic calculation if enabled)
--concurrency <n> -c Max concurrent image checks 5
--delay <ms> Delay between image checks (ms) 0
--timeout <ms> -t Page load timeout (ms) 30000
--user-agent <ua> Custom User-Agent string WebsiteImageChecker/1.0
--no-cache Disable in-memory cache for image results Cache enabled
--headless -h Run browser in headless mode true
--dev -d Run in development mode (non-headless) false

Rate Limiting & Retry Options

Option Description Default
--max-retries <n> Maximum number of retries for failed requests 3
--retry-delay <ms> Initial delay between retries (ms) 1000
--max-retry-delay <ms> Maximum delay between retries (ms) 30000
--backoff-multiplier <n> Multiplier for exponential backoff 2
--no-adaptive-concurrency Disable adaptive concurrency control Adaptive enabled
--min-concurrency <n> Minimum concurrency level 1
--no-rate-limit-detection Disable automatic rate limit detection Detection enabled

Advanced Parallelism Options

Option Description Default
--max-concurrent-batches <n> Maximum number of batches to process concurrently 3
--max-concurrent-pages <n> Maximum number of pages to process concurrently per batch 10
--max-concurrent-images <n> Maximum number of images to process concurrently per page 20

Dynamic Calculation & Learning Options

Option Description Default
--no-dynamic-calculations Disable dynamic parameter calculations Dynamic calculations enabled
--enable-learning Enable adaptive learning during execution Learning disabled
--filter-uploads-only Only check wp-content/uploads/ images and external images true
--no-filter-uploads-only Check all images (including theme images) false

CDN Bypass Option

Option Description Default
--bypass-cdn Automatically retry failed images against local server false

Usage Examples

Basic Single Page Check

npm start https://example.com

Sitemap with Custom Settings

npm start https://example.com \
  --sitemap https://example.com/sitemap.xml \
  --concurrency 10 \
  --delay 500 \
  --timeout 60000

Limited Pages from Sitemap

npm start https://example.com \
  --sitemap https://example.com/sitemap.xml \
  --limit 50

With Rate Limiting Protection

npm start https://example.com \
  --sitemap https://example.com/sitemap.xml \
  --max-retries 5 \
  --retry-delay 2000 \
  --max-retry-delay 60000

With Dynamic Calculations and Learning

npm start https://example.com \
  --sitemap https://example.com/sitemap_index.xml \
  --enable-learning \
  --max-concurrent-batches 3 \
  --max-concurrent-pages 8 \
  --max-concurrent-images 15

WordPress Site (Filter Uploads Only)

npm start https://example.com \
  --sitemap https://example.com/sitemap_index.xml \
  --filter-uploads-only \
  --enable-learning

Development Mode (See Browser)

npm start https://example.com --dev

With CDN Bypass

npm start https://example.com \
  --sitemap https://example.com/sitemap.xml \
  --bypass-cdn

Report Format

The tool generates two types of reports:

Console Report

The console output provides real-time feedback and a final summary:

πŸ” Checking website: https://example.com
⏰ Started at: 1/15/2025, 10:30:00 AM

πŸ“‹ Found 100 URLs in sitemap
🧠 Dynamic Parameters Calculated:
  πŸ“¦ Batch Size: 25
  πŸ”„ Max Retries: 3
  ⚑ Initial Concurrency: 10
  ⏱️  Base Delay: 100ms
  πŸ“Š Estimated Duration: 45 minutes

πŸš€ Enhanced Parallelism:
  πŸ”„ Concurrent Batches: 3
  πŸ“„ Concurrent Pages: 8
  πŸ–ΌοΈ  Concurrent Images: 15

🚦 Processing batch group 1/4 (3 batches)
πŸ“¦ Batch 1/12 (25 pages)

πŸ“„ Checking page 1/100: https://example.com/
πŸ“Έ Found 12 images on this page (filtered: wp-content/uploads/ + external only)
  Checking image 1/12: https://example.com/wp-content/uploads/2025/image.jpg
  βœ… Working: https://example.com/wp-content/uploads/2025/image.jpg
  ...
  βœ… Working: 10 images
  ❌ Broken: 2 images

πŸ“„ Checking page 2/100: https://example.com/about
πŸ“Έ Found 8 images on this page (2 already checked, 6 new)
  ⏭️  All images on this page were already checked, skipping...

...

βœ… Check complete! Processed 100 pages
πŸ“Š Total images: 850
βœ… Working: 820
❌ Broken: 30
⏱️  Duration: 42m 15s
⏰ Completed at: 1/15/2025, 11:12:15 AM

πŸ“Š IMAGE CHECK REPORT
==================================================
Pages checked: 100
Total images found: 850
Working images: 820
Broken images: 30

🌐 IMAGE SOURCES:
------------------------------
πŸ›οΈ PANTHEON: 450 images
☁️ GCP: 120 images
⚑ AWS-CLOUDFRONT: 50 images
❓ UNKNOWN: 230 images

⚠️  GCP IMAGES DETECTED: 120 images served from Google Cloud Platform
  βœ… Working GCP images: 115
  ❌ Broken GCP images: 5

❌ BROKEN IMAGES:
------------------------------
URL: https://example.com/wp-content/uploads/2025/broken.jpg
Page: https://example.com/
Status Code: 404
Error: Failed to load image
Source: PANTHEON
==================================================

πŸ“„ Markdown report saved: /Users/username/Documents/image-check-report-2025-01-15T11-12-15-123Z.md

πŸ“‹ To share this report:
1. Copy the contents of image-check-report-2025-01-15T11-12-15-123Z.md
2. Go to https://gist.github.com
3. Create a new gist and paste the content
4. Share the gist URL for easy access

Markdown Report

A comprehensive markdown report is automatically generated and saved to your Documents folder. The report includes:

Header Information

  • Generation timestamp
  • Target site URL
  • Sitemap URL (if used)
  • Start time, completion time, and total duration

Summary Section

  • Pages checked count
  • Total images found
  • Working images count and percentage
  • Broken images count and percentage

Performance Metrics Section

  • Total duration (minutes and seconds)
  • Images per second
  • Pages per second
  • Average time per image
  • Average time per page

Image Sources Section

  • Breakdown by source (Pantheon, GCP, AWS CloudFront, etc.)
  • Count and percentage for each source

GCP Images Detected Section (if applicable)

  • Total GCP images count
  • Working vs broken breakdown
  • Detailed table of all GCP images with URLs, pages, and status

Broken Images Section

  • Complete table of all broken images with:
    • Image URL
    • Page URL (clickable link)
    • Status code
    • Source (if detected)
    • Error message

Skipped Images Section (if applicable)

  • Data URI images that were skipped
  • URL preview, page URL, and reason

How It Works

Image Detection

The application detects broken images by:

  1. HTTP Status Codes: Images returning 4xx or 5xx status codes
  2. Content Type Validation: Ensuring responses have valid image content types
  3. Network Errors: Failed requests due to network issues or invalid URLs

Image Source Detection

Images are automatically categorized by source based on response headers:

  • GCP: Detected by x-goog-hash header
  • Pantheon: Detected by x-pantheon-styx-hostname header
  • AWS CloudFront: Detected by x-amz-cf-id or x-cache headers
  • Cloudflare: Detected by cf-ray header
  • Fastly: Detected by x-served-by header
  • Data URI: Base64 encoded images
  • Unknown: No identifying headers found

Image Filtering (WordPress)

When --filter-uploads-only is enabled (default):

  • Same-domain images: Only checks images in /wp-content/uploads/ path
  • External images: All external images are checked
  • Theme images: Images in theme directories are skipped
  • Data URIs: Always skipped (already inline)

Rate Limiting & Retry Logic

The tool includes sophisticated rate limiting:

  • Automatic Detection: Detects rate limit errors (ERR_ABORTED, timeouts, 429, etc.)
  • Exponential Backoff: Increases delay between retries exponentially
  • Adaptive Concurrency: Automatically reduces concurrency on errors
  • Configurable Retries: Set maximum retries and delays

Dynamic Parameter Calculation

When enabled, the tool automatically calculates:

  • Optimal Batch Size: Based on total number of URLs
  • Optimal Retry Count: Based on site size
  • Optimal Concurrency: Based on site size and system capabilities
  • Optimal Delays: Based on expected load
  • Estimated Duration: Calculated completion time

Adaptive Learning

When --enable-learning is enabled:

  • Performance Monitoring: Tracks success/failure rates and response times
  • Dynamic Adjustment: Automatically adjusts batch sizes and concurrency during execution
  • Error Rate Tracking: Monitors error rates and adapts accordingly
  • Optimization: Continuously optimizes parameters based on observed performance

Parallelism

The tool uses multi-level parallelism:

  1. Batch Level: Multiple batches processed concurrently
  2. Page Level: Multiple pages within each batch processed simultaneously
  3. Image Level: Multiple images within each page checked in parallel

Image Deduplication

  • Images are tracked across all pages
  • Each unique image URL is only checked once
  • Duplicate detections are logged and skipped

Sitemap Support

The tool supports standard XML sitemaps and sitemap indexes:

Standard Sitemap

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
  </url>
  <url>
    <loc>https://example.com/about</loc>
  </url>
</urlset>

Sitemap Index

Automatically handles sitemap indexes (common in WordPress/Yoast SEO):

  • Detects sitemap indexes containing <sitemapindex> root element
  • Fetches all sub-sitemap URLs from the index
  • Processes each sub-sitemap to collect all page URLs
  • Combines all URLs and applies limits/batching as specified

Testing

Run the test suite:

npm test

Run tests with UI:

npx playwright test --ui

API Usage

You can also use the ImageChecker class programmatically:

const ImageChecker = require('./src/image-checker');

async function checkMyWebsite() {
  const checker = new ImageChecker({ 
    headless: true,
    maxConcurrent: 3,
    delay: 100,
    userAgent: 'MyBot/1.0',
    filterUploadsOnly: true,
    enableLearning: true
  });
  
  try {
    await checker.init();
    
    // Check single page
    const results = await checker.checkWebsite('https://example.com');
    
    // Or check entire website with sitemap
    const sitemapResults = await checker.checkWebsiteWithSitemap('https://example.com/sitemap.xml');
    
    await checker.generateReport(results);
  } finally {
    await checker.close();
  }
}

checkMyWebsite();

Requirements

  • Node.js 18 or higher
  • npm or yarn
  • Internet connection for downloading Playwright browsers

Troubleshooting

Common Issues

  1. Playwright browsers not installed: Run npx playwright install
  2. Permission errors: Ensure you have write permissions in the project directory and Documents folder
  3. Network timeouts: Increase the timeout value for slow-loading websites
  4. CORS issues: Some images may be blocked due to CORS policies
  5. Sitemap parsing errors: Ensure the sitemap URL is accessible and in valid XML format
  6. ERR_ABORTED errors: These are handled automatically with retries, but may indicate rate limiting or server issues

Debug Mode

Run in development mode to see the browser window:

npm start https://example.com --dev

Performance Tips

  • Use --enable-learning for automatic optimization during execution
  • Use --max-concurrent-batches, --max-concurrent-pages, and --max-concurrent-images to fine-tune parallelism
  • Use --delay to add pauses between requests (helps with rate limiting)
  • Use --no-cache if you want fresh results every time
  • Use --filter-uploads-only for WordPress sites to skip theme images
  • Use sitemaps for large websites to ensure comprehensive coverage
  • Let dynamic calculations handle parameter optimization for best results

Best Practices

βœ… Throttle requests: Use delay and concurrency options to avoid overwhelming servers
βœ… Use sitemaps: Ensures comprehensive coverage of all pages
βœ… Enable learning: Allows automatic optimization during execution
βœ… Filter appropriately: Use --filter-uploads-only for WordPress to focus on uploaded images
βœ… Monitor performance: Check timing and performance metrics in reports
βœ… Use retry logic: Configurable retries handle transient network issues
βœ… Test against staging: Point to any URL (staging, production, etc.)
βœ… Use distinct User-Agent: Customizable User-Agent string
βœ… Review markdown reports: Detailed reports saved to Documents folder for easy sharing

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Submit a pull request

License

MIT License - see LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors