Website Image Checker

A powerful Playwright application that scans websites for broken images and provides detailed reports. This tool helps you maintain website quality by identifying images that fail to load, detecting image sources (GCP, Pantheon, etc.), and provides comprehensive analysis across entire websites using sitemaps.

Features

🔍 Comprehensive Image Detection: Finds images in <img> tags, srcset attributes, and CSS background images
🚨 Broken Image Detection: Identifies images that return 4xx/5xx status codes or have invalid content types
🌐 Image Source Detection: Automatically detects image sources (GCP, Pantheon, AWS CloudFront, Cloudflare, Fastly, etc.)
📊 Detailed Reporting: Provides comprehensive console and markdown reports with broken image URLs, status codes, and error messages
⚡ Fast & Efficient: Uses HTTP HEAD/GET requests for image checking with intelligent parallel processing
🎨 Beautiful Output: Color-coded console output for easy reading
🔧 Configurable: Extensive customization options for timeouts, concurrency, rate limiting, and more
🗺️ Sitemap Support: Check entire websites using XML sitemaps (including sitemap indexes)
🚦 Advanced Rate Limiting: Built-in throttling, adaptive concurrency, exponential backoff, and intelligent retry logic
🧠 Dynamic Parameter Calculation: Automatically calculates optimal batch sizes, concurrency, and delays based on site size
📈 Performance Metrics: Tracks and reports timing information, images per second, and other performance metrics
🎯 Smart Filtering: Option to filter to only check wp-content/uploads/ images and external images (perfect for WordPress sites)

Installation

Clone or download this repository
Install dependencies:
```
npm install
```
Install Playwright browsers:
```
npx playwright install
```

Usage

Basic Usage

Check all images on a single page:

npm start https://example.com

Check Entire Website Using Sitemap

Check all images across your entire website using an XML sitemap:

npm start https://example.com --sitemap https://example.com/sitemap.xml

Check Sitemap Index

The tool automatically handles sitemap indexes (common in WordPress/Yoast SEO):

npm start https://example.com --sitemap https://example.com/sitemap_index.xml

Recommended Production Run

For full website checks with optimal settings:

npm start https://example.com \
  --sitemap https://example.com/sitemap_index.xml \
  --enable-learning \
  --max-concurrent-batches 3 \
  --max-concurrent-pages 8 \
  --max-concurrent-images 15 \
  --filter-uploads-only

Complete Command-Line Options

Basic Options

Option	Short	Description	Default
`--sitemap <url>`	`-s`	XML sitemap URL to crawl	None
`--limit <n>`	`-l`	Limit number of pages to check from sitemap	No limit
`--batch-size <n>`	`-b`	Process pages in batches of N	No batching (uses dynamic calculation if enabled)
`--concurrency <n>`	`-c`	Max concurrent image checks	5
`--delay <ms>`		Delay between image checks (ms)	0
`--timeout <ms>`	`-t`	Page load timeout (ms)	30000
`--user-agent <ua>`		Custom User-Agent string	WebsiteImageChecker/1.0
`--no-cache`		Disable in-memory cache for image results	Cache enabled
`--headless`	`-h`	Run browser in headless mode	true
`--dev`	`-d`	Run in development mode (non-headless)	false

Rate Limiting & Retry Options

Option	Description	Default
`--max-retries <n>`	Maximum number of retries for failed requests	3
`--retry-delay <ms>`	Initial delay between retries (ms)	1000
`--max-retry-delay <ms>`	Maximum delay between retries (ms)	30000
`--backoff-multiplier <n>`	Multiplier for exponential backoff	2
`--no-adaptive-concurrency`	Disable adaptive concurrency control	Adaptive enabled
`--min-concurrency <n>`	Minimum concurrency level	1
`--no-rate-limit-detection`	Disable automatic rate limit detection	Detection enabled

Advanced Parallelism Options

Option	Description	Default
`--max-concurrent-batches <n>`	Maximum number of batches to process concurrently	3
`--max-concurrent-pages <n>`	Maximum number of pages to process concurrently per batch	10
`--max-concurrent-images <n>`	Maximum number of images to process concurrently per page	20

Dynamic Calculation & Learning Options

Option	Description	Default
`--no-dynamic-calculations`	Disable dynamic parameter calculations	Dynamic calculations enabled
`--enable-learning`	Enable adaptive learning during execution	Learning disabled
`--filter-uploads-only`	Only check wp-content/uploads/ images and external images	true
`--no-filter-uploads-only`	Check all images (including theme images)	false

CDN Bypass Option

Option	Description	Default
`--bypass-cdn`	Automatically retry failed images against local server	false

Usage Examples

Basic Single Page Check

npm start https://example.com

Sitemap with Custom Settings

npm start https://example.com \
  --sitemap https://example.com/sitemap.xml \
  --concurrency 10 \
  --delay 500 \
  --timeout 60000

Limited Pages from Sitemap

npm start https://example.com \
  --sitemap https://example.com/sitemap.xml \
  --limit 50

With Rate Limiting Protection

npm start https://example.com \
  --sitemap https://example.com/sitemap.xml \
  --max-retries 5 \
  --retry-delay 2000 \
  --max-retry-delay 60000

With Dynamic Calculations and Learning

npm start https://example.com \
  --sitemap https://example.com/sitemap_index.xml \
  --enable-learning \
  --max-concurrent-batches 3 \
  --max-concurrent-pages 8 \
  --max-concurrent-images 15

WordPress Site (Filter Uploads Only)

npm start https://example.com \
  --sitemap https://example.com/sitemap_index.xml \
  --filter-uploads-only \
  --enable-learning

Development Mode (See Browser)

npm start https://example.com --dev

With CDN Bypass

npm start https://example.com \
  --sitemap https://example.com/sitemap.xml \
  --bypass-cdn

Report Format

The tool generates two types of reports:

Console Report

The console output provides real-time feedback and a final summary:

🔍 Checking website: https://example.com
⏰ Started at: 1/15/2025, 10:30:00 AM

📋 Found 100 URLs in sitemap
🧠 Dynamic Parameters Calculated:
  📦 Batch Size: 25
  🔄 Max Retries: 3
  ⚡ Initial Concurrency: 10
  ⏱️  Base Delay: 100ms
  📊 Estimated Duration: 45 minutes

🚀 Enhanced Parallelism:
  🔄 Concurrent Batches: 3
  📄 Concurrent Pages: 8
  🖼️  Concurrent Images: 15

🚦 Processing batch group 1/4 (3 batches)
📦 Batch 1/12 (25 pages)

📄 Checking page 1/100: https://example.com/
📸 Found 12 images on this page (filtered: wp-content/uploads/ + external only)
  Checking image 1/12: https://example.com/wp-content/uploads/2025/image.jpg
  ✅ Working: https://example.com/wp-content/uploads/2025/image.jpg
  ...
  ✅ Working: 10 images
  ❌ Broken: 2 images

📄 Checking page 2/100: https://example.com/about
📸 Found 8 images on this page (2 already checked, 6 new)
  ⏭️  All images on this page were already checked, skipping...

...

✅ Check complete! Processed 100 pages
📊 Total images: 850
✅ Working: 820
❌ Broken: 30
⏱️  Duration: 42m 15s
⏰ Completed at: 1/15/2025, 11:12:15 AM

📊 IMAGE CHECK REPORT
==================================================
Pages checked: 100
Total images found: 850
Working images: 820
Broken images: 30

🌐 IMAGE SOURCES:
------------------------------
🏛️ PANTHEON: 450 images
☁️ GCP: 120 images
⚡ AWS-CLOUDFRONT: 50 images
❓ UNKNOWN: 230 images

⚠️  GCP IMAGES DETECTED: 120 images served from Google Cloud Platform
  ✅ Working GCP images: 115
  ❌ Broken GCP images: 5

❌ BROKEN IMAGES:
------------------------------
URL: https://example.com/wp-content/uploads/2025/broken.jpg
Page: https://example.com/
Status Code: 404
Error: Failed to load image
Source: PANTHEON
==================================================

📄 Markdown report saved: /Users/username/Documents/image-check-report-2025-01-15T11-12-15-123Z.md

📋 To share this report:
1. Copy the contents of image-check-report-2025-01-15T11-12-15-123Z.md
2. Go to https://gist.github.com
3. Create a new gist and paste the content
4. Share the gist URL for easy access

Markdown Report

A comprehensive markdown report is automatically generated and saved to your Documents folder. The report includes:

Header Information

Generation timestamp
Target site URL
Sitemap URL (if used)
Start time, completion time, and total duration

Summary Section

Pages checked count
Total images found
Working images count and percentage
Broken images count and percentage

Performance Metrics Section

Total duration (minutes and seconds)
Images per second
Pages per second
Average time per image
Average time per page

Image Sources Section

Breakdown by source (Pantheon, GCP, AWS CloudFront, etc.)
Count and percentage for each source

GCP Images Detected Section (if applicable)

Total GCP images count
Working vs broken breakdown
Detailed table of all GCP images with URLs, pages, and status

Broken Images Section

Complete table of all broken images with:
- Image URL
- Page URL (clickable link)
- Status code
- Source (if detected)
- Error message

Skipped Images Section (if applicable)

Data URI images that were skipped
URL preview, page URL, and reason

How It Works

Image Detection

The application detects broken images by:

HTTP Status Codes: Images returning 4xx or 5xx status codes
Content Type Validation: Ensuring responses have valid image content types
Network Errors: Failed requests due to network issues or invalid URLs

Image Source Detection

Images are automatically categorized by source based on response headers:

GCP: Detected by x-goog-hash header
Pantheon: Detected by x-pantheon-styx-hostname header
AWS CloudFront: Detected by x-amz-cf-id or x-cache headers
Cloudflare: Detected by cf-ray header
Fastly: Detected by x-served-by header
Data URI: Base64 encoded images
Unknown: No identifying headers found

Image Filtering (WordPress)

When --filter-uploads-only is enabled (default):

Same-domain images: Only checks images in /wp-content/uploads/ path
External images: All external images are checked
Theme images: Images in theme directories are skipped
Data URIs: Always skipped (already inline)

Rate Limiting & Retry Logic

The tool includes sophisticated rate limiting:

Automatic Detection: Detects rate limit errors (ERR_ABORTED, timeouts, 429, etc.)
Exponential Backoff: Increases delay between retries exponentially
Adaptive Concurrency: Automatically reduces concurrency on errors
Configurable Retries: Set maximum retries and delays

Dynamic Parameter Calculation

When enabled, the tool automatically calculates:

Optimal Batch Size: Based on total number of URLs
Optimal Retry Count: Based on site size
Optimal Concurrency: Based on site size and system capabilities
Optimal Delays: Based on expected load
Estimated Duration: Calculated completion time

Adaptive Learning

When --enable-learning is enabled:

Performance Monitoring: Tracks success/failure rates and response times
Dynamic Adjustment: Automatically adjusts batch sizes and concurrency during execution
Error Rate Tracking: Monitors error rates and adapts accordingly
Optimization: Continuously optimizes parameters based on observed performance

Parallelism

The tool uses multi-level parallelism:

Batch Level: Multiple batches processed concurrently
Page Level: Multiple pages within each batch processed simultaneously
Image Level: Multiple images within each page checked in parallel

Image Deduplication

Images are tracked across all pages
Each unique image URL is only checked once
Duplicate detections are logged and skipped

Sitemap Support

The tool supports standard XML sitemaps and sitemap indexes:

Standard Sitemap

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
  </url>
  <url>
    <loc>https://example.com/about</loc>
  </url>
</urlset>

Sitemap Index

Automatically handles sitemap indexes (common in WordPress/Yoast SEO):

Detects sitemap indexes containing <sitemapindex> root element
Fetches all sub-sitemap URLs from the index
Processes each sub-sitemap to collect all page URLs
Combines all URLs and applies limits/batching as specified

Testing

Run the test suite:

npm test

Run tests with UI:

npx playwright test --ui

API Usage

You can also use the ImageChecker class programmatically:

const ImageChecker = require('./src/image-checker');

async function checkMyWebsite() {
  const checker = new ImageChecker({ 
    headless: true,
    maxConcurrent: 3,
    delay: 100,
    userAgent: 'MyBot/1.0',
    filterUploadsOnly: true,
    enableLearning: true
  });
  
  try {
    await checker.init();
    
    // Check single page
    const results = await checker.checkWebsite('https://example.com');
    
    // Or check entire website with sitemap
    const sitemapResults = await checker.checkWebsiteWithSitemap('https://example.com/sitemap.xml');
    
    await checker.generateReport(results);
  } finally {
    await checker.close();
  }
}

checkMyWebsite();

Requirements

Node.js 18 or higher
npm or yarn
Internet connection for downloading Playwright browsers

Troubleshooting

Common Issues

Playwright browsers not installed: Run npx playwright install
Permission errors: Ensure you have write permissions in the project directory and Documents folder
Network timeouts: Increase the timeout value for slow-loading websites
CORS issues: Some images may be blocked due to CORS policies
Sitemap parsing errors: Ensure the sitemap URL is accessible and in valid XML format
ERR_ABORTED errors: These are handled automatically with retries, but may indicate rate limiting or server issues

Debug Mode

Run in development mode to see the browser window:

npm start https://example.com --dev

Performance Tips

Use --enable-learning for automatic optimization during execution
Use --max-concurrent-batches, --max-concurrent-pages, and --max-concurrent-images to fine-tune parallelism
Use --delay to add pauses between requests (helps with rate limiting)
Use --no-cache if you want fresh results every time
Use --filter-uploads-only for WordPress sites to skip theme images
Use sitemaps for large websites to ensure comprehensive coverage
Let dynamic calculations handle parameter optimization for best results

Best Practices

✅ Throttle requests: Use delay and concurrency options to avoid overwhelming servers
✅ Use sitemaps: Ensures comprehensive coverage of all pages
✅ Enable learning: Allows automatic optimization during execution
✅ Filter appropriately: Use --filter-uploads-only for WordPress to focus on uploaded images
✅ Monitor performance: Check timing and performance metrics in reports
✅ Use retry logic: Configurable retries handle transient network issues
✅ Test against staging: Point to any URL (staging, production, etc.)
✅ Use distinct User-Agent: Customizable User-Agent string
✅ Review markdown reports: Detailed reports saved to Documents folder for easy sharing

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Submit a pull request

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
tests		tests
.gitignore		.gitignore
.nvmrc		.nvmrc
README.md		README.md
example-usage.js		example-usage.js
package-lock.json		package-lock.json
package.json		package.json
playwright.config.js		playwright.config.js

Folders and files

Latest commit

History

Repository files navigation

Website Image Checker

Features

Installation

Usage

Basic Usage

Check Entire Website Using Sitemap

Check Sitemap Index

Recommended Production Run

Complete Command-Line Options

Basic Options

Rate Limiting & Retry Options

Advanced Parallelism Options

Dynamic Calculation & Learning Options

CDN Bypass Option

Usage Examples

Basic Single Page Check

Sitemap with Custom Settings

Limited Pages from Sitemap

With Rate Limiting Protection

With Dynamic Calculations and Learning

WordPress Site (Filter Uploads Only)

Development Mode (See Browser)

With CDN Bypass

Report Format

Console Report

Markdown Report

Header Information

Summary Section

Performance Metrics Section

Image Sources Section

GCP Images Detected Section (if applicable)

Broken Images Section

Skipped Images Section (if applicable)

How It Works

Image Detection

Image Source Detection

Image Filtering (WordPress)

Rate Limiting & Retry Logic

Dynamic Parameter Calculation

Adaptive Learning

Parallelism

Image Deduplication

Sitemap Support

Standard Sitemap

Sitemap Index

Testing

API Usage

Requirements

Troubleshooting

Common Issues

Debug Mode

Performance Tips

Best Practices

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages