A powerful Playwright application that scans websites for broken images and provides detailed reports. This tool helps you maintain website quality by identifying images that fail to load, detecting image sources (GCP, Pantheon, etc.), and provides comprehensive analysis across entire websites using sitemaps.
- π Comprehensive Image Detection: Finds images in
<img>tags,srcsetattributes, and CSS background images - π¨ Broken Image Detection: Identifies images that return 4xx/5xx status codes or have invalid content types
- π Image Source Detection: Automatically detects image sources (GCP, Pantheon, AWS CloudFront, Cloudflare, Fastly, etc.)
- π Detailed Reporting: Provides comprehensive console and markdown reports with broken image URLs, status codes, and error messages
- β‘ Fast & Efficient: Uses HTTP HEAD/GET requests for image checking with intelligent parallel processing
- π¨ Beautiful Output: Color-coded console output for easy reading
- π§ Configurable: Extensive customization options for timeouts, concurrency, rate limiting, and more
- πΊοΈ Sitemap Support: Check entire websites using XML sitemaps (including sitemap indexes)
- π¦ Advanced Rate Limiting: Built-in throttling, adaptive concurrency, exponential backoff, and intelligent retry logic
- π§ Dynamic Parameter Calculation: Automatically calculates optimal batch sizes, concurrency, and delays based on site size
- π Performance Metrics: Tracks and reports timing information, images per second, and other performance metrics
- π― Smart Filtering: Option to filter to only check wp-content/uploads/ images and external images (perfect for WordPress sites)
- Clone or download this repository
- Install dependencies:
npm install
- Install Playwright browsers:
npx playwright install
Check all images on a single page:
npm start https://example.comCheck all images across your entire website using an XML sitemap:
npm start https://example.com --sitemap https://example.com/sitemap.xmlThe tool automatically handles sitemap indexes (common in WordPress/Yoast SEO):
npm start https://example.com --sitemap https://example.com/sitemap_index.xmlFor full website checks with optimal settings:
npm start https://example.com \
--sitemap https://example.com/sitemap_index.xml \
--enable-learning \
--max-concurrent-batches 3 \
--max-concurrent-pages 8 \
--max-concurrent-images 15 \
--filter-uploads-only| Option | Short | Description | Default |
|---|---|---|---|
--sitemap <url> |
-s |
XML sitemap URL to crawl | None |
--limit <n> |
-l |
Limit number of pages to check from sitemap | No limit |
--batch-size <n> |
-b |
Process pages in batches of N | No batching (uses dynamic calculation if enabled) |
--concurrency <n> |
-c |
Max concurrent image checks | 5 |
--delay <ms> |
Delay between image checks (ms) | 0 | |
--timeout <ms> |
-t |
Page load timeout (ms) | 30000 |
--user-agent <ua> |
Custom User-Agent string | WebsiteImageChecker/1.0 | |
--no-cache |
Disable in-memory cache for image results | Cache enabled | |
--headless |
-h |
Run browser in headless mode | true |
--dev |
-d |
Run in development mode (non-headless) | false |
| Option | Description | Default |
|---|---|---|
--max-retries <n> |
Maximum number of retries for failed requests | 3 |
--retry-delay <ms> |
Initial delay between retries (ms) | 1000 |
--max-retry-delay <ms> |
Maximum delay between retries (ms) | 30000 |
--backoff-multiplier <n> |
Multiplier for exponential backoff | 2 |
--no-adaptive-concurrency |
Disable adaptive concurrency control | Adaptive enabled |
--min-concurrency <n> |
Minimum concurrency level | 1 |
--no-rate-limit-detection |
Disable automatic rate limit detection | Detection enabled |
| Option | Description | Default |
|---|---|---|
--max-concurrent-batches <n> |
Maximum number of batches to process concurrently | 3 |
--max-concurrent-pages <n> |
Maximum number of pages to process concurrently per batch | 10 |
--max-concurrent-images <n> |
Maximum number of images to process concurrently per page | 20 |
| Option | Description | Default |
|---|---|---|
--no-dynamic-calculations |
Disable dynamic parameter calculations | Dynamic calculations enabled |
--enable-learning |
Enable adaptive learning during execution | Learning disabled |
--filter-uploads-only |
Only check wp-content/uploads/ images and external images | true |
--no-filter-uploads-only |
Check all images (including theme images) | false |
| Option | Description | Default |
|---|---|---|
--bypass-cdn |
Automatically retry failed images against local server | false |
npm start https://example.comnpm start https://example.com \
--sitemap https://example.com/sitemap.xml \
--concurrency 10 \
--delay 500 \
--timeout 60000npm start https://example.com \
--sitemap https://example.com/sitemap.xml \
--limit 50npm start https://example.com \
--sitemap https://example.com/sitemap.xml \
--max-retries 5 \
--retry-delay 2000 \
--max-retry-delay 60000npm start https://example.com \
--sitemap https://example.com/sitemap_index.xml \
--enable-learning \
--max-concurrent-batches 3 \
--max-concurrent-pages 8 \
--max-concurrent-images 15npm start https://example.com \
--sitemap https://example.com/sitemap_index.xml \
--filter-uploads-only \
--enable-learningnpm start https://example.com --devnpm start https://example.com \
--sitemap https://example.com/sitemap.xml \
--bypass-cdnThe tool generates two types of reports:
The console output provides real-time feedback and a final summary:
π Checking website: https://example.com
β° Started at: 1/15/2025, 10:30:00 AM
π Found 100 URLs in sitemap
π§ Dynamic Parameters Calculated:
π¦ Batch Size: 25
π Max Retries: 3
β‘ Initial Concurrency: 10
β±οΈ Base Delay: 100ms
π Estimated Duration: 45 minutes
π Enhanced Parallelism:
π Concurrent Batches: 3
π Concurrent Pages: 8
πΌοΈ Concurrent Images: 15
π¦ Processing batch group 1/4 (3 batches)
π¦ Batch 1/12 (25 pages)
π Checking page 1/100: https://example.com/
πΈ Found 12 images on this page (filtered: wp-content/uploads/ + external only)
Checking image 1/12: https://example.com/wp-content/uploads/2025/image.jpg
β
Working: https://example.com/wp-content/uploads/2025/image.jpg
...
β
Working: 10 images
β Broken: 2 images
π Checking page 2/100: https://example.com/about
πΈ Found 8 images on this page (2 already checked, 6 new)
βοΈ All images on this page were already checked, skipping...
...
β
Check complete! Processed 100 pages
π Total images: 850
β
Working: 820
β Broken: 30
β±οΈ Duration: 42m 15s
β° Completed at: 1/15/2025, 11:12:15 AM
π IMAGE CHECK REPORT
==================================================
Pages checked: 100
Total images found: 850
Working images: 820
Broken images: 30
π IMAGE SOURCES:
------------------------------
ποΈ PANTHEON: 450 images
βοΈ GCP: 120 images
β‘ AWS-CLOUDFRONT: 50 images
β UNKNOWN: 230 images
β οΈ GCP IMAGES DETECTED: 120 images served from Google Cloud Platform
β
Working GCP images: 115
β Broken GCP images: 5
β BROKEN IMAGES:
------------------------------
URL: https://example.com/wp-content/uploads/2025/broken.jpg
Page: https://example.com/
Status Code: 404
Error: Failed to load image
Source: PANTHEON
==================================================
π Markdown report saved: /Users/username/Documents/image-check-report-2025-01-15T11-12-15-123Z.md
π To share this report:
1. Copy the contents of image-check-report-2025-01-15T11-12-15-123Z.md
2. Go to https://gist.github.com
3. Create a new gist and paste the content
4. Share the gist URL for easy access
A comprehensive markdown report is automatically generated and saved to your Documents folder. The report includes:
- Generation timestamp
- Target site URL
- Sitemap URL (if used)
- Start time, completion time, and total duration
- Pages checked count
- Total images found
- Working images count and percentage
- Broken images count and percentage
- Total duration (minutes and seconds)
- Images per second
- Pages per second
- Average time per image
- Average time per page
- Breakdown by source (Pantheon, GCP, AWS CloudFront, etc.)
- Count and percentage for each source
- Total GCP images count
- Working vs broken breakdown
- Detailed table of all GCP images with URLs, pages, and status
- Complete table of all broken images with:
- Image URL
- Page URL (clickable link)
- Status code
- Source (if detected)
- Error message
- Data URI images that were skipped
- URL preview, page URL, and reason
The application detects broken images by:
- HTTP Status Codes: Images returning 4xx or 5xx status codes
- Content Type Validation: Ensuring responses have valid image content types
- Network Errors: Failed requests due to network issues or invalid URLs
Images are automatically categorized by source based on response headers:
- GCP: Detected by
x-goog-hashheader - Pantheon: Detected by
x-pantheon-styx-hostnameheader - AWS CloudFront: Detected by
x-amz-cf-idorx-cacheheaders - Cloudflare: Detected by
cf-rayheader - Fastly: Detected by
x-served-byheader - Data URI: Base64 encoded images
- Unknown: No identifying headers found
When --filter-uploads-only is enabled (default):
- Same-domain images: Only checks images in
/wp-content/uploads/path - External images: All external images are checked
- Theme images: Images in theme directories are skipped
- Data URIs: Always skipped (already inline)
The tool includes sophisticated rate limiting:
- Automatic Detection: Detects rate limit errors (ERR_ABORTED, timeouts, 429, etc.)
- Exponential Backoff: Increases delay between retries exponentially
- Adaptive Concurrency: Automatically reduces concurrency on errors
- Configurable Retries: Set maximum retries and delays
When enabled, the tool automatically calculates:
- Optimal Batch Size: Based on total number of URLs
- Optimal Retry Count: Based on site size
- Optimal Concurrency: Based on site size and system capabilities
- Optimal Delays: Based on expected load
- Estimated Duration: Calculated completion time
When --enable-learning is enabled:
- Performance Monitoring: Tracks success/failure rates and response times
- Dynamic Adjustment: Automatically adjusts batch sizes and concurrency during execution
- Error Rate Tracking: Monitors error rates and adapts accordingly
- Optimization: Continuously optimizes parameters based on observed performance
The tool uses multi-level parallelism:
- Batch Level: Multiple batches processed concurrently
- Page Level: Multiple pages within each batch processed simultaneously
- Image Level: Multiple images within each page checked in parallel
- Images are tracked across all pages
- Each unique image URL is only checked once
- Duplicate detections are logged and skipped
The tool supports standard XML sitemaps and sitemap indexes:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
</url>
<url>
<loc>https://example.com/about</loc>
</url>
</urlset>Automatically handles sitemap indexes (common in WordPress/Yoast SEO):
- Detects sitemap indexes containing
<sitemapindex>root element - Fetches all sub-sitemap URLs from the index
- Processes each sub-sitemap to collect all page URLs
- Combines all URLs and applies limits/batching as specified
Run the test suite:
npm testRun tests with UI:
npx playwright test --uiYou can also use the ImageChecker class programmatically:
const ImageChecker = require('./src/image-checker');
async function checkMyWebsite() {
const checker = new ImageChecker({
headless: true,
maxConcurrent: 3,
delay: 100,
userAgent: 'MyBot/1.0',
filterUploadsOnly: true,
enableLearning: true
});
try {
await checker.init();
// Check single page
const results = await checker.checkWebsite('https://example.com');
// Or check entire website with sitemap
const sitemapResults = await checker.checkWebsiteWithSitemap('https://example.com/sitemap.xml');
await checker.generateReport(results);
} finally {
await checker.close();
}
}
checkMyWebsite();- Node.js 18 or higher
- npm or yarn
- Internet connection for downloading Playwright browsers
- Playwright browsers not installed: Run
npx playwright install - Permission errors: Ensure you have write permissions in the project directory and Documents folder
- Network timeouts: Increase the timeout value for slow-loading websites
- CORS issues: Some images may be blocked due to CORS policies
- Sitemap parsing errors: Ensure the sitemap URL is accessible and in valid XML format
- ERR_ABORTED errors: These are handled automatically with retries, but may indicate rate limiting or server issues
Run in development mode to see the browser window:
npm start https://example.com --dev- Use
--enable-learningfor automatic optimization during execution - Use
--max-concurrent-batches,--max-concurrent-pages, and--max-concurrent-imagesto fine-tune parallelism - Use
--delayto add pauses between requests (helps with rate limiting) - Use
--no-cacheif you want fresh results every time - Use
--filter-uploads-onlyfor WordPress sites to skip theme images - Use sitemaps for large websites to ensure comprehensive coverage
- Let dynamic calculations handle parameter optimization for best results
β
Throttle requests: Use delay and concurrency options to avoid overwhelming servers
β
Use sitemaps: Ensures comprehensive coverage of all pages
β
Enable learning: Allows automatic optimization during execution
β
Filter appropriately: Use --filter-uploads-only for WordPress to focus on uploaded images
β
Monitor performance: Check timing and performance metrics in reports
β
Use retry logic: Configurable retries handle transient network issues
β
Test against staging: Point to any URL (staging, production, etc.)
β
Use distinct User-Agent: Customizable User-Agent string
β
Review markdown reports: Detailed reports saved to Documents folder for easy sharing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Submit a pull request
MIT License - see LICENSE file for details.