feat: page snapshot scripts for offline selector testing#10
feat: page snapshot scripts for offline selector testing#10blubblubservice wants to merge 1 commit intomainfrom
Conversation
- scripts/save-snapshots.ts: Playwright script to fetch avto.net pages - scripts/test-selectors.ts: Load saved HTML and test all selectors offline - snapshots/README.md: Instructions for manual snapshot saving (Cloudflare blocks automation) - npm run snapshot / npm run test-selectors Note: Cloudflare blocks automated Playwright access. Save HTML manually from a real browser (see snapshots/README.md), then use test-selectors to verify.
📝 WalkthroughWalkthroughIntroduces infrastructure for offline HTML snapshot capture and selector testing of avto.net pages. Adds two snapshot-saving scripts (save-snapshots.ts and save-snapshots-v2.ts) that use Playwright to capture search and detail page HTML while bypassing Cloudflare. Includes a test-selectors.ts script to validate DOM selectors against snapshots. Updates package.json with new npm scripts and adds documentation. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (3)
scripts/test-selectors.ts (1)
31-44: Emptycatch {}blocks in a diagnostic tool hide selector errors.In the snapshot scripts, swallowing errors is fine (best-effort). But in a testing tool whose purpose is to report what works and what doesn't, a silently caught exception will show up as
❌ nullinstead of revealing that the selector threw. Consider logging the error so you can distinguish "field not found" from "selector crashed."Proposed fix (for both blocks)
- } catch {} + } catch (e) { r.priceError = String(e); }- } catch {} + } catch (e) { r.specsError = String(e); }Also applies to: 52-78
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/test-selectors.ts` around lines 31 - 44, The empty catch block in the TreeWalker block (in scripts/test-selectors.ts) swallows selector errors causing tests to report null instead of an error; update the catch to capture the exception and record or log it (e.g., set r.priceComment or a new r.priceError with the error.message/stack and/or call console.error) so failures that are selector exceptions are distinguishable from "not found" results — apply the same change to the other similar block that inspects nodes (lines referencing document.createTreeWalker, r.priceComment, r.priceFromComment).scripts/save-snapshots.ts (2)
73-81: Snapshots are saved without verifying Cloudflare was bypassed.Unlike
save-snapshots-v2.tswhich checks page content and applies a-cf-blockedsuffix, this script blindly saves whatever HTML is returned (line 80). If CF blocks the request, you get a challenge page namedsearch-1.htmlthat looks like a valid snapshot. When fed totest-selectors, every selector silently reports❌ nullwith no indication the snapshot itself is bad.Consider adding a basic content check (e.g., look for
challenge-platformor the absence of expected markers) and either renaming or logging a warning.Proposed fix
const html = await page.content(); + const cfBlocked = html.includes('challenge-platform') || (!html.includes('details.asp') && !html.includes('ResultsAd')); - const filename = `search-${i + 1}.html`; + const filename = cfBlocked ? `search-${i + 1}-cf-blocked.html` : `search-${i + 1}.html`; + if (cfBlocked) console.warn(` ⚠️ Cloudflare may have blocked this page`); writeFileSync(join(SNAPSHOT_DIR, filename), html, 'utf-8');🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/save-snapshots.ts` around lines 73 - 81, The script saves snapshots without checking for Cloudflare challenge pages, so update the block around page.goto / page.content in scripts/save-snapshots.ts to detect CF by inspecting the HTML for known markers (e.g., 'challenge-platform', 'Cloudflare', or absence of expected page markers used by test-selectors), and if detected either append a '-cf-blocked' suffix to the filename (constructed as `search-${i + 1}.html`) or emit a clear warning log before writing; use the existing SNAPSHOT_DIR, filename variable and writeFileSync call, mirror the detection/handling logic from save-snapshots-v2.ts so test-selectors can skip or flag these snapshots.
1-144: Significant overlap withsave-snapshots-v2.ts.Both scripts share browser config, anti-detection setup, delay helpers, URL patterns, and file-saving logic. If you plan to iterate on these further, consider extracting shared config/utilities into a common module to avoid the two scripts drifting apart. Fine to defer given these are dev tools.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/save-snapshots.ts` around lines 1 - 144, Extract the duplicated setup into a shared dev-utils module: move the delay(ms) helper, SNAPSHOT_DIR and SEARCH_URLS constants, the Chromium launch/config object and the anti-detection init script (the context.addInitScript block) into a common export (e.g., create functions getBrowserOptions(), getContextInitScript(), getSnapshotDir(), getSearchUrls(), delay()). Update both save-snapshots.ts and save-snapshots-v2.ts to import these symbols and call them from their main() functions (keep per-script logic like which pages to fetch and file naming in each file); ensure writeFileSync usage remains local but uses the shared SNAPSHOT_DIR value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/save-snapshots-v2.ts`:
- Around line 46-58: The browser launched by chromium.launch() in main() can
leak if an exception occurs before browser.close(); declare let browser; before
calling chromium.launch(), then wrap the launch and the rest of main()'s body in
a try/finally block and call await browser?.close() in the finally to guarantee
cleanup; apply the same try/finally pattern around the other launch/close pair
referenced at lines 171-172 so every browser instance is closed on error.
- Around line 128-145: The paginated snapshot code always writes "search-2.html"
even when the page is Cloudflare-challenged; update the block that calls
waitForCloudflare(page, 60) to mirror the earlier search-1 logic: capture the
page HTML into html, determine whether the page is CF-blocked (use the same
check/flag used for the first search capture), and set the filename to include
"-cf-blocked" when blocked before calling writeFileSync(join(SNAPSHOT_DIR,
filename), html, 'utf-8') and logging. Ensure you reference the same detection
mechanism used with waitForCloudflare and reuse symbols html, SNAPSHOT_DIR,
writeFileSync and waitForCloudflare so naming is consistent with the initial
search snapshot.
In `@scripts/test-selectors.ts`:
- Around line 142-172: The main() function may leak the Playwright browser if an
exception occurs between chromium.launch() and browser.close(); wrap the browser
lifecycle in a try/finally (or use a local variable and close in finally) so
browser.close() always runs even on errors, and change the top-level caller
(currently main().catch(console.error)) to log the error and exit non-zero
(e.g., process.exit(1)) so failures don’t return code 0; update references
inside main (browser, chromium.launch(), browser.close(), testSearchPage,
testDetailPage) accordingly to ensure deterministic cleanup and non-zero exit on
failure.
In `@snapshots/README.md`:
- Around line 1-33: Add a snapshots/.gitignore that ignores *.html but keeps
.gitkeep (e.g., add a negation for .gitkeep) so saved HTML snapshots aren’t
committed; then update README.md to remove the "Commit and push" step and
document the new behavior. Also either wire up save-snapshots-v2.ts by adding an
npm script (e.g., "snapshot:v2": "ts-node scripts/save-snapshots-v2.ts" or
similar) and a short README note describing when to use v2 (Cloudflare handling)
or delete save-snapshots-v2.ts if v1 is the intended/only supported script.
Ensure references use the exact names snapshots/.gitignore, README.md,
save-snapshots-v2.ts and the npm script key you add (e.g., snapshot:v2).
---
Nitpick comments:
In `@scripts/save-snapshots.ts`:
- Around line 73-81: The script saves snapshots without checking for Cloudflare
challenge pages, so update the block around page.goto / page.content in
scripts/save-snapshots.ts to detect CF by inspecting the HTML for known markers
(e.g., 'challenge-platform', 'Cloudflare', or absence of expected page markers
used by test-selectors), and if detected either append a '-cf-blocked' suffix to
the filename (constructed as `search-${i + 1}.html`) or emit a clear warning log
before writing; use the existing SNAPSHOT_DIR, filename variable and
writeFileSync call, mirror the detection/handling logic from
save-snapshots-v2.ts so test-selectors can skip or flag these snapshots.
- Around line 1-144: Extract the duplicated setup into a shared dev-utils
module: move the delay(ms) helper, SNAPSHOT_DIR and SEARCH_URLS constants, the
Chromium launch/config object and the anti-detection init script (the
context.addInitScript block) into a common export (e.g., create functions
getBrowserOptions(), getContextInitScript(), getSnapshotDir(), getSearchUrls(),
delay()). Update both save-snapshots.ts and save-snapshots-v2.ts to import these
symbols and call them from their main() functions (keep per-script logic like
which pages to fetch and file naming in each file); ensure writeFileSync usage
remains local but uses the shared SNAPSHOT_DIR value.
In `@scripts/test-selectors.ts`:
- Around line 31-44: The empty catch block in the TreeWalker block (in
scripts/test-selectors.ts) swallows selector errors causing tests to report null
instead of an error; update the catch to capture the exception and record or log
it (e.g., set r.priceComment or a new r.priceError with the error.message/stack
and/or call console.error) so failures that are selector exceptions are
distinguishable from "not found" results — apply the same change to the other
similar block that inspects nodes (lines referencing document.createTreeWalker,
r.priceComment, r.priceFromComment).
| async function main() { | ||
| console.log('Launching browser v2 (longer waits, Turnstile interaction)...'); | ||
|
|
||
| const browser = await chromium.launch({ | ||
| headless: false, | ||
| args: [ | ||
| '--disable-blink-features=AutomationControlled', | ||
| '--no-sandbox', | ||
| '--disable-setuid-sandbox', | ||
| '--disable-dev-shm-usage', | ||
| '--window-size=1920,1080', | ||
| ], | ||
| }); |
There was a problem hiding this comment.
Browser is not closed if an error occurs mid-execution.
If any unhandled error is thrown between chromium.launch() and browser.close(), the browser process leaks. Wrap the body of main() in a try/finally to ensure cleanup.
🛡️ Proposed fix
async function main() {
console.log('Launching browser v2 (longer waits, Turnstile interaction)...');
const browser = await chromium.launch({
headless: false,
args: [ /* ... */ ],
});
+ try {
const context = await browser.newContext({ /* ... */ });
// ... rest of main body ...
- await browser.close();
- console.log('\nDone!');
+ } finally {
+ await browser.close();
+ console.log('\nDone!');
+ }
}Also applies to: 171-172
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/save-snapshots-v2.ts` around lines 46 - 58, The browser launched by
chromium.launch() in main() can leak if an exception occurs before
browser.close(); declare let browser; before calling chromium.launch(), then
wrap the launch and the rest of main()'s body in a try/finally block and call
await browser?.close() in the finally to guarantee cleanup; apply the same
try/finally pattern around the other launch/close pair referenced at lines
171-172 so every browser instance is closed on error.
| // If we got through, try a second search page via pagination | ||
| if (detailUrls.length > 0) { | ||
| console.log('\nTrying to navigate to page 2 via pagination...'); | ||
| await delay(10000); | ||
| try { | ||
| const nextLink = await page.$('a:has-text("Naslednja"), a:has-text("»"), a:has-text("2")'); | ||
| if (nextLink) { | ||
| await nextLink.click(); | ||
| await page.waitForLoadState('domcontentloaded', { timeout: 30000 }); | ||
| await waitForCloudflare(page, 60); | ||
| const html = await page.content(); | ||
| writeFileSync(join(SNAPSHOT_DIR, 'search-2.html'), html, 'utf-8'); | ||
| console.log(`Saved search-2.html (${(html.length / 1024).toFixed(0)} KB)`); | ||
| } | ||
| } catch (e) { | ||
| console.log('Could not get page 2:', e); | ||
| } | ||
| } |
There was a problem hiding this comment.
Pagination snapshot doesn't apply the cf-blocked naming convention.
Search page 1 (line 110) conditionally names the file with -cf-blocked when CF isn't passed, but the paginated search-2 (line 139) always saves as search-2.html even though waitForCloudflare is called on line 137. This could produce a snapshot that looks valid but actually contains the CF challenge page.
Proposed fix
- await waitForCloudflare(page, 60);
+ const passed = await waitForCloudflare(page, 60);
const html = await page.content();
- writeFileSync(join(SNAPSHOT_DIR, 'search-2.html'), html, 'utf-8');
- console.log(`Saved search-2.html (${(html.length / 1024).toFixed(0)} KB)`);
+ const filename = passed ? 'search-2.html' : 'search-2-cf-blocked.html';
+ writeFileSync(join(SNAPSHOT_DIR, filename), html, 'utf-8');
+ console.log(`Saved ${filename} (${(html.length / 1024).toFixed(0)} KB)`);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // If we got through, try a second search page via pagination | |
| if (detailUrls.length > 0) { | |
| console.log('\nTrying to navigate to page 2 via pagination...'); | |
| await delay(10000); | |
| try { | |
| const nextLink = await page.$('a:has-text("Naslednja"), a:has-text("»"), a:has-text("2")'); | |
| if (nextLink) { | |
| await nextLink.click(); | |
| await page.waitForLoadState('domcontentloaded', { timeout: 30000 }); | |
| await waitForCloudflare(page, 60); | |
| const html = await page.content(); | |
| writeFileSync(join(SNAPSHOT_DIR, 'search-2.html'), html, 'utf-8'); | |
| console.log(`Saved search-2.html (${(html.length / 1024).toFixed(0)} KB)`); | |
| } | |
| } catch (e) { | |
| console.log('Could not get page 2:', e); | |
| } | |
| } | |
| // If we got through, try a second search page via pagination | |
| if (detailUrls.length > 0) { | |
| console.log('\nTrying to navigate to page 2 via pagination...'); | |
| await delay(10000); | |
| try { | |
| const nextLink = await page.$('a:has-text("Naslednja"), a:has-text("»"), a:has-text("2")'); | |
| if (nextLink) { | |
| await nextLink.click(); | |
| await page.waitForLoadState('domcontentloaded', { timeout: 30000 }); | |
| const passed = await waitForCloudflare(page, 60); | |
| const html = await page.content(); | |
| const filename = passed ? 'search-2.html' : 'search-2-cf-blocked.html'; | |
| writeFileSync(join(SNAPSHOT_DIR, filename), html, 'utf-8'); | |
| console.log(`Saved ${filename} (${(html.length / 1024).toFixed(0)} KB)`); | |
| } | |
| } catch (e) { | |
| console.log('Could not get page 2:', e); | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/save-snapshots-v2.ts` around lines 128 - 145, The paginated snapshot
code always writes "search-2.html" even when the page is Cloudflare-challenged;
update the block that calls waitForCloudflare(page, 60) to mirror the earlier
search-1 logic: capture the page HTML into html, determine whether the page is
CF-blocked (use the same check/flag used for the first search capture), and set
the filename to include "-cf-blocked" when blocked before calling
writeFileSync(join(SNAPSHOT_DIR, filename), html, 'utf-8') and logging. Ensure
you reference the same detection mechanism used with waitForCloudflare and reuse
symbols html, SNAPSHOT_DIR, writeFileSync and waitForCloudflare so naming is
consistent with the initial search snapshot.
| async function main() { | ||
| if (!existsSync(SNAPSHOT_DIR)) { | ||
| console.error('No snapshots/ directory found. Save HTML snapshots first (see snapshots/README.md).'); | ||
| process.exit(1); | ||
| } | ||
|
|
||
| const files = readdirSync(SNAPSHOT_DIR).filter(f => f.endsWith('.html')); | ||
| if (files.length === 0) { | ||
| console.error('No .html files in snapshots/. Save avto.net pages manually (see snapshots/README.md).'); | ||
| process.exit(1); | ||
| } | ||
|
|
||
| const browser = await chromium.launch({ headless: true }); | ||
| const page = await browser.newPage(); | ||
|
|
||
| const detailFiles = files.filter(f => f.startsWith('detail')); | ||
| const searchFiles = files.filter(f => f.startsWith('search')); | ||
|
|
||
| if (searchFiles.length > 0) { | ||
| console.log('\n📋 SEARCH RESULTS PAGES'); | ||
| for (const f of searchFiles) await testSearchPage(page, f); | ||
| } | ||
|
|
||
| if (detailFiles.length > 0) { | ||
| console.log('\n📄 DETAIL PAGES'); | ||
| for (const f of detailFiles) await testDetailPage(page, f); | ||
| } | ||
|
|
||
| await browser.close(); | ||
| console.log('\nDone.'); | ||
| } |
There was a problem hiding this comment.
Browser not cleaned up on mid-execution failure; inconsistent exit behavior.
Same resource-leak pattern as save-snapshots-v2.ts: if an error occurs between chromium.launch() and browser.close(), the process leaks. Additionally, line 174 uses main().catch(console.error) which logs but exits with code 0, unlike the snapshot scripts that call process.exit(1). For a dev tool invoked via npm run test-selectors, a silent zero-exit on failure can be misleading.
Proposed fix
async function main() {
// ... validation ...
const browser = await chromium.launch({ headless: true });
+ try {
const page = await browser.newPage();
// ... test logic ...
- await browser.close();
- console.log('\nDone.');
+ } finally {
+ await browser.close();
+ console.log('\nDone.');
+ }
}
-main().catch(console.error);
+main().catch((e) => {
+ console.error('Fatal:', e);
+ process.exit(1);
+});🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/test-selectors.ts` around lines 142 - 172, The main() function may
leak the Playwright browser if an exception occurs between chromium.launch() and
browser.close(); wrap the browser lifecycle in a try/finally (or use a local
variable and close in finally) so browser.close() always runs even on errors,
and change the top-level caller (currently main().catch(console.error)) to log
the error and exit non-zero (e.g., process.exit(1)) so failures don’t return
code 0; update references inside main (browser, chromium.launch(),
browser.close(), testSearchPage, testDetailPage) accordingly to ensure
deterministic cleanup and non-zero exit on failure.
| # HTML Snapshots | ||
|
|
||
| Save HTML snapshots of avto.net pages here for offline selector development. | ||
|
|
||
| ## How to save snapshots | ||
|
|
||
| Cloudflare blocks automated access. Save pages manually from a real browser: | ||
|
|
||
| 1. Open Chrome, navigate to avto.net | ||
| 2. Find a search results page → Right-click → "Save as" → "Webpage, HTML Only" | ||
| 3. Save as `search-1.html`, `search-2.html`, etc. | ||
| 4. Open 3-5 individual listing detail pages → Save each as `detail-1.html`, `detail-2.html`, etc. | ||
| 5. Commit and push | ||
|
|
||
| Or use the automated script (may be blocked by Cloudflare): | ||
| ```bash | ||
| npm run snapshot | ||
| ``` | ||
|
|
||
| ## Testing selectors offline | ||
|
|
||
| Once you have HTML files here, test selectors against them: | ||
| ```bash | ||
| npm run test-selectors | ||
| ``` | ||
|
|
||
| This loads each snapshot and runs the current detail/search selectors, reporting which fields are populated vs null. | ||
|
|
||
| ## Files | ||
|
|
||
| - `search-*.html` — Search results pages | ||
| - `detail-*.html` — Individual listing detail pages | ||
| - `.gitkeep` — Keeps this directory in git |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check snapshots directory structure
echo "=== Snapshots directory structure ==="
fd --type f . snapshots/ 2>/dev/null | head -20
echo -e "\n=== Check for .gitkeep ==="
find snapshots -name ".gitkeep" 2>/dev/null && echo "Found" || echo "Not found"
echo -e "\n=== Check for .gitignore in snapshots ==="
find snapshots -name ".gitignore" 2>/dev/null && echo "Found" || echo "Not found"
echo -e "\n=== Root .gitignore snapshot patterns ==="
cat .gitignore 2>/dev/null | grep -i snapshot || echo "No snapshot pattern found"
echo -e "\n=== Check for save-snapshots-v2.ts ==="
find . -name "*save-snapshots*" -o -name "*snapshot*v2*" 2>/dev/null | grep -v node_modulesRepository: blubblub/test-scraper
Length of output: 392
🏁 Script executed:
#!/bin/bash
# Check package.json for snapshot-related scripts
echo "=== Package.json snapshot scripts ==="
if [ -f package.json ]; then
jq '.scripts | to_entries[] | select(.key | contains("snapshot")) | "\(.key): \(.value)"' package.json 2>/dev/null
else
echo "package.json not found"
fiRepository: blubblub/test-scraper
Length of output: 148
🏁 Script executed:
#!/bin/bash
# Verify content of snapshots/README.md to confirm what it says
echo "=== Content of snapshots/README.md ==="
cat -n snapshots/README.md 2>/dev/null | head -35Repository: blubblub/test-scraper
Length of output: 1300
🏁 Script executed:
#!/bin/bash
# Check what's in snapshots/.gitignore
echo "=== snapshots/.gitignore content ==="
cat snapshots/.gitignore
echo -e "\n=== Check if v2 is referenced in package.json ==="
cat package.json | grep -i "v2\|save-snapshots-v2"Repository: blubblub/test-scraper
Length of output: 222
🏁 Script executed:
#!/bin/bash
# Check save-snapshots-v2.ts to understand its purpose
echo "=== save-snapshots-v2.ts first 30 lines ==="
head -30 scripts/save-snapshots-v2.ts
echo -e "\n=== Check for any documentation about v2 ==="
grep -r "save-snapshots-v2" --include="*.md" --include="*.ts" --include="*.js" . 2>/dev/null | head -10Repository: blubblub/test-scraper
Length of output: 1458
Create snapshots/.gitignore to prevent HTML snapshots from being committed, and document or configure the v2 script.
The README instructs users to "Commit and push" HTML snapshots (step 5), but there is no snapshots/.gitignore to prevent large HTML files from bloating the repository history. Add a .gitignore file that ignores *.html while preserving .gitkeep.
Additionally, save-snapshots-v2.ts exists in the codebase with enhanced Cloudflare handling, but it is not:
- Added to npm scripts (only v1 is configured)
- Documented in the README or elsewhere
Either add v2 to the npm scripts and document when to use it, or remove the file if v1 is the intended approach.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@snapshots/README.md` around lines 1 - 33, Add a snapshots/.gitignore that
ignores *.html but keeps .gitkeep (e.g., add a negation for .gitkeep) so saved
HTML snapshots aren’t committed; then update README.md to remove the "Commit and
push" step and document the new behavior. Also either wire up
save-snapshots-v2.ts by adding an npm script (e.g., "snapshot:v2": "ts-node
scripts/save-snapshots-v2.ts" or similar) and a short README note describing
when to use v2 (Cloudflare handling) or delete save-snapshots-v2.ts if v1 is the
intended/only supported script. Ensure references use the exact names
snapshots/.gitignore, README.md, save-snapshots-v2.ts and the npm script key you
add (e.g., snapshot:v2).
What
scripts/save-snapshots.ts— Playwright script to fetch avto.net pages (may be blocked by Cloudflare)scripts/test-selectors.ts— Load saved HTML snapshots and test all current selectors offline, reporting populated vs null fieldssnapshots/directory with README instructionsnpm run snapshotandnpm run test-selectorsscriptsWhy
Cloudflare blocks automated access to avto.net from our server. This tooling lets the team:
<!-- DATA -->/<!-- PRICE -->comment walkersNote
Cloudflare Turnstile challenge blocks all headless/headed Playwright from our server IP. HTML snapshots need to be saved manually from a real browser (Ctrl+S → HTML only). See
snapshots/README.md.Summary by CodeRabbit
Release Notes
Chores
Documentation