Skip to content

Fix broken link checker to await link validations before closing the browser#5965

Draft
tkuhemiya wants to merge 2 commits intowso2:masterfrom
tkuhemiya:fix/broken-link-checker-race
Draft

Fix broken link checker to await link validations before closing the browser#5965
tkuhemiya wants to merge 2 commits intowso2:masterfrom
tkuhemiya:fix/broken-link-checker-race

Conversation

@tkuhemiya
Copy link
Copy Markdown

@tkuhemiya tkuhemiya commented Mar 14, 2026

Purpose

Fix the async flow in the broken link checker so link validations complete before the Puppeteer browser is closed.

Previously, markdown-link-check was triggered inside the page crawl loop without being awaited, which could allow the crawl to finish and the browser to close before all link checks had completed. This change extracts link validation into a dedicated checkLink() helper and waits for all pending checks before closing the browser.

Related Issue

No related issue

Related PRs

No related PRs

Test environment

  • macOS (Darwin 24.6.0)
  • Node.js v25.6.1
  • No JDK or database applicable for this change
  • No manual browser testing performed
  • Validation:
    • Hosted a local page containing test links
    • Configured the linked endpoints to respond with 404 after a 3-second delay
    • Ran node check-broken-links.js
    • Verified the script reported all broken links and completed link checking before closing the browser

Security checks

Summary by CodeRabbit

  • Refactor
    • Enhanced the link checking mechanism with improved error handling and optimized concurrent processing for better reliability and efficiency in detecting broken links.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 14, 2026

Walkthrough

The link checking logic in the script was refactored from a nested callback-based approach to a promise-based concurrent approach. A new helper function checkLink was introduced, browser instantiation was simplified to launch once per page, and error handling was consolidated while maintaining recursive depth-based crawling.

Changes

Cohort / File(s) Summary
Promise-Based Link Checking Refactor
en/check-broken-links.js
Introduced checkLink helper function, replaced callback-based link checking with Promise.all-based concurrency, consolidated error handling with centralized catch/finally blocks, and simplified browser lifecycle management.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Hops through broken links so fine,
Promises all in a line,
One browser, many checks at once,
No more callbacks—clever hunch!
The web is healthier today,
Thanks to refactoring's way! 🌐

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: fixing the async flow in the broken link checker by awaiting link validations before closing the browser.
Description check ✅ Passed The description covers the purpose, test environment, and security checks as required by the template. All critical sections are completed with relevant details.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can customize the tone of the review comments and chat replies.

Configure the tone_instructions setting to customize the tone of the review comments and chat replies. For example, you can set the tone to Act like a strict teacher, Act like a pirate and more.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 14, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@en/check-broken-links.js`:
- Around line 81-84: The code currently calls puppeteer.launch() then
browser.newPage() before entering the guarded try, which can leak a browser
process if newPage() throws; move the browser initialization (await
puppeteer.launch()) into the try block so both launch and newPage are protected,
and in the finally block check that browser is non-null/defined before calling
browser.close(); apply the same change to the second occurrence that initializes
a browser (the block around lines 125-126) so both browser launch usages are
created inside their respective try blocks and safely null-checked in finally.
- Around line 87-90: The anchor hrefs collected in linksWithInfo include
fragment hashes which cause duplicate visits (e.g., /page#intro vs /page#api);
normalize by stripping the URL.hash (or anything after '#') when mapping anchor
tags in page.evaluate so returned links are fragment-free, and also ensure you
strip fragments again before deduping against visitedUrls and before calling
checkLinksOnPage; update the same logic used around the checkLinksOnPage
invocation (lines referenced 117-118) to use the normalized (fragment-stripped)
URL.
- Around line 53-66: Update the markdownLinkCheck call and results handling to
use the documented API: replace the options object { retry: true } with the
supported retry options (e.g., retryOn429, retryCount, fallbackRetryDelay) when
invoking markdownLinkCheck, and in the results.forEach callback use the
documented result fields (status, statusCode, err, link) instead of the
undocumented dead field; specifically, mark a link broken if result.err is
truthy or result.statusCode (or result.status) indicates an HTTP error (>=400 or
equals 404), and continue calling logBrokenLink(link, sourceUrl) and console.log
for those cases. Ensure you reference the markdownLinkCheck invocation and the
results.forEach block when making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8d77c0bf-03a0-40e6-bb0a-15b0f0b149d6

📥 Commits

Reviewing files that changed from the base of the PR and between cecd4c9 and 3db88f0.

📒 Files selected for processing (1)
  • en/check-broken-links.js

Comment on lines +53 to +66
return new Promise((resolve, reject) => {
markdownLinkCheck(link, { retry: true }, (err, results) => {
if (err) {
reject(err);
return;
}

results.forEach(result => {
const { dead, statusCode } = result;

if (dead || statusCode === 404) {
logBrokenLink(link, sourceUrl);
console.log(`\n[Broken Link] Found: ${link}\n`);
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Mar 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

What fields does the markdown-link-check JavaScript API expose on each result object, and what are the documented retry-related option names?

💡 Result:

Result object fields (JS API)

Each element in the results array has these documented properties: link, status, statusCode, err. (github.com)

Retry-related option names (documented)

The documented retry options are: retryOn429, retryCount, fallbackRetryDelay. (github.com)

Citations:


Use the documented markdown-link-check API fields and retry options.

The JS API result objects expose status, statusCode, err, and link. The supported retry options are retryOn429, retryCount, and fallbackRetryDelay. The current code uses the undocumented dead field and { retry: true } option, causing only explicit 404s to be reported and no retry logic to be applied.

♻️ Proposed fix
-    markdownLinkCheck(link, { retry: true }, (err, results) => {
+    markdownLinkCheck(link, { retryOn429: true }, (err, results) => {
@@
-        const { dead, statusCode } = result;
-
-        if (dead || statusCode === 404) {
+        if (result.status === 'dead') {
           logBrokenLink(link, sourceUrl);
           console.log(`\n[Broken Link] Found: ${link}\n`);
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return new Promise((resolve, reject) => {
markdownLinkCheck(link, { retry: true }, (err, results) => {
if (err) {
reject(err);
return;
}
results.forEach(result => {
const { dead, statusCode } = result;
if (dead || statusCode === 404) {
logBrokenLink(link, sourceUrl);
console.log(`\n[Broken Link] Found: ${link}\n`);
}
return new Promise((resolve, reject) => {
markdownLinkCheck(link, { retryOn429: true }, (err, results) => {
if (err) {
reject(err);
return;
}
results.forEach(result => {
if (result.status === 'dead') {
logBrokenLink(link, sourceUrl);
console.log(`\n[Broken Link] Found: ${link}\n`);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@en/check-broken-links.js` around lines 53 - 66, Update the markdownLinkCheck
call and results handling to use the documented API: replace the options object
{ retry: true } with the supported retry options (e.g., retryOn429, retryCount,
fallbackRetryDelay) when invoking markdownLinkCheck, and in the results.forEach
callback use the documented result fields (status, statusCode, err, link)
instead of the undocumented dead field; specifically, mark a link broken if
result.err is truthy or result.statusCode (or result.status) indicates an HTTP
error (>=400 or equals 404), and continue calling logBrokenLink(link, sourceUrl)
and console.log for those cases. Ensure you reference the markdownLinkCheck
invocation and the results.forEach block when making the change.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These behaviors were already present before this PR. The purpose of this change was only to ensure that link checks are properly awaited so validation completes before the browser closes.

If maintainers prefer, I can address the additional issues mentioned here in this PR as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tkuhemiya, that's a fair clarification — the undocumented dead field and { retry: true } option were indeed pre-existing and are outside the scope of this PR's fix. The async/await correction is the right focus here.

If you'd prefer to keep this PR scoped, I'd suggest addressing the API correctness improvements in a separate follow-up. Would you like me to open a GitHub issue to track that?

Comment on lines +81 to +84
const browser = await puppeteer.launch();
const page = await browser.newPage();

try {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Move browser creation inside the guarded block.

If browser.newPage() fails after puppeteer.launch() succeeds, control never reaches finally and the browser process is left running. Initialize browser inside try and null-check it in finally.

♻️ Proposed fix
-  const browser = await puppeteer.launch();
-  const page = await browser.newPage();
-
-  try {
+  let browser;
+  let page;
+
+  try {
+    browser = await puppeteer.launch();
+    page = await browser.newPage();
@@
-    await browser.close(); // Ensure the browser is closed
+    if (browser) {
+      await browser.close();
+    }

Also applies to: 125-126

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@en/check-broken-links.js` around lines 81 - 84, The code currently calls
puppeteer.launch() then browser.newPage() before entering the guarded try, which
can leak a browser process if newPage() throws; move the browser initialization
(await puppeteer.launch()) into the try block so both launch and newPage are
protected, and in the finally block check that browser is non-null/defined
before calling browser.close(); apply the same change to the second occurrence
that initializes a browser (the block around lines 125-126) so both browser
launch usages are created inside their respective try blocks and safely
null-checked in finally.

Comment on lines +87 to +90
const linksWithInfo = await page.evaluate(() => {
const anchorTags = Array.from(document.querySelectorAll('a'));
return anchorTags.map(tag => tag.href).filter(link => link.startsWith('http'));
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Normalize fragment-only variants before crawling.

tag.href includes #fragment, so the crawler treats /page#intro and /page#api as different entries in visitedUrls. That re-visits the same document, duplicates broken-link reports, and inflates the link counters. Strip hash before deduping and before calling checkLinksOnPage.

♻️ Proposed fix
 async function checkLinksOnPage(url, depth = 2) {
+  const normalizedUrl = new URL(url);
+  normalizedUrl.hash = '';
+  url = normalizedUrl.toString();
+
   if (isShuttingDown || depth < 0 || visitedUrls.has(url)) return; // Stop if shutting down, depth limit reached, or already visited
@@
-    const linksWithInfo = await page.evaluate(() => {
+    const linksWithInfo = (await page.evaluate(() => {
       const anchorTags = Array.from(document.querySelectorAll('a'));
       return anchorTags.map(tag => tag.href).filter(link => link.startsWith('http'));
-    });
+    })).map(link => {
+      const normalizedLink = new URL(link);
+      normalizedLink.hash = '';
+      return normalizedLink.toString();
+    });
@@
-    for (const link of linksWithInfo) {
+    for (const link of new Set(linksWithInfo)) {

Also applies to: 117-118

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@en/check-broken-links.js` around lines 87 - 90, The anchor hrefs collected in
linksWithInfo include fragment hashes which cause duplicate visits (e.g.,
/page#intro vs /page#api); normalize by stripping the URL.hash (or anything
after '#') when mapping anchor tags in page.evaluate so returned links are
fragment-free, and also ensure you strip fragments again before deduping against
visitedUrls and before calling checkLinksOnPage; update the same logic used
around the checkLinksOnPage invocation (lines referenced 117-118) to use the
normalized (fragment-stripped) URL.

@tkuhemiya tkuhemiya marked this pull request as draft March 19, 2026 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants