Fix broken link checker to await link validations before closing the browser by tkuhemiya · Pull Request #5965 · wso2/docs-is

tkuhemiya · 2026-03-14T03:31:49Z

Purpose

Fix the async flow in the broken link checker so link validations complete before the Puppeteer browser is closed.

Previously, markdown-link-check was triggered inside the page crawl loop without being awaited, which could allow the crawl to finish and the browser to close before all link checks had completed. This change extracts link validation into a dedicated checkLink() helper and waits for all pending checks before closing the browser.

Related Issue

No related issue

Related PRs

No related PRs

Test environment

macOS (Darwin 24.6.0)
Node.js v25.6.1
No JDK or database applicable for this change
No manual browser testing performed
Validation:
- Hosted a local page containing test links
- Configured the linked endpoints to respond with 404 after a 3-second delay
- Ran node check-broken-links.js
- Verified the script reported all broken links and completed link checking before closing the browser

Security checks

Followed secure coding standards in http://wso2.com/technical-reports/wso2-secure-engineering-guidelines?
Ran FindSecurityBugs plugin and verified report?
Confirmed that this PR doesn't commit any keys, passwords, tokens, usernames, or other secrets?

Summary by CodeRabbit

Refactor
- Enhanced the link checking mechanism with improved error handling and optimized concurrent processing for better reliability and efficiency in detecting broken links.

coderabbitai · 2026-03-14T03:33:11Z

Walkthrough

The link checking logic in the script was refactored from a nested callback-based approach to a promise-based concurrent approach. A new helper function checkLink was introduced, browser instantiation was simplified to launch once per page, and error handling was consolidated while maintaining recursive depth-based crawling.

Changes

Cohort / File(s)	Summary
Promise-Based Link Checking Refactor `en/check-broken-links.js`	Introduced `checkLink` helper function, replaced callback-based link checking with Promise.all-based concurrency, consolidated error handling with centralized catch/finally blocks, and simplified browser lifecycle management.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Hops through broken links so fine,
Promises all in a line,
One browser, many checks at once,
No more callbacks—clever hunch!
The web is healthier today,
Thanks to refactoring's way! 🌐

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: fixing the async flow in the broken link checker by awaiting link validations before closing the browser.
Description check	✅ Passed	The description covers the purpose, test environment, and security checks as required by the template. All critical sections are completed with relevant details.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can customize the tone of the review comments and chat replies.

Configure the tone_instructions setting to customize the tone of the review comments and chat replies. For example, you can set the tone to Act like a strict teacher, Act like a pirate and more.

CLAassistant · 2026-03-14T03:40:41Z

All committers have signed the CLA.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@en/check-broken-links.js`:
- Around line 81-84: The code currently calls puppeteer.launch() then
browser.newPage() before entering the guarded try, which can leak a browser
process if newPage() throws; move the browser initialization (await
puppeteer.launch()) into the try block so both launch and newPage are protected,
and in the finally block check that browser is non-null/defined before calling
browser.close(); apply the same change to the second occurrence that initializes
a browser (the block around lines 125-126) so both browser launch usages are
created inside their respective try blocks and safely null-checked in finally.
- Around line 87-90: The anchor hrefs collected in linksWithInfo include
fragment hashes which cause duplicate visits (e.g., /page#intro vs /page#api);
normalize by stripping the URL.hash (or anything after '#') when mapping anchor
tags in page.evaluate so returned links are fragment-free, and also ensure you
strip fragments again before deduping against visitedUrls and before calling
checkLinksOnPage; update the same logic used around the checkLinksOnPage
invocation (lines referenced 117-118) to use the normalized (fragment-stripped)
URL.
- Around line 53-66: Update the markdownLinkCheck call and results handling to
use the documented API: replace the options object { retry: true } with the
supported retry options (e.g., retryOn429, retryCount, fallbackRetryDelay) when
invoking markdownLinkCheck, and in the results.forEach callback use the
documented result fields (status, statusCode, err, link) instead of the
undocumented dead field; specifically, mark a link broken if result.err is
truthy or result.statusCode (or result.status) indicates an HTTP error (>=400 or
equals 404), and continue calling logBrokenLink(link, sourceUrl) and console.log
for those cases. Ensure you reference the markdownLinkCheck invocation and the
results.forEach block when making the change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8d77c0bf-03a0-40e6-bb0a-15b0f0b149d6

📥 Commits

Reviewing files that changed from the base of the PR and between cecd4c9 and 3db88f0.

📒 Files selected for processing (1)

en/check-broken-links.js

coderabbitai · 2026-03-14T03:46:29Z

en/check-broken-links.js

+  return new Promise((resolve, reject) => {
+    markdownLinkCheck(link, { retry: true }, (err, results) => {
+      if (err) {
+        reject(err);
+        return;
+      }
+
+      results.forEach(result => {
+        const { dead, statusCode } = result;
+
+        if (dead || statusCode === 404) {
+          logBrokenLink(link, sourceUrl);
+          console.log(`\n[Broken Link] Found: ${link}\n`);
+        }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

What fields does the markdown-link-check JavaScript API expose on each result object, and what are the documented retry-related option names?

💡 Result:

Result object fields (JS API)

Each element in the results array has these documented properties: link, status, statusCode, err. (github.com)

Retry-related option names (documented)

The documented retry options are: retryOn429, retryCount, fallbackRetryDelay. (github.com)

Citations:

1: https://github.com/tcort/markdown-link-check

2: https://github.com/tcort/markdown-link-check

Use the documented markdown-link-check API fields and retry options.

The JS API result objects expose status, statusCode, err, and link. The supported retry options are retryOn429, retryCount, and fallbackRetryDelay. The current code uses the undocumented dead field and { retry: true } option, causing only explicit 404s to be reported and no retry logic to be applied.

♻️ Proposed fix

- markdownLinkCheck(link, { retry: true }, (err, results) => { + markdownLinkCheck(link, { retryOn429: true }, (err, results) => { @@ - const { dead, statusCode } = result; - - if (dead || statusCode === 404) { + if (result.status === 'dead') { logBrokenLink(link, sourceUrl); console.log(`\n[Broken Link] Found: ${link}\n`); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

return new Promise((resolve, reject) => {

markdownLinkCheck(link, { retry: true }, (err, results) => {

if (err) {

reject(err);

return;

}

results.forEach(result => {

const { dead, statusCode } = result;

if (dead || statusCode === 404) {

logBrokenLink(link, sourceUrl);

console.log(`\n[Broken Link] Found: ${link}\n`);

}

return new Promise((resolve, reject) => {

markdownLinkCheck(link, { retryOn429: true }, (err, results) => {

if (err) {

reject(err);

return;

}

results.forEach(result => {

if (result.status === 'dead') {

logBrokenLink(link, sourceUrl);

console.log(`\n[Broken Link] Found: ${link}\n`);

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@en/check-broken-links.js` around lines 53 - 66, Update the markdownLinkCheck call and results handling to use the documented API: replace the options object { retry: true } with the supported retry options (e.g., retryOn429, retryCount, fallbackRetryDelay) when invoking markdownLinkCheck, and in the results.forEach callback use the documented result fields (status, statusCode, err, link) instead of the undocumented dead field; specifically, mark a link broken if result.err is truthy or result.statusCode (or result.status) indicates an HTTP error (>=400 or equals 404), and continue calling logBrokenLink(link, sourceUrl) and console.log for those cases. Ensure you reference the markdownLinkCheck invocation and the results.forEach block when making the change.

These behaviors were already present before this PR. The purpose of this change was only to ensure that link checks are properly awaited so validation completes before the browser closes.

If maintainers prefer, I can address the additional issues mentioned here in this PR as well.

@tkuhemiya, that's a fair clarification — the undocumented dead field and { retry: true } option were indeed pre-existing and are outside the scope of this PR's fix. The async/await correction is the right focus here.

If you'd prefer to keep this PR scoped, I'd suggest addressing the API correctness improvements in a separate follow-up. Would you like me to open a GitHub issue to track that?

coderabbitai · 2026-03-14T03:46:29Z

en/check-broken-links.js

+  const browser = await puppeteer.launch();
+  const page = await browser.newPage();
+
+  try {


⚠️ Potential issue | 🟠 Major

Move browser creation inside the guarded block.

If browser.newPage() fails after puppeteer.launch() succeeds, control never reaches finally and the browser process is left running. Initialize browser inside try and null-check it in finally.

♻️ Proposed fix

- const browser = await puppeteer.launch(); - const page = await browser.newPage(); - - try { + let browser; + let page; + + try { + browser = await puppeteer.launch(); + page = await browser.newPage(); @@ - await browser.close(); // Ensure the browser is closed + if (browser) { + await browser.close(); + }

Also applies to: 125-126

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@en/check-broken-links.js` around lines 81 - 84, The code currently calls puppeteer.launch() then browser.newPage() before entering the guarded try, which can leak a browser process if newPage() throws; move the browser initialization (await puppeteer.launch()) into the try block so both launch and newPage are protected, and in the finally block check that browser is non-null/defined before calling browser.close(); apply the same change to the second occurrence that initializes a browser (the block around lines 125-126) so both browser launch usages are created inside their respective try blocks and safely null-checked in finally.

coderabbitai · 2026-03-14T03:46:29Z

en/check-broken-links.js

+    const linksWithInfo = await page.evaluate(() => {
+      const anchorTags = Array.from(document.querySelectorAll('a'));
+      return anchorTags.map(tag => tag.href).filter(link => link.startsWith('http'));
+    });


⚠️ Potential issue | 🟠 Major

Normalize fragment-only variants before crawling.

tag.href includes #fragment, so the crawler treats /page#intro and /page#api as different entries in visitedUrls. That re-visits the same document, duplicates broken-link reports, and inflates the link counters. Strip hash before deduping and before calling checkLinksOnPage.

♻️ Proposed fix

async function checkLinksOnPage(url, depth = 2) { + const normalizedUrl = new URL(url); + normalizedUrl.hash = ''; + url = normalizedUrl.toString(); + if (isShuttingDown || depth < 0 || visitedUrls.has(url)) return; // Stop if shutting down, depth limit reached, or already visited @@ - const linksWithInfo = await page.evaluate(() => { + const linksWithInfo = (await page.evaluate(() => { const anchorTags = Array.from(document.querySelectorAll('a')); return anchorTags.map(tag => tag.href).filter(link => link.startsWith('http')); - }); + })).map(link => { + const normalizedLink = new URL(link); + normalizedLink.hash = ''; + return normalizedLink.toString(); + }); @@ - for (const link of linksWithInfo) { + for (const link of new Set(linksWithInfo)) {

Also applies to: 117-118

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@en/check-broken-links.js` around lines 87 - 90, The anchor hrefs collected in linksWithInfo include fragment hashes which cause duplicate visits (e.g., /page#intro vs /page#api); normalize by stripping the URL.hash (or anything after '#') when mapping anchor tags in page.evaluate so returned links are fragment-free, and also ensure you strip fragments again before deduping against visitedUrls and before calling checkLinksOnPage; update the same logic used around the checkLinksOnPage invocation (lines referenced 117-118) to use the normalized (fragment-stripped) URL.

tkuhemiya added 2 commits March 14, 2026 08:11

extract checkLink into a standalone function

ca4eff5

add await link checks before closing browser

3db88f0

coderabbitai bot reviewed Mar 14, 2026

View reviewed changes

tkuhemiya marked this pull request as draft March 19, 2026 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broken link checker to await link validations before closing the browser#5965

Fix broken link checker to await link validations before closing the browser#5965
tkuhemiya wants to merge 2 commits intowso2:masterfrom
tkuhemiya:fix/broken-link-checker-race

tkuhemiya commented Mar 14, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 14, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Mar 14, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 14, 2026 •

edited

Loading

Uh oh!

tkuhemiya Mar 14, 2026

Uh oh!

coderabbitai bot Mar 14, 2026

Uh oh!

coderabbitai bot Mar 14, 2026

Uh oh!

coderabbitai bot Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tkuhemiya commented Mar 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Related Issue

Related PRs

Test environment

Security checks

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

CLAassistant commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Result object fields (JS API)

Retry-related option names (documented)

Uh oh!

tkuhemiya Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tkuhemiya commented Mar 14, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 14, 2026 •

edited

Loading

CLAassistant commented Mar 14, 2026 •

edited

Loading

coderabbitai bot Mar 14, 2026 •

edited

Loading