Extend API to allow string-checking, improve link-checking, and optimize performance#1
Conversation
…ation Introduce `checkCodeString`, `checkLinksString`, and `minifyString` to the API, enabling string-based pipelines for validation, link checking, and minification. Updated documentation, type definitions, and tests to support the new methods. (This commit message was AI-generated.) Signed-off-by: Jens Oliver Meiert <jens@meiert.com>
Enhanced link-checking functionality to detect URLs in unquoted attributes such as `href=https://example.com`. Updated regex, tests, and documentation to reflect this improvement. (This commit message was AI-generated.) Signed-off-by: Jens Oliver Meiert <jens@meiert.com>
Optimized directory traversal by parallelizing subdirectory handling. Cached HtmlValidate and HTML Minifier Next instances per preset to reduce redundant initialization. Enhanced link-checker performance with precompiled regexes and efficient ignore-list processing. (This commit message was AI-generated.) Signed-off-by: Jens Oliver Meiert <jens@meiert.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughRelease PR adding string-based programmatic APIs ( ChangesString APIs and Core Optimizations
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/adapters/validate.js (2)
36-55:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRace condition in validator cache population.
When multiple
validate()calls with the same preset execute concurrently before the cache is populated, each will triggergetValidator(preset)simultaneously. SincevalidatorCache.has(preset)returnsfalsefor all, each call will dynamically importhtml-validate, construct a newHtmlValidateinstance, and race to store it in the cache. The last write wins, causing wasted initialization work and potentially leaving orphaned validator instances.🔒 Proposed fix using promise caching
Cache the promise itself instead of the resolved validator to ensure only one initialization occurs per preset:
-/** `@type` {Map<string, import('html-validate').HtmlValidate>} */ +/** `@type` {Map<string, Promise<import('html-validate').HtmlValidate>>} */ const validatorCache = new Map(); /** * Return a cached HtmlValidate instance for the given preset, creating one if needed. * `@param` {string} preset * `@returns` {Promise<import('html-validate').HtmlValidate>} */ async function getValidator(preset) { - if (validatorCache.has(preset)) return /** `@type` {import('html-validate').HtmlValidate} */ (validatorCache.get(preset)); + if (validatorCache.has(preset)) return validatorCache.get(preset); + + const promise = (async () => { + let HtmlValidate; + try { + ({ HtmlValidate } = await import('html-validate')); + } catch { + throw new Error('Could not load HTML-validate. Ensure it is installed and check for breaking API changes.'); + } - let HtmlValidate; - try { - ({ HtmlValidate } = await import('html-validate')); - } catch { - throw new Error('Could not load HTML-validate. Ensure it is installed and check for breaking API changes.'); - } - - let validator; - try { - validator = new HtmlValidate({ extends: [`html-validate:${preset}`] }); - } catch (err) { - throw new Error(`HTML-validate initialization failed—the package may have breaking changes: ${err instanceof Error ? err.message : String(err)}`, { cause: err }); - } + let validator; + try { + validator = new HtmlValidate({ extends: [`html-validate:${preset}`] }); + } catch (err) { + throw new Error(`HTML-validate initialization failed—the package may have breaking changes: ${err instanceof Error ? err.message : String(err)}`, { cause: err }); + } + return validator; + })(); - validatorCache.set(preset, validator); - return validator; + validatorCache.set(preset, promise); + return promise; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/adapters/validate.js` around lines 36 - 55, The validator cache has a race because getValidator checks validatorCache.has(preset) and then performs async initialization; change caching to store the initialization Promise itself so concurrent callers share it: inside getValidator, when no cache entry exists, create and set an init promise that imports 'html-validate', constructs new HtmlValidate({ extends: [`html-validate:${preset}`] }) and resolves to the instance, then await that cached promise and return the instance; ensure you handle rejection by removing the cached promise on error (so future calls can retry). Reference: getValidator, validatorCache, HtmlValidate.
81-81:⚠️ Potential issue | 🟠 MajorAddress thread safety of cached HtmlValidate instance in concurrent validation.
The cached
HtmlValidateinstance (line 65, reused at line 81) is shared across concurrent validator.validateString() calls withinrunWithConcurrency. The official html-validate documentation does not guarantee thread-safety or document whether validateString() is safe to call concurrently on the same instance. If the validator maintains mutable state during validation, concurrent calls can race and produce incorrect results.Consider either: (a) creating one HtmlValidate instance per concurrent worker instead of caching a single shared instance, or (b) serializing access to the validator with an application-level mutex or queue.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/adapters/validate.js` at line 81, The cached HtmlValidate instance is reused concurrently by runWithConcurrency causing potential race conditions when calling validator.validateString; either instantiate a fresh HtmlValidate per worker (move HtmlValidate creation into the function run by runWithConcurrency so each concurrent task gets its own validator instance) or serialize calls to validator.validateString by guarding that call with an application-level mutex/semaphore/queue so only one validateString executes at a time; update the code paths that reference the cached instance (the variable holding the HtmlValidate and the validator.validateString invocation) to follow the chosen approach.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@README.md`:
- Around line 159-170: The checkLinks documentation is missing the callback
options present in the implementation; update the README `checkLinks` section so
it lists `options.onProgress` and `options.onStart` (same as `checkLinksString`)
with their brief descriptions ("Called after each URL is checked" and "Called
once with the total number of URLs to check") to match the `checkLinks`
implementation and ensure consistency with `checkLinksString`.
In `@src/adapters/check-links.js`:
- Around line 15-16: The RE_SRCSET regex currently allows mismatched quotes;
replace RE_SRCSET with a pattern that enforces matching quotes (e.g., use an
alternation that captures either double-quoted or single-quoted values) and
update the extraction logic that reads RE_SRCSET matches to check both capture
groups (the double-quote group and the single-quote group) when retrieving the
srcset value; specifically modify the symbol RE_SRCSET and the code that
iterates/execs its matches so it reads match[1] || match[2] instead of assuming
a single capture.
In `@src/adapters/validate.js`:
- Around line 28-29: The validatorCache Map is unbounded and can grow
indefinitely; replace it with or wrap it in an LRU-capable structure (e.g.,
implement a simple LRU by tracking usage order or use an LRU cache lib) and
enforce a MAX_VALIDATOR_CACHE_SIZE constant so that when adding a new
HtmlValidate instance via validatorCache (the Map currently declared as
validatorCache) you evict the least-recently-used entry once size >
MAX_VALIDATOR_CACHE_SIZE; update all places that read/write validatorCache to
mark entries as recently used (get/set) and ensure HtmlValidate instances are
properly disposed if needed, or alternatively add a clear documented note where
validatorCache is declared that it is intentionally unbounded for small fixed
preset sets.
In `@src/lib/files.js`:
- Around line 72-86: The parallelized walk() calls mutate the shared results
array concurrently causing a race; change walk to use a local array (e.g.,
localResults) for its own findings, push into that instead of the shared
results, and have walk return its localResults; in the caller where subdirs are
awaited, await Promise.all(subdirs) and concatenate/merge the returned arrays
into the shared results (or replace the shared results usage by collecting and
returning the aggregated array) so no concurrent push() happens on the same
array; update references to results and the walk() return type accordingly.
---
Outside diff comments:
In `@src/adapters/validate.js`:
- Around line 36-55: The validator cache has a race because getValidator checks
validatorCache.has(preset) and then performs async initialization; change
caching to store the initialization Promise itself so concurrent callers share
it: inside getValidator, when no cache entry exists, create and set an init
promise that imports 'html-validate', constructs new HtmlValidate({ extends:
[`html-validate:${preset}`] }) and resolves to the instance, then await that
cached promise and return the instance; ensure you handle rejection by removing
the cached promise on error (so future calls can retry). Reference:
getValidator, validatorCache, HtmlValidate.
- Line 81: The cached HtmlValidate instance is reused concurrently by
runWithConcurrency causing potential race conditions when calling
validator.validateString; either instantiate a fresh HtmlValidate per worker
(move HtmlValidate creation into the function run by runWithConcurrency so each
concurrent task gets its own validator instance) or serialize calls to
validator.validateString by guarding that call with an application-level
mutex/semaphore/queue so only one validateString executes at a time; update the
code paths that reference the cached instance (the variable holding the
HtmlValidate and the validator.validateString invocation) to follow the chosen
approach.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 44d64aa9-4b38-4ccb-863f-4d06827d0633
⛔ Files ignored due to path filters (1)
package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (12)
CHANGELOG.mdREADME.mdbin/hihtml.test.jspackage.jsonsrc/adapters/check-code.jssrc/adapters/check-links.jssrc/adapters/minify.jssrc/adapters/validate.jssrc/index.d.tssrc/index.jssrc/index.types.tssrc/lib/files.js
Updated validator caching to store promises, ensuring concurrent callers share initialization and preventing redundant instantiation. Added error handling to remove failed promises from the cache for consistency. (This commit message was AI-generated.) Signed-off-by: Jens Oliver Meiert <jens@meiert.com>
Updated the `RE_SRCSET` regex to handle alternatives with consistent capture groups. Adjusted the parsing logic to ensure compatibility with the updated regex, preventing potential parsing errors. (This commit message was AI-generated.) Signed-off-by: Jens Oliver Meiert <jens@meiert.com>
Added a comment explaining the unbounded nature of `validatorCache`, noting that it will only contain a small, fixed number of entries due to the limited set of presets exposed by `html-validate`. This enhances code readability and maintainability. (This commit message was AI-generated.) Signed-off-by: Jens Oliver Meiert <jens@meiert.com>
Signed-off-by: Jens Oliver Meiert <jens@meiert.com>
Signed-off-by: Jens Oliver Meiert <jens@meiert.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/adapters/check-links.js`:
- Around line 15-16: The attribute regexes RE_ATTR and RE_SRCSET don't allow
spaces around the '=' so attributes like `href = "..."` or `srcset = '...'` are
missed; update both patterns to permit optional whitespace by inserting \s*
between the attribute group and the '=' and also between '=' and the value
capture (i.e., change from (?:href|src|action)=... to
(?:href|src|action)\s*=\s*... and likewise for srcset) so the expressions match
attributes with or without spaces around '='.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: f6b96888-5b1b-4555-b6e8-52e61573171f
📒 Files selected for processing (4)
README.mdpackage.jsonsrc/adapters/check-links.jssrc/adapters/validate.js
Updated the `RE_ATTR` and `RE_SRCSET` regexes to allow spaces around the `=` character for attributes like `href`, `src`, and `srcset`. Added corresponding test cases to ensure proper functionality and accuracy. (This commit message was AI-generated.) Signed-off-by: Jens Oliver Meiert <jens@meiert.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/adapters/check-links.js`:
- Around line 176-180: The ignore entries are stored verbatim so hostnames like
"EXAMPLE.COM" or values with surrounding whitespace won't match URL.hostname
(which is lowercased); update buildIgnoreList to normalize entries by trimming
whitespace and lowercasing hostnames and prefixes (and for prefixes also
normalize and possibly remove redundant trailing slashes) before putting them
into hostnames Set and prefixes array, and apply the same normalization in the
corresponding isIgnored/usage logic around the second block (lines 189-198) so
comparisons use the same trimmed, lowercased form.
- Around line 15-16: extractUrls currently uses RE_ATTR/RE_SRCSET to regex-scan
the raw HTML string (via matchAll), which picks up href/src/action inside
comments, <script> and <style> and causes false positives; update extractUrls to
parse the HTML into a DOM (or at minimum remove HTML comments and all
<script>/<style> content) before extracting attributes so only real markup nodes
are inspected; specifically, replace the raw-string matchAll approach in
extractUrls (and the logic around RE_ATTR/RE_SRCSET and the countChecked
handling in the block around lines 54–60) with a DOM traversal that reads
href/src/action/srcset attributes from element nodes (or strip
comments/scripts/styles first) and then parse srcset tokens, ensuring commented
or script/style text is ignored.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4331e872-38ad-408e-80e2-7df05308c5a4
📒 Files selected for processing (2)
bin/hihtml.test.jssrc/adapters/check-links.js
Updated URL extraction logic to disregard URLs inside HTML comments, `<script>` and `<style>` elements while still processing `<script src>`. Enhanced ignore list handling by normalizing entries for case-insensitivity and trimming trailing slashes. Added corresponding test cases to validate these improvements. (This commit message was AI-generated.) Signed-off-by: Jens Oliver Meiert <jens@meiert.com>
Signed-off-by: Jens Oliver Meiert <jens@meiert.com>
Signed-off-by: Jens Oliver Meiert <jens@meiert.com>
Signed-off-by: Jens Oliver Meiert <jens@meiert.com>
Renamed type definitions across adapters for clearer naming and alignment with the overall naming conventions. This refactor ensures consistency and improves code readability across the project. (This commit message was AI-generated.) Signed-off-by: Jens Oliver Meiert <jens@meiert.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/index.d.ts (1)
4-79:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd backward-compatible type aliases for renamed exports.
The public type exports have been renamed without compatibility aliases:
MessageValidation(wasValidationMessage)ResultCodeValidation(wasValidationResult)ResultCode(wasCheckResult)ResultLinks(wasLinkCheckResult)ResultMinification(wasMinificationResult)HihtmlConfig(wasHiHTMLConfig)Consumers importing the old names will experience breaking changes. Add type aliases for at least one release cycle to ease migration:
+// Backward-compatible aliases (deprecated) +export type ValidationMessage = MessageValidation; +export type ValidationResult = ResultCodeValidation; +export type DeprecationResult = ResultCodeDeprecation; +export type CheckResult = ResultCode; +export type LinkCheckResult = ResultLinks; +export type MinificationResult = ResultMinification; +export type HiHTMLConfig = HihtmlConfig;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/index.d.ts` around lines 4 - 79, The public type names were renamed but no backward-compatible aliases were provided; add exported type aliases in this declaration file mapping the old names to the new ones (e.g. export type ValidationMessage = MessageValidation; export type ValidationResult = ResultCodeValidation; export type CheckResult = ResultCode; export type LinkCheckResult = ResultLinks; export type MinificationResult = ResultMinification; export type HiHTMLConfig = HihtmlConfig) so consumers importing the legacy names continue to work for one release cycle—place these aliases alongside the existing interfaces in src/index.d.ts.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/adapters/check-links.js`:
- Around line 185-200: The prefix matching in isIgnored is too broad because it
compares the full lowercased URL string with prefix strings using startsWith;
update isIgnored to parse the URL (new URL(urlLower)) and match prefixes against
the URL pathname for path-only prefixes (ensure prefixes that start with '/' are
normalized and compared to parsed.pathname) and for prefixes that include a
hostname/origin (e.g., "https://example.com/foo") compare against parsed.origin
+ parsed.pathname or parsed.href as appropriate; keep existing hostnames Set
handling unchanged and normalize trailing slashes before comparison to avoid
false matches like "example.com.evil" or "/docs2".
In `@src/adapters/minify.js`:
- Around line 29-46: The cache is populated only after the async import and
preset resolution in loadMinifier, allowing concurrent callers to duplicate
work; fix by storing a placeholder Promise result in minifierCache immediately
(e.g., set minifierCache.set(preset, initPromise) before awaiting
import('html-minifier-next')) so subsequent callers await the same Promise, and
then resolve that Promise with the final { htmlMinify, presetOptions } (using
the existing getPreset/htmlMinify logic) or replace the placeholder with the
resolved value to avoid duplicate initializations.
---
Outside diff comments:
In `@src/index.d.ts`:
- Around line 4-79: The public type names were renamed but no
backward-compatible aliases were provided; add exported type aliases in this
declaration file mapping the old names to the new ones (e.g. export type
ValidationMessage = MessageValidation; export type ValidationResult =
ResultCodeValidation; export type CheckResult = ResultCode; export type
LinkCheckResult = ResultLinks; export type MinificationResult =
ResultMinification; export type HiHTMLConfig = HihtmlConfig) so consumers
importing the legacy names continue to work for one release cycle—place these
aliases alongside the existing interfaces in src/index.d.ts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: cda635d5-3db8-4c28-9329-4483839f004e
📒 Files selected for processing (12)
CHANGELOG.mdREADME.mdSECURITY.mdbin/hihtml.test.jssrc/adapters/check-code.jssrc/adapters/check-links.jssrc/adapters/minify.jssrc/adapters/validate.jssrc/index.d.tssrc/index.types.tssrc/lib/config.jssrc/lib/output.js
Updated `minifierCache` to store asynchronous promises for lazy initialization of minifiers. This improves efficiency and ensures proper async handling during minifier loading. (This commit message was AI-generated.) Signed-off-by: Jens Oliver Meiert <jens@meiert.com>
Updated the `isIgnored` function to enhance URL parsing and prefix matching by using `new URL` consistently. Fixed edge cases related to path handling, ensuring proper comparison for ignored URLs. (This commit message was AI-generated.) Signed-off-by: Jens Oliver Meiert <jens@meiert.com>
Signed-off-by: Jens Oliver Meiert <jens@meiert.com>
Summary by CodeRabbit
New Features
Enhancements
Performance Improvements
Documentation
Tests