A blazing-fast HTML parser for Node.js, powered by Rust and NAPI-RS
@xcrap/html-parser is an experimental HTML parsing library written in Rust, exposed to Node.js through the NAPI-RS framework. It is designed to be fast, lightweight, and to support both CSS selectors and XPath queries — with built-in support for result limits and element nesting.
Although part of the Xcrap scraping ecosystem, this library can be used as a standalone package in any Node.js project.
- ✨ Features
- ⚡ Performance
- 📦 Installation
- 🚀 Quick Start
- 📖 API Reference
- 🔍 Usage Examples
- 🏗️ Architecture
- 🛠️ Development
- 🤝 Contributing
- 📝 License
- ⚡ Blazing Fast — Core parsing done in Rust; significantly faster than JS-based parsers at instance initialization.
- 🎯 Dual Query Support — Query elements using both CSS selectors (via
scraper) and XPath expressions (viasxd-xpath). - 🦥 Lazy Loading — Internal CSS and XPath engines are only initialized when first needed, reducing unnecessary overhead.
- 🔢 Built-in Limits — Pass a
limitoption toselectManyto cap the number of returned elements. - 🌲 Element Traversal — Navigate nested elements using
selectFirstandselectManydirectly onHTMLElementinstances. - 🔒 Type-Safe — Fully typed TypeScript declarations included (
index.d.ts). - 🖥️ Platform Support — Pre-built native binary currently available for Windows x64 only. Other platforms require compilation from source (see Development).
Benchmarks below compare parser initialization speed (instantiation time per file):
@xcrap/html-parser : 0.246214 ms/file ± 0.136808 ✅ Fastest
html-parser : 36.825500 ms/file ± 28.855100
htmljs-parser : 0.501577 ms/file ± 1.210800
html-dom-parser : 2.180280 ms/file ± 1.796170
html5parser : 1.674640 ms/file ± 1.222790
cheerio : 8.679980 ms/file ± 6.328520
parse5 : 4.821180 ms/file ± 2.668220
htmlparser2 : 1.497390 ms/file ± 1.398040
htmlparser : 16.171200 ms/file ± 109.076000
high5 : 2.982290 ms/file ± 1.927480
node-html-parser : 2.901670 ms/file ± 1.908040
Benchmarks sourced from node-html-parser repository.
The performance advantage comes from lazy loading: the internal Html (CSS engine) and Package (XPath engine) instances are only initialized on first use and reused across subsequent calls on the same parser instance.
Install via your preferred package manager:
# npm
npm install @xcrap/html-parser
# yarn
yarn add @xcrap/html-parser
# pnpm
pnpm add @xcrap/html-parserRequirements:
- Node.js >= 18.0.0
Native binaries are pre-built and distributed for the following platforms:
| Platform | Architecture | Support |
|---|---|---|
| Windows | x64 | ✅ Pre-built |
| macOS | x64 | 🔧 Build from source |
| macOS | ARM64 | 🔧 Build from source |
| Linux | x64 (GNU) | 🔧 Build from source |
⚠️ Note: Currently only the Windows x64 binary is pre-built and included in the published package. Users on other platforms must compile the native addon locally — see the Development section for instructions.
import { HtmlParser, css, xpath } from "@xcrap/html-parser"
const html = `
<html>
<body>
<h1 class="title">Hello World</h1>
<ul>
<li class="item">Item 1</li>
<li class="item">Item 2</li>
<li class="item">Item 3</li>
</ul>
</body>
</html>
`
const parser = new HtmlParser(html)
// Select a single element using a CSS selector
const heading = parser.selectFirst({ query: css("h1") })
console.log(heading?.text) // "Hello World"
// Select multiple elements and limit results
const items = parser.selectMany({ query: css("li.item"), limit: 2 })
console.log(items.map(el => el.text)) // ["Item 1", "Item 2"]
// Use XPath instead
const firstItem = parser.selectFirst({ query: xpath("//li[@class='item']") })
console.log(firstItem?.text) // "Item 1"CommonJS is also fully supported via
require:const { parse, css, xpath } = require("@xcrap/html-parser") const parser = parse(html)
The main entry point for parsing an HTML string. CSS and XPath engines are lazily initialized on first use and reused across subsequent queries.
new HtmlParser(content: string): HtmlParser| Parameter | Type | Description |
|---|---|---|
content |
string |
The raw HTML string to parse. |
Alias: You can also use the
parse(content: string)function as a convenience wrapper:import { parse } from "@xcrap/html-parser" const parser = parse(html)
Selects the first element matching the given query.
parser.selectFirst(options: SelectFirstOptions): HTMLElement | null| Parameter | Type | Description |
|---|---|---|
options.query |
QueryConfig |
A query config built with css() or xpath(). |
Returns HTMLElement | null — null if no element matches.
Selects all elements matching the given query.
parser.selectMany(options: SelectManyOptions): HTMLElement[]| Parameter | Type | Description |
|---|---|---|
options.query |
QueryConfig |
A query config built with css() or xpath(). |
options.limit |
number? |
Optional. Maximum number of elements to return. Values <= 0 are ignored (returns all). |
Returns HTMLElement[] — an empty array if no matches.
Represents a matched DOM element. Provides properties and methods to inspect and traverse its content.
Note:
HTMLElementinstances also supportselectFirstandselectMany, allowing scoped queries within a found element.
| Property | Type | Description |
|---|---|---|
outerHTML |
string |
The full HTML of the element, including its opening and closing tags. |
innerHTML |
string (getter) |
The inner HTML content (children only, excluding the element's own tags). |
text |
string (getter) |
The concatenated plain-text content of the element and its descendants. |
id |
string | null (getter) |
The element's id attribute, or null if not present. |
tagName |
string (getter) |
The element's tag name in UPPERCASE (e.g., "DIV", "H1"). |
className |
string (getter) |
The full class attribute string (e.g., "post featured"). |
classList |
string[] (getter) |
An array of individual class names. Empty array if no class. |
attributes |
Record<string, string> (getter) |
All attributes as a key-value object. |
firstChild |
HTMLElement | null (getter) |
The first child element, or null if none. |
lastChild |
HTMLElement | null (getter) |
The last child element, or null if none. |
element.getAttribute(name: string): string | nullReturns the value of the named attribute, or null if the attribute does not exist.
element.selectFirst(options: SelectFirstOptions): HTMLElement | nullScoped version of HtmlParser.selectFirst. Searches within the current element.
element.selectMany(options: SelectManyOptions): HTMLElement[]Scoped version of HtmlParser.selectMany. Searches within the current element.
element.toString(): stringReturns the outerHTML string of the element.
Helper functions to create typed QueryConfig objects.
css(query: string): QueryConfig
xpath(query: string): QueryConfigThese functions are the recommended way to build query configurations. They ensure the correct query type is set.
import { css, xpath } from "@xcrap/html-parser"
css("article.post") // → { query: "article.post", type: QueryType.CSS }
xpath("//article[@class]") // → { query: "//article[@class]", type: QueryType.XPath }// Identifies the query engine to use
export declare const enum QueryType {
CSS = 0,
XPath = 1,
}
// Holds a raw query string and its associated engine type
export interface QueryConfig {
query: string
type: QueryType
}
// Options for single-element selection
export interface SelectFirstOptions {
query: QueryConfig
}
// Options for multi-element selection
export interface SelectManyOptions {
query: QueryConfig
limit?: number // <= 0 or undefined means no limit
}import { HtmlParser, css } from "@xcrap/html-parser"
const html = `
<main>
<article id="post-1" class="post featured" data-author="alice">
<h2 class="post-title">First Post</h2>
<p class="excerpt">A short description.</p>
</article>
<article id="post-2" class="post" data-author="bob">
<h2 class="post-title">Second Post</h2>
<p class="excerpt">Another description.</p>
</article>
</main>
`
const parser = new HtmlParser(html)
// Select by tag name
const firstArticle = parser.selectFirst({ query: css("article") })
console.log(firstArticle?.id) // "post-1"
// Select by class
const allPosts = parser.selectMany({ query: css(".post") })
console.log(allPosts.length) // 2
// Select by attribute
const featuredPost = parser.selectFirst({ query: css("[data-author='alice']") })
console.log(featuredPost?.getAttribute("data-author")) // "alice"
// Select with limit
const limited = parser.selectMany({ query: css("article"), limit: 1 })
console.log(limited.length) // 1import { HtmlParser, xpath } from "@xcrap/html-parser"
const html = `
<ul>
<li class="tag">rust</li>
<li class="tag">napi</li>
<li class="tag">nodejs</li>
</ul>
`
const parser = new HtmlParser(html)
// Select all <li> with class "tag"
const tags = parser.selectMany({ query: xpath("//li[@class='tag']") })
console.log(tags.map(t => t.text)) // ["rust", "napi", "nodejs"]
// Limit XPath results
const limited = parser.selectMany({ query: xpath("//li"), limit: 2 })
console.log(limited.length) // 2import { HtmlParser, css } from "@xcrap/html-parser"
const html = `
<nav id="main-nav">
<ul>
<li><a href="/home">Home</a></li>
<li><a href="/about">About</a></li>
<li><a href="/contact">Contact</a></li>
</ul>
</nav>
`
const parser = new HtmlParser(html)
// Find the nav, then narrow down inside it
const nav = parser.selectFirst({ query: css("#main-nav") })
if (nav) {
const links = nav.selectMany({ query: css("a") })
links.forEach(link => {
console.log(`${link.text} → ${link.getAttribute("href")}`)
// "Home → /home"
// "About → /about"
// "Contact → /contact"
})
// First and last child shortcuts
console.log(nav.firstChild?.tagName) // "UL"
console.log(nav.lastChild?.tagName) // "UL"
}import { HtmlParser, css } from "@xcrap/html-parser"
const html = `
<a
id="cta"
class="btn btn-primary"
href="https://example.com"
target="_blank"
data-track="click"
>
Click here
</a>
`
const parser = new HtmlParser(html)
const link = parser.selectFirst({ query: css("a") })
if (link) {
console.log(link.id) // "cta"
console.log(link.tagName) // "A"
console.log(link.className) // "btn btn-primary"
console.log(link.classList) // ["btn", "btn-primary"]
console.log(link.getAttribute("href")) // "https://example.com"
console.log(link.getAttribute("target")) // "_blank"
console.log(link.getAttribute("missing")) // null
console.log(link.attributes)
// {
// id: "cta",
// class: "btn btn-primary",
// href: "https://example.com",
// target: "_blank",
// "data-track": "click"
// }
}The library is structured as a native Node.js addon written in Rust, bridged via NAPI-RS.
src/
├── lib.rs # Crate entry point; exposes the `parse()` function via NAPI
├── parser.rs # HTMLParser struct — lazy-loads CSS (scraper) and XPath (sxd) engines
├── types.rs # HTMLElement struct — all DOM properties and methods
├── engines.rs # Internal: select_first/many by CSS and XPath (pure Rust)
└── query_builders.rs # css() and xpath() helper functions exposed to JS
-
Lazy Initialization:
HTMLParserholdsOption<Html>andOption<Package>fields. Each engine is only allocated on first use and reused automatically, so callingselectFirst(CSS) and thenselectMany(XPath) on the same parser creates only two parsing passes total — one per engine. -
Dual Engine: CSS queries use the
scrapercrate; XPath queries usesxd-xpathwithsxd_htmlfor HTML→XML normalization. -
Zero-copy Approach: Elements are represented by their
outerHTMLstring, avoiding complex lifetime management across the FFI boundary.
| Crate | Version | Role |
|---|---|---|
napi |
3.0.0 |
NAPI-RS runtime for Node.js integration |
napi-derive |
3.0.0 |
Procedural macros for NAPI bindings |
scraper |
0.25.0 |
HTML parsing and CSS selector engine |
sxd-document |
0.3.2 |
XML document model (used for XPath) |
sxd-xpath |
0.4.2 |
XPath expression evaluator |
sxd_html |
0.1.2 |
HTML → sxd document converter |
- Rust (stable toolchain) — Install
- Node.js >= 18 — Install
- Yarn >= 4 —
npm install -g yarn - NAPI-RS CLI — installed automatically via dev dependencies
# Clone the repository
git clone https://github.com/Xcrap-Cloud/html-parser.git
cd html-parser
# Install Node.js dependencies
yarn install# Build native addon in release mode
yarn build
# Build in debug mode (faster compilation, slower runtime)
yarn build:debugThe output binary (html-parser.<platform>.node) will be placed in the project root.
yarn testTests are written with AVA and located in the __test__/ directory.
# Format all (TypeScript/JS, Rust, TOML)
yarn format
# Individual formatters
yarn format:prettier # Prettier for TS/JS/JSON/YAML/Markdown
yarn format:rs # cargo fmt for Rust
yarn format:toml # Taplo for TOML filesyarn lint # OXLint for TypeScript/JavaScript filesContributions are welcome! Please follow these steps:
- Fork the repository.
- Create a branch:
git checkout -b feat/your-featureorgit checkout -b fix/your-bug. - Make your changes, ensuring all tests pass:
yarn test. - Format your code:
yarn format. - Commit with a descriptive message:
git commit -m "feat: add support for XYZ". - Push your branch:
git push origin feat/your-feature. - Open a Pull Request with a clear description of the changes.
Please see CONTRIBUTING.md for detailed guidelines.
Distributed under the MIT License.
© Marcuth and contributors.