🕷️ @xcrap/html-parser

A blazing-fast HTML parser for Node.js, powered by Rust and NAPI-RS

@xcrap/html-parser is an experimental HTML parsing library written in Rust, exposed to Node.js through the NAPI-RS framework. It is designed to be fast, lightweight, and to support both CSS selectors and XPath queries — with built-in support for result limits and element nesting.

Although part of the Xcrap scraping ecosystem, this library can be used as a standalone package in any Node.js project.

✨ Features

⚡ Blazing Fast — Core parsing done in Rust; significantly faster than JS-based parsers at instance initialization.
🎯 Dual Query Support — Query elements using both CSS selectors (via scraper) and XPath expressions (via sxd-xpath).
🦥 Lazy Loading — Internal CSS and XPath engines are only initialized when first needed, reducing unnecessary overhead.
🔢 Built-in Limits — Pass a limit option to selectMany to cap the number of returned elements.
🌲 Element Traversal — Navigate nested elements using selectFirst and selectMany directly on HTMLElement instances.
🔒 Type-Safe — Fully typed TypeScript declarations included (index.d.ts).
🖥️ Platform Support — Pre-built native binary currently available for Windows x64 only. Other platforms require compilation from source (see Development).

⚡ Performance

Benchmarks below compare parser initialization speed (instantiation time per file):

@xcrap/html-parser    :  0.246214 ms/file  ±  0.136808  ✅ Fastest
html-parser           : 36.825500 ms/file  ± 28.855100
htmljs-parser         :  0.501577 ms/file  ±  1.210800
html-dom-parser       :  2.180280 ms/file  ±  1.796170
html5parser           :  1.674640 ms/file  ±  1.222790
cheerio               :  8.679980 ms/file  ±  6.328520
parse5                :  4.821180 ms/file  ±  2.668220
htmlparser2           :  1.497390 ms/file  ±  1.398040
htmlparser            : 16.171200 ms/file  ± 109.076000
high5                 :  2.982290 ms/file  ±  1.927480
node-html-parser      :  2.901670 ms/file  ±  1.908040

Benchmarks sourced from node-html-parser repository.

The performance advantage comes from lazy loading: the internal Html (CSS engine) and Package (XPath engine) instances are only initialized on first use and reused across subsequent calls on the same parser instance.

📦 Installation

Install via your preferred package manager:

# npm
npm install @xcrap/html-parser

# yarn
yarn add @xcrap/html-parser

# pnpm
pnpm add @xcrap/html-parser

Requirements:

Node.js >= 18.0.0

Native binaries are pre-built and distributed for the following platforms:

Platform	Architecture	Support
Windows	x64	✅ Pre-built
macOS	x64	🔧 Build from source
macOS	ARM64	🔧 Build from source
Linux	x64 (GNU)	🔧 Build from source

⚠️ Note: Currently only the Windows x64 binary is pre-built and included in the published package. Users on other platforms must compile the native addon locally — see the Development section for instructions.

🚀 Quick Start

import { HtmlParser, css, xpath } from "@xcrap/html-parser"

const html = `
  <html>
    <body>
      <h1 class="title">Hello World</h1>
      <ul>
        <li class="item">Item 1</li>
        <li class="item">Item 2</li>
        <li class="item">Item 3</li>
      </ul>
    </body>
  </html>
`

const parser = new HtmlParser(html)

// Select a single element using a CSS selector
const heading = parser.selectFirst({ query: css("h1") })
console.log(heading?.text) // "Hello World"

// Select multiple elements and limit results
const items = parser.selectMany({ query: css("li.item"), limit: 2 })
console.log(items.map(el => el.text)) // ["Item 1", "Item 2"]

// Use XPath instead
const firstItem = parser.selectFirst({ query: xpath("//li[@class='item']") })
console.log(firstItem?.text) // "Item 1"

CommonJS is also fully supported via require:

const { parse, css, xpath } = require("@xcrap/html-parser")
const parser = parse(html)

📖 API Reference

`HtmlParser` / `HTMLParser`

The main entry point for parsing an HTML string. CSS and XPath engines are lazily initialized on first use and reused across subsequent queries.

Constructor

new HtmlParser(content: string): HtmlParser

Parameter	Type	Description
`content`	`string`	The raw HTML string to parse.

Alias: You can also use the parse(content: string) function as a convenience wrapper:
import { parse } from "@xcrap/html-parser"
const parser = parse(html)

`selectFirst(options)`

Selects the first element matching the given query.

parser.selectFirst(options: SelectFirstOptions): HTMLElement | null

Parameter	Type	Description
`options.query`	`QueryConfig`	A query config built with `css()` or `xpath()`.

Returns HTMLElement | null — null if no element matches.

`selectMany(options)`

Selects all elements matching the given query.

parser.selectMany(options: SelectManyOptions): HTMLElement[]

Parameter	Type	Description
`options.query`	`QueryConfig`	A query config built with `css()` or `xpath()`.
`options.limit`	`number?`	Optional. Maximum number of elements to return. Values `<= 0` are ignored (returns all).

Returns HTMLElement[] — an empty array if no matches.

`HTMLElement`

Represents a matched DOM element. Provides properties and methods to inspect and traverse its content.

Note: HTMLElement instances also support selectFirst and selectMany, allowing scoped queries within a found element.

Properties

Property	Type	Description
`outerHTML`	`string`	The full HTML of the element, including its opening and closing tags.
`innerHTML`	`string` (getter)	The inner HTML content (children only, excluding the element's own tags).
`text`	`string` (getter)	The concatenated plain-text content of the element and its descendants.
`id`	`string \| null` (getter)	The element's `id` attribute, or `null` if not present.
`tagName`	`string` (getter)	The element's tag name in UPPERCASE (e.g., `"DIV"`, `"H1"`).
`className`	`string` (getter)	The full `class` attribute string (e.g., `"post featured"`).
`classList`	`string[]` (getter)	An array of individual class names. Empty array if no class.
`attributes`	`Record<string, string>` (getter)	All attributes as a key-value object.
`firstChild`	`HTMLElement \| null` (getter)	The first child element, or `null` if none.
`lastChild`	`HTMLElement \| null` (getter)	The last child element, or `null` if none.

Methods

`getAttribute(name)`

element.getAttribute(name: string): string | null

Returns the value of the named attribute, or null if the attribute does not exist.

`selectFirst(options)`

element.selectFirst(options: SelectFirstOptions): HTMLElement | null

Scoped version of HtmlParser.selectFirst. Searches within the current element.

`selectMany(options)`

element.selectMany(options: SelectManyOptions): HTMLElement[]

Scoped version of HtmlParser.selectMany. Searches within the current element.

`toString()`

element.toString(): string

Returns the outerHTML string of the element.

`css()` and `xpath()`

Helper functions to create typed QueryConfig objects.

css(query: string): QueryConfig
xpath(query: string): QueryConfig

These functions are the recommended way to build query configurations. They ensure the correct query type is set.

import { css, xpath } from "@xcrap/html-parser"

css("article.post")           // → { query: "article.post", type: QueryType.CSS }
xpath("//article[@class]")    // → { query: "//article[@class]", type: QueryType.XPath }

Types

// Identifies the query engine to use
export declare const enum QueryType {
  CSS   = 0,
  XPath = 1,
}

// Holds a raw query string and its associated engine type
export interface QueryConfig {
  query: string
  type: QueryType
}

// Options for single-element selection
export interface SelectFirstOptions {
  query: QueryConfig
}

// Options for multi-element selection
export interface SelectManyOptions {
  query: QueryConfig
  limit?: number  // <= 0 or undefined means no limit
}

🔍 Usage Examples

CSS Selectors

import { HtmlParser, css } from "@xcrap/html-parser"

const html = `
  <main>
    <article id="post-1" class="post featured" data-author="alice">
      <h2 class="post-title">First Post</h2>
      <p class="excerpt">A short description.</p>
    </article>
    <article id="post-2" class="post" data-author="bob">
      <h2 class="post-title">Second Post</h2>
      <p class="excerpt">Another description.</p>
    </article>
  </main>
`

const parser = new HtmlParser(html)

// Select by tag name
const firstArticle = parser.selectFirst({ query: css("article") })
console.log(firstArticle?.id) // "post-1"

// Select by class
const allPosts = parser.selectMany({ query: css(".post") })
console.log(allPosts.length) // 2

// Select by attribute
const featuredPost = parser.selectFirst({ query: css("[data-author='alice']") })
console.log(featuredPost?.getAttribute("data-author")) // "alice"

// Select with limit
const limited = parser.selectMany({ query: css("article"), limit: 1 })
console.log(limited.length) // 1

XPath Queries

import { HtmlParser, xpath } from "@xcrap/html-parser"

const html = `
  <ul>
    <li class="tag">rust</li>
    <li class="tag">napi</li>
    <li class="tag">nodejs</li>
  </ul>
`

const parser = new HtmlParser(html)

// Select all <li> with class "tag"
const tags = parser.selectMany({ query: xpath("//li[@class='tag']") })
console.log(tags.map(t => t.text)) // ["rust", "napi", "nodejs"]

// Limit XPath results
const limited = parser.selectMany({ query: xpath("//li"), limit: 2 })
console.log(limited.length) // 2

Navigating Nested Elements

import { HtmlParser, css } from "@xcrap/html-parser"

const html = `
  <nav id="main-nav">
    <ul>
      <li><a href="/home">Home</a></li>
      <li><a href="/about">About</a></li>
      <li><a href="/contact">Contact</a></li>
    </ul>
  </nav>
`

const parser = new HtmlParser(html)

// Find the nav, then narrow down inside it
const nav = parser.selectFirst({ query: css("#main-nav") })

if (nav) {
  const links = nav.selectMany({ query: css("a") })
  links.forEach(link => {
    console.log(`${link.text} → ${link.getAttribute("href")}`)
    // "Home → /home"
    // "About → /about"
    // "Contact → /contact"
  })

  // First and last child shortcuts
  console.log(nav.firstChild?.tagName)  // "UL"
  console.log(nav.lastChild?.tagName)   // "UL"
}

Working with Attributes

import { HtmlParser, css } from "@xcrap/html-parser"

const html = `
  <a
    id="cta"
    class="btn btn-primary"
    href="https://example.com"
    target="_blank"
    data-track="click"
  >
    Click here
  </a>
`

const parser = new HtmlParser(html)
const link = parser.selectFirst({ query: css("a") })

if (link) {
  console.log(link.id)                        // "cta"
  console.log(link.tagName)                   // "A"
  console.log(link.className)                 // "btn btn-primary"
  console.log(link.classList)                 // ["btn", "btn-primary"]
  console.log(link.getAttribute("href"))      // "https://example.com"
  console.log(link.getAttribute("target"))    // "_blank"
  console.log(link.getAttribute("missing"))   // null
  console.log(link.attributes)
  // {
  //   id: "cta",
  //   class: "btn btn-primary",
  //   href: "https://example.com",
  //   target: "_blank",
  //   "data-track": "click"
  // }
}

🏗️ Architecture

The library is structured as a native Node.js addon written in Rust, bridged via NAPI-RS.

src/
├── lib.rs             # Crate entry point; exposes the `parse()` function via NAPI
├── parser.rs          # HTMLParser struct — lazy-loads CSS (scraper) and XPath (sxd) engines
├── types.rs           # HTMLElement struct — all DOM properties and methods
├── engines.rs         # Internal: select_first/many by CSS and XPath (pure Rust)
└── query_builders.rs  # css() and xpath() helper functions exposed to JS

Key Design Decisions

Lazy Initialization: HTMLParser holds Option<Html> and Option<Package> fields. Each engine is only allocated on first use and reused automatically, so calling selectFirst (CSS) and then selectMany (XPath) on the same parser creates only two parsing passes total — one per engine.
Dual Engine: CSS queries use the scraper crate; XPath queries use sxd-xpath with sxd_html for HTML→XML normalization.
Zero-copy Approach: Elements are represented by their outerHTML string, avoiding complex lifetime management across the FFI boundary.

Internal Rust Dependencies

Crate	Version	Role
`napi`	`3.0.0`	NAPI-RS runtime for Node.js integration
`napi-derive`	`3.0.0`	Procedural macros for NAPI bindings
`scraper`	`0.25.0`	HTML parsing and CSS selector engine
`sxd-document`	`0.3.2`	XML document model (used for XPath)
`sxd-xpath`	`0.4.2`	XPath expression evaluator
`sxd_html`	`0.1.2`	HTML → sxd document converter

🛠️ Development

Prerequisites

Rust (stable toolchain) — Install
Node.js >= 18 — Install
Yarn >= 4 — npm install -g yarn
NAPI-RS CLI — installed automatically via dev dependencies

Setup

# Clone the repository
git clone https://github.com/Xcrap-Cloud/html-parser.git
cd html-parser

# Install Node.js dependencies
yarn install

Building

# Build native addon in release mode
yarn build

# Build in debug mode (faster compilation, slower runtime)
yarn build:debug

The output binary (html-parser.<platform>.node) will be placed in the project root.

Running Tests

yarn test

Tests are written with AVA and located in the __test__/ directory.

Formatting

# Format all (TypeScript/JS, Rust, TOML)
yarn format

# Individual formatters
yarn format:prettier   # Prettier for TS/JS/JSON/YAML/Markdown
yarn format:rs         # cargo fmt for Rust
yarn format:toml       # Taplo for TOML files

Linting

yarn lint   # OXLint for TypeScript/JavaScript files

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository.
Create a branch: git checkout -b feat/your-feature or git checkout -b fix/your-bug.
Make your changes, ensuring all tests pass: yarn test.
Format your code: yarn format.
Commit with a descriptive message: git commit -m "feat: add support for XYZ".
Push your branch: git push origin feat/your-feature.
Open a Pull Request with a clear description of the changes.

Please see CONTRIBUTING.md for detailed guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.cargo		.cargo
.github		.github
.yarn/releases		.yarn/releases
__test__		__test__
benchmark		benchmark
src		src
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.prettierignore		.prettierignore
.taplo.toml		.taplo.toml
.yarnrc.yml		.yarnrc.yml
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs
index.d.ts		index.d.ts
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json
rustfmt.toml		rustfmt.toml
tsconfig.json		tsconfig.json
wasi-worker-browser.mjs		wasi-worker-browser.mjs
yarn.lock		yarn.lock

Folders and files

Latest commit

History

Repository files navigation

🕷️ @xcrap/html-parser

📋 Table of Contents

✨ Features

⚡ Performance

📦 Installation

🚀 Quick Start

📖 API Reference

HtmlParser / HTMLParser

Constructor

selectFirst(options)

selectMany(options)

HTMLElement

Properties

Methods

getAttribute(name)

selectFirst(options)

selectMany(options)

toString()

css() and xpath()

Types

🔍 Usage Examples

CSS Selectors

XPath Queries

Navigating Nested Elements

Working with Attributes

🏗️ Architecture

Key Design Decisions

Internal Rust Dependencies

🛠️ Development

Prerequisites

Setup

Building

Running Tests

Formatting

Linting

🤝 Contributing

📝 License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`HtmlParser` / `HTMLParser`

`selectFirst(options)`

`selectMany(options)`

`HTMLElement`

`getAttribute(name)`

`selectFirst(options)`

`selectMany(options)`

`toString()`

`css()` and `xpath()`

Packages