Skip to content

Xcrap-Cloud/html-parser

Repository files navigation

🕷️ @xcrap/html-parser

A blazing-fast HTML parser for Node.js, powered by Rust and NAPI-RS

npm version License: MIT Node.js >= 18 Built with Rust

@xcrap/html-parser is an experimental HTML parsing library written in Rust, exposed to Node.js through the NAPI-RS framework. It is designed to be fast, lightweight, and to support both CSS selectors and XPath queries — with built-in support for result limits and element nesting.

Although part of the Xcrap scraping ecosystem, this library can be used as a standalone package in any Node.js project.


📋 Table of Contents


✨ Features

  • ⚡ Blazing Fast — Core parsing done in Rust; significantly faster than JS-based parsers at instance initialization.
  • 🎯 Dual Query Support — Query elements using both CSS selectors (via scraper) and XPath expressions (via sxd-xpath).
  • 🦥 Lazy Loading — Internal CSS and XPath engines are only initialized when first needed, reducing unnecessary overhead.
  • 🔢 Built-in Limits — Pass a limit option to selectMany to cap the number of returned elements.
  • 🌲 Element Traversal — Navigate nested elements using selectFirst and selectMany directly on HTMLElement instances.
  • 🔒 Type-Safe — Fully typed TypeScript declarations included (index.d.ts).
  • 🖥️ Platform Support — Pre-built native binary currently available for Windows x64 only. Other platforms require compilation from source (see Development).

⚡ Performance

Benchmarks below compare parser initialization speed (instantiation time per file):

@xcrap/html-parser    :  0.246214 ms/file  ±  0.136808  ✅ Fastest
html-parser           : 36.825500 ms/file  ± 28.855100
htmljs-parser         :  0.501577 ms/file  ±  1.210800
html-dom-parser       :  2.180280 ms/file  ±  1.796170
html5parser           :  1.674640 ms/file  ±  1.222790
cheerio               :  8.679980 ms/file  ±  6.328520
parse5                :  4.821180 ms/file  ±  2.668220
htmlparser2           :  1.497390 ms/file  ±  1.398040
htmlparser            : 16.171200 ms/file  ± 109.076000
high5                 :  2.982290 ms/file  ±  1.927480
node-html-parser      :  2.901670 ms/file  ±  1.908040

Benchmarks sourced from node-html-parser repository.

The performance advantage comes from lazy loading: the internal Html (CSS engine) and Package (XPath engine) instances are only initialized on first use and reused across subsequent calls on the same parser instance.


📦 Installation

Install via your preferred package manager:

# npm
npm install @xcrap/html-parser

# yarn
yarn add @xcrap/html-parser

# pnpm
pnpm add @xcrap/html-parser

Requirements:

  • Node.js >= 18.0.0

Native binaries are pre-built and distributed for the following platforms:

Platform Architecture Support
Windows x64 ✅ Pre-built
macOS x64 🔧 Build from source
macOS ARM64 🔧 Build from source
Linux x64 (GNU) 🔧 Build from source

⚠️ Note: Currently only the Windows x64 binary is pre-built and included in the published package. Users on other platforms must compile the native addon locally — see the Development section for instructions.


🚀 Quick Start

import { HtmlParser, css, xpath } from "@xcrap/html-parser"

const html = `
  <html>
    <body>
      <h1 class="title">Hello World</h1>
      <ul>
        <li class="item">Item 1</li>
        <li class="item">Item 2</li>
        <li class="item">Item 3</li>
      </ul>
    </body>
  </html>
`

const parser = new HtmlParser(html)

// Select a single element using a CSS selector
const heading = parser.selectFirst({ query: css("h1") })
console.log(heading?.text) // "Hello World"

// Select multiple elements and limit results
const items = parser.selectMany({ query: css("li.item"), limit: 2 })
console.log(items.map(el => el.text)) // ["Item 1", "Item 2"]

// Use XPath instead
const firstItem = parser.selectFirst({ query: xpath("//li[@class='item']") })
console.log(firstItem?.text) // "Item 1"

CommonJS is also fully supported via require:

const { parse, css, xpath } = require("@xcrap/html-parser")
const parser = parse(html)

📖 API Reference

HtmlParser / HTMLParser

The main entry point for parsing an HTML string. CSS and XPath engines are lazily initialized on first use and reused across subsequent queries.

Constructor

new HtmlParser(content: string): HtmlParser
Parameter Type Description
content string The raw HTML string to parse.

Alias: You can also use the parse(content: string) function as a convenience wrapper:

import { parse } from "@xcrap/html-parser"
const parser = parse(html)

selectFirst(options)

Selects the first element matching the given query.

parser.selectFirst(options: SelectFirstOptions): HTMLElement | null
Parameter Type Description
options.query QueryConfig A query config built with css() or xpath().

Returns HTMLElement | nullnull if no element matches.

selectMany(options)

Selects all elements matching the given query.

parser.selectMany(options: SelectManyOptions): HTMLElement[]
Parameter Type Description
options.query QueryConfig A query config built with css() or xpath().
options.limit number? Optional. Maximum number of elements to return. Values <= 0 are ignored (returns all).

Returns HTMLElement[] — an empty array if no matches.


HTMLElement

Represents a matched DOM element. Provides properties and methods to inspect and traverse its content.

Note: HTMLElement instances also support selectFirst and selectMany, allowing scoped queries within a found element.

Properties

Property Type Description
outerHTML string The full HTML of the element, including its opening and closing tags.
innerHTML string (getter) The inner HTML content (children only, excluding the element's own tags).
text string (getter) The concatenated plain-text content of the element and its descendants.
id string | null (getter) The element's id attribute, or null if not present.
tagName string (getter) The element's tag name in UPPERCASE (e.g., "DIV", "H1").
className string (getter) The full class attribute string (e.g., "post featured").
classList string[] (getter) An array of individual class names. Empty array if no class.
attributes Record<string, string> (getter) All attributes as a key-value object.
firstChild HTMLElement | null (getter) The first child element, or null if none.
lastChild HTMLElement | null (getter) The last child element, or null if none.

Methods

getAttribute(name)
element.getAttribute(name: string): string | null

Returns the value of the named attribute, or null if the attribute does not exist.

selectFirst(options)
element.selectFirst(options: SelectFirstOptions): HTMLElement | null

Scoped version of HtmlParser.selectFirst. Searches within the current element.

selectMany(options)
element.selectMany(options: SelectManyOptions): HTMLElement[]

Scoped version of HtmlParser.selectMany. Searches within the current element.

toString()
element.toString(): string

Returns the outerHTML string of the element.


css() and xpath()

Helper functions to create typed QueryConfig objects.

css(query: string): QueryConfig
xpath(query: string): QueryConfig

These functions are the recommended way to build query configurations. They ensure the correct query type is set.

import { css, xpath } from "@xcrap/html-parser"

css("article.post")           // → { query: "article.post", type: QueryType.CSS }
xpath("//article[@class]")    // → { query: "//article[@class]", type: QueryType.XPath }

Types

// Identifies the query engine to use
export declare const enum QueryType {
  CSS   = 0,
  XPath = 1,
}

// Holds a raw query string and its associated engine type
export interface QueryConfig {
  query: string
  type: QueryType
}

// Options for single-element selection
export interface SelectFirstOptions {
  query: QueryConfig
}

// Options for multi-element selection
export interface SelectManyOptions {
  query: QueryConfig
  limit?: number  // <= 0 or undefined means no limit
}

🔍 Usage Examples

CSS Selectors

import { HtmlParser, css } from "@xcrap/html-parser"

const html = `
  <main>
    <article id="post-1" class="post featured" data-author="alice">
      <h2 class="post-title">First Post</h2>
      <p class="excerpt">A short description.</p>
    </article>
    <article id="post-2" class="post" data-author="bob">
      <h2 class="post-title">Second Post</h2>
      <p class="excerpt">Another description.</p>
    </article>
  </main>
`

const parser = new HtmlParser(html)

// Select by tag name
const firstArticle = parser.selectFirst({ query: css("article") })
console.log(firstArticle?.id) // "post-1"

// Select by class
const allPosts = parser.selectMany({ query: css(".post") })
console.log(allPosts.length) // 2

// Select by attribute
const featuredPost = parser.selectFirst({ query: css("[data-author='alice']") })
console.log(featuredPost?.getAttribute("data-author")) // "alice"

// Select with limit
const limited = parser.selectMany({ query: css("article"), limit: 1 })
console.log(limited.length) // 1

XPath Queries

import { HtmlParser, xpath } from "@xcrap/html-parser"

const html = `
  <ul>
    <li class="tag">rust</li>
    <li class="tag">napi</li>
    <li class="tag">nodejs</li>
  </ul>
`

const parser = new HtmlParser(html)

// Select all <li> with class "tag"
const tags = parser.selectMany({ query: xpath("//li[@class='tag']") })
console.log(tags.map(t => t.text)) // ["rust", "napi", "nodejs"]

// Limit XPath results
const limited = parser.selectMany({ query: xpath("//li"), limit: 2 })
console.log(limited.length) // 2

Navigating Nested Elements

import { HtmlParser, css } from "@xcrap/html-parser"

const html = `
  <nav id="main-nav">
    <ul>
      <li><a href="/home">Home</a></li>
      <li><a href="/about">About</a></li>
      <li><a href="/contact">Contact</a></li>
    </ul>
  </nav>
`

const parser = new HtmlParser(html)

// Find the nav, then narrow down inside it
const nav = parser.selectFirst({ query: css("#main-nav") })

if (nav) {
  const links = nav.selectMany({ query: css("a") })
  links.forEach(link => {
    console.log(`${link.text}${link.getAttribute("href")}`)
    // "Home → /home"
    // "About → /about"
    // "Contact → /contact"
  })

  // First and last child shortcuts
  console.log(nav.firstChild?.tagName)  // "UL"
  console.log(nav.lastChild?.tagName)   // "UL"
}

Working with Attributes

import { HtmlParser, css } from "@xcrap/html-parser"

const html = `
  <a
    id="cta"
    class="btn btn-primary"
    href="https://example.com"
    target="_blank"
    data-track="click"
  >
    Click here
  </a>
`

const parser = new HtmlParser(html)
const link = parser.selectFirst({ query: css("a") })

if (link) {
  console.log(link.id)                        // "cta"
  console.log(link.tagName)                   // "A"
  console.log(link.className)                 // "btn btn-primary"
  console.log(link.classList)                 // ["btn", "btn-primary"]
  console.log(link.getAttribute("href"))      // "https://example.com"
  console.log(link.getAttribute("target"))    // "_blank"
  console.log(link.getAttribute("missing"))   // null
  console.log(link.attributes)
  // {
  //   id: "cta",
  //   class: "btn btn-primary",
  //   href: "https://example.com",
  //   target: "_blank",
  //   "data-track": "click"
  // }
}

🏗️ Architecture

The library is structured as a native Node.js addon written in Rust, bridged via NAPI-RS.

src/
├── lib.rs             # Crate entry point; exposes the `parse()` function via NAPI
├── parser.rs          # HTMLParser struct — lazy-loads CSS (scraper) and XPath (sxd) engines
├── types.rs           # HTMLElement struct — all DOM properties and methods
├── engines.rs         # Internal: select_first/many by CSS and XPath (pure Rust)
└── query_builders.rs  # css() and xpath() helper functions exposed to JS

Key Design Decisions

  • Lazy Initialization: HTMLParser holds Option<Html> and Option<Package> fields. Each engine is only allocated on first use and reused automatically, so calling selectFirst (CSS) and then selectMany (XPath) on the same parser creates only two parsing passes total — one per engine.

  • Dual Engine: CSS queries use the scraper crate; XPath queries use sxd-xpath with sxd_html for HTML→XML normalization.

  • Zero-copy Approach: Elements are represented by their outerHTML string, avoiding complex lifetime management across the FFI boundary.

Internal Rust Dependencies

Crate Version Role
napi 3.0.0 NAPI-RS runtime for Node.js integration
napi-derive 3.0.0 Procedural macros for NAPI bindings
scraper 0.25.0 HTML parsing and CSS selector engine
sxd-document 0.3.2 XML document model (used for XPath)
sxd-xpath 0.4.2 XPath expression evaluator
sxd_html 0.1.2 HTML → sxd document converter

🛠️ Development

Prerequisites

  • Rust (stable toolchain) — Install
  • Node.js >= 18 — Install
  • Yarn >= 4 — npm install -g yarn
  • NAPI-RS CLI — installed automatically via dev dependencies

Setup

# Clone the repository
git clone https://github.com/Xcrap-Cloud/html-parser.git
cd html-parser

# Install Node.js dependencies
yarn install

Building

# Build native addon in release mode
yarn build

# Build in debug mode (faster compilation, slower runtime)
yarn build:debug

The output binary (html-parser.<platform>.node) will be placed in the project root.

Running Tests

yarn test

Tests are written with AVA and located in the __test__/ directory.

Formatting

# Format all (TypeScript/JS, Rust, TOML)
yarn format

# Individual formatters
yarn format:prettier   # Prettier for TS/JS/JSON/YAML/Markdown
yarn format:rs         # cargo fmt for Rust
yarn format:toml       # Taplo for TOML files

Linting

yarn lint   # OXLint for TypeScript/JavaScript files

🤝 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository.
  2. Create a branch: git checkout -b feat/your-feature or git checkout -b fix/your-bug.
  3. Make your changes, ensuring all tests pass: yarn test.
  4. Format your code: yarn format.
  5. Commit with a descriptive message: git commit -m "feat: add support for XYZ".
  6. Push your branch: git push origin feat/your-feature.
  7. Open a Pull Request with a clear description of the changes.

Please see CONTRIBUTING.md for detailed guidelines.


📝 License

Distributed under the MIT License.
© Marcuth and contributors.

About

Xcrap HTML Parser is an experimental library written in Rust, built with the NAPI-RS framework for compatibility with Node.js. Its goal is to be fast, lightweight, and support both CSS and XPath queries. Designed for the Xcrap framework ecosystem — but not limited to it — it natively provides query options and limits on processed elements.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors