Skip to content

Latest commit

 

History

History
235 lines (161 loc) · 6.78 KB

File metadata and controls

235 lines (161 loc) · 6.78 KB

validate-xml: High-Performance XML Schema Validator

CI License: MIT Rust 1.81+

A blazingly fast CLI tool for validating XML files against XML Schemas, built in Rust with a focus on concurrent processing, intelligent caching, and low memory overhead.

Validate 20,000 files in seconds with automatic schema caching, concurrent validation, and comprehensive error reporting.


Features

Core Capabilities

  • Concurrent Validation: Uses all available CPU cores for parallel XML/XSD validation
  • Schema Caching: Two-tier caching (L1 memory, L2 disk) prevents redundant downloads
  • Batch Processing: Validate entire directory trees (100,000+ files) without memory exhaustion
  • Output: Text (human-readable) or Compact Summary
  • Smart Error Reporting: Line/column numbers, clear error messages, detailed diagnostics

Performance

  • Pure Rust: Uses xmloxide for XML/XSD validation — no system dependencies or unsafe code
  • Async I/O: Tokio-based async operations for files and HTTP downloads
  • In-Memory Caching: First-run download + cross-run disk cache for schema reuse
  • Bounded Memory: Concurrent validation with configurable limits

🏗️ Architecture

  • Hybrid Async/Sync: Async I/O (files, HTTP, caching) + sync CPU-bound validation (xmloxide)
  • True Parallel Validation: No global locks - 10x throughput on multi-core CPUs
  • Parse Once, Validate Many: Schemas are parsed once and shared safely across threads
  • Modular Design: Clean separation of concerns (discovery, loading, validation, reporting)
  • Non-Blocking: CPU-intensive tasks are offloaded to spawn_blocking to keep the async runtime responsive.

Prerequisites

  • Rust: 1.81+ (stable toolchain) with Cargo

Installation

From Source

git clone https://github.com/franklinchen/validate-xml-rust.git
cd validate-xml-rust
cargo install --path .

This installs the validate-xml binary to ~/.cargo/bin. Add ~/.cargo/bin to your $PATH if not already present.


Quick Start

Basic Usage

Validate all XML files in a directory:

# Validate all .xml files (recursive)
validate-xml /path/to/xml/files

# Validate files with custom extensions
validate-xml --extensions xml,xsd /path/to/files

# Validate with verbose output and progress bar
validate-xml --verbose /path/to/files

Schema Override

Validate XML files against a specific XSD schema, even if the XML files don't contain xsi:schemaLocation attributes:

# Validate using an explicit schema file
validate-xml --schema /path/to/schema.xsd /path/to/xml/files

Output Formats

Standard output includes validation status per file (in verbose mode) and a final summary.

Validation Summary:
  Total files: 20000
  Valid: 19950
  Invalid: 50
  Errors: 0
  Skipped: 0
  Success rate: 99.8%
  Duration: 4.20s

Error Message Format

Validation errors are reported with precise location information for easy IDE integration:

path/to/file.xml:42:15: Missing required element 'id'
path/to/file.xml:87:3: Element 'invalid' not allowed here
path/to/file.xml:120:1: Schema error: Could not locate schema resource

Remote Schema Example

A robust way to test remote schema validation is using an Apache Maven POM file. A sample is provided in samples/pom.xml:

# Validate the provided sample which uses a remote Apache Maven schema
validate-xml samples/pom.xml

Command-Line Reference

Basic Syntax

validate-xml [OPTIONS] <DIRECTORY>

Options

Option Default Description
--extensions <EXT> xml XML file extension to match (comma-separated)
--threads <N> CPU cores Max concurrent validation threads
--cache-dir <PATH> Platform specific Schema cache directory
--cache-ttl <HOURS> 24 Schema cache TTL in hours
--verbose - Show detailed output
--quiet - Suppress non-error output
--progress Auto Show progress bar
--schema <PATH> - Validate against a specific XSD (overrides schema references in XML)
--fail-fast - Stop validation on first error
--help - Show help message
--version - Show version information

Exit Codes

Code Meaning
0 All files valid
1 Configuration or CLI error
2 Errors occurred during validation (system/network)
3 Invalid files found (schema violations)

How It Works

Architecture

The validator consists of four main components:

1. File Discovery

  • Recursively traverses directory tree and filters by extension.

2. Schema Loading

  • Extracts schema URLs (xsi:schemaLocation, xsi:noNamespaceSchemaLocation).
  • Downloads remote schemas (HTTP/HTTPS) and caches raw bytes to memory and disk.
  • Parse Once: Parsed schema structures are cached in memory and shared safely across threads.

3. Concurrent Validation

  • Spawns async tasks bounded by --threads.
  • Heavy CPU tasks (parsing, validation) are offloaded to spawn_blocking.
  • Thread Safety: xmloxide is fully thread-safe — no global locks needed. Full parallel execution for XML validation.

4. Error Reporting

  • Aggregates and formats errors with line/column information.

Caching Strategy

  • L1 Parsed Cache: In-memory moka cache storing compiled XsdSchema. Ensures we parse any XSD exactly once.
  • L2 Raw Cache: Disk-backed cacache for persistent cross-run storage of schema bytes.

Performance Characteristics

Benchmarks (divan)

Micro-benchmarks measuring the core validation engine (on Apple M1 Max):

Operation Median Time Throughput
Schema Parsing ~6.0 µs 166,000/sec
Valid XML Validation ~17.2 µs 58,000/sec
Invalid XML Validation ~17.6 µs 56,000/sec

Note: Validation includes reading the XML file and checking against the cached schema.


Development

Building from Source

# Clone repository
git clone https://github.com/franklinchen/validate-xml-rust.git
cd validate-xml-rust

# Build (release, optimized)
cargo build --release

# Run tests
cargo test

# Run benchmarks
cargo bench

Testing & Quality

Before submitting changes, ensure you run:

  • cargo fmt
  • cargo clippy
  • cargo test

License

MIT License - See LICENSE file for details


Acknowledgements

Google Gemini was used as an aid in improving this project, particularly in streamlining the architecture and test suite.