Website | Docs | crates.io | GitHub
A fast, open-source web retrieval engine built in Rust. Fetches pages via lightweight HTTP with automatic fallback to headless Chromium for JavaScript-heavy sites. Returns clean, LLM-ready Markdown.
91% quality win rate against Firecrawl across 35 real-world URLs, judged by LLM evaluation. 1.5x faster on average.
cargo install essence-engine- Two-tier rendering -- lightweight HTTP fetch for most pages, automatic Chromium fallback for SPAs and JS-heavy sites
- LLM-optimized output -- clean Markdown with noise removal, heading hierarchy, code preservation, and structured metadata
- Structured extraction -- pull typed JSON from pages using CSS selectors or LLM-based extraction with schema validation
- 6 REST endpoints + MCP server -- scrape, crawl, map, search, extract, llms.txt -- usable by humans and AI agents alike
- OpenAPI spec -- auto-generated at
/api/docs/openapi.jsonfor SDK generation and tooling - Official SDKs -- Python and TypeScript client libraries
- Self-hosted & open-source -- no API keys, no rate limits, no vendor lock-in
- Respectful crawling -- robots.txt compliance, per-domain rate limits, configurable politeness
- Fast -- median response under 1 second for HTTP-rendered pages
# macOS
brew install chromium
# Ubuntu/Debian
sudo apt install chromium-browser
# Or skip Chromium -- the HTTP engine handles most pages without a browsergit clone https://github.com/ruchit-p/essence.git
cd essence/backend
cp .env.example .env
cargo build --release
cargo run --release
# Server starts at http://localhost:8080docker-compose up -d
# Service available at http://localhost:8080# Scrape a page to Markdown
curl -s -X POST http://localhost:8080/api/v1/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}' | jq .
# Discover all URLs on a site
curl -s -X POST http://localhost:8080/api/v1/map \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}' | jq .
# Crawl a site (follow links, respect robots.txt)
curl -s -X POST http://localhost:8080/api/v1/crawl \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "maxDepth": 2, "limit": 10}' | jq .
# Search the web
curl -s -X POST http://localhost:8080/api/v1/search \
-H "Content-Type: application/json" \
-d '{"query": "rust web scraping", "limit": 5}' | jq .
# Extract structured data
curl -s -X POST http://localhost:8080/api/v1/extract \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com"], "mode": "css", "selectors": {"title": "h1", "links": "a"}, "schema": {"properties": {"title": {"type": "string"}, "links": {"type": "array"}}}}' | jq .All endpoints accept JSON POST requests at /api/v1/.
Single-page capture with automatic engine selection.
Request:
| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | URL to scrape |
formats |
string[] | ["markdown"] |
Output formats: "markdown", "html", "links", "metadata" |
engine |
string | "auto" |
Rendering engine: "auto", "http", or "browser" |
timeout |
int | 30000 |
Timeout in milliseconds |
headers |
object | none | Custom HTTP headers |
onlyMainContent |
bool | true |
Extract only the main content area |
Response:
{
"success": true,
"data": {
"markdown": "# Page Title\n\nPage content in clean Markdown...",
"metadata": {
"title": "Page Title",
"description": "Page description",
"language": "en",
"url": "https://example.com",
"statusCode": 200,
"wordCount": 1234,
"readingTime": 5
},
"links": ["https://example.com/about", "https://example.com/contact"],
"images": ["https://example.com/hero.jpg"]
}
}Multi-page traversal with dedup, robots.txt compliance, and rate limiting.
Request:
| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | Starting URL |
maxDepth |
int | 2 |
Maximum link depth to follow |
limit |
int | 100 |
Maximum pages to crawl |
includePaths |
string[] | none | Glob patterns to include (e.g. ["/blog/*"]) |
excludePaths |
string[] | none | Glob patterns to exclude (e.g. ["/admin/*"]) |
allowBackwardLinks |
bool | false |
Follow links up the URL hierarchy |
allowExternalLinks |
bool | false |
Follow links to other domains |
detectPagination |
bool | true |
Automatically follow paginated content |
Response:
{
"success": true,
"data": [
{
"markdown": "# Home\n\nWelcome...",
"metadata": { "title": "Home", "url": "https://example.com", "statusCode": 200 }
},
{
"markdown": "# About\n\nAbout us...",
"metadata": { "title": "About", "url": "https://example.com/about", "statusCode": 200 }
}
]
}URL discovery via sitemaps and in-page link extraction.
Request:
| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | Site URL to discover links from |
includeSubdomains |
bool | true |
Include URLs from subdomains |
limit |
int | 5000 |
Maximum URLs to return |
search |
string | none | Filter URLs by search query |
ignoreSitemap |
bool | false |
Skip sitemap.xml discovery |
Response:
{
"success": true,
"links": [
"https://example.com/",
"https://example.com/about",
"https://example.com/blog",
"https://example.com/blog/post-1"
]
}Web search via DuckDuckGo with optional scraping of result pages.
Request:
| Field | Type | Default | Description |
|---|---|---|---|
query |
string | required | Search query |
limit |
int | 10 |
Number of search results |
scrapeResults |
bool | false |
Scrape full content of each result |
Response:
{
"success": true,
"data": [
{
"title": "Web Scraping in Rust",
"url": "https://example.com/rust-scraping",
"snippet": "A guide to web scraping with Rust..."
}
]
}Structured data extraction from web pages using CSS selectors, LLM-based extraction, or both.
Request:
| Field | Type | Default | Description |
|---|---|---|---|
urls |
string[] | required | URLs to extract from (1-10) |
schema |
object | none | JSON Schema for output structure |
selectors |
object | none | CSS selector → field name mappings |
prompt |
string | none | Natural language extraction instruction |
mode |
string | "auto" |
"auto", "css", or "llm" |
llmBaseUrl |
string | none | OpenAI-compatible API URL (required for "llm" mode) |
llmModel |
string | "gpt-4o-mini" |
LLM model name |
llmApiKey |
string | none | LLM API key |
Response:
{
"success": true,
"data": [
{
"title": "Example Domain",
"links": ["https://www.iana.org/domains/example"]
}
]
}Streaming variant of the crawl endpoint. Returns results via Server-Sent Events (SSE) as pages are crawled, rather than waiting for the full crawl to complete. Accepts the same parameters as /api/v1/crawl.
Auto-generated OpenAPI 3.1 specification for all endpoints. Use with Swagger UI, Postman, or SDK generators.
Health check endpoint. Returns 200 OK when the server is running. Used by Docker's HEALTHCHECK.
Essence includes a built-in Model Context Protocol server, allowing AI agents (Claude, Cursor, Windsurf, etc.) to use scrape, crawl, map, search, extract, and llmstxt as tools directly.
The MCP endpoint is available at http://localhost:8080/mcp using the Streamable HTTP transport. Start the Essence server first, then connect your MCP client.
| Tool | Description |
|---|---|
scrape |
Fetch a single page and return Markdown content. Params: url (required), formats, engine, timeout_ms |
crawl |
Crawl a website up to a given depth and page limit. Params: url (required), max_depth, limit, include_paths, exclude_paths |
map |
Discover URLs on a site via sitemaps and link extraction. Params: url (required), search, limit, include_subdomains |
search |
Search the web and optionally scrape results. Params: query (required), limit, scrape_results |
extract |
Extract structured data from pages. Params: urls (required), schema, selectors, prompt, mode |
llmstxt |
Generate llms.txt from a website. Params: url (required), max_urls, llm_base_url, show_full_text |
Note: The MCP server uses
timeout_mswhile the REST API usestimeout. Both accept milliseconds.
Add to your project's .mcp.json (or global ~/.claude/.mcp.json):
{
"mcpServers": {
"essence": {
"type": "http",
"url": "http://localhost:8080/mcp"
}
}
}Then use Essence tools directly in Claude Code conversations -- Claude will call scrape, crawl, map, search, extract, and llmstxt as needed.
Add to your Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json on macOS):
{
"mcpServers": {
"essence": {
"type": "http",
"url": "http://localhost:8080/mcp"
}
}
}Point your MCP client at http://localhost:8080/mcp using the HTTP transport. No API key required.
Request → HTTP Engine (fast, lightweight)
↓ content density too low?
Browser Engine (Chromium CDP)
↓
Markdown Output (clean, LLM-ready)
- HTTP Engine (default) -- lightweight fetch via reqwest. Used when the HTML contains sufficient content density.
- Browser Engine (fallback) -- full Chromium automation via CDP. Activated for SPAs, anti-bot pages, and JavaScript-rendered content.
Auto-detection triggers browser fallback based on:
- Content density analysis (too little text in HTML)
- Hydration markers (
__NEXT_DATA__,__NUXT__, etc.) - Meta-refresh redirects
- Anti-fetch response headers
backend/src/
api/ # Endpoint handlers (scrape, crawl, map, search)
engines/ # HTTP and Browser rendering engines
format/ # Markdown and metadata output formatters
crawler/ # URL frontier, robots.txt, rate limiting, sitemaps, pagination
search/ # DuckDuckGo HTML search integration
cache/ # In-memory caching (moka)
rate_limit/ # Per-domain rate limiting (governor)
utils/ # DNS cache, URL normalization, robots.txt parsing
mcp.rs # MCP server (Model Context Protocol)
types.rs # Request/response types
error.rs # Error types
config.rs # Configuration
main.rs # Server startup with graceful shutdown
Copy backend/.env.example to backend/.env and customize:
| Variable | Default | Description |
|---|---|---|
PORT |
8080 |
Server port |
RUST_LOG |
essence=info |
Log level (debug, info, warn, error) |
BROWSER_HEADLESS |
true |
Run Chromium in headless mode |
BROWSER_POOL_SIZE |
5 |
Number of browser instances in pool |
BROWSER_TIMEOUT_MS |
30000 |
Browser page timeout |
ENGINE_WATERFALL_ENABLED |
true |
Enable HTTP-to-browser fallback |
ENGINE_WATERFALL_DELAY_MS |
5000 |
Delay before browser fallback |
CRAWL_RATE_LIMIT_PER_SEC |
2 |
Crawl rate limit per domain |
MAX_CONCURRENT_REQUESTS |
10 |
Max concurrent crawl requests |
MAX_PARALLEL_SCRAPES |
5 |
Parallel scrapes for search results |
MAX_REQUEST_SIZE_MB |
1 |
Maximum request body size |
CRAWL_MAX_DURATION_SEC |
300 |
Maximum crawl duration in seconds |
RETRY_MAX_ATTEMPTS |
3 |
Max retry attempts for failed requests |
RETRY_INITIAL_DELAY_MS |
500 |
Initial retry delay |
RETRY_MAX_DELAY_MS |
30000 |
Maximum retry delay |
cd backend
# Unit tests (377 tests, no network required)
cargo test --lib
# Lint
cargo clippy -- -D warnings
# Integration tests (requires network)
cargo test --test integration -- --ignored
# Browser engine tests (requires Chromium)
cargo test --test browser_engine_tests
# Compile all tests without running
cargo test --no-runEssence is benchmarked head-to-head against Firecrawl across 35 real-world URLs spanning 7 categories, evaluated by an LLM judge on 5 quality dimensions.
| Metric | Essence | Target | Status |
|---|---|---|---|
| Quality Win Rate (LLM Judge) | 91.2% (31-2-1) | >= 70% | Exceeded |
| Speed Win Rate | 74% (26-9-0) | >= 50% | Exceeded |
| Success Rate | 100% | 100% | Met |
| Category | Essence Wins | Win Rate |
|---|---|---|
| Structured (Wikipedia, APIs) | 5/5 | 100% |
| News (BBC, Reuters, Ars) | 5/5 | 100% |
| Reference (HN, arXiv, dev.to) | 5/5 | 100% |
| Content (essays, blogs) | 4/5 | 80% |
| Dynamic (React, Next.js, Vercel) | 4/5 | 80% |
| Docs (Rust, Python, Go, MDN) | 3/5 | 60% |
| E-Commerce (Newegg, eBay, IKEA) | 3/5 | 60% |
See the full benchmark methodology for detailed results.
See CONTRIBUTING.md for development setup, code style, and PR process.
This repository includes a CLAUDE.md for AI-assisted development.