Markdowner v2.0

High-performance URL to markdown converter with multi-engine browser support and intelligent content extraction.

Features

Multi-engine support: Choose between Playwright (fast, reliable) and Camoufox (Firefox-based, better anti-bot bypass)
Two extraction methods:
- html - Direct HTTP fetch (fast, for server-side rendered pages)
- hydration - Full browser rendering with JavaScript execution
Intelligent content extraction: Three-tier extraction system:
- XPath - Direct extraction using user-provided path (fastest)
- Content Type Profiles - Specialized extraction for 15+ content types (job, blog, docs, etc.)
- Generic - Per-section evaluation for maximum recall
Anti-bot bypass: Configurable stealth levels (none, low, high) with human simulation
30+ conversion rules: Handles alerts, FAQs, cards, tables, code blocks, and more
Bright Data proxy integration: Session rotation for IP rotation
Dual-layer caching: LRU in-memory + file-based persistence

Quick Start

Using Docker Compose (Recommended)

# Clone and configure
git clone https://github.com/your-repo/markdowner.git
cd markdowner
cp env.example .env

# Build and run
docker-compose up -d

# Test
curl "http://localhost:3000/convert?url=https://example.com"

Local Development

# Create virtual environment
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install dependencies
pip install -r requirements.txt

# Install browser engines
playwright install chromium firefox --with-deps
python -m camoufox fetch  # Optional: for Camoufox engine

# Run the server
python -m uvicorn src.main:app --host 0.0.0.0 --port 3000 --reload

API Reference

`GET /convert`

Convert a URL to markdown.

Parameter	Type	Default	Description
`url`	string	required	The website URL to convert
`method`	`html` \| `hydration`	`hydration`	Extraction method
`engine`	`playwright` \| `camoufox`	`playwright`	Browser engine for hydration
`enableDetailedResponse`	boolean	`false`	Include full page content (skip extraction)
`crawlSubpages`	boolean	`false`	Convert linked subpages (max 10)
`useProxy`	boolean	`false`	Use Bright Data proxy
`stealthLevel`	`none` \| `low` \| `high`	`none`	Anti-bot bypass level
`urlsPostfix`	boolean	`false`	Append all extracted links at the bottom
`xpath`	string	`null`	XPath expression for direct content extraction
`contentType`	string	`null`	Content type hint for specialized extraction

Content Types

When you know the type of page you're extracting, use contentType for better results:

Type	Description	Best For
`job`	Job postings	LinkedIn, Greenhouse, Lever, Gusto job boards
`blog`	Blog posts, articles	Medium, Dev.to, personal blogs
`news`	News articles	News sites, journalism
`docs`	Technical documentation	MDN, ReadTheDocs, GitHub docs
`product`	E-commerce products	Amazon, Shopify stores
`recipe`	Cooking recipes	AllRecipes, food blogs
`forum`	Forum threads, Q&A	Reddit, StackOverflow, Discourse
`wiki`	Wiki pages	Wikipedia, MediaWiki sites
`profile`	User profiles	About pages, team pages
`listing`	Search results	Directories, catalogs
`event`	Event pages	Conferences, meetups
`research`	Academic papers	Research articles
`legal`	Legal documents	Terms of service, privacy policies
`changelog`	Release notes	Version history
`landing`	Marketing pages	Landing pages

Stealth Levels

Level	Description	Use Case
`none`	Basic stealth (default)	Most websites
`low`	Stealth plugin + basic evasion	Sites with light bot detection
`high`	Full human simulation (15-20s)	Cloudflare Turnstile, aggressive anti-bot

Examples:

# Basic conversion with JavaScript rendering
curl "http://localhost:3000/convert?url=https://example.com"

# Fast SSR conversion (no browser)
curl "http://localhost:3000/convert?url=https://example.com&method=html"

# Use Camoufox for anti-bot bypass
curl "http://localhost:3000/convert?url=https://example.com&engine=camoufox"

# High stealth for Cloudflare-protected sites
curl "http://localhost:3000/convert?url=https://example.com&engine=camoufox&stealthLevel=high"

# Extract job posting with specialized profile
curl "http://localhost:3000/convert?url=https://jobs.example.com/posting&contentType=job"

# Extract documentation
curl "http://localhost:3000/convert?url=https://docs.example.com/api&contentType=docs"

# Direct XPath extraction (fastest when you know the structure)
curl "http://localhost:3000/convert?url=https://example.com&xpath=//div[@class='content']"

# Get all links appended at the bottom
curl "http://localhost:3000/convert?url=https://example.com&urlsPostfix=true"

# JSON response
curl -H "Accept: application/json" \
  "http://localhost:3000/convert?url=https://example.com"

# With proxy
curl "http://localhost:3000/convert?url=https://example.com&useProxy=true"

Other Endpoints

Endpoint	Method	Description
`/`	GET	API documentation
`/health`	GET	Health check
`/metrics`	GET	Engine and cache metrics
`/cache/stats`	GET	Cache statistics
`/cache`	DELETE	Clear all cached entries

Content Extraction

Markdowner uses a three-tier extraction system to intelligently find and extract main content:

1. XPath Mode (Fastest)

When you provide an xpath parameter, content is extracted directly without any analysis:

curl "http://localhost:3000/convert?url=https://example.com&xpath=//article"

2. Content Type Profiles

When you provide a contentType, specialized selectors and scoring weights are used:

Priority selectors: Common CSS selectors for that content type (tried first)
Weighted scoring: Features like lists, paragraphs, and code blocks are weighted differently per type
Example: Job postings weight lists higher (for requirements), docs weight code blocks higher

curl "http://localhost:3000/convert?url=https://jobs.example.com&contentType=job"

3. Generic Mode (Maximum Recall)

When no hints are provided, each section of the page is evaluated independently:

Computes structural features (text length, paragraphs, headings, lists, images, etc.)
Scores each section using a balanced algorithm
Includes all sections that pass the threshold
Falls back to Readability if needed

Browser Engines

Playwright (Default)

Fast and reliable
Chromium-based
Good for most websites

Camoufox

Firefox-based with anti-fingerprinting
Better for bypassing Cloudflare Turnstile
Automatic fallback to Playwright on failure

Environment Variables

Variable	Default	Description
`PORT`	`3000`	Server port
`CACHE_ENABLED`	`true`	Enable caching
`CACHE_TTL_SECONDS`	`3600`	Cache TTL (1 hour)
`CACHE_MAX_MEMORY`	`1000`	Max in-memory cache entries
`BROWSER_HEADLESS`	`true`	Run browsers headless
`BROWSER_TIMEOUT`	`30000`	Page load timeout (ms)
`WORKER_POOL`	`1`	Number of browser workers
`BRIGHTDATA_USERNAME`	-	Bright Data username
`BRIGHTDATA_PASSWORD`	-	Bright Data password
`BRIGHTDATA_PROXY`	-	Bright Data proxy host:port

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
code_of_conduct.md		code_of_conduct.md
docker-compose.yml		docker-compose.yml
env.example		env.example
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Markdowner v2.0

Features

Quick Start

Using Docker Compose (Recommended)

Local Development

API Reference

`GET /convert`

Content Types

Stealth Levels

Other Endpoints

Content Extraction

1. XPath Mode (Fastest)

2. Content Type Profiles

3. Generic Mode (Maximum Recall)

Browser Engines

Playwright (Default)

Camoufox

Environment Variables

License

About

Uh oh!

Releases

Packages

Languages

License

HireBase-1/markdowner

Folders and files

Latest commit

History

Repository files navigation

Markdowner v2.0

Features

Quick Start

Using Docker Compose (Recommended)

Local Development

API Reference

GET /convert

Content Types

Stealth Levels

Other Endpoints

Content Extraction

1. XPath Mode (Fastest)

2. Content Type Profiles

3. Generic Mode (Maximum Recall)

Browser Engines

Playwright (Default)

Camoufox

Environment Variables

License

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`GET /convert`

Packages