High-performance URL to markdown converter with multi-engine browser support and intelligent content extraction.
- Multi-engine support: Choose between Playwright (fast, reliable) and Camoufox (Firefox-based, better anti-bot bypass)
- Two extraction methods:
html- Direct HTTP fetch (fast, for server-side rendered pages)hydration- Full browser rendering with JavaScript execution
- Intelligent content extraction: Three-tier extraction system:
- XPath - Direct extraction using user-provided path (fastest)
- Content Type Profiles - Specialized extraction for 15+ content types (job, blog, docs, etc.)
- Generic - Per-section evaluation for maximum recall
- Anti-bot bypass: Configurable stealth levels (none, low, high) with human simulation
- 30+ conversion rules: Handles alerts, FAQs, cards, tables, code blocks, and more
- Bright Data proxy integration: Session rotation for IP rotation
- Dual-layer caching: LRU in-memory + file-based persistence
# Clone and configure
git clone https://github.com/your-repo/markdowner.git
cd markdowner
cp env.example .env
# Build and run
docker-compose up -d
# Test
curl "http://localhost:3000/convert?url=https://example.com"# Create virtual environment
python -m venv venv
source venv/bin/activate # or `venv\Scripts\activate` on Windows
# Install dependencies
pip install -r requirements.txt
# Install browser engines
playwright install chromium firefox --with-deps
python -m camoufox fetch # Optional: for Camoufox engine
# Run the server
python -m uvicorn src.main:app --host 0.0.0.0 --port 3000 --reloadConvert a URL to markdown.
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
string | required | The website URL to convert |
method |
html | hydration |
hydration |
Extraction method |
engine |
playwright | camoufox |
playwright |
Browser engine for hydration |
enableDetailedResponse |
boolean | false |
Include full page content (skip extraction) |
crawlSubpages |
boolean | false |
Convert linked subpages (max 10) |
useProxy |
boolean | false |
Use Bright Data proxy |
stealthLevel |
none | low | high |
none |
Anti-bot bypass level |
urlsPostfix |
boolean | false |
Append all extracted links at the bottom |
xpath |
string | null |
XPath expression for direct content extraction |
contentType |
string | null |
Content type hint for specialized extraction |
When you know the type of page you're extracting, use contentType for better results:
| Type | Description | Best For |
|---|---|---|
job |
Job postings | LinkedIn, Greenhouse, Lever, Gusto job boards |
blog |
Blog posts, articles | Medium, Dev.to, personal blogs |
news |
News articles | News sites, journalism |
docs |
Technical documentation | MDN, ReadTheDocs, GitHub docs |
product |
E-commerce products | Amazon, Shopify stores |
recipe |
Cooking recipes | AllRecipes, food blogs |
forum |
Forum threads, Q&A | Reddit, StackOverflow, Discourse |
wiki |
Wiki pages | Wikipedia, MediaWiki sites |
profile |
User profiles | About pages, team pages |
listing |
Search results | Directories, catalogs |
event |
Event pages | Conferences, meetups |
research |
Academic papers | Research articles |
legal |
Legal documents | Terms of service, privacy policies |
changelog |
Release notes | Version history |
landing |
Marketing pages | Landing pages |
| Level | Description | Use Case |
|---|---|---|
none |
Basic stealth (default) | Most websites |
low |
Stealth plugin + basic evasion | Sites with light bot detection |
high |
Full human simulation (15-20s) | Cloudflare Turnstile, aggressive anti-bot |
Examples:
# Basic conversion with JavaScript rendering
curl "http://localhost:3000/convert?url=https://example.com"
# Fast SSR conversion (no browser)
curl "http://localhost:3000/convert?url=https://example.com&method=html"
# Use Camoufox for anti-bot bypass
curl "http://localhost:3000/convert?url=https://example.com&engine=camoufox"
# High stealth for Cloudflare-protected sites
curl "http://localhost:3000/convert?url=https://example.com&engine=camoufox&stealthLevel=high"
# Extract job posting with specialized profile
curl "http://localhost:3000/convert?url=https://jobs.example.com/posting&contentType=job"
# Extract documentation
curl "http://localhost:3000/convert?url=https://docs.example.com/api&contentType=docs"
# Direct XPath extraction (fastest when you know the structure)
curl "http://localhost:3000/convert?url=https://example.com&xpath=//div[@class='content']"
# Get all links appended at the bottom
curl "http://localhost:3000/convert?url=https://example.com&urlsPostfix=true"
# JSON response
curl -H "Accept: application/json" \
"http://localhost:3000/convert?url=https://example.com"
# With proxy
curl "http://localhost:3000/convert?url=https://example.com&useProxy=true"| Endpoint | Method | Description |
|---|---|---|
/ |
GET | API documentation |
/health |
GET | Health check |
/metrics |
GET | Engine and cache metrics |
/cache/stats |
GET | Cache statistics |
/cache |
DELETE | Clear all cached entries |
Markdowner uses a three-tier extraction system to intelligently find and extract main content:
When you provide an xpath parameter, content is extracted directly without any analysis:
curl "http://localhost:3000/convert?url=https://example.com&xpath=//article"When you provide a contentType, specialized selectors and scoring weights are used:
- Priority selectors: Common CSS selectors for that content type (tried first)
- Weighted scoring: Features like lists, paragraphs, and code blocks are weighted differently per type
- Example: Job postings weight lists higher (for requirements), docs weight code blocks higher
curl "http://localhost:3000/convert?url=https://jobs.example.com&contentType=job"When no hints are provided, each section of the page is evaluated independently:
- Computes structural features (text length, paragraphs, headings, lists, images, etc.)
- Scores each section using a balanced algorithm
- Includes all sections that pass the threshold
- Falls back to Readability if needed
- Fast and reliable
- Chromium-based
- Good for most websites
- Firefox-based with anti-fingerprinting
- Better for bypassing Cloudflare Turnstile
- Automatic fallback to Playwright on failure
| Variable | Default | Description |
|---|---|---|
PORT |
3000 |
Server port |
CACHE_ENABLED |
true |
Enable caching |
CACHE_TTL_SECONDS |
3600 |
Cache TTL (1 hour) |
CACHE_MAX_MEMORY |
1000 |
Max in-memory cache entries |
BROWSER_HEADLESS |
true |
Run browsers headless |
BROWSER_TIMEOUT |
30000 |
Page load timeout (ms) |
WORKER_POOL |
1 |
Number of browser workers |
BRIGHTDATA_USERNAME |
- | Bright Data username |
BRIGHTDATA_PASSWORD |
- | Bright Data password |
BRIGHTDATA_PROXY |
- | Bright Data proxy host:port |
MIT