Skip to content

Adapted: Hirebase provides local service for Markdowner, maintained by people who use it to understand content on the internet

License

Notifications You must be signed in to change notification settings

HireBase-1/markdowner

 
 

Repository files navigation

Markdowner v2.0

High-performance URL to markdown converter with multi-engine browser support and intelligent content extraction.

Features

  • Multi-engine support: Choose between Playwright (fast, reliable) and Camoufox (Firefox-based, better anti-bot bypass)
  • Two extraction methods:
    • html - Direct HTTP fetch (fast, for server-side rendered pages)
    • hydration - Full browser rendering with JavaScript execution
  • Intelligent content extraction: Three-tier extraction system:
    • XPath - Direct extraction using user-provided path (fastest)
    • Content Type Profiles - Specialized extraction for 15+ content types (job, blog, docs, etc.)
    • Generic - Per-section evaluation for maximum recall
  • Anti-bot bypass: Configurable stealth levels (none, low, high) with human simulation
  • 30+ conversion rules: Handles alerts, FAQs, cards, tables, code blocks, and more
  • Bright Data proxy integration: Session rotation for IP rotation
  • Dual-layer caching: LRU in-memory + file-based persistence

Quick Start

Using Docker Compose (Recommended)

# Clone and configure
git clone https://github.com/your-repo/markdowner.git
cd markdowner
cp env.example .env

# Build and run
docker-compose up -d

# Test
curl "http://localhost:3000/convert?url=https://example.com"

Local Development

# Create virtual environment
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install dependencies
pip install -r requirements.txt

# Install browser engines
playwright install chromium firefox --with-deps
python -m camoufox fetch  # Optional: for Camoufox engine

# Run the server
python -m uvicorn src.main:app --host 0.0.0.0 --port 3000 --reload

API Reference

GET /convert

Convert a URL to markdown.

Parameter Type Default Description
url string required The website URL to convert
method html | hydration hydration Extraction method
engine playwright | camoufox playwright Browser engine for hydration
enableDetailedResponse boolean false Include full page content (skip extraction)
crawlSubpages boolean false Convert linked subpages (max 10)
useProxy boolean false Use Bright Data proxy
stealthLevel none | low | high none Anti-bot bypass level
urlsPostfix boolean false Append all extracted links at the bottom
xpath string null XPath expression for direct content extraction
contentType string null Content type hint for specialized extraction

Content Types

When you know the type of page you're extracting, use contentType for better results:

Type Description Best For
job Job postings LinkedIn, Greenhouse, Lever, Gusto job boards
blog Blog posts, articles Medium, Dev.to, personal blogs
news News articles News sites, journalism
docs Technical documentation MDN, ReadTheDocs, GitHub docs
product E-commerce products Amazon, Shopify stores
recipe Cooking recipes AllRecipes, food blogs
forum Forum threads, Q&A Reddit, StackOverflow, Discourse
wiki Wiki pages Wikipedia, MediaWiki sites
profile User profiles About pages, team pages
listing Search results Directories, catalogs
event Event pages Conferences, meetups
research Academic papers Research articles
legal Legal documents Terms of service, privacy policies
changelog Release notes Version history
landing Marketing pages Landing pages

Stealth Levels

Level Description Use Case
none Basic stealth (default) Most websites
low Stealth plugin + basic evasion Sites with light bot detection
high Full human simulation (15-20s) Cloudflare Turnstile, aggressive anti-bot

Examples:

# Basic conversion with JavaScript rendering
curl "http://localhost:3000/convert?url=https://example.com"

# Fast SSR conversion (no browser)
curl "http://localhost:3000/convert?url=https://example.com&method=html"

# Use Camoufox for anti-bot bypass
curl "http://localhost:3000/convert?url=https://example.com&engine=camoufox"

# High stealth for Cloudflare-protected sites
curl "http://localhost:3000/convert?url=https://example.com&engine=camoufox&stealthLevel=high"

# Extract job posting with specialized profile
curl "http://localhost:3000/convert?url=https://jobs.example.com/posting&contentType=job"

# Extract documentation
curl "http://localhost:3000/convert?url=https://docs.example.com/api&contentType=docs"

# Direct XPath extraction (fastest when you know the structure)
curl "http://localhost:3000/convert?url=https://example.com&xpath=//div[@class='content']"

# Get all links appended at the bottom
curl "http://localhost:3000/convert?url=https://example.com&urlsPostfix=true"

# JSON response
curl -H "Accept: application/json" \
  "http://localhost:3000/convert?url=https://example.com"

# With proxy
curl "http://localhost:3000/convert?url=https://example.com&useProxy=true"

Other Endpoints

Endpoint Method Description
/ GET API documentation
/health GET Health check
/metrics GET Engine and cache metrics
/cache/stats GET Cache statistics
/cache DELETE Clear all cached entries

Content Extraction

Markdowner uses a three-tier extraction system to intelligently find and extract main content:

1. XPath Mode (Fastest)

When you provide an xpath parameter, content is extracted directly without any analysis:

curl "http://localhost:3000/convert?url=https://example.com&xpath=//article"

2. Content Type Profiles

When you provide a contentType, specialized selectors and scoring weights are used:

  • Priority selectors: Common CSS selectors for that content type (tried first)
  • Weighted scoring: Features like lists, paragraphs, and code blocks are weighted differently per type
  • Example: Job postings weight lists higher (for requirements), docs weight code blocks higher
curl "http://localhost:3000/convert?url=https://jobs.example.com&contentType=job"

3. Generic Mode (Maximum Recall)

When no hints are provided, each section of the page is evaluated independently:

  • Computes structural features (text length, paragraphs, headings, lists, images, etc.)
  • Scores each section using a balanced algorithm
  • Includes all sections that pass the threshold
  • Falls back to Readability if needed

Browser Engines

Playwright (Default)

  • Fast and reliable
  • Chromium-based
  • Good for most websites

Camoufox

  • Firefox-based with anti-fingerprinting
  • Better for bypassing Cloudflare Turnstile
  • Automatic fallback to Playwright on failure

Environment Variables

Variable Default Description
PORT 3000 Server port
CACHE_ENABLED true Enable caching
CACHE_TTL_SECONDS 3600 Cache TTL (1 hour)
CACHE_MAX_MEMORY 1000 Max in-memory cache entries
BROWSER_HEADLESS true Run browsers headless
BROWSER_TIMEOUT 30000 Page load timeout (ms)
WORKER_POOL 1 Number of browser workers
BRIGHTDATA_USERNAME - Bright Data username
BRIGHTDATA_PASSWORD - Bright Data password
BRIGHTDATA_PROXY - Bright Data proxy host:port

License

MIT

About

Adapted: Hirebase provides local service for Markdowner, maintained by people who use it to understand content on the internet

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.3%
  • Dockerfile 0.7%