A developer-friendly tool that turns any website into a structured, deployable set of AI-ready Markdown artifacts with a host-level llms.txt index.
site2llms discovers pages, extracts readable content, summarizes each page via a local Ollama model, and writes structured output under output/<host>/ — ready to serve, embed, or feed into any LLM workflow.
Most tools in this space either stop at SEO analysis, produce a single flat index file, or require manual curation. site2llms is different:
- Deployable artifacts, not reports. The output is a complete directory of Markdown files with YAML frontmatter and a
llms.txtindex — ready to drop into a static site, a docs bundle, or a RAG pipeline. It's not a one-off analysis; it's a repeatable build step. - Full pages, not just links. Every discovered page gets its own structured summary with TL;DR, key points, FAQ, and metadata. An LLM consuming these files gets real content, not a table of contents.
- Incremental by default. A content-hash manifest (
manifest.json) tracks what changed. Re-running the tool only processes new or updated pages — suitable for CI/CD or scheduled regeneration. - Handles the real web. Cloudflare challenges, SiteGround CAPTCHAs, JS-rendered SPAs, WordPress Elementor themes — the layered fetch pipeline (HTTP → headless Chromium → cookie injection) deals with sites as they actually exist, not as they ideally should.
- Developer-friendly. Interactive prompts for quick runs, structured output for automation, clean separation of discover → fetch → extract → summarize → write stages. No external SaaS dependencies — just .NET, Ollama, and optionally Playwright.
- Discovers URLs using ordered strategies: WordPress REST API → Sitemap → RSS/Atom feeds → Crawl fallback.
- Detects WordPress sites automatically and uses the REST API to get server-rendered content (bypasses JS-dependent themes like Elementor).
- Fetches page HTML with browser-like headers; automatically retries with a headless Chromium browser when bot-protection or challenge pages are detected.
- Supports cookie injection from a Netscape/JSON cookie file to bypass CAPTCHAs and authentication gates.
- Supports optional URL include/exclude filters to constrain discovery to specific sections of a site.
- Extracts main content (
main,article, role/content selectors, thenbody) and strips boilerplate. - Converts extracted content to Markdown.
- Calls Ollama
/api/generateto produce structured summaries (TL;DR, key points, FAQ, context). - Writes one summary file per page in
output/<host>/ai/pages/*.mdwith YAML frontmatter. - Builds/updates
output/<host>/llms.txt— a sorted, host-level index of all summarized pages. - Optionally builds
output/<host>/llms-full.txt— a single full-corpus text file containing the index plus every generated summary. - Maintains
output/<host>/manifest.jsonfor content-hash caching so unchanged pages are skipped on re-runs.
| Component | Purpose |
|---|---|
| .NET 8.0 console app | Runtime |
| AngleSharp 1.4.0 | HTML parsing & DOM querying |
| Microsoft.Playwright 1.55.0 | Headless Chromium for JS-rendered/protected sites |
| ReverseMarkdown 5.2.0 | HTML → Markdown conversion |
| System.ServiceModel.Syndication | RSS/Atom feed parsing |
| Ollama API | Local LLM summarization |
- Only if building from source: .NET SDK 8.x
-
Go to the Releases page.
-
Download the archive for your operating system:
OS Asset to download Windows x64 site2llms-win-x64.zipLinux x64 site2llms-linux-x64.tar.gzmacOS x64 site2llms-osx-x64.tar.gzmacOS Apple Silicon site2llms-osx-arm64.tar.gz -
Extract the archive:
Windows (PowerShell):
Expand-Archive site2llms-win-x64.zip -DestinationPath site2llms
Linux / macOS:
tar xzf site2llms-linux-x64.tar.gz
-
(First run only) Install the Playwright Chromium browser if you need headless fallback:
Windows:
pwsh site2llms/playwright.ps1 install chromiumLinux / macOS:
./site2llms/playwright.sh install chromium
-
Run the tool — see Usage below.
-
Ensure you have the .NET 8 SDK installed.
-
Clone and build:
git clone https://github.com/giacomo1215/site2llms.git cd site2llms dotnet build -
(First run only) Install Playwright browsers:
pwsh bin/Debug/net8.0/playwright.ps1 install chromium
-
Run via
dotnet run(append--before any CLI flags):dotnet run dotnet run -- --url https://example.com --max-pages 50
site2llms supports two modes: CLI (pass arguments) and interactive (answer prompts).
Pass at least --url to activate CLI mode. All other flags are optional with sensible defaults.
site2llms --url https://example.com| Flag | Description | Default |
|---|---|---|
--url <URL> |
Root URL to crawl (required) | — |
--max-pages <N> |
Maximum number of pages to process | 200 |
--max-depth <N> |
Maximum BFS crawl depth for discovery | 3 |
--delay <ms> |
Politeness delay between requests (ms) | 250 |
--ollama-url <URL> |
Ollama API base URL | http://localhost:11434 |
--ollama-model <NAME> |
Ollama model identifier | minimax-m2.5:cloud |
--cookies <PATH> |
Path to a Netscape/JSON cookie file | — |
--include <PATTERN> |
Include only URLs matching the pattern; repeatable | — |
--exclude <PATTERN> |
Exclude URLs matching the pattern; repeatable | — |
--same-host-only |
Restrict discovery to same host | (on by default) |
--no-same-host |
Allow cross-host discovery | — |
--dry-run |
Discover URLs only — skip fetching, summarisation and output | — |
--llms-full |
Also generate llms-full.txt with the full page corpus |
— |
-h, --help |
Show help message and exit | — |
# Minimal — crawl with defaults
site2llms --url https://example.com
# Limit scope
site2llms --url https://example.com --max-pages 50 --max-depth 2
# Use a different model
site2llms --url https://example.com --ollama-model llama3
# Bypass protection with cookies
site2llms --url https://protected-site.com --cookies cookies.txt
# Keep only documentation pages and skip blog/archive areas
site2llms --url https://example.com --include "*docs*" --exclude "*blog*" --exclude "*tag*"
# Preview discovered URLs without processing
site2llms --url https://example.com --dry-run
# Also emit llms-full.txt
site2llms --url https://example.com --llms-full
# Full example
site2llms --url https://example.com --max-pages 50 --delay 500 --cookies cookies.txt --ollama-model llama3 --include "*/docs/*" --llms-fullWhen building from source, prefix with
dotnet run --:dotnet run -- --url https://example.com --dry-run
Run without arguments to enter interactive mode, where the tool prompts for each option:
site2llms- Answer the interactive prompts:
| Prompt | Default |
|---|---|
| Root URL | https://example.com |
| Max pages | 200 |
| Max depth for crawl fallback | 3 |
| Delay ms between requests | 250 |
| Ollama base URL | http://localhost:11434 |
| Ollama model | minimax-m2.5:cloud |
| Cookie file (Netscape/JSON) | (blank to skip) |
| Include URL patterns (comma-separated, blank for none) | (blank to skip) |
| Exclude URL patterns (comma-separated, blank for none) | (blank to skip) |
| Generate llms-full.txt with full page corpus | No |
site2llms - Universal website summarizer
Root URL [https://example.com]: https://example.com
Max pages [200]: 3
Max depth for crawl fallback [3]: 2
Delay ms between requests [250]: 100
Ollama base URL [http://localhost:11434]:
Ollama model [minimax-m2.5:cloud]:
Cookie file (Netscape/JSON, blank to skip) []:
Include URL patterns (comma-separated, blank for none) []:
Exclude URL patterns (comma-separated, blank for none) []:
Generate llms-full.txt with full page corpus [y/N]: n
WP REST detected: no
Discovered 3 pages.
Processing: https://example.com/
...
Run completed.
Discovered: 3
Processed: 2
Skipped: 1 (cache hits: 1)
Failed: 0
Output: C:\...\output\example.com
Processing: https://protected-site.com/
Protection detected: SiteGround CAPTCHA (SGCaptcha) — retrying with headless browser...
Headless browser also blocked: SiteGround CAPTCHA (SGCaptcha)
Tip: supply a cookie file (--cookies) from a real browser session to bypass this protection.
Skipped: Extracted markdown too short (<50 chars)
For a root URL like https://example.com:
output/
example.com/
llms.txt # host-level index of all summarized pages
llms-full.txt # optional full-corpus export with all summaries
manifest.json # content-hash cache for incremental runs
ai/
pages/
home.md # structured summary with YAML frontmatter
about_us.md
contattaci_php.md
Each page file contains YAML frontmatter and a structured Markdown body:
---
title: "Page Title"
source_url: "https://example.com/page"
fetched_at: "2025-01-15T10:30:00Z"
content_hash: "sha256hex..."
generator: "site2llms + Ollama"
---The body follows a consistent template: TL;DR (2–4 bullets), Key points (5–10 bullets), Useful context (content type, services, deliverables), FAQ (5–8 Q&A pairs), and a Reference link back to the source URL.
A sorted index with the site root, a short description, and one entry per page:
# llms.txt for example.com
Site root: https://example.com
Short description: AI-friendly markdown summaries generated by site2llms.
## Index
- About Us: https://example.com/ai/pages/about_us.md
- Home: https://example.com/ai/pages/home.md
Generated only when requested with --llms-full or by answering yes in interactive mode. It contains the same top-level site metadata, a sorted index, and then the full markdown body for every generated page summary in a single file.
Per-URL cache metadata (url, contentHash, relativeOutputPath, lastGeneratedAt, title). If a page's content hash hasn't changed since the last run, it's skipped as a cache hit.
Discover → Fetch → Extract → Summarize → Write → Build index
- Discover —
CompositeDiscoveryruns strategies in order (WP REST → Sitemap → RSS/Atom → Crawl), merges their results, and deduplicates URLs; when the same URL is found multiple times, the earliest strategy keeps precedence. - Fetch — WordPress REST
content.rendered(if WP detected) → HTTP with browser headers → Headless Chromium fallback. Cookies injected into both HTTP and headless paths. - Filter — include/exclude URL patterns are applied during discovery;
--excludetakes precedence over--include. - Extract —
HeuristicContentExtractorselects the best content container, strips boilerplate, converts to Markdown. - Cache check — SHA-256 content hash compared against
manifest.json; unchanged pages are skipped. - Summarize —
OllamaSummarizercalls/api/generate(temperature 0.2) with a structured prompt template. - Write —
FileOutputWriterpersists the summary file;ManifestStoreupdates the cache. - Build index —
LlmsTxtBuildergenerates thellms.txtfile (sorted by title, deduplicated by filename slug).
Strategies are run in order, and their results are merged. If the same URL is found by multiple strategies, the earliest strategy keeps precedence for that URL.
| Strategy | When it's used | How it works |
|---|---|---|
| WordPress REST | WP sites (auto-detected) | Probes /wp-json/ and /?rest_route=/, fetches wp/v2/pages + wp/v2/posts with pagination, skips attachments and password-protected posts, caches content.rendered in-memory, then applies URL filters |
| Sitemap | Any site with XML sitemaps | Tries /sitemap.xml, /sitemap_index.xml, /wp-sitemap.xml; supports both sitemapindex and urlset |
| RSS/Atom | Feed-enabled sites | Tries /feed/, /rss, /rss.xml, /feed.xml; extracts page links from feed items |
| Crawl | Fallback for all other sites | BFS crawl from root URL; honors MaxDepth, MaxPages, DelayMs, same-host filtering, and skips the root entirely if it does not match the active include/exclude rules |
Use URL filters when you want to limit processing to a subsection of a site or skip noisy areas such as tag pages, archives, account screens, or search results.
--includeis repeatable in CLI mode and acts as an allow-list. If at least one include pattern is provided, a URL must match one of them to be processed.--excludeis repeatable in CLI mode and always wins over--include.- Interactive mode accepts comma-separated include/exclude patterns and splits them into multiple rules.
- Pattern matching is case-insensitive.
- Patterns containing
*are treated as wildcards and matched against the full absolute URL. - Patterns without
*use a case-insensitive substring match against the full absolute URL.
Examples:
# Only product pages
site2llms --url https://example.com --include "*/products/*"
# Keep docs, but skip changelog and tag pages
site2llms --url https://example.com --include "*docs*" --exclude "*changelog*" --exclude "*tag*"
# Interactive mode accepts comma-separated values
Include URL patterns (comma-separated, blank for none) []: *docs*, */guides/*
Exclude URL patterns (comma-separated, blank for none) []: *tag*, *page/*- Preferred containers:
main→article→[role='main']→.content/.entry-content/.post-content→body - Boilerplate removal: strips
script,style,noscript,nav,footer,header,aside - Markdown conversion: ReverseMarkdown with GitHub-flavored output; plain-text fallback if HTML→MD yields empty
- Skip threshold: pages with extracted markdown shorter than 50 characters are skipped
The fetch pipeline has three layers:
| Layer | Description |
|---|---|
| HTTP fetch | Fast, lightweight HttpClient with browser-like headers and automatic gzip/brotli decompression |
| Headless Chromium | Automatic fallback when the HTTP response is blocked or too thin (<600 bytes). Uses Playwright with NetworkIdle wait and stealth settings (--disable-blink-features=AutomationControlled, navigator.webdriver removal) |
| Cookie injection | Cookies from a Netscape/JSON file are injected into both HttpClient (CookieContainer) and the Playwright browser context, domain-filtered |
The app recognizes 13 common protection patterns and reports each one explicitly:
| Pattern | Label |
|---|---|
SGCaptcha / .well-known/sgcaptcha |
SiteGround CAPTCHA |
cf-challenge / "Just a moment" |
Cloudflare challenge |
| "Attention Required" | Cloudflare block page |
| hCaptcha / g-recaptcha | CAPTCHA challenges |
| "Checking your browser" | Browser verification |
| "DDoS protection by" | DDoS interstitial |
| "enable javascript" | JS-required gate |
When a challenge is detected on the root URL, a warm-up session launches a headless browser to solve it. If successful, the browser session is reused for all subsequent requests.
Two formats are supported:
Netscape/Mozilla cookie.txt (exported by browser extensions like "Get cookies.txt LOCALLY"):
# Netscape HTTP Cookie File
.example.com TRUE / FALSE 0 session_id abc123
.example.com TRUE / TRUE 0 __cf_bm xyz789
JSON array:
[
{ "name": "session_id", "value": "abc123", "domain": ".example.com", "path": "/" },
{ "name": "__cf_bm", "value": "xyz789", "domain": ".example.com", "path": "/" }
]- Open the target site in a real browser and solve the CAPTCHA/challenge.
- Export cookies using a browser extension (e.g., "Get cookies.txt LOCALLY") as a
.txtfile. - Run site2llms and provide the path when prompted:
Cookie file (Netscape/JSON, blank to skip) []: cookies.txt Cookies loaded from: cookies.txt
SameHostOnlyis set totrue— only same-host URLs are discovered and processed.- URL filters are evaluated against the full absolute URL and are enforced during discovery, before fetch/extract/summarize work begins.
- HTTP client timeout is 90 seconds.
- Both website fetching and Ollama calls use browser-like HTTP headers.
- Automatic decompression (gzip, deflate, Brotli) is enabled.
- WordPress REST requests include retry with exponential backoff on 429/503 responses.
- Ollama summarization uses temperature 0.2 for deterministic, low-variance output.
- Check website responsiveness and network reachability.
- Reduce
MaxPagesfor first runs. - Increase
DelayMsfor rate-limited sites (does not increase HTTP timeout). - If needed, increase
HttpClient.TimeoutinProgram.cs.
- Ensure Ollama is reachable at the configured base URL.
- Verify the selected model exists (
ollama list). - Pull the model if needed (
ollama pull <model-name>).
- Some pages are mostly navigation or script-rendered and may be skipped.
- Try specific content URLs instead of generic landing pages.
- For WordPress/Elementor sites, the WP REST API path usually gets real content even when HTML fetch fails.
- Check whether
--includeis too restrictive; once present, only matching URLs are allowed. - Check whether an
--excludepattern is overriding an include rule. - Remember that matching is done against the full absolute URL, not only the path segment.
- In interactive mode, separate multiple patterns with commas.
If you see:
Protection detected: SiteGround CAPTCHA (SGCaptcha) — retrying with headless browser...
Tip: supply a cookie file (--cookies) from a real browser session to bypass this protection.
- Visit the site in a real browser and complete the challenge.
- Export cookies (Netscape
.txtor JSON) and provide the file path when running the app. - Cookies are injected into both HTTP and headless browser requests automatically.
Program.cs # Entry point — interactive prompts, DI wiring, run
Core/
Models/ # Data records (CrawlOptions, PageContent, Manifest, …)
Pipeline/
SummarizationPipeline.cs # Orchestrates discover → fetch → extract → summarize → write
Services/
Discovery/ # URL discovery strategies (WP REST, Sitemap, RSS, Crawl)
Extraction/ # Heuristic HTML → Markdown content extraction
Fetching/ # HTTP, Headless Chromium, WP REST content fetchers
Output/ # File writer, manifest store, llms.txt builder
Summarization/ # Ollama summarizer
WordPress/ # WP REST API client
Utils/ # Challenge detection, cookie loading, hashing, Playwright session
dotnet build
dotnet run- Primarily interactive prompts; CLI flags are available but still limited
- Single model provider (Ollama)
- Heuristic extraction may miss content in complex SPA frameworks
- Cookie files must be manually exported from a browser session
- Headless browser adds latency (~5–15s per page) when triggered
Possible next steps:
- external cache source fallback (Google Cache / Wayback Machine)
- stealth browser patches for more aggressive bot protection.