site2llms

A developer-friendly tool that turns any website into a structured, deployable set of AI-ready Markdown artifacts with a host-level llms.txt index.

site2llms discovers pages, extracts readable content, summarizes each page via a local Ollama model, and writes structured output under output/<host>/ — ready to serve, embed, or feed into any LLM workflow.

Why site2llms?

Most tools in this space either stop at SEO analysis, produce a single flat index file, or require manual curation. site2llms is different:

Deployable artifacts, not reports. The output is a complete directory of Markdown files with YAML frontmatter and a llms.txt index — ready to drop into a static site, a docs bundle, or a RAG pipeline. It's not a one-off analysis; it's a repeatable build step.
Full pages, not just links. Every discovered page gets its own structured summary with TL;DR, key points, FAQ, and metadata. An LLM consuming these files gets real content, not a table of contents.
Incremental by default. A content-hash manifest (manifest.json) tracks what changed. Re-running the tool only processes new or updated pages — suitable for CI/CD or scheduled regeneration.
Handles the real web. Cloudflare challenges, SiteGround CAPTCHAs, JS-rendered SPAs, WordPress Elementor themes — the layered fetch pipeline (HTTP → headless Chromium → cookie injection) deals with sites as they actually exist, not as they ideally should.
Developer-friendly. Interactive prompts for quick runs, structured output for automation, clean separation of discover → fetch → extract → summarize → write stages. No external SaaS dependencies — just .NET, Ollama, and optionally Playwright.

What it does

Discovers URLs using ordered strategies: WordPress REST API → Sitemap → RSS/Atom feeds → Crawl fallback.
Detects WordPress sites automatically and uses the REST API to get server-rendered content (bypasses JS-dependent themes like Elementor).
Fetches page HTML with browser-like headers; automatically retries with a headless Chromium browser when bot-protection or challenge pages are detected.
Supports cookie injection from a Netscape/JSON cookie file to bypass CAPTCHAs and authentication gates.
Supports optional URL include/exclude filters to constrain discovery to specific sections of a site.
Extracts main content (main, article, role/content selectors, then body) and strips boilerplate.
Converts extracted content to Markdown.
Calls Ollama /api/generate to produce structured summaries (TL;DR, key points, FAQ, context).
Writes one summary file per page in output/<host>/ai/pages/*.md with YAML frontmatter.
Builds/updates output/<host>/llms.txt — a sorted, host-level index of all summarized pages.
Optionally builds output/<host>/llms-full.txt — a single full-corpus text file containing the index plus every generated summary.
Maintains output/<host>/manifest.json for content-hash caching so unchanged pages are skipped on re-runs.

Tech stack

Component	Purpose
.NET 8.0 console app	Runtime
AngleSharp 1.4.0	HTML parsing & DOM querying
Microsoft.Playwright 1.55.0	Headless Chromium for JS-rendered/protected sites
ReverseMarkdown 5.2.0	HTML → Markdown conversion
System.ServiceModel.Syndication	RSS/Atom feed parsing
Ollama API	Local LLM summarization

Requirements

Only if building from source: .NET SDK 8.x

Getting started

Option 1 — Download a pre-built release (recommended)

Go to the Releases page.
Download the archive for your operating system:

OS Asset to download

Windows x64 site2llms-win-x64.zip

Linux x64 site2llms-linux-x64.tar.gz

macOS x64 site2llms-osx-x64.tar.gz

macOS Apple Silicon site2llms-osx-arm64.tar.gz

Extract the archive:

Windows (PowerShell):

Expand-Archive site2llms-win-x64.zip -DestinationPath site2llms

Linux / macOS:

tar xzf site2llms-linux-x64.tar.gz

(First run only) Install the Playwright Chromium browser if you need headless fallback:

Windows:
```
pwsh site2llms/playwright.ps1 install chromium
```
Linux / macOS:
```
./site2llms/playwright.sh install chromium
```
Run the tool — see Usage below.

Option 2 — Build from source

Ensure you have the .NET 8 SDK installed.

Clone and build:

git clone https://github.com/giacomo1215/site2llms.git
cd site2llms
dotnet build

(First run only) Install Playwright browsers:

pwsh bin/Debug/net8.0/playwright.ps1 install chromium

Run via dotnet run (append -- before any CLI flags):

dotnet run
dotnet run -- --url https://example.com --max-pages 50

Usage

site2llms supports two modes: CLI (pass arguments) and interactive (answer prompts).

CLI mode

Pass at least --url to activate CLI mode. All other flags are optional with sensible defaults.

site2llms --url https://example.com

Available flags

Flag	Description	Default
`--url <URL>`	Root URL to crawl (required)	—
`--max-pages <N>`	Maximum number of pages to process	`200`
`--max-depth <N>`	Maximum BFS crawl depth for discovery	`3`
`--delay <ms>`	Politeness delay between requests (ms)	`250`
`--ollama-url <URL>`	Ollama API base URL	`http://localhost:11434`
`--ollama-model <NAME>`	Ollama model identifier	`minimax-m2.5:cloud`
`--cookies <PATH>`	Path to a Netscape/JSON cookie file	—
`--include <PATTERN>`	Include only URLs matching the pattern; repeatable	—
`--exclude <PATTERN>`	Exclude URLs matching the pattern; repeatable	—
`--same-host-only`	Restrict discovery to same host	(on by default)
`--no-same-host`	Allow cross-host discovery	—
`--dry-run`	Discover URLs only — skip fetching, summarisation and output	—
`--llms-full`	Also generate `llms-full.txt` with the full page corpus	—
`-h`, `--help`	Show help message and exit	—

Examples

# Minimal — crawl with defaults
site2llms --url https://example.com

# Limit scope
site2llms --url https://example.com --max-pages 50 --max-depth 2

# Use a different model
site2llms --url https://example.com --ollama-model llama3

# Bypass protection with cookies
site2llms --url https://protected-site.com --cookies cookies.txt

# Keep only documentation pages and skip blog/archive areas
site2llms --url https://example.com --include "*docs*" --exclude "*blog*" --exclude "*tag*"

# Preview discovered URLs without processing
site2llms --url https://example.com --dry-run

# Also emit llms-full.txt
site2llms --url https://example.com --llms-full

# Full example
site2llms --url https://example.com --max-pages 50 --delay 500 --cookies cookies.txt --ollama-model llama3 --include "*/docs/*" --llms-full

When building from source, prefix with dotnet run -- :
dotnet run -- --url https://example.com --dry-run

Interactive mode

Run without arguments to enter interactive mode, where the tool prompts for each option:

site2llms

Answer the interactive prompts:

Prompt	Default
Root URL	`https://example.com`
Max pages	`200`
Max depth for crawl fallback	`3`
Delay ms between requests	`250`
Ollama base URL	`http://localhost:11434`
Ollama model	`minimax-m2.5:cloud`
Cookie file (Netscape/JSON)	(blank to skip)
Include URL patterns (comma-separated, blank for none)	(blank to skip)
Exclude URL patterns (comma-separated, blank for none)	(blank to skip)
Generate llms-full.txt with full page corpus	`No`

Example run

site2llms - Universal website summarizer
Root URL [https://example.com]: https://example.com
Max pages [200]: 3
Max depth for crawl fallback [3]: 2
Delay ms between requests [250]: 100
Ollama base URL [http://localhost:11434]:
Ollama model [minimax-m2.5:cloud]:
Cookie file (Netscape/JSON, blank to skip) []:
Include URL patterns (comma-separated, blank for none) []:
Exclude URL patterns (comma-separated, blank for none) []:
Generate llms-full.txt with full page corpus [y/N]: n
WP REST detected: no
Discovered 3 pages.
Processing: https://example.com/
...
Run completed.
Discovered: 3
Processed:  2
Skipped:    1 (cache hits: 1)
Failed:     0
Output:     C:\...\output\example.com

Example with a protected site

Processing: https://protected-site.com/
  Protection detected: SiteGround CAPTCHA (SGCaptcha) — retrying with headless browser...
  Headless browser also blocked: SiteGround CAPTCHA (SGCaptcha)
  Tip: supply a cookie file (--cookies) from a real browser session to bypass this protection.
  Skipped: Extracted markdown too short (<50 chars)

Output structure

For a root URL like https://example.com:

output/
  example.com/
    llms.txt              # host-level index of all summarized pages
    llms-full.txt         # optional full-corpus export with all summaries
    manifest.json         # content-hash cache for incremental runs
    ai/
      pages/
        home.md           # structured summary with YAML frontmatter
        about_us.md
        contattaci_php.md

`ai/pages/*.md`

Each page file contains YAML frontmatter and a structured Markdown body:

---
title: "Page Title"
source_url: "https://example.com/page"
fetched_at: "2025-01-15T10:30:00Z"
content_hash: "sha256hex..."
generator: "site2llms + Ollama"
---

The body follows a consistent template: TL;DR (2–4 bullets), Key points (5–10 bullets), Useful context (content type, services, deliverables), FAQ (5–8 Q&A pairs), and a Reference link back to the source URL.

`llms.txt`

A sorted index with the site root, a short description, and one entry per page:

# llms.txt for example.com

Site root: https://example.com
Short description: AI-friendly markdown summaries generated by site2llms.

## Index
- About Us: https://example.com/ai/pages/about_us.md
- Home: https://example.com/ai/pages/home.md

`llms-full.txt`

Generated only when requested with --llms-full or by answering yes in interactive mode. It contains the same top-level site metadata, a sorted index, and then the full markdown body for every generated page summary in a single file.

`manifest.json`

Per-URL cache metadata (url, contentHash, relativeOutputPath, lastGeneratedAt, title). If a page's content hash hasn't changed since the last run, it's skipped as a cache hit.

Processing pipeline

Discover → Fetch → Extract → Summarize → Write → Build index

Discover — CompositeDiscovery runs strategies in order (WP REST → Sitemap → RSS/Atom → Crawl), merges their results, and deduplicates URLs; when the same URL is found multiple times, the earliest strategy keeps precedence.
Fetch — WordPress REST content.rendered (if WP detected) → HTTP with browser headers → Headless Chromium fallback. Cookies injected into both HTTP and headless paths.
Filter — include/exclude URL patterns are applied during discovery; --exclude takes precedence over --include.
Extract — HeuristicContentExtractor selects the best content container, strips boilerplate, converts to Markdown.
Cache check — SHA-256 content hash compared against manifest.json; unchanged pages are skipped.
Summarize — OllamaSummarizer calls /api/generate (temperature 0.2) with a structured prompt template.
Write — FileOutputWriter persists the summary file; ManifestStore updates the cache.
Build index — LlmsTxtBuilder generates the llms.txt file (sorted by title, deduplicated by filename slug).

Discovery strategies

Strategies are run in order, and their results are merged. If the same URL is found by multiple strategies, the earliest strategy keeps precedence for that URL.

Strategy	When it's used	How it works
WordPress REST	WP sites (auto-detected)	Probes `/wp-json/` and `/?rest_route=/`, fetches `wp/v2/pages` + `wp/v2/posts` with pagination, skips attachments and password-protected posts, caches `content.rendered` in-memory, then applies URL filters
Sitemap	Any site with XML sitemaps	Tries `/sitemap.xml`, `/sitemap_index.xml`, `/wp-sitemap.xml`; supports both `sitemapindex` and `urlset`
RSS/Atom	Feed-enabled sites	Tries `/feed/`, `/rss`, `/rss.xml`, `/feed.xml`; extracts page links from feed items
Crawl	Fallback for all other sites	BFS crawl from root URL; honors `MaxDepth`, `MaxPages`, `DelayMs`, same-host filtering, and skips the root entirely if it does not match the active include/exclude rules

URL filtering

Use URL filters when you want to limit processing to a subsection of a site or skip noisy areas such as tag pages, archives, account screens, or search results.

--include is repeatable in CLI mode and acts as an allow-list. If at least one include pattern is provided, a URL must match one of them to be processed.
--exclude is repeatable in CLI mode and always wins over --include.
Interactive mode accepts comma-separated include/exclude patterns and splits them into multiple rules.
Pattern matching is case-insensitive.
Patterns containing * are treated as wildcards and matched against the full absolute URL.
Patterns without * use a case-insensitive substring match against the full absolute URL.

Examples:

# Only product pages
site2llms --url https://example.com --include "*/products/*"

# Keep docs, but skip changelog and tag pages
site2llms --url https://example.com --include "*docs*" --exclude "*changelog*" --exclude "*tag*"

# Interactive mode accepts comma-separated values
Include URL patterns (comma-separated, blank for none) []: *docs*, */guides/*
Exclude URL patterns (comma-separated, blank for none) []: *tag*, *page/*

Extraction heuristics

Preferred containers: main → article → [role='main'] → .content / .entry-content / .post-content → body
Boilerplate removal: strips script, style, noscript, nav, footer, header, aside
Markdown conversion: ReverseMarkdown with GitHub-flavored output; plain-text fallback if HTML→MD yields empty
Skip threshold: pages with extracted markdown shorter than 50 characters are skipped

Fetching & protection bypass

The fetch pipeline has three layers:

Layer	Description
HTTP fetch	Fast, lightweight `HttpClient` with browser-like headers and automatic gzip/brotli decompression
Headless Chromium	Automatic fallback when the HTTP response is blocked or too thin (<600 bytes). Uses Playwright with `NetworkIdle` wait and stealth settings (`--disable-blink-features=AutomationControlled`, `navigator.webdriver` removal)
Cookie injection	Cookies from a Netscape/JSON file are injected into both `HttpClient` (`CookieContainer`) and the Playwright browser context, domain-filtered

Challenge detection

The app recognizes 13 common protection patterns and reports each one explicitly:

Pattern	Label
SGCaptcha / `.well-known/sgcaptcha`	SiteGround CAPTCHA
`cf-challenge` / "Just a moment"	Cloudflare challenge
"Attention Required"	Cloudflare block page
hCaptcha / g-recaptcha	CAPTCHA challenges
"Checking your browser"	Browser verification
"DDoS protection by"	DDoS interstitial
"enable javascript"	JS-required gate

When a challenge is detected on the root URL, a warm-up session launches a headless browser to solve it. If successful, the browser session is reused for all subsequent requests.

Cookie file format

Two formats are supported:

Netscape/Mozilla cookie.txt (exported by browser extensions like "Get cookies.txt LOCALLY"):

# Netscape HTTP Cookie File
.example.com	TRUE	/	FALSE	0	session_id	abc123
.example.com	TRUE	/	TRUE	0	__cf_bm	xyz789

JSON array:

[
  { "name": "session_id", "value": "abc123", "domain": ".example.com", "path": "/" },
  { "name": "__cf_bm", "value": "xyz789", "domain": ".example.com", "path": "/" }
]

How to bypass a protected site

Open the target site in a real browser and solve the CAPTCHA/challenge.
Export cookies using a browser extension (e.g., "Get cookies.txt LOCALLY") as a .txt file.

Run site2llms and provide the path when prompted:

Cookie file (Netscape/JSON, blank to skip) []: cookies.txt
Cookies loaded from: cookies.txt

Configuration notes

SameHostOnly is set to true — only same-host URLs are discovered and processed.
URL filters are evaluated against the full absolute URL and are enforced during discovery, before fetch/extract/summarize work begins.
HTTP client timeout is 90 seconds.
Both website fetching and Ollama calls use browser-like HTTP headers.
Automatic decompression (gzip, deflate, Brotli) is enabled.
WordPress REST requests include retry with exponential backoff on 429/503 responses.
Ollama summarization uses temperature 0.2 for deterministic, low-variance output.

Troubleshooting

Timeout errors

Check website responsiveness and network reachability.
Reduce MaxPages for first runs.
Increase DelayMs for rate-limited sites (does not increase HTTP timeout).
If needed, increase HttpClient.Timeout in Program.cs.

Ollama errors

Ensure Ollama is reachable at the configured base URL.
Verify the selected model exists (ollama list).
Pull the model if needed (ollama pull <model-name>).

Empty or skipped content

Some pages are mostly navigation or script-rendered and may be skipped.
Try specific content URLs instead of generic landing pages.
For WordPress/Elementor sites, the WP REST API path usually gets real content even when HTML fetch fails.

Unexpectedly missing pages

Check whether --include is too restrictive; once present, only matching URLs are allowed.
Check whether an --exclude pattern is overriding an include rule.
Remember that matching is done against the full absolute URL, not only the path segment.
In interactive mode, separate multiple patterns with commas.

Protected / CAPTCHA sites

If you see:

Protection detected: SiteGround CAPTCHA (SGCaptcha) — retrying with headless browser...
Tip: supply a cookie file (--cookies) from a real browser session to bypass this protection.

Visit the site in a real browser and complete the challenge.
Export cookies (Netscape .txt or JSON) and provide the file path when running the app.
Cookies are injected into both HTTP and headless browser requests automatically.

Project structure

Program.cs                          # Entry point — interactive prompts, DI wiring, run
Core/
  Models/                           # Data records (CrawlOptions, PageContent, Manifest, …)
  Pipeline/
    SummarizationPipeline.cs        # Orchestrates discover → fetch → extract → summarize → write
  Services/
    Discovery/                      # URL discovery strategies (WP REST, Sitemap, RSS, Crawl)
    Extraction/                     # Heuristic HTML → Markdown content extraction
    Fetching/                       # HTTP, Headless Chromium, WP REST content fetchers
    Output/                         # File writer, manifest store, llms.txt builder
    Summarization/                  # Ollama summarizer
    WordPress/                      # WP REST API client
  Utils/                            # Challenge detection, cookie loading, hashing, Playwright session

Development

dotnet build
dotnet run

Current limitations

Primarily interactive prompts; CLI flags are available but still limited
Single model provider (Ollama)
Heuristic extraction may miss content in complex SPA frameworks
Cookie files must be manually exported from a browser session
Headless browser adds latency (~5–15s per page) when triggered

Possible next steps:

external cache source fallback (Google Cache / Wayback Machine)
stealth browser patches for more aggressive bot protection.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
Core		Core
.gitignore		.gitignore
LICENSE.md		LICENSE.md
Program.cs		Program.cs
README.md		README.md
site2llms.csproj		site2llms.csproj
site2llms.sln		site2llms.sln

OS	Asset to download
Windows x64	`site2llms-win-x64.zip`
Linux x64	`site2llms-linux-x64.tar.gz`
macOS x64	`site2llms-osx-x64.tar.gz`
macOS Apple Silicon	`site2llms-osx-arm64.tar.gz`

Folders and files

Latest commit

History

Repository files navigation

site2llms

Why site2llms?

What it does

Tech stack

Requirements

Getting started

Option 1 — Download a pre-built release (recommended)

Option 2 — Build from source

Usage

CLI mode

Available flags

Examples

Interactive mode

Example run

Example with a protected site

Output structure

ai/pages/*.md

llms.txt

llms-full.txt

manifest.json

Processing pipeline

Discovery strategies

URL filtering

Extraction heuristics

Fetching & protection bypass

Challenge detection

Cookie file format

How to bypass a protected site

Configuration notes

Troubleshooting

Timeout errors

Ollama errors

Empty or skipped content

Unexpectedly missing pages

Protected / CAPTCHA sites

Project structure

Development

Current limitations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`ai/pages/*.md`

`llms.txt`

`llms-full.txt`

`manifest.json`

Packages