A lightweight CLI web scraper written in Kiwi. Fetches pages, extracts data with CSS/XPath selectors, supports controlled crawling, and outputs in multiple formats.
kiwi scrape [OPTIONS] URLIf the script is marked executable (chmod +x scrape.kiwi) and your system supports shebangs, you can also run it as:
./scrape.kiwi [OPTIONS] URLkiwi scrape --helpUsage: scrape [OPTIONS] URL
Lightweight, fast CLI web scraper. Fetches pages, extracts data with CSS/XPath,
supports controlled crawling, and outputs in multiple formats.
Options:
-h, --help Show this help message and exit.
-o, --output PATH Output path. For single page: file. For crawl (depth>0):
directory (default: ./scraped_<domain>).
-f, --format FORMAT Output format: html, text, markdown, json (default: text).
--css SELECTOR CSS selector to extract matching elements.
--xpath XPATH XPath selector to extract matching elements.
--attr ATTRIBUTE Extract only this attribute (e.g. href, src, alt).
--text Extract only visible text content (default behavior).
--links Extract all <a> hrefs (one per line or JSON array).
--images Extract all <img> src + alt (overrides other modes).
-d, --depth INT Crawl depth (0 = single page only, default: 0).
--max-pages INT Maximum pages to crawl (default: 20).
-u, --user-agent TEXT Custom User-Agent.
-H, --header TEXT Custom header (KEY:VALUE).
--cookies TEXT Cookie string (name=value; name2=value2).
--delay FLOAT Seconds between requests during crawl (default: 1.0).
--timeout INT Request timeout in seconds (default: 15).
--proxy URL Proxy URL (not yet implemented).
--no-verify Skip SSL certificate verification (not yet implemented).
-v, --verbose Verbose logging (shows progress, skipped URLs).
-q, --quiet Silent except errors.
--json Force JSON output (structured when selectors used).
--no-respect-robots Disable robots.txt enforcement.
Examples:
scrape https://example.com
scrape --links https://example.com
scrape -d 2 --max-pages 10 https://example.com
scrape --css "article h2" --text https://news.ycombinator.com
| Flag | Description | Default |
|---|---|---|
-h, --help |
Show help and exit | β |
-o, --output PATH |
File (single page) or directory (crawl) for output | stdout / ./scraped_<domain> |
-f, --format FORMAT |
html, text, markdown, json |
text |
--css SELECTOR |
CSS selector (tag, .class, #id, tag.class, comma-separated) |
β |
--xpath EXPR |
XPath expression (//tag, //tag[@attr='val']) |
β |
--attr ATTRIBUTE |
Extract only this attribute from matched elements | β |
--text |
Extract visible text content (default mode) | β |
--links |
Extract all <a> hrefs |
β |
--images |
Extract <img> src + alt text |
β |
-d, --depth INT |
Crawl depth; follows internal links N levels deep | 0 |
--max-pages INT |
Hard limit on pages crawled | 20 |
-u, --user-agent TEXT |
Custom User-Agent string | scrape/2.1 (+https://github.com/fuseraft/kiwi) |
-H, --header TEXT |
Custom HTTP header as KEY:VALUE |
β |
--cookies TEXT |
Cookie string (name=value; name2=value2) |
β |
--delay FLOAT |
Seconds between crawl requests | 1.0 |
--timeout INT |
Request timeout in seconds | 15 |
--proxy URL |
HTTP/HTTPS proxy (not yet implemented) | β |
--no-verify |
Skip SSL certificate verification (not yet implemented) | β |
-v, --verbose |
Show progress and debug info | β |
-q, --quiet |
Suppress all output except errors | β |
--json |
Structured JSON output (array of objects when selectors used) | β |
--no-respect-robots |
Disable robots.txt enforcement | robots.txt honored by default |
Notes:
- If both
--cssand--xpathare given, CSS takes precedence. --imagesand--linksoverride selector modes.- JSON output with selectors returns objects like
{"tag": "a", "text": "...", "href": "..."}. - When
--depth > 0, only same-domain links are followed and each page is saved aspage-<n>.<ext>inside the output directory.
1. Basic single-page scrape (plain text)
kiwi scrape https://example.comExample Domain Example Domain This domain is for use in documentation examples without needing permission. Avoid use in operations. Learn more
2. Save full HTML to file
kiwi scrape --format html -o example.html https://example.comβ Saved 1 page to example.html (0.5 KB)
3. Extract all links (one per line)
kiwi scrape --links https://example.comhttps://iana.org/domains/example
4. Extract elements with a CSS selector
kiwi scrape --css "h1" https://example.comExample Domain
5. CSS selector with JSON output
kiwi scrape --css "h1" --json https://example.com[{"tag": "h1", "text": "Example Domain"}]6. Extract images with verbose logging
kiwi scrape --images -v https://example.com[INFO] Fetching https://example.com...
[INFO] Found 0 images
7. Crawl a site (depth 1, markdown output)
kiwi scrape -d 1 --max-pages 5 --format markdown -o ./myblog https://example.comCrawling https://example.com (depth 1)...
β Page 1/5: https://example.com β ./myblog/page-1.md
β Crawl finished. 1 pages saved to ./myblog
8. Custom header and CSS selector with JSON
kiwi scrape -H "Authorization: Bearer token123" --css "h1, h2" --json https://example.com[{"tag": "h1", "text": "Example Domain"}]9. Save crawl output quietly
kiwi scrape -d 2 --max-pages 10 -q -o ./out https://example.comkiwi tests/test.kiwiThe test suite covers all scraper functions: URL parsing, link/image extraction, CSS/XPath selectors, Markdown conversion, robots.txt parsing, and more.