scrape

A lightweight CLI web scraper written in Kiwi. Fetches pages, extracts data with CSS/XPath selectors, supports controlled crawling, and outputs in multiple formats.

Usage

kiwi scrape [OPTIONS] URL

If the script is marked executable (chmod +x scrape.kiwi) and your system supports shebangs, you can also run it as:

./scrape.kiwi [OPTIONS] URL

Help

kiwi scrape --help

Usage: scrape [OPTIONS] URL

  Lightweight, fast CLI web scraper. Fetches pages, extracts data with CSS/XPath,
  supports controlled crawling, and outputs in multiple formats.

Options:
  -h, --help                Show this help message and exit.
  -o, --output PATH         Output path. For single page: file. For crawl (depth>0):
                            directory (default: ./scraped_<domain>).
  -f, --format FORMAT       Output format: html, text, markdown, json (default: text).
      --css SELECTOR        CSS selector to extract matching elements.
      --xpath XPATH         XPath selector to extract matching elements.
      --attr ATTRIBUTE      Extract only this attribute (e.g. href, src, alt).
      --text                Extract only visible text content (default behavior).
      --links               Extract all <a> hrefs (one per line or JSON array).
      --images              Extract all <img> src + alt (overrides other modes).
  -d, --depth INT           Crawl depth (0 = single page only, default: 0).
      --max-pages INT       Maximum pages to crawl (default: 20).
  -u, --user-agent TEXT     Custom User-Agent.
  -H, --header TEXT         Custom header (KEY:VALUE).
      --cookies TEXT        Cookie string (name=value; name2=value2).
      --delay FLOAT         Seconds between requests during crawl (default: 1.0).
      --timeout INT         Request timeout in seconds (default: 15).
      --proxy URL           Proxy URL (not yet implemented).
      --no-verify           Skip SSL certificate verification (not yet implemented).
  -v, --verbose             Verbose logging (shows progress, skipped URLs).
  -q, --quiet               Silent except errors.
      --json                Force JSON output (structured when selectors used).
      --no-respect-robots   Disable robots.txt enforcement.

Examples:
  scrape https://example.com
  scrape --links https://example.com
  scrape -d 2 --max-pages 10 https://example.com
  scrape --css "article h2" --text https://news.ycombinator.com

Flags Reference

Flag	Description	Default
`-h, --help`	Show help and exit	—
`-o, --output PATH`	File (single page) or directory (crawl) for output	stdout / `./scraped_<domain>`
`-f, --format FORMAT`	`html`, `text`, `markdown`, `json`	`text`
`--css SELECTOR`	CSS selector (tag, `.class`, `#id`, `tag.class`, comma-separated)	—
`--xpath EXPR`	XPath expression (`//tag`, `//tag[@attr='val']`)	—
`--attr ATTRIBUTE`	Extract only this attribute from matched elements	—
`--text`	Extract visible text content (default mode)	—
`--links`	Extract all `<a>` hrefs	—
`--images`	Extract `<img>` src + alt text	—
`-d, --depth INT`	Crawl depth; follows internal links N levels deep	`0`
`--max-pages INT`	Hard limit on pages crawled	`20`
`-u, --user-agent TEXT`	Custom User-Agent string	`scrape/2.1 (+https://github.com/fuseraft/kiwi)`
`-H, --header TEXT`	Custom HTTP header as `KEY:VALUE`	—
`--cookies TEXT`	Cookie string (`name=value; name2=value2`)	—
`--delay FLOAT`	Seconds between crawl requests	`1.0`
`--timeout INT`	Request timeout in seconds	`15`
`--proxy URL`	HTTP/HTTPS proxy (not yet implemented)	—
`--no-verify`	Skip SSL certificate verification (not yet implemented)	—
`-v, --verbose`	Show progress and debug info	—
`-q, --quiet`	Suppress all output except errors	—
`--json`	Structured JSON output (array of objects when selectors used)	—
`--no-respect-robots`	Disable robots.txt enforcement	robots.txt honored by default

Notes:

If both --css and --xpath are given, CSS takes precedence.
--images and --links override selector modes.
JSON output with selectors returns objects like {"tag": "a", "text": "...", "href": "..."}.
When --depth > 0, only same-domain links are followed and each page is saved as page-<n>.<ext> inside the output directory.

Examples

1. Basic single-page scrape (plain text)

kiwi scrape https://example.com

Example Domain Example Domain This domain is for use in documentation examples without needing permission. Avoid use in operations. Learn more

2. Save full HTML to file

kiwi scrape --format html -o example.html https://example.com

✓ Saved 1 page to example.html (0.5 KB)

3. Extract all links (one per line)

kiwi scrape --links https://example.com

https://iana.org/domains/example

4. Extract elements with a CSS selector

kiwi scrape --css "h1" https://example.com

Example Domain

5. CSS selector with JSON output

kiwi scrape --css "h1" --json https://example.com

[{"tag": "h1", "text": "Example Domain"}]

6. Extract images with verbose logging

kiwi scrape --images -v https://example.com

[INFO] Fetching https://example.com...
[INFO] Found 0 images

7. Crawl a site (depth 1, markdown output)

kiwi scrape -d 1 --max-pages 5 --format markdown -o ./myblog https://example.com

Crawling https://example.com (depth 1)...
✓ Page 1/5: https://example.com                           → ./myblog/page-1.md
✓ Crawl finished. 1 pages saved to ./myblog

8. Custom header and CSS selector with JSON

kiwi scrape -H "Authorization: Bearer token123" --css "h1, h2" --json https://example.com

[{"tag": "h1", "text": "Example Domain"}]

9. Save crawl output quietly

kiwi scrape -d 2 --max-pages 10 -q -o ./out https://example.com

Running the Tests

kiwi tests/test.kiwi

The test suite covers all scraper functions: URL parsing, link/image extraction, CSS/XPath selectors, Markdown conversion, robots.txt parsing, and more.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
lib		lib
tests		tests
README.md		README.md
scrape.kiwi		scrape.kiwi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scrape

Usage

Help

Flags Reference

Examples

Running the Tests

About

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

scrape

Usage

Help

Flags Reference

Examples

Running the Tests

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!