Skip to content

fuseraft/kiwi-scrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

scrape

A lightweight CLI web scraper written in Kiwi. Fetches pages, extracts data with CSS/XPath selectors, supports controlled crawling, and outputs in multiple formats.

Usage

kiwi scrape [OPTIONS] URL

If the script is marked executable (chmod +x scrape.kiwi) and your system supports shebangs, you can also run it as:

./scrape.kiwi [OPTIONS] URL

Help

kiwi scrape --help
Usage: scrape [OPTIONS] URL

  Lightweight, fast CLI web scraper. Fetches pages, extracts data with CSS/XPath,
  supports controlled crawling, and outputs in multiple formats.

Options:
  -h, --help                Show this help message and exit.
  -o, --output PATH         Output path. For single page: file. For crawl (depth>0):
                            directory (default: ./scraped_<domain>).
  -f, --format FORMAT       Output format: html, text, markdown, json (default: text).
      --css SELECTOR        CSS selector to extract matching elements.
      --xpath XPATH         XPath selector to extract matching elements.
      --attr ATTRIBUTE      Extract only this attribute (e.g. href, src, alt).
      --text                Extract only visible text content (default behavior).
      --links               Extract all <a> hrefs (one per line or JSON array).
      --images              Extract all <img> src + alt (overrides other modes).
  -d, --depth INT           Crawl depth (0 = single page only, default: 0).
      --max-pages INT       Maximum pages to crawl (default: 20).
  -u, --user-agent TEXT     Custom User-Agent.
  -H, --header TEXT         Custom header (KEY:VALUE).
      --cookies TEXT        Cookie string (name=value; name2=value2).
      --delay FLOAT         Seconds between requests during crawl (default: 1.0).
      --timeout INT         Request timeout in seconds (default: 15).
      --proxy URL           Proxy URL (not yet implemented).
      --no-verify           Skip SSL certificate verification (not yet implemented).
  -v, --verbose             Verbose logging (shows progress, skipped URLs).
  -q, --quiet               Silent except errors.
      --json                Force JSON output (structured when selectors used).
      --no-respect-robots   Disable robots.txt enforcement.

Examples:
  scrape https://example.com
  scrape --links https://example.com
  scrape -d 2 --max-pages 10 https://example.com
  scrape --css "article h2" --text https://news.ycombinator.com

Flags Reference

Flag Description Default
-h, --help Show help and exit β€”
-o, --output PATH File (single page) or directory (crawl) for output stdout / ./scraped_<domain>
-f, --format FORMAT html, text, markdown, json text
--css SELECTOR CSS selector (tag, .class, #id, tag.class, comma-separated) β€”
--xpath EXPR XPath expression (//tag, //tag[@attr='val']) β€”
--attr ATTRIBUTE Extract only this attribute from matched elements β€”
--text Extract visible text content (default mode) β€”
--links Extract all <a> hrefs β€”
--images Extract <img> src + alt text β€”
-d, --depth INT Crawl depth; follows internal links N levels deep 0
--max-pages INT Hard limit on pages crawled 20
-u, --user-agent TEXT Custom User-Agent string scrape/2.1 (+https://github.com/fuseraft/kiwi)
-H, --header TEXT Custom HTTP header as KEY:VALUE β€”
--cookies TEXT Cookie string (name=value; name2=value2) β€”
--delay FLOAT Seconds between crawl requests 1.0
--timeout INT Request timeout in seconds 15
--proxy URL HTTP/HTTPS proxy (not yet implemented) β€”
--no-verify Skip SSL certificate verification (not yet implemented) β€”
-v, --verbose Show progress and debug info β€”
-q, --quiet Suppress all output except errors β€”
--json Structured JSON output (array of objects when selectors used) β€”
--no-respect-robots Disable robots.txt enforcement robots.txt honored by default

Notes:

  • If both --css and --xpath are given, CSS takes precedence.
  • --images and --links override selector modes.
  • JSON output with selectors returns objects like {"tag": "a", "text": "...", "href": "..."}.
  • When --depth > 0, only same-domain links are followed and each page is saved as page-<n>.<ext> inside the output directory.

Examples

1. Basic single-page scrape (plain text)

kiwi scrape https://example.com
Example Domain Example Domain This domain is for use in documentation examples without needing permission. Avoid use in operations. Learn more

2. Save full HTML to file

kiwi scrape --format html -o example.html https://example.com
βœ“ Saved 1 page to example.html (0.5 KB)

3. Extract all links (one per line)

kiwi scrape --links https://example.com
https://iana.org/domains/example

4. Extract elements with a CSS selector

kiwi scrape --css "h1" https://example.com
Example Domain

5. CSS selector with JSON output

kiwi scrape --css "h1" --json https://example.com
[{"tag": "h1", "text": "Example Domain"}]

6. Extract images with verbose logging

kiwi scrape --images -v https://example.com
[INFO] Fetching https://example.com...
[INFO] Found 0 images

7. Crawl a site (depth 1, markdown output)

kiwi scrape -d 1 --max-pages 5 --format markdown -o ./myblog https://example.com
Crawling https://example.com (depth 1)...
βœ“ Page 1/5: https://example.com                           β†’ ./myblog/page-1.md
βœ“ Crawl finished. 1 pages saved to ./myblog

8. Custom header and CSS selector with JSON

kiwi scrape -H "Authorization: Bearer token123" --css "h1, h2" --json https://example.com
[{"tag": "h1", "text": "Example Domain"}]

9. Save crawl output quietly

kiwi scrape -d 2 --max-pages 10 -q -o ./out https://example.com

Running the Tests

kiwi tests/test.kiwi

The test suite covers all scraper functions: URL parsing, link/image extraction, CSS/XPath selectors, Markdown conversion, robots.txt parsing, and more.

About

a lightweight, web scraper written in kiwi πŸ₯

Topics

Resources

Stars

Watchers

Forks

Contributors