An open-source, community-driven web scraping agent powered by Google ADK and Agent Skills. Clone it, add your API key, and start scraping any website through a conversational interface — no cloud infrastructure required.
Site-specific scraping knowledge lives in Agent Skill files (plain Markdown), not in the agent code. Anyone can contribute new scraping recipes by adding a SKILL.md file without touching Python.
| Tool | Purpose | Install |
|---|---|---|
| Python 3.11+ | Runtime | python.org |
| Poetry | Python package manager | pip install poetry |
| Node.js + npx | Runs @playwright/mcp browser tool |
nodejs.org |
| uv + uvx | Runs mcp-server-fetch HTTP tool |
pip install uv |
# 1. Clone the repository
git clone https://github.com/DamiMartinez/scrapeagent.git
cd scrapeagent
# 2. Install Python dependencies
poetry install
# 3. Pre-install the MCP browser tool (avoids timeout on first run)
npm install -g @playwright/mcp
# 4. Configure your API key
cp .env.example .env
# Edit .env — set GOOGLE_API_KEY (or the key for your chosen provider)
# 5. Start the agent
poetry run adk web
# 6. Open http://localhost:8000 and start scrapingScrapeAgent uses LiteLLM for model-agnostic support. Set LITELLM_MODEL in your .env and provide the matching API key.
| Model | LITELLM_MODEL value |
API key env var | Notes |
|---|---|---|---|
| Gemini 2.5 Flash | gemini/gemini-2.5-flash |
GOOGLE_API_KEY |
Default. Free tier available. |
| Gemini 2.0 Flash | gemini/gemini-2.0-flash |
GOOGLE_API_KEY |
Lighter alternative |
| GPT-4o | openai/gpt-4o |
OPENAI_API_KEY |
OpenAI hosted |
| Claude Sonnet 4.6 | anthropic/claude-sonnet-4-6 |
ANTHROPIC_API_KEY |
Anthropic hosted |
| Llama 3.2 (local) | ollama/llama3.2 |
(none required) | Fully local via Ollama |
Skills are the brain of ScrapeAgent. Each skill documents how to scrape a specific website.
| Skill | What it scrapes | Trigger phrases |
|---|---|---|
hacker-news |
Front page stories: title, URL, score, author, comments | "Hacker News", "HN", "ycombinator" |
github-trending |
Trending repos by language and time period | "GitHub trending", "popular repos" |
skill-creator |
Helps you create new skills through conversation | "Create a skill", "document how to scrape" |
ScrapeAgent is intentionally simple: one root agent with MCP browser tools and an ADK SkillToolset. The "expertise" lives in skill files, not in the agent's prompt.
┌─────────────────┬─────────────────────────┬────────────────────────────────┬────────────────────────┐
│ Level │ What loads │ When │ Size │
├─────────────────┼─────────────────────────┼────────────────────────────────┼────────────────────────┤
│ L1 — Metadata │ name + description only │ At startup, for every skill │ ~100 tokens per skill │
├─────────────────┼─────────────────────────┼────────────────────────────────┼────────────────────────┤
│ L2 — Instruc- │ Full SKILL.md body │ Only when the agent decides │ <5000 tokens (1 skill) │
│ tions │ │ the skill is relevant │ │
├─────────────────┼─────────────────────────┼────────────────────────────────┼────────────────────────┤
│ L3 — Resources │ Files in references/ │ Only if skill instructions │ On demand │
│ │ and assets/ │ reference them │ │
└─────────────────┴─────────────────────────┴────────────────────────────────┴────────────────────────┘
Why this matters: A community library of 50 skills costs ~5,000 tokens at startup (50 × ~100 token descriptions). The full instruction payload for any given task is only 1–2 skill bodies. Without this pattern, baking all scraping instructions into a monolithic system prompt would blow the context window and degrade quality as the skill library grows.
scrapeagent/
├── agent.py # Wires together LiteLLM model, MCP toolsets, and SkillToolset
├── prompt.py # Minimal orchestration prompt — expertise is in skills
├── tools/
│ └── file_tools.py # save_output (CSV/JSON/MD) and create_skill
└── skills/
├── hacker-news/ # One directory per skill
├── github-trending/
└── skill-creator/
"Create a skill to scrape quotes from quotes.toscrape.com"
The agent will investigate the site, identify CSS selectors, and call create_skill to write the SKILL.md file. Restart the agent (Ctrl+C then adk web) to load the new skill.
Create a directory under scrapeagent/skills/ following the Agent Skills spec:
scrapeagent/skills/
└── my-site/
├── SKILL.md # required
├── references/ # optional: extra .md files
└── assets/ # optional: templates, examples
SKILL.md structure:
---
name: my-site
description: One or two sentences describing what this scrapes and when to use it.
metadata:
author: your-github-handle
version: "1.0"
---
# My Site Scraper
## Overview
...
## Instructions
...Important: The directory name (e.g.
my-site/) must exactly match thenamefield inSKILL.mdfrontmatter. The skill loader enforces this and will raise an error if they differ.
See scrapeagent/skills/hacker-news/SKILL.md for a complete example.
Scraped data is saved to ./output/ (gitignored). Three formats are supported:
| Format | Flag | Example filename |
|---|---|---|
| CSV | format="csv" (default) |
output/hn_stories_2024-02-26.csv |
| JSON | format="json" |
output/github_trending_python.json |
| Markdown | format="md" |
output/hn_stories_today.md |
See CONTRIBUTING.md for how to write and submit new skills. All skill contributions are welcome — no Python knowledge required.