ScrapeAgent

An open-source, community-driven web scraping agent powered by Google ADK and Agent Skills. Clone it, add your API key, and start scraping any website through a conversational interface — no cloud infrastructure required.

Site-specific scraping knowledge lives in Agent Skill files (plain Markdown), not in the agent code. Anyone can contribute new scraping recipes by adding a SKILL.md file without touching Python.

Prerequisites

Tool	Purpose	Install
Python 3.11+	Runtime	python.org
Poetry	Python package manager	`pip install poetry`
Node.js + npx	Runs `@playwright/mcp` browser tool	nodejs.org
uv + uvx	Runs `mcp-server-fetch` HTTP tool	`pip install uv`

Quick Start

# 1. Clone the repository
git clone https://github.com/DamiMartinez/scrapeagent.git
cd scrapeagent

# 2. Install Python dependencies
poetry install

# 3. Pre-install the MCP browser tool (avoids timeout on first run)
npm install -g @playwright/mcp

# 4. Configure your API key
cp .env.example .env
# Edit .env — set GOOGLE_API_KEY (or the key for your chosen provider)

# 5. Start the agent
poetry run adk web

# 6. Open http://localhost:8000 and start scraping

Supported Models

ScrapeAgent uses LiteLLM for model-agnostic support. Set LITELLM_MODEL in your .env and provide the matching API key.

Model	`LITELLM_MODEL` value	API key env var	Notes
Gemini 2.5 Flash	`gemini/gemini-2.5-flash`	`GOOGLE_API_KEY`	Default. Free tier available.
Gemini 2.0 Flash	`gemini/gemini-2.0-flash`	`GOOGLE_API_KEY`	Lighter alternative
GPT-4o	`openai/gpt-4o`	`OPENAI_API_KEY`	OpenAI hosted
Claude Sonnet 4.6	`anthropic/claude-sonnet-4-6`	`ANTHROPIC_API_KEY`	Anthropic hosted
Llama 3.2 (local)	`ollama/llama3.2`	(none required)	Fully local via Ollama

Available Skills

Skills are the brain of ScrapeAgent. Each skill documents how to scrape a specific website.

Skill	What it scrapes	Trigger phrases
`hacker-news`	Front page stories: title, URL, score, author, comments	"Hacker News", "HN", "ycombinator"
`github-trending`	Trending repos by language and time period	"GitHub trending", "popular repos"
`skill-creator`	Helps you create new skills through conversation	"Create a skill", "document how to scrape"

Architecture: How It Works

ScrapeAgent is intentionally simple: one root agent with MCP browser tools and an ADK SkillToolset. The "expertise" lives in skill files, not in the agent's prompt.

┌─────────────────┬─────────────────────────┬────────────────────────────────┬────────────────────────┐
│      Level      │       What loads        │              When              │          Size          │
├─────────────────┼─────────────────────────┼────────────────────────────────┼────────────────────────┤
│ L1 — Metadata   │ name + description only │ At startup, for every skill    │ ~100 tokens per skill  │
├─────────────────┼─────────────────────────┼────────────────────────────────┼────────────────────────┤
│ L2 — Instruc-   │ Full SKILL.md body      │ Only when the agent decides    │ <5000 tokens (1 skill) │
│      tions      │                         │ the skill is relevant          │                        │
├─────────────────┼─────────────────────────┼────────────────────────────────┼────────────────────────┤
│ L3 — Resources  │ Files in references/    │ Only if skill instructions     │ On demand              │
│                 │ and assets/             │ reference them                 │                        │
└─────────────────┴─────────────────────────┴────────────────────────────────┴────────────────────────┘

Why this matters: A community library of 50 skills costs ~5,000 tokens at startup (50 × ~100 token descriptions). The full instruction payload for any given task is only 1–2 skill bodies. Without this pattern, baking all scraping instructions into a monolithic system prompt would blow the context window and degrade quality as the skill library grows.

Components

scrapeagent/
├── agent.py          # Wires together LiteLLM model, MCP toolsets, and SkillToolset
├── prompt.py         # Minimal orchestration prompt — expertise is in skills
├── tools/
│   └── file_tools.py # save_output (CSV/JSON/MD) and create_skill
└── skills/
    ├── hacker-news/  # One directory per skill
    ├── github-trending/
    └── skill-creator/

Adding New Skills

Method 1: Ask the agent

"Create a skill to scrape quotes from quotes.toscrape.com"

The agent will investigate the site, identify CSS selectors, and call create_skill to write the SKILL.md file. Restart the agent (Ctrl+C then adk web) to load the new skill.

Method 2: Write it manually

Create a directory under scrapeagent/skills/ following the Agent Skills spec:

scrapeagent/skills/
└── my-site/
    ├── SKILL.md          # required
    ├── references/       # optional: extra .md files
    └── assets/           # optional: templates, examples

SKILL.md structure:

---
name: my-site
description: One or two sentences describing what this scrapes and when to use it.
metadata:
  author: your-github-handle
  version: "1.0"
---

# My Site Scraper

## Overview
...

## Instructions
...

Important: The directory name (e.g. my-site/) must exactly match the name field in SKILL.md frontmatter. The skill loader enforces this and will raise an error if they differ.

See scrapeagent/skills/hacker-news/SKILL.md for a complete example.

Output Files

Scraped data is saved to ./output/ (gitignored). Three formats are supported:

Format	Flag	Example filename
CSV	`format="csv"` (default)	`output/hn_stories_2024-02-26.csv`
JSON	`format="json"`	`output/github_trending_python.json`
Markdown	`format="md"`	`output/hn_stories_today.md`

Contributing

See CONTRIBUTING.md for how to write and submit new skills. All skill contributions are welcome — no Python knowledge required.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
scrapeagent		scrapeagent
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScrapeAgent

Prerequisites

Quick Start

Supported Models

Available Skills

Architecture: How It Works

Components

Adding New Skills

Method 1: Ask the agent

Method 2: Write it manually

Output Files

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ScrapeAgent

Prerequisites

Quick Start

Supported Models

Available Skills

Architecture: How It Works

Components

Adding New Skills

Method 1: Ask the agent

Method 2: Write it manually

Output Files

Contributing

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages