Reference DB Intelligence Agent

An OpenCode-powered agent for discovering and extracting product data from Canadian grocery websites.

Built for the Reference DB project — a reliable database of Canadian food products.

Quick Start

# Install dependencies
uv sync

# Install Playwright browsers
uv run playwright install chromium

# Copy environment file
cp .env.example .env

# Run the agent
uv run rdai run --url https://voila.ca

# Discovery only (no extraction)
uv run rdai discover --url https://voila.ca

# Extract specific fields
uv run rdai run --url https://voila.ca --fields barcode,brand,product_name

# Limit extraction
uv run rdai run --url https://voila.ca --max-products 500

Project Structure

reference-db-intelligence-agent/
├── AGENTS.md               # OpenCode agent instructions
├── PROJECT_CONTEXT.md      # Project background
├── SCRAPING_RULES.md       # Rules for all scraping code
├── opencode.json           # OpenCode configuration
├── pyproject.toml          # Python dependencies
├── .env.example            # Environment variables template
│
├── src/
│   ├── cli.py              # CLI entry point (rdai command)
│   ├── run_manager.py      # Orchestrates full pipeline
│   ├── models/
│   │   ├── product.py      # RawProduct data model
│   │   ├── run.py          # RunConfig, RunContext
│   │   └── intelligence.py # WebsiteIntelligence, ApiEndpoint, etc.
│   ├── discovery/
│   │   └── discoverer.py   # robots.txt, sitemaps, API discovery
│   ├── protection/
│   │   └── analyzer.py     # Cloudflare, CAPTCHA, bot detection
│   ├── extractors/
│   │   ├── base.py         # BaseExtractor
│   │   ├── json_ld.py      # JSON-LD extractor (priority 1)
│   │   └── playwright_extractor.py  # Playwright extractor (priority 2)
│   └── reports/
│       └── generator.py    # Markdown report generator
│
├── runs/                   # One folder per run
│   └── 2026-06-22_abc123/
│       ├── report.md
│       ├── raw_products.jsonl
│       ├── website_intelligence.json
│       ├── sitemaps.json
│       ├── protection_analysis.json
│       ├── coverage_report.json
│       ├── statistics.json
│       └── logs.json
│
└── artifacts/              # Reusable artifacts across runs

Extraction Priority

The agent always tries free methods first:

Priority	Method	When Used
1	JSON-LD	Always (fastest)
2	Playwright	JS-rendered pages
3	Crawl4AI	Dynamic content
4	Scrapling	Anti-bot evasion
5	BeautifulSoup	Simple HTML
6	Bright Data MCP	Last resort (requires `--bright-data`)

Run Artifacts

Every run saves:

File	Contents
`report.md`	Full human-readable run report
`raw_products.jsonl`	All extracted products (JSONL)
`website_intelligence.json`	All discovered APIs, sitemaps, protections
`protection_analysis.json`	Protection mechanisms detected
`sitemaps.json`	Discovered sitemaps
`coverage_report.json`	Field coverage statistics
`statistics.json`	Run statistics
`logs.json`	Full event log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reference DB Intelligence Agent

Quick Start

Project Structure

Extraction Priority

Run Artifacts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.opencode/agent		.opencode/agent
src		src
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
PROJECT_CONTEXT.md		PROJECT_CONTEXT.md
README.md		README.md
SCRAPING_RULES.md		SCRAPING_RULES.md
opencode.json		opencode.json
pyproject.toml		pyproject.toml

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Reference DB Intelligence Agent

Quick Start

Project Structure

Extraction Priority

Run Artifacts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages