Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 24 additions & 97 deletions developer-guides/usage-guides/choosing-the-data-extractor.mdx
Original file line number Diff line number Diff line change
@@ -1,47 +1,43 @@
---
title: "Choosing the Data Extractor"
description: "Compare /agent, /extract, and /scrape (JSON mode) to pick the right tool for structured data extraction"
description: "Compare /agent and /scrape (JSON mode) to pick the right tool for structured data extraction"
og:title: "Choosing the Data Extractor | Firecrawl"
og:description: "Compare /agent, /extract, and /scrape (JSON mode) to pick the right tool for structured data extraction"
og:description: "Compare /agent and /scrape (JSON mode) to pick the right tool for structured data extraction"
sidebarTitle: "Choosing the Data Extractor"
---

import AgentWithSchemaPython from "/snippets/v2/agent/with-schema/python.mdx";
import AgentWithSchemaJS from "/snippets/v2/agent/with-schema/js.mdx";
import AgentWithSchemaCURL from "/snippets/v2/agent/with-schema/curl.mdx";
import ExtractPython from "/snippets/v2/extract/base/python.mdx";
import ExtractNode from "/snippets/v2/extract/base/js.mdx";
import ExtractCURL from "/snippets/v2/extract/base/curl.mdx";
import ScrapeJsonPython from "/snippets/v2/scrape/json/base/python.mdx";
import ScrapeJsonNode from "/snippets/v2/scrape/json/base/js.mdx";
import ScrapeJsonCURL from "/snippets/v2/scrape/json/base/curl.mdx";

Firecrawl offers three approaches for extracting structured data from web pages. Each serves different use cases with varying levels of automation and control.
Firecrawl offers two approaches for extracting structured data from web pages. Each serves different use cases with varying levels of automation and control.

## Quick Comparison

| Feature | `/agent` | `/extract` | `/scrape` (JSON mode) |
|---------|----------|------------|----------------------|
| **Status** | Active | Use `/agent` instead | Active |
| **URL Required** | No (optional) | Yes (wildcards supported) | Yes (single URL) |
| **Scope** | Web-wide discovery | Multiple pages/domains | Single page |
| **URL Discovery** | Autonomous web search | Crawls from given URLs | None |
| **Processing** | Asynchronous | Asynchronous | Synchronous |
| **Schema Required** | No (prompt or schema) | No (prompt or schema) | No (prompt or schema) |
| **Pricing** | Dynamic (5 free runs/day) | Token-based (1 credit = 15 tokens) | 5 credits/page (1 base + 4 for JSON mode) |
| **Best For** | Research, discovery, complex gathering | Multi-page extraction (when you know URLs) | Known single-page extraction |
| Feature | `/agent` | `/scrape` (JSON mode) |
|---------|----------|----------------------|
| **URL Required** | No (optional) | Yes (single URL) |
| **Scope** | Web-wide discovery, multi-page, or single page | Single page |
| **URL Discovery** | Autonomous web search | None |
| **Processing** | Asynchronous | Synchronous |
| **Schema Required** | No (prompt or schema) | No (prompt or schema) |
| **Pricing** | Dynamic (5 free runs/day); 10 credits/cell on Parallel Agents fast path | 5 credits/page (1 base + 4 for JSON mode) |
| **Best For** | Research, discovery, multi-page or batch gathering | Known single-page extraction |

## 1. `/agent` Endpoint

The `/agent` endpoint is Firecrawl's most advanced offering—the successor to `/extract`. It uses AI agents to autonomously search, navigate, and gather data from across the web.
The `/agent` endpoint is Firecrawl's most advanced offering. It uses AI agents to autonomously search, navigate, and gather data from across the web.

### Key Characteristics

- **URLs Optional**: Just describe what you need via `prompt`; URLs are completely optional
- **Autonomous Navigation**: The agent searches and navigates deep into sites to find your data
- **Deep Web Search**: Autonomously discovers information across multiple domains and pages
- **Parallel Processing**: Processes multiple sources simultaneously for faster results
- **Models Available**: `spark-1-mini` (default, 60% cheaper) and `spark-1-pro` (higher accuracy)
- **Models Available**: `spark-1-fast` (cheapest, used by Parallel Agents — 10 credits/cell), `spark-1-mini` (default, balanced cost/quality), and `spark-1-pro` (highest accuracy)

### Example

Expand All @@ -59,55 +55,15 @@ The `/agent` endpoint is Firecrawl's most advanced offering—the successor to `

**Why `/agent`**: You don't know which websites contain this information. The agent will autonomously search the web, navigate to relevant sources (Crunchbase, news sites, company pages), and compile the structured data for you.

For more details, see the [Agent documentation](/features/agent).

---

## 2. `/extract` Endpoint

<Note>
**Use `/agent` instead**: We recommend migrating to [`/agent`](/features/agent)—it's faster, more reliable, doesn't require URLs, and handles all `/extract` use cases plus more.
</Note>

The `/extract` endpoint collects structured data from specified URLs or entire domains using LLM-powered extraction.

### Key Characteristics

- **URLs Typically Required**: Provide at least one URL (supports wildcards like `example.com/*`)
- **Domain Crawling**: Can crawl and parse all URLs discovered in a domain
- **Web Search Enhancement**: Optional `enableWebSearch` to follow links outside specified domains
- **Schema Optional**: Supports strict JSON schema OR natural language prompts
- **Async Processing**: Returns job ID for status checking

### The URL Limitation

The fundamental challenge with `/extract` is that you typically need to know URLs upfront:

1. **Discovery gap**: For tasks like "find YC W24 companies," you don't know which URLs contain the data. You'd need a separate search step before calling `/extract`.
2. **Awkward web search**: While `enableWebSearch` exists, it's constrained to start from URLs you provide—an awkward workflow for discovery tasks.
3. **Why `/agent` was created**: `/extract` is good at extracting from known locations, but less effective at discovering where data lives.

### Example

<CodeGroup>
### Parallel Agents

<ExtractPython />
<ExtractNode />
<ExtractCURL />
For high-volume batch extraction — e.g., enriching a list of 1,000 companies with funding data — use Parallel Agents. They run an intelligent waterfall: `spark-1-fast` handles simple cells at a flat 10 credits/cell, escalating to `spark-1-mini` only when needed. This is the right tool for grid/batch workflows where you'd otherwise be looping over many similar prompts.

</CodeGroup>

### Best Use Case: Targeted Multi-Page Extraction

**Scenario**: You have your competitor's documentation URL and want to extract all their API endpoints from `docs.competitor.com/*`.

**Why `/extract` worked here**: You knew the exact domain. But even then, `/agent` with URLs provided would typically give better results today.

For more details, see the [Extract documentation](/features/extract).
For more details, see the [Agent documentation](/features/agent).

---

## 3. `/scrape` Endpoint with JSON Mode
## 2. `/scrape` Endpoint with JSON Mode

The `/scrape` endpoint with JSON mode is the most controlled approach—it extracts structured data from a single known URL using an LLM to parse the page content into your specified schema.

Expand Down Expand Up @@ -147,6 +103,7 @@ For more details, see the [JSON mode documentation](/features/llm-extract).
- **YES**
- **Single page?** → Use `/scrape` with JSON mode
- **Multiple pages?** → Use `/agent` with URLs (or batch `/scrape`)
- **Many similar prompts across a list?** → Use `/agent` Parallel Agents

### Recommendations by Scenario

Expand All @@ -157,6 +114,7 @@ For more details, see the [JSON mode documentation](/features/llm-extract).
| "Get all blog posts from competitor.com" | `/agent` with URL |
| "Monitor prices across multiple known URLs" | `/scrape` with batch processing |
| "Research companies in a specific industry" | `/agent` |
| "Enrich 1,000 companies with funding data" | `/agent` Parallel Agents |
| "Extract contact info from 50 known company pages" | `/scrape` with batch processing |

---
Expand All @@ -166,15 +124,14 @@ For more details, see the [JSON mode documentation](/features/llm-extract).
| Endpoint | Cost | Notes |
|----------|------|-------|
| `/scrape` (JSON mode) | 5 credits/page (1 base + 4 for JSON mode) | Fixed, predictable |
| `/extract` | Token-based (1 credit = 15 tokens) | Variable based on content |
| `/agent` | Dynamic | 5 free runs/day; varies by complexity |
| `/agent` | Dynamic | 5 free runs/day; typical run ~100–500 credits |
| `/agent` (Parallel Agents) | 10 credits/cell on the fast path | Batch/grid workflows |

### Example: "Find the founders of Firecrawl"

| Endpoint | How It Works | Credits Used |
|----------|--------------|--------------|
| `/scrape` | You find the URL manually, then scrape 1 page | ~1 credit |
| `/extract` | You provide URL(s), it extracts structured data | Variable (token-based) |
| `/scrape` | You find the URL manually, then scrape 1 page | ~5 credits |
| `/agent` | Just send the prompt—agent finds and extracts | ~100–500 credits |

**Tradeoff**: `/scrape` is cheapest but requires you to know the URL. `/agent` costs more but handles discovery automatically.
Expand All @@ -183,42 +140,13 @@ For detailed pricing, see [Firecrawl Pricing](https://firecrawl.dev/pricing).

---

## Migration: `/extract` → `/agent`

If you're currently using `/extract`, migration is straightforward:

**Before (extract):**

```python
result = app.extract(
urls=["https://example.com/*"],
prompt="Extract product information",
schema=schema
)
```

**After (agent):**

```python
result = app.agent(
urls=["https://example.com"], # Optional - can omit entirely
prompt="Extract product information from example.com",
schema=schema,
model="spark-1-mini" # or "spark-1-pro" for higher accuracy
)
```

The key advantage: with `/agent`, you can drop the URLs entirely and just describe what you need.

---

## Key Takeaways

1. **Know the exact URL?** Use `/scrape` with JSON mode—it's the cheapest (5 credits/page), fastest (synchronous), and most predictable option.

2. **Need autonomous research?** Use `/agent`—it handles discovery automatically with 5 free runs/day, then dynamic pricing based on complexity.

3. **Migrate from `/extract`** to `/agent` for new projects—`/agent` is the successor with better capabilities.
3. **Running batch workflows over a list?** Use `/agent` Parallel Agents with `spark-1-fast` for a flat 10 credits/cell.

4. **Cost vs. convenience tradeoff**: `/scrape` is most cost-effective when you know your URLs; `/agent` costs more but eliminates manual URL discovery.

Expand All @@ -229,5 +157,4 @@ The key advantage: with `/agent`, you can drop the URLs entirely and just descri
- [Agent documentation](/features/agent)
- [Agent models](/features/models)
- [JSON mode documentation](/features/llm-extract)
- [Extract documentation](/features/extract)
- [Batch scraping](/features/batch-scrape)