🌐 web2json-agent

Stop Coding Scrapers, Start Getting Data — from Hours to Seconds

📖 What is web2json-agent?

An AI-powered web scraping agent that automatically generates production-ready parser code from HTML samples — no manual XPath/CSS selector writing required.

📋 Demo

20260120204054.mp4

📊 SWDE Benchmark Results

The SWDE dataset covers 8 vertical fields, 80 websites, and 124,291 pages

	Precision	Recall	F1 Score
COT	87.75	79.90	76.95
Reflexion	93.28	82.76	82.40
AUTOSCRAPER	92.49	89.13	88.69
Web2JSON-Agent	91.50	90.46	89.93

🚀 Quick Start

Install via pip

# 1. Install package
pip install web2json-agent

# 2. Initialize configuration
web2json setup

Install for Developers

# 1. Clone the repository
git clone https://github.com/ccprocessor/web2json-agent
cd web2json-agent

# 2. Install in editable mode
pip install -e .

# 3. Initialize configuration
web2json setup

📚 Complete User Guide

For a comprehensive tutorial covering installation, configuration, and all usage scenarios, see:

📖 Web2JSON-Agent Complete User Guide (中文)

This guide includes:

Detailed installation steps
Configuration methods (interactive wizard, config file, environment variables)
Layout clustering for mixed HTML types
Complete API examples and use cases
FAQ and troubleshooting

🐍 API Usage

Web2JSON provides five simple APIs. Perfect for databases, APIs, and real-time processing!

API 1: `extract_data` - Complete Workflow

Extract structured data from HTML in one step (schema + parser + data).

Auto Mode - Let AI automatically discover and extract fields:

from web2json import Web2JsonConfig, extract_data

config = Web2JsonConfig(
    name="my_project",
    html_path="html_samples/",
    # save=['schema', 'code', 'data'],  # Save to local disk
    # output_path="./results",  # Custom output directory (default: "output")
)

result = extract_data(config)

# Results are always returned in memory
print(result.final_schema)        # Dict: extracted schema
print(result.parser_code)          # str: generated parser code
print(result.parsed_data[0])       # List[Dict]: parsed JSON data

Predefined Mode - Extract only specific fields:

from web2json import Web2JsonConfig, extract_data

config = Web2JsonConfig(
    name="articles",
    html_path="html_samples/",
    schema={
        "title": "string",
        "author": "string",
        "date": "string",
        "content": "string"
    },
    # save=['schema', 'code', 'data'],  # Save to local disk
    # output_path="./results",  # Custom output directory
)

result = extract_data(config)
# Returns: ExtractDataResult with schema, code, and data in memory

API 2: `extract_schema` - Extract Schema Only

Generate a JSON schema describing the data structure in HTML.

from web2json import Web2JsonConfig, extract_schema

config = Web2JsonConfig(
    name="schema_only",
    html_path="html_samples/",
    # save=['schema'],  # Save schema to disk
    # output_path="./schemas",  # Custom output directory
)

result = extract_schema(config)

print(result.final_schema)         # Dict: final schema
print(result.intermediate_schemas) # List[Dict]: iteration history

API 3: `infer_code` - Generate Parser Code

Generate parser code from a schema (Dict or from previous step).

from web2json import Web2JsonConfig, infer_code

# Use schema from previous step or define manually
my_schema = {
    "title": "string",
    "author": "string",
    "content": "string"
}

config = Web2JsonConfig(
    name="my_parser",
    html_path="html_samples/",
    schema=my_schema,
    # save=['code'],  # Save parser code and schema to disk
    # output_path="./parsers",  # Custom output directory
)

result = infer_code(config)

print(result.parser_code)  # str: BeautifulSoup parser code
print(result.schema)       # Dict: schema used

API 4: `extract_data_with_code` - Parse with Code

Use parser code to extract data from HTML files.

from web2json import Web2JsonConfig, extract_data_with_code

config = Web2JsonConfig(
    name="parse_demo",
    html_path="new_html_files/",
    parser_code="output/blog/parsers/final_parser.py",  # Path to parser .py file
    save=['data'],  # Save parsed data to disk
    output_path="./parse_results",  # Custom output directory
)

result = extract_data_with_code(config)

print(f"Success: {result.success_count}, Failed: {result.failed_count}")
for item in result.parsed_data:
    print(f"File: {item['filename']}")
    print(f"Data: {item['data']}")

API 5: `classify_html_dir` - Classify HTML by Layout

Group HTML files by layout similarity (for mixed-layout datasets).

from web2json import Web2JsonConfig, classify_html_dir

config = Web2JsonConfig(
    name="classify_demo",
    html_path="mixed_html/",
    # save=['report', 'files'],  # Save cluster report and copy files to subdirectories
    # output_path="./cluster_analysis",  # Custom output directory
)

result = classify_html_dir(config)

print(f"Found {result.cluster_count} layout types")
print(f"Noise files: {len(result.noise_files)}")

for cluster_name, files in result.clusters.items():
    print(f"{cluster_name}: {len(files)} files")
    for file in files[:3]:
        print(f"  - {file}")

Configuration Reference

Web2JsonConfig Parameters:

Parameter	Type	Default	Description
`name`	`str`	Required	Project name (for identification)
`html_path`	`str`	Required	HTML directory or file path
`output_path`	`str`	`"output"`	Output directory (used when save is specified)
`iteration_rounds`	`int`	`3`	Number of samples for learning
`schema`	`Dict`	`None`	Predefined schema (None = auto mode)
`enable_schema_edit`	`bool`	`False`	Enable manual schema editing
`parser_code`	`str`	`None`	Parser code (for extract_data_with_code)
`save`	`List[str]`	`None`	Items to save locally (e.g., `['schema', 'code', 'data']`). None = memory only

Standalone API Parameters:

API	Parameters	Returns
`extract_data`	`config: Web2JsonConfig`	`ExtractDataResult`
`extract_schema`	`config: Web2JsonConfig`	`ExtractSchemaResult`
`infer_code`	`config: Web2JsonConfig`	`InferCodeResult`
`extract_data_with_code`	`config: Web2JsonConfig`	`ParseResult`
`classify_html_dir`	`config: Web2JsonConfig`	`ClusterResult`

All result objects provide:

Direct access to data via object attributes
.to_dict() method for serialization
.get_summary() method for quick stats

Which API Should I Use?

# Need data immediately? → extract_data
config = Web2JsonConfig(name="my_run", html_path="html_samples/")
result = extract_data(config)
print(result.parsed_data)

# Want to review/edit schema first? → extract_schema + infer_code
config = Web2JsonConfig(name="schema_run", html_path="html_samples/")
schema_result = extract_schema(config)

# Edit schema if needed, then generate code
config = Web2JsonConfig(
    name="code_run",
    html_path="html_samples/",
    schema=schema_result.final_schema
)
code_result = infer_code(config)

# Parse with the generated code
config = Web2JsonConfig(
    name="parse_run",
    html_path="new_html_files/",
    parser_code=code_result.parser_code
)
data_result = extract_data_with_code(config)

# Have parser code, need to parse more files? → extract_data_with_code
config = Web2JsonConfig(
    name="parse_more",
    html_path="more_files/",
    parser_code=my_parser_code
)
result = extract_data_with_code(config)

# Mixed layouts (list + detail pages)? → classify_html_dir
config = Web2JsonConfig(name="classify", html_path="mixed_html/")
result = classify_html_dir(config)

📄 License

Apache-2.0 License

Made with ❤️ by the web2json-agent team

⭐ Star us on GitHub | 🐛 Report Issues | 📖 Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 211 Commits
docs		docs
evaluation		evaluation
input_html		input_html
tests		tests
web2json		web2json
web2json_api		web2json_api
web2json_ui		web2json_ui
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
simple_api_demo.py		simple_api_demo.py
start.sh		start.sh
stop.sh		stop.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌐 web2json-agent

📖 What is web2json-agent?

📋 Demo

📊 SWDE Benchmark Results

🚀 Quick Start

Install via pip

Install for Developers

📚 Complete User Guide

🐍 API Usage

API 1: `extract_data` - Complete Workflow

API 2: `extract_schema` - Extract Schema Only

API 3: `infer_code` - Generate Parser Code

API 4: `extract_data_with_code` - Parse with Code

API 5: `classify_html_dir` - Classify HTML by Layout

Configuration Reference

Which API Should I Use?

📄 License

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

License

ccprocessor/web2json-agent

Folders and files

Latest commit

History

Repository files navigation

🌐 web2json-agent

📖 What is web2json-agent?

📋 Demo

📊 SWDE Benchmark Results

🚀 Quick Start

Install via pip

Install for Developers

📚 Complete User Guide

🐍 API Usage

API 1: extract_data - Complete Workflow

API 2: extract_schema - Extract Schema Only

API 3: infer_code - Generate Parser Code

API 4: extract_data_with_code - Parse with Code

API 5: classify_html_dir - Classify HTML by Layout

Configuration Reference

Which API Should I Use?

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

API 1: `extract_data` - Complete Workflow

API 2: `extract_schema` - Extract Schema Only

API 3: `infer_code` - Generate Parser Code

API 4: `extract_data_with_code` - Parse with Code

API 5: `classify_html_dir` - Classify HTML by Layout

Packages