Skip to content

A step-by-step guide and code repository for building an intelligent, autonomous web scraping agent using Python, LangChain, and Selenium.

Notifications You must be signed in to change notification settings

ERIZOAT/ai-agent-web-scraper

Repository files navigation

AI Agent Web Scraper (Beginner-Friendly Tutorial)

A step-by-step guide and code repository for building an intelligent, autonomous web scraping agent using Python, LangChain, and Selenium.

💡 Key Takeaways

  • AI Agents use Large Language Models (LLMs) to dynamically decide how to scrape, making them more resilient than traditional scripts.
  • The core components are the Orchestrator (LLM), Browser Automation (Selenium/Playwright), and a Defense Bypass Mechanism (CAPTCHA Solver).
  • Anti-bot measures like CAPTCHAs are the biggest challenge, requiring specialized tools for reliable data collection.

🚀 Quick Start

This tutorial uses Python, LangChain, and Selenium.

1. Setup Environment

# Create a new directory
mkdir ai-scraper-agent
cd ai-scraper-agent

# Install core libraries
pip install langchain selenium openai

2. Define Agent Tools (tools.py)

The agent needs a tool to interact with the web, simulating a browser.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from langchain.tools import tool
import time

def get_driver():
    """Initializes a headless Chrome WebDriver."""
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    # Ensure you have the correct driver installed and path set
    service = Service(executable_path='/usr/bin/chromedriver') 
    driver = webdriver.Chrome(service=service, options=options)
    return driver

@tool
def browse_website(url: str) -> str:
    """Navigates to a URL and returns the page content."""
    driver = get_driver()
    try:
        driver.get(url)
        time.sleep(3) # Wait for dynamic content to load
        return driver.page_source
    finally:
        driver.quit()

3. Create the AI Orchestrator (agent.py)

The orchestrator uses the LLM to decide when and how to use the browse_website tool.

from langchain.agents import AgentExecutor, create_react_agent
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
# from tools import browse_website # Assuming tools.py is in the same directory

# 1. Define the Prompt
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an expert web scraping agent. Use the available tools to fulfill the user's request."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}")
])

# 2. Initialize the LLM (Requires OPENAI_API_KEY)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# 3. Create the Agent and Executor
tools = [browse_website]
agent = create_react_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

# Example run
# result = agent_executor.invoke({"input": "What is the main headline on the CapSolver homepage?"})
# print(result)

🛡️ Defense Bypass: Solving CAPTCHAs

For a production-ready AI agent, handling anti-bot measures is critical. Over 95% of scraping failures are due to anti-bot systems like CAPTCHAs.

Integration with a CAPTCHA Solver is a must.

We recommend CapSolver for its high success rate and seamless API integration. Your agent can be programmed to call the CapSolver API automatically when a CAPTCHA is detected, receive a token, and bypass the challenge.

About

A step-by-step guide and code repository for building an intelligent, autonomous web scraping agent using Python, LangChain, and Selenium.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages