Skip to content

Latest commit

 

History

History
118 lines (85 loc) · 2.73 KB

File metadata and controls

118 lines (85 loc) · 2.73 KB

WebScraper Pro 🕸️

Python Scraping BeautifulSoup Requests License

A configurable Python web scraping tool that extracts structured data from multiple webpages and exports the results to CSV.
Built for automation, data collection, and Upwork-style client projects.


✨ Features

  • Scrapes multiple pages using a URL pattern with {page}
  • Fully configurable via JSON (no code changes needed)
  • Extracts data using CSS selectors (quotes, authors, tags, or any other fields)
  • Saves clean structured data to CSV
  • Logs scraping progress to logs/scraper.log
  • Easy CLI interface for clients and non-technical users

🧱 Project Structure

webscraper_pro/
├─ README.md
├─ LICENSE
├─ requirements.txt
├─ .gitignore
├─ data/
│  ├─ sample_urls.txt
│  └─ output/
├─ logs/
├─ webscraper/
│  ├─ __init__.py
│  ├─ config_example.json
│  ├─ cli.py
│  ├─ scraper.py
│  ├─ parser.py
│  └─ storage.py

⚙️ Configuration

Example config file: webscraper/config_example.json

{
    "base_url": "https://quotes.toscrape.com/page/{page}/",
    "start_page": 1,
    "end_page": 3,
    "selectors": {
        "quote": ".quote .text",
        "author": ".quote .author",
        "tags": ".quote .tags .tag"
    }
}

Fields explained:

  • base_url — must contain {page} so scraper can iterate
  • start_page / end_page — scraping range
  • selectors — CSS selectors for each extracted field

You can modify this JSON to scrape any website, not just quotes.


▶️ How to Run

Create and activate a virtual environment:

python3 -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Run the scraper:

python -m webscraper.cli --config webscraper/config_example.json --output data/output/quotes.csv

Result:

  • Fetches pages 1–3
  • Extracts quotes, authors, and tags
  • Saves them to data/output/quotes.csv

📜 License

This project is licensed under the MIT License.
You are free to use, modify, distribute, and incorporate the code into your own projects.

See the full license in the included LICENSE file.


📝 Notes

  • This project is for demonstration and educational purposes.
  • Always respect website terms of service and robots.txt when scraping real websites.
  • The scraper is modular and easy to extend for more complex automation.