Skip to content

Aftab-shk/web-scrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraper — books.toscrape.com

Overview

A modular Python web scraping project that extracts book data from books.toscrape.com, a sandbox website designed for scraping practice. The project follows a structured pipeline: scrape, clean, validate, and export. It is built for clarity and real-world applicability, making it suitable for entry-to-intermediate level developers.

Features

  • Multi-page scraping with configurable page count
  • Extraction of book title, price, and rating
  • Automatic retry logic for failed HTTP requests
  • Request timeout to prevent hanging connections
  • Data cleaning: currency symbols removed, ratings converted from text to numeric
  • Data validation: missing values dropped, price and rating range enforced
  • Output in both CSV and JSON formats

Tech Stack

Tool Purpose
Python 3.8+ Core language
requests HTTP requests
BeautifulSoup HTML parsing
pandas Data manipulation and export
lxml Fast HTML parser backend

Project Structure

web-scrapper/
├── main.py            # Entry point — orchestrates the full pipeline
├── extractor.py       # Fetches pages and parses HTML for book data
├── cleaner.py         # Cleans price and rating fields
├── validator.py       # Validates data quality (missing values, ranges)
├── requirements.txt   # Python dependencies
├── README.md
└── output/            # Generated at runtime
    ├── books_raw.csv
    ├── books_cleaned.csv
    └── books_cleaned.json

How It Works

  1. Scrapingextractor.py sends HTTP requests to each catalogue page of books.toscrape.com. It retries up to 3 times on failure and uses a 10-second timeout. BeautifulSoup parses the HTML to extract the title, price, and rating from each book listing.

  2. Raw Export — The unprocessed data is saved to output/books_raw.csv for reference.

  3. Cleaningcleaner.py converts price strings (e.g., £51.77) to floats and rating words (e.g., Three) to integers (e.g., 3).

  4. Validationvalidator.py removes rows that have missing values, prices less than or equal to zero, or ratings outside the 1-5 range.

  5. Final Export — The validated dataset is saved to both output/books_cleaned.csv and output/books_cleaned.json.

How to Run

Prerequisites

  • Python 3.8 or higher
  • pip (Python package manager)

Installation

# Clone the repository
git clone https://github.com/your-username/web-scrapper.git
cd web-scrapper

# Install dependencies
pip install -r requirements.txt

Execution

python main.py

The script will display progress in the terminal and generate output files in the output/ directory.

Output

File Description
output/books_raw.csv Raw scraped data before any processing
output/books_cleaned.csv Cleaned and validated data in CSV format
output/books_cleaned.json Cleaned and validated data in JSON format

Purpose

This project demonstrates a practical, real-world scraping workflow broken into discrete, reusable modules. It is designed as a learning project that covers:

  • HTTP request handling with error recovery
  • HTML parsing and data extraction
  • Data cleaning and type conversion
  • Data validation and quality enforcement
  • Structured file output

It is well-suited for developers building a portfolio or learning how to work with web data in Python.

Future Improvements

  • Add command-line arguments for page count and output directory
  • Support scraping additional fields (availability, description, category)
  • Store data in a SQLite or PostgreSQL database
  • Add logging with the logging module instead of print statements
  • Implement asynchronous scraping with aiohttp for better performance
  • Add unit tests for each module

About

A modular Python web scraper that extracts, cleans, and validates book data from books.toscrape.com, exporting to CSV and JSON.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages