Skip to content

xAsmodeus/Web-Scraping-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

107 Commits
 
 
 
 
 
 

Repository files navigation

Web Scraping Pipeline


📑 Table of Contents


🧭 Overview

A web scraping pipeline that gathers verified business contact data from publicly available directories across selected industries.

  1. Collects and saves all winery profile URLs from Europages
  2. Visits those pages (or linked websites), extracts valid email addresses.
  3. Produces and submits the final cleaned datasets(CSV).

The pipeline must follow this exact format.

NAME EMAILS COUNTRY
NAME LINK

🛠️ Approach

Step 1 – Setup

  • Imported the libraries.
  • Identified the 3 source pages.

Step 2 – Data Extraction

  • Scraped data from each page.
  • Saved each page’s .html links into separate CSV files.

Step 3 – Data Merging

  • Merged the three CSV files into one master file.

Step 4 – URL Construction

  • Combined the standard base URL with the variable link parts.
  • Generated full industry URLs.

Step 5 – Final Outputs

  • Created NAME | LINKS CSV.
  • Created NAME | EMAILS | COUNTRY CSV.

📊 Results

  1. NAME | LINK(.html)
Screenshot_2
  1. NAME | LINK
Screenshot_2
  1. NAME | LINK | COUNTRY
Screenshot_1

🧰 Technology Stack

  • Language: Python 3
  • Modules:
    • requests
    • BeautifulSoup (bs4)
    • pandas
    • urllib.parse
    • re
  • Platform: Google Colab

🛡️ Licence

MIT License


🙌 Credits

Sources
https://www.notion.so/digiole/Scalable-Web-Scraping-Pipeline-21425969342680b7a99ef9f999a96f06
https://www.europages.co.uk/en/search?isPserpFirst=1&q=winery+supplies
https://beautiful-soup-4.readthedocs.io/en/latest/#quick-start
https://pandas.pydata.org/docs/user_guide/index.html
W3schools
Stackoverflow
Youtube
Chat GPT
Gemini AI
Geeks for Geeks

About

A web scraping model that gathers names, emails and links from wine producers across Europe.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors