A web scraping pipeline that gathers verified business contact data from publicly available directories across selected industries.
- Collects and saves all winery profile URLs from Europages
- Visits those pages (or linked websites), extracts valid email addresses.
- Produces and submits the final cleaned datasets(CSV).
The pipeline must follow this exact format.
| NAME | EMAILS | COUNTRY |
|---|---|---|
| NAME | LINK |
|---|---|
Step 1 – Setup
- Imported the libraries.
- Identified the 3 source pages.
Step 2 – Data Extraction
- Scraped data from each page.
- Saved each page’s
.htmllinks into separate CSV files.
Step 3 – Data Merging
- Merged the three CSV files into one master file.
Step 4 – URL Construction
- Combined the standard base URL with the variable link parts.
- Generated full industry URLs.
Step 5 – Final Outputs
- Created
NAME | LINKSCSV. - Created
NAME | EMAILS | COUNTRYCSV.
NAME | LINK(.html)
NAME | LINK
NAME | LINK | COUNTRY
- Language: Python 3
- Modules:
requestsBeautifulSoup (bs4)pandasurllib.parsere
- Platform: Google Colab
MIT License
| Sources |
|---|
| https://www.notion.so/digiole/Scalable-Web-Scraping-Pipeline-21425969342680b7a99ef9f999a96f06 |
| https://www.europages.co.uk/en/search?isPserpFirst=1&q=winery+supplies |
| https://beautiful-soup-4.readthedocs.io/en/latest/#quick-start |
| https://pandas.pydata.org/docs/user_guide/index.html |
| W3schools |
| Stackoverflow |
| Youtube |
| Chat GPT |
| Gemini AI |
| Geeks for Geeks |