This project performs end‑to‑end automated web data extraction, data cleaning, Excel generation, and email delivery, with optional GitHub Actions CI/CD and n8n orchestration.
It demonstrates a multi‑environment automation architecture:
Wikipedia Page
↓
Python Script
(scrape → clean → Excel → email)
↓
Excel Output (.xlsx)
↓
Email Delivery (if SMTP configured)
n8n Trigger (Cron or Manual)
↓
HTTP Request → GitHub Actions (workflow_dispatch)
↓
GitHub Actions
- Install dependencies
- Run Python script in cloud
- Upload Excel as artifact
- Email sent by Python script
↓
Artifact Storage (GitHub)
↓
n8n
- List artifacts
- Pick latest
- Download binary (Excel)
This is the real execution order, matching actual behavior exactly.
The Python script (html_table_scraper.py) performs 5 fully automated steps:
- Uses
requestswith custom headers - Fetches a Wikipedia table page
- Includes timeout & error handling
BeautifulSoupselects the tablepandas.read_html()converts it to a DataFrame- Handles multi‑index columns
✔ Flattens multi‑index column headers
✔ Normalizes column names
✔ Renames technical columns:
revenue_usd_in_millions→revenue_usd_millionemployees_employees→employeesheadquartersnote_1→headquarters
✔ Removes:
- unnamed columns
- empty rows
- irrelevant metadata columns (state-owned, reference)
✔ Automatically sorts by rank
This results in a clean, analysis‑ready dataset.
Excel is exported to:
outputs/largest_companies_by_revenue.xlsx
If .env contains valid Gmail App Password credentials:
- The script generates an email
- Attaches the Excel file
- Sends it via Gmail SMTP
👉 If SMTP not configured, the script skips email gracefully.
File: .github/workflows/automation.yml
GitHub Actions provides:
✔ Cloud execution
✔ Dependency isolation
✔ Reproducibility
✔ Secure secrets management
✔ Artifact generation
Steps:
- Setup Python
- Install requirements
- Run the scraper script
- Email is sent by Python
- Excel is uploaded as a GitHub Artifact
This allows fully cloud‑based automation without local execution.
n8n provides:
- Manual execution
- Scheduled execution (cron)
- Triggering GitHub Actions via HTTP
- Downloading the latest artifact
- UI‑based binary download
n8n does not send email in this project;
email is always handled by the Python script.
html-table-scraper-gmail-automation/
│
├── html_table_scraper.py
├── requirements.txt
├── README.md
│
├── outputs/
│ └── largest_companies_by_revenue.xlsx
│
├── .github/
│ └── workflows/
│ └── automation.yml
│
└── docs/
├── README_N8N.md
├── html-table-scraper-gmail-automation.json
└── automation.png
pip install -r requirements.txt
SCRAPER_SMTP_USER=your_email@gmail.com
SCRAPER_SMTP_PASSWORD=your_app_password
python html_table_scraper.py
| Method | Sends Email | Generates Excel | Stores Artifact | Requires Setup |
|---|---|---|---|---|
| Local Python | ✅ | ✅ | ❌ | .env for SMTP |
| GitHub Actions | ✅ | ✅ | ✅ | GitHub Secrets |
| n8n | (via Python) | (via Python) | Download available | GitHub token |
Özge Güneş
Automation • Python • Workflow Engineering