web-scraper-demo

Python web scraper for CNN headlines. Respects robots.txt, avoids duplicates, and saves output to CSV/JSON.

Prerequisites

Before you start, make sure you have:

Python 3.11+ installed
Download from https://www.python.org/downloads/
Git installed (for cloning the repo)

Clone the Repo

git clone https://github.com/Annette3125/web-scraper-demo.git
cd web-scraper-demo

Create virtual environment in project's root directory:

For Linux/Mac

python3 -m venv venv

For Windows

python -m venv venv

Activate the virtual environment

For Linux/Mac:

source venv/bin/activate

For Windows

venv\Scripts\activate

Dependencies

pip install -r requirements.txt

Development

I use Black and Isort for code styling and formatting.

pip install black isort
isort .
black .

Run

python scrape.py

Output files

The script generates:

data/headlines.csv
data/headlines.json

CSV format

Columns: source,url,heading,author,date Example of csv output:

source,url,heading,author,date
edition.cnn.com,https://edition.cnn.com/2026/03/04/politics/us-troop-deaths-iran-trump-hegseth,Trump’s and Hegseth’s awkward comments about US troop deaths in Iran war,Aaron Blake,"PUBLISHED Mar 4, 2026, 4:45 PM ET"
edition.cnn.com,https://edition.cnn.com/travel/social-bathhouses-north-america,The new going-out spot isn’t a bar. It’s so much hotter than that,,"PUBLISHED Mar 4, 2026, 8:24 AM ET"

JSON output

A list of objects with keys: source,url,heading,author,date

Example of json output:


[
    {
        "source": "edition.cnn.com",
        "url": "https://edition.cnn.com/2026/03/12/economy/costs-iran-war-price-groceries",
        "heading": "What Iran war could soon cost you",
        "date": "PUBLISHED Mar 12, 2026, 7:00 AM ET",
        "author": "Elisabeth Buchwald"
    },
   ]

Notes

The scraper respects robots.txt and uses small delays to avoid overloading the website.
Output files are generated in the data/ directory.

License

This project is MIT-licensed. See LICENSE.

Future Improvements

Add SQLite integration for persistent storage.

A complete Python web scraping pipeline (requests → BeautifulSoup → JSON/CSV).

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
scrape.py		scrape.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

web-scraper-demo

Prerequisites

Clone the Repo

Create virtual environment in project's root directory:

Activate the virtual environment

Dependencies

Development

Run

Output files

CSV format

JSON output

Notes

License

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

web-scraper-demo

Prerequisites

Clone the Repo

Create virtual environment in project's root directory:

Activate the virtual environment

Dependencies

Development

Run

Output files

CSV format

JSON output

Notes

License

Future Improvements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages