API Collector — ETL Pipeline (Python → CSV/Parquet → S3 → Spark)

A small end-to-end ETL-style data pipeline demonstrating real-world data engineering: - Extract stock data from the Alpha Vantage API - Transform with Pandas (cleaning, validation, typing) - Load locally into CSV + Parquet - Optionally upload to AWS S3 - Process with PySpark (local Databricks-style ETL) - Run tests via GitHub Actions CI

This project demonstrates how Python, CI, AWS, and Spark can work together in a small, clear, end-to-end data pipeline.

Quick start

git clone https://github.com/Annette3125/api-collector.git
cd api-collector
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cp .env.sample .env
python -m api_collector.get_data

Expected: creates data/new/stock_data_latest.csv and timestamped CSV/Parquet snapshots.

Example run output

Pipeline: Alpha Vantage → get_data.py → CSV/Parquet → (optional) S3 → (optional) Spark summary → tests/CI.

Output contract: CSV columns are always: date, open, high, low, close, volume, symbol.

🧩 Features

Extract

Pulls TIME_SERIES_DAILY stock data via Alpha Vantage API
Supports multiple symbols (configurable in .env)
Handles rate limits and HTTP errors

Transform

Renames and normalizes columns
Converts datatypes (numeric, datetime)
Drops invalid rows (negative or missing prices)
Sorts by date, removes duplicates

Load

Saves fresh snapshot as:
- data/new/stock_data_latest.csv
- timestamped history files
- Parquet outputs for further analytics

Output

Example CSV schema:

date,open,high,low,close,volume,symbol

⚙️ Setup

Prerequisites

Python 3.11+
Git

Environment variables

Create .env from sample:

cp .env.sample .env

ALPHA_VANTAGE_API_KEY=your-api-key
SYMBOLS=AAPL,GOOGL,MSFT
DATA_DIR=data/new
RATE_LIMIT_SLEEP=15

Get a free API key from Alpha Vantage: https://www.alphavantage.co/support/#api-key

Run

Extract + Transform + Load

python -m api_collector.get_data

Sample output files

data/new/stock_data_latest.csv
data/new/stock_data_.csv
data/new/stock_data_.parquet

Timestamp format: YYYYMMDD_HHMMSS (UTC).

Daily scheduler (optional)

python -m api_collector.scheduler

☁️ AWS S3 upload (optional)

Configure AWS credentials locally

aws configure

To use S3 upload, configure AWS credentials locally (or use an IAM role) and set S3_BUCKET_NAME in .env.

Set S3 bucket in .env:

S3_BUCKET_NAME=your-bucket-name
S3_KEY=raw/stock_data_latest.csv

Upload latest CSV:

python -m api_collector.upload_to_s3

Example output:

s3://<your-bucket-name>/raw/stock_data_latest.csv

🔥 Local Spark ETL (optional)

macOS prerequisites:

export JAVA_HOME="$(/usr/libexec/java_home -v 17)"
export SPARK_LOCAL_IP=127.0.0.1

Run Spark ETL:

python -m api_collector.databricks_etl

or:

./scripts/run_spark.sh

Output: data/processed/stock_summary.parquet

🧪 Tests

pytest -q

CI runs automatically on every GitHub push.

🚀 Future improvements

Upload processed Parquet to S3
Orchestrator (Airflow/Prefect)
Streamlit dashboard
Small FastAPI service for querying results

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github/workflows		.github/workflows
api_collector		api_collector
data		data
screenshots		screenshots
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
scheduler.py		scheduler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

API Collector — ETL Pipeline (Python → CSV/Parquet → S3 → Spark)

Quick start

Example run output

🧩 Features

Extract

Transform

Load

Output

⚙️ Setup

Prerequisites

Run

☁️ AWS S3 upload (optional)

🔥 Local Spark ETL (optional)

🧪 Tests

🚀 Future improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

API Collector — ETL Pipeline (Python → CSV/Parquet → S3 → Spark)

Quick start

Example run output

🧩 Features

Extract

Transform

Load

Output

⚙️ Setup

Prerequisites

Run

☁️ AWS S3 upload (optional)

🔥 Local Spark ETL (optional)

🧪 Tests

🚀 Future improvements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages