A data platform that ingests, stores, cleans, and analyzes stock market data via a REST API. The project is built incrementally to demonstrate a complete ELT pipeline using modern tools.
- FastAPI β REST API framework
- PostgreSQL β Database with JSONB column for raw data storage
- Pydantic β Data validation via schemas
- psycopg3 + psycopg_pool β Database connection with connection pooling
- Pandas β Data cleaning, transformation, and analysis
- Docker β PostgreSQL and FastAPI running in containers
- python-dotenv β Environment variable management
- yfinance β Fetches historical stock data from Yahoo Finance
- schedule β Schedules daily data fetching
stock-data-pipeline/
βββ data/
β βββ cleaned_data.csv # Cleaned stock data
β βββ flagged_data.csv # Rows flagged as suspicious
β βββ rejected_data.csv # Rows rejected as invalid
βββ src/
β βββ __init__.py
β βββ fetcher.py # Automated data fetching and scheduling
β βββ main.py # FastAPI application and endpoints
β βββ schemas.py # Pydantic model for stock data
β βββ database.py # Database connection and pool
β βββ processor.py # Data ingestion and cleaning
β βββ stock_analysis.py # Metrics: pct_change, mean price, volatility
β βββ quality_data.py # Data flagging and rejection logic
β βββ daily_stats.py # Top gainers, losers and volume logic
βββ tests/
β βββ __init__.py
β βββ test_main.py # Pytest
βββ dashboard.html # Live stock dashboard
βββ .env # Environment variables (not committed)
βββ .gitignore
βββ .dockerignore
βββ Dockerfile
βββ docker-compose.yaml
βββ pyproject.toml
βββ uv.lock
PostgreSQL runs in Docker on port 5440. The stocks_raw table stores stock data as JSONB:
| Column | Type | Description |
|---|---|---|
| id | bigint (PK) | Auto-generated ID |
| stock | jsonb | Stock data as JSON |
Defined in src/schemas.py using Pydantic:
class StockData(BaseModel):
ticker: str
price: float
currency: str = "USD"
date: date
volume: Optional[int] = NoneHealth check β returns a greeting message.
Inserts a single stock object.
Returns all rows from stocks_raw.
Inserts multiple stock objects in a single request.
Returns the top 10 tickers with the highest daily percentage gain.
Returns the top 10 tickers with the highest daily percentage loss.
Returns the top 10 tickers by trading volume for the latest date.
Raw data from PostgreSQL is processed through the following steps:
- Extract β
processor.pyfetches all rows fromstocks_rawvia the connection pool - Clean β strips whitespace, validates dates, removes duplicates and invalid prices
- Flag β
quality_data.pyflags suspicious values (e.g. price > 10 000, empty fields) - Reject β rejects impossible values (e.g. price > 50 000, malformed currency)
- Analyze β
stock_analysis.pycalculates:- Daily percentage change per ticker (
pct_change) - Mean price per ticker
- Rolling volatility per ticker (std over 2-day window)
- Daily percentage change per ticker (
Output is saved to the data/ directory as CSV files.
Raw stock data is automatically fetched from Yahoo Finance and inserted into PostgreSQL via fetcher.py.
- Fetch β
yfinancedownloads minute-by-minute OHLCV data for 70+ tickers - Transform β each row is mapped to a dict with
ticker,price,currency,date,volume - Load β data is inserted into
stocks_rawas JSONB via the connection pool - Schedule β
schedulelibrary runsfetch_data()automatically every minute during market hours
The script runs continuously with a while True loop, checking every second if a scheduled job is pending.
- Image β Built on
python:3.13-slim(Debian-based Linux) - Build β
docker build -t stock-api .packages the FastAPI app with all dependencies viauv - Run β
docker run -p 8000:8000 --env-file .env stock-apistarts the container with credentials injected at runtime - Result β FastAPI runs on
http://0.0.0.0:8000inside an isolated Linux container, connecting to PostgreSQL on the host machine viahost.docker.internal
A live stock dashboard served as a static HTML file, fetching data from the FastAPI endpoints.
- Top Gainers β tickers with highest daily % gain (green)
- Top Losers β tickers with highest daily % loss (red)
- Top Volume β tickers with highest trading volume (gold)
- Percentage change calculated against previous day's closing price
- Stock prices are slightly delayed via Yahoo Finance
Open dashboard.html in a browser with the API running to view the dashboard.
Create a .env file in the project root:
DB_HOST=localhost
DB_PORT=5440
DB_USERNAME=postgres
DB_PASSWORD=your_password
DB_NAME=stock_db
docker compose up -dcd src
python fetcher.pyNote: fetcher.py must be running continuously to collect intraday data. It fetches minute-by-minute data for 70+ tickers during market hours (15:30β22:00 CET).
cd src
fastapi dev main.pyNavigate to http://localhost:8000/docs for Swagger UI.
Open dashboard.html in your browser.
Developing a real-time financial pipeline presented several "real-world" engineering hurdles. Below are the most significant challenges and how they were resolved:
Challenge: During testing on a Monday, the dashboard showed 0% price changes for all stocks.
Diagnosis: The system was comparing Monday's live price against the most recent record in the database, which was also from Monday. In financial terms, pct_change must be calculated against the previous trading day's close (Friday), not the current day's opening.
Solution: Refactored the analysis engine to use dynamic Pandas indexing (iloc[-2]). This ensures the reference point is always the last valid close, regardless of weekends or market holidays.
Challenge: The Python process would occasionally crash with RuntimeError: couldn't stop thread or database connection timeouts.
Diagnosis: The data fetcher was accidentally triggered every second inside a while True loop, causing hundreds of overlapping yfinance threads and exhausting the PostgreSQL connection pool.
Solution: Decoupled the execution logic. The fetcher is now strictly managed by the schedule library, with a 5-minute interval during market hours, ensuring each batch of 70+ tickers completes before the next one starts.
Challenge: Tickers would randomly disappear from the "Top Volume" or "Gainers" lists.
Diagnosis: The API filtered data based on a global max_date. Since different tickers update at slightly different intervals (latency), a ticker that hadn't updated in the last 5 seconds was excluded.
Solution: Implemented .groupby('ticker').tail(1) in the Pandas transformation layer. This ensures the dashboard always displays the latest known state for every ticker, providing a consistent user experience despite network jitter.
Challenge: Manual restarts of the fetcher risked creating duplicate entries for the same minute.
Solution: Leveraged PostgreSQL's ON CONFLICT DO NOTHING combined with a unique constraint on the raw data. This makes the ELT pipeline idempotent, meaning it can be restarted at any time to "fill the gaps" without corrupting the historical dataset.
| Phase | Status | Description |
|---|---|---|
| 1 β Foundation | β Done | FastAPI, PostgreSQL, Pydantic, manual data ingestion |
| 2 β Transform with Pandas | β Done | Clean data, flag/reject bad data, calculate key metrics |
| 3 β Automated data fetching | β Done | yfinance integration, daily scheduling, full ELT pipeline |
| 4 β Linux & Docker | β Done | Dockerfile for FastAPI, full containerization |
| 5 β Dashboard / Visualization | β Done | Top gainers/losers/volume endpoints, live HTML dashboard |
Claude was utilized as a pair-programming partner to troubleshoot specific library errors (e.g., yfinance thread issues) and to assist with the boilerplate CSS for the dashboard. All core logic, ELT-pipeline architecture, and database schemas were designed and implemented manually to ensure deep understanding.
dashboard.htmlwas generated with AI assistance and not written manually.- CORS middleware was added to
main.pyto allow the browser to make requests to the API from a different origin (the HTML file). Without it, the browser blocks requests between different ports/domains for security reasons.