Distributed real-time analytics platform for streaming stock market data.
StockStream ingests live and historical OHLCV data from Yahoo Finance, streams it through Apache Kafka, and processes it with Spark Structured Streaming. Price time-series is written to InfluxDB for fast range queries and aggregations; stock metadata (symbol, sector, industry, 52-week high/low) is stored in PostgreSQL. A rule engine runs on each batch to fire alerts on price moves, volume spikes, and volatility. Grafana dashboards connect to InfluxDB for real-time charts. The stack runs in Docker: Zookeeper, Kafka, Spark master/worker, InfluxDB, PostgreSQL, Grafana, and a Python producer container. Designed for fault tolerance (consumer groups, checkpointing, backpressure) and horizontal scaling via Kafka partitions and Spark workers.
| Layer | Technology | Version |
|---|---|---|
| Message broker | Apache Kafka | 2.8 |
| Stream processing | Apache Spark Structured Streaming | 3.3.3 |
| Time-series DB | InfluxDB | 2.5.1 |
| Relational DB | PostgreSQL | latest |
| Visualization | Grafana OSS | 8.4.3 |
| Runtime | Python | 3.10+ |
Python dependencies: confluent-kafka, findspark, psycopg2-binary, influxdb-client, yfinance, python-dotenv, schedule, pytz
- Kafka topics:
real-time-stock-prices,stock-general-information(4 partitions each) - Consumer group:
stock-stream-consumer-group - Backpressure:
maxOffsetsPerTrigger= 10,000 - Checkpoint recovery at
/tmp/spark-checkpoint
- PostgreSQL: Stock metadata. Indexes on
Symbol,Entry_Date,Sector,Industry, composite(Symbol, Entry_Date) - InfluxDB: OHLCV time-series. Symbols as tags, line protocol write
| Rule | Default | Severity |
|---|---|---|
| Price drop | 5% | HIGH |
| Price spike | 5% | MEDIUM |
| High volume | 1M shares | INFO |
| Volatility (high-low spread) | 3% | MEDIUM |
Logs: logs/alerts.log, logs/producer.log, logs/consumer.log
Prerequisites: Python 3.10+, Docker, Docker Compose
# 1. Start infrastructure
docker-compose up -d
# 2. Install Python deps
pip install -r requirements.txt
# 3. Run producer
cd producer && python producer.pyConsumer (inside Spark container):
docker exec -it ktech_spark_submit bash -c "spark-submit --master spark://spark-master:7077 \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.3 \
--jars /opt/bitnami/spark/jars/postgresql-42.5.4.jar \
/app/consumer/consumer.py"├── producer/
│ ├── producer.py
│ ├── producer_utils.py
│ └── stock_info_producer.py
├── consumer/
│ ├── consumer.py
│ ├── InfluxDBWriter.py
│ └── alert_rules.py
├── script/
│ ├── initdb.sql
│ └── utils.py
├── logs/
├── docs/ # GitHub Pages landing site
├── docker-compose.yaml
├── requirements.txt
└── pyproject.toml
- Ingestion: yfinance → producer → Kafka (
real-time-stock-prices,stock-general-information) - Processing: Spark Structured Streaming → parse, validate, transform
- Storage: InfluxDB (OHLCV), PostgreSQL (metadata via
stock_info_producer) - Alerting:
consumer/alert_rules.pyruns on each batch - Viz: Grafana → InfluxDB (port 3000)
| Variable | Description |
|---|---|
STOCKS |
Comma-separated symbols (e.g. AAPL,MSFT,GOOGL) |
INFLUXDB_BUCKET |
InfluxDB bucket |
INFLUXDB_MEASUREMENT |
Measurement name |
INFLUX_ORG, INFLUX_TOKEN |
InfluxDB auth |
POSTGRES_* |
PostgreSQL connection |
Edit consumer/alert_rules.py:
PriceDropAlert(threshold_percent=5.0)
PriceSpikeAlert(threshold_percent=5.0)
HighVolumeAlert(threshold_volume=1000000)
VolatilityAlert(threshold_percent=3.0)Spark options in consumer/consumer.py:
maxOffsetsPerTrigger: 10000 (backpressure)kafka.group.id: stock-stream-consumer-group
Yahoo Finance API → Producer → Kafka → Spark Streaming → InfluxDB / PostgreSQL → Grafana. All services run via Docker Compose.
Grafana at localhost:3000, wired to InfluxDB. Candlestick, gauge, and line charts for real-time and historical data.

