This repository contains a complete, production-ready data pipeline for ingesting, transforming, analyzing, and modeling vehicle tracking data using AWS services and Python-based tools. The project demonstrates real-time data engineering and machine learning capabilities in the cloud.
- Overview
- Architecture Diagram
- Features
- Technologies Used
- Directory Structure
- Pipeline Components
- Setup Instructions
- Results Summary
This pipeline was designed to simulate real-world vehicle tracking data and process it through a complete cloud-based ETL and ML workflow. The system includes:
- Simulated real-time vehicle tracking using AWS Kinesis.
- Data ingestion and cleaning using AWS Glue and PySpark.
- Schema inference and cataloging with Glue Crawlers.
- Data storage in S3 (raw and processed formats).
- Querying with Amazon Athena.
- Modeling and anomaly detection using Amazon SageMaker and Python.
- Real-time data simulation and ingestion using
boto3and Kinesis. - Scalable data transformation using AWS Glue + PySpark.
- Automated schema discovery using Glue Crawlers.
- Querying processed data directly from S3 using Amazon Athena.
- Predictive modeling and anomaly detection (Z-Score, Isolation Forest, LOF).
- Exploratory Data Analysis (EDA) visualized using Matplotlib and Seaborn.
- AWS Services: Kinesis, S3, Lambda, Glue, Glue Crawler, Athena, SageMaker
- Programming Languages: Python (3.8+)
- Libraries:
- Data:
pandas,numpy,Sk-learn,Pyspark - AWS:
boto3,Athena,AWS Glue,S3,Kinesis,Lambda,SageMaker,IAM - ML:
scikit-learn,xgboost,LinearRegression,Random Forest,SVR - Visualization:
matplotlib,seaborn
- Data:
The vehicle-data-stream.py script streams rows from a CSV file to an AWS Kinesis Data Stream using boto3.
AWS Lambda is triggered by Kinesis and writes raw events to an S3 bucket in JSON format.
The pyspark_cleaning.py script (run in AWS Glue) performs:
- JSON parsing from the
detailscolumn - Type casting and schema normalization
- Missing value imputation
- Duplicate removal
- Writes clean data to S3 in Parquet and CSV formats
AWS Glue Crawler infers schema from the processed data and registers it in the Glue Data Catalog.
The athena_connection.py script allows querying the S3-structured data via Athena and returns results as Pandas DataFrames.
- EDA: Speed distribution, traffic trends by hour/day, vehicle class frequency
- Prediction: Estimating
estimated_speedusing:- Linear Regression
- Random Forest
- SVR
- XGBoost
- Anomaly Detection:
- Z-Score method
- Isolation Forest
- Local Outlier Factor
- Consensus approach
| Model | RMSE | MAE | R² |
|---|---|---|---|
| Linear Regression | 13.61 | 9.26 | 0.57 |
| Random Forest | 13.71 | 9.47 | 0.56 |
| SVR | 20.76 | 11.65 | -0.00 |
| XGBoost | 13.58 | 9.38 | 0.57 |
| Method | Description | Anomalies | % of Data |
|---|---|---|---|
| Z-Score | Statistical method | 322 | 1.15% |
| Isolation Forest | Tree-based method | 1404 | 5.00% |
| Local Outlier Factor | Density-based (local density comparisons) | 1404 | 5.00% |
| Consensus Approach | Agreement across methods | 383 | 1.36% |
