This project optimizes taxi fleet management using Big Data Management tools like Apache Spark. Some of our project objectives are,
- Analyze historical and real-time taxi trip data from the City of Chicago
- Identify high-demand pickup areas, peak hours, and traffic congestion hotspots
- Propose fleet reallocation strategies
- Utilized Databricks and Apache Spark to perform distributed data processing and analysis
- The dataset was preprocessed, cleaned, and aggregated to analyze trip volumes, fare distribution, and pickup/drop-off locations across various timeframes and geographic regions
The City of Chicago taxi trip data set was used for this analysis. It contains fare amounts, trip miles, trip durations, pickup and drop-off timestamps, and the geographical location (community area) of pickups and drop-offs. Data Pre-processing, Descriptive Statistics, Demand Analysis, Congestion Analysis, Fleet Reallocation Recommendations
- API Call - Historic Data (Jan 2024 - Sep 2024) and Real-time Data (Oct 2024 - Future)
- Data Cleaning and Transformation
- Load Data to Parquet files on Cloud Storage
- Create Data Warehouse tables on PySpark
- Load Data from Parquet files to DW tables
- Data Analysis and Visualization
- Trip (Fact Table)
- Company (Dimension Table)
- Transaction (Fact Table)