From 0a7911f38970597e42299a1181383805403164e5 Mon Sep 17 00:00:00 2001 From: Chris Johns Date: Mon, 21 Aug 2023 11:11:42 -0400 Subject: [PATCH 1/3] Fix linked list index out of bounds Fix index out of bounds in Doubly Linked List --- .../algorithms/datastructures/linkedlist/DoublyLinkedList.h | 2 +- .../algorithms/datastructures/linkedlist/DoublyLinkedList.py | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/src/main/cpp/algorithms/datastructures/linkedlist/DoublyLinkedList.h b/src/main/cpp/algorithms/datastructures/linkedlist/DoublyLinkedList.h index 66414c5..b2fc545 100644 --- a/src/main/cpp/algorithms/datastructures/linkedlist/DoublyLinkedList.h +++ b/src/main/cpp/algorithms/datastructures/linkedlist/DoublyLinkedList.h @@ -194,7 +194,7 @@ class DoublyLinkedList // Add an element at a specified index void addAt(int index, const T &data) { - if (index < 0) { + if (index < 0 || index > size_) { throw ("Illegal Index"); } if (index == 0) { diff --git a/src/main/python/algorithms/datastructures/linkedlist/DoublyLinkedList.py b/src/main/python/algorithms/datastructures/linkedlist/DoublyLinkedList.py index 4435927..78a5748 100644 --- a/src/main/python/algorithms/datastructures/linkedlist/DoublyLinkedList.py +++ b/src/main/python/algorithms/datastructures/linkedlist/DoublyLinkedList.py @@ -110,8 +110,8 @@ def addAt(self, index, data): """ Add an element at a specified index """ - if index < 0: - raise Exception('index should not be negative. The value of index was: {}'.format(index)) + if index < 0 or index > self.llSize: + raise Exception('Illegal index. The value of index was: {}'.format(index)) if index == 0: self.addFirst(data) From 66f7230f0b1dad7183f41493e2c001da05bd3505 Mon Sep 17 00:00:00 2001 From: Chris Johns Date: Tue, 17 Oct 2023 11:35:42 -0400 Subject: [PATCH 2/3] Update project notebook --- Student_MLE_MiniProject_EDA.ipynb | 424 ++++++++++++++++++++++++++++++ 1 file changed, 424 insertions(+) create mode 100644 Student_MLE_MiniProject_EDA.ipynb diff --git a/Student_MLE_MiniProject_EDA.ipynb b/Student_MLE_MiniProject_EDA.ipynb new file mode 100644 index 0000000..ba49e09 --- /dev/null +++ b/Student_MLE_MiniProject_EDA.ipynb @@ -0,0 +1,424 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [], + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Mini Project: Exploratory Data Analysis" + ], + "metadata": { + "id": "hVq3RjKJpJI0" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Exploratory Data Analysis: Unveiling Insights from the NYC Taxi Dataset\n", + "\n", + "Data has become the lifeblood of the modern world, permeating every aspect of our lives and transforming the way we make decisions. In this era of vast information, the ability to extract meaningful insights from raw data has emerged as a crucial skill. Enter exploratory data analysis (EDA), a powerful approach that allows us to unravel hidden patterns, detect anomalies, and generate valuable knowledge from the vast volumes of data at our disposal.\n", + "\n", + "Exploratory data analysis serves as the initial step in any data-driven investigation, offering a comprehensive understanding of the dataset's structure, distributions, and relationships between variables. By applying statistical and visual techniques, analysts gain a deeper insight into the data, paving the way for more accurate predictions, informed decision-making, and the discovery of actionable insights.\n", + "\n", + "To illustrate the significance of exploratory data analysis, we delve into one of the most popular and widely studied datasets in the field—the [NYC Taxi Dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). As the name suggests, this dataset captures detailed information about taxi trips within the bustling city of New York. The NYC Taxi Dataset is an ideal choice for learning and practicing EDA techniques due to its richness, complexity, and real-world applicability.\n", + "\n", + "The dataset encompasses a vast range of attributes, including pickup and drop-off locations, timestamps, trip durations, passenger counts, payment information, and much more. By exploring this data, we can gain valuable insights into the dynamics of taxi usage, understand travel patterns across different neighborhoods, identify peak hours of demand, analyze fare structures, and even uncover interesting anecdotes about the city's vibrant life.\n", + "\n", + "The NYC Taxi Dataset is an excellent resource for aspiring ML practitioners to develop their EDA skills. Its scale, complexity, and real-world relevance make it an engaging playground for uncovering hidden patterns, generating hypotheses, and forming data-driven narratives.\n", + "\n", + "In this mini project, we will dive deep into the NYC Taxi Dataset. We will leverage various EDA techniques to unveil meaningful insights, visualize data distributions, identify outliers, and pose insightful questions that will fuel further analysis and exploration. By the end of this colab, students will have a solid foundation in exploratory data analysis and be equipped to tackle real-world data challenges with confidence." + ], + "metadata": { + "id": "ajb94WgIRdgC" + } + }, + { + "cell_type": "code", + "source": [ + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns" + ], + "metadata": { + "id": "lbJFWLELlI6N" + }, + "execution_count": 1, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Load the NYC taxi dataset into a Pandas DataFrame and do a few basic checks to ensure the data is loaded properly. Note, there are several months of data that can be used. For simplicity, use the Yellow Taxi 2022-01 parquet file [here](https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet). Here are your tasks:\n", + "\n", + " 1. Load the `yellow_tripdata_2022-01.parquet` file into Pandas.\n", + " 2. Print the first 5 rows of data. Study the schema and make sure you understand what each of the fields mean by referencing the [documentation](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf).\n", + " 3. How many rows are in the dataset? How many unique columns are in the dataset?\n", + " 4. Which columns have NULL values and how many NULL values are present in each of these columns?\n", + " 5. Generate summary statistics using Pandas' [describe method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html). Do you notice anything unusual in the dataset? Find at least one anomoly and try to come up with a hypothesis to explain it. \n", + " 6. Drop all rows with NULL values and store the result. We'll ignore NULL valued rows in this mini-project.\n" + ], + "metadata": { + "id": "sgK6-XtjVnjj" + } + }, + { + "cell_type": "code", + "source": [ + "# Load parquet file into a Pandas DataFrame\n", + "df = pd.read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet')" + ], + "metadata": { + "id": "db--eb8zlNNg" + }, + "execution_count": 2, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# Display the first few rows of the dataset\n", + "print(df.head())" + ], + "metadata": { + "id": "TslBuHoXl_o1", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "d2f632dc-913c-47dc-822d-9b75ab3a29f7" + }, + "execution_count": 3, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + " VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \\\n", + "0 1 2022-01-01 00:35:40 2022-01-01 00:53:29 2.0 \n", + "1 1 2022-01-01 00:33:43 2022-01-01 00:42:07 1.0 \n", + "2 2 2022-01-01 00:53:21 2022-01-01 01:02:19 1.0 \n", + "3 2 2022-01-01 00:25:21 2022-01-01 00:35:23 1.0 \n", + "4 2 2022-01-01 00:36:48 2022-01-01 01:14:20 1.0 \n", + "\n", + " trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID \\\n", + "0 3.80 1.0 N 142 236 \n", + "1 2.10 1.0 N 236 42 \n", + "2 0.97 1.0 N 166 166 \n", + "3 1.09 1.0 N 114 68 \n", + "4 4.30 1.0 N 68 163 \n", + "\n", + " payment_type fare_amount extra mta_tax tip_amount tolls_amount \\\n", + "0 1 14.5 3.0 0.5 3.65 0.0 \n", + "1 1 8.0 0.5 0.5 4.00 0.0 \n", + "2 1 7.5 0.5 0.5 1.76 0.0 \n", + "3 2 8.0 0.5 0.5 0.00 0.0 \n", + "4 1 23.5 0.5 0.5 3.00 0.0 \n", + "\n", + " improvement_surcharge total_amount congestion_surcharge airport_fee \n", + "0 0.3 21.95 2.5 0.0 \n", + "1 0.3 13.30 0.0 0.0 \n", + "2 0.3 10.56 0.0 0.0 \n", + "3 0.3 11.80 2.5 0.0 \n", + "4 0.3 30.30 2.5 0.0 \n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "# Display the shape of the dataset" + ], + "metadata": { + "id": "jkQN2lBymKH-" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# Check for missing values" + ], + "metadata": { + "id": "N8JeOtV5mQJ7" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# Summary statistics of the dataset" + ], + "metadata": { + "id": "h9q5SOrql5mS" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# Drop rows with missing values." + ], + "metadata": { + "id": "VP3EWMIEo4sp" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Feature engineering is a critical process in machine learning that involves selecting, transforming, and creating features from raw data to improve the performance and accuracy of predictive models. While machine learning algorithms have the capability to automatically learn patterns from data, the quality and relevance of the features used as inputs greatly influence the model's ability to generalize and make accurate predictions. Feature engineering, therefore, plays a crucial role in extracting meaningful information and representing it in a format that best captures the underlying relationships within the data.\n", + "\n", + "Here are your tasks:\n", + "\n", + " 1. Create a new feature that calculates the trip duration in minutes.\n", + " 2. Create additional features for the pick-up day of week and pick-up hour.\n", + " 3. Use the Seaborn library to create a [line plot](https://seaborn.pydata.org/generated/seaborn.lineplot.html) depicting the number of trips as a function of the hour of day. What's the busiest time of day?\n", + " 4. Create another lineplot depicting the number of trips as a function of the day of week. What day of the week is the least busy?" + ], + "metadata": { + "id": "u2i5l3QNFAf3" + } + }, + { + "cell_type": "code", + "source": [ + "# Create a new column for trip duration in minutes" + ], + "metadata": { + "id": "Wef2rR1Npl8f" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# Create new columns for pickup hour and day of week" + ], + "metadata": { + "id": "d13PYaN2FPFt" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# Create a lineplot displaying the number of trips by pickup hour" + ], + "metadata": { + "id": "-Bf7bnS9uU-h" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# Create a lineplot displaying the number of trips by pickup day" + ], + "metadata": { + "id": "iGjHC9lHuO8r" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "In the realm of machine learning, understanding the relationships between variables is crucial for building accurate and effective predictive models. One powerful tool for exploring these relationships is the correlation matrix. A correlation matrix provides a comprehensive overview of the pairwise correlations between variables in a dataset, allowing practitioners to quantify and visualize the strength and direction of these associations. This matrix is an essential component of exploratory data analysis and offers several key benefits:\n", + "\n", + "1. Relationship Assessment: The correlation matrix provides a quantitative measure of the relationship between variables. By calculating correlation coefficients, typically using methods like Pearson's correlation coefficient, analysts can determine if variables are positively correlated (increase together), negatively correlated (one increases as the other decreases), or uncorrelated (no systematic relationship). These measures offer insights into the direction and strength of the relationships, helping to identify important variables that may influence the target variable.\n", + "\n", + "2. Feature Selection: Correlation matrices are invaluable in feature selection, which involves identifying the most relevant variables for building predictive models. By examining the correlations between the target variable and other features, analysts can identify highly correlated variables that may be strong predictors. This knowledge enables informed decisions regarding which variables to include in the model, potentially reducing dimensionality, enhancing model efficiency, and preventing issues such as multicollinearity.\n", + "\n", + "3. Multicollinearity Detection: Multicollinearity occurs when two or more independent variables in a model are highly correlated. This can lead to problems such as instability in coefficient estimates, difficulty in interpreting feature importance, and reduced model robustness. By examining the correlation matrix, analysts can identify highly correlated variables and make informed decisions about which ones to include or exclude to mitigate multicollinearity. Removing redundant variables improves model interpretability and generalization.\n", + "\n", + "Here is your task:\n", + "\n", + " 1. Compute a correlation matrix between the variables 'trip_distance', 'fare_amount', 'tip_amount', 'total_amount', 'trip_duration' and use Seaborn to create a heatmap of the results. Which variables are strongly correlated?" + ], + "metadata": { + "id": "yVQLUFXPGe4e" + } + }, + { + "cell_type": "code", + "source": [ + "# Compute correlation matrix of numerical variables\n", + "\n", + "# Create a heatmap of the correlation matrix" + ], + "metadata": { + "id": "XCY8MrLAppQz" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Pairplots, also known as scatterplot matrices, allow for the visualization of pairwise relationships between multiple variables simultaneously. Each subplot in the pairplot represents the scatterplot of one variable against another. Pairplots offer several advantages in EDA:\n", + "\n", + " - Variable Relationships: Pairplots enable analysts to explore the relationships between variables, revealing patterns such as linear or nonlinear correlations, clusters, or other associations. These visual cues guide further analysis, feature selection, or modeling decisions.\n", + "\n", + " - Multivariate Analysis: Pairplots help identify multivariate dependencies and interactions, highlighting how different variables jointly influence one another. This is particularly valuable in identifying potential confounding factors or discovering hidden interactions that may not be apparent when considering variables in isolation.\n", + "\n", + " - Outlier Detection: Pairplots can reveal potential outliers by showing data points that deviate significantly from the general pattern observed between variables. Outliers can indicate data anomalies or influential observations that may impact model performance.\n", + "\n", + " - Feature Importance: Pairplots provide an intuitive representation of the relative importance of different features. Variables exhibiting strong correlations or clear patterns may be more relevant for predictive modeling or feature selection.\n", + "\n", + " - Data Quality: Pairplots can help identify data quality issues, such as data entry errors or measurement inconsistencies. Patterns that do not align with expectations or exhibit unusual trends may signal data problems that require further investigation or preprocessing.\n", + "\n", + "Here is your task:\n", + "\n", + " 1. Create a [pairplot matrix](https://seaborn.pydata.org/generated/seaborn.pairplot.html) using Seaborn to observation the relationship between the following variables: trip_distance, fare_amount, tip_amount, total_amount, trip_duration. Note, pairplots can be memory intensive. Try sampling the dataset using the [sample method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) in Pandas. Which variables appear to have a strong relationship? Which variables seem to have no relationship?" + ], + "metadata": { + "id": "Eh6TJ8iRJHDm" + } + }, + { + "cell_type": "code", + "source": [ + "# Create a scatter plot matrix of numerical variables. If memory issues try the df.sample method." + ], + "metadata": { + "id": "l_v24Ym3p8A-" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "A count plot is a type of categorical plot that displays the number of occurrences of each category in a dataset. It is particularly useful for visualizing the distribution and frequency of categorical variables. Here are some key uses and benefits of count plots:\n", + "\n", + " - Categorical Variable Exploration: Count plots provide a quick and concise summary of the distribution of categorical variables. They allow analysts to understand the frequency or count of each category, revealing the proportions and imbalances within the dataset. This information is crucial for gaining insights into the composition and characteristics of categorical variables.\n", + "\n", + " - Class Imbalance Assessment: In classification tasks, count plots help assess the balance or imbalance of different target classes. It provides a visual representation of the distribution of classes, highlighting any significant discrepancies in the sample sizes across categories. Identifying imbalanced classes is important in machine learning as it can affect model performance and bias the predictions towards the majority class.\n", + "\n", + " - Data Quality Inspection: Count plots can be utilized to detect data quality issues in categorical variables. It allows analysts to identify unexpected or erroneous categories that may indicate data entry errors, missing data, or inconsistencies in the dataset. By observing the counts for each category, anomalies or discrepancies can be easily spotted, enabling data cleaning or further investigation if necessary.\n", + "\n", + " - Feature Importance Evaluation: Count plots can provide insights into the importance or relevance of different categorical features in relation to the target variable. By visualizing the distribution of categories within each class or target level, analysts can determine which categories are more prevalent or have higher frequencies for specific outcomes. This understanding helps in assessing the discriminatory power of categorical features and their potential impact on predictive models.\n", + "\n", + "Here is your task:\n", + "\n", + " 1. Use Seaborn to create a [countplot](https://seaborn.pydata.org/generated/seaborn.countplot.html) for the variables PULocationID, and DOLocationID. Keep only the top 15 pick-up and drop-off locations. What's the most popular pick-up location?" + ], + "metadata": { + "id": "OA1p83hesFkH" + } + }, + { + "cell_type": "code", + "source": [ + "# Create a Seaborn countplot for PULocationID and DOLocationID. Only plot the top 15 categories by value counts." + ], + "metadata": { + "id": "eigDjtkTruQD" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "A box plot, also known as a box-and-whisker plot, is a powerful visualization tool for displaying the distribution, variability, and outliers within a numerical dataset. It provides a concise summary of key statistical measures and offers several important uses:\n", + "\n", + " - Data Distribution and Skewness: Box plots offer a visual representation of the distribution of numerical data, providing insights into its central tendency, spread, and skewness. The box represents the interquartile range (IQR), which contains the middle 50% of the data, with the median indicated by a horizontal line within the box. By observing the length and symmetry of the box, analysts can assess whether the data is skewed or symmetrically distributed.\n", + "\n", + " - Outlier Detection: Box plots are highly effective in identifying outliers, which are data points that deviate significantly from the rest of the distribution. The whiskers of the plot extend to the minimum and maximum non-outlier values, with any data points beyond the whiskers considered as potential outliers. Outliers can indicate data errors, anomalies, or important observations requiring further investigation.\n", + "\n", + " - Comparing Groups or Categories: Box plots are useful for comparing the distributions of numerical data across different groups or categories. By creating side-by-side or grouped box plots, analysts can easily compare the central tendencies, spreads, and shapes of distributions between different groups. This allows for the identification of differences, similarities, or patterns within the data.\n", + "\n", + " - Variability and Spread: Box plots provide insights into the variability and spread of the data. The length of the box indicates the spread of the middle 50% of the data, while the whiskers show the range of non-outlier values. By comparing the lengths of the boxes and whiskers, analysts can assess the relative variability between different groups or categories, aiding in the understanding of the data's dispersion.\n", + "\n", + " - Skewedness and Symmetry: Box plots offer a visual assessment of the skewness or symmetry of the data distribution. A symmetrical distribution is represented by a box plot with an equal length on both sides of the median, while a skewed distribution is indicated by a longer box on one side. This visual cue helps in understanding the shape and characteristics of the data, assisting in further analysis and modeling decisions.\n", + "\n", + " - Data Range and Quartiles: Box plots display the quartiles of the data distribution. The lower quartile (Q1) represents the 25th percentile, the upper quartile (Q3) represents the 75th percentile, and the interquartile range (IQR) is the range between Q1 and Q3. These quartiles provide a summary of the range and spread of the central portion of the data, aiding in the understanding of the data's variability and dispersion.\n", + "\n", + "Your task is:\n", + "\n", + " 1. Use Seaborn's [boxplot](https://seaborn.pydata.org/generated/seaborn.boxplot.html) to discern the relationship between payment_type and total_amount. Does anything look weird? Can you explain what's going on?" + ], + "metadata": { + "id": "Bbl-WZtctVbH" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "fM_2IGKrk8Vy" + }, + "outputs": [], + "source": [ + "# Create a box plot of total amount by payment type. Do you see anything odd?" + ] + }, + { + "cell_type": "markdown", + "source": [ + "A histogram is a graphical representation that displays the distribution of a continuous or discrete numerical variable. It provides insights into the underlying data distribution and helps uncover patterns, frequencies, and ranges within the dataset. Here are some key uses and benefits of histogram plots:\n", + "\n", + " - Data Distribution: Histograms allow analysts to visualize the shape, central tendency, and spread of the data. They provide an overview of the data distribution, helping to identify if it follows a particular pattern, such as a normal distribution, skewed distribution, bimodal distribution, or multimodal distribution. Understanding the data distribution aids in selecting appropriate analysis techniques and understanding the characteristics of the data.\n", + "\n", + " - Frequency Analysis: Histograms display the frequency or count of data points within predefined bins or intervals along the x-axis. By observing the height or count of each bin, analysts can identify the frequency of occurrence for different values or value ranges. This information helps assess the concentration of data points and identify peaks or modes in the distribution.\n", + "\n", + " - Outlier Detection: Histograms can assist in identifying outliers, which are data points that significantly deviate from the rest of the distribution. Outliers may indicate data errors, unusual observations, or important anomalies requiring further investigation. By examining the tails or extreme values in the histogram, analysts can identify potential outliers that may require additional scrutiny.\n", + "\n", + " - Data Range and Spread: Histograms provide insights into the range and spread of the data. The x-axis represents the variable's values, while the y-axis represents the frequency or count of occurrences. By observing the width and span of the histogram, analysts can assess the data's range and variability. This information helps understand the data's spread and aids in subsequent analysis or decision-making processes.\n", + "\n", + " - Feature Engineering: Histograms can guide feature engineering processes by informing appropriate transformations, binning strategies, or encoding techniques for numerical variables. They assist in identifying nonlinear relationships or determining optimal cut-off points for converting continuous variables into categorical ones. Histograms also help identify data skewness and guide transformation methods to address the skew if necessary.\n", + "\n", + " - Data Quality Inspection: Histograms can be useful in detecting data quality issues or anomalies. Unusual spikes, gaps, or unexpected patterns in the histogram may indicate data entry errors, measurement inconsistencies, or missing data. By observing the histogram, analysts can identify potential data quality issues that require further investigation or preprocessing.\n", + "\n", + "Your task is:\n", + "\n", + " 1. Use Seaborn's [histplot](https://seaborn.pydata.org/generated/seaborn.histplot.html) to explore the data distributions for fare_amount, trip_distance, and extra. Use kernel density estimators to better visualize the distribution. Use sampling if you run into any memory issues." + ], + "metadata": { + "id": "CnV0dIHquPtf" + } + }, + { + "cell_type": "code", + "source": [ + "# Explore data distributions for 'fare_amount', 'trip_distance' and 'extra' using Seaborn's histplot. Sample the data if you run into memory issues." + ], + "metadata": { + "id": "8HI3uXHCyI_B" + }, + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file From f6149b2722e32c9ec81873ee0f7a86c3721227f8 Mon Sep 17 00:00:00 2001 From: Chris Johns Date: Tue, 17 Oct 2023 11:38:20 -0400 Subject: [PATCH 3/3] Delete Student_MLE_MiniProject_EDA.ipynb --- Student_MLE_MiniProject_EDA.ipynb | 424 ------------------------------ 1 file changed, 424 deletions(-) delete mode 100644 Student_MLE_MiniProject_EDA.ipynb diff --git a/Student_MLE_MiniProject_EDA.ipynb b/Student_MLE_MiniProject_EDA.ipynb deleted file mode 100644 index ba49e09..0000000 --- a/Student_MLE_MiniProject_EDA.ipynb +++ /dev/null @@ -1,424 +0,0 @@ -{ - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "colab": { - "provenance": [], - "include_colab_link": true - }, - "kernelspec": { - "name": "python3", - "display_name": "Python 3" - }, - "language_info": { - "name": "python" - } - }, - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "view-in-github", - "colab_type": "text" - }, - "source": [ - "\"Open" - ] - }, - { - "cell_type": "markdown", - "source": [ - "# Mini Project: Exploratory Data Analysis" - ], - "metadata": { - "id": "hVq3RjKJpJI0" - } - }, - { - "cell_type": "markdown", - "source": [ - "# Exploratory Data Analysis: Unveiling Insights from the NYC Taxi Dataset\n", - "\n", - "Data has become the lifeblood of the modern world, permeating every aspect of our lives and transforming the way we make decisions. In this era of vast information, the ability to extract meaningful insights from raw data has emerged as a crucial skill. Enter exploratory data analysis (EDA), a powerful approach that allows us to unravel hidden patterns, detect anomalies, and generate valuable knowledge from the vast volumes of data at our disposal.\n", - "\n", - "Exploratory data analysis serves as the initial step in any data-driven investigation, offering a comprehensive understanding of the dataset's structure, distributions, and relationships between variables. By applying statistical and visual techniques, analysts gain a deeper insight into the data, paving the way for more accurate predictions, informed decision-making, and the discovery of actionable insights.\n", - "\n", - "To illustrate the significance of exploratory data analysis, we delve into one of the most popular and widely studied datasets in the field—the [NYC Taxi Dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). As the name suggests, this dataset captures detailed information about taxi trips within the bustling city of New York. The NYC Taxi Dataset is an ideal choice for learning and practicing EDA techniques due to its richness, complexity, and real-world applicability.\n", - "\n", - "The dataset encompasses a vast range of attributes, including pickup and drop-off locations, timestamps, trip durations, passenger counts, payment information, and much more. By exploring this data, we can gain valuable insights into the dynamics of taxi usage, understand travel patterns across different neighborhoods, identify peak hours of demand, analyze fare structures, and even uncover interesting anecdotes about the city's vibrant life.\n", - "\n", - "The NYC Taxi Dataset is an excellent resource for aspiring ML practitioners to develop their EDA skills. Its scale, complexity, and real-world relevance make it an engaging playground for uncovering hidden patterns, generating hypotheses, and forming data-driven narratives.\n", - "\n", - "In this mini project, we will dive deep into the NYC Taxi Dataset. We will leverage various EDA techniques to unveil meaningful insights, visualize data distributions, identify outliers, and pose insightful questions that will fuel further analysis and exploration. By the end of this colab, students will have a solid foundation in exploratory data analysis and be equipped to tackle real-world data challenges with confidence." - ], - "metadata": { - "id": "ajb94WgIRdgC" - } - }, - { - "cell_type": "code", - "source": [ - "import pandas as pd\n", - "import matplotlib.pyplot as plt\n", - "import seaborn as sns" - ], - "metadata": { - "id": "lbJFWLELlI6N" - }, - "execution_count": 1, - "outputs": [] - }, - { - "cell_type": "markdown", - "source": [ - "Load the NYC taxi dataset into a Pandas DataFrame and do a few basic checks to ensure the data is loaded properly. Note, there are several months of data that can be used. For simplicity, use the Yellow Taxi 2022-01 parquet file [here](https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet). Here are your tasks:\n", - "\n", - " 1. Load the `yellow_tripdata_2022-01.parquet` file into Pandas.\n", - " 2. Print the first 5 rows of data. Study the schema and make sure you understand what each of the fields mean by referencing the [documentation](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf).\n", - " 3. How many rows are in the dataset? How many unique columns are in the dataset?\n", - " 4. Which columns have NULL values and how many NULL values are present in each of these columns?\n", - " 5. Generate summary statistics using Pandas' [describe method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html). Do you notice anything unusual in the dataset? Find at least one anomoly and try to come up with a hypothesis to explain it. \n", - " 6. Drop all rows with NULL values and store the result. We'll ignore NULL valued rows in this mini-project.\n" - ], - "metadata": { - "id": "sgK6-XtjVnjj" - } - }, - { - "cell_type": "code", - "source": [ - "# Load parquet file into a Pandas DataFrame\n", - "df = pd.read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet')" - ], - "metadata": { - "id": "db--eb8zlNNg" - }, - "execution_count": 2, - "outputs": [] - }, - { - "cell_type": "code", - "source": [ - "# Display the first few rows of the dataset\n", - "print(df.head())" - ], - "metadata": { - "id": "TslBuHoXl_o1", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "d2f632dc-913c-47dc-822d-9b75ab3a29f7" - }, - "execution_count": 3, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - " VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \\\n", - "0 1 2022-01-01 00:35:40 2022-01-01 00:53:29 2.0 \n", - "1 1 2022-01-01 00:33:43 2022-01-01 00:42:07 1.0 \n", - "2 2 2022-01-01 00:53:21 2022-01-01 01:02:19 1.0 \n", - "3 2 2022-01-01 00:25:21 2022-01-01 00:35:23 1.0 \n", - "4 2 2022-01-01 00:36:48 2022-01-01 01:14:20 1.0 \n", - "\n", - " trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID \\\n", - "0 3.80 1.0 N 142 236 \n", - "1 2.10 1.0 N 236 42 \n", - "2 0.97 1.0 N 166 166 \n", - "3 1.09 1.0 N 114 68 \n", - "4 4.30 1.0 N 68 163 \n", - "\n", - " payment_type fare_amount extra mta_tax tip_amount tolls_amount \\\n", - "0 1 14.5 3.0 0.5 3.65 0.0 \n", - "1 1 8.0 0.5 0.5 4.00 0.0 \n", - "2 1 7.5 0.5 0.5 1.76 0.0 \n", - "3 2 8.0 0.5 0.5 0.00 0.0 \n", - "4 1 23.5 0.5 0.5 3.00 0.0 \n", - "\n", - " improvement_surcharge total_amount congestion_surcharge airport_fee \n", - "0 0.3 21.95 2.5 0.0 \n", - "1 0.3 13.30 0.0 0.0 \n", - "2 0.3 10.56 0.0 0.0 \n", - "3 0.3 11.80 2.5 0.0 \n", - "4 0.3 30.30 2.5 0.0 \n" - ] - } - ] - }, - { - "cell_type": "code", - "source": [ - "# Display the shape of the dataset" - ], - "metadata": { - "id": "jkQN2lBymKH-" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "source": [ - "# Check for missing values" - ], - "metadata": { - "id": "N8JeOtV5mQJ7" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "source": [ - "# Summary statistics of the dataset" - ], - "metadata": { - "id": "h9q5SOrql5mS" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "source": [ - "# Drop rows with missing values." - ], - "metadata": { - "id": "VP3EWMIEo4sp" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "source": [ - "Feature engineering is a critical process in machine learning that involves selecting, transforming, and creating features from raw data to improve the performance and accuracy of predictive models. While machine learning algorithms have the capability to automatically learn patterns from data, the quality and relevance of the features used as inputs greatly influence the model's ability to generalize and make accurate predictions. Feature engineering, therefore, plays a crucial role in extracting meaningful information and representing it in a format that best captures the underlying relationships within the data.\n", - "\n", - "Here are your tasks:\n", - "\n", - " 1. Create a new feature that calculates the trip duration in minutes.\n", - " 2. Create additional features for the pick-up day of week and pick-up hour.\n", - " 3. Use the Seaborn library to create a [line plot](https://seaborn.pydata.org/generated/seaborn.lineplot.html) depicting the number of trips as a function of the hour of day. What's the busiest time of day?\n", - " 4. Create another lineplot depicting the number of trips as a function of the day of week. What day of the week is the least busy?" - ], - "metadata": { - "id": "u2i5l3QNFAf3" - } - }, - { - "cell_type": "code", - "source": [ - "# Create a new column for trip duration in minutes" - ], - "metadata": { - "id": "Wef2rR1Npl8f" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "source": [ - "# Create new columns for pickup hour and day of week" - ], - "metadata": { - "id": "d13PYaN2FPFt" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "source": [ - "# Create a lineplot displaying the number of trips by pickup hour" - ], - "metadata": { - "id": "-Bf7bnS9uU-h" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "source": [ - "# Create a lineplot displaying the number of trips by pickup day" - ], - "metadata": { - "id": "iGjHC9lHuO8r" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "source": [ - "In the realm of machine learning, understanding the relationships between variables is crucial for building accurate and effective predictive models. One powerful tool for exploring these relationships is the correlation matrix. A correlation matrix provides a comprehensive overview of the pairwise correlations between variables in a dataset, allowing practitioners to quantify and visualize the strength and direction of these associations. This matrix is an essential component of exploratory data analysis and offers several key benefits:\n", - "\n", - "1. Relationship Assessment: The correlation matrix provides a quantitative measure of the relationship between variables. By calculating correlation coefficients, typically using methods like Pearson's correlation coefficient, analysts can determine if variables are positively correlated (increase together), negatively correlated (one increases as the other decreases), or uncorrelated (no systematic relationship). These measures offer insights into the direction and strength of the relationships, helping to identify important variables that may influence the target variable.\n", - "\n", - "2. Feature Selection: Correlation matrices are invaluable in feature selection, which involves identifying the most relevant variables for building predictive models. By examining the correlations between the target variable and other features, analysts can identify highly correlated variables that may be strong predictors. This knowledge enables informed decisions regarding which variables to include in the model, potentially reducing dimensionality, enhancing model efficiency, and preventing issues such as multicollinearity.\n", - "\n", - "3. Multicollinearity Detection: Multicollinearity occurs when two or more independent variables in a model are highly correlated. This can lead to problems such as instability in coefficient estimates, difficulty in interpreting feature importance, and reduced model robustness. By examining the correlation matrix, analysts can identify highly correlated variables and make informed decisions about which ones to include or exclude to mitigate multicollinearity. Removing redundant variables improves model interpretability and generalization.\n", - "\n", - "Here is your task:\n", - "\n", - " 1. Compute a correlation matrix between the variables 'trip_distance', 'fare_amount', 'tip_amount', 'total_amount', 'trip_duration' and use Seaborn to create a heatmap of the results. Which variables are strongly correlated?" - ], - "metadata": { - "id": "yVQLUFXPGe4e" - } - }, - { - "cell_type": "code", - "source": [ - "# Compute correlation matrix of numerical variables\n", - "\n", - "# Create a heatmap of the correlation matrix" - ], - "metadata": { - "id": "XCY8MrLAppQz" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "source": [ - "Pairplots, also known as scatterplot matrices, allow for the visualization of pairwise relationships between multiple variables simultaneously. Each subplot in the pairplot represents the scatterplot of one variable against another. Pairplots offer several advantages in EDA:\n", - "\n", - " - Variable Relationships: Pairplots enable analysts to explore the relationships between variables, revealing patterns such as linear or nonlinear correlations, clusters, or other associations. These visual cues guide further analysis, feature selection, or modeling decisions.\n", - "\n", - " - Multivariate Analysis: Pairplots help identify multivariate dependencies and interactions, highlighting how different variables jointly influence one another. This is particularly valuable in identifying potential confounding factors or discovering hidden interactions that may not be apparent when considering variables in isolation.\n", - "\n", - " - Outlier Detection: Pairplots can reveal potential outliers by showing data points that deviate significantly from the general pattern observed between variables. Outliers can indicate data anomalies or influential observations that may impact model performance.\n", - "\n", - " - Feature Importance: Pairplots provide an intuitive representation of the relative importance of different features. Variables exhibiting strong correlations or clear patterns may be more relevant for predictive modeling or feature selection.\n", - "\n", - " - Data Quality: Pairplots can help identify data quality issues, such as data entry errors or measurement inconsistencies. Patterns that do not align with expectations or exhibit unusual trends may signal data problems that require further investigation or preprocessing.\n", - "\n", - "Here is your task:\n", - "\n", - " 1. Create a [pairplot matrix](https://seaborn.pydata.org/generated/seaborn.pairplot.html) using Seaborn to observation the relationship between the following variables: trip_distance, fare_amount, tip_amount, total_amount, trip_duration. Note, pairplots can be memory intensive. Try sampling the dataset using the [sample method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) in Pandas. Which variables appear to have a strong relationship? Which variables seem to have no relationship?" - ], - "metadata": { - "id": "Eh6TJ8iRJHDm" - } - }, - { - "cell_type": "code", - "source": [ - "# Create a scatter plot matrix of numerical variables. If memory issues try the df.sample method." - ], - "metadata": { - "id": "l_v24Ym3p8A-" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "source": [ - "A count plot is a type of categorical plot that displays the number of occurrences of each category in a dataset. It is particularly useful for visualizing the distribution and frequency of categorical variables. Here are some key uses and benefits of count plots:\n", - "\n", - " - Categorical Variable Exploration: Count plots provide a quick and concise summary of the distribution of categorical variables. They allow analysts to understand the frequency or count of each category, revealing the proportions and imbalances within the dataset. This information is crucial for gaining insights into the composition and characteristics of categorical variables.\n", - "\n", - " - Class Imbalance Assessment: In classification tasks, count plots help assess the balance or imbalance of different target classes. It provides a visual representation of the distribution of classes, highlighting any significant discrepancies in the sample sizes across categories. Identifying imbalanced classes is important in machine learning as it can affect model performance and bias the predictions towards the majority class.\n", - "\n", - " - Data Quality Inspection: Count plots can be utilized to detect data quality issues in categorical variables. It allows analysts to identify unexpected or erroneous categories that may indicate data entry errors, missing data, or inconsistencies in the dataset. By observing the counts for each category, anomalies or discrepancies can be easily spotted, enabling data cleaning or further investigation if necessary.\n", - "\n", - " - Feature Importance Evaluation: Count plots can provide insights into the importance or relevance of different categorical features in relation to the target variable. By visualizing the distribution of categories within each class or target level, analysts can determine which categories are more prevalent or have higher frequencies for specific outcomes. This understanding helps in assessing the discriminatory power of categorical features and their potential impact on predictive models.\n", - "\n", - "Here is your task:\n", - "\n", - " 1. Use Seaborn to create a [countplot](https://seaborn.pydata.org/generated/seaborn.countplot.html) for the variables PULocationID, and DOLocationID. Keep only the top 15 pick-up and drop-off locations. What's the most popular pick-up location?" - ], - "metadata": { - "id": "OA1p83hesFkH" - } - }, - { - "cell_type": "code", - "source": [ - "# Create a Seaborn countplot for PULocationID and DOLocationID. Only plot the top 15 categories by value counts." - ], - "metadata": { - "id": "eigDjtkTruQD" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "source": [ - "A box plot, also known as a box-and-whisker plot, is a powerful visualization tool for displaying the distribution, variability, and outliers within a numerical dataset. It provides a concise summary of key statistical measures and offers several important uses:\n", - "\n", - " - Data Distribution and Skewness: Box plots offer a visual representation of the distribution of numerical data, providing insights into its central tendency, spread, and skewness. The box represents the interquartile range (IQR), which contains the middle 50% of the data, with the median indicated by a horizontal line within the box. By observing the length and symmetry of the box, analysts can assess whether the data is skewed or symmetrically distributed.\n", - "\n", - " - Outlier Detection: Box plots are highly effective in identifying outliers, which are data points that deviate significantly from the rest of the distribution. The whiskers of the plot extend to the minimum and maximum non-outlier values, with any data points beyond the whiskers considered as potential outliers. Outliers can indicate data errors, anomalies, or important observations requiring further investigation.\n", - "\n", - " - Comparing Groups or Categories: Box plots are useful for comparing the distributions of numerical data across different groups or categories. By creating side-by-side or grouped box plots, analysts can easily compare the central tendencies, spreads, and shapes of distributions between different groups. This allows for the identification of differences, similarities, or patterns within the data.\n", - "\n", - " - Variability and Spread: Box plots provide insights into the variability and spread of the data. The length of the box indicates the spread of the middle 50% of the data, while the whiskers show the range of non-outlier values. By comparing the lengths of the boxes and whiskers, analysts can assess the relative variability between different groups or categories, aiding in the understanding of the data's dispersion.\n", - "\n", - " - Skewedness and Symmetry: Box plots offer a visual assessment of the skewness or symmetry of the data distribution. A symmetrical distribution is represented by a box plot with an equal length on both sides of the median, while a skewed distribution is indicated by a longer box on one side. This visual cue helps in understanding the shape and characteristics of the data, assisting in further analysis and modeling decisions.\n", - "\n", - " - Data Range and Quartiles: Box plots display the quartiles of the data distribution. The lower quartile (Q1) represents the 25th percentile, the upper quartile (Q3) represents the 75th percentile, and the interquartile range (IQR) is the range between Q1 and Q3. These quartiles provide a summary of the range and spread of the central portion of the data, aiding in the understanding of the data's variability and dispersion.\n", - "\n", - "Your task is:\n", - "\n", - " 1. Use Seaborn's [boxplot](https://seaborn.pydata.org/generated/seaborn.boxplot.html) to discern the relationship between payment_type and total_amount. Does anything look weird? Can you explain what's going on?" - ], - "metadata": { - "id": "Bbl-WZtctVbH" - } - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "fM_2IGKrk8Vy" - }, - "outputs": [], - "source": [ - "# Create a box plot of total amount by payment type. Do you see anything odd?" - ] - }, - { - "cell_type": "markdown", - "source": [ - "A histogram is a graphical representation that displays the distribution of a continuous or discrete numerical variable. It provides insights into the underlying data distribution and helps uncover patterns, frequencies, and ranges within the dataset. Here are some key uses and benefits of histogram plots:\n", - "\n", - " - Data Distribution: Histograms allow analysts to visualize the shape, central tendency, and spread of the data. They provide an overview of the data distribution, helping to identify if it follows a particular pattern, such as a normal distribution, skewed distribution, bimodal distribution, or multimodal distribution. Understanding the data distribution aids in selecting appropriate analysis techniques and understanding the characteristics of the data.\n", - "\n", - " - Frequency Analysis: Histograms display the frequency or count of data points within predefined bins or intervals along the x-axis. By observing the height or count of each bin, analysts can identify the frequency of occurrence for different values or value ranges. This information helps assess the concentration of data points and identify peaks or modes in the distribution.\n", - "\n", - " - Outlier Detection: Histograms can assist in identifying outliers, which are data points that significantly deviate from the rest of the distribution. Outliers may indicate data errors, unusual observations, or important anomalies requiring further investigation. By examining the tails or extreme values in the histogram, analysts can identify potential outliers that may require additional scrutiny.\n", - "\n", - " - Data Range and Spread: Histograms provide insights into the range and spread of the data. The x-axis represents the variable's values, while the y-axis represents the frequency or count of occurrences. By observing the width and span of the histogram, analysts can assess the data's range and variability. This information helps understand the data's spread and aids in subsequent analysis or decision-making processes.\n", - "\n", - " - Feature Engineering: Histograms can guide feature engineering processes by informing appropriate transformations, binning strategies, or encoding techniques for numerical variables. They assist in identifying nonlinear relationships or determining optimal cut-off points for converting continuous variables into categorical ones. Histograms also help identify data skewness and guide transformation methods to address the skew if necessary.\n", - "\n", - " - Data Quality Inspection: Histograms can be useful in detecting data quality issues or anomalies. Unusual spikes, gaps, or unexpected patterns in the histogram may indicate data entry errors, measurement inconsistencies, or missing data. By observing the histogram, analysts can identify potential data quality issues that require further investigation or preprocessing.\n", - "\n", - "Your task is:\n", - "\n", - " 1. Use Seaborn's [histplot](https://seaborn.pydata.org/generated/seaborn.histplot.html) to explore the data distributions for fare_amount, trip_distance, and extra. Use kernel density estimators to better visualize the distribution. Use sampling if you run into any memory issues." - ], - "metadata": { - "id": "CnV0dIHquPtf" - } - }, - { - "cell_type": "code", - "source": [ - "# Explore data distributions for 'fare_amount', 'trip_distance' and 'extra' using Seaborn's histplot. Sample the data if you run into memory issues." - ], - "metadata": { - "id": "8HI3uXHCyI_B" - }, - "execution_count": null, - "outputs": [] - } - ] -} \ No newline at end of file