diff --git a/docs/index.mdx b/docs/index.mdx index 20061af..51029e1 100644 --- a/docs/index.mdx +++ b/docs/index.mdx @@ -102,7 +102,7 @@ Select a technology below to dive into our structured tutorials. Each path is de

Learn NoSQL database concepts with MongoDB. Store, query, and manage data efficiently for modern applications.

- +

Explore artificial intelligence, machine learning, and neural networks with beginner-friendly examples.

diff --git a/docs/machine-learning/statistics/basic-concepts.mdx b/docs/machine-learning/statistics/basic-concepts.mdx index e69de29..9169411 100644 --- a/docs/machine-learning/statistics/basic-concepts.mdx +++ b/docs/machine-learning/statistics/basic-concepts.mdx @@ -0,0 +1,72 @@ +--- +title: "Basic Statistical Concepts" +sidebar_label: Basic Concepts +description: "Introduction to the fundamental pillars of statistics in ML: Populations vs. Samples, Descriptive vs. Inferential statistics, and Data Types." +tags: [statistics, mathematics-for-ml, data-types, population, sample, descriptive-statistics] +--- + +Statistics is the science of collecting, analyzing, and interpreting data. In Machine Learning, statistics provides the tools to handle uncertainty, validate models, and understand whether the patterns we find are "real" or just random noise. + +## 1. Population vs. Sample + +The most fundamental distinction in statistics is between the group we want to know about and the group we actually observe. + +* **Population:** The entire group of individuals or instances about whom we want to draw conclusions. + * *Example:* All people who use a specific social media app. +* **Sample:** A subset of the population that we actually collect data from. + * *Example:* 1,000 users who responded to a survey. + +:::important The Goal of ML +In Machine Learning, our training data is a **sample**. Our goal is to build a model that generalizes well to the entire **population** (unseen data). +::: + +## 2. Descriptive vs. Inferential Statistics + +Statistics is generally divided into two main branches: + +### A. Descriptive Statistics +This branch focuses on summarizing and describing the characteristics of a dataset. We use numbers and graphs to tell the story of the data we have in hand. +* **Tools:** Mean, Median, Mode, Standard Deviation, Histograms. + +### B. Inferential Statistics +This branch focuses on making predictions or generalizations about a population based on a sample. +* **Tools:** Hypothesis testing, P-values, Confidence Intervals, Regression. + +## 3. Types of Data + +Not all data is created equal. The way we process features in ML depends entirely on their statistical type. + +| Data Type | Sub-type | Description | Example | +| :--- | :--- | :--- | :--- | +| **Qualitative** (Categorical) | **Nominal** | Categories with no inherent order. | Eye color, Gender, Zip Code. | +| | **Ordinal** | Categories with a meaningful order. | Education level (Bachelors, Masters, PhD). | +| **Quantitative** (Numerical) | **Discrete** | Values that can be counted (integers). | Number of rooms in a house, number of clicks. | +| | **Continuous** | Values that can be measured (real numbers). | Temperature, Weight, Stock price. | + +## 4. Parameters vs. Statistics + +* **Parameter:** A numerical value that describes a characteristic of the entire **population**. (Usually denoted by Greek letters like $\mu$ for mean). +* **Statistic:** A numerical value that describes a characteristic of a **sample**. (Usually denoted by Roman letters like $\bar{x}$ for mean). + +In ML, we use **Sample Statistics** (like the error on our training set) to estimate the true **Population Parameters** (the true error the model would make on all possible data). + +## 5. Why Statistics Matters in the ML Pipeline + +1. **Exploratory Data Analysis (EDA):** Before building a model, we use descriptive statistics to find outliers, understand distributions, and identify correlations. +2. **Feature Engineering:** Understanding data types helps us decide how to encode variables (e.g., One-Hot Encoding for Nominal data). +3. **Model Validation:** We use inferential statistics to determine if a model's performance improvement is statistically significant or just due to a lucky split of the data. + +--- + +## References for More Details + +* **StatQuest with Josh Starmer - Statistics Fundamentals:** + * [YouTube Link](https://www.youtube.com/playlist?list=PLblh5JKOoLUK0FLuzwntyYI10UQFUhsY9) + * *Best for:* Highly visual and intuitive explanations of population vs. sample and other core concepts. +* **Khan Academy - Summarizing Quantitative Data:** + * [Website Link](https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data) + * *Best for:* Interactive practice with mean, median, and variance. + +--- + +Now that we have the vocabulary, let's look at the specific numerical tools we use to describe the center and spread of our data. \ No newline at end of file diff --git a/docs/machine-learning/statistics/data-visualization.mdx b/docs/machine-learning/statistics/data-visualization.mdx index e69de29..fb75cce 100644 --- a/docs/machine-learning/statistics/data-visualization.mdx +++ b/docs/machine-learning/statistics/data-visualization.mdx @@ -0,0 +1,66 @@ +--- +title: "Data Visualization in Statistics" +sidebar_label: Data Visualization +description: "Exploring the essential plots and charts used in statistical analysis to identify patterns, distributions, and outliers in Machine Learning datasets." +tags: [statistics, data-visualization, eda, histograms, box-plots, scatter-plots, mathematics-for-ml] +--- + +Numerical summaries like the Mean or Standard Deviation only tell half the story. **Data Visualization** allows us to see the shape, spread, and anomalies in our data that numbers might hide. In Machine Learning, visualization is the primary tool used during **Exploratory Data Analysis (EDA)**. + +## 1. Visualizing Distributions (Univariate Analysis) + +To understand a single feature, we look at its distribution. + +### A. Histograms +A histogram groups continuous data into "bins" and shows the frequency of data points in each bin. It is the best tool for identifying the **shape** of the data (Normal, Skewed, Bimodal). + +### B. Box Plots (Whisker Plots) +Box plots are incredible for identifying **outliers** and understanding the quartiles of your data. +* **The Box:** Represents the Interquartile Range (IQR), containing the middle 50% of the data. +* **The Line:** The Median. +* **The Whiskers:** Usually extend to $1.5 \times \text{IQR}$. +* **Dots:** Data points outside the whiskers are considered outliers. + +## 2. Visualizing Relationships (Bivariate Analysis) + +To understand how two features interact, we use relational plots. + +### A. Scatter Plots + +Scatter plots showing positive correlation, negative correlation, and no correlation + +Scatter plots display individual data points on an XY plane. They are the first step in identifying **Correlation**. +* **Linear Relationship:** Points form a straight line. +* **Non-linear Relationship:** Points form a curve. +* **No Relationship:** Points look like a random cloud. + +### B. Bar Charts vs. Pie Charts +* **Bar Charts:** Best for comparing a numerical value across different categories. +* **Pie Charts:** Best for showing parts of a whole (though bar charts are often preferred for readability). + +--- + +## 3. Visualizing Multiple Variables (Multivariate) + +### A. Heatmaps (Correlation Matrices) +In ML, we often have dozens of features. A heatmap uses color to represent the correlation coefficient between every pair of features. This helps in **Feature Selection** by identifying redundant variables. + + + +### B. Pair Plots +A grid of scatter plots for every pair of features in a dataset. It allows you to see relationships across the entire dataset at once. + +--- + +## 4. Anscombe's Quartet: Why Visualization Matters +The most famous example of why we visualize is **Anscombe's Quartet**. It consists of four datasets that have nearly identical descriptive statistics (mean, variance, correlation), yet look completely different when graphed. + + + +:::tip ML Best Practice +Never start training a model before visualizing your data. Plots often reveal data quality issues (like sensors being stuck at a maximum value) that summary statistics would miss. +::: + +--- + +Visualizing our data often reveals a specific "bell-shaped" curve that appears everywhere in nature and math. Understanding this curve is our next major step. \ No newline at end of file diff --git a/docs/machine-learning/statistics/descriptive-statistics.mdx b/docs/machine-learning/statistics/descriptive-statistics.mdx index e69de29..d74ed47 100644 --- a/docs/machine-learning/statistics/descriptive-statistics.mdx +++ b/docs/machine-learning/statistics/descriptive-statistics.mdx @@ -0,0 +1,70 @@ +--- +title: "Descriptive Statistics" +sidebar_label: Descriptive Statistics +description: "Mastering measures of central tendency (mean, median, mode) and dispersion (variance, standard deviation, range) to summarize and understand data distributions." +tags: [statistics, mean, median, variance, standard-deviation, descriptive-statistics, mathematics-for-ml] +--- + +Descriptive statistics allow us to summarize large volumes of raw data into a few meaningful numbers. In Machine Learning, we use these to understand the "center" and the "spread" of our features, which is essential for data cleaning and feature scaling. + +## 1. Measures of Central Tendency + +These measures tell us where the "middle" of the data lies. + +Mean, Median, and Mode in normal and skewed distributions + +### A. Mean (Average) +The sum of all values divided by the total number of values. It is highly sensitive to **outliers**. +$$ \mu = \frac{\sum x_i}{N} $$ + +### B. Median +The middle value when the data is sorted. It is **robust** to outliers, making it better for skewed distributions (like house prices or salaries). + +### C. Mode +The value that appears most frequently. Useful for categorical data (e.g., finding the most common car color). + +--- + +## 2. Measures of Dispersion (Spread) + +Knowing the center isn't enough; we need to know how "spread out" the data is. + +### A. Range +The difference between the maximum and minimum values. Simple, but very sensitive to extreme outliers. + +### B. Variance ($\sigma^2$) +The average of the squared differences from the Mean. It measures how far each number in the set is from the mean. +$$ \sigma^2 = \frac{\sum (x_i - \mu)^2}{N} $$ + +### C. Standard Deviation ($\sigma$) +The square root of the variance. It is the most common measure of spread because it is in the **same units** as the original data. + +* **Low $\sigma$:** Data points are close to the mean. +* **High $\sigma$:** Data points are spread out over a wide range. + +--- + +## 3. Measures of Shape + +Beyond center and spread, we look at the symmetry and "peakedness" of the data. + +### A. Skewness +Measures the asymmetry of the distribution. +* **Positive (Right) Skew:** Long tail on the right side. +* **Negative (Left) Skew:** Long tail on the left side. + +### B. Kurtosis +Measures how "fat" or "thin" the tails of the distribution are compared to a normal distribution. High kurtosis indicates the presence of frequent outliers. + +--- + +## 4. Why this matters for ML + +1. **Handling Outliers:** If the Mean and Median are far apart, you likely have outliers that could skew your model's training. +2. **Missing Value Imputation:** When filling in missing data, we often choose the **Mean** (for normal data), **Median** (for skewed data), or **Mode** (for categorical data). +3. **Feature Scaling:** Techniques like **Z-Score Normalization** (Standardization) directly use the Mean and Standard Deviation to rescale features: + $$ z = \frac{x - \mu}{\sigma} $$ + +--- + +Visualizing these numbers is often more intuitive than reading a table. Next, we’ll explore the most important probability distribution in all of science and ML. \ No newline at end of file diff --git a/docs/machine-learning/statistics/inferential-statistics.mdx b/docs/machine-learning/statistics/inferential-statistics.mdx index e69de29..a075c47 100644 --- a/docs/machine-learning/statistics/inferential-statistics.mdx +++ b/docs/machine-learning/statistics/inferential-statistics.mdx @@ -0,0 +1,108 @@ +--- +title: "Inferential Statistics" +sidebar_label: Inferential Statistics +description: "Understanding how to make predictions and inferences about populations using samples, hypothesis testing, and p-values." +tags: [statistics, inference, hypothesis-testing, p-value, confidence-intervals, mathematics-for-ml] +--- + +In Descriptive Statistics, we describe the data we have. In **Inferential Statistics**, we use that data to make "educated guesses" or predictions about data we *don't* have. This is the foundation of scientific discovery and model validation in Machine Learning. + +## 1. The Core Workflow + +Inferential statistics allows us to take a small sample and project those findings onto a larger population. + +```mermaid +sankey-beta + %% source,target,value + Population,Sample,30 + Sample,Analysis,30 + Analysis,Point Estimates,15 + Analysis,Confidence Intervals,15 + Point Estimates,Population Inference,15 + Confidence Intervals,Population Inference,15 + +``` + +## 2. Point Estimation + +A **Point Estimate** is a single value (a statistic) used to estimate a population parameter. + +* **Sample Mean ($\bar{x}$)** estimates the **Population Mean ($\mu$)**. +* **Sample Variance ($s^2$)** estimates the **Population Variance ($\sigma^2$)**. + +However, because samples are smaller than populations, point estimates are rarely 100% accurate. We use **Confidence Intervals** to express our uncertainty. + +## 3. Hypothesis Testing + +Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. + +### The Two Hypotheses + +1. **Null Hypothesis ($H_0$):** The "status quo." It assumes there is no effect or no difference. (e.g., "This new feature does not improve model accuracy.") +2. **Alternative Hypothesis ($H_a$):** What we want to prove. (e.g., "This new feature improves model accuracy.") + +### The Decision Process + +We use the **P-value** to decide whether to reject the Null Hypothesis. + +```mermaid +flowchart TD + Start["State Hypotheses H0 and Ha"] --> Alpha[Set Significance Level α - usually 0.05] + Alpha --> Test[Perform Statistical Test - t-test, Z-test] + Test --> PVal{Calculate P-value} + PVal -- "P < α" --> Reject[Reject H0: Results are Statistically Significant] + PVal -- "P ≥ α" --> Fail[Fail to Reject H0: No significant effect found] + +``` + +## 4. Confidence Intervals + +A **Confidence Interval (CI)** provides a range of values that is likely to contain the population parameter. + +$$ +\text{CI} = \text{Point Estimate} \pm (\text{Critical Value} \times \text{Standard Error}) +$$ + +:::note Example +We are 95% confident that the true accuracy of our model on all future data is between 88% and 92%. +::: + +## 5. Common Statistical Tests in ML + +| Test | Use Case | Example in ML | +| --- | --- | --- | +| **Z-Test** | Comparing means with a large sample size (n > 30). | Comparing the average spend of two large user groups. | +| **T-Test** | Comparing means with a small sample size (n < 30). | Comparing performance of two model architectures on a small dataset. | +| **Chi-Square Test** | Testing relationships between categorical variables. | Is the "Click" rate independent of the "Device Type"? | +| **ANOVA** | Comparing means across 3 or more groups. | Does the choice of optimizer (Adam, SGD, RMSprop) significantly change accuracy? | + +## 6. Type I and Type II Errors + +When making inferences, we can be wrong in two ways: + +```mermaid +quadrantChart + title Statistical Decision Matrix + x-axis "Null Hypothesis is True" --> "Null Hypothesis is False" + y-axis "Reject Null" --> "Fail to Reject" + quadrant-1 "Type I Error (False Positive)" + quadrant-2 "Correct Decision (True Positive)" + quadrant-3 "Correct Decision (True Negative)" + quadrant-4 "Type II Error (False Negative)" + +``` + +
+ +1. **Type I Error (\alpha):** You claim there is an effect when there isn't. (False Positive). +2. **Type II Error (\beta):** You fail to detect an effect that actually exists. (False Negative). + +## 7. Why this matters for ML Engineers + +* **A/B Testing:** Inferential statistics is the engine behind A/B testing new model versions in production. +* **Feature Selection:** We use tests like Chi-Square to see if a feature actually has a relationship with the target variable. +* **Model Comparison:** If Model A has 91% accuracy and Model B has 91.5%, is that difference "real" or just luck? Inferential stats tells you if the improvement is **statistically significant**. + +--- + +Understanding inference allows us to trust our model's results. Now, we dive into the specific probability distributions that model the randomness we see in the real world. \ No newline at end of file diff --git a/static/img/tutorials/ml/measures-central-tendency.jpg b/static/img/tutorials/ml/measures-central-tendency.jpg new file mode 100644 index 0000000..46e69f4 Binary files /dev/null and b/static/img/tutorials/ml/measures-central-tendency.jpg differ diff --git a/static/img/tutorials/ml/scatter-plots.jpg b/static/img/tutorials/ml/scatter-plots.jpg new file mode 100644 index 0000000..d76fac6 Binary files /dev/null and b/static/img/tutorials/ml/scatter-plots.jpg differ