Lectures: CSE206_Fa24-01.pdf Lab/Tutorial: week01.pdf
- Probability gives a precise mathematical way to talk about uncertainty in models of the real world.
- Statistics uses data (observations) to learn about the unknown “state of nature” behind those models.
- We describe data sets using descriptive statistics: measures of location (center) and measures of scattering (spread).
- The sample mean, median, and quantiles are different ways to say “where the data sit” on the number line.
- The sample standard deviation and mean absolute deviation describe how tightly or loosely data are clustered.
- Dot plots and histograms are visual tools to see the overall shape of a data set (center, spread, skewness, outliers).
- Outliers are atypical data points that can strongly influence the mean and standard deviation, but much less the median.
- The lab for week 1 gives practice computing these summaries, drawing dot plots and histograms, and comparing two groups.
- Plain-language definition.
- Probability is a branch of mathematics that encodes uncertainty and lets us compute chances of events under a model.
- Statistics uses observed data to infer the unknown parameters or probabilities that define that model.
- Formal definition.
- In probability, we assume a probability model is known: basic events have assigned probabilities, and we derive the probabilities of more complex events from them.
- In statistics, we treat those probabilities or parameters as unknown and use data samples to estimate or test them.
- Intuition / mental model.
- Probability: “If I know how the world works, what is the chance of this outcome?” (e.g., probability of five heads in a row for a fair coin).
- Statistics: “From what I have observed, what can I say about how the world works?” (e.g., is this coin fair given the toss results?).
- Tiny example.
- Probability question: assuming a fair coin, how likely is it that heads appears nearly half the time in many tosses?
- Statistics question: I tossed a coin ten times and got eight heads; can I reasonably claim the coin is not fair?
- Plain-language definition.
- The state of nature is the underlying reality or mechanism that generates data.
- Observations are the actual values we record from that mechanism; the collection of observed values is a sample.
- Formal definition.
- We assume there is an unknown probability distribution describing the phenomenon.
- A sample is a finite sequence of real numbers
$x_1, \dots, x_n$ drawn from that distribution.
- Intuition / mental model.
- Think of the state of nature as the complete list of all possible outputs, which we can never fully see, and the sample as a small subset we actually measure.
- Examples from the lecture: all possible responses of a large AI model, all heights in a city, or all device lifetimes from a factory.
- Tiny example.
- Population heights: the true heights of everyone in Kazan form the state of nature; the thousand people you measure in a study form your sample.
- Plain-language definition.
- The sample mean is the average value of the data; it is one way to describe the center of a sample.
- Formal definition.
- For a sample
${x_1, \dots, x_n}$ , the sample mean is $$ \bar{x} = \frac{1}{n}\sum_{j=1}^n x_j. $$ - It is the unique number
$\mu$ for which$\sum_{j=1}^n (x_j - \mu) = 0$ .
- For a sample
- Intuition / mental model.
- Imagine the data points as weights on a number line; the mean is the balance point where the system is in equilibrium.
- It is very sensitive to extreme values (outliers), because every data point enters the sum directly.
- Tiny example.
- If reaction times in a tiny sample are
$180, 220, 200$ milliseconds, then$\bar{x} = (180 + 220 + 200)/3 = 200$ ms.
- If reaction times in a tiny sample are
- Plain-language definition.
- The median is the “middle” value of an ordered sample.
- Quantiles (such as quartiles) mark positions that split the ordered data into fixed fractions (like quarters).
- Formal definition.
- Reorder the sample into an increasing sequence
$y_1 \le \dots \le y_n$ . - If
$n$ is odd, the median$m$ is$y_{(n+1)/2}$ . - If
$n$ is even, the median is $$ m = \frac{y_{n/2} + y_{n/2 + 1}}{2}. $$ - For a positive integer
$q$ , the$k$ -th$q$ -quantile is the value$y_j$ such that roughly$k/q$ of the values lie below$y_j$ and the rest above it.
- Reorder the sample into an increasing sequence
- Intuition / mental model.
- The median is a center that is robust: moving a few extreme points far away does not change it much, as long as the order around the middle stays the same.
- Quartiles cut the ordered data into four blocks of equal size and help describe spread and skewness.
- Tiny example.
- For the ordered data
${1, 3, 5, 9, 10}$ ($n = 5$ ), the median is the third value:$m = 5$ . - For
${1, 3, 5, 9}$ ($n = 4$ ), the median is$(3 + 5)/2 = 4$ .
- For the ordered data
- Plain-language definition.
- The standard deviation measures how far, on average in a squared sense, data points are from the mean.
- The mean absolute deviation (m.a.d.) measures how far, on average, data points are from the mean using absolute distances.
- Formal definition.
- For a sample
${x_1, \dots, x_n}$ with mean$\bar{x}$ , the (sample) standard deviation is $$ s = \sqrt{\frac{1}{n - 1}\sum_{j=1}^n (x_j - \bar{x})^2}. $$The denominator
$n-1$ is the “degrees of freedom” correction (Bessel’s correction): we already used the data once to estimate$\bar{x}$ , so if we divided by$n$ the average squared deviation would systematically underestimate the true variance of the population. With$1/(n-1)$ ,$s^2$ is (under standard i.i.d. assumptions) an unbiased estimator of the population variance. - The sample variance ($\operatorname{Var}(x)$) is
$s^2$ , the square of the standard deviation. - If instead you are treating the data set as the whole population (for example, measuring spread of all exam scores in a fixed class, or computing loss over all points in a finite training set), it is natural to drop the correction and define the population variance by dividing by
$n$ . - The mean absolute deviation is $$ \text{m.a.d.} = \frac{1}{n}\sum_{j=1}^n |x_j - \bar{x}|. $$
- For a sample
- Intuition / mental model.
- Both numbers describe how “together” or “spread out” the data are, but standard deviation penalizes large deviations more strongly because of the square.
- Standard deviation and m.a.d. are insensitive to adding a constant to all data (a shift), but they scale when all data are multiplied by a constant.
- Tiny example.
- If a small sample is
${9, 10, 11}$ , then$\bar{x} = 10$ . - The deviations are
$-1, 0, +1$ ; the m.a.d. is$(1 + 0 + 1)/3 = 2/3$ . - The standard deviation is
$\sqrt{[(1^2 + 0^2 + 1^2)/(3 - 1)]} = \sqrt{1} = 1$ .
- If a small sample is
- Plain-language definition.
- A dot plot places a dot for each data point in one of several equal-width cells along the horizontal axis.
- A histogram uses rectangles over those cells; the height of each rectangle is the frequency (or relative frequency) of data in that cell.
- Formal definition.
- Let
$x_{\min}$ and$x_{\max}$ be the minimum and maximum data values. - Choose a number of cells
$N$ and partition$[x_{\min}, x_{\max}]$ into$N$ subintervals of equal length. - Dot plot: for each subinterval, draw a column of dots, one per data point in that interval.
- Histogram: for each subinterval, draw a rectangle with base the subinterval and height equal to the number of data points (or relative frequency) in that interval.
- Let
- Intuition / mental model.
- Dot plots are very clear for small to moderate samples: you can see each point and where it falls.
- Histograms are better for large samples, letting you see overall shape (e.g., symmetric, skewed, multimodal) without tracking individual values.
- Tiny example.
- For the sample
${10, -2, -4.5, -7.8, -1, 0, 0, 0, 3, 3, 0, 5, -1, 0, -2, -2, -7.7, 10, 10, 11, -2, -3}$ from the lecture, the instructor shows dot plots and histograms that change shape when the number of cells$N$ changes.
- For the sample
|
|
|
|
- Plain-language definition.
- Outliers are data points that look atypical or far from the bulk of the sample.
- The empirical distribution function (EDF), denoted
$F_{\text{emp}}(x)$ , describes, for each$x$ , the proportion of data values that are$\le x$ .
- Formal definition.
- For a sample
${x_1, \dots, x_n}$ , the EDF is $$ F_{\text{emp}}(x) = \frac{1}{n}#{ j \in {1, \dots, n} : x_j \le x}. $$ - Outliers are not defined by a single formula in the lecture; their identification depends on the underlying state of nature and context.
- For a sample
- Intuition / mental model.
-
$F_{\text{emp}}(x)$ is a step function that jumps up by$1/n$ at each data point; it shows how the sample “accumulates” as$x$ increases. - Outliers may lie outside a range such as
$(\bar{x} - s, \bar{x} + s)$ , but in many real problems this simple rule can mislead, so context is crucial.
-
- Tiny example.
- For the sample
${22, 24, 35, 1, 1, 5, 6, 10, 13, 21, 10, 10, 40, 41, 47}$ from the lecture,$F_{\text{emp}}(x)$ is$0$ for$x < 1$ , jumps to$2/15$ at$x = 1$ , and eventually reaches$1$ after the largest value$47$ .
- For the sample
- Formula.
- $$ \bar{x} = \frac{1}{n}\sum_{j=1}^n x_j. $$
- Symbols.
-
$x_1, \dots, x_n$ : observed data values (the sample). -
$n$ : sample size. -
$\bar{x}$ : sample mean (average).
-
- When to use it.
- To summarize the central tendency of quantitative data when you are not too worried about extreme values.
- In many later results in probability and statistics, the sample mean is a basic building block.
- Common mistakes.
- Forgetting to divide by
$n$ after summing. - Using rounded intermediate sums that cause unnecessary rounding error.
- Treating the mean as robust to outliers (it is not).
- Forgetting to divide by
- Formula / definition.
- Order the data:
$y_1 \le \dots \le y_n$ . - Median
$m$ :- If
$n$ is odd,$m = y_{(n+1)/2}$ . - If
$n$ is even,$m = (y_{n/2} + y_{n/2+1})/2$ .
- If
- First and third quartiles are special quantiles located at roughly 25% and 75% positions in the ordered list.
- Order the data:
- Symbols.
-
$y_j$ :$j$ -th ordered value of the sample. -
$n$ : sample size. -
$m$ : sample median.
-
- When to use it.
- To describe the center of data sets with clear skewness or outliers, where the mean might be misleading.
- To split the data into parts and reason about where most values lie.
- Common mistakes.
- Forgetting to order the data before taking the middle.
- Miscounting indices when
$n$ is even. - Confusing median (position-based) with mean (value-based).
- Formula / definition.
- For a
$p%$ trimmed mean:- Order the data.
- Remove the smallest
$p%$ and largest$p%$ of observations. - Take the ordinary mean of the remaining data.
- The lab uses a
$10%$ trimmed mean.
- For a
- Symbols.
-
$p$ : trimming percentage. -
$n$ : sample size;$pn$ (rounded appropriately) observations are trimmed from each tail.
-
- When to use it.
- When you want a measure of center that behaves like the mean but is less sensitive to a few extreme values.
- To check whether outliers are heavily influencing the plain mean by comparing mean vs trimmed mean vs median.
- Common mistakes.
- Forgetting to sort the data before trimming.
- Removing the wrong number of observations from each tail.
- Mixing up which data remain after trimming when computing the average.
- Formula.
- Standard deviation: $$ s = \sqrt{\frac{1}{n - 1}\sum_{j=1}^n (x_j - \bar{x})^2}. $$
- Variance: $$ s^2 = \frac{1}{n - 1}\sum_{j=1}^n (x_j - \bar{x})^2. $$
- Symbols.
-
$x_j$ : data points. -
$\bar{x}$ : sample mean. -
$n$ : sample size. -
$s$ : sample standard deviation. -
$s^2$ : sample variance.
-
- When to use it.
- To measure spread around the mean and compare variability between samples (e.g., control vs treatment groups in the lab).
- As input to many later statistical procedures (confidence intervals, tests, etc.).
- Common mistakes.
- Forgetting the context of the denominator:
- use
$n-1$ when the data are a sample from a larger population and you want$s^2$ to estimate the unknown population variance (unbiased “sample” variance); - use
$n$ when you are describing the spread of a finite population you have completely observed (population variance).
- use
- Forgetting to square the deviations before summing.
- Taking the square root too early or forgetting it when going from variance to standard deviation.
- Forgetting the context of the denominator:
- Formula.
- $$ \text{m.a.d.} = \frac{1}{n}\sum_{j=1}^n |x_j - \bar{x}|. $$
- Symbols.
-
$x_j$ : data points. -
$\bar{x}$ : sample mean. -
$n$ : sample size. -
$|\cdot|$ : absolute value.
-
- When to use it.
- When you want a measure of spread that treats all deviations linearly instead of squaring them.
- As a way to see how much data typically differ from the mean without giving extra emphasis to outliers.
- Common mistakes.
- Confusing m.a.d. with standard deviation; they are different quantities.
- Forgetting to take absolute values before summing.
- Formula.
- $$ F_{\text{emp}}(x) = \frac{1}{n}#{ j \in {1, \dots, n} : x_j \le x}. $$
- In words: the EDF takes your sample values
$x_1,\dots,x_n$ and, for any cutoff$x$ you choose, gives back the fraction of sample points that are$\le x$ . If you plot$x\mapsto F_{\text{emp}}(x)$ , you get a step graph that shows how the data accumulate from left to right.
- Symbols.
-
$x$ : point on the horizontal axis where we evaluate the function. -
$x_j$ : data points. -
$n$ : sample size. -
$#{\cdot}$ : number of elements in the set.
-
- When to use it.
- To visualize how the sample accumulates as values increase.
- To connect sample behavior with quantiles (e.g., where
$F_{\text{emp}}$ crosses 0.25, 0.5, 0.75).
- Common mistakes.
- Counting only strictly less than instead of “less than or equal”, which changes the step locations.
- Forgetting to divide by
$n$ , giving counts instead of proportions.
- Setup.
- From the lecture: suppose we have one thousand data points, one of which is
$1000$ and all the rest are$0$ . - So the sample is
$x_1, \dots, x_{1000}$ with 999 zeros and a single 1000.
- From the lecture: suppose we have one thousand data points, one of which is
- Step 1: compute the sample mean.
- The sum of all values is
$1000 + 0 + \dots + 0 = 1000$ . - The sample size is
$n = 1000$ . - The mean is $$ \bar{x} = \frac{1000}{1000} = 1. $$
- The sum of all values is
- Step 2: compute the median.
- Order the data: 999 zeros, then one 1000 at the end.
- For
$n = 1000$ , the median is the average of the 500th and 501st ordered values. - Both the 500th and 501st values are 0, so the median is $$ m = \frac{0 + 0}{2} = 0. $$
- Step 3: compute the standard deviation.
- Deviations from the mean: 999 values equal to
$(0 - 1) = -1$ , and one value equal to$(1000 - 1) = 999$ . - Sum of squared deviations: $$ \sum_{j=1}^{1000} (x_j - \bar{x})^2 = 999^2 + 999 \cdot 1^2 = 999 \cdot 1000 = 999000. $$
- The standard deviation is $$ s = \sqrt{\frac{999000}{1000 - 1}} = \sqrt{\frac{999000}{999}} = \sqrt{1000} \approx 31.6. $$
- Deviations from the mean: 999 values equal to
- Step 4: compute the mean absolute deviation.
- Absolute deviations:
$|1000 - 1| = 999$ , and 999 copies of$|0 - 1| = 1$ . - Sum of absolute deviations is
$999 + 999 \cdot 1 = 1998$ . - m.a.d. is $$ \text{m.a.d.} = \frac{1998}{1000} = 1.998. $$
- Absolute deviations:
- Check your intuition.
- The mean (1) lies close to the many zeros but is pulled away by the single large value.
- The median (0) completely ignores the extreme 1000.
- The standard deviation is very large because the squared deviation of 999 dominates.
- The m.a.d. is closer in size to the mean and much smaller than the standard deviation.
- Setup.
- From the lecture, consider the sample
${22, 24, 35, 1, 1, 5, 6, 10, 13, 21, 10, 10, 40, 41, 47}$ . - There are
$n = 15$ observations.
- From the lecture, consider the sample
- Step 1: order the data.
- Ordered sample:
${1, 1, 5, 6, 10, 10, 10, 13, 21, 22, 24, 35, 40, 41, 47}$ .
- Ordered sample:
- Step 2: compute
$F_{\text{emp}}(x)$ at a few points.- For
$x < 1$ , no observations are$\le x$ , so$F_{\text{emp}}(x) = 0$ . - At
$x = 1$ , two observations are$\le 1$ , so $$ F_{\text{emp}}(1) = \frac{2}{15}. $$ - At
$x = 10$ , seven observations are$\le 10$ , so $$ F_{\text{emp}}(10) = \frac{7}{15}. $$ - For any
$x$ between 41 and 47 (but less than 47), 14 observations are$\le x$ , so$F_{\text{emp}}(x) = 14/15$ . - For
$x \ge 47$ , all 15 observations are$\le x$ , so$F_{\text{emp}}(x) = 1$ .
- For
- Step 3: relate to quantiles.
- The median is the 8th ordered value (since
$n = 15$ ), which is$13$ . - At
$x = 13$ , exactly 8 observations are$\le 13$ , so $$ F_{\text{emp}}(13) = \frac{8}{15} \approx 0.53. $$ - This is just above 0.5, consistent with the interpretation of the median as the 50%-quantile.
- The median is the 8th ordered value (since
- Check your intuition.
- As
$x$ moves along the horizontal axis, the EDF jumps up by$1/15$ at each data point. - Reading off approximate levels like 0.25, 0.5, and 0.75 on the vertical axis helps locate quartiles and median.
- As
- Problem 1 (Walpole 1.2): reaction time of 20 amateur racers.
- Compute the sample mean and sample median of the reaction-time data.
- Compute the 10% trimmed mean.
- Draw a dot plot of the data.
- Compare mean, median, and trimmed mean to discuss possible outliers.
- Problem 2 (Walpole 1.5 and 1.11): effect of regular exercise on cholesterol.
- Data are changes in cholesterol levels for a control and a treatment group.
- Draw a dot plot for both groups on the same graph.
- Compute mean, median, and 10% trimmed mean for each group.
- Explain why the mean difference suggests one conclusion while median/trimmed mean suggest another.
- Compute sample variance and sample standard deviation for each group.
- Problem 3 (Walpole 1.13): battery lifetime data (hours).
- Compute sample mean and median.
- Identify which value causes a substantial difference between mean and median.
- Recalculate the mean without that value and compare.
- Problem 4 (Walpole 1.17): time to fall asleep for smokers (A) and nonsmokers (B).
- Compute sample mean and sample standard deviation for each group.
- Draw a dot plot for both groups on the same line.
- Comment on how smoking appears to affect time to fall asleep.
- Problem 5 (Walpole 1.18): final exam scores in an elementary statistics course.
- Construct a relative frequency histogram.
- Sketch a smooth curve that approximates the distribution and discuss skewness.
- Compute sample mean, sample median, and sample standard deviation.
- Problem 6 (Walpole 1.20): lifetime (seconds) of 50 fruit flies under a new spray.
- Set up a relative frequency distribution.
- Construct a relative frequency histogram.
- Find the median.
- Problem 7: complete the Colab notebook linked at the end of the lab (practice using Python/Colab for descriptive statistics).
- Computing means, medians, trimmed means.
- Sum all values and divide by the sample size to get the mean.
- Sort the data; take the middle value(s) to find the median.
- For a 10% trimmed mean, sort the data, remove the smallest and largest 10% of observations, then compute the mean of what remains.
- Dot plots and histograms.
- For dot plots, choose a horizontal axis that covers all data, split into equal-width cells, and place a dot in the appropriate cell for each observation.
- For histograms, count how many observations fall in each cell (frequency) or divide by the total number of observations (relative frequency), then draw rectangles with those heights.
- When comparing two groups (e.g., control vs treatment, smokers vs nonsmokers), plot them on the same axis to compare centers, spreads, and outliers visually.
- Interpreting outliers and robustness.
- If the mean, median, and trimmed mean are close, there is little evidence of strong outliers.
- If the mean is far away from the median and trimmed mean, this suggests that a few extreme values may be pulling the mean.
- Recomputing the mean after removing a suspected extreme value is a way (used in the lab) to see how influential it is.
- Comparing groups with descriptive statistics.
- For cholesterol and sleep-time problems, compare means to see differences in average level between groups.
- Compare medians and trimmed means to see whether group differences are robust to outliers.
- Compare standard deviations to see which group has more variability.
- When commenting on “impact” (e.g., of smoking or treatment), base your reasoning on these differences and on the plots (dot plots, histograms).
- Relative frequency distributions and skewness.
- To build a relative frequency distribution, define class intervals (bins), count how many observations fall in each, and divide by the total number of observations.
- Plot these relative frequencies as a histogram and inspect whether the distribution is symmetric, right-skewed (long tail to the right), or left-skewed.
- Use the shape to interpret how the bulk of the data and extreme values behave.
- Practice 1: mean vs median vs outlier.
- Question: Consider a data set of 9 exam scores where 8 scores are between 60 and 80, and one score is 10. Which measure of center (mean or median) is more affected by the single low score?
- Brief answer: The mean is much more affected, because it averages all values; the median depends only on the order and usually stays near the middle of the bulk of the data.
- Practice 2: interpreting a dot plot of two groups.
- Question: In a dot plot, group A’s dots are mostly clustered tightly around 0 with a few large positive values, while group B’s dots are spread more evenly from negative to positive values. Which group likely has larger standard deviation, and what might that mean?
- Brief answer: Group B likely has a larger standard deviation, indicating more variability in its measurements. Group A has a tighter core but a few outliers.
- Practice 3: trimmed mean reasoning.
- Question: You compute a mean of 20 for a sample, but the 10% trimmed mean and the median are both close to 15. What does this suggest about the data?
- Brief answer: It suggests there are some extreme high values that pull the plain mean upward; trimming and using the median reduce their influence and give a center near 15.
- Probability assumes a model with known probabilities and computes chances of events; statistics uses data to infer those unknown probabilities or parameters.
- The state of nature is the underlying reality (or probability distribution); a sample is the finite list of observed values.
- The sample mean
$\bar{x} = \frac{1}{n}\sum x_j$ is a natural measure of center but is sensitive to outliers. - The median and other quantiles are location measures based on ordered positions and are more robust to extreme values.
- The sample standard deviation
$s$ and variance$s^2$ measure how spread out the data are around the mean, with large deviations heavily weighted. - The mean absolute deviation measures spread using absolute distances and usually gives a smaller value than the standard deviation for the same data.
- Dot plots and histograms turn raw data into visual summaries that reveal center, spread, skewness, and potential outliers.
- Outliers are atypical points whose importance must be judged in context; simple rules based on
$(\bar{x} - s, \bar{x} + s)$ can be misleading. - The empirical distribution function
$F_{\text{emp}}(x)$ counts the proportion of observations$\le x$ and connects the sample to quantiles like the median. - The week 1 lab reinforces these ideas by having you compute means, medians, trimmed means, standard deviations, draw dot plots and histograms, and interpret differences between groups.



