Sampling and Descriptive Statistics

Chapter 7 of your statistics book, Sampling and Descriptive Statistics , introduces the fundamental transition from theoretical probability (what should happen) to practical statistics (what we observe and infer). It focuses on using data collected from a sample to estimate and describe unknown properties of the larger population distribution.

The core idea is that when a distribution is unknown, we can look at the empirical distribution derived solely from the sampled data, and use it to estimate key characteristics like the average and the variance.

Concept: Probability vs. Statistics

The book draws a distinction: Probability involves studying experiments when the model (the underlying distribution or laws) is fully known. Statistics involves trying to infer unknown aspects of that model based on observed outcomes.

For instance, if we flip a coin 100 times, the results ( $X_1, X_2, \dots, X_{100}$ ) are random variables derived from a Bernoulli distribution with an unknown probability $p$ . Statistics seeks to use the 60 heads observed in the sample ( $\sum X_i = 60$ ) to say something about the unknown, long-run probability $p$ .

Concept: The Sample

We typically model repeated measurements (like height, weight, or arsenic content) taken from a population as independent and identically distributed (i.i.d.) random variables ( $X_1, X_2, \dots, X_n$ ). While sampling is sometimes done “without replacement,” if the sample size ( $n$ ) is small compared to the population size, the results are highly similar to i.i.d. sampling “with replacement”.

Since the true underlying distribution of the population is unknown, the empirical distribution is defined purely from the observed data points.

Concept and Definition

💡

Empirical Distribution

The empirical distribution is a discrete distribution that assigns equal probability to every observed data point.

If $X_1, X_2, \dots, X_n$ are the observed i.i.d. samples, the probability mass function ( $f(t)$ ) of the empirical distribution is: $f(t) = \frac{1}{n} \# \{X_i = t\}$ where $\#\{X_i = t\}$ is the count of how many times the value $t$ appeared in the sample.

Intuition: If you have five data points $\{1, 1, 2, 4, 6\}$ , the empirical distribution states that $P(X=1) = 2/5$ , and $P(X=t) = 1/5$ for $t \in \{2, 4, 6\}$ .

Key Properties

Randomness: The empirical distribution is itself a random quantity, as every new set of sampled data will generate a different distribution.
Inference: It is expected that as the sample size ( $n$ ) grows, the empirical distribution should converge to and provide information about the true, underlying distribution.

Descriptive statistics are summary numerical measures derived from the empirical distribution to estimate the actual parameters of the underlying distribution.

7.2.1 Sample Mean ( $\bar{X}$ )

The sample mean is the long-familiar average of the observed data points. It serves as an estimator for the true population mean ( $\mu$ ).

💡

Sample Mean

$\bar{X} = \frac{X_1 + X_2 + \dots + X_n}{n}$

Unbiased Estimator: $E[\bar{X}] = \mu$ . On average, $\bar{X}$ accurately targets the unknown population mean.
Consistency: $SD[\bar{X}] = \frac{\sigma}{\sqrt{n}}$ . As $n \to \infty$ , the sample mean converges to the true mean.

7.2.2 Sample Variance ( $S^2$ )

The sample variance estimates the true population variance ( $\sigma^2$ ).

💡

Sample Variance

$S^2 = \frac{(X_1 - \bar{X})^2 + (X_2 - \bar{X})^2 + \dots + (X_n - \bar{X})^2}{n - 1}$

Note: Using $(n-1)$ ensures that $S^2$ is an unbiased estimator ( $E[S^2] = \sigma^2$ ).

7.2.3 Sample Proportion ( $\hat{p}$ ) and Empirical CDF

When an event $A$ occurs in the population with probability $p = P(X \in A)$ , the sample version of this is the sample proportion ( $\hat{p}$ ).

💡

Sample Proportion

$\hat{p} = \frac{\# \{X_i \in A\}}{n}$

The sample proportion is an excellent estimator: $E[\hat{p}] = p$ and $Var(\hat{p}) = \frac{p(1-p)}{n}$ .

The sample analog of the CDF is the Empirical Cumulative Distribution Function (ECDF): $F_n(t) = \frac{\# \{X_i \leq t\}}{n}$

Example: Sample Mean and Variance

Machine Weights

Scenario: We sample $n=4$ items from a machine population and find their measured weights are $\{10, 12, 9, 13\}$ . The true mean weight $\mu$ and variance $\sigma^2$ are unknown.

Question: What are the estimates for the true mean and variance?

View Detailed Solution ▼

Calculate Sample Mean ( $\bar{X}$ ): $\bar{X} = \frac{10 + 12 + 9 + 13}{4} = \frac{44}{4} = 11$ The point estimate for the true mean $\mu$ is $\bar{\mathbf{X}} = 11$ .
Calculate Sample Variance ( $S^2$ ): First, find the squared deviations from the mean:
- $(10-11)^2 = 1$
- $(12-11)^2 = 1$
- $(9-11)^2 = 4$
- $(13-11)^2 = 4$
$S^2 = \frac{1 + 1 + 4 + 4}{4 - 1} = \frac{10}{3} \approx 3.33$ The estimate for the true variance $\sigma^2$ is $\mathbf{S^2 = 3.33}$ .

Analogy: If the population distribution is a vast, unseen ocean, Sampling and Descriptive Statistics are like dropping a small cup (the sample) into the ocean and analyzing its contents. The sample mean and variance are your best unbiased guesses as to the average depth and wave height of the entire ocean, knowing that if you use a big enough cup (large $n$ ), your guess will be highly accurate (consistency).

All Chapters in this Book

Lesson 1

Basic Concepts

Foundational mathematical framework for probability, including definitions, axioms, conditional probability, and Bayes' Theorem.

Lesson 2

Sampling and Repeated Trials

Models based on repeated independent trials, focusing on Bernoulli trials and sampling methods.

Lesson 3

Discrete Random Variables

Formalizing random variables, probability mass functions, and independence.

Lesson 4

Summarizing Discrete Random Variables

Deriving numerical characteristics—expected value, variance, and standard deviation—to summarize behavior of discrete random variables.

Lesson 5

Continuous Probabilities and Random Variables

Transitioning from discrete sums to continuous integrals, density functions, and key distributions like Normal and Exponential.

Lesson 6

Summarising Continuous Random Variables

Extending expected value and variance to continuous variables, exploring Moment Generating Functions and Bivariate Normal distributions.

Lesson 7