Menu

Sampling and Descriptive Statistics

SAMPLING & DESCRIPTIVE STATS

Chapter 7 of your statistics book, Sampling and Descriptive Statistics , introduces the fundamental transition from theoretical probability (what should happen) to practical statistics (what we observe and infer). It focuses on using data collected from a sample to estimate and describe unknown properties of the larger population distribution.

The core idea is that when a distribution is unknown, we can look at the empirical distribution derived solely from the sampled data, and use it to estimate key characteristics like the average and the variance.


7.1 The Statistical Context

Concept: Probability vs. Statistics

The book draws a distinction: Probability involves studying experiments when the model (the underlying distribution or laws) is fully known. Statistics involves trying to infer unknown aspects of that model based on observed outcomes.

For instance, if we flip a coin 100 times, the results (X1,X2,,X100X_1, X_2, \dots, X_{100}) are random variables derived from a Bernoulli distribution with an unknown probability pp. Statistics seeks to use the 60 heads observed in the sample (Xi=60\sum X_i = 60) to say something about the unknown, long-run probability pp.

Concept: The Sample

We typically model repeated measurements (like height, weight, or arsenic content) taken from a population as independent and identically distributed (i.i.d.) random variables (X1,X2,,XnX_1, X_2, \dots, X_n). While sampling is sometimes done “without replacement,” if the sample size (nn) is small compared to the population size, the results are highly similar to i.i.d. sampling “with replacement”.


7.1 The Empirical Distribution

Since the true underlying distribution of the population is unknown, the empirical distribution is defined purely from the observed data points.

Concept and Definition

💡

Empirical Distribution

The empirical distribution is a discrete distribution that assigns equal probability to every observed data point.

If X1,X2,,XnX_1, X_2, \dots, X_n are the observed i.i.d. samples, the probability mass function (f(t)f(t)) of the empirical distribution is: f(t)=1n#{Xi=t}f(t) = \frac{1}{n} \# \{X_i = t\} where #{Xi=t}\#\{X_i = t\} is the count of how many times the value tt appeared in the sample.

Intuition: If you have five data points {1,1,2,4,6}\{1, 1, 2, 4, 6\}, the empirical distribution states that P(X=1)=2/5P(X=1) = 2/5, and P(X=t)=1/5P(X=t) = 1/5 for t{2,4,6}t \in \{2, 4, 6\}.

Key Properties

  1. Randomness: The empirical distribution is itself a random quantity, as every new set of sampled data will generate a different distribution.
  2. Inference: It is expected that as the sample size (nn) grows, the empirical distribution should converge to and provide information about the true, underlying distribution.

7.2 Descriptive Statistics

Descriptive statistics are summary numerical measures derived from the empirical distribution to estimate the actual parameters of the underlying distribution.

7.2.1 Sample Mean (Xˉ\bar{X})

The sample mean is the long-familiar average of the observed data points. It serves as an estimator for the true population mean (μ\mu).

💡

Sample Mean

Xˉ=X1+X2++Xnn\bar{X} = \frac{X_1 + X_2 + \dots + X_n}{n}

  • Unbiased Estimator: E[Xˉ]=μE[\bar{X}] = \mu. On average, Xˉ\bar{X} accurately targets the unknown population mean.
  • Consistency: SD[Xˉ]=σnSD[\bar{X}] = \frac{\sigma}{\sqrt{n}}. As nn \to \infty, the sample mean converges to the true mean.

7.2.2 Sample Variance (S2S^2)

The sample variance estimates the true population variance (σ2\sigma^2).

💡

Sample Variance

S2=(X1Xˉ)2+(X2Xˉ)2++(XnXˉ)2n1S^2 = \frac{(X_1 - \bar{X})^2 + (X_2 - \bar{X})^2 + \dots + (X_n - \bar{X})^2}{n - 1}

  • Note: Using (n1)(n-1) ensures that S2S^2 is an unbiased estimator (E[S2]=σ2E[S^2] = \sigma^2).

7.2.3 Sample Proportion (p^\hat{p}) and Empirical CDF

When an event AA occurs in the population with probability p=P(XA)p = P(X \in A), the sample version of this is the sample proportion (p^\hat{p}).

💡

Sample Proportion

p^=#{XiA}n\hat{p} = \frac{\# \{X_i \in A\}}{n}

The sample proportion is an excellent estimator: E[p^]=pE[\hat{p}] = p and Var(p^)=p(1p)nVar(\hat{p}) = \frac{p(1-p)}{n}.

The sample analog of the CDF is the Empirical Cumulative Distribution Function (ECDF): Fn(t)=#{Xit}nF_n(t) = \frac{\# \{X_i \leq t\}}{n}


Example: Sample Mean and Variance

Q1

Machine Weights

Scenario: We sample n=4n=4 items from a machine population and find their measured weights are {10,12,9,13}\{10, 12, 9, 13\}. The true mean weight μ\mu and variance σ2\sigma^2 are unknown.

Question: What are the estimates for the true mean and variance?

📝 View Detailed Solution
  1. Calculate Sample Mean (Xˉ\bar{X}): Xˉ=10+12+9+134=444=11\bar{X} = \frac{10 + 12 + 9 + 13}{4} = \frac{44}{4} = 11 The point estimate for the true mean μ\mu is Xˉ=11\bar{\mathbf{X}} = 11.

  2. Calculate Sample Variance (S2S^2): First, find the squared deviations from the mean:

    • (1011)2=1(10-11)^2 = 1
    • (1211)2=1(12-11)^2 = 1
    • (911)2=4(9-11)^2 = 4
    • (1311)2=4(13-11)^2 = 4

    S2=1+1+4+441=1033.33S^2 = \frac{1 + 1 + 4 + 4}{4 - 1} = \frac{10}{3} \approx 3.33 The estimate for the true variance σ2\sigma^2 is S2=3.33\mathbf{S^2 = 3.33}.

Analogy: If the population distribution is a vast, unseen ocean, Sampling and Descriptive Statistics are like dropping a small cup (the sample) into the ocean and analyzing its contents. The sample mean and variance are your best unbiased guesses as to the average depth and wave height of the entire ocean, knowing that if you use a big enough cup (large nn), your guess will be highly accurate (consistency).