Sampling and Descriptive Statistics
SAMPLING & DESCRIPTIVE STATS
Chapter 7 of your statistics book, Sampling and Descriptive Statistics , introduces the fundamental transition from theoretical probability (what should happen) to practical statistics (what we observe and infer). It focuses on using data collected from a sample to estimate and describe unknown properties of the larger population distribution.
The core idea is that when a distribution is unknown, we can look at the empirical distribution derived solely from the sampled data, and use it to estimate key characteristics like the average and the variance.
7.1 The Statistical Context
Concept: Probability vs. Statistics
The book draws a distinction: Probability involves studying experiments when the model (the underlying distribution or laws) is fully known. Statistics involves trying to infer unknown aspects of that model based on observed outcomes.
For instance, if we flip a coin 100 times, the results () are random variables derived from a Bernoulli distribution with an unknown probability . Statistics seeks to use the 60 heads observed in the sample () to say something about the unknown, long-run probability .
Concept: The Sample
We typically model repeated measurements (like height, weight, or arsenic content) taken from a population as independent and identically distributed (i.i.d.) random variables (). While sampling is sometimes done “without replacement,” if the sample size () is small compared to the population size, the results are highly similar to i.i.d. sampling “with replacement”.
7.1 The Empirical Distribution
Since the true underlying distribution of the population is unknown, the empirical distribution is defined purely from the observed data points.
Concept and Definition
Empirical Distribution
The empirical distribution is a discrete distribution that assigns equal probability to every observed data point.
If are the observed i.i.d. samples, the probability mass function () of the empirical distribution is: where is the count of how many times the value appeared in the sample.
Intuition: If you have five data points , the empirical distribution states that , and for .
Key Properties
- Randomness: The empirical distribution is itself a random quantity, as every new set of sampled data will generate a different distribution.
- Inference: It is expected that as the sample size () grows, the empirical distribution should converge to and provide information about the true, underlying distribution.
7.2 Descriptive Statistics
Descriptive statistics are summary numerical measures derived from the empirical distribution to estimate the actual parameters of the underlying distribution.
7.2.1 Sample Mean ()
The sample mean is the long-familiar average of the observed data points. It serves as an estimator for the true population mean ().
Sample Mean
- Unbiased Estimator: . On average, accurately targets the unknown population mean.
- Consistency: . As , the sample mean converges to the true mean.
7.2.2 Sample Variance ()
The sample variance estimates the true population variance ().
Sample Variance
- Note: Using ensures that is an unbiased estimator ().
7.2.3 Sample Proportion () and Empirical CDF
When an event occurs in the population with probability , the sample version of this is the sample proportion ().
Sample Proportion
The sample proportion is an excellent estimator: and .
The sample analog of the CDF is the Empirical Cumulative Distribution Function (ECDF):
Example: Sample Mean and Variance
Machine Weights
Scenario: We sample items from a machine population and find their measured weights are . The true mean weight and variance are unknown.
Question: What are the estimates for the true mean and variance?
📝 View Detailed Solution ▼
-
Calculate Sample Mean (): The point estimate for the true mean is .
-
Calculate Sample Variance (): First, find the squared deviations from the mean:
The estimate for the true variance is .
Analogy: If the population distribution is a vast, unseen ocean, Sampling and Descriptive Statistics are like dropping a small cup (the sample) into the ocean and analyzing its contents. The sample mean and variance are your best unbiased guesses as to the average depth and wave height of the entire ocean, knowing that if you use a big enough cup (large ), your guess will be highly accurate (consistency).