Describing Categorical Data
DESCRIBING CATEGORICAL DATA
Chapter 3, Describing Categorical Data: Frequency Distribution, focuses on the tools and methods used to organise, visualise, and summarise qualitative information . Below is a detailed breakdown of the topics, accompanied by examples and practice exercises.
1. Frequency and Relative Frequency
Frequency Distribution
This is a list of distinct values (categories) and their frequencies (the count of how many times each occurs).
Relative Frequency
This is the ratio of the frequency of a category to the total number of observations. It is expressed as a value between 0 and 1 and is crucial for comparing datasets of different sizes.
Example Calculation
Suppose you have the following data: A, A, B, C, A, D, A, B, D, C (Total = 10).
| Category | Frequency | Relative Frequency |
|---|---|---|
| A | 4 | |
| B | 2 | |
| C | 2 | |
| D | 2 | |
| Total | 10 | 1.0 |
2. Graphical Displays
Pie Chart
A circle divided into slices proportional to the relative frequencies. It is best used for comparing parts of a whole .
Figure 1: Pie Chart representing relative frequencies of categories A, B, C, and D.
Bar Chart
Displays categories on the horizontal axis and frequencies (or percentages) on the vertical axis. The bars should not touch, and the height represents the count.
Pareto Chart
A specific type of bar chart where the categories are sorted by frequency from highest to lowest. Used to identify significant issues.
Figure 2: Pareto Chart showing sorted frequency of defects.
Ordinal Data Rule
If categories have a natural rank (e.g., S, M, L), the bar chart must preserve that ordering rather than sorting by frequency.
3. The Area Principle
The Area Principle
The area occupied by a part of a graph should correspond exactly to the amount of data it represents. Violating this is a common way to mislead with statistics.
Misleading Graphs
- Decorated Graphs: Using 3D images or shapes (like bottles or boxes) often distorts the visual area compared to the data.
- Truncated Graphs: When the vertical axis (baseline) does not start at zero, it can exaggerate small differences between categories.
- Manipulated Y-Axis: Expanding or compressing the scale can make changes look more or less significant than they are.
4. Summarising Categorical Data
Mode
The most common category (the one with the highest frequency).
- Bimodal/Multimodal: If two or more categories tie for the highest frequency.
Median
Only applicable to ordinal data. It is the category of the middle observation after the data has been sorted by rank.
- Calculation: If there are 15 observations, the median is the 8th value in the sorted list.
Practice Session
Calculating Frequencies
A total of 2000 cases of Covid-19 were registered in 5 districts. Given the proportions below, find the relative frequency for Nagpur and the total cases for Pune.
| District | Relative Frequency |
|---|---|
| Mumbai | 0.35 |
| Pune | 0.20 |
| Nagpur | |
| Thane | 0.25 |
| Nashik | 0.08 |
View Detailed Solution â–¼
- Finding Nagpur (): The sum of all relative frequencies must equal 1.0.
- Cases in Pune: Multiply the total cases by the relative frequency for Pune.
Identifying Concepts
Which descriptive measure can be used for both nominal and ordinal data?
(a) Mean (b) Median (c) Mode
View Detailed Solution â–¼
(c) Mode.
The mode is the only measure listed that applies to both nominal (labels without order) and ordinal (ranked labels) data. The median requires an order, and the mean requires numerical values.
Fruit Salad Analogy
Think of describing categorical data like sorting a mixed bag of fruit.
- The Frequency Distribution is your final count: 10 apples, 5 bananas, 2 oranges.
- The Pie Chart shows you what portion of your bag is apple-flavoured.
- The Mode is simply pointing out that apples are the most common fruit you found.
- But be careful—if you draw a picture of the fruit and make a tiny orange look as big as a giant apple just to be artistic, you have violated the Area Principle and misled your audience!
All Chapters in this Book
Statistics
Introduces the subject as the 'art of learning from data,' covering its collection, description, and analysis.
Data
Focuses on the nature of information itself and how it is categorised.
Describing Categorical Data
Visualising and identifying the 'centre' of qualitative data.
Describing Numerical Data
Tools for organising and measuring the typical values and spread of quantitative variables.
Association Between Two Variables
Explores how information about one variable can provide insight into another.
Basic Principle of Counting
Foundations of probability by teaching how to count possible outcomes.
Factorial
Defines the product of positive integers.
Permutation
Covers the various ways to calculate ordered arrangements of objects.
Combination
Focuses on the mathematical methods for selecting objects when the order of selection does not matter.