Describing Categorical Data

DESCRIBING CATEGORICAL DATA

Chapter 3, Describing Categorical Data: Frequency Distribution, focuses on the tools and methods used to organise, visualise, and summarise qualitative information . Below is a detailed breakdown of the topics, accompanied by examples and practice exercises.

1. Frequency and Relative Frequency

💡

Frequency Distribution

This is a list of distinct values (categories) and their frequencies (the count of how many times each occurs).

💡

Relative Frequency

This is the ratio of the frequency of a category to the total number of observations. It is expressed as a value between 0 and 1 and is crucial for comparing datasets of different sizes.

Example Calculation

Suppose you have the following data: A, A, B, C, A, D, A, B, D, C (Total = 10).

Category	Frequency	Relative Frequency
A	4	$4/10 = 0.4$
B	2	$2/10 = 0.2$
C	2	$2/10 = 0.2$
D	2	$2/10 = 0.2$
Total	10	1.0

2. Graphical Displays

💡

Pie Chart

A circle divided into slices proportional to the relative frequencies. It is best used for comparing parts of a whole .

Figure 1: Pie Chart representing relative frequencies of categories A, B, C, and D.

💡

Bar Chart

Displays categories on the horizontal axis and frequencies (or percentages) on the vertical axis. The bars should not touch, and the height represents the count.

💡

Pareto Chart

A specific type of bar chart where the categories are sorted by frequency from highest to lowest. Used to identify significant issues.

Figure 2: Pareto Chart showing sorted frequency of defects.

💡

Ordinal Data Rule

If categories have a natural rank (e.g., S, M, L), the bar chart must preserve that ordering rather than sorting by frequency.

3. The Area Principle

💡

The Area Principle

The area occupied by a part of a graph should correspond exactly to the amount of data it represents. Violating this is a common way to mislead with statistics.

Misleading Graphs

Decorated Graphs: Using 3D images or shapes (like bottles or boxes) often distorts the visual area compared to the data.
Truncated Graphs: When the vertical axis (baseline) does not start at zero, it can exaggerate small differences between categories.
Manipulated Y-Axis: Expanding or compressing the scale can make changes look more or less significant than they are.

4. Summarising Categorical Data

💡

Mode

The most common category (the one with the highest frequency).

Bimodal/Multimodal: If two or more categories tie for the highest frequency.

💡

Median

Only applicable to ordinal data. It is the category of the middle observation after the data has been sorted by rank.

Calculation: If there are 15 observations, the median is the 8th value in the sorted list.

Practice Session

Calculating Frequencies

A total of 2000 cases of Covid-19 were registered in 5 districts. Given the proportions below, find the relative frequency for Nagpur and the total cases for Pune.

District	Relative Frequency
Mumbai	0.35
Pune	0.20
Nagpur	$x$
Thane	0.25
Nashik	0.08

View Detailed Solution ▼

Finding Nagpur ( $x$ ): The sum of all relative frequencies must equal 1.0. $0.35 + 0.20 + x + 0.25 + 0.08 = 1.0$ $0.88 + x = 1.0 \implies \mathbf{x = 0.12}$
Cases in Pune: Multiply the total cases by the relative frequency for Pune. $2000 \times 0.20 = \mathbf{400 \text{ cases}}$

Identifying Concepts

Which descriptive measure can be used for both nominal and ordinal data?

(a) Mean (b) Median (c) Mode

View Detailed Solution ▼

(c) Mode.

The mode is the only measure listed that applies to both nominal (labels without order) and ordinal (ranked labels) data. The median requires an order, and the mean requires numerical values.

💡

Fruit Salad Analogy

Think of describing categorical data like sorting a mixed bag of fruit.

The Frequency Distribution is your final count: 10 apples, 5 bananas, 2 oranges.
The Pie Chart shows you what portion of your bag is apple-flavoured.
The Mode is simply pointing out that apples are the most common fruit you found.
But be careful—if you draw a picture of the fruit and make a tiny orange look as big as a giant apple just to be artistic, you have violated the Area Principle and misled your audience!

All Chapters in this Book

Lesson 1

Statistics

Introduces the subject as the 'art of learning from data,' covering its collection, description, and analysis.

Lesson 2

Data

Focuses on the nature of information itself and how it is categorised.

Lesson 3