How to Compare Data Sets
Common graphical displays (e.g., dotplots, boxplots, stemplots, bar
charts) can be
effective tools for comparing data from two or more data sets.
View Video Lesson
Four Ways to Describe Data Sets
When you compare two or more data sets, focus on four features:
- Unusual features. Unusual features refer to gaps (areas of the
distribution where there are no observations) and
outliers.
The remainder of this lesson shows how to use
various graphs to compare data sets in terms of center, spread, shape, and unusual
features. (This is a skill that students are expected to master for the
Advanced Placement Statistics Exam.)
Dotplots
When
dotplots
are used to compare data sets, they are positioned one above the other,
using the same scale of measurement, as shown below.
The dotplots show pet ownership in homes
on two city blocks.
Pet ownership is a little lower in block A. In block A,
most households have zero or one pet; in block B, most
households have two or more pets. In block A, pet ownership is
skewed right; in block B, it is roughly bell-shaped. In block
B, pet ownership ranges from 0 to 6 pets per household versus
0 to 4 pets in block A; so there is more variability in the
block B distribution.
There are no outliers or gaps in either data set.
Back-to-Back Stemplots
The back-to-back
stemplots
are another graphic option for comparing data from two groups.
The center of a back-to-back stemplot consists of a column of
stems, with a vertical line on each side. Leaves
representing one data set extend from the right, and
leaves representing the other data set extend from
the left.
7
1
1 4 6
4 5 8
1 2 2 2 8 9
3 4 7 9
2 5 8
1 3
0
1
2
3
4
5
6
7
1
2 6 8
3 4 4 6 6 8 9
4 3 6
4
The back-to-back stemplot above shows the amount
of cash (in dollars) carried by a random sample of teenage boys
and girls. The boys carried more cash than the girls - a median of
$42 for the boys versus $36 for the girls. Both distributions were
roughly bell-shaped, although there was more variation
among the boys. And finally, there were neither gaps nor outliers
in either group.
Parallel Boxplots
With parallel
boxplots
(aka, side-by-side boxplots), data from two groups are
displayed on the same chart, using the same measurement scale.
The boxplot above summarizes results from a medical study.
The treatment group received an experimental drug to relieve cold
symptoms, and the control group received a placebo. The boxplot
shows the number of days each group continued to report symptoms.
Neither boxplot reveals unusual features, such as gaps or outliers.
Both plots are skewed to the right, although the skew is more
prominent in the treatment group. The range of patient response was
about the same in both groups. In the treatment
group, cold symptoms lasted 1 to 15 days
(range
= 14)
versus 3 to 17 days (range = 14) for
the control group. The median recovery time is more telling -
about 6 days for the treatment group versus about 9 days for the control
group. It appears that the drug may have had a positive effect on
patient recovery.
Double Bar Charts
A double bar chart is similar to a regular
bar chart,
except that it provides two pieces of information for
each category rather than just one. Often,
the charts are color-coded with a different colored bar
representing each piece of information.
The double bar chart above shows customer satisfaction
ratings for different cars, broken out by gender. The
blue bars represent males; the red bars, females.
Both groups prefer the Japanese cars to the American cars, with
Honda receiving the highest ratings and Ford receiving the
lowest ratings. Moreover, both genders agree on the rank
order in which the cars are rated. As a group, the men seem to be tougher
raters; they gave lower ratings to each car than the women gave.
Test Your Understanding
Problem
The back-to-back
stemplot
below shows the number of books read
in a year by a random sample of college and high school students.
7
3 6 6
1 2 3 4
6 8 8 9
2 8
3
0
1
2
3
4
5
6
7
0 0 3 5
1 2 4 4 6
1 8 9
0 1
Which of the following statements are true?
I. Seven college students did not read any books.
II. The college median is equal to the high school median.
III. The mean is greater than the median in both groups.
(A) I only
(B) II only
(C) III only
(D) I and II
(E) II and III
Solution
The correct answer is (E). None of the college students failed to read a
book during
the year; the fewest read was seven. In both groups, the
median
is equal to 24. And the mean number of books read per year
is 25.3 for high school students versus 30.4
for college students; so the mean is greater than the median
in both groups.