CHAPTER 3 Displaying and Describing Categorical Data.

CHAPTER 3

Displaying and Describing

Categorical Data

Objectives

• Frequency Table• Relative Frequency Table• Distribution• Area Principle• Bar Chart Pie Chart• Contingency Table• Marginal Distribution• Conditional Distribution• Independence• Segmented Bar Chart• Simpson’s Paradox

The Three Rules of Data Analysis

• The three rules of data analysis won’t be difficult to remember:

1. Make a picture—things may be revealed that are not obvious in the raw data. These will be things to think about.

2. Make a picture—important features of and patterns in the data will show up. You may also see things that you did not expect.

3. Make a picture—the best way to tell others about your data is with a well-chosen picture.

Frequency Table• What is a frequency Table? A

frequency table is an organization of raw data in tabular form, using classes (or intervals) and frequencies.

• What is a frequency count? The frequency or the frequency count for a data value is the number of times the value occurs in the data set.

Categorical Frequency Tables

• NOTE: Later we will consider qualitative frequency tables.

• What is a categorical frequency table? A categorical frequency table represents data that can be placed in specific categories, such as gender, hair color, political affiliation etc.

A frequency table is a tabular summary of data showing the frequency (or number) of items in each of several non-overlapping categories.

The objective is to provide insights about the data that cannot be quickly obtained by looking only at the original data.

Frequency Table

Frequency Tables:

• We can “organize” the data by counting the number of data values in each category of interest.

• We can organize these counts into a frequency table, which records the totals and the category names.

Categorical Frequency Table

• Example: The blood types of 25 blood donors are given below. Summarize the data using a frequency distribution.

AB B A O B O B O A O B O B B B A O AB AB O A B AB O A

Categorical Frequency Table for the Blood Types

Note: The classes for the distribution are the blood types.

Your Turn

• Guests staying at Marada Inn were asked to rate the quality of their accommodations as being excellent (E),above average (AA), average (A), below average (BA), or poor (P). The ratings provided by a sample of 20 guests:

• BA, AA, AA, A, AA, A, AA, A, AA, BA, P, E, AA, A, AA, AA, BA, P, AA, A .

• Make a frequency table.

Categorical Frequency Distribution

Rating Counts

Poor (P) 2

Below Average (BA) 3

Average (A) 5

Above Average (AA) 9

Excellent (E) 1

Total 20

Relative Frequency Tables:• A relative frequency table is similar, but gives the relative

frequency, a decimal or percentage (instead of counts) for each category.

The relative frequency of a class is the fraction or proportion of the total number of data items belonging to the class.

A relative frequency table is a tabular summary of a set of data showing the relative frequency for each class.

Relative Frequency Table

Percent Frequency Table

The percent frequency of a class is the relative frequency multiplied by 100.

A percent frequency table is a tabular summary of a set of data showing the percent frequency for each class.

Relative & Percent Categorical Frequency Table

• Using the frequency table below, from the Marada Inn problem, create a relative and percent frequency table.

• Add two additional columns labeled relative frequency and percent frequency.

Rating Counts

Poor (P) 2

Below Average (BA) 3

Average (A) 5

Above Average (AA) 9

Excellent (E) 1

Total 20

Relative & Percent Categorical Frequency Table

Rating Frequency Rel. Freq. %Freq

Poor 2 2/20 = .10 10%

Below avg. 3 3/20 = .15 15%

Avg. 5 5/20 = .25 25%

Above avg. 9 9/20 = .45 45%

Excellent 1 1/20 = .05 5%

Total: 20 1 100%

Frequency Tables:

• All three types of tables show how cases are distributed across the categories.

• They describe the distribution of a categorical variable because they name the possible categories and tell how frequently each occurs.

There are three kinds of lies: lies, damned lies, and statistics.

Benjamin Disraeli (1804 - 1881)

http://www.quotationspage.com/quote/487.html



http://www.quotationspage.com/myquotations.php?add=487


http://www.quotationspage.com/quotes/Benjamin_Disraeli/

Misleading Statistics

• Now that we have the frequency table, we are ready to make a picture or a graph of the data.

• Misleading graphs

• Scale• Pictographs

Misleading Statistics - Scale

• Adjusting the scale of a graph is a common way to mislead (or lie) with statistics.

• Example:

Misleading Scale

Misleading Statistics

• The best data displays observe a fundamental principle of graphing data called the area principle.

• The area principle says that the area occupied by a part of the graph should correspond to the magnitude of the value it represents.

• Violations of the area principle are another common way of misleading with statistics.

What’s Wrong With This Picture?

• You might think that

a good way to show

the Titanic data is

with this display:

What’s Wrong With This Picture?

• The ship display makes it look like most of the people on the Titanic were crew members, with a few passengers along for the ride.

• When we look at each ship, we see the area taken up by the ship, instead of the length of the ship.

• The ship display violates the area principle: • The area occupied by a part of the

graph should correspond to the magnitude of the value it represents. Slide 3 - 24

Double the length, width, and height of a cube, and the volume increases

by a factor of eight

Area Principle - Pictographs

Area Principle Pictographs

GRAPHING CATEGORICAL OR QUALITATIVE DATA

Ways to Graph Categorical Data

Because the variable is categorical, the data in the graph can be ordered any way we want (alphabetical, by increasing value, by year, by personal preference, etc.).

1. Bar Charts – Each category is represented by a bar.

2. Pie Charts - The slices must represent the parts of one whole.

Bar Charts

• A bar chart displays the distribution of a categorical variable, showing the counts for each category next to each other for easy comparison.

• A bar chart stays true to the area principle.

• Thus, a better display for the ship data is:

Bar Charts (cont.)

• A relative frequency bar chart displays the relative proportion of counts for each category.

• A relative frequency bar chart also stays true to the area principle.

• Replacing counts with percentages in the ship data:

Slide 3 - 30

Bar Charts

A bar chart is a graphical device for depicting qualitative data.

On one axis (usually the horizontal axis), we specify the labels that are used for each of the categories. A frequency, relative frequency, or percent frequency scale can be used for the other axis (usually the vertical axis).

Using a bar of fixed width drawn above each class label, we extend the height appropriately.

The bars are separated to emphasize the fact that each class is a separate category.

Bar Charts

• Either counts (frequency bar chart) or proportions (relative frequency bar chart) may be shown on the y-axis. This will not change the shape or relationships of the graph.

• Make sure all graphs have a descriptive title and that the axes are labeled (this is true for all graphs in AP Stats).

Pie Charts

• When you are interested in parts of the whole, a pie chart might be your display of choice.

• Pie charts show the whole group of cases as a circle.

• They slice the circle into pieces whose size is proportional to the fraction of the whole in each category.

Slide 3 - 33

Pie Chart The pie chart is a commonly used graphical device for presenting relative frequency distributions for qualitative data.

First draw a circle; then subdivide the circle into sectors that correspond in area to the relative frequency for each category.

Since there are 360 degrees in a circle, a category with a relative frequency of .25 would consume .25(360) = 90 degrees of the circle.

Relations between Two Categorical Variables

• Examples:• Is gender or race related to political

preference? • What type of music can make people

relax? • Will different packaging of the same

product attract people with different social-economic background?

• A contingency table or two-way table is a way to display the data from two categorical variables. A sort of Venn Diagram which shows how a population splits according to two factors.

Contingency Tables• A contingency table allows us to look at two categorical

variables together. • It shows how individuals are distributed along each variable,

contingent on the value of the other variable.• Example: we can examine the class of ticket and whether a

person survived the Titanic:

Contingency Tables (cont.)• The margins of the table, both on the right and on the

bottom, give totals and the frequency distributions for each of the variables.

• Each frequency distribution is called a marginal distribution of its respective variable.• The marginal distribution of Survival is:

Contingency Tables (cont.)• Each cell of the table gives the count for a combination

of values of the two values.• For example, the second cell in the crew column

tells us that 673 crew members died when the Titanic sunk.

Conditional Distributions

• A conditional distribution shows the distribution of one variable for just the individuals who satisfy some condition on another variable.• The following is the conditional distribution of ticket

Class, conditional on having survived:

Conditional Distributions (cont.)

• The following is the conditional distribution of ticket Class, conditional on having perished:


• The conditional distributions tell us that there is a difference in class for those who survived and those who perished.

• This is better shown with pie charts of the two distributions:


• We see that the distribution of Class for the survivors is different from that of the nonsurvivors.

• This leads us to believe that Class and Survival are associated, that they are not independent.

• The variables would be considered independent when the distribution of one variable in a contingency table is the same for all categories of the other variable.

Segmented Bar Charts

• A segmented bar chart displays the same information as a pie chart, but in the form of bars instead of circles.

• Each bar is treated as the “whole” and is divided proportionally into segments corresponding o the percentage in each group.

• Here is the segmented bar chart for ticket Class by Survival status:

Slide 3 - 43

Example: Income level vs. Job Satisfaction

Income Job Satisfaction Row Total

1 2 3 4 < 30K 20 24 80 82

206 30K-50K 22 38 104 125

289 50K-80K 13 28 81 113

235 > 80K 7 18 54 92

171 C. Total 62 108 319 412

901

• This is a Contingency table with Income Level as the Row Variable and Job Satisfaction as the Column Variable.

• The distributions of income to job satisfaction or job satisfaction to income are called Conditional Distributions.

• The distributions of income alone and job satisfaction alone are called Marginal Distributions.

• Relationships between categorical variables are described by calculating appropriate percents from the counts given in each cell.

Conditionaldistribution Marginal

distribution

Tabletotal

Example:

• A Statistics class reports the following data on sex and eye color for students in the class:

1. What percent of females are brown-eyed?

2. What percent of brown-eyed students are female?

3. What percent of students are brown-eyed females?

4. What’s the distribution of eye color?

5. What’s the conditional distribution of eye color for the males?

6. Compare the percent who are female among the blue-eyed students to the percent of all students who are female?

7. Does it seem that eye color and sex are independent? Explain.

Eye Color

Sex

Blue Brown Green/Hazel/Other Total

Males 6 20 6 32

Females 4 16 12 32

Total 10 36 18 64

Solution: Eye Color

Sex

Blue Brown Green/Hazel/Other Total

Males 6 20 6 32

Females 4 16 12 32

Total 10 36 18 64

1. What percent of females are brown-eyed? 16/32 = .5 or 50%

2. What percent of brown-eyed students are female? 16/36 = .444 or 44.4%

3. What percent of students are brown-eyed females? 16/64 = .25 or 25%

4. What’s the distribution of eye color? 10/64 = .156 or 15.6% Blue, 36/64 = .563 or 56.3% Brown, 18/64 =.281 or 28.1% Green/Hazel/Other

5. What’s the conditional distribution of eye color for the males? 6/32 = .188 or 18.8% Blue, 20/32 = .625 or 62.5% Brown, 6/32 = .188 or 18.8% Green/Hazel/Other

6. Compare the percent who are female among the blue-eyed students to the percent of all students who are female? 4/10 = .4 or 40% of the blue-eyed students are female, while 32/64 = .5 or 50% of all students are female.

7. Does it seem that eye color and sex are independent? Explain. Since blue-eyed students appear less likely to be female, it seems that Sex and Eye Color may not be independent. (But the numbers are small.)

SIMPSON’S PARADOX

Simpson’s Paradox• Discovered by E. H. Simpson in 1951.• Occurs when averaging different samples of different

sizes• Two groups from one sample are compared to two similar

groups from another sample• One sample’s success rate for both

groups is higher than the success rates for the other sample

Not E. H. Simpson

Simpson’s Paradox

• However, when both groups’ respective success rates are combined, the sample with the lower success rate ends up with the better overall proportion of successes. Thus, the paradox.

• One sample usually has a considerably smaller number of members than the other groups

• Simpson’s Paradox does not occur in populations with similar amounts

What is Simpson’s Paradox?

• Simpson’s Paradox occurs when an association between two variables is reversed upon observing a third variable.

• Simpson’s paradox lurking variable creates a reversal in the direction of an association (“confounding”)

• To uncover Simpson’s Paradox, divide data into subgroups based on the lurking variable

Simpson’s Paradox

Recent Cleveland Indians season records2003—68-94, 42.0% winning percentage

2004—80-82, 49.4% winning percentage

Two-season record: 148-176, 45.7% win percentage

Recent Minnesota Twins season records2003—90-72, 55.6% win percentage

2004—92-70, 56.8% win percentage

Two-season record: 182-142, 56.2% win percentage

Notice that the Twins had a higher percentage in both 2003 and 2004, as well as in the two-year period. Not Simpson’s Paradox.

Simpson’s Paradox at workRonnie Belliard

2002—61/289, .211 of his at-bats were hits


Two-season average: 185/736, hits .2514 of the time

Casey Blake



Two-season average: 147/577, hits .2548 of the time

The two season batting avg. for Belliard was lower than Blake’s, but divided into separate seasons, Belliard’s had a higher batting avg. both seasons.

This is Simpson’s Paradox.

Consider college acceptance rates by sex

Discrimination? (Simpson’s Paradox)

Accepted Notaccepted

Total

Men 198 162 360

Women 88 112 200

Total 286 274 560

198 of 360 (55%) of men accepted 88 of 200 (44%) of women accepted Is there a sex bias?


• Or is there a lurking variable that explains the association?

• To evaluate this, split applications according to the lurking variable "major applied to”• Business School (240 applicants) • Art School (320 applicants)


18 of 120 men (15%) of men were accepted to B-school24 of 120 (20%) of women were accepted to B-schoolA higher percentage of women were accepted

BUSINESS SCHOOL


Total

Men 18 102 120

Women 24 96 120

Total 42 198 240

Discrimination (Simpson’s Paradox)

ART SCHOOL

180 of 240 men (75%) of men were accepted64 of 80 (80%) of women were accepted A higher percentage of women were accepted.


Total

Men 180 60 240

Women 64 16 80

Total 244 76 320

• Within each school, a higher percentage of women were accepted than men.

• No discrimination against women• Possible discrimination against men• This is an example of Simpson’s Paradox.

• When the lurking variable (School applied to) was ignored, the data suggest discrimination against women.

• When the School applied to was considered, the association is reversed.


Colin R. Blyth’s example of Simpson’s Paradox

• A doctor was planning to try a new treatment on patients mostly local (C) and a few in Chicago (C’). A statistician advised him to use a table of random numbers and as each C patient became available, assign him to the new treatment with probability .91, leave him to the standard treatment with probability .09; and the same for C’ patient with probability .01 and .99 respectively. When the doctor returned with the data the statistician told him that the new treatment was obviously a very bad one, and criticized him for having continued trying it on so many patients.

Treatment Standard New

Dead 5950 9005

Alive 5050 1095

(46%) (11%)

• The doctor replied that he continued because the new treatment was obviously a very good one, having nearly doubled the recovery rate in both cities.

C patient only C’ patient only

Treatment Standard New Standard New

Dead 950 9000 5000 5

Alive 50 1000 5000 95

5% 10% 50% 95%

Smokers’ Example

• In England a study was conducted to examine the survival rates of smokers and non-smokers. The result implied a significant positive correlation between smoking & survival rates because only 24% of smokers died as compared to 31% of non-smokers. When the data were broken down by age group in a contingency table, it was found that there were more older people in the non-smoker group. Thus age played a very significant role in the outcome but since it was overlooked the researchers were left with deceiving results. (Appleton & French, 1996).

The Paradox

What’s true for the parts isn’t true for the whole.

CONCLUSION!!!!

Simpson’s paradox is a rare phenomenon! It does not occur often! Thus statisticians must be trained academically & ethically well enough to make sure that if it has occurred they will detect and correct it. This is where practice, critical thinking skills, and repetition come into play!

What Can Go Wrong?

• Don’t violate the area principle.

• While some people might like the pie chart on the left better, it is harder to compare fractions of the whole, which a well-done pie chart does.

What Can Go Wrong? (cont.)

• Keep it honest—make sure your display shows what it says it shows.

• This plot of the percentage of high-school students who engage in specified dangerous behaviors has a problem. Can you see it?


• Don’t confuse similar-sounding percentages—pay particular attention to the wording of the context.

• Don’t forget to look at the variables separately too—examine the marginal distributions, since it is important to know how many cases are in each category.


• Be sure to use enough individuals!

• Do not make a report like “We found that 66.67% of

the rats improved their performance with training.

The other rat died.”


• Don’t overstate your case—don’t claim something you can’t.

• Don’t use unfair or silly averages—this could lead to Simpson’s Paradox, so be careful when you average one variable across different levels of a second variable.

What have we learned?

• We can summarize categorical data by counting the number of cases in each category (expressing these as counts or percents).

• We can display the distribution in a bar chart or pie chart.

• And, we can examine two-way tables called contingency tables, examining marginal and/or conditional distributions of the variables.

Assignment

• Exercises pg. 37 – 43: #5, 7, 11, 16, 21, 23, 27 -31 odd, 37

• Read Ch-4, pg. 44 - 71

CHAPTER 3 Displaying and Describing Categorical Data.

Documents

Transcript of CHAPTER 3 Displaying and Describing Categorical Data.