Displaying and describing categorical data

36

Transcript of Displaying and describing categorical data

Page 1: Displaying and describing categorical data
Page 2: Displaying and describing categorical data

The three rules of data analysis won’t

be difficult to remember:1. Make a picture—things may be revealed

that are not obvious in the raw data. These will be things to think about.

2. Make a picture—important features of and patterns in the data will show up. You may also see things that you did not expect.

3. Make a picture—the best way to tell others about your data is with a well-chosen picture.

The Three Rules of Data Analysis

Page 3: Displaying and describing categorical data

We can “pile” the data by counting the number of data values in each category of interest.

We can organize these counts into a frequency table, which records the totals and the category names.

Frequency Tables

Page 4: Displaying and describing categorical data

What is your favorite color on this list? Red Blue Green Purple

Frequency Tables Example

Color FrequencyRedBlueGreenPurple

Page 5: Displaying and describing categorical data

A relative frequency table is similar, but gives the percentages (instead of counts) for each category.

Cumulative Frequency – calculates the sum of all relative frequencies for that particular category and all previous categories

Relative Frequency Tables

Page 6: Displaying and describing categorical data

Both types of tables show how cases are

distributed across the categories. They describe the distribution of a categorical

variable because they name the possible categories and tell how frequently each occurs.

To calculate the relative frequency: Divide the count by the total number to cases Multiply by 100 to express as a percentage

To calculate the cumulative relative frequency: Take the current category’s relative frequency and

add it all previous relative frequencies

Creating Frequency Tables

Page 7: Displaying and describing categorical data

Let’s use the previous data we collected to create a relative frequency table.

Relative Frequency: Example

Color Frequency Relative Frequency

Cumulative Relative Frequency

Red

Blue

Green

Purple

TOTAL

Page 8: Displaying and describing categorical data

You might think that a good way to show the Titanic data is with this display. Is it?

What’s Wrong With This Picture?

Page 9: Displaying and describing categorical data

The ship display makes it look like most of the people on the Titanic were crew members, with a few passengers along for the ride.

When we look at each ship, we see the area taken up by the ship, instead of the length of the ship.

The ship display violates the area principle:

The area occupied by a part of the graph should correspond to the magnitude of the value it represents.

The Area Principle

Page 10: Displaying and describing categorical data

A bar chart displays the distribution of a

categorical variable, showing the counts for each category next to each other for easy comparison.

A bar chart stays true to the area principle.

Thus, a better display for the ship data is:

Bar charts have spaces between the categories, while histograms do not.

Bar Charts

Page 11: Displaying and describing categorical data

A relative frequency bar chart displays the relative proportion of counts for each category.

A relative frequency bar chart also stays true to the area principle.

Replacing counts with percentages in the ship data:

Bar Charts (cont.)

Page 12: Displaying and describing categorical data

Use equal increments along the dependent (y) axis. Follow the area principle and be sure to keep the

area equal in each segment (the x-axis should also have equal increments).

Example: Create a bar graph based on the collected data:

How to Create Bar Charts

Color FrequencyRedBlueGreenPurple

Page 13: Displaying and describing categorical data

When you are interested in parts of the whole, a pie chart might be your display of choice.

Pie charts show the whole group of cases as a circle.

They slice the circle into pieces whose size is proportional to the fraction of the whole in each category.

Pie Charts

Page 14: Displaying and describing categorical data

Convert Your Data:

Convert all data values to percentages of the whole data set. For example, four radishes, three cucumbers, two carrots and one pepper equals 40 percent radishes, 30 percent cucumbers, 20 percent carrots and 10 percent peppers.

Convert the percentages into angles. Since a full circle is 360 degrees, multiply this by the percentages to get the angle for each section of the pie. For the radishes, 0.4 X 360 = 144 degrees. For the cucumbers, 0.3 X 360 = 108 degrees. For the carrots, 0.2 X 360 = 72 degrees. For the peppers, 0.1 X 360 = 36 degrees.

Make sure the angle calculations are correct by adding all the angles. The total should be 360. 144 + 108 + 72 + 36 = 360. You may be off by a tenth or so due to rounding, so be

careful.

How to Accurately Create Pie Charts

Page 15: Displaying and describing categorical data

Draw the Chart: Draw a circle on a blank sheet of paper, using a compass. While a

compass is not necessary, using one will make the chart much neater and clearer by ensuring the circle is even.

Draw a radius, from the center to the right edge of the circle, using the ruler or straight edge. This will be the first base line.

Measure the largest angle in the data with the protractor, starting at the baseline, and mark it on the edge of the circle. Use the ruler to draw another radius to that point. Use this new radius as a base line for your next largest angle and continue this process until you get to the last data point. You will only need to measure the last angle to verify its value since both lines will already be drawn.

Label and shade the sections of the pie chart to highlight whatever data is important for your use.

How to Accurately Create Pie Charts (cont.)

Page 16: Displaying and describing categorical data

What’s your favorite academic subject?

Create your own pie chart to represent the class’s distribution of subject preferences.

Class Activity

Color FrequencyMathEnglishHistoryScience

Day 2

Page 17: Displaying and describing categorical data

A contingency table allows us to look at

two categorical variables together. It shows how individuals are distributed

along each variable, depending on the value of the other variable. Example: we can examine the class of ticket and whether a person survived the Titanic:

Contingency Tables

Page 18: Displaying and describing categorical data

The margins of the table, both on the right

and on the bottom, give totals and the frequency distributions for each of the variables.

Each frequency distribution is called a marginal distribution of its respective variable. The marginal distribution of Survival is:

Contingency Tables (cont.)

Page 19: Displaying and describing categorical data

Each cell of the table gives the count for

a combination of values of the two values. For example, the second cell in the crew column tells us that 673 crew members died when the Titanic sunk.

Contingency Tables (cont.)

Page 20: Displaying and describing categorical data

The percentage of passengers who were both in 1st class and survived:

The percentage of 1st class passengers who survived:

The percentage of the survivors who were in 1st class:

Contingency tables allow us to make many comparisons.

Contingency Table Example

Page 21: Displaying and describing categorical data

This two-way table shows the favorite leisure activities for 50 adults - 20 men and 30 women.

What is the marginal distribution of the responses?

Example: Finding Marginal Distributions

Page 22: Displaying and describing categorical data

A conditional distribution shows the distribution of one variable for just the individuals who satisfy some condition on another variable. The following is the conditional distribution of ticket Class, conditional on having survived:

Conditional Distributions

Page 23: Displaying and describing categorical data

The following is the conditional distribution of ticket Class, conditional on having passed:

Conditional Distributions (cont.)

Page 24: Displaying and describing categorical data

The conditional distributions tell us that there is a difference in class for those who survived and those who perished.

This is better shown with pie charts of the two distributions:

Conditional Distributions (cont.)

Page 25: Displaying and describing categorical data

We see that the distribution of Class for the survivors is different from that of the nonsurvivors..

This leads us to believe that Class and Survival are associated, that they are not independent.

The variables would be considered independent when the distribution of one variable in a contingency table is the same for all categories of the other variable.

Conditional Distributions (cont.)

Page 26: Displaying and describing categorical data

1. What % of the class are girls with liberal political views?

2. What % of conservatives are boys?

3. What % of girls are moderate?

4. What is the marginal frequency distribution of political views?

5. What is the conditional distribution of gender among conservatives?

Activity: Political Views

Conservative Liberal Moderate TotalFemaleMaleTOTAL

Page 27: Displaying and describing categorical data

When averages are taken across different groups, they can appear to contradict the overall averages.

Simpson’s ParadoxDay 3

Page 28: Displaying and describing categorical data

It’s the last inning of an important game.

Your team is one run down with the bases loaded and two outs. The pitcher is due up, so you’ll be sending in a pinch-hitter. There are two batters available on the bench. Whom should you send in to bat?

Do the averages line up with the overall?

Simpson’s Paradox (cont.)

Page 29: Displaying and describing categorical data

A segmented bar

chart displays the same information as a pie chart, but in the form of bars instead of circles.

Each bar is treated as the “whole” and is divided proportionally into segments corresponding o the percentage in each group.

Here is the segmented bar chart for ticket Class by Survival status:

Segmented Bar Charts

Page 30: Displaying and describing categorical data

Don’t violate the area principle.

While some people might like the pie chart on the left better, it is harder to compare fractions of the whole, which a well-done pie chart does.

What Can Go Wrong?

Page 31: Displaying and describing categorical data

Keep it honest—make sure your display shows what it says it shows.

This plot of the percentage of high-school students who engage in specified dangerous behaviors has a problem. Can you see it?

What Can Go Wrong? (cont.)

Page 32: Displaying and describing categorical data

Don’t confuse similar-sounding percentages—pay particular attention to the wording of the context.

Don’t forget to look at the variables separately too—examine the marginal distributions, since it is important to know how many cases are in each category.

What Can Go Wrong? (cont.)

Page 33: Displaying and describing categorical data

Be sure to use enough individuals!

Do not make a report like “We found that 66.67% of the rats improved their performance with training. The other rat died.”

What Can Go Wrong? (cont.)

Page 34: Displaying and describing categorical data

Don’t overstate your case—don’t claim something you can’t.

Don’t use unfair or silly averages—this could lead to Simpson’s Paradox, so be careful when you average one variable across different levels of a second variable.

What Can Go Wrong? (cont.)

Page 35: Displaying and describing categorical data

We can summarize categorical data by counting the number of cases in each category (expressing these as counts or percents).

We can display the distribution in a bar chart or pie chart.

And, we can examine two-way tables called contingency tables, examining marginal and/or conditional distributions of the variables.

Recap

Page 36: Displaying and describing categorical data

Day 1: #5 – 9 ODD, 11, 13 – 16 Day 2: # 17, 19 – 21 Day 3: # 23, 25, 28 Day 4: # 32, 33, 39

Read Chapter 4

Assignments: pp. 38-39