Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical...

40
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Associat ion between Categori cal

Transcript of Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical...

Page 1: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 1

Chapter 5Association between Categorical Variables

Page 2: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 2

5.1 Contingency Tables

Which hosts send more buyers to Amazon.com?

To answer this question we must gather data on two categorical variables: Host and Purchase

Host identifies the originating site: Comcast, Google, or Nextag; Purchase indicates whether or not the visit results in a sale

Page 3: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 3

5.1 Contingency Tables

Consider Two Categorical Variables Simultaneously

A table that shows counts of cases on one categorical variable contingent on the value of another (for every combination of both variables)

Cells in a contingency table are mutually exclusive

Page 4: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 4

5.1 Contingency Tables

Contingency Table for Web Shopping

Page 5: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 5

5.1 Contingency Tables

Marginal and Conditional Distributions

• Marginal distributions appear in the “margins” of a contingency table and represent the totals (frequencies) for each categorical variable separately

• Conditional distributions refer to counts within a row or column of a contingency table (restricted to cases satisfying a condition)

Page 6: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 6

5.1 Contingency Tables

Conditional Distribution of Purchase for each Host (Column Counts and Percentages)

Page 7: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 7

5.1 Contingency Tables

Conditional Distribution

• Reveals that Comcast has the highest rate of purchases, more than twice that of Nextag

• Host and Purchase are associated

Page 8: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 8

5.1 Contingency Tables

Stacked Bar Charts

• Used to display conditional distributions

• Divides the bars in a bar chart proportionally into segments corresponding to the percentage in each group of a second variable

Page 9: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 9

5.1 Contingency Tables

Contingency Table of Purchase by Region

Page 10: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 10

5.1 Contingency Tables

Stacked Bar Chart Shows No Association

Page 11: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 11

5.1 Contingency Tables

Mosaic Plots

Alternative to stacked bar chart

A plot in which the size of each “tile” is proportional to the count in a cell of a contingency table

Page 12: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 12

5.1 Contingency Tables

Contingency Table of Shirt Size by Style

Page 13: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 13

5.1 Contingency Tables

Mosaic Plot Shows Association

Page 14: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 14

4M Example 5.1: CAR THEFT

Motivation

Should insurance companies vary the premiums for different car models (are some cars more likely to be stolen than others)?

Page 15: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 15

4M Example 5.1: CAR THEFT

Method

Data obtained from the National Highway Traffic Safety Administration (NHTSA) on car theft for seven popular models (two categorical variables: type of car and whether the car was stolen).

Page 16: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 16

4M Example 5.1: CAR THEFT

Mechanics

Page 17: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 17

4M Example 5.1: CAR THEFT

Mechanics

Page 18: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 18

4M Example 5.1: CAR THEFT

Message

The Dodge Intrepid is more likely to be stolen than other popular models. The data suggest that higher premiums for theft insurance should be charged for models that are more likely to be stolen.

Page 19: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 19

5.2 Lurking Variables and Simpson’s Paradox

Association Not Necessarily Causation

Lurking Variable: a concealed variable that affects the apparent relationship between two other variables

Simpson’s Paradox: a change in the association between two variables when data are separated into groups defined by a third variable

Page 20: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 20

5.2 Lurking Variables and Simpson’s Paradox

Hidden Lurking Variable (Weight)

Page 21: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 21

5.2 Lurking Variables and Simpson’s Paradox

Adjusted for Lurking Variable (Weight)

Page 22: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 22

4M Example 5.2: AIRLINE ARRIVALS

Motivation

Does it matter which of two airlines a corporate CEO chooses when flying to meetings if he wants to avoid delays?

Page 23: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 23

4M Example 5.2: AIRLINE ARRIVALS

Method

Data obtained from US Bureau of Transportation Statistics on flight delays for two airlines (two categorical variables: airline and whether the flight arrived on time).

Page 24: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 24

4M Example 5.2: AIRLINE ARRIVALS

Mechanics

Page 25: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 25

4M Example 5.2: AIRLINE ARRIVALS

Mechanics – Is destination a lurking variable?

Page 26: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 26

4M Example 5.2: AIRLINE ARRIVALS

Mechanics – This is Simpson’s Paradox

Page 27: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 27

4M Example 5.2: AIRLINE ARRIVALS

Message

The CEO should book on US Airways as it is more likely to arrive on time regardless of destination.

Page 28: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 28

5.3 Strength of Association

Chi-Squared Statistic

A measure of association in a contingency table

Calculated based on a comparison of the observed contingency table to an artificial table with the same marginal totals but no association

Page 29: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 29

5.3 Strength of Association

Contingency Table

Page 30: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 30

5.3 Strength of Association

Calculating the Chi-Squared Statistic

Page 31: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 31

5.3 Strength of Association

Calculating the Chi-Squared Statistic

2

2 2 2 230 40 70 60 50 40 (50 60)

40 60 40 60x

2 2 2 210 10 10 10

40 60 40 60

2.5 1.67 2.5 1.67

8.33

Page 32: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 32

5.3 Strength of Association

Cramer’s V

Derived from the Chi-Squared Statistic

Ranges in value from 0 (variables are not associated) to 1(variables are perfectly associated)

Page 33: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 33

5.3 Strength of Association

Calculating Cramer’s V

ν = 0.20 for our exampleThere is a weak association between group (students or staff) and attitude toward sharing copyrighted music

)1,1min(

2

crn

Page 34: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 34

5.3 Strength of Association

Checklist: Chi-Squared and Cramer’s V

Verify that variables are categorical

Verify that there are no obvious lurking variables

Page 35: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 35

4M Example 5.3: REAL ESTATE

Motivation

Do people who heat their homes with gas prefer to cook with gas as well? What heating systems and appliances should a developer select for newly built homes?

Page 36: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 36

4M Example 5.3: REAL ESTATE

Method

The developer contacts homeowners to obtain the data. Two categorical variables: type of fuel used for home heating (gas or electric) and type of fuel used for cooking (gas or electric).

Page 37: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 37

4M Example 5.3: REAL ESTATE

Mechanics

Chi-Squared = 98.62; Cramer’s V = 0.47

Page 38: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 38

4M Example 5.3: REAL ESTATE

MessageHomeowners prefer gas to electric heat by about 2 to 1. The developer should build about two-thirds of new homes with gas heat. Put electric appliances in all homes with electric heat and in half of the homes with gas heat (assuming that buyers for new homes have the same preferences).

Page 39: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 39

Best Practices

Use contingency tables to find and summarize association between two categorical variables.

Be on the lookout for lurking variables.

Use plots to show association.

Exploit the absence of association.

Page 40: Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 5 Association between Categorical Variables.

Copyright © 2014, 2011 Pearson Education, Inc. 40

Pitfalls

Don’t interpret association as causation.

Don’t display too many numbers in a table.