Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Describing the Relation between Two...

74
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Describing the Relation between Two Variables 4

Transcript of Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Describing the Relation between Two...

Copyright © 2013, 2010 and 2007 Pearson Education, Inc.

Chapter

Describing the Relation between Two Variables

4

Chap 2 2

Copyright © 2013, 2010 and 2007 Pearson Education, Inc.

Section

Scatter Diagrams and Correlation

4.1

4-4

The response (dependent) or “output” variable is the variable whose value can be “predicted” or explained by the value of the explanatory/predictor (independent) or “input” variable.

4-5

A scatter diagram is a graph that shows the relationship between two quantitative variables.

The predictor (independent / “x”) variable is plotted on the horizontal axis, and the response (dependent / “y”) variable is plotted on the vertical axis.

4-6

EXAMPLE Drawing and Interpreting a Scatter DiagramEXAMPLE Drawing and Interpreting a Scatter Diagram

The data to the right are based on a study for drilling thru rock.

The researchers wanted to determine whether the time it takes to drill thru 5 feet of rock increases with the depth at which the drilling begins.

Depth at which drilling begins is the predictor variable, “x”, and time (min) to drill five feet is the response variable “y”.

Draw a scatter diagram of the data.

4-7

4-8

Various Types of Relations in a Scatter Diagram

4-9

Two variables that are linearly related are positively correlated when higher values of one variable are associated with higher values of the other (positive slope), and lower values of one variable are associated with lower values of the

other.

That is, two variables are “positively correlated” if, as one variable increases, the other variable

also increases.

4-10

4-11

Two variables that are linearly related are negatively correlated when higher values of one variable are associated with lower values of the other (negative slope), and lower values of one variable are associated with higher values of the

other.

That is, two variables are “negatively correlated” if, as one variable increases, the other variable

decreases.

4-12

The linear correlation coefficient or Pearson Correlation Coefficient is a measure of the strength and direction of the linear relation

between two quantitative variables.

The Greek letter “ρ” (rho) represents the population correlation coefficient, and “r”

represents the sample correlation coefficient.

Larson & Farber, Elementary Statistics: Picturing the World, 3e 13

Linear Correlation Coefficient

A measure of the strength and direction of a linear relationship between two variables

The range of r is from –1 to +1

If r is close to 1 there is a

strong positive

correlation.

If r is close to –1 there is a strong negative correlation.

If r is close to 0 there is no

linear correlation.

–1 0 1

4-14

Properties of the Pearson Correlation Coefficient

1. –1 ≤ r ≤ 1.

2.If r = + 1, then a perfect positive correlation exists between the two variables.

3.If r = –1, then a perfect negative correlation exists between the two variables.

4.The closer r is to +1, the stronger is the positive correlation between the two variables.

5.The closer r is to –1, the stronger is the negative correlation between the two variables.

4-15

6. If r is close to 0, then little or no evidence exists of correlation between the two variables. So r close to 0 does not imply no relation, just no linear relation.

7. The “r” coefficient is dimensionless.

8. The Pearson correlation coefficient is not resistant. Therefore, just one observation that does not follow the overall data pattern (think outlier) could affect the value of r.

4-16

4-17

EXAMPLE Determining the Pearson Correlation CoefficientEXAMPLE Determining the Pearson Correlation Coefficient

Determine the Pearson correlation coefficient “r” of the drilling data:

1.By algebra (boo!)

2.By calculator (yeah!)

4-18

xi x

sx

yi y

sy

xi x

sx

yi y

sy

xi x

sx

yi y

sy

x y

4-19

r

xi xsx

yi y

sy

n 1

8.501037

12 10.773

20Chap 2

Better way …Better way …

1.1.Enter all the “x” data in List 1, and all the Enter all the “x” data in List 1, and all the “y” data in List 2. Make sure you keep the “y” data in List 2. Make sure you keep the related pairs of x,y data in the same order. related pairs of x,y data in the same order.

2.2. Set your Calc to “Diagnostics On”Set your Calc to “Diagnostics On”

3.3.Go to Stat: Calc:4 LinReg: L1,L2Go to Stat: Calc:4 LinReg: L1,L2

4.4.Look for value of Pearson “r”Look for value of Pearson “r”

Chap 9Chap 9 2121

TI-84 Line of Regression (LOR)TI-84 Line of Regression (LOR)

X data (horiz) to (L1), Y (vert) data to (L2) X data (horiz) to (L1), Y (vert) data to (L2)

STAT PLOT: Plot 1: Scatter PlotSTAT PLOT: Plot 1: Scatter Plot

Zoom:9:StatZoom:9:Stat

STAT:Calc:4: LinReg(ax+b): L1, L2, Y1STAT:Calc:4: LinReg(ax+b): L1, L2, Y1 This will generate the LOR on the Stat Plot This will generate the LOR on the Stat Plot

thru the points, and will show the thru the points, and will show the equation at Y1equation at Y1

To predict “y” value when x=9, find Y1(9)To predict “y” value when x=9, find Y1(9) Y1 is found at VARS, Y-VARS, Func, Y1Y1 is found at VARS, Y-VARS, Func, Y1

4-22

Testing for a Linear Relation

1.Determine the absolute value of the Pearson correlation coefficient: |r|.

2.Find the critical value in Table II from Appendix A (or handout) for the given sample size.

3. If |r| is greater than the critical value, then a usable (make predictions) linear relation exists between the two variables. Otherwise, no linear relation exists.

4-23

EXAMPLE Does a Linear Relation Exist? EXAMPLE Does a Linear Relation Exist?

Determine whether a linear relation exists between time and depth of the drilling. What type of relation appears to exist between time to drill five feet and depth at which drilling begins?

The Pearson |r| value for the two variables (time/depth) is 0.773.

The critical value for n = 12 observations is 0.576.

Since 0.773 > 0.576, there is a positive linear correlation

between time to drill five feet and depth at which drilling begins. We can use this correlation to make

predictions.

4-24

Another way that two variables can be related even though there is not a causal relation is

through a “lurking variable”.

A lurking variable is related to both the explanatory and response variable.

For example, ice cream sales and crime rates have a very high positive correlation. Does this mean that

sales of ice cream causes crime rates to go up?

The lurking variable is temperature. As temperatures rise, both ice cream sales and crime

rates rise.

25Chap 2

Something to remember…Something to remember…

Correlation between variables does not imply Correlation between variables does not imply “causation” (the independent causes the “causation” (the independent causes the

dependent) unless the results come from a dependent) unless the results come from a controlled experimentcontrolled experiment..

Correlation of variables in an Correlation of variables in an observational observational study study only implies “association” between only implies “association” between the variables and not “causation” of one by the variables and not “causation” of one by the other.the other.

Copyright © 2013, 2010 and 2007 Pearson Education, Inc.

Section

Least-squares Regression

4.2

4-27

Use points: (2, 5.7) and (6, 1.9)

m 5.7 1.9

2 6 0.95

y y1 m x x1 y 5.7 0.95 x 2 y 5.7 0.95x 1.9

y 0.95x 7.6

EXAMPLE Finding an Equation that Describes Linearly Correlated Data

EXAMPLE Finding an Equation that Describes Linearly Correlated Data

Find a linear equation that relates x (predictor variable) and y ( response variable) by selecting any two points and finding

the equation of the line between those points.

4-28

Graph the equation on the scatter diagram.

Use the equation to predict y if x = 3

Note: (3, 5.2) is actual data point

y 0.95x 7.6

0.95(3) 7.6

4.75

4-29

}(3, 5.2) residual = observed y – predicted y

= 5.2 – 4.75 = 0.45

The difference between the observed value of y and the predicted value of y is the error, or residual.

Using the line and the predicted value at x = 3:

residual = observed y – predicted y = 5.2 – 4.75 = 0.45 (error)

4-30

Least-Squares Regression Criterion

The least-squares regression line (LOR or COBF) is the line that minimizes the sum of the squared errors (residuals).

This LOR line minimizes the sum of the squared vertical distance between the observed values of y and those predicted by the line (“y-hat”),

In other words: minimize Σ residuals2

y

31Chap 2

Key ConceptsKey Concepts

LORLOR stands for “Line of Regression” stands for “Line of Regression”

COBFCOBF stands for “Curve of Best Fit” stands for “Curve of Best Fit”

Both terms refer to the Least-Squares Regression Line Both terms refer to the Least-Squares Regression Line and are used interchangeably.and are used interchangeably.

4-32

EXAMPLE Finding the Least-squares Regression LineEXAMPLE Finding the Least-squares Regression Line

Find the LOR line.

Predict the drilling time if drilling starts at 130 feet.

Is the observed drilling time at 130 feet above, or below predicted?

(a)Draw the LOR on the scatter diagram of the data.

4-33

We agree to round the estimates of the slope and intercept to four decimal places.

(b)

(c) The observed drilling time is 6.93 seconds. The predicted drilling time is 7.035 seconds. The LOR-predicted drilling time is 1.52% above observed.

y 0.0116x 5.5273

y 0.0116x 5.5273

0.0116(130) 5.5273

7.035

4-34

4-35

Interpretation of Slope of a line:

The slope of the LOR regression line is 0.0116.

Therefore, for each additional one foot of depth we start the drilling, the time to drill five feet increases by 0.0116 min (~ 0.7 sec), on average.

4-36

If the LOR is used to make predictions based on values of the predictor (independent) variable that are significantly outside the observed values, then the researcher is working outside the scope of the model.

Never use an LOR to make predictions outside the scope of the model because the linear relation may not still exist.

Copyright © 2013, 2010 and 2007 Pearson Education, Inc.

Section

Diagnostics on the Least-squares Regression (LOR)

Line

4.3

4-38

The coefficient of determination, R2, measures the proportion of total variation in the response variable that is explained by the LOR line.

The coefficient of determination is a number between 0 and 1, inclusive. 0 < R2 < 1.

If R2 = 0 the LOR has no prediction value

If R2 = 1 it means 100% of the variation in the response variable is caused by a change in the predictor variable.

4-39

Depth at which drilling begins is the predictor variable, “x”

Time (min) to drill five feet is the response variable, y.

4-40

4-41

Regression Analysis

The regression equation (LOR) is:

y (time) = 0.0116x (depth) + 5.53 (min)

Sample Statistics

Mean Standard Deviation

Depth 126.2 52.2

Time 6.99 0.781

Correlation Between Depth and Time: 0.773

4-42

Suppose we were asked to predict the time to drill an additional 5 feet, but we did not

know the current depth of the drill. What would be our best “guess”?

ANSWER:

The mean time to drill additional 5 feet: 6.99 minutes (see Sample

Statistics)

4-43

Now suppose that we are asked to predict the time to drill an additional 5 feet if we know that the current depth

of the drill is 160 feet?

ANSWER:

Our “guess” increased from 6.99 minutes to 7.39 minutes because we knew the drill depth

and the LOR equation.

y 5.53 0.0116(160) 7.39

4-44

45Chap 2

DefinitionsDefinitions

The “observed” value of the response (dependent) The “observed” value of the response (dependent) variable: variable:

The “predicted” value (by the LOR) “ “ “The “predicted” value (by the LOR) “ “ “

The “mean” value (of all the “y” values) “ The “mean” value (of all the “y” values) “

y

y

y

4-46

Total Deviation

Unexplained Deviation

Explained Deviation

+=

4-47

Total Variation = Unexplained Variation + Explained Variation

Unexplained VariationExplained Variation

Total VariationTotal Variation= 1 – R2 =

4-48

To determine R2 for the linear regression model simply square the value of the

Pearson correlation coefficient “r ”.

The TI-84 gives you both “r” and R2

when you use the “LinReg subroutine (remember to set “Diagnostics On”)

To determine R2 for the linear regression model simply square the value of the

Pearson correlation coefficient “r ”.

The TI-84 gives you both “r” and R2

when you use the “LinReg subroutine (remember to set “Diagnostics On”)

4-49

EXAMPLE Determining the Coefficient of DeterminationEXAMPLE Determining the Coefficient of Determination

Find and interpret the coefficient of determination for the drilling data.

The Pearson correlation coefficient, “r ” = 0.773,

R2 = 0.7732 = 0.5975 = 59.75%.

So, 59.75% of the variance in drilling time is explained by the variance of drilling depth.

4-50

Data Set A Data Set B Data Set C

A: 99.99% of the variation in y is explained by the variation in x (LOR)

B: 94.7% of the variation in y is explained by the variation in x (LOR)

C: 9.4% of the variation in y is explained by the variation in x (LOR)

4-51

Residuals play an important role in determining the adequacy of the linear

model.

If a plot of the residuals against the predictor (indep) variable shows a

discernable pattern, such as a curve, then the response (dep) and predictor variable may

not be linearly related.

4-52

4-53

If a plot of the residuals versus the predictor (x) variable shows the

spread of the residuals increasing or decreasing as the “x” variable

increases,

then a requirement for a linear model is violated.

This requirement is called

“constant error variance”

4-54

4-55

A plot of residuals against the predictor (indep) variable may also reveal outliers.

These values will be easy to identify because:

the residual will lie far from others in the plot.

4-56

4-57

EXAMPLE Residual Analysis

Draw a residual plot of the drilling time data.

Comment on the appropriateness (validity) of the LOR least-squares

model.

4-58

4-59

An influential observation is:

an observation (data pair) that significantly affects either:

1.the LOR’s slope and/or y-intercept,

or

2. the value of the Pearson linear correlation coefficient “r”.

4-60

Predictor/Explanatory, x

Influential observations typically exist when the point is an outlier relative to the LOR.

So, Case 3 is likely to be influential.

4-61

Suppose an additional data point is added to the drilling data. At a depth of 300 feet, it took 12.49 minutes to

drill 5 feet.

Is this point influential?

EXAMPLE Influential ObservationsEXAMPLE Influential Observations

4-62

4-63

LOR with influential

LOR without influential

Copyright © 2013, 2010 and 2007 Pearson Education, Inc.

Section

Contingency Tables and Association

4.4

4-65

A college professor conducted a study to assess the effectiveness of teaching a statistics course via (1) traditional lecture method, (2)

online delivery (no classroom meetings), and (3) hybrid instruction (online course + weekly meetings)

The grades (A – F) that students received in each of the courses were tallied.

The table is referred to as a contingency table.The row (response) variable is “grade” and the

column (predictor) variable is “delivery method”. Each position inside the table is referred to as a cell.

4-66

A marginal distribution of a variable is either a freq or rel freq distribution

of the row or column variable from the contingency table.

(it gets its name from the fact that the freq’s are displayed in either the

bottom or right margins of the table)

4-67

EXAMPLE Frequency Marginal DistributionsEXAMPLE Frequency Marginal Distributions

Find the frequency marginal distributions for course grade (rows) and delivery method (cols).

4-68

EXAMPLE Relative Frequency Marginal DistributionsEXAMPLE Relative Frequency Marginal Distributions

Determine the relative frequency marginal distribution for course grade and delivery

method.

4-69

A conditional distribution lists the relative frequency of each category of the response

variable “y” for a given value of the predictor variable “x” in the contingency table.

In other words, each cell contains a rel freq value.

4-70

EXAMPLE Determining a Conditional DistributionEXAMPLE Determining a Conditional Distribution

Comment on any association that may exist between course grade and delivery method.

“It appears that students in the hybrid course are more likely to pass than in the

other two methods.”

4-71

EXAMPLE Drawing a Bar Graph of a Conditional DistributionEXAMPLE Drawing a Bar Graph of a Conditional Distribution

Using the results of the previous example, draw a bar graph that represents the conditional distribution of

method of delivery (y) by grade earned (x).

4-72

The following contingency table shows the survival status by category of passenger on the RMS Titanic on 15 Apr 1912.

The actual total death toll was 1502/2224 or 67.5%

Draw a conditional bar graph of survival status (y) by pax cat (x).

4-73

Simpson’s Paradox represents a situation in which an association between two variables

inverts or disappears when the effect of a third (“lurking”) variable is introduced to the analysis.

For ex, UC Berkely was sued for favoring males over females (“gender bias”) in its acceptance

rates because the A rate for males was 0.460 and for females was 0.304. However, when the

variable “Program of Study” was included, the acceptance rate for females in most programs

was actually higher.

Chap 2 74