Topics 26 - 28

25
Topics 26 - 28

description

Topics 26 - 28. Relationship in Data. Topic 26. Graphical Displays of Association. Activity 26-1: House Prices – Page 570. Scatter Plot - Graphical display the data (Page 570) Horizontal axis – Explanatory variable Vertical Axis – Response Variable - PowerPoint PPT Presentation

Transcript of Topics 26 - 28

Page 1: Topics 26 - 28

Topics 26 - 28

Page 2: Topics 26 - 28

Relationship in Data

Topic 26

Graphical Displays of Association

Page 3: Topics 26 - 28

Activity 26-1: House Prices – Page 570Scatter Plot - Graphical display the data (Page 570)

Horizontal axis – Explanatory variableVertical Axis – Response Variable

Association- Two variable displays association if knowing the value of one variable is useful in predicting the value of the other variable (Page 571)

Three aspects of the association between quantitative variables: (Page 571)

Direction (Positive or negative)Strength (Strong, Moderate or Week)Form (Linear or Curved)

Page 4: Topics 26 - 28

More A categorical variable can be incorporated into a scatter plot by constructing a labeled scatter plot, which assigns different labels to the dots based on the category of the observational unit.

For example, you might indicate observations coming from males with the label M and from females with the label F.

Remember an observed association does not imply a cause-and-effect relationship exist between two variables.

Page 5: Topics 26 - 28

Activity 26-3: Car Data – Page 573

Page 6: Topics 26 - 28

Relationship in Data

Topic 27

Correlation Coefficient

Page 7: Topics 26 - 28

CorrelationCorrelation measures the degree of linear association between two quantitative variables.

But even when two variables display a nonlinear relationship, the correlation between them still might be quite high when there is a strong increasing or decreasing trend.

With these data, the relationship is clearly curved and not linear, and yet the correlation is still fairly high. Do not assume from a high correlation coefficient that the relationship between the variables must be only linear.

Always look at a scatter plot, in conjunction with the correlation coefficient, to assess the form ( linear or not) of the association.

Page 8: Topics 26 - 28

Correlation Coefficient (r)No matter how close a correlation coefficient (r) is to 1, and no matter how strong the association between two variables, a cause- and- effect conclusion cannot necessarily be drawn from observational data.

There are far more plausible explanations for why countries with lots of televisions per thousand people tend to have long life expectancies. For example, the technological sophistication of the country is related to both number of televisions and life expectancy.

Page 9: Topics 26 - 28

Correlation Coefficient (r) The correlation coefficient (r) is a number that measures the direction and strength of linear association between two quantitative variables.

A correlation coefficient is a number! In fact, it is a number between -1 and 1, inclusive.

Always examine a scatter plot in addition to calculating a correlation coefficient. A clear nonlinear relationship can have a small ( close to zero) correlation,

and a correlation can be close to -1 or 1, even if the relationship follows a curve or other nonlinear pattern.

Page 10: Topics 26 - 28

Correlation Coefficient (r)The slope, or steepness, of the points in a scatter plot is unrelated to the value of the correlation coefficient.

If the points fall on a perfectly straight line with a positive slope, then the correlation coefficient equals 1.0 whether that slope is very steep or not steep at all.

What matters for the magnitude of the correlation is how closely the points concentrate around a line, not the steepness of a line.

Page 11: Topics 26 - 28

Correlation Coefficient (r)Before calculating correlation you need to enable the optionPress 2nd, 0 and scroll down to find DiagnosticOn, then press ENTER twice.

Enter data for L1 and L2 in the calculatorGo to STAT, EDIT

Run Least Square RegressionGo to STAT, CALC, 8: LinReg(a+bx)Enter L1 , L2 where you entered the dataPress Enter to calculate Correlation Coefficient (r) and/or Correlation of Determinant (r2)

Page 12: Topics 26 - 28

Relationship in Data

Topic 28

Least Squares Regression

Page 13: Topics 26 - 28

Linear Equation-The equation of a generic line can be written as y ˆ = a + b x, in algebra class ( y = mx + b )

where y denotes the response variable. Terms “ Least squares line” and “ regression line” are used interchangeable.x denotes the explanatory variable ( also called the predictor variable). For Example, x represents foot length and y represents height, and it is good form to use variable names in the equation. a = represent y-interceptb = Slope of the line

The caret on the y ( read as “ y- hat”) indicates that its values are predicted, not actual, heights.

Page 14: Topics 26 - 28

ResidualsOne way to measure the “ fit” of a line is to calculate the residuals for all of the observational units.

A residual is the difference between the observed y value and the y value predicted by your line for the corresponding x value.

In other words, the residual is the vertical distance from an observation to the regression line.

Page 15: Topics 26 - 28

Regression LineOne of the primary uses of regression is prediction.

You can use the regression line to predict the value of the y- variable for a given value of the x- variable simply by plugging that value of x into the equation of the regression line. This process is equivalent to finding the y- value of the point on the regression line corresponding to the x- value of interest.

A more common criterion for determining the “ best” line is to look at the sum of squared residuals ( SSE).

The line that achieves the exact minimum value of the sum of the squared residuals is called the least squares line, or the regression line.

Remember to provide measurement units when reporting predictions. In other words, be clear that the predicted height is in inches, not centimeters or any other units.

Page 16: Topics 26 - 28

InterpolationInterpolation means trying to predict the response variable for values of the explanatory variable within those contained in the data.

Page 17: Topics 26 - 28

ExtrapolationExtrapolation means trying to predict the response variable for values of the explanatory variable beyond those contained in the data. When you have no information about the behavior of the data outside the values contained in your dataset

( e. g., you have no reason to believe the relationship between height and foot length remains roughly linear beyond these values), extrapolation is not advisable.

Page 18: Topics 26 - 28

An observation is considered influential if removing it from the dataset substantially changes the least squares regression equation.

Typically, observations that have extreme explanatory ( x) variable values ( far below or far above the sample mean x-bar ) have more potential to be influential.

Page 19: Topics 26 - 28

The coefficient of DeterminationThe coefficient of determination is equal to the square of the correlation coefficient, so it is denoted by r2. (Where r represents Correlation Coefficient)

r2does not represent the proportion of points that fall on the line, or the proportion of the y- variable that is explained by the x- variable.

Rather, r 2 is the proportion of the variability in the y- variable that is explained by the least squares line with the x- variable.

Of course, when writing your interpretation in a given context, use the variable names rather than generic x and y labels.

Page 20: Topics 26 - 28

You have not yet considered how to calculate the slope and intercept coefficients of the least squares line. Let the equation of a generic least squares line be yˆ = a + b x. The most convenient expressions for calculating the intercept and slope coefficients for the least squares line involve the means and standard deviations of the two variables, along with the correlation coefficient between them. It turns out the slope can be calculated from b = r * sy / sx . The intercept coefficient can then be calculated from a = y-bar - b * x-bar .

Page 21: Topics 26 - 28

transformationWhen a straight line is not the best mathematical model for a relationship, you can often transform one or both variables to make the association more linear. A transformation is a mathematical function applied to a variable, re- expressing that variable on a different scale.Common transformations include logarithm, square root, and other powers. Often trial and error is needed to select the transformation that establishes a linear relationship.

Page 22: Topics 26 - 28

Exercise 28-6: Airfares – Page 634Enter data for L1 and L2 in the calculator

Go to STAT, EDIT

Run Least Square RegressionGo to STAT, CALC, 8: LinReg(a+bx)Enter L1 , L2 where you entered the dataPress Enter to calculate Regression line

Page 23: Topics 26 - 28

ReviewNotice that the word “ coefficient” appears often here. Be especially careful not to confuse a slope coefficient with a correlation coefficient.A common theme in statistical modeling is to think of each data point as being composed of two parts:

the part that is explained by the model ( often called the fit) and the “ leftover” part ( often called the residual)

that is the result either of chance variation or of variables you have not yet considered or measured.

In the context of least squares regression, the fitted value for an observation is simply the y- value that the regression line would predict for the x- value of that observation ( i. e., the fitted value is yˆ). The residual is the difference between the actual y- value and the fitted value y ˆ ( residual actual fitted ), so the residual measures the vertical distance from the observed y- value to the regression line.

Page 24: Topics 26 - 28

Review

Be sure to subtract in the correct order ( observed minus predicted) when calculating a residual. ( Remember that points above the line have positive residuals.) Never take a prediction very seriously if it results from extrapolating well beyond the actual data. Remember not to generalize from the sample data to a larger population unless the sample was drawn randomly or you have some other reason to believe the sample is representative of the population.

Page 25: Topics 26 - 28

Exercise 28-26: Cricket Thermometers – Page 642