AP Stats: 12.1, part 1 - MATH WITH MAYER

42
AP Stats: 12.1, part 1 Inference for Linear Regression: Intro, conditions, parameters, computer outputs

Transcript of AP Stats: 12.1, part 1 - MATH WITH MAYER

Page 1: AP Stats: 12.1, part 1 - MATH WITH MAYER

AP Stats: 12.1, part 1 Inference for Linear Regression: Intro, conditions, parameters, computer outputs

Page 2: AP Stats: 12.1, part 1 - MATH WITH MAYER

Is there a linear relationship between the height from

which a paper helicopter is released and the time it takes to hit the ground?

Page 3: AP Stats: 12.1, part 1 - MATH WITH MAYER

In this activity, we will perform an experiment to investigate this question.

�  Explanatory variable:

�  drop height (inches) �  Response variable:

�  Time (seconds)

�  Experimental units: �  50 rotocopters

�  Treatments: �  5 different heights

�  What kind of lurking variables could there be, and what could we do to control them?

�  What/how should we

randomize? �  Let’s set some ground rules:

� Where do we start the measurement?

� Where do we end the measurement? (once it hits the ground, vs. once it stops spinning…does it matter?)

� How is it dropped?

Page 4: AP Stats: 12.1, part 1 - MATH WITH MAYER

Within your group, you need to assign jobs: �  Recorder: Records the time for each

rotocopter �  Measurer: Initially measures the perpendicular

distance each will drop �  Dropper: Drops the rotocopter �  Caller: Calls time when the rotocopter hits the

ground (in whatever fashion we consistently define that to be).

�  Police (could be more than one person): Makes sure that the procedure is as consistent as possible.

Page 5: AP Stats: 12.1, part 1 - MATH WITH MAYER

Before you start dropping… �  You and your group will find a consistent place to

drop the rotocopters. It must be a different height than any other group, so get my approval before you start.

�  Measure this height to the nearest 0.5 inch. �  Please don’t put yourself in danger by balancing

on a unstable surface. I don’t have time to fill out injury reports.

�  When you return, add your data to the table and dotplot. Record the data and dotplot on your page as well.

�  You have 10 minutes. Go.

Page 6: AP Stats: 12.1, part 1 - MATH WITH MAYER

Let’s review… �  (every time you see this, this information is

written on the “relevant review” page)

Page 7: AP Stats: 12.1, part 1 - MATH WITH MAYER

Scatterplot �  A scatterplot displays the relationship between two

quantitative variables measured on the same individuals.

�  An explanatory variable helps explain or influences changes in a response variable; it’s the x.

�  A response variable measures an outcome of a study; it’s the y.

�  Calling one variable “explanatory” and the other “response” doesn’t necessarily mean that changes in one CAUSES changes in the other. Association does not imply causation.

Page 8: AP Stats: 12.1, part 1 - MATH WITH MAYER

Interpreting Scatterplots � Direction: Positive

vs. Negative� Form: Curved?

Linear? Clusters?� Strength: The

strength is determined by how closely the points follow a clear form

� Outliers: Do we see any deviations from the pattern?

Page 9: AP Stats: 12.1, part 1 - MATH WITH MAYER

Correlation � The correlation measures the strength and

direction of the linear relationship between two quantitative variables.

� The correlation r always falls between -1 and +1. The closer r is to zero, the weaker of a linear relationship between the two. Values of r close to -1 or +1 indicate the points lie close to a straight line.

� Correlation is written as r.

Page 10: AP Stats: 12.1, part 1 - MATH WITH MAYER

A strong relationship means a line: 𝑦 =𝑎+𝑏𝑥

In the equation…�  is the predicted value of the response variable y

for a given value of the explanatory variable x�  b is the slope

�  Change in y (response variable) to change in x (explanatory variable)

�  a is the y-intercept�  The value of the response for which the

explanatory is zeroThe Least-Squares Regression line creates the

best fit for a scatterplot.

Page 11: AP Stats: 12.1, part 1 - MATH WITH MAYER

Given our rotocopter scatterplot, do you think we have a strong enough relationship to model it with a least-squares regression line? How do we find that equation again? Let’s review…

Page 12: AP Stats: 12.1, part 1 - MATH WITH MAYER

To find and plot the line of regression on graphing calculator

Page 13: AP Stats: 12.1, part 1 - MATH WITH MAYER

Warm up: a and b from AP Stats: Chapter 12, Lesson 1: Inference for Linear Regression

Example: Tipping at a buffet Do customers who stay longer at buffets give larger tips? Charlotte, an AP statistics student who worked at an Asian buffet, decided to investigate this question for her second semester project. While she was doing her job as a hostess, she obtained a random sample of receipts, which included the length of time (in minutes) the party was in the restaurant and the amount of the tip (in dollars). Do these data provide convincing evidence that customers who stay longer give larger tips? Here are the data:

Page 14: AP Stats: 12.1, part 1 - MATH WITH MAYER

After you make the scatterplot… 1.  After making the scatterplot, find the least-

squares regression line for predicting descent time from drop height.

2.  What part of the least-squares regression line establishes a relationship between the explanatory and response variables?

3.  Interpret the slope of the regression line from the previous prompt in context. What is your best guess for the increase in descent time for each additional foot of drop height?

Page 15: AP Stats: 12.1, part 1 - MATH WITH MAYER

Analyzing Least-Squares Regression Line � Since the data plotted represent a

sample of the true relationship between these two variables, we could say that the line of regression therefore is only an estimate of the true, population relationship between the two variables

� Chances are, a different set of data generated would produce a different least-squares regression line.

Page 16: AP Stats: 12.1, part 1 - MATH WITH MAYER
Page 17: AP Stats: 12.1, part 1 - MATH WITH MAYER
Page 18: AP Stats: 12.1, part 1 - MATH WITH MAYER

Further analyzing a least-squares regression line � When analyzing a scatterplot, the

question remains: is the relationship established really linear, or did it look linear just by chance?

�  In the population, how much can we expect the y-variable to change with every x-variable increase? What is the margin of error of this change?

Page 19: AP Stats: 12.1, part 1 - MATH WITH MAYER

Meaning: �  If we want to know the TRUE population slope, we

need to do inference. �  If we want to challenge or test assumptions about

the relationship between two variables, we may need inference.

�  Since we are making inference (predictions and conclusions about given statements) about scatterplots, we’re examining how well fit the line is to the data.

�  What is truly being examined in inference for linear regression is the value of the residuals between bivariate data.

Page 20: AP Stats: 12.1, part 1 - MATH WITH MAYER

Speaking of inference…On the back of your scatterplot, #4… 1.  Does it seem plausible that there is really no linear

relationship between descent time and drop height and that the observed slope happened just by chance due to the random assignment?

a.  Meaning, if we randomly assign the times to a drop height, then there should be no relationship between time and drop height…is that the case?

b.  How could we use simulation to investigate what values of the slope we could expect to happen by the chance due to the random assignment?

�   Let's simulate slope! And as we do so, let’s examine

the sampling distribution of slope.

Page 21: AP Stats: 12.1, part 1 - MATH WITH MAYER

Sampling distribution of slope, b �  Your SRS of n observations (x,y) from a population of size N

has a defined relationship (that you probably do not know) of predicted y = � + βx.

�  When conditions are met, the sampling distribution for the slope of the sample regression line has the following characteristics: �  Shape: approximately normal, (as long as the values of the

response variable, y, follow a normal distribution for each value of explanatory variable x.)

�  Center: 𝜇↓𝑏 =𝛽 �  Spread: Standard deviation of the sampling distribution is:

𝜎↓𝑏 = 𝜎/𝜎↓𝑥 √𝑛   (as long as 10% condition is satisfied)

Page 22: AP Stats: 12.1, part 1 - MATH WITH MAYER

Conditions for Regression Inference: LINER

� Linear Model is Appropriate �  Independent Observations � Normal distribution of y for each x � Equal SD of y for each x � Random sample or randomized

experiment

Page 23: AP Stats: 12.1, part 1 - MATH WITH MAYER

How to check conditions for inference Start by making histogram, normal probability plot of the residuals, and a residual plot. � Linear: Does the scatterplot look linear-ish?

Look at the residual plot: it shouldn’t look curved, and the residuals are clustered around the residual = 0 line.

Page 24: AP Stats: 12.1, part 1 - MATH WITH MAYER

How to check conditions for inference (continued) �  Independent: There should be random sampling

or assignment to ensure independence; check 10% condition if done without replacement.

�  Normal: Check stemplot or histogram for normality; check normal probability plot of residuals (should make a line)

�  Equal SD: vertical spread of residuals on residual plot should be roughly the same for all x-values. (constant-ish height of residuals along x = 0 line)

�  Random: again, check for random assignment/sampling.

Page 25: AP Stats: 12.1, part 1 - MATH WITH MAYER

Let’s Review… � Residual wha?

Page 26: AP Stats: 12.1, part 1 - MATH WITH MAYER

Residuals defined � A residual is the difference between an

observed value of the response variable and the value predicted by the regression line.

� Residual = observed y – predicted y� Residual =𝑦− 𝑦 

Page 27: AP Stats: 12.1, part 1 - MATH WITH MAYER

Residual Plots �  It’s a scatterplot of the regression

residuals against the x variable.�  They help us assess how well a

regression line fits the data.�  The closer the residuals are to the

horizontal line, the better the fit of the line to the data.

�  If a line is the best model for a scatterplot, the residual plot should show no obvious pattern; a curved pattern shows that the relationship is not linear and a straight line may not be the best model.

Page 28: AP Stats: 12.1, part 1 - MATH WITH MAYER

Making a residual plot on the calculators

1.  X should be in L1, Y in L2.2.  Put cursor over L3; press 2nd, Stat/Resid. Press

Enter.3.  Press enter again. A list of residuals should be in L3.4.  Turn all plots off under 2nd/y= to see stat plot. Also,

check y =. If you had a regression line plotted, its equation is there, and if it confuses you to be there, clear it from y1.

5.  Create new scatterplot. Xlist: L1 (explanatory variable) and Ylist:L3 (residuals)

6.  Zoom/9:ZoomStat

Page 29: AP Stats: 12.1, part 1 - MATH WITH MAYER

Normal Probability Plot � The more normally

distributed of data, the more of a line in the plot.

Page 30: AP Stats: 12.1, part 1 - MATH WITH MAYER

Normal Probability Plot �  Step 1: Take your data and put it into a list. �  Step 2: Make a Stat Plot: 2nd button, y= �  Step 3: Turn Stat Plot on �  Step 4: Choose Probability Plot (last icon of

the six) �  Make sure your list is identified in “Data

List” (you can change it by using 2nd, and buttons 1-6, depending on which list it is)

�  Zoom: choose 9:ZoomStat �  Press Graph: a straight line denotes normal

distribution, any other line does not.

Page 31: AP Stats: 12.1, part 1 - MATH WITH MAYER

Let’s make these displays for our rotocopter data using the graphing calculators.

Page 32: AP Stats: 12.1, part 1 - MATH WITH MAYER

Example 1: Does seat location matter? �  Many people believe that students

learn better if they sit closer to the front of the classroom. Does sitting closer cause higher achievement, or do better students simply choose to sit in the front? To investigate, an AP Stats teacher randomly assigned students to seat locations in his classroom for a particular chapter and recorded the test score for each student at the end of the chapter. The explanatory variable in this experiment is which row the student was assigned (row 1 is the closest to the front and row 7 is the farthest away). Here are the results, including a scatterplot, residual plot, histogram, and Normal probability plot of the residuals are shown.

Row 1 76, 77, 94, 99

Row 2 83, 85, 74, 79

Row 3 90, 88, 68, 78

Row 4 94, 72, 101, 72, 79

Row 5 76, 65, 90, 67, 96

Row 6 88, 79, 90, 83

Row 7 79, 76, 77, 63

Page 33: AP Stats: 12.1, part 1 - MATH WITH MAYER
Page 34: AP Stats: 12.1, part 1 - MATH WITH MAYER

Example 1: a. Check whether the conditions for performing inference about the regression model are met.

Solution: �  Linear: The scatterplot shows a weak linear relationship and the

residual plot does not show any obvious leftover patterns. �  Independent: Students were randomly assigned to seats and

were monitored for cheating, so knowing the score for one student should give no additional information about another student’s score.

�  Normal: The histogram is roughly symmetric and unimodal, and the Normal probability plot is roughly linear.

�  Equal SD: Although there is a different amount of variability in each row, the differences aren’t very large and there is no systematic pattern, such as increasing variability as x increases.

�  Random: The students were assigned to seats at random. Because there are no serious violations of the conditions, we should be safe performing inference about the regression model in this setting.

Page 35: AP Stats: 12.1, part 1 - MATH WITH MAYER

Parameters in the regression Model

Statistic that estimates…. the parameter….

y-intercept a α

slope b β

Standard deviation of the residuals

s σ

Page 36: AP Stats: 12.1, part 1 - MATH WITH MAYER

Reading Computer Outputs

Predictor Coef SE Coef T P Constant 85.706 4.239 20.22 0.000 Row -1.1171 0.9472 -1.18 0.248

S = 10.0673 R-Sq = 4.7% R-Sq(adj) = 1.3%  

a = y-intercept, estimates α

Page 37: AP Stats: 12.1, part 1 - MATH WITH MAYER

Example 2 continued: b. Using the output, write the equation for the least-squares regression line for predicting grade from the row assigned.     c. The model for regression inference has three parameters α, β, and σ. Explain what each parameter represents in context. Then provide an estimate for each.     d. Identify the standard error of the slope SEb from the computer output. Interpret this value in context.

Page 38: AP Stats: 12.1, part 1 - MATH WITH MAYER

Example 1: Solutions b. Using the output, write the equation for the least-squares regression line for predicting grade from the row assigned.     c. The model for regression inference has three parameters α, β, and σ. Explain what each parameter represents in context. Then provide an estimate for each.     d. Identify the standard error of the slope SEb from the computer output. Interpret this value in context.

Page 39: AP Stats: 12.1, part 1 - MATH WITH MAYER

Inference Method #1: Confidence Interval of slope, β

�  It provides an interval of plausible values for the true slope.

� Formula: b ± t* SEb è __<β<__ (df= n – 2) �  (Standard Error of the slope is usually

provided; not expected to know how to find it)

� 4 steps!

Page 40: AP Stats: 12.1, part 1 - MATH WITH MAYER

�  context. �  (b) Calculate the 95% confidence interval for the true slope. Show

your work. �  (c) Interpret the interval from part (b) in context. �  (d) Is there convincing evidence that seat location affects scores? �  Solution: (a) SEb = 0.9472. If we repeated the random assignment many times, the slope of the estimated regression line would typically vary by about 0.9472 from the slope of the true regression line for predicting test score from row number. (b) Because n = 30, df = 30 – 2 = 28, and t* = 2.048. The 95% confidence interval is −1.1171 2.048(0.9472) = −1.1171 1.9399 = (−3.0570, 0.8228). (c) We are 95% confident that the interval from −3.0570 to 0.8228 captures the slope of the true regression line relating a student’s test score y and the student’s row number x. (d) Because the interval of plausible slopes includes 0, we do not have convincing evidence that seat location affects test scores.

Page 41: AP Stats: 12.1, part 1 - MATH WITH MAYER

Inference Method #2: Significance Test of the Slope � Test statistic Formula: df = n – 2 𝑡= 𝑏− 𝛽↓𝑜 /𝑆𝐸↓𝑏   Big ideas are the same: Use 4 steps Value of test statistic and P-Value are typically provided on computer output

Page 42: AP Stats: 12.1, part 1 - MATH WITH MAYER