AP Stats: 12.1, part 1 - MATH WITH MAYER

AP Stats: 12.1, part 1 Inference for Linear Regression: Intro, conditions, parameters, computer outputs

Is there a linear relationship between the height from

which a paper helicopter is released and the time it takes to hit the ground?

In this activity, we will perform an experiment to investigate this question.

�  Explanatory variable:

�  drop height (inches) �  Response variable:

�  Time (seconds)

�  Experimental units: �  50 rotocopters

�  Treatments: �  5 different heights

�  What kind of lurking variables could there be, and what could we do to control them?

�  What/how should we

randomize? �  Let’s set some ground rules:

� Where do we start the measurement?

� Where do we end the measurement? (once it hits the ground, vs. once it stops spinning…does it matter?)

� How is it dropped?

Within your group, you need to assign jobs: �  Recorder: Records the time for each

rotocopter �  Measurer: Initially measures the perpendicular

distance each will drop �  Dropper: Drops the rotocopter �  Caller: Calls time when the rotocopter hits the

ground (in whatever fashion we consistently define that to be).

�  Police (could be more than one person): Makes sure that the procedure is as consistent as possible.

Before you start dropping… �  You and your group will find a consistent place to

drop the rotocopters. It must be a different height than any other group, so get my approval before you start.

�  Measure this height to the nearest 0.5 inch. �  Please don’t put yourself in danger by balancing

on a unstable surface. I don’t have time to fill out injury reports.

�  When you return, add your data to the table and dotplot. Record the data and dotplot on your page as well.

�  You have 10 minutes. Go.

Let’s review… �  (every time you see this, this information is

written on the “relevant review” page)

Scatterplot �  A scatterplot displays the relationship between two

quantitative variables measured on the same individuals.

�  An explanatory variable helps explain or influences changes in a response variable; it’s the x.

�  A response variable measures an outcome of a study; it’s the y.

�  Calling one variable “explanatory” and the other “response” doesn’t necessarily mean that changes in one CAUSES changes in the other. Association does not imply causation.

Interpreting Scatterplots � Direction: Positive

vs. Negative� Form: Curved?

Linear? Clusters?� Strength: The

strength is determined by how closely the points follow a clear form

� Outliers: Do we see any deviations from the pattern?

Correlation � The correlation measures the strength and

direction of the linear relationship between two quantitative variables.

� The correlation r always falls between -1 and +1. The closer r is to zero, the weaker of a linear relationship between the two. Values of r close to -1 or +1 indicate the points lie close to a straight line.

� Correlation is written as r.

A strong relationship means a line: 𝑦 =𝑎+𝑏𝑥

In the equation…�  is the predicted value of the response variable y

for a given value of the explanatory variable x�  b is the slope

�  Change in y (response variable) to change in x (explanatory variable)

�  a is the y-intercept�  The value of the response for which the

explanatory is zeroThe Least-Squares Regression line creates the

best fit for a scatterplot.

Given our rotocopter scatterplot, do you think we have a strong enough relationship to model it with a least-squares regression line? How do we find that equation again? Let’s review…

To find and plot the line of regression on graphing calculator

Warm up: a and b from AP Stats: Chapter 12, Lesson 1: Inference for Linear Regression

Example: Tipping at a buffet Do customers who stay longer at buffets give larger tips? Charlotte, an AP statistics student who worked at an Asian buffet, decided to investigate this question for her second semester project. While she was doing her job as a hostess, she obtained a random sample of receipts, which included the length of time (in minutes) the party was in the restaurant and the amount of the tip (in dollars). Do these data provide convincing evidence that customers who stay longer give larger tips? Here are the data:

After you make the scatterplot… 1.  After making the scatterplot, find the least-

squares regression line for predicting descent time from drop height.

2.  What part of the least-squares regression line establishes a relationship between the explanatory and response variables?

3.  Interpret the slope of the regression line from the previous prompt in context. What is your best guess for the increase in descent time for each additional foot of drop height?

Analyzing Least-Squares Regression Line � Since the data plotted represent a

sample of the true relationship between these two variables, we could say that the line of regression therefore is only an estimate of the true, population relationship between the two variables

� Chances are, a different set of data generated would produce a different least-squares regression line.

Further analyzing a least-squares regression line � When analyzing a scatterplot, the

question remains: is the relationship established really linear, or did it look linear just by chance?

�  In the population, how much can we expect the y-variable to change with every x-variable increase? What is the margin of error of this change?

Meaning: �  If we want to know the TRUE population slope, we

need to do inference. �  If we want to challenge or test assumptions about

the relationship between two variables, we may need inference.

�  Since we are making inference (predictions and conclusions about given statements) about scatterplots, we’re examining how well fit the line is to the data.

�  What is truly being examined in inference for linear regression is the value of the residuals between bivariate data.

Speaking of inference…On the back of your scatterplot, #4… 1.  Does it seem plausible that there is really no linear

relationship between descent time and drop height and that the observed slope happened just by chance due to the random assignment?

a.  Meaning, if we randomly assign the times to a drop height, then there should be no relationship between time and drop height…is that the case?

b.  How could we use simulation to investigate what values of the slope we could expect to happen by the chance due to the random assignment?

�  Let's simulate slope! And as we do so, let’s examine

the sampling distribution of slope.

Sampling distribution of slope, b �  Your SRS of n observations (x,y) from a population of size N

has a defined relationship (that you probably do not know) of predicted y = � + βx.

�  When conditions are met, the sampling distribution for the slope of the sample regression line has the following characteristics: �  Shape: approximately normal, (as long as the values of the

response variable, y, follow a normal distribution for each value of explanatory variable x.)

�  Center: 𝜇↓𝑏 =𝛽 �  Spread: Standard deviation of the sampling distribution is:

𝜎↓𝑏 = 𝜎/𝜎↓𝑥 √𝑛   (as long as 10% condition is satisfied)

Conditions for Regression Inference: LINER

� Linear Model is Appropriate �  Independent Observations � Normal distribution of y for each x � Equal SD of y for each x � Random sample or randomized

experiment

How to check conditions for inference Start by making histogram, normal probability plot of the residuals, and a residual plot. � Linear: Does the scatterplot look linear-ish?

Look at the residual plot: it shouldn’t look curved, and the residuals are clustered around the residual = 0 line.

How to check conditions for inference (continued) �  Independent: There should be random sampling

or assignment to ensure independence; check 10% condition if done without replacement.

�  Normal: Check stemplot or histogram for normality; check normal probability plot of residuals (should make a line)

�  Equal SD: vertical spread of residuals on residual plot should be roughly the same for all x-values. (constant-ish height of residuals along x = 0 line)

�  Random: again, check for random assignment/sampling.

Let’s Review… � Residual wha?

Residuals defined � A residual is the difference between an

observed value of the response variable and the value predicted by the regression line.

� Residual = observed y – predicted y� Residual =𝑦− 𝑦 

Residual Plots �  It’s a scatterplot of the regression

residuals against the x variable.�  They help us assess how well a

regression line fits the data.�  The closer the residuals are to the

horizontal line, the better the fit of the line to the data.

�  If a line is the best model for a scatterplot, the residual plot should show no obvious pattern; a curved pattern shows that the relationship is not linear and a straight line may not be the best model.

Making a residual plot on the calculators

1.  X should be in L1, Y in L2.2.  Put cursor over L3; press 2nd, Stat/Resid. Press

Enter.3.  Press enter again. A list of residuals should be in L3.4.  Turn all plots off under 2nd/y= to see stat plot. Also,

check y =. If you had a regression line plotted, its equation is there, and if it confuses you to be there, clear it from y1.

5.  Create new scatterplot. Xlist: L1 (explanatory variable) and Ylist:L3 (residuals)

6.  Zoom/9:ZoomStat

Normal Probability Plot � The more normally

distributed of data, the more of a line in the plot.

Normal Probability Plot �  Step 1: Take your data and put it into a list. �  Step 2: Make a Stat Plot: 2nd button, y= �  Step 3: Turn Stat Plot on �  Step 4: Choose Probability Plot (last icon of

the six) �  Make sure your list is identified in “Data

List” (you can change it by using 2nd, and buttons 1-6, depending on which list it is)

�  Zoom: choose 9:ZoomStat �  Press Graph: a straight line denotes normal

distribution, any other line does not.

Let’s make these displays for our rotocopter data using the graphing calculators.

Example 1: Does seat location matter? �  Many people believe that students

learn better if they sit closer to the front of the classroom. Does sitting closer cause higher achievement, or do better students simply choose to sit in the front? To investigate, an AP Stats teacher randomly assigned students to seat locations in his classroom for a particular chapter and recorded the test score for each student at the end of the chapter. The explanatory variable in this experiment is which row the student was assigned (row 1 is the closest to the front and row 7 is the farthest away). Here are the results, including a scatterplot, residual plot, histogram, and Normal probability plot of the residuals are shown.

Row 1 76, 77, 94, 99

Row 2 83, 85, 74, 79

Row 3 90, 88, 68, 78

Row 4 94, 72, 101, 72, 79

Row 5 76, 65, 90, 67, 96

Row 6 88, 79, 90, 83

Row 7 79, 76, 77, 63

Example 1: a. Check whether the conditions for performing inference about the regression model are met.

Solution: �  Linear: The scatterplot shows a weak linear relationship and the

residual plot does not show any obvious leftover patterns. �  Independent: Students were randomly assigned to seats and

were monitored for cheating, so knowing the score for one student should give no additional information about another student’s score.

�  Normal: The histogram is roughly symmetric and unimodal, and the Normal probability plot is roughly linear.

�  Equal SD: Although there is a different amount of variability in each row, the differences aren’t very large and there is no systematic pattern, such as increasing variability as x increases.

�  Random: The students were assigned to seats at random. Because there are no serious violations of the conditions, we should be safe performing inference about the regression model in this setting.

Parameters in the regression Model

Statistic that estimates…. the parameter….

y-intercept a α

slope b β

Standard deviation of the residuals

s σ

Reading Computer Outputs

Predictor Coef SE Coef T P Constant 85.706 4.239 20.22 0.000 Row -1.1171 0.9472 -1.18 0.248

S = 10.0673 R-Sq = 4.7% R-Sq(adj) = 1.3%

a = y-intercept, estimates α

Example 2 continued: b. Using the output, write the equation for the least-squares regression line for predicting grade from the row assigned. c. The model for regression inference has three parameters α, β, and σ. Explain what each parameter represents in context. Then provide an estimate for each. d. Identify the standard error of the slope SEb from the computer output. Interpret this value in context.

Example 1: Solutions b. Using the output, write the equation for the least-squares regression line for predicting grade from the row assigned. c. The model for regression inference has three parameters α, β, and σ. Explain what each parameter represents in context. Then provide an estimate for each. d. Identify the standard error of the slope SEb from the computer output. Interpret this value in context.

Inference Method #1: Confidence Interval of slope, β

�  It provides an interval of plausible values for the true slope.

� Formula: b ± t* SEb è __<β<__ (df= n – 2) �  (Standard Error of the slope is usually

provided; not expected to know how to find it)

� 4 steps!

�  context. �  (b) Calculate the 95% confidence interval for the true slope. Show

your work. �  (c) Interpret the interval from part (b) in context. �  (d) Is there convincing evidence that seat location affects scores? �  Solution: (a) SEb = 0.9472. If we repeated the random assignment many times, the slope of the estimated regression line would typically vary by about 0.9472 from the slope of the true regression line for predicting test score from row number. (b) Because n = 30, df = 30 – 2 = 28, and t* = 2.048. The 95% confidence interval is −1.1171 2.048(0.9472) = −1.1171 1.9399 = (−3.0570, 0.8228). (c) We are 95% confident that the interval from −3.0570 to 0.8228 captures the slope of the true regression line relating a student’s test score y and the student’s row number x. (d) Because the interval of plausible slopes includes 0, we do not have convincing evidence that seat location affects test scores.

Inference Method #2: Significance Test of the Slope � Test statistic Formula: df = n – 2 𝑡= 𝑏− 𝛽↓𝑜 /𝑆𝐸↓𝑏   Big ideas are the same: Use 4 steps Value of test statistic and P-Value are typically provided on computer output

AP Stats: 12.1, part 1 - MATH WITH MAYER

Documents

Transcript of AP Stats: 12.1, part 1 - MATH WITH MAYER