3.2: Least Squares Regressions
Transcript of 3.2: Least Squares Regressions
3.2:LeastSquaresRegressions
Section3.2Least-SquaresRegression
Afterthissection,youshouldbeableto…
ü INTERPRETaregressionline
üCALCULATEtheequationoftheleast-squaresregressionline
üCALCULATEresiduals
üCONSTRUCTandINTERPRETresidualplots
üDETERMINEhowwellalinefitsobserveddata
ü INTERPRETcomputerregressionoutput
RegressionLinesAregressionline summarizestherelationshipbetweentwovariables,butonlyinsettingswhereoneofthevariableshelpsexplain orpredicttheother.
Aregressionline isalinethatdescribeshowa
responsevariabley changesasanexplanatoryvariablex
changes.Weoftenusearegressionlinetopredictthevalueofy
foragivenvalueofx.
RegressionLinesRegressionlinesareusedtoconductanalysis.• Collegesusestudent’sSATandGPAstopredictcollegesuccess
• Professionalsportsteamsuseplayer’svitalstats(40yarddash,height,weight)topredictsuccess
• Macy’susesshipping,salesandinventorydatapredictfuturesales.
• MDCPSusesstudentdatatoevaluateteachersusingtheVAMmodel
RegressionLineEquationSupposethatyisaresponsevariable(plottedontheverticalaxis)andxisanexplanatoryvariable(plottedonthehorizontalaxis).Aregressionlinerelatingytoxhasanequationoftheform:
ŷ=ax+bInthisequation,•ŷ(read“yhat”)isthepredictedvalueoftheresponsevariableyforagivenvalueoftheexplanatoryvariablex.•aistheslope,theamountbywhichyispredictedtochangewhenxincreasesbyoneunit.•bistheyintercept,thepredictedvalueofywhenx=0.
RegressionLineEquation
0.0908x+16.3
FormatofRegressionLines
Format1:=0.0908x+16.3=predictedbackpackweight
x=student’sweight
Format2:Predictedbackpackweight=16.3+0.0908(student’sweight)
InterpretingLinearRegression• Y-intercept:Astudentweighingzeropoundsispredicted
tohaveabackpackweightof16.3pounds(nopracticalinterpretation).
• Slope:Foreachadditionalpoundthatthestudentweighs,itispredictedthattheirbackpackwillweighanadditional0.0908poundsmore,onaverage.
InterpretingLinearRegressionInterpretthey-interceptandslopevaluesincontext.Isthereanypracticalinterpretation?
=37x+270x=HoursStudiedfortheSAT
PredictedSATMathScore
InterpretingLinearRegression=37x+270
Slope:Foreachadditionalhourthestudentstudies,his/herscoreispredictedtoincrease
37points,onaverage.Thismakessense
OR thisdoesnotmakesense;itisunreasonableforscorestoincreaseby37pointsforJUSTonehourofstudying.
InterpretingLinearRegression=37x+270
Y-intercept:Ifastudentstudiesforzerohours,thenthestudent’spredictedSATscoreis270
points.Thismakessense
OR ThisdoesnotmakesensebecauseaSATscoresof270isverylowregardlessofstudy.
PredictedValueWhatisthepredictedSATMathscoreforastudentwhostudies12hours?
=37x+270HoursStudiedfortheSAT(x)PredictedSATMathScore(y)
PredictedValueWhatisthepredictedSATMathscoreforastudentwhostudies12hours?
=37x+270HoursStudiedfortheSAT(x)PredictedSATMathScore(y)
=37(12)+270PredictedScore:714points
SelfCheckQuiz!
SelfCheckQuiz:CalculatetheRegressionEquation
AcrazyprofessorbelievesthatachildwithIQ100shouldhaveareadingtestscoreof50,andthatreadingscoreshouldincreaseby1pointforeveryadditionalpointofIQ.Whatistheequationoftheprofessor’sregressionlineforpredictingreadingscorefromIQ?Besuretoidentifyallvariablesused.
SelfCheckQuiz:CalculatetheRegressionEquation
AcrazyprofessorbelievesthatachildwithIQ100shouldhaveareadingtestscoreof50,andthatreadingscoreshouldincreaseby1pointforeveryadditionalpointofIQ.Whatistheequationoftheprofessor’sregressionlineforpredictingreadingscorefromIQ?Besuretoidentifyallvariablesused.
Answer:=50+x=predictedreadingscore
x=numberofIQpointsabove100
SelfCheckQuiz:InterpretingRegressionLines&PredictedValueDataontheIQtestscoresandreadingtestscoresforagroupoffifth-gradechildrenresultedinthefollowingregressionline:predictedreadingscore=−33.4+0.882(IQscore)
(a)What’stheslopeofthisline?Interpretthisvalueincontext.(b)What’sthey-intercept?Explainwhythevalueoftheinterceptisnotstatisticallymeaningful.(c)FindthepredictedreadingscoresfortwochildrenwithIQscoresof90and130,respectively.
predictedreadingscore=−33.4+0.882(IQscore)
(a)Slope=0.882.Foreach1pointincreaseofIQscore,thereadingscoreispredictedtoincrease0.882points,onaverage.
(b)Y-intercept=-33.4.IfthestudenthasanIQofzero,whichisessentialimpossible(wouldnotbeabletoholdapenciltotaketheexam),thescorewouldbe-33.4.Thishasnopracticalinterpretation.
(c)PredictedValue:90:-33.4+0.882(90)=45.98130:-33.4+0.882(130)=81.26points.
Least-SquaresRegressionLineDifferentregressionlinesproducedifferentresiduals.TheregressionlineweuseinAPStatsisLeast-SquaresRegression.Theleast-squaresregressionlineofyonxisthelinethatmakesthesumofthesquaredresidualsassmallaspossible.
ResidualsAresidual isthedifferencebetweenanobservedvalueoftheresponsevariableandthevaluepredictedbytheregressionline.Thatis,
residual=actualy – predictedy(rememberAP)
residual=y - ŷ
residual
Positiveresiduals(aboveline)
Negativeresiduals(belowline)
HowtoCalculatetheResidual
1. Calculatethepredictedvalue,byplugginginxtotheLSRE.
2. Determinetheobserved/actualvalue.3. Subtract.
CalculatetheResidual1. Ifastudentweighs170poundsandtheirbackpackweighs
35pounds,whatisthevalueoftheresidual?
2. Ifastudentweighs105poundsandtheirbackpackweighs24pounds,whatisthevalueoftheresidual?
CalculatetheResidual1.Ifastudentweighs170poundsandtheirbackpackweighs35pounds,whatisthevalueoftheresidual?
Predicted:ŷ=16.3+0.0908(170)=31.736Observed:35Residual:35- 31.736=3.264poundsThestudent’sbackpackweighs3.264poundsmorethanpredicted.
CalculatetheResidual2.Ifastudentweighs105poundsandtheirbackpackweighs24pounds,whatisthevalueoftheresidual?
Predicted:ŷ=16.3+0.0908(105)=25.834Observed:24Residual:24– 25.834=-1.834Thestudent’sbackpackweighs1.834poundslessthanpredicted
CheckYourUnderstandingSomedatawerecollectedontheweightofamalewhitelaboratoryratforthefirst25weeksafteritsbirth.Ascatterplotofy =weight(ingrams)andx=timesincebirth(inweeks)showsafairlystrong,positivelinearrelationship.Theregressionequation𝒚" = 𝟏𝟎𝟎 + 𝟒𝟎𝒙modelsthedatawell.A. Predicttherat’sweightat16weeksold.
B.Calculateandinterprettheresidualiftheratweighed700gramsat16weeksold
C.Shouldyouusethislinetopredicttherat’sweightat2yearsold?
ResidualPlotsAresidualplot isascatterplotoftheresidualsagainsttheexplanatoryvariable.Residualplotshelpusassesshowwellaregressionlinefitsthedata.
TI-NSpire:ResidualPlots1. PressMENU,4:Analyze2. Option6:Residual,Option2:ShowResidualPlot
InterpretingResidualPlotsAresidualplotmagnifiesthedeviationsofthepointsfromtheline,makingiteasiertoseeunusualobservationsandpatterns.
1) Theresidualplotshouldshownoobviouspatterns2) Theresidualsshouldberelativelysmallinsize.
Avalidresidualplotshouldlooklikethe“nightsky”withapproximatelyequalamountsofpositiveandnegativeresiduals.
Pattern in residualsLinear model not
appropriate
ShouldYouUseLSRL?1.
2.
InterpretingComputerRegressionOutput
Besureyoucanlocate:theslope,they interceptanddeterminetheequationoftheLSRL.
𝒚" =-0.0034415x+3.5051𝒚" =predicted....x=explanatoryvariable
DetermineistheequationoftheLSRL.
DetermineistheequationoftheLSRL.
𝒚" =174.40x+72.95x=customersinline𝒚" =predictedsecondsittakestocheckout.
r2:CoefficientofDeterminationr2tellsushowmuchbettertheLSRLdoesatpredictingvaluesofythansimplyguessingthemeany foreachvalueinthedataset.
Inthisexample,r2 equals60.6%.
60.6%ofthevariationinpackweightisexplainedbythelinearrelationshipwithbodyweight.
(Insertr2)%ofthevariationiny isexplainedbythelinearrelationshipwithx.
Interpretr2
Interpretinasentence(howmuchvariationisaccountedfor?)
1. r2 =0.875,x=hoursstudied,y=SATscore2. r2 =0.523,x=hoursslept,y=alertnessscore
Answers:1. 87.5%ofthevariationinSATscoreis
explainedbythelinearrelationshipwiththenumberofhoursstudied.
2. 52.3%ofthevariationinalertnessscoreisexplainedbythelinearrelationshipwiththenumberofhoursslept.
Interpretr2
S:StandardDeviationoftheResiduals
Ifweusealeast-squaresregressionlinetopredictthevaluesofaresponsevariabley fromanexplanatoryvariablex,thestandarddeviationoftheresiduals(s) isgivenby
SrepresentsthetypicaloraverageERROR(residual).
Positive=UNDERpredictsNegative=OVERpredicts
s =residuals2
n 2=
(yi Ù y )2
n 2
S:StandardDeviationoftheResiduals
1.Identifyandinterpretthestandarddeviationoftheresidual.
S:StandardDeviationoftheResiduals
Answer:S=0.740
Interpretation:Onaverage,themodelunderpredictsfatgainby0.740kilogramsusingtheleast-squaresregressionline.
SelfCheckQuiz!Thedataisarandomsampleof10trainscomparingnumberofcarsonthetrainandfuelconsumptioninpoundsofcoal.• Whatistheregressionequation?Besuretodefineallvariables.• Whatisr2 tellingyou?• Defineandinterprettheslopeincontext.Doesithavea
practicalinterpretation?• Defineandinterpretthey-interceptincontext.• Whatisstellingyou?
1.ŷ=2.1495x+10.667ŷ=predictedfuelconsumptioninpoundsofcoalx=numberofrailcars
2.96.7%ofthevarationisfuelconsumptionisexplainedbythelinearrelationshipwiththenumberofrailcars.3.Slope=2.1495.Witheachadditionalcar,thefuelconsuptionincreasedby2.1495poundsofcoal,onaverage.Thismakespracticalsense.4.Y-interpect=10.667.Whentherearenocarsattachedtothetrainthefuelconsuptionis10.667poundsofcoal.Thishasnopracticalintrepretationbeacusethereisalwaysatleastonecar,theengine.5.S=4.361.Onaverage,themodelunderpredictsfuelconsumptionby4.361poundsofcoalusingtheleast-squaresregressionline.
ExtrapolationWecanusearegressionlinetopredicttheresponseŷ foraspecificvalueoftheexplanatoryvariablex.Theaccuracyofthepredictiondependsonhowmuchthedatascatterabouttheline.Exercisecautioninmakingpredictionsoutsidetheobservedvaluesofx.
Extrapolation istheuseofaregressionlineforpredictionfaroutsidetheintervalofvaluesoftheexplanatory
variablex usedtoobtaintheline.Suchpredictionsareoftennotaccurate.
OutliersandInfluentialPoints
• Anoutlierisanobservationthatliesoutsidetheoverallpatternoftheotherobservations.
• Anobservationisinfluentialforastatisticalcalculationifremovingitwouldmarkedlychangetheresultofthecalculation.
• Pointsthatareoutliersinthex directionofascatterplotareofteninfluentialfortheleast-squaresregressionline.
• Note:Notallinfluentialpointsareoutliers,norarealloutliersinfluentialpoints.
OutliersandInfluentialPoints
Theleftgraphisperfectlylinear.Intherightgraph,thelastvaluewaschangedfrom(5,5)to(8,5)…clearlyinfluential,becauseitchangedthegraphsignificantly.However,theresidualisverysmall.
IdentifytheOutlier…
IdentifytheOutlier…
CheckYourUnderstandingThescatterplotshowsthepayroll(inmillionsofdollars)andnumberofwinsforMajorLeagueBaseballteamsin2016,alongwiththeleast-squaresregressionline.ThepointshighlightedinredrepresenttheLosAngelesDodgers(farright)andtheClevelandIndians(upperleft).
CheckYourUnderstandingA.DescribewhatinfluencethepointrepresentingtheLosAngelesDodgershasonthe equationoftheleast-squaresregressionline.Explainyourreasoning.
CheckYourUnderstandingB.DescribewhatinfluencethepointrepresentingtheClevelandIndianshasonthestandarddeviation oftheresidualsandr2.Explainyourreasoning.
CorrelationandRegressionLimitations
Thedistinctionbetweenexplanatoryandresponsevariablesisimportantinregression.
CorrelationandRegressionLimitations
Correlationandregressionlinesdescribeonlylinearrelationships.
NO!!!
Correlationandleast-squaresregressionlinesarenotresistant.
CorrelationandRegressionLimitations
CorrelationandRegressionWisdom
Anassociationbetweenanexplanatoryvariablex andaresponsevariabley,evenifitisverystrong,isnotbyitselfgoodevidencethatchangesinx actuallycausechangesiny.
AssociationDoesNotImplyCausation
Aseriousstudyoncefoundthatpeoplewithtwocarslivelongerthanpeoplewhoonlyownonecar.Owningthreecarsisevenbetter,andsoon.Thereisasubstantialpositivecorrelationbetweennumberofcarsx andlengthoflifey.Why?
FRQ2018#1
AdditionalCalculations&Proofs
Least-SquaresRegressionLineWecanusetechnologytofindtheequationoftheleast-squaresregressionline.Wecanalsowriteitintermsofthemeansandstandarddeviationsofthetwovariablesandtheircorrelation.
Equationoftheleast-squaresregressionlineWehavedataonanexplanatoryvariablex andaresponsevariabley forn individuals.Fromthedata,calculatethemeansandstandarddeviationsofthetwovariablesandtheircorrelation.Theleastsquaresregressionlineisthelineŷ =a +bx with
slope andy intercept
b = rsysx
a = y bx
CalculatetheLeastSquaresRegressionLine
SomepeoplethinkthatthebehaviorofthestockmarketinJanuarypredictsitsbehaviorfortherestoftheyear.Taketheexplanatoryvariablex tobethepercentchangeinastockmarketindexinJanuaryandtheresponsevariabley tobethechangeintheindexfortheentireyear.Weexpectapositivecorrelationbetweenx andy becausethechangeduringJanuarycontributestothefullyear’schange.Calculationfromdataforan18-yearperiodgivesMeanx=1.75% Sx=5.36% Meany=9.07%Sy =15.35% r=0.596Findtheequationoftheleast-squareslineforpredictingfull-yearchangefromJanuarychange.Showyourwork.
TheRoleofr2 inRegressionThestandarddeviationoftheresidualsgivesusanumericalestimateoftheaveragesizeofourpredictionerrors.
Thecoefficientofdeterminationr2 isthefractionofthevariationinthevaluesofy thatisaccountedforbytheleast-squaresregressionlineofy onx.Wecancalculater2 usingthefollowingformula:
Inpracticality,justsquarethecorrelationr.
r2 =1 SSESST
= 2residualSSE = 2)( yySST i
AccountedforError
IfweusetheLSRLtomakeourpredictions,thesumofthesquaredresidualsis30.90.SSE=30.90
1– SSE/SST=1–30.97/83.87r2 =0.63263.2%ofthevariationinbackpackweightisaccountedforbythelinearmodelrelatingpackweighttobodyweight.
Ifweusethemeanbackpackweightasourprediction,thesumofthesquaredresidualsis83.87.SST=83.87
SSE/SST=30.97/83.87SSE/SST=0.368
Therefore,36.8%ofthevariationinpackweightisunaccountedfor bytheleast-squaresregressionline.
UnaccountedforError
InterpretingaRegressionLineConsidertheregressionlinefromtheexample(pg.164)“DoesFidgetingKeepYouSlim?”Identifytheslopeandy-interceptandinterpreteachvalueincontext.
The y-intercept a = 3.505 kg is the fat gain estimated by this model if NEA does not change when a person overeats.
The slope b = -0.00344 tells us that the amount of fat gained is predicted to go down by 0.00344 kg for each added calorie of NEA.
fatgain = 3.505 - 0.00344(NEA change)