Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is...
Transcript of Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is...
©Prof.AndyField,2016 www.discoveringstatistics.com Page1
Introducing the Linear Model What is Correlational Research? Correlationaldesignsarewhenmanyvariablesaremeasuredsimultaneouslybutunlikeinanexperimentnoneofthemaremanipulated.Whenweusecorrelationaldesignswecan’tlookforcause-effectrelationshipsbecausewehaven’tmanipulatedanyofthevariables,andalsobecauseallofthevariableshavebeenmeasuredatthesamepointintime(if you’re really bored, Field, 2013, Chapter 1 explains why experiments allow us to make causal inferences butcorrelational researchdoesnot). Inpsychology, themostcommoncorrelational researchconsistsof the researcheradministeringseveralquestionnairesthatmeasuredifferentaspectsofbehaviourtoseewhichaspectsofbehaviourarerelated.Manyofyouwilldothissortofresearchforyourfinalyearresearchproject(sopayattention!).
The linear model Inthefirstfewlectureswesawthattheonlyequationweeverreallyneedisthisone:
outcome! = Model! + error!
Wealsosawthatweoftenfitalinearmodel,whichinitssimplestformcanbewrittenas:
outcome! = 𝑏𝑏* + 𝑏𝑏+𝑋𝑋! + error!
y! = 𝑏𝑏* + 𝑏𝑏+𝑋𝑋! + ε!Eq.1
Thefundamentalideaisthatanoutcomeforanentitycanbepredictedfromamodelandsomeerrorassociatedwiththat prediction (ei).We are predicting an outcome variable (yi) from a predictor variable (Xi) and a parameter,b1,associatedwiththepredictorvariablethatquantifiestherelationshipithaswiththeoutcomevariable.Wealsoneedaparameterthattellsusthevalueoftheoutcomewhenthepredictoriszero;thisparameterisb0.
Youmightrecognizethismodelas‘theequationofastraightline’.Ihavetalkedaboutfitting‘linearmodels’,andlinearsimplymeans‘straightline’.Anystraightlinecanbedefinedbytwothings:(1)theslope(orgradient)oftheline(usuallydenotedbyb1);and(2)thepointatwhichthelinecrossestheverticalaxisofthegraph(knownastheinterceptoftheline,b0).Theseparametersb1andb0areknownastheregressioncoefficientsandwillcropuptimeandtimeagain,whereyoumayseethemreferredtogenerallyasb(withoutanysubscript)orbn(meaningthebassociatedwithvariablen).Aparticularline(i.e.,model)willhavehasaspecificinterceptandgradient.
Figure1:Showslineswiththesamegradientsbutdifferentintercepts,andlinesthatsharethesameinterceptbuthavedifferentgradients
Same intercepts, different gradients Same gradients, different intercepts
©Prof.AndyField,2016 www.discoveringstatistics.com Page2
Figure1showsasetoflinesthathavethesameinterceptbutdifferentgradients.Forthesethreemodels,b0willbethesameineachbutthevaluesofb1willdifferineach;thisfigurealsoshowsmodelsthathavethesamegradients(b1isthesame ineachmodel)butdifferent intercepts (theb0isdifferent ineachmodel). I’vementionedalready thatb1quantifiestherelationshipbetweenthepredictorvariableandtheoutcome,andFigure1illustratesthispoint.Amodelwithapositiveb1describesapositiverelationship,whereasalinewithanegativeb1describesanegativerelationship.Looking at Figure1 (left) the red linedescribes a positive relationshipwhereas the green linedescribes a negativerelationship. As such, we can use a linearmodel (i.e., a straight line) to summarize the relationship between twovariables:thegradient(b1)tellsuswhatthemodellookslike(itsshape)andtheintercept(b0)tellsuswherethemodelis(itslocationingeometricspace).
Thisisallquiteabstractsolet’slookatanexample.ImaginethatIwasinterestedinpredictingphysicalanddownloadedalbumsales(outcome)fromtheamountofmoneyspentadvertisingthatalbum(predictor).WecouldsummarizethisrelationshipusingalinearmodelbyreplacingthenamesofourvariablesintoEq.1.
y! = 𝑏𝑏* + 𝑏𝑏+𝑋𝑋! + ε!
albumsales! = 𝑏𝑏* + 𝑏𝑏+advertisingbudget! + ε!Eq.2
Oncewehaveestimatedthevaluesofthebswewouldbeabletomakeapredictionaboutalbumsalesbyreplacing‘advertising’withanumberrepresentinghowmuchwewantedtospendadvertisinganalbum.Forexample,imaginethatb0turnedouttobe50andb1turnedouttobe100,ourmodelwouldbe:
albumsales! = 50 + 100×advertisingbudget! + ε! Eq.3
NotethatIhavereplacedthebetaswiththeirnumericvalues.Now,wecanmakeaprediction.Imaginewewantedtospend£5onadvertising,wecan replace thevariable ‘advertisingbudget’with thisvalueandsolve theequation todiscoverhowmanyalbumsaleswewillget:
albumsales! = 50 + 100×5 + ε!= 550 + ε!
So,basedonourmodelwecanpredictthatifwespend£5onadvertising,we’llsell550albums.I’velefttheerrortermintheretoremindyouthatthispredictionwillprobablynotbeperfectlyaccurate.Thisvalueof550albumsalesisknownasapredictedvalue.
The linear model with several predictors Wehaveseenthatwecanuseastraightlineto‘model’therelationshipbetweentwovariables.However,lifeisusuallymorecomplicatedthanthat:thereareoftennumerousvariablesthatmightberelatedtotheoutcomeofinterest.Totakeouralbumsalesexample,wemightexpectvariablesotherthansimplyadvertisingtohaveaneffect.Forexample,howmuchsomeonehearssongsfromthealbumontheradio,orthe‘look’ofthebandmighthaveaninfluence.Oneofthebeautifulthingsaboutthelinearmodelisthatitcanbeexpandedtoincludeasmanypredictorsasyoulike.Toaddapredictorallweneedtodoisplaceitintothemodelandgiveitabthatestimatestherelationshipinthepopulationbetweenthatpredictorandtheoutcome.Forexample,ifwewantedtoaddthenumberofplaysofthebandontheradioperweek(airplay),wecouldaddthissecondpredictoringeneralas:
𝑌𝑌! = 𝑏𝑏* + 𝑏𝑏+𝑋𝑋+! + 𝑏𝑏>𝑋𝑋>! + 𝜀𝜀! Eq.4
Notethatall thathaschanged istheadditionofasecondpredictor(X2)andanassociatedparameter(b2).Tomakethingsmoreconcrete,let’susethevariablenamesinstead:
albumsales! = 𝑏𝑏* + 𝑏𝑏+advertisingbudget! + 𝑏𝑏>airplay! + 𝜀𝜀! Eq.5
Thenewmodelincludesab-valueforbothpredictors(and,ofcourse,theconstant,b0).Ifweestimatetheb-values,wecouldmakepredictionsaboutalbumsalesbasednotonlyontheamountspentonadvertisingbutalsointermsofradioplay.
©Prof.AndyField,2016 www.discoveringstatistics.com Page3
Multiple regression can be usedwith three, four or even ten ormore predictors. In general we can add asmanypredictorsaswelike,andthelinearmodelwillexpandaccordingly:
𝑌𝑌! = 𝑏𝑏* + 𝑏𝑏+𝑋𝑋+! + 𝑏𝑏>𝑋𝑋>! ⋯ 𝑏𝑏D𝑋𝑋D! + 𝜀𝜀! Eq.6
Inwhich,Y istheoutcomevariable,b1 isthecoefficientofthefirstpredictor(X1),b2 isthecoefficientofthesecondpredictor(X2),bnisthecoefficientofthenthpredictor(Xni),andeiistheerrorfortheithparticipant.(Thebracketsaren’tnecessary,they’rejusttomaketheconnectiontoEq.1).Thisequationillustratesthatwecanaddinasmanypredictorsaswelikeuntilwereachthefinalone(Xn),buteachtimewedo,weassignitaregressioncoefficient(b).
Estimating the model Linearmodelscanbedescribedentirelybyaconstant(b0)andbyparametersassociatedwitheachpredictor(bs).Theseparameters are estimated using themethod of least squares (described in your lecture). Thismethod is known asordinaryleastsquares(OLS)regression.Inotherwords,SPSSfindsthevaluesoftheparametersthathavetheleastamountoferror(relativetoanyothervalue)forthedatayouhave.
Assessing the goodness of fit, sums of squares, R and R2 OnceNephwickandClungglewadhavefoundthemodelofbestfititisimportantthatweassesshowwellthismodelfitstheactualdata(weassessthegoodnessoffitofthemodel).Wedothisbecauseeventhoughthemodelisthebestoneavailable, itcanstillbea lousyfit tothedata.Onemeasureoftheadequacyofamodel is thesumofsquareddifferences(thinkbacktolecture2,orField,2013,Chapter2).Thereareseveralsumsofsquaresthatcanbecalculatedtohelpusgaugethecontributionofourmodeltopredictingtheoutcome.Let’sgobacktoourexampleofpredictingalbumsales(Y)fromtheamountofmoneyspentadvertisingthatalbum(X).Onedaymybosscameintomyofficeandsaid‘Andy,Iknowyouwantedtobearockstarandyou’veendedupworkingasmystats-monkey,buthowmanyalbumswillwesellifwespend£100,000onadvertising?’IntheabsenceofanydataprobablythebestanswerIcouldgivewouldbe themeannumberofalbumsales (say,200,000)becauseonaveragethat’showmanyalbumsweexpect tosell.However,whatifhetheasks‘Howmanyalbumswillwesellifwespend£1onadvertising?’Again,intheabsenceofanyaccurateinformation,mybestguesswouldbethemean.Thereisaproblem:whateveramountofmoneyisspentonadvertisingIalwayspredictthesamelevelsofsales.Assuch,themeanisafairlyuselessmodelofarelationshipbetweentwovariables.—butitisthesimplestmodelavailable.
Usingthemeanasamodel,wecancalculatethedifferencebetweentheobservedvalues,andthevaluespredictedbythemean.Wesawinlecture1thatwesquareallofthesedifferencestogiveusthesumofsquareddifferences.Thissumofsquareddifferencesisknownasthetotalsumofsquares(denotedSST)becauseitisthetotalamountoferrorpresentwhenthemostbasicmodelisappliedtothedata(Figure2).Now,ifwefitthemoresophisticatedmodeltothedata,suchasalineofbestfit,wecanworkoutthedifferencesbetweenthisnewmodelandtheobserveddata.Evenifanoptimalmodelisfittedtothedatathereisstillsomeinaccuracy,whichisrepresentedbythedifferencesbetweeneachobserveddatapointandthevaluepredictedbytheregressionline.Thesedifferencesaresquaredbeforetheyareaddedupsothatthedirectionsofthedifferencesdonotcancelout.Theresultisknownasthesumofsquaredresiduals(SSR).Thisvaluerepresentsthedegreeofinaccuracywhenthebestmodelisfittedtothedata.Wecanusethesetwovaluestocalculatehowmuchbettertheregressionline(thelineofbestfit)isthanjustusingthemeanasamodel(i.e.howmuchbetteristhebestpossiblemodelthantheworstmodel?).ThisimprovementinpredictionisthedifferencebetweenSSTandSSR.Thisdifferenceshowsusthereductionintheinaccuracyofthemodelresultingfromfittingtheregressionmodeltothedata.Thisimprovementisthemodelsumofsquares(SSM).
If thevalueofSSM is largethentheregressionmodel isverydifferent fromusingthemeantopredict theoutcomevariable.Thisimpliesthattheregressionmodelhasmadeabigimprovementtohowwelltheoutcomevariablecanbepredicted. However, if SSM is small then using the regression model is little better than using the mean (i.e. theregressionmodelisnobetterthantakingour‘bestguess’).Ausefulmeasurearisingfromthesesumsofsquaresistheproportionofimprovementduetothemodel.Thisiseasilycalculatedbydividingthesumofsquaresforthemodelbythetotalsumofsquares.TheresultingvalueiscalledR2andtoexpressthisvalueasapercentageyoushouldmultiplyitby100.So,R2representstheamountofvarianceintheoutcomeexplainedbythemodel(SSM)relativetohowmuchvariationtherewastoexplaininthefirstplace(SST).
©Prof.AndyField,2016 www.discoveringstatistics.com Page4
𝑅𝑅> =𝑆𝑆𝑆𝑆G𝑆𝑆𝑆𝑆H
Eq.7
Figure2:Diagramshowingfromwheretheregressionsumsofsquaresderive
A second use of the sums of squares in assessing themodel is the F-test. This test is based upon the ratio of theimprovementduetothemodel(SSM)andthedifferencebetweenthemodelandtheobserveddata(SSR).Ratherthanusingthesumsofsquares,itusesthemeansumsofsquares(referredtoasthemeansquaresorMS).Theresultisthemeansquaresforthemodel(MSM)andtheresidualmeansquares(MSR)—seeField2013formoredetail.Atthisstageitisn’tessentialthatyouunderstandhowthemeansquaresarederived(itisexplainedinField,2013).However,itisimportantthatyouunderstandthattheF-ratio(Eq.8)isameasureofhowmuchthemodelhasimprovedthepredictionoftheoutcomecomparedtothelevelofinaccuracyofthemodel.Ifamodelisgood,thentheimprovementinpredictionduetothemodel(MSM)tobelargeandthedifferencebetweenthemodelandtheobserveddata(MSR)tobesmall.Inshort,agoodmodelshouldhavealargeF-ratio.
𝐹𝐹 =𝑀𝑀𝑆𝑆G𝑀𝑀𝑆𝑆K
Eq.8
SST uses the differences between the observed data
and the mean value of Y
SSR uses the differences between the observed data
and the regression line
SSM uses the differences between the mean value of Y
and the regression line
©Prof.AndyField,2016 www.discoveringstatistics.com Page5
Assessing individual predictors Thevalueofbrepresentsthechangeintheoutcomeresultingfromaunitchangeinthepredictor.Ifthemodelisverybadthenwewouldexpectthechangeintheoutcometobezero.Aregressioncoefficientof0means:(1)aunitchangeinthepredictorvariableresultsinnochangeinthepredictedvalueoftheoutcome(thepredictedvalueoftheoutcomedoesnotchangeatall).Logically ifavariablesignificantlypredictsanoutcome,thenitshouldhaveab-valuethatisdifferentfromzero.Thet-statisticteststhenullhypothesisthatthevalueofbis0:therefore,ifitissignificantwegainconfidenceinthehypothesisthattheb-valueissignificantlydifferentfrom0andthatthepredictorvariablecontributessignificantlytoourabilitytoestimatevaluesoftheoutcome.
LikeF, the t-statistic isbasedon the ratioofexplainedvarianceagainstunexplainedvarianceorerror.Whatwe’reinterestedinhereisnotsomuchvariancebutwhetherthebisbigcomparedtotheamountoferrorinthatestimate.Toestimatehowmucherrorwecouldexpect to find inbweuse the standarderrorbecause it tellsusabouthowdifferentb-valueswouldbeacrossdifferentsamples.Eq.9showshowthet-testiscalculated:Thebexpectedisthevalueofbthatwewouldexpecttoobtainifthenullhypothesisweretrue(i.e.,zero)andsothisvaluecanbereplacedby0.Theequationsimplifiestobecometheobservedvalueofbdividedbythestandarderrorwithwhichitisassociated:
𝑡𝑡 =𝑏𝑏MNOPQRPS − 𝑏𝑏PUVPWXPS
𝑆𝑆𝑆𝑆Z=𝑏𝑏MNOPQRPS𝑆𝑆𝑆𝑆Z Eq.9
Thevaluesofthaveaspecialdistributionthatdiffersaccordingtothedegreesoffreedomforthetest.Inthiscontext,the degrees of freedom areN−p−1,whereN is the total sample size andp is the number of predictors. In simpleregressionwhenwehaveonlyonepredictor,thisreducesdowntoN−2.SPSSprovidestheexactprobabilitythattheobservedvalue(ora largerone)oftwouldoccur ifthevalueofbwas, infact,0.Asageneralrule, if thisobservedsignificanceislessthan.05,thenscientistsassumethatbissignificantlydifferentfrom0;putanotherway,thepredictormakesasignificantcontributiontopredictingtheoutcome.
Generalization and Bootstrapping Rememberfromyourlectureonbiasthatlinearmodelsassume:
• Linearityandadditivity:therelationshipyou’retryingtomodelis,infact,linearandwithseveralpredictors,theycombineadditively.
• Normality:Forbestimatestobeoptimaltheresidualsshouldbenormallydistributed.ForCIsandconfidenceintervalstobeaccurate,thesamplingdistributionofbsshouldbenormal.
• Homoscedasticity:necessaryforbestimatestobeoptimalandsignificancetestsandCIsoftheparameterstobeaccurate.
Iftheseassumptionsaremetthenwecantrusttheestimatesofourbs,whichmeansthatwecangeneralizeourmodel(i.e.assumethatitworksinsamplesotherthantheonefromwhichwecollecteddata).Ifwehaveconcernsabouttheseassumptions we can use bootstrapping to compute robust estimates of bs and their confidence intervals. Lack ofnormalitypreventsusfromknowingtheshapeofthesamplingdistributionunlesswehavebigsamples.BootstrappingEfron&Tibshirani,1993)getsaroundthisproblembyestimatingthepropertiesofthesamplingdistributionfromthesample data. In effect, the sample data are treated as a population fromwhich smaller samples (called bootstrapsamples)aretaken(puttingeachscorebackbeforeanewoneisdrawnfromthesample).Theparameterofinterest(e.g.,theregressionparameter)iscalculatedineachbootstrapsample.Thisprocessisrepeatedperhaps2000times.Theendresultisthatwehave2000parameterestimates,onefromeachbootstrapsample.Therearetwothingswecandowiththeseestimates:thefirstistoorderthemandworkoutthelimitswithinwhich95%ofthemfall.Wecanusethesevaluesasanestimateofthelimitsofthe95%confidenceintervaloftheparameter.Theresultisknownasapercentilebootstrapconfidenceinterval(becauseit isbasedonthevaluesbetweenwhich95%ofbootstrapsampleestimatesfall).Thesecondthingwecandoistocalculatethestandarddeviationoftheparameterestimatesfromthebootstrapsamplesanduseitasthestandarderrorofparameterestimates.Animportantpointtorememberisthatbecausebootstrappingisbasedontakingrandomsamplesfromthedatayou’vecollectedtheestimatesyougetwillbeslightly different every time. This is nothing to worry about. For a fairly gentle introduction to the concept ofbootstrappingseeWright,LondonandField(2011).
©Prof.AndyField,2016 www.discoveringstatistics.com Page6
Fitting a linear model Figure3showsthegeneralprocessofconductingregressionanalysis.First,weshouldproducescatterplotstogetsomeideaofwhethertheassumptionoflinearityismet,andalsotolookforanyoutliersorobviousunusualcases.Atthisstagewemighttransformthedatatocorrectproblems.Havingdonethisinitialscreenforproblemswefitamodelandsavethevariousdiagnosticstatisticsthatwewilldiscussnextweek. Ifwewanttogeneralizeourmodelbeyondthesample,orweareinterestedininterpretingsignificancetestsandconfidenceintervalsthenweexaminetheseresidualstocheckforhomoscedasticity,normality,independenceandlinearity.Ifwefindproblemsthenwetakecorrectiveactionandre-estimatethemodel.Also,it’sprobablywisetousebootstrappedconfidenceintervalswhenwefirstestimatethemodelbecausethenwecanbasicallyforgetaboutthingslikenormality.
Figure3:Theprocessoffittingaregressionmodel
Regression using SPSS TherearesomedatafromField2013inthefileAlbumSales.sav.Thisdatafilehas200rows,eachonerepresentingadifferentalbum.Therearealsotwocolumns,onerepresentingthesalesofeachalbumintheweekafterreleaseandtheotherrepresentingtheamount(inpounds)spentpromotingthealbumbeforerelease.Thisistheformatforenteringregressiondata:theoutcomevariableandanypredictorsshouldbeenteredindifferentcolumns,andeachrowshouldrepresentindependentvaluesofthosevariables.
ThepatternofthedataisshowninFigure4anditshouldbeclearthatapositiverelationshipexists:so,themoremoneyspentadvertisingthealbum,themoreitislikelytosell.Ofcoursetherearesomealbumsthatsellwellregardlessofadvertising(topleftofscatterplot),buttherearenonethatsellbadlywhenadvertisinglevelsarehigh(bottomrightofscatterplot).Thescatterplotalsoshows the lineofbest fit for thesedata:bearing inmind that themeanwouldberepresentedbyaflatlineataroundthe200,000salesmark,theregressionlineisnoticeabledifferent.
Check residuals
Assumptions met and no bias
No normality
Graphs: zpred vs. zresid
Re-run analysis: Bootstrap CIs, transform
data
Model can be generalized
Linearity
Normality
Initial checks Graphs: scatterplotsLinearity and unusual cases
Run initial regression Save diagnostics
Homoscedasticity
Graphs: histogram
Independence
HeteroscedasticityRe-run analysis: weighted least squares regression
Lack of linearityTransform data
Lack of independence Use a multilevel model (Chapter 19)
©Prof.AndyField,2016 www.discoveringstatistics.com Page7
Figure4:Scatterplotshowingtherelationshipbetweenalbumsalesandtheamountspentpromotingthealbum.
Running a basic Analysis Tofindouttheparametersthatdescribetheregressionline,andtoseewhetherthislineisausefulmodel,weneedtorun a regression analysis. To do the analysis you need to access the main dialog box by selecting
Figure4showstheresultingdialogbox.ThereisaspacelabelledDependentinwhichyoushouldplacetheoutcomevariable(inthisexamplesales).So,selectsalesfromthelistontheleft-handside,andtransferitbydraggingitorclickingon .ThereisanotherspacelabelledIndependent(s)inwhichanypredictorvariableshouldbeplaced.Insimpleregressionweuseonlyonepredictor(inthisexample,adverts)andsoyoushouldselectadvertsfromthelistandclickon totransferittothelistofpredictors.Thereareavarietyofoptionsavailable,butthesewillbeexploredwithinthecontextofmultipleregression.
Figure5:Maindialogboxforregression
Ifweareworriedaboutassumptionsthenwecangetbootstrappedconfidenceintervalsfortheregressioncoefficientsbyclicking .Select toactivatebootstrapping,andtogeta95%confidenceintervalclick
.Clickon inthemaindialogboxtorunthebasicanalysis.
©Prof.AndyField,2016 www.discoveringstatistics.com Page8
Figure6:Bootstrapdialogbox
Output from SPSS
Overall Fit of the Model
Output1
ThefirsttableprovidedbySPSSisasummaryofthemodelthatgivesthevalueofRandR2forthemodel.Forthesedata,Ris0.578andbecausethereisonlyonepredictor,thisvaluerepresentsthesimplecorrelationbetweenadvertisingandalbumsales(youcanconfirmthisbyrunningacorrelation). The value of R2 is 0.335, which tells us that advertisingexpenditurecanaccountfor33.5%ofthevariationinalbumsales.Theremightbemany factors that canexplain this variation, butourmodel,which includesonly advertisingexpenditureexplains33%:66%of thevariationinalbumsalesisunexplained.Therefore,theremustbeothervariablesthathaveaninfluencealso
Thenextpartoftheoutputreportsananalysisofvariance(ANOVA—seeField,2013,Chapter11).ThemostimportantpartofthetableistheF-ratio,whichiscalculatedusingEq.8,andtheassociatedsignificancevalue.Forthesedata,Fis99.59,whichissignificantatp<.001(becausethevalueinthecolumnlabelledSig.islessthan.001).Thisresulttellsusthatthereislessthana0.1%chancethatanF-ratiothislargewouldhappeniftherewerenoeffect.Therefore,wecanconcludethatourregressionmodelresultsinsignificantlybetterpredictionofalbumsalesthanifweusedthemeanvalueofalbumsales.Inshort,theregressionmodeloverallpredictsalbumsalessignificantlywell.
Output2
SSR
SSM
SSTMSR
MSM
©Prof.AndyField,2016 www.discoveringstatistics.com Page9
Model Parameters
TheANOVAtellsuswhether themodel,overall, results ina significantlygooddegreeofpredictionof theoutcomevariable.However,theANOVAdoesn’ttellusabouttheindividualcontributionofvariablesinthemodel(althoughinthissimplecasethereisonlyonevariableinthemodelandsowecaninferthatthisvariableisagoodpredictor).ThetableinOutput3providesdetailsofthemodelparameters(thebetavalues)andthesignificanceofthesevalues.WesawinEq.1thatb0wastheYinterceptandthisvalueisthevalueBfortheconstant.So,fromthetable,wecansaythatb0is134.14,andthiscanbeinterpretedasmeaningthatwhennomoneyisspentonadvertising(whenX=0),themodelpredictsthat134,140albumswillbesold(rememberthatourunitofmeasurementwasthousandsofalbums).Wecanalsoreadoffthevalueofb1fromthetableandthisvaluerepresentsthegradientoftheregressionline.Itis0.096.1Althoughthisvalueistheslopeoftheregressionline,itismoreusefultothinkofthisvalueasrepresentingthechangeintheoutcomeassociatedwithaunitchangeinthepredictor.Therefore,ifourpredictorvariableisincreasedby1unit(iftheadvertisingbudgetisincreasedby1),thenourmodelpredictsthat0.096extraalbumswillbesold.Ourunitsofmeasurementwerethousandsofpoundsandthousandsofalbumssold,sowecansaythatforanincreaseinadvertisingof£1000themodelpredicts96(0.096´1000=96)extraalbumsales.Asyoumightimagine,thisinvestmentisprettybadfortherecordcompany:theyinvest£1000andgetonly96extrasales!Fortunately,aswealreadyknow,advertisingaccountsforonlyone-thirdofalbumsales.
Output3
Wesawearlierthat,ingeneral,valuesoftheregressioncoefficientbrepresentthechangeintheoutcomeresultingfromaunitchangeinthepredictorandthatifapredictorishavingasignificantimpactonourabilitytopredicttheoutcomethenthisbshouldbedifferentfrom0(andbigrelativetoitsstandarderror).Wealsosawthatthet-testtellsuswhethertheb-valueisdifferentfrom0.SPSSprovidestheexactprobabilitythattheobservedvalueoftwouldoccurif thevalueofb in thepopulationwerezero. If thisobservedsignificance is less than .05, thentheresult reflectsagenuineeffect.Forbothts,theprobabilitiesare.000(zeroto3decimalplaces)andsowecansaythattheprobabilityofthesetvalues(orlarger)occurringifthevaluesofbinthepopulationwerezeroislessthan.001.Therefore,thebsaresignificantlydifferentfrom0.Inthecaseofthebforadvertisingbudgetthisresultmeansthattheadvertisingbudgetmakesasignificantcontribution(p<.001)topredictingalbumsales.
Thebootstrapconfidenceintervaltellsusthatthepopulationvalueofbforadvertisingbudgetislikelytofallbetween.08 and .11 and because this interval doesn’t include zero we would conclude that there is a genuine positiverelationshipbetweenadvertisingbudgetandalbumsalesinthepopulation.Also,thesignificanceassociatedwiththisconfidenceintervalisp=.001,whichishighlysignificant.Also,notethatthebootstrapprocessinvolvesre-estimatingthestandarderror(itchangesfrom.01intheoriginaltabletoabootstrapestimateof.009).Thisisaverysmallchange.Fortheconstant, thestandarderror is7.537comparedtothebootstrapestimateof8.214,which isadifferenceof
1SometimessmallvaluesarereportedbySPSSasthingslike9.612E-02andmanystudentsfindthisnotationconfusing.Well,thinkofE-02asmeaning‘movethedecimalplace2stepstotheleft’,so9.612E-02becomes0.09612.
©Prof.AndyField,2016 www.discoveringstatistics.com Page10
0.677.Thebootstrapconfidenceintervalsandsignificancevaluesareusefultoreportandinterpretbecausetheydonotrelyonassumptionsofnormalityorhomoscedasticity.
Using the Model Sofar,wehavediscoveredthatwehaveausefulmodel,onethatsignificantlyimprovesourabilitytopredictalbumsales.However,thenextstage isoftentousethatmodeltomakesomepredictions.Thefirststage istodefinethemodelbyreplacingtheb-valuesinEq.2withthevaluesfromtheoutput.Inaddition,wecanreplacetheXandYwiththevariablenamessothatthemodelbecomes:
albumsales! = 𝑏𝑏* + 𝑏𝑏+advertisingbudget!
= 134.14 + 0.096×advertisingbudget! Eq.10
Itisnowpossibletomakeapredictionaboutalbumsales,byreplacingtheadvertisingbudgetwithavalueofinterest.For example, imagine a recording company executive wanted to spend £100,000 on advertising a new album.Rememberingthatourunitsarealreadyinthousandsofpounds,wecansimplyreplacetheadvertisingbudgetwith100.Hewoulddiscoverthatalbumsalesshouldbearound144,000forthefirstweekofsales:
albumsales! = 134.14 + 0.096×advertisingbudget!
= 134.14 + 0.096×100
= 143.74
Eq.11
Regression with several predictors using SPSS
SELF-TEST: Produce a matrix scatterplot of Sales, Adverts, Airplay and Attract including theregressionline.
Main options Theexecutivehaspastresearchindicatingthatadvertisingbudgetisasignificantpredictorofalbumsales,andsoheshouldincludethisvariableinthemodelfirst.Hisnewvariables(airplayandattract)should,therefore,beenteredintothemodelafteradvertisingbudget.Thismethodishierarchical(theresearcherdecidesinwhichordertoentervariablesintothemodelbasedonpastresearch).TodoahierarchicalregressioninSPSSwehavetoenterthevariablesinblocks(eachblockrepresentingonestepinthehierarchy).Togettothemainregressiondialogboxselect .Tosetupthefirstblockwedoexactlywhatwedidbefore.Selecttheoutcomevariable(albumsales)anddragittotheboxlabelledDependent(orclickon ).Wealsoneedtospecifythepredictorvariableforthefirstblock.We’vedecidedthatadvertisingbudgetshouldbeenteredintothemodelfirst,soselectthisvariableinthelistanddragittotheboxlabelledIndependent(s)(orclickon ).UnderneaththeIndependent(s)box,thereisadrop-downmenuforspecifyingtheMethodofregression.Youcanselectadifferentmethodofvariableentryforeachblockbyclickingon ,nexttowhereitsaysMethod(seeFigure5).Thedefaultoptionisforcedentry,andthisistheoptionwewant,butnextweekwe’lllookatotherapproaches.
Havingspecifiedthefirstblockinthehierarchy,weneedtomoveontotothesecond.Totellthecomputerthatyouwanttospecifyanewblockofpredictorsyoumustclickon .ThisprocessclearstheIndependent(s)boxsothatyoucanenterthenewpredictors(youshouldalsonotethatabovethisboxitnowreadsBlock2of2indicatingthatyouareinthesecondblockofthetwothatyouhavesofarspecified).WedecidedthatthesecondblockwouldcontainbothofthenewpredictorsandsoyoushouldclickonAirplayandAttract(whileholdingdownCtrl,orCmdifyouuseaMac)inthevariableslistanddragthemtotheIndependent(s)boxorclickon .ThedialogboxshouldnowlooklikeFigure7.Tomovebetweenblocksusethe and buttons(soforexample,tomovebacktoblock1,clickon).Wecangetbootstrappedconfidenceintervalsfortheregressioncoefficientsbyclicking aswedidbefore.
©Prof.AndyField,2016 www.discoveringstatistics.com Page11
Figure7:Maindialogboxforblock2ofthemultipleregression
Output from SPSS
Overall Fit of the Model
Output4
Notethattherearetwomodels.Model1referstothefirst stage in the hierarchy when only advertisingbudgetisusedasapredictor.Model2referstowhenallthreepredictorsareused.UnderthistableSPSStellsus what the dependent variable (outcome) was andwhatthepredictorswere ineachof thetwomodels.The column labelled R contains the values of themultiplecorrelationcoefficientbetweenthepredictorsand the outcome. When only advertising budget isused as a predictor, this is the simple correlationbetweenadvertisingandalbumsales (0.578). In fact,all of the statistics formodel 1 are the same as theregressionmodelearlier(Output1).
Thenext columngivesusavalueofR2,whichwealreadyknow isameasureofhowmuchof thevariability in theoutcomeisaccountedforbythepredictors.Forthefirstmodelitsvalueis.335,whichmeansthatadvertisingbudgetaccountsfor33.5%ofthevariationinalbumsales.However,whentheothertwopredictorsareincludedaswell(model2),thisvalueincreasesto.665or66.5%ofthevarianceinalbumsales.Therefore,ifadvertisingaccountsfor33.5%,wecantellthatattractivenessandradioplayaccountforanadditional33%.2So,theinclusionofthetwonewpredictorshasexplainedquitealargeamountofthevariationinalbumsales
Output5containsanANOVAthattestswhetherthemodelissignificantlybetteratpredictingtheoutcomethanusingthemeanasa‘bestguess’.FortheinitialmodeltheF-ratiois99.59,p<.001.ForthesecondmodelthevalueofFis129.498, which is also highly significant (p < .001). We can interpret these results as meaning that both modelssignificantlyimprovedourabilitytopredicttheoutcomevariablecomparedtonotfittingthemodel.
2Thatis,33%=66.5%-33.5%.
©Prof.AndyField,2016 www.discoveringstatistics.com Page12
Output5
Model Parameters
RememberthatinmultipleregressionthemodeltakestheformofEq.6andinthatequationthereareseveralunknownparameters(theb-values).ThefirstpartofOutput6givesusestimatesfortheseb-valuesandthesevaluesindicatetheindividualcontributionofeachpredictortothemodel.Wecanusetheseestimatestodefineourmodel:
sales! = 𝑏𝑏* + 𝑏𝑏+advertising! + 𝑏𝑏>airplay! + 𝑏𝑏aattractiveness!
= −26.61 + 0.08advertising! + 3.37airplay! + 11.09attractiveness! Eq.12
Theb-valuestellusabouttherelationshipbetweenalbumsalesandeachpredictor.Allthreepredictorshavepositiveb-valuesindicatingpositiverelationships.So,asadvertisingbudgetincreases,albumsalesincrease;asplaysontheradioincrease,sodoalbumsales;andfinallymoreattractivebandswillsellmorealbums.Theb-valuestellusmorethanthis,though.Theytellustowhatdegreeeachpredictoraffectstheoutcome iftheeffectsofallotherpredictorsareheldconstant:
• Advertisingbudget (b=0.085):asadvertisingbudget increasesbyoneunit,albumsales increaseby0.085units.
• Airplay(b=3.367):asthenumberofplaysonradiointheweekbeforereleaseincreasesbyone,albumsalesincreaseby3.367units.
• Attractiveness (b=11.086):abandratedoneunithigherontheattractivenessscalecanexpectadditionalalbumsalesof11.086units.
Forthismodel,theadvertisingbudget,t(196)=12.26,p<.001,theamountofradioplaypriortorelease,t(196)=12.12,p<.001andattractivenessoftheband,t(196)=4.55,p<.001,areallsignificantpredictorsofalbumsales.3Rememberthatthesesignificancetestsareaccurateonlyiftheassumptionsdiscussedinyourlecturearemet.
If these assumptions aren’tmet, or youwant to ignore them, you could look at the table of bootstrap confidenceintervalsforeachpredictorandtheirsignificancevalue4.Thesetellusthatadvertising,b=0.09[0.07,0.10],p=.001,airplay,b=3.37[2.74,4.02],p=.001,andattractivenessoftheband,b=11.09[6.46,15.01],p=.001,allsignificantlypredictalbumsales.Theconfidenceintervalsareconstructedsuchthatin95%ofsamplestheboundariescontainthepopulationvalueofb.Therefore,it’slikelythattheconfidenceintervalwehaveconstructedforthissamplewillcontainthetruevalueofb inthepopulation.Therefore,wecanusetheconfidence intervalstotellusthe likelysizeoftheparameterinthepopulation(i.e.,thetruevalue).Iftheconfidenceintervalcontainszerothenthismeansthatthetruevaluemightbezero(i.e.,noeffectatall)oroppositetowhatweobservedinthesample(e.g.,anegativebinsteadof
3ForallofthesepredictorsIwrotet(196).Thenumberinbracketsisthedegreesoffreedom.InregressionthedegreesoffreedomareN-p-1,whereNisthetotalsamplesize(inthiscase200)andpisthenumberofpredictors(inthiscase3).Forthesedataweget200–3–1=196.4Rememberthatbecauseofhowbootstrappingworksthevaluesinyouroutputwillbeslightlydifferenttomine,anddifferentifyoure-runtheanalysis.
©Prof.AndyField,2016 www.discoveringstatistics.com Page13
thepositiveonethatweobserved).Therefore,becausezerodoesnotfallwithintheboundariesofanyofourbootstrapconfidenceintervals,wecanconcludeveryconfidentlythatthepopulationvaluesofbarepositive—inotherwords,allofthepredictorvariablesaregenuinepredictorsofalbumsales.
Output6
Tasks Task 1 Lacourseetal.(2001)conductedastudytoseewhethersuicideriskwasrelatedtolisteningtoheavymetalmusic.Theydevisedascaletomeasurepreferenceforbands falling intothecategoryofheavymetal.Thisscale includedheavymetalbands(BlackSabbath,IronMaiden),speedmetalbands(Slayer,Metallica),death/blackmetalbands(Obituary,Burzum)andgothicbands(MarilynManson,SistsersofMercy).Theythenusedthis(andothervariables)aspredictorsofsuicideriskbasedonascalemeasuringsuicidalideationetc.devisedbyTousignantetal.,(1988).
• Lacourse,E.,Claes,M.,&Villeneuve,M.(2001).HeavyMetalMusicandAdolescentSuicidalRisk.JournalofYouthandAdolescence,30(3),321-332.[AvailablethroughtheSussexElectronicLibrary].
Let’simaginewereplicatedthisstudy.ThedatafileHMSuicide.sav(onthecoursewebsite)containsthedatafromsuchareplication.Therearetwovariablesrepresentingscoresonthescalesdescribedabove:hm(theextenttowhichthepersonlistenstoheavymetalmusic)andsuicide(theextenttowhichsomeonehassuicidalideationandsoon).Usingthesedatacarryoutaregressionanalysistoseewhetherlisteningtoheavymetalpredictssuiciderisk.
Howmuchvariancedoesthefinalmodelexplain?
YourAnswers:
©Prof.AndyField,2016 www.discoveringstatistics.com Page14
Doeslisteningtoheavymetalsignificantlypredictsuiciderisk(quoterelevantstatistics)?
YourAnswers:
Whatisthenatureoftherelationshipbetweenlisteningtoheavymetalandsuiciderisk?(sketchascatterplotifithelpsyoutoexplain).
YourAnswers:
Writeouttheregressionequationforthemodel
YourAnswers:
Aslisteningtoheavymetalincreasesby1unit,howmuchdoessuicideriskincrease?
YourAnswers:
Task 2 Oneofmyfavouriteactivities,especiallywhentryingtodobrain-meltingthingslikewritingstatisticsbooks,istodrinktea. IamEnglishafterall.Fortunately, tea improvesyourcognitivefunction,well, inoldChinesepeopleatanyrate(Feng,Gwee,Kua,&Ng,2010).ImaynotbeChineseandI’mnotthatold,butIneverthelessenjoytheideathatteamighthelpmethink.Therearesomedata(TeaMakesYouBrainy716.sav)basedonFengetal.’sstudythatmeasuredthenumberofcupsofteadrunkandcognitivefunctioning.Useregressiontoconstructamodelthatpredictscognitivefunctioning from tea drinking, what would cognitive functioning be if someone drank 10 cups of tea? Is there asignificanteffect?
©Prof.AndyField,2016 www.discoveringstatistics.com Page15
Task 3 Inthelecturewesawanexampleofoutliersvs.influentialcasesinwhichwepredictedmortalityratesformthenumberofpubsinthearea.Runaregressionanalysisforthepubs.savdatapredictingmortalityfromthenumberofpubs.Tryrepeatingtheanalysisbutbootstrappingtheconfidenceintervals.
Task 4 TheHonestyLab (www.honestylab.com) lookedathowpeopleevaluateddishonestacts.Participantsevaluated thedishonestyofactsbasedonwatchingvideosofpeopleconfessingtothoseacts.Themediawouldhaveusbelievethatthemore likeable theperpetratorwas, themorepositively theirdishonestactswereviewed. Imaginewe took100peopleandgavethemarandomdishonestact,describedbytheperpetrator.Weaskedthemtoevaluatethehonestyoftheact(0=appallingbehaviourto10=it’sOkreally)andhowmuchtheylikedtheperson(0=notatall,10=alot).I’vefabricatedsomedata(HonestyLab.sav)relatingtopeople’sratingsofdishonestactsandthelikeablenessoftheperpetrator. Run a regression using bootstrapping to predict ratings of dishonesty from the likeableness of theperpetrator.
References Feng,L.,Gwee,X.,Kua,E.H.,&Ng,T.P.(2010).Cognitivefunctionandteaconsumptionincommunitydwellingolder
ChineseinSingapore.JournalofNutritionHealth&Aging,14(6),433-438.
Wright, D. B., London, K., & Field, A. P. (2011). Using Bootstrap Estimation and the Plug-in Principle for ClinicalPsychologyData.JournalofExperimentalPsychopathology.,2(2),252–270.doi:doi:10.5127/jep.013611
Terms of Use Thishandoutcontainsmaterialfrom:
Field,A.P.(2013).DiscoveringstatisticsusingSPSS:andsexanddrugsandrock‘n’roll(4thEdition).London:Sage.
ThismaterialiscopyrightAndyField(2000-2016).
This document is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 InternationalLicense,basicallyyoucanuseitforteachingandnon-profitactivitiesbutnotmeddlewithitwithoutpermissionfromtheauthor.