Exploring Data: The Beast of Bias
Transcript of Exploring Data: The Beast of Bias
©Prof.AndyField,2016 www.discoveringstatistics.com Page1
Exploring Data: The Beast of Bias Sources of Bias Abitofrevision.We’veseenthathavingcollecteddataweusuallyfitamodelthatrepresentsthehypothesisthatwewanttotest.Thismodelisusuallyalinearmodel,whichtakestheformof:
outcome! = 𝑏𝑏$𝑋𝑋$! + 𝑏𝑏'𝑋𝑋'! ⋯ 𝑏𝑏)𝑋𝑋)! + error! Eq.1
Therefore,wepredictanoutcomevariable,fromoneormorepredictorvariables(theXs)andparameters(thebsintheequation)thattellussomethingabouttherelationshipbetweenthepredictorandtheoutcomevariable.Finally,themodelwillnotpredicttheoutcomeperfectlysoforeachobservationtherewillbesomeerror.
Whenwefitamodel,weoftenestimatetheparameters(b)usingthemethodofleastsquares(knownasordinaryleastsquaresorOLS).We’renotinterestedinoursamplesomuchasageneralpopulation,soweusethesampledatatoestimatethevalueoftheparametersinthepopulation(that’swhywecallthemestimatesratherthanvalues).Whenweestimateaparameterwealsocomputeanestimateofhowwell itrepresentsthepopulationsuchasastandarderrororconfidence interval.Wecantesthypothesesabout theseparametersbycomputingteststatisticsandtheirassociated probabilities (p-values). Therefore, when we think about bias, we need to think about it within threecontexts:
1. Thingsthatbiastheparameterestimates.
2. Thingsthatbiasstandarderrorsandconfidenceintervals.
3. Thingsthatbiasteststatisticsandp-values.
These situations are related because confidence intervals and test statistics both rely on the standard error. It isimportantthatweidentifyandeliminateanythingthatmightaffecttheinformationthatweusetodrawconclusionsabouttheworld,soweexploredatatolookforbias.
Outliers Anoutlierisascoreverydifferentfromtherestofthedata.WhenIpublishedmyfirstbook(Field,2000),Iobsessivelycheckedthebook’sratingsonAmazon.co.uk.Customerratingscanrangefrom1to5stars,where5isthebest.Backin2002,myfirstbookhadsevenratings(intheordergiven)of2,5,4,5,5,5,and5.Allbutoneoftheseratingsarefairlysimilar(mainly5and4)butthefirstratingwasquitedifferentfromtherest—itwasaratingof2(ameanandhorriblerating).Themeanofthesescoreswas4.43.Theonlyscorethatwasn’ta4or5wasthefirstratingof2.Thisscoreisanexampleofanoutlier—aweirdandunusualperson(sorry, Imeanscore) thatdeviates fromtherestofhumanity (Imean,dataset).Themeanofthescoreswhentheoutlierisnotincludedis4.83(itincreasesby0.4).Thisexampleshowshowasinglescore,fromsomemean-spiritedbadgerturd,canbiasaparametersuchasthemean:thefirstratingof2dragstheaveragedown.Basedonthisbiasedestimatenewcustomersmighterroneouslyconcludethatmybookisworsethanthepopulationactuallythinksitis.
Spotting outliers with graphs Abiologistwasworriedaboutthepotentialhealtheffectsofmusicfestivals.So,oneyearshewenttotheDownloadMusicFestival(http://www.downloadfestival.co.uk)andmeasuredthehygieneof810concertgoersoverthethreedaysofthefestival.Intheoryeachpersonwasmeasuredoneachdaybutbecauseitwasdifficulttotrackpeopledown,thereweresomemissingdataondays2and3.Hygienewasmeasuredusingastandardisedtechnique(don’tworryitwasn’tlickingtheperson’sarmpit)thatresultsinascorerangingbetween0(yousmelllikearottingcorpsethat’shidingupaskunk’sarse)and5(yousmellofsweetrosesonafreshspringday).NowIknowfrombitterexperiencethatsanitationisnotalwaysgreatattheseplaces(Readingfestivalseemsparticularlybad…)andsothisresearcherpredictedthatpersonal hygiene would go down dramatically over the three days of the festival. The data file is calledDownloadFestival.sav.
©Prof.AndyField,2016 www.discoveringstatistics.com Page2
ToplotahistogramweuseChartBuilder(seelastweek’shandout)whichisaccessedthroughthe menu.SelectHistograminthelistlabelledChoosefromtobringupthegalleryshowninFigure1.Thisgalleryhasfouriconsrepresentingdifferenttypesofhistogram,andyoushouldselecttheappropriateoneeitherbydouble-clickingonit,orbydraggingitontothecanvasintheChartBuilder:
® Simplehistogram:Usethisoptionwhenyoujustwanttoseethefrequenciesofscoresforasinglevariable(i.e.mostofthetime).
® Stacked histogram andPopulation Pyramid: If you had a grouping variable (e.g.whethermen orwomenattendedthefestival)youcouldproduceahistograminwhicheachbarissplitbygroup(stackedhistogram)ortheoutcome(inthiscasehygiene)ontheverticalaxisandeachgroup(i.e.menvs.women)onthehorizontal(i.e.,thehistogramsformenandwomenappearbacktobackonthegraph).
® FrequencyPolygon:Thisoptiondisplaysthesamedataasthesimplehistogramexceptthatitusesalineinsteadofbarstoshowthefrequency,andtheareabelowthelineisshaded.
Figure1:Thehistogramgallery
Wearegoingtodoasimplehistogramsodouble-clickontheiconforasimplehistogram(Figure1).TheChartBuilderdialogboxwillnowshowapreviewofthegraphinthecanvasarea.Atthemomentit’snotveryexcitingbecausewehaven’ttoldSPSSwhichvariableswewanttoplot.Notethatthevariablesinthedataeditorarelistedontheleft-handsideoftheChartBuilder,andanyofthesevariablescanbedraggedintoanyofthedropzones(spacessurroundedbybluedottedlines).
Ahistogramplotsasinglevariable(x-axis)againstthefrequencyofscores(y-axis),soallweneedtodoisselectavariablefromthelistanddragitinto .Let’sdothisforthehygienescoresonday1ofthefestival.Clickonthisvariableinthelistanddragitto asshowninFigure2;youwillnowfindthehistogrampreviewedonthecanvas.Todrawthehistogramclickon .
TheresultinghistogramisshowninFigure3andthefirstthingthatshouldleapoutatyouisthatthereappearstobeonecasethat isverydifferenttotheothers.Allofthescoresappeartobesquisheduponeendofthedistributionbecausetheyarealllessthan5(yieldingwhatisknownasaleptokurticdistribution!)exceptforone,whichhasavalueof20!Thisscoreisanoutlier.What’soddaboutthisoutlieristhatithasascoreof20,whichisabovethetopofourscale(rememberourhygienescalerangedonlyfrom0-5)andso itmustbeamistake(orthepersonhadobsessivecompulsivedisorderandhadwashedthemselvesintoastateofextremecleanliness).However,with810cases,howonearthdowe findoutwhich case itwas?You could just look through thedata, but thatwould certainly give youaheadacheandsoinsteadwecanuseaboxplot.
You encountered boxplots or box-whisker diagrams last year. At the centre of the plot is themedian, which issurroundedbyaboxthetopandbottomofwhicharethelimitswithinwhichthemiddle50%ofobservationsfall(theinterquartilerange).Stickingoutofthetopandbottomoftheboxaretwowhiskerswhichextendtothemostandleastextremescoresrespectively.Outliersasshownasdotsorstars(seemybookfordetails).
Simple Histogram
Stacked Histogram
Frequency Polygon
Population Pyramid
©Prof.AndyField,2016 www.discoveringstatistics.com Page3
Figure2:Plottingahistogram
Figure3:HistogramoftheDay1DownloadFestivalhygienescores
Selectthe menu,thenselectBoxplotinthelistlabelledChoosefromtobringupthegalleryshowninFigure4.Therearethreetypesofboxplotyoucanchoose:
® Simpleboxplot:Usethisoptionwhenyouwanttoplotaboxplotofasinglevariable,butyouwantdifferentboxplots produced for different categories in thedata (for thesehygienedatawe couldproduce separateboxplotsformenandwomen).
® Clusteredboxplot:Thisoptionisthesameasthesimpleboxplotexceptthatyoucanselectasecondcategoricalvariableonwhichtosplit thedata.Boxplots for thissecondvariableareproduced indifferentcolours.Forexample,wemighthavemeasuredwhetherourfestival-goerwasstayinginatentoranearbyhotelduringthefestival.Wecouldproduceboxplotsnotjustformenandwomen,butwithinmenandwomenwecouldhavedifferent-colouredboxplotsforthosewhostayedintentsandthosewhostayedinhotels.
Click on the HygieneDay 1 variable and drag
it to this Drop Zone
©Prof.AndyField,2016 www.discoveringstatistics.com Page4
® 1-DBoxplot:Usethisoptionwhenyoujustwanttoseeaboxplotforasinglevariable.(Thisdiffersfromthesimpleboxplotonlyinthatnocategoricalvariableisselectedforthex-axis.)
Figure4:Theboxplotgallery
Tomakeourboxplotof theday1hygienescores,double-clickonthesimpleboxplot icon (Figure4), then fromthevariablelistselectthehygieneday1scorevariableanddragitinto .ThedialogshouldnowlooklikeFigure5—clickon toproducethegraph.
Figure5:Boxplotforthedownloadfestivaldata
Theoutlierthatwedetectedinthehistogramhasshownupasanextremescore(*)ontheboxplot.SPSShelpfullytellsusthenumberofthecase(611)that’sproducingthisoutlier.Ifwegotothedataeditor(dataview),wecanlocatethiscasequicklybyclickingon andtyping611inthedialogboxthatappears.Thattakesusstraighttocase611.Lookingatthiscaserevealsascoreof20.02,whichisprobablyamistypingof2.02.We’dhavetogobacktotherawdataandcheck.We’llassumewe’vecheckedtherawdataandthisscoreshouldbe2.02,soreplacethevalue20.02withthevalue2.02beforewecontinuethisexample
SELFTEST:Nowwehaveremovedtheoutlierinthedata,re-plotthehistogramandboxplot.
Simple Boxplot
Clustered Boxplot
1-D Boxplot
©Prof.AndyField,2016 www.discoveringstatistics.com Page5
Figure6:Histogram(left)andboxplot(right)ofhygienescoresonday1oftheDownloadFestival
Figure6showsthehistogramandboxplotforthedataaftertheextremecasehasbeencorrected.Thedistributionisnicely symmetrical anddoesn’t seem toopointy or flat.Neither plot indicates any particularly extreme scores: theboxplotsuggeststhatcase574isamildoutlier,butthehistogramdoesn’tseemtoshowanycasesasbeingparticularlyoutoftheordinary
Assumptions Mostofourpotentialsourcesofbiascomeintheformof‘violationsofassumptions’.Anassumptionisaconditionthatensuresthatwhatyou’reattemptingtodoworks.Forexample,whenweassessamodelusingateststatistic,wehaveusuallymadesomeassumptions,andiftheseassumptionsaretruethenweknowthatwecantaketheteststatistic(and,therefore,p-value)associatedwithamodelatfacevalueandinterpretitaccordingly.Conversely,ifanyoftheassumptionsarenottrue(usuallyreferredtoasaviolation)thentheteststatisticandp-valuewillbeinaccurateandcouldleadustothewrongconclusionifweinterpretthematfacevalue.
Assumptionsareoftenpresentedasthoughdifferentstatisticalprocedureshavetheirownuniquesetofassumptions.However,becausewe’reusuallyfittingvariationsofthelinearmodeltoourdata,allofthetestsinmybook(Field,2013)basicallyhavethesameassumptions.Theseassumptionsrelatetothequalityofthemodelitself,andtheteststatisticsusedtoassessit(whichareusuallyparametrictestsbasedonthenormaldistribution).Themainassumptionsthatwe’lllookatare:
• Additivityandlinearity
• Normalityofsomethingorother
• Homoscedasticity/homogeneityofvariance
• Independence
Additivity and Linearity Thevastmajorityofstatisticalmodelsinmybookarebasedonthelinearmodel,whichtakesthisform:
outcome! = 𝑏𝑏$𝑋𝑋$! + 𝑏𝑏'𝑋𝑋'! ⋯ 𝑏𝑏)𝑋𝑋)! + error!
The assumption of additivity and linearity means that the outcome variable is, in reality, linearly related to anypredictors(i.e.,theirrelationshipcanbesummedupbyastraightline)andthatifyouhaveseveralpredictorsthentheircombinedeffect isbestdescribedbyaddingtheireffects together. Inotherwords, itmeansthat theprocesswe’retryingtomodelcanbeaccuratelydescribedas:
𝑏𝑏$𝑋𝑋$! + 𝑏𝑏'𝑋𝑋'! ⋯ 𝑏𝑏)𝑋𝑋)!
Thisassumptionisthemostimportantbecauseifitisnottruethenevenifallotherassumptionsaremet,yourmodelisinvalidbecauseyouhavedescribeditincorrectly.It’sabitlikecallingyourpetcatadog:youcantrytogetittogoin
©Prof.AndyField,2016 www.discoveringstatistics.com Page6
akennel,ortofetchsticks,ortositwhenyoutellitto,butdon’tbesurprisedwhenit’sbehaviourisn’twhatyouexpectbecauseeventhoughyou’veacalleditadog,itisinfactacat.
The assumption of normality
What does it mean?
Manypeopletakethe‘assumptionofnormality’tomeanthatyourdataneedtobenormallydistributed.However,thatisn’twhatitmeans.Whatitdoesmeanis:
1. Forconfidenceintervalsaroundaparameterestimate(e.g.,themean,orab)tobeaccurate,thatestimatemustcomefromanormaldistribution.
2. Forsignificancetestsofmodels(andtheparameterestimatesthatdefinethem)tobeaccuratethesamplingdistributionofwhat’sbeingtestedmustbenormal.Forexample,iftestingwhethertwomeansaredifferent,the data do not need to be normally distributed, but the sampling distribution of means (or differencesbetweenmeans) does. Similarly, if looking at relationshipsbetween variables, the significance tests of theparameterestimatesthatdefinethoserelationships(thebsinEq.1)willbeaccurateonlywhenthesamplingdistributionoftheestimateisnormal.
3. Fortheestimatesoftheparametersthatdefineamodel(thebsinEq.1)tobeoptimal(usingthemethodofleastsquares)theresiduals(theerroriinEq.1)inthepopulationmustbenormallydistributed.
Themisconception thatpeopleoftenhaveabout thedata themselvesneeding tobenormallydistributedprobablystemsfromthefactthatifthedataarenormallydistributedthenit’sreasonabletoassumethattheerrorsinthemodelandthesamplingdistributionarealso(andremember,wedon’thavedirectaccesstothesamplingdistributionsowehavetomakeeducatedguessesaboutitsshape).
Figure7:Adistributionthatlooksnon-normal(left)couldbemadeupofdifferentgroupsofnormally-distributedscores
Whenyouhaveacategoricalpredictorvariable(suchaspeoplefallingintodifferentgroups)youwouldn’texpecttheoveralldistributionoftheoutcome(orresiduals)tobenormal.Forexample,ifyouhaveseenthemovie‘theMuppets’,youwillknowthatMuppetsliveamongus.ImagineyoupredictedthatMuppetsarehappierthanhumans(onTVtheyseemtobe).YoucollecthappinessscoresinsomeMuppetsandsomeHumansandplotthefrequencydistribution.YougetthegraphontheleftofFigure7anddecidethatyourdataarenotnormal:youthinkthatyouhaveviolatedtheassumption of normality. However, you haven’t because you predicted that Humans and Muppets will differ inhappiness; in other words, you predict that they come from different populations. If we plot separate frequencydistributionsforhumansandMuppets(rightofFigure7)you’llnoticethatwithineachgroupthedistributionofscoresisverynormal.Thedataareasyoupredicted:Muppetsarehappierthanhumansandsothecentreoftheirdistributionishigherthanthatofhumans.Whenyoucombineallofthescoresthisgivesyouabimodaldistribution(i.e.,twohumps).Thisexampleillustratesthatitisnotthenormalityoftheoutcome(orresiduals)overallthatmatters,butnormalityateachuniquelevelofthepredictorvariable.
Happiness
Density
0.00
0.05
0.10
0.15
0.20
5 10 15 20
PopulationHumanMuppets
Happiness
Density
0.00
0.05
0.10
0.15
0.20
5 10 15 20
Entire Population Population Split Into Groups
©Prof.AndyField,2016 www.discoveringstatistics.com Page7
When does the assumption of normality matter?
Thecentrallimittheoremmeansthatthereareavarietyofsituationsinwhichwecanassumenormalityregardlessoftheshapeofoursampledata(Lumley,Diehr,Emerson,&Chen,2002):
1. Confidenceintervals:Thecentrallimittheoremtellsusthatinlargesamples,theestimatewillhavecomefromanormaldistributionregardlessofwhatthesampleorpopulationdatalooklike.Therefore,ifweareinterestedin computing confidence intervals thenwe don’t need toworry about the assumption of normality if oursampleislargeenough.
2. Significancetests:thecentral limittheoremtellsusthattheshapeofourdatashouldn’taffectsignificancetestsprovidedoursampleislargeenough.However,theextenttowhichteststatisticsperformastheyshoulddoinlargesamplesvariesacrossdifferentteststatistics—formoreinformationreadField(2013).
3. Parameterestimates:Themethodofleastsquareswillalwaysgiveyouanestimateofthemodelparametersthatminimizeserror,sointhatsenseyoudon’tneedtoassumenormalityofanythingtofitalinearmodelandestimatetheparametersthatdefineit(Gelman&Hill,2007).However,thereareothermethodsforestimatingmodelparameters,andifyouhappentohavenormallydistributederrorsthentheestimatesthatyouobtainedusingthemethodofleastsquareswillhavelesserrorthantheestimatesyouwouldhavegotusinganyoftheseothermethods.
Tosumupthen,ifallyouwanttodoisestimatetheparametersofyourmodelthennormalitydoesn’treallymatter.Ifyouwanttoconstructconfidenceintervalsaroundthoseparameters,orcomputesignificancetestsrelatingtothoseparametersthentheassumptionofnormalitymattersinsmallsamples,butbecauseofthecentrallimittheoremwedon’treallyneedtoworryaboutthisassumptioninlargersamples(butseeField(2013)foradiscussionofwhatwemightmeanbyalargersample).Inpracticalterms,aslongasyoursampleisfairlylarge,outliersareamorepressingconcernthannormality.
Homogeneity of Variance/Homoscedasticity Thesecondassumptionwe’llexplorerelatestovarianceanditcanimpactonthetwomainthingsthatwemightdowhenwefitmodelstodata:
• Parameters:Ifweusethemethodofleastsquarestoestimatetheparametersinthemodel,thenthiswillgiveusoptimalestimatesifthevarianceoftheoutcomevariableisequalacrossdifferentvaluesofthepredictorvariable.
• Nullhypothesissignificancetesting:teststatisticsoftenassumethatthevarianceoftheoutcomevariableisequalacrossdifferentvaluesofthepredictorvariable.Ifthisisnotthecasethentheseteststatisticswillbeinaccurate.
Therefore,tomakesureourestimatesoftheparametersthatdefineourmodelandsignificancetestsareaccuratewehavetoassumehomoscedasticity(alsoknownashomogeneityofvariance).
What is homoscedasticity/homogeneity of variance?
Indesignsinwhichyoutestseveralgroupsofparticipantsthisassumptionmeansthateachofthesesamplescomesfrompopulationswith the same variance. In correlational designs, this assumptionmeans that the varianceof theoutcomevariableshouldbestableatalllevelsofthepredictorvariable.Inotherwords,asyougothroughlevelsofthepredictorvariable,thevarianceoftheoutcomevariableshouldnotchange.
When does homoscedasticity/homogeneity of variance matter?
Intermsofestimatingtheparameterswithinalinearmodelifweassumeequalityofvariancethentheestimateswegetusingthemethodofleastsquareswillbeoptimal.Ifvariancesfortheoutcomevariabledifferalongthepredictorvariablethentheestimatesoftheparameterswithinthemodelwillnotbeoptimal.Theywillbe‘unbiased’(Hayes&Cai,2007)butnotoptimal.
Unequal variances/heteroscedasticity also creates bias and inconsistency in the estimate of the standard errorassociated with the parameter estimates (Hayes & Cai, 2007). This basically means that confidence intervals and
©Prof.AndyField,2016 www.discoveringstatistics.com Page8
significance testswillbebiased (because theyarecomputedusing thestandarderror).Confidence intervalscanbe‘extremelyinaccurate’whenhomogeneityofvariance/homoscedasticitycannotbeassumed(Wilcox,2010).Someteststatisticsaredesignedtobeaccurateevenwhenthisassumptionisviolated.
Independence Thisassumptionmeansthattheerrorsinyourmodel(theerroriinEq.1)arenotrelatedtoeachother.ImaginePauland Julie were participants in an experiment where they had to indicate whether they remembered having seenparticularphotos.IfPaulandJulieweretoconferaboutwhetherthey’dseencertainphotosthentheiranswerswouldnotbeindependent:Julie’sresponsetoagivenquestionwoulddependonPaul’sanswer.Weknowalreadythatifweestimateamodeltopredicttheirresponses,therewillbeerrorinthosepredictionsandbecausePaulandJulie’sscoresarenotindependenttheerrorsassociatedwiththesepredictedvalueswillalsonotbeindependent.IfPaulandJuliewereunabletoconfer (if theywere locked indifferentrooms) thentheerror termsshouldbe independent (unlessthey’retelepathic):theerrorinpredictingPaul’sresponseshouldnotbeinfluencedbytheerrorinpredictingJulie’sresponse.
Theequationthatweusetoestimatethestandarderrorisvalidonlyifobservationsareindependent.Rememberthatweusethestandarderrortocomputeconfidenceintervalsandsignificancetests,soifweviolatetheassumptionofindependencethenourconfidenceintervalsandsignificancetestswillbeinvalid.Ifweusethemethodofleastsquares,thenmodelparameterestimateswill still bevalidbutnotoptimal (wecouldgetbetterestimatesusingadifferentmethod).Ingeneral,ifthisassumptionisviolated,therearetechniquesyoucanusedescribedinChapter20of(Field,2013).
Testing Assumptions Testing normality Youcanlookfornormalityinthreeways:(1)graphs;(2)numerically;and(3)significancetests.WecandoallthreeusingtheExplorecommandinSPSS.Intermsofgraphswecanlookathistograms(whichwe’vealreadylearntabout)andP-PplotsandQ-Qplots.P-PandQ-Qplotsbasicallyshowthesamething:aP-Pplotplotsthecumulativeprobabilityofavariableagainstthecumulativeprobabilityofaparticulardistribution(inthiscaseanormaldistribution).AQ-Qplotdoesthesamethingbutexpressedasquantiles.WithlargedatasetsQ-Qplotsareabiteasiertointerpret.Inasensetheyplotthe‘actualdata’against ‘whatyou’dexpecttogetfromanormaldistribution’,so ifthedataarenormallydistributedthentheactualscoreswillbethesameastheexpectedscoresandyou’llgetalovelystraightdiagonalline.Thisidealscenarioishelpfullyplottedonthegraphandyourjobistocomparethedatapointstothisline.Ifvaluesfallonthediagonaloftheplotthenthevariableisnormallydistributed;however,whenthedatasagconsistentlyaboveorbelowthediagonalthenthisshowsthatthekurtosisdiffersfromanormaldistribution,andwhenthedatapointsareS-shaped,theproblemisskewness.
Numerically, SPSS usesmethods to calculate skew and kurtosis (see Field (2013) if you have forgottenwhat theseconceptsare)thatgivevaluesofzeroinanormaldistribution.Positivevaluesofskewnessindicateapile-upofscoresontheleftofthedistribution,whereasnegativevaluesindicateapile-upontheright.Positivevaluesofkurtosisindicateapointyandheavy-taileddistribution,whereasnegativevaluesindicateaflatandlight-taileddistribution.Thefurtherthevalueisfromzero,themorelikelyitisthatthedataarenotnormallydistributed.
Finally,wecanseewhetherthedistributionofscoresdeviatesfromacomparablenormaldistribution.TheKolmogorov-SmirnovtestandShapiro-Wilktestdothis: theycomparethescores inthesampletoanormallydistributedsetofscoreswiththesamemeanandstandarddeviation.
• Ifthetestisnon-significant(p>.05)ittellsusthatthedistributionofthesampleisnotsignificantlydifferentfromanormaldistribution(i.e.,itisprobablynormal).
• If,however, thetest issignificant (p< .05)thenthedistribution inquestion issignificantlydifferentfromanormaldistribution(i.e.,itisnon-normal).
Thesetestsseemgreat:inoneeasyproceduretheytelluswhetherourscoresarenormallydistributed(nice!).However,theJaneSuperbrainBoxexplainssomereallygoodreasonsnottousethem.Ifyouinsistonusingthem,bearJane’sadviceinmindandalwaysplotyourdataaswellandtrytomakeaninformeddecisionabouttheextentofnon-normalitybasedonconvergingevidence.
©Prof.AndyField,2016 www.discoveringstatistics.com Page9
Figure8showsthedialogboxesfortheExplorecommand( ).First,enteranyvariablesofinterestintheboxlabelledDependentListbyhighlightingthemontheleft-handsideandtransferringthembyclickingon .Forthisexample,selectthehygienescoresforthethreedays.Ifyouclickon adialogboxappears,butthedefaultoptionisfine(itwillproducemeans,standarddeviationsandsoon).Themoreinterestingoption for our current purposes is accessed by clicking on . In this dialog box select the option
, and thiswill produceboth theK-S test and somenormalQ-Qplots. You can also split theanalysis by a factor or grouping varaiable (for example,we could do a separete analysis formales and females bydragginggendertotheFactorListbox—we’lldothislaterinthehandout).
Figure8:Dialogboxesfortheexplorecommand
Wealsoneedtoclickon totellSPSShowtodealwithmisisngvalues.Thisisimportantbecausealthoughwestartoffwith810scoresonday1,bydaytwowehaveonly264andthisisreducedto123onday3.BydefaultSPSSwilluseonlycasesforwhichtherearevalidscoresonalloftheselectedvariables.Thiswouldmeanthatforday1,eventhoughwehave810scores,itwilluseonlythe123casesforwhichtherearescoresonallthreedays.Thisisknownasexlcudingcaseslistwise.However,wewantittouseallofthescoresithasonagivenday,whichisknownaspairwise.Onceyouhaveclickedon selectExcludecasespairwise,thenclickon toreturntothemaindialogboxandclickon toruntheanalysis.
©Prof.AndyField,2016 www.discoveringstatistics.com Page10
SELF-TEST:Wehavealreadyplottedahistogramoftheday1scores,usingwhatyouleantbeforeplothistogramsforthehygienescoresfordays2and3oftheDownloadFestival.
Figure9:Histograms(left)andP-Pplots(right)ofthehygienescoresoverthethreedaysoftheDownloadFestival
Figure9showsthehistograms(fromtheself-testtasks)andthecorrespondingQ-Qplots.Theday1scoreslookquitenormal;TheQ-Qplotechoesthisviewbecausethedatapointsallfallveryclosetothe‘ideal’diagonalline.However,thedistributionsfordays2and3arenotnearlyassymmetricalasday1:theybothlookpositivelyskewed.Again,thiscanbeseenintheQ-Qplotsbythedatapointsdeviatingawayfromthediagonal.Ingeneral,thisseemstosuggestthatbydays2and3,hygienescoresweremuchmoreclusteredaroundthelowendofthescale.Rememberthatthelowerthescore, the lesshygienic theperson is, sogenerallypeoplebecamesmellieras the festivalprogressed.Theskewoccursbecauseasubstantialminorityinsistedonupholdingtheirlevelsofhygiene(againstallodds)overthecourseofthefestival(babywet-wipesareindispensableIfind).
Output1showsthetableofdescriptivestatisticsforthethreevariablesinthisexample.Onaverage,hygienescoreswere1.77(outof5)onday1ofthefestival,butwentdownto0.96and0.98ondays2and3respectively.Theotherimportantmeasuresforourpurposesaretheskewnessandthekurtosis,bothofwhichhaveanassociatedstandarderror.Forday1theskewvalueisveryclosetozero(whichisgood)andkurtosisisalittlenegative.Fordays2and3,though,thereisaskewnessofaround1(positiveskew).
©Prof.AndyField,2016 www.discoveringstatistics.com Page11
Wecanconvertthesevaluestoz-scoreswhichenablesusto(1)compareskewandkurtosisvaluesindifferentsamplesthatuseddifferentmeasures,and(2)calculateap-valuethattellsusifthevaluesaresignificantlydifferentfrom0(i.e.,normal).Althoughtherearegoodreasonsnottodothis (seeJaneSuperbrainBox), ifyouwanttoyoucando itbysubtractingthemeanofthedistribution(inthiscasezero)fromthescoreandthendividingbythestandarderrorofthedistribution.
𝑧𝑧+,-./-++ =𝑆𝑆 − 0
𝑆𝑆𝑆𝑆+,-./-++ 𝑧𝑧,4567+8+ =
𝐾𝐾 − 0𝑆𝑆𝑆𝑆,4567+8+
Intheaboveequations,thevaluesofS(skewness)andK(kurtosis)andtheirrespectivestandarderrorsareproducedbySPSS.Thesez-scorescanbecomparedagainstvaluesthatyouwouldexpecttogetifskewandkurtosiswerenotdifferentfrom0.So,anabsolutevaluegreaterthan1.96issignificantatp<.05,above2.58issignificantatp<.01andabove3.29issignificantatp<.001.However,youreallyshouldusethesecriteriaonlyinsmallsamples:inlargersampleslookattheshapeofthedistributionvisually,interpretthevalueoftheskewnessandkurtosisstatistics,andpossiblydon’tevenworryaboutnormalityatall(JaneSuperbrainBox).
Output1
UsingthevaluesinOutput1,calculatethez-scoresforskewnessandKurtosisforeachdayoftheDownloadfestival.
©Prof.AndyField,2016 www.discoveringstatistics.com Page12
YourAnswers:
Output2
Output2showstheK-Stest.Rememberthatasignificantvalue(Sig.lessthan.05)indicatesadeviationfromnormality.
AretheresultsoftheK-Stestssurprisinggiventhehistogramswehavealreadyseen?
YourAnswers:
Forday1theK-Stestisjustaboutnotsignificant(p=.097),whichissurprisinglyclosetosignificantgivennownormaltheday1scoreslookedinthehistogram(Figure3).However,thesamplesizeonday1isverylarge(N=810)andthesignificanceof theK-S test for thesedatashowshow in largesamplesevensmallandunimportantdeviations fromnormalitymightbedeemedsignificantbythistest(JaneSuperbrainBox).Fordays2and3thetestishighlysignificant,indicatingthatthesedistributionsarenotnormal,whichislikelytoreflecttheskewseeninthehistogramsforthesedatabutcouldagainbedowntothelargesample(Figure9).
Reporting the K-S test
TheteststatisticfortheK-StestisdenotedbyDandweshouldreportthedegreesoffreedom(df)fromthetableinbracketsaftertheD.WecanreporttheresultsinOutput2inthefollowingway:
ü Thehygienescoresonday1,D(810)=0.029,p=.097,didnotdeviatesignificantlyfromnormal;however,day2,D(264)=0.121,p<.001,andday3,D(123)=0.140,p<.001,scoreswerebothsignificantlynon-normal.
JaneSuperbrainBox:Significancetestsandassumptions
©Prof.AndyField,2016 www.discoveringstatistics.com Page13
Throughoutthishandoutwewilllookatvarioussignificanceteststhathavebeendevisedtolookatwhetherassumptions are violated. These include tests of whether a distribution is normal (the Kolmorgorov-Smirnoff and Shapiro-Wilk tests), tests of homogeneity of variances (Levene’s test), and tests ofsignificanceofskewandkurtosis.Allofthesetestsarebasedonnullhypothesissignificancetestingandthismeansthat(1)inlargesamplestheycanbesignificantevenforsmallandunimportanteffects,and(2)insmallsamplestheywilllackpowertodetectviolationsof.
Thecentrallimittheoremmeansthatassamplesizesgetlarger,thelesstheassumptionofnormalitymattersbecausethesamplingdistributionwillbenormalregardlessofwhatourpopulation(orindeed
sample)datalooklike.So,theproblemisthatinlargesamples,wherewedon’tneedtoworryaboutnormality,atestofnormalityismorelikelytobesignificant,andthereforelikelytomakeusworryaboutandcorrectforsomethingthatdoesn’tneedtobecorrectedorworriedabout.Conversely, insmallsamples,wherewemightwanttoworryaboutnormality,asignificancetestwon’thavethepowertodetectnon-normalityandsoislikelytoencourageusnottoworryaboutsomethingthatweprobablyoughtto.Therefore,thebestadviceisthatifyoursampleis largethendon’tusesignificancetestsofnormality,infactdon’tworrytoomuchaboutnormalityatall.Insmallsamplesthenpayattentionifyoursignificancetestsaresignificantbutresistbeinglulledintoafalsesenseofsecurityiftheyarenot.
Testing homogeneity of variance/homoscedasticity Likenormality,youcanlookatthevariancesusinggraphs,numbersandsignificancetests.Graphically,wecancreateascatterplotofthevaluesoftheresidualsplottedagainstthevaluesoftheoutcomepredictedbyourmodel.Indoingsowe’relookingatwhetherthereisasystematicrelationshipbetweenwhatcomesoutofthemodel(thepredictedvalues)andtheerrorsinthemodel.Normallyweconvertthepredictedvaluesanderrorstoz-scores1sothisplotissometimesreferred to as zpred vs. zresid. If linearity and homoscedasticity hold true then there should be no systematicrelationshipbetweentheerrorsinthemodelandwhatthemodelpredicts.Lookingatthisgraphcan,therefore,killtwobirdswithonestone.Ifthisgraphfunnelsout,thenthechancesarethatthereisheteroscedasticityinthedata.Ifthereisanysortofcurveinthisgraphthenthechancesarethatthedatahavebrokentheassumptionoflinearity.
Figure10showsseveralexamplesoftheplotofstandardizedresidualsagainststandardizedpredictedvalues.Thetopleftpanelshowsasituationinwhichtheassumptionsoflinearityandhomoscedasticityhavebeenmet.Thetoprightpanelshowsasimilarplotforadatasetthatviolatestheassumptionofhomoscedasticity.Notethatthepointsformafunnel:theybecomemorespreadoutacrossthegraph.Thisfunnelshapeistypicalofheteroscedasticityandindicatesincreasingvarianceacrosstheresiduals.Thebottomleftpanelshowsaplotofsomedatainwhichthereisanon-linearrelationshipbetweentheoutcomeandthepredictor:thereisaclearcurveintheresiduals.Finally,thebottomrightpanelillustratesdatathatnotonlyhaveanon-linearrelationship,butalsoshowheteroscedasticity.Notefirstthecurvedtrendintheresiduals,andthenalsonotethatatoneendoftheplotthepointsareveryclosetogetherwhereasattheotherendtheyarewidelydispersed.Whentheseassumptionshavebeenviolatedyouwillnotseetheseexactpatterns,buthopefullytheseplotswillhelpyoutounderstandthegeneralanomaliesyoushouldlookoutfor.
Numerically,wecansimplylookatthevaluesofthevariancesindifferentgroups.SomepeoplelookatHartley’sFMaxalsoknownasthevarianceratio(Pearson&Hartley,1954).Thisistheratioofthevariancesbetweenthegroupwiththebiggestvarianceandthegroupwiththesmallestvariance.Theacceptablesizeofthisratiodependsonthenumberofvariancescomparedandthesamplesize—see(Field,2013)formoredetail.
MorecommonlyusedisLevene’stest(Levene,1960),whichteststhenullhypothesisthatthevariancesindifferentgroupsareequal.
• If Levene’s test is significant at p£ .05 then you conclude that the variances are significantly different—therefore,theassumptionofhomogeneityofvarianceshasbeenviolated.
• If, however, Levene’s test is non-significant (i.e., p > .05) then the variances are roughly equal and theassumptionistenable.
1Thesesstandardizederrorsarecalledstandardizedresiduals.
©Prof.AndyField,2016 www.discoveringstatistics.com Page14
AlthoughLevene’stestcanbeselectedasanoptioninmanyofthestatisticalteststhatrequireit,it’sbesttolookatitwhenyou’reexploringdatabecauseitinformsthemodelyoufit.AswiththeK-StestyouneedtotakeLevene’stestwithapinchofsalt(JaneSuperbrainBox).
Figure10:Plotsofstandardizedresidualsagainstpredicted(fitted)values
WecangetLevene’stestusingtheexploremenuthatweusedintheprevioussection.Stickingwiththehygienescores,we’ll compare the variances of males and females on day 1 of the festival. Use
toopenthedialogboxinFigure11.Transfertheday1variablefromthelistontheleft-handsidetotheboxlabelledDependentListbyclickingonthe nexttothisbox;becausewewanttosplit theoutputbythegroupingvariabletocomparethevariances,selectthevariablegenderandtransferittotheboxlabelledFactorListbyclickingontheappropriate .Thenclickon toopentheotherdialogboxinFigure11.TogetLevene’stestweneedtoselectoneoftheoptionswhereitsaysSpreadvs.levelwithLevenetest.Ifyouselect Levene’stestiscarriedoutontherawdata(agoodplacetostart).Whenyou’vefinishedwiththisdialogboxclickon toreturntothemainExploredialogboxandthenclickon toruntheanalysis.
Output3showsthetableforLevene’stest.Levene’stestcanbebasedondifferencesbetweenscoresandthemean,andscoresandthemedian.Themedianisslightlypreferable(becauseitislessbiasedbyoutliers).Whenusingboththemean(p=.030)andthemedian(p=.037)thesignificancevaluesarelessthan.05indicatingasignificantdifferencebetweenthemaleandfemalevariances.Tocalculatethevarianceratio,weneedtodividethelargestvariancebythesmallest.Youshould findthevariances inyouroutput: themalevariancewas0.413andthe femaleone0.496, thevarianceratiois,therefore,0.496/0.413=1.2.Inessencethevariancesarepracticallyequal.So,whydoesLevene’stesttellustheyaresignificantlydifferent?Theanswerisbecausethesamplesizesaresolarge:wehad315malesand495femalessoeventhisverysmalldifferenceinvariancesisshownupassignificantbyLevene’stest(JaneSuperbrainBox).Hopefullythisexampleconvincesyoutotreatthistestcautiously.
©Prof.AndyField,2016 www.discoveringstatistics.com Page15
Figure11:ExploringgroupsofdataandobtainingLevene’stest
Output3
Reporting Levene’s test
Levene’stestcanbedenotedwiththeletterFandtherearetwodifferentdegreesoffreedom.Assuchyoucanreportit,ingeneralform,asF(df1,df2)=value,sig:
ü Forthehygienescoresonday1ofthefestival,thevarianceswereunequalforformalesandfemales,F(1,808)=4.74,p=.03.
Whatistheassumptionofhomogeneityofvariance?
YourAnswers:
Reducing Bias Havinglookedatpotentialsourcesofbias,thenextissueishowtoreducetheimpactofbias.Essentiallytherearefourmethodsforcorrectingproblemswiththedata,whichcanberememberedwiththehandyacronymofTWAT:
©Prof.AndyField,2016 www.discoveringstatistics.com Page16
• Trimthedata:deleteacertainamountofscoresfromtheextremes.
• Windsorizing:substituteoutlierswiththehighestvaluethatisn’tanoutlier.
• AnalysewithRobustMethods:thistypicallyinvolvesatechniqueknownasbootstrapping.
• Transformthedata:thisinvolvesapplyingamathematicalfunctiontoscorestotrytocorrectanyproblemswiththem.
Probablythebestofthesechoicesistouserobusttests,whichisatermappliedtoafamilyofprocedurestoestimatestatistics thatare reliableevenwhen thenormalassumptionsof the statisticarenotmet. For thepurposesof thishandoutwe’lllookattransformingdata,andthroughoutthemodulewe’llusebootstrapping(whichisarobustmethodexplainedinyourlecture),butyoucanfindmoredetailonthesetechniquesandtheotherinChapter5of(Field,2013).
Bootstrapping (robust methods) SomeSPSSprocedureshaveabootstrapoption,whichcanbeaccessedbyclickingon toactivatethedialogboxinFigure12.Select toactivatebootstrappingfortheprocedureyou’recurrentlydoing.Intermsoftheoptions,SPSSwillcomputea95%percentileconfidenceinterval( ),butyoucanchangethemethodtoaslightlymoreaccurateone(Efron&Tibshirani,1993)calledabiascorrectedandacceleratedconfidenceinterval(
).Youcanalsochangetheconfidencelevelbytypinganumberotherthan95intheboxlabelled Level(%). By default, SPSS uses 1000 bootstrap samples, which is a reasonable number and you certainlywouldn’tneedtousemorethan2000.
http://youtu.be/mNrxixgwA2M
Figure12:Thestandardbootstrapdialogbox
Transforming Data Thefinalthingthatyoucandotocombatproblemswithnormalityandlinearity istotransformyourdata.Theideabehindtransformationsisthatyoudosomethingtoeveryscoretocorrectfordistributionalproblems,outliers,lackoflinearityorunequalvariances.Ifyouarelookingatrelationshipsbetweenvariables(e.g.,regression)justtransformtheproblematicvariable,butifyouarelookingatdifferencesbetweenvariables(e.g.,changeinavariableovertime)thenyouneedtotransformallofthosevariables.Forexample,ourfestivalhygienedatawerenotnormalondays2and3ofthefestival.Now,wemightwanttolookathowhygienelevelschangedacrossthethreedays(i.e.,comparethemean
©Prof.AndyField,2016 www.discoveringstatistics.com Page17
onday1tothemeansondays2and3toseeifpeoplegotsmellier).Thedatafordays2and3wereskewedandneedtobetransformed,butbecausewemightlatercomparethedatatoscoresonday1,wewouldalsohavetotransformtheday1data(eventhoughscoreswerenotskewed).Ifwedon’tchangetheday1dataaswell,thenanydifferencesinhygienescoreswe find fromday1 today2or3willbeduetous transformingonevariableandnot theothers.However, ifweweregoingto lookattherelationshipbetweenday1andday2scores(notthedifferencebetweenthem)wecouldtransformonlytheday2scoresandleavetheday1scoresalone.
Choosing a transformation
Therearevarioustransformationsthatyoucandotothedatathatarehelpfulincorrectingvariousproblems.Table1:showssomecommontransformationsandtheiruses.Thewaytodecidewhichtransformationtouseisbygoodoldfashionedtrialanderror:tryoneout,seeifithelpsandifitdoesn’tthentryadifferentone.
Table1:Datatransformationsandtheiruses
DataTransformation CanCorrectFor
Logtransformation(log(Xi)):Takingthelogarithmofasetofnumberssquashestherighttailofthedistribution.Assuchit’sa goodway to reducepositive skew. This transformation is also veryuseful if youhaveproblemswith linearity (it cansometimesmakeacurvilinearrelationshiplinear).However,youcan’tgetalogvalueofzeroornegativenumbers,soifyourdata tend to zero or produce negative numbers you need to add a constant to all of the data before you do thetransformation.Forexample,ifyouhavezerosinthedatathendolog(Xi+1),orifyouhavenegativenumbersaddwhatevervaluemakesthesmallestnumberinthedatasetpositive.
Positive skew,positive kurtosis,unequal variances,lackoflinearity
Squareroottransformation(ÖÖXi):Takingthesquarerootoflargevalueshasmoreofaneffectthantakingthesquarerootofsmallvalues.Consequently,takingthesquarerootofeachofyourscoreswillbringanylargescoresclosertothecentre—ratherlikethelogtransformation.Assuch,thiscanbeausefulwaytoreducepositiveskew;however,youstillhavethesameproblemwithnegativenumbers(negativenumbersdon’thaveasquareroot).
Positive skew,positive kurtosis,unequal variances,lackoflinearity
Reciprocaltransformation(1/Xi):Dividing1byeachscorealsoreducestheimpactoflargescores.Thetransformedvariablewillhavealowerlimitof0(verylargenumberswillbecomecloseto0).Onethingtobearinmindwiththistransformationis that it reverses the scores: scores that were originally large in the data set become small (close to zero) after thetransformation,butscoresthatwereoriginallysmallbecomebigafterthetransformation.Forexample,imaginetwoscoresof1and10;afterthetransformationtheybecome1/1=1,and1/10=0.1:thesmallscorebecomesbiggerthanthelargescoreafterthetransformation.However,youcanavoidthisbyreversingthescoresbeforethetransformation,byfindingthe highest score and changing each score to the highest score minus the score you’re looking at. So, you do atransformation1/(XHighest−Xi).Likethelogtransformation,youcan’ttakethereciprocalof0(because1/0=infinity)soifyouhavezerosinthedatayouneedtoaddaconstanttoallscoresbeforedoingthetransformation.
Positive skew,positive kurtosis,unequalvariances
Reversescoretransformations:Anyoneoftheabovetransformationscanbeusedtocorrectnegativelyskeweddata,butfirstyouhavetoreversethescores.Todothis,subtracteachscorefromthehighestscoreobtained,orthehighestscore+1(dependingonwhetheryouwantyourlowestscoretobe0or1).Ifyoudothis,don’tforgettoreversethescoresbackafterwards,ortorememberthatthe interpretationofthevariable isreversed:bigscoreshavebecomesmallandsmallscoreshavebecomebig.
Negativeskew
Tryingoutdifferenttransformationscanbequitetimeconsuming;however,ifheterogeneityofvarianceisyourissuethenwecanseetheeffectofatransformationquitequickly.WhenweranLevene’stest(Figure11)werantheanalysisselectingtherawscores( ).However,ifthevariancesturnouttobeunequal,astheydidinourexample,youcanusethesamedialogboxbutselect .Whenyoudothisyoushouldnoticeadrop-downlistthatbecomesactiveandifyouclickonthisyou’llnoticethatitlistsseveraltransformationsincludingtheonesthatIhavejustdescribed.Ifyouselectatransformationfromthislist(NaturallogperhapsorSquareroot)thenSPSSwillcalculatewhatLevene’stestwouldbeifyouweretotransformthedatausingthismethod.Thiscansaveyoualotoftimetryingoutdifferenttransformations.
Using SPSS’s Compute command
Thecomputecommandenablesustocarryoutvariousfunctionsoncolumnsofdatainthedataeditor.Sometypicalfunctionsareaddingscoresacrossseveralcolumns,takingthesquarerootofthescoresinacolumn,orcalculatingthemeanofseveralvariables.Toaccessthecomputedialogbox,usethemousetoselect .TheresultingdialogboxisshowninFigure13;ithasalistoffunctionsontheright-handside,acalculator-likekeyboardinthecentreandablankspacethatI’velabelledthecommandarea.ThebasicideaisthatyoutypeanameforanewvariableinthearealabelledTargetVariableandthenyouwritesomekindofcommandinthecommandareatotellSPSShowtocreatethisnewvariable.Youuseacombinationofexistingvariablesselectedfromthelistontheleftandnumericexpressions.So,forexample,youcoulduseitlikeacalculatortoaddvariables(i.e.addtwocolumnsinthedata
©Prof.AndyField,2016 www.discoveringstatistics.com Page18
editortomakeathird).Therearehundredsofbuilt-infunctionsthatSPSShasgroupedtogether.InthedialogboxitliststhesegroupsinthearealabelledFunctiongroup;uponselectingafunctiongroup,alistofavailablefunctionswithinthatgroupwillappearintheboxlabelledFunctionsandSpecialVariables.Ifyouselectafunction,thenadescriptionofthatfunctionappearsintheboxindicatedinFigure13.Youcanentervariablenamesintothecommandareabyselectingthevariablerequiredfromthevariableslistandthenclickingon .Likewise,youcanselectacertainfunctionfromthelistofavailablefunctionsandenteritintothecommandareabyclickingon .
FirsttypeavariablenameintheboxlabelledTargetVariable,thenclickon andanotherdialogboxappears,whereyou cangive thevariableadescriptive label and specifywhether it is anumericor stringvariable (seeyourhandout fromweek1). Thenwhenyouhavewrittenyour command for SPSS toexecute, clickon to run thecommandandcreate thenewvariable.Thereare functions forcalculatingmeans, standarddeviationsandsumsofcolumns.We’regoingtousethesquarerootandlogarithmfunctions,whichareusefulfortransformingdatathatareskewed.
Figure13:DialogboxfortheComputecommand
Log Transformation
Let’sreturntoourDownloadfestivaldata.OpenthemainComputedialogboxbyselecting .Enter the name logday1 into the box labelled Target Variable, click on and give the variable a moredescriptivenamesuchasLogtransformedhygienescoresforday1ofDownloadfestival.InthelistboxlabelledFunctiongroup clickonArithmetic and then in thebox labelledFunctionsandSpecialVariables clickonLg10 (this is the logtransformation tobase10,Ln is thenatural log) and transfer it to the commandareaby clickingon .When the
Command area
Categories of functions
Functions within the selected category
Use this dialog box to select certain cases
in the data
Variable list
Description ofselected function
©Prof.AndyField,2016 www.discoveringstatistics.com Page19
commandistransferred,itappearsinthecommandareaas‘LG10(?)’andthequestionmarkshouldbereplacedwithavariablename(whichcanbetypedmanuallyortransferredfromthevariableslist).Soreplacethequestionmarkwiththevariableday1byeitherselectingthevariableinthelistanddraggingitacross,clickingon ,orjusttyping‘day1’wherethequestionmarkis.
For theday2hygienescores there isavalueof0 in theoriginaldata,and there isno logarithmof thevalue0.Toovercometheproblemweaddaconstanttoouroriginalscoresbeforewetakethelogofthosescores.Anyconstantwilldo(althoughsometimesitcanmatter),providedthatitmakesallofthescoresgreaterthan0.Inthiscaseourlowestscoreis0inthedatasowecouldadd1toallofthescorestoensurethatallscoresaregreaterthanzero.Eventhoughthisproblemaffectstheday2scores,weneedtobeconsistentanddothesametotheday1scoresaswewilldowiththeday2scores.Therefore,makesurethecursorisstillinsidethebracketsandclickon andthen .ThefinaldialogboxshouldlooklikeFigure13.NotethattheexpressionreadsLG10(day1+1);thatis,SPSSwilladdonetoeachoftheday1scoresandthentakethelogoftheresultingvalues.Clickon tocreateanewvariablelogday1containingthetransformedvalues.
SELFTEST:Haveagoatcreatingsimilarvariableslogday2andlogday3fortheday2andday3data.Plothistogramsofthetransformedscoresforallthreedays.
Square Root Transformation
Tousethesquareroottransformation,wecouldrunthroughthesameprocess,byusinganamesuchassqrtday1andselectingtheSQRT(numexpr)functionfromthelist.ThiswillappearintheboxlabelledNumericExpression:asSQRT(?),and you can simply replace the questionmarkwith the variable youwant to change—in this caseday1. The finalexpressionwillreadSQRT(day1).
SELFTEST:Repeatthisprocessforday2andday3tocreatevariablescalledsqrtday2andsqrtday3.Plothistogramsofthetransformedscoresforallthreedays.
Reciprocal Transformation
Todoareciprocaltransformationonthedatafromday1,wecoulduseanamesuchasrecday1 intheboxlabelledTargetVariable.Thenwecansimplyclickon andthen .Ordinarilyyouwouldselectthevariablenamethatyouwanttotransformfromthelistanddragitacross,clickon orjusttypethenameofthevariable.However,theday2datacontainazerovalueandifwetrytodivide1by0thenwe’llgetanerrormessage(youcan’tdivideby0).Weneedtoaddaconstanttoourvariablejustaswedidforthelogtransformation.Anyconstantwilldo,but1isaconvenientnumberforthesedata.So, insteadofselectingthevariablewewanttotransform,clickon .ThisplacesapairofbracketsintotheboxlabelledNumericExpression;thenmakesurethecursorisbetweenthesetwobracketsandselectthevariableyouwanttotransformfromthelistandtransferitacrossbyclickingon (ortypethenameofthevariablemanually).Nowclickon andthen (ortype+1usingyourkeyboard).TheboxlabelledNumericExpressionshouldnowcontainthetext1/(day1+1).Clickon tocreateanewvariablecontainingthetransformedvalues.
SELFTEST:Repeatthisprocessforday2andday3.Plothistogramsofthetransformedscoresforallthreedays.
The effect of transformations
Figure14showsthedistributionsfordays1and2ofthefestivalafterthethreedifferenttransformations.ComparethesetotheuntransformeddistributionsinFigure9.Now,youcanseethatallthreetransformationshavecleanedupthehygienescoresforday2:thepositiveskewisreduced(thesquareroottransformationinparticularhasbeenuseful).However,becauseourhygienescoresonday1weremoreorlesssymmetricaltobeginwith,theyhavenowbecome
©Prof.AndyField,2016 www.discoveringstatistics.com Page20
slightly negatively skewed for the log and square root transformation, and positively skewed for the reciprocaltransformation2Ifwe’reusingscoresfromday2aloneorlookingattherelationshipbetweenday1andday2,thenwecouldusethetransformedscores;however,ifwewantedtolookatthechangeinscoresthenwe’dhavetoweighupwhetherthebenefitsofthetransformationfortheday2scoresoutweightheproblemsitcreatesintheday1scores—dataanalysiscanbefrustratingsometimesJ
Figure14:Distributionsofthehygienedataonday1andday2aftervarioustransformations
Multiple Choice Test Gotohttps://studysites.uk.sagepub.com/field4e/study/mcqs.htmandtestyourselfonthemultiplechoicequestionsforChapter5.Ifyougetanywrong,re-readthishandout(orField,2013,Chapter5)anddothemagainuntilyougetthemallcorrect.
2Thereversaloftheskewforthereciprocaltransformationisbecause,asImentionedearlier,thereciprocalhastheeffectofreversingthescores.
Day 1 Day 2
Log
Square root
1/x
©Prof.AndyField,2016 www.discoveringstatistics.com Page21
References Efron,B.,&Tibshirani,R.(1993).Anintroductiontothebootstrap.BocaRaton,FL:Chapman&Hall/CRC.Field,A.P.(2000).DiscoveringstatisticsusingSPSSforWindows:Advancedtechniquesforthebeginner.London:Sage.Field,A.P.(2013).DiscoveringstatisticsusingIBMSPSSStatistics:Andsexanddrugsandrock'n'roll(4thed.).London:
Sage.Gelman,A.,&Hill,J.(2007).DataAnalysisUsingRegressionandMultilevel/HierarchicalModels.Cambridge:Cambridge
UniversityPress.Hayes, A. F., & Cai, L. (2007). Using heteroskedasticity-consistent standard error estimators in OLS regression: An
introductionandsoftwareimplementation.BehaviorResearchMethods,39(4),709-722.Levene,H.(1960).Robusttestsforequalityofvariances.InI.Olkin,S.G.Ghurye,W.Hoeffding,W.G.Madow&H.B.
Mann (Eds.),Contributions to Probability and Statistics: Essays inHonor ofHaroldHotelling (pp. 278-292).Stanford,CA:StanfordUniversityPress.
Lumley,T.,Diehr,P.,Emerson,S.,&Chen,L.(2002).Theimportanceofthenormalityassumptioninlargepublichealthdatasets.AnnualReviewofPublicHealth,23,151-169.
Pearson,E.S.,&Hartley,H.O. (1954).Biometrikatablesforstatisticians,volumeI.NewYork:CambridgeUniversityPress.
Wilcox,R.R.(2010).Fundamentalsofmodernstatisticalmethods:substantially improvingpowerandaccuracy.NewYork:Springer.
Terms of Use Thishandoutcontainsmaterialfrom:
Field,A.P.(2013).DiscoveringstatisticsusingSPSS:andsexanddrugsandrock‘n’roll(4thEdition).London:Sage.
ThismaterialiscopyrightAndyField(2000-2016).
This document is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 InternationalLicense (https://creativecommons.org/licenses/by-nc-nd/4.0/), basically you can use it for teaching and non-profitactivitiesbutnotmeddlewithitwithoutpermissionfromtheauthor.