Exploring Data: The Beast of Bias

©Prof.AndyField,2016 www.discoveringstatistics.com Page1

Exploring Data: The Beast of Bias Sources of Bias Abitofrevision.We’veseenthathavingcollecteddataweusuallyfitamodelthatrepresentsthehypothesisthatwewanttotest.Thismodelisusuallyalinearmodel,whichtakestheformof:

outcome! = 𝑏𝑏$𝑋𝑋$! + 𝑏𝑏'𝑋𝑋'! ⋯ 𝑏𝑏)𝑋𝑋)! + error! Eq.1

Therefore,wepredictanoutcomevariable,fromoneormorepredictorvariables(theXs)andparameters(thebsintheequation)thattellussomethingabouttherelationshipbetweenthepredictorandtheoutcomevariable.Finally,themodelwillnotpredicttheoutcomeperfectlysoforeachobservationtherewillbesomeerror.

Whenwefitamodel,weoftenestimatetheparameters(b)usingthemethodofleastsquares(knownasordinaryleastsquaresorOLS).We’renotinterestedinoursamplesomuchasageneralpopulation,soweusethesampledatatoestimatethevalueoftheparametersinthepopulation(that’swhywecallthemestimatesratherthanvalues).Whenweestimateaparameterwealsocomputeanestimateofhowwell itrepresentsthepopulationsuchasastandarderrororconfidence interval.Wecantesthypothesesabout theseparametersbycomputingteststatisticsandtheirassociated probabilities (p-values). Therefore, when we think about bias, we need to think about it within threecontexts:

1. Thingsthatbiastheparameterestimates.

2. Thingsthatbiasstandarderrorsandconfidenceintervals.

3. Thingsthatbiasteststatisticsandp-values.

These situations are related because confidence intervals and test statistics both rely on the standard error. It isimportantthatweidentifyandeliminateanythingthatmightaffecttheinformationthatweusetodrawconclusionsabouttheworld,soweexploredatatolookforbias.

Outliers Anoutlierisascoreverydifferentfromtherestofthedata.WhenIpublishedmyfirstbook(Field,2000),Iobsessivelycheckedthebook’sratingsonAmazon.co.uk.Customerratingscanrangefrom1to5stars,where5isthebest.Backin2002,myfirstbookhadsevenratings(intheordergiven)of2,5,4,5,5,5,and5.Allbutoneoftheseratingsarefairlysimilar(mainly5and4)butthefirstratingwasquitedifferentfromtherest—itwasaratingof2(ameanandhorriblerating).Themeanofthesescoreswas4.43.Theonlyscorethatwasn’ta4or5wasthefirstratingof2.Thisscoreisanexampleofanoutlier—aweirdandunusualperson(sorry, Imeanscore) thatdeviates fromtherestofhumanity (Imean,dataset).Themeanofthescoreswhentheoutlierisnotincludedis4.83(itincreasesby0.4).Thisexampleshowshowasinglescore,fromsomemean-spiritedbadgerturd,canbiasaparametersuchasthemean:thefirstratingof2dragstheaveragedown.Basedonthisbiasedestimatenewcustomersmighterroneouslyconcludethatmybookisworsethanthepopulationactuallythinksitis.

Spotting outliers with graphs Abiologistwasworriedaboutthepotentialhealtheffectsofmusicfestivals.So,oneyearshewenttotheDownloadMusicFestival(http://www.downloadfestival.co.uk)andmeasuredthehygieneof810concertgoersoverthethreedaysofthefestival.Intheoryeachpersonwasmeasuredoneachdaybutbecauseitwasdifficulttotrackpeopledown,thereweresomemissingdataondays2and3.Hygienewasmeasuredusingastandardisedtechnique(don’tworryitwasn’tlickingtheperson’sarmpit)thatresultsinascorerangingbetween0(yousmelllikearottingcorpsethat’shidingupaskunk’sarse)and5(yousmellofsweetrosesonafreshspringday).NowIknowfrombitterexperiencethatsanitationisnotalwaysgreatattheseplaces(Readingfestivalseemsparticularlybad…)andsothisresearcherpredictedthatpersonal hygiene would go down dramatically over the three days of the festival. The data file is calledDownloadFestival.sav.


ToplotahistogramweuseChartBuilder(seelastweek’shandout)whichisaccessedthroughthe menu.SelectHistograminthelistlabelledChoosefromtobringupthegalleryshowninFigure1.Thisgalleryhasfouriconsrepresentingdifferenttypesofhistogram,andyoushouldselecttheappropriateoneeitherbydouble-clickingonit,orbydraggingitontothecanvasintheChartBuilder:

® Simplehistogram:Usethisoptionwhenyoujustwanttoseethefrequenciesofscoresforasinglevariable(i.e.mostofthetime).

® Stacked histogram andPopulation Pyramid: If you had a grouping variable (e.g.whethermen orwomenattendedthefestival)youcouldproduceahistograminwhicheachbarissplitbygroup(stackedhistogram)ortheoutcome(inthiscasehygiene)ontheverticalaxisandeachgroup(i.e.menvs.women)onthehorizontal(i.e.,thehistogramsformenandwomenappearbacktobackonthegraph).

® FrequencyPolygon:Thisoptiondisplaysthesamedataasthesimplehistogramexceptthatitusesalineinsteadofbarstoshowthefrequency,andtheareabelowthelineisshaded.

Figure1:Thehistogramgallery

Wearegoingtodoasimplehistogramsodouble-clickontheiconforasimplehistogram(Figure1).TheChartBuilderdialogboxwillnowshowapreviewofthegraphinthecanvasarea.Atthemomentit’snotveryexcitingbecausewehaven’ttoldSPSSwhichvariableswewanttoplot.Notethatthevariablesinthedataeditorarelistedontheleft-handsideoftheChartBuilder,andanyofthesevariablescanbedraggedintoanyofthedropzones(spacessurroundedbybluedottedlines).

Ahistogramplotsasinglevariable(x-axis)againstthefrequencyofscores(y-axis),soallweneedtodoisselectavariablefromthelistanddragitinto .Let’sdothisforthehygienescoresonday1ofthefestival.Clickonthisvariableinthelistanddragitto asshowninFigure2;youwillnowfindthehistogrampreviewedonthecanvas.Todrawthehistogramclickon .

TheresultinghistogramisshowninFigure3andthefirstthingthatshouldleapoutatyouisthatthereappearstobeonecasethat isverydifferenttotheothers.Allofthescoresappeartobesquisheduponeendofthedistributionbecausetheyarealllessthan5(yieldingwhatisknownasaleptokurticdistribution!)exceptforone,whichhasavalueof20!Thisscoreisanoutlier.What’soddaboutthisoutlieristhatithasascoreof20,whichisabovethetopofourscale(rememberourhygienescalerangedonlyfrom0-5)andso itmustbeamistake(orthepersonhadobsessivecompulsivedisorderandhadwashedthemselvesintoastateofextremecleanliness).However,with810cases,howonearthdowe findoutwhich case itwas?You could just look through thedata, but thatwould certainly give youaheadacheandsoinsteadwecanuseaboxplot.

You encountered boxplots or box-whisker diagrams last year. At the centre of the plot is themedian, which issurroundedbyaboxthetopandbottomofwhicharethelimitswithinwhichthemiddle50%ofobservationsfall(theinterquartilerange).Stickingoutofthetopandbottomoftheboxaretwowhiskerswhichextendtothemostandleastextremescoresrespectively.Outliersasshownasdotsorstars(seemybookfordetails).

Simple Histogram

Stacked Histogram

Frequency Polygon

Population Pyramid


Figure2:Plottingahistogram

Figure3:HistogramoftheDay1DownloadFestivalhygienescores

Selectthe menu,thenselectBoxplotinthelistlabelledChoosefromtobringupthegalleryshowninFigure4.Therearethreetypesofboxplotyoucanchoose:

® Simpleboxplot:Usethisoptionwhenyouwanttoplotaboxplotofasinglevariable,butyouwantdifferentboxplots produced for different categories in thedata (for thesehygienedatawe couldproduce separateboxplotsformenandwomen).

® Clusteredboxplot:Thisoptionisthesameasthesimpleboxplotexceptthatyoucanselectasecondcategoricalvariableonwhichtosplit thedata.Boxplots for thissecondvariableareproduced indifferentcolours.Forexample,wemighthavemeasuredwhetherourfestival-goerwasstayinginatentoranearbyhotelduringthefestival.Wecouldproduceboxplotsnotjustformenandwomen,butwithinmenandwomenwecouldhavedifferent-colouredboxplotsforthosewhostayedintentsandthosewhostayedinhotels.

Click on the HygieneDay 1 variable and drag

it to this Drop Zone


® 1-DBoxplot:Usethisoptionwhenyoujustwanttoseeaboxplotforasinglevariable.(Thisdiffersfromthesimpleboxplotonlyinthatnocategoricalvariableisselectedforthex-axis.)

Figure4:Theboxplotgallery

Tomakeourboxplotof theday1hygienescores,double-clickonthesimpleboxplot icon (Figure4), then fromthevariablelistselectthehygieneday1scorevariableanddragitinto .ThedialogshouldnowlooklikeFigure5—clickon toproducethegraph.

Figure5:Boxplotforthedownloadfestivaldata

Theoutlierthatwedetectedinthehistogramhasshownupasanextremescore(*)ontheboxplot.SPSShelpfullytellsusthenumberofthecase(611)that’sproducingthisoutlier.Ifwegotothedataeditor(dataview),wecanlocatethiscasequicklybyclickingon andtyping611inthedialogboxthatappears.Thattakesusstraighttocase611.Lookingatthiscaserevealsascoreof20.02,whichisprobablyamistypingof2.02.We’dhavetogobacktotherawdataandcheck.We’llassumewe’vecheckedtherawdataandthisscoreshouldbe2.02,soreplacethevalue20.02withthevalue2.02beforewecontinuethisexample

SELFTEST:Nowwehaveremovedtheoutlierinthedata,re-plotthehistogramandboxplot.

Simple Boxplot

Clustered Boxplot

1-D Boxplot


Figure6:Histogram(left)andboxplot(right)ofhygienescoresonday1oftheDownloadFestival

Figure6showsthehistogramandboxplotforthedataaftertheextremecasehasbeencorrected.Thedistributionisnicely symmetrical anddoesn’t seem toopointy or flat.Neither plot indicates any particularly extreme scores: theboxplotsuggeststhatcase574isamildoutlier,butthehistogramdoesn’tseemtoshowanycasesasbeingparticularlyoutoftheordinary

Assumptions Mostofourpotentialsourcesofbiascomeintheformof‘violationsofassumptions’.Anassumptionisaconditionthatensuresthatwhatyou’reattemptingtodoworks.Forexample,whenweassessamodelusingateststatistic,wehaveusuallymadesomeassumptions,andiftheseassumptionsaretruethenweknowthatwecantaketheteststatistic(and,therefore,p-value)associatedwithamodelatfacevalueandinterpretitaccordingly.Conversely,ifanyoftheassumptionsarenottrue(usuallyreferredtoasaviolation)thentheteststatisticandp-valuewillbeinaccurateandcouldleadustothewrongconclusionifweinterpretthematfacevalue.

Assumptionsareoftenpresentedasthoughdifferentstatisticalprocedureshavetheirownuniquesetofassumptions.However,becausewe’reusuallyfittingvariationsofthelinearmodeltoourdata,allofthetestsinmybook(Field,2013)basicallyhavethesameassumptions.Theseassumptionsrelatetothequalityofthemodelitself,andtheteststatisticsusedtoassessit(whichareusuallyparametrictestsbasedonthenormaldistribution).Themainassumptionsthatwe’lllookatare:

• Additivityandlinearity

• Normalityofsomethingorother

• Homoscedasticity/homogeneityofvariance

• Independence

Additivity and Linearity Thevastmajorityofstatisticalmodelsinmybookarebasedonthelinearmodel,whichtakesthisform:

outcome! = 𝑏𝑏$𝑋𝑋$! + 𝑏𝑏'𝑋𝑋'! ⋯ 𝑏𝑏)𝑋𝑋)! + error!

The assumption of additivity and linearity means that the outcome variable is, in reality, linearly related to anypredictors(i.e.,theirrelationshipcanbesummedupbyastraightline)andthatifyouhaveseveralpredictorsthentheircombinedeffect isbestdescribedbyaddingtheireffects together. Inotherwords, itmeansthat theprocesswe’retryingtomodelcanbeaccuratelydescribedas:

𝑏𝑏$𝑋𝑋$! + 𝑏𝑏'𝑋𝑋'! ⋯ 𝑏𝑏)𝑋𝑋)!

Thisassumptionisthemostimportantbecauseifitisnottruethenevenifallotherassumptionsaremet,yourmodelisinvalidbecauseyouhavedescribeditincorrectly.It’sabitlikecallingyourpetcatadog:youcantrytogetittogoin


akennel,ortofetchsticks,ortositwhenyoutellitto,butdon’tbesurprisedwhenit’sbehaviourisn’twhatyouexpectbecauseeventhoughyou’veacalleditadog,itisinfactacat.

The assumption of normality

What does it mean?

Manypeopletakethe‘assumptionofnormality’tomeanthatyourdataneedtobenormallydistributed.However,thatisn’twhatitmeans.Whatitdoesmeanis:

1. Forconfidenceintervalsaroundaparameterestimate(e.g.,themean,orab)tobeaccurate,thatestimatemustcomefromanormaldistribution.

2. Forsignificancetestsofmodels(andtheparameterestimatesthatdefinethem)tobeaccuratethesamplingdistributionofwhat’sbeingtestedmustbenormal.Forexample,iftestingwhethertwomeansaredifferent,the data do not need to be normally distributed, but the sampling distribution of means (or differencesbetweenmeans) does. Similarly, if looking at relationshipsbetween variables, the significance tests of theparameterestimatesthatdefinethoserelationships(thebsinEq.1)willbeaccurateonlywhenthesamplingdistributionoftheestimateisnormal.

3. Fortheestimatesoftheparametersthatdefineamodel(thebsinEq.1)tobeoptimal(usingthemethodofleastsquares)theresiduals(theerroriinEq.1)inthepopulationmustbenormallydistributed.

Themisconception thatpeopleoftenhaveabout thedata themselvesneeding tobenormallydistributedprobablystemsfromthefactthatifthedataarenormallydistributedthenit’sreasonabletoassumethattheerrorsinthemodelandthesamplingdistributionarealso(andremember,wedon’thavedirectaccesstothesamplingdistributionsowehavetomakeeducatedguessesaboutitsshape).

Figure7:Adistributionthatlooksnon-normal(left)couldbemadeupofdifferentgroupsofnormally-distributedscores

Whenyouhaveacategoricalpredictorvariable(suchaspeoplefallingintodifferentgroups)youwouldn’texpecttheoveralldistributionoftheoutcome(orresiduals)tobenormal.Forexample,ifyouhaveseenthemovie‘theMuppets’,youwillknowthatMuppetsliveamongus.ImagineyoupredictedthatMuppetsarehappierthanhumans(onTVtheyseemtobe).YoucollecthappinessscoresinsomeMuppetsandsomeHumansandplotthefrequencydistribution.YougetthegraphontheleftofFigure7anddecidethatyourdataarenotnormal:youthinkthatyouhaveviolatedtheassumption of normality. However, you haven’t because you predicted that Humans and Muppets will differ inhappiness; in other words, you predict that they come from different populations. If we plot separate frequencydistributionsforhumansandMuppets(rightofFigure7)you’llnoticethatwithineachgroupthedistributionofscoresisverynormal.Thedataareasyoupredicted:Muppetsarehappierthanhumansandsothecentreoftheirdistributionishigherthanthatofhumans.Whenyoucombineallofthescoresthisgivesyouabimodaldistribution(i.e.,twohumps).Thisexampleillustratesthatitisnotthenormalityoftheoutcome(orresiduals)overallthatmatters,butnormalityateachuniquelevelofthepredictorvariable.

Happiness

Density

0.00

0.05

0.10

0.15

0.20

5 10 15 20

PopulationHumanMuppets

Happiness

Density

0.00

0.05

0.10

0.15

0.20

5 10 15 20

Entire Population Population Split Into Groups


When does the assumption of normality matter?

Thecentrallimittheoremmeansthatthereareavarietyofsituationsinwhichwecanassumenormalityregardlessoftheshapeofoursampledata(Lumley,Diehr,Emerson,&Chen,2002):

1. Confidenceintervals:Thecentrallimittheoremtellsusthatinlargesamples,theestimatewillhavecomefromanormaldistributionregardlessofwhatthesampleorpopulationdatalooklike.Therefore,ifweareinterestedin computing confidence intervals thenwe don’t need toworry about the assumption of normality if oursampleislargeenough.

2. Significancetests:thecentral limittheoremtellsusthattheshapeofourdatashouldn’taffectsignificancetestsprovidedoursampleislargeenough.However,theextenttowhichteststatisticsperformastheyshoulddoinlargesamplesvariesacrossdifferentteststatistics—formoreinformationreadField(2013).

3. Parameterestimates:Themethodofleastsquareswillalwaysgiveyouanestimateofthemodelparametersthatminimizeserror,sointhatsenseyoudon’tneedtoassumenormalityofanythingtofitalinearmodelandestimatetheparametersthatdefineit(Gelman&Hill,2007).However,thereareothermethodsforestimatingmodelparameters,andifyouhappentohavenormallydistributederrorsthentheestimatesthatyouobtainedusingthemethodofleastsquareswillhavelesserrorthantheestimatesyouwouldhavegotusinganyoftheseothermethods.

Tosumupthen,ifallyouwanttodoisestimatetheparametersofyourmodelthennormalitydoesn’treallymatter.Ifyouwanttoconstructconfidenceintervalsaroundthoseparameters,orcomputesignificancetestsrelatingtothoseparametersthentheassumptionofnormalitymattersinsmallsamples,butbecauseofthecentrallimittheoremwedon’treallyneedtoworryaboutthisassumptioninlargersamples(butseeField(2013)foradiscussionofwhatwemightmeanbyalargersample).Inpracticalterms,aslongasyoursampleisfairlylarge,outliersareamorepressingconcernthannormality.

Homogeneity of Variance/Homoscedasticity Thesecondassumptionwe’llexplorerelatestovarianceanditcanimpactonthetwomainthingsthatwemightdowhenwefitmodelstodata:

• Parameters:Ifweusethemethodofleastsquarestoestimatetheparametersinthemodel,thenthiswillgiveusoptimalestimatesifthevarianceoftheoutcomevariableisequalacrossdifferentvaluesofthepredictorvariable.

• Nullhypothesissignificancetesting:teststatisticsoftenassumethatthevarianceoftheoutcomevariableisequalacrossdifferentvaluesofthepredictorvariable.Ifthisisnotthecasethentheseteststatisticswillbeinaccurate.

Therefore,tomakesureourestimatesoftheparametersthatdefineourmodelandsignificancetestsareaccuratewehavetoassumehomoscedasticity(alsoknownashomogeneityofvariance).

What is homoscedasticity/homogeneity of variance?

Indesignsinwhichyoutestseveralgroupsofparticipantsthisassumptionmeansthateachofthesesamplescomesfrompopulationswith the same variance. In correlational designs, this assumptionmeans that the varianceof theoutcomevariableshouldbestableatalllevelsofthepredictorvariable.Inotherwords,asyougothroughlevelsofthepredictorvariable,thevarianceoftheoutcomevariableshouldnotchange.

When does homoscedasticity/homogeneity of variance matter?

Intermsofestimatingtheparameterswithinalinearmodelifweassumeequalityofvariancethentheestimateswegetusingthemethodofleastsquareswillbeoptimal.Ifvariancesfortheoutcomevariabledifferalongthepredictorvariablethentheestimatesoftheparameterswithinthemodelwillnotbeoptimal.Theywillbe‘unbiased’(Hayes&Cai,2007)butnotoptimal.

Unequal variances/heteroscedasticity also creates bias and inconsistency in the estimate of the standard errorassociated with the parameter estimates (Hayes & Cai, 2007). This basically means that confidence intervals and


significance testswillbebiased (because theyarecomputedusing thestandarderror).Confidence intervalscanbe‘extremelyinaccurate’whenhomogeneityofvariance/homoscedasticitycannotbeassumed(Wilcox,2010).Someteststatisticsaredesignedtobeaccurateevenwhenthisassumptionisviolated.

Independence Thisassumptionmeansthattheerrorsinyourmodel(theerroriinEq.1)arenotrelatedtoeachother.ImaginePauland Julie were participants in an experiment where they had to indicate whether they remembered having seenparticularphotos.IfPaulandJulieweretoconferaboutwhetherthey’dseencertainphotosthentheiranswerswouldnotbeindependent:Julie’sresponsetoagivenquestionwoulddependonPaul’sanswer.Weknowalreadythatifweestimateamodeltopredicttheirresponses,therewillbeerrorinthosepredictionsandbecausePaulandJulie’sscoresarenotindependenttheerrorsassociatedwiththesepredictedvalueswillalsonotbeindependent.IfPaulandJuliewereunabletoconfer (if theywere locked indifferentrooms) thentheerror termsshouldbe independent (unlessthey’retelepathic):theerrorinpredictingPaul’sresponseshouldnotbeinfluencedbytheerrorinpredictingJulie’sresponse.

Theequationthatweusetoestimatethestandarderrorisvalidonlyifobservationsareindependent.Rememberthatweusethestandarderrortocomputeconfidenceintervalsandsignificancetests,soifweviolatetheassumptionofindependencethenourconfidenceintervalsandsignificancetestswillbeinvalid.Ifweusethemethodofleastsquares,thenmodelparameterestimateswill still bevalidbutnotoptimal (wecouldgetbetterestimatesusingadifferentmethod).Ingeneral,ifthisassumptionisviolated,therearetechniquesyoucanusedescribedinChapter20of(Field,2013).

Testing Assumptions Testing normality Youcanlookfornormalityinthreeways:(1)graphs;(2)numerically;and(3)significancetests.WecandoallthreeusingtheExplorecommandinSPSS.Intermsofgraphswecanlookathistograms(whichwe’vealreadylearntabout)andP-PplotsandQ-Qplots.P-PandQ-Qplotsbasicallyshowthesamething:aP-Pplotplotsthecumulativeprobabilityofavariableagainstthecumulativeprobabilityofaparticulardistribution(inthiscaseanormaldistribution).AQ-Qplotdoesthesamethingbutexpressedasquantiles.WithlargedatasetsQ-Qplotsareabiteasiertointerpret.Inasensetheyplotthe‘actualdata’against ‘whatyou’dexpecttogetfromanormaldistribution’,so ifthedataarenormallydistributedthentheactualscoreswillbethesameastheexpectedscoresandyou’llgetalovelystraightdiagonalline.Thisidealscenarioishelpfullyplottedonthegraphandyourjobistocomparethedatapointstothisline.Ifvaluesfallonthediagonaloftheplotthenthevariableisnormallydistributed;however,whenthedatasagconsistentlyaboveorbelowthediagonalthenthisshowsthatthekurtosisdiffersfromanormaldistribution,andwhenthedatapointsareS-shaped,theproblemisskewness.

Numerically, SPSS usesmethods to calculate skew and kurtosis (see Field (2013) if you have forgottenwhat theseconceptsare)thatgivevaluesofzeroinanormaldistribution.Positivevaluesofskewnessindicateapile-upofscoresontheleftofthedistribution,whereasnegativevaluesindicateapile-upontheright.Positivevaluesofkurtosisindicateapointyandheavy-taileddistribution,whereasnegativevaluesindicateaflatandlight-taileddistribution.Thefurtherthevalueisfromzero,themorelikelyitisthatthedataarenotnormallydistributed.

Finally,wecanseewhetherthedistributionofscoresdeviatesfromacomparablenormaldistribution.TheKolmogorov-SmirnovtestandShapiro-Wilktestdothis: theycomparethescores inthesampletoanormallydistributedsetofscoreswiththesamemeanandstandarddeviation.

• Ifthetestisnon-significant(p>.05)ittellsusthatthedistributionofthesampleisnotsignificantlydifferentfromanormaldistribution(i.e.,itisprobablynormal).

• If,however, thetest issignificant (p< .05)thenthedistribution inquestion issignificantlydifferentfromanormaldistribution(i.e.,itisnon-normal).

Thesetestsseemgreat:inoneeasyproceduretheytelluswhetherourscoresarenormallydistributed(nice!).However,theJaneSuperbrainBoxexplainssomereallygoodreasonsnottousethem.Ifyouinsistonusingthem,bearJane’sadviceinmindandalwaysplotyourdataaswellandtrytomakeaninformeddecisionabouttheextentofnon-normalitybasedonconvergingevidence.


Figure8showsthedialogboxesfortheExplorecommand( ).First,enteranyvariablesofinterestintheboxlabelledDependentListbyhighlightingthemontheleft-handsideandtransferringthembyclickingon .Forthisexample,selectthehygienescoresforthethreedays.Ifyouclickon adialogboxappears,butthedefaultoptionisfine(itwillproducemeans,standarddeviationsandsoon).Themoreinterestingoption for our current purposes is accessed by clicking on . In this dialog box select the option

, and thiswill produceboth theK-S test and somenormalQ-Qplots. You can also split theanalysis by a factor or grouping varaiable (for example,we could do a separete analysis formales and females bydragginggendertotheFactorListbox—we’lldothislaterinthehandout).

Figure8:Dialogboxesfortheexplorecommand

Wealsoneedtoclickon totellSPSShowtodealwithmisisngvalues.Thisisimportantbecausealthoughwestartoffwith810scoresonday1,bydaytwowehaveonly264andthisisreducedto123onday3.BydefaultSPSSwilluseonlycasesforwhichtherearevalidscoresonalloftheselectedvariables.Thiswouldmeanthatforday1,eventhoughwehave810scores,itwilluseonlythe123casesforwhichtherearescoresonallthreedays.Thisisknownasexlcudingcaseslistwise.However,wewantittouseallofthescoresithasonagivenday,whichisknownaspairwise.Onceyouhaveclickedon selectExcludecasespairwise,thenclickon toreturntothemaindialogboxandclickon toruntheanalysis.


SELF-TEST:Wehavealreadyplottedahistogramoftheday1scores,usingwhatyouleantbeforeplothistogramsforthehygienescoresfordays2and3oftheDownloadFestival.

Figure9:Histograms(left)andP-Pplots(right)ofthehygienescoresoverthethreedaysoftheDownloadFestival

Figure9showsthehistograms(fromtheself-testtasks)andthecorrespondingQ-Qplots.Theday1scoreslookquitenormal;TheQ-Qplotechoesthisviewbecausethedatapointsallfallveryclosetothe‘ideal’diagonalline.However,thedistributionsfordays2and3arenotnearlyassymmetricalasday1:theybothlookpositivelyskewed.Again,thiscanbeseenintheQ-Qplotsbythedatapointsdeviatingawayfromthediagonal.Ingeneral,thisseemstosuggestthatbydays2and3,hygienescoresweremuchmoreclusteredaroundthelowendofthescale.Rememberthatthelowerthescore, the lesshygienic theperson is, sogenerallypeoplebecamesmellieras the festivalprogressed.Theskewoccursbecauseasubstantialminorityinsistedonupholdingtheirlevelsofhygiene(againstallodds)overthecourseofthefestival(babywet-wipesareindispensableIfind).

Output1showsthetableofdescriptivestatisticsforthethreevariablesinthisexample.Onaverage,hygienescoreswere1.77(outof5)onday1ofthefestival,butwentdownto0.96and0.98ondays2and3respectively.Theotherimportantmeasuresforourpurposesaretheskewnessandthekurtosis,bothofwhichhaveanassociatedstandarderror.Forday1theskewvalueisveryclosetozero(whichisgood)andkurtosisisalittlenegative.Fordays2and3,though,thereisaskewnessofaround1(positiveskew).


Wecanconvertthesevaluestoz-scoreswhichenablesusto(1)compareskewandkurtosisvaluesindifferentsamplesthatuseddifferentmeasures,and(2)calculateap-valuethattellsusifthevaluesaresignificantlydifferentfrom0(i.e.,normal).Althoughtherearegoodreasonsnottodothis (seeJaneSuperbrainBox), ifyouwanttoyoucando itbysubtractingthemeanofthedistribution(inthiscasezero)fromthescoreandthendividingbythestandarderrorofthedistribution.

𝑧𝑧+,-./-++ =𝑆𝑆 − 0

𝑆𝑆𝑆𝑆+,-./-++ 𝑧𝑧,4567+8+ =

𝐾𝐾 − 0𝑆𝑆𝑆𝑆,4567+8+

Intheaboveequations,thevaluesofS(skewness)andK(kurtosis)andtheirrespectivestandarderrorsareproducedbySPSS.Thesez-scorescanbecomparedagainstvaluesthatyouwouldexpecttogetifskewandkurtosiswerenotdifferentfrom0.So,anabsolutevaluegreaterthan1.96issignificantatp<.05,above2.58issignificantatp<.01andabove3.29issignificantatp<.001.However,youreallyshouldusethesecriteriaonlyinsmallsamples:inlargersampleslookattheshapeofthedistributionvisually,interpretthevalueoftheskewnessandkurtosisstatistics,andpossiblydon’tevenworryaboutnormalityatall(JaneSuperbrainBox).

Output1

UsingthevaluesinOutput1,calculatethez-scoresforskewnessandKurtosisforeachdayoftheDownloadfestival.


YourAnswers:

Output2

Output2showstheK-Stest.Rememberthatasignificantvalue(Sig.lessthan.05)indicatesadeviationfromnormality.

AretheresultsoftheK-Stestssurprisinggiventhehistogramswehavealreadyseen?

YourAnswers:

Forday1theK-Stestisjustaboutnotsignificant(p=.097),whichissurprisinglyclosetosignificantgivennownormaltheday1scoreslookedinthehistogram(Figure3).However,thesamplesizeonday1isverylarge(N=810)andthesignificanceof theK-S test for thesedatashowshow in largesamplesevensmallandunimportantdeviations fromnormalitymightbedeemedsignificantbythistest(JaneSuperbrainBox).Fordays2and3thetestishighlysignificant,indicatingthatthesedistributionsarenotnormal,whichislikelytoreflecttheskewseeninthehistogramsforthesedatabutcouldagainbedowntothelargesample(Figure9).

Reporting the K-S test

TheteststatisticfortheK-StestisdenotedbyDandweshouldreportthedegreesoffreedom(df)fromthetableinbracketsaftertheD.WecanreporttheresultsinOutput2inthefollowingway:

ü Thehygienescoresonday1,D(810)=0.029,p=.097,didnotdeviatesignificantlyfromnormal;however,day2,D(264)=0.121,p<.001,andday3,D(123)=0.140,p<.001,scoreswerebothsignificantlynon-normal.

JaneSuperbrainBox:Significancetestsandassumptions


Throughoutthishandoutwewilllookatvarioussignificanceteststhathavebeendevisedtolookatwhetherassumptions are violated. These include tests of whether a distribution is normal (the Kolmorgorov-Smirnoff and Shapiro-Wilk tests), tests of homogeneity of variances (Levene’s test), and tests ofsignificanceofskewandkurtosis.Allofthesetestsarebasedonnullhypothesissignificancetestingandthismeansthat(1)inlargesamplestheycanbesignificantevenforsmallandunimportanteffects,and(2)insmallsamplestheywilllackpowertodetectviolationsof.

Thecentrallimittheoremmeansthatassamplesizesgetlarger,thelesstheassumptionofnormalitymattersbecausethesamplingdistributionwillbenormalregardlessofwhatourpopulation(orindeed

sample)datalooklike.So,theproblemisthatinlargesamples,wherewedon’tneedtoworryaboutnormality,atestofnormalityismorelikelytobesignificant,andthereforelikelytomakeusworryaboutandcorrectforsomethingthatdoesn’tneedtobecorrectedorworriedabout.Conversely, insmallsamples,wherewemightwanttoworryaboutnormality,asignificancetestwon’thavethepowertodetectnon-normalityandsoislikelytoencourageusnottoworryaboutsomethingthatweprobablyoughtto.Therefore,thebestadviceisthatifyoursampleis largethendon’tusesignificancetestsofnormality,infactdon’tworrytoomuchaboutnormalityatall.Insmallsamplesthenpayattentionifyoursignificancetestsaresignificantbutresistbeinglulledintoafalsesenseofsecurityiftheyarenot.

Testing homogeneity of variance/homoscedasticity Likenormality,youcanlookatthevariancesusinggraphs,numbersandsignificancetests.Graphically,wecancreateascatterplotofthevaluesoftheresidualsplottedagainstthevaluesoftheoutcomepredictedbyourmodel.Indoingsowe’relookingatwhetherthereisasystematicrelationshipbetweenwhatcomesoutofthemodel(thepredictedvalues)andtheerrorsinthemodel.Normallyweconvertthepredictedvaluesanderrorstoz-scores1sothisplotissometimesreferred to as zpred vs. zresid. If linearity and homoscedasticity hold true then there should be no systematicrelationshipbetweentheerrorsinthemodelandwhatthemodelpredicts.Lookingatthisgraphcan,therefore,killtwobirdswithonestone.Ifthisgraphfunnelsout,thenthechancesarethatthereisheteroscedasticityinthedata.Ifthereisanysortofcurveinthisgraphthenthechancesarethatthedatahavebrokentheassumptionoflinearity.

Figure10showsseveralexamplesoftheplotofstandardizedresidualsagainststandardizedpredictedvalues.Thetopleftpanelshowsasituationinwhichtheassumptionsoflinearityandhomoscedasticityhavebeenmet.Thetoprightpanelshowsasimilarplotforadatasetthatviolatestheassumptionofhomoscedasticity.Notethatthepointsformafunnel:theybecomemorespreadoutacrossthegraph.Thisfunnelshapeistypicalofheteroscedasticityandindicatesincreasingvarianceacrosstheresiduals.Thebottomleftpanelshowsaplotofsomedatainwhichthereisanon-linearrelationshipbetweentheoutcomeandthepredictor:thereisaclearcurveintheresiduals.Finally,thebottomrightpanelillustratesdatathatnotonlyhaveanon-linearrelationship,butalsoshowheteroscedasticity.Notefirstthecurvedtrendintheresiduals,andthenalsonotethatatoneendoftheplotthepointsareveryclosetogetherwhereasattheotherendtheyarewidelydispersed.Whentheseassumptionshavebeenviolatedyouwillnotseetheseexactpatterns,buthopefullytheseplotswillhelpyoutounderstandthegeneralanomaliesyoushouldlookoutfor.

Numerically,wecansimplylookatthevaluesofthevariancesindifferentgroups.SomepeoplelookatHartley’sFMaxalsoknownasthevarianceratio(Pearson&Hartley,1954).Thisistheratioofthevariancesbetweenthegroupwiththebiggestvarianceandthegroupwiththesmallestvariance.Theacceptablesizeofthisratiodependsonthenumberofvariancescomparedandthesamplesize—see(Field,2013)formoredetail.

MorecommonlyusedisLevene’stest(Levene,1960),whichteststhenullhypothesisthatthevariancesindifferentgroupsareequal.

• If Levene’s test is significant at p£ .05 then you conclude that the variances are significantly different—therefore,theassumptionofhomogeneityofvarianceshasbeenviolated.

• If, however, Levene’s test is non-significant (i.e., p > .05) then the variances are roughly equal and theassumptionistenable.

1Thesesstandardizederrorsarecalledstandardizedresiduals.


AlthoughLevene’stestcanbeselectedasanoptioninmanyofthestatisticalteststhatrequireit,it’sbesttolookatitwhenyou’reexploringdatabecauseitinformsthemodelyoufit.AswiththeK-StestyouneedtotakeLevene’stestwithapinchofsalt(JaneSuperbrainBox).

Figure10:Plotsofstandardizedresidualsagainstpredicted(fitted)values

WecangetLevene’stestusingtheexploremenuthatweusedintheprevioussection.Stickingwiththehygienescores,we’ll compare the variances of males and females on day 1 of the festival. Use

toopenthedialogboxinFigure11.Transfertheday1variablefromthelistontheleft-handsidetotheboxlabelledDependentListbyclickingonthe nexttothisbox;becausewewanttosplit theoutputbythegroupingvariabletocomparethevariances,selectthevariablegenderandtransferittotheboxlabelledFactorListbyclickingontheappropriate .Thenclickon toopentheotherdialogboxinFigure11.TogetLevene’stestweneedtoselectoneoftheoptionswhereitsaysSpreadvs.levelwithLevenetest.Ifyouselect Levene’stestiscarriedoutontherawdata(agoodplacetostart).Whenyou’vefinishedwiththisdialogboxclickon toreturntothemainExploredialogboxandthenclickon toruntheanalysis.

Output3showsthetableforLevene’stest.Levene’stestcanbebasedondifferencesbetweenscoresandthemean,andscoresandthemedian.Themedianisslightlypreferable(becauseitislessbiasedbyoutliers).Whenusingboththemean(p=.030)andthemedian(p=.037)thesignificancevaluesarelessthan.05indicatingasignificantdifferencebetweenthemaleandfemalevariances.Tocalculatethevarianceratio,weneedtodividethelargestvariancebythesmallest.Youshould findthevariances inyouroutput: themalevariancewas0.413andthe femaleone0.496, thevarianceratiois,therefore,0.496/0.413=1.2.Inessencethevariancesarepracticallyequal.So,whydoesLevene’stesttellustheyaresignificantlydifferent?Theanswerisbecausethesamplesizesaresolarge:wehad315malesand495femalessoeventhisverysmalldifferenceinvariancesisshownupassignificantbyLevene’stest(JaneSuperbrainBox).Hopefullythisexampleconvincesyoutotreatthistestcautiously.


Figure11:ExploringgroupsofdataandobtainingLevene’stest

Output3

Reporting Levene’s test

Levene’stestcanbedenotedwiththeletterFandtherearetwodifferentdegreesoffreedom.Assuchyoucanreportit,ingeneralform,asF(df1,df2)=value,sig:

ü Forthehygienescoresonday1ofthefestival,thevarianceswereunequalforformalesandfemales,F(1,808)=4.74,p=.03.

Whatistheassumptionofhomogeneityofvariance?

YourAnswers:

Reducing Bias Havinglookedatpotentialsourcesofbias,thenextissueishowtoreducetheimpactofbias.Essentiallytherearefourmethodsforcorrectingproblemswiththedata,whichcanberememberedwiththehandyacronymofTWAT:


• Trimthedata:deleteacertainamountofscoresfromtheextremes.

• Windsorizing:substituteoutlierswiththehighestvaluethatisn’tanoutlier.

• AnalysewithRobustMethods:thistypicallyinvolvesatechniqueknownasbootstrapping.

• Transformthedata:thisinvolvesapplyingamathematicalfunctiontoscorestotrytocorrectanyproblemswiththem.

Probablythebestofthesechoicesistouserobusttests,whichisatermappliedtoafamilyofprocedurestoestimatestatistics thatare reliableevenwhen thenormalassumptionsof the statisticarenotmet. For thepurposesof thishandoutwe’lllookattransformingdata,andthroughoutthemodulewe’llusebootstrapping(whichisarobustmethodexplainedinyourlecture),butyoucanfindmoredetailonthesetechniquesandtheotherinChapter5of(Field,2013).

Bootstrapping (robust methods) SomeSPSSprocedureshaveabootstrapoption,whichcanbeaccessedbyclickingon toactivatethedialogboxinFigure12.Select toactivatebootstrappingfortheprocedureyou’recurrentlydoing.Intermsoftheoptions,SPSSwillcomputea95%percentileconfidenceinterval( ),butyoucanchangethemethodtoaslightlymoreaccurateone(Efron&Tibshirani,1993)calledabiascorrectedandacceleratedconfidenceinterval(

).Youcanalsochangetheconfidencelevelbytypinganumberotherthan95intheboxlabelled Level(%). By default, SPSS uses 1000 bootstrap samples, which is a reasonable number and you certainlywouldn’tneedtousemorethan2000.

http://youtu.be/mNrxixgwA2M

Figure12:Thestandardbootstrapdialogbox

Transforming Data Thefinalthingthatyoucandotocombatproblemswithnormalityandlinearity istotransformyourdata.Theideabehindtransformationsisthatyoudosomethingtoeveryscoretocorrectfordistributionalproblems,outliers,lackoflinearityorunequalvariances.Ifyouarelookingatrelationshipsbetweenvariables(e.g.,regression)justtransformtheproblematicvariable,butifyouarelookingatdifferencesbetweenvariables(e.g.,changeinavariableovertime)thenyouneedtotransformallofthosevariables.Forexample,ourfestivalhygienedatawerenotnormalondays2and3ofthefestival.Now,wemightwanttolookathowhygienelevelschangedacrossthethreedays(i.e.,comparethemean


onday1tothemeansondays2and3toseeifpeoplegotsmellier).Thedatafordays2and3wereskewedandneedtobetransformed,butbecausewemightlatercomparethedatatoscoresonday1,wewouldalsohavetotransformtheday1data(eventhoughscoreswerenotskewed).Ifwedon’tchangetheday1dataaswell,thenanydifferencesinhygienescoreswe find fromday1 today2or3willbeduetous transformingonevariableandnot theothers.However, ifweweregoingto lookattherelationshipbetweenday1andday2scores(notthedifferencebetweenthem)wecouldtransformonlytheday2scoresandleavetheday1scoresalone.

Choosing a transformation

Therearevarioustransformationsthatyoucandotothedatathatarehelpfulincorrectingvariousproblems.Table1:showssomecommontransformationsandtheiruses.Thewaytodecidewhichtransformationtouseisbygoodoldfashionedtrialanderror:tryoneout,seeifithelpsandifitdoesn’tthentryadifferentone.

Table1:Datatransformationsandtheiruses

DataTransformation CanCorrectFor

Logtransformation(log(Xi)):Takingthelogarithmofasetofnumberssquashestherighttailofthedistribution.Assuchit’sa goodway to reducepositive skew. This transformation is also veryuseful if youhaveproblemswith linearity (it cansometimesmakeacurvilinearrelationshiplinear).However,youcan’tgetalogvalueofzeroornegativenumbers,soifyourdata tend to zero or produce negative numbers you need to add a constant to all of the data before you do thetransformation.Forexample,ifyouhavezerosinthedatathendolog(Xi+1),orifyouhavenegativenumbersaddwhatevervaluemakesthesmallestnumberinthedatasetpositive.

Positive skew,positive kurtosis,unequal variances,lackoflinearity

Squareroottransformation(ÖÖXi):Takingthesquarerootoflargevalueshasmoreofaneffectthantakingthesquarerootofsmallvalues.Consequently,takingthesquarerootofeachofyourscoreswillbringanylargescoresclosertothecentre—ratherlikethelogtransformation.Assuch,thiscanbeausefulwaytoreducepositiveskew;however,youstillhavethesameproblemwithnegativenumbers(negativenumbersdon’thaveasquareroot).

Positive skew,positive kurtosis,unequal variances,lackoflinearity

Reciprocaltransformation(1/Xi):Dividing1byeachscorealsoreducestheimpactoflargescores.Thetransformedvariablewillhavealowerlimitof0(verylargenumberswillbecomecloseto0).Onethingtobearinmindwiththistransformationis that it reverses the scores: scores that were originally large in the data set become small (close to zero) after thetransformation,butscoresthatwereoriginallysmallbecomebigafterthetransformation.Forexample,imaginetwoscoresof1and10;afterthetransformationtheybecome1/1=1,and1/10=0.1:thesmallscorebecomesbiggerthanthelargescoreafterthetransformation.However,youcanavoidthisbyreversingthescoresbeforethetransformation,byfindingthe highest score and changing each score to the highest score minus the score you’re looking at. So, you do atransformation1/(XHighest−Xi).Likethelogtransformation,youcan’ttakethereciprocalof0(because1/0=infinity)soifyouhavezerosinthedatayouneedtoaddaconstanttoallscoresbeforedoingthetransformation.

Positive skew,positive kurtosis,unequalvariances

Reversescoretransformations:Anyoneoftheabovetransformationscanbeusedtocorrectnegativelyskeweddata,butfirstyouhavetoreversethescores.Todothis,subtracteachscorefromthehighestscoreobtained,orthehighestscore+1(dependingonwhetheryouwantyourlowestscoretobe0or1).Ifyoudothis,don’tforgettoreversethescoresbackafterwards,ortorememberthatthe interpretationofthevariable isreversed:bigscoreshavebecomesmallandsmallscoreshavebecomebig.

Negativeskew

Tryingoutdifferenttransformationscanbequitetimeconsuming;however,ifheterogeneityofvarianceisyourissuethenwecanseetheeffectofatransformationquitequickly.WhenweranLevene’stest(Figure11)werantheanalysisselectingtherawscores( ).However,ifthevariancesturnouttobeunequal,astheydidinourexample,youcanusethesamedialogboxbutselect .Whenyoudothisyoushouldnoticeadrop-downlistthatbecomesactiveandifyouclickonthisyou’llnoticethatitlistsseveraltransformationsincludingtheonesthatIhavejustdescribed.Ifyouselectatransformationfromthislist(NaturallogperhapsorSquareroot)thenSPSSwillcalculatewhatLevene’stestwouldbeifyouweretotransformthedatausingthismethod.Thiscansaveyoualotoftimetryingoutdifferenttransformations.

Using SPSS’s Compute command

Thecomputecommandenablesustocarryoutvariousfunctionsoncolumnsofdatainthedataeditor.Sometypicalfunctionsareaddingscoresacrossseveralcolumns,takingthesquarerootofthescoresinacolumn,orcalculatingthemeanofseveralvariables.Toaccessthecomputedialogbox,usethemousetoselect .TheresultingdialogboxisshowninFigure13;ithasalistoffunctionsontheright-handside,acalculator-likekeyboardinthecentreandablankspacethatI’velabelledthecommandarea.ThebasicideaisthatyoutypeanameforanewvariableinthearealabelledTargetVariableandthenyouwritesomekindofcommandinthecommandareatotellSPSShowtocreatethisnewvariable.Youuseacombinationofexistingvariablesselectedfromthelistontheleftandnumericexpressions.So,forexample,youcoulduseitlikeacalculatortoaddvariables(i.e.addtwocolumnsinthedata


editortomakeathird).Therearehundredsofbuilt-infunctionsthatSPSShasgroupedtogether.InthedialogboxitliststhesegroupsinthearealabelledFunctiongroup;uponselectingafunctiongroup,alistofavailablefunctionswithinthatgroupwillappearintheboxlabelledFunctionsandSpecialVariables.Ifyouselectafunction,thenadescriptionofthatfunctionappearsintheboxindicatedinFigure13.Youcanentervariablenamesintothecommandareabyselectingthevariablerequiredfromthevariableslistandthenclickingon .Likewise,youcanselectacertainfunctionfromthelistofavailablefunctionsandenteritintothecommandareabyclickingon .

FirsttypeavariablenameintheboxlabelledTargetVariable,thenclickon andanotherdialogboxappears,whereyou cangive thevariableadescriptive label and specifywhether it is anumericor stringvariable (seeyourhandout fromweek1). Thenwhenyouhavewrittenyour command for SPSS toexecute, clickon to run thecommandandcreate thenewvariable.Thereare functions forcalculatingmeans, standarddeviationsandsumsofcolumns.We’regoingtousethesquarerootandlogarithmfunctions,whichareusefulfortransformingdatathatareskewed.

Figure13:DialogboxfortheComputecommand

Log Transformation

Let’sreturntoourDownloadfestivaldata.OpenthemainComputedialogboxbyselecting .Enter the name logday1 into the box labelled Target Variable, click on and give the variable a moredescriptivenamesuchasLogtransformedhygienescoresforday1ofDownloadfestival.InthelistboxlabelledFunctiongroup clickonArithmetic and then in thebox labelledFunctionsandSpecialVariables clickonLg10 (this is the logtransformation tobase10,Ln is thenatural log) and transfer it to the commandareaby clickingon .When the

Command area

Categories of functions

Functions within the selected category

Use this dialog box to select certain cases

in the data

Variable list

Description ofselected function


commandistransferred,itappearsinthecommandareaas‘LG10(?)’andthequestionmarkshouldbereplacedwithavariablename(whichcanbetypedmanuallyortransferredfromthevariableslist).Soreplacethequestionmarkwiththevariableday1byeitherselectingthevariableinthelistanddraggingitacross,clickingon ,orjusttyping‘day1’wherethequestionmarkis.

For theday2hygienescores there isavalueof0 in theoriginaldata,and there isno logarithmof thevalue0.Toovercometheproblemweaddaconstanttoouroriginalscoresbeforewetakethelogofthosescores.Anyconstantwilldo(althoughsometimesitcanmatter),providedthatitmakesallofthescoresgreaterthan0.Inthiscaseourlowestscoreis0inthedatasowecouldadd1toallofthescorestoensurethatallscoresaregreaterthanzero.Eventhoughthisproblemaffectstheday2scores,weneedtobeconsistentanddothesametotheday1scoresaswewilldowiththeday2scores.Therefore,makesurethecursorisstillinsidethebracketsandclickon andthen .ThefinaldialogboxshouldlooklikeFigure13.NotethattheexpressionreadsLG10(day1+1);thatis,SPSSwilladdonetoeachoftheday1scoresandthentakethelogoftheresultingvalues.Clickon tocreateanewvariablelogday1containingthetransformedvalues.

SELFTEST:Haveagoatcreatingsimilarvariableslogday2andlogday3fortheday2andday3data.Plothistogramsofthetransformedscoresforallthreedays.

Square Root Transformation

Tousethesquareroottransformation,wecouldrunthroughthesameprocess,byusinganamesuchassqrtday1andselectingtheSQRT(numexpr)functionfromthelist.ThiswillappearintheboxlabelledNumericExpression:asSQRT(?),and you can simply replace the questionmarkwith the variable youwant to change—in this caseday1. The finalexpressionwillreadSQRT(day1).

SELFTEST:Repeatthisprocessforday2andday3tocreatevariablescalledsqrtday2andsqrtday3.Plothistogramsofthetransformedscoresforallthreedays.

Reciprocal Transformation

Todoareciprocaltransformationonthedatafromday1,wecoulduseanamesuchasrecday1 intheboxlabelledTargetVariable.Thenwecansimplyclickon andthen .Ordinarilyyouwouldselectthevariablenamethatyouwanttotransformfromthelistanddragitacross,clickon orjusttypethenameofthevariable.However,theday2datacontainazerovalueandifwetrytodivide1by0thenwe’llgetanerrormessage(youcan’tdivideby0).Weneedtoaddaconstanttoourvariablejustaswedidforthelogtransformation.Anyconstantwilldo,but1isaconvenientnumberforthesedata.So, insteadofselectingthevariablewewanttotransform,clickon .ThisplacesapairofbracketsintotheboxlabelledNumericExpression;thenmakesurethecursorisbetweenthesetwobracketsandselectthevariableyouwanttotransformfromthelistandtransferitacrossbyclickingon (ortypethenameofthevariablemanually).Nowclickon andthen (ortype+1usingyourkeyboard).TheboxlabelledNumericExpressionshouldnowcontainthetext1/(day1+1).Clickon tocreateanewvariablecontainingthetransformedvalues.

SELFTEST:Repeatthisprocessforday2andday3.Plothistogramsofthetransformedscoresforallthreedays.

The effect of transformations

Figure14showsthedistributionsfordays1and2ofthefestivalafterthethreedifferenttransformations.ComparethesetotheuntransformeddistributionsinFigure9.Now,youcanseethatallthreetransformationshavecleanedupthehygienescoresforday2:thepositiveskewisreduced(thesquareroottransformationinparticularhasbeenuseful).However,becauseourhygienescoresonday1weremoreorlesssymmetricaltobeginwith,theyhavenowbecome


slightly negatively skewed for the log and square root transformation, and positively skewed for the reciprocaltransformation2Ifwe’reusingscoresfromday2aloneorlookingattherelationshipbetweenday1andday2,thenwecouldusethetransformedscores;however,ifwewantedtolookatthechangeinscoresthenwe’dhavetoweighupwhetherthebenefitsofthetransformationfortheday2scoresoutweightheproblemsitcreatesintheday1scores—dataanalysiscanbefrustratingsometimesJ

Figure14:Distributionsofthehygienedataonday1andday2aftervarioustransformations

Multiple Choice Test Gotohttps://studysites.uk.sagepub.com/field4e/study/mcqs.htmandtestyourselfonthemultiplechoicequestionsforChapter5.Ifyougetanywrong,re-readthishandout(orField,2013,Chapter5)anddothemagainuntilyougetthemallcorrect.

2Thereversaloftheskewforthereciprocaltransformationisbecause,asImentionedearlier,thereciprocalhastheeffectofreversingthescores.

Day 1 Day 2

Log

Square root

1/x


References Efron,B.,&Tibshirani,R.(1993).Anintroductiontothebootstrap.BocaRaton,FL:Chapman&Hall/CRC.Field,A.P.(2000).DiscoveringstatisticsusingSPSSforWindows:Advancedtechniquesforthebeginner.London:Sage.Field,A.P.(2013).DiscoveringstatisticsusingIBMSPSSStatistics:Andsexanddrugsandrock'n'roll(4thed.).London:

Sage.Gelman,A.,&Hill,J.(2007).DataAnalysisUsingRegressionandMultilevel/HierarchicalModels.Cambridge:Cambridge

UniversityPress.Hayes, A. F., & Cai, L. (2007). Using heteroskedasticity-consistent standard error estimators in OLS regression: An

introductionandsoftwareimplementation.BehaviorResearchMethods,39(4),709-722.Levene,H.(1960).Robusttestsforequalityofvariances.InI.Olkin,S.G.Ghurye,W.Hoeffding,W.G.Madow&H.B.

Mann (Eds.),Contributions to Probability and Statistics: Essays inHonor ofHaroldHotelling (pp. 278-292).Stanford,CA:StanfordUniversityPress.

Lumley,T.,Diehr,P.,Emerson,S.,&Chen,L.(2002).Theimportanceofthenormalityassumptioninlargepublichealthdatasets.AnnualReviewofPublicHealth,23,151-169.

Pearson,E.S.,&Hartley,H.O. (1954).Biometrikatablesforstatisticians,volumeI.NewYork:CambridgeUniversityPress.

Wilcox,R.R.(2010).Fundamentalsofmodernstatisticalmethods:substantially improvingpowerandaccuracy.NewYork:Springer.

Terms of Use Thishandoutcontainsmaterialfrom:

Field,A.P.(2013).DiscoveringstatisticsusingSPSS:andsexanddrugsandrock‘n’roll(4thEdition).London:Sage.

ThismaterialiscopyrightAndyField(2000-2016).

This document is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 InternationalLicense (https://creativecommons.org/licenses/by-nc-nd/4.0/), basically you can use it for teaching and non-profitactivitiesbutnotmeddlewithitwithoutpermissionfromtheauthor.

Exploring Data: The Beast of Bias

Documents

Transcript of Exploring Data: The Beast of Bias