Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing...
Transcript of Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing...
MakingDataVisualAPracticalGuidetoUsingVisualizationforInsight
DanyelFisherandMiriahMeyer
MakingDataVisualbyDanyelFisherandMiriahMeyer
Copyright©2018MiriahMeyer,Microsoft.Allrightsreserved.
PrintedintheUnitedStatesofAmerica.
PublishedbyO’ReillyMedia,Inc.,1005GravensteinHighwayNorth,Sebastopol,CA95472.
O’Reillybooksmaybepurchasedforeducational,business,orsalespromotionaluse.Onlineeditionsarealsoavailableformosttitles(http://oreilly.com/safari).Formoreinformation,contactourcorporate/institutionalsalesdepartment:[email protected].
Editor:RachelRoumeliotis InteriorDesigner:DavidFutato
ProductionEditor:JustinBilling CoverDesigner:KarenMontgomery
Copyeditor:RachelHead Illustrator:RebeccaDemarest
Proofreader:GillianMcGarvey Additionalillustrations:DominikMoritz
Indexer:EllenTroutman-Zaig andKanit“Ham”Wongsuphasawat
January2018:FirstEdition
RevisionHistoryfortheFirstEdition
2017-12-19:FirstRelease
Seehttp://bit.ly/making-data-visualforreleasedetails.
TheO’ReillylogoisaregisteredtrademarkofO’ReillyMedia,Inc.MakingDataVisual,thecoverimage,andrelatedtradedressaretrademarksofO’ReillyMedia,Inc.
Whilethepublisherandtheauthorshaveusedgoodfaitheffortstoensurethattheinformationandinstructionscontainedinthisworkareaccurate,thepublisherandtheauthorsdisclaimallresponsibilityforerrorsoromissions,includingwithoutlimitationresponsibilityfordamagesresultingfromtheuseoforrelianceonthiswork.Useoftheinformationandinstructionscontainedin
thisworkisatyourownrisk.Ifanycodesamplesorothertechnologythisworkcontainsordescribesissubjecttoopensourcelicensesortheintellectualpropertyrightsofothers,itisyourresponsibilitytoensurethatyourusethereofcomplieswithsuchlicensesand/orrights.
978-1-491-92846-2
[LSI]
Preface
Visualizationisavitaltoolforunderstandingandsharinginsightsarounddata.Therightvisualizationcanhelpexpressacoreideaoropenaspacetoexamination;itcangettheworldtalkingaboutadatasetorsharinganinsightFigureP-1.
FigureP-1.Visualizationscantakemanyforms,fromviewsthatsupportexploratoryanalysis(topleft),tothosethatprovidequickoverviewsinadashboard(bottom),toaninfographicaboutpopulartopics(top
right).
Visualizationsprovideadirectandtangiblerepresentationofdata.Theyallowpeopletoconfirmhypothesesandgaininsights.Whenincorporatedintothedataanalysisprocessearlyandoften,visualizationscanevenfundamentallyalterthequestionsthatsomeoneisasking.
Creatingeffectivevisualizationsishard.Notbecauseadatasetrequiresanexotic
andbespokevisualrepresentation—formanyproblems,standardstatisticalchartswillsuffice.Andnotbecausecreatingavisualizationrequirescodingexpertiseinanunfamiliarprogramminglanguage—off-the-shelftoolslikeExcel,Tableau,andRareampleenoughtosuffice.
Rather,creatingeffectivevisualizationsisdifficultbecausetheproblemsthatarebestaddressedbyvisualizationareoftencomplexandill-formed.Thetaskoffiguringoutwhatattributesofadatasetareimportantisoftenconflatedwithfiguringoutwhattypeofvisualizationtouse.Pickingacharttypetorepresentspecificattributesinadatasetiscomparativelyeasy.Decidingonwhichdataattributeswillhelpansweraquestion,however,isacomplex,poorlydefined,anduser-drivenprocessthatcanrequireseveralroundsofvisualizationandexplorationtoresolve.Inthisbook,wefocusontheprocessofgoingfromhigh-levelquestionstowell-defineddataanalysistasks,andonhowtoincorporatevisualizationsalongthewaytoclarifyunderstandingandgaininsights.
WhoIsThisBookFor?Thisbookisforpeoplewhohaveaccesstodataand,perhaps,asuiteofcomputationaltoolsbutwhoarelessthansurehowtoturnthatdataintovisualinsights.Wefindthatmanydatasciencebooksassumethatyoucanfigureouthowtovisualizethedataoncecollected,andvisualizationbooksassumethatyoualreadyhaveawell-definedquestion,readytobevisualized.If,likeus,youwouldliketoaddresstheseassumptions,thenthisbookisforyou.
Thisbookdoesnotcoverhowtocleanandmanagedataindetailorhowtowritevisualizationcode.Therearealreadygreatbooksonthesetopics(and,whenrelevant,wepointtosomeofthem).Rather,thisbookspeakstowhythoseprocessesareimportant.Similarly,thisbookdoesnotaddresshowtochooseabeautifulcolormaporselectatypeface.Instead,welayoutaframeworkforhowtothinkaboutdatagiventhepossibilitiesandconstraintsofvisualexploration.Ourgoalistoshowhowtoeffectivelyusevisualizationstomakesenseofdata.
WhoAreWe?Theauthorsofthisbookhaveacombinedthreedecadesofexperienceinmakingsenseofdatathroughdesigningandusingvisualizations.Wehaveworkedwithdatafromabroadrangeoffields:biologyandurbantransportation,businessintelligenceandscientificvisualization,debuggingcodeandbuildingmaps.Wehaveworkedwithanalystsfromavarietyoforganizations,fromsmall,academicsciencelabstoteamsofdataanalystsembeddedinlargecompanies.Someoftheprojectswehaveworkedonhaveresultedinsophisticated,bespokevisualizationsystemsdesignedcollaborativelywithdomainspecialists,andatothertimeswehavepointedpeopletooff-the-shelfvisualizationtoolsafterafewconversations.Wehavetaughtuniversityclassesinvisualizationandhavegivenlecturesandtutorials.Allinall,wehavevisualizedhundredsofdatasets.
Wehavefoundthatourknowledgeaboutvisualizationtechniques,solutions,andsystemsshapesthewaythatwethinkandreasonaboutdata.Visualizationisfundamentallyaboutpresentingdatainawaythatelicitshumanreasoning,makesroomforindividualinterpretations,andsupportsexploration.Wehelpour
collaboratorsmaketheirquestionsanddatareflectthesevalues.Theprocesswelayoutinthisbookdescribesourmethodfordoingthis.
OverviewofChaptersChapter1illustratestheprocessofmakingsensewithvisualizationsthroughaquickexample,exposingtherolethatavisualrepresentationcanplayindatadiscovery.
Chapter2startstogetintodetails.Itdiscussesamechanismtohelpnarrowaquestionfromabroadtaskintosomethingthatcanbeaddressedwithaniterativevisualizationprocess.Forexample,thebroadquestion“Whoarethebestmoviedirectors?”doesnotnecessarilysuggestaspecificvisualization—but“Findmoviedirectorswhodirectedtop-grossingmoviesusinganIMDBdataset”canleadmoredirectlytoananswerbywayofavisualizationortwo.Thisprocesscreatesanoperationalizedquestion,onethatconsistsofparticulartasksthatcanbedirectlyaddressedwithdata.
Thisprocessofnarrowingaquestiondowntoactionabletasksrequiresinputfrommultiplestakeholders.Chapter3laysoutaniterativesetofstepsforgettingtotheoperationalization,whichwecalldatacounseling.Thesestepsincludefindingtherightpeopletotalkto,askingeffectivequestions,andrapidlyexploringthedatathroughincreasinglysophisticatedprototypes.
Thenumericalnitty-grittyofthebookfollows.Chapter4discussestypesandrelationsofdata,anddefinestermslikedimensions,measures,categorical,andquantitative.Chapter5thenorganizescommonvisualizationtypesbythetaskstheyfulfillandthedatatheyuse.Then,Chapter6explorespowerfulvisualizationtechniquesthatusemultipleviewsandinteractiontosupportanalysisoflarge,complexdatasets.Thesethreechaptersaremeanttoprovideanoverviewofsomeofthemosteffectiveandcommonlyusedideasforsupportingsensemakingwithvisualizations,andareframedusingtheoperationalizationanddatacounselingprocesstohelpguidedecision-makingaboutwhichvisualizationstochoose.
Withthisunderstandingofgettingtoinsight—fromquestionstodatatovisualizations—theremainderofthebookillustratestwoexamplesofcarryingoutthesesteps.ThecasestudyinChapter7describesthecreationofabusiness
intelligencedashboardincollaborationwithateamofdevelopersandanalystsatMicrosoft.TheoneinChapter8drawsfromscience,presentinganexamplewithateamofscientistswhoworkwithbiologicaldata.Thesecasestudiesillustratetheflexibilityoftheprocesslaidoutinthisbook,aswellasthediversetypesofoutcomesthatarepossible.
Thisbookisaccompaniedbyacompanionwebsite.FromthissiteyoucandownloadthecodeandinteractiveversionsofthevisualizationspresentedinChapters5and6,aswellasothercodeandsupplementarymaterial.
AcknowledgmentsDanyelandMiriahwouldliketothankDanyel’scolleaguesatMicrosoft,includingStevenDrucker,MaryCzerwinski,andSueDumais,fortheirenthusiasmandencouragement.WealsothankMiriah’sresearchgroup,theVisualizationDesignLabattheUniversityofUtah,includingAlexLex,forhelpingtheprojecttoevolveandprovidingfeedbackonideas.Wearebothdeeplyappreciativeofourworkorganizationsforsupportingthetimeandenergyrequiredbyprojectslikethisone,andforseeingthevalueincommunicatingourresearchbroadly.AtO’ReillyMedia,wethankMikeLoukidesforencouragingustostartthiswork,andShannonCuttandRachelRoumeliotisforguidingitfromstarttofinish.
PortionsofthisworkwerepresentedattheIPAMWorkshoponCultureAnalyticsof2016,MicrosoftDataInsightsSummitof2016,UniversityofIllinoisUrbana-ChampaignHCISeminarSeriesof2016,UniversityofBritishColumbiaHCISeminarSeriesof2016,WomeninDataScienceConferenceatStanfordUniversityin2017,andO’ReillyVelocityConferencein2017.Ourthankstotheorganizersofthoseevents,andtoparticipantswhogaveuscriticalfeedbackandhelpedclarifyourthoughts.
EarlyfeedbackontheoperationalizationprocesscamefromChristianCantonofMicrosoft.MichaelTwidaleandAndreaThomer,bothofUIUC,helpedinformthediscussionofdatacounselingwiththeirinsightsonhowreferencelibrariansdotheirwork.
WearegratefultoDominikMortizandKanit“Ham”WongsuphasawatforputtingtogethertheexamplesusedinChapters5and6.Theirwork,aswellas
thatoftherestoftheVega-Liteteam,ishelpingshapethefutureofdatavisualization.WealsothankAlexBigelowforsupplyingtheskateboadingvisualizationfigureinthePreface.
WethankJacquelineRichardsforherreviewanddiscussionofthecasestudyinChapter7.Similarly,thecollaborationwithAngelaDePaceandhergroupattheHarvardMedicalSchoolforthecasestudyinChapter8providedvaluableandrichinsightsintotheprocessofdesigningvisualizationsfordomainexperts.Theprojectsdescribedinbothofthesechaptersweredeeplyinfluentialinourworkpractices.
Ourtechnicalreviewers,MichaelFreeman,JeffHeer,andJerryOverton,helpedclarifyandstrengthentheargumentswemake.
Finally,MiriahthanksBrianPriceforhisendlesssupportandencouragement,withoutwhichshecouldneverdothethingsshedoes.
O’ReillySafariSafari(formerlySafariBooksOnline)isamembership-basedtrainingandreferenceplatformforenterprise,government,educators,andindividuals.
Membershaveaccesstothousandsofbooks,trainingvideos,LearningPaths,interactivetutorials,andcuratedplaylistsfromover250publishers,includingO’ReillyMedia,HarvardBusinessReview,PrenticeHallProfessional,Addison-WesleyProfessional,MicrosoftPress,Sams,Que,PeachpitPress,Adobe,FocalPress,CiscoPress,JohnWiley&Sons,Syngress,MorganKaufmann,IBMRedbooks,Packt,AdobePress,FTPress,Apress,Manning,NewRiders,McGraw-Hill,Jones&Bartlett,andCourseTechnology,amongothers.
Formoreinformation,pleasevisithttp://oreilly.com/safari.
HowtoContactUsPleaseaddresscommentsandquestionsconcerningthisbooktothepublisher:
O’ReillyMedia,Inc.
1005GravensteinHighwayNorth
Sebastopol,CA95472
800-998-9938(intheUnitedStatesorCanada)
707-829-0515(internationalorlocal)
707-829-0104(fax)
Wehaveawebpageforthisbook,wherewelisterrata,examples,andanyadditionalinformation.Youcanaccessthispageathttp://bit.ly/making-data-visual.
Tocommentorasktechnicalquestionsaboutthisbook,[email protected].
Formoreinformationaboutourbooks,courses,conferences,andnews,seeourwebsiteathttp://www.oreilly.com.
FindusonFacebook:http://facebook.com/oreilly
FollowusonTwitter:http://twitter.com/oreillymedia
WatchusonYouTube:http://www.youtube.com/oreillymedia
Chapter1.GettingtoanEffectiveVisualization
Choosingordesigningagoodvisualizationisrarelyastraightforwardprocess.Itistemptingtobelievethatthereisonebeautifulvisualizationthatwillshowallthecriticalaspectsofadataset.Thattherightvisualrepresentationwillrevealhiddeninsights.Thataperfect,simple,andelegantvisualization—perhapsjustalinechartorawell-chosenscatterplot—willshowpreciselywhattheimportantvariablewasandhowitvariedinpreciselythewaytoillustrateacriticallesson.
Thisisoftentheimpressionthatwe,atleast,areleftwithafterreadingdatasciencecasestudies.Butinourexperience,thisdoesnotmatchtherealityofvisualdataanalysis.Ittakeshardwork,andtrialanderror,togettoaninsightfulvisualization.Westartbythinkingaboutwhatwewanttoknow,andwerefinefuzzyquestionsintoactionable,concretetasks.Weclean,reshape,andrestructurethedataintoformsthatwecanputintoavisualization.Weworkaroundlimitationsinthedata,andwetrytounderstandwhattheuserwantstolearn.Wehavetoconsiderwhichvisualrepresentationstouseandwhatinteractionmechanismstosupport.Alongtheway,wefindothervariablesthattellusmoreaboutthedatasetandthathelpclarifyourthinking.Andnosinglevisualizationiseverquiteabletoshowalloftheimportantaspectsofourdataatonce—therejustarenotenoughvisualencodingchannels.
Designingeffectivevisualizationspresentsaparadox.Ontheonehand,visualizationsareintendedtohelpuserslearnaboutpartsoftheirdatathattheydon’tknowabout.Ontheotherhand,themoreweknowabouttheusers’needsandthecontextoftheirdata,thebetterwecandesignavisualizationtoservethem.Theprocessdescribedinthisbookembracesthisparadox:itleveragestheknowledgeusershaveoftheirdatasets,thecontextthedatalivesin,andthewaysitwascollected—includingitslikelyflaws,challenges,anderrors—inordertofigureouttheaspectsofitthatmatter.
Putanotherway,thisbookisaboutthepathfrom“Ihavesomedata…”to“Weknowthisbecauseoftheseclear,concise,andinsightfulvisualizations.”We
believethatcreatingeffectivevisualizationsisitselfaprocessofexplorationanddiscovery.Agoodvisualizationdesignrequiresadeepunderstandingoftheproblem,data,andusers.
GettingtoInsightWemostoftenworkwithotherpeoplethathaveadatasettheyaretryingtomakesenseof.Theprocessofdesigningavisualizationusuallystartswhenpeoplewalkintoouroffice.
CLIENT:IhavesomedatathatI’dliketovisualize.HowshouldIdrawit?
Theclientseemstoexpectustopullavisualizationofftheshelf,tosculptthatperfectvisualization.Wealmostalwaysfrustratethembyaskingwhattheyhopetosee.
Q:Whatisitaboutthedatathatyouwouldliketovisualize?
CLIENT:Iwanttoseehowprofitableourstoresare.
Q:Whatinyourdataindicatesastorebeingprofitable?
CLIENT:Itmeansthatthestorehaslotsofsalesofhigh-profititems.
Q:Howdoesprofitvarybystore?
Andsoon.
Bytheendofthisprocess,weoftenfindthattheclientsdonothaveavisualizationproblem,butanoperationalizationone.Theirstrugglestochooseavisualizationstemfromalackofclarityaboutwhichattributesofthedataaremostimportantandhowthoseattributesrelatetooneanother.Oncetheycandescribehowthedataattributesrelatetothequestiontheyaretryingtoanswer,findinganappropriatevisualizationbecomesmucheasier.
Wehavelearnedovertheyearsthatdesigningeffectivevisualizationstomakesenseofdataisnotanart—itisasystematicandrepeatableprocess.Wehavesystematizedthisprocessintowhatwebelievearereproducibleandclearsteps.
Thisprocesstracksourunderstandingoffourcomponents:
Data
Whatdataisavailable,andwhatdoesitmean?Whatdoesthedatalooklike,andwhatareitsimportantaspects?Wherediditcomefrom,andwhywasitoriginallycollected?
Tasks
Whatneedstohappenwiththedata?Whatarethelow-levelquestionsandtasksthatwillsupporthigh-levelgoals?
Stakeholders
Whoisinvolvedwiththedata,theproblem,andthegoals?Whatcantheysayabouttheproblemtohelpdesignaneffectivevisualization?Whowillviewthefinalvisualization,andwhatsortsofthingsdoweexpectthemtolearnfromit?Whatdomainknowledgedotheybringtothetable?Whatanswerswouldtheyfindsatisfying?
Visualization
Howdoestheunderstandingofdata,tasks,andstakeholderscometogether?Whatrepresentationsofthisdatawillfulfillthetasksfortheusers?
Regardlessofthevisualizationoutcome,thisprocesswillalmostcertainlyleadtonewdiscoveriesandinsights.Thesediscoverieshelptoinformtheoperationalization,buttheywillalsolikelysteertheprocessdownnewandunexpectedpaths.Theguidanceandframeworkinthisbookaremeanttohelpidentifyopportunitiesfordiscoveringnewknowledgeandtomakeanotherwisemessyprocessabitmorestructured.
Hotmap:MakingDecisionswithDataAsanexampleofhowvisualizationscanhelpyoutobetterunderstandaproblem,andhelpanorganizationmakedecisions,wecanlookbackto2006.Microsoftwasrollingoutitsnewmappingtool,VirtualEarth,azoomableworldmap.TheteambehindVirtualEarthhadlotsofquestionsabouthowtheiruserswereusingthisnewtool,sotheycollectedusagedata.
Theusagedatawasbasedontraditionaltelemetry:ithadgreatinformationon
whatcitiesweremostviewed,howmanyviewerswerein“street”modeversus“photograph”mode,andeveninformationaboutviewers’displays.Theyinstrumentedsearchandnavigation,andtheycollectedcountsforthenumberoftimesthatuserslookedatcertainsentinelregions.AndbecauseVirtualEarthwasbuiltontopofasetofprogressivelyhigher-resolutionimagetiles,theteamwasalsoarchivingserverlogsthattrackedhowoftenindividualtilesweredownloaded.
Interviewswithteammemberssuggestedthattheydidnothaveanintuitivenotionofhowtheirtoolwasbeingused.Inconversation,oneteammemberarguedthatpeoplewerelikelytolookattheirownhomes;anotherthoughtthattheoverheadphotographywouldmostlybeusedovermountains.Thegoalswerevaried:theyincludedseeingwhethertheuserexperiencewaswellbalancedacrossuserneedsanddecidinghowandwheretoinvestinfutureroundsofphotography.
WeaddressedthesequestionswithavisualizationtoolcalledHotmap.Figure1-1showsascreencapturefromthevisualizationtool,focusingonthecentralUnitedStates.Hotmapusesaheatmapencodingofthetileaccessvalues.Thisisavisualizationtechniquethatusesacolormaptoencodetheaccessvaluesatthegeospatiallocationsofthetiles.Coloredspotsonthemapareplaceswheremoreusershaveaccessedimagetiles.Thecolormapisalogarithmiccolorscale,sobrightspotshavemanymoreaccessesthandimones.
Figure1-1.Hotmap,lookingatthecentralUnitedStates.ThewhiteboxsurroundsananomalyinSouthDakota.
Someofthebrightestareascorrespondtomajorpopulationcenters—ChicagoandMinneapolisontheright,DenverandSaltLakeCityinthemiddle,andWestCoastcitiesontheleft.Nearthecenter,though,isananomalousshape:abrightspotwherenobigcityexists.Thereisastarshapearoundthebrightspot,andanarcofbrightcolorsnearby.ThespotisinasparselypopulatedbitofSouthDakota—therewasnoobviousreasontotheteamwhyusersmightzoominthere.
Thatpointis,however,veryclosetothecenterofamapofthecontinentalUS.Infact,theteamlearnedthatthecenterofthestarcorrespondstothecenterofthedefaultplacementofthemapinmanybrowsers.Thebrightspotwiththestarmostlikelycorrespondstousersslidingaroundafterinadvertentlyzoomingin,tryingtofigureoutwherethey’velanded;thearcseemstocorrespondtovariationsinmonitorproportions.
Asaresultofthisusabilitychallenge,manymappingtools—includingBingMaps(thesuccessorproducttoVirtualEarth)—nolongerofferazoomslider,whichkeepsusersfromaccidentallyzoomingallthewayinonasingleclick.
Asecondscreencapture,showninFigure1-2,revealsabrightspotoffthecoastofGhana.Thisspotexhibitsthesamestarpatterncreatedbyusersscrollingaroundtotrytofigureoutwhatpartofthemaptheyareviewing.Thisspotislikelyonlybrightbecauseitisat0degreeslatitude,0degreeslongitude,apointthatGIStoolsrunintooften.Whilecomputersmightfind(0,0)appealing,itisunlikelythatthereismuchthereforthetypicalVirtualEarthusertofindinteresting.1
Figure1-2.Hotmap,lookingatthemaporigin(0,0).
Thissecondbrightspotinspiredahuntforbugs.TheteamrapidlylearnedthatVirtualEarth’ssearchfacilitywouldsometimesfail,andinsteadofreturninganerrormessage,typosanderroneoussearcheswouldsometimesredirecttheuserto(0,0).Interestingly,thebughadbeenonthebacklogforsometimebecausetheteamhaddecidedthatitwasnotlikelytosurfaceoften.Seeingthisimagemadeitclearthatsomeusersreallywerebeingconfusedbytheerror,sotheteamprioritizedthebug.
AlthoughtheVirtualEarthteamstartedoutusingtheHotmapvisualizationexpectingtofindoutabouthowusersinteractedwithmaps,theygleanedmuchmorethanjustacharacterizationofusagepatterns.Likemany—darewesaymost?—newvisualizations,themostinterestinginsightswerethosethattheviewerswerenotanticipatingtofind.2
WhereVisualizationIsUsefulIsvisualizationthesilverbullettohelpusmakesenseofdata?Notalways.Therearetwoquestionstoconsidertohelpyoudecideifyourdataanalysisproblemisagoodcandidateforavisualizationsolution.
First,couldtheanalysistasksbesupportedwithanalgorithm?Acrisptasksuchas“IwanttoknowthetotalnumberofuserswholookedatSeattle”suggeststhatanalgorithm,statisticaltest,orevenatableofnumbersmightbethebestwaytoanswerthequestion.Ontheotherhand,“Howdousersexplorethemap?”ismuchfuzzier.Fuzzytasksaregreatcandidatesforavisualizationsolutionbecausetheyrequireyoutolookatthedatafromdifferentanglesandperspectives,andtobeabletomakedecisionsandinferencesbasedonyourownknowledgeandunderstanding.
Thesecondquestiontoconsideris“Isallthenecessaryinformationcontainedinthedataset?”Ifthereisinformationabouttheproblemthatisnotinthedatasetwhichrequiresanexperttointerpretthedatathatisthere,thenvisualizationisagreatsolution.Goingbacktoourfuzzyquestionaboutexploringamap,wecanimaginethatitisunlikelythattherewillbeanexplicitattributeinthedatathatclassifiesauser’sexplorationstyle.Instead,answeringthisquestionrequiressomeonetointerpretotheraspectsofthedatatobringknowledgetobearaboutwhataspectsofthedataimplyanexplorationstyle.Again,visualizationenablesthissortofflexibleanduser-centricanalysis.
Forallbutthecrispestquestionsaboutexplicitlymeasuredphenomena,visualizationisprobablyagoodtooltothrowataproblem.Inourexperience,wehavealmostnevercomeupagainstaproblemthatcannotbenefitfromsomeamountofvisualization.
FurtherReadingTheHotmapprojectisdiscussedin:
Fisher,Danyel."Hotmap:LookingatGeographicAttention.”IEEETransactionsonVisualizationandComputerGraphics13(2007):1184–1191.
Fisher,Danyel."TheImpactofHotmap.”TheInfovis2009DiscoveryExhibition.Redmond,WA:Microsoft,2009.
Somanydatasetshavereferencesto(0,0)thatGISpractitionersrefertothatlocationas“nullisland.”
See“FurtherReading”forotherstoriesofhowHotmaphasbeenused.
1
2
Chapter2.FromQuestionstoTasks
Allvisualizationbeginswithaquestionaboutdata.Ananalystwantstoknowsomethingaboutaphenomenonintheworld,orwantstosharetheirknowledgeaboutitwithsomeoneelse.Shebelievesthephenomenontheywishtoexamineisrepresentedsomehowinthedata.
Thechallengeinthisprocessisthatthequestiontheanalystwishestoaddresscanseemfarfromthedata.Theanalystmightbeworkingonabroadgoal:say,“Arehigh-salaryemployeesmoreproductivethanlesswell-paidones?”Thisleadstoaprocessofmakingthequestionmeasurable.Whatdoestheanalystmeanbyhigh-salary,andproductive?Whatvisualizationorsetofvisualizationswoulddemonstratetherelationshipbetweenthesevariables?
Theprocessofbreakingdownthesequestionsintosomethingthatcanactuallybecomputedfromthedataisiterative,exploratory,andsometimessurprising.Thischapterdescribeshowtorefinehigh-levelquestionsintospecific,data-driventasks.Theoutcomeofthatprocessisasetofconcisedesignrequirementsforavisualizationtoolthatsupportsfindinganswerstothosequestions.
Thegeneralconceptofrefiningquestionsintotasksappearsacrossallofthesciences.Inmanyfields,theprocessiscalledoperationalization,andreferstotheprocessofreducingacomplexsetoffactorstoasinglemetric.Thefieldofvisualizationtakesonthatgoalmorebroadly:ratherthanattemptingtoidentifyasinglemetric,theanalystinsteadtriestolookmoreholisticallyacrossthedatatogetausable,actionableanswer.Arrivingatthatanswermightinvolveexploringmultipleattributes,andusinganumberofviewsthatallowtheideastocometogether.Thus,operationalizationinthecontextofvisualizationistheprocessofidentifyingtaskstobeperformedoverthedatasetthatareareasonableapproximationofthehigh-levelquestionofinterest.
Avisualizationisnottheinevitableoutcomeofoperationalization.Exploringthedatamightshowthatthegoalisbestachievedwithastatisticalanalysisorwithmachinelearning.Similarly,theoutcomeoftheprocessmightshowthata
clusteranalysisacrossmultipleattributesismoreusefulthanaplot.Wefindthatmoreoftenthannot,visualizationisavitalcomponentofgettingtoasuccessfuloperationalization.
Thischapteremphasizesthedataaspectsofthisprocess.Thenextchaptermovestothehumansideoftheprocess:howtogettheinformationnecessarytoeffectivelyoperationalizethehigh-levelquestions.Laterchaptersthenlookathowtotranslatetheoperationalizedquestionsintospecificvisualizations.
Example:IdentifyingGoodMovieDirectorsToguidetheprocessthroughoperationalization,thischapterexaminesanexemplarquestion:“Whoarethebestmoviedirectors?”
Nonspecificquestionslikethisarehowmanydataexplorationsstart.Answeringaquestionlikethisrequiresamuchmorespecifictaskthatcanbepreciselyaddressedwithadataset.Beforewecanbemorespecific,wefirstneedtotakeastepback:whoneedstoknowtheanswertothisquestion?Theusecasemightbeafilmstudenttryingtoassertthathisdissertationisaboutoneofthemostinfluentialdirectors,orahiringmanagerlookingtohireadirectorforanupcomingproject,orajournalistputtingtogetherasplashyarticlethatwillfeatureatoplist.
Eachoftheseusersneedssuggestsdifferentinterpretationsforthenotionofbestdirector.Thefilmstudentislookingforawaytoquantifyanddefendanotionofinfluence,whereasthehiringmanagermightwanttolimitthemselftopeopleworkingtodaywhoarelessaccomplishedandthusmoreaffordable.Forthisexample,though,theuserwillbeajournalistwhoisputtingtogetheranarticleaboutanewmovieandwantstoincludealistofthebestdirectors.
Thegoalofoperationalizationistorefineandclarifythequestionuntiltheanalystcanforgeanexplicitlinkbetweenthedatathattheycanfindandthequestionstheywouldliketoanswer.Forthisexample,thedatasetathandcontainsalistofmoviesratedbythefilm-aficionadocommunity.Eachmovieisassociatedwithadirector,anumberofraters,andanaverageratingscore.
ANOTEONTHEDATA
ThedatasetusedforthischapteriscomprisedoftwoofIMDB’sdownloadablelists,directors.listandratings.list.ThereisacopyoftheJupyternotebookthatparsesthemintocleanerCSVsonthebook’scompanionwebsite.Thescriptcleansthedatatoremoveentriesthatthedatabasereferstoasnotbeingmovies,suchasvideogamesandTVshows.TheanalysisandvisualizationsinthischapterarecarriedoutinPythonandrecordedinasecondJupyternotebookavailableatthesamesite.
Withbothdataandahigh-levelquestioninhand,thevisualizationworkcanbegin.Dataaloneisnotenoughtodictateasetofdesignrequirementsforconstructingavisualization.Whatismissinghereisatranslationofthehigh-levelquestion“Whoarethebestmoviedirectors?”intoasetofconcretetasksoverthedata.
Thechoiceofdatasetandoperationalizationisfundamentallyaspecificperspectiveonaproblem;theystandinforwhattheanalystwishestounderstand.Inthisexample,thereareotherwaystoframetheinquiryandothertypesofdatathatcouldbecollected.Thisisalargepartofwhyvisualizationissoimportantforansweringquestionslikethese:itallowsananalyst’sexperienceandknowledgetolayerdirectlyontopofthedatathatisultimatelyshown.Theanalyst’sskillsandexperienceallowthemtomakeinferencesaboutthemoreabstractquestionstheyarereallyinterestedin.
MakingaQuestionConcreteTheprocessofoperationalizationwindsitswayfromageneralgoalorabroadquestiontospecifictasks,andthentovisualizationsthatsupportthosespecifictasksbasedonconcretedata.
Toachievethis,theanalystsearchesforproxies.Proxiesarepartialandimperfectrepresentationsoftheabstractthingthattheanalystisreallyinterestedin.Forexample,highmovieratingsmaybeareasonableproxyforbestinourmovieexample.Selectingandinterpretingproxiesrequiresjudgmentandexpertisetoassesshowwell,andwithwhatsortsoflimitations,theyrepresenttheabstractconcept.
Inoperationalization,therearetwoimportanttypesofproxies:
Aproxytaskisalower-leveltaskthatstandsinfortheoriginal.Theresultofaproxytaskreflectsontheanswertotheoriginalquestion,buttheproxytaskitselfismorecloselyrelatedtothedata;itcanbeaccomplishedwithquantitativetools,suchasavisualizationorastatisticalanalysis.
Aproxyvalueisanattributeinthedatathatstandsinforamoreabstractconcept.Thiscanbeanexistingattribute,oritcanbederivedfromthedata.
Operationalizingaquestionoftenresultsinmorequestions,whichrequirefurtherarticulationofproxies.Onestepinthisprocessistofindplaceswhereaquestionisunderspecifiedordoesnotdirectlyreferencethedataonhand,inordertoidentifywhereproxiesarenecessary.
Collaborationwithstakeholderscruciallyinformstheprocessofoperationalizingquestions.Ithelpstolearnwhatdataisavailableandhowtheresultswillbeused.Interviewshelptoidentifythequestionsandgoalsofthestakeholderswithrespecttothedataandtofurtherunderstandwhatdataisavailableorcanbemadeavailable.Throughouttheprocess,ananalysttranslatesquestionsandgoalsintoadescriptionoftheproblemthatisamenabletoadatasolution.InterviewtechniquesandprototypingarediscussedinmoredetailinChapter3.
Inthisbookweadvocateanapproachofsystematicoperationalizationinordertobolsterexplicitacknowledgment,validation,andsupportoftherangeofpossibleproxydecisionsforaquestion.Thissystematicapproachleavesopenfuturepossibilitiesandprovidesguidanceformakingdownstreamdecisions.Thestarttothisprocessisgettingtounderstandthequestionandwhatisavailableinthedata—andappreciatingthegapsbetweenthem.
Thischapterbothdescribesandillustratestheoperationalizationprocess.Itusesthemoviedirectorexampletoshowhowtorefineaquestionintodetailed,specifictasks.Itdiscussesthefourcomponentsthatweusetodescribeanoperationalizedtask.
AConcreteMovieQuestionTheexamplestartedwiththehigh-levelquestion“Whoarethebestdirectors?”Thedatasetisalistofdirectorsandalistofmovies.Thefirsttaskisto
operationalizebestdirector.Asaroughdefinition,agooddirectorhasdirectedmanygoodmovies.Butmanygoodmoviesisalsoill-defined,andthusaproxyforgoodmoviemightinturnbebasedonitsratingonIMDB.
Thesedecisionsreplacedonebitofambiguitywiththreemore.Howmanyofthesebestdirectorsneedtoappearintheresults?WhatcountsasgoodIMDBratings,andwhataremanyofthem?Forthatmatter,aquickglanceattheIMDBdatarevealsthatthereareshortfilms,TVepisodes,videogames,andsoon—sowhatcountsasamovie?
Itispossibletochoosethemeasuresarbitrarily:“MorethanfivemovieswithIMDBratingsgreaterthan9.8,”or“averagemovieratinghigherthan8.2,”or“nomovieswitharatinglessthan5.”Whileitisnotuncommontomakethesesortsofdecisionsbasedonroughknowledgeofthedata,orevenbasedonchoosingnice,roundnumbers,lookingattheactualdataisimportant.Thetop-rateditemsonIMDBmightturnouttohaveveryhighratingsbutonlyonereview.Greatdirectorsmightdirectafewstinkers,sojustlookingattheaverageratingmightturnouttobeapoorchoice.Theonlywaytolearnwhatthedatasaysistostartdiggingintoit.
Choosingaproxyallowstheanalysttosanity-checktheirdecisions;itcanbevaluabletodothisiterativelyateachstep,checkingboththequalityofthedataandoftheproxy.
Aquickglanceatthefirstfivedataitemsinthedatasetrevealsnon-mainstreammovies(Table2-1).Thealphabeticalfirstmovieinthedatasetiscalled#1,withatotalof12raters;thesecondisthesimilarlyobscure#1SerialKiller.Sincethescenariotargetsageneralaudience,itshouldprobablyfocusonmoviesthatmostpeoplearelikelytoknow.Adifferentscenariocouldsuggestverydifferentproxies.
Table2-1.Aquickglanceatthefirstdataitemsinthemoviedataset(whichissortedalphabetically)revealsthattherecouldbemovieswithpositiveratingsthathaveveryfewraters,implyinganobscure(butdecent)movie.
ID Raters Score Title Director
0 12 6.4 #1(2005) Breen,James(V)
1 35 6.0 #1SerialKiller(2013) Yung,Stanley(I)
2 5 5.8 #137(2011) Elliott,Frances
3 11 7.4 #140Characters:ADocumentaryAboutTwitter(2… Beasley,Bryan(I)
4 23 6.7 #30(2013) Wilde,Timothy
… … … … …
Thedecisiontostickwithmainstreammoviessuggestsaneedforaproxyforpopularity.Onechoicecouldbethenumberofratingsforeachmovie.Byplottingthedistributionofthenumberofratingsbymovies(Figure2-1),weseethatthevastmajorityofmoviesinthedatasetactuallyhaveveryfewratings.
Figure2-1.Distributionofratings.Thishistogramshowsthecountofnumberofratingsperfilm.Almostallthefilmshavefewratings,withaverylongtail.
Thisfirstplotshowsthatthenumberofratingsisheavilyskewed.Onewaytomakethisdistributionmoreinterpretableistoplotitonalogarithmicscale.InFigure2-2,thedatahasbeenbucketed;afilmwith1,000ratingsnowappearsinthebucketforlog (1000)=3.Takingthelogofthenumberofratingssmoothsthedistribution,moreeffectivelyshowingitsshape.
10
Figure2-2.Distributionofthelogarithm(base10)ofthenumberofratings.Thepeakisunder2:mostfilmshaveunder100ratings.
Wecanalsocomputesomebasicsummarizingstatisticsaboutthenumberofratings:themedianmovieinthedatasethasjust26ratingswhilethe75thpercentileisat132ratings. Bylookingupthenumberofratingsforasampleofblockbusters,wenotethatmoviesthatanyonecannameoffhandhavetensofthousandsofratings.Theseareusefulobservations;perhapsitwouldbevaluabletotrimtoaslimmersetofmoviestoensurethatmostareonesthatareasonablenumberofpeoplehaveseen.
Wewanttochooseanumber,though,that’sfairtogoodmovies,eveniftheyarenotverypopular—inthiscase,wepick,somewhatarbitrarily,themost-rated25%ofmovies.Thisamountstoaround70,000filmswithmorethan132ratings.
Wenextpivotandlookatthedistributionofratingsfortheslimmed-downsetofmovies,showninFigure2-3.Thisdistributionshowsadistinctcurvewithaclearpeakandnoticeabledrop-off:ratingsabove7.5seemdifferentfromlowerratings.(Thisdistributionhasamedianscoreof6.6,anda75thpercentileof7.4.)
1
Figure2-3.Distributionofscore.Thishistogramshowsthecountofratings,bybucket.Almostallratingsareextremelylow,withaverygradualtail.
Steppingbackfromourdiveintothedata,wecanobservethatwehaveproceededsomedistancealongtheoperationalization.Wehavedefinedagooddirectoranddecidedthatitisbasedontheirmovies;wehavefocusedonmoviesandchosenasetthatarepopularenoughtobepartoftheanalysis.Buttherearestillunansweredquestions:Howwillwerankdirectorsagainsteachother?Whatmakesfora“best”director?
Asystematicapproachtooperationalizationallowsananalysttoseethefullrangeofdecisionsandhelpsinpullingtogetherthesetofproxiesthatcaninformafinalanswer.Ultimately,aninteractivevisualizationtoolcanenableexplorationofmultipleproxiestoallowforasetofjustified,andvalidated,answers.Forourrunningexample,wewillcontinuewiththeoperationalizationafterdescribingaframeworkformakingdecisionsexplicitthroughouttheprocess.
BreakingDownaTask
Throughouttheoperationalization,weneedtoidentifywhereinaquestionortaskthereisaneedforamorerefinedproxy.Doingsosystematicallycanmakeiteasiertovalidatethosedecisions,aswellastoproducearoadmapoftheprocess.Thisallowstheanalysttoeffectivelyrevisitdecisionsonceabetterunderstandingoftheproblemisgained.
Ananalystcanrefineataskbyfirstbreakingitdownintofourspecificcomponents.Identifyingthesecomponentsandhowtheydoordonotdirectlyreferencethedatabecomesatemplateforchoosingmorespecifictasks.Thecomponentsare:
Objects
Thingsoreventsthatexistintheworld:inourexample,adirectorandamoviearebothobjects.Inothercontexts,objectsmightbeauserorasaleofasingleitem.Whenataskisspecificenough,eachobjectwillbesomethingthatcanberepresentedinorcomputedfrom,thedata.Fairlyoften,whenthetaskisatitsmostspecific,anobjectwillcorrespondtoasinglerowinadatabase.
Measures
Theoutcomevariablesthatwillbemeasuredfortheobjects.Qualityofadirector,happinessofauser,andsalesofastoreareallmeasures.Inasufficientlyspecifictask,themeasureiseitheranexistingattributeinthedatasetoronethatcanbedirectlycomputedfromthedata.Ameasureissometimesaggregatedacrossmanyitemsofdata.Inourexample,anumberofmoviesareaggregatedtogethertogetascoreforasingledirector.
Groupings(orpartitions)
Attributesorcharacteristicsofthedatathatseparatethedataitemsintogroups.Forexample,groupingsmightincludestoreregion(westernversuseastern),startdateofplayers,whetherusershavepurchasedanupgrade,orsalesbyyear.Inaspecifictask,partitionsareattributesoftheobjectsorcanbecalculateddirectlyfromthoseattributes.Whenthevisualizationiscreated,partitionswilloftenmanifestasgroupings,separationsacrosscharts,orfilters.
Actions
Wordsthatarticulatethespecificthingbeingdonewiththedata,suchascompare,identify,characterize,etc.Actionsguidetheprocessofchoosingappropriatevisualizations.
Theactionisusefulforidentifyingtheothercomponents.Takethistask:Comparetheamountofmoneyspentin-gamebyplayerswhoplaymorehoursversusthosewhoplayfewerhours.Theactioniscompare.Whatiscompared?Theplayers(theobject).Whatisitaboutplayersthatwewanttocompare?Themoneyspent(themeasure).Finally,thereisaspecificpartitionontheobjects.Theywillbebrokenintotwogroups:thosethatplaymanyhoursandthosethatplayfewhours.
Thefollowingcomponentsaretheheartofaniterativeprocess:
1. Refinethequestionintooneormoretasksthat,individuallyortogether,addressthegeneralquestion.
2. Foreachtask:
a. Identifythecomponentsofthetask.
b. Lookforambiguouscomponents—namely,componentsthatarenotdirectlyaddressablebythedataset.
c. Foreachambiguouscomponent,defineaproxybycreatinganewquestionthataddressesthecomponent,andreturntostep1withthosequestions.
d. Iftherearenoambiguouscomponentsthenthetaskisdeemedactionable,andthuscanbeaddressedwithavisualizationorothercomputationaltechnique.
Next,we’llexplicatesomeofthequestionsfromthemovieexampletoillustratehowthecomponentsworkinpractice,beginningwithExample2-1.
Example2-1.Breakingdownthetasktofindgooddirectors
Task:Identifythetopdirectorswhohavedirectedmanygood,popularmovies
Action:Identify
Object:Director
Measure:Numberofgood,popularmovies
Grouping:Filteroutnon-movies
Identifyingtopdirectorsimpliesthatthere’sameaningfulsortonthedirectorssothatthetopcanbefound(Example2-2).Thus,wecanfurtherrefinetheactiontospecifyanordering.Also,ourfirstlookatthedatashowedthatmanymoviesareunpopular,whichimpliedagroupingtofilteroutunpopularmovies.
Example2-2.Refinedtaskforgooddirectors
Task:Rankorderofdirectorsbythosewhohavedirectedmanygood,popularmovies
Action:Rankorder
Object:Director
Measure:Numberofgoodmovies
Grouping:Filteroutnon-moviesandunpopularmovies
Filteringoutunpopularmoviesisasubtask(Example2-3),whichweaddressedwithahistogramofthenumberofratingsformovies.Thevisualizationofthedistributionallowedustodetermineagoodcutpointforpopularversusunpopular—namely,popularmovieswerethoseinthetop25%ofmovieswiththehighestnumberofratings.
Example2-3.Subtaskforfilteringunpopularmovies
Task:Filteroutmovieswithsignificantlyfewerratings
Action:Filter
Object:Movie
Measure:Numberofratings
Grouping:Separateintomostpopularandleastpopularmovies
ThissubtaskcanbebroughtbackintoExample2-2asaproxyforunpopularmovies.
However,westillhavesomeworktodoonExample2-2:themeasurenumberofgoodmoviesisill-definedwithrespecttothedata.Weneedtorefinethiscomponentbydevelopingaproxyforagoodmovie.Oncewedothat,wecanthenexaminewhatitmeanstohavedirectedmanyofthem.
Theseproxiesrequirefurtherelaboration.Howmanyhigh-scoringmoviesarerequiredfromdirectors?Dolow-scoringmoviescountagainstthem?Thisprocessofidentifyingreasonableproxiesisofteniterative.Forexample,inexploringandvalidatingaproxywiththedata,itmightbecomeobviousthattheeffectsoffilteringbythenumberofratingswasamistakenapproach.
Atthispoint,wecanrecognizethatweneedaproxymeasureforgood(Example2-4).Thereareavarietyofproxiesthatwecantryhere,withvariousvisualizations.Theprocesscontinuesonward.
Example2-4.Subtaskforgoodmovies
Task:Quantify“agoodmovie”
Action:Quantify
Object:Movie
Measure:Goodness
Grouping:None
Breakingdownataskintocomponentshelpsinguidingrefinementofataskintoonethatcanbeaddressedwiththedata.Themostdirectwaytodosoistoconsiderthequestion“Aretheobject,measure,andgroupingeachdirectlydescribedinthedata?”Foreachofthesethreecomponents,isitclearwhichaspectsofthedataareimportantorhowtoderivewhatweneedfromthedata?Ifnot,repeattheprocessofformulatingasubquestioninordertoderiveamorespecificanswer.
Let’stakealookataverydifferentexample—thistime,fromagameplaymetricsscenario(Example2-5).
Example2-5.Exemplartaskforanalyzingagame
Task:Comparetheamountofmoneyspentin-gamebyplayerswhoplaymorehoursversusthosewhoplayfewerhours.
Action:Compare
Object:Players
Measure:Moneyspent
Grouping:Playerswhoplaymanyhours;playerswhoplayfewhours
InExample2-5,thepartitiondividesbetweenmanyandfewhours.Thiscomponentneedstoberefinedfurther,whichleadstoanewquestion:“Inthegame,howmanyis‘many’hoursforaplayer?”Theanalystmighttakeaseriesofsteps.Theymightlookatthedistributionofhoursplayed,ortheymightchoosetofilteroutplayerswhohaveplayedzerohourorthosewhohaven’tmadeitpastthetutorial,ortheymightlookatothermetricsthatareimportanttothegame.Thesestepswouldhelptheanalystfigureoutgoodproxiesformanyandfewhours.
WhenTasksLeadtoNewQuestionsTherearefourbroadcategoriesofnewlinesofinquirythatcanemergefrom
refiningaquestion.First,asinthemovieexample,therefinementprocessoftenrevealsthatanewanalysisisneededtoanswerthesequestions.
Second,operationalizingcanalsoleadinnewdirections.Intheprocessofexploringwhothebestdirectorsare,theanalystmightnoticethatsomedirectorssticktoasinglegenre;theymightdecidethatthisanalysismightbeinterestingdividedacrossmultiplegenres.TheymightalsonoticethatbothIMDBandRottenTomatoeshavescoresonmovies,andwanttoseehowtheseresultsvarybasedonRottenTomatoesscoresinsteadofIMDB.
Third,thedataitselfcanleadtonewquestionstoo.Inexploratorydataanalysis(EDA),forexample,thedataanalystdiscoversnewquestionsbasedonthedata.Theprocessoflookingatthedatatoaddresssomeofthesequestionsgeneratesincidentalvisualizations—oddpatterns,outliers,orsurprisingcorrelationsthatareworthlookingintofurther.
Finally,doingsomeanalysisoftenleadstodoingaroundofdatacleaning.Whiledatacleaningislargelyoutofthescopeofthisbook,oddoutliersandsurprisingtrendsare,asoftenasnot,theresultofdirtydata.
ReturningtotheExample:ExploringDifferentDefinitionsThereareseveraldifferentpossibledefinitionsofbestdirector.
Hereisone:thebestdirectorhasthemostmovieswithmorethan134ratings.Table2-2showsthetopscorers.ThemostprolificdirectorsinourdatasetareChuckJonesandFritzFeleng(whodirectedclassicLooneyTunesanimations),WilliamHanna(whodirectedTomandJerryandotherclassicHanna-Barberacartoons),andGeorgeMéliès(anearlyinventorofspecialeffectsandshorts).
Table2-2.Topfivedirectorsbynumberoffilmsoverthreshold
Director Avg.raters Avg.score Count Totalraters
Jones,Chuck(I) 719 7.4 148 106,397
Freleng,Fritz 402 7.2 141 56,730
Hanna,William(I) 591 7.5 119 70,315
Méliès,Georges 717 6.1 114 81,769
White,Jules(I) 235 7.1 102 23,969
GeorgesMélièshas526filmsonhisIMDBpage;only114,however,madeitoverthethresholdofraters.Thehugenumberoffilmsisexplainedbythefactthatthefilmsareshorts—morefamiliarontelevisionnow,butoncealsoshownintheaters.Thisshouldbeanopportunitytodomoredatacleaningtojoininanothertablethatwilltelluswhetherafilmisashortornot,andfilterthoseout.IMDBhasafilmdurationdatatable;inatypicalanalysisprocess,thenextstepwouldbetomergeinthistable,addinganewproxyforwhatmakesforashortfilm.
Wemightexploreotherdefinitionsofbestdirectors.Forexample,thebestdirectorsmightmakethemoviesthatpeoplewanttoratethemost.Table2-3isalistofthedirectorswhosemovieshave,intotal,themostratings.
Table2-3.Topfivedirectorsbytotalnumberofratingsacrossallmovies
Director Avg.raters Avg.score Count Totalraters
Spielberg,Steven 245,717 7.2 36 8,845,795
Nolan,Christopher(I) 778,737 8.2 11 8,566,104
Tarantino,Quentin 526,689 7.8 13 6,846,955
Jackson,Peter(I) 371,219 7.6 16 5,939,505
Scorsese,Martin(I) 144,823 7.5 41 5,937,725
Thislistmakessense.Theseareveryfamousnameswhohavedirectedveryfamiliarmovies.
Differentproxiesyielddifferentresults.Orderingbytheaveragescoreforallmoviesbyasingledirectormightbeonewaytofindtheverybestdirectors.AsseeninTable2-4,thefirstonthislistisadirectorwhohasonlyonemovieover
thethreshold:aMongolianmoviefrom2016with624ratersandanaveragescoreof9.7.Thismeasureofpopularityreturnsaverydifferentsetofresultsthanthepreviousmeasure:tenthousandtimesmorepeopleratedQuentinTarantino’smoviesthanUranchimegUrtnasan’swork.
Table2-4.Topfivedirectorsbyaveragescore
Director Avg.raters Avg.score Count Totalraters
Urtnasan,Uranchimeg 624 9.7 1 624
Miller,George(XXXVII) 394 9.6 2 787
Chowdhury,AmitabhReza 14,628 96 1 14,628
Biebert,Aaron 12,040 9.6 1 1,204
Arsyn,Ken 619 9.5 6 3,712
Butcanthequalityofadirectorbemeasuredbasedonjustoneortwomovies?Eachstepofdataexplorationleadstoanotherstepofrefiningthequestion.Isitmoreimportanttohavemanyraters,ahighaveragescore,orahighminimumscore?
Thechoiceofmetricsleadstoverydifferentoutcomes.Aslighttweakdetermineswhetheryoufinddirectorsofanimatedcartoons,blockbusterdirectors,oraverydiversesetofinternationaldirectors.
HowSpecificDoestheProcessGet?Thisprocessofrefinementleadstoascaryscenario.InDisney’sFantasia,intheSorcerer’sApprenticesequence,MickeyMouseattemptstostopanenchantedbroombychoppingitinhalfandinsteadproducestwohalf-sizeenchantedbrooms.Willouranalysissubtasksforevermultiply?
Theoperationalizationprocessisaniterativeoneandtheendpointisnotpreciselydefined.Theanswertothequestionofhowfartogois,simply,farenough.Theprocessisdonewhenthetaskisdirectlyactionable,usingthedataathand.Theanalystknowshowtodescribetheobjects,measures,andgroupingsintermsofthedata—wheretofindit,howtocompute,andhowtoaggregateit.Atthispoint,theyknowwhatthequestionwilllooklikeandtheyknowwhattheycandotogettheanswer.
Anactionabletaskmeansthatitispossibletoactonitsresult.Thatactionmight
betopresentausefulresulttoadecisionmakerortoproceedtoanextstepinadifferentresult.Ananswerisactionablewhenitnolongerneedsfurtherworktomakesenseofit.
Low-levelobjectsarereadytobeinterpretedfromthedata.Sometimestheycanbereaddirectlyoffthedatatable,butmoreoftenitismoreindirect;theanalystmayneedtocarryouttransformationsonthedata,whethermathematicaltransformationsordatabasejoins.Forinstance,inthemovieexample,theobjectisthedirector;theproxyforthedirectoristheresultofaggregatingmultiplemoviestogether.Partitionsandmeasuresatthelowestlevelwillresolvetoconcretemanipulationsoftheobjects.
Theprocessendswhenallthetoolsneededtoansweraquestionareinplace—whetherasanumber,avisualization,orevenasaninteractionacrossmultiplevisualizationsrepresentingmultipleproxies.Theanalystmightdecidethattherightcutoffformanyhoursofgameplayissixhours—anumber—orthehoursplayedbythetop10%ofplayers—aformula—orabovethelogicalbreakpoint,whichmightberepresentedbyadistribution.Theseresultsgetpropagatedbackintoanyothertasksthatdependonthem.
MakingUseofResultsThisprocessofpropagatingresultsbackintohigher-levelquestionsisflexibile.Sometimesthelow-levelquestiondoesnothaveanexactanswerbutinsteadresolvesinitsownvisualizationorinteraction.Thatvisualizationmighthelpananalystinmakingadecision,butitmightalsoimplyparametersonthedata.Forexample,thejournalistmightrealizethatthereareseveralpossiblecutoffsfordefiningwhatitmeanstobeagoodmovie.Ratherthansimplypickingaspecificthreshold,ananalystmightinsteadpropagateamechanismfordynamicallydeterminingcutoffsintohigher-leveltasks.Seeingavariablepropagatedlikethiscanbeacuethataninteractivevisualization—ratherthanastaticimage—mightbehelpful.
Visualizationisalsoimportantforsupportingtheoperationalizationprocess,eveniftheendresultisnotaninteractivevisualization.Inthemovieexample,visualizationhelpedustounderstandthenatureanddistributionofthedata.Visualizationcanbemoreprominentwithmorecomplexanalysistasks.Ifthe
analystwantedtocompareratingsagainstpopularity,itwouldbedifficulttodisplaythatonalist;iftheywantedtoexplorehypothesesabouthowthepopularityofdirectorschangesovertime,morevisualrepresentationswouldhelpthemexplorethedata.
Conclusion:AWell-OperationalizedTaskAwell-operationalizedtask,relativetotheunderlyingdata,fulfillsthefollowingcriteria:
Canbecomputedbasedonthedata
Makesspecificreferencetotheattributesofthedata
Hasatraceablepathfromthehigh-levelabstractquestionstoasetofconcrete,actionabletasks
Awell-operationalizedtaskisafirststeptowardcreatingavisualization.Chapter4beginstodescribethewaysinwhichtheobjects,measures,andpartitionscanbeshapedintoaspectsofavisualization.Chapters5and6constructvisualizationsbasedonthem.
Writtenoutindetail,thisprocesscanseemtedious,butinpractice,itisabbreviatedandsimplified.Therearetwoimportantusesforthissystematization.First,theprocessofexplicitlylookingatcomponentscanhelpuntangleknottyproblems,decomposingplaceswheretheanalysthasmadeassumptionsaboutthedata.ExplainingpreciselywhythenumberofIMDBratingsisaproxyforpopularityforcestheanalysttoexplorewhetheritisagoodchoice—and,perhaps,torevisethatchoicelater.
Theprocessalsohelpsguidequestionsandinterviews.Chapter3explainshowtocarryoutoperationalizationwithdomainexperts.Recognizingtheneedtomakedecisionsaboutproxieshelpsguidetheseconversations.Everydatasethassubtleties;itcanbefartooeasytoslipdownrabbitholesofcomplications.Beingsystematicabouttheoperationalizationcanhelpfocusourconversationswithexperts,onlyintroducingcomplicationswhenneeded.
FurtherReadingTheprocessoutlinedhereissimilar—andinmanywaysparallel—totheGoal,Question,Metric(GQM)processfoundinthesoftwareengineeringspace.GQMrefinesfromageneralgoaltoaspecificmetric,usuallyorientedaroundprocessimprovementsothattheconsumercanhaveasinglenumberthathelpsthemknowwhethertheyaresucceedinginimprovingthatprocess.
Ourprocessismoreexploratoryandoftencomesearlierinthecycle.AGQManalysismightchooseagoallike“improveuserretention.”Incontrast,exploratoryoperationalizationmightstartwithaquestionlike“Douserscomebacktooursite?”withtheawarenessthattheproblemismultifacetedandcomplex,andmightrequireavarietyofdifferentmetricstodescribe.FormoreonGQM,see:
Basili,Victor,GianluigiCaldiera,andDieterRombach.“TheGoalQuestionMetricApproach.”EncyclopediaofSoftwareEngineering.NewYork:Wiley,1994.
Thedatavisualizationfieldhasspentagreatdealofefforttryingtounderstandthetasksthatcanbeaccomplishedinavisualization.AmarandStasko,forexample,explorealow-levelanalysisoftaskscarriedoutonaspecificvisualization.Attheotherendofthespectrum,BrehmerandMunznerexplorehigh-leveltasksforvisualization,startingwithcomparingpresentationandexploration:
Amar,RobertandJohnStasko.“AKnowledgeTask-BasedFrameworkfortheDesignandEvaluationofInformationVisualizations.”ProceedingsoftheIEEESymposiumonInformationVisualization(2004):143–150.
Brehmer,MathewandTamaraMunzner.“AMulti-LevelTypologyofAbstractVisualizationTasks.”IEEETransactionsonVisualizationandComputerGraphics19(2013):2376–2385.
Medianandpercentilearewaysofcharacterizingadistributionofnumbers.Ifoneweretosortthenumbers,the75thpercentilewouldbe75%ofthewaydown1
inthelist.Themedianwouldbeatthehalfwaypoint.