Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing...

41

Transcript of Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing...

Page 1: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a
Page 2: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

MakingDataVisualAPracticalGuidetoUsingVisualizationforInsight

DanyelFisherandMiriahMeyer

Page 3: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

MakingDataVisualbyDanyelFisherandMiriahMeyer

Copyright©2018MiriahMeyer,Microsoft.Allrightsreserved.

PrintedintheUnitedStatesofAmerica.

PublishedbyO’ReillyMedia,Inc.,1005GravensteinHighwayNorth,Sebastopol,CA95472.

O’Reillybooksmaybepurchasedforeducational,business,orsalespromotionaluse.Onlineeditionsarealsoavailableformosttitles(http://oreilly.com/safari).Formoreinformation,contactourcorporate/institutionalsalesdepartment:[email protected].

Editor:RachelRoumeliotis InteriorDesigner:DavidFutato

ProductionEditor:JustinBilling CoverDesigner:KarenMontgomery

Copyeditor:RachelHead Illustrator:RebeccaDemarest

Proofreader:GillianMcGarvey Additionalillustrations:DominikMoritz

Indexer:EllenTroutman-Zaig andKanit“Ham”Wongsuphasawat

January2018:FirstEdition

RevisionHistoryfortheFirstEdition

2017-12-19:FirstRelease

Seehttp://bit.ly/making-data-visualforreleasedetails.

TheO’ReillylogoisaregisteredtrademarkofO’ReillyMedia,Inc.MakingDataVisual,thecoverimage,andrelatedtradedressaretrademarksofO’ReillyMedia,Inc.

Whilethepublisherandtheauthorshaveusedgoodfaitheffortstoensurethattheinformationandinstructionscontainedinthisworkareaccurate,thepublisherandtheauthorsdisclaimallresponsibilityforerrorsoromissions,includingwithoutlimitationresponsibilityfordamagesresultingfromtheuseoforrelianceonthiswork.Useoftheinformationandinstructionscontainedin

Page 4: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

thisworkisatyourownrisk.Ifanycodesamplesorothertechnologythisworkcontainsordescribesissubjecttoopensourcelicensesortheintellectualpropertyrightsofothers,itisyourresponsibilitytoensurethatyourusethereofcomplieswithsuchlicensesand/orrights.

978-1-491-92846-2

[LSI]

Page 5: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

Preface

Visualizationisavitaltoolforunderstandingandsharinginsightsarounddata.Therightvisualizationcanhelpexpressacoreideaoropenaspacetoexamination;itcangettheworldtalkingaboutadatasetorsharinganinsightFigureP-1.

FigureP-1.Visualizationscantakemanyforms,fromviewsthatsupportexploratoryanalysis(topleft),tothosethatprovidequickoverviewsinadashboard(bottom),toaninfographicaboutpopulartopics(top

right).

Visualizationsprovideadirectandtangiblerepresentationofdata.Theyallowpeopletoconfirmhypothesesandgaininsights.Whenincorporatedintothedataanalysisprocessearlyandoften,visualizationscanevenfundamentallyalterthequestionsthatsomeoneisasking.

Creatingeffectivevisualizationsishard.Notbecauseadatasetrequiresanexotic

Page 6: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

andbespokevisualrepresentation—formanyproblems,standardstatisticalchartswillsuffice.Andnotbecausecreatingavisualizationrequirescodingexpertiseinanunfamiliarprogramminglanguage—off-the-shelftoolslikeExcel,Tableau,andRareampleenoughtosuffice.

Rather,creatingeffectivevisualizationsisdifficultbecausetheproblemsthatarebestaddressedbyvisualizationareoftencomplexandill-formed.Thetaskoffiguringoutwhatattributesofadatasetareimportantisoftenconflatedwithfiguringoutwhattypeofvisualizationtouse.Pickingacharttypetorepresentspecificattributesinadatasetiscomparativelyeasy.Decidingonwhichdataattributeswillhelpansweraquestion,however,isacomplex,poorlydefined,anduser-drivenprocessthatcanrequireseveralroundsofvisualizationandexplorationtoresolve.Inthisbook,wefocusontheprocessofgoingfromhigh-levelquestionstowell-defineddataanalysistasks,andonhowtoincorporatevisualizationsalongthewaytoclarifyunderstandingandgaininsights.

Page 7: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

WhoIsThisBookFor?Thisbookisforpeoplewhohaveaccesstodataand,perhaps,asuiteofcomputationaltoolsbutwhoarelessthansurehowtoturnthatdataintovisualinsights.Wefindthatmanydatasciencebooksassumethatyoucanfigureouthowtovisualizethedataoncecollected,andvisualizationbooksassumethatyoualreadyhaveawell-definedquestion,readytobevisualized.If,likeus,youwouldliketoaddresstheseassumptions,thenthisbookisforyou.

Thisbookdoesnotcoverhowtocleanandmanagedataindetailorhowtowritevisualizationcode.Therearealreadygreatbooksonthesetopics(and,whenrelevant,wepointtosomeofthem).Rather,thisbookspeakstowhythoseprocessesareimportant.Similarly,thisbookdoesnotaddresshowtochooseabeautifulcolormaporselectatypeface.Instead,welayoutaframeworkforhowtothinkaboutdatagiventhepossibilitiesandconstraintsofvisualexploration.Ourgoalistoshowhowtoeffectivelyusevisualizationstomakesenseofdata.

WhoAreWe?Theauthorsofthisbookhaveacombinedthreedecadesofexperienceinmakingsenseofdatathroughdesigningandusingvisualizations.Wehaveworkedwithdatafromabroadrangeoffields:biologyandurbantransportation,businessintelligenceandscientificvisualization,debuggingcodeandbuildingmaps.Wehaveworkedwithanalystsfromavarietyoforganizations,fromsmall,academicsciencelabstoteamsofdataanalystsembeddedinlargecompanies.Someoftheprojectswehaveworkedonhaveresultedinsophisticated,bespokevisualizationsystemsdesignedcollaborativelywithdomainspecialists,andatothertimeswehavepointedpeopletooff-the-shelfvisualizationtoolsafterafewconversations.Wehavetaughtuniversityclassesinvisualizationandhavegivenlecturesandtutorials.Allinall,wehavevisualizedhundredsofdatasets.

Wehavefoundthatourknowledgeaboutvisualizationtechniques,solutions,andsystemsshapesthewaythatwethinkandreasonaboutdata.Visualizationisfundamentallyaboutpresentingdatainawaythatelicitshumanreasoning,makesroomforindividualinterpretations,andsupportsexploration.Wehelpour

Page 8: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

collaboratorsmaketheirquestionsanddatareflectthesevalues.Theprocesswelayoutinthisbookdescribesourmethodfordoingthis.

OverviewofChaptersChapter1illustratestheprocessofmakingsensewithvisualizationsthroughaquickexample,exposingtherolethatavisualrepresentationcanplayindatadiscovery.

Chapter2startstogetintodetails.Itdiscussesamechanismtohelpnarrowaquestionfromabroadtaskintosomethingthatcanbeaddressedwithaniterativevisualizationprocess.Forexample,thebroadquestion“Whoarethebestmoviedirectors?”doesnotnecessarilysuggestaspecificvisualization—but“Findmoviedirectorswhodirectedtop-grossingmoviesusinganIMDBdataset”canleadmoredirectlytoananswerbywayofavisualizationortwo.Thisprocesscreatesanoperationalizedquestion,onethatconsistsofparticulartasksthatcanbedirectlyaddressedwithdata.

Thisprocessofnarrowingaquestiondowntoactionabletasksrequiresinputfrommultiplestakeholders.Chapter3laysoutaniterativesetofstepsforgettingtotheoperationalization,whichwecalldatacounseling.Thesestepsincludefindingtherightpeopletotalkto,askingeffectivequestions,andrapidlyexploringthedatathroughincreasinglysophisticatedprototypes.

Thenumericalnitty-grittyofthebookfollows.Chapter4discussestypesandrelationsofdata,anddefinestermslikedimensions,measures,categorical,andquantitative.Chapter5thenorganizescommonvisualizationtypesbythetaskstheyfulfillandthedatatheyuse.Then,Chapter6explorespowerfulvisualizationtechniquesthatusemultipleviewsandinteractiontosupportanalysisoflarge,complexdatasets.Thesethreechaptersaremeanttoprovideanoverviewofsomeofthemosteffectiveandcommonlyusedideasforsupportingsensemakingwithvisualizations,andareframedusingtheoperationalizationanddatacounselingprocesstohelpguidedecision-makingaboutwhichvisualizationstochoose.

Withthisunderstandingofgettingtoinsight—fromquestionstodatatovisualizations—theremainderofthebookillustratestwoexamplesofcarryingoutthesesteps.ThecasestudyinChapter7describesthecreationofabusiness

Page 9: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

intelligencedashboardincollaborationwithateamofdevelopersandanalystsatMicrosoft.TheoneinChapter8drawsfromscience,presentinganexamplewithateamofscientistswhoworkwithbiologicaldata.Thesecasestudiesillustratetheflexibilityoftheprocesslaidoutinthisbook,aswellasthediversetypesofoutcomesthatarepossible.

Thisbookisaccompaniedbyacompanionwebsite.FromthissiteyoucandownloadthecodeandinteractiveversionsofthevisualizationspresentedinChapters5and6,aswellasothercodeandsupplementarymaterial.

AcknowledgmentsDanyelandMiriahwouldliketothankDanyel’scolleaguesatMicrosoft,includingStevenDrucker,MaryCzerwinski,andSueDumais,fortheirenthusiasmandencouragement.WealsothankMiriah’sresearchgroup,theVisualizationDesignLabattheUniversityofUtah,includingAlexLex,forhelpingtheprojecttoevolveandprovidingfeedbackonideas.Wearebothdeeplyappreciativeofourworkorganizationsforsupportingthetimeandenergyrequiredbyprojectslikethisone,andforseeingthevalueincommunicatingourresearchbroadly.AtO’ReillyMedia,wethankMikeLoukidesforencouragingustostartthiswork,andShannonCuttandRachelRoumeliotisforguidingitfromstarttofinish.

PortionsofthisworkwerepresentedattheIPAMWorkshoponCultureAnalyticsof2016,MicrosoftDataInsightsSummitof2016,UniversityofIllinoisUrbana-ChampaignHCISeminarSeriesof2016,UniversityofBritishColumbiaHCISeminarSeriesof2016,WomeninDataScienceConferenceatStanfordUniversityin2017,andO’ReillyVelocityConferencein2017.Ourthankstotheorganizersofthoseevents,andtoparticipantswhogaveuscriticalfeedbackandhelpedclarifyourthoughts.

EarlyfeedbackontheoperationalizationprocesscamefromChristianCantonofMicrosoft.MichaelTwidaleandAndreaThomer,bothofUIUC,helpedinformthediscussionofdatacounselingwiththeirinsightsonhowreferencelibrariansdotheirwork.

WearegratefultoDominikMortizandKanit“Ham”WongsuphasawatforputtingtogethertheexamplesusedinChapters5and6.Theirwork,aswellas

Page 10: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

thatoftherestoftheVega-Liteteam,ishelpingshapethefutureofdatavisualization.WealsothankAlexBigelowforsupplyingtheskateboadingvisualizationfigureinthePreface.

WethankJacquelineRichardsforherreviewanddiscussionofthecasestudyinChapter7.Similarly,thecollaborationwithAngelaDePaceandhergroupattheHarvardMedicalSchoolforthecasestudyinChapter8providedvaluableandrichinsightsintotheprocessofdesigningvisualizationsfordomainexperts.Theprojectsdescribedinbothofthesechaptersweredeeplyinfluentialinourworkpractices.

Ourtechnicalreviewers,MichaelFreeman,JeffHeer,andJerryOverton,helpedclarifyandstrengthentheargumentswemake.

Finally,MiriahthanksBrianPriceforhisendlesssupportandencouragement,withoutwhichshecouldneverdothethingsshedoes.

O’ReillySafariSafari(formerlySafariBooksOnline)isamembership-basedtrainingandreferenceplatformforenterprise,government,educators,andindividuals.

Membershaveaccesstothousandsofbooks,trainingvideos,LearningPaths,interactivetutorials,andcuratedplaylistsfromover250publishers,includingO’ReillyMedia,HarvardBusinessReview,PrenticeHallProfessional,Addison-WesleyProfessional,MicrosoftPress,Sams,Que,PeachpitPress,Adobe,FocalPress,CiscoPress,JohnWiley&Sons,Syngress,MorganKaufmann,IBMRedbooks,Packt,AdobePress,FTPress,Apress,Manning,NewRiders,McGraw-Hill,Jones&Bartlett,andCourseTechnology,amongothers.

Formoreinformation,pleasevisithttp://oreilly.com/safari.

HowtoContactUsPleaseaddresscommentsandquestionsconcerningthisbooktothepublisher:

O’ReillyMedia,Inc.

Page 11: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

1005GravensteinHighwayNorth

Sebastopol,CA95472

800-998-9938(intheUnitedStatesorCanada)

707-829-0515(internationalorlocal)

707-829-0104(fax)

Wehaveawebpageforthisbook,wherewelisterrata,examples,andanyadditionalinformation.Youcanaccessthispageathttp://bit.ly/making-data-visual.

Tocommentorasktechnicalquestionsaboutthisbook,[email protected].

Formoreinformationaboutourbooks,courses,conferences,andnews,seeourwebsiteathttp://www.oreilly.com.

FindusonFacebook:http://facebook.com/oreilly

FollowusonTwitter:http://twitter.com/oreillymedia

WatchusonYouTube:http://www.youtube.com/oreillymedia

Page 12: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

Chapter1.GettingtoanEffectiveVisualization

Choosingordesigningagoodvisualizationisrarelyastraightforwardprocess.Itistemptingtobelievethatthereisonebeautifulvisualizationthatwillshowallthecriticalaspectsofadataset.Thattherightvisualrepresentationwillrevealhiddeninsights.Thataperfect,simple,andelegantvisualization—perhapsjustalinechartorawell-chosenscatterplot—willshowpreciselywhattheimportantvariablewasandhowitvariedinpreciselythewaytoillustrateacriticallesson.

Thisisoftentheimpressionthatwe,atleast,areleftwithafterreadingdatasciencecasestudies.Butinourexperience,thisdoesnotmatchtherealityofvisualdataanalysis.Ittakeshardwork,andtrialanderror,togettoaninsightfulvisualization.Westartbythinkingaboutwhatwewanttoknow,andwerefinefuzzyquestionsintoactionable,concretetasks.Weclean,reshape,andrestructurethedataintoformsthatwecanputintoavisualization.Weworkaroundlimitationsinthedata,andwetrytounderstandwhattheuserwantstolearn.Wehavetoconsiderwhichvisualrepresentationstouseandwhatinteractionmechanismstosupport.Alongtheway,wefindothervariablesthattellusmoreaboutthedatasetandthathelpclarifyourthinking.Andnosinglevisualizationiseverquiteabletoshowalloftheimportantaspectsofourdataatonce—therejustarenotenoughvisualencodingchannels.

Designingeffectivevisualizationspresentsaparadox.Ontheonehand,visualizationsareintendedtohelpuserslearnaboutpartsoftheirdatathattheydon’tknowabout.Ontheotherhand,themoreweknowabouttheusers’needsandthecontextoftheirdata,thebetterwecandesignavisualizationtoservethem.Theprocessdescribedinthisbookembracesthisparadox:itleveragestheknowledgeusershaveoftheirdatasets,thecontextthedatalivesin,andthewaysitwascollected—includingitslikelyflaws,challenges,anderrors—inordertofigureouttheaspectsofitthatmatter.

Putanotherway,thisbookisaboutthepathfrom“Ihavesomedata…”to“Weknowthisbecauseoftheseclear,concise,andinsightfulvisualizations.”We

Page 13: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

believethatcreatingeffectivevisualizationsisitselfaprocessofexplorationanddiscovery.Agoodvisualizationdesignrequiresadeepunderstandingoftheproblem,data,andusers.

GettingtoInsightWemostoftenworkwithotherpeoplethathaveadatasettheyaretryingtomakesenseof.Theprocessofdesigningavisualizationusuallystartswhenpeoplewalkintoouroffice.

CLIENT:IhavesomedatathatI’dliketovisualize.HowshouldIdrawit?

Theclientseemstoexpectustopullavisualizationofftheshelf,tosculptthatperfectvisualization.Wealmostalwaysfrustratethembyaskingwhattheyhopetosee.

Q:Whatisitaboutthedatathatyouwouldliketovisualize?

CLIENT:Iwanttoseehowprofitableourstoresare.

Q:Whatinyourdataindicatesastorebeingprofitable?

CLIENT:Itmeansthatthestorehaslotsofsalesofhigh-profititems.

Q:Howdoesprofitvarybystore?

Andsoon.

Bytheendofthisprocess,weoftenfindthattheclientsdonothaveavisualizationproblem,butanoperationalizationone.Theirstrugglestochooseavisualizationstemfromalackofclarityaboutwhichattributesofthedataaremostimportantandhowthoseattributesrelatetooneanother.Oncetheycandescribehowthedataattributesrelatetothequestiontheyaretryingtoanswer,findinganappropriatevisualizationbecomesmucheasier.

Wehavelearnedovertheyearsthatdesigningeffectivevisualizationstomakesenseofdataisnotanart—itisasystematicandrepeatableprocess.Wehavesystematizedthisprocessintowhatwebelievearereproducibleandclearsteps.

Thisprocesstracksourunderstandingoffourcomponents:

Page 14: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

Data

Whatdataisavailable,andwhatdoesitmean?Whatdoesthedatalooklike,andwhatareitsimportantaspects?Wherediditcomefrom,andwhywasitoriginallycollected?

Tasks

Whatneedstohappenwiththedata?Whatarethelow-levelquestionsandtasksthatwillsupporthigh-levelgoals?

Stakeholders

Whoisinvolvedwiththedata,theproblem,andthegoals?Whatcantheysayabouttheproblemtohelpdesignaneffectivevisualization?Whowillviewthefinalvisualization,andwhatsortsofthingsdoweexpectthemtolearnfromit?Whatdomainknowledgedotheybringtothetable?Whatanswerswouldtheyfindsatisfying?

Visualization

Howdoestheunderstandingofdata,tasks,andstakeholderscometogether?Whatrepresentationsofthisdatawillfulfillthetasksfortheusers?

Regardlessofthevisualizationoutcome,thisprocesswillalmostcertainlyleadtonewdiscoveriesandinsights.Thesediscoverieshelptoinformtheoperationalization,buttheywillalsolikelysteertheprocessdownnewandunexpectedpaths.Theguidanceandframeworkinthisbookaremeanttohelpidentifyopportunitiesfordiscoveringnewknowledgeandtomakeanotherwisemessyprocessabitmorestructured.

Hotmap:MakingDecisionswithDataAsanexampleofhowvisualizationscanhelpyoutobetterunderstandaproblem,andhelpanorganizationmakedecisions,wecanlookbackto2006.Microsoftwasrollingoutitsnewmappingtool,VirtualEarth,azoomableworldmap.TheteambehindVirtualEarthhadlotsofquestionsabouthowtheiruserswereusingthisnewtool,sotheycollectedusagedata.

Theusagedatawasbasedontraditionaltelemetry:ithadgreatinformationon

Page 15: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

whatcitiesweremostviewed,howmanyviewerswerein“street”modeversus“photograph”mode,andeveninformationaboutviewers’displays.Theyinstrumentedsearchandnavigation,andtheycollectedcountsforthenumberoftimesthatuserslookedatcertainsentinelregions.AndbecauseVirtualEarthwasbuiltontopofasetofprogressivelyhigher-resolutionimagetiles,theteamwasalsoarchivingserverlogsthattrackedhowoftenindividualtilesweredownloaded.

Interviewswithteammemberssuggestedthattheydidnothaveanintuitivenotionofhowtheirtoolwasbeingused.Inconversation,oneteammemberarguedthatpeoplewerelikelytolookattheirownhomes;anotherthoughtthattheoverheadphotographywouldmostlybeusedovermountains.Thegoalswerevaried:theyincludedseeingwhethertheuserexperiencewaswellbalancedacrossuserneedsanddecidinghowandwheretoinvestinfutureroundsofphotography.

WeaddressedthesequestionswithavisualizationtoolcalledHotmap.Figure1-1showsascreencapturefromthevisualizationtool,focusingonthecentralUnitedStates.Hotmapusesaheatmapencodingofthetileaccessvalues.Thisisavisualizationtechniquethatusesacolormaptoencodetheaccessvaluesatthegeospatiallocationsofthetiles.Coloredspotsonthemapareplaceswheremoreusershaveaccessedimagetiles.Thecolormapisalogarithmiccolorscale,sobrightspotshavemanymoreaccessesthandimones.

Figure1-1.Hotmap,lookingatthecentralUnitedStates.ThewhiteboxsurroundsananomalyinSouthDakota.

Page 16: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

Someofthebrightestareascorrespondtomajorpopulationcenters—ChicagoandMinneapolisontheright,DenverandSaltLakeCityinthemiddle,andWestCoastcitiesontheleft.Nearthecenter,though,isananomalousshape:abrightspotwherenobigcityexists.Thereisastarshapearoundthebrightspot,andanarcofbrightcolorsnearby.ThespotisinasparselypopulatedbitofSouthDakota—therewasnoobviousreasontotheteamwhyusersmightzoominthere.

Thatpointis,however,veryclosetothecenterofamapofthecontinentalUS.Infact,theteamlearnedthatthecenterofthestarcorrespondstothecenterofthedefaultplacementofthemapinmanybrowsers.Thebrightspotwiththestarmostlikelycorrespondstousersslidingaroundafterinadvertentlyzoomingin,tryingtofigureoutwherethey’velanded;thearcseemstocorrespondtovariationsinmonitorproportions.

Asaresultofthisusabilitychallenge,manymappingtools—includingBingMaps(thesuccessorproducttoVirtualEarth)—nolongerofferazoomslider,whichkeepsusersfromaccidentallyzoomingallthewayinonasingleclick.

Asecondscreencapture,showninFigure1-2,revealsabrightspotoffthecoastofGhana.Thisspotexhibitsthesamestarpatterncreatedbyusersscrollingaroundtotrytofigureoutwhatpartofthemaptheyareviewing.Thisspotislikelyonlybrightbecauseitisat0degreeslatitude,0degreeslongitude,apointthatGIStoolsrunintooften.Whilecomputersmightfind(0,0)appealing,itisunlikelythatthereismuchthereforthetypicalVirtualEarthusertofindinteresting.1

Page 17: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

Figure1-2.Hotmap,lookingatthemaporigin(0,0).

Thissecondbrightspotinspiredahuntforbugs.TheteamrapidlylearnedthatVirtualEarth’ssearchfacilitywouldsometimesfail,andinsteadofreturninganerrormessage,typosanderroneoussearcheswouldsometimesredirecttheuserto(0,0).Interestingly,thebughadbeenonthebacklogforsometimebecausetheteamhaddecidedthatitwasnotlikelytosurfaceoften.Seeingthisimagemadeitclearthatsomeusersreallywerebeingconfusedbytheerror,sotheteamprioritizedthebug.

AlthoughtheVirtualEarthteamstartedoutusingtheHotmapvisualizationexpectingtofindoutabouthowusersinteractedwithmaps,theygleanedmuchmorethanjustacharacterizationofusagepatterns.Likemany—darewesaymost?—newvisualizations,themostinterestinginsightswerethosethattheviewerswerenotanticipatingtofind.2

Page 18: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

WhereVisualizationIsUsefulIsvisualizationthesilverbullettohelpusmakesenseofdata?Notalways.Therearetwoquestionstoconsidertohelpyoudecideifyourdataanalysisproblemisagoodcandidateforavisualizationsolution.

First,couldtheanalysistasksbesupportedwithanalgorithm?Acrisptasksuchas“IwanttoknowthetotalnumberofuserswholookedatSeattle”suggeststhatanalgorithm,statisticaltest,orevenatableofnumbersmightbethebestwaytoanswerthequestion.Ontheotherhand,“Howdousersexplorethemap?”ismuchfuzzier.Fuzzytasksaregreatcandidatesforavisualizationsolutionbecausetheyrequireyoutolookatthedatafromdifferentanglesandperspectives,andtobeabletomakedecisionsandinferencesbasedonyourownknowledgeandunderstanding.

Thesecondquestiontoconsideris“Isallthenecessaryinformationcontainedinthedataset?”Ifthereisinformationabouttheproblemthatisnotinthedatasetwhichrequiresanexperttointerpretthedatathatisthere,thenvisualizationisagreatsolution.Goingbacktoourfuzzyquestionaboutexploringamap,wecanimaginethatitisunlikelythattherewillbeanexplicitattributeinthedatathatclassifiesauser’sexplorationstyle.Instead,answeringthisquestionrequiressomeonetointerpretotheraspectsofthedatatobringknowledgetobearaboutwhataspectsofthedataimplyanexplorationstyle.Again,visualizationenablesthissortofflexibleanduser-centricanalysis.

Forallbutthecrispestquestionsaboutexplicitlymeasuredphenomena,visualizationisprobablyagoodtooltothrowataproblem.Inourexperience,wehavealmostnevercomeupagainstaproblemthatcannotbenefitfromsomeamountofvisualization.

Page 19: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

FurtherReadingTheHotmapprojectisdiscussedin:

Fisher,Danyel."Hotmap:LookingatGeographicAttention.”IEEETransactionsonVisualizationandComputerGraphics13(2007):1184–1191.

Fisher,Danyel."TheImpactofHotmap.”TheInfovis2009DiscoveryExhibition.Redmond,WA:Microsoft,2009.

Somanydatasetshavereferencesto(0,0)thatGISpractitionersrefertothatlocationas“nullisland.”

See“FurtherReading”forotherstoriesofhowHotmaphasbeenused.

1

2

Page 20: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

Chapter2.FromQuestionstoTasks

Allvisualizationbeginswithaquestionaboutdata.Ananalystwantstoknowsomethingaboutaphenomenonintheworld,orwantstosharetheirknowledgeaboutitwithsomeoneelse.Shebelievesthephenomenontheywishtoexamineisrepresentedsomehowinthedata.

Thechallengeinthisprocessisthatthequestiontheanalystwishestoaddresscanseemfarfromthedata.Theanalystmightbeworkingonabroadgoal:say,“Arehigh-salaryemployeesmoreproductivethanlesswell-paidones?”Thisleadstoaprocessofmakingthequestionmeasurable.Whatdoestheanalystmeanbyhigh-salary,andproductive?Whatvisualizationorsetofvisualizationswoulddemonstratetherelationshipbetweenthesevariables?

Theprocessofbreakingdownthesequestionsintosomethingthatcanactuallybecomputedfromthedataisiterative,exploratory,andsometimessurprising.Thischapterdescribeshowtorefinehigh-levelquestionsintospecific,data-driventasks.Theoutcomeofthatprocessisasetofconcisedesignrequirementsforavisualizationtoolthatsupportsfindinganswerstothosequestions.

Thegeneralconceptofrefiningquestionsintotasksappearsacrossallofthesciences.Inmanyfields,theprocessiscalledoperationalization,andreferstotheprocessofreducingacomplexsetoffactorstoasinglemetric.Thefieldofvisualizationtakesonthatgoalmorebroadly:ratherthanattemptingtoidentifyasinglemetric,theanalystinsteadtriestolookmoreholisticallyacrossthedatatogetausable,actionableanswer.Arrivingatthatanswermightinvolveexploringmultipleattributes,andusinganumberofviewsthatallowtheideastocometogether.Thus,operationalizationinthecontextofvisualizationistheprocessofidentifyingtaskstobeperformedoverthedatasetthatareareasonableapproximationofthehigh-levelquestionofinterest.

Avisualizationisnottheinevitableoutcomeofoperationalization.Exploringthedatamightshowthatthegoalisbestachievedwithastatisticalanalysisorwithmachinelearning.Similarly,theoutcomeoftheprocessmightshowthata

Page 21: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

clusteranalysisacrossmultipleattributesismoreusefulthanaplot.Wefindthatmoreoftenthannot,visualizationisavitalcomponentofgettingtoasuccessfuloperationalization.

Thischapteremphasizesthedataaspectsofthisprocess.Thenextchaptermovestothehumansideoftheprocess:howtogettheinformationnecessarytoeffectivelyoperationalizethehigh-levelquestions.Laterchaptersthenlookathowtotranslatetheoperationalizedquestionsintospecificvisualizations.

Example:IdentifyingGoodMovieDirectorsToguidetheprocessthroughoperationalization,thischapterexaminesanexemplarquestion:“Whoarethebestmoviedirectors?”

Nonspecificquestionslikethisarehowmanydataexplorationsstart.Answeringaquestionlikethisrequiresamuchmorespecifictaskthatcanbepreciselyaddressedwithadataset.Beforewecanbemorespecific,wefirstneedtotakeastepback:whoneedstoknowtheanswertothisquestion?Theusecasemightbeafilmstudenttryingtoassertthathisdissertationisaboutoneofthemostinfluentialdirectors,orahiringmanagerlookingtohireadirectorforanupcomingproject,orajournalistputtingtogetherasplashyarticlethatwillfeatureatoplist.

Eachoftheseusersneedssuggestsdifferentinterpretationsforthenotionofbestdirector.Thefilmstudentislookingforawaytoquantifyanddefendanotionofinfluence,whereasthehiringmanagermightwanttolimitthemselftopeopleworkingtodaywhoarelessaccomplishedandthusmoreaffordable.Forthisexample,though,theuserwillbeajournalistwhoisputtingtogetheranarticleaboutanewmovieandwantstoincludealistofthebestdirectors.

Thegoalofoperationalizationistorefineandclarifythequestionuntiltheanalystcanforgeanexplicitlinkbetweenthedatathattheycanfindandthequestionstheywouldliketoanswer.Forthisexample,thedatasetathandcontainsalistofmoviesratedbythefilm-aficionadocommunity.Eachmovieisassociatedwithadirector,anumberofraters,andanaverageratingscore.

ANOTEONTHEDATA

Page 22: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

ThedatasetusedforthischapteriscomprisedoftwoofIMDB’sdownloadablelists,directors.listandratings.list.ThereisacopyoftheJupyternotebookthatparsesthemintocleanerCSVsonthebook’scompanionwebsite.Thescriptcleansthedatatoremoveentriesthatthedatabasereferstoasnotbeingmovies,suchasvideogamesandTVshows.TheanalysisandvisualizationsinthischapterarecarriedoutinPythonandrecordedinasecondJupyternotebookavailableatthesamesite.

Withbothdataandahigh-levelquestioninhand,thevisualizationworkcanbegin.Dataaloneisnotenoughtodictateasetofdesignrequirementsforconstructingavisualization.Whatismissinghereisatranslationofthehigh-levelquestion“Whoarethebestmoviedirectors?”intoasetofconcretetasksoverthedata.

Thechoiceofdatasetandoperationalizationisfundamentallyaspecificperspectiveonaproblem;theystandinforwhattheanalystwishestounderstand.Inthisexample,thereareotherwaystoframetheinquiryandothertypesofdatathatcouldbecollected.Thisisalargepartofwhyvisualizationissoimportantforansweringquestionslikethese:itallowsananalyst’sexperienceandknowledgetolayerdirectlyontopofthedatathatisultimatelyshown.Theanalyst’sskillsandexperienceallowthemtomakeinferencesaboutthemoreabstractquestionstheyarereallyinterestedin.

MakingaQuestionConcreteTheprocessofoperationalizationwindsitswayfromageneralgoalorabroadquestiontospecifictasks,andthentovisualizationsthatsupportthosespecifictasksbasedonconcretedata.

Toachievethis,theanalystsearchesforproxies.Proxiesarepartialandimperfectrepresentationsoftheabstractthingthattheanalystisreallyinterestedin.Forexample,highmovieratingsmaybeareasonableproxyforbestinourmovieexample.Selectingandinterpretingproxiesrequiresjudgmentandexpertisetoassesshowwell,andwithwhatsortsoflimitations,theyrepresenttheabstractconcept.

Inoperationalization,therearetwoimportanttypesofproxies:

Page 23: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

Aproxytaskisalower-leveltaskthatstandsinfortheoriginal.Theresultofaproxytaskreflectsontheanswertotheoriginalquestion,buttheproxytaskitselfismorecloselyrelatedtothedata;itcanbeaccomplishedwithquantitativetools,suchasavisualizationorastatisticalanalysis.

Aproxyvalueisanattributeinthedatathatstandsinforamoreabstractconcept.Thiscanbeanexistingattribute,oritcanbederivedfromthedata.

Operationalizingaquestionoftenresultsinmorequestions,whichrequirefurtherarticulationofproxies.Onestepinthisprocessistofindplaceswhereaquestionisunderspecifiedordoesnotdirectlyreferencethedataonhand,inordertoidentifywhereproxiesarenecessary.

Collaborationwithstakeholderscruciallyinformstheprocessofoperationalizingquestions.Ithelpstolearnwhatdataisavailableandhowtheresultswillbeused.Interviewshelptoidentifythequestionsandgoalsofthestakeholderswithrespecttothedataandtofurtherunderstandwhatdataisavailableorcanbemadeavailable.Throughouttheprocess,ananalysttranslatesquestionsandgoalsintoadescriptionoftheproblemthatisamenabletoadatasolution.InterviewtechniquesandprototypingarediscussedinmoredetailinChapter3.

Inthisbookweadvocateanapproachofsystematicoperationalizationinordertobolsterexplicitacknowledgment,validation,andsupportoftherangeofpossibleproxydecisionsforaquestion.Thissystematicapproachleavesopenfuturepossibilitiesandprovidesguidanceformakingdownstreamdecisions.Thestarttothisprocessisgettingtounderstandthequestionandwhatisavailableinthedata—andappreciatingthegapsbetweenthem.

Thischapterbothdescribesandillustratestheoperationalizationprocess.Itusesthemoviedirectorexampletoshowhowtorefineaquestionintodetailed,specifictasks.Itdiscussesthefourcomponentsthatweusetodescribeanoperationalizedtask.

AConcreteMovieQuestionTheexamplestartedwiththehigh-levelquestion“Whoarethebestdirectors?”Thedatasetisalistofdirectorsandalistofmovies.Thefirsttaskisto

Page 24: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

operationalizebestdirector.Asaroughdefinition,agooddirectorhasdirectedmanygoodmovies.Butmanygoodmoviesisalsoill-defined,andthusaproxyforgoodmoviemightinturnbebasedonitsratingonIMDB.

Thesedecisionsreplacedonebitofambiguitywiththreemore.Howmanyofthesebestdirectorsneedtoappearintheresults?WhatcountsasgoodIMDBratings,andwhataremanyofthem?Forthatmatter,aquickglanceattheIMDBdatarevealsthatthereareshortfilms,TVepisodes,videogames,andsoon—sowhatcountsasamovie?

Itispossibletochoosethemeasuresarbitrarily:“MorethanfivemovieswithIMDBratingsgreaterthan9.8,”or“averagemovieratinghigherthan8.2,”or“nomovieswitharatinglessthan5.”Whileitisnotuncommontomakethesesortsofdecisionsbasedonroughknowledgeofthedata,orevenbasedonchoosingnice,roundnumbers,lookingattheactualdataisimportant.Thetop-rateditemsonIMDBmightturnouttohaveveryhighratingsbutonlyonereview.Greatdirectorsmightdirectafewstinkers,sojustlookingattheaverageratingmightturnouttobeapoorchoice.Theonlywaytolearnwhatthedatasaysistostartdiggingintoit.

Choosingaproxyallowstheanalysttosanity-checktheirdecisions;itcanbevaluabletodothisiterativelyateachstep,checkingboththequalityofthedataandoftheproxy.

Aquickglanceatthefirstfivedataitemsinthedatasetrevealsnon-mainstreammovies(Table2-1).Thealphabeticalfirstmovieinthedatasetiscalled#1,withatotalof12raters;thesecondisthesimilarlyobscure#1SerialKiller.Sincethescenariotargetsageneralaudience,itshouldprobablyfocusonmoviesthatmostpeoplearelikelytoknow.Adifferentscenariocouldsuggestverydifferentproxies.

Page 25: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

Table2-1.Aquickglanceatthefirstdataitemsinthemoviedataset(whichissortedalphabetically)revealsthattherecouldbemovieswithpositiveratingsthathaveveryfewraters,implyinganobscure(butdecent)movie.

ID Raters Score Title Director

0 12 6.4 #1(2005) Breen,James(V)

1 35 6.0 #1SerialKiller(2013) Yung,Stanley(I)

2 5 5.8 #137(2011) Elliott,Frances

3 11 7.4 #140Characters:ADocumentaryAboutTwitter(2… Beasley,Bryan(I)

4 23 6.7 #30(2013) Wilde,Timothy

… … … … …

Thedecisiontostickwithmainstreammoviessuggestsaneedforaproxyforpopularity.Onechoicecouldbethenumberofratingsforeachmovie.Byplottingthedistributionofthenumberofratingsbymovies(Figure2-1),weseethatthevastmajorityofmoviesinthedatasetactuallyhaveveryfewratings.

Page 26: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

Figure2-1.Distributionofratings.Thishistogramshowsthecountofnumberofratingsperfilm.Almostallthefilmshavefewratings,withaverylongtail.

Thisfirstplotshowsthatthenumberofratingsisheavilyskewed.Onewaytomakethisdistributionmoreinterpretableistoplotitonalogarithmicscale.InFigure2-2,thedatahasbeenbucketed;afilmwith1,000ratingsnowappearsinthebucketforlog (1000)=3.Takingthelogofthenumberofratingssmoothsthedistribution,moreeffectivelyshowingitsshape.

10

Page 27: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

Figure2-2.Distributionofthelogarithm(base10)ofthenumberofratings.Thepeakisunder2:mostfilmshaveunder100ratings.

Wecanalsocomputesomebasicsummarizingstatisticsaboutthenumberofratings:themedianmovieinthedatasethasjust26ratingswhilethe75thpercentileisat132ratings. Bylookingupthenumberofratingsforasampleofblockbusters,wenotethatmoviesthatanyonecannameoffhandhavetensofthousandsofratings.Theseareusefulobservations;perhapsitwouldbevaluabletotrimtoaslimmersetofmoviestoensurethatmostareonesthatareasonablenumberofpeoplehaveseen.

Wewanttochooseanumber,though,that’sfairtogoodmovies,eveniftheyarenotverypopular—inthiscase,wepick,somewhatarbitrarily,themost-rated25%ofmovies.Thisamountstoaround70,000filmswithmorethan132ratings.

Wenextpivotandlookatthedistributionofratingsfortheslimmed-downsetofmovies,showninFigure2-3.Thisdistributionshowsadistinctcurvewithaclearpeakandnoticeabledrop-off:ratingsabove7.5seemdifferentfromlowerratings.(Thisdistributionhasamedianscoreof6.6,anda75thpercentileof7.4.)

1

Page 28: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

Figure2-3.Distributionofscore.Thishistogramshowsthecountofratings,bybucket.Almostallratingsareextremelylow,withaverygradualtail.

Steppingbackfromourdiveintothedata,wecanobservethatwehaveproceededsomedistancealongtheoperationalization.Wehavedefinedagooddirectoranddecidedthatitisbasedontheirmovies;wehavefocusedonmoviesandchosenasetthatarepopularenoughtobepartoftheanalysis.Buttherearestillunansweredquestions:Howwillwerankdirectorsagainsteachother?Whatmakesfora“best”director?

Asystematicapproachtooperationalizationallowsananalysttoseethefullrangeofdecisionsandhelpsinpullingtogetherthesetofproxiesthatcaninformafinalanswer.Ultimately,aninteractivevisualizationtoolcanenableexplorationofmultipleproxiestoallowforasetofjustified,andvalidated,answers.Forourrunningexample,wewillcontinuewiththeoperationalizationafterdescribingaframeworkformakingdecisionsexplicitthroughouttheprocess.

BreakingDownaTask

Page 29: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

Throughouttheoperationalization,weneedtoidentifywhereinaquestionortaskthereisaneedforamorerefinedproxy.Doingsosystematicallycanmakeiteasiertovalidatethosedecisions,aswellastoproducearoadmapoftheprocess.Thisallowstheanalysttoeffectivelyrevisitdecisionsonceabetterunderstandingoftheproblemisgained.

Ananalystcanrefineataskbyfirstbreakingitdownintofourspecificcomponents.Identifyingthesecomponentsandhowtheydoordonotdirectlyreferencethedatabecomesatemplateforchoosingmorespecifictasks.Thecomponentsare:

Objects

Thingsoreventsthatexistintheworld:inourexample,adirectorandamoviearebothobjects.Inothercontexts,objectsmightbeauserorasaleofasingleitem.Whenataskisspecificenough,eachobjectwillbesomethingthatcanberepresentedinorcomputedfrom,thedata.Fairlyoften,whenthetaskisatitsmostspecific,anobjectwillcorrespondtoasinglerowinadatabase.

Measures

Theoutcomevariablesthatwillbemeasuredfortheobjects.Qualityofadirector,happinessofauser,andsalesofastoreareallmeasures.Inasufficientlyspecifictask,themeasureiseitheranexistingattributeinthedatasetoronethatcanbedirectlycomputedfromthedata.Ameasureissometimesaggregatedacrossmanyitemsofdata.Inourexample,anumberofmoviesareaggregatedtogethertogetascoreforasingledirector.

Groupings(orpartitions)

Attributesorcharacteristicsofthedatathatseparatethedataitemsintogroups.Forexample,groupingsmightincludestoreregion(westernversuseastern),startdateofplayers,whetherusershavepurchasedanupgrade,orsalesbyyear.Inaspecifictask,partitionsareattributesoftheobjectsorcanbecalculateddirectlyfromthoseattributes.Whenthevisualizationiscreated,partitionswilloftenmanifestasgroupings,separationsacrosscharts,orfilters.

Actions

Page 30: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

Wordsthatarticulatethespecificthingbeingdonewiththedata,suchascompare,identify,characterize,etc.Actionsguidetheprocessofchoosingappropriatevisualizations.

Theactionisusefulforidentifyingtheothercomponents.Takethistask:Comparetheamountofmoneyspentin-gamebyplayerswhoplaymorehoursversusthosewhoplayfewerhours.Theactioniscompare.Whatiscompared?Theplayers(theobject).Whatisitaboutplayersthatwewanttocompare?Themoneyspent(themeasure).Finally,thereisaspecificpartitionontheobjects.Theywillbebrokenintotwogroups:thosethatplaymanyhoursandthosethatplayfewhours.

Thefollowingcomponentsaretheheartofaniterativeprocess:

1. Refinethequestionintooneormoretasksthat,individuallyortogether,addressthegeneralquestion.

2. Foreachtask:

a. Identifythecomponentsofthetask.

b. Lookforambiguouscomponents—namely,componentsthatarenotdirectlyaddressablebythedataset.

c. Foreachambiguouscomponent,defineaproxybycreatinganewquestionthataddressesthecomponent,andreturntostep1withthosequestions.

d. Iftherearenoambiguouscomponentsthenthetaskisdeemedactionable,andthuscanbeaddressedwithavisualizationorothercomputationaltechnique.

Next,we’llexplicatesomeofthequestionsfromthemovieexampletoillustratehowthecomponentsworkinpractice,beginningwithExample2-1.

Page 31: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

Example2-1.Breakingdownthetasktofindgooddirectors

Task:Identifythetopdirectorswhohavedirectedmanygood,popularmovies

Action:Identify

Object:Director

Measure:Numberofgood,popularmovies

Grouping:Filteroutnon-movies

Identifyingtopdirectorsimpliesthatthere’sameaningfulsortonthedirectorssothatthetopcanbefound(Example2-2).Thus,wecanfurtherrefinetheactiontospecifyanordering.Also,ourfirstlookatthedatashowedthatmanymoviesareunpopular,whichimpliedagroupingtofilteroutunpopularmovies.

Example2-2.Refinedtaskforgooddirectors

Task:Rankorderofdirectorsbythosewhohavedirectedmanygood,popularmovies

Action:Rankorder

Object:Director

Measure:Numberofgoodmovies

Grouping:Filteroutnon-moviesandunpopularmovies

Filteringoutunpopularmoviesisasubtask(Example2-3),whichweaddressedwithahistogramofthenumberofratingsformovies.Thevisualizationofthedistributionallowedustodetermineagoodcutpointforpopularversusunpopular—namely,popularmovieswerethoseinthetop25%ofmovieswiththehighestnumberofratings.

Page 32: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

Example2-3.Subtaskforfilteringunpopularmovies

Task:Filteroutmovieswithsignificantlyfewerratings

Action:Filter

Object:Movie

Measure:Numberofratings

Grouping:Separateintomostpopularandleastpopularmovies

ThissubtaskcanbebroughtbackintoExample2-2asaproxyforunpopularmovies.

However,westillhavesomeworktodoonExample2-2:themeasurenumberofgoodmoviesisill-definedwithrespecttothedata.Weneedtorefinethiscomponentbydevelopingaproxyforagoodmovie.Oncewedothat,wecanthenexaminewhatitmeanstohavedirectedmanyofthem.

Theseproxiesrequirefurtherelaboration.Howmanyhigh-scoringmoviesarerequiredfromdirectors?Dolow-scoringmoviescountagainstthem?Thisprocessofidentifyingreasonableproxiesisofteniterative.Forexample,inexploringandvalidatingaproxywiththedata,itmightbecomeobviousthattheeffectsoffilteringbythenumberofratingswasamistakenapproach.

Atthispoint,wecanrecognizethatweneedaproxymeasureforgood(Example2-4).Thereareavarietyofproxiesthatwecantryhere,withvariousvisualizations.Theprocesscontinuesonward.

Example2-4.Subtaskforgoodmovies

Task:Quantify“agoodmovie”

Action:Quantify

Object:Movie

Measure:Goodness

Page 33: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

Grouping:None

Breakingdownataskintocomponentshelpsinguidingrefinementofataskintoonethatcanbeaddressedwiththedata.Themostdirectwaytodosoistoconsiderthequestion“Aretheobject,measure,andgroupingeachdirectlydescribedinthedata?”Foreachofthesethreecomponents,isitclearwhichaspectsofthedataareimportantorhowtoderivewhatweneedfromthedata?Ifnot,repeattheprocessofformulatingasubquestioninordertoderiveamorespecificanswer.

Let’stakealookataverydifferentexample—thistime,fromagameplaymetricsscenario(Example2-5).

Example2-5.Exemplartaskforanalyzingagame

Task:Comparetheamountofmoneyspentin-gamebyplayerswhoplaymorehoursversusthosewhoplayfewerhours.

Action:Compare

Object:Players

Measure:Moneyspent

Grouping:Playerswhoplaymanyhours;playerswhoplayfewhours

InExample2-5,thepartitiondividesbetweenmanyandfewhours.Thiscomponentneedstoberefinedfurther,whichleadstoanewquestion:“Inthegame,howmanyis‘many’hoursforaplayer?”Theanalystmighttakeaseriesofsteps.Theymightlookatthedistributionofhoursplayed,ortheymightchoosetofilteroutplayerswhohaveplayedzerohourorthosewhohaven’tmadeitpastthetutorial,ortheymightlookatothermetricsthatareimportanttothegame.Thesestepswouldhelptheanalystfigureoutgoodproxiesformanyandfewhours.

WhenTasksLeadtoNewQuestionsTherearefourbroadcategoriesofnewlinesofinquirythatcanemergefrom

Page 34: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

refiningaquestion.First,asinthemovieexample,therefinementprocessoftenrevealsthatanewanalysisisneededtoanswerthesequestions.

Second,operationalizingcanalsoleadinnewdirections.Intheprocessofexploringwhothebestdirectorsare,theanalystmightnoticethatsomedirectorssticktoasinglegenre;theymightdecidethatthisanalysismightbeinterestingdividedacrossmultiplegenres.TheymightalsonoticethatbothIMDBandRottenTomatoeshavescoresonmovies,andwanttoseehowtheseresultsvarybasedonRottenTomatoesscoresinsteadofIMDB.

Third,thedataitselfcanleadtonewquestionstoo.Inexploratorydataanalysis(EDA),forexample,thedataanalystdiscoversnewquestionsbasedonthedata.Theprocessoflookingatthedatatoaddresssomeofthesequestionsgeneratesincidentalvisualizations—oddpatterns,outliers,orsurprisingcorrelationsthatareworthlookingintofurther.

Finally,doingsomeanalysisoftenleadstodoingaroundofdatacleaning.Whiledatacleaningislargelyoutofthescopeofthisbook,oddoutliersandsurprisingtrendsare,asoftenasnot,theresultofdirtydata.

ReturningtotheExample:ExploringDifferentDefinitionsThereareseveraldifferentpossibledefinitionsofbestdirector.

Hereisone:thebestdirectorhasthemostmovieswithmorethan134ratings.Table2-2showsthetopscorers.ThemostprolificdirectorsinourdatasetareChuckJonesandFritzFeleng(whodirectedclassicLooneyTunesanimations),WilliamHanna(whodirectedTomandJerryandotherclassicHanna-Barberacartoons),andGeorgeMéliès(anearlyinventorofspecialeffectsandshorts).

Table2-2.Topfivedirectorsbynumberoffilmsoverthreshold

Director Avg.raters Avg.score Count Totalraters

Jones,Chuck(I) 719 7.4 148 106,397

Freleng,Fritz 402 7.2 141 56,730

Page 35: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

Hanna,William(I) 591 7.5 119 70,315

Méliès,Georges 717 6.1 114 81,769

White,Jules(I) 235 7.1 102 23,969

GeorgesMélièshas526filmsonhisIMDBpage;only114,however,madeitoverthethresholdofraters.Thehugenumberoffilmsisexplainedbythefactthatthefilmsareshorts—morefamiliarontelevisionnow,butoncealsoshownintheaters.Thisshouldbeanopportunitytodomoredatacleaningtojoininanothertablethatwilltelluswhetherafilmisashortornot,andfilterthoseout.IMDBhasafilmdurationdatatable;inatypicalanalysisprocess,thenextstepwouldbetomergeinthistable,addinganewproxyforwhatmakesforashortfilm.

Wemightexploreotherdefinitionsofbestdirectors.Forexample,thebestdirectorsmightmakethemoviesthatpeoplewanttoratethemost.Table2-3isalistofthedirectorswhosemovieshave,intotal,themostratings.

Table2-3.Topfivedirectorsbytotalnumberofratingsacrossallmovies

Director Avg.raters Avg.score Count Totalraters

Spielberg,Steven 245,717 7.2 36 8,845,795

Nolan,Christopher(I) 778,737 8.2 11 8,566,104

Tarantino,Quentin 526,689 7.8 13 6,846,955

Jackson,Peter(I) 371,219 7.6 16 5,939,505

Scorsese,Martin(I) 144,823 7.5 41 5,937,725

Thislistmakessense.Theseareveryfamousnameswhohavedirectedveryfamiliarmovies.

Differentproxiesyielddifferentresults.Orderingbytheaveragescoreforallmoviesbyasingledirectormightbeonewaytofindtheverybestdirectors.AsseeninTable2-4,thefirstonthislistisadirectorwhohasonlyonemovieover

Page 36: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

thethreshold:aMongolianmoviefrom2016with624ratersandanaveragescoreof9.7.Thismeasureofpopularityreturnsaverydifferentsetofresultsthanthepreviousmeasure:tenthousandtimesmorepeopleratedQuentinTarantino’smoviesthanUranchimegUrtnasan’swork.

Page 37: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

Table2-4.Topfivedirectorsbyaveragescore

Director Avg.raters Avg.score Count Totalraters

Urtnasan,Uranchimeg 624 9.7 1 624

Miller,George(XXXVII) 394 9.6 2 787

Chowdhury,AmitabhReza 14,628 96 1 14,628

Biebert,Aaron 12,040 9.6 1 1,204

Arsyn,Ken 619 9.5 6 3,712

Butcanthequalityofadirectorbemeasuredbasedonjustoneortwomovies?Eachstepofdataexplorationleadstoanotherstepofrefiningthequestion.Isitmoreimportanttohavemanyraters,ahighaveragescore,orahighminimumscore?

Thechoiceofmetricsleadstoverydifferentoutcomes.Aslighttweakdetermineswhetheryoufinddirectorsofanimatedcartoons,blockbusterdirectors,oraverydiversesetofinternationaldirectors.

HowSpecificDoestheProcessGet?Thisprocessofrefinementleadstoascaryscenario.InDisney’sFantasia,intheSorcerer’sApprenticesequence,MickeyMouseattemptstostopanenchantedbroombychoppingitinhalfandinsteadproducestwohalf-sizeenchantedbrooms.Willouranalysissubtasksforevermultiply?

Theoperationalizationprocessisaniterativeoneandtheendpointisnotpreciselydefined.Theanswertothequestionofhowfartogois,simply,farenough.Theprocessisdonewhenthetaskisdirectlyactionable,usingthedataathand.Theanalystknowshowtodescribetheobjects,measures,andgroupingsintermsofthedata—wheretofindit,howtocompute,andhowtoaggregateit.Atthispoint,theyknowwhatthequestionwilllooklikeandtheyknowwhattheycandotogettheanswer.

Anactionabletaskmeansthatitispossibletoactonitsresult.Thatactionmight

Page 38: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

betopresentausefulresulttoadecisionmakerortoproceedtoanextstepinadifferentresult.Ananswerisactionablewhenitnolongerneedsfurtherworktomakesenseofit.

Low-levelobjectsarereadytobeinterpretedfromthedata.Sometimestheycanbereaddirectlyoffthedatatable,butmoreoftenitismoreindirect;theanalystmayneedtocarryouttransformationsonthedata,whethermathematicaltransformationsordatabasejoins.Forinstance,inthemovieexample,theobjectisthedirector;theproxyforthedirectoristheresultofaggregatingmultiplemoviestogether.Partitionsandmeasuresatthelowestlevelwillresolvetoconcretemanipulationsoftheobjects.

Theprocessendswhenallthetoolsneededtoansweraquestionareinplace—whetherasanumber,avisualization,orevenasaninteractionacrossmultiplevisualizationsrepresentingmultipleproxies.Theanalystmightdecidethattherightcutoffformanyhoursofgameplayissixhours—anumber—orthehoursplayedbythetop10%ofplayers—aformula—orabovethelogicalbreakpoint,whichmightberepresentedbyadistribution.Theseresultsgetpropagatedbackintoanyothertasksthatdependonthem.

MakingUseofResultsThisprocessofpropagatingresultsbackintohigher-levelquestionsisflexibile.Sometimesthelow-levelquestiondoesnothaveanexactanswerbutinsteadresolvesinitsownvisualizationorinteraction.Thatvisualizationmighthelpananalystinmakingadecision,butitmightalsoimplyparametersonthedata.Forexample,thejournalistmightrealizethatthereareseveralpossiblecutoffsfordefiningwhatitmeanstobeagoodmovie.Ratherthansimplypickingaspecificthreshold,ananalystmightinsteadpropagateamechanismfordynamicallydeterminingcutoffsintohigher-leveltasks.Seeingavariablepropagatedlikethiscanbeacuethataninteractivevisualization—ratherthanastaticimage—mightbehelpful.

Visualizationisalsoimportantforsupportingtheoperationalizationprocess,eveniftheendresultisnotaninteractivevisualization.Inthemovieexample,visualizationhelpedustounderstandthenatureanddistributionofthedata.Visualizationcanbemoreprominentwithmorecomplexanalysistasks.Ifthe

Page 39: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

analystwantedtocompareratingsagainstpopularity,itwouldbedifficulttodisplaythatonalist;iftheywantedtoexplorehypothesesabouthowthepopularityofdirectorschangesovertime,morevisualrepresentationswouldhelpthemexplorethedata.

Conclusion:AWell-OperationalizedTaskAwell-operationalizedtask,relativetotheunderlyingdata,fulfillsthefollowingcriteria:

Canbecomputedbasedonthedata

Makesspecificreferencetotheattributesofthedata

Hasatraceablepathfromthehigh-levelabstractquestionstoasetofconcrete,actionabletasks

Awell-operationalizedtaskisafirststeptowardcreatingavisualization.Chapter4beginstodescribethewaysinwhichtheobjects,measures,andpartitionscanbeshapedintoaspectsofavisualization.Chapters5and6constructvisualizationsbasedonthem.

Writtenoutindetail,thisprocesscanseemtedious,butinpractice,itisabbreviatedandsimplified.Therearetwoimportantusesforthissystematization.First,theprocessofexplicitlylookingatcomponentscanhelpuntangleknottyproblems,decomposingplaceswheretheanalysthasmadeassumptionsaboutthedata.ExplainingpreciselywhythenumberofIMDBratingsisaproxyforpopularityforcestheanalysttoexplorewhetheritisagoodchoice—and,perhaps,torevisethatchoicelater.

Theprocessalsohelpsguidequestionsandinterviews.Chapter3explainshowtocarryoutoperationalizationwithdomainexperts.Recognizingtheneedtomakedecisionsaboutproxieshelpsguidetheseconversations.Everydatasethassubtleties;itcanbefartooeasytoslipdownrabbitholesofcomplications.Beingsystematicabouttheoperationalizationcanhelpfocusourconversationswithexperts,onlyintroducingcomplicationswhenneeded.

Page 40: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

FurtherReadingTheprocessoutlinedhereissimilar—andinmanywaysparallel—totheGoal,Question,Metric(GQM)processfoundinthesoftwareengineeringspace.GQMrefinesfromageneralgoaltoaspecificmetric,usuallyorientedaroundprocessimprovementsothattheconsumercanhaveasinglenumberthathelpsthemknowwhethertheyaresucceedinginimprovingthatprocess.

Ourprocessismoreexploratoryandoftencomesearlierinthecycle.AGQManalysismightchooseagoallike“improveuserretention.”Incontrast,exploratoryoperationalizationmightstartwithaquestionlike“Douserscomebacktooursite?”withtheawarenessthattheproblemismultifacetedandcomplex,andmightrequireavarietyofdifferentmetricstodescribe.FormoreonGQM,see:

Basili,Victor,GianluigiCaldiera,andDieterRombach.“TheGoalQuestionMetricApproach.”EncyclopediaofSoftwareEngineering.NewYork:Wiley,1994.

Thedatavisualizationfieldhasspentagreatdealofefforttryingtounderstandthetasksthatcanbeaccomplishedinavisualization.AmarandStasko,forexample,explorealow-levelanalysisoftaskscarriedoutonaspecificvisualization.Attheotherendofthespectrum,BrehmerandMunznerexplorehigh-leveltasksforvisualization,startingwithcomparingpresentationandexploration:

Amar,RobertandJohnStasko.“AKnowledgeTask-BasedFrameworkfortheDesignandEvaluationofInformationVisualizations.”ProceedingsoftheIEEESymposiumonInformationVisualization(2004):143–150.

Brehmer,MathewandTamaraMunzner.“AMulti-LevelTypologyofAbstractVisualizationTasks.”IEEETransactionsonVisualizationandComputerGraphics19(2013):2376–2385.

Medianandpercentilearewaysofcharacterizingadistributionofnumbers.Ifoneweretosortthenumbers,the75thpercentilewouldbe75%ofthewaydown1

Page 41: Making Data Visual€¦ · Preface Visualization is a vital tool for understanding and sharing insights around data. The right visualization can help express a core idea or open a

inthelist.Themedianwouldbeatthehalfwaypoint.