Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for...

238

Transcript of Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for...

Page 1: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications
Page 2: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

DataVisualizationandExplorationwithRApracticalguidetousingR,RStudio,andTidyversefordatavisualization,exploration,anddatascienceapplications.

Page 3: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

EricPimpler

Page 4: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

IntroductiontoDataVisualizationandExplorationwithRApracticalguidetousingR,RStudio,andtidyversefordatavisualization,exploration,anddatascienceapplications.EricPimpler

GeospatialTrainingServices215WBandera#114-104Boerne,TX78006PH:210-260-4992Email:[email protected]:http://geospatialtraining.comTwitter:@gistraining

Copyright©2017byEricPimpler–GeospatialTrainingServicesAllrightsreserved.

Nopartofthisbookmaybereproducedinanyformorbyanyelectronicormechanicalmeans,includinginformationstorageandretrievalsystems,withoutwrittenpermissionfromtheauthor,exceptfortheuseofbriefquotationsinabookreview.

Page 5: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

AbouttheAuthor

EricPimpler

EricPimpleristhefounderandownerofGeospatialTrainingServices(geospatialtraining.com)andhaveover25yearsofexperienceimplementingandteachingGISsolutionsusingEsrisoftware.CurrentlyhefocusesondatascienceapplicationswithRalongwithArcGISProandDesktopscriptingwithPythonandthedevelopmentofcustomArcGISEnterprise(Server)andArcGISOnlinewebandmobileapplicationswithJavaScript.

EricisthealsotheauthorofseveralotherbooksincludingIntroductiontoProgrammingArcGISProwithPython(https://www.amazon.com/dp/1979451079/re(https://www.amazon.com/dp/1979451079/re1&keywords=Programming+ArcGIS+Pro+with+Python),ProgrammingArcGISwithPythonCookbook(https://www.packtpub.com/application-development/programmingarcgis-python-cookbook-second-edition),SpatialAnalyticswithArcGIS(https://www.packtpub.com/application-development/spatial-analytics-arcgis),BuildingWebandMobileArcGISServerApplicationswithJavaScript(https://www.packtpub.com/application-development/building-weband-mobile-arcgis-server-applicationsjavascript),andArcGISBlueprints(https://www.packtpub.com/applicationdevelopment/arcgis-blueprints).

Page 6: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

IfyouneedconsultingassistancewithyourdatascienceorGISprojetspleasecontactEricateric@geospatialtraining.comorsales@geospatialtraining.com.GeospatialTrainingServicesprovidescontractapplicationdevelopmentandprogrammingexpertiseforR,ArcGISPro,ArcGISDesktop,ArcGISEnterprise(Server),andArcGISOnlineusingPython,.NET/ArcObjects,andJavaScript.

Page 7: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

DownloadingandInstallingExerciseDataforthisBook

Thisisintendedasahands-onexercisebookandisdesignedtogiveyouasmuchhandsoncodingexperiencewithRaspossible.Manyoftheexercisesinthisbookrequirethatyouloaddatafromafile-baseddatasourcesuchasaCSVfile.Thesefileswillneedtobeinstalledonyourcomputerbeforecontinuingwiththeexercisesinthischapteraswellastherestofthebook.Pleasefollowtheinstructionsbelowtodownloadandinstalltheexercisedata

1.Inawebbrowsergotooneofthelinksbelowtodownloadtheexercisedata:https://www.dropbox.com/s/5p7j7nl8hgijsnx/IntroR.zip?dl=0.

https://s3.amazonaws.com/VirtualGISClassroom/IntroR/IntroR.zip2.ThiswilldownloadafilecalledIntroR.zip.

3.Theexercisedatacanbeunzippedtoanylocationonyourcomputer.AfterunzippingtheIntroR.zipfileyouwillhaveafolderstructurethatincludesIntroRasthetop-mostfolderwithsub-folderscalledDataandSolutions.TheDatafoldercontainsthedatathatwillbeusedintheexercisesinthebook,whiletheSolutionsfoldercontainssolutionfilesfortheRscriptthatyouwillwrite.

RStudiocanbeusedonWindows,Mac,orLinuxsoratherthanspecifyingaspecificfoldertoplacethedataIwillleavetheinstallationlocationuptoyou.Justrememberwhereyouunzipthedatabecauseyou’llneedtoreferencethelocationwhenyousettheworkingdirectory.

4.ForreferencepurposesIhaveinstalledthedatatothedesktopofmyMaccomputerunderIntroR\Data.Youwillseethislocationreferencedatvariouslocationsthroughoutthebook.However,keepinmindthatyoucaninstallthedataanywhere.

Page 8: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

TableofContents

CHAPTER1:IntroductiontoRandRStudio.......................................................9

IntroductiontoRStudio...........................................................................................................10Exercise1:Creatingvariablesandassigningdata.............................................................27Exercise2:Usingvectorsandfactors....................................................................................32Exercise3:Usinglists.................................................................................................................36Exercise4:Usingdataclasses................................................................................................39Exercise5:Loopingstatements..............................................................................................46Exercise6:Decisionsupportstatements–if|else..............................................................48Exercise7:Usingfunctions......................................................................................................51Exercise8:Introductiontotidyverse......................................................................................53

CHAPTER2:TheBasicsofDataExplorationandVisualizationwithR..........57

Exercise1:Installingandloadingtidyverse..........................................................................58Exercise2:Loadingandexaminingadataset.....................................................................60Exercise3:Filteringadataset.................................................................................................64Exercise4:Groupingandsummarizingadataset...............................................................65Exercise5:Plottingadataset.................................................................................................66Exercise6:Graphingburglariesbymonthandyear

Page 9: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

...........................................................67

CHAPTER3:LoadingDataintoR......................................................................73

Exercise1:Loadingacsvfilewithread.table()....................................................................73Exercise2:Loadingacsvfilewithread.csv().......................................................................76Exercise3:Loadingatabdelimitedfilewithread.table()..................................................77Exercise4:Usingreadrtoloaddata.....................................................................................77

CHAPTER4:TransformingData........................................................................83

Exercise1:Filteringrecordstocreateasubset....................................................................84Exercise2:Narrowingthelistofcolumnswithselect()........................................................87Exercise3:ArrangingRows.....................................................................................................90Exercise4:AddingRowswithmutate().................................................................................92Exercise5:SummarizingandGrouping.................................................................................94Exercise6:Piping......................................................................................................................97Exercise7:Challenge..............................................................................................................99

CHAPTER5:CreatingTidyData.....................................................................101

Exercise1:Gathering............................................................................................................102Exercise2:Spreading............................................................................................................107

Page 10: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Exercise3:Separating...........................................................................................................110Exercise4:Uniting..................................................................................................................113

CHAPTER6:BasicDataExplorationTechniquesinR...................................115

Exercise1:MeasuringCategoricalVariationwithaBarChart........................................116Exercise2:MeasuringContinuousVariationwithaHistogram.........................................118Exercise3:MeasuringCovariationwithBoxPlots..............................................................120Exercise4:MeasuringCovariationwithSymbolSize.........................................................122Exercise5:2Dbinandhexcharts........................................................................................124Exercise6:GeneratingSummaryStatistics.........................................................................126

CHAPTER7:BasicDataVisualizationTechniques........................................129

Step1:Creatingascatterplot..............................................................................................130Step2:Addingaregressionlinetothescatterplot...........................................................133Step3:Plottingcategories....................................................................................................136Step4:Labelingthegraph...................................................................................................137Step5:Legendlayouts..........................................................................................................144Step6:Creatingafacet.......................................................................................................146Step7:Theming......................................................................................................................147Step8:Creatingbarcharts

Page 11: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

..................................................................................................148Step9:CreatingViolinPlots..................................................................................................150Step10:Creatingdensityplots............................................................................................153

CHAPTER8:VisualizingGeographicDatawithggmap..............................157

Exercise1:Creatingabasemap.........................................................................................158Exercise2:Addingoperationaldatalayers.......................................................................162Exercise3:AddingLayersfromShapefiles..........................................................................169

CHAPTER9:RMarkdown................................................................................173

Exercise1:CreatinganRMarkdownfile............................................................................175Exercise2:AddingCodeChunksandTexttoanRMarkdownFile.................................178Exercise3:Codechunkandheaderoptions.....................................................................190Exercise4:Caching...............................................................................................................199Exercise5:UsingKnittooutputanRMarkdownfile..........................................................201

CHAPTER10:CaseStudy–WildfireActivityintheWesternUnitedStates.............................................................................205

Exercise1:Havethenumberofwildfiresincreasedordecreasedinthepastfewdecades?..................................................................................207Exercise2:Hastheacreageburnedincreasedovertime?.............................................211Exercise3:Isthesizeofindividualwildfiresincreasingovertime?...................................220

Page 12: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Exercise4:Hasthelengthofthefireseasonincreasedovertime?................................225Exercise5:Doestheaveragewildfiresizedifferbyfederalorganization.......................230

CHAPTER11:CaseStudy–SingleFamilyResidentialHomeandRentalValues....................................................................233

Exercise1:WhatisthetrendforhomevaluesintheAustinmetroarea.........................234Exercise2:WhatisthetrendforrentalratesintheAustinmetroarea?..........................240Exercise3:DeterminingthePrice-RentRatiofortheAustinmetropolitanarea.............242Exercise4:ComparingresidentialhomevaluesinAustintootherTexasandU.S.metropolitanareas..............................................................................247

Chapter1

Page 13: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

IntroductiontoRandRStudio

TheRProjectforStatisticalComputing,orsimplynamedR,isafreesoftwareenvironmentforstatisticalcomputingandgraphics.Itisalsoaprogramminglanguagethatiswidelyusedamongstatisticiansanddataminersfordevelopingstatisticalsoftwareanddataanalysis.Overthelastfewyears,theywerejoinedbyenterpriseswhodiscoveredthepotentialofR,aswellastechnologyvendorsthatofferRsupportorR-basedproducts.

Althoughthereareotherprogramminglanguagesforhandlingstatistics,Rhasbecomethedefactolanguageofstatisticalroutines,offeringapackagerepositorywithover6400problem-solvingpackages.Itisalsooffersversatileandpowerfulplotting.Italsohastheadvantageoftreatingtabularandmulti-dimensionaldataasalabeled,indexedseriesofobservations.Thisisagamechangerovertypicalsoftwarewhichisjustdoing2Dlayout,likeExcel.

Inthischapterwe’llcoverthefollowingtopics:

•IntroductiontoRStudio•Creatingvariablesandassigningdata•Usingvectorsandfactors•Usinglists•Usingdataclasses•Loopingstatements•Decisionsupportstatements•Usingfunctions•Introductiontotidyverse

IntroductiontoRStudio

Thereareanumberofintegrateddevelopmentenvironments(IDE)thatyoucanusetowriteRcodeincludingVisualStudioforR,Eclipse,RConsole,andRStudioamongothers.Youcouldalsouseaplaintexteditoraswell.However,we’regoingtouseRStudiofortheexercisesinthisbook.RStudioisafree,opensourceIDEforR.Itincludesaconsole,syntax-highlightingeditorthatsupportsdirectcodeexecution,aswellastoolsforplotting,history,debuggingandworkspacemanagement.

Page 14: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

RStudioisavailableinopensourceandcommercialeditionsandrunsonthedesktop(Windows,Mac,andLinux)orinabrowserconnectedtoRStudioServerorRStudioServerPro(Debian/Ubuntu,RedHat/CentOS,andSUSELinux).

AlthoughtherearemanyoptionsforRdevelopment,we’regoingtouseRStudiofortheexercisesinthisbook.YoucangetmoreinformationonRStudioat

https://www.rstudio.com/products/rstudio/TheRStudioInterface

TheRStudioInterface,displayedinthescreenshotbelow,looksquitecomplexinitially,butwhenyoubreaktheinterfacedownintosectionsitisn’tsooverwhelming.We’llcovermuchoftheinterfaceinthesectionsbelow.Keepinmindthoughthattheinterfaceiscustomizablesoifyoufindthedefaultinterfaceisn’texactlywhatyoulikeitcanbechanged.You’lllearnhowtocustomizetheinterfaceinalatersection.

TosimplifytheoverviewofRStudiowe’llbreaktheIDEintoquadrantstomakeiteasiertoreferenceeachcomponentoftheinterface.Thescreenshotbelowillustrateseachofthequadrants.We’llstartwiththepanesinquadrant1andworkthrougheachofthequadrants.

Page 15: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

FilesPane–(Q1)

TheFilespanefunctionslikeafileexplorersimilartoWindowsExploreronaWindowsoperatingsystemorFinderonaMac.Thistab,displayedinthescreenshotbelow,providesthefollowingfunctionality:

1.Deletefilesandfolders2.Createnewfolders3.Renamefolders4.Foldernavigation5.Copyormovefiles6.Setworkingdirectoryorgotoworkingdirectory7.Viewfiles8.Importdatasets

PlotsPane–(Q1)

ThePlotspane,displayedinthescreenshotbelow,isusedtoviewoutputvisualizationsproducedwhentypingcodeintotheConsolewindoworrunningascript.Plotscanbecreatedusingavarietyofdifferentpackages,butwe’llprimarilybeusingtheggplot2packageinthisbook.Onceproduced,youcanzoomin,exportasanimage,orPDF,copytotheclipboard,andremoveplots.Youcanalsocannavigatetopreviousandnextplots.

Page 16: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

PackagesPane–(Q1)

ThePackagespane,showninthescreenshotbelow,displaysallcurrentlyinstalledpackagesalongwithabriefdescriptionandversionnumberforthepackage.Packagescanalsoberemovedusingthexicontotherightoftheversionnumberforthepackage.ClickingonthepackagenamewilldisplaythehelpfileforthepackageintheHelptab.ClickingonthecheckboxtotheleftofthepackagenameloadsthelibrarysothatitcanbeusedwhenwritingcodeintheConsolewindow.

Page 17: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

HelpPane–(Q1)TheHelppane,showninthescreenshotbelow,displayslinkedhelpdocumentationforanypackagesthatyouhaveinstalled.

Page 18: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

ViewerPane–(Q1)RStudioincludesaViewerpanethatcanbeusedtoviewlocalwebcontent.Forexample,webgraphicsgeneratedusingpackageslikegoogleVis,htmlwidgets,andRCharts,orevenalocalwebapplicationcreatedwithShiny.However,keepinmindthattheViewerpanecanonlybeusedforlocalwebcontentintheformofstaticHTMLpageswritteninthesession’stemporarydirectoryoralocallyrunwebapplication.TheViewerpanecan’tbeusedtoviewonlinecontent.

EnvironmentPane–(Q2)

TheEnvironmentpanecontainsalistingofvariablesthatyouhavecreatedforthecurrentsession.Eachvariableislistedinthetabandcanbeexpandedtoviewthecontentsofthevariable.Youcanseeanexampleofthisinthescreenshotbelowbytakingalookatthedfvariable.Therectanglesurroundingthedfvariabledisplaysthecolumnsforthevariable.

Page 19: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Clickingthetableicononthefar-rightsideofthedisplay(highlightedwiththearrowinthescreenshotabove)willopenthedatainatabularviewerasseeninthescreenshotbelow.

Page 20: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

OtherfunctionalityprovidedbytheEnvironmentpaneincludesopeningorsavingaworkspace,importingdatasetfromtextfiles,Excelspreadsheets,andvariousstatisticalpackageformats.Youcanalsoclearthecurrentworkspace.

HistoryPane–(Q2)

TheHistorypane,showninthescreenshotbelow,displaysalistofallcommandsthathavebeenexecutedinthecurrentsession.Thistabincludesanumberofusefulfunctionsincludingtheabilitytosavethesecommandstoafileorloadhistoricalcommandsfromanexistingfile.YoucanalsoselectspecificcommandsfromtheHistorytabandsendthemdirectlytotheconsoleoranopenscript.YoucanalsoremoveitemsfromtheHistorypane.

Page 21: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

ConnectionsPane–(Q2)TheConnectionstabcanbeusedtoaccessexistingorcreatenewconnectionstoODBCandSparkdatasources.

Page 22: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

SourcePane–(Q3)

TheSourcepaneinRStudio,seeninthescreenshotbelow,isusedtocreatescripts,anddisplaydatasetsAnRscriptissimplyatextfilecontainingaseriesofcommandsthatareexecutedtogether.CommandscanalsobewrittenlinebylinefromtheConsolepaneaswell.WhenwrittenfromtheConsolepane,eachlineofcodeisexecutedwhenyouclicktheEnter(Return)key.However,scriptsareexecutedasagroup.

Multiplescriptscanbeopenatthesametimewitheachscriptoccupyingaseparatetabasseeninthescreenshot.RStudioprovidestheabilitytoexecutetheentirescript,onlythecurrentline,orahighlightedgroupoflines.Thisgivesyoualotofcontrolovertheexecutionthecodeinascript.

Page 23: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

TheSourcepanecanalsobeusedtodisplaydatasets.Inthescreenshotbelow,adataframeisdisplayed.DataframescanbedisplayedinthismannerbycallingtheView(<dataframe>)function.

ConsolePane–(Q4)

TheConsolepaneinRStudioisusedtointeractivelywriteandrunlinesofcode.EachtimeyouenteralineofcodeandclickEnter(Return)itwillexecutethatlineofcode.AnywarningorerrormessageswillbedisplayedintheConsole

Page 24: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

windowaswellasoutputfromprint()statements.

TerminalPane–(Q4)

TheRStudioTerminalpaneprovidesaccesstothesystemshellfromwithintheRStudioIDE.Itsupportsxtermemulation,enablinguseoffull-screenterminalapplications(e.g.texteditors,terminalmultiplexers)aswellasregularcommand-lineoperationswithlineeditingandshellhistory.

Therearemanypotentialusesoftheshellincludingadvancedsourcecontroloperations,executionoflong-runningjobs,remotelogins,andsystemadministrationofRStudio.

TheTerminalpaneisunlikemostoftheotherfeaturesfoundinRStudiointhatit’scapabilitiesareplatformspecific.Ingeneral,thesedifferencescanbecategorizedaseitherWindowscapabilitiesorother(Mac,Linux,RStudioServer).

CustomizingtheInterface

Ifyoudon’tlikethedefaultRStudiointerface,youcancustomizetheappearance.Todoso,gotoTool|Options(RStudio|PreferencesonaMac).

Page 25: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Thedialogseeninthescreenshotbelowwillbedisplayed.

Page 26: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

ThePaneLayouttabisusedtochangethelocationsofconsole,sourceeditor,andtabpanes,andsetwhichtabsareincludedineachpane.

Page 27: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

MenuOptions

TherearealsoamultitudeofoptionsthatcanbeaccessedfromtheRStudiomenuitemsaswell.Coveringtheseitemsindepthisbeyondthescopeofthisbook,butingeneralherearesomeofthemoreusefulfunctionsthatcanbeaccessedthroughthemenus.

1.Createnewfilesandprojects2.Importdatasets3.Hide,show,andzoominandoutofpanes4.Workwithplots(save,zoom,clear)

Page 28: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

5.Settheworkingdirectory6.Saveandloadworkspace7.Startanewsession8.Debuggingtools9.Profilingtools10.Installpackages11.Accesshelpsystem

You’lllearnhowtousevariouscomponentsoftheRStudiointerfaceaswemovethroughtheexercisesinthebook.

InstallingRStudio

Ifyouhaven’talreadydoneso,nowisagoodtimetodownloadandinstallRStudio.ThereareanumberofversionsofRStudio,includingafreeopensourceversionwhichwillbesufficientforthisbook.VersionsarealsoavailableforvariousoperatingsystemsincludingWindows,Mac,andLinux.

1.Gotohttps://www.rstudio.com/products/rstudio/download/findRStudioforDesktop,theOpenSourceLicenseversion,andfollowintheinstructionstodownloadandinstallthesoftware.Inthenextsectionwe’llexplorethebasicprogrammingconstructsoftheRlanguageincludingthecreationandassigningofdatatovariables,aswellasthedatatypesandobjectsthatcanbeassignedtovariables.

InstallingtheExerciseData

Thisisintendedasahands-onexercisebookandisdesignedtogiveyouasmuchhandsoncodingexperiencewithRaspossible.Manyoftheexercisesinthisbookrequirethatyouloaddatafromafile-baseddatasourcesuchasaCSVfile.Thesefileswillneedtobeinstalledonyourcomputerbeforecontinuingwiththeexercisesinthischapteraswellastherestofthebook.Pleasefollowtheinstructionsbelowtodownloadandinstalltheexercisedata.

1.Inawebbrowsergotohttps://www.dropbox.com/s/5p7j7nl8hgijsnx/IntroR.zip?dl=0.2.ThiswilldownloadafilecalledIntroR.zip.

3.Theexercisedatacanbeunzippedtoanylocationonyourcomputer.AfterunzippingtheIntroR.zipfileyouwillhaveafolderstructurethatincludesIntroR

Page 29: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

asthetop-mostfolderwithsub-folderscalledDataandSolutions.TheDatafoldercontainsthedatathatwillbeusedintheexercisesinthebook,whiletheSolutionsfoldercontainssolutionfilesfortheRscriptthatyouwillwrite.

RStudiocanbeusedonWindows,Mac,orLinuxsoratherthanspecifyingaspecificfoldertoplacethedataIwillleavetheinstallationlocationuptoyou.Justrememberwhereyouunzipthedatabecauseyou’llneedtoreferencethelocationwhenyousettheworkingdirectory.

4.ForreferencepurposesIhaveinstalledthedatatothedesktopofmyMaccomputerunderIntroR\Data.Youwillseethislocationreferencedatvariouslocationsthroughoutthebook.However,keepinmindthatyoucaninstallthedataanywhere.

Exercise1:Creatingvariablesandassigningdata

IntheRprogramminglanguage,likeotherlanguages,variablesaregivenanameandassigneddata.Eachvariablehasanamethatrepresentsitsareainmemory.InR,variablesarecasesensitivesousecareinnamingyourvariableandreferringtothemlaterinyourcode.

TherearetwowaysthatvariablescanbeassignedinR.Inthefirstcodeexamplebelow,avariablenamedxiscreated.Theuseofalessthansignimmediatelyfollowedbyadashthenprecedesthevariablename.ThisistheoperatorusedtoassigndatatoavariableinR.Ontheright-handsideofthisoperatoristhevaluebeingassigntothevariable.Inthiscase,thevalue10hasbeenassignedtothevariablex.ToprintthevalueofavariableinRyoucansimpletypethevariablenameandthenclicktheEnterkeyonyourkeyboard.

x<-10x[1]10

Theotherwayofcreatingandassigningdatatoavariableistousetheequalsign.Inthesecondcodeexamplewecreateavariablecalledyandassignthevalue10tothevariable.Thissecondmethodofcreatingandassigningdatatoavariableisprobablymorefamiliartoyouifyou’veusedotherlanguageslikePythonorJavaScript.

Page 30: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

y=10y[1]10

IntheRprogramminglanguage,likeotherlanguages,variablesaregivenanameandassigneddata.Eachvariableisanamedareainthecomputer’smemory.InR,variablesarealsocasesensitivesousecareinnamingyourvariablesandreferringtothemlaterinyourcode.Inthisexerciseyou’lllearnhowtocreatevariablesinRandassigndata.1.OpenRStudioandfindtheConsolewindow.Itshouldbeontheleft-hand

sideofyourscreenatthebottom.

2.Thefirstthingyou’llneedtodoissettheworkingdirectoryfortheRStudiosession.Theworkingdirectoryforallchaptersinthisbookwillbethelocationwhereyouinstalledtheexercisedata.PleasereferbacktothesectionInstallingExerciseDataforexercisedatainstallationinstructionsifyouhaven’talreadycompletedthisstep.

TheworkingdirectorycanbesetbytypingthecodeyouseebelowintotheConsolepaneorbygoingtoSession|SetWorkingDirectory|ChooseDirectoryfromtheRStudiomenu.YouwillneedtospecifythelocationoftheIntroR\Datafolderwhereyouinstalled

setwd(<installationdirectoryforexercisedata>)

3.AsImentionedintheintroductiontothisexercise,therearetwowaystocreateandassigndatatovariablesinR.We’llexaminebothinthissection.First,createavariablecalledxandassignthevalue10asseenbelow.Noticetheuseofthelessthansign(<)followedimmediatelybyadash(-).Thisoperatorcanbeusedtoassigndatatoavariable.Thevariablenameisontheleft-handsideoftheoperator,andthedatawe’reassigningtothevariableisontheright-handsideoftheoperator.

Note:Thescreenshotbelowdisplaysaworkingdirectoryof~/Desktop/IntroR/Data/whichmayormaynotbeyourworkingdirectory.ThisissimplytheworkingdirectorythatI’vedefinedformyRStudiosessiononaMaccomputer.ThiswilldependentirelyonwhereyouinstalledtheexercisedataforthebookandtheworkingdirectoryyouhavesetforyourRStudiosession.

Page 31: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

4.Thesecondwayofcreatingavariableistousetheequalsign.Create

asecondvariableusingthismethodasseeninthescreenshotbelow.Assignthevalueasy=20.Iwillusetheequalsignthroughoutthebookinfutureexercisessinceitisusedinotherprogramminglanguagesandiseasiertounderstandandtype.However,youarefreetouseeitheroperator.

5.Finally,createathirdvariablecalledzandassignitthevalueofx+y.Thevariablesx,y,andzhaveallbeenassignednumericdata.VariablesinRcanbeassignedothertypesofdataaswellincludingcharacters(alsoknownasstrings),Booleans,andanumberofdataobjectsincludingvectors,factors,lists,matrices,dataframes,andothers.

Page 32: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

6.Thethreevariablesthatyou’vecreated(x,y,andz)areallnumericdatatypes.Thisshouldbeself-explanatory,butanynumber,includingintegers,floatingpoint,andcomplexnumbersareinherentlydefinedasnumericdatatypes.However,ifyousurroundanumberwithquotesitwillbeinterpretedbyRasacharacterdatatype.

7.Youcanviewthevalueofanyvariablesimplybytypingthevariablenameasseeninthescreenshotbelow.Dothatnowtoseehowitworks.TypingthenameofavariableandclickingtheEnter\Returnkeywillimplicitlycalltheprint()function.

Page 33: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

8.Thesamethingcanbeaccomplishedusingtheprint()functionasseenbelow.

9.VariablesinRarecasesensitive.Toillustratethis,createanewvariablecalledmyNameandassignitthevalueofyournameasIhavedoneinthescreenshotbelow.Inthiscase,sincewe’veenclosedthevaluewithquotes,Rwillassignitasacharacter(string)datatype.Anysequenceofcharacters,whethertheybeletters,numbers,orspecialcharacters,willbedefinedasacharacterdatatypeifsurroundedbyquotes.

NoticethatwhenItypethenameofthevariable(withthecorrectcase)itwillreportthevalueassociatedwiththevariable,butwhenItypemyname(alllowercase)itreportsanerror.Eventhoughthenameisthesamethecasingisdifferent,soyoumustalwaysrefertoyourvariablenameswiththesamecasethattheywerecreated.

Page 34: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

10.Toseealistofallvariablesinyourcurrentworkspaceyoucantypethe

ls()function.Dothatnowtoseealistofallthevariablesyouhavecreatedinthissession.EachvariableanditscurrentvalueisalsodisplayedintheEnvironmentpaneontheright-handsideofRStudio.

11.Therearemanydatatypesthatcanbeassignedtovariables.Inthisbriefexerciseweassignedbothcharacter(string)andnumericdatatovariables.Aswedivefurtherintothebookwe’llexamineadditionaldatatypesthatcanbeassignedtovariablesinR.Thesyntaxwillremainthesamethoughnomatterwhattypeofdataisbeingassignedtoavariable.

12.YoucancheckyourworkagainstthesolutionfileChapter1_1.R.

Exercise2:Usingvectorsandfactors

InR,avectorisasequenceofdataelementsthathavethesamedatatype.Vectorsareusedprimarilyascontainerstylevariablesusedtoholdmultiplevaluesthatcanthenbemanipulatedorextractedasneeded.Thekeythoughis

Page 35: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

thatallthevaluesmustbeofthesametype.Forexample,allthevaluesmustbenumeric,character,orBoolean.Youcan’tincludeanysortofcombinationofdatatypes.

TocreateavectorinRyoucallthec()functionandpassinalistofvaluesofthesametype.Aftercreatingavectorthereareanumberofwaysthatyoucanexamine,manipulate,andextractdata.Inthisexerciseyou’lllearnthebasicsofworkingwithvectors.

1.OpenRStudioandfindtheConsolepane.Itshouldbeontheleft-handsideofyourscreenatthebottom.

2.IntheRConsolepanecreateanewvectorasseeninthecodeexamplebelow.Thec()functionisusedtocreatethevectorobject.Thisvectoriscomposedofcharacterdatatypes.Rememberthatallvaluesinthevectormustbeofthesamedatatype.

layers<-c(‘Parcels’,‘Streets’,‘Railroads’,‘Streams’,‘Buildings’)3.Getthelengthofthevectorusingthelength()function.Thisshouldreturnavalueof5.length(layers)[1]5

4.Youcanretrieveindividualitemsfromavectorbypassinginanindexnumber.RetrievetheRailroadsvaluebypassinginanindexnumberof3,whichcorrespondstothepositionalorderofthisvalue.Risa1basedlanguagesothefirstiteminthelistoccupiesposition1.

layers[3][1]“Railroads”5.Youcanextractacontiguoussequenceofvaluesbypassingintwoindexnumbersasseenbelow.layers[3:5][1]“Railroads”“Streams”“Buildings”6.Valuescanberemovedfromavectorbypassinginanegativeintegerasseenbelow.ThiswillremoveStreamsfromthevector.

layers[1]“Parcels”“Streets”“Railroads”“Streams”“Buildings”layers[-4][1]“Parcels”“Streets”“Railroads”“Buildings”

Page 36: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

7.Createasecondvectorcontainingnumbersasseenbelow.layerIds<-c(1,2,3,4)

8.Inthisnextstepwe’regoingtocombinethelayersandlayerIdsvectorsintoasinglevector.You’llrecallthatalltheitemsinavectormustbeofthesamedatatype.Inacaselikethiswhereonevectorcontainscharactersandtheothernumbers,Rwillautomaticallyconvertthenumberstocharacters.Enterthefollowingcodetoseethisinaction.

layerIds<-c(1,2,3,4)combinedVector<-c(layers,layerIds)combinedVector[1]“Parcels”“Streets”“Railroads”“Streams”“Buildings”[6]“1”“2”“3”“4”

9.Nowlet’screatetwonewsetsofvectorstoseehowvectorarithmeticworks.Addthefollowinglinesofcode.x<-c(10,20,30,40,50)y<-c(100,200,300,400,500)10.Nowaddthevaluesofthevectors.x+y[1]11022033044055011.Subtractthevalues.y-x[1]9018027036045012.Multiplythevalues.

10*x[1]10020030040050020*y[1]200040006000800010000

13.YoucanalsousethebuiltinRfunctionagainstthevaluesofavector.Enterthefollowlinesofcodestoseehowthebuilt-infunctionswork.sum(x)[1]150

mean(y)[1]300median(y)[1]300

max(y)[1]500min(x)[1]10

Page 37: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

14.AFactorisbasicallyavectorbutwithcategories,soitwilllookfamiliartoyou.GoaheadandcleartheRConsolebyselectingtheEditmenuitemandthenClearConsoleinRStudio.

15.Addthefollowingcodeblock.NotethatyoucaneasilyuselinecontinuationinRsimplybyselectingtheEnter(Return)keyonyourkeyboard.Itwillautomaticallyaddthe“+”atthebeginningofthelineindicatingthatitissimplyacontinuationofthelastline.

land.type<-factor(c(“Residential”,“Commercial”,“Agricultural”,“Commercial”,“Commercial”,“Residential”),levels=c(“Residential”,“Commercial”))

table(land.type)land.typeResidentialCommercial23

16.Nowlet’stalkaboutorderingoffactors.Theremaybetimeswhenyouwanttoordertheoutputofthefactor.Forexample,youmaywanttoordertheresultsbymonth.Enterthefollowingcode:

mons<-c(“March”,“April”,“January”,“November”,“January”,+“September”,“October”,“September”,“November”,“August”,+“January”,“November”,“November”,“February”,“May”,“August”,+“July”,“December”,“August”,“August”,“September”,“November”,+“February”,“April”)

mons<-factor(mons)table(mons)mons

AprilAugustDecemberFebruaryJanuaryJuly241231MarchMayNovemberOctoberSeptember11513

17.Theoutputislessthandesirableinthiscase.Itwouldbepreferabletohavethemonthslistedintheorderthattheyoccurduringtheyear.Creatinganorderedfactorresolvesthisissue.Addthefollowingcodetoseehowthisworks.

mons<-factor(mons,levels=c(‘January’,‘February’,‘March’,+‘April’,‘May’,‘June’,‘July’,‘August’,‘September’,+‘October’,‘November’,’December’),ordered=TRUE)

Page 38: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

table(mons)monsJanuaryFebruaryMarchAprilMayJune

321210JulyAugustSeptemberOctoberNovemberDecember143151

Creatinganorderedfactorresolvesthisissue.Inthenextexerciseyou’lllearnhowtouselists,whicharesimilarinmanywaystovectorsinthattheyareacontainerstyleobject,butasyou’llseetheydifferinanimportantwayaswell.YoucancheckyourworkagainstthesolutionfileChapter1_2.R.

Exercise3:Usinglists

Alistisanorderedcollectionofelements,inmanywaysverysimilartovectors.However,therearesomeimportantdifferencesbetweenalistandavector.Withlistsyoucanincludeanycombinationofdatatypes.Thisdiffersfromotherdatastructureslikevectors,matrices,andfactorswhichmustcontainthesamedatatype.Listsarehighlyversatileandusefuldatatypes.AlistinRactsasacontainerstyleobjectinthatitcanholdmanyvaluesthatyoustoretemporarilyandpulloutasneeded.

1.CleartheRConsolebyselectingtheEditmenuitemandthenClearConsoleinRStudio.

2.Listscanbecreatedthroughtheuseofthelist()function.It’salsocommontocallafunctionthatreturnsalistvariableaswell,butforthesakeofsimplicityinthisexercisewe’llusethelist()functiontocreatethelist.

Eachvaluethatyouintendtoplaceinsidethelistshouldbeseparatedbyacomma.Thevaluesplacedintothelistcanbeofanytype,whichdiffersfromvectorsthatmustallbeofthesametype.AddthecodeyouseebelowintheConsolepane.

my.list<-list(“Streets”,2000,“Parcels”,5000,TRUE,FALSE)Inthisexamplealistcalledmy.listhasbeencreatedwithanumberofcharacter,numeric,andBooleanvalues.

3.Becauselistsarecontainerstyleobjectsyouwillneedtopullvaluesoutofa

Page 39: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

listatvarioustimes.Thisisdonebypassinganindexnumberinsidesquarebrackets,withtheindexnumberonereferringtothefirstvalueinthelist,andeachsuccessivevalueoccupyingthenextindexnumberinorder.However,accessingitemsinalistcanbealittleconfusingasyou’llsee.Addthefollowingcodeandthenwe’lldiscuss.

my.list[2][[1]][1]2000

Theindexnumber2isareferencetothesecondvalueinthemy.listobject,whichinthiscaseisthenumber2000.However,whenyoupassanindexnumberinsideasinglepairofsquarebracesitactuallyreturnsanotherlistobject,thistimewithasinglevalue.Inthiscase,2000istheonlyvalueinthelist,butitisalistobjectratherthananumber.

4.Nowaddthecodeyouseebelowtoseehowtopullouttheactualvaluefromthelistratherthanreturninganotherlistwithasinglevalue.my.list[[2]]

Inthiscasewepassavalueof2insideapairofsquarebraces.Usingtwosquarebracesoneithersideoftheindexnumberwillpulltheactualvalueoutofthelistratherthanreturninganewlistwithasinglevalue.Inthiscase,thevalue2000isreturnedasanumericvalue.Thiscanbealittleconfusingthefirstfewtimesyouseeandusethis,butlistsareacommonlyuseddatatypeinRsoyou’llwanttomakesureyouunderstandthisconcept.

5.Theremaybetimeswhenyouwanttopullmultiplevaluesfromalistratherthanjustasinglevalue.Thisiscalledlistslicingandcanbeaccomplishedusingsyntaxyouseebelow.Inthiscasewepassintwoindexnumbersthatindicatethestartingandendingpositionofthevaluesthatshouldberetrieved.Trythisonyourown.

new.list<-my.list[c(1,2)]new.list[[1]][1]“Streets”

[[2]][1]20006.Thisreturnedanewlistobjectstoredinthevariablenew.list.Usingbasiclistindexingyoucanthenpullavalueoutofthislist.

Page 40: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

new.list[[2]][1]2000

7.Youcangetthenumberofitemsinalistbycallingthelength()function.Thiswillreturnthenumberofvaluesinthelist,notincludinganynestedlists.Callingthelength()functioninthisexerciseonthemy.listvariableshouldproducearesultof6.

length(my.list)

8.Finally,theremaybetimeswhenyouareuncertainifavariableisstoredasavectororalist.Youcanusetheis.list()function,whichwillreturnaTRUEorFALSEvaluethatindicateswhetherthevariableisalistobject.

is.list(my.list)[1]TRUE9.YoucancheckyourworkagainstthesolutionfileChapter1_3.R.

Exercise4:Usingdataclasses

Inthisexercisewe’lltakealookatmatricesanddataframes.AmatrixinRisastructureverysimilartoatableinthatithascolumnsandrows.Thistypeofstructureiscommonlyusedinstatisticaloperations.Amatrixiscreatedusingthematrix()function.Thenumberofcolumnsandrowscanbepassedinasargumentstothefunctiontodefinetheattributesanddatavaluesofthematrix.Amatrixmightbecreatedfromthevaluesfoundintheattributetableofafeatureclass.However,keepinmindthatallthevaluesinthematrixmustofthesamedatatype.

DataframesinRareverysimilartotablesinthattheyhavecolumnsandrows.Thismakesthemverysimilartomatrixobjectsaswell.Instatistics,adatasetwilloftencontainmultiplevariables.Forexample,ifyouareanalyzingrealestatesalesforanareatherewillbemanyfactorsincludingincome,jobgrowth,immigration,andothers.

Theseindividualvariablesarestoredasthecolumnsinadataframe.Dataframesaremostcommonlycreatedbyloadinganexternalfile,databasetable,orURLcontainingtabularinformationusingoneofthemanyfunctionsprovidedbyRforimportingadataset.Youcanalsomanuallyenterthevalues.WhenmanuallyenteringthedatatheRconsolewilldisplayaspreadsheetstyleinterfacethatyou

Page 41: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

canusetodefinethecolumnnamesaswellastherowvalues.Rincludesmanybuilt-indatasetsthatyoucanuseforlearningpurposesandthesearestoredasdataframes.

1.OpenRStudioandfindtheConsolepane.Itshouldbeonthebottom,lefthandsideofyourscreen.

2.Let’sstartwithmatrices.IntheRConsolecreateanewmatrixasseeninthecodeexamplebelow.Thec()functionisusedtodefinethedatafortheobject.Thismatrixiscomposedofnumericdatatypes.Rememberthatallvaluesinthematrixmustbeofthesamedatatype.

matrx<-matrix(c(2,4,3,1,5,7),nrow=2,ncol=3,byrow=TRUE)matrx

[,1][,2][,3][1,]243[2,]157

3.Youcannamethecolumnsinamatrix.Addthecodeyouseebelowtonameyourcolumns.

colnames(matrx)<-c(“POP2000”,“POP2005”,“POP2010”)POP2000POP2005POP2010[1,]243[2,]157

4.Nowlet’sretrieveavaluefromthematrixwiththecodeyouseebelow.Theformatismatrix(row,column).matrx[2,3]POP201075.Youcanalsoextractanentirerowusingthecodeyouseebelow.Herewejustprovidearowvaluebutnocolumn.matrx[2,]POP2000POP2005POP20101576.Oryoucanextractanentirecolumnusingtheformatyouseebelow.matrx[,3][1]377.Youcanalsoextractmultiplecolumnsatatime.matrx[,c(1,3)]

Page 42: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

POP2000POP2010[1,]23[2,]17

8.Youcanalsoaccesscolumnsorrowsbynameifyouhavenamedthem.matrx[,“POP2005”][1]459.YoucanusethecolSums(),colMeans()orrowSums()functionsagainstthedataaswell.

colSums(matrx)POP2000POP2005POP20103811>colMeans(matrx)POP2000POP2005POP20101.54.05.5

10.Nowwe’llturnourattentiontoDataFrames.CleartheRconsoleandexecutethedata()functionasseenbelow.ThisdisplaysalistofallthesampledatasetsthatarepartofR.Youcanuseanyofthesedatasets.

11.Forthisexercisewe’llusetheUSArrestsdataframe.AddthecodeyouseebelowtodisplaythecontentsoftheUSArrestsdataframe.

Page 43: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

12.Next,we’llpulloutthedataforallrowsfromtheAssaultcolumn.

USArrests$Assault[1]2362632941902762041102383352114612024911356115[17]109249833001492557225917810910225257159285254[33]33745120151159106174279861882011204815614581[49]53161

13.Avaluefromaspecificrow,columncombinationcanbeextractedusingthe

Page 44: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

codeseenbelowwheretherowisspecifiedasthefirstoffsetandthecolumnisthesecond.ThisparticularcodeextractstheassaultvalueforWyoming.

USArrests[50,2][1]16114.Ifyouleaveoffthecolumnitwillreturnallcolumnsforthatrow.USArrests[50,]MurderAssaultUrbanPopRapeWyoming6.81616015.6

ThesampledatasetsincludedwithRaregoodforlearningpurposes,butoflimitedusefulnessbeyondthat.You’regoingtowanttoloaddatasetsthatarerelevanttoyourlineofwork,andmanyofthesedatasetshaveatabularstructurethatisconducivetothedataframeobject.Mostofthesedatasetswillneedtobeloadedfromanexternalsourcethatmaybefoundindelimitedtextfiles,databasetables,webservices,andothers.You’lllearnhowtoloadtheseexternaldatasetsusingRcodeinalaterchapterofthebook,butasyou’llseeinthisnextexerciseyoucanalsousetheRStudiointerfacetoloadthemaswell.15.InRStudiogototheFilemenuandselectImportDataset|FromText

(readr).Thiswilldisplaythedialogseeninthescreenshotbelow.We’lldiscussthereadrpackageinmuchmoredetailinafuturechapter,butthispackageisusedtoefficientlyreadexternaldataintoadataframe.

16.UsetheBrowsebuttontobrowsetotheStudyArea.csvfilefoundintheData

Page 45: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

folderwhereyouinstalledtheexercisedataforthisbook.TheStudyArea.csvfileisacommaseparatedlistofwildfiresfrom1980-2016fortheWesternUnitedStates.

Thedatawillbeloadedintoapreviewwindowasseenbelow.Thereareanumberofimportoptionsalongwiththecodethatwillbeexecuted.Youcanleavethedefaultvaluesinthiscase.

17.ClickImportfromthisImportTestDatadialog.Thiswillloadthedata

intoadataframe(technicallycalledaTibbleintidyverse)calledStudyArea.ItwillalsousetheView()functiontodisplaytheresultsinatabularviewdisplayedinthescreenshotbelow.

Page 46: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

18.Messages,warnings,anderrorsfromtheimportwillbedisplayedinthe

Consolewindow.Youcanignorethesemessagesfornow.We’lldiscusstheminmoredetailinalaterchapter.

Page 47: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

ThisStudyAreadataframecanthenbeusedfordataexplorationandvisualization,whichwe’llcoverinfuturechapters.19.YoucancheckyourworkagainstthesolutionfileChapter1_4.R.

Exercise5:Loopingstatements

Loopingstatementsaren’tusedasmuchinRastheyareinotherlanguagesbecauseRhasbuiltinsupportforvectorization.Vectorizationisabuilt-instructurethatautomaticallyloopsthroughadatastructurewithouttheneedtowriteloopingcode.However,theremaybetimeswhenyouneedtowriteloopingcodetoaccomplishaspecifictaskthatisn’thandledbyvectorizationsoyouneedtounderstandthesyntaxofloopingstatementsinR.We’lltakealookatasimpleblockofcodethatloopsthroughtherowsinadataframe.

Page 48: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Forloopsareusedwhenyouknowexactlyhowmanytimestorepeatablockofcode.Thisincludestheuseofdataframeobjectsthathaveaspecificnumberofrows.Forloopsaretypicallyusedwithvectoranddataframestructures.

1.Forthisbriefexercisewe’llusetheStudyAreadataframethatyouimportedfromanexternalfileinthelastexercise.YouwillalsolearnhowtocreateanRscriptandlearnhowtoexecutethescript.AscriptissimplyaseriesofcommandsthatarerunasagroupratherthanenteringandrunningyourcodeonelineatatimefromtheConsolewindow.

2.CreateanewRscriptbygoingtoFile|NewFile|RScriptfromtheRStudiointerface.

3.SavethefilewithanameofChapter1_5.R.Youcanplacethescriptfilewhereveryou’dlike,butitisrecommendedthatyousaveittoyourfolderwhereyourexercisedataisloaded.

4.AddthefollowinglinesofcodetotheChapter1_5.Rscript.

for(firein1:nrow(StudyArea)){print(StudyArea[fire,“TOTALACRES”])}

5.RunthecodebyselectingCode|RunRegion|RunAllfromtheRStudiomenuorbyclickingtheSourcebuttononthescripttab.

Thiswillproduceastreamofdatathatlookssimilartowhatyouseebelow.Youwillwanttostoptheexecutionofthisscriptafteritbeginsdisplayingdatabecauseoftheamountofdataandtimeitwilltaketoprintoutalltheinformation.TheforloopsyntaxassignseachrowfromtheStudyAreadataframetoavariablecalledfire.Thetotalnumberofacresburnedforeachfireisthenprinted.

#Atibble:1x1TOTALACRES

<dbl>10.100#Atibble:1x1

Page 49: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

TOTALACRES<dbl>13.#Atibble:1x1

TOTALACRES<dbl>10.500#Atibble:1x1

TOTALACRES<dbl>10.100#Atibble:1x1

TOTALACRES<dbl>

AsImentionedearlier,youwon’toftenneedtouseforloopsinRbecauseofthebuilt-insupportforvectorization,butsoonerorlateryou’llrunintoasituationwhereyouneedtocreatetheseloopingstructures.

6.YoucancheckyourworkagainstthesolutionfileChapter1_5.R.

Exercise6:Decisionsupportstatements–if|else

Decisionsupportstatementsenableyoutowritecodethatbranchesbaseduponspecificconditions.Thebasicif|elsestatementinRisusedfordecisionsupport.Basically,ifstatementsareusedtobranchcodebasedonatestexpression.IfthetestexpressionevaluatestoTRUE,thenablockofcodeisexecuted.IfthetestevaluatestoFALSEthentheprocessingskipsdowntothefirstelseifstatementoranelsestatementifyoudon’tincludeanyelseifstatements.

Eachif|elseif|elsestatementhasanassociatedcodeblockthatwillexecutewhenthestatementevaluatestoTRUE.CodeblocksaredenotedinRusingcurlybracesasseeninthecodeexamplebelow.

Youcanincludezeroormoreelseifstatementsdependingonwhatyou’re

Page 50: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

attemptingtoaccomplishinyourcode.IfnostatementsevaluatetoTRUE,processingwillexecutethecodeblockassociatedwiththeelsestatement.1.Inthisexercisewe’llbuildontheloopingexercisebyaddinginanif|

elseif|elseblockthatdisplaysthefirenamesaccordingtosize.2.CreateanewRscriptbygoingtoFile|NewFile|RScriptfromtheRStudiointerface.

3.SavethefilewithanameofChapter1_6.R.Youcanplacethescriptfilewhereveryou’dlike,butitisrecommendedthatyousaveittoyourfolderwhereyourexercisedataisloaded.

4.CopyandpastetheforloopyoucreatedinthelastexerciseandsavedtotheChapter1_5.RfileintoyournewChapter1_6.Rfile.

for(firein1:nrow(StudyArea)){print(StudyArea[fire,“TOTALACRES”])}

5.Addtheif|elseifblockyouseebelow.ThisscriptloopsthroughalltherowsintheStudyAreadataframeandprintsoutmessagesthatindicatewhenafirehasburnedmorethanthespecifiednumberofacresforeachcategory.

for(firein1:nrow(StudyArea)){if(StudyArea[fire,“TOTALACRES”]>100000){print(paste(“100KFire:“,StudyArea[fire,“FIRENAME”],sep=“”))}elseif(StudyArea[fire,“TOTALACRES”]>75000){print(paste(“75KFire:“,StudyArea[fire,“FIRENAME”],sep=“”))}elseif(StudyArea[fire,“TOTALACRES”]>50000){print(paste(“50KFire:“,StudyArea[fire,“FIRENAME”],sep=

“”))}}

6.RunthecodebyselectingCode|RunRegion|RunAllfromtheRStudiomenuorbyclickingtheSourcebuttononthescripttab.ThescriptshouldstartproducingoutputintheConsolepanesimilartowhatyouseebelow.

[1]“50KFire:PIRU”

Page 51: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

[1]“100KFire:CEDAR”[1]“50KFire:MINE”[1]“100KFire:24COMMAND”[1]“50KFire:RANCH”[1]“75KFire:HARRIS”[1]“50KFire:SUNNYSIDETURNOFF”[1]“100KFire:Range12”

7.Youcanoptionallyaddanelseblockattheendthatwillprintamessageforanyfirethatisn’tgreaterthan50,000acres.Mostofthefiresinthisdatasetarelessthan50,000soyou’llseealotofmessagesthatindicatethisifyouaddtheelseblockbelow.

for(firein1:nrow(StudyArea)){if(StudyArea[fire,“TOTALACRES”]>100000){print(paste(“100KFire:“,StudyArea[fire,“FIRENAME”],sep=“”))}elseif(StudyArea[fire,“TOTALACRES”]>75000){print(paste(“75KFire:“,StudyArea[fire,“FIRENAME”],sep=“”))}elseif(StudyArea[fire,“TOTALACRES”]>50000){print(paste(“50KFire:“,StudyArea[fire,“FIRENAME”],sep=“”))}else{print(“NotaMEGAFIRE”)}}8.YoucancheckyourworkagainstthesolutionfileChapter1_6.R.

Exercise7:Usingfunctions

Functionsareagroupofstatementsthatexecuteasagroupandareaction-orientedstructuresinthattheyaccomplishsomesortoftask.Inputvariablescanbepassedintofunctionsthroughwhatareknownasparameters.Anothernameforparametersisarguments.Theseparametersbecomevariablesinsidethefunctiontowhichtheyarepassed.

Rpackagesincludemanypre-builtfunctionsthatyoucanusetoaccomplishspecifictasks,butyoucanalsobuildyourownfunctions.Functionstaketheformseeninthescreenshotbelow.

Page 52: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Functionsareassignedaname,cantakezeroormorearguments,eachseparatedbyacomma,haveabodyofstatementsthatexecuteasagroup,andcanreturnavalue.Thebodyofafunctionisalwaysenclosedbycurlybraces.Thisiswheretheworkofthefunctionisaccomplished.Anyvariablesdefinedinsidethefunctionorpassedasargumentstothefunctionbecomelocalvariablesthatareonlyaccessiblefrominsidethefunction.Thereturnkeywordisusedtoreturnavaluetothecodethatinitiallycalledthefunction.

Thewayyoucallafunctioncandifferalittle.Thebasicformofcallingafunctionistoreferencethenameofthefunctionfollowedbyanyargumentsinsideparenthesesjustafterthenameofthefunction.Whenpassingargumentstothefunctionusingthisdefaultsyntax,yousimplypassthevaluefortheparameter,anditisassumedthatyouarepassingthemintheorderthattheyweredefined.Inthiscasetheorderthatyoupassintheargumentsisveryimportant.Theordermustmatchtheorderthatwasusedtodefinethefunction.Thisisillustratedinthecodeexamplebelow.

myfunction(2,4)Ifthefunctionreturnsavalue,thenyouwillneedtoassignavariablenametothefunctioncallasseeninthecodeexamplebelowthatcreatesavariablecalledz.z=myfunction(2,4)

Finally,whileyoudon’thavetospecifythenameoftheargumentyoucandosoifyou’dlike.Inthiscaseyousimplypassinthenameoftheargumentfollowedbyanequalsignandthenthevaluebeingpassedforthatargument.Thecodeexamplebelowillustratesthisoptionalwayofcallingafunction.

myfunction(arg1=2,arg2=4)Inthisexerciseyou’lllearnhowtocallsomeofthebuilt-inRfunctions.

Page 53: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

1.Rincludesanumberofbuiltinfunctionsforgeneratingsummarystatisticsforadataset.Inthisexercisewe’llcallsomeofthefunctionsontheStudyAreadataframethatwascreatedinExercise4:UsingDataClasses.IntheConsolepaneaddthelineofcodeyouseebelowtocallthemean()function.Inthiscase,theTOTALACREScolumnfromtheStudyAreadataframewillbepassedasaparametertothefunction.Thisfunctioncalculatesthemeanofanumericdataset,whichinthiscasewillbe191.0917.

mean(StudyArea$TOTALACRES)[1]191.09172.Repeatthissameprocesswiththemin(),max(),andmedian()functions.

3.TheYEAR_fieldintheStudyAreadataframecontainstheyearinwhichthefireoccured.Thesubstr()functioncanbeusedtoextractaseriesofcharactersfromavariable.Usethesubstr()functionasseenbelowtoextractoutthelasttwodigitsoftheyear.

substr(StudyArea$YEAR_,3,4)

4.You’veseenexamplesofanumberofotherbuiltinRfunctionsinpreviousexercisesincludingprint(),ls()rm(),andothers.ThebaseRpackagecontainsmanyfunctionsthatcanbeusedtoaccomplishvarioustasks.Therearethousandsofotherthird-partyRpackagesthatyoucanuseaswell,andtheyallcontainadditionalfunctionsforperformingspecifictasks.Youcanalsocreateyourownfunctions,andwe’lldothatinafuturechapter.

5.YoucancheckyourworkagainstthesolutionfileChapter1_7.R.

Exercise8:Introductiontotidyverse

WhilethebaseRpackageincludesmanyusefulfunctionsanddatastructuresthatyoucanusetoaccomplishawidevarietyofdatasciencetasks,thethird-partytidyversepackagesupportsacomprehensivedatascienceworkflowasillustratedinthediagrambelow.Thetidyverseecosystemincludesmanysub-packagesdesignedtoaddressspecificcomponentsoftheworkflow.

Page 54: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Ttidyverseisacoherentsystemofpackagesforimporting,tidying,transforming,exploring,andvisualizingdata.ThepackagesofthetidyverseecosystemweremostlydevelopedbyHadleyWickham,buttheyarenowbeingexpandedbyseveralcontributors.Tidyversepackagesareintendedtomakestatisticiansanddatascientistsmoreproductivebyguidingthemthroughworkflowsthatfacilitatecommunication,andresultinreproducibleworkproducts.Fundamentally,thetidyverseisabouttheconnectionsbetweenthetoolsthatmaketheworkflowpossible.Let’sbrieflydiscussthecorepackagesthatarepartoftidyverse,andthenwe’lldoadeeperdiveintothespecificsofthepackagesaswemovethroughthebook.We’llusethesetoolsextensivelythroughoutthebook.

readr

Thegoalofreadristofacilitatetheimportoffile-baseddataintoastructureddataformat.Thereadrpackageincludessevenfunctionsforimportingfile-baseddatasetsincludingcsv,tsv,delimited,fixedwidth,whitespaceseparated,andweblogfiles.

Dataisimportedintoadatastructurecalledatibble.Tibblesarethetidyverseimplementationofadataframe.Theyarequitesimilartodataframes,butarebasicallyanewer,moreadvancedversion.However,therearesomeimportantdifferencesbetweentibblesanddataframes.Tibblesneverconvertdatatypesofvariables.Theyneverchangethenamesofvariablesorcreaterownames.Tibblesalsohavearefinedprintmethodthatshowsonlythefirst10rows,andallcolumnsthatwillfitonthescreen.Tibblesalsoprintthecolumntypealongwiththename.We’llrefertotibblesasdataframesthroughouttheremainderofthebooktokeepthingssimple,butkeepinmindthatyou’reactuallygoingtobeworkingwithtibbleobjects.Inthenextchapteryou’lllearnhowtousethe

Page 55: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

read_csv()functiontoloadcsvfilesintoatibbleobject.

tidyr

DatatidyingisaconsistentwayoforganizingdatainR,andcanbefacilitatedthroughthetidyrpackage.Therearethreerulesthatwecanfollowtomakeadatasettidy.First,eachvariablemusthaveitsowncolumn.Second,eachobservationmusthaveitsownrow,andfinally,eachvaluemusthaveitsowncell.

dplyr

Thedplyrpackageisaveryimportantpartoftidyverse.Itincludesfivekeyfunctionsfortransformingyourdatainvariousways.Thesefunctionsincludefilter(),arrange(),select(),mutate(),andsummarize().Inaddition,thesefunctionsallworkverycloselywiththegroup_by()function.Allfivefunctionsworkinaverysimilarmannerwherethefirstargumentisthedataframeyou’reoperatingon,andthenextNnumberofargumentsarethevariablestoinclude.Theresultofcallingallfivefunctionsisthecreationofanewdataframethatisatransformedversionofthedataframepassedtothefunction.We’llcoverthespecificsofeachfunctioninalaterchapter.

ggplot2Theggplot2packageisadatavisualizationpackageforR,createdbyHadleyWickhamin2005andisanimplementationofLelandWilkinson’sGrammarofGraphics.

GrammarofGraphicsisatermusedtoexpresstheideaofcreatingindividualblocksthatarecombinedintoagraphicaldisplay.Thebuildingblocksusedinggplot2toimplementtheGrammarofGraphicsincludedata,aestheticmapping,geometricobjects,statisticaltransformations,scales,coordinatesystems,positionadjustments,andfaceting.

Usingggplot2youcancreatemanydifferentkindsofchartsandgraphsincludingbarcharts,boxplots,violinplots,scatterplots,regressionlines,andmore.Thereareanumberofadvantagestousingggplot2versusothervisualizationtechniquesavailableinR.Theseadvantagesincludeaconsistentstylefordefiningthegraphics,ahighlevelofabstractionforspecifyingplots,flexibility,abuilt-inthemingsystemforplotappearance,matureandcompletegraphicssystem,andaccesstomanyotherggplot2usersforsupport.

Page 56: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Othertidyversepackages

Thetidyverseecosystemincludesanumberofothersupportingpackagesincludingstringr,purr,forcats,andothers.Inthisbookwe’llfocusprimarilyonthepackagealreadydescribed,buttoroundoutyourknowledgeoftidyverseyoucanreferencetidyverse.org.

Conclusion

InthischapteryoulearnedthebasicsofusingtheRStudiointerfacefordatavisualizationandexplorationaswellassomeofthebasiccapabilitiesoftheRlanguage.Afterlearninghowtocreatevariablesandassigndata,youlearnedsomeofthebasicRdatatypesincludingcharacters,vectors,factors,lists,matrices,anddataframes.Youalsolearnedaboutsomeofthebasicprogrammingconstructsincludinglooping,decisionsupportstatements,andfunctions.Finally,youreceivedanoverviewofthetidyversepackage.Inthenextchapteryou’lllearnsomebasicdataexplorationandvisualizationtechniquesbeforewediveintothespecificsinfuturechapters.

Chapter2

Page 57: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

TheBasicsofDataExplorationandVisualizationwithR

Nowthatyou’vegottenyourfeetwetwiththebasicsofRwe’regoingtoturnourattentiontocoveringsomeofthefundamentalconceptsofdataexplorationandvisualizationusingtidyverse.Thischapterisgoingtobeagentleintroductiontosomeofthetopicsthatwe’regoingtocoverinmuchmoreexhaustivedetailincomingchapters.Fornow,Ijustwantyoutogetasenseofwhatispossibleusingvarioustoolsinthetidyversepackage.

ThischapterwillteachyoufundamentaltechniquesforhowtousethereadrpackagetoloadexternaldatafromaCSVfileintoR,thedplyrpackagetomassageandmanipulatedata,andggplot2tovisualizedata.You’llalsolearnhowtoinstallandthetidyverseecosystemofpackagesandloadthepackagesintotheRStudioenvironment.

AsImentionedpreviously,thischapterisintendedasagentleintroductiontowhatispossibleratherthanadetailedinspectionofthepackages.Futurechapterswillgointoextensivedetailonthesetopics.Fornow,Ijustwantyoutogetasenseofwhatispossibleevenifyoudon’tcompletelyunderstandthedetails.

Inthischapterwe’llcoverthefollowingtopics:

•Installingandloadingtidyverse•Loadingandexaminingadataset•Filteringadataset•Groupingandsummarizingadataset•Plottingadataset

Exercise1:Installingandloadingtidyverse

InChapter1:IntroductiontoRyoulearnedthebasicsconceptsofthetidyversepackage.We’llbeusingvariouspackagesfromthetidyverseecosystemthroughoutthisbookincludingreadr,dplyr,andggplot2amongothers.Tidyverseisathird-partypackagesoyou’llneedtoinstallthepackageusingRStudiosothatitcanbeusedintheexercisesinthisbook.Inthisexerciseyou’ll

Page 58: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

learnhowtoinstalltidyverseandloadthepackageintoyourscripts.

1.OpenRStudio.

2.Thetidyversepackageisreallymoreanecosystemofpackagesthatcanbeusedtocarryoutvariousdatasciencetasks.Whenyouinstalltidyverseitinstallsallofthepackagesthatarepartoftidyverse,manyofwhichwediscussedinthelastchapter.Alternatively,youcaninstallthemindividuallyaswell.ThereareacouplewaysthatyoucaninstallpackagesinRStudio.

LocatethePackagespaneinthelowerrightportionoftheRStudiowindow.Toinstallanewpackageusingthispane,clicktheInstallbuttonshowninthescreenshotbelow.

InthePackagestextbox,typetidyverse.Alternatively,youcanloadthepackagesindividuallysoinsteadoftypingtidyverseyouwouldtypereadrorggplot2orwhateverpackageyouwanttoinstall.We’regoingtousethereadr,dplyr,andggplot2packagesinthischapterandinmanyotherssoyoucaneitherinstalltheentiretidyversepackage,whichincludesthepackageswe’lluseinthischapterplusanumberofothersorinstallthemindividually.Goaheadanddothatnow.

Page 59: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

3.Theotherwayofinstallingpackagesistousetheinstall.packages()functionasseenbelow.ThisfunctionshouldbetypesfromtheConsolepane.

install.packages(<package>)Forexample,ifyouwantedtoinstallthedplyrpackageyouwouldtype:install.packages(“dplyr”)

4.Tousethefunctionalityprovidedbyapackageitalsoneedstobeloadedeitherintoanindividualscriptthatwillusethepackage,oritcanalsobeloadedfromthePackagespane.ToloadapackagefromthePackagespane,simplyclickthecheckboxnexttothepackageasseeninthescreenshotbelow.

Page 60: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

5.YoucanalsoloadapackagefromeitherascriptortheConsolepanebytypinglibrary(<package>).Forexample,toloadthereadrpackageyouwouldtypethefollowing:

library(readr)

Exercise2:Loadingandexaminingadataset

ThetidyversepackageisdesignedtoworkwithdatastoredinanobjectcalledaTibble.Tibblesarethetidyverseimplementationofadataframe.Theyarequitesimilartodataframes,butarebasicallyanewer,moreadvancedversion.

Therearesomeimportantdifferencesbetweentibblesanddataframes.Tibblesneverconvertthedatatypesofvariables.Also,theyneverchangethenamesofvariablesorcreaterownames.Tibblesalsohavearefinedprintmethodthatshowsonlythefirst10rows,andallcolumnsthatwillfitonthescreen.Tibblesalsoprintthecolumntypealongwiththename.We’llrefertotibblesasdataframesthroughouttheremainderofthischaptertokeepthingssimple,butkeepinmindthatyou’reactuallygoingtobeworkingwithtibbleobjectsasopposedtotheolderdataframeobjects.

Gettingdataintoatibbleobjectformanipulation,analysis,andvisualizationisnormallyaccomplishedthroughtheuseofoneofthereadfunctionsfoundinthereadrpackage.Inthisexerciseyou’lllearnhowtoreadthecontentsofaCSVfileintoRusingtheread_csv()functionfoundinthereadrpackage.

1.OpenRStudio.

Page 61: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

2.InthePackagespanescrolldownuntilyouseethereadrpackageandchecktheboxjusttotheleftasseenbelowasseeninthescreenshotfromthelastexerciseinthischapter.Note:Ifyoudon’tseethereadrpackageinthePackagespaneitmeansthatthepackagehasn’tbeeninstalled.You’llneedtogobacktothelastexerciseandfollowtheinstructionsprovided.

3.YouwillalsoneedtosettheworkingdirectoryfortheRStudiosession.TheeasiestwaytodothisistogotoSession|SetWorkingDirectory|ChooseDirectoryandthennavigatetotheIntroR\Datafolderwhereyouinstalledtheexercisedataforthisbook.

4.Theread_csv()functionisgoingtobeusedtoreadthecontentsofafilecalledCrime_Data.csv.Thisfilecontainsapproximately481,000crimereportsfromSeattle,WAcoveringaspanofapproximately10years.IfyouhaveMicrosoftExcelorsomeotherspreadsheettypesoftwaretakeafewmomentstoexaminethecontentsofthisfile.

Foreachcrimeoffensethisfileincludesdateandtimeinformation,crimecategoriesanddescription,policedepartmentinformationincludingsector,beat,andprecinct,andneighborhoodname.

5.FindtheRStudioConsolepaneandaddthecodeyouseebelow.ThiswillreadthedatastoredintheCrime_Data.csvfileintoadataframe(actuallyatibbleasdiscussedintheintroduction)calleddfCrime.

dfCrime=read_csv(“Crime_Data.csv”,col_names=TRUE)6.You’llseesomemessagesindicatingthecolumnnamesanddatatypesforeachasseenbelow.

Parsedwithcolumnspecification:cols(`ReportNumber`=col_double(),`OccurredDate`=col_character(),`OccurredTime`=col_integer(),`ReportedDate`=col_character(),`ReportedTime`=col_integer(),`CrimeSubcategory`=col_character(),`PrimaryOffenseDescription`=col_character(),Precinct=col_character(),

Page 62: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Sector=col_character(),Beat=col_character(),Neighborhood=col_character())

7.Youcangetacountofthenumberofrecordswiththenrow()function.nrow(dfCrime)[1]4813768.TheView()functioncanbeusedtoviewthedatainatabularformatasseeninthescreenshotbelow.View(dfCrime)

9.Itwilloftenbethecasethatyoudon’tneedallthecolumnsinthedatathatyouimport.Thedplyrpackageincludesaselect()functionthatcanbeusedtolimitthefieldsinthedataframe.InthePackagespane,loadthedplyrlibrary.Again,ifyoudon’tseethedplyrlibrarythenit(ortheentiretidyverse)willneedtobeinstalled.

10.Inthiscasewe’lllimitthecolumnstothefollowing:ReportedDate,

CrimeSubcategory,PrimaryOffenseDescription,Precinct,Sector,Beat,andNeighborhood.Addthecodeyouseebelowtoaccomplishthis.

dfCrime=select(dfCrime,‘ReportedDate’,‘CrimeSubcategory’,‘PrimaryOffenseDescription’,‘Precinct’,‘Sector’,‘Beat’,‘Neighborhood’)

11.Viewtheresults.View(dfCrime)

12.Youmayalsowanttorenamecolumnstomakethemmorereaderfriendlyorperhapssimplifythenames.Theselect()functioncanbeusedtodothisaswell.Addthecodeyouseebelowtoseehowthisworks.Yousimplypassinthenewnameofthecolumnfollowedbyanequalsignandthentheoldcolumnname.

Page 63: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

dfCrime=select(dfCrime,‘CrimeDate’=‘ReportedDate’,‘Category’=‘CrimeSubcategory’,‘Description’=‘PrimaryOffenseDescription’,‘Precinct’,‘Sector’,‘Beat’,‘Neighborhood’)

Exercise3:Filteringadataset

Inadditiontolimitingthecolumnsthatarepartofadataframe,it’salsocommontosubsetorfiltertherowsusingawhereclause.Filteringthedatasetenablesyoutofocusonasubsetoftherowsinsteadoftheentiredataset.Thedplyrpackageincludesafilter()functionthatsupportsthiscapability.Inthisexerciseyou’llfilterthedatasetsothatonlyrowsfromaspecificneighborhoodareincluded.

1.IntheRStudioConsolepaneaddthefollowingcode.ThiswillensurethatonlycrimesfromtheQUEENANNEneighborhoodareincluded.dfCrime2=filter(dfCrime,Neighborhood==‘QUEENANNE’)2.Getthenumberofrowsandviewthedataifyou’dlikewiththeView()function.nrow(dfCrime2)[1]25172

3.Youcanalsoincludemultipleexpressionsinafilter()function.Forexample,thelineofcodebelowwouldfilterthedataframetoincludeonlyresidentialburglariesthatoccurredintheQueenAnneneighborhood.Thereisnoneedtoaddthelineofcodebelow.It’sjustmeantasanexample.We’llexaminemorecomplexfilterexpressionsinalaterchapter.

dfCrime3=filter(dfCrime,Neighborhood==‘QUEENANNE’,Category==‘BURGLARY-RESIDENTIAL’)

Exercise4:Groupingandsummarizingadataset

Thegroup_by()function,foundinthedplyrpackage,iscommonlyusedtogroupdatabyoneormorevariables.Oncegrouped,summarystatisticscanthenbegeneratedforthegrouporyoucanvisualizethedatainvariousways.Forexample,thecrimedatasetwe’reusinginthischaptercouldbegroupedbyoffense,neighborhoodandyearandthensummarystatisticsincludingthecount,mean,andmediannumberofburglariesbyyeargenerated.

It’salsoverycommontovisualizethesegroupeddatasetsindifferentways.Bar

Page 64: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

charts,scatterplots,orothergraphscouldbeproducedforthegroupeddataset.Inthisexerciseyou’lllearnhowtogroupdataandproducesummarystatistics.

1.IntheRStudioconsolewindowaddthecodeyouseebelowtogroupthecrimesbypolicebeat.dfCrime2=group_by(dfCrime2,Beat)2.Then()functionisusedtogetacountofthenumberofrecordsforeachgroup.Addthecodeyouseebelow.dfCrime2=summarise(dfCrime2,n=n())3.Usethehead()functiontoexaminetheresults.head(dfCrime2)

#Atibble:4x2Beatn<chr><int>

1D243732Q1883Q2108514Q39860

Exercise5:Plottingadataset

Theggplot2packagecanbeusedtocreatevarioustypesofchartsandgraphsfromadataframe.Theggplot()functionisusedtodefineplots,andcanbepassedanumberofparametersandjoinedwithotherfunctionstoultimatelyproduceanoutputchart.

Thefirstparameterpassedtoggplot()willbethedataframeyouwanttoplot.Typicallythiswillbeadataframeobject,butitcanalsobeasubsetofadataframedefinedwiththesubset()function.Thefirstcodeexampleonthisslidepassesavariablecalledhousing,whichcontainsadataframe.Inthesecondcodeexample,thesubset()functionispassedastheparameter.ThissubsetfunctiondefinesafilterthatwillincludeonlyrowswheretheStatevariableisequaltoMAorTX.

Inthisexerciseyouwillcreateasimplebarchartfromthedataframecreatedinthepreviousexercisesinthischapter.

Page 65: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

1.IntheRStudioconsoleaddthecodeyouseebelow.Theggplot()functioninthiscaseispassedthedfCrimedataframecreatedinapreviousexercises.Thegeom_col()functionisusedtodefinethegeometryofthegraph(barchart)andispassedamappingparameterwhichisdefinedbycallingtheaes()functionandpassinginthecolumnsforthexaxis(Beat),andtheyaxis(n=count).

ggplot(data=dfCrime2)+geom_col(mapping=aes(x=Beat,y=n),fill=”red”)2.

ThiswillproducethechartyouseebelowinthePlotspane.

Exercise6:Graphingburglariesbymonthandyear

Inthisexercisewe’llcreatesomethingalittlemorecomplex.We’llcreateacouplebarchartsthatdisplaythenumberofburglariesbyyearandbymonthfortheQueenAnneneighborhood.Inadditiontothedplyrandggplot2packagesweusedpreviouslyinthischapterwe’llalsousethelubridatepackagetomanipulatedateinformation.

1.IntheRStudioPackagespane,loadthelubridatepackage.Thelubridatepackageispartoftidyverseandisusedtoworkwithdatesandtimes.Also,makesurethereadr,dplyrandggplot2packagesareloaded.

2.LoadthecrimedatafromtheCrime_Data.csvfile.dfCrime=read_csv(“Crime_Data.csv”,col_names=TRUE)3.Specifythecolumnsandcolumnnames.

dfCrime=select(dfCrime,‘CrimeDate’=‘ReportedDate’,‘Category’=‘CrimeSubcategory’,‘Description’=‘PrimaryOffenseDescription’,‘Precinct’,‘Sector’,‘Beat’,‘Neighborhood’)

4.FiltertherecordssothatonlyresidentialburglariesintheQueenAnneneighborhoodareretained.dfCrime2=filter(dfCrime,Neighborhood==‘QUEENANNE’,Category==‘BURGLARY-RESIDENTIAL’)

Page 66: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

5.Thedplyrpackageincludestheabilitytodynamicallycreatenewcolumnsinadataframethroughthemanipulationofdatafromexistingcolumnsinthedataframe.Themutate()functionisusedtocreatethenewcolumns.Herethemutate()functionwillbeusedtoextracttheyearfromtheCrimeDatecolumn.

Addthefollowingcodetoseethisinaction.ThesecondparametercreatesanewcolumncalledYEARandpopulatesitbyusingtheyear()functionfromthelubridatepackage.Insidetheyear()functiontheCrimeDatecolumn,whichisacharactercolumn,isconvertedtoadateandtheformatofthedate

dfCrime3=mutate(dfCrime2,YEAR=year(as.Date(dfCrime2$CrimeDate,format=’%m/%d/%Y’)))

6.Viewtheresult.NoticetheYEARcolumnattheendofthedataframe.Themutate()functionalwaysaddsnewcolumnstotheendofthedataframe.

View(dfCrime3)

7.Nowwe’llgroupthedatabyyearandsummarizebygettingacountofthenumberofcrimesperyear.Addthefollowinglinesofcode.dfCrime4=group_by(dfCrime3,YEAR)dfCrime4=summarise(dfCrime4,n=n())8.Viewtheresult.View(dfCrime4)

Page 67: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

9.Createabarchartbycallingtheggplot()andgeom_col()functionsasseenbelow.DefineYEARasthecolumnforthexaxisandthenumberofcrimesfortheyaxis.ThisshouldproducethechartyouseebelowinthePlotspane.

ggplot(data=dfCrime4)+geom_col(mapping=aes(x=YEAR,y=n),fill=”red”)

Page 68: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

10.Nowwe’llcreateanotherbarchartthatdisplaysthenumberofcrimesbymonthinsteadofyear.First,createaMONTHcolumnusingthemutate()function.

dfCrime3=mutate(dfCrime2,MONTH=month(as.Date(dfCrime2$CrimeDate,format=’%m/%d/%Y’)))11.Groupandsummarizethedatabymonth.dfCrime4=group_by(dfCrime3,MONTH)dfCrime4=summarise(dfCrime4,n=n())12.Viewtheresult.View(dfCrime4)13.Createthebarchart.ggplot(data=dfCrime4)+geom_col(mapping=aes(x=MONTH,y=n),fill=”red”)

Page 69: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

14.YoucancheckyourworkagainstthesolutionfileChapter2_6.R.

Conclusion

Inthischapteryoulearnedsomebasictechniquesfordataexplorationandvisualizationusingthetidyversepackageanditsecosystemofsub-packages.AfterinstallingandloadingthepackageusingRStudioyouperformedanumberoftasksusingtheRprogramminglanguagewithanumberoftidyversesub-packages.YouloadedadatasetfromaCSVfileusingreadr.After,youmanipulatedthedatainvariouswaysusingthedplyrpackage.Theselect()functionwasusedtoincludeandrenamecolumns,andthecontentsofthedataframewerefilteredusingthefilter()function.Thedatawasthengroupedandsummarized,andfinallyseveralgraphswereproducedusingggplot2.

Inthenextchapteryouwilllearnhowmoreabouthowtousethereadrpackagetoloaddatafromexternaldatasources.Chapter3

Page 70: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

LoadingDataintoR

Largedataobjects,typicallystoredasdataframesinR,aremostoftenreadfromexternalfiles.R,alongwithtidyverse,includeanumberoffunctionsthatcanreadexternaldatafilesfromawidevarietyofsourcesincludingtextfilesofmanyvarieties,relationaldatabases,andwebservices.Externaltextfilesneedtohaveaspecificformatwiththefirstline,calledtheheader,containingthecolumnnames.Eachadditionallineinthefilewillhavevaluesforeachvariable.Inthischapter,we’llexamineanumberoffunctionsthatcanbeusedtoreaddata.

ThereareanumberofcommondataformatsthatcanbereadintoandoutofR.Thisincludestextfilesinformatssuchascsv,txt,html,andjson.ItalsoincludesfilesoutputfromstatisticalapplicationsincludingSASandSPSS.OnlineresourcesincludingwebservicesandHTMLpagescanalsobereadintoR.Finally,relationalandnon-relationaldatabasetablescanbereadaswell.ThereareanumberoffunctionsprovidedbyRandTidyversewhichwillenableyoutoreadthesevarioussources.

Inthischapterwe’llcoverthefollowingtopics:

•Loadingacsvfilewithread.table()•Loadingacsvfilewithread.csv()•Loadingatabdelimitedfilewithread.table()•Usingreadrtoloaddata

Exercise1:Loadingacsvfilewithread.table()

Thefirstfunctionwe’llexamineisread.table().Theread.table()functionisabuiltinRfunctionthatcanbeusedtoreadvariousfileformatsintoadataframe.ThisisprobablythemostcommoninternalfunctionusedforreadingsimplefilesintoR.However,aswe’llseelaterinthemodule,tidyverseincludessimilarfunctionswhichareactuallymoreefficientatreadingexternaldataintoR.

Thesyntaxforread.table()istoacceptafilename,whichwillbethepathandfilename,alongwithaTRUE|FALSEindicatorfortheheader.IfsettoTRUEtheassumptionisthatcolumnnamesareintheheaderlineofthefile.Thepathisnotnecessaryifyouhavealreadysettheworkingdirectory.Theoutputofthe

Page 71: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

read.table()functionisadataframeobject.

Theheaderline,ifincludedinthetextfile,willloadadatasetintoadataframeobject.Defaultvalueswillbeusedforthecolumnheadersifthesearenotprovided.Thefile.choose()functionisahandyfunctionthatyoucanusetointeractivelyselectthefileyouwantimportedratherthanhavingtohardcodethepathtothedataset.

Inthisexerciseyou’lllearnhowtousetheread.table()functiontoloadacsvformatfile.1.OpenRStudioandfindtheConsolepane.

2.Ifnecessary,settheworkingdirectorybytypingthecodeyouseebelowintotheConsolepaneorbygoingtoSession|SetWorkingDirectory|ChooseDirectoryfromtheRStudiomenu.

setwd(<installationdirectoryforexercisedata>)

3.TheDatafoldercontainsafilecalledStudyArea.csv,whichisacommaseparatedfilecontainingwildfiredatafromtheyears1980-2016forthestatesofCalifornia,Oregon,Washington,Idaho,Montana,Wyoming,Colorado,Utah,Nevada,Arizona,andNewMexico.Therearealittleover439,000recordsinthisfileandthereare37columnsofinformationthatdescribeeachfireduringthisperiod.

Usetheread.table()functiontoloadthisdataintoanewdataframeobject.Whathappenswhenyourunthislineofcode?df=read.table(“StudyArea.csv”,header=TRUE)Youwillgetanerrormessagewhenyouattempttorunthislineofcode.Theerrormessageshouldappearasseenbelow.Errorinread.table(“StudyArea.csv”,header=TRUE):morecolumnsthancolumnnames

Thereasonanerrormessagewasgeneratedinthiscaseisthattheread_table()functionusesspacesasthedelimiterbetweenrecordsandourfileusescommasasthedelimiter.

4.Updateyourcalltoread.table()asseenbelowtoincludethesepargument,whichshouldbeacomma.df=read.table(“StudyArea.csv”,sep=”,”,header=TRUE)

Page 72: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Whenyourunthislineofcodeyou‘llseeanewerror.Errorinscan(file=file,what=what,sep=sep,quote=quote,dec=dec,:line12didnothave14elements

Theread.table()functionwillNOTautomaticallyfillinanymissingvalueswithadefaultvaluesuchasNAsobecausesomeofthecolumnsareemptyinourrowswegetanerrormessagethatindicatesaparticularlinedidn’thaveall14columnsofinformation.WecanfixthisbyaddingthefillparameterandsettingitequaltoTRUE.

5.Updateyourcodeasseenbelowtoaddthefillparameter.df=read.table(“StudyArea.csv”,header=TRUE,fill=TRUE,sep=”,”)

Whenyourunthislineofcodeitwillimportthecontentsofthefileintoadataframeobject.However,ifyoulookattheEnvironmenttabinRStudioyouwillseethatitonlyloaded153,095recordsandyetweknowthereareover400,000recordsinthefile.Quotes(singleordouble)inacsvfilecancauserecordsnottobeloaded.

6.Let’saddonemoreparametertohandlerecordsthatwerethrownoutduetoquotes.df=read.table(“StudyArea.csv”,header=TRUE,fill=TRUE,quote=””,sep=”,”)

Whenyouexecutethislineofcode,440,476recordsshouldbeimported.ThedataisloadedintoanRdataframeobjectwhichisastructurethatresemblesatable.Detailedinformationaboutdataframeobjectswillbecoveredinalatersectionofthecourse.Fornow,youcanthinkofthemastablescontainingcolumnsandrows.

Mypointinshowingyouthisistoshowhowdifficultitcanbetousetheread.table()functiontoloadthecontentsofacsvfile.Theread.table()functionistypicallyusedtoloadtabdelimitedtextfiles,butmanypeoplewillattempttousetheread.table()functionwithcsvformatfileswithoutunderstandingalltheparametersthatmayneedtobeincluded.Instead,youshoulduseread.csv()aswe’lldointhenextstep.

7.YoucancheckyourworkagainstthesolutionfileChapter3_1.R.

Exercise2:Loadingacsvfilewithread.csv()

Page 73: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Theread.csv()functionisalsoabuiltinRfunctionthatisalmostidenticaltoread.table(),withtheexceptionthattheheaderandfillargumentsaresettoTRUEbydefault.Inthisstepyou’llseehowmucheasieritistoloadacsvfileusingread.csv().

1.Theread.csv()functionautomaticallyhandlesmostofthesituationsyouarerequiredtoidentifywhenusingread.table()toloadacsvfile.Enterandrunthecodeyouseebelowtoseehowmucheasierthisiswithread.csv().

df=read.csv(“StudyArea.csv”)

2.Thiswillcorrectlyloadall400,000+recordsfromthecsvfile!Seehowmucheasierthatis?Therewillbeafewrecordsmissing,butoverallthisfunctionismucheasiertousethanread.table().

3.YoucancheckyourworkagainstthesolutionfileChapter3_2.R.

Exercise3:Loadingatabdelimitedfilewithread.table()

Theread.table()functionismostoftenusedtoreadthecontentsofatabdelimitedfile.Inthisstepyou’lllearnhowtodothat.

1.YourDatafolderincludesafilecalledall_genes_pombase.txt,whichistextdelimited.OpenthisfilewithExcelorsomeotherapplicationtoseethefieldstructureanddelimiters.

2.IntheRConsolewindowenterandrunthecodeyouseebelowtoimportthefile.df2=read.table(“all_genes_pombase.txt”,header=TRUE,sep=”\t”,quote=””)

3.Thisshouldload7019recordsintothedataframe.You’llnoticethatmanyoftheparametersstillneedtobeusedwhenloadingthedatasetsoit’snotaseasytouseasyoumighthopeeveninthiscase.

4.YoucancheckyourworkagainstthesolutionfileChapter3_3.R.

Exercise4:Usingreadrtoloaddata

Sofarinthischapterwe’vebeenlookingatvariousbuiltinRfunctionsfor

Page 74: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

readingexternalfilesintoRasdataframes.Thetidyversepackageincludesasub-packagecalledreadrthatcanalsobeusedtoloadexternaldata.Thereadrpackageincludesaread_csv()functionthatloadsdatamuchfasterthantheinternalread.csv()function.

Inadditiontoloadingthedatafasteritalsoincludesaprogressdialogandtheoutputincludesthedataframecolumnstructurealongwithanyparsingerrors.Overall,theread_csv()functioninthereadrpackageispreferredoverthefunctionsfoundinthebasicinstallationofR.Thereadrpackagealsoincludessomeotherfunctionsforloadingvariousfileformatsincludingread_delim(),read_csv2(),andread_tsv().Eachofthefunctionsacceptthesameparameters,soonceyou’velearnedtouseanyoftheRfunctionsforloadingdatayoucaneasilyuseanyoftheothers.

Inthisstepyou’regoingtousetheread_csv()functionfoundinthereadrpackagetoloaddataintoadataframe.1.Loadthereadrlibrary.library(readr)

2.Theread_csv()functioninthereadrpackagecanbeusedtoloadcsvfiles.Comparedtothebaseloadingfunctionswelookedatpreviouslyinthisexercise,readrfunctionsaresignificantlyfaster(10x),includeahelpfulprogressbartoprovidefeedbackontheprogressoftheloadforlargefiles,andallthefunctionsworkexactlythesameway.

Addandrunthecodeyouseebelow.Noticehowmuchmorequicklythedataloadsintothedataframeobject.Thecol_typesargumentwasusedinthiscasetoloadallthecolumnsasacharacterdatatypeforsimplificationpurposes.Otherwisewe’dhavetodosomeadditionalpreprocessingofthedatatoaccountforvariouscolumndatatypes.

dfReadr=read_csv(“StudyArea.csv”,col_types=cols(.default=“c”),col_names=TRUE)Otherloadingfunctionsfoundinthereadrpackageincluderead_delim(),read_csv2(),read_tsv()

3.Nowlet’srunthisfunctionagain,butthistimetakeoffthecol_typesargumentsoyoucanseeanexampleofsomeofthepotentialloadingerrorsthatcanoccur.Updateandrunyourcodeasfollows:

Page 75: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

dfReadr=read_csv(“StudyArea.csv”,col_names=TRUE)4.Thefirstthingyou’llseeisalistofthecolumnsthatwillbeimportedalongwiththecolumndatatype.Youroutputshouldappearasfollows:Parsedwithcolumnspecification:

cols(.default=col_character(),FID=col_integer(),UNIT=col_integer(),FIRENUMBER=col_integer(),SPECCAUSE=col_integer(),STATCAUSE=col_integer(),SIZECLASSN=col_integer(),FIRETYPE=col_integer(),PROTECTION=col_integer(),FIREPROTTY=col_integer(),YEAR_=col_integer(),FiscalYear=col_integer(),STATE_FIPS=col_integer(),FIPS=col_integer(),DLATITUDE=col_double(),DLONGITUDE=col_double(),TOTALACRES=col_double(),TRPGENCAUS=col_integer(),TRPSPECCAU=col_integer(),Duplicate_=col_integer()

)5.Awarningmessagewillbedisplayedbelowthatindicatingthattherewereparsingerrorsontheload.Warning:196742parsingfailures.row#Atibble:5x5colrowcolexpectedactualfileexpected

<int><chr><chr><chr><chr>actual1242621UNITanintegerEOR‘StudyArea.csv’file2242622UNITanintegerEOR‘StudyArea.csv’row3242623UNITanintegerEOR‘StudyArea.csv’col4242624UNITanintegerEOR‘StudyArea.csv’expected

5242625UNITanintegerEOR‘StudyArea.csv’6.Youcanusetheproblems()functiontogetalistoftheparsingerrors.Addandrunthecodeyouseebelow.problems(dfReadr)#Atibble:196,742x5rowcolexpectedactualfile

<int><chr><chr><chr><chr>1242621UNITanintegerEOR‘StudyArea.csv’

Page 76: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

2242622UNITanintegerEOR‘StudyArea.csv’3242623UNITanintegerEOR‘StudyArea.csv’4242624UNITanintegerEOR‘StudyArea.csv’5242625UNITanintegerEOR‘StudyArea.csv’6242626UNITanintegerEOR‘StudyArea.csv’7242627UNITanintegerEOR‘StudyArea.csv’8242628UNITanintegerEOR‘StudyArea.csv’9242629UNITanintegerEOR‘StudyArea.csv’10242630UNITanintegerEOR‘StudyArea.csv’#...with196,732morerows

7.FromthelooksoftheerrormessagesitappearsthereisanissuewiththeUNITcolumn.Ifyoulookbackuptothelistofcolumnsanddatatypes,you’llnoticethattheUNITcolumnwascreatedasanintegerdatatype.However,ifyouopentheStudyArea.csvfileinExceloranotherapplicationyou’llquicklyseethatnotallthevaluesarenumeric.Someincludeletters.Thisaccountsfortheparsingerrorsinthedataset.

Updateyourcodeasseenbelowandrunitagain.ThissetstheUNITcolumntoacharacter(text)datatype.dfReadr=read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)

Thistimeyoushouldgetacleanloadofthedataset.Thatdoesn’tmeanthedatawon’tneedsomeadditionalpreparationandcleanup.Forexample,therearesomedatefieldsincludingSTARTDATEDthatwereloadedascharacterbutmightbebetteroffasdatefields.Wecansavethisadditionalpreparationworkforalaterexercisethough.

8.Youcanexaminethefirstfewlinesofthedataframebyenteringthehead()functionasseenbelow.head(dfReadr)#Atibble:6x14FIDORGANIZATIUNITSUBUNITSUBUNIT2FIRENAMECAUSEYEAR_

STARTDATEDCONTRDATEDOUTDATEDSTATESTATE_FIPS<int><chr><chr><chr><chr><chr><chr><int><chr><chr><chr><chr><int>10FWS81682USCADBRSanDiegoBay…PUMPHOU…Human2001

Page 77: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

1/1/010:001/1/010:…NACali…621FWS81682USCADBRSanDiegoBay…I5Human20025/3/020:005/3/020:…NACali…632FWS81682USCADBRSanDiegoBay…SOUTHBAYHuman20026/1/020:006/1/020:…NACali…643FWS81682USCADBRSanDiegoBay…MARINAHuman20017/12/010:…7/12/010…NACali…654FWS81682USCADBRSanDiegoBay…HILLHuman19949/13/940:…9/13/940…NACali…665FWS81682USCADBRSanDiegoBay…IRRIGATI…Human19944/22/940:…4/22/940…NACali…6#...with1morevariable:TOTALACRES<dbl>

9.YoucancheckyourworkagainstthesolutionfileChapter3_4.R.

Conclusion

InthischapteryoulearnedvariousfunctionsforloadinganexternaldatafileincludingthebuiltinRfunctionsread.table()andread.csv().Whilethesefunctionscancertainlygetthejobdone,theread_csv()functionfoundinthereadrpackageisamuchmoreefficientfunctionforloadingexternaldata.Inthenextchapteryouwilllearnhowtotransformyourdatasetsusingthedplyrpackage.You’lllearntechniquesforfilteringthecontentsofadataframe,selectingspecificcolumnstobeused,arrangingrowsinascendingordescendingorder,andsummarizeandgroupadataset.

Chapter4

Page 78: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

TransformingData

BeforeadatasetcanbeanalyzedinRitoftenneedstobemanipulatedortransformedinvariousways.Thedplyrpackage,partofthelargertidyversepackage,providesasetoffunctionsthatallowyoutotransformadatasetinvariousways.Thedplyrpackageisaveryimportantpartoftidyversesincethefunctionsprovidedthroughthispackageareusedsofrequentlytotransformdataindifferentwayspriortodoingmoreadvanceddataexploration,visualization,andmodeling.

Therearefivekeyfunctionsthatarepartofdplyr:filter(),arrange(),select(),mutate(),andsummarize().Allfivefunctionsworkinasimilarmannerwherethefirstargumentisthedataframetomanipulate,thenextNnumberofparametersdefinedthecolumnstoinclude,andallreturnadataframeasaresult.

Thedplyrfunctionsareoftenusedinconjunctionwiththegroup_by()dplyrfunctiontomanipulateadatasetthathasbeengroupedinsomeway.Thegroup_by()functioncreatesanewdataframeobjectthathasbeengroupedbyoneormorevariables.

Inthischapterwe’llcoverthefollowingtopics:

•Filteringrecordstocreateasubset•Narrowingthelistofcolumns•Arrangingrowsinascendingordescendingorder•Addingrows•Summarizingandgrouping•Pipingforcodeefficiency

Exercise1:Filteringrecordstocreateasubset

Thefirstdplyrfunctionthatwe’llexamineisfilter().Thefilter()functionisusedtocreateasubsetofrecordsbasedonsomevalue.Forexample,youmightwanttocreateadataframeofwildfirescontainingincidentsthathaveburnedmorethan25,000acres.Aslongasyouhaveanexistingdataframethatincludesacolumnthatmeasuresthenumberofacresburned,youcanaccomplishthecreationofthissubsetusingthefilter()function.

Page 79: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Aswillbethecasewithallthedplyrfunctionsweexamine,thefirstargumentpassedtothefilter()functionisadataframeobject.Eachadditionalparameterpassedtothefunctionisaconditionalexpressionusedtofilterthedataframe.Forexample,takealookatthelineofcodebelow.Thisstatementcallsthefilter()functiontocreateanewvariablecalleddf25k,whichwillcontainonlyrowswheretheACREScolumncontainsavaluegreaterthan25000.

df25k=filter(df,ACRES>=25000)

Thisisanexampleofcallingthefilter()functionandpassingasingleconditionalexpression.Inthenextcodeexample,twoconditionalexpressionsarepassed.Thefirstisusedtofilterrecordssothatthenumberofacresisgreaterthanorequalto25000,andthesecondfilterrecordssothatonlyrecordswheretheYearcolumncontainsavalueof2016willberetained.

df25k=filter(df,ACRES>=25000,YEAR==2016)

Inthiscase,thedf25kvariablewillincluderecordswherebothconditionsarematched:acreageburnedisgreaterthan25000andthefireyearwas2016.Thiscanalsoberewrittenasasingleparameterthatusesthe&operatortocombineexpressionsasseenbelow.

df25k=filter(df,ACRES>=25000&YEAR==2016)Inthisexerciseyou’lllearnhowtousethefilter()functiontocreateasubsetofrecordsbasedonsomevalue.

1.Theexercisesinthischapterrequirethefollowingpackages:readr,dplyr,ggplot2.TheycanbeloadedfromthePackagespane,theConsolepane,orascript.

2.OpenRStudioandfindtheConsolepane.

3.Ifnecessary,settheworkingdirectorybytypingthecodeyouseebelowintotheConsolepaneorbygoingtoSession|SetWorkingDirectory|ChooseDirectoryfromtheRStudiomenu.

setwd(<installationdirectoryforexercisedata>)

4.TheDatafoldercontainsafilecalledStudyArea.csv,whichisacommaseparatedfilecontainingwildfiredatafromtheyears1980-2016forthestatesof

Page 80: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

California,Oregon,Washington,Idaho,Montana,Wyoming,Colorado,Utah,Nevada,Arizona,andNewMexico.Therearealittleover439,000recordsinthisfileandthereare37columnsofinformationthatdescribeeachfireduringthisperiod.

Usetheread_csv()functiontoloadthedatasetintoadataframe.dfFires=read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)5.Usethenrow()functiontomakesurethattheapproximately439,000recordswereloaded.nrow(dfFires)[1]439362

6.Initiallywe’lluseasingleconditionalexpressionwiththefilter()functiontocreateasubsetofrecordsthatcontainsonlywildfiresthataregreaterthan25,000acres.Addthecodeyouseebelowtorunthefilter()function.Alldplyrfunctions,includingfilter(),returnanewdataframeobjectsoyouneedtospecifyanewvariablethatwillcontaintheoutputdataframe.Thedf25kvariablewillholdtheoutputdataframeinthiscase.

df25k=filter(dfFires,TOTALACRES>=25000)

Getacountofthenumberofrecordsthatmatchthefilter.Thereshouldbe655rows.YoumayalsowanttousetheView(df25k)functiontoseethedatainatabularformat.

nrow(df25k)[1]655

7.Youcanalsoincludemultipleconditionalexpressionsaspartofthefilter.Eachexpression(argument)iscombinedwithan“and”clausebydefault.Thismeansthatallexpressionsmustbematchedforarecordedtobereturned.Addandrunthecodeyouseebelowtoseeanexample.

df1k=filter(dfFires,TOTALACRES>=1000,YEAR_==2016)nrow(df1k)[1]152

8.Youcanalsocombinetheexpressionsintoasingleexpressionwithmultipleconditionsasseenbelow.Thiswillaccomplishthesamethingastheprevious

Page 81: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

lineofcode.Whichofthetwoyouuseisamatterofpersonalpreferenceinthiscasesincewe’reusingan“and”clause.The&characteristhe“and”operator.Youwouldneedtousethe|charactertoincludean“or”operator.

df1k=filter(dfFires,TOTALACRES>=1000&YEAR_==2016)

9.Finally,whenyouhavealistofpotentialvaluesthatyouwanttobeincludedbythefilterthe%in%statementcanbeused.Addthelineofcodebelowtoseehowthisworks.Thisparticularlineofcodewouldcreateadataframecontainingfiresthatoccurredintheyears2010,2011,or2012.

dfYear=filter(dfFires,YEAR_%in%c(2010,2011,2012))10.YoucanviewanyofthesedataframesinatabularviewusingtheView(<dataframe>)syntax.Forexample,View(dfYear)11.YoucancheckyourworkagainstthesolutionfileChapter4_1.R.

Exercise2:Narrowingthelistofcolumnswithselect()

Manydatasetsthatyouloadfromexternaldatasourcesincludedozensofcolumns.TheStudyArea.csvfilethatyou’vebeenworkingwithintheexercisesincludes37columnsofinformation.Inmostcasesyouwon’tneedallthecolumns.

Theselect()functioncanbeusedtonarrowdownthelistofcolumnstoincludeonlythoseneededforatask.Tousetheselect()function,simplypassinthenameofthedataframealongwiththecolumnstoinclude.

1.Usetheread_csv()functiontoloadthedatasetintoadataframe.

Note:ForthesakeofcompletenessyouwillbeloadingtheexternaldatafromtheStudyArea.csvfiletothedfFiresdataframe,butthisstepisn’tabsolutelynecessaryifyou’redoingtheexercisesinsequenceinthesameRStudiosession.

dfFires=read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)2.Onanewline,addacalltotheselect()functionasseenbelowtolimitthecolumnsthatarereturned.dfFires2=select(dfFires,FIRENAME,TOTALACRES,YEAR_)

Page 82: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

3.Displaythefirstfewrowsandnoticethatwenowhaveonlythreecolumns.head(dfFires2)

FIRENAMETOTALACRESYEAR_<chr><dbl><int>1PUMPHOUSE0.10020012I53.0020023SOUTHBAY0.50020024MARINA0.10020015HILL1.0019946IRRIGATION0.1001994

4.Manyofthecolumnnamesthatyouimportwillnotbeveryreaderfriendlysoit’snotuncommontowanttorenamethecolumnsaswell.Thiscanbeaccomplishedusingtheselect()functionaswell.Renameyourcolumnsbyaddingandrunningthecodeyouseebelow.

dfFires2=select(dfFires,“FIRE”=“FIRENAME”,“ACRES”=“TOTALACRES”,“YR”=“YEAR_”)5.Displaythefirstfewlines.head(dfFires2)

FIREACRESYR<chr><dbl><int>1PUMPHOUSE0.10020012I53.0020023SOUTHBAY0.50020024MARINA0.10020015HILL1.0019946IRRIGATION0.1001994

6.Therearealsoanumberofhandyhelperfunctionsthatyoucanusewiththeselect()functiontofilterthereturnedcolumns.Theseincludestarts_with(),ends_with(),contains(),matches(),andnum_range().Toseehowthisworks,addandrunthecodeyouseebelow.ThiswillreturnanycolumnsthatcontainthewordDATE.

dfFires3=select(dfFires,contains(“DATE”))head(dfFires3)

STARTDATEDCONTRDATEDOUTDATED<chr><chr><chr>11/1/010:001/1/010:00NA25/3/020:005/3/020:00NA36/1/020:006/1/020:00NA47/12/010:007/12/010:00NA59/13/940:009/13/940:00NA64/22/940:004/22/940:00NA

7.Youcanalsomakemultiplecallstothesehelperfunctions.dfFires3=select(dfFires,contains(“DATE”),starts_with(“TOTAL”))head(dfFires3)

DSTARTDATEDCONTRDATEDOUTDATEDTOTALACRES<chr><chr>

Page 83: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

<chr><dbl>11/1/010:001/1/010:00NA0.10025/3/020:005/3/020:00NA3.0036/1/020:006/1/020:00NA0.50047/12/010:007/12/010:00NA0.10059/13/940:009/13/940:00NA1.0064/22/940:004/22/940:00NA0.100

8.YoucancheckyourworkagainstthesolutionfileChapter4_2.R.

Exercise3:ArrangingRows

Thearrange()functioninthedplyrpackagecanbeusedtoordertherowsinadataframe.Thisfunctionacceptsasetofcolumnstoorderbywiththedefaultroworderingbeinginascendingorder.However,youcanpassthedesc()helperfunctiontoordertherowsindescendingorder.Missingvalueswillbeplacedattheendofthedataframe.

1.Usetheread_csv()functiontoloadthedatasetintoadataframe.dfFires=read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)2.Filterthedatasetsothatitcontainsonlyfiresgreaterthan1,000acresburnedfromtheyear2016.df1k=filter(dfFires,TOTALACRES>=1000,YEAR_==2016)3.Addandrunthecodeyouseebelowtocreateasubsetofcolumnsandrenamethem.df1k=select(df1k,“NAME”=“FIRENAME”,“ACRES”=“TOTALACRES”,“YR”=“YEAR_”)4.Sorttherowssothattheyareinascendingorder.arrange(df1k,ACRES)

NAMEACRESYR<chr><dbl><int>1Crackerbox1000.20162Lakes1000.20163Choulic21008.20164AmigoWash1020.20165Granite1030.20166Tie1031.20167Black1040.20168BybeeCreek1072.20169MARSHES1080.201610BugCreek1089.2016

5.Usethedesc()

Page 84: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

helperfunctiontoordertherowsindescendingorder.arrange(df1k,desc(ACRES))

NAMEACRESYR<chr><dbl><int>1PIONEER188404.20162Junkins181320.20163Range12171915.20164Erskine48007.20165Cedar45977.20166Maple45425.20167Rail43799.20168NorthFire42102.20169Laidlaw39813.201610BLUECUT36274.2016

6.YoucanusetheView()functionasawrapperaroundthesecallstoviewthedatainatabulargridviewbyaddingthecodeyouseebelow.View(arrange(df1k,desc(ACRES)))7.YoucancheckyourworkagainstthesolutionfileChapter4_3.R.

Exercise4:AddingRowswithmutate()

Themutate()functionisusedtoaddnewcolumnstoadataframethataretheresultofafunctionyourunonothercolumnsinthedataframe.Anynewcolumnscreatedwiththemutate()functionwillbeaddedtotheendofthedataframe.Thisfunctioncanbeincrediblyusefulfordynamicallycreatingnewcolumnsthataretheresultofoperationsperformedonothercolumnsfromthedataframe.Inthisexerciseyou’lllearnhowthemutate()functioncanbeusedtocreatenewcolumnsinadataframe.

1.You’regoingtoneedthelubridatepackageforthisexercise.Thelubridatepackageispartoftidyverseandisusedtoworkwithdatesandtimes.InRStudio,checkthePackagestabtomakesurethatlubridatehasbeeninstalledandloadedasseeninthescreenshotbelow.Ifnot,you’llneedtodosonowusingtheinstructionsforinstallingandloadingapackagecoveredinChapter1:IntroductiontoR.

2.RecallfromChapter1:IntroductiontoRthatyoucanalsoloadaninstalledlibraryusingthesyntaxseenbelow.library(lubridate)3.Usetheread_csv()functiontoloadthedatasetintoadataframe.dfFires=read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)4.Usetheselect()functiontodefineasetofcolumnsforthedataframe.

Page 85: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

df=select(dfFires,ORGANIZATI,STATE,YEAR_,TOTALACRES,CAUSE,STARTDATED)

5.Dosomebasicfilteringofthedatasothatonlyfiresgreaterthan1,000acresburnedandhaveacauseofHumanorNaturalareincluded.TherearesomerecordsmarkedasUnknowninthedataset,sowe’llremovethoseforthisexercise.

df=filter(df,TOTALACRES>=1000&CAUSE%in%c(‘Human’,‘Natural’))

6.Usethemutate()functiontocreateanewDOYcolumnthatcontainsthedayoftheyearthatthefirestarted.Theyday()functionfromthelubridatepackageisusedtoreturnthedayoftheyearusingaformattedinputdatefromtheSTARTDATEDcolumn.

df=mutate(df,DOY=yday(as.Date(df$STARTDATED,format=’%m/%d/%y%H:%M’)))7.ViewtheresultingDOYcolumn.View(df)

8.YoucancheckyourworkagainstthesolutionfileChapter4_4.R.

9.Inthenextexercisethemutate()functionwillbeusedagainwhenwecreateacolumnthatholdsthedecadeofthefireandthencalculatesthetotalacreageburnedbyacreage.

Page 86: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Exercise5:SummarizingandGrouping

Summarystatisticsforadataframecanbeproducedwiththesummarize()function.Thesummarize()functionproducesasinglerowofdatacontainingsummarystatisticsfromadataframe.Thisfunctionisnormallypairedwiththegroup_by()functiontoproducegroupsummarystatistics.

Thegroupingofdatainadataframefacilitatesthesplit-apply-combineparadigm.Thisparadigmfirstsplitsthedataintogroups,usingthegroup_by()functionindplyr,thenappliesanalysistothegroup,andfinally,combinestheresults.Thegroup_by()functionhandlesthesplitportionoftheparadigmbycreatinggroupsofdatausingoneormorecolumns.Forexample,youmightgroupallwildfiresbystateandcause.

Inthisstepyou’llusethemutate(),summarize(),andgroup_by()functionstogroupwildfiresbydecadeandproduceasummaryofthemeanwildfiresizeforeachdecade.1.Usetheread_csv()functiontoloadthedatasetintoadataframe.

dfFires=read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)2.Selectthecolumnsthatwillbeusedintheexercise.df=select(dfFires,ORGANIZATI,STATE,YEAR_,TOTALACRES,CAUSE)3.Filtertherecords.df=filter(df,TOTALACRES>=1000)

4.Usethemutate()functiontocreateanewcolumncalledDECADEthatdefinesthedecadeinwhicheachfireoccurred.Inthiscaseanifelse()functioniscalledtoproducethevaluesforeachdecade.

functioniscalledtoproducethevaluesforeachdecade.

1989”,ifelse(YEAR_%in%1990:1999,“1990-1999”,ifelse(YEAR_%in%2000:2009,“2000-2009”,ifelse(YEAR_%in%2010:2016,“2010-2016”,“-99”)))))5.Viewtheresult.View(df)

Page 87: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

6.Usethegroup_by()functiontogroupthedataframebydecade.grp=group_by(df,DECADE)7.Summarizethemeansizeofwildfiresbydecadeusingthesummarize()function.sm=summarize(grp,mean(TOTALACRES))8.Viewtheresult.View(sm)

9.Let’stidythingsupbyrenamingthenewcolumnproducedbythesummarize()function.names(sm)<-c(“DECADE”,“MEAN_ACRES_BURNED”)

10.Finally,let’screateabarchartoftheresults.We’lldiscussthecreationofmanydifferenttypesofchartsandgraphsaswemovethroughlaterchaptersofthebooksodetaileddiscussionofthesetopicswillbesavedforlater.

ggplot(data=sm)+geom_col(mapping=aes(x=DECADE,y=MEAN_ACRES_BURNED),fill=”red”)

Page 88: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

11.YoucancheckyourworkagainstthesolutionfileChapter4_5.R.

Exercise6:Piping

Asyou’veprobablynoticedinsomeoftheseexercises,itisnotunusualtorunaseriesofdplyrfunctionsaspartofalargerprocessingroutine.Asyou’llrecall,eachdplyrfunctionreturnsanewdataframe,andthisdataframeistypicallyusedastheinputtothenextdplyrfunctionintheseries.Thesedataframesareintermediatedatasetsnotneededbeyondthecurrentstep.However,youarestillrequiredtonameandcodeeachofthesedatasets.

Pipingisamoreefficientwayofhandlingthesetemporary,intermediatedatasets.Insum,pipingisanefficientwayofsendingtheoutputofonefunctiontoanotherfunctionwithoutcreatinganintermediatedatasetandismostusefulwhenyouhaveaseriesoffunctionstorun.Thesyntaxforpipingistousethe%>%charactersattheendofeachstatementthatyouwanttopipe.Inthisexerciseyou’lllearnhowtousepipingtochaintogetherinputandoutputdataframes.

Page 89: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

1.Inthelastexercisetheselect(),filter(),mutate(),group_by(),andsummarize()functionwereallusedinaseriesthatultimatelyproducedabarchartshowingthemeanacreageburnedbywildfiresinthepastfewdecades.Eachofthesefunctionsreturnadataframe,whichisthenusedasinputtothenextfunctionintheseries.Pipingisamoreefficientwayofcodingthischainingoffunctioncalls.RewritethecodeproducedinExercise4:AddingRowswithmutate()asseenbelowandthenwe’lldiscusshowpipingworks.

library(lubridate)df=read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)%>%select(ORGANIZATI,STATE,YEAR_,TOTALACRES,CAUSE,STARTDATED)%>%filter(TOTALACRES>=1000&CAUSE%in%c(‘Human’,‘Natural’))%>%mutate(DOY=yday(as.Date(STARTDATED,format=’%m/%d/%y%H:%M’)))View(df)

ThefirstlineofcodereadsthecontentsoftheexternalStudyArea.csvfileintoadataframevariable(df)aswe’vedoneinalltheotherexercisesinthischapter.However,you’llnoticetheinclusionofthepipingstatement(%>%>)attheendoftheline.Thisensuresthatthecontentsofthedfvariablewillautomaticallybesenttotheselect()function.

Noticethattheselect()functiondoesnotcreateavariablelikewehavedoneinthepastexercises,andthatwehaveleftoffthefirstparameter,whichwouldnormallyhavebeenthedataframevariable.Itisimpliedthatthedfvariablewillbepassedtotheselect()function.Thissameprocessofincludingthepipingstatementattheendofeachlineandleavingoffthefirstparameterisrepeatedforalltheadditionallinesofcodewherewewanttoautomaticallypassthedfvariabletothenextdplyrfunction.Finally,weviewthecontentsofthedfvariableusingtheView()functiononthelastline.

Pipingmakesyourcodemorestreamlinedandeasiertoreadandalsotakesawaytheneedtocreateandpopulatevariablesthatareonlyusedasintermediatedatasets.

2.YoucancheckyourworkagainstthesolutionfileChapter4_6.R.

Page 90: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Exercise7:Challenge

Thechallengestepisoptional,butitwillgiveyouachancetoreinforcewhatyou’velearnedinthismodule.CreateanewdataframethatisasubsetoftheoriginaldfFiresdataframe.ThesubsetshouldcontainallfiresfromtheStateofIdahoandthecolumnsshouldbelimitedsothatonlytheYEAR_,CAUSE,andTOTALACREScolumnsarepresent.Renamethecolumnsifyouwish.GroupthedatabyCAUSEandYEARandthensummarizebytotalacresburned.Plottheresults.

Conclusion

Inthischapteryoulearnedhowtousethedplyrpackagetoperformvariousdatatransformationfunctions.Youlearnedhowtolimitcolumnswiththeselect()function,filteradataframebasedononeormoreexpressions,addcolumnswithmutate(),andsummarizeandgroupdata.Finally,youlearnedhowtousepipingtomakeyourcodemoreefficient.

Inthenextchapteryou’llhowtocreatetidydatasetswiththetidyrpackage.Chapter5

Page 91: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

CreatingTidyData

Let’sfirstdescribewhatwemeanby“tidydata”,becausethetermdoesn’tnecessarilyfullydescribetheconcept.DatatidyingisaconsistentwayoforganizingdatainRandcanbefacilitatedthroughthetidyrpackagefoundinthetidyverseecosystem.Therearethreerulesthatwecanfollowtomakeadatasettidy.First,eachvariablemusthaveitsowncolumn.Second,eachobservationmusthaveitsownrow,andfinally,eachvaluemusthaveitsowncell.Thisisillustratedbythediagrambelow.

Therearetwomainadvantagesofhavingtidydata.Oneismoreofageneraladvantageandtheotherismorespecific.First,havingaconsistent,uniformdatastructureisveryimportant.Theotherpackagesthatarepartoftidyverse,includingdplyrandggplot2aredesignedtoworkwithtidydatasoensuringthatyourdataisuniformfacilitatestheefficientprocessingofyourdata.Inaddition,placingvariablesintocolumnsallowsfortheeasilyfacilitationofvectorizationinR.

Manydatasetsthatyouencounterwillnotbetidyandwillrequiresomeworkonyourend.Therecanbemanyreasonswhyadatasetisn’ttidy.Oftentimesthepeoplewhocreatedthedatasetaren’tfamiliarwiththeprinciplesoftidydata.Unlessyouaretrainedinthepracticeofcreatingtidydatasetsorspendalotoftimeworkingwithdatastructurestheseconceptsaren’treadilyapparent.Anothercommonreasonthatdatasetsaren’ttidyisthatdataisoftenorganizedtofacilitatesomethingotherthananalysis.Dataentryisperhapsthemostcommonofthereasonsthatfallintothiscategory.Tomakedataentryaseasyaspossible,peoplewilloftenarrangedatainwaysthataren’ttidy.So,manydatasetsrequiresomesortoftidyingbeforeyoucanbeginyouranalysis.

Page 92: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Thefirststepistofigureoutwhatthevariablesandobservationsareforthedataset.Thiswillfacilitateyourunderstandingofwhatthecolumnsandrowsshouldbe.Inaddition,youwillalsoneedtoresolveoneortwocommonproblems.Youwillneedtofigureoutifonevariableisspreadacrossmultiplecolumns,andyouwillneedtofigureoutifoneobservationisscatteredacrossmultiplerows.Theseconceptsareknownasgatheringandspreading.We’llexaminetheseconceptsfurtherintheexercisesinthischapter.

Inthischapterwe’llcoverthefollowingtopics:

•Gathering•Spreading•Separating•Uniting

Exercise1:Gathering

Acommonprobleminmanydatasetsisthatthecolumnnamesarenotvariablesbutrathervaluesofavariable.Inthefigurebelow,the1999and2000columnsareactuallyvaluesofthevariableYEAR.Eachrowintheexistingtableactuallyrepresentstwoobservations.Thetidyrpackagecanbeusedtogathertheseexistingcolumnsintoanewvariable.Inthiscase,weneedtocreateanewcolumncalledYEARandthengathertheexistingvaluesinthe1999and2000columnsintothenewYEARcolumn.

Thegather()functionfromthetidyrpackagecanbeusedtoaccomplishthegatheringofdata.Takealookatthelineofcodebelowtoseehowthisfunctionworks.

Page 93: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

gather(‘1999’,‘2000’,key=‘year’,value=‘cases’)

Therearethreeparametersofthegather()function.Thefirstisthesetofcolumnsthatrepresentwhatshouldbevaluesandnotvariables.Thesewouldbethe1999and2000columnsintheexamplewehavebeenfollowing.Next,you’llneedtonamethevariableofthenewcolumn.Thisisalsocalledthekey,andinthiscasewilltheyearvariable.Finally,you’llneedtoprovidethevalue,whichisthenameofthevariablewhosevaluesarespreadoverthecells.

Inthisexerciseyou’lllearnhowtousethegather()functiontoresolvethetypesofproblemswediscussedintheintroductiontothistopic.

1.IntheDatafolderwhereyouinstalledtheexercisedataforthisbookisafilecalledCountryPopulation.csv.Openthisfile,preferablyinMicrosoftExcel,orsomeothertypeofspreadsheetsoftware.Thefileshouldlooksimilartothescreenshotbelow.Thisspreadsheetincludesshouldlooksimilartothescreenshotbelow.Thisspreadsheetincludes2017.Thecolumnsforeachyearrepresentvalues,notvariables.ThesecolumnsneedtobegatheredintoanewpairofvariablesthatrepresenttheYearandPopulation.Inthisexerciseyou’llusethegather()functiontoaccomplishthisdatatidyingtask.

Page 94: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

2.OpenRStudioandfindtheConsolepane.

3.Ifnecessary,settheworkingdirectorybytypingthecodeyouseebelowintotheConsolepaneorbygoingtoSession|SetWorkingDirectory|ChooseDirectoryfromtheRStudiomenu.

setwd(<installationdirectoryforexercisedata>)4.Ifnecessary,loadthereadrandtidyrpackagesbyclickingthecheckboxesinthePackagespaneorbyincludingthefollowinglineofcode.library(readr)library(tidyr)5.LoadtheCountryPopulation.csvfileintoRStudiobywritingthecodeyouseebelowintheConsolepane.dfPop=read_csv(“CountryPopulation.csv”,col_names=TRUE)YoushouldseethefollowingoutputintheConsolepane.Parsedwithcolumnspecification:cols(

`CountryName`=col_character(),`CountryCode`=col_character(),`2010`=col_double(),`2011`=col_double(),`2012`=col_double(),`2013`=col_double(),`2014`=col_double(),`2015`=col_double(),`2016`=col_double(),`2017`=col_double()

Page 95: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

)6.UsetheView()functiontodisplaythedatainatabularstructure.View(dfPop)

7.Usethegather()functionasseenbelow.dfPop2=gather(dfPop,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=‘YEAR’,value=‘POPULATION’)8.Viewtheoutput.View(dfPop2)

Page 96: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

9.YoucancheckyourworkagainstthesolutionfileChapter5_1.R.

Exercise2:Spreading

Spreadingistheoppositeofgatheringandisusedwhenanobservationisspreadacrossmultiplerows.Inthediagrambelow,table2shoulddefineanobservationofonecountryperyear.However,you’llnoticethatthisisspreadacrosstworows.Onerowforcasesandanotherforpopulation.

Page 97: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Wecanusethespread()functiontofixthisproblem.Thespread()functiontakestwoparameters:thecolumnthatcontainsvariablenames,knownasthekeyandacolumnthatcontainsvaluesfrommultiplevariables–thevalue.

spread(table2,key,value)Inthisexerciseyou’lllearnhowtousethespread()functiontoresolvethetypesofproblemswediscussedintheintroductiontothistopic.

1.Forthisexerciseyou’lldownloadsomesampledatathatneedstobespread.InstallthedevtoolspackageandDSRdatasetsusingthecodeyouseebelowbytypingintheConsolepane.Alternatively,youcanusethePackagespanetoinstallthepackages.

install.packages(“devtools”)devtools::install_github(“garrettgman/DSR”)2.LoadtheDSRlibrarybygoingtoPackageandclickingthecheckboxnexttoDSR.3.Viewtable2.Inthiscase,anobservationisonecountryperyear,butyou’llnoticethateachobservationisactuallyspreadintotworows.View(table2)

Page 98: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

4.Usethespread()functiontocorrectthisproblem.table2b=spread(table2,key=type,value=count)5.Viewtheresults.View(table2b)

Page 99: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

6.YoucancheckyourworkagainstthesolutionfileChapter5_2.R.

Exercise3:Separating

Anothercommoncaseinvolvestwovariablesbeingplacedintothesamecolumn.Forexample,thespreadsheetbelowhasaState-CountyNamecolumnthatactuallycontainstwovariablesseparatedbyaslash.

Theseparate()functioncanbeusedtosplitacolumnintomultiplecolumnsbysplittingonaseparator.Bydefault,theseparate()functionwillautomaticallylookforanynonalphanumericcharacteroryoucandefineaspecificcharacter.

Page 100: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Here,theseparate()functionwillsplitthevaluesoftheState-CountyNamecolumnintotwovariables:StateAbbrevandCountyName.

Theseparate()functionacceptsparametersforthenameofthecolumntoseparatealongwiththenamesofthecolumnstoseparateinto,andanoptionalseparator.Bydefault,separate()willlookforanynon-alphanumericcharactertouseastheseparator,butyoucanalsodefineaspecificseparator.Youcanseeanexampleofhowtheseparate()functionworksbelow.

separate(table3,rate,into=c(“cases”,“population”))Inthisexerciseyou’lllearnhowtousetheseparate()functiontoresolvethetypesofproblemswediscussedintheintroductiontothistopic.

1.IntheDatafolderwhereyouinstalledtheexercisedataforthisbookisafilecalledusco2005.csv.Openthisfile,preferablyinMicrosoftExcel,orsomeothertypeofspreadsheetsoftware.Thefileshouldlooksimilartothescreenshotbelow.

2.Loadtheusco2005.csvfileintoRStudiobywritingthecodeyouseebelowin

Page 101: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

theConsolepane.df=read_csv(“usco2005.csv”,col_names=TRUE)3.Viewtheimporteddata.View(df)

4.Usetheseparate()functiontoseparatethecontentsoftheStateCountyNamecolumnintoStateAbbrevandCountyNamecolumns.df2=separate(df,”State-CountyName”,into=c(“StateAbbrev”,“CountyName”))5.Viewtheresults.View(df2)

6.YoucancheckyourworkagainstthesolutionfileChapter5_3.R.

Exercise4:Uniting

TheUnite()functionistheexactoppositeofseparate()inthatitcombinesmultiplecolumnsintoasinglecolumn.Whilenotusednearlyasoftenasseparate(),theremaybetimeswhenyouneedthefunctionalityprovidedbyunite().Inthisexerciseyou’llunitethedataframethatwasseparatedinthelastexercise.

1.IntheConsolepane,addthecodeyouseebelowtounitetheStateAbbrevand

Page 102: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

CountyNamecolumnsbackintoasinglecolumn.df3=unite(df2,State_County_Name,StateAbbrev,CountyName)2.Viewtheresult.View(df3)

3.YoucancheckyourworkagainstthesolutionfileChapter5_4.R.

Conclusion

Inthischapteryouwereintroducedtothetidyrpackageanditssetoffunctionsforcreatingtidydatasets.ThenextchapterwillteachyouthebasicsofdataexplorationusingRandtidyverse.

Chapter6

Page 103: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

BasicDataExplorationTechniquesinR

ExploratoryDataAnalysis(EDA)isaworkflowdesignedtogainabetterunderstandingofyourdata.Theworkflowconsistsofthreesteps.Thefirstistogeneratequestionsaboutyourdata.Inthisstepyouwanttobeasbroadaspossiblebecauseatthispointyoudon’treallyhaveagoodfeelforthedata.Next,searchforanswerstothesequestionsbyvisualizing,transforming,andmodelingthedata.Finally,refineyourquestionsandorgeneratenewquestions.InRtherearetwoprimarytoolsthatsupportthedataexplorationprocess:plotsandsummarystatistics.

Datacangenerallybedividedintocategoricalorcontinuoustypes.Categoricalvariablesconsistofasmallsetofvalues,whilecontinuousvariableshaveapotentiallyinfinitesetoforderedvalues.Categoricalvariablesareoftenvisualizedwithbarcharts,andcontinuousvariableswithhistograms.BothcategoricalandcontinuousdatacanberepresentedthroughvariouschartscreatedwithR.

Whenperformingbasicvisualizationofvariables,wetendtomeasureeithervariationorcovariation.Variationisthetendencyofthevaluesofavariabletochangefrommeasurementtomeasurement.Thevariablebeingmeasuredisthesamethough.Thiswouldincludethingslikethetotalacresburnedbyawildfire(continuous)orthenumberofcrimesbypolicedistrict(categoricaldata.Covariationisthetendencyofthevaluesoftwoormorevariablestovarytogetherinarelatedway.

•Measuringcategoricalvariationwithabarchart•Measuringcontinuousvariationwithahistogram•Measuringcovariationwithboxplots•Measuringcovariationwithsymbolsize•Creating2Dbinsandhexcharts•Generatingsummarystatistics

Exercise1:MeasuringCategoricalVariationwithaBarChart

Abarchartisagreatwaytovisualizecategoricaldata.Itseparateseachcategoryintoaseparatebarandthentheheightofeachbarisdefinedbythenumberof

Page 104: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

occurrencesinthatcategory.

1.Theexercisesinthischapterrequirethefollowingpackages:readr,dplyr,ggplot2.TheycanbeloadedfromthePackagespane,theConsolepane,orascript.

2.OpenRStudioandfindtheConsolepane.

3.Ifnecessary,settheworkingdirectorybytypingthecodeyouseebelowintotheConsolepaneorbygoingtoSession|SetWorkingDirectory|ChooseDirectoryfromtheRStudiomenu.

setwd(<installationdirectoryforexercisedata>)4.Usetheread_csv()functiontoloadthedatasetintoadataframe.dfFires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)

5.Forthisanalysis,we’llfilterthedatasothatonlyfiresthatburnedgreaterthan1,000acresintheyears2010through2016arerepresented.Addthecodeyouseebelowtofilterthedataandandsendtheresultstoabarchart.

df<-filter(df,TOTALACRES>=1000,YEAR_%in%c(2010,2011,2012,2013,2014,2015,2016))ggplot(data=df)+geom_bar(mapping=aes(x=YEAR_))Thiswillproduceabarchartthatappearsasseeninthescreenshotbelow.

Page 105: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

6.Usethecount()functiontogettheactualcountforeachcategory.View(count(df,YEAR_))

Page 106: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Exercise2:MeasuringContinuousVariationwithaHistogram

Thedistributionofacontinuousvariablecanbemeasuredwiththeuseofahistogram.Inthisexerciseyou’llcreateahistogramofwildfireacresburned.1.Onanewline,usetheread_csv()functiontoloadtheStudyArea.csvfile.dfFires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)

2.Pipethedataframeandusetheselect()functiontolimitthecolumnsandfiltertherowssothatonlyfiresgreaterthan1,000acresareincluded.Sincewehavealargenumberofwildfiresthatburnedonlyasmallnumberofacreswe’llfocusonfiresthatarealittlelargerinthiscase.

df%>%select(ORGANIZATI,STATE,YEAR_,TOTALACRES,CAUSE)%>%filter(TOTALACRES>=1000)%>%

3.Createthehistogramusingggplot()withgeom_hist()andabinsizeof500.Thedataisobviouslystillskewedtowardthelowerendofthenumberofacresburned.Addthehighlightedcodeyouseebelowtoproducethechart.

df%>%select(ORGANIZATI,STATE,YEAR_,TOTALACRES,CAUSE)%>%filter(TOTALACRES>=1000)%>%

Page 107: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

ggplot()+geom_histogram(mapping=aes(x=TOTALACRES),binwidth=500)

4.Youcanalsogetaphysicalcountofthenumberoffiresthatfellintoeachbin.Fromviewingthehistogramandthecountit’sobviousthatthevastmajorityoffiresaresmall.

df%>%count(cut_width(TOTALACRES,500))

`cut_width(TOTALACRES,500)`n<fct><int>1[750,1250]1542(1250,1750]1783(1750,2250]144

Page 108: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

4(2250,2750]825(2750,3250]706(3250,3750]397(3750,4250]598(4250,4750]429(4750,5250]4010(5250,5750]37

5.Challenge:Recreatethehistogramusingabinsizeof5000.Whatistheeffectontheoutput?

Exercise3:MeasuringCovariationwithBoxPlots

Boxplotsprovideavisualrepresentationofthespreadofdataforavariable.Theseplotsdisplaytherangeofvaluesforavariablealongwiththemedianandquartiles.Followtheinstructionsprovidedbelowtocreateaboxplotthatmeasurescovariationbetweenorganizationandtotalacreageburned.

1.Usetheread_csv()functiontoloadthedatasetintoadataframe.dfFires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)

2.Pipethedataframeandfiltertherowssothatonlyfirebetween5000and1000acresareincluded.Then,groupthedatabyorganization.TheORGANIZATIcolumninthedatasetcontainscategoricaldatafortheU.S.federalgovernmentagenciesthathavehadlandaffectedbywildfires.Finally,useggplot()withgeom_boxplot()tocreateaboxplotshowingthedistributionofwildfiresbyorganization.

df%>%filter(TOTALACRES>=5000&TOTALACRES<=10000)%>%group_by(ORGANIZATI)%>%ggplot(mapping=aes(x=ORGANIZATI,y=TOTALACRES))+geom_boxplot()

Page 109: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

TheorganizationislistedontheXaxisandthetotalacreageburnedontheYaxis.TheboxcontainsahorizontallinethatrepresentsthemedianforthevariableandtheboxitselfisknownastheInterQuartileRange(IQR).Theverticallinesthatextendoneithersideoftheboxareknownasthewhiskersandrepresentthefirstandfourthquartile.Alargerboxandwhiskersindicatealargerdistributionofdata.

3.Challenge:CreateanewboxplotthatmapsthecovariationofCAUSEandTOTALACRES.

Exercise4:MeasuringCovariationwithSymbolSize

Thegeom_count()functioncanbeusedwithggplot()tomeasurecovariationbetweenvariablesusingdifferentsymbolsizes.Followtheinstructionsprovidedbelowtomeasurethecovariationbetweenorganizationandwildfirecauseusingsymbolsize.

1.Usetheread_csv()functiontoloadthedatasetintoadataframe.dfFires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)

Page 110: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

2.PipethedataframeandfiltertherowssothatonlywildfiresthatoriginatedduetoNaturalorHumancausesareincluded.ThiswillremoveanyrecordsthatareUnknownorhavemissingvalues.Then,usegeom_count()tocreateagraduatedsymbolchartbasedonthenumberoffiresbyorganization.

df%>%filter(CAUSE==‘Natural’|CAUSE==‘Human’)%>%group_by(ORGANIZATI)%>%ggplot()+geom_count(mapping=aes(x=ORGANIZATI,y=CAUSE))

3.Youcanalsogetanexactcountofthenumberoffiresbyorganizationandcause.df%>%count(ORGANIZATI,CAUSE)

ORGANIZATICAUSEn<chr><chr><int>1BIAHuman492BIANatural913BLMHuman1874BLMNatural3865FSHuman158

Page 111: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

6FSNatural4317FWSHuman108FWSNatural79FWSUndetermined610NPSHuman611NPSNatural46

Exercise5:2Dbinandhexcharts

Youcanalsouse2Dbinandhexchartsasanalternativewayofviewingthedistributionoftwovariables.Followtheinstructionsprovidedbelowtocreate2Dbinandhexchartsthatvisualizetherelationshipbetweentheyearandtotalacreageburned.

1.Usetheread_csv()functiontoloadthedatasetintoadataframe.dfFires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)2.Createa2DbinmapwithYEAR_ontheXaxisandTOTALACRESontheYaxis.ggplot(data=dfFires)+geom_bin2d(mapping=aes(x=YEAR_,y=TOTALACRES))

Page 112: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

3.Createa2DhexmapwithYEAR_ontheXaxisandTOTALACRESontheYaxis.ggplot(data=df)+geom_hex(mapping=aes(x=YEAR_,y=TOTALACRES))

Page 113: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Exercise6:GeneratingSummaryStatistics

Anotherbasictechniqueforperformingexploratorydataanalysisistogeneratevarioussummarystatisticsonadataset.Rincludesanumberofindividualfunctionsforgeneratingspecificsummarystatisticsoryoucanusethesummary()functiontogenerateasetofsummarystatistics.

1.ReloadtheStudyArea.csvfileintoadataframe.df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)2.Restrictthelistofcolumns.df<-select(df,ORGANIZATI,STATE,YEAR_,TOTALACRES,CAUSE)3.Filterthelisttoincludeonlywildfiresgreaterthan1,000acres.df<-filter(df,TOTALACRES>=1000)4.Callthemean()function,passinginareferencetothedataframeandtheTOTALACREScolumn.mean(df$TOTALACRES)[1]10813.065.Callthemedian()function.median(df$TOTALACRES)[1]32406.Insteadofcallingtheindividualsummarystatisticsfunctionsyoucansimplyusethesummary()functiontoreturnalistofsummarystatistics.summary(df$TOTALACRES)Min.1stQu.MedianMean3rdQu.Max.1000167032401081382825906207.YoucancheckyourworkagainstthesolutionfileChapter6_6.R.

Conclusion

InthischapteryoulearnedsomebasicdataexplorationtechniquesusingR.Youlearnedhowtomeasurecategoricalandcontinuousvariationwithbarchartsandhistograms,andcovariationwithboxplotsanddifferentsymbolsize.Finally,youlearnedhowtogeneratesummarystatisticsandcreate2Dbinsandhexcharts.

Inthenextchapteryou’lllearnhowtovisualizedatausingtheggplot2package.Chapter7

Page 114: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

BasicDataVisualizationTechniques

Theggplot2packageisalibrarythatenablesthecreationofmanytypesofdatavisualizationincludingvarioustypesofchartsandgraphs.ThislibrarywasfirstcreatedbyHadleyWickhamin2005andisanRimplementationofLelandWilkinson’sGrammarofGraphics.Theideabehindthispackageistospecifyplotbuildingblocksandthencombinethemtocreateagraphicaldisplay.Buildingblocksofggplot2includedata,aestheticmapping,geometricobjects,statisticaltransformations,scales,coordinatesystems,positionadjustments,andfaceting.

Thereareanumberofadvantagestousingggplot2versusothervisualizationtechniquesavailableinR.Theseadvantagesincludeaconsistentstylefordefiningthegraphics,ahighlevelofabstractionforspecifyingplots,flexibility,abuilt-inthemingsystemforplotappearance,matureandcompletegraphicssystem,andaccesstomanyotherggplot2usersforsupport.

Inthischapterwe’llcoverthefollowingtopics:•Creatingascatterplot•Addingaregressionlinetoascatterplot•Plottingcategories•Labelingthegraph•Legendlayouts•Creatingafacet•Theming•Creatingbarcharts•Creatingviolinplots•Creatingdensityplots

Step1:Creatingascatterplot

Ascatterplotisagraphinwhichthevaluesoftwovariablesareplottedalongtwoaxes,withthepatternoftheresultingpointsrevealinganycorrelationpresent.

1.Theexercisesinthischapterrequirethefollowingpackages:readr,dplyr,ggplot2.TheycanbeloadedfromthePackagespane,theConsolepane,ora

Page 115: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

script.

2.OpenRStudioandfindtheConsolepane.

3.Ifnecessary,settheworkingdirectorybytypingthecodeyouseebelowintotheConsolepaneorbygoingtoSession|SetWorkingDirectory|ChooseDirectoryfromtheRStudiomenu.

setwd(<installationdirectoryforexercisedata>)4.LoadthecontentsoftheStudyArea.csvfileintoadataframe.dfWildfires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)5.Createasubsetofcolumns.df<-select(dfWildfires,ORGANIZATI,STATE,YEAR_,TOTALACRES,CAUSE)6.Grouptherecordsbyyear.grp<-group_by(df,YEAR_)7.Summarizethedatabytotalnumberofacresburned.sm<-summarize(grp,totalacres=sum(TOTALACRES))8.Useggplot()tocreateascatterplotwiththeyearonthexaxisandthetotalacresburnedontheyaxis.ggplot(data=sm)+geom_point(mapping=aes(x=YEAR_,y=totalacres))

Page 116: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

9.Therearetimeswhenitmakessensetousethelogarithmicscalesinchartsandgraphs.Onereasonistorespondtoskewnesstowardslargevalues,i.e,casesinwhichoneorafewpointsaremuchlargerthanthebulkofthedata.Inthegraphthatwejustcreatedthereareacouplepointsthatfallintothiscategoryontheyaxis.

Createthegraphagain,butthistimeusethelog()functiononthetotalacrescolumn.ggplot(data=sm)+geom_point(mapping=aes(x=YEAR_,y=log(totalacres)))

Page 117: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

10.YoucancheckyourworkagainstthesolutionfileChapter7_1.R.

Step2:Addingaregressionlinetothescatterplot

Plotsconstructedwithggplot()canhavemorethanonegeometry.It’scommontoaddaprediction(regression)linetotheplot.

1.Thereareseveralwaysthatyoucanaddaregressionlinetothescatterplot,oneofwhichistousethegeom_smooth()functionwiththemethodsettolm(straightline)andtheseparametersettoFALSE.Addthelineofcodeyouseebelowtotheconsolewindow.

ggplot(data=sm,aes(x=YEAR_,y=log(totalacres)))+geom_point()+geom_smooth(method=lm,se=FALSE)

Page 118: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

2.Changethemethodtoloesstheeffectontheregressionline.ggplot(data=sm,aes(x=YEAR_,y=log(totalacres)))+geom_point()+geom_smooth(method=loess,se=FALSE)

Page 119: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

3.Youcanaddaconfidenceintervalaroundtheregressionlinebysettingse=TRUE.ggplot(data=sm,aes(x=YEAR_,y=log(totalacres)))+geom_point()+geom_smooth(method=loess,se=TRUE)

Page 120: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

4.YoucancheckyourworkagainstthesolutionfileChapter7_2.R.

Step3:Plottingcategories

Ratherthangraphingtheentiresetofwildfiresyoumightwanttobetterunderstandthetrendsbystate.Inthisstepyou’llcreateanewscatterplotthatvisualizeswildfirestrendsovertimebystate.

1.Regroupthewildfiresdataframebystateandyear.grp<-group_by(df,STATE,YEAR_)2.Summarizethegroupsbytotalacresburned.sm<-summarize(grp,totalacres=sum(TOTALACRES))3.Addacolourparametertotheaes()functionsothatthepointsandregressionlinearemappedaccordingtothestateinwhichtheyoccurred.ggplot(data=sm,aes(x=YEAR_,y=totalacres,colour=STATE))+geom_point(aes(colour=STATE))+stat_smooth(method=lm,se=FALSE)

Page 121: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

4.YoucancheckyourworkagainstthesolutionfileChapter7_3.R

Step4:Labelingthegraph

Youcanaddlabelstoyourgraphthrougheitherthegeom_text()functionorthegeom_label()function.1.Labeleachofthepointsonthescatterplotusinggeom_text()withalabelsizeof3.

ggplot(data=sm,aes(x=YEAR_,y=log(totalacres)))+geom_point()+geom_smooth(method=loess,se=TRUE)+geom_text(aes(label=STATE),size=3)

Page 122: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Nowthisobviouslydoesn’tworkverywell.Thedisplayisextremelyclutteredsolet’sadjustafewparameterstomakethiseasiertoread.2.Youcanusethecheck_overlapparametertoremoveanyoverlappinglabels.Updateyourcodeasseenbelow.

ggplot(data=sm,aes(x=YEAR_,y=log(totalacres)))+geom_point()+geom_smooth(method=loess,se=TRUE)+geom_text(aes(label=STATE),size=3,check_overlap=TRUE)

Page 123: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

3.Thislookquiteabitbetterbutifyouchangethelabelsizeto2itwillfurtherreducetheclutterandoverlappingwhilehopefullystillbeingreadable.

Page 124: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

4.Youmayhavenoticedthatthelabelssitdirectlyontopofthetopics.Youcanusethenudge_xandnudge_yparameterstomovethelabelsrelativetothepoint.Usenudge_xasseenbelowtoseehowthismovesthelabelshorizontally.

ggplot(data=sm,aes(x=YEAR_,y=log(totalacres)))+geom_point()+geom_smooth(method=loess,se=TRUE)+geom_text(aes(label=STATE),size=2,check_overlap=TRUE,nudge_x=1.0)

Page 125: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

5.Youcanalsocolorthelabelsbycategorybyaddingthecolorparametertotheaes()forgeom_text().

ggplot(data=sm,aes(x=YEAR_,y=log(totalacres)))+geom_point()+geom_smooth(method=loess,se=TRUE)+geom_text(aes(label=STATE,color=STATE),size=2,check_overlap=TRUE,nudge_x=1.0)

Page 126: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

6.Youcanalsoaddasubtitleandcaptionwiththecodeyouseebelow.

ggplot(data=sm,aes(x=YEAR_,y=log(totalacres)))+geom_point()+geom_smooth(method=loess,se=TRUE)+labs(title=paste(“AcreageBurnedbyWildfiresHasIncreasedInthePastFewDecades”),subtitle=paste(“1980-2016”),caption=”DatafromUSGS”)

Page 127: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

7.YoucanalsoupdatetheXandYlabelsforthegraph.Updatetheselabelsonyourgraphusingthecodeyouseebelow.

ggplot(data=sm,aes(x=YEAR_,y=log(totalacres)))+geom_point()+geom_smooth(method=loess,se=TRUE)+labs(title=paste(“AcreageBurnedbyWildfiresHasIncreasedInthePastFewDecades”),subtitle=paste(“1980-2016”),caption=”DatafromUSGS”)+scale_y_continuous(name=”LogofTotalAcresBurned”)+scale_x_continuous(name=”BurnYear”)

Page 128: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

8.YoucancheckyourworkagainstthesolutionfileChapter7_4.R

Step5:Legendlayouts

Thetheme()functioncanbeusedtocontrolthelocationofthelegendandtheguides()functioncanbeusedtoprovideadditionallegendcontrol.

1.Thetheme()functionalongwiththelegend.postionargumentisusedtocontrolthelocationofthelegendonthegraph.Bydefault,thelegendwe’veseensofarhasbeenplacedontherightsideofthegraphwithaverticalorientation.Repositionthelegendtothebottomwiththecodebelow.

ggplot(data=sm,aes(x=YEAR_,y=log(totalacres),color=STATE))+geom_point()+labs(title=paste(“AcreageBurnedbyWildfiresHasIncreasedInthePastFewDecades”),subtitle=paste(“1980-2016”),caption=”DatafromUSGS”)+scale_y_continuous(name=”LogofTotalAcresBurned”)+scale_x_continuous(name=”BurnYear”)+theme(legend.position=”bottom”)

Page 129: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

2.Youcanalsoexplicitlyremovealegendbysettinglegend.position=“none”.Trythatnowifyou’dlike.

3.Otheraspectsofthelegendsuchasthenumberofrowsinthelegendaswellasthesymbolsizecanbecontrolthroughtheguides()function.Usethecodeyouseebelowtoupdatethelegendtobetworowsandwitheachsymbolsettosize4.

ggplot(data=sm,aes(x=YEAR_,y=log(totalacres),color=STATE))++geom_point()++labs(title=paste(“AcreageBurnedbyWildfiresHasIncreasedInthePastFewDecades”),subtitle=paste(“1980-2016”),caption=”DatafromUSGS”)++scale_y_continuous(name=”LogofTotalAcresBurned”)++scale_x_continuous(name=”BurnYear”)++theme(legend.position=“bottom”)++guides(color=guide_legend(nrow=2,override.aes=list(size=4)))

Page 130: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

4.YoucancheckyourworkagainstthesolutionfileChapter7_5.R

Step6:Creatingafacet

Aparticularlygoodwayofgraphingcategoricalvariablesistosplityourplotintofacets,whicharesubplotsthateachdisplayonesubsetofthedata.Thefacet_wrap()andfacet_grid()functioncanbeusedtocreatefacets.

1.Usethefacet_wrap()functiondisplayedinthecodebelowtocreateafacetmapthatdisplaystotalacresburnedbystate.ggplot(data=sm,mapping=aes(x=YEAR_,y=log(totalacres)))+geom_point()+facet_wrap(~STATE)+geom_smooth(method=loess,se=TRUE)

Page 131: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

2.YoucancheckyourworkagainstthesolutionfileChapter7_6.R

Step7:Theming

includeseightbuiltinthemesthatcanbeusedtocustomizethestylingoftheggplot2non-dataelementsofyourplot.

1.Theeightthemesincludedinggplot2aretheme_bw,theme_classic,theme_dark,theme_gray,theme_light,theme_linedraw,theme_minimal,theme_void.

Addthecodeyouseebelowtochangethefacettotheme_dark.

ggplot(data=sm,mapping=aes(x=YEAR_,y=log(totalacres)))+geom_point()+facet_wrap(~STATE)+geom_smooth(method=loess,se=TRUE)+theme_dark()

Page 132: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

2.Experimentwiththethemestoseethedifferencesinstyling.3.YoucancheckyourworkagainstthesolutionfileChapter7_7.R

Step8:Creatingbarcharts

Youcanusegeom_bar()orgeom_chart()tocreatebarchartswithggplot2.However,thereisasignificantdifferencebetweenthetwo.Thegeom_bar()functionwillgenerateacountofthenumberofinstancesofavariable.Inotherwords,itchangesthestatisticthathasalreadybeengeneratedforthegroup.Thegeom_col()functionkeepsthevariablealreadygeneratedforthegroup.Toseethedifference,completethefollowingsteps.

1.LoadtheStudyArea.csvfileandgetasubsetofcolumns.dfWildfires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df<-select(dfWildfires,ORGANIZATI,STATE,YEAR_,TOTALACRES,CAUSE)2.FilterthedataframesothatonlywildfiresforCaliforniaareincluded.df<-filter(df,STATE==‘California’)3.GroupthedataframebyYEAR_.

Page 133: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

grp<-group_by(df,YEAR_)4.Plotthedatausinggeom_bar()asseenbelow.Noticethatthebarchartthatisproducedisacountofthenumberoffiresforeachyear.ggplot(data=grp)+geom_bar(mapping=aes(x=YEAR_),fill=”red”)

5.Nowusegeom_col()toseethedifference.TheTOTALACRESvariableismaintainedinthiscase.ggplot(data=grp)+geom_col(mapping=aes(x=YEAR_,y=TOTALACRES),fill=”red”)6.YoucancheckyourworkagainstthesolutionfileChapter7_8.R

Step9:CreatingViolinPlots

Violinplots,whicharesimilartoboxplots,alsoshowtheprobabilitydensityatvariousvalues.Thickerareasoftheviolinplotindicateahigherprobabilityatthatvalue.Typically,violinplotsalsoincludeamarkerforthemedianalongwiththeInter-QuartileRange(IQR).Thegeom_violin()functionisusedtocreateviolinplotsinggplot2.

1.LoadtheStudyArea.csvfileandgetasubsetofcolumns.dfWildfires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=

Page 134: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

col_character()),col_names=TRUE)df<-select(dfWildfires,ORGANIZATI,STATE,YEAR_,TOTALACRES,CAUSE)2.Filterthedataframesothatonlywildfiresgreaterthan5,000acresareincluded.dfWildfires<-filter(dfWildfires,TOTALACRES>=5000)3.Groupthewildfiresbyorganization.grpWildfires<-group_by(dfWildfires,ORGANIZATI)4.Createabasicviolinplot.ggplot(data=grpWildfires,mapping=aes(x=ORGANIZATI,y=log(TOTALACRES)))+geom_violin()

5.Youcanaddtheindividualobservationsusinggeom_jitter().

ggplot(data=grpWildfires,mapping=aes(x=ORGANIZATI,y=log(TOTALACRES)))+geom_violin()+geom_jitter(height=0,width=0.1)

6.Themeancanbeaddedusingstat_summary()asseenbelow.

ggplot(data=grpWildfires,mapping=aes(x=ORGANIZATI,

Page 135: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

y=log(TOTALACRES)))+geom_violin()+geom_jitter(height=0,width=0.1)+stat_summary(fun.y=mean,geom=”point”,size=2,color=”red”)

7.Thebox_plot()functioncanbeusedtoaddthemeanandIQR.ggplot(data=grpWildfires,mapping=aes(x=ORGANIZATI,y=log(TOTALACRES)))+geom_violin()+geom_boxplot(width=0.1)8.YoucancheckyourworkagainstthesolutionfileChapter7_9.R

Step10:Creatingdensityplots

Densityplots,createdwithgeom_density()computesadensityestimate,whichisasmoothedversionofahistogramandisusedwithcontinuousdata.ggplot2canalsocompute2Dversionsofdensityincludescontoursandpolygonstyleddensityplots.

1.Inthisfirstportionoftheexerciseyou’llcreateabasicdensityplot.LoadtheStudyArea.csvfileandgetasubsetofcolumns.dfWildfires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df<-select(dfWildfires,ORGANIZATI,STATE,YEAR_,TOTALACRES,

Page 136: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

CAUSE)2.Filterthedataframesothatonlywildfiresgreaterthan1,000acresareincluded.dfWildfires<-filter(dfWildfires,TOTALACRES>=1000)3.Createadensityplotwiththegeom_density()function.ggplot(dfWildfires,aes(TOTALACRES))+geom_density()

4.Youmayalsowanttocreatethesamedensityplotwithaloggedversionofthedata.ggplot(dfWildfires,aes(log(TOTALACRES)))+geom_density()

Page 137: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

5.Next,you’llcreate2Dplotsofthedatastartingwithcontours.Addthecodeyouseebelow.ggplot(dfWildfires,aes(x=YEAR_,y=log(TOTALACRES)))+geom_point()+geom_density_2d()

Page 138: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

6.Finally,createa2Ddensitysurfaceusingstat_density_2d().

ggplot(dfWildfires,aes(x=YEAR_,y=log(TOTALACRES)))+geom_density_2d()+stat_density_2d(geom=”raster”,aes(fill=..density..),contour=FALSE)

Page 139: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

7.YoucancheckyourworkagainstthesolutionfileChapter7_10.R

Conclusion

Inthischapteryoulearnedvariousdatavisualizationtechniquesusingggplot2.Westartedwithbasicscatterplots,addedregressionlines,labeledthegraphsinvariousways,andcreatedalegend.Inaddition,youlearnedhowtocreatefacetplots,andworkwithggplot2sbuiltinthemingoptions.Youalsolearnedhowtocreatebarcharts,violincharts,anddensityplots.

Inthenextchapteryouwilllearnhowtocreatemapsusingtheggmappackage.Chapter8

Page 140: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

VisualizingGeographicDatawithggmap

Theggmappackageenablesthevisualizationofspatialdataandspatialstatisticsinamapformatusingthelayeredapproachofggplot2.ThispackagealsoincludesbasemapsthatgiveyourvisualizationscontextincludingGoogleMaps,OpenStreetMap,StamenMaps,andCloudMademaps.Inaddition,utilityfunctionsareprovidedforaccessingvariousGoogleservicesincludingGeocoding,DistanceMatrix,andDirections.

Theggmappackageisbasedonggplot2,whichmeansitwilltakealayeredapproachandwillconsistofthesamefivecomponentsfoundinggplot2.Theseincludeadefaultdatasetwithaestheticmappingswherexislongitude,yislatitude,andthecoordinatesystemisfixedtoMercator.Othercomponentsincludeoneormorelayersdefinedwithageometricobjectandstatisticaltransformation,ascaleforeachaestheticmapping,coordinatesystem,andfacetspecification.Becauseggmapisbuiltonggplot2ishasaccesstothefullrangeofggplot2thatyoulearnedaboutinapreviousexercise.

Inthischapterwe’llcoverthefollowingtopics:

•Creatingabasemap•Addingoperationallayers•Addinglayersfromashapefile

Exercise1:Creatingabasemap

Therearetwobasicstepstocreateamapwithggmap.Thedetailsaremorecomplexthanthesetwostepsmightimply,butingeneralyoujustneedtodownloadthemapraster(basemap)andthenplotoperationaldataonthebasemap.Thefirststepistodownloadthemapraster,alsoknownasthebasemap.Thisisaccomplishedusingtheget_map()function,whichcanbeusedtocreateabasemapfromGoogle,Stamen,OpenStreetMap,orCloudMade.You’lllearnhowtodothatinthisstep.Inafuturestepyou’lllearnhowtoaddandstyleoperationaldatainvariousways.

1.OpenRStudioandfindtheConsolepane.

2.Ifnecessary,settheworkingdirectorybytypingthecodeyouseebelowinto

Page 141: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

theConsolepaneorbygoingtoSession|SetWorkingDirectory|ChooseDirectoryfromtheRStudiomenu.

setwd(<installationdirectoryforexercisedata>)

3.LoadtheggmappackagebygoingtothePackagespaneinRStudioandclickingonthecheckboxnexttothepackagename.Alternatively,youcanloaditfromtheConsolebytyping:

library(ggmap)4.CreateavariablecalledmyLocationandsetittoCalifornia.myLocation<-“California”5.Calltheget_map()functionandpassinthelocationvariablealongwithazoomlevelof6.myMap<-get_map(location=myLocation,zoom=6)

6.InRStudioyoushouldseesomereturnmessagesthatlooksimilartothecodeyouseebelow.Ifyoudon’tseesomethingsimilartothis,youmayneedtore-executethescript.Itisn’tuncommontogetanerrormessagewhencallingtheget_map()functionfromRStudio.Ifthishappenssimplyre-executethecodeuntilyougetsomethingthatissimilartowhatyouseebelow.

MapfromURL:http://maps.googleapis.com/maps/api/staticmap?center=California&zoom=6&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=falseInformationfromURL:http://maps.googleapis.com/maps/api/geocode/json?address=California&sensor=false

7.Calltheggmap()function,passinginthemyMapvariable.ThePlotspaneshoulddisplaythemapasseenbelow.ThedefaultmaptypeisGoogleMapswithastyleofTerrain.

ggmap(myMap)

Page 142: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

TheGooglesourceincludesanumberofmaptypesincludingthoseyouseeinthescreenshotbelow.

Page 143: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

8.AddandexecutethecodeyouseebelowtoaddaGooglesatellitemap.

myMap<-get_map(location=myLocation,zoom=6,source=”google”,maptype=”satellite”)ggmap(myMap)

9.Thereareanumberofwaysthatyoucandefinetheinputlocation:longitude/latitudecoordinatepair,acharacterstring,oraboundingbox.Thecharacterstringtendstobeamorepracticalsolutioninmanysituationssinceyoucansimplypassinthenameofthelocation.Forexample,youcoulddefine

Page 144: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

thelocationasHoustonTexasorTheWhiteHouseorTheGrandCanyon.Whenacharacterstringispassedtothelocationparameteritisthenpassedtothegeocodingservicetoobtainthelatitude/longitudecoordinatepair.Addthecodeyouseebelowtoseehowpassinginacharacterstringworks.

myMap<-get_map(location=“GrandCanyon,Arizona”,zoom=11)ggmap(myMap)

Thezoomlevelcanbesetbetween3and21with3representingacontinentlevelview,and21representingabuildinglevelview.Takesometimeto

Page 145: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

experimentwiththezoomleveltoseetheeffectofvarioussettings.

10.YoucancheckyourworkagainstthesolutionfileChapter8_1.R

Exercise2:Addingoperationaldatalayers

ggmap()returnsaggplotobject,meaningthatitactsasabaselayerintheggplot2framework.Thisallowsforthefullrangeofggplot2capabilitiesmeaningthatyoucanplotpointsonthemap,addcontoursand2Dheatmaps,andmore.We’llexaminesomeofthesecapabilitiesinthissection.

1.Initiallywe’lljustloadthewildfireeventsaspoints.AddthecodeyouseebelowtoproduceamapofCaliforniathatdisplayswildfiresfromtheyears1980-2016thatburnedmorethan1,000acres.

myLocation<-“California”#getthebasemaplayermyMap<-get_map(location=myLocation,zoom=6)

#readinthewildfiredatatoadataframe(tibble)dfWildfires<-read_csv(“StudyArea_SmallFile.csv”,col_names=TRUE)

#selectspecificcolumnsofinformationdf<-select(dfWildfires,STATE,YEAR_,TOTALACRES,DLATITUDE,DLONGITUDE)

#filterthedataframesothatonlyfiresgreaterthan1,000acresburnedinCaliforniaarepresentdf<-filter(df,TOTALACRES>=1000&STATE==‘California’)

#usegeom_point()todisplaythepoints.Thexandypropertiesoftheaes()functionareusedtodefinethegeometryggmap(myMap)+geom_point(data=df,aes(x=DLONGITUDE,y=DLATITUDE))

Page 146: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

2.Nowlet’sdosomethingalittlemoreinteresting.First,usethedplyrfunctionmutate()togroupthefiresbydecade.

togroupthefiresbydecade.1989”,ifelse(YEAR_%in%1990:1999,“1990-1999”,ifelse(YEAR_%in%2000:2009,“2000-2009”,ifelse(YEAR_%in%2010:2016,“2010-2016”,“-99”)))))

3.Next,colorcodethewildfiresbyDECADEandcreateagraduatedsymbolmapbasedonthesizeofeachfire.Thecolourpropertydefinesthecolumnto

Page 147: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

useforgrouping,andthesizepropertydefinethecolumntouseforthesizeofeachsymbol.

ggmap(myMap)+geom_point(data=df,aes(x=DLONGITUDE,y=DLATITUDE,colour=DECADE,size=TOTALACRES))Thisshouldproduceamapthatappearsasseeninthescreenshotbelow.

4.Let’schangethemapviewtofocusmoreonsouthernCalifornia,andinparticulartheareajustnorthofLosAngeles.

myMap<-get_map(location=“SantaClarita,California”,zoom=10)ggmap(myMap)+geom_point(data=df,aes(x=DLONGITUDE,y=DLATITUDE,colour=DECADE,size=TOTALACRES))

Page 148: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

5.Next,we’lladdcontourandheatlayers.Thegeom_density2d()functionisusedtocreatethecontourswhilethestat_density2d()functioncreatestheheatmap.Addthefollowingcodetoproducethemapyouseebelow.Youcanexperimentwiththecolorsusingthescale_fill_gradient(lowandhigh)properties.Herewe’vesetthemtogreenandredrespectively,butyoumaywanttochangethecolorscheme.

myMap<-get_map(location=“California”,zoom=6)

ggmap(myMap,extent=“device”)+geom_density2d(data=df,aes(x=DLONGITUDE,y=DLATITUDE),size=0.3)+stat_density2d(data=df,aes(x=DLONGITUDE,y=DLATITUDE,fill=..level..,alpha=..level..),size=0.01,bins=16,geom=“polygon”)+scale_fill_gradient(low=“green”,high=“red”)+scale_alpha(range=c(0,0.3),guide=FALSE)

Page 149: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

6.Ifyou’dprefertoseetheheatmapwithoutcontours,thecodecanbesimplifiedasfollows:

ggmap(myMap,extent=“device”)+stat_density2d(data=df,aes(x=DLONGITUDE,y=DLATITUDE,fill=..level..,alpha=..level..),size=0.01,bins=16,geom=“polygon”)+scale_fill_gradient(low=“green”,high=“red”)+scale_alpha(range=c(0,0.3),guide=FALSE)

Page 150: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

7.Finally,let’screateafacetmapthatdepictshotspotsforeachyearinthecurrentdecade.Addthefollowingcodetoseehowthisworks.Thedatasetcontainsinformationupthroughtheyear2016.

df<-filter(df,YEAR_%in%c(2010,2011,2012,2013,2014,2015,2016))

ggmap(myMap,extent=“device”)+stat_density2d(data=df,aes(x=DLONGITUDE,y=DLATITUDE,fill=..level..,alpha=..level..),size=0.01,bins=16,geom=“polygon”)+scale_fill_gradient(low=“green”,high=“red”)+scale_alpha(range=c(0,0.3),guide=FALSE)+facet_wrap(~YEAR_)

Page 151: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

8.YoucancheckyourworkagainstthesolutionfileChapter8_2.R

Exercise3:AddingLayersfromShapefiles

WhiletheyaresomewhatofanolderGISdataformat,shapefilesarestillcommonlyusedtorepresentgeographicfeatures.Withalittlebitofmanipulation,youcangetplotdatafromshapefilesontoggmap.

1.Forthisexerciseyou’llneedtoinstallanadditionalpackagecalledrgdal.UsethePackagespanetofindandinstallrgdalorenterthecodeyouseebelow.

Page 152: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

install.packages(“rgdal”)2.LoadthergdalpackagethroughthePackagespaneorenterthecodeyouseebelow.library(rgdal)

3.TheDatafolderthatcontainstheexercisedataforthisbookcontainsashapefilecalledS_USA.Wilderness.You’llactuallyseeanumberoffileswiththisname,butadifferentfileextension.Thesefilescombinetocreatewhatiscalledashapefile.ThisfilecontainstheboundariesofdesignatedwildernessareasintheUnitedStates.UsethereadOGR()functionfromrgdaltoloadthedataintoavariable.

wild=readOGR(‘.’,‘S_USA.Wilderness’)

4.Thefortify()function,whichispartofggplot2,convertsalltheindividualpointsthatdefineeachboundaryintoadataframethatcanthenbeusedtoplotthepolygonboundaries.

wild<-fortify(wild)

5.Usetheggmapqmap()function(qmapmeansquickmap)tocreatethebasemapthatwillbeusedasthereferenceforthewildernessboundaries.CenterthemapinMontana.

montana<-qmap(“Montana”,zoom=6)

6.Beforeplottingthewildernessboundariesaspolygonsonthemap,takealookatthedataframethatwascreatedbythefortify()functionsoyou’llhaveabetterunderstandingofthestructurecreatedbythisfunction.

View(wild)

Page 153: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Takealookatthegroupcolumn.Thiscolumnuniquelyidentifieseachwildernessboundary.Thewildernessboundariesarepolygons,andpolygonsaredefinedbyasetofpointswhichdefinethestructureofthepolygon.It’ssortoflikeplayingconnectthedots,whereeachdotisalatitude/longitudecoordinatepairdefinedbythelongandlatcolumnsinthedataframe.

Forexample,takealookatgroup0.1.Noticethattherearemultiplerowsthatcontainsthevalue0.1,andthateachrowhasuniquelongandlatvalues.Theseareallthepointsusedtodefinetheboundariesofthatpolygon.

7.Nowplotthewildernessboundariesonthebasemap.Noticetheuseofthegroupcolumnforgroupingthepolygons.Itdoestakesometimetoplottheboundariesonthemapsobepatientwiththisstep.Eventuallyyoushouldseeamapsimilartothescreenshotbelow.

montana+geom_polygon(aes(x=long,y=lat,group=group,alpha=0.25),data=wild,fill=’white’)+geom_polygon(aes(x=long,y=lat,group=group),

Page 154: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

data=wild,color=’black’,fill=NA)

8.Optional–Usethecolor,fill,andalpha(usedtodefinetransparency)parameterstochangethesymbologytodifferentcolorsandstyles.9.YoucancheckyourworkagainstthesolutionfileChapter8_3.R

Conclusion

Inthischapteryoulearnedhowtousetheggmappackagetocreatecompellingdatavisualizationsinmapformat.YoulearnedhowtocreatedbasemapsusingGoogleasadatasource,addoperationaldatalayers,createvarioustypesofmapvisualizationsusingexternaldatasources,andloadshapefiles.

InthenextchapteryouwilllearnhowtouseRMarkdowntoshareyourworkwithothers.Chapter9

Page 155: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

RMarkdown

RMarkdownisanauthoringframeworkfordatasciencethatcombinescode,results,andcommentary.OutputformatsincludePDF,Word,HTML,slideshows,andmore.AnRMarkdowndocumentessentiallyservesthreepurposes:communication,collaboration,andasamodern-daylabenvironmentthatcapturesnotonlywhatyoudid,butalsowhatyouwerethinking.Fromacommunicationperspectiveitenablesdecisionmakerstofocusmoreontheresultsofyouranalysisratherthanthecode.However,becauseitenablesyoutoalsoincludethecode,itfunctionsasameansofcollaborationbetweendatascientists.

RMarkdownusesthermarkdownpackage,butyoudon’thavetoexplicitlyloadthepackageinRStudio.RStudiowillautomaticallyloadthepackageasneeded.TheoutputformatofanRMarkdownfileisaplaintextfilewithanextensionofRmd.ThesefilescontainamixtureofthreetypesofcontentincludingaYAMLheader,Rcode,andtextmixedwithsimpletextformatting.

TheoutputRmarkdownfilecontainsbothcodeandtheoutputofthecode.UsingtheRStudiointerfaceyoucanrunsectionsofthecodeorallthecodeinthefile.Youcanseeanexampleofthisinthescreenshotbelow.Noticethatthecodeisenclosedbythreeback-ticksfollowedbytheoutputofthecodebelow.

Page 156: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

IfyouwanttoexportthecontentstoaspecificfiletypeyoucanusetheKnitfunctionalityembeddedinRStudiotoexporttoHTML,PDF,andWordformats.Thiswillexportacompletefilecontainingtext,code,andresults.

Inthischapterwe’llcoverthefollowingtopics:

•CreatingaRMarkdownfile•AddingcodechunksandtexttoanRMarkdownfile•Codechunkandheaderoptions•Caching•UsingKnittooutputanRMarkdownfile

Page 157: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Exercise1:CreatinganRMarkdownfile

AnRMarkdownfileissimplyaplaintextfilewithafileextensionof.Rmd.YoucanuseRStudiotocreatenewmarkdownfiles,whichiswhatyou’lldointhisbriefexercise.

1.Theexercisesinthischapterrequirethefollowingpackages:readr,dplyr,ggplot2,andggmap.TheycanbeloadedfromthePackagespane,theConsolepane,orascript.

2.OpenRStudioandgotoFile|NewFile|RMarkdown.Thiswilldisplaythedialogyouseebelow.Therearedifferenttypesofmarkdownthatcanbecreated,butforthisexercisewe’llkeepitsimpleandcreateadocument.

3.SelectDocument(whichisthedefault),giveitatitleofCreatingMapswith

Page 158: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

R,changetheauthornameifyou’dlike,andselectPDFastheoutput.

4.Thiswillcreateafilewithsomeheaderinformation,text,andcode.Yourfileshouldlooksimilartothescreenshotbelow.

Page 159: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

5.Attheverytopofthefileistheheaderinformation,whichissurroundedbydashes.We’lladdsomecontenttothissectioninalaterexercise,butfornowwe’llleaveitasis.

6.Codesectionsare

Page 160: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

groupedthroughtheuseofback-ticksasseeninthescreenshotbelow.

7.Plaintextandformattedtextcanbeincludedinamarkdownfileaswell.Textthatneedstobeformattedmustfollowaspecificsyntax.Forexample,youformattextforitalics,boldfont,headings,linksandimages.Belowisanexampleofbothplaintextandtextthathasbeenformatted.

8.Otherthantheheaderinformationwearen’tgoingtouseanyofthedefaultcodeortextprovidedsogoaheadanddeleteeverythingotherthantheheader.

9.Savethefiletoyourworkingdirectorywithanameof

CreatingMapsWithR.Rmd.

Exercise2:AddingCodeChunksandTexttoanRMarkdownFile

RcodecanbeincludedintheRMarkdownfilethroughtheuseofchunks,whicharedefinedthroughtheuseofthreeback-ticksfollowedbyanrenclosedwithincurlybraces.Insidethecurlybracesareoptionsthatcanbeincluded.TheseoptionscanincludeTRUE|FALSEparametersforturningvarioustypesofmessagingonandoff.

Chunksdefineasingletask,sortoflikeafunction.Theyshouldbeself-containedandtightlydefinedpiecesofcode.TherearethreewaystoinsertchunksintoanRMarkdownfile:Cmd/Ctrl-Alt-I,theInsertbuttonontheeditortoolbar,andbymanuallytypingthechunkdelimiters.

Page 161: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

YoucanalsoaddplaintextandformattedtexttoanRMarkdownfile.Formattedtexthastobedefinedaccordingtoaspecificsyntax.We’llseevariousexamplesofformattedtextaswemovethroughthisexercise.

Inthisexerciseyou’lllearnhowtoaddcodechunkstoanRMarkdownfile.

1.First,we’lladdsomedescriptivetextthatwillbeincludedintheoutputRMarkupfile.Addthetextyouseebelowtothefilejustbelowtheheader.Ifyouhaveadigitalcopyofthebookyoucancopyandpasteratherthantypingeverything.NoticethatthetextStep1:CreatingaBasemaphasbeenprecededbytwopoundsigns.##Step1:CreatingaBasemap.Thepoundsignsareusedtodefineheadings.InthiscasetwopoundsignswouldtranslatetoanHTML<h2>tag,whichsimplydefinesthesizeofthetext.You’llalsonoticethatsomeofthewordslikeggmapandggplotaresurroundedbysinglequotes.Singlequotesareusedtodefineadifferentstyleforthewordthatindicatesthiswordisprogrammaticcode.

The`ggmap`packageenablesthevisualizationofspatialdataandspatialstatisticsinamapformatusingthelayeredapproachof`ggplot2`.ThispackagealsoincludesbasemapsthatgiveyourvisualizationscontextincludingGoogleMaps,OpenStreetMap,StamenMaps,andCloudMademaps.Inaddition,utilityfunctionsareprovidedforaccessingvariousGoogleservicesincludingGeocoding,DistanceMatrix,andDirections.

The`ggmap`packageisbasedon`ggplot2`,whichmeansitwilltakealayeredapproachandwillconsistofthesamefivecomponentsfoundin`ggplot2`.Theseincludeadefaultdatasetwithaestheticmappingswherexislongitude,yislatitude,andthecoordinatesystemisfixedtoMercator.Othercomponentsincludeoneormorelayersdefinedwithageometricobjectandstatisticaltransformation,ascaleforeachaestheticmapping,coordinatesystem,andfacetspecification.Because`ggmap`isbuilton`ggplot2`ithasaccesstothefullrangeof`ggplot2`functionality.Inthisexerciseyou’lllearnhowtousethe`ggmap`packagetoplotvarioustypesofspatialvisualizations.

##Step1:CreatingaBasemapTherearetwobasicstepstocreateamapwith`ggmap`.Thedetailsaremorecomplexthanthesetwostepsmightimply,butingeneralyoujustneedtodownloadthemaprasterandthenplotoperationaldataonthebasemap.Step1is

Page 162: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

todownloadthemapraster,alsoknownasthebasemap.Thisisaccomplishedusingthe`get_map()`function,whichcanbeusedtocreateabasemapfromGoogle,Stamen,OpenStreetMap,orCloudMade.You’lllearnhowtodothatinthisstep.Inafuturestepyou’lllearnhowtoaddandstyleoperationaldatainvariousways.

1.First,loadthelibrariesthatwe’llneedforthisexercise

2.ClickInsertandthenRtoinsertanewcodechunkasseenbelow.Thecodeyouaddwillgoinbetweenthesetofback-ticks.Mostmarkdownfileswillhaveanumberofcodechunks,witheachdefiningaspecifictask.Theyaresimilarinmanywaystofunctions.

3.Forthiscodechunkwe’lljustloadthelibrariesthatwillbeusedinthisexercise.Addthecodeyouseebelowinsidethecodechunkboundaries.

```{r}library(ggplot2)library(ggmap)library(readr)library(dplyr)```

4.Addsomeadditionaltextthatdescribesthenextstep.

2.Createavariablecalled`myLocation`andsetitto`California`.Callthe`get_map()`functionwithazoomlevelof6,andplotthemapusingthe`ggmap()`function,passinginareferencetothevariablereturnedbythe`get_map()`function.ThedefaultmaptypeisGoogleMapswithastyleofTerrain.

5.Insertanewcodechunkjustbelowthedescriptivetextandaddthefollowingcode.

Page 163: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

```{r}myLocation<-“California”myMap<-get_map(location=myLocation,zoom=6)ggmap(myMap)

6.Let’srunthecodethathasbeenaddedsofartoseetheresult.SelectRun|RunAllfromtheRStudiointerface.Thisshouldproducetheoutputyouseebelow.Theoutputisincludedinsidethemarkdowndocument.Ifnot,checkyourcodeandtryrunningitagain.

Page 164: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

7.Adddescriptivetextforthenextsection.3.ThecodeyouseebelowwillcreateaGooglesatellitebasemaplayer.OtherbasemaplayersincludeStamen,OSM,andCloudMade.8.Createanewcodechunkandaddthecodeyouseebelow.

```{r}myMap<-get_map(location=myLocation,zoom=6,source=”google”,maptype=”satellite”)ggmap(myMap)```

9.Adddescriptivetextforthenextsection.

4.Thereareanumberofwaysthatyoucandefinetheinputlocation:longitude/latitudecoordinatepair,acharacterstring,oraboundingbox.Thecharacterstringtendstobeamorepracticalsolutioninmanysituationssinceyoucansimplypassinthenameofthelocation.Forexample,youcoulddefine

Page 165: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

thelocationasHoustonTexasorTheWhiteHouseorTheGrandCanyon.Whenacharacterstringispassedtothelocationparameteritisthenpassedtothegeocodingservicetoobtainthelatitude/longitudecoordinatepair.Addthecodeyouseebelowtoseehowpassinginacharacterstringworks.

10.Createanewcodechunkandaddthecodeyouseebelow.

```{r}myMap<-get_map(location=“GrandCanyon,Arizona”,zoom=11)ggmap(myMap)

11.Let’sstopaddingcodefornowandrunwhatiscurrentlyinthefiletoseetheresult.SelectRun|RunAll.Severalmapswillbeproducedinsidethemarkupdocumentincludingtheoneseenbelow,whichwillbeproducedattheveryend.Ifyoudon’tseethemapsyoumayneedtocheckyourcode.Wehaven’tyetaddedparametersthatwilloutputwarningsanderrors,butwilldosoinalaterstep.

Page 166: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

12.Adddescriptivetextforthenextsection.Thezoomlevelcanbesetbetween3and21with3representingacontinentlevelview,and21representingabuildinglevelview.

##Step2:AddingOperationalDataLayers`ggmap()`returnsa`ggplot`object,meaningthatitactsasabaselayerinthe`ggplot2`framework.Thisallowsforthefullrangeof`ggplot2`capabilitiesmeaningthatyoucanplotpointsonthemap,addcontoursand2Dheatmaps,andmore.We’llexaminesomeofthesecapabilitiesinthissection.

1.Forthissectionwe’llusethehistoricalwildfireinformationfoundintheStudyArea_SmallFile.csvfile.Loadthisdatasetusingthe`read_csv()`function.Youcandownloadthisfileat:https://www.dropbox.com/s/9ouh21a6ym62nsl/StudyArea.csv?dl=0

13.Createanewcodechunkandaddthecodeyouseebelow.Thiswillloadwildfiredatafromacsvfile.Note:ThepathtoyourStudyArea_SmallFile.csv

Page 167: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

filemaydifferfromtheoneyouseebelow.

```{r}dfWildfires<-read_csv(“~/Desktop/IntroR/Data/StudyArea_SmallFile.csv”,col_types=list(FIRENUMBER=col_character(),UNIT=col_character()),col_names=TRUE)```

14.Adddescriptivetextforthenextsection.

2.Initiallywe’lljustloadthewildfireeventsaspoints.AddthecodeyouseebelowtoproduceamapofCaliforniathatdisplayswildfiresfromtheyears1980-2016thatburnedmorethan1,000acres.

15.Createanewcodechunkandaddthecodeyouseebelow.Thiscodechunkwilldisplayeachofthewildfiresasapointonthemap.

```{r}myLocation<-‘California’#getthebasemapmyMap<-get_map(location=myLocation,zoom=6)#usetheselect()functiontolimitthecolumnsfromthedataframedf<-select(dfWildfires,STATE,YEAR_,TOTALACRES,DLATITUDE,DLONGITUDE)#usethefilter()functiontogetonlyfiresinCaliforniawithacres#burnedgreaterthan1000df<-filter(df,TOTALACRES>=1000&STATE==‘California’)#producethefinalmapggmap(myMap)+geom_point(data=df,aes(x=DLONGITUDE,y=DLATITUDE))```

16.Addthefollowingdescriptivetext.3.Nowlet’sdosomethingalittlemoreinteresting.First,usethe`dplyr``mutate()`functiontogroupthefiresbydecade.

17.Createanewcodechunkandaddthecodeyouseebelow.Themutate()functionisusedinthiscodechunktocreateanewcolumncalledDECADEandthenpopulateeachrowwithavalueforthedecadeinwhichthefireoccurred.

Page 168: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

```{r}

```{r}1989”,ifelse(YEAR_%in%1990:1999,“1990-1999”,ifelse(YEAR_%in%2000:2009,“2000-2009”,ifelse(YEAR_%in%2010:2016,“2010-2016”,“-99”)))))```

18.Addthefollowingdescriptivetext.

4.Next,colorcodethewildfiresby`DECADE`andcreateagraduatedsymbolmapbasedonthesizeofeachfire.The`colour`propertydefinesthecolumntouseforgrouping,andthe`size`propertydefinesthecolumntouseforthesizeofeachsymbol.

19.Createanewcodechunkandaddthecodeyouseebelow.Thiscodechunkwillcolorcodethefiresbydecade.

```{r}ggmap(myMap)+geom_point(data=df,aes(x=DLONGITUDE,y=DLATITUDE,colour=DECADE,size=TOTALACRES))

20.Let’sstopaddingcodefornowandrunwhatiscurrentlyinthefiletoseetheresult.BeforerunningthecodeagaingoaheadandclearthepastresultsbyclickingthesmallXintheupperrighthandscorneroftheoutputforeachmapasseeninthescreenshotbelow.

Page 169: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

21.SelectRun|RunAll.Theoutputproducedwillincludeseveralmapswiththefinalmapappearingasseeninthescreenshotbelow.

Page 170: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

5.Let’schangethemapviewtofocusmoreonsouthernCalifornia,andinparticulartheareajustnorthofLosAngeles.

23.Createanewcodechunkandaddthecodeyouseebelow.Thiscodechunkwillcolorcodethefiresbydecadeandsizethesymbolsaccordingthetotalacreageburned.

```{r}myMap<-get_map(location=“SantaClarita,California”,zoom=10)ggmap(myMap)+geom_point(data=df,aes(x=DLONGITUDE,y=DLATITUDE,colour=DECADE,size=TOTALACRES))```

24.Addthefollowingdescriptivetext.

6.Nextwe’lladdcontourandheatlayers.The`geom_density2d()`functionisusedtocreatethecontourswhilethe`stat_density2d()`functioncreatestheheat

Page 171: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

map.Addthefollowingcodetoproducethemapyouseebelow.Youcanexperimentwiththecolorsusingthe`scale_fill_gradient(lowandhigh)`properties.Herewe’vesetthemtogreenandredrespectively,butyoumaywanttochangethecolorscheme.

25.Createanewcodechunkandaddthecodeyouseebelow.Thiscodechunkwillcreateaheatmapandaddcontours.```{r}myMap<-get_map(location=“SantaClarita,California”,zoom=8)

ggmap(myMap,extent=“device”)+geom_density2d(data=df,aes(x=DLONGITUDE,y=DLATITUDE),size=0.3)+stat_density2d(data=df,aes(x=DLONGITUDE,y=DLATITUDE,fill=..level..,alpha=..level..),size=0.01,bins=16,geom=“polygon”)+scale_fill_gradient(low=“green”,high=“red”)+scale_alpha(range=c(0,0.3),guide=FALSE)7.Ifyou’dprefertoseetheheatmapwithoutcontours,thecodecanbesimplifiedasfollows:

27.Createanewcodechunkandaddthecodeyouseebelow.Thiscodechunkwillremovethecontours.

```{r}ggmap(myMap,extent=“device”)+stat_density2d(data=df,aes(x=DLONGITUDE,y=DLATITUDE,fill=..level..,alpha=..level..),size=0.01,bins=16,geom=“polygon”)+scale_fill_gradient(low=“green”,high=“red”)+scale_alpha(range=c(0,0.3),guide=FALSE)```

28.Addthefollowingdescriptivetext.

8.Finally,let’screateafacetmapthatdepictshotspotsforeachyearinthecurrentdecade.Addthefollowingcodetoseehowthisworks.Thedatasetcontainsinformationupthroughtheyear2016.

29.Createacodechunkandaddthecodeyouseebelow.

```{r}df<-filter(dfWildfires,STATE==‘California’)df<-filter(df,YEAR_%in%c(2010,2011,2012,2013,2014,2015,2016))myMap<-get_map(location=“SantaClarita,California”,zoom=9)

Page 172: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

ggmap(myMap,extent=“device”)+stat_density2d(data=df,aes(x=DLONGITUDE,y=DLATITUDE,fill=..level..,alpha=..level..),size=0.01,bins=16,geom=“polygon”)+scale_fill_gradient(low=“green”,high=“red”)+scale_alpha(range=c(0,0.3),guide=FALSE)+facet_wrap(~YEAR_)

30.ThatcompletesthecodeforthisRMarkdownfile.GoaheadandrunthecodeagaintoseethefinaloutputbyselectingRun|RunAll.

Exercise3:Codechunkandheaderoptions

Chunkoptionsareargumentssuppliedtothechunkheader.Currentlythereareapproximately60suchoptions.We’llexaminesomeofthemorecommonlyusedandimportantoptionsinthisexercise.Allcodechunkoptionsareplacedinsidethe{r}block.

CodechunkscanbegivenanoptionalnameasseenintheexamplecodebelowwherethecodechunkhasbeengivenanameofMapSetup.```{rMapSetup,warning=FALSE,error=FALSE,message=FALSE}

TheadvantagesofnamingchunksincludeeasiernavigationusingthecodenavigatorinRStudio,usefulnamesgiventographicsproducedbychunks,andtheabilitytocachechunkstoavoidre-performingcomputationsoneachrun.Thislastadvantageisperhapsthemostuseful.1.TheRMarkdownpaneincludesaquickaccessmenuforeasilynavigating

todifferentsectionsofyourRMarkdownpage.Thearrowinthescreenshotbelowdisplaysthelocationofthisfunctionality.

Page 173: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

2.ClickonthequickaccessbuttonnowtoseethedifferentsectionsoftheRMarkdownfile.Youshouldseesomethingsimilartothescreenshotbelow.You’llnoticethatitissectionedbyheadingsandthencodechunks.Tomakenavigationeasieryoucannameeachofthesechunks.

Page 174: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

SelectChunk1underStep1:CreatingaBasemaptoreturntothefirstcodechunkyoucreatedinanearlierexercise.Thiscodechunksimplydefinesthelibrariesthatwillbeusedinthefile.

Inthe{r}sectionoftheheadernamethechunklibs.```{rlibs}3.Noticethatthevaluehasnowbeenupdatedinthequickaccessdropdownmenu.

Page 175: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

4.Renametherestofyourcodechunks.Youcanusewhatevernamemakesthemostsenseforeach.

5.Next,we’lladdsomecodeoptions.Althoughtherearecurrently60+optionsthatcanbeappliedtoacodechunkwe’llexamineonlyafewofthemoreimportantoptions.Youcangetalistofalltheavailablecodechunkoptionsathttps://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf.

6.Messagingisoneofthemostcommonlyusedandusefuloptions.Thereareactuallythreemessagingoptions:messages,warnings,errors.AllthreeareTRUE|FALSEvaluesthatcanbesetandallaresettoFALSEbydefault.NavigatetoChunk2andaddtheoptionsyouseehighlightedbelow.Thiswillturnonthemessagingforanygeneralinformationmessages,warnings,anderrors.

```{rerror=TRUE,warning=TRUE,message=TRUE}

Page 176: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

myLocation<-“California”myMap<-get_map(location=myLocation,zoom=6)ggmap(myMap)```

7.Nowwhenyourunthissectionanyofthesemessageswillbeprintedoutalongwiththeoutput.Ratherthanrunningtheentiremarkdownfilecodeeachtimeyouwanttotestsomethingyoucanlimittheruntoaparticularcodechunkbyclickingthearrowonthefar-righthandsideofthecodechunkasseeninthescreenshotbelow.

8.Theoutputwindowincludestwooverviewwindows:theoutputvisualizationandtheRConsole.IfyouclicktheRConsoleoverviewwindowasseeninthescreenshotbelowitwilldisplayanymessagesthatwerewrittentotheconsoleasaresultoftheexecutionofthiscodeblock.

Page 177: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

ClickingtheRConsolewindowshouldproduceanoutputsimilartothescreenshotbelow.

9.Nowaddthesamemessage,warning,anderroroptionstoyourothercodechunks.10.Runthecodechunksoneatatimeanexaminetheoutput.Anywarninganderrorswillbeprominentlydisplayedasseeninthescreenshotbelow.

Page 178: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

11.Youcanalsodefinedocumentwideoptionsaswell.Inthisstepwe’lllookatacommonoptiondefinedintheheader.Thecontentoftheheaderdefinesparametersthatcontrolvarioussettingsfortheentiredocument.

Theheadercanincludebasicdescriptiveinformationincludingthetitle,author,date,andoutputformatalongwithothersettingsincludingparametersandbibliographiesandcitations.Parametersareusedwhenyouneedtore-renderthesamereportbutwithdistinctvaluesforinputs.Theparamsfieldcontrolstheseparameters.

You’llnoticeinthecodeexamplebelowthatastateparameterhasbeendefinedwithavalueofCalifornia.ThisvaluecanthenbeaccessedelsewhereintheRMarkdownfileusingthesyntaxparams$<parameter>orparams$stateinthisexample.

Page 179: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

AddtheparamsoptionswithaparameterofstateandsetitequaltoCaliforniainyourfileexactlyasseeninthescreenshotabove.12.NavigatetoChunk2andfindthelineyouseebelow.myLocation<-“California”13.Changethislineasseenbelowtoaccessthestateparameter.

```{rerror=TRUE,warning=TRUE,message=TRUE}myLocation<-params$statemyMap<-get_map(location=myLocation,zoom=6)ggmap(myMap)```

14.RunthecodeforChunk2onlyandyoushouldseethesameoutputmapcenteredonCalifornia.

15.Cleartheoutputforchunk2byclickingtheXintheupperright-handcorneroftheoutput.16.ReturntothestateparameterintheheaderandchangethevaluetoMontana.

--title:“CreatingMapswithR”author:“EricPimpler”

date:“7/18/2018”output:html_documentparams:

state:‘Montana’

Page 180: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

--17.Runcodechunk2againandnowthemapshouldbecenteredonMontana.

Page 181: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Exercise4:Caching

Codechunkscanalsobecached,whichisgreatforcomputationthattakesalongtimetoexecute.ToenablecachingthecacheparametershouldbesettoTRUE.Thiswillsavetheoutputofthecodechunktoaspeciallynamedfileondesk.Onanysubsequentruns,knitrcheckstoseeifthecodehaschanged,andifnot,itwillreusethecachedresults.

Youdoneedtobecarefulwithcachingthoughasitwillonlyre-runacodechunkifthecodechanges.However,itdoesn’ttakeintoaccountthingssuchaschangestounderlyingdatasources.Forexample,thedatainanunderlyingdatasourcecouldchange,butbecausetheRMarkdownfilewillonlyre-runthecodechunkifthecodechanges,thiscouldbecomeanissue.

Page 182: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

1.Findthecodechunkyouseebelowthatmapstheindividualwildfirepoints.YoumayhavenamedthechunksomethingotherthanwhatIhavenamedthechunk(point_map).

```{rpoint_map,error=TRUE,warning=TRUE,message=TRUE}myLocation<-‘California’#getthebasemapmyMap<-get_map(location=myLocation,zoom=6)#usetheselect()functiontolimitthecolumnsfromthedataframedf<-select(dfWildfires,STATE,YEAR_,TOTALACRES,DLATITUDE,DLONGITUDE)#usethefilter()functiontogetonlyfiresinCaliforniawithacres#burnedgreaterthan1000df<-filter(df,TOTALACRES>=1000&STATE==‘California’)#producethefinalmapggmap(myMap)+geom_point(data=df,aes(x=DLONGITUDE,y=DLATITUDE))```

2.Addthecacheparametertotheoptionsforthechunkasseenbelow.```{rpoint_map,cache=TRUE,error=TRUE,warning=TRUE,

3.ThiscodechunkisdependentuponthedatainthedfWildfiresdataframe,whichisloadedinthecodechunkdirectlyprecedingthischunk.ThecodechunkthatloadsthedatafromacsvfileintothedfWidlfiresvariablecanbeseenbelow.Youmayhavenamedthechunkdifferently(load_data).

```{rload_data,error=TRUE,warning=TRUE,message=TRUE}dfWildfires<-read_csv(“~/Desktop/IntroR/Data/StudyArea_SmallFile.csv”,col_types=list(FIRENUMBER=col_character(),UNIT=col_character()),col_names=TRUE)```

4.Becausethepoint_mapcodechunkisdependentuponthedatainthedfWildfiresdataframeyouneedtoaddadependsonparametertothepoint_mapcodechunk.

```{rpoint_map,cache=TRUE,dependson=’load_data’,error=TRUE,warning=TRUE,message=TRUE}

Page 183: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Thiswillcoversituationswheretheread_csv()callchanges.Forexample,adifferentfilemightbereadbythefunction.

5.Keepinmindthatthecacheanddependsonparametersonlymonitorforchangesinthe.Rmdfile.WhatwouldhappeniftheunderlyingdataintheStudyArea_SmallFile.csvfilechanges?Theansweristhatthechangeswouldn’tbepickedup.Tohandlethissortofsituationyoucanusethecache.extraoptionalongwiththefile.info()function.

```{rload_data,cache.extra=file.info(‘~/Desktop/IntroR/Data/StudyArea_SmallFile.csv’)error=TRUE,warning=TRUE,

Exercise5:UsingKnittooutputanRMarkdownfile

TheKnitfunctionalitybuiltintoRStudiocanbeusedtoexportanRMarkdownfiletovariousformatsincludingHTML,PDF,andWord.Knitcanbeaccessedfromthedropdownmenuseeninthescreenshotbelow.

1.TosimplifytheoutputoftheRMarkdownfileyou’regoingtoremovesomeoftheoptionsthatwereaddedinpreviousexercise.Inthe

Page 184: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

CreateMapsWithR.rmdfileremovecache,dependson,andcache.extraparametersaddedinthelastexercise.

2.SelectKnitandfindtheKnitDirectorymenuitemfromtheRStudiointerface.Bydefault,itissettoDocumentDirectory.ThissimplymeansthattheoutputfilewillgointothesamedirectorywheretheRMarkdownfilehasbeensaved.

3.SelectKnit|KnittoHTML.Knitwillbeginprocessingthefileandyou’llseeoutputmessaginginformationwrittentotheConsolepane.IfeverythinggoesasexpectedanoutputHTMLfilecalledCreatingMapsWithR.htmlwillbecreatedinthesamefolderwheretheCreatingMapsWithR.Rmdfilewassaved.Theoutputfilewillbefairlylength,butthetoppartshouldlooksimilartothescreenshotbelow.

Page 185: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

4.YoucancheckyourworkagainsttheCreatingMapsWithR.Rmdsolutionfile.

Conclusion

InthischapteryoulearnedhowtocreateanRMarkdownfile,whichcanbeusedtoshareyourworkwithothersinvariousformatsincludingPDF,Word,HTML,slideshows,andmore.RMarkdownfilescanincludecode,results,andcommentary,makingthemaperfectresourceforexplainingnotonlytheresultsofaproject,butalsothemechanicsofhowtheworkwasaccomplished.

Inthenextchapteryou’lltackleacasestudythatexamineswildfireactivityinthewesternUnitedStates.

Chapter10

Page 186: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

CaseStudy–WildfireActivityintheWesternUnitedStates

Studiessuggestthatoverthepastfewdecades,thenumberandsizeofwildfireshaveincreasedthroughoutthewesternUnitedStates.Theaveragelengthofwildfireseasonhasincreasedsignificantlyaswellinsomeareas.AccordingtotheUnionofConcernedScientists(UCS),everystateinthewesternUShasexperiencedanincreaseintheaverageannualnumberoflargewildfires(greaterthan1,000acres)overthepastfewdecades.ThePacificNorthwest,includingWashington,Oregon,Idaho,andthewesternhalfofMontanahavehadparticularlychallengingwildfireseasonsinrecentyears.

The2017wildfireseasonshatteredrecordsandcosttheU.S.ForestServiceanunprecedented$2billion.FromtheOregonwildfirestolateseasonfiresinMontana,andthehighlyunusualtimingoftheCaliforniafiresinDecember,itwasabusyyearinthewesternUnitedStates.While2017wasaparticularlynotablewildfireseason,thistrendisnothingnewandresearchsuggestswecanexpectthisunfortunatetrendtocontinueduetoclimatechangeandotherfactors.Arecentstudysuggeststhatoverthenexttwodecades,asmanyas11statesarepredictedtoseetheaverageannualareaburnedincreaseby500percent.

ExtensivestudieshavefoundthatlargeforestfiresinthewesternUShavebeenoccurringnearlyfivetimesmoreoftensincethe1970sand80s.Suchfiresareburningmorethansixtimesthelandareaasbeforeandlastingalmostfivetimeslonger.

Climatechangeisthoughttobetheprimarycauseoftheincreaseinlargewildfireswithrisingtemperaturesleadingtoearlieranddecreasedvolumeofsnowmelts,decreasedprecipitation,andforestconditionsthataredrierforlongerperiodsoftime.Anincreaseinforesttreediseasefrominsectdisturbancehasalsobeenassociatedwithclimatechangeandcanleadtolargeareasofhighlyflammabledeadordyingforests.Otherpotentialcausesofincreasedwildfireactivityincludeforestmanagementpractices,andanincreaseinhumancausedwildfiresduetoaccidentsorarson.

InthiscasestudyyouwillusetheskillsyouhavegainedinthisbookalongwithwildfiredatafromtheFederalWildlandFireOccurrenceDatabase,

Page 187: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

(https://wildfire.cr.usgs.gov/firehistory/data.html),providedbytheU.S.GeologicalSurvey(USGS)tovisualizethechangeinwildfireactivityfrom1980to2016.AnalysiswillbelimitedtothewesternUnitedStatesincludingCalifornia,Arizona,NewMexico,Colorado,Utah,Nevada,Utah,Oregon,Washington,Idaho,Montana,andWyoming.Wewereparticularlyinterestedinthesurgeoflargewildlandfires,categorizedasfiresthatburngreaterthan1,000acres.

So,haswildfireactivityandsizeactuallyincreased,ordoesitjustseemthatwaybecausewe’retunedinmoretobadnewsandsocialmedia?Inthischapteryou’llanswerthosequestionsandmoreusingRwiththetidyversepackage.

Inthischapterwe’llanswerthefollowingquestions:

•Havethenumberofwildfiresincreasedordecreasedinthepastfewdecades?•Hastheacreageburnedincreasedovertime?•Isthesizeofindividualwildfiresincreasingovertime?•Hasthelengthofthefireseasonincreasedovertime?•Doestheacreageburneddifferbyfederalorganization?

Exercise1:Havethenumberofwildfiresincreasedordecreasedinthepastfewdecades?

TheStudyArea.csvfileinyourIntroR\Datafoldercontainsallnon-prescribedwildfireactivityfrom1980-2016forthe11statesinourstudyarea,whichincludeCalifornia,Oregon,Washington,Idaho,Nevada,Arizona,Utah,Montana,Wyoming,Colorado,andNewMexico.We’llusethisfileforalltheexercisesinthischapter.We’regoingtofocusprimarilyonlargewildfiresinthisstudy,definedhereasanynon-prescribedfiregreaterthan1,000acres.

1.InyourIntroRfoldercreateanewfoldercalledCaseStudy1.YoucandothisinsideRStudiobygoingtotheFilespaneandselectingNewFolderinsideyourworkingdirectory.

2.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS1_Exercise1.R.3.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.

library(readr)library(dplyr)library(ggplot2)

Page 188: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

4.Usetheread_csv()functionfromthereadrpackagetoloadthedataintoadataframe.df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)5.Checkthenumberofrowsinthedataframe.Thisshouldreturn439362orsomethingclosetothat.nrow(df)[1]4393626.Weonlyneedafewofthecolumnsfromthedataframeforthisexerciseso

usetheselect()functiontoretrievetheSTATE,YEAR_,TOTALACRES,andCAUSEcolumns.We’llalsorenamesomeofthesecolumnsinthisstep.Pipingwillbeusedfortherestofthecodeinthisexercisesobeginthestatementasseenbelow.

df%>%select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%7.Next,filterthedataframesothatonlywildfiresthatburned1,000acresormoreareincluded.Addthecodehighlightedinboldbelow.

df%>%select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%

8.Grouptherecordsbyyear.

df%>%select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(YR)%>%

9.Getacountofthenumberofwildfiresforeachyearbyusingthesummarize()functionwiththecount=n()parameter.

df%>%select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(YR)%>%summarize(count=n())%>%

Page 189: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

10.Finally,createascatterplotwitharegressionlinethatdepictsthenumberofwildfiresovertheyears.

df%>%select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(YR)%>%summarize(count=n())%>%ggplot(mapping=aes(x=YR,y=count))+geom_point()+geom_smooth(method=lm,se=TRUE)+ggtitle(“LargeFiresAreBecomingMoreCommonintheWest-1980-2016”)+xlab(“Year”)+ylab(“NumberofWildfires”)

11.YoucancheckyourworkagainstthesolutionfileCS1_Exercise1.R.12.SavethescriptandthenclicktheRunbutton.Ifyou’vecodedeverythingcorrectlyyoushouldseetheplotdisplayedinthescreenshotbelow.

13.Basedonthisvisualizationitappearsasthoughlargewildfireshaveindeedbecomemorecommonoverthepastfewdecades.Butlet’sexpandthistoseeifallthestatesinthestudyareahavethesamepattern.14.CreateanewRscript

Page 190: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

andsaveitwithanameofCS1_Exercise1B.R.

15.Addthefollowingcodetoyourscriptandsaveit.We’lldiscussthedifferencesbetweenthisscriptandthepreviousafterward.

library(readr)library(dplyr)library(ggplot2)

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(STATE,YR)%>%summarize(cnt=n())%>%ggplot(mapping=aes(x=YR,y=cnt))+geom_point()+facet_wrap(~STATE)+geom_smooth(method=lm,se=TRUE)+ggtitle(“NumberofFiresbyStateandYear”)+xlab(“Year”)+ylab(“NumberofFires”)

16.Savethescriptandthenrunittoseetheoutputshowninthescreenshotbelow.

Page 191: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

ThisscriptgroupsthedatasetbySTATEandYRandthensummarizesthedatabygeneratingacountofthenumberforthisgrouping.Finally,thefacet_wrap()functionisusedwithggplot()tocreatethefacetmapthatdepictsthenumberoffiresbystateovertime.Anumberoftheindividualstatesshowaslightupwardtrendovertime,butmanyhaveanalmostflatregressionline.

17.YoucancheckyourworkagainstthesolutionfileCS1_Exercise1B.R.

18.Challenge1:Repeatthisprocesstoseetheresultsforwildfiresgreaterthan5,000acres,25,000acres,and100,000acres.Arethesefindingconsistentwiththeresultsofwildfiresgreaterthan1,000acres?

19.Challenge2:Repeattheprocessbutthistimegroupthedatabyyearandbywildfiresthatarenaturallyoccurring.TheCAUSEcolumnincludesavalueofNaturalthatcanbeusedtogroupthedata.You’llneedacompoundgroupingstatement.

Exercise2:Hastheacreageburnedincreasedovertime?

Page 192: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Measuringthenumberoffiresovertimeonlytellspartofthestory.Theamountofacreageburnedduringthattimemaygiveusmoreinsightintothepatternsinwildfireactivity.Inthisexercisewe’llcreatevisualizationsthatillustratehowmuchacreageisbeingburnedeachyearasaresultofwildfires.

1.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS1_Exercise2.R.2.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.

library(readr)library(dplyr)library(ggplot2)

3.Thefirstfewlinesofthisscriptwillbesimilartothepreviousexercises,soIwon’tdiscussthedetailsofeachline.Bynowyoushouldbeabletodeterminewhateachoftheselineswillaccomplishanyway.Addthelinesshownbelow.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%4.Groupthedatabyyear.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(YR)%>%

5.Usethesummarize()functiontosumthetotalacreageburnedbyyear.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(YR)%>%summarize(totalacres=sum(ACRES))%>%

Page 193: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

6.Createascatterplotwithregressionlinethatdisplaysthetotalacreageburnedbyyear.Inthiscaseyou’llconvertthetotalacresburnedtoalogarithmicscaleaswell.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(YR)%>%summarize(totalacres=sum(ACRES))%>%

ggplot(mapping=aes(x=YR,y=log(totalacres)))+geom_point()+geom_smooth(method=lm,se=TRUE)+ggtitle(“TotalAcresBurned”)+xlab(“Year”)+ylab(“LogofTotalAcresBurned”)

7.YoucancheckyourworkagainstthesolutionfileCS1_Exercise2.R.

8.Savethescriptandthenrunittoseetheoutputshowninthescreenshotbelow.It’sclearfromthisgraphthattherehasbeenasignificantincreaseintheacreageburnedoverthepastfewdecades.

Page 194: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

9.Nowlet’sseeifthistrendissignificantforallstatesinthestudyarea.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS1_Exercise2B.R.

10.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.

library(readr)library(dplyr)library(ggplot2)

11.Thefirstfewlinesofthisscriptwillbesimilartothepreviousexercises,soIwon’tdiscussthedetailsofeachline.Addthelinesshownbelow.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%

Page 195: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

filter(ACRES>=1000)%>%12.GroupthedatabySTATEandYR.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(STATE,YR)%>%

13.Usethesummarize()functiontocalculatethetotalacreageburnedbystate.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(STATE,YR)%>%summarize(totalacres=sum(ACRES))%>%

14.Createafacetplotthatdisplaysthetotalacreageburnedbystateandyear.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(STATE,YR)%>%summarize(totalacres=sum(ACRES))%>%ggplot(mapping=aes(x=YR,y=log(totalacres)))+geom_point()+facet_wrap(~STATE)+geom_smooth(method=lm,se=TRUE)+ggtitle(“TotalAcresBurned”)+xlab(“Year”)+ylab(“LogofTotalAcresBurned”)

15.YoucancheckyourworkagainstthesolutionfileCS1_Exercise2B.R

Page 196: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

.

16.Savethescriptandthenrunittoseetheoutputshowninthescreenshotbelow.It’sclearfromthisgraphthattherehasbeenanincreaseintheacreageburnedoverthepastfewdecadesforallthestatesinthestudyarea.

17.Youmayhavewonderedifthereisadifferenceinthesizeofwildfiresthatwerecausednaturallyasopposedtohumaninduced.Inthenextfewstepswe’llwriteascripttodojustthat.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS1_Exercise2C.R.

18.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.

library(readr)library(dplyr)library(ggplot2)

19.Thefirstfewlinesofthisscriptwillbesimilartothepreviousexercises,soIwon’tdiscussthedetailsofeachline.Addthelinesshownbelow.

Page 197: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%

20.Forthisscriptwe’llfiltersothatonlyNaturalandHumanvaluesareselectedfromtheCAUSEcolumninadditiontorequiringthatonlyfiresgreaterthan1,000acresbeincluded.

ThereareadditionalvaluesintheCAUSEcolumnincludingUNKNOWNandafewotherrandomvaluessothat’swhywe’retakingthisextrastep.Thedatasetdoesnotincludeprescribedfires,sowedon’thavetoworryaboutthatinthiscase.

The%in%operatorcanbeusedwithavectorinRtodefinemultiplevaluesasisthecasehere.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000&CAUSE%in%c(‘Human’,‘Natural’))%>%21.GroupthedatabyCAUSEandYR.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000&CAUSE%in%c(‘Human’,‘Natural’))%>%group_by(CAUSE,YR)%>%

22.Sumthetotalacreageburned.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

Page 198: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000&CAUSE%in%c(‘Human’,‘Natural’))%>%group_by(CAUSE,YR)%>%summarize(totalacres=sum(ACRES))%>%

23.Plotthedataset.Usethecolourpropertyfromtheaes()functiontocolorcodethevaluesbyCAUSE.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000&CAUSE%in%c(‘Human’,‘Natural’))%>%group_by(CAUSE,YR)%>%summarize(totalacres=sum(ACRES))%>%ggplot(mapping=aes(x=YR,y=log(totalacres),colour=CAUSE))+geom_point()+geom_smooth(method=lm,se=TRUE)+ggtitle(“TotalAcresBurned”)+xlab(“Year”)+ylab(“LogofTotalAcresBurned”)

24.YoucancheckyourworkagainstthesolutionfileCS1_Exercise2C.R.25.Savethescriptandthenrunittoseetheoutputshowninthescreenshotbelow.Bothhumanandnaturallycausedwildfireshaveseenasignificantincreaseintheamountofacreageburnedoverthepastfewdecades,buttheamountofacreageburnedbynaturallyoccurringfiresappeartobeincreasingatamorerapidpace.

Page 199: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

26.Finally,let’screateaviolinplottoseethedistributionofacresburnedbystate.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS1_Exercise2D.R.

27.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.

library(readr)library(dplyr)library(ggplot2)

28.Thefirstfewlinesofthisscriptwillbesimilartothepreviousexercises,soIwon’tdiscussthedetailsofeachline.Addthelinesshownbelow.df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,

Page 200: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

CAUSE)%>%filter(ACRES>=1000)%>%group_by(STATE)%>%

29.Createaviolinplotwithanembeddedboxplot.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(STATE)%>%ggplot(mapping=aes(x=STATE,y=log(ACRES)))+geom_violin()+geom_boxplot(width=0.1)+ggtitle(“WildfiresbyStateGreaterthan1,000Acres”)+xlab(“State”)+ylab(“AcresBurned(Log)”)

30.YoucancheckyourworkagainstthesolutionfileCS1_Exercise2D.R.31.Savethescriptandthenrunittoseetheoutputshowninthescreenshotbelow.

Page 201: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Exercise3:Isthesizeofindividualwildfiresincreasingovertime?

Inthelastexercisewefoundthatthenumberofwildfiresappearstobeincreasingoverthepastfewdecades.Inthisexercisewe’lldeterminewhetherthesizeofthosefireshasincreasedaswell.TheStudyArea.csvfilecontainsaTOTALACREScolumnthatdefinesthenumberofacresburnedbyeachfire.We’llgroupthefiresbyyearandthenbydecadeanddeterminethemeanandmedianfiresizeforeach.

1.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS1_Exercise3.R.2.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.

library(readr)library(dplyr)library(ggplot2)

3.Thefirstfewlinesofthisscriptwillbethesameasthepreviousexercises,soIwon’tdiscussthedetailsofeachline.Bynowyoushouldbeabletodetermine

Page 202: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

whateachoftheselineswillaccomplishanyway.Addthelinesshownbelow.

dfWildfires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df<-select(dfWildfires,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)df<-filter(df,ACRES>=1000)grp<-group_by(df,CAUSE,YR)

4.Summarizethedatabydeterminingthemeanacreageburnedforeachgroup.sm<-summarize(grp,mean(ACRES))

5.Thesummarize()functionwillcreateanewcolumncalledmean(ACRES)andaddittotheoutputdataframe.Thisisn’texactlyauser-friendlyname,sowe’llchangethenameofthiscolumninthenextstep.Youcanseetheoutputofthesummarize()functioninthescreenshotbelow.

6.Changethecolumnname.colnames(sm)[3]<-‘MEAN’7.Createascatterplotoftheresults.

ggplot(data=sm,mapping=aes(x=YR,y=MEAN))+geom_point()+geom_smooth(method=lm,se=TRUE)+ggtitle(“AverageSizeofWildfiresHasIncreasedforbothHumanandNaturalCauses”)+xlab(“Year”)+ylab(“AverageWildfireSize”)

8.Theentirescriptshouldappearasseenbelow.

Page 203: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

library(readr)library(dplyr)library(ggplot2)

dfWildfires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df=select(dfWildfires,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)df<-filter(df,ACRES>=1000)grp<-group_by(df,CAUSE,YR)sm<-summarize(grp,mean(ACRES))colnames(sm)[3]<-‘MEAN’ggplot(data=sm,mapping=aes(x=YR,y=MEAN))+geom_point()+geom_smooth(method=lm,se=TRUE)+ggtitle(“AverageSizeofWildfiresHasIncreasedforbothHumanandNaturalCauses”)+xlab(“Year”)+ylab(“AverageWildfireSize”)

9.YoucancheckyourworkagainstthesolutionfileCS1_Exercise3.R.

10.Saveandrunthescript.Ifeverythinghasbeencodedcorrectlyyoushouldseethefollowingoutput.Thisgraphindicatesacleartrendtowardlargerwildfiresovertime.

Page 204: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

11.Nowlet’slookgroupthewildfiresbydecade,sumthetotalacreageburnedduringthattime,andcreateabarcharttodisplaytheresults.12.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS1_Exercise3B.R.13.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.

library(readr)library(dplyr)library(ggplot2)

14.Load,select,andfilterthedatainthesamewaywe’vedonewiththeotherexercisesinthischapter.

dfWildfires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df<-select(dfWildfires,ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)df<-filter(df,ACRES>=1000)

15.Inthisstepwe’llusethemutate()functionalongwithanifelse()functiontocreateanewcolumncalledDECADEandthenpopulatethecontentsofthiscolumnbasedonthevalueoftheYRcolumnforeachrow.Addthecodeyouseebelow.

df<-mutate(df,DECADE=ifelse(YR%in%1980:1989,“1980-1989”,ifelse(YR%in%1990:1999,“1990-1999”,ifelse(YR%in%2000:2009,“2000-2009”,ifelse(YR%in%2010:2016,“2010-2016”,“-99”)))))

16.GroupthedatasetbyDECADE.grp<-group_by(df,DECADE)17.Summarizethedatabycalculatingthemeanvalueofacresburned.sm<-summarize(grp,mean(ACRES))18.Renamethecolumncreatedbythesummarize()function.znames(sm)<-c(“DECADE”,“MEAN_ACRES_BURNED”)19.Usethegeom_col()functionalongwithggplot()tocreateabarchartthatdisplaysthemeanwildfiresizebydecade.ggplot(data=sm)+geom_col(mapping=aes(x=DECADE,y=MEAN_ACRES_BURNED),fill=”red”)

Page 205: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

20.YoucancheckyourworkagainstthesolutionfileCS1_Exercise3B.R.

21.Saveandrunthescript.Ifeverythinghasbeencodedcorrectlyyoushouldseethefollowingoutput.Thisbarchartindicatesacleartrendtowardlargerwildfireswitheachpassingdecade,althoughitshouldbenotedthatthedatasetonlyextendsthrough2016sotheresultsforthecurrentdecademaybedifferentinafewyears.

Exercise4:Hasthelengthofthefireseasonincreasedovertime?

Wildfireseasonisgenerallydefinedasthetimeperiodbetweentheyear’sfirstandlastlargewildfires.Theinfographicbelow,fromtheUnionofConcernedScientists(https://www.ucsusa.org/global-warming/science-and-impacts/impacts/infographic-wildfiresclimate-change.html#.W1cji9hKj_Q),highlightsthelengthofthewildfireseasonfortheWesternU.S.asaregion.Localwildfireseasonsvarybylocationbuthavealmostuniversallybecomelongeroverthepast40years.

Page 206: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Inthisexercisewe’llmeasurethelengthofthewildfireseasonoverthepastfewdecadesfortheregionasawhole,aswellasindividualstates.1.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS1_Exercise4.R.

2.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.Notethatyouwillneedtoloadthelubridatelibraryforthisexercisesincewe’llbedealingwithdates.

library(readr)library(dplyr)library(lubridate)library(ggplot2)

3.Thefirstfewlinesofthisscriptwillbesimilartothepreviousexercises,soIwon’tdiscussthedetailsofeachline.Bynowyoushouldbeabletodeterminewhateachoftheselineswillaccomplishanyway.Addthelinesshownbelowtoloadthedata,selectthecolumns,andfilterthedata.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE,STARTDATED)%>%filter(ACRES>=1000)%>%

4.Tomeasurethelengthofthewildfireseasonwe’regoingtoconvertthestartdateofeachfireintothedayoftheyear.Forexample,ifafireoccurredon

Page 207: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

February1st,itwouldbethe32nddayoftheyear.Usethemutate()functionasseenbelowtoaccomplishthis.Themutate()functionusestheyday()lubridatefunctiontoconvertthevaluefortheSTARTDATEDcolumnintothedayoftheyear.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE,STARTDATED)%>%filter(ACRES>=1000)%>%mutate(DOY=yday(as.Date(STARTDATED,format=’%m/%d/%y%H:%M’)))

%>%5.Groupthedatabyyear.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE,STARTDATED)%>%filter(ACRES>=1000)%>%mutate(DOY=yday(as.Date(STARTDATED,format=’%m/%d/%y%H:%M’)))

%>%group_by(YR)%>%6.Gettheearliestandlateststartdatesofthewildfiresusingthesummarize()function.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE,STARTDATED)%>%filter(ACRES>=1000)%>%

Page 208: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

mutate(DOY=yday(as.Date(STARTDATED,format=’%m/%d/%y%H:%M’)))

%>%group_by(YR)%>%summarize(dtEarly=min(DOY,na.rm=TRUE),dtLate=max(DOY,

na.rm=TRUE))%>%

7.Finally,useggplotwithtwocallstogeom_line()tocreatetwolinegraphsthatdisplaytheearlieststartandlatestenddatesbyyear.You’llalsoaddasmoothedregressionlinetobothlinegraphs.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE,STARTDATED)%>%filter(ACRES>=1000)%>%mutate(DOY=yday(as.Date(STARTDATED,format=’%m/%d/%y%H:%M’)))

%>%group_by(YR)%>%summarize(dtEarly=min(DOY,na.rm=TRUE),dtLate=max(DOY,

na.rm=TRUE))%>%ggplot()+geom_line(mapping=aes(x=YR,y=dtEarly,color=’B’))+geom_line(mapping=aes(x=YR,y=dtLate,color=’R’))+geom_smooth(method=lm,se=TRUE,aes(x=YR,y=dtEarly,color=”B”))+geom_smooth(method=lm,se=TRUE,aes(x=YR,y=dtLate,color=”R”))+xlab(“Year”)+ylab(“DayofYear”)+scale_colour_manual(name=“Legend”,values=c(“R”=“#FF0000”,“B”=“#000000”),labels=c(“FirstFire”,“LastFire”))

8.YoucancheckyourworkagainstthesolutionfileCS1_Exercise4.R.

9.Saveandrunthescript.Ifeverythinghasbeencodedcorrectlyyoushouldseethefollowingoutput.Thischartshowsaclearlengtheningofthewildfireseason

Page 209: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

withthefirstfiredatecomingsignificantlyearlierinrecentyearsandthestartdateofthelastfireincreasingaswell.

10.Thelastscriptexaminedthetrendsinwildfireseasonlengthfortheentirestudyarea,butyoumightwanttoexaminethesetrendsatastatelevelinstead.Thiscanbeeasilyaccomplishedbyaddingasecondstatementtothefilter.Updatethefilterasseenbelowandre-runthescripttoseetheresult.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%select(ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE,STARTDATED)%>%filter(ACRES>=1000&STATE==‘Arizona’)%>%mutate(DOY=yday(as.Date(STARTDATED,format=’%m/%d/%y%H:%M’)))

%>%group_by(YR)%>%

Page 210: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

summarize(dtEarly=min(DOY,na.rm=TRUE),dtLate=max(DOY,

na.rm=TRUE))%>%ggplot()+geom_line(mapping=aes(x=YR,y=dtEarly,color=’B’))+geom_line(mapping=aes(x=YR,y=dtLate,color=’R’))+geom_smooth(method=lm,se=TRUE,aes(x=YR,y=dtEarly,color=”B”))+geom_smooth(method=lm,se=TRUE,aes(x=YR,y=dtLate,color=”R”))+xlab(“Year”)+ylab(“DayofYear”)+scale_colour_manual(name=“Legend”,values=c(“R”=“#FF0000”,“B”=“#000000”),labels=c(“FirstFire”,“LastFire”))

TheStateofArizonashowsanevenbiggertrendtowardlongerwildfireseasons.Tryafewotherstatesaswell.

Exercise5:Doestheaveragewildfiresizedifferbyfederalorganization

Towrapupthischapterwe’llexamineiftheaveragewildfiresizediffersbyfederalorganization.TheStudyArea.csvfileincludesacolumn(ORGANIZATI)

Page 211: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

thatindicatesthejurisdictionwherethefirestarted.Thiscolumncanbeusedtogroupthewildfires.

1.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS1_Exercise5.R.2.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.

library(readr)library(dplyr)library(ggplot2)

3.Thefirstfewlinesofthisscriptwillbesimilartothepreviousexercises,soIwon’tdiscussthedetailsofeachline.Bynowyoushouldbeabletodeterminewhateachoftheselineswillaccomplishanyway.Addthelinesshownbelowtoloadthedata,selectthecolumns,andfilterthedata.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(ORG=ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE,STARTDATED)%>%filter(ACRES>=1000)%>%

4.GroupthedatasetbyORGandYR.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(ORG=ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,

CAUSE,STARTDATED)%>%filter(ACRES>=1000&ORG%in%c(‘BIA’,‘BLM’,‘FS’,‘FWS’,‘NPS’))%>%group_by(ORG,YR)%>%

5.Summarizethedatabycalculatingthemeanacreageburnedbyorganizationandyear.

Page 212: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(ORG=ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE,STARTDATED)%>%filter(ACRES>=1000&ORG%in%c(‘BIA’,‘BLM’,‘FS’,‘FWS’,‘NPS’))%>%group_by(ORG,YR)%>%summarize(meanacres=mean(ACRES))%>%

6.Createafacetplotforthemeanacreageburnedbyyearforeachorganization.

df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%

select(ORG=ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE,STARTDATED)%>%filter(ACRES>=1000&ORG%in%c(‘BIA’,‘BLM’,‘FS’,‘FWS’,‘NPS’))%>%group_by(ORG,YR)%>%summarize(meanacres=mean(ACRES))%>%ggplot(mapping=aes(x=YR,y=log(meanacres)))+geom_point()+facet_wrap(~ORG)+geom_smooth(method=lm,se=TRUE)+ggtitle(“AcresBurnedbyFederalOrganization”)+xlab(“Year”)+ylab(“LogofTotalAcresBurned”)

7.YoucancheckyourworkagainstthesolutionfileCS1_Exercise5.R.8.Saveandrunthescript.Ifeverythinghasbeencodedcorrectlyyoushouldseethefollowingoutput.Itappearsasthoughallthefederalagencieshaveexperiencedsimilarincreasesinthesizeofwildfiressince1980.

Page 213: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Chapter11

Page 214: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

CaseStudy–SingleFamilyResidentialHomeandRentalValues

TheZillowResearchgrouppublishesseveraldifferentmeasuresofhomesvaluesonamonthlybasisincludingmedianlistprices,mediansaleprices,andtheZillowHomeValueIndex(ZHVI).TheZHVIisbasedonZillow’sinternalmethodologyformeasuringhomevaluesovertime.Inaddition,Zillowalsopublishesasimilarmeasureofrentalvalues(ZRI)aswellasanumberofotherrealestaterelateddatasets.

ThemethodologyforZHVIcanbereadindetailathttps://www.zillow.com/research/zhvi-methodology-6032/,butthesimpleexplanationisthatZillowtakesallestimatedhomevaluesforagivenregionandmonth(Zestimate),takesamedianofthesevalues,appliessomeadjustmentstoaccountforseasonalityorerrorsinindividualhomeestimates,andthendoesthesameacrossallmonthsoverthepast20yearsandformanydifferentgeographylevels(ZIP,neighborhood,city,county,metro,state,andcountry).Forexample,ifZHVIwas$400,000inSeattleonemonth,thatindicatesthat50percentofhomesintheareaareworthmorethan$400,000and50percentareworthless(adjustingforseasonalfluctuations–e.g.pricestendtobelowinDecember).

ZillowrecommendsusingZHVItotrackhomevaluesovertimefortheverysimplereasonthatZHVIrepresentsthewholehousingstockandnotjustthehomesthatlistorsellinagivenmonth.ImagineamonthwherenohomesoutsideofCaliforniasold.Anationalmedianpriceseriesormedianlistserieswouldbothspike.ZHVI,however,wouldremainamedianofallhomesacrossthecountryandwouldn’tskewtowardCaliforniaanymorethaninthepreviousmonth.ZHVIwillalwaysreflectthevalueofallhomesandnotjusttheonesthatlistorsellinagivenmonth.Inthischapterwe’llusesomebasicRvisualizationtechniquestobetterunderstandresidentialrealestatevaluesandrentalpricesintheAustin,TXmetropolitanarea.

Inthischapterwe’llcoverthefollowingtopics:

•WhatisthetrendforhomevaluesintheAustinmetropolitanarea?•WhatisthetrendforrentalvaluesintheAustinmetropolitanarea?•Determiningtheprice-rentratiofortheAustinmetropolitanarea.

Page 215: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

•ComparingresidentialhomevaluesinAustintootherTexasmetropolitanareas

Exercise1:WhatisthetrendforhomevaluesintheAustinmetroarea

TheCounty_Zhvi_SingleFamilyResidence.csvfileinyourIntroR\DatafoldercontainshomevaluedatafromZillow.TheZillowHomeValueIndex(ZHVI)isasmoothed,seasonallyadjustedmeasureofthemedianestimatedhomevalueacrossagivenregionandhousingtype.Itisadollar-denominatedalternativetorepeat-salesindices.Zillowalsopublisheshomevalueandotherhousingdataforlocalmarkets,aswellasamoredetailedmethodologyandacomparisonofZHVItotheS&PCoreLogicCase-ShillerHomePriceIndices.We’llusethisfileforthisparticularexercise.

Inthisfirstexercisewe’llexaminehomevaluesoverthepastcoupleofdecadesfromtheAustinmetropolitanarea.

1.InyourIntroRfoldercreateanewfoldercalledCaseStudy2.YoucandothisinsideRStudiobygoingtotheFilespaneandselectingNewFolderinsideyourworkingdirectory.

2.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS2_Exercise1.R.3.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.

library(readr)library(dplyr)library(ggplot2)

4.Usetheread_csv()functionfromthereadrpackagetoloadthedataintoadataframe.df<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)5.Startapipingexpressionanddefinethecolumnsthatshouldbeincludedinthedataframe.df%>%

select(RegionName,State,Metro,`1996`=`1996-05`,`1997`=`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`==`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`==`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`=05`,`2007`=`2007-05`,`2008`=`2008-05`,`2009`=

Page 216: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

`2009-05`,`2010`=`2010-05`,`2011`=`2011-05`,`2012`=`2012-05`,`2013`=`2013-05`,`2014`=`2014-05`,`2015`=`2015-05`,`2016`=`2016-05`,`2017`=`2017-05`,`2018`=`2018-05`)%>%

6.FilterthedataframetoincludetheAustinmetropolitanareafromthestateofTexas.

df%>%select(RegionName,State,Metro,`1996`=`1996-05`,`1997`=`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`==`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`==`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`=05`,`2007`=`2007-05`,`2008`=`2008-05`,`2009`=`2009-05`,`2010`=`2010-05`,`2011`=`2011-05`,`2012`=`2012-05`,`2013`=`2013-05`,`2014`=`2014-05`,`2015`=`2015-05`,`2016`=`2016-05`,`2017`=`2017-05`,`2018`=`2018-05`)%>%filter(State==‘TX’&Metro==‘Austin’)%>%

7.Ifyouweretoviewthestructureofthedataframeatthispointitwouldlooklikethescreenshotbelow.Acommonprobleminmanydatasetsisthatthecolumnnamesarenotvariablesbutrathervaluesofavariable.Inthefigureprovidedbelow,thecolumnsthatrepresenteachyearinthestudyareactuallyvaluesofthevariableYEAR.Eachrowintheexistingtableactuallyrepresentsmanyannualobservations.Thetidyrpackagecanbeusedtogathertheseexistingcolumnsintoanewvariable.Inthiscase,weneedtocreateanewcolumncalledYRandthengathertheexistingvaluesintheannualcolumnsintothenewYRcolumn.

Inthenextstepwe’llusethegather()functiontoaccomplishthis.

8.Usethegather()functiontotidyupthedatasothatanewYRcolumniscreated,androwsforeachcounty(RegionName)andyearvalueareadded.

df<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)df%>%

Page 217: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

select(RegionName,State,Metro,`1996`=`1996-05`,`1997`=`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`==`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`==`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`=05`,`2007`=`2007-05`,`2008`=`2008-05`,`2009`=`2009-05`,`2010`=`2010-05`,`2011`=`2011-05`,`2012`=`2012-05`,`2013`=`2013-05`,`2014`=`2014-05`,`2015`=`2015-05`,`2016`=`2016-05`,`2017`=`2017-05`,`2018`=`2018-05`)%>%filter(State==‘TX’&Metro==‘Austin’)%>%gather(`1996`,`1997`,`1998`,`1999`,`2000`,`2001`,`2002`,`2003`,`2004`,`2005`,`2006`,`2007`,`2008`,`2009`,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,`2018`,key=’YR’,value=’ZHVI’)%>%

9.Ifyouweretoviewtheresult,thedataframewouldnowappearasseeninthefigurebelow.

Page 218: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

10.Nowwe’rereadytoplotthedata.AddthecodeyouseebelowtocreateapointplotthatisgroupedbyRegionName(County).

Page 219: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

df<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)df%>%

select(RegionName,State,Metro,`1996`=`1996-05`,`1997`=`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`==`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`==`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`=05`,`2007`=`2007-05`,`2008`=`2008-05`,`2009`=`2009-05`,`2010`=`2010-05`,`2011`=`2011-05`,`2012`=`2012-05`,`2013`=`2013-05`,`2014`=`2014-05`,`2015`=`2015-05`,`2016`=`2016-05`,`2017`=`2017-05`,`2018`=`2018-05`)%>%filter(State==‘TX’&Metro==‘Austin’)%>%gather(`1996`,`1997`,`1998`,`1999`,`2000`,`2001`,`2002`,`2003`,`2004`,`2005`,`2006`,`2007`,`2008`,`2009`,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,`2018`,key=’YR’,value=’ZHVI’)%>%ggplot(mapping=aes(x=YR,y=ZHVI,colour=RegionName))+geom_point()+geom_smooth(method=lm,se=TRUE)+ggtitle(“SingleFamilyHomesValuesHaveIncreasedintheAustinMetroArea”)+xlab(“Year”)+ylab(“HomeValues”)

11.YoucancheckyourworkagainstthesolutionfileCS2_Exercise1.R.

12.Savethescriptandthenrunittoseetheoutputshowninthescreenshotbelow.AllcountiesintheAustinmetropolitanareahaveexperiencedsignificantlyincreasedvaluesinthepastcoupledecades.Theincreasehasbeenparticularlynoticeablesince2012.

Page 220: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

13.Insteadofasimpledotplotyoumightwanttocreateabarchartinstead.Commentoutthelineofcodethatcallstheexistingggplot()functionandaddanewlineasseenbelow.

ggplot(mapping=aes(x=YR,y=ZHVI,colour=RegionName))+geom_col()+ggtitle(“SingleFamilyHomesValuesHaveIncreasedintheAustinMetroArea”)+xlab(“Year”)+ylab(“HomeValues”)

14.Saveandrunthescriptandtheoutputshouldnowappearasseeninthescreenshotbelow.Theupwardtrendinvaluesseemsevenmoreobviouswhenviewedinthismanner.

Page 221: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Exercise2:WhatisthetrendforrentalratesintheAustinmetroarea?

TheCounty_Zri_SingleFamilyResidenceRental.csvfileinyourIntroR\DatafoldercontainssinglefamilyresidentialrealestatevaluesZillow.ZillowRentIndex(ZRI)isasmoothed,seasonallyadjustedmeasureofthemedianestimatedmarketraterentacrossagivenregionandhousingtype.ZRIisadollar-denominatedalternativetorepeatrentindices.

Inthisexercisewe’llexaminerentvaluesoverthepastfewyearsfromtheAustinmetropolitanarea.1.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS2_Exercise2.R.2.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.

library(readr)library(dplyr)library(ggplot2)

3.Usetheread_csv()functionfromthereadrpackagetoloadthedataintoadataframe.

Page 222: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

df<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)

4.Selectthecolumnsandfilterthedata.Thisdatasetcontainsdatafrom2010goingforward.We’llusedatafromDecemberoftheyears2010to2017fortheAustin,TXmetropolitanarea.

df<-read_csv(“County_Zri_SingleFamilyResidenceRental.csv”,col_names=TRUE)df%>%

select(RegionName,State,Metro,`2010`=`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`=`2013-12`,`2014`=`2014-12`,`2015`=`2015-12`,`2016`=`2016-12`,`2017`=`2017-12`)%>%filter(State==‘TX’&Metro==‘Austin’)%>%

5.Gatherthedata.

df<-read_csv(“County_Zri_SingleFamilyResidenceRental.csv”,col_names=TRUE)df%>%

select(RegionName,State,Metro,`2010`=`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`=`2013-12`,`2014`=`2014-12`,`2015`=`2015-12`,`2016`=`2016-12`,`2017`=`2017-12`)%>%filter(State==‘TX’&Metro==‘Austin’)%>%gather(`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZRI’)%>%

6.Calltheggplot()functiontoplotthedata.Inthisplotwe’llalsoaddlabelstoeachpoint.

df<-read_csv(“County_Zri_SingleFamilyResidenceRental.csv”,col_names=TRUE)df%>%

select(RegionName,State,Metro,`2010`=`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`=`2013-12`,`2014`=`2014-12`,`2015`=`2015-12`,`2016`=`2016-12`,`2017`=`2017-12`)%>%filter(State==‘TX’&Metro==‘Austin’)%>%

Page 223: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

gather(`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZRI’)%>%ggplot(mapping=aes(x=YR,y=ZRI,colour=RegionName))+geom_point()+geom_text(aes(label=ZRI,vjust=-0.5),size=3)+ggtitle(“SingleFamilyRentalValuesHaveIncreasedintheAustinMetroArea”)+xlab(“Year”)+ylab(“RentalValues”)

7.YoucancheckyourworkagainstthesolutionfileCS2_Exercise2.R.8.Savethescriptandthenrunittoseetheoutputshowninthescreenshotbelow.

Page 224: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Exercise3:DeterminingthePrice-RentRatiofortheAustinmetropolitanarea

Theprice-to-rentratioisameasureoftherelativeaffordabilityofrentingandbuyinginagivenhousingmarket.Itiscalculatedastheratioofhomepricestoannualrentalrates.So,forexample,inarealestatemarketwhere,onaverage,ahomeworth$200,000couldrentfor$1000amonth,theprice-rentratiois16.67.That’sdeterminedusingtheformula:$200,000÷(12x$1,000).Ingeneral,thelowertheratio,themorefavorabletorealestateinvestorslookingforresidentialproperty.

Inthisexerciseyou’lljointheZillowhomevaluedatatotherentaldata,createanewcolumntoholdtheprice-rentratio,calculatetheratio,andplotthedataasabarchart.1.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS1_Exercise3.R.2.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.

library(readr)library(dplyr)library(ggplot2)

Page 225: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

3.Inthisstepyou’llreadtheresidentialvaluationinformationfromtheZillowfile,definethecolumnsthatshouldbeused,filterthedataandgatherthedata.Inthiscasewe’regoingtofilterthedatasothatonlyTravisCountyisincluded.Addthefollowinglinesofcodetoyourscripttoaccomplishthistask.

dfHomeVals<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)dfHomeVals<-select(dfHomeVals,RegionName,State,Metro,`2010`=dfHomeVals<-select(dfHomeVals,RegionName,State,Metro,`2010`=12`,`2014`=`2014-12`,`2015`=`2015-12`,`2016`=`2016-12`,`2017`=`2017-12`)dfHomeVals<-filter(dfHomeVals,State==‘TX’&Metro==‘Austin’&RegionName==‘Travis’)dfHomeVals<-gather(dfHomeVals,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZHVI’)

4.Nowdothesamefortherentaldata.

dfRentVals<-read_csv(“County_Zri_SingleFamilyResidenceRental.csv”,col_names=TRUE)dfRentVals<-select(dfRentVals,RegionName,State,Metro,`2010`=`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`==`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`=12`,`2017`=`2017-12`)dfRentVals<-filter(dfRentVals,State==‘TX’&Metro==‘Austin’&RegionName==‘Travis’)dfRentVals<-gather(dfRentVals,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZRI’)

5.Thetwopreviousstepscreateddataframesfortheresidentialhomevalueandrentaldata.Inthisstepwe’lljointhosetwodataframestogetherusingthedplyrpackage.Addthelineofcodeyouseebelowtoyourscript.Thisusestheinner_join()function,whichisthesimplesttypeofjoin.Aninnerjoinmatchespairsofobservationswhenevertheirkeysareequal.

df<-inner_join(dfHomeVals,dfRentVals,by=‘YR’)

6.Ifyouweretoviewtheresultingdataframeatthispointitwouldlooklikethescreenshotbelow.NoticethattheZHVI(residentialhomevalue)andZRI(rentalvalue)columnsareattached.

Page 226: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

7.Next,usethemutate()functiontocreateacolumncalledPriceRentRatio,andpopulatetherowsusingthecalculationseenbelow.

df<-mutate(df,PriceRentRatio=ZHVI/(12*ZRI))

8.Ifyouweretoviewtheresultsofthemutate()functionitwouldappearasseeninthescreenshotbelow.NoticethateachyearincludesaPriceRentRatiovaluethathasbeencalculated.

9.Finally,createabarchartusinggeom_col()withPriceRentRatioastheyaxis,andYRasthexaxis.ggplot(data=df)+geom_col(mapping=aes(x=YR,y=PriceRentRatio),fill=”red”)10.Yourentirescriptshouldappearasseenbelow.

library(readr)library(dplyr)library(ggplot2)

dfHomeVals<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)dfHomeVals<-select(dfHomeVals,RegionName,State,Metro,`2010`=

Page 227: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

dfHomeVals<-select(dfHomeVals,RegionName,State,Metro,`2010`=12`,`2014`=`2014-12`,`2015`=`2015-12`,`2016`=`2016-12`,`2017`=`2017-12`)dfHomeVals<-filter(dfHomeVals,State==‘TX’&Metro==‘Austin’&RegionName==‘Travis’)dfHomeVals<-gather(dfHomeVals,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZHVI’)

dfRentVals<-read_csv(“County_Zri_SingleFamilyResidenceRental.csv”,col_names=TRUE)dfRentVals<-select(dfRentVals,RegionName,State,Metro,`2010`=`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`==`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`=12`,`2017`=`2017-12`)dfRentVals<-filter(dfRentVals,State==‘TX’&Metro==‘Austin’&RegionName==‘Travis’)dfRentVals<-gather(dfRentVals,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZRI’)

#jointhetwodfdf<-inner_join(dfHomeVals,dfRentVals,by=‘YR’)df<-mutate(df,PriceRentRatio=ZHVI/(12*ZRI))ggplot(data=df)+geom_col(mapping=aes(x=YR,y=PriceRentRatio),fill=”red”)

11.YoucanalsocheckyourworkagainstthesolutionfileCS2_Exercise3.R.

12.Savethescriptandthenrunittoseetheoutputshowninthescreenshotbelow.Price-rentratioshavebeensteadilyincreasingduringthecurrentdecade.

Page 228: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Exercise4:ComparingresidentialhomevaluesinAustintootherTexasandU.S.metropolitanareas

Inthisexercisewe’llcompareresidentialhomevaluesfromtheAustinmetropolitanareatootherlargemetropolitanareasinTexasincludingSanAntonio,Dallas,andHouston.Forthisexercisewe’llcreateaboxplotcontainedwithinaviolinplot.

1.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS2_Exercise4.R.2.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.

library(readr)library(dplyr)library(ggplot2)

3.Usetheread_csv()functionfromthereadrpackagetoloadthedataintoadataframe.df<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)

Page 229: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

4.Selectthecolumnsandfilterthedata.Thisdatasetcontainsdatafrom2010goingforward.We’llusedatafromDecemberoftheyears2010to2017fortheAustin,TXmetropolitanarea.

dfHomeVals<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)dfHomeVals%>%

select(RegionName,State,Metro,`2010`=`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`=`2013-12`,`2014`=`2014-12`,`2015`=`2015-12`,`2016`=`2016-12`,`2017`=`2017-12`)%>%

5.FilterthedataframetoincludeonlyAustin,SanAntonio,Dallas-FortWorth,andHouston.Thesearethefourmajormetropolitanareasinthestate.

dfHomeVals<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)dfHomeVals%>%

select(RegionName,State,Metro,`2010`=`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`=`2013-12`,`2014`=`2014-12`,`2015`=`2015-12`,`2016`=`2016-12`,`2017`=`2017-12`)%>%filter(State==‘TX’&Metro%in%c(“Austin”,“SanAntonio”,“Dallas-FortWorth”,“Houston”))%>%

6.Gatherthedataframe.

dfHomeVals%>%select(RegionName,State,Metro,`2010`=`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`=`2013-12`,`2014`=`2014-12`,`2015`=`2015-12`,`2016`=`2016-12`,`2017`=`2017-12`)%>%filter(State==‘TX’&Metro%in%c(“Austin”,“SanAntonio”,“Dallas-FortWorth”,“Houston”))%>%gather(`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZHVI’)%>%

7.Groupthedatabymetropolitanarea.

dfHomeVals<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)

Page 230: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

dfHomeVals%>%

select(RegionName,State,Metro,`2010`=`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`=`2013-12`,`2014`=`2014-12`,`2015`=`2015-12`,`2016`=`2016-12`,`2017`=`2017-12`)%>%filter(State==‘TX’&Metro%in%c(“Austin”,“SanAntonio”,“Dallas-FortWorth”,“Houston”))%>%gather(`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZHVI’)%>%group_by(Metro)%>%

8.Useggplot()withgeom_violin()andgeom_boxplot()tocreatetheplot.

dfHomeVals<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)dfHomeVals%>%

select(RegionName,State,Metro,`2010`=`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`=`2013-12`,`2014`=`2014-12`,`2015`=`2015-12`,`2016`=`2016-12`,`2017`=`2017-12`)%>%filter(State==‘TX’&Metro%in%c(“Austin”,“SanAntonio”,“Dallas-FortWorth”,“Houston”))%>%gather(`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZHVI’)%>%group_by(Metro)%>%ggplot(mapping=aes(x=Metro,y=ZHVI))+geom_violin()+geom_boxplot(width=0.1)+ggtitle(“ZHVIforMetroTexas”)+xlab(“Metro”)+ylab(“ZHVI”)

9.YoucanalsocheckyourworkagainstthesolutionfileCS2_Exercise4.R.

10.Savethescriptandthenrunittoseetheoutputshowninthescreenshotbelow.

Page 231: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

11.Challenge:Updatethescripttoincludethefollowingmetropolitanareas:Austin,Denver,Phoenix,SaltLakeCity,Boise,Portland.YoucancheckyourcodeagainstthesolutionfileCS2_Exercise4.R.Theoutputplotshouldappearasseeninthescreenshotbelow.

Page 232: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

12.Finally,we’llcreateascriptthatdisplaystheZHVIvaluesforeachmetropolitanareainafacetplot.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS2_Exercise4B.R.

13.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.

library(readr)library(dplyr)library(ggplot2)

14.Usetheread_csv()functionfromthereadrpackagetoloadthedataintoadataframe.df<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)15.Definethecolumnstouse.Inthiscasewe’llusetheyears2000-2017.

dfHomeVals%>%select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=05`,`2008`=`2008-05`,`2009`=`2009-05`,`2010`=`2010-05`,`2011`=`2011-05`,`2012`

Page 233: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

=`2012-05`,`2013`=`2013-05`,`2014`=`2014-05`,`2015`=`2015-05`,`2016`=`2016-05`,`2017`=`2017-05`,`2018`=`2018-05`)%>%

16.Filterthedataframetoincludeonlyspecificmetropolitanareas.

dfHomeVals<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)dfHomeVals%>%

select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=

select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=

select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=05`,`2008`=`2008-05`,`2009`=`2009-05`,`2010`=`2010-05`,`2011`=`2011-05`,`2012`=`2012-05`,`2013`=`2013-05`,`2014`=`2014-05`,`2015`=`2015-05`,`2016`=`2016-05`,`2017`=`2017-05`,`2018`=`2018-05`)%>%filter(Metro%in%c(“Austin”,“Denver”,“Phoenix”,“Portland”,“SaltLakeCity”))%>%

17.Gatherthedata.

dfHomeVals<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)dfHomeVals%>%

select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=

select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=

select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=05`,`2008`=`2008-05`,`2009`=`2009-05`,`2010`=`2010-05`,`2011`=`2011-05`,`2012`=`2012-05`,`2013`=`2013-05`,`2014`=`2014-05`,`2015`=`2015-05`,`2016`=`2016-05`,`2017`=`2017-05`,`2018`=`2018-05`)%>%filter(Metro%in%c(“Austin”,“Denver”,“Phoenix”,“Portland”,“SaltLakeCity”))%>%gather(`2000`,`2001`,`2002`,`2003`,`2004`,`2005`,`2006`,`2007`,`2008`,`2009`,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZHVI’)%>%

Page 234: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

18.Groupthedatabymetropolitanarea.

dfHomeVals<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)dfHomeVals%>%

select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=

select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=

select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=05`,`2008`=`2008-05`,`2009`=`2009-05`,`2010`=`2010-05`,`2011`=`2011-05`,`2012`=`2012-05`,`2013`=`2013-05`,`2014`=`2014-05`,`2015`=`2015-05`,`2016`=`2016-05`,`2017`=`2017-05`,`2018`=`2018-05`)%>%filter(Metro%in%c(“Austin”,“Denver”,“Phoenix”,“Portland”,“SaltLakeCity”))%>%gather(`2000`,`2001`,`2002`,`2003`,`2004`,`2005`,`2006`,`2007`,`2008`,`2009`,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZHVI’)%>%group_by(Metro)%>%

19.Plotthedataasafacetplot.

dfHomeVals<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)dfHomeVals%>%

select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=

select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=

select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=05`,`2008`=`2008-05`,`2009`=`2009-05`,`2010`=`2010-05`,`2011`=`2011-05`,`2012`=`2012-05`,`2013`=`2013-05`,`2014`=`2014-05`,`2015`=`2015-05`,`2016`=`2016-05`,`2017`=`2017-05`,`2018`=`2018-05`)%>%filter(Metro%in%c(“Austin”,“Denver”,“Phoenix”,“Portland”,“SaltLakeCity”))%>%gather(`2000`,`2001`,`2002`,`2003`,`2004`,`2005`,`2006`,`2007`,`2008`,`2009`,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZHVI’)%>%

Page 235: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

group_by(Metro)%>%ggplot(mapping=aes(x=YR,y=ZHVI))+geom_point()+facet_wrap(~Metro)+geom_smooth(method=lm,se=TRUE)+ggtitle(“ZHVIbyMetroArea”)+xlab(“Year”)+ylab(“ZHVI”)

20.YoucanalsocheckyourworkagainstthesolutionfileCS2_Exercise4B.R.21.Savethescriptandthenrunittoseetheoutputshowninthescreenshotbelow.

Page 236: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

DataVisualizationandExplorationwithR

Today,datascienceisanindispensabletoolforanyorganization,allowingfortheanalysisandoptimizationofdecisionsandstrategy.Rhasbecomethepreferredsoftwarefordatascience,thankstoitsopensourcenature,simplicity,applicabilitytodataanalysis,andtheabundanceoflibrariesforanytypeofalgorithm.

Thisbookwillallowthestudenttolearn,indetail,thefundamentalsoftheRlanguageandadditionallymastersomeofthemostefficientlibrariesfordatavisualizationinchart,graph,andmapformats.Thereaderwilllearnthelanguageandapplicationsthroughexamplesandpractice.Nopriorprogrammingskillsarerequired.

WebeginwiththeinstallationandconfigurationoftheRenvironmentthroughRStudio.Asyouprogressthroughtheexercisesinthishands-onbookyou’llbecomethoroughlyacquaintedwithR’sfeaturesandthepopulartidyversepackage.Withthisbook,youwilllearnaboutthebasicconceptsofRprogramming,workefficientlywithgraphs,charts,andmaps,andcreatepublication-readydocumentsusingrealworlddata.Thedetailedstep-by-stepinstructionswillenableyoutogetacleansetofdata,produceengagingvisualizations,andcreatereportsfortheresults.

Whatyouwilllearnhowtodointhisbook:

IntroductiontotheRprogramminglanguageandRStudio

Usingthetidyversepackagefordataloading,transformation,andvisualization

GetatourofthemostimportantdatastructuresinR

Learntechniquesforimportingdata,manipulatingdata,performinganalysis,andproducingusefuldatavisualization

Datavisualizationtechniqueswithggplot2

Page 237: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications

Geographicvisualizationandmapswithggmap

Turningyouranalysesintohighqualitydocuments,reports,andpresentationswithRMarkdown.

Handsoncasestudiesdesignedtoreplicaterealworldprojectsandreinforcetheknowledgeyoulearninthebook

Formoreinformationvisitgeospatialtraining.com!

Page 238: Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for Data Visualization Exploration and Data Science Applications