Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for...
Transcript of Data Visualization and Exploration with R A Practical Guide to Using R RStudio and Tidyverse for...
DataVisualizationandExplorationwithRApracticalguidetousingR,RStudio,andTidyversefordatavisualization,exploration,anddatascienceapplications.
EricPimpler
IntroductiontoDataVisualizationandExplorationwithRApracticalguidetousingR,RStudio,andtidyversefordatavisualization,exploration,anddatascienceapplications.EricPimpler
GeospatialTrainingServices215WBandera#114-104Boerne,TX78006PH:210-260-4992Email:[email protected]:http://geospatialtraining.comTwitter:@gistraining
Copyright©2017byEricPimpler–GeospatialTrainingServicesAllrightsreserved.
Nopartofthisbookmaybereproducedinanyformorbyanyelectronicormechanicalmeans,includinginformationstorageandretrievalsystems,withoutwrittenpermissionfromtheauthor,exceptfortheuseofbriefquotationsinabookreview.
AbouttheAuthor
EricPimpler
EricPimpleristhefounderandownerofGeospatialTrainingServices(geospatialtraining.com)andhaveover25yearsofexperienceimplementingandteachingGISsolutionsusingEsrisoftware.CurrentlyhefocusesondatascienceapplicationswithRalongwithArcGISProandDesktopscriptingwithPythonandthedevelopmentofcustomArcGISEnterprise(Server)andArcGISOnlinewebandmobileapplicationswithJavaScript.
EricisthealsotheauthorofseveralotherbooksincludingIntroductiontoProgrammingArcGISProwithPython(https://www.amazon.com/dp/1979451079/re(https://www.amazon.com/dp/1979451079/re1&keywords=Programming+ArcGIS+Pro+with+Python),ProgrammingArcGISwithPythonCookbook(https://www.packtpub.com/application-development/programmingarcgis-python-cookbook-second-edition),SpatialAnalyticswithArcGIS(https://www.packtpub.com/application-development/spatial-analytics-arcgis),BuildingWebandMobileArcGISServerApplicationswithJavaScript(https://www.packtpub.com/application-development/building-weband-mobile-arcgis-server-applicationsjavascript),andArcGISBlueprints(https://www.packtpub.com/applicationdevelopment/arcgis-blueprints).
IfyouneedconsultingassistancewithyourdatascienceorGISprojetspleasecontactEricateric@geospatialtraining.comorsales@geospatialtraining.com.GeospatialTrainingServicesprovidescontractapplicationdevelopmentandprogrammingexpertiseforR,ArcGISPro,ArcGISDesktop,ArcGISEnterprise(Server),andArcGISOnlineusingPython,.NET/ArcObjects,andJavaScript.
DownloadingandInstallingExerciseDataforthisBook
Thisisintendedasahands-onexercisebookandisdesignedtogiveyouasmuchhandsoncodingexperiencewithRaspossible.Manyoftheexercisesinthisbookrequirethatyouloaddatafromafile-baseddatasourcesuchasaCSVfile.Thesefileswillneedtobeinstalledonyourcomputerbeforecontinuingwiththeexercisesinthischapteraswellastherestofthebook.Pleasefollowtheinstructionsbelowtodownloadandinstalltheexercisedata
1.Inawebbrowsergotooneofthelinksbelowtodownloadtheexercisedata:https://www.dropbox.com/s/5p7j7nl8hgijsnx/IntroR.zip?dl=0.
https://s3.amazonaws.com/VirtualGISClassroom/IntroR/IntroR.zip2.ThiswilldownloadafilecalledIntroR.zip.
3.Theexercisedatacanbeunzippedtoanylocationonyourcomputer.AfterunzippingtheIntroR.zipfileyouwillhaveafolderstructurethatincludesIntroRasthetop-mostfolderwithsub-folderscalledDataandSolutions.TheDatafoldercontainsthedatathatwillbeusedintheexercisesinthebook,whiletheSolutionsfoldercontainssolutionfilesfortheRscriptthatyouwillwrite.
RStudiocanbeusedonWindows,Mac,orLinuxsoratherthanspecifyingaspecificfoldertoplacethedataIwillleavetheinstallationlocationuptoyou.Justrememberwhereyouunzipthedatabecauseyou’llneedtoreferencethelocationwhenyousettheworkingdirectory.
4.ForreferencepurposesIhaveinstalledthedatatothedesktopofmyMaccomputerunderIntroR\Data.Youwillseethislocationreferencedatvariouslocationsthroughoutthebook.However,keepinmindthatyoucaninstallthedataanywhere.
TableofContents
CHAPTER1:IntroductiontoRandRStudio.......................................................9
IntroductiontoRStudio...........................................................................................................10Exercise1:Creatingvariablesandassigningdata.............................................................27Exercise2:Usingvectorsandfactors....................................................................................32Exercise3:Usinglists.................................................................................................................36Exercise4:Usingdataclasses................................................................................................39Exercise5:Loopingstatements..............................................................................................46Exercise6:Decisionsupportstatements–if|else..............................................................48Exercise7:Usingfunctions......................................................................................................51Exercise8:Introductiontotidyverse......................................................................................53
CHAPTER2:TheBasicsofDataExplorationandVisualizationwithR..........57
Exercise1:Installingandloadingtidyverse..........................................................................58Exercise2:Loadingandexaminingadataset.....................................................................60Exercise3:Filteringadataset.................................................................................................64Exercise4:Groupingandsummarizingadataset...............................................................65Exercise5:Plottingadataset.................................................................................................66Exercise6:Graphingburglariesbymonthandyear
...........................................................67
CHAPTER3:LoadingDataintoR......................................................................73
Exercise1:Loadingacsvfilewithread.table()....................................................................73Exercise2:Loadingacsvfilewithread.csv().......................................................................76Exercise3:Loadingatabdelimitedfilewithread.table()..................................................77Exercise4:Usingreadrtoloaddata.....................................................................................77
CHAPTER4:TransformingData........................................................................83
Exercise1:Filteringrecordstocreateasubset....................................................................84Exercise2:Narrowingthelistofcolumnswithselect()........................................................87Exercise3:ArrangingRows.....................................................................................................90Exercise4:AddingRowswithmutate().................................................................................92Exercise5:SummarizingandGrouping.................................................................................94Exercise6:Piping......................................................................................................................97Exercise7:Challenge..............................................................................................................99
CHAPTER5:CreatingTidyData.....................................................................101
Exercise1:Gathering............................................................................................................102Exercise2:Spreading............................................................................................................107
Exercise3:Separating...........................................................................................................110Exercise4:Uniting..................................................................................................................113
CHAPTER6:BasicDataExplorationTechniquesinR...................................115
Exercise1:MeasuringCategoricalVariationwithaBarChart........................................116Exercise2:MeasuringContinuousVariationwithaHistogram.........................................118Exercise3:MeasuringCovariationwithBoxPlots..............................................................120Exercise4:MeasuringCovariationwithSymbolSize.........................................................122Exercise5:2Dbinandhexcharts........................................................................................124Exercise6:GeneratingSummaryStatistics.........................................................................126
CHAPTER7:BasicDataVisualizationTechniques........................................129
Step1:Creatingascatterplot..............................................................................................130Step2:Addingaregressionlinetothescatterplot...........................................................133Step3:Plottingcategories....................................................................................................136Step4:Labelingthegraph...................................................................................................137Step5:Legendlayouts..........................................................................................................144Step6:Creatingafacet.......................................................................................................146Step7:Theming......................................................................................................................147Step8:Creatingbarcharts
..................................................................................................148Step9:CreatingViolinPlots..................................................................................................150Step10:Creatingdensityplots............................................................................................153
CHAPTER8:VisualizingGeographicDatawithggmap..............................157
Exercise1:Creatingabasemap.........................................................................................158Exercise2:Addingoperationaldatalayers.......................................................................162Exercise3:AddingLayersfromShapefiles..........................................................................169
CHAPTER9:RMarkdown................................................................................173
Exercise1:CreatinganRMarkdownfile............................................................................175Exercise2:AddingCodeChunksandTexttoanRMarkdownFile.................................178Exercise3:Codechunkandheaderoptions.....................................................................190Exercise4:Caching...............................................................................................................199Exercise5:UsingKnittooutputanRMarkdownfile..........................................................201
CHAPTER10:CaseStudy–WildfireActivityintheWesternUnitedStates.............................................................................205
Exercise1:Havethenumberofwildfiresincreasedordecreasedinthepastfewdecades?..................................................................................207Exercise2:Hastheacreageburnedincreasedovertime?.............................................211Exercise3:Isthesizeofindividualwildfiresincreasingovertime?...................................220
Exercise4:Hasthelengthofthefireseasonincreasedovertime?................................225Exercise5:Doestheaveragewildfiresizedifferbyfederalorganization.......................230
CHAPTER11:CaseStudy–SingleFamilyResidentialHomeandRentalValues....................................................................233
Exercise1:WhatisthetrendforhomevaluesintheAustinmetroarea.........................234Exercise2:WhatisthetrendforrentalratesintheAustinmetroarea?..........................240Exercise3:DeterminingthePrice-RentRatiofortheAustinmetropolitanarea.............242Exercise4:ComparingresidentialhomevaluesinAustintootherTexasandU.S.metropolitanareas..............................................................................247
Chapter1
IntroductiontoRandRStudio
TheRProjectforStatisticalComputing,orsimplynamedR,isafreesoftwareenvironmentforstatisticalcomputingandgraphics.Itisalsoaprogramminglanguagethatiswidelyusedamongstatisticiansanddataminersfordevelopingstatisticalsoftwareanddataanalysis.Overthelastfewyears,theywerejoinedbyenterpriseswhodiscoveredthepotentialofR,aswellastechnologyvendorsthatofferRsupportorR-basedproducts.
Althoughthereareotherprogramminglanguagesforhandlingstatistics,Rhasbecomethedefactolanguageofstatisticalroutines,offeringapackagerepositorywithover6400problem-solvingpackages.Itisalsooffersversatileandpowerfulplotting.Italsohastheadvantageoftreatingtabularandmulti-dimensionaldataasalabeled,indexedseriesofobservations.Thisisagamechangerovertypicalsoftwarewhichisjustdoing2Dlayout,likeExcel.
Inthischapterwe’llcoverthefollowingtopics:
•IntroductiontoRStudio•Creatingvariablesandassigningdata•Usingvectorsandfactors•Usinglists•Usingdataclasses•Loopingstatements•Decisionsupportstatements•Usingfunctions•Introductiontotidyverse
IntroductiontoRStudio
Thereareanumberofintegrateddevelopmentenvironments(IDE)thatyoucanusetowriteRcodeincludingVisualStudioforR,Eclipse,RConsole,andRStudioamongothers.Youcouldalsouseaplaintexteditoraswell.However,we’regoingtouseRStudiofortheexercisesinthisbook.RStudioisafree,opensourceIDEforR.Itincludesaconsole,syntax-highlightingeditorthatsupportsdirectcodeexecution,aswellastoolsforplotting,history,debuggingandworkspacemanagement.
RStudioisavailableinopensourceandcommercialeditionsandrunsonthedesktop(Windows,Mac,andLinux)orinabrowserconnectedtoRStudioServerorRStudioServerPro(Debian/Ubuntu,RedHat/CentOS,andSUSELinux).
AlthoughtherearemanyoptionsforRdevelopment,we’regoingtouseRStudiofortheexercisesinthisbook.YoucangetmoreinformationonRStudioat
https://www.rstudio.com/products/rstudio/TheRStudioInterface
TheRStudioInterface,displayedinthescreenshotbelow,looksquitecomplexinitially,butwhenyoubreaktheinterfacedownintosectionsitisn’tsooverwhelming.We’llcovermuchoftheinterfaceinthesectionsbelow.Keepinmindthoughthattheinterfaceiscustomizablesoifyoufindthedefaultinterfaceisn’texactlywhatyoulikeitcanbechanged.You’lllearnhowtocustomizetheinterfaceinalatersection.
TosimplifytheoverviewofRStudiowe’llbreaktheIDEintoquadrantstomakeiteasiertoreferenceeachcomponentoftheinterface.Thescreenshotbelowillustrateseachofthequadrants.We’llstartwiththepanesinquadrant1andworkthrougheachofthequadrants.
FilesPane–(Q1)
TheFilespanefunctionslikeafileexplorersimilartoWindowsExploreronaWindowsoperatingsystemorFinderonaMac.Thistab,displayedinthescreenshotbelow,providesthefollowingfunctionality:
1.Deletefilesandfolders2.Createnewfolders3.Renamefolders4.Foldernavigation5.Copyormovefiles6.Setworkingdirectoryorgotoworkingdirectory7.Viewfiles8.Importdatasets
PlotsPane–(Q1)
ThePlotspane,displayedinthescreenshotbelow,isusedtoviewoutputvisualizationsproducedwhentypingcodeintotheConsolewindoworrunningascript.Plotscanbecreatedusingavarietyofdifferentpackages,butwe’llprimarilybeusingtheggplot2packageinthisbook.Onceproduced,youcanzoomin,exportasanimage,orPDF,copytotheclipboard,andremoveplots.Youcanalsocannavigatetopreviousandnextplots.
PackagesPane–(Q1)
ThePackagespane,showninthescreenshotbelow,displaysallcurrentlyinstalledpackagesalongwithabriefdescriptionandversionnumberforthepackage.Packagescanalsoberemovedusingthexicontotherightoftheversionnumberforthepackage.ClickingonthepackagenamewilldisplaythehelpfileforthepackageintheHelptab.ClickingonthecheckboxtotheleftofthepackagenameloadsthelibrarysothatitcanbeusedwhenwritingcodeintheConsolewindow.
HelpPane–(Q1)TheHelppane,showninthescreenshotbelow,displayslinkedhelpdocumentationforanypackagesthatyouhaveinstalled.
ViewerPane–(Q1)RStudioincludesaViewerpanethatcanbeusedtoviewlocalwebcontent.Forexample,webgraphicsgeneratedusingpackageslikegoogleVis,htmlwidgets,andRCharts,orevenalocalwebapplicationcreatedwithShiny.However,keepinmindthattheViewerpanecanonlybeusedforlocalwebcontentintheformofstaticHTMLpageswritteninthesession’stemporarydirectoryoralocallyrunwebapplication.TheViewerpanecan’tbeusedtoviewonlinecontent.
EnvironmentPane–(Q2)
TheEnvironmentpanecontainsalistingofvariablesthatyouhavecreatedforthecurrentsession.Eachvariableislistedinthetabandcanbeexpandedtoviewthecontentsofthevariable.Youcanseeanexampleofthisinthescreenshotbelowbytakingalookatthedfvariable.Therectanglesurroundingthedfvariabledisplaysthecolumnsforthevariable.
Clickingthetableicononthefar-rightsideofthedisplay(highlightedwiththearrowinthescreenshotabove)willopenthedatainatabularviewerasseeninthescreenshotbelow.
OtherfunctionalityprovidedbytheEnvironmentpaneincludesopeningorsavingaworkspace,importingdatasetfromtextfiles,Excelspreadsheets,andvariousstatisticalpackageformats.Youcanalsoclearthecurrentworkspace.
HistoryPane–(Q2)
TheHistorypane,showninthescreenshotbelow,displaysalistofallcommandsthathavebeenexecutedinthecurrentsession.Thistabincludesanumberofusefulfunctionsincludingtheabilitytosavethesecommandstoafileorloadhistoricalcommandsfromanexistingfile.YoucanalsoselectspecificcommandsfromtheHistorytabandsendthemdirectlytotheconsoleoranopenscript.YoucanalsoremoveitemsfromtheHistorypane.
ConnectionsPane–(Q2)TheConnectionstabcanbeusedtoaccessexistingorcreatenewconnectionstoODBCandSparkdatasources.
SourcePane–(Q3)
TheSourcepaneinRStudio,seeninthescreenshotbelow,isusedtocreatescripts,anddisplaydatasetsAnRscriptissimplyatextfilecontainingaseriesofcommandsthatareexecutedtogether.CommandscanalsobewrittenlinebylinefromtheConsolepaneaswell.WhenwrittenfromtheConsolepane,eachlineofcodeisexecutedwhenyouclicktheEnter(Return)key.However,scriptsareexecutedasagroup.
Multiplescriptscanbeopenatthesametimewitheachscriptoccupyingaseparatetabasseeninthescreenshot.RStudioprovidestheabilitytoexecutetheentirescript,onlythecurrentline,orahighlightedgroupoflines.Thisgivesyoualotofcontrolovertheexecutionthecodeinascript.
TheSourcepanecanalsobeusedtodisplaydatasets.Inthescreenshotbelow,adataframeisdisplayed.DataframescanbedisplayedinthismannerbycallingtheView(<dataframe>)function.
ConsolePane–(Q4)
TheConsolepaneinRStudioisusedtointeractivelywriteandrunlinesofcode.EachtimeyouenteralineofcodeandclickEnter(Return)itwillexecutethatlineofcode.AnywarningorerrormessageswillbedisplayedintheConsole
windowaswellasoutputfromprint()statements.
TerminalPane–(Q4)
TheRStudioTerminalpaneprovidesaccesstothesystemshellfromwithintheRStudioIDE.Itsupportsxtermemulation,enablinguseoffull-screenterminalapplications(e.g.texteditors,terminalmultiplexers)aswellasregularcommand-lineoperationswithlineeditingandshellhistory.
Therearemanypotentialusesoftheshellincludingadvancedsourcecontroloperations,executionoflong-runningjobs,remotelogins,andsystemadministrationofRStudio.
TheTerminalpaneisunlikemostoftheotherfeaturesfoundinRStudiointhatit’scapabilitiesareplatformspecific.Ingeneral,thesedifferencescanbecategorizedaseitherWindowscapabilitiesorother(Mac,Linux,RStudioServer).
CustomizingtheInterface
Ifyoudon’tlikethedefaultRStudiointerface,youcancustomizetheappearance.Todoso,gotoTool|Options(RStudio|PreferencesonaMac).
Thedialogseeninthescreenshotbelowwillbedisplayed.
ThePaneLayouttabisusedtochangethelocationsofconsole,sourceeditor,andtabpanes,andsetwhichtabsareincludedineachpane.
MenuOptions
TherearealsoamultitudeofoptionsthatcanbeaccessedfromtheRStudiomenuitemsaswell.Coveringtheseitemsindepthisbeyondthescopeofthisbook,butingeneralherearesomeofthemoreusefulfunctionsthatcanbeaccessedthroughthemenus.
1.Createnewfilesandprojects2.Importdatasets3.Hide,show,andzoominandoutofpanes4.Workwithplots(save,zoom,clear)
5.Settheworkingdirectory6.Saveandloadworkspace7.Startanewsession8.Debuggingtools9.Profilingtools10.Installpackages11.Accesshelpsystem
You’lllearnhowtousevariouscomponentsoftheRStudiointerfaceaswemovethroughtheexercisesinthebook.
InstallingRStudio
Ifyouhaven’talreadydoneso,nowisagoodtimetodownloadandinstallRStudio.ThereareanumberofversionsofRStudio,includingafreeopensourceversionwhichwillbesufficientforthisbook.VersionsarealsoavailableforvariousoperatingsystemsincludingWindows,Mac,andLinux.
1.Gotohttps://www.rstudio.com/products/rstudio/download/findRStudioforDesktop,theOpenSourceLicenseversion,andfollowintheinstructionstodownloadandinstallthesoftware.Inthenextsectionwe’llexplorethebasicprogrammingconstructsoftheRlanguageincludingthecreationandassigningofdatatovariables,aswellasthedatatypesandobjectsthatcanbeassignedtovariables.
InstallingtheExerciseData
Thisisintendedasahands-onexercisebookandisdesignedtogiveyouasmuchhandsoncodingexperiencewithRaspossible.Manyoftheexercisesinthisbookrequirethatyouloaddatafromafile-baseddatasourcesuchasaCSVfile.Thesefileswillneedtobeinstalledonyourcomputerbeforecontinuingwiththeexercisesinthischapteraswellastherestofthebook.Pleasefollowtheinstructionsbelowtodownloadandinstalltheexercisedata.
1.Inawebbrowsergotohttps://www.dropbox.com/s/5p7j7nl8hgijsnx/IntroR.zip?dl=0.2.ThiswilldownloadafilecalledIntroR.zip.
3.Theexercisedatacanbeunzippedtoanylocationonyourcomputer.AfterunzippingtheIntroR.zipfileyouwillhaveafolderstructurethatincludesIntroR
asthetop-mostfolderwithsub-folderscalledDataandSolutions.TheDatafoldercontainsthedatathatwillbeusedintheexercisesinthebook,whiletheSolutionsfoldercontainssolutionfilesfortheRscriptthatyouwillwrite.
RStudiocanbeusedonWindows,Mac,orLinuxsoratherthanspecifyingaspecificfoldertoplacethedataIwillleavetheinstallationlocationuptoyou.Justrememberwhereyouunzipthedatabecauseyou’llneedtoreferencethelocationwhenyousettheworkingdirectory.
4.ForreferencepurposesIhaveinstalledthedatatothedesktopofmyMaccomputerunderIntroR\Data.Youwillseethislocationreferencedatvariouslocationsthroughoutthebook.However,keepinmindthatyoucaninstallthedataanywhere.
Exercise1:Creatingvariablesandassigningdata
IntheRprogramminglanguage,likeotherlanguages,variablesaregivenanameandassigneddata.Eachvariablehasanamethatrepresentsitsareainmemory.InR,variablesarecasesensitivesousecareinnamingyourvariableandreferringtothemlaterinyourcode.
TherearetwowaysthatvariablescanbeassignedinR.Inthefirstcodeexamplebelow,avariablenamedxiscreated.Theuseofalessthansignimmediatelyfollowedbyadashthenprecedesthevariablename.ThisistheoperatorusedtoassigndatatoavariableinR.Ontheright-handsideofthisoperatoristhevaluebeingassigntothevariable.Inthiscase,thevalue10hasbeenassignedtothevariablex.ToprintthevalueofavariableinRyoucansimpletypethevariablenameandthenclicktheEnterkeyonyourkeyboard.
x<-10x[1]10
Theotherwayofcreatingandassigningdatatoavariableistousetheequalsign.Inthesecondcodeexamplewecreateavariablecalledyandassignthevalue10tothevariable.Thissecondmethodofcreatingandassigningdatatoavariableisprobablymorefamiliartoyouifyou’veusedotherlanguageslikePythonorJavaScript.
y=10y[1]10
IntheRprogramminglanguage,likeotherlanguages,variablesaregivenanameandassigneddata.Eachvariableisanamedareainthecomputer’smemory.InR,variablesarealsocasesensitivesousecareinnamingyourvariablesandreferringtothemlaterinyourcode.Inthisexerciseyou’lllearnhowtocreatevariablesinRandassigndata.1.OpenRStudioandfindtheConsolewindow.Itshouldbeontheleft-hand
sideofyourscreenatthebottom.
2.Thefirstthingyou’llneedtodoissettheworkingdirectoryfortheRStudiosession.Theworkingdirectoryforallchaptersinthisbookwillbethelocationwhereyouinstalledtheexercisedata.PleasereferbacktothesectionInstallingExerciseDataforexercisedatainstallationinstructionsifyouhaven’talreadycompletedthisstep.
TheworkingdirectorycanbesetbytypingthecodeyouseebelowintotheConsolepaneorbygoingtoSession|SetWorkingDirectory|ChooseDirectoryfromtheRStudiomenu.YouwillneedtospecifythelocationoftheIntroR\Datafolderwhereyouinstalled
setwd(<installationdirectoryforexercisedata>)
3.AsImentionedintheintroductiontothisexercise,therearetwowaystocreateandassigndatatovariablesinR.We’llexaminebothinthissection.First,createavariablecalledxandassignthevalue10asseenbelow.Noticetheuseofthelessthansign(<)followedimmediatelybyadash(-).Thisoperatorcanbeusedtoassigndatatoavariable.Thevariablenameisontheleft-handsideoftheoperator,andthedatawe’reassigningtothevariableisontheright-handsideoftheoperator.
Note:Thescreenshotbelowdisplaysaworkingdirectoryof~/Desktop/IntroR/Data/whichmayormaynotbeyourworkingdirectory.ThisissimplytheworkingdirectorythatI’vedefinedformyRStudiosessiononaMaccomputer.ThiswilldependentirelyonwhereyouinstalledtheexercisedataforthebookandtheworkingdirectoryyouhavesetforyourRStudiosession.
4.Thesecondwayofcreatingavariableistousetheequalsign.Create
asecondvariableusingthismethodasseeninthescreenshotbelow.Assignthevalueasy=20.Iwillusetheequalsignthroughoutthebookinfutureexercisessinceitisusedinotherprogramminglanguagesandiseasiertounderstandandtype.However,youarefreetouseeitheroperator.
5.Finally,createathirdvariablecalledzandassignitthevalueofx+y.Thevariablesx,y,andzhaveallbeenassignednumericdata.VariablesinRcanbeassignedothertypesofdataaswellincludingcharacters(alsoknownasstrings),Booleans,andanumberofdataobjectsincludingvectors,factors,lists,matrices,dataframes,andothers.
6.Thethreevariablesthatyou’vecreated(x,y,andz)areallnumericdatatypes.Thisshouldbeself-explanatory,butanynumber,includingintegers,floatingpoint,andcomplexnumbersareinherentlydefinedasnumericdatatypes.However,ifyousurroundanumberwithquotesitwillbeinterpretedbyRasacharacterdatatype.
7.Youcanviewthevalueofanyvariablesimplybytypingthevariablenameasseeninthescreenshotbelow.Dothatnowtoseehowitworks.TypingthenameofavariableandclickingtheEnter\Returnkeywillimplicitlycalltheprint()function.
8.Thesamethingcanbeaccomplishedusingtheprint()functionasseenbelow.
9.VariablesinRarecasesensitive.Toillustratethis,createanewvariablecalledmyNameandassignitthevalueofyournameasIhavedoneinthescreenshotbelow.Inthiscase,sincewe’veenclosedthevaluewithquotes,Rwillassignitasacharacter(string)datatype.Anysequenceofcharacters,whethertheybeletters,numbers,orspecialcharacters,willbedefinedasacharacterdatatypeifsurroundedbyquotes.
NoticethatwhenItypethenameofthevariable(withthecorrectcase)itwillreportthevalueassociatedwiththevariable,butwhenItypemyname(alllowercase)itreportsanerror.Eventhoughthenameisthesamethecasingisdifferent,soyoumustalwaysrefertoyourvariablenameswiththesamecasethattheywerecreated.
10.Toseealistofallvariablesinyourcurrentworkspaceyoucantypethe
ls()function.Dothatnowtoseealistofallthevariablesyouhavecreatedinthissession.EachvariableanditscurrentvalueisalsodisplayedintheEnvironmentpaneontheright-handsideofRStudio.
11.Therearemanydatatypesthatcanbeassignedtovariables.Inthisbriefexerciseweassignedbothcharacter(string)andnumericdatatovariables.Aswedivefurtherintothebookwe’llexamineadditionaldatatypesthatcanbeassignedtovariablesinR.Thesyntaxwillremainthesamethoughnomatterwhattypeofdataisbeingassignedtoavariable.
12.YoucancheckyourworkagainstthesolutionfileChapter1_1.R.
Exercise2:Usingvectorsandfactors
InR,avectorisasequenceofdataelementsthathavethesamedatatype.Vectorsareusedprimarilyascontainerstylevariablesusedtoholdmultiplevaluesthatcanthenbemanipulatedorextractedasneeded.Thekeythoughis
thatallthevaluesmustbeofthesametype.Forexample,allthevaluesmustbenumeric,character,orBoolean.Youcan’tincludeanysortofcombinationofdatatypes.
TocreateavectorinRyoucallthec()functionandpassinalistofvaluesofthesametype.Aftercreatingavectorthereareanumberofwaysthatyoucanexamine,manipulate,andextractdata.Inthisexerciseyou’lllearnthebasicsofworkingwithvectors.
1.OpenRStudioandfindtheConsolepane.Itshouldbeontheleft-handsideofyourscreenatthebottom.
2.IntheRConsolepanecreateanewvectorasseeninthecodeexamplebelow.Thec()functionisusedtocreatethevectorobject.Thisvectoriscomposedofcharacterdatatypes.Rememberthatallvaluesinthevectormustbeofthesamedatatype.
layers<-c(‘Parcels’,‘Streets’,‘Railroads’,‘Streams’,‘Buildings’)3.Getthelengthofthevectorusingthelength()function.Thisshouldreturnavalueof5.length(layers)[1]5
4.Youcanretrieveindividualitemsfromavectorbypassinginanindexnumber.RetrievetheRailroadsvaluebypassinginanindexnumberof3,whichcorrespondstothepositionalorderofthisvalue.Risa1basedlanguagesothefirstiteminthelistoccupiesposition1.
layers[3][1]“Railroads”5.Youcanextractacontiguoussequenceofvaluesbypassingintwoindexnumbersasseenbelow.layers[3:5][1]“Railroads”“Streams”“Buildings”6.Valuescanberemovedfromavectorbypassinginanegativeintegerasseenbelow.ThiswillremoveStreamsfromthevector.
layers[1]“Parcels”“Streets”“Railroads”“Streams”“Buildings”layers[-4][1]“Parcels”“Streets”“Railroads”“Buildings”
7.Createasecondvectorcontainingnumbersasseenbelow.layerIds<-c(1,2,3,4)
8.Inthisnextstepwe’regoingtocombinethelayersandlayerIdsvectorsintoasinglevector.You’llrecallthatalltheitemsinavectormustbeofthesamedatatype.Inacaselikethiswhereonevectorcontainscharactersandtheothernumbers,Rwillautomaticallyconvertthenumberstocharacters.Enterthefollowingcodetoseethisinaction.
layerIds<-c(1,2,3,4)combinedVector<-c(layers,layerIds)combinedVector[1]“Parcels”“Streets”“Railroads”“Streams”“Buildings”[6]“1”“2”“3”“4”
9.Nowlet’screatetwonewsetsofvectorstoseehowvectorarithmeticworks.Addthefollowinglinesofcode.x<-c(10,20,30,40,50)y<-c(100,200,300,400,500)10.Nowaddthevaluesofthevectors.x+y[1]11022033044055011.Subtractthevalues.y-x[1]9018027036045012.Multiplythevalues.
10*x[1]10020030040050020*y[1]200040006000800010000
13.YoucanalsousethebuiltinRfunctionagainstthevaluesofavector.Enterthefollowlinesofcodestoseehowthebuilt-infunctionswork.sum(x)[1]150
mean(y)[1]300median(y)[1]300
max(y)[1]500min(x)[1]10
14.AFactorisbasicallyavectorbutwithcategories,soitwilllookfamiliartoyou.GoaheadandcleartheRConsolebyselectingtheEditmenuitemandthenClearConsoleinRStudio.
15.Addthefollowingcodeblock.NotethatyoucaneasilyuselinecontinuationinRsimplybyselectingtheEnter(Return)keyonyourkeyboard.Itwillautomaticallyaddthe“+”atthebeginningofthelineindicatingthatitissimplyacontinuationofthelastline.
land.type<-factor(c(“Residential”,“Commercial”,“Agricultural”,“Commercial”,“Commercial”,“Residential”),levels=c(“Residential”,“Commercial”))
table(land.type)land.typeResidentialCommercial23
16.Nowlet’stalkaboutorderingoffactors.Theremaybetimeswhenyouwanttoordertheoutputofthefactor.Forexample,youmaywanttoordertheresultsbymonth.Enterthefollowingcode:
mons<-c(“March”,“April”,“January”,“November”,“January”,+“September”,“October”,“September”,“November”,“August”,+“January”,“November”,“November”,“February”,“May”,“August”,+“July”,“December”,“August”,“August”,“September”,“November”,+“February”,“April”)
mons<-factor(mons)table(mons)mons
AprilAugustDecemberFebruaryJanuaryJuly241231MarchMayNovemberOctoberSeptember11513
17.Theoutputislessthandesirableinthiscase.Itwouldbepreferabletohavethemonthslistedintheorderthattheyoccurduringtheyear.Creatinganorderedfactorresolvesthisissue.Addthefollowingcodetoseehowthisworks.
mons<-factor(mons,levels=c(‘January’,‘February’,‘March’,+‘April’,‘May’,‘June’,‘July’,‘August’,‘September’,+‘October’,‘November’,’December’),ordered=TRUE)
table(mons)monsJanuaryFebruaryMarchAprilMayJune
321210JulyAugustSeptemberOctoberNovemberDecember143151
Creatinganorderedfactorresolvesthisissue.Inthenextexerciseyou’lllearnhowtouselists,whicharesimilarinmanywaystovectorsinthattheyareacontainerstyleobject,butasyou’llseetheydifferinanimportantwayaswell.YoucancheckyourworkagainstthesolutionfileChapter1_2.R.
Exercise3:Usinglists
Alistisanorderedcollectionofelements,inmanywaysverysimilartovectors.However,therearesomeimportantdifferencesbetweenalistandavector.Withlistsyoucanincludeanycombinationofdatatypes.Thisdiffersfromotherdatastructureslikevectors,matrices,andfactorswhichmustcontainthesamedatatype.Listsarehighlyversatileandusefuldatatypes.AlistinRactsasacontainerstyleobjectinthatitcanholdmanyvaluesthatyoustoretemporarilyandpulloutasneeded.
1.CleartheRConsolebyselectingtheEditmenuitemandthenClearConsoleinRStudio.
2.Listscanbecreatedthroughtheuseofthelist()function.It’salsocommontocallafunctionthatreturnsalistvariableaswell,butforthesakeofsimplicityinthisexercisewe’llusethelist()functiontocreatethelist.
Eachvaluethatyouintendtoplaceinsidethelistshouldbeseparatedbyacomma.Thevaluesplacedintothelistcanbeofanytype,whichdiffersfromvectorsthatmustallbeofthesametype.AddthecodeyouseebelowintheConsolepane.
my.list<-list(“Streets”,2000,“Parcels”,5000,TRUE,FALSE)Inthisexamplealistcalledmy.listhasbeencreatedwithanumberofcharacter,numeric,andBooleanvalues.
3.Becauselistsarecontainerstyleobjectsyouwillneedtopullvaluesoutofa
listatvarioustimes.Thisisdonebypassinganindexnumberinsidesquarebrackets,withtheindexnumberonereferringtothefirstvalueinthelist,andeachsuccessivevalueoccupyingthenextindexnumberinorder.However,accessingitemsinalistcanbealittleconfusingasyou’llsee.Addthefollowingcodeandthenwe’lldiscuss.
my.list[2][[1]][1]2000
Theindexnumber2isareferencetothesecondvalueinthemy.listobject,whichinthiscaseisthenumber2000.However,whenyoupassanindexnumberinsideasinglepairofsquarebracesitactuallyreturnsanotherlistobject,thistimewithasinglevalue.Inthiscase,2000istheonlyvalueinthelist,butitisalistobjectratherthananumber.
4.Nowaddthecodeyouseebelowtoseehowtopullouttheactualvaluefromthelistratherthanreturninganotherlistwithasinglevalue.my.list[[2]]
Inthiscasewepassavalueof2insideapairofsquarebraces.Usingtwosquarebracesoneithersideoftheindexnumberwillpulltheactualvalueoutofthelistratherthanreturninganewlistwithasinglevalue.Inthiscase,thevalue2000isreturnedasanumericvalue.Thiscanbealittleconfusingthefirstfewtimesyouseeandusethis,butlistsareacommonlyuseddatatypeinRsoyou’llwanttomakesureyouunderstandthisconcept.
5.Theremaybetimeswhenyouwanttopullmultiplevaluesfromalistratherthanjustasinglevalue.Thisiscalledlistslicingandcanbeaccomplishedusingsyntaxyouseebelow.Inthiscasewepassintwoindexnumbersthatindicatethestartingandendingpositionofthevaluesthatshouldberetrieved.Trythisonyourown.
new.list<-my.list[c(1,2)]new.list[[1]][1]“Streets”
[[2]][1]20006.Thisreturnedanewlistobjectstoredinthevariablenew.list.Usingbasiclistindexingyoucanthenpullavalueoutofthislist.
new.list[[2]][1]2000
7.Youcangetthenumberofitemsinalistbycallingthelength()function.Thiswillreturnthenumberofvaluesinthelist,notincludinganynestedlists.Callingthelength()functioninthisexerciseonthemy.listvariableshouldproducearesultof6.
length(my.list)
8.Finally,theremaybetimeswhenyouareuncertainifavariableisstoredasavectororalist.Youcanusetheis.list()function,whichwillreturnaTRUEorFALSEvaluethatindicateswhetherthevariableisalistobject.
is.list(my.list)[1]TRUE9.YoucancheckyourworkagainstthesolutionfileChapter1_3.R.
Exercise4:Usingdataclasses
Inthisexercisewe’lltakealookatmatricesanddataframes.AmatrixinRisastructureverysimilartoatableinthatithascolumnsandrows.Thistypeofstructureiscommonlyusedinstatisticaloperations.Amatrixiscreatedusingthematrix()function.Thenumberofcolumnsandrowscanbepassedinasargumentstothefunctiontodefinetheattributesanddatavaluesofthematrix.Amatrixmightbecreatedfromthevaluesfoundintheattributetableofafeatureclass.However,keepinmindthatallthevaluesinthematrixmustofthesamedatatype.
DataframesinRareverysimilartotablesinthattheyhavecolumnsandrows.Thismakesthemverysimilartomatrixobjectsaswell.Instatistics,adatasetwilloftencontainmultiplevariables.Forexample,ifyouareanalyzingrealestatesalesforanareatherewillbemanyfactorsincludingincome,jobgrowth,immigration,andothers.
Theseindividualvariablesarestoredasthecolumnsinadataframe.Dataframesaremostcommonlycreatedbyloadinganexternalfile,databasetable,orURLcontainingtabularinformationusingoneofthemanyfunctionsprovidedbyRforimportingadataset.Youcanalsomanuallyenterthevalues.WhenmanuallyenteringthedatatheRconsolewilldisplayaspreadsheetstyleinterfacethatyou
canusetodefinethecolumnnamesaswellastherowvalues.Rincludesmanybuilt-indatasetsthatyoucanuseforlearningpurposesandthesearestoredasdataframes.
1.OpenRStudioandfindtheConsolepane.Itshouldbeonthebottom,lefthandsideofyourscreen.
2.Let’sstartwithmatrices.IntheRConsolecreateanewmatrixasseeninthecodeexamplebelow.Thec()functionisusedtodefinethedatafortheobject.Thismatrixiscomposedofnumericdatatypes.Rememberthatallvaluesinthematrixmustbeofthesamedatatype.
matrx<-matrix(c(2,4,3,1,5,7),nrow=2,ncol=3,byrow=TRUE)matrx
[,1][,2][,3][1,]243[2,]157
3.Youcannamethecolumnsinamatrix.Addthecodeyouseebelowtonameyourcolumns.
colnames(matrx)<-c(“POP2000”,“POP2005”,“POP2010”)POP2000POP2005POP2010[1,]243[2,]157
4.Nowlet’sretrieveavaluefromthematrixwiththecodeyouseebelow.Theformatismatrix(row,column).matrx[2,3]POP201075.Youcanalsoextractanentirerowusingthecodeyouseebelow.Herewejustprovidearowvaluebutnocolumn.matrx[2,]POP2000POP2005POP20101576.Oryoucanextractanentirecolumnusingtheformatyouseebelow.matrx[,3][1]377.Youcanalsoextractmultiplecolumnsatatime.matrx[,c(1,3)]
POP2000POP2010[1,]23[2,]17
8.Youcanalsoaccesscolumnsorrowsbynameifyouhavenamedthem.matrx[,“POP2005”][1]459.YoucanusethecolSums(),colMeans()orrowSums()functionsagainstthedataaswell.
colSums(matrx)POP2000POP2005POP20103811>colMeans(matrx)POP2000POP2005POP20101.54.05.5
10.Nowwe’llturnourattentiontoDataFrames.CleartheRconsoleandexecutethedata()functionasseenbelow.ThisdisplaysalistofallthesampledatasetsthatarepartofR.Youcanuseanyofthesedatasets.
11.Forthisexercisewe’llusetheUSArrestsdataframe.AddthecodeyouseebelowtodisplaythecontentsoftheUSArrestsdataframe.
12.Next,we’llpulloutthedataforallrowsfromtheAssaultcolumn.
USArrests$Assault[1]2362632941902762041102383352114612024911356115[17]109249833001492557225917810910225257159285254[33]33745120151159106174279861882011204815614581[49]53161
13.Avaluefromaspecificrow,columncombinationcanbeextractedusingthe
codeseenbelowwheretherowisspecifiedasthefirstoffsetandthecolumnisthesecond.ThisparticularcodeextractstheassaultvalueforWyoming.
USArrests[50,2][1]16114.Ifyouleaveoffthecolumnitwillreturnallcolumnsforthatrow.USArrests[50,]MurderAssaultUrbanPopRapeWyoming6.81616015.6
ThesampledatasetsincludedwithRaregoodforlearningpurposes,butoflimitedusefulnessbeyondthat.You’regoingtowanttoloaddatasetsthatarerelevanttoyourlineofwork,andmanyofthesedatasetshaveatabularstructurethatisconducivetothedataframeobject.Mostofthesedatasetswillneedtobeloadedfromanexternalsourcethatmaybefoundindelimitedtextfiles,databasetables,webservices,andothers.You’lllearnhowtoloadtheseexternaldatasetsusingRcodeinalaterchapterofthebook,butasyou’llseeinthisnextexerciseyoucanalsousetheRStudiointerfacetoloadthemaswell.15.InRStudiogototheFilemenuandselectImportDataset|FromText
(readr).Thiswilldisplaythedialogseeninthescreenshotbelow.We’lldiscussthereadrpackageinmuchmoredetailinafuturechapter,butthispackageisusedtoefficientlyreadexternaldataintoadataframe.
16.UsetheBrowsebuttontobrowsetotheStudyArea.csvfilefoundintheData
folderwhereyouinstalledtheexercisedataforthisbook.TheStudyArea.csvfileisacommaseparatedlistofwildfiresfrom1980-2016fortheWesternUnitedStates.
Thedatawillbeloadedintoapreviewwindowasseenbelow.Thereareanumberofimportoptionsalongwiththecodethatwillbeexecuted.Youcanleavethedefaultvaluesinthiscase.
17.ClickImportfromthisImportTestDatadialog.Thiswillloadthedata
intoadataframe(technicallycalledaTibbleintidyverse)calledStudyArea.ItwillalsousetheView()functiontodisplaytheresultsinatabularviewdisplayedinthescreenshotbelow.
18.Messages,warnings,anderrorsfromtheimportwillbedisplayedinthe
Consolewindow.Youcanignorethesemessagesfornow.We’lldiscusstheminmoredetailinalaterchapter.
ThisStudyAreadataframecanthenbeusedfordataexplorationandvisualization,whichwe’llcoverinfuturechapters.19.YoucancheckyourworkagainstthesolutionfileChapter1_4.R.
Exercise5:Loopingstatements
Loopingstatementsaren’tusedasmuchinRastheyareinotherlanguagesbecauseRhasbuiltinsupportforvectorization.Vectorizationisabuilt-instructurethatautomaticallyloopsthroughadatastructurewithouttheneedtowriteloopingcode.However,theremaybetimeswhenyouneedtowriteloopingcodetoaccomplishaspecifictaskthatisn’thandledbyvectorizationsoyouneedtounderstandthesyntaxofloopingstatementsinR.We’lltakealookatasimpleblockofcodethatloopsthroughtherowsinadataframe.
Forloopsareusedwhenyouknowexactlyhowmanytimestorepeatablockofcode.Thisincludestheuseofdataframeobjectsthathaveaspecificnumberofrows.Forloopsaretypicallyusedwithvectoranddataframestructures.
1.Forthisbriefexercisewe’llusetheStudyAreadataframethatyouimportedfromanexternalfileinthelastexercise.YouwillalsolearnhowtocreateanRscriptandlearnhowtoexecutethescript.AscriptissimplyaseriesofcommandsthatarerunasagroupratherthanenteringandrunningyourcodeonelineatatimefromtheConsolewindow.
2.CreateanewRscriptbygoingtoFile|NewFile|RScriptfromtheRStudiointerface.
3.SavethefilewithanameofChapter1_5.R.Youcanplacethescriptfilewhereveryou’dlike,butitisrecommendedthatyousaveittoyourfolderwhereyourexercisedataisloaded.
4.AddthefollowinglinesofcodetotheChapter1_5.Rscript.
for(firein1:nrow(StudyArea)){print(StudyArea[fire,“TOTALACRES”])}
5.RunthecodebyselectingCode|RunRegion|RunAllfromtheRStudiomenuorbyclickingtheSourcebuttononthescripttab.
Thiswillproduceastreamofdatathatlookssimilartowhatyouseebelow.Youwillwanttostoptheexecutionofthisscriptafteritbeginsdisplayingdatabecauseoftheamountofdataandtimeitwilltaketoprintoutalltheinformation.TheforloopsyntaxassignseachrowfromtheStudyAreadataframetoavariablecalledfire.Thetotalnumberofacresburnedforeachfireisthenprinted.
#Atibble:1x1TOTALACRES
<dbl>10.100#Atibble:1x1
TOTALACRES<dbl>13.#Atibble:1x1
TOTALACRES<dbl>10.500#Atibble:1x1
TOTALACRES<dbl>10.100#Atibble:1x1
TOTALACRES<dbl>
AsImentionedearlier,youwon’toftenneedtouseforloopsinRbecauseofthebuilt-insupportforvectorization,butsoonerorlateryou’llrunintoasituationwhereyouneedtocreatetheseloopingstructures.
6.YoucancheckyourworkagainstthesolutionfileChapter1_5.R.
Exercise6:Decisionsupportstatements–if|else
Decisionsupportstatementsenableyoutowritecodethatbranchesbaseduponspecificconditions.Thebasicif|elsestatementinRisusedfordecisionsupport.Basically,ifstatementsareusedtobranchcodebasedonatestexpression.IfthetestexpressionevaluatestoTRUE,thenablockofcodeisexecuted.IfthetestevaluatestoFALSEthentheprocessingskipsdowntothefirstelseifstatementoranelsestatementifyoudon’tincludeanyelseifstatements.
Eachif|elseif|elsestatementhasanassociatedcodeblockthatwillexecutewhenthestatementevaluatestoTRUE.CodeblocksaredenotedinRusingcurlybracesasseeninthecodeexamplebelow.
Youcanincludezeroormoreelseifstatementsdependingonwhatyou’re
attemptingtoaccomplishinyourcode.IfnostatementsevaluatetoTRUE,processingwillexecutethecodeblockassociatedwiththeelsestatement.1.Inthisexercisewe’llbuildontheloopingexercisebyaddinginanif|
elseif|elseblockthatdisplaysthefirenamesaccordingtosize.2.CreateanewRscriptbygoingtoFile|NewFile|RScriptfromtheRStudiointerface.
3.SavethefilewithanameofChapter1_6.R.Youcanplacethescriptfilewhereveryou’dlike,butitisrecommendedthatyousaveittoyourfolderwhereyourexercisedataisloaded.
4.CopyandpastetheforloopyoucreatedinthelastexerciseandsavedtotheChapter1_5.RfileintoyournewChapter1_6.Rfile.
for(firein1:nrow(StudyArea)){print(StudyArea[fire,“TOTALACRES”])}
5.Addtheif|elseifblockyouseebelow.ThisscriptloopsthroughalltherowsintheStudyAreadataframeandprintsoutmessagesthatindicatewhenafirehasburnedmorethanthespecifiednumberofacresforeachcategory.
for(firein1:nrow(StudyArea)){if(StudyArea[fire,“TOTALACRES”]>100000){print(paste(“100KFire:“,StudyArea[fire,“FIRENAME”],sep=“”))}elseif(StudyArea[fire,“TOTALACRES”]>75000){print(paste(“75KFire:“,StudyArea[fire,“FIRENAME”],sep=“”))}elseif(StudyArea[fire,“TOTALACRES”]>50000){print(paste(“50KFire:“,StudyArea[fire,“FIRENAME”],sep=
“”))}}
6.RunthecodebyselectingCode|RunRegion|RunAllfromtheRStudiomenuorbyclickingtheSourcebuttononthescripttab.ThescriptshouldstartproducingoutputintheConsolepanesimilartowhatyouseebelow.
[1]“50KFire:PIRU”
[1]“100KFire:CEDAR”[1]“50KFire:MINE”[1]“100KFire:24COMMAND”[1]“50KFire:RANCH”[1]“75KFire:HARRIS”[1]“50KFire:SUNNYSIDETURNOFF”[1]“100KFire:Range12”
7.Youcanoptionallyaddanelseblockattheendthatwillprintamessageforanyfirethatisn’tgreaterthan50,000acres.Mostofthefiresinthisdatasetarelessthan50,000soyou’llseealotofmessagesthatindicatethisifyouaddtheelseblockbelow.
for(firein1:nrow(StudyArea)){if(StudyArea[fire,“TOTALACRES”]>100000){print(paste(“100KFire:“,StudyArea[fire,“FIRENAME”],sep=“”))}elseif(StudyArea[fire,“TOTALACRES”]>75000){print(paste(“75KFire:“,StudyArea[fire,“FIRENAME”],sep=“”))}elseif(StudyArea[fire,“TOTALACRES”]>50000){print(paste(“50KFire:“,StudyArea[fire,“FIRENAME”],sep=“”))}else{print(“NotaMEGAFIRE”)}}8.YoucancheckyourworkagainstthesolutionfileChapter1_6.R.
Exercise7:Usingfunctions
Functionsareagroupofstatementsthatexecuteasagroupandareaction-orientedstructuresinthattheyaccomplishsomesortoftask.Inputvariablescanbepassedintofunctionsthroughwhatareknownasparameters.Anothernameforparametersisarguments.Theseparametersbecomevariablesinsidethefunctiontowhichtheyarepassed.
Rpackagesincludemanypre-builtfunctionsthatyoucanusetoaccomplishspecifictasks,butyoucanalsobuildyourownfunctions.Functionstaketheformseeninthescreenshotbelow.
Functionsareassignedaname,cantakezeroormorearguments,eachseparatedbyacomma,haveabodyofstatementsthatexecuteasagroup,andcanreturnavalue.Thebodyofafunctionisalwaysenclosedbycurlybraces.Thisiswheretheworkofthefunctionisaccomplished.Anyvariablesdefinedinsidethefunctionorpassedasargumentstothefunctionbecomelocalvariablesthatareonlyaccessiblefrominsidethefunction.Thereturnkeywordisusedtoreturnavaluetothecodethatinitiallycalledthefunction.
Thewayyoucallafunctioncandifferalittle.Thebasicformofcallingafunctionistoreferencethenameofthefunctionfollowedbyanyargumentsinsideparenthesesjustafterthenameofthefunction.Whenpassingargumentstothefunctionusingthisdefaultsyntax,yousimplypassthevaluefortheparameter,anditisassumedthatyouarepassingthemintheorderthattheyweredefined.Inthiscasetheorderthatyoupassintheargumentsisveryimportant.Theordermustmatchtheorderthatwasusedtodefinethefunction.Thisisillustratedinthecodeexamplebelow.
myfunction(2,4)Ifthefunctionreturnsavalue,thenyouwillneedtoassignavariablenametothefunctioncallasseeninthecodeexamplebelowthatcreatesavariablecalledz.z=myfunction(2,4)
Finally,whileyoudon’thavetospecifythenameoftheargumentyoucandosoifyou’dlike.Inthiscaseyousimplypassinthenameoftheargumentfollowedbyanequalsignandthenthevaluebeingpassedforthatargument.Thecodeexamplebelowillustratesthisoptionalwayofcallingafunction.
myfunction(arg1=2,arg2=4)Inthisexerciseyou’lllearnhowtocallsomeofthebuilt-inRfunctions.
1.Rincludesanumberofbuiltinfunctionsforgeneratingsummarystatisticsforadataset.Inthisexercisewe’llcallsomeofthefunctionsontheStudyAreadataframethatwascreatedinExercise4:UsingDataClasses.IntheConsolepaneaddthelineofcodeyouseebelowtocallthemean()function.Inthiscase,theTOTALACREScolumnfromtheStudyAreadataframewillbepassedasaparametertothefunction.Thisfunctioncalculatesthemeanofanumericdataset,whichinthiscasewillbe191.0917.
mean(StudyArea$TOTALACRES)[1]191.09172.Repeatthissameprocesswiththemin(),max(),andmedian()functions.
3.TheYEAR_fieldintheStudyAreadataframecontainstheyearinwhichthefireoccured.Thesubstr()functioncanbeusedtoextractaseriesofcharactersfromavariable.Usethesubstr()functionasseenbelowtoextractoutthelasttwodigitsoftheyear.
substr(StudyArea$YEAR_,3,4)
4.You’veseenexamplesofanumberofotherbuiltinRfunctionsinpreviousexercisesincludingprint(),ls()rm(),andothers.ThebaseRpackagecontainsmanyfunctionsthatcanbeusedtoaccomplishvarioustasks.Therearethousandsofotherthird-partyRpackagesthatyoucanuseaswell,andtheyallcontainadditionalfunctionsforperformingspecifictasks.Youcanalsocreateyourownfunctions,andwe’lldothatinafuturechapter.
5.YoucancheckyourworkagainstthesolutionfileChapter1_7.R.
Exercise8:Introductiontotidyverse
WhilethebaseRpackageincludesmanyusefulfunctionsanddatastructuresthatyoucanusetoaccomplishawidevarietyofdatasciencetasks,thethird-partytidyversepackagesupportsacomprehensivedatascienceworkflowasillustratedinthediagrambelow.Thetidyverseecosystemincludesmanysub-packagesdesignedtoaddressspecificcomponentsoftheworkflow.
Ttidyverseisacoherentsystemofpackagesforimporting,tidying,transforming,exploring,andvisualizingdata.ThepackagesofthetidyverseecosystemweremostlydevelopedbyHadleyWickham,buttheyarenowbeingexpandedbyseveralcontributors.Tidyversepackagesareintendedtomakestatisticiansanddatascientistsmoreproductivebyguidingthemthroughworkflowsthatfacilitatecommunication,andresultinreproducibleworkproducts.Fundamentally,thetidyverseisabouttheconnectionsbetweenthetoolsthatmaketheworkflowpossible.Let’sbrieflydiscussthecorepackagesthatarepartoftidyverse,andthenwe’lldoadeeperdiveintothespecificsofthepackagesaswemovethroughthebook.We’llusethesetoolsextensivelythroughoutthebook.
readr
Thegoalofreadristofacilitatetheimportoffile-baseddataintoastructureddataformat.Thereadrpackageincludessevenfunctionsforimportingfile-baseddatasetsincludingcsv,tsv,delimited,fixedwidth,whitespaceseparated,andweblogfiles.
Dataisimportedintoadatastructurecalledatibble.Tibblesarethetidyverseimplementationofadataframe.Theyarequitesimilartodataframes,butarebasicallyanewer,moreadvancedversion.However,therearesomeimportantdifferencesbetweentibblesanddataframes.Tibblesneverconvertdatatypesofvariables.Theyneverchangethenamesofvariablesorcreaterownames.Tibblesalsohavearefinedprintmethodthatshowsonlythefirst10rows,andallcolumnsthatwillfitonthescreen.Tibblesalsoprintthecolumntypealongwiththename.We’llrefertotibblesasdataframesthroughouttheremainderofthebooktokeepthingssimple,butkeepinmindthatyou’reactuallygoingtobeworkingwithtibbleobjects.Inthenextchapteryou’lllearnhowtousethe
read_csv()functiontoloadcsvfilesintoatibbleobject.
tidyr
DatatidyingisaconsistentwayoforganizingdatainR,andcanbefacilitatedthroughthetidyrpackage.Therearethreerulesthatwecanfollowtomakeadatasettidy.First,eachvariablemusthaveitsowncolumn.Second,eachobservationmusthaveitsownrow,andfinally,eachvaluemusthaveitsowncell.
dplyr
Thedplyrpackageisaveryimportantpartoftidyverse.Itincludesfivekeyfunctionsfortransformingyourdatainvariousways.Thesefunctionsincludefilter(),arrange(),select(),mutate(),andsummarize().Inaddition,thesefunctionsallworkverycloselywiththegroup_by()function.Allfivefunctionsworkinaverysimilarmannerwherethefirstargumentisthedataframeyou’reoperatingon,andthenextNnumberofargumentsarethevariablestoinclude.Theresultofcallingallfivefunctionsisthecreationofanewdataframethatisatransformedversionofthedataframepassedtothefunction.We’llcoverthespecificsofeachfunctioninalaterchapter.
ggplot2Theggplot2packageisadatavisualizationpackageforR,createdbyHadleyWickhamin2005andisanimplementationofLelandWilkinson’sGrammarofGraphics.
GrammarofGraphicsisatermusedtoexpresstheideaofcreatingindividualblocksthatarecombinedintoagraphicaldisplay.Thebuildingblocksusedinggplot2toimplementtheGrammarofGraphicsincludedata,aestheticmapping,geometricobjects,statisticaltransformations,scales,coordinatesystems,positionadjustments,andfaceting.
Usingggplot2youcancreatemanydifferentkindsofchartsandgraphsincludingbarcharts,boxplots,violinplots,scatterplots,regressionlines,andmore.Thereareanumberofadvantagestousingggplot2versusothervisualizationtechniquesavailableinR.Theseadvantagesincludeaconsistentstylefordefiningthegraphics,ahighlevelofabstractionforspecifyingplots,flexibility,abuilt-inthemingsystemforplotappearance,matureandcompletegraphicssystem,andaccesstomanyotherggplot2usersforsupport.
Othertidyversepackages
Thetidyverseecosystemincludesanumberofothersupportingpackagesincludingstringr,purr,forcats,andothers.Inthisbookwe’llfocusprimarilyonthepackagealreadydescribed,buttoroundoutyourknowledgeoftidyverseyoucanreferencetidyverse.org.
Conclusion
InthischapteryoulearnedthebasicsofusingtheRStudiointerfacefordatavisualizationandexplorationaswellassomeofthebasiccapabilitiesoftheRlanguage.Afterlearninghowtocreatevariablesandassigndata,youlearnedsomeofthebasicRdatatypesincludingcharacters,vectors,factors,lists,matrices,anddataframes.Youalsolearnedaboutsomeofthebasicprogrammingconstructsincludinglooping,decisionsupportstatements,andfunctions.Finally,youreceivedanoverviewofthetidyversepackage.Inthenextchapteryou’lllearnsomebasicdataexplorationandvisualizationtechniquesbeforewediveintothespecificsinfuturechapters.
Chapter2
TheBasicsofDataExplorationandVisualizationwithR
Nowthatyou’vegottenyourfeetwetwiththebasicsofRwe’regoingtoturnourattentiontocoveringsomeofthefundamentalconceptsofdataexplorationandvisualizationusingtidyverse.Thischapterisgoingtobeagentleintroductiontosomeofthetopicsthatwe’regoingtocoverinmuchmoreexhaustivedetailincomingchapters.Fornow,Ijustwantyoutogetasenseofwhatispossibleusingvarioustoolsinthetidyversepackage.
ThischapterwillteachyoufundamentaltechniquesforhowtousethereadrpackagetoloadexternaldatafromaCSVfileintoR,thedplyrpackagetomassageandmanipulatedata,andggplot2tovisualizedata.You’llalsolearnhowtoinstallandthetidyverseecosystemofpackagesandloadthepackagesintotheRStudioenvironment.
AsImentionedpreviously,thischapterisintendedasagentleintroductiontowhatispossibleratherthanadetailedinspectionofthepackages.Futurechapterswillgointoextensivedetailonthesetopics.Fornow,Ijustwantyoutogetasenseofwhatispossibleevenifyoudon’tcompletelyunderstandthedetails.
Inthischapterwe’llcoverthefollowingtopics:
•Installingandloadingtidyverse•Loadingandexaminingadataset•Filteringadataset•Groupingandsummarizingadataset•Plottingadataset
Exercise1:Installingandloadingtidyverse
InChapter1:IntroductiontoRyoulearnedthebasicsconceptsofthetidyversepackage.We’llbeusingvariouspackagesfromthetidyverseecosystemthroughoutthisbookincludingreadr,dplyr,andggplot2amongothers.Tidyverseisathird-partypackagesoyou’llneedtoinstallthepackageusingRStudiosothatitcanbeusedintheexercisesinthisbook.Inthisexerciseyou’ll
learnhowtoinstalltidyverseandloadthepackageintoyourscripts.
1.OpenRStudio.
2.Thetidyversepackageisreallymoreanecosystemofpackagesthatcanbeusedtocarryoutvariousdatasciencetasks.Whenyouinstalltidyverseitinstallsallofthepackagesthatarepartoftidyverse,manyofwhichwediscussedinthelastchapter.Alternatively,youcaninstallthemindividuallyaswell.ThereareacouplewaysthatyoucaninstallpackagesinRStudio.
LocatethePackagespaneinthelowerrightportionoftheRStudiowindow.Toinstallanewpackageusingthispane,clicktheInstallbuttonshowninthescreenshotbelow.
InthePackagestextbox,typetidyverse.Alternatively,youcanloadthepackagesindividuallysoinsteadoftypingtidyverseyouwouldtypereadrorggplot2orwhateverpackageyouwanttoinstall.We’regoingtousethereadr,dplyr,andggplot2packagesinthischapterandinmanyotherssoyoucaneitherinstalltheentiretidyversepackage,whichincludesthepackageswe’lluseinthischapterplusanumberofothersorinstallthemindividually.Goaheadanddothatnow.
3.Theotherwayofinstallingpackagesistousetheinstall.packages()functionasseenbelow.ThisfunctionshouldbetypesfromtheConsolepane.
install.packages(<package>)Forexample,ifyouwantedtoinstallthedplyrpackageyouwouldtype:install.packages(“dplyr”)
4.Tousethefunctionalityprovidedbyapackageitalsoneedstobeloadedeitherintoanindividualscriptthatwillusethepackage,oritcanalsobeloadedfromthePackagespane.ToloadapackagefromthePackagespane,simplyclickthecheckboxnexttothepackageasseeninthescreenshotbelow.
5.YoucanalsoloadapackagefromeitherascriptortheConsolepanebytypinglibrary(<package>).Forexample,toloadthereadrpackageyouwouldtypethefollowing:
library(readr)
Exercise2:Loadingandexaminingadataset
ThetidyversepackageisdesignedtoworkwithdatastoredinanobjectcalledaTibble.Tibblesarethetidyverseimplementationofadataframe.Theyarequitesimilartodataframes,butarebasicallyanewer,moreadvancedversion.
Therearesomeimportantdifferencesbetweentibblesanddataframes.Tibblesneverconvertthedatatypesofvariables.Also,theyneverchangethenamesofvariablesorcreaterownames.Tibblesalsohavearefinedprintmethodthatshowsonlythefirst10rows,andallcolumnsthatwillfitonthescreen.Tibblesalsoprintthecolumntypealongwiththename.We’llrefertotibblesasdataframesthroughouttheremainderofthischaptertokeepthingssimple,butkeepinmindthatyou’reactuallygoingtobeworkingwithtibbleobjectsasopposedtotheolderdataframeobjects.
Gettingdataintoatibbleobjectformanipulation,analysis,andvisualizationisnormallyaccomplishedthroughtheuseofoneofthereadfunctionsfoundinthereadrpackage.Inthisexerciseyou’lllearnhowtoreadthecontentsofaCSVfileintoRusingtheread_csv()functionfoundinthereadrpackage.
1.OpenRStudio.
2.InthePackagespanescrolldownuntilyouseethereadrpackageandchecktheboxjusttotheleftasseenbelowasseeninthescreenshotfromthelastexerciseinthischapter.Note:Ifyoudon’tseethereadrpackageinthePackagespaneitmeansthatthepackagehasn’tbeeninstalled.You’llneedtogobacktothelastexerciseandfollowtheinstructionsprovided.
3.YouwillalsoneedtosettheworkingdirectoryfortheRStudiosession.TheeasiestwaytodothisistogotoSession|SetWorkingDirectory|ChooseDirectoryandthennavigatetotheIntroR\Datafolderwhereyouinstalledtheexercisedataforthisbook.
4.Theread_csv()functionisgoingtobeusedtoreadthecontentsofafilecalledCrime_Data.csv.Thisfilecontainsapproximately481,000crimereportsfromSeattle,WAcoveringaspanofapproximately10years.IfyouhaveMicrosoftExcelorsomeotherspreadsheettypesoftwaretakeafewmomentstoexaminethecontentsofthisfile.
Foreachcrimeoffensethisfileincludesdateandtimeinformation,crimecategoriesanddescription,policedepartmentinformationincludingsector,beat,andprecinct,andneighborhoodname.
5.FindtheRStudioConsolepaneandaddthecodeyouseebelow.ThiswillreadthedatastoredintheCrime_Data.csvfileintoadataframe(actuallyatibbleasdiscussedintheintroduction)calleddfCrime.
dfCrime=read_csv(“Crime_Data.csv”,col_names=TRUE)6.You’llseesomemessagesindicatingthecolumnnamesanddatatypesforeachasseenbelow.
Parsedwithcolumnspecification:cols(`ReportNumber`=col_double(),`OccurredDate`=col_character(),`OccurredTime`=col_integer(),`ReportedDate`=col_character(),`ReportedTime`=col_integer(),`CrimeSubcategory`=col_character(),`PrimaryOffenseDescription`=col_character(),Precinct=col_character(),
Sector=col_character(),Beat=col_character(),Neighborhood=col_character())
7.Youcangetacountofthenumberofrecordswiththenrow()function.nrow(dfCrime)[1]4813768.TheView()functioncanbeusedtoviewthedatainatabularformatasseeninthescreenshotbelow.View(dfCrime)
9.Itwilloftenbethecasethatyoudon’tneedallthecolumnsinthedatathatyouimport.Thedplyrpackageincludesaselect()functionthatcanbeusedtolimitthefieldsinthedataframe.InthePackagespane,loadthedplyrlibrary.Again,ifyoudon’tseethedplyrlibrarythenit(ortheentiretidyverse)willneedtobeinstalled.
10.Inthiscasewe’lllimitthecolumnstothefollowing:ReportedDate,
CrimeSubcategory,PrimaryOffenseDescription,Precinct,Sector,Beat,andNeighborhood.Addthecodeyouseebelowtoaccomplishthis.
dfCrime=select(dfCrime,‘ReportedDate’,‘CrimeSubcategory’,‘PrimaryOffenseDescription’,‘Precinct’,‘Sector’,‘Beat’,‘Neighborhood’)
11.Viewtheresults.View(dfCrime)
12.Youmayalsowanttorenamecolumnstomakethemmorereaderfriendlyorperhapssimplifythenames.Theselect()functioncanbeusedtodothisaswell.Addthecodeyouseebelowtoseehowthisworks.Yousimplypassinthenewnameofthecolumnfollowedbyanequalsignandthentheoldcolumnname.
dfCrime=select(dfCrime,‘CrimeDate’=‘ReportedDate’,‘Category’=‘CrimeSubcategory’,‘Description’=‘PrimaryOffenseDescription’,‘Precinct’,‘Sector’,‘Beat’,‘Neighborhood’)
Exercise3:Filteringadataset
Inadditiontolimitingthecolumnsthatarepartofadataframe,it’salsocommontosubsetorfiltertherowsusingawhereclause.Filteringthedatasetenablesyoutofocusonasubsetoftherowsinsteadoftheentiredataset.Thedplyrpackageincludesafilter()functionthatsupportsthiscapability.Inthisexerciseyou’llfilterthedatasetsothatonlyrowsfromaspecificneighborhoodareincluded.
1.IntheRStudioConsolepaneaddthefollowingcode.ThiswillensurethatonlycrimesfromtheQUEENANNEneighborhoodareincluded.dfCrime2=filter(dfCrime,Neighborhood==‘QUEENANNE’)2.Getthenumberofrowsandviewthedataifyou’dlikewiththeView()function.nrow(dfCrime2)[1]25172
3.Youcanalsoincludemultipleexpressionsinafilter()function.Forexample,thelineofcodebelowwouldfilterthedataframetoincludeonlyresidentialburglariesthatoccurredintheQueenAnneneighborhood.Thereisnoneedtoaddthelineofcodebelow.It’sjustmeantasanexample.We’llexaminemorecomplexfilterexpressionsinalaterchapter.
dfCrime3=filter(dfCrime,Neighborhood==‘QUEENANNE’,Category==‘BURGLARY-RESIDENTIAL’)
Exercise4:Groupingandsummarizingadataset
Thegroup_by()function,foundinthedplyrpackage,iscommonlyusedtogroupdatabyoneormorevariables.Oncegrouped,summarystatisticscanthenbegeneratedforthegrouporyoucanvisualizethedatainvariousways.Forexample,thecrimedatasetwe’reusinginthischaptercouldbegroupedbyoffense,neighborhoodandyearandthensummarystatisticsincludingthecount,mean,andmediannumberofburglariesbyyeargenerated.
It’salsoverycommontovisualizethesegroupeddatasetsindifferentways.Bar
charts,scatterplots,orothergraphscouldbeproducedforthegroupeddataset.Inthisexerciseyou’lllearnhowtogroupdataandproducesummarystatistics.
1.IntheRStudioconsolewindowaddthecodeyouseebelowtogroupthecrimesbypolicebeat.dfCrime2=group_by(dfCrime2,Beat)2.Then()functionisusedtogetacountofthenumberofrecordsforeachgroup.Addthecodeyouseebelow.dfCrime2=summarise(dfCrime2,n=n())3.Usethehead()functiontoexaminetheresults.head(dfCrime2)
#Atibble:4x2Beatn<chr><int>
1D243732Q1883Q2108514Q39860
Exercise5:Plottingadataset
Theggplot2packagecanbeusedtocreatevarioustypesofchartsandgraphsfromadataframe.Theggplot()functionisusedtodefineplots,andcanbepassedanumberofparametersandjoinedwithotherfunctionstoultimatelyproduceanoutputchart.
Thefirstparameterpassedtoggplot()willbethedataframeyouwanttoplot.Typicallythiswillbeadataframeobject,butitcanalsobeasubsetofadataframedefinedwiththesubset()function.Thefirstcodeexampleonthisslidepassesavariablecalledhousing,whichcontainsadataframe.Inthesecondcodeexample,thesubset()functionispassedastheparameter.ThissubsetfunctiondefinesafilterthatwillincludeonlyrowswheretheStatevariableisequaltoMAorTX.
Inthisexerciseyouwillcreateasimplebarchartfromthedataframecreatedinthepreviousexercisesinthischapter.
1.IntheRStudioconsoleaddthecodeyouseebelow.Theggplot()functioninthiscaseispassedthedfCrimedataframecreatedinapreviousexercises.Thegeom_col()functionisusedtodefinethegeometryofthegraph(barchart)andispassedamappingparameterwhichisdefinedbycallingtheaes()functionandpassinginthecolumnsforthexaxis(Beat),andtheyaxis(n=count).
ggplot(data=dfCrime2)+geom_col(mapping=aes(x=Beat,y=n),fill=”red”)2.
ThiswillproducethechartyouseebelowinthePlotspane.
Exercise6:Graphingburglariesbymonthandyear
Inthisexercisewe’llcreatesomethingalittlemorecomplex.We’llcreateacouplebarchartsthatdisplaythenumberofburglariesbyyearandbymonthfortheQueenAnneneighborhood.Inadditiontothedplyrandggplot2packagesweusedpreviouslyinthischapterwe’llalsousethelubridatepackagetomanipulatedateinformation.
1.IntheRStudioPackagespane,loadthelubridatepackage.Thelubridatepackageispartoftidyverseandisusedtoworkwithdatesandtimes.Also,makesurethereadr,dplyrandggplot2packagesareloaded.
2.LoadthecrimedatafromtheCrime_Data.csvfile.dfCrime=read_csv(“Crime_Data.csv”,col_names=TRUE)3.Specifythecolumnsandcolumnnames.
dfCrime=select(dfCrime,‘CrimeDate’=‘ReportedDate’,‘Category’=‘CrimeSubcategory’,‘Description’=‘PrimaryOffenseDescription’,‘Precinct’,‘Sector’,‘Beat’,‘Neighborhood’)
4.FiltertherecordssothatonlyresidentialburglariesintheQueenAnneneighborhoodareretained.dfCrime2=filter(dfCrime,Neighborhood==‘QUEENANNE’,Category==‘BURGLARY-RESIDENTIAL’)
5.Thedplyrpackageincludestheabilitytodynamicallycreatenewcolumnsinadataframethroughthemanipulationofdatafromexistingcolumnsinthedataframe.Themutate()functionisusedtocreatethenewcolumns.Herethemutate()functionwillbeusedtoextracttheyearfromtheCrimeDatecolumn.
Addthefollowingcodetoseethisinaction.ThesecondparametercreatesanewcolumncalledYEARandpopulatesitbyusingtheyear()functionfromthelubridatepackage.Insidetheyear()functiontheCrimeDatecolumn,whichisacharactercolumn,isconvertedtoadateandtheformatofthedate
dfCrime3=mutate(dfCrime2,YEAR=year(as.Date(dfCrime2$CrimeDate,format=’%m/%d/%Y’)))
6.Viewtheresult.NoticetheYEARcolumnattheendofthedataframe.Themutate()functionalwaysaddsnewcolumnstotheendofthedataframe.
View(dfCrime3)
7.Nowwe’llgroupthedatabyyearandsummarizebygettingacountofthenumberofcrimesperyear.Addthefollowinglinesofcode.dfCrime4=group_by(dfCrime3,YEAR)dfCrime4=summarise(dfCrime4,n=n())8.Viewtheresult.View(dfCrime4)
9.Createabarchartbycallingtheggplot()andgeom_col()functionsasseenbelow.DefineYEARasthecolumnforthexaxisandthenumberofcrimesfortheyaxis.ThisshouldproducethechartyouseebelowinthePlotspane.
ggplot(data=dfCrime4)+geom_col(mapping=aes(x=YEAR,y=n),fill=”red”)
10.Nowwe’llcreateanotherbarchartthatdisplaysthenumberofcrimesbymonthinsteadofyear.First,createaMONTHcolumnusingthemutate()function.
dfCrime3=mutate(dfCrime2,MONTH=month(as.Date(dfCrime2$CrimeDate,format=’%m/%d/%Y’)))11.Groupandsummarizethedatabymonth.dfCrime4=group_by(dfCrime3,MONTH)dfCrime4=summarise(dfCrime4,n=n())12.Viewtheresult.View(dfCrime4)13.Createthebarchart.ggplot(data=dfCrime4)+geom_col(mapping=aes(x=MONTH,y=n),fill=”red”)
14.YoucancheckyourworkagainstthesolutionfileChapter2_6.R.
Conclusion
Inthischapteryoulearnedsomebasictechniquesfordataexplorationandvisualizationusingthetidyversepackageanditsecosystemofsub-packages.AfterinstallingandloadingthepackageusingRStudioyouperformedanumberoftasksusingtheRprogramminglanguagewithanumberoftidyversesub-packages.YouloadedadatasetfromaCSVfileusingreadr.After,youmanipulatedthedatainvariouswaysusingthedplyrpackage.Theselect()functionwasusedtoincludeandrenamecolumns,andthecontentsofthedataframewerefilteredusingthefilter()function.Thedatawasthengroupedandsummarized,andfinallyseveralgraphswereproducedusingggplot2.
Inthenextchapteryouwilllearnhowmoreabouthowtousethereadrpackagetoloaddatafromexternaldatasources.Chapter3
LoadingDataintoR
Largedataobjects,typicallystoredasdataframesinR,aremostoftenreadfromexternalfiles.R,alongwithtidyverse,includeanumberoffunctionsthatcanreadexternaldatafilesfromawidevarietyofsourcesincludingtextfilesofmanyvarieties,relationaldatabases,andwebservices.Externaltextfilesneedtohaveaspecificformatwiththefirstline,calledtheheader,containingthecolumnnames.Eachadditionallineinthefilewillhavevaluesforeachvariable.Inthischapter,we’llexamineanumberoffunctionsthatcanbeusedtoreaddata.
ThereareanumberofcommondataformatsthatcanbereadintoandoutofR.Thisincludestextfilesinformatssuchascsv,txt,html,andjson.ItalsoincludesfilesoutputfromstatisticalapplicationsincludingSASandSPSS.OnlineresourcesincludingwebservicesandHTMLpagescanalsobereadintoR.Finally,relationalandnon-relationaldatabasetablescanbereadaswell.ThereareanumberoffunctionsprovidedbyRandTidyversewhichwillenableyoutoreadthesevarioussources.
Inthischapterwe’llcoverthefollowingtopics:
•Loadingacsvfilewithread.table()•Loadingacsvfilewithread.csv()•Loadingatabdelimitedfilewithread.table()•Usingreadrtoloaddata
Exercise1:Loadingacsvfilewithread.table()
Thefirstfunctionwe’llexamineisread.table().Theread.table()functionisabuiltinRfunctionthatcanbeusedtoreadvariousfileformatsintoadataframe.ThisisprobablythemostcommoninternalfunctionusedforreadingsimplefilesintoR.However,aswe’llseelaterinthemodule,tidyverseincludessimilarfunctionswhichareactuallymoreefficientatreadingexternaldataintoR.
Thesyntaxforread.table()istoacceptafilename,whichwillbethepathandfilename,alongwithaTRUE|FALSEindicatorfortheheader.IfsettoTRUEtheassumptionisthatcolumnnamesareintheheaderlineofthefile.Thepathisnotnecessaryifyouhavealreadysettheworkingdirectory.Theoutputofthe
read.table()functionisadataframeobject.
Theheaderline,ifincludedinthetextfile,willloadadatasetintoadataframeobject.Defaultvalueswillbeusedforthecolumnheadersifthesearenotprovided.Thefile.choose()functionisahandyfunctionthatyoucanusetointeractivelyselectthefileyouwantimportedratherthanhavingtohardcodethepathtothedataset.
Inthisexerciseyou’lllearnhowtousetheread.table()functiontoloadacsvformatfile.1.OpenRStudioandfindtheConsolepane.
2.Ifnecessary,settheworkingdirectorybytypingthecodeyouseebelowintotheConsolepaneorbygoingtoSession|SetWorkingDirectory|ChooseDirectoryfromtheRStudiomenu.
setwd(<installationdirectoryforexercisedata>)
3.TheDatafoldercontainsafilecalledStudyArea.csv,whichisacommaseparatedfilecontainingwildfiredatafromtheyears1980-2016forthestatesofCalifornia,Oregon,Washington,Idaho,Montana,Wyoming,Colorado,Utah,Nevada,Arizona,andNewMexico.Therearealittleover439,000recordsinthisfileandthereare37columnsofinformationthatdescribeeachfireduringthisperiod.
Usetheread.table()functiontoloadthisdataintoanewdataframeobject.Whathappenswhenyourunthislineofcode?df=read.table(“StudyArea.csv”,header=TRUE)Youwillgetanerrormessagewhenyouattempttorunthislineofcode.Theerrormessageshouldappearasseenbelow.Errorinread.table(“StudyArea.csv”,header=TRUE):morecolumnsthancolumnnames
Thereasonanerrormessagewasgeneratedinthiscaseisthattheread_table()functionusesspacesasthedelimiterbetweenrecordsandourfileusescommasasthedelimiter.
4.Updateyourcalltoread.table()asseenbelowtoincludethesepargument,whichshouldbeacomma.df=read.table(“StudyArea.csv”,sep=”,”,header=TRUE)
Whenyourunthislineofcodeyou‘llseeanewerror.Errorinscan(file=file,what=what,sep=sep,quote=quote,dec=dec,:line12didnothave14elements
Theread.table()functionwillNOTautomaticallyfillinanymissingvalueswithadefaultvaluesuchasNAsobecausesomeofthecolumnsareemptyinourrowswegetanerrormessagethatindicatesaparticularlinedidn’thaveall14columnsofinformation.WecanfixthisbyaddingthefillparameterandsettingitequaltoTRUE.
5.Updateyourcodeasseenbelowtoaddthefillparameter.df=read.table(“StudyArea.csv”,header=TRUE,fill=TRUE,sep=”,”)
Whenyourunthislineofcodeitwillimportthecontentsofthefileintoadataframeobject.However,ifyoulookattheEnvironmenttabinRStudioyouwillseethatitonlyloaded153,095recordsandyetweknowthereareover400,000recordsinthefile.Quotes(singleordouble)inacsvfilecancauserecordsnottobeloaded.
6.Let’saddonemoreparametertohandlerecordsthatwerethrownoutduetoquotes.df=read.table(“StudyArea.csv”,header=TRUE,fill=TRUE,quote=””,sep=”,”)
Whenyouexecutethislineofcode,440,476recordsshouldbeimported.ThedataisloadedintoanRdataframeobjectwhichisastructurethatresemblesatable.Detailedinformationaboutdataframeobjectswillbecoveredinalatersectionofthecourse.Fornow,youcanthinkofthemastablescontainingcolumnsandrows.
Mypointinshowingyouthisistoshowhowdifficultitcanbetousetheread.table()functiontoloadthecontentsofacsvfile.Theread.table()functionistypicallyusedtoloadtabdelimitedtextfiles,butmanypeoplewillattempttousetheread.table()functionwithcsvformatfileswithoutunderstandingalltheparametersthatmayneedtobeincluded.Instead,youshoulduseread.csv()aswe’lldointhenextstep.
7.YoucancheckyourworkagainstthesolutionfileChapter3_1.R.
Exercise2:Loadingacsvfilewithread.csv()
Theread.csv()functionisalsoabuiltinRfunctionthatisalmostidenticaltoread.table(),withtheexceptionthattheheaderandfillargumentsaresettoTRUEbydefault.Inthisstepyou’llseehowmucheasieritistoloadacsvfileusingread.csv().
1.Theread.csv()functionautomaticallyhandlesmostofthesituationsyouarerequiredtoidentifywhenusingread.table()toloadacsvfile.Enterandrunthecodeyouseebelowtoseehowmucheasierthisiswithread.csv().
df=read.csv(“StudyArea.csv”)
2.Thiswillcorrectlyloadall400,000+recordsfromthecsvfile!Seehowmucheasierthatis?Therewillbeafewrecordsmissing,butoverallthisfunctionismucheasiertousethanread.table().
3.YoucancheckyourworkagainstthesolutionfileChapter3_2.R.
Exercise3:Loadingatabdelimitedfilewithread.table()
Theread.table()functionismostoftenusedtoreadthecontentsofatabdelimitedfile.Inthisstepyou’lllearnhowtodothat.
1.YourDatafolderincludesafilecalledall_genes_pombase.txt,whichistextdelimited.OpenthisfilewithExcelorsomeotherapplicationtoseethefieldstructureanddelimiters.
2.IntheRConsolewindowenterandrunthecodeyouseebelowtoimportthefile.df2=read.table(“all_genes_pombase.txt”,header=TRUE,sep=”\t”,quote=””)
3.Thisshouldload7019recordsintothedataframe.You’llnoticethatmanyoftheparametersstillneedtobeusedwhenloadingthedatasetsoit’snotaseasytouseasyoumighthopeeveninthiscase.
4.YoucancheckyourworkagainstthesolutionfileChapter3_3.R.
Exercise4:Usingreadrtoloaddata
Sofarinthischapterwe’vebeenlookingatvariousbuiltinRfunctionsfor
readingexternalfilesintoRasdataframes.Thetidyversepackageincludesasub-packagecalledreadrthatcanalsobeusedtoloadexternaldata.Thereadrpackageincludesaread_csv()functionthatloadsdatamuchfasterthantheinternalread.csv()function.
Inadditiontoloadingthedatafasteritalsoincludesaprogressdialogandtheoutputincludesthedataframecolumnstructurealongwithanyparsingerrors.Overall,theread_csv()functioninthereadrpackageispreferredoverthefunctionsfoundinthebasicinstallationofR.Thereadrpackagealsoincludessomeotherfunctionsforloadingvariousfileformatsincludingread_delim(),read_csv2(),andread_tsv().Eachofthefunctionsacceptthesameparameters,soonceyou’velearnedtouseanyoftheRfunctionsforloadingdatayoucaneasilyuseanyoftheothers.
Inthisstepyou’regoingtousetheread_csv()functionfoundinthereadrpackagetoloaddataintoadataframe.1.Loadthereadrlibrary.library(readr)
2.Theread_csv()functioninthereadrpackagecanbeusedtoloadcsvfiles.Comparedtothebaseloadingfunctionswelookedatpreviouslyinthisexercise,readrfunctionsaresignificantlyfaster(10x),includeahelpfulprogressbartoprovidefeedbackontheprogressoftheloadforlargefiles,andallthefunctionsworkexactlythesameway.
Addandrunthecodeyouseebelow.Noticehowmuchmorequicklythedataloadsintothedataframeobject.Thecol_typesargumentwasusedinthiscasetoloadallthecolumnsasacharacterdatatypeforsimplificationpurposes.Otherwisewe’dhavetodosomeadditionalpreprocessingofthedatatoaccountforvariouscolumndatatypes.
dfReadr=read_csv(“StudyArea.csv”,col_types=cols(.default=“c”),col_names=TRUE)Otherloadingfunctionsfoundinthereadrpackageincluderead_delim(),read_csv2(),read_tsv()
3.Nowlet’srunthisfunctionagain,butthistimetakeoffthecol_typesargumentsoyoucanseeanexampleofsomeofthepotentialloadingerrorsthatcanoccur.Updateandrunyourcodeasfollows:
dfReadr=read_csv(“StudyArea.csv”,col_names=TRUE)4.Thefirstthingyou’llseeisalistofthecolumnsthatwillbeimportedalongwiththecolumndatatype.Youroutputshouldappearasfollows:Parsedwithcolumnspecification:
cols(.default=col_character(),FID=col_integer(),UNIT=col_integer(),FIRENUMBER=col_integer(),SPECCAUSE=col_integer(),STATCAUSE=col_integer(),SIZECLASSN=col_integer(),FIRETYPE=col_integer(),PROTECTION=col_integer(),FIREPROTTY=col_integer(),YEAR_=col_integer(),FiscalYear=col_integer(),STATE_FIPS=col_integer(),FIPS=col_integer(),DLATITUDE=col_double(),DLONGITUDE=col_double(),TOTALACRES=col_double(),TRPGENCAUS=col_integer(),TRPSPECCAU=col_integer(),Duplicate_=col_integer()
)5.Awarningmessagewillbedisplayedbelowthatindicatingthattherewereparsingerrorsontheload.Warning:196742parsingfailures.row#Atibble:5x5colrowcolexpectedactualfileexpected
<int><chr><chr><chr><chr>actual1242621UNITanintegerEOR‘StudyArea.csv’file2242622UNITanintegerEOR‘StudyArea.csv’row3242623UNITanintegerEOR‘StudyArea.csv’col4242624UNITanintegerEOR‘StudyArea.csv’expected
5242625UNITanintegerEOR‘StudyArea.csv’6.Youcanusetheproblems()functiontogetalistoftheparsingerrors.Addandrunthecodeyouseebelow.problems(dfReadr)#Atibble:196,742x5rowcolexpectedactualfile
<int><chr><chr><chr><chr>1242621UNITanintegerEOR‘StudyArea.csv’
2242622UNITanintegerEOR‘StudyArea.csv’3242623UNITanintegerEOR‘StudyArea.csv’4242624UNITanintegerEOR‘StudyArea.csv’5242625UNITanintegerEOR‘StudyArea.csv’6242626UNITanintegerEOR‘StudyArea.csv’7242627UNITanintegerEOR‘StudyArea.csv’8242628UNITanintegerEOR‘StudyArea.csv’9242629UNITanintegerEOR‘StudyArea.csv’10242630UNITanintegerEOR‘StudyArea.csv’#...with196,732morerows
7.FromthelooksoftheerrormessagesitappearsthereisanissuewiththeUNITcolumn.Ifyoulookbackuptothelistofcolumnsanddatatypes,you’llnoticethattheUNITcolumnwascreatedasanintegerdatatype.However,ifyouopentheStudyArea.csvfileinExceloranotherapplicationyou’llquicklyseethatnotallthevaluesarenumeric.Someincludeletters.Thisaccountsfortheparsingerrorsinthedataset.
Updateyourcodeasseenbelowandrunitagain.ThissetstheUNITcolumntoacharacter(text)datatype.dfReadr=read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)
Thistimeyoushouldgetacleanloadofthedataset.Thatdoesn’tmeanthedatawon’tneedsomeadditionalpreparationandcleanup.Forexample,therearesomedatefieldsincludingSTARTDATEDthatwereloadedascharacterbutmightbebetteroffasdatefields.Wecansavethisadditionalpreparationworkforalaterexercisethough.
8.Youcanexaminethefirstfewlinesofthedataframebyenteringthehead()functionasseenbelow.head(dfReadr)#Atibble:6x14FIDORGANIZATIUNITSUBUNITSUBUNIT2FIRENAMECAUSEYEAR_
STARTDATEDCONTRDATEDOUTDATEDSTATESTATE_FIPS<int><chr><chr><chr><chr><chr><chr><int><chr><chr><chr><chr><int>10FWS81682USCADBRSanDiegoBay…PUMPHOU…Human2001
1/1/010:001/1/010:…NACali…621FWS81682USCADBRSanDiegoBay…I5Human20025/3/020:005/3/020:…NACali…632FWS81682USCADBRSanDiegoBay…SOUTHBAYHuman20026/1/020:006/1/020:…NACali…643FWS81682USCADBRSanDiegoBay…MARINAHuman20017/12/010:…7/12/010…NACali…654FWS81682USCADBRSanDiegoBay…HILLHuman19949/13/940:…9/13/940…NACali…665FWS81682USCADBRSanDiegoBay…IRRIGATI…Human19944/22/940:…4/22/940…NACali…6#...with1morevariable:TOTALACRES<dbl>
9.YoucancheckyourworkagainstthesolutionfileChapter3_4.R.
Conclusion
InthischapteryoulearnedvariousfunctionsforloadinganexternaldatafileincludingthebuiltinRfunctionsread.table()andread.csv().Whilethesefunctionscancertainlygetthejobdone,theread_csv()functionfoundinthereadrpackageisamuchmoreefficientfunctionforloadingexternaldata.Inthenextchapteryouwilllearnhowtotransformyourdatasetsusingthedplyrpackage.You’lllearntechniquesforfilteringthecontentsofadataframe,selectingspecificcolumnstobeused,arrangingrowsinascendingordescendingorder,andsummarizeandgroupadataset.
Chapter4
TransformingData
BeforeadatasetcanbeanalyzedinRitoftenneedstobemanipulatedortransformedinvariousways.Thedplyrpackage,partofthelargertidyversepackage,providesasetoffunctionsthatallowyoutotransformadatasetinvariousways.Thedplyrpackageisaveryimportantpartoftidyversesincethefunctionsprovidedthroughthispackageareusedsofrequentlytotransformdataindifferentwayspriortodoingmoreadvanceddataexploration,visualization,andmodeling.
Therearefivekeyfunctionsthatarepartofdplyr:filter(),arrange(),select(),mutate(),andsummarize().Allfivefunctionsworkinasimilarmannerwherethefirstargumentisthedataframetomanipulate,thenextNnumberofparametersdefinedthecolumnstoinclude,andallreturnadataframeasaresult.
Thedplyrfunctionsareoftenusedinconjunctionwiththegroup_by()dplyrfunctiontomanipulateadatasetthathasbeengroupedinsomeway.Thegroup_by()functioncreatesanewdataframeobjectthathasbeengroupedbyoneormorevariables.
Inthischapterwe’llcoverthefollowingtopics:
•Filteringrecordstocreateasubset•Narrowingthelistofcolumns•Arrangingrowsinascendingordescendingorder•Addingrows•Summarizingandgrouping•Pipingforcodeefficiency
Exercise1:Filteringrecordstocreateasubset
Thefirstdplyrfunctionthatwe’llexamineisfilter().Thefilter()functionisusedtocreateasubsetofrecordsbasedonsomevalue.Forexample,youmightwanttocreateadataframeofwildfirescontainingincidentsthathaveburnedmorethan25,000acres.Aslongasyouhaveanexistingdataframethatincludesacolumnthatmeasuresthenumberofacresburned,youcanaccomplishthecreationofthissubsetusingthefilter()function.
Aswillbethecasewithallthedplyrfunctionsweexamine,thefirstargumentpassedtothefilter()functionisadataframeobject.Eachadditionalparameterpassedtothefunctionisaconditionalexpressionusedtofilterthedataframe.Forexample,takealookatthelineofcodebelow.Thisstatementcallsthefilter()functiontocreateanewvariablecalleddf25k,whichwillcontainonlyrowswheretheACREScolumncontainsavaluegreaterthan25000.
df25k=filter(df,ACRES>=25000)
Thisisanexampleofcallingthefilter()functionandpassingasingleconditionalexpression.Inthenextcodeexample,twoconditionalexpressionsarepassed.Thefirstisusedtofilterrecordssothatthenumberofacresisgreaterthanorequalto25000,andthesecondfilterrecordssothatonlyrecordswheretheYearcolumncontainsavalueof2016willberetained.
df25k=filter(df,ACRES>=25000,YEAR==2016)
Inthiscase,thedf25kvariablewillincluderecordswherebothconditionsarematched:acreageburnedisgreaterthan25000andthefireyearwas2016.Thiscanalsoberewrittenasasingleparameterthatusesthe&operatortocombineexpressionsasseenbelow.
df25k=filter(df,ACRES>=25000&YEAR==2016)Inthisexerciseyou’lllearnhowtousethefilter()functiontocreateasubsetofrecordsbasedonsomevalue.
1.Theexercisesinthischapterrequirethefollowingpackages:readr,dplyr,ggplot2.TheycanbeloadedfromthePackagespane,theConsolepane,orascript.
2.OpenRStudioandfindtheConsolepane.
3.Ifnecessary,settheworkingdirectorybytypingthecodeyouseebelowintotheConsolepaneorbygoingtoSession|SetWorkingDirectory|ChooseDirectoryfromtheRStudiomenu.
setwd(<installationdirectoryforexercisedata>)
4.TheDatafoldercontainsafilecalledStudyArea.csv,whichisacommaseparatedfilecontainingwildfiredatafromtheyears1980-2016forthestatesof
California,Oregon,Washington,Idaho,Montana,Wyoming,Colorado,Utah,Nevada,Arizona,andNewMexico.Therearealittleover439,000recordsinthisfileandthereare37columnsofinformationthatdescribeeachfireduringthisperiod.
Usetheread_csv()functiontoloadthedatasetintoadataframe.dfFires=read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)5.Usethenrow()functiontomakesurethattheapproximately439,000recordswereloaded.nrow(dfFires)[1]439362
6.Initiallywe’lluseasingleconditionalexpressionwiththefilter()functiontocreateasubsetofrecordsthatcontainsonlywildfiresthataregreaterthan25,000acres.Addthecodeyouseebelowtorunthefilter()function.Alldplyrfunctions,includingfilter(),returnanewdataframeobjectsoyouneedtospecifyanewvariablethatwillcontaintheoutputdataframe.Thedf25kvariablewillholdtheoutputdataframeinthiscase.
df25k=filter(dfFires,TOTALACRES>=25000)
Getacountofthenumberofrecordsthatmatchthefilter.Thereshouldbe655rows.YoumayalsowanttousetheView(df25k)functiontoseethedatainatabularformat.
nrow(df25k)[1]655
7.Youcanalsoincludemultipleconditionalexpressionsaspartofthefilter.Eachexpression(argument)iscombinedwithan“and”clausebydefault.Thismeansthatallexpressionsmustbematchedforarecordedtobereturned.Addandrunthecodeyouseebelowtoseeanexample.
df1k=filter(dfFires,TOTALACRES>=1000,YEAR_==2016)nrow(df1k)[1]152
8.Youcanalsocombinetheexpressionsintoasingleexpressionwithmultipleconditionsasseenbelow.Thiswillaccomplishthesamethingastheprevious
lineofcode.Whichofthetwoyouuseisamatterofpersonalpreferenceinthiscasesincewe’reusingan“and”clause.The&characteristhe“and”operator.Youwouldneedtousethe|charactertoincludean“or”operator.
df1k=filter(dfFires,TOTALACRES>=1000&YEAR_==2016)
9.Finally,whenyouhavealistofpotentialvaluesthatyouwanttobeincludedbythefilterthe%in%statementcanbeused.Addthelineofcodebelowtoseehowthisworks.Thisparticularlineofcodewouldcreateadataframecontainingfiresthatoccurredintheyears2010,2011,or2012.
dfYear=filter(dfFires,YEAR_%in%c(2010,2011,2012))10.YoucanviewanyofthesedataframesinatabularviewusingtheView(<dataframe>)syntax.Forexample,View(dfYear)11.YoucancheckyourworkagainstthesolutionfileChapter4_1.R.
Exercise2:Narrowingthelistofcolumnswithselect()
Manydatasetsthatyouloadfromexternaldatasourcesincludedozensofcolumns.TheStudyArea.csvfilethatyou’vebeenworkingwithintheexercisesincludes37columnsofinformation.Inmostcasesyouwon’tneedallthecolumns.
Theselect()functioncanbeusedtonarrowdownthelistofcolumnstoincludeonlythoseneededforatask.Tousetheselect()function,simplypassinthenameofthedataframealongwiththecolumnstoinclude.
1.Usetheread_csv()functiontoloadthedatasetintoadataframe.
Note:ForthesakeofcompletenessyouwillbeloadingtheexternaldatafromtheStudyArea.csvfiletothedfFiresdataframe,butthisstepisn’tabsolutelynecessaryifyou’redoingtheexercisesinsequenceinthesameRStudiosession.
dfFires=read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)2.Onanewline,addacalltotheselect()functionasseenbelowtolimitthecolumnsthatarereturned.dfFires2=select(dfFires,FIRENAME,TOTALACRES,YEAR_)
3.Displaythefirstfewrowsandnoticethatwenowhaveonlythreecolumns.head(dfFires2)
FIRENAMETOTALACRESYEAR_<chr><dbl><int>1PUMPHOUSE0.10020012I53.0020023SOUTHBAY0.50020024MARINA0.10020015HILL1.0019946IRRIGATION0.1001994
4.Manyofthecolumnnamesthatyouimportwillnotbeveryreaderfriendlysoit’snotuncommontowanttorenamethecolumnsaswell.Thiscanbeaccomplishedusingtheselect()functionaswell.Renameyourcolumnsbyaddingandrunningthecodeyouseebelow.
dfFires2=select(dfFires,“FIRE”=“FIRENAME”,“ACRES”=“TOTALACRES”,“YR”=“YEAR_”)5.Displaythefirstfewlines.head(dfFires2)
FIREACRESYR<chr><dbl><int>1PUMPHOUSE0.10020012I53.0020023SOUTHBAY0.50020024MARINA0.10020015HILL1.0019946IRRIGATION0.1001994
6.Therearealsoanumberofhandyhelperfunctionsthatyoucanusewiththeselect()functiontofilterthereturnedcolumns.Theseincludestarts_with(),ends_with(),contains(),matches(),andnum_range().Toseehowthisworks,addandrunthecodeyouseebelow.ThiswillreturnanycolumnsthatcontainthewordDATE.
dfFires3=select(dfFires,contains(“DATE”))head(dfFires3)
STARTDATEDCONTRDATEDOUTDATED<chr><chr><chr>11/1/010:001/1/010:00NA25/3/020:005/3/020:00NA36/1/020:006/1/020:00NA47/12/010:007/12/010:00NA59/13/940:009/13/940:00NA64/22/940:004/22/940:00NA
7.Youcanalsomakemultiplecallstothesehelperfunctions.dfFires3=select(dfFires,contains(“DATE”),starts_with(“TOTAL”))head(dfFires3)
DSTARTDATEDCONTRDATEDOUTDATEDTOTALACRES<chr><chr>
<chr><dbl>11/1/010:001/1/010:00NA0.10025/3/020:005/3/020:00NA3.0036/1/020:006/1/020:00NA0.50047/12/010:007/12/010:00NA0.10059/13/940:009/13/940:00NA1.0064/22/940:004/22/940:00NA0.100
8.YoucancheckyourworkagainstthesolutionfileChapter4_2.R.
Exercise3:ArrangingRows
Thearrange()functioninthedplyrpackagecanbeusedtoordertherowsinadataframe.Thisfunctionacceptsasetofcolumnstoorderbywiththedefaultroworderingbeinginascendingorder.However,youcanpassthedesc()helperfunctiontoordertherowsindescendingorder.Missingvalueswillbeplacedattheendofthedataframe.
1.Usetheread_csv()functiontoloadthedatasetintoadataframe.dfFires=read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)2.Filterthedatasetsothatitcontainsonlyfiresgreaterthan1,000acresburnedfromtheyear2016.df1k=filter(dfFires,TOTALACRES>=1000,YEAR_==2016)3.Addandrunthecodeyouseebelowtocreateasubsetofcolumnsandrenamethem.df1k=select(df1k,“NAME”=“FIRENAME”,“ACRES”=“TOTALACRES”,“YR”=“YEAR_”)4.Sorttherowssothattheyareinascendingorder.arrange(df1k,ACRES)
NAMEACRESYR<chr><dbl><int>1Crackerbox1000.20162Lakes1000.20163Choulic21008.20164AmigoWash1020.20165Granite1030.20166Tie1031.20167Black1040.20168BybeeCreek1072.20169MARSHES1080.201610BugCreek1089.2016
5.Usethedesc()
helperfunctiontoordertherowsindescendingorder.arrange(df1k,desc(ACRES))
NAMEACRESYR<chr><dbl><int>1PIONEER188404.20162Junkins181320.20163Range12171915.20164Erskine48007.20165Cedar45977.20166Maple45425.20167Rail43799.20168NorthFire42102.20169Laidlaw39813.201610BLUECUT36274.2016
6.YoucanusetheView()functionasawrapperaroundthesecallstoviewthedatainatabulargridviewbyaddingthecodeyouseebelow.View(arrange(df1k,desc(ACRES)))7.YoucancheckyourworkagainstthesolutionfileChapter4_3.R.
Exercise4:AddingRowswithmutate()
Themutate()functionisusedtoaddnewcolumnstoadataframethataretheresultofafunctionyourunonothercolumnsinthedataframe.Anynewcolumnscreatedwiththemutate()functionwillbeaddedtotheendofthedataframe.Thisfunctioncanbeincrediblyusefulfordynamicallycreatingnewcolumnsthataretheresultofoperationsperformedonothercolumnsfromthedataframe.Inthisexerciseyou’lllearnhowthemutate()functioncanbeusedtocreatenewcolumnsinadataframe.
1.You’regoingtoneedthelubridatepackageforthisexercise.Thelubridatepackageispartoftidyverseandisusedtoworkwithdatesandtimes.InRStudio,checkthePackagestabtomakesurethatlubridatehasbeeninstalledandloadedasseeninthescreenshotbelow.Ifnot,you’llneedtodosonowusingtheinstructionsforinstallingandloadingapackagecoveredinChapter1:IntroductiontoR.
2.RecallfromChapter1:IntroductiontoRthatyoucanalsoloadaninstalledlibraryusingthesyntaxseenbelow.library(lubridate)3.Usetheread_csv()functiontoloadthedatasetintoadataframe.dfFires=read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)4.Usetheselect()functiontodefineasetofcolumnsforthedataframe.
df=select(dfFires,ORGANIZATI,STATE,YEAR_,TOTALACRES,CAUSE,STARTDATED)
5.Dosomebasicfilteringofthedatasothatonlyfiresgreaterthan1,000acresburnedandhaveacauseofHumanorNaturalareincluded.TherearesomerecordsmarkedasUnknowninthedataset,sowe’llremovethoseforthisexercise.
df=filter(df,TOTALACRES>=1000&CAUSE%in%c(‘Human’,‘Natural’))
6.Usethemutate()functiontocreateanewDOYcolumnthatcontainsthedayoftheyearthatthefirestarted.Theyday()functionfromthelubridatepackageisusedtoreturnthedayoftheyearusingaformattedinputdatefromtheSTARTDATEDcolumn.
df=mutate(df,DOY=yday(as.Date(df$STARTDATED,format=’%m/%d/%y%H:%M’)))7.ViewtheresultingDOYcolumn.View(df)
8.YoucancheckyourworkagainstthesolutionfileChapter4_4.R.
9.Inthenextexercisethemutate()functionwillbeusedagainwhenwecreateacolumnthatholdsthedecadeofthefireandthencalculatesthetotalacreageburnedbyacreage.
Exercise5:SummarizingandGrouping
Summarystatisticsforadataframecanbeproducedwiththesummarize()function.Thesummarize()functionproducesasinglerowofdatacontainingsummarystatisticsfromadataframe.Thisfunctionisnormallypairedwiththegroup_by()functiontoproducegroupsummarystatistics.
Thegroupingofdatainadataframefacilitatesthesplit-apply-combineparadigm.Thisparadigmfirstsplitsthedataintogroups,usingthegroup_by()functionindplyr,thenappliesanalysistothegroup,andfinally,combinestheresults.Thegroup_by()functionhandlesthesplitportionoftheparadigmbycreatinggroupsofdatausingoneormorecolumns.Forexample,youmightgroupallwildfiresbystateandcause.
Inthisstepyou’llusethemutate(),summarize(),andgroup_by()functionstogroupwildfiresbydecadeandproduceasummaryofthemeanwildfiresizeforeachdecade.1.Usetheread_csv()functiontoloadthedatasetintoadataframe.
dfFires=read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)2.Selectthecolumnsthatwillbeusedintheexercise.df=select(dfFires,ORGANIZATI,STATE,YEAR_,TOTALACRES,CAUSE)3.Filtertherecords.df=filter(df,TOTALACRES>=1000)
4.Usethemutate()functiontocreateanewcolumncalledDECADEthatdefinesthedecadeinwhicheachfireoccurred.Inthiscaseanifelse()functioniscalledtoproducethevaluesforeachdecade.
functioniscalledtoproducethevaluesforeachdecade.
1989”,ifelse(YEAR_%in%1990:1999,“1990-1999”,ifelse(YEAR_%in%2000:2009,“2000-2009”,ifelse(YEAR_%in%2010:2016,“2010-2016”,“-99”)))))5.Viewtheresult.View(df)
6.Usethegroup_by()functiontogroupthedataframebydecade.grp=group_by(df,DECADE)7.Summarizethemeansizeofwildfiresbydecadeusingthesummarize()function.sm=summarize(grp,mean(TOTALACRES))8.Viewtheresult.View(sm)
9.Let’stidythingsupbyrenamingthenewcolumnproducedbythesummarize()function.names(sm)<-c(“DECADE”,“MEAN_ACRES_BURNED”)
10.Finally,let’screateabarchartoftheresults.We’lldiscussthecreationofmanydifferenttypesofchartsandgraphsaswemovethroughlaterchaptersofthebooksodetaileddiscussionofthesetopicswillbesavedforlater.
ggplot(data=sm)+geom_col(mapping=aes(x=DECADE,y=MEAN_ACRES_BURNED),fill=”red”)
11.YoucancheckyourworkagainstthesolutionfileChapter4_5.R.
Exercise6:Piping
Asyou’veprobablynoticedinsomeoftheseexercises,itisnotunusualtorunaseriesofdplyrfunctionsaspartofalargerprocessingroutine.Asyou’llrecall,eachdplyrfunctionreturnsanewdataframe,andthisdataframeistypicallyusedastheinputtothenextdplyrfunctionintheseries.Thesedataframesareintermediatedatasetsnotneededbeyondthecurrentstep.However,youarestillrequiredtonameandcodeeachofthesedatasets.
Pipingisamoreefficientwayofhandlingthesetemporary,intermediatedatasets.Insum,pipingisanefficientwayofsendingtheoutputofonefunctiontoanotherfunctionwithoutcreatinganintermediatedatasetandismostusefulwhenyouhaveaseriesoffunctionstorun.Thesyntaxforpipingistousethe%>%charactersattheendofeachstatementthatyouwanttopipe.Inthisexerciseyou’lllearnhowtousepipingtochaintogetherinputandoutputdataframes.
1.Inthelastexercisetheselect(),filter(),mutate(),group_by(),andsummarize()functionwereallusedinaseriesthatultimatelyproducedabarchartshowingthemeanacreageburnedbywildfiresinthepastfewdecades.Eachofthesefunctionsreturnadataframe,whichisthenusedasinputtothenextfunctionintheseries.Pipingisamoreefficientwayofcodingthischainingoffunctioncalls.RewritethecodeproducedinExercise4:AddingRowswithmutate()asseenbelowandthenwe’lldiscusshowpipingworks.
library(lubridate)df=read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)%>%select(ORGANIZATI,STATE,YEAR_,TOTALACRES,CAUSE,STARTDATED)%>%filter(TOTALACRES>=1000&CAUSE%in%c(‘Human’,‘Natural’))%>%mutate(DOY=yday(as.Date(STARTDATED,format=’%m/%d/%y%H:%M’)))View(df)
ThefirstlineofcodereadsthecontentsoftheexternalStudyArea.csvfileintoadataframevariable(df)aswe’vedoneinalltheotherexercisesinthischapter.However,you’llnoticetheinclusionofthepipingstatement(%>%>)attheendoftheline.Thisensuresthatthecontentsofthedfvariablewillautomaticallybesenttotheselect()function.
Noticethattheselect()functiondoesnotcreateavariablelikewehavedoneinthepastexercises,andthatwehaveleftoffthefirstparameter,whichwouldnormallyhavebeenthedataframevariable.Itisimpliedthatthedfvariablewillbepassedtotheselect()function.Thissameprocessofincludingthepipingstatementattheendofeachlineandleavingoffthefirstparameterisrepeatedforalltheadditionallinesofcodewherewewanttoautomaticallypassthedfvariabletothenextdplyrfunction.Finally,weviewthecontentsofthedfvariableusingtheView()functiononthelastline.
Pipingmakesyourcodemorestreamlinedandeasiertoreadandalsotakesawaytheneedtocreateandpopulatevariablesthatareonlyusedasintermediatedatasets.
2.YoucancheckyourworkagainstthesolutionfileChapter4_6.R.
Exercise7:Challenge
Thechallengestepisoptional,butitwillgiveyouachancetoreinforcewhatyou’velearnedinthismodule.CreateanewdataframethatisasubsetoftheoriginaldfFiresdataframe.ThesubsetshouldcontainallfiresfromtheStateofIdahoandthecolumnsshouldbelimitedsothatonlytheYEAR_,CAUSE,andTOTALACREScolumnsarepresent.Renamethecolumnsifyouwish.GroupthedatabyCAUSEandYEARandthensummarizebytotalacresburned.Plottheresults.
Conclusion
Inthischapteryoulearnedhowtousethedplyrpackagetoperformvariousdatatransformationfunctions.Youlearnedhowtolimitcolumnswiththeselect()function,filteradataframebasedononeormoreexpressions,addcolumnswithmutate(),andsummarizeandgroupdata.Finally,youlearnedhowtousepipingtomakeyourcodemoreefficient.
Inthenextchapteryou’llhowtocreatetidydatasetswiththetidyrpackage.Chapter5
CreatingTidyData
Let’sfirstdescribewhatwemeanby“tidydata”,becausethetermdoesn’tnecessarilyfullydescribetheconcept.DatatidyingisaconsistentwayoforganizingdatainRandcanbefacilitatedthroughthetidyrpackagefoundinthetidyverseecosystem.Therearethreerulesthatwecanfollowtomakeadatasettidy.First,eachvariablemusthaveitsowncolumn.Second,eachobservationmusthaveitsownrow,andfinally,eachvaluemusthaveitsowncell.Thisisillustratedbythediagrambelow.
Therearetwomainadvantagesofhavingtidydata.Oneismoreofageneraladvantageandtheotherismorespecific.First,havingaconsistent,uniformdatastructureisveryimportant.Theotherpackagesthatarepartoftidyverse,includingdplyrandggplot2aredesignedtoworkwithtidydatasoensuringthatyourdataisuniformfacilitatestheefficientprocessingofyourdata.Inaddition,placingvariablesintocolumnsallowsfortheeasilyfacilitationofvectorizationinR.
Manydatasetsthatyouencounterwillnotbetidyandwillrequiresomeworkonyourend.Therecanbemanyreasonswhyadatasetisn’ttidy.Oftentimesthepeoplewhocreatedthedatasetaren’tfamiliarwiththeprinciplesoftidydata.Unlessyouaretrainedinthepracticeofcreatingtidydatasetsorspendalotoftimeworkingwithdatastructurestheseconceptsaren’treadilyapparent.Anothercommonreasonthatdatasetsaren’ttidyisthatdataisoftenorganizedtofacilitatesomethingotherthananalysis.Dataentryisperhapsthemostcommonofthereasonsthatfallintothiscategory.Tomakedataentryaseasyaspossible,peoplewilloftenarrangedatainwaysthataren’ttidy.So,manydatasetsrequiresomesortoftidyingbeforeyoucanbeginyouranalysis.
Thefirststepistofigureoutwhatthevariablesandobservationsareforthedataset.Thiswillfacilitateyourunderstandingofwhatthecolumnsandrowsshouldbe.Inaddition,youwillalsoneedtoresolveoneortwocommonproblems.Youwillneedtofigureoutifonevariableisspreadacrossmultiplecolumns,andyouwillneedtofigureoutifoneobservationisscatteredacrossmultiplerows.Theseconceptsareknownasgatheringandspreading.We’llexaminetheseconceptsfurtherintheexercisesinthischapter.
Inthischapterwe’llcoverthefollowingtopics:
•Gathering•Spreading•Separating•Uniting
Exercise1:Gathering
Acommonprobleminmanydatasetsisthatthecolumnnamesarenotvariablesbutrathervaluesofavariable.Inthefigurebelow,the1999and2000columnsareactuallyvaluesofthevariableYEAR.Eachrowintheexistingtableactuallyrepresentstwoobservations.Thetidyrpackagecanbeusedtogathertheseexistingcolumnsintoanewvariable.Inthiscase,weneedtocreateanewcolumncalledYEARandthengathertheexistingvaluesinthe1999and2000columnsintothenewYEARcolumn.
Thegather()functionfromthetidyrpackagecanbeusedtoaccomplishthegatheringofdata.Takealookatthelineofcodebelowtoseehowthisfunctionworks.
gather(‘1999’,‘2000’,key=‘year’,value=‘cases’)
Therearethreeparametersofthegather()function.Thefirstisthesetofcolumnsthatrepresentwhatshouldbevaluesandnotvariables.Thesewouldbethe1999and2000columnsintheexamplewehavebeenfollowing.Next,you’llneedtonamethevariableofthenewcolumn.Thisisalsocalledthekey,andinthiscasewilltheyearvariable.Finally,you’llneedtoprovidethevalue,whichisthenameofthevariablewhosevaluesarespreadoverthecells.
Inthisexerciseyou’lllearnhowtousethegather()functiontoresolvethetypesofproblemswediscussedintheintroductiontothistopic.
1.IntheDatafolderwhereyouinstalledtheexercisedataforthisbookisafilecalledCountryPopulation.csv.Openthisfile,preferablyinMicrosoftExcel,orsomeothertypeofspreadsheetsoftware.Thefileshouldlooksimilartothescreenshotbelow.Thisspreadsheetincludesshouldlooksimilartothescreenshotbelow.Thisspreadsheetincludes2017.Thecolumnsforeachyearrepresentvalues,notvariables.ThesecolumnsneedtobegatheredintoanewpairofvariablesthatrepresenttheYearandPopulation.Inthisexerciseyou’llusethegather()functiontoaccomplishthisdatatidyingtask.
2.OpenRStudioandfindtheConsolepane.
3.Ifnecessary,settheworkingdirectorybytypingthecodeyouseebelowintotheConsolepaneorbygoingtoSession|SetWorkingDirectory|ChooseDirectoryfromtheRStudiomenu.
setwd(<installationdirectoryforexercisedata>)4.Ifnecessary,loadthereadrandtidyrpackagesbyclickingthecheckboxesinthePackagespaneorbyincludingthefollowinglineofcode.library(readr)library(tidyr)5.LoadtheCountryPopulation.csvfileintoRStudiobywritingthecodeyouseebelowintheConsolepane.dfPop=read_csv(“CountryPopulation.csv”,col_names=TRUE)YoushouldseethefollowingoutputintheConsolepane.Parsedwithcolumnspecification:cols(
`CountryName`=col_character(),`CountryCode`=col_character(),`2010`=col_double(),`2011`=col_double(),`2012`=col_double(),`2013`=col_double(),`2014`=col_double(),`2015`=col_double(),`2016`=col_double(),`2017`=col_double()
)6.UsetheView()functiontodisplaythedatainatabularstructure.View(dfPop)
7.Usethegather()functionasseenbelow.dfPop2=gather(dfPop,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=‘YEAR’,value=‘POPULATION’)8.Viewtheoutput.View(dfPop2)
9.YoucancheckyourworkagainstthesolutionfileChapter5_1.R.
Exercise2:Spreading
Spreadingistheoppositeofgatheringandisusedwhenanobservationisspreadacrossmultiplerows.Inthediagrambelow,table2shoulddefineanobservationofonecountryperyear.However,you’llnoticethatthisisspreadacrosstworows.Onerowforcasesandanotherforpopulation.
Wecanusethespread()functiontofixthisproblem.Thespread()functiontakestwoparameters:thecolumnthatcontainsvariablenames,knownasthekeyandacolumnthatcontainsvaluesfrommultiplevariables–thevalue.
spread(table2,key,value)Inthisexerciseyou’lllearnhowtousethespread()functiontoresolvethetypesofproblemswediscussedintheintroductiontothistopic.
1.Forthisexerciseyou’lldownloadsomesampledatathatneedstobespread.InstallthedevtoolspackageandDSRdatasetsusingthecodeyouseebelowbytypingintheConsolepane.Alternatively,youcanusethePackagespanetoinstallthepackages.
install.packages(“devtools”)devtools::install_github(“garrettgman/DSR”)2.LoadtheDSRlibrarybygoingtoPackageandclickingthecheckboxnexttoDSR.3.Viewtable2.Inthiscase,anobservationisonecountryperyear,butyou’llnoticethateachobservationisactuallyspreadintotworows.View(table2)
4.Usethespread()functiontocorrectthisproblem.table2b=spread(table2,key=type,value=count)5.Viewtheresults.View(table2b)
6.YoucancheckyourworkagainstthesolutionfileChapter5_2.R.
Exercise3:Separating
Anothercommoncaseinvolvestwovariablesbeingplacedintothesamecolumn.Forexample,thespreadsheetbelowhasaState-CountyNamecolumnthatactuallycontainstwovariablesseparatedbyaslash.
Theseparate()functioncanbeusedtosplitacolumnintomultiplecolumnsbysplittingonaseparator.Bydefault,theseparate()functionwillautomaticallylookforanynonalphanumericcharacteroryoucandefineaspecificcharacter.
Here,theseparate()functionwillsplitthevaluesoftheState-CountyNamecolumnintotwovariables:StateAbbrevandCountyName.
Theseparate()functionacceptsparametersforthenameofthecolumntoseparatealongwiththenamesofthecolumnstoseparateinto,andanoptionalseparator.Bydefault,separate()willlookforanynon-alphanumericcharactertouseastheseparator,butyoucanalsodefineaspecificseparator.Youcanseeanexampleofhowtheseparate()functionworksbelow.
separate(table3,rate,into=c(“cases”,“population”))Inthisexerciseyou’lllearnhowtousetheseparate()functiontoresolvethetypesofproblemswediscussedintheintroductiontothistopic.
1.IntheDatafolderwhereyouinstalledtheexercisedataforthisbookisafilecalledusco2005.csv.Openthisfile,preferablyinMicrosoftExcel,orsomeothertypeofspreadsheetsoftware.Thefileshouldlooksimilartothescreenshotbelow.
2.Loadtheusco2005.csvfileintoRStudiobywritingthecodeyouseebelowin
theConsolepane.df=read_csv(“usco2005.csv”,col_names=TRUE)3.Viewtheimporteddata.View(df)
4.Usetheseparate()functiontoseparatethecontentsoftheStateCountyNamecolumnintoStateAbbrevandCountyNamecolumns.df2=separate(df,”State-CountyName”,into=c(“StateAbbrev”,“CountyName”))5.Viewtheresults.View(df2)
6.YoucancheckyourworkagainstthesolutionfileChapter5_3.R.
Exercise4:Uniting
TheUnite()functionistheexactoppositeofseparate()inthatitcombinesmultiplecolumnsintoasinglecolumn.Whilenotusednearlyasoftenasseparate(),theremaybetimeswhenyouneedthefunctionalityprovidedbyunite().Inthisexerciseyou’llunitethedataframethatwasseparatedinthelastexercise.
1.IntheConsolepane,addthecodeyouseebelowtounitetheStateAbbrevand
CountyNamecolumnsbackintoasinglecolumn.df3=unite(df2,State_County_Name,StateAbbrev,CountyName)2.Viewtheresult.View(df3)
3.YoucancheckyourworkagainstthesolutionfileChapter5_4.R.
Conclusion
Inthischapteryouwereintroducedtothetidyrpackageanditssetoffunctionsforcreatingtidydatasets.ThenextchapterwillteachyouthebasicsofdataexplorationusingRandtidyverse.
Chapter6
BasicDataExplorationTechniquesinR
ExploratoryDataAnalysis(EDA)isaworkflowdesignedtogainabetterunderstandingofyourdata.Theworkflowconsistsofthreesteps.Thefirstistogeneratequestionsaboutyourdata.Inthisstepyouwanttobeasbroadaspossiblebecauseatthispointyoudon’treallyhaveagoodfeelforthedata.Next,searchforanswerstothesequestionsbyvisualizing,transforming,andmodelingthedata.Finally,refineyourquestionsandorgeneratenewquestions.InRtherearetwoprimarytoolsthatsupportthedataexplorationprocess:plotsandsummarystatistics.
Datacangenerallybedividedintocategoricalorcontinuoustypes.Categoricalvariablesconsistofasmallsetofvalues,whilecontinuousvariableshaveapotentiallyinfinitesetoforderedvalues.Categoricalvariablesareoftenvisualizedwithbarcharts,andcontinuousvariableswithhistograms.BothcategoricalandcontinuousdatacanberepresentedthroughvariouschartscreatedwithR.
Whenperformingbasicvisualizationofvariables,wetendtomeasureeithervariationorcovariation.Variationisthetendencyofthevaluesofavariabletochangefrommeasurementtomeasurement.Thevariablebeingmeasuredisthesamethough.Thiswouldincludethingslikethetotalacresburnedbyawildfire(continuous)orthenumberofcrimesbypolicedistrict(categoricaldata.Covariationisthetendencyofthevaluesoftwoormorevariablestovarytogetherinarelatedway.
•Measuringcategoricalvariationwithabarchart•Measuringcontinuousvariationwithahistogram•Measuringcovariationwithboxplots•Measuringcovariationwithsymbolsize•Creating2Dbinsandhexcharts•Generatingsummarystatistics
Exercise1:MeasuringCategoricalVariationwithaBarChart
Abarchartisagreatwaytovisualizecategoricaldata.Itseparateseachcategoryintoaseparatebarandthentheheightofeachbarisdefinedbythenumberof
occurrencesinthatcategory.
1.Theexercisesinthischapterrequirethefollowingpackages:readr,dplyr,ggplot2.TheycanbeloadedfromthePackagespane,theConsolepane,orascript.
2.OpenRStudioandfindtheConsolepane.
3.Ifnecessary,settheworkingdirectorybytypingthecodeyouseebelowintotheConsolepaneorbygoingtoSession|SetWorkingDirectory|ChooseDirectoryfromtheRStudiomenu.
setwd(<installationdirectoryforexercisedata>)4.Usetheread_csv()functiontoloadthedatasetintoadataframe.dfFires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)
5.Forthisanalysis,we’llfilterthedatasothatonlyfiresthatburnedgreaterthan1,000acresintheyears2010through2016arerepresented.Addthecodeyouseebelowtofilterthedataandandsendtheresultstoabarchart.
df<-filter(df,TOTALACRES>=1000,YEAR_%in%c(2010,2011,2012,2013,2014,2015,2016))ggplot(data=df)+geom_bar(mapping=aes(x=YEAR_))Thiswillproduceabarchartthatappearsasseeninthescreenshotbelow.
6.Usethecount()functiontogettheactualcountforeachcategory.View(count(df,YEAR_))
Exercise2:MeasuringContinuousVariationwithaHistogram
Thedistributionofacontinuousvariablecanbemeasuredwiththeuseofahistogram.Inthisexerciseyou’llcreateahistogramofwildfireacresburned.1.Onanewline,usetheread_csv()functiontoloadtheStudyArea.csvfile.dfFires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)
2.Pipethedataframeandusetheselect()functiontolimitthecolumnsandfiltertherowssothatonlyfiresgreaterthan1,000acresareincluded.Sincewehavealargenumberofwildfiresthatburnedonlyasmallnumberofacreswe’llfocusonfiresthatarealittlelargerinthiscase.
df%>%select(ORGANIZATI,STATE,YEAR_,TOTALACRES,CAUSE)%>%filter(TOTALACRES>=1000)%>%
3.Createthehistogramusingggplot()withgeom_hist()andabinsizeof500.Thedataisobviouslystillskewedtowardthelowerendofthenumberofacresburned.Addthehighlightedcodeyouseebelowtoproducethechart.
df%>%select(ORGANIZATI,STATE,YEAR_,TOTALACRES,CAUSE)%>%filter(TOTALACRES>=1000)%>%
ggplot()+geom_histogram(mapping=aes(x=TOTALACRES),binwidth=500)
4.Youcanalsogetaphysicalcountofthenumberoffiresthatfellintoeachbin.Fromviewingthehistogramandthecountit’sobviousthatthevastmajorityoffiresaresmall.
df%>%count(cut_width(TOTALACRES,500))
`cut_width(TOTALACRES,500)`n<fct><int>1[750,1250]1542(1250,1750]1783(1750,2250]144
4(2250,2750]825(2750,3250]706(3250,3750]397(3750,4250]598(4250,4750]429(4750,5250]4010(5250,5750]37
5.Challenge:Recreatethehistogramusingabinsizeof5000.Whatistheeffectontheoutput?
Exercise3:MeasuringCovariationwithBoxPlots
Boxplotsprovideavisualrepresentationofthespreadofdataforavariable.Theseplotsdisplaytherangeofvaluesforavariablealongwiththemedianandquartiles.Followtheinstructionsprovidedbelowtocreateaboxplotthatmeasurescovariationbetweenorganizationandtotalacreageburned.
1.Usetheread_csv()functiontoloadthedatasetintoadataframe.dfFires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)
2.Pipethedataframeandfiltertherowssothatonlyfirebetween5000and1000acresareincluded.Then,groupthedatabyorganization.TheORGANIZATIcolumninthedatasetcontainscategoricaldatafortheU.S.federalgovernmentagenciesthathavehadlandaffectedbywildfires.Finally,useggplot()withgeom_boxplot()tocreateaboxplotshowingthedistributionofwildfiresbyorganization.
df%>%filter(TOTALACRES>=5000&TOTALACRES<=10000)%>%group_by(ORGANIZATI)%>%ggplot(mapping=aes(x=ORGANIZATI,y=TOTALACRES))+geom_boxplot()
TheorganizationislistedontheXaxisandthetotalacreageburnedontheYaxis.TheboxcontainsahorizontallinethatrepresentsthemedianforthevariableandtheboxitselfisknownastheInterQuartileRange(IQR).Theverticallinesthatextendoneithersideoftheboxareknownasthewhiskersandrepresentthefirstandfourthquartile.Alargerboxandwhiskersindicatealargerdistributionofdata.
3.Challenge:CreateanewboxplotthatmapsthecovariationofCAUSEandTOTALACRES.
Exercise4:MeasuringCovariationwithSymbolSize
Thegeom_count()functioncanbeusedwithggplot()tomeasurecovariationbetweenvariablesusingdifferentsymbolsizes.Followtheinstructionsprovidedbelowtomeasurethecovariationbetweenorganizationandwildfirecauseusingsymbolsize.
1.Usetheread_csv()functiontoloadthedatasetintoadataframe.dfFires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)
2.PipethedataframeandfiltertherowssothatonlywildfiresthatoriginatedduetoNaturalorHumancausesareincluded.ThiswillremoveanyrecordsthatareUnknownorhavemissingvalues.Then,usegeom_count()tocreateagraduatedsymbolchartbasedonthenumberoffiresbyorganization.
df%>%filter(CAUSE==‘Natural’|CAUSE==‘Human’)%>%group_by(ORGANIZATI)%>%ggplot()+geom_count(mapping=aes(x=ORGANIZATI,y=CAUSE))
3.Youcanalsogetanexactcountofthenumberoffiresbyorganizationandcause.df%>%count(ORGANIZATI,CAUSE)
ORGANIZATICAUSEn<chr><chr><int>1BIAHuman492BIANatural913BLMHuman1874BLMNatural3865FSHuman158
6FSNatural4317FWSHuman108FWSNatural79FWSUndetermined610NPSHuman611NPSNatural46
Exercise5:2Dbinandhexcharts
Youcanalsouse2Dbinandhexchartsasanalternativewayofviewingthedistributionoftwovariables.Followtheinstructionsprovidedbelowtocreate2Dbinandhexchartsthatvisualizetherelationshipbetweentheyearandtotalacreageburned.
1.Usetheread_csv()functiontoloadthedatasetintoadataframe.dfFires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)2.Createa2DbinmapwithYEAR_ontheXaxisandTOTALACRESontheYaxis.ggplot(data=dfFires)+geom_bin2d(mapping=aes(x=YEAR_,y=TOTALACRES))
3.Createa2DhexmapwithYEAR_ontheXaxisandTOTALACRESontheYaxis.ggplot(data=df)+geom_hex(mapping=aes(x=YEAR_,y=TOTALACRES))
Exercise6:GeneratingSummaryStatistics
Anotherbasictechniqueforperformingexploratorydataanalysisistogeneratevarioussummarystatisticsonadataset.Rincludesanumberofindividualfunctionsforgeneratingspecificsummarystatisticsoryoucanusethesummary()functiontogenerateasetofsummarystatistics.
1.ReloadtheStudyArea.csvfileintoadataframe.df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)2.Restrictthelistofcolumns.df<-select(df,ORGANIZATI,STATE,YEAR_,TOTALACRES,CAUSE)3.Filterthelisttoincludeonlywildfiresgreaterthan1,000acres.df<-filter(df,TOTALACRES>=1000)4.Callthemean()function,passinginareferencetothedataframeandtheTOTALACREScolumn.mean(df$TOTALACRES)[1]10813.065.Callthemedian()function.median(df$TOTALACRES)[1]32406.Insteadofcallingtheindividualsummarystatisticsfunctionsyoucansimplyusethesummary()functiontoreturnalistofsummarystatistics.summary(df$TOTALACRES)Min.1stQu.MedianMean3rdQu.Max.1000167032401081382825906207.YoucancheckyourworkagainstthesolutionfileChapter6_6.R.
Conclusion
InthischapteryoulearnedsomebasicdataexplorationtechniquesusingR.Youlearnedhowtomeasurecategoricalandcontinuousvariationwithbarchartsandhistograms,andcovariationwithboxplotsanddifferentsymbolsize.Finally,youlearnedhowtogeneratesummarystatisticsandcreate2Dbinsandhexcharts.
Inthenextchapteryou’lllearnhowtovisualizedatausingtheggplot2package.Chapter7
BasicDataVisualizationTechniques
Theggplot2packageisalibrarythatenablesthecreationofmanytypesofdatavisualizationincludingvarioustypesofchartsandgraphs.ThislibrarywasfirstcreatedbyHadleyWickhamin2005andisanRimplementationofLelandWilkinson’sGrammarofGraphics.Theideabehindthispackageistospecifyplotbuildingblocksandthencombinethemtocreateagraphicaldisplay.Buildingblocksofggplot2includedata,aestheticmapping,geometricobjects,statisticaltransformations,scales,coordinatesystems,positionadjustments,andfaceting.
Thereareanumberofadvantagestousingggplot2versusothervisualizationtechniquesavailableinR.Theseadvantagesincludeaconsistentstylefordefiningthegraphics,ahighlevelofabstractionforspecifyingplots,flexibility,abuilt-inthemingsystemforplotappearance,matureandcompletegraphicssystem,andaccesstomanyotherggplot2usersforsupport.
Inthischapterwe’llcoverthefollowingtopics:•Creatingascatterplot•Addingaregressionlinetoascatterplot•Plottingcategories•Labelingthegraph•Legendlayouts•Creatingafacet•Theming•Creatingbarcharts•Creatingviolinplots•Creatingdensityplots
Step1:Creatingascatterplot
Ascatterplotisagraphinwhichthevaluesoftwovariablesareplottedalongtwoaxes,withthepatternoftheresultingpointsrevealinganycorrelationpresent.
1.Theexercisesinthischapterrequirethefollowingpackages:readr,dplyr,ggplot2.TheycanbeloadedfromthePackagespane,theConsolepane,ora
script.
2.OpenRStudioandfindtheConsolepane.
3.Ifnecessary,settheworkingdirectorybytypingthecodeyouseebelowintotheConsolepaneorbygoingtoSession|SetWorkingDirectory|ChooseDirectoryfromtheRStudiomenu.
setwd(<installationdirectoryforexercisedata>)4.LoadthecontentsoftheStudyArea.csvfileintoadataframe.dfWildfires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)5.Createasubsetofcolumns.df<-select(dfWildfires,ORGANIZATI,STATE,YEAR_,TOTALACRES,CAUSE)6.Grouptherecordsbyyear.grp<-group_by(df,YEAR_)7.Summarizethedatabytotalnumberofacresburned.sm<-summarize(grp,totalacres=sum(TOTALACRES))8.Useggplot()tocreateascatterplotwiththeyearonthexaxisandthetotalacresburnedontheyaxis.ggplot(data=sm)+geom_point(mapping=aes(x=YEAR_,y=totalacres))
9.Therearetimeswhenitmakessensetousethelogarithmicscalesinchartsandgraphs.Onereasonistorespondtoskewnesstowardslargevalues,i.e,casesinwhichoneorafewpointsaremuchlargerthanthebulkofthedata.Inthegraphthatwejustcreatedthereareacouplepointsthatfallintothiscategoryontheyaxis.
Createthegraphagain,butthistimeusethelog()functiononthetotalacrescolumn.ggplot(data=sm)+geom_point(mapping=aes(x=YEAR_,y=log(totalacres)))
10.YoucancheckyourworkagainstthesolutionfileChapter7_1.R.
Step2:Addingaregressionlinetothescatterplot
Plotsconstructedwithggplot()canhavemorethanonegeometry.It’scommontoaddaprediction(regression)linetotheplot.
1.Thereareseveralwaysthatyoucanaddaregressionlinetothescatterplot,oneofwhichistousethegeom_smooth()functionwiththemethodsettolm(straightline)andtheseparametersettoFALSE.Addthelineofcodeyouseebelowtotheconsolewindow.
ggplot(data=sm,aes(x=YEAR_,y=log(totalacres)))+geom_point()+geom_smooth(method=lm,se=FALSE)
2.Changethemethodtoloesstheeffectontheregressionline.ggplot(data=sm,aes(x=YEAR_,y=log(totalacres)))+geom_point()+geom_smooth(method=loess,se=FALSE)
3.Youcanaddaconfidenceintervalaroundtheregressionlinebysettingse=TRUE.ggplot(data=sm,aes(x=YEAR_,y=log(totalacres)))+geom_point()+geom_smooth(method=loess,se=TRUE)
4.YoucancheckyourworkagainstthesolutionfileChapter7_2.R.
Step3:Plottingcategories
Ratherthangraphingtheentiresetofwildfiresyoumightwanttobetterunderstandthetrendsbystate.Inthisstepyou’llcreateanewscatterplotthatvisualizeswildfirestrendsovertimebystate.
1.Regroupthewildfiresdataframebystateandyear.grp<-group_by(df,STATE,YEAR_)2.Summarizethegroupsbytotalacresburned.sm<-summarize(grp,totalacres=sum(TOTALACRES))3.Addacolourparametertotheaes()functionsothatthepointsandregressionlinearemappedaccordingtothestateinwhichtheyoccurred.ggplot(data=sm,aes(x=YEAR_,y=totalacres,colour=STATE))+geom_point(aes(colour=STATE))+stat_smooth(method=lm,se=FALSE)
4.YoucancheckyourworkagainstthesolutionfileChapter7_3.R
Step4:Labelingthegraph
Youcanaddlabelstoyourgraphthrougheitherthegeom_text()functionorthegeom_label()function.1.Labeleachofthepointsonthescatterplotusinggeom_text()withalabelsizeof3.
ggplot(data=sm,aes(x=YEAR_,y=log(totalacres)))+geom_point()+geom_smooth(method=loess,se=TRUE)+geom_text(aes(label=STATE),size=3)
Nowthisobviouslydoesn’tworkverywell.Thedisplayisextremelyclutteredsolet’sadjustafewparameterstomakethiseasiertoread.2.Youcanusethecheck_overlapparametertoremoveanyoverlappinglabels.Updateyourcodeasseenbelow.
ggplot(data=sm,aes(x=YEAR_,y=log(totalacres)))+geom_point()+geom_smooth(method=loess,se=TRUE)+geom_text(aes(label=STATE),size=3,check_overlap=TRUE)
3.Thislookquiteabitbetterbutifyouchangethelabelsizeto2itwillfurtherreducetheclutterandoverlappingwhilehopefullystillbeingreadable.
4.Youmayhavenoticedthatthelabelssitdirectlyontopofthetopics.Youcanusethenudge_xandnudge_yparameterstomovethelabelsrelativetothepoint.Usenudge_xasseenbelowtoseehowthismovesthelabelshorizontally.
ggplot(data=sm,aes(x=YEAR_,y=log(totalacres)))+geom_point()+geom_smooth(method=loess,se=TRUE)+geom_text(aes(label=STATE),size=2,check_overlap=TRUE,nudge_x=1.0)
5.Youcanalsocolorthelabelsbycategorybyaddingthecolorparametertotheaes()forgeom_text().
ggplot(data=sm,aes(x=YEAR_,y=log(totalacres)))+geom_point()+geom_smooth(method=loess,se=TRUE)+geom_text(aes(label=STATE,color=STATE),size=2,check_overlap=TRUE,nudge_x=1.0)
6.Youcanalsoaddasubtitleandcaptionwiththecodeyouseebelow.
ggplot(data=sm,aes(x=YEAR_,y=log(totalacres)))+geom_point()+geom_smooth(method=loess,se=TRUE)+labs(title=paste(“AcreageBurnedbyWildfiresHasIncreasedInthePastFewDecades”),subtitle=paste(“1980-2016”),caption=”DatafromUSGS”)
7.YoucanalsoupdatetheXandYlabelsforthegraph.Updatetheselabelsonyourgraphusingthecodeyouseebelow.
ggplot(data=sm,aes(x=YEAR_,y=log(totalacres)))+geom_point()+geom_smooth(method=loess,se=TRUE)+labs(title=paste(“AcreageBurnedbyWildfiresHasIncreasedInthePastFewDecades”),subtitle=paste(“1980-2016”),caption=”DatafromUSGS”)+scale_y_continuous(name=”LogofTotalAcresBurned”)+scale_x_continuous(name=”BurnYear”)
8.YoucancheckyourworkagainstthesolutionfileChapter7_4.R
Step5:Legendlayouts
Thetheme()functioncanbeusedtocontrolthelocationofthelegendandtheguides()functioncanbeusedtoprovideadditionallegendcontrol.
1.Thetheme()functionalongwiththelegend.postionargumentisusedtocontrolthelocationofthelegendonthegraph.Bydefault,thelegendwe’veseensofarhasbeenplacedontherightsideofthegraphwithaverticalorientation.Repositionthelegendtothebottomwiththecodebelow.
ggplot(data=sm,aes(x=YEAR_,y=log(totalacres),color=STATE))+geom_point()+labs(title=paste(“AcreageBurnedbyWildfiresHasIncreasedInthePastFewDecades”),subtitle=paste(“1980-2016”),caption=”DatafromUSGS”)+scale_y_continuous(name=”LogofTotalAcresBurned”)+scale_x_continuous(name=”BurnYear”)+theme(legend.position=”bottom”)
2.Youcanalsoexplicitlyremovealegendbysettinglegend.position=“none”.Trythatnowifyou’dlike.
3.Otheraspectsofthelegendsuchasthenumberofrowsinthelegendaswellasthesymbolsizecanbecontrolthroughtheguides()function.Usethecodeyouseebelowtoupdatethelegendtobetworowsandwitheachsymbolsettosize4.
ggplot(data=sm,aes(x=YEAR_,y=log(totalacres),color=STATE))++geom_point()++labs(title=paste(“AcreageBurnedbyWildfiresHasIncreasedInthePastFewDecades”),subtitle=paste(“1980-2016”),caption=”DatafromUSGS”)++scale_y_continuous(name=”LogofTotalAcresBurned”)++scale_x_continuous(name=”BurnYear”)++theme(legend.position=“bottom”)++guides(color=guide_legend(nrow=2,override.aes=list(size=4)))
4.YoucancheckyourworkagainstthesolutionfileChapter7_5.R
Step6:Creatingafacet
Aparticularlygoodwayofgraphingcategoricalvariablesistosplityourplotintofacets,whicharesubplotsthateachdisplayonesubsetofthedata.Thefacet_wrap()andfacet_grid()functioncanbeusedtocreatefacets.
1.Usethefacet_wrap()functiondisplayedinthecodebelowtocreateafacetmapthatdisplaystotalacresburnedbystate.ggplot(data=sm,mapping=aes(x=YEAR_,y=log(totalacres)))+geom_point()+facet_wrap(~STATE)+geom_smooth(method=loess,se=TRUE)
2.YoucancheckyourworkagainstthesolutionfileChapter7_6.R
Step7:Theming
includeseightbuiltinthemesthatcanbeusedtocustomizethestylingoftheggplot2non-dataelementsofyourplot.
1.Theeightthemesincludedinggplot2aretheme_bw,theme_classic,theme_dark,theme_gray,theme_light,theme_linedraw,theme_minimal,theme_void.
Addthecodeyouseebelowtochangethefacettotheme_dark.
ggplot(data=sm,mapping=aes(x=YEAR_,y=log(totalacres)))+geom_point()+facet_wrap(~STATE)+geom_smooth(method=loess,se=TRUE)+theme_dark()
2.Experimentwiththethemestoseethedifferencesinstyling.3.YoucancheckyourworkagainstthesolutionfileChapter7_7.R
Step8:Creatingbarcharts
Youcanusegeom_bar()orgeom_chart()tocreatebarchartswithggplot2.However,thereisasignificantdifferencebetweenthetwo.Thegeom_bar()functionwillgenerateacountofthenumberofinstancesofavariable.Inotherwords,itchangesthestatisticthathasalreadybeengeneratedforthegroup.Thegeom_col()functionkeepsthevariablealreadygeneratedforthegroup.Toseethedifference,completethefollowingsteps.
1.LoadtheStudyArea.csvfileandgetasubsetofcolumns.dfWildfires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df<-select(dfWildfires,ORGANIZATI,STATE,YEAR_,TOTALACRES,CAUSE)2.FilterthedataframesothatonlywildfiresforCaliforniaareincluded.df<-filter(df,STATE==‘California’)3.GroupthedataframebyYEAR_.
grp<-group_by(df,YEAR_)4.Plotthedatausinggeom_bar()asseenbelow.Noticethatthebarchartthatisproducedisacountofthenumberoffiresforeachyear.ggplot(data=grp)+geom_bar(mapping=aes(x=YEAR_),fill=”red”)
5.Nowusegeom_col()toseethedifference.TheTOTALACRESvariableismaintainedinthiscase.ggplot(data=grp)+geom_col(mapping=aes(x=YEAR_,y=TOTALACRES),fill=”red”)6.YoucancheckyourworkagainstthesolutionfileChapter7_8.R
Step9:CreatingViolinPlots
Violinplots,whicharesimilartoboxplots,alsoshowtheprobabilitydensityatvariousvalues.Thickerareasoftheviolinplotindicateahigherprobabilityatthatvalue.Typically,violinplotsalsoincludeamarkerforthemedianalongwiththeInter-QuartileRange(IQR).Thegeom_violin()functionisusedtocreateviolinplotsinggplot2.
1.LoadtheStudyArea.csvfileandgetasubsetofcolumns.dfWildfires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=
col_character()),col_names=TRUE)df<-select(dfWildfires,ORGANIZATI,STATE,YEAR_,TOTALACRES,CAUSE)2.Filterthedataframesothatonlywildfiresgreaterthan5,000acresareincluded.dfWildfires<-filter(dfWildfires,TOTALACRES>=5000)3.Groupthewildfiresbyorganization.grpWildfires<-group_by(dfWildfires,ORGANIZATI)4.Createabasicviolinplot.ggplot(data=grpWildfires,mapping=aes(x=ORGANIZATI,y=log(TOTALACRES)))+geom_violin()
5.Youcanaddtheindividualobservationsusinggeom_jitter().
ggplot(data=grpWildfires,mapping=aes(x=ORGANIZATI,y=log(TOTALACRES)))+geom_violin()+geom_jitter(height=0,width=0.1)
6.Themeancanbeaddedusingstat_summary()asseenbelow.
ggplot(data=grpWildfires,mapping=aes(x=ORGANIZATI,
y=log(TOTALACRES)))+geom_violin()+geom_jitter(height=0,width=0.1)+stat_summary(fun.y=mean,geom=”point”,size=2,color=”red”)
7.Thebox_plot()functioncanbeusedtoaddthemeanandIQR.ggplot(data=grpWildfires,mapping=aes(x=ORGANIZATI,y=log(TOTALACRES)))+geom_violin()+geom_boxplot(width=0.1)8.YoucancheckyourworkagainstthesolutionfileChapter7_9.R
Step10:Creatingdensityplots
Densityplots,createdwithgeom_density()computesadensityestimate,whichisasmoothedversionofahistogramandisusedwithcontinuousdata.ggplot2canalsocompute2Dversionsofdensityincludescontoursandpolygonstyleddensityplots.
1.Inthisfirstportionoftheexerciseyou’llcreateabasicdensityplot.LoadtheStudyArea.csvfileandgetasubsetofcolumns.dfWildfires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df<-select(dfWildfires,ORGANIZATI,STATE,YEAR_,TOTALACRES,
CAUSE)2.Filterthedataframesothatonlywildfiresgreaterthan1,000acresareincluded.dfWildfires<-filter(dfWildfires,TOTALACRES>=1000)3.Createadensityplotwiththegeom_density()function.ggplot(dfWildfires,aes(TOTALACRES))+geom_density()
4.Youmayalsowanttocreatethesamedensityplotwithaloggedversionofthedata.ggplot(dfWildfires,aes(log(TOTALACRES)))+geom_density()
5.Next,you’llcreate2Dplotsofthedatastartingwithcontours.Addthecodeyouseebelow.ggplot(dfWildfires,aes(x=YEAR_,y=log(TOTALACRES)))+geom_point()+geom_density_2d()
6.Finally,createa2Ddensitysurfaceusingstat_density_2d().
ggplot(dfWildfires,aes(x=YEAR_,y=log(TOTALACRES)))+geom_density_2d()+stat_density_2d(geom=”raster”,aes(fill=..density..),contour=FALSE)
7.YoucancheckyourworkagainstthesolutionfileChapter7_10.R
Conclusion
Inthischapteryoulearnedvariousdatavisualizationtechniquesusingggplot2.Westartedwithbasicscatterplots,addedregressionlines,labeledthegraphsinvariousways,andcreatedalegend.Inaddition,youlearnedhowtocreatefacetplots,andworkwithggplot2sbuiltinthemingoptions.Youalsolearnedhowtocreatebarcharts,violincharts,anddensityplots.
Inthenextchapteryouwilllearnhowtocreatemapsusingtheggmappackage.Chapter8
VisualizingGeographicDatawithggmap
Theggmappackageenablesthevisualizationofspatialdataandspatialstatisticsinamapformatusingthelayeredapproachofggplot2.ThispackagealsoincludesbasemapsthatgiveyourvisualizationscontextincludingGoogleMaps,OpenStreetMap,StamenMaps,andCloudMademaps.Inaddition,utilityfunctionsareprovidedforaccessingvariousGoogleservicesincludingGeocoding,DistanceMatrix,andDirections.
Theggmappackageisbasedonggplot2,whichmeansitwilltakealayeredapproachandwillconsistofthesamefivecomponentsfoundinggplot2.Theseincludeadefaultdatasetwithaestheticmappingswherexislongitude,yislatitude,andthecoordinatesystemisfixedtoMercator.Othercomponentsincludeoneormorelayersdefinedwithageometricobjectandstatisticaltransformation,ascaleforeachaestheticmapping,coordinatesystem,andfacetspecification.Becauseggmapisbuiltonggplot2ishasaccesstothefullrangeofggplot2thatyoulearnedaboutinapreviousexercise.
Inthischapterwe’llcoverthefollowingtopics:
•Creatingabasemap•Addingoperationallayers•Addinglayersfromashapefile
Exercise1:Creatingabasemap
Therearetwobasicstepstocreateamapwithggmap.Thedetailsaremorecomplexthanthesetwostepsmightimply,butingeneralyoujustneedtodownloadthemapraster(basemap)andthenplotoperationaldataonthebasemap.Thefirststepistodownloadthemapraster,alsoknownasthebasemap.Thisisaccomplishedusingtheget_map()function,whichcanbeusedtocreateabasemapfromGoogle,Stamen,OpenStreetMap,orCloudMade.You’lllearnhowtodothatinthisstep.Inafuturestepyou’lllearnhowtoaddandstyleoperationaldatainvariousways.
1.OpenRStudioandfindtheConsolepane.
2.Ifnecessary,settheworkingdirectorybytypingthecodeyouseebelowinto
theConsolepaneorbygoingtoSession|SetWorkingDirectory|ChooseDirectoryfromtheRStudiomenu.
setwd(<installationdirectoryforexercisedata>)
3.LoadtheggmappackagebygoingtothePackagespaneinRStudioandclickingonthecheckboxnexttothepackagename.Alternatively,youcanloaditfromtheConsolebytyping:
library(ggmap)4.CreateavariablecalledmyLocationandsetittoCalifornia.myLocation<-“California”5.Calltheget_map()functionandpassinthelocationvariablealongwithazoomlevelof6.myMap<-get_map(location=myLocation,zoom=6)
6.InRStudioyoushouldseesomereturnmessagesthatlooksimilartothecodeyouseebelow.Ifyoudon’tseesomethingsimilartothis,youmayneedtore-executethescript.Itisn’tuncommontogetanerrormessagewhencallingtheget_map()functionfromRStudio.Ifthishappenssimplyre-executethecodeuntilyougetsomethingthatissimilartowhatyouseebelow.
MapfromURL:http://maps.googleapis.com/maps/api/staticmap?center=California&zoom=6&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=falseInformationfromURL:http://maps.googleapis.com/maps/api/geocode/json?address=California&sensor=false
7.Calltheggmap()function,passinginthemyMapvariable.ThePlotspaneshoulddisplaythemapasseenbelow.ThedefaultmaptypeisGoogleMapswithastyleofTerrain.
ggmap(myMap)
TheGooglesourceincludesanumberofmaptypesincludingthoseyouseeinthescreenshotbelow.
8.AddandexecutethecodeyouseebelowtoaddaGooglesatellitemap.
myMap<-get_map(location=myLocation,zoom=6,source=”google”,maptype=”satellite”)ggmap(myMap)
9.Thereareanumberofwaysthatyoucandefinetheinputlocation:longitude/latitudecoordinatepair,acharacterstring,oraboundingbox.Thecharacterstringtendstobeamorepracticalsolutioninmanysituationssinceyoucansimplypassinthenameofthelocation.Forexample,youcoulddefine
thelocationasHoustonTexasorTheWhiteHouseorTheGrandCanyon.Whenacharacterstringispassedtothelocationparameteritisthenpassedtothegeocodingservicetoobtainthelatitude/longitudecoordinatepair.Addthecodeyouseebelowtoseehowpassinginacharacterstringworks.
myMap<-get_map(location=“GrandCanyon,Arizona”,zoom=11)ggmap(myMap)
Thezoomlevelcanbesetbetween3and21with3representingacontinentlevelview,and21representingabuildinglevelview.Takesometimeto
experimentwiththezoomleveltoseetheeffectofvarioussettings.
10.YoucancheckyourworkagainstthesolutionfileChapter8_1.R
Exercise2:Addingoperationaldatalayers
ggmap()returnsaggplotobject,meaningthatitactsasabaselayerintheggplot2framework.Thisallowsforthefullrangeofggplot2capabilitiesmeaningthatyoucanplotpointsonthemap,addcontoursand2Dheatmaps,andmore.We’llexaminesomeofthesecapabilitiesinthissection.
1.Initiallywe’lljustloadthewildfireeventsaspoints.AddthecodeyouseebelowtoproduceamapofCaliforniathatdisplayswildfiresfromtheyears1980-2016thatburnedmorethan1,000acres.
myLocation<-“California”#getthebasemaplayermyMap<-get_map(location=myLocation,zoom=6)
#readinthewildfiredatatoadataframe(tibble)dfWildfires<-read_csv(“StudyArea_SmallFile.csv”,col_names=TRUE)
#selectspecificcolumnsofinformationdf<-select(dfWildfires,STATE,YEAR_,TOTALACRES,DLATITUDE,DLONGITUDE)
#filterthedataframesothatonlyfiresgreaterthan1,000acresburnedinCaliforniaarepresentdf<-filter(df,TOTALACRES>=1000&STATE==‘California’)
#usegeom_point()todisplaythepoints.Thexandypropertiesoftheaes()functionareusedtodefinethegeometryggmap(myMap)+geom_point(data=df,aes(x=DLONGITUDE,y=DLATITUDE))
2.Nowlet’sdosomethingalittlemoreinteresting.First,usethedplyrfunctionmutate()togroupthefiresbydecade.
togroupthefiresbydecade.1989”,ifelse(YEAR_%in%1990:1999,“1990-1999”,ifelse(YEAR_%in%2000:2009,“2000-2009”,ifelse(YEAR_%in%2010:2016,“2010-2016”,“-99”)))))
3.Next,colorcodethewildfiresbyDECADEandcreateagraduatedsymbolmapbasedonthesizeofeachfire.Thecolourpropertydefinesthecolumnto
useforgrouping,andthesizepropertydefinethecolumntouseforthesizeofeachsymbol.
ggmap(myMap)+geom_point(data=df,aes(x=DLONGITUDE,y=DLATITUDE,colour=DECADE,size=TOTALACRES))Thisshouldproduceamapthatappearsasseeninthescreenshotbelow.
4.Let’schangethemapviewtofocusmoreonsouthernCalifornia,andinparticulartheareajustnorthofLosAngeles.
myMap<-get_map(location=“SantaClarita,California”,zoom=10)ggmap(myMap)+geom_point(data=df,aes(x=DLONGITUDE,y=DLATITUDE,colour=DECADE,size=TOTALACRES))
5.Next,we’lladdcontourandheatlayers.Thegeom_density2d()functionisusedtocreatethecontourswhilethestat_density2d()functioncreatestheheatmap.Addthefollowingcodetoproducethemapyouseebelow.Youcanexperimentwiththecolorsusingthescale_fill_gradient(lowandhigh)properties.Herewe’vesetthemtogreenandredrespectively,butyoumaywanttochangethecolorscheme.
myMap<-get_map(location=“California”,zoom=6)
ggmap(myMap,extent=“device”)+geom_density2d(data=df,aes(x=DLONGITUDE,y=DLATITUDE),size=0.3)+stat_density2d(data=df,aes(x=DLONGITUDE,y=DLATITUDE,fill=..level..,alpha=..level..),size=0.01,bins=16,geom=“polygon”)+scale_fill_gradient(low=“green”,high=“red”)+scale_alpha(range=c(0,0.3),guide=FALSE)
6.Ifyou’dprefertoseetheheatmapwithoutcontours,thecodecanbesimplifiedasfollows:
ggmap(myMap,extent=“device”)+stat_density2d(data=df,aes(x=DLONGITUDE,y=DLATITUDE,fill=..level..,alpha=..level..),size=0.01,bins=16,geom=“polygon”)+scale_fill_gradient(low=“green”,high=“red”)+scale_alpha(range=c(0,0.3),guide=FALSE)
7.Finally,let’screateafacetmapthatdepictshotspotsforeachyearinthecurrentdecade.Addthefollowingcodetoseehowthisworks.Thedatasetcontainsinformationupthroughtheyear2016.
df<-filter(df,YEAR_%in%c(2010,2011,2012,2013,2014,2015,2016))
ggmap(myMap,extent=“device”)+stat_density2d(data=df,aes(x=DLONGITUDE,y=DLATITUDE,fill=..level..,alpha=..level..),size=0.01,bins=16,geom=“polygon”)+scale_fill_gradient(low=“green”,high=“red”)+scale_alpha(range=c(0,0.3),guide=FALSE)+facet_wrap(~YEAR_)
8.YoucancheckyourworkagainstthesolutionfileChapter8_2.R
Exercise3:AddingLayersfromShapefiles
WhiletheyaresomewhatofanolderGISdataformat,shapefilesarestillcommonlyusedtorepresentgeographicfeatures.Withalittlebitofmanipulation,youcangetplotdatafromshapefilesontoggmap.
1.Forthisexerciseyou’llneedtoinstallanadditionalpackagecalledrgdal.UsethePackagespanetofindandinstallrgdalorenterthecodeyouseebelow.
install.packages(“rgdal”)2.LoadthergdalpackagethroughthePackagespaneorenterthecodeyouseebelow.library(rgdal)
3.TheDatafolderthatcontainstheexercisedataforthisbookcontainsashapefilecalledS_USA.Wilderness.You’llactuallyseeanumberoffileswiththisname,butadifferentfileextension.Thesefilescombinetocreatewhatiscalledashapefile.ThisfilecontainstheboundariesofdesignatedwildernessareasintheUnitedStates.UsethereadOGR()functionfromrgdaltoloadthedataintoavariable.
wild=readOGR(‘.’,‘S_USA.Wilderness’)
4.Thefortify()function,whichispartofggplot2,convertsalltheindividualpointsthatdefineeachboundaryintoadataframethatcanthenbeusedtoplotthepolygonboundaries.
wild<-fortify(wild)
5.Usetheggmapqmap()function(qmapmeansquickmap)tocreatethebasemapthatwillbeusedasthereferenceforthewildernessboundaries.CenterthemapinMontana.
montana<-qmap(“Montana”,zoom=6)
6.Beforeplottingthewildernessboundariesaspolygonsonthemap,takealookatthedataframethatwascreatedbythefortify()functionsoyou’llhaveabetterunderstandingofthestructurecreatedbythisfunction.
View(wild)
Takealookatthegroupcolumn.Thiscolumnuniquelyidentifieseachwildernessboundary.Thewildernessboundariesarepolygons,andpolygonsaredefinedbyasetofpointswhichdefinethestructureofthepolygon.It’ssortoflikeplayingconnectthedots,whereeachdotisalatitude/longitudecoordinatepairdefinedbythelongandlatcolumnsinthedataframe.
Forexample,takealookatgroup0.1.Noticethattherearemultiplerowsthatcontainsthevalue0.1,andthateachrowhasuniquelongandlatvalues.Theseareallthepointsusedtodefinetheboundariesofthatpolygon.
7.Nowplotthewildernessboundariesonthebasemap.Noticetheuseofthegroupcolumnforgroupingthepolygons.Itdoestakesometimetoplottheboundariesonthemapsobepatientwiththisstep.Eventuallyyoushouldseeamapsimilartothescreenshotbelow.
montana+geom_polygon(aes(x=long,y=lat,group=group,alpha=0.25),data=wild,fill=’white’)+geom_polygon(aes(x=long,y=lat,group=group),
data=wild,color=’black’,fill=NA)
8.Optional–Usethecolor,fill,andalpha(usedtodefinetransparency)parameterstochangethesymbologytodifferentcolorsandstyles.9.YoucancheckyourworkagainstthesolutionfileChapter8_3.R
Conclusion
Inthischapteryoulearnedhowtousetheggmappackagetocreatecompellingdatavisualizationsinmapformat.YoulearnedhowtocreatedbasemapsusingGoogleasadatasource,addoperationaldatalayers,createvarioustypesofmapvisualizationsusingexternaldatasources,andloadshapefiles.
InthenextchapteryouwilllearnhowtouseRMarkdowntoshareyourworkwithothers.Chapter9
RMarkdown
RMarkdownisanauthoringframeworkfordatasciencethatcombinescode,results,andcommentary.OutputformatsincludePDF,Word,HTML,slideshows,andmore.AnRMarkdowndocumentessentiallyservesthreepurposes:communication,collaboration,andasamodern-daylabenvironmentthatcapturesnotonlywhatyoudid,butalsowhatyouwerethinking.Fromacommunicationperspectiveitenablesdecisionmakerstofocusmoreontheresultsofyouranalysisratherthanthecode.However,becauseitenablesyoutoalsoincludethecode,itfunctionsasameansofcollaborationbetweendatascientists.
RMarkdownusesthermarkdownpackage,butyoudon’thavetoexplicitlyloadthepackageinRStudio.RStudiowillautomaticallyloadthepackageasneeded.TheoutputformatofanRMarkdownfileisaplaintextfilewithanextensionofRmd.ThesefilescontainamixtureofthreetypesofcontentincludingaYAMLheader,Rcode,andtextmixedwithsimpletextformatting.
TheoutputRmarkdownfilecontainsbothcodeandtheoutputofthecode.UsingtheRStudiointerfaceyoucanrunsectionsofthecodeorallthecodeinthefile.Youcanseeanexampleofthisinthescreenshotbelow.Noticethatthecodeisenclosedbythreeback-ticksfollowedbytheoutputofthecodebelow.
IfyouwanttoexportthecontentstoaspecificfiletypeyoucanusetheKnitfunctionalityembeddedinRStudiotoexporttoHTML,PDF,andWordformats.Thiswillexportacompletefilecontainingtext,code,andresults.
Inthischapterwe’llcoverthefollowingtopics:
•CreatingaRMarkdownfile•AddingcodechunksandtexttoanRMarkdownfile•Codechunkandheaderoptions•Caching•UsingKnittooutputanRMarkdownfile
Exercise1:CreatinganRMarkdownfile
AnRMarkdownfileissimplyaplaintextfilewithafileextensionof.Rmd.YoucanuseRStudiotocreatenewmarkdownfiles,whichiswhatyou’lldointhisbriefexercise.
1.Theexercisesinthischapterrequirethefollowingpackages:readr,dplyr,ggplot2,andggmap.TheycanbeloadedfromthePackagespane,theConsolepane,orascript.
2.OpenRStudioandgotoFile|NewFile|RMarkdown.Thiswilldisplaythedialogyouseebelow.Therearedifferenttypesofmarkdownthatcanbecreated,butforthisexercisewe’llkeepitsimpleandcreateadocument.
3.SelectDocument(whichisthedefault),giveitatitleofCreatingMapswith
R,changetheauthornameifyou’dlike,andselectPDFastheoutput.
4.Thiswillcreateafilewithsomeheaderinformation,text,andcode.Yourfileshouldlooksimilartothescreenshotbelow.
5.Attheverytopofthefileistheheaderinformation,whichissurroundedbydashes.We’lladdsomecontenttothissectioninalaterexercise,butfornowwe’llleaveitasis.
6.Codesectionsare
groupedthroughtheuseofback-ticksasseeninthescreenshotbelow.
7.Plaintextandformattedtextcanbeincludedinamarkdownfileaswell.Textthatneedstobeformattedmustfollowaspecificsyntax.Forexample,youformattextforitalics,boldfont,headings,linksandimages.Belowisanexampleofbothplaintextandtextthathasbeenformatted.
8.Otherthantheheaderinformationwearen’tgoingtouseanyofthedefaultcodeortextprovidedsogoaheadanddeleteeverythingotherthantheheader.
9.Savethefiletoyourworkingdirectorywithanameof
CreatingMapsWithR.Rmd.
Exercise2:AddingCodeChunksandTexttoanRMarkdownFile
RcodecanbeincludedintheRMarkdownfilethroughtheuseofchunks,whicharedefinedthroughtheuseofthreeback-ticksfollowedbyanrenclosedwithincurlybraces.Insidethecurlybracesareoptionsthatcanbeincluded.TheseoptionscanincludeTRUE|FALSEparametersforturningvarioustypesofmessagingonandoff.
Chunksdefineasingletask,sortoflikeafunction.Theyshouldbeself-containedandtightlydefinedpiecesofcode.TherearethreewaystoinsertchunksintoanRMarkdownfile:Cmd/Ctrl-Alt-I,theInsertbuttonontheeditortoolbar,andbymanuallytypingthechunkdelimiters.
YoucanalsoaddplaintextandformattedtexttoanRMarkdownfile.Formattedtexthastobedefinedaccordingtoaspecificsyntax.We’llseevariousexamplesofformattedtextaswemovethroughthisexercise.
Inthisexerciseyou’lllearnhowtoaddcodechunkstoanRMarkdownfile.
1.First,we’lladdsomedescriptivetextthatwillbeincludedintheoutputRMarkupfile.Addthetextyouseebelowtothefilejustbelowtheheader.Ifyouhaveadigitalcopyofthebookyoucancopyandpasteratherthantypingeverything.NoticethatthetextStep1:CreatingaBasemaphasbeenprecededbytwopoundsigns.##Step1:CreatingaBasemap.Thepoundsignsareusedtodefineheadings.InthiscasetwopoundsignswouldtranslatetoanHTML<h2>tag,whichsimplydefinesthesizeofthetext.You’llalsonoticethatsomeofthewordslikeggmapandggplotaresurroundedbysinglequotes.Singlequotesareusedtodefineadifferentstyleforthewordthatindicatesthiswordisprogrammaticcode.
The`ggmap`packageenablesthevisualizationofspatialdataandspatialstatisticsinamapformatusingthelayeredapproachof`ggplot2`.ThispackagealsoincludesbasemapsthatgiveyourvisualizationscontextincludingGoogleMaps,OpenStreetMap,StamenMaps,andCloudMademaps.Inaddition,utilityfunctionsareprovidedforaccessingvariousGoogleservicesincludingGeocoding,DistanceMatrix,andDirections.
The`ggmap`packageisbasedon`ggplot2`,whichmeansitwilltakealayeredapproachandwillconsistofthesamefivecomponentsfoundin`ggplot2`.Theseincludeadefaultdatasetwithaestheticmappingswherexislongitude,yislatitude,andthecoordinatesystemisfixedtoMercator.Othercomponentsincludeoneormorelayersdefinedwithageometricobjectandstatisticaltransformation,ascaleforeachaestheticmapping,coordinatesystem,andfacetspecification.Because`ggmap`isbuilton`ggplot2`ithasaccesstothefullrangeof`ggplot2`functionality.Inthisexerciseyou’lllearnhowtousethe`ggmap`packagetoplotvarioustypesofspatialvisualizations.
##Step1:CreatingaBasemapTherearetwobasicstepstocreateamapwith`ggmap`.Thedetailsaremorecomplexthanthesetwostepsmightimply,butingeneralyoujustneedtodownloadthemaprasterandthenplotoperationaldataonthebasemap.Step1is
todownloadthemapraster,alsoknownasthebasemap.Thisisaccomplishedusingthe`get_map()`function,whichcanbeusedtocreateabasemapfromGoogle,Stamen,OpenStreetMap,orCloudMade.You’lllearnhowtodothatinthisstep.Inafuturestepyou’lllearnhowtoaddandstyleoperationaldatainvariousways.
1.First,loadthelibrariesthatwe’llneedforthisexercise
2.ClickInsertandthenRtoinsertanewcodechunkasseenbelow.Thecodeyouaddwillgoinbetweenthesetofback-ticks.Mostmarkdownfileswillhaveanumberofcodechunks,witheachdefiningaspecifictask.Theyaresimilarinmanywaystofunctions.
3.Forthiscodechunkwe’lljustloadthelibrariesthatwillbeusedinthisexercise.Addthecodeyouseebelowinsidethecodechunkboundaries.
```{r}library(ggplot2)library(ggmap)library(readr)library(dplyr)```
4.Addsomeadditionaltextthatdescribesthenextstep.
2.Createavariablecalled`myLocation`andsetitto`California`.Callthe`get_map()`functionwithazoomlevelof6,andplotthemapusingthe`ggmap()`function,passinginareferencetothevariablereturnedbythe`get_map()`function.ThedefaultmaptypeisGoogleMapswithastyleofTerrain.
5.Insertanewcodechunkjustbelowthedescriptivetextandaddthefollowingcode.
```{r}myLocation<-“California”myMap<-get_map(location=myLocation,zoom=6)ggmap(myMap)
6.Let’srunthecodethathasbeenaddedsofartoseetheresult.SelectRun|RunAllfromtheRStudiointerface.Thisshouldproducetheoutputyouseebelow.Theoutputisincludedinsidethemarkdowndocument.Ifnot,checkyourcodeandtryrunningitagain.
7.Adddescriptivetextforthenextsection.3.ThecodeyouseebelowwillcreateaGooglesatellitebasemaplayer.OtherbasemaplayersincludeStamen,OSM,andCloudMade.8.Createanewcodechunkandaddthecodeyouseebelow.
```{r}myMap<-get_map(location=myLocation,zoom=6,source=”google”,maptype=”satellite”)ggmap(myMap)```
9.Adddescriptivetextforthenextsection.
4.Thereareanumberofwaysthatyoucandefinetheinputlocation:longitude/latitudecoordinatepair,acharacterstring,oraboundingbox.Thecharacterstringtendstobeamorepracticalsolutioninmanysituationssinceyoucansimplypassinthenameofthelocation.Forexample,youcoulddefine
thelocationasHoustonTexasorTheWhiteHouseorTheGrandCanyon.Whenacharacterstringispassedtothelocationparameteritisthenpassedtothegeocodingservicetoobtainthelatitude/longitudecoordinatepair.Addthecodeyouseebelowtoseehowpassinginacharacterstringworks.
10.Createanewcodechunkandaddthecodeyouseebelow.
```{r}myMap<-get_map(location=“GrandCanyon,Arizona”,zoom=11)ggmap(myMap)
11.Let’sstopaddingcodefornowandrunwhatiscurrentlyinthefiletoseetheresult.SelectRun|RunAll.Severalmapswillbeproducedinsidethemarkupdocumentincludingtheoneseenbelow,whichwillbeproducedattheveryend.Ifyoudon’tseethemapsyoumayneedtocheckyourcode.Wehaven’tyetaddedparametersthatwilloutputwarningsanderrors,butwilldosoinalaterstep.
12.Adddescriptivetextforthenextsection.Thezoomlevelcanbesetbetween3and21with3representingacontinentlevelview,and21representingabuildinglevelview.
##Step2:AddingOperationalDataLayers`ggmap()`returnsa`ggplot`object,meaningthatitactsasabaselayerinthe`ggplot2`framework.Thisallowsforthefullrangeof`ggplot2`capabilitiesmeaningthatyoucanplotpointsonthemap,addcontoursand2Dheatmaps,andmore.We’llexaminesomeofthesecapabilitiesinthissection.
1.Forthissectionwe’llusethehistoricalwildfireinformationfoundintheStudyArea_SmallFile.csvfile.Loadthisdatasetusingthe`read_csv()`function.Youcandownloadthisfileat:https://www.dropbox.com/s/9ouh21a6ym62nsl/StudyArea.csv?dl=0
13.Createanewcodechunkandaddthecodeyouseebelow.Thiswillloadwildfiredatafromacsvfile.Note:ThepathtoyourStudyArea_SmallFile.csv
filemaydifferfromtheoneyouseebelow.
```{r}dfWildfires<-read_csv(“~/Desktop/IntroR/Data/StudyArea_SmallFile.csv”,col_types=list(FIRENUMBER=col_character(),UNIT=col_character()),col_names=TRUE)```
14.Adddescriptivetextforthenextsection.
2.Initiallywe’lljustloadthewildfireeventsaspoints.AddthecodeyouseebelowtoproduceamapofCaliforniathatdisplayswildfiresfromtheyears1980-2016thatburnedmorethan1,000acres.
15.Createanewcodechunkandaddthecodeyouseebelow.Thiscodechunkwilldisplayeachofthewildfiresasapointonthemap.
```{r}myLocation<-‘California’#getthebasemapmyMap<-get_map(location=myLocation,zoom=6)#usetheselect()functiontolimitthecolumnsfromthedataframedf<-select(dfWildfires,STATE,YEAR_,TOTALACRES,DLATITUDE,DLONGITUDE)#usethefilter()functiontogetonlyfiresinCaliforniawithacres#burnedgreaterthan1000df<-filter(df,TOTALACRES>=1000&STATE==‘California’)#producethefinalmapggmap(myMap)+geom_point(data=df,aes(x=DLONGITUDE,y=DLATITUDE))```
16.Addthefollowingdescriptivetext.3.Nowlet’sdosomethingalittlemoreinteresting.First,usethe`dplyr``mutate()`functiontogroupthefiresbydecade.
17.Createanewcodechunkandaddthecodeyouseebelow.Themutate()functionisusedinthiscodechunktocreateanewcolumncalledDECADEandthenpopulateeachrowwithavalueforthedecadeinwhichthefireoccurred.
```{r}
```{r}1989”,ifelse(YEAR_%in%1990:1999,“1990-1999”,ifelse(YEAR_%in%2000:2009,“2000-2009”,ifelse(YEAR_%in%2010:2016,“2010-2016”,“-99”)))))```
18.Addthefollowingdescriptivetext.
4.Next,colorcodethewildfiresby`DECADE`andcreateagraduatedsymbolmapbasedonthesizeofeachfire.The`colour`propertydefinesthecolumntouseforgrouping,andthe`size`propertydefinesthecolumntouseforthesizeofeachsymbol.
19.Createanewcodechunkandaddthecodeyouseebelow.Thiscodechunkwillcolorcodethefiresbydecade.
```{r}ggmap(myMap)+geom_point(data=df,aes(x=DLONGITUDE,y=DLATITUDE,colour=DECADE,size=TOTALACRES))
20.Let’sstopaddingcodefornowandrunwhatiscurrentlyinthefiletoseetheresult.BeforerunningthecodeagaingoaheadandclearthepastresultsbyclickingthesmallXintheupperrighthandscorneroftheoutputforeachmapasseeninthescreenshotbelow.
21.SelectRun|RunAll.Theoutputproducedwillincludeseveralmapswiththefinalmapappearingasseeninthescreenshotbelow.
5.Let’schangethemapviewtofocusmoreonsouthernCalifornia,andinparticulartheareajustnorthofLosAngeles.
23.Createanewcodechunkandaddthecodeyouseebelow.Thiscodechunkwillcolorcodethefiresbydecadeandsizethesymbolsaccordingthetotalacreageburned.
```{r}myMap<-get_map(location=“SantaClarita,California”,zoom=10)ggmap(myMap)+geom_point(data=df,aes(x=DLONGITUDE,y=DLATITUDE,colour=DECADE,size=TOTALACRES))```
24.Addthefollowingdescriptivetext.
6.Nextwe’lladdcontourandheatlayers.The`geom_density2d()`functionisusedtocreatethecontourswhilethe`stat_density2d()`functioncreatestheheat
map.Addthefollowingcodetoproducethemapyouseebelow.Youcanexperimentwiththecolorsusingthe`scale_fill_gradient(lowandhigh)`properties.Herewe’vesetthemtogreenandredrespectively,butyoumaywanttochangethecolorscheme.
25.Createanewcodechunkandaddthecodeyouseebelow.Thiscodechunkwillcreateaheatmapandaddcontours.```{r}myMap<-get_map(location=“SantaClarita,California”,zoom=8)
ggmap(myMap,extent=“device”)+geom_density2d(data=df,aes(x=DLONGITUDE,y=DLATITUDE),size=0.3)+stat_density2d(data=df,aes(x=DLONGITUDE,y=DLATITUDE,fill=..level..,alpha=..level..),size=0.01,bins=16,geom=“polygon”)+scale_fill_gradient(low=“green”,high=“red”)+scale_alpha(range=c(0,0.3),guide=FALSE)7.Ifyou’dprefertoseetheheatmapwithoutcontours,thecodecanbesimplifiedasfollows:
27.Createanewcodechunkandaddthecodeyouseebelow.Thiscodechunkwillremovethecontours.
```{r}ggmap(myMap,extent=“device”)+stat_density2d(data=df,aes(x=DLONGITUDE,y=DLATITUDE,fill=..level..,alpha=..level..),size=0.01,bins=16,geom=“polygon”)+scale_fill_gradient(low=“green”,high=“red”)+scale_alpha(range=c(0,0.3),guide=FALSE)```
28.Addthefollowingdescriptivetext.
8.Finally,let’screateafacetmapthatdepictshotspotsforeachyearinthecurrentdecade.Addthefollowingcodetoseehowthisworks.Thedatasetcontainsinformationupthroughtheyear2016.
29.Createacodechunkandaddthecodeyouseebelow.
```{r}df<-filter(dfWildfires,STATE==‘California’)df<-filter(df,YEAR_%in%c(2010,2011,2012,2013,2014,2015,2016))myMap<-get_map(location=“SantaClarita,California”,zoom=9)
ggmap(myMap,extent=“device”)+stat_density2d(data=df,aes(x=DLONGITUDE,y=DLATITUDE,fill=..level..,alpha=..level..),size=0.01,bins=16,geom=“polygon”)+scale_fill_gradient(low=“green”,high=“red”)+scale_alpha(range=c(0,0.3),guide=FALSE)+facet_wrap(~YEAR_)
30.ThatcompletesthecodeforthisRMarkdownfile.GoaheadandrunthecodeagaintoseethefinaloutputbyselectingRun|RunAll.
Exercise3:Codechunkandheaderoptions
Chunkoptionsareargumentssuppliedtothechunkheader.Currentlythereareapproximately60suchoptions.We’llexaminesomeofthemorecommonlyusedandimportantoptionsinthisexercise.Allcodechunkoptionsareplacedinsidethe{r}block.
CodechunkscanbegivenanoptionalnameasseenintheexamplecodebelowwherethecodechunkhasbeengivenanameofMapSetup.```{rMapSetup,warning=FALSE,error=FALSE,message=FALSE}
TheadvantagesofnamingchunksincludeeasiernavigationusingthecodenavigatorinRStudio,usefulnamesgiventographicsproducedbychunks,andtheabilitytocachechunkstoavoidre-performingcomputationsoneachrun.Thislastadvantageisperhapsthemostuseful.1.TheRMarkdownpaneincludesaquickaccessmenuforeasilynavigating
todifferentsectionsofyourRMarkdownpage.Thearrowinthescreenshotbelowdisplaysthelocationofthisfunctionality.
2.ClickonthequickaccessbuttonnowtoseethedifferentsectionsoftheRMarkdownfile.Youshouldseesomethingsimilartothescreenshotbelow.You’llnoticethatitissectionedbyheadingsandthencodechunks.Tomakenavigationeasieryoucannameeachofthesechunks.
SelectChunk1underStep1:CreatingaBasemaptoreturntothefirstcodechunkyoucreatedinanearlierexercise.Thiscodechunksimplydefinesthelibrariesthatwillbeusedinthefile.
Inthe{r}sectionoftheheadernamethechunklibs.```{rlibs}3.Noticethatthevaluehasnowbeenupdatedinthequickaccessdropdownmenu.
4.Renametherestofyourcodechunks.Youcanusewhatevernamemakesthemostsenseforeach.
5.Next,we’lladdsomecodeoptions.Althoughtherearecurrently60+optionsthatcanbeappliedtoacodechunkwe’llexamineonlyafewofthemoreimportantoptions.Youcangetalistofalltheavailablecodechunkoptionsathttps://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf.
6.Messagingisoneofthemostcommonlyusedandusefuloptions.Thereareactuallythreemessagingoptions:messages,warnings,errors.AllthreeareTRUE|FALSEvaluesthatcanbesetandallaresettoFALSEbydefault.NavigatetoChunk2andaddtheoptionsyouseehighlightedbelow.Thiswillturnonthemessagingforanygeneralinformationmessages,warnings,anderrors.
```{rerror=TRUE,warning=TRUE,message=TRUE}
myLocation<-“California”myMap<-get_map(location=myLocation,zoom=6)ggmap(myMap)```
7.Nowwhenyourunthissectionanyofthesemessageswillbeprintedoutalongwiththeoutput.Ratherthanrunningtheentiremarkdownfilecodeeachtimeyouwanttotestsomethingyoucanlimittheruntoaparticularcodechunkbyclickingthearrowonthefar-righthandsideofthecodechunkasseeninthescreenshotbelow.
8.Theoutputwindowincludestwooverviewwindows:theoutputvisualizationandtheRConsole.IfyouclicktheRConsoleoverviewwindowasseeninthescreenshotbelowitwilldisplayanymessagesthatwerewrittentotheconsoleasaresultoftheexecutionofthiscodeblock.
ClickingtheRConsolewindowshouldproduceanoutputsimilartothescreenshotbelow.
9.Nowaddthesamemessage,warning,anderroroptionstoyourothercodechunks.10.Runthecodechunksoneatatimeanexaminetheoutput.Anywarninganderrorswillbeprominentlydisplayedasseeninthescreenshotbelow.
11.Youcanalsodefinedocumentwideoptionsaswell.Inthisstepwe’lllookatacommonoptiondefinedintheheader.Thecontentoftheheaderdefinesparametersthatcontrolvarioussettingsfortheentiredocument.
Theheadercanincludebasicdescriptiveinformationincludingthetitle,author,date,andoutputformatalongwithothersettingsincludingparametersandbibliographiesandcitations.Parametersareusedwhenyouneedtore-renderthesamereportbutwithdistinctvaluesforinputs.Theparamsfieldcontrolstheseparameters.
You’llnoticeinthecodeexamplebelowthatastateparameterhasbeendefinedwithavalueofCalifornia.ThisvaluecanthenbeaccessedelsewhereintheRMarkdownfileusingthesyntaxparams$<parameter>orparams$stateinthisexample.
AddtheparamsoptionswithaparameterofstateandsetitequaltoCaliforniainyourfileexactlyasseeninthescreenshotabove.12.NavigatetoChunk2andfindthelineyouseebelow.myLocation<-“California”13.Changethislineasseenbelowtoaccessthestateparameter.
```{rerror=TRUE,warning=TRUE,message=TRUE}myLocation<-params$statemyMap<-get_map(location=myLocation,zoom=6)ggmap(myMap)```
14.RunthecodeforChunk2onlyandyoushouldseethesameoutputmapcenteredonCalifornia.
15.Cleartheoutputforchunk2byclickingtheXintheupperright-handcorneroftheoutput.16.ReturntothestateparameterintheheaderandchangethevaluetoMontana.
--title:“CreatingMapswithR”author:“EricPimpler”
date:“7/18/2018”output:html_documentparams:
state:‘Montana’
--17.Runcodechunk2againandnowthemapshouldbecenteredonMontana.
Exercise4:Caching
Codechunkscanalsobecached,whichisgreatforcomputationthattakesalongtimetoexecute.ToenablecachingthecacheparametershouldbesettoTRUE.Thiswillsavetheoutputofthecodechunktoaspeciallynamedfileondesk.Onanysubsequentruns,knitrcheckstoseeifthecodehaschanged,andifnot,itwillreusethecachedresults.
Youdoneedtobecarefulwithcachingthoughasitwillonlyre-runacodechunkifthecodechanges.However,itdoesn’ttakeintoaccountthingssuchaschangestounderlyingdatasources.Forexample,thedatainanunderlyingdatasourcecouldchange,butbecausetheRMarkdownfilewillonlyre-runthecodechunkifthecodechanges,thiscouldbecomeanissue.
1.Findthecodechunkyouseebelowthatmapstheindividualwildfirepoints.YoumayhavenamedthechunksomethingotherthanwhatIhavenamedthechunk(point_map).
```{rpoint_map,error=TRUE,warning=TRUE,message=TRUE}myLocation<-‘California’#getthebasemapmyMap<-get_map(location=myLocation,zoom=6)#usetheselect()functiontolimitthecolumnsfromthedataframedf<-select(dfWildfires,STATE,YEAR_,TOTALACRES,DLATITUDE,DLONGITUDE)#usethefilter()functiontogetonlyfiresinCaliforniawithacres#burnedgreaterthan1000df<-filter(df,TOTALACRES>=1000&STATE==‘California’)#producethefinalmapggmap(myMap)+geom_point(data=df,aes(x=DLONGITUDE,y=DLATITUDE))```
2.Addthecacheparametertotheoptionsforthechunkasseenbelow.```{rpoint_map,cache=TRUE,error=TRUE,warning=TRUE,
3.ThiscodechunkisdependentuponthedatainthedfWildfiresdataframe,whichisloadedinthecodechunkdirectlyprecedingthischunk.ThecodechunkthatloadsthedatafromacsvfileintothedfWidlfiresvariablecanbeseenbelow.Youmayhavenamedthechunkdifferently(load_data).
```{rload_data,error=TRUE,warning=TRUE,message=TRUE}dfWildfires<-read_csv(“~/Desktop/IntroR/Data/StudyArea_SmallFile.csv”,col_types=list(FIRENUMBER=col_character(),UNIT=col_character()),col_names=TRUE)```
4.Becausethepoint_mapcodechunkisdependentuponthedatainthedfWildfiresdataframeyouneedtoaddadependsonparametertothepoint_mapcodechunk.
```{rpoint_map,cache=TRUE,dependson=’load_data’,error=TRUE,warning=TRUE,message=TRUE}
Thiswillcoversituationswheretheread_csv()callchanges.Forexample,adifferentfilemightbereadbythefunction.
5.Keepinmindthatthecacheanddependsonparametersonlymonitorforchangesinthe.Rmdfile.WhatwouldhappeniftheunderlyingdataintheStudyArea_SmallFile.csvfilechanges?Theansweristhatthechangeswouldn’tbepickedup.Tohandlethissortofsituationyoucanusethecache.extraoptionalongwiththefile.info()function.
```{rload_data,cache.extra=file.info(‘~/Desktop/IntroR/Data/StudyArea_SmallFile.csv’)error=TRUE,warning=TRUE,
Exercise5:UsingKnittooutputanRMarkdownfile
TheKnitfunctionalitybuiltintoRStudiocanbeusedtoexportanRMarkdownfiletovariousformatsincludingHTML,PDF,andWord.Knitcanbeaccessedfromthedropdownmenuseeninthescreenshotbelow.
1.TosimplifytheoutputoftheRMarkdownfileyou’regoingtoremovesomeoftheoptionsthatwereaddedinpreviousexercise.Inthe
CreateMapsWithR.rmdfileremovecache,dependson,andcache.extraparametersaddedinthelastexercise.
2.SelectKnitandfindtheKnitDirectorymenuitemfromtheRStudiointerface.Bydefault,itissettoDocumentDirectory.ThissimplymeansthattheoutputfilewillgointothesamedirectorywheretheRMarkdownfilehasbeensaved.
3.SelectKnit|KnittoHTML.Knitwillbeginprocessingthefileandyou’llseeoutputmessaginginformationwrittentotheConsolepane.IfeverythinggoesasexpectedanoutputHTMLfilecalledCreatingMapsWithR.htmlwillbecreatedinthesamefolderwheretheCreatingMapsWithR.Rmdfilewassaved.Theoutputfilewillbefairlylength,butthetoppartshouldlooksimilartothescreenshotbelow.
4.YoucancheckyourworkagainsttheCreatingMapsWithR.Rmdsolutionfile.
Conclusion
InthischapteryoulearnedhowtocreateanRMarkdownfile,whichcanbeusedtoshareyourworkwithothersinvariousformatsincludingPDF,Word,HTML,slideshows,andmore.RMarkdownfilescanincludecode,results,andcommentary,makingthemaperfectresourceforexplainingnotonlytheresultsofaproject,butalsothemechanicsofhowtheworkwasaccomplished.
Inthenextchapteryou’lltackleacasestudythatexamineswildfireactivityinthewesternUnitedStates.
Chapter10
CaseStudy–WildfireActivityintheWesternUnitedStates
Studiessuggestthatoverthepastfewdecades,thenumberandsizeofwildfireshaveincreasedthroughoutthewesternUnitedStates.Theaveragelengthofwildfireseasonhasincreasedsignificantlyaswellinsomeareas.AccordingtotheUnionofConcernedScientists(UCS),everystateinthewesternUShasexperiencedanincreaseintheaverageannualnumberoflargewildfires(greaterthan1,000acres)overthepastfewdecades.ThePacificNorthwest,includingWashington,Oregon,Idaho,andthewesternhalfofMontanahavehadparticularlychallengingwildfireseasonsinrecentyears.
The2017wildfireseasonshatteredrecordsandcosttheU.S.ForestServiceanunprecedented$2billion.FromtheOregonwildfirestolateseasonfiresinMontana,andthehighlyunusualtimingoftheCaliforniafiresinDecember,itwasabusyyearinthewesternUnitedStates.While2017wasaparticularlynotablewildfireseason,thistrendisnothingnewandresearchsuggestswecanexpectthisunfortunatetrendtocontinueduetoclimatechangeandotherfactors.Arecentstudysuggeststhatoverthenexttwodecades,asmanyas11statesarepredictedtoseetheaverageannualareaburnedincreaseby500percent.
ExtensivestudieshavefoundthatlargeforestfiresinthewesternUShavebeenoccurringnearlyfivetimesmoreoftensincethe1970sand80s.Suchfiresareburningmorethansixtimesthelandareaasbeforeandlastingalmostfivetimeslonger.
Climatechangeisthoughttobetheprimarycauseoftheincreaseinlargewildfireswithrisingtemperaturesleadingtoearlieranddecreasedvolumeofsnowmelts,decreasedprecipitation,andforestconditionsthataredrierforlongerperiodsoftime.Anincreaseinforesttreediseasefrominsectdisturbancehasalsobeenassociatedwithclimatechangeandcanleadtolargeareasofhighlyflammabledeadordyingforests.Otherpotentialcausesofincreasedwildfireactivityincludeforestmanagementpractices,andanincreaseinhumancausedwildfiresduetoaccidentsorarson.
InthiscasestudyyouwillusetheskillsyouhavegainedinthisbookalongwithwildfiredatafromtheFederalWildlandFireOccurrenceDatabase,
(https://wildfire.cr.usgs.gov/firehistory/data.html),providedbytheU.S.GeologicalSurvey(USGS)tovisualizethechangeinwildfireactivityfrom1980to2016.AnalysiswillbelimitedtothewesternUnitedStatesincludingCalifornia,Arizona,NewMexico,Colorado,Utah,Nevada,Utah,Oregon,Washington,Idaho,Montana,andWyoming.Wewereparticularlyinterestedinthesurgeoflargewildlandfires,categorizedasfiresthatburngreaterthan1,000acres.
So,haswildfireactivityandsizeactuallyincreased,ordoesitjustseemthatwaybecausewe’retunedinmoretobadnewsandsocialmedia?Inthischapteryou’llanswerthosequestionsandmoreusingRwiththetidyversepackage.
Inthischapterwe’llanswerthefollowingquestions:
•Havethenumberofwildfiresincreasedordecreasedinthepastfewdecades?•Hastheacreageburnedincreasedovertime?•Isthesizeofindividualwildfiresincreasingovertime?•Hasthelengthofthefireseasonincreasedovertime?•Doestheacreageburneddifferbyfederalorganization?
Exercise1:Havethenumberofwildfiresincreasedordecreasedinthepastfewdecades?
TheStudyArea.csvfileinyourIntroR\Datafoldercontainsallnon-prescribedwildfireactivityfrom1980-2016forthe11statesinourstudyarea,whichincludeCalifornia,Oregon,Washington,Idaho,Nevada,Arizona,Utah,Montana,Wyoming,Colorado,andNewMexico.We’llusethisfileforalltheexercisesinthischapter.We’regoingtofocusprimarilyonlargewildfiresinthisstudy,definedhereasanynon-prescribedfiregreaterthan1,000acres.
1.InyourIntroRfoldercreateanewfoldercalledCaseStudy1.YoucandothisinsideRStudiobygoingtotheFilespaneandselectingNewFolderinsideyourworkingdirectory.
2.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS1_Exercise1.R.3.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.
library(readr)library(dplyr)library(ggplot2)
4.Usetheread_csv()functionfromthereadrpackagetoloadthedataintoadataframe.df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)5.Checkthenumberofrowsinthedataframe.Thisshouldreturn439362orsomethingclosetothat.nrow(df)[1]4393626.Weonlyneedafewofthecolumnsfromthedataframeforthisexerciseso
usetheselect()functiontoretrievetheSTATE,YEAR_,TOTALACRES,andCAUSEcolumns.We’llalsorenamesomeofthesecolumnsinthisstep.Pipingwillbeusedfortherestofthecodeinthisexercisesobeginthestatementasseenbelow.
df%>%select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%7.Next,filterthedataframesothatonlywildfiresthatburned1,000acresormoreareincluded.Addthecodehighlightedinboldbelow.
df%>%select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%
8.Grouptherecordsbyyear.
df%>%select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(YR)%>%
9.Getacountofthenumberofwildfiresforeachyearbyusingthesummarize()functionwiththecount=n()parameter.
df%>%select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(YR)%>%summarize(count=n())%>%
10.Finally,createascatterplotwitharegressionlinethatdepictsthenumberofwildfiresovertheyears.
df%>%select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(YR)%>%summarize(count=n())%>%ggplot(mapping=aes(x=YR,y=count))+geom_point()+geom_smooth(method=lm,se=TRUE)+ggtitle(“LargeFiresAreBecomingMoreCommonintheWest-1980-2016”)+xlab(“Year”)+ylab(“NumberofWildfires”)
11.YoucancheckyourworkagainstthesolutionfileCS1_Exercise1.R.12.SavethescriptandthenclicktheRunbutton.Ifyou’vecodedeverythingcorrectlyyoushouldseetheplotdisplayedinthescreenshotbelow.
13.Basedonthisvisualizationitappearsasthoughlargewildfireshaveindeedbecomemorecommonoverthepastfewdecades.Butlet’sexpandthistoseeifallthestatesinthestudyareahavethesamepattern.14.CreateanewRscript
andsaveitwithanameofCS1_Exercise1B.R.
15.Addthefollowingcodetoyourscriptandsaveit.We’lldiscussthedifferencesbetweenthisscriptandthepreviousafterward.
library(readr)library(dplyr)library(ggplot2)
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(STATE,YR)%>%summarize(cnt=n())%>%ggplot(mapping=aes(x=YR,y=cnt))+geom_point()+facet_wrap(~STATE)+geom_smooth(method=lm,se=TRUE)+ggtitle(“NumberofFiresbyStateandYear”)+xlab(“Year”)+ylab(“NumberofFires”)
16.Savethescriptandthenrunittoseetheoutputshowninthescreenshotbelow.
ThisscriptgroupsthedatasetbySTATEandYRandthensummarizesthedatabygeneratingacountofthenumberforthisgrouping.Finally,thefacet_wrap()functionisusedwithggplot()tocreatethefacetmapthatdepictsthenumberoffiresbystateovertime.Anumberoftheindividualstatesshowaslightupwardtrendovertime,butmanyhaveanalmostflatregressionline.
17.YoucancheckyourworkagainstthesolutionfileCS1_Exercise1B.R.
18.Challenge1:Repeatthisprocesstoseetheresultsforwildfiresgreaterthan5,000acres,25,000acres,and100,000acres.Arethesefindingconsistentwiththeresultsofwildfiresgreaterthan1,000acres?
19.Challenge2:Repeattheprocessbutthistimegroupthedatabyyearandbywildfiresthatarenaturallyoccurring.TheCAUSEcolumnincludesavalueofNaturalthatcanbeusedtogroupthedata.You’llneedacompoundgroupingstatement.
Exercise2:Hastheacreageburnedincreasedovertime?
Measuringthenumberoffiresovertimeonlytellspartofthestory.Theamountofacreageburnedduringthattimemaygiveusmoreinsightintothepatternsinwildfireactivity.Inthisexercisewe’llcreatevisualizationsthatillustratehowmuchacreageisbeingburnedeachyearasaresultofwildfires.
1.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS1_Exercise2.R.2.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.
library(readr)library(dplyr)library(ggplot2)
3.Thefirstfewlinesofthisscriptwillbesimilartothepreviousexercises,soIwon’tdiscussthedetailsofeachline.Bynowyoushouldbeabletodeterminewhateachoftheselineswillaccomplishanyway.Addthelinesshownbelow.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%4.Groupthedatabyyear.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(YR)%>%
5.Usethesummarize()functiontosumthetotalacreageburnedbyyear.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(YR)%>%summarize(totalacres=sum(ACRES))%>%
6.Createascatterplotwithregressionlinethatdisplaysthetotalacreageburnedbyyear.Inthiscaseyou’llconvertthetotalacresburnedtoalogarithmicscaleaswell.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(YR)%>%summarize(totalacres=sum(ACRES))%>%
ggplot(mapping=aes(x=YR,y=log(totalacres)))+geom_point()+geom_smooth(method=lm,se=TRUE)+ggtitle(“TotalAcresBurned”)+xlab(“Year”)+ylab(“LogofTotalAcresBurned”)
7.YoucancheckyourworkagainstthesolutionfileCS1_Exercise2.R.
8.Savethescriptandthenrunittoseetheoutputshowninthescreenshotbelow.It’sclearfromthisgraphthattherehasbeenasignificantincreaseintheacreageburnedoverthepastfewdecades.
9.Nowlet’sseeifthistrendissignificantforallstatesinthestudyarea.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS1_Exercise2B.R.
10.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.
library(readr)library(dplyr)library(ggplot2)
11.Thefirstfewlinesofthisscriptwillbesimilartothepreviousexercises,soIwon’tdiscussthedetailsofeachline.Addthelinesshownbelow.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%
filter(ACRES>=1000)%>%12.GroupthedatabySTATEandYR.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(STATE,YR)%>%
13.Usethesummarize()functiontocalculatethetotalacreageburnedbystate.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(STATE,YR)%>%summarize(totalacres=sum(ACRES))%>%
14.Createafacetplotthatdisplaysthetotalacreageburnedbystateandyear.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(STATE,YR)%>%summarize(totalacres=sum(ACRES))%>%ggplot(mapping=aes(x=YR,y=log(totalacres)))+geom_point()+facet_wrap(~STATE)+geom_smooth(method=lm,se=TRUE)+ggtitle(“TotalAcresBurned”)+xlab(“Year”)+ylab(“LogofTotalAcresBurned”)
15.YoucancheckyourworkagainstthesolutionfileCS1_Exercise2B.R
.
16.Savethescriptandthenrunittoseetheoutputshowninthescreenshotbelow.It’sclearfromthisgraphthattherehasbeenanincreaseintheacreageburnedoverthepastfewdecadesforallthestatesinthestudyarea.
17.Youmayhavewonderedifthereisadifferenceinthesizeofwildfiresthatwerecausednaturallyasopposedtohumaninduced.Inthenextfewstepswe’llwriteascripttodojustthat.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS1_Exercise2C.R.
18.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.
library(readr)library(dplyr)library(ggplot2)
19.Thefirstfewlinesofthisscriptwillbesimilartothepreviousexercises,soIwon’tdiscussthedetailsofeachline.Addthelinesshownbelow.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%
20.Forthisscriptwe’llfiltersothatonlyNaturalandHumanvaluesareselectedfromtheCAUSEcolumninadditiontorequiringthatonlyfiresgreaterthan1,000acresbeincluded.
ThereareadditionalvaluesintheCAUSEcolumnincludingUNKNOWNandafewotherrandomvaluessothat’swhywe’retakingthisextrastep.Thedatasetdoesnotincludeprescribedfires,sowedon’thavetoworryaboutthatinthiscase.
The%in%operatorcanbeusedwithavectorinRtodefinemultiplevaluesasisthecasehere.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000&CAUSE%in%c(‘Human’,‘Natural’))%>%21.GroupthedatabyCAUSEandYR.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000&CAUSE%in%c(‘Human’,‘Natural’))%>%group_by(CAUSE,YR)%>%
22.Sumthetotalacreageburned.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000&CAUSE%in%c(‘Human’,‘Natural’))%>%group_by(CAUSE,YR)%>%summarize(totalacres=sum(ACRES))%>%
23.Plotthedataset.Usethecolourpropertyfromtheaes()functiontocolorcodethevaluesbyCAUSE.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000&CAUSE%in%c(‘Human’,‘Natural’))%>%group_by(CAUSE,YR)%>%summarize(totalacres=sum(ACRES))%>%ggplot(mapping=aes(x=YR,y=log(totalacres),colour=CAUSE))+geom_point()+geom_smooth(method=lm,se=TRUE)+ggtitle(“TotalAcresBurned”)+xlab(“Year”)+ylab(“LogofTotalAcresBurned”)
24.YoucancheckyourworkagainstthesolutionfileCS1_Exercise2C.R.25.Savethescriptandthenrunittoseetheoutputshowninthescreenshotbelow.Bothhumanandnaturallycausedwildfireshaveseenasignificantincreaseintheamountofacreageburnedoverthepastfewdecades,buttheamountofacreageburnedbynaturallyoccurringfiresappeartobeincreasingatamorerapidpace.
26.Finally,let’screateaviolinplottoseethedistributionofacresburnedbystate.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS1_Exercise2D.R.
27.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.
library(readr)library(dplyr)library(ggplot2)
28.Thefirstfewlinesofthisscriptwillbesimilartothepreviousexercises,soIwon’tdiscussthedetailsofeachline.Addthelinesshownbelow.df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,
CAUSE)%>%filter(ACRES>=1000)%>%group_by(STATE)%>%
29.Createaviolinplotwithanembeddedboxplot.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)%>%filter(ACRES>=1000)%>%group_by(STATE)%>%ggplot(mapping=aes(x=STATE,y=log(ACRES)))+geom_violin()+geom_boxplot(width=0.1)+ggtitle(“WildfiresbyStateGreaterthan1,000Acres”)+xlab(“State”)+ylab(“AcresBurned(Log)”)
30.YoucancheckyourworkagainstthesolutionfileCS1_Exercise2D.R.31.Savethescriptandthenrunittoseetheoutputshowninthescreenshotbelow.
Exercise3:Isthesizeofindividualwildfiresincreasingovertime?
Inthelastexercisewefoundthatthenumberofwildfiresappearstobeincreasingoverthepastfewdecades.Inthisexercisewe’lldeterminewhetherthesizeofthosefireshasincreasedaswell.TheStudyArea.csvfilecontainsaTOTALACREScolumnthatdefinesthenumberofacresburnedbyeachfire.We’llgroupthefiresbyyearandthenbydecadeanddeterminethemeanandmedianfiresizeforeach.
1.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS1_Exercise3.R.2.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.
library(readr)library(dplyr)library(ggplot2)
3.Thefirstfewlinesofthisscriptwillbethesameasthepreviousexercises,soIwon’tdiscussthedetailsofeachline.Bynowyoushouldbeabletodetermine
whateachoftheselineswillaccomplishanyway.Addthelinesshownbelow.
dfWildfires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df<-select(dfWildfires,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)df<-filter(df,ACRES>=1000)grp<-group_by(df,CAUSE,YR)
4.Summarizethedatabydeterminingthemeanacreageburnedforeachgroup.sm<-summarize(grp,mean(ACRES))
5.Thesummarize()functionwillcreateanewcolumncalledmean(ACRES)andaddittotheoutputdataframe.Thisisn’texactlyauser-friendlyname,sowe’llchangethenameofthiscolumninthenextstep.Youcanseetheoutputofthesummarize()functioninthescreenshotbelow.
6.Changethecolumnname.colnames(sm)[3]<-‘MEAN’7.Createascatterplotoftheresults.
ggplot(data=sm,mapping=aes(x=YR,y=MEAN))+geom_point()+geom_smooth(method=lm,se=TRUE)+ggtitle(“AverageSizeofWildfiresHasIncreasedforbothHumanandNaturalCauses”)+xlab(“Year”)+ylab(“AverageWildfireSize”)
8.Theentirescriptshouldappearasseenbelow.
library(readr)library(dplyr)library(ggplot2)
dfWildfires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df=select(dfWildfires,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)df<-filter(df,ACRES>=1000)grp<-group_by(df,CAUSE,YR)sm<-summarize(grp,mean(ACRES))colnames(sm)[3]<-‘MEAN’ggplot(data=sm,mapping=aes(x=YR,y=MEAN))+geom_point()+geom_smooth(method=lm,se=TRUE)+ggtitle(“AverageSizeofWildfiresHasIncreasedforbothHumanandNaturalCauses”)+xlab(“Year”)+ylab(“AverageWildfireSize”)
9.YoucancheckyourworkagainstthesolutionfileCS1_Exercise3.R.
10.Saveandrunthescript.Ifeverythinghasbeencodedcorrectlyyoushouldseethefollowingoutput.Thisgraphindicatesacleartrendtowardlargerwildfiresovertime.
11.Nowlet’slookgroupthewildfiresbydecade,sumthetotalacreageburnedduringthattime,andcreateabarcharttodisplaytheresults.12.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS1_Exercise3B.R.13.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.
library(readr)library(dplyr)library(ggplot2)
14.Load,select,andfilterthedatainthesamewaywe’vedonewiththeotherexercisesinthischapter.
dfWildfires<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df<-select(dfWildfires,ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE)df<-filter(df,ACRES>=1000)
15.Inthisstepwe’llusethemutate()functionalongwithanifelse()functiontocreateanewcolumncalledDECADEandthenpopulatethecontentsofthiscolumnbasedonthevalueoftheYRcolumnforeachrow.Addthecodeyouseebelow.
df<-mutate(df,DECADE=ifelse(YR%in%1980:1989,“1980-1989”,ifelse(YR%in%1990:1999,“1990-1999”,ifelse(YR%in%2000:2009,“2000-2009”,ifelse(YR%in%2010:2016,“2010-2016”,“-99”)))))
16.GroupthedatasetbyDECADE.grp<-group_by(df,DECADE)17.Summarizethedatabycalculatingthemeanvalueofacresburned.sm<-summarize(grp,mean(ACRES))18.Renamethecolumncreatedbythesummarize()function.znames(sm)<-c(“DECADE”,“MEAN_ACRES_BURNED”)19.Usethegeom_col()functionalongwithggplot()tocreateabarchartthatdisplaysthemeanwildfiresizebydecade.ggplot(data=sm)+geom_col(mapping=aes(x=DECADE,y=MEAN_ACRES_BURNED),fill=”red”)
20.YoucancheckyourworkagainstthesolutionfileCS1_Exercise3B.R.
21.Saveandrunthescript.Ifeverythinghasbeencodedcorrectlyyoushouldseethefollowingoutput.Thisbarchartindicatesacleartrendtowardlargerwildfireswitheachpassingdecade,althoughitshouldbenotedthatthedatasetonlyextendsthrough2016sotheresultsforthecurrentdecademaybedifferentinafewyears.
Exercise4:Hasthelengthofthefireseasonincreasedovertime?
Wildfireseasonisgenerallydefinedasthetimeperiodbetweentheyear’sfirstandlastlargewildfires.Theinfographicbelow,fromtheUnionofConcernedScientists(https://www.ucsusa.org/global-warming/science-and-impacts/impacts/infographic-wildfiresclimate-change.html#.W1cji9hKj_Q),highlightsthelengthofthewildfireseasonfortheWesternU.S.asaregion.Localwildfireseasonsvarybylocationbuthavealmostuniversallybecomelongeroverthepast40years.
Inthisexercisewe’llmeasurethelengthofthewildfireseasonoverthepastfewdecadesfortheregionasawhole,aswellasindividualstates.1.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS1_Exercise4.R.
2.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.Notethatyouwillneedtoloadthelubridatelibraryforthisexercisesincewe’llbedealingwithdates.
library(readr)library(dplyr)library(lubridate)library(ggplot2)
3.Thefirstfewlinesofthisscriptwillbesimilartothepreviousexercises,soIwon’tdiscussthedetailsofeachline.Bynowyoushouldbeabletodeterminewhateachoftheselineswillaccomplishanyway.Addthelinesshownbelowtoloadthedata,selectthecolumns,andfilterthedata.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE,STARTDATED)%>%filter(ACRES>=1000)%>%
4.Tomeasurethelengthofthewildfireseasonwe’regoingtoconvertthestartdateofeachfireintothedayoftheyear.Forexample,ifafireoccurredon
February1st,itwouldbethe32nddayoftheyear.Usethemutate()functionasseenbelowtoaccomplishthis.Themutate()functionusestheyday()lubridatefunctiontoconvertthevaluefortheSTARTDATEDcolumnintothedayoftheyear.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE,STARTDATED)%>%filter(ACRES>=1000)%>%mutate(DOY=yday(as.Date(STARTDATED,format=’%m/%d/%y%H:%M’)))
%>%5.Groupthedatabyyear.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE,STARTDATED)%>%filter(ACRES>=1000)%>%mutate(DOY=yday(as.Date(STARTDATED,format=’%m/%d/%y%H:%M’)))
%>%group_by(YR)%>%6.Gettheearliestandlateststartdatesofthewildfiresusingthesummarize()function.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE,STARTDATED)%>%filter(ACRES>=1000)%>%
mutate(DOY=yday(as.Date(STARTDATED,format=’%m/%d/%y%H:%M’)))
%>%group_by(YR)%>%summarize(dtEarly=min(DOY,na.rm=TRUE),dtLate=max(DOY,
na.rm=TRUE))%>%
7.Finally,useggplotwithtwocallstogeom_line()tocreatetwolinegraphsthatdisplaytheearlieststartandlatestenddatesbyyear.You’llalsoaddasmoothedregressionlinetobothlinegraphs.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE,STARTDATED)%>%filter(ACRES>=1000)%>%mutate(DOY=yday(as.Date(STARTDATED,format=’%m/%d/%y%H:%M’)))
%>%group_by(YR)%>%summarize(dtEarly=min(DOY,na.rm=TRUE),dtLate=max(DOY,
na.rm=TRUE))%>%ggplot()+geom_line(mapping=aes(x=YR,y=dtEarly,color=’B’))+geom_line(mapping=aes(x=YR,y=dtLate,color=’R’))+geom_smooth(method=lm,se=TRUE,aes(x=YR,y=dtEarly,color=”B”))+geom_smooth(method=lm,se=TRUE,aes(x=YR,y=dtLate,color=”R”))+xlab(“Year”)+ylab(“DayofYear”)+scale_colour_manual(name=“Legend”,values=c(“R”=“#FF0000”,“B”=“#000000”),labels=c(“FirstFire”,“LastFire”))
8.YoucancheckyourworkagainstthesolutionfileCS1_Exercise4.R.
9.Saveandrunthescript.Ifeverythinghasbeencodedcorrectlyyoushouldseethefollowingoutput.Thischartshowsaclearlengtheningofthewildfireseason
withthefirstfiredatecomingsignificantlyearlierinrecentyearsandthestartdateofthelastfireincreasingaswell.
10.Thelastscriptexaminedthetrendsinwildfireseasonlengthfortheentirestudyarea,butyoumightwanttoexaminethesetrendsatastatelevelinstead.Thiscanbeeasilyaccomplishedbyaddingasecondstatementtothefilter.Updatethefilterasseenbelowandre-runthescripttoseetheresult.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%select(ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE,STARTDATED)%>%filter(ACRES>=1000&STATE==‘Arizona’)%>%mutate(DOY=yday(as.Date(STARTDATED,format=’%m/%d/%y%H:%M’)))
%>%group_by(YR)%>%
summarize(dtEarly=min(DOY,na.rm=TRUE),dtLate=max(DOY,
na.rm=TRUE))%>%ggplot()+geom_line(mapping=aes(x=YR,y=dtEarly,color=’B’))+geom_line(mapping=aes(x=YR,y=dtLate,color=’R’))+geom_smooth(method=lm,se=TRUE,aes(x=YR,y=dtEarly,color=”B”))+geom_smooth(method=lm,se=TRUE,aes(x=YR,y=dtLate,color=”R”))+xlab(“Year”)+ylab(“DayofYear”)+scale_colour_manual(name=“Legend”,values=c(“R”=“#FF0000”,“B”=“#000000”),labels=c(“FirstFire”,“LastFire”))
TheStateofArizonashowsanevenbiggertrendtowardlongerwildfireseasons.Tryafewotherstatesaswell.
Exercise5:Doestheaveragewildfiresizedifferbyfederalorganization
Towrapupthischapterwe’llexamineiftheaveragewildfiresizediffersbyfederalorganization.TheStudyArea.csvfileincludesacolumn(ORGANIZATI)
thatindicatesthejurisdictionwherethefirestarted.Thiscolumncanbeusedtogroupthewildfires.
1.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS1_Exercise5.R.2.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.
library(readr)library(dplyr)library(ggplot2)
3.Thefirstfewlinesofthisscriptwillbesimilartothepreviousexercises,soIwon’tdiscussthedetailsofeachline.Bynowyoushouldbeabletodeterminewhateachoftheselineswillaccomplishanyway.Addthelinesshownbelowtoloadthedata,selectthecolumns,andfilterthedata.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(ORG=ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE,STARTDATED)%>%filter(ACRES>=1000)%>%
4.GroupthedatasetbyORGandYR.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(ORG=ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,
CAUSE,STARTDATED)%>%filter(ACRES>=1000&ORG%in%c(‘BIA’,‘BLM’,‘FS’,‘FWS’,‘NPS’))%>%group_by(ORG,YR)%>%
5.Summarizethedatabycalculatingthemeanacreageburnedbyorganizationandyear.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(ORG=ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE,STARTDATED)%>%filter(ACRES>=1000&ORG%in%c(‘BIA’,‘BLM’,‘FS’,‘FWS’,‘NPS’))%>%group_by(ORG,YR)%>%summarize(meanacres=mean(ACRES))%>%
6.Createafacetplotforthemeanacreageburnedbyyearforeachorganization.
df<-read_csv(“StudyArea.csv”,col_types=list(UNIT=col_character()),col_names=TRUE)df%>%
select(ORG=ORGANIZATI,STATE,YR=YEAR_,ACRES=TOTALACRES,CAUSE,STARTDATED)%>%filter(ACRES>=1000&ORG%in%c(‘BIA’,‘BLM’,‘FS’,‘FWS’,‘NPS’))%>%group_by(ORG,YR)%>%summarize(meanacres=mean(ACRES))%>%ggplot(mapping=aes(x=YR,y=log(meanacres)))+geom_point()+facet_wrap(~ORG)+geom_smooth(method=lm,se=TRUE)+ggtitle(“AcresBurnedbyFederalOrganization”)+xlab(“Year”)+ylab(“LogofTotalAcresBurned”)
7.YoucancheckyourworkagainstthesolutionfileCS1_Exercise5.R.8.Saveandrunthescript.Ifeverythinghasbeencodedcorrectlyyoushouldseethefollowingoutput.Itappearsasthoughallthefederalagencieshaveexperiencedsimilarincreasesinthesizeofwildfiressince1980.
Chapter11
CaseStudy–SingleFamilyResidentialHomeandRentalValues
TheZillowResearchgrouppublishesseveraldifferentmeasuresofhomesvaluesonamonthlybasisincludingmedianlistprices,mediansaleprices,andtheZillowHomeValueIndex(ZHVI).TheZHVIisbasedonZillow’sinternalmethodologyformeasuringhomevaluesovertime.Inaddition,Zillowalsopublishesasimilarmeasureofrentalvalues(ZRI)aswellasanumberofotherrealestaterelateddatasets.
ThemethodologyforZHVIcanbereadindetailathttps://www.zillow.com/research/zhvi-methodology-6032/,butthesimpleexplanationisthatZillowtakesallestimatedhomevaluesforagivenregionandmonth(Zestimate),takesamedianofthesevalues,appliessomeadjustmentstoaccountforseasonalityorerrorsinindividualhomeestimates,andthendoesthesameacrossallmonthsoverthepast20yearsandformanydifferentgeographylevels(ZIP,neighborhood,city,county,metro,state,andcountry).Forexample,ifZHVIwas$400,000inSeattleonemonth,thatindicatesthat50percentofhomesintheareaareworthmorethan$400,000and50percentareworthless(adjustingforseasonalfluctuations–e.g.pricestendtobelowinDecember).
ZillowrecommendsusingZHVItotrackhomevaluesovertimefortheverysimplereasonthatZHVIrepresentsthewholehousingstockandnotjustthehomesthatlistorsellinagivenmonth.ImagineamonthwherenohomesoutsideofCaliforniasold.Anationalmedianpriceseriesormedianlistserieswouldbothspike.ZHVI,however,wouldremainamedianofallhomesacrossthecountryandwouldn’tskewtowardCaliforniaanymorethaninthepreviousmonth.ZHVIwillalwaysreflectthevalueofallhomesandnotjusttheonesthatlistorsellinagivenmonth.Inthischapterwe’llusesomebasicRvisualizationtechniquestobetterunderstandresidentialrealestatevaluesandrentalpricesintheAustin,TXmetropolitanarea.
Inthischapterwe’llcoverthefollowingtopics:
•WhatisthetrendforhomevaluesintheAustinmetropolitanarea?•WhatisthetrendforrentalvaluesintheAustinmetropolitanarea?•Determiningtheprice-rentratiofortheAustinmetropolitanarea.
•ComparingresidentialhomevaluesinAustintootherTexasmetropolitanareas
Exercise1:WhatisthetrendforhomevaluesintheAustinmetroarea
TheCounty_Zhvi_SingleFamilyResidence.csvfileinyourIntroR\DatafoldercontainshomevaluedatafromZillow.TheZillowHomeValueIndex(ZHVI)isasmoothed,seasonallyadjustedmeasureofthemedianestimatedhomevalueacrossagivenregionandhousingtype.Itisadollar-denominatedalternativetorepeat-salesindices.Zillowalsopublisheshomevalueandotherhousingdataforlocalmarkets,aswellasamoredetailedmethodologyandacomparisonofZHVItotheS&PCoreLogicCase-ShillerHomePriceIndices.We’llusethisfileforthisparticularexercise.
Inthisfirstexercisewe’llexaminehomevaluesoverthepastcoupleofdecadesfromtheAustinmetropolitanarea.
1.InyourIntroRfoldercreateanewfoldercalledCaseStudy2.YoucandothisinsideRStudiobygoingtotheFilespaneandselectingNewFolderinsideyourworkingdirectory.
2.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS2_Exercise1.R.3.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.
library(readr)library(dplyr)library(ggplot2)
4.Usetheread_csv()functionfromthereadrpackagetoloadthedataintoadataframe.df<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)5.Startapipingexpressionanddefinethecolumnsthatshouldbeincludedinthedataframe.df%>%
select(RegionName,State,Metro,`1996`=`1996-05`,`1997`=`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`==`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`==`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`=05`,`2007`=`2007-05`,`2008`=`2008-05`,`2009`=
`2009-05`,`2010`=`2010-05`,`2011`=`2011-05`,`2012`=`2012-05`,`2013`=`2013-05`,`2014`=`2014-05`,`2015`=`2015-05`,`2016`=`2016-05`,`2017`=`2017-05`,`2018`=`2018-05`)%>%
6.FilterthedataframetoincludetheAustinmetropolitanareafromthestateofTexas.
df%>%select(RegionName,State,Metro,`1996`=`1996-05`,`1997`=`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`==`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`==`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`=05`,`2007`=`2007-05`,`2008`=`2008-05`,`2009`=`2009-05`,`2010`=`2010-05`,`2011`=`2011-05`,`2012`=`2012-05`,`2013`=`2013-05`,`2014`=`2014-05`,`2015`=`2015-05`,`2016`=`2016-05`,`2017`=`2017-05`,`2018`=`2018-05`)%>%filter(State==‘TX’&Metro==‘Austin’)%>%
7.Ifyouweretoviewthestructureofthedataframeatthispointitwouldlooklikethescreenshotbelow.Acommonprobleminmanydatasetsisthatthecolumnnamesarenotvariablesbutrathervaluesofavariable.Inthefigureprovidedbelow,thecolumnsthatrepresenteachyearinthestudyareactuallyvaluesofthevariableYEAR.Eachrowintheexistingtableactuallyrepresentsmanyannualobservations.Thetidyrpackagecanbeusedtogathertheseexistingcolumnsintoanewvariable.Inthiscase,weneedtocreateanewcolumncalledYRandthengathertheexistingvaluesintheannualcolumnsintothenewYRcolumn.
Inthenextstepwe’llusethegather()functiontoaccomplishthis.
8.Usethegather()functiontotidyupthedatasothatanewYRcolumniscreated,androwsforeachcounty(RegionName)andyearvalueareadded.
df<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)df%>%
select(RegionName,State,Metro,`1996`=`1996-05`,`1997`=`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`==`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`==`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`=05`,`2007`=`2007-05`,`2008`=`2008-05`,`2009`=`2009-05`,`2010`=`2010-05`,`2011`=`2011-05`,`2012`=`2012-05`,`2013`=`2013-05`,`2014`=`2014-05`,`2015`=`2015-05`,`2016`=`2016-05`,`2017`=`2017-05`,`2018`=`2018-05`)%>%filter(State==‘TX’&Metro==‘Austin’)%>%gather(`1996`,`1997`,`1998`,`1999`,`2000`,`2001`,`2002`,`2003`,`2004`,`2005`,`2006`,`2007`,`2008`,`2009`,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,`2018`,key=’YR’,value=’ZHVI’)%>%
9.Ifyouweretoviewtheresult,thedataframewouldnowappearasseeninthefigurebelow.
10.Nowwe’rereadytoplotthedata.AddthecodeyouseebelowtocreateapointplotthatisgroupedbyRegionName(County).
df<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)df%>%
select(RegionName,State,Metro,`1996`=`1996-05`,`1997`=`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`==`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`==`1997-05`,`1998`=`1998-05`,`1999`=`1999-05`,`2000`=05`,`2007`=`2007-05`,`2008`=`2008-05`,`2009`=`2009-05`,`2010`=`2010-05`,`2011`=`2011-05`,`2012`=`2012-05`,`2013`=`2013-05`,`2014`=`2014-05`,`2015`=`2015-05`,`2016`=`2016-05`,`2017`=`2017-05`,`2018`=`2018-05`)%>%filter(State==‘TX’&Metro==‘Austin’)%>%gather(`1996`,`1997`,`1998`,`1999`,`2000`,`2001`,`2002`,`2003`,`2004`,`2005`,`2006`,`2007`,`2008`,`2009`,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,`2018`,key=’YR’,value=’ZHVI’)%>%ggplot(mapping=aes(x=YR,y=ZHVI,colour=RegionName))+geom_point()+geom_smooth(method=lm,se=TRUE)+ggtitle(“SingleFamilyHomesValuesHaveIncreasedintheAustinMetroArea”)+xlab(“Year”)+ylab(“HomeValues”)
11.YoucancheckyourworkagainstthesolutionfileCS2_Exercise1.R.
12.Savethescriptandthenrunittoseetheoutputshowninthescreenshotbelow.AllcountiesintheAustinmetropolitanareahaveexperiencedsignificantlyincreasedvaluesinthepastcoupledecades.Theincreasehasbeenparticularlynoticeablesince2012.
13.Insteadofasimpledotplotyoumightwanttocreateabarchartinstead.Commentoutthelineofcodethatcallstheexistingggplot()functionandaddanewlineasseenbelow.
ggplot(mapping=aes(x=YR,y=ZHVI,colour=RegionName))+geom_col()+ggtitle(“SingleFamilyHomesValuesHaveIncreasedintheAustinMetroArea”)+xlab(“Year”)+ylab(“HomeValues”)
14.Saveandrunthescriptandtheoutputshouldnowappearasseeninthescreenshotbelow.Theupwardtrendinvaluesseemsevenmoreobviouswhenviewedinthismanner.
Exercise2:WhatisthetrendforrentalratesintheAustinmetroarea?
TheCounty_Zri_SingleFamilyResidenceRental.csvfileinyourIntroR\DatafoldercontainssinglefamilyresidentialrealestatevaluesZillow.ZillowRentIndex(ZRI)isasmoothed,seasonallyadjustedmeasureofthemedianestimatedmarketraterentacrossagivenregionandhousingtype.ZRIisadollar-denominatedalternativetorepeatrentindices.
Inthisexercisewe’llexaminerentvaluesoverthepastfewyearsfromtheAustinmetropolitanarea.1.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS2_Exercise2.R.2.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.
library(readr)library(dplyr)library(ggplot2)
3.Usetheread_csv()functionfromthereadrpackagetoloadthedataintoadataframe.
df<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)
4.Selectthecolumnsandfilterthedata.Thisdatasetcontainsdatafrom2010goingforward.We’llusedatafromDecemberoftheyears2010to2017fortheAustin,TXmetropolitanarea.
df<-read_csv(“County_Zri_SingleFamilyResidenceRental.csv”,col_names=TRUE)df%>%
select(RegionName,State,Metro,`2010`=`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`=`2013-12`,`2014`=`2014-12`,`2015`=`2015-12`,`2016`=`2016-12`,`2017`=`2017-12`)%>%filter(State==‘TX’&Metro==‘Austin’)%>%
5.Gatherthedata.
df<-read_csv(“County_Zri_SingleFamilyResidenceRental.csv”,col_names=TRUE)df%>%
select(RegionName,State,Metro,`2010`=`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`=`2013-12`,`2014`=`2014-12`,`2015`=`2015-12`,`2016`=`2016-12`,`2017`=`2017-12`)%>%filter(State==‘TX’&Metro==‘Austin’)%>%gather(`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZRI’)%>%
6.Calltheggplot()functiontoplotthedata.Inthisplotwe’llalsoaddlabelstoeachpoint.
df<-read_csv(“County_Zri_SingleFamilyResidenceRental.csv”,col_names=TRUE)df%>%
select(RegionName,State,Metro,`2010`=`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`=`2013-12`,`2014`=`2014-12`,`2015`=`2015-12`,`2016`=`2016-12`,`2017`=`2017-12`)%>%filter(State==‘TX’&Metro==‘Austin’)%>%
gather(`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZRI’)%>%ggplot(mapping=aes(x=YR,y=ZRI,colour=RegionName))+geom_point()+geom_text(aes(label=ZRI,vjust=-0.5),size=3)+ggtitle(“SingleFamilyRentalValuesHaveIncreasedintheAustinMetroArea”)+xlab(“Year”)+ylab(“RentalValues”)
7.YoucancheckyourworkagainstthesolutionfileCS2_Exercise2.R.8.Savethescriptandthenrunittoseetheoutputshowninthescreenshotbelow.
Exercise3:DeterminingthePrice-RentRatiofortheAustinmetropolitanarea
Theprice-to-rentratioisameasureoftherelativeaffordabilityofrentingandbuyinginagivenhousingmarket.Itiscalculatedastheratioofhomepricestoannualrentalrates.So,forexample,inarealestatemarketwhere,onaverage,ahomeworth$200,000couldrentfor$1000amonth,theprice-rentratiois16.67.That’sdeterminedusingtheformula:$200,000÷(12x$1,000).Ingeneral,thelowertheratio,themorefavorabletorealestateinvestorslookingforresidentialproperty.
Inthisexerciseyou’lljointheZillowhomevaluedatatotherentaldata,createanewcolumntoholdtheprice-rentratio,calculatetheratio,andplotthedataasabarchart.1.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS1_Exercise3.R.2.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.
library(readr)library(dplyr)library(ggplot2)
3.Inthisstepyou’llreadtheresidentialvaluationinformationfromtheZillowfile,definethecolumnsthatshouldbeused,filterthedataandgatherthedata.Inthiscasewe’regoingtofilterthedatasothatonlyTravisCountyisincluded.Addthefollowinglinesofcodetoyourscripttoaccomplishthistask.
dfHomeVals<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)dfHomeVals<-select(dfHomeVals,RegionName,State,Metro,`2010`=dfHomeVals<-select(dfHomeVals,RegionName,State,Metro,`2010`=12`,`2014`=`2014-12`,`2015`=`2015-12`,`2016`=`2016-12`,`2017`=`2017-12`)dfHomeVals<-filter(dfHomeVals,State==‘TX’&Metro==‘Austin’&RegionName==‘Travis’)dfHomeVals<-gather(dfHomeVals,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZHVI’)
4.Nowdothesamefortherentaldata.
dfRentVals<-read_csv(“County_Zri_SingleFamilyResidenceRental.csv”,col_names=TRUE)dfRentVals<-select(dfRentVals,RegionName,State,Metro,`2010`=`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`==`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`=12`,`2017`=`2017-12`)dfRentVals<-filter(dfRentVals,State==‘TX’&Metro==‘Austin’&RegionName==‘Travis’)dfRentVals<-gather(dfRentVals,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZRI’)
5.Thetwopreviousstepscreateddataframesfortheresidentialhomevalueandrentaldata.Inthisstepwe’lljointhosetwodataframestogetherusingthedplyrpackage.Addthelineofcodeyouseebelowtoyourscript.Thisusestheinner_join()function,whichisthesimplesttypeofjoin.Aninnerjoinmatchespairsofobservationswhenevertheirkeysareequal.
df<-inner_join(dfHomeVals,dfRentVals,by=‘YR’)
6.Ifyouweretoviewtheresultingdataframeatthispointitwouldlooklikethescreenshotbelow.NoticethattheZHVI(residentialhomevalue)andZRI(rentalvalue)columnsareattached.
7.Next,usethemutate()functiontocreateacolumncalledPriceRentRatio,andpopulatetherowsusingthecalculationseenbelow.
df<-mutate(df,PriceRentRatio=ZHVI/(12*ZRI))
8.Ifyouweretoviewtheresultsofthemutate()functionitwouldappearasseeninthescreenshotbelow.NoticethateachyearincludesaPriceRentRatiovaluethathasbeencalculated.
9.Finally,createabarchartusinggeom_col()withPriceRentRatioastheyaxis,andYRasthexaxis.ggplot(data=df)+geom_col(mapping=aes(x=YR,y=PriceRentRatio),fill=”red”)10.Yourentirescriptshouldappearasseenbelow.
library(readr)library(dplyr)library(ggplot2)
dfHomeVals<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)dfHomeVals<-select(dfHomeVals,RegionName,State,Metro,`2010`=
dfHomeVals<-select(dfHomeVals,RegionName,State,Metro,`2010`=12`,`2014`=`2014-12`,`2015`=`2015-12`,`2016`=`2016-12`,`2017`=`2017-12`)dfHomeVals<-filter(dfHomeVals,State==‘TX’&Metro==‘Austin’&RegionName==‘Travis’)dfHomeVals<-gather(dfHomeVals,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZHVI’)
dfRentVals<-read_csv(“County_Zri_SingleFamilyResidenceRental.csv”,col_names=TRUE)dfRentVals<-select(dfRentVals,RegionName,State,Metro,`2010`=`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`==`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`=12`,`2017`=`2017-12`)dfRentVals<-filter(dfRentVals,State==‘TX’&Metro==‘Austin’&RegionName==‘Travis’)dfRentVals<-gather(dfRentVals,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZRI’)
#jointhetwodfdf<-inner_join(dfHomeVals,dfRentVals,by=‘YR’)df<-mutate(df,PriceRentRatio=ZHVI/(12*ZRI))ggplot(data=df)+geom_col(mapping=aes(x=YR,y=PriceRentRatio),fill=”red”)
11.YoucanalsocheckyourworkagainstthesolutionfileCS2_Exercise3.R.
12.Savethescriptandthenrunittoseetheoutputshowninthescreenshotbelow.Price-rentratioshavebeensteadilyincreasingduringthecurrentdecade.
Exercise4:ComparingresidentialhomevaluesinAustintootherTexasandU.S.metropolitanareas
Inthisexercisewe’llcompareresidentialhomevaluesfromtheAustinmetropolitanareatootherlargemetropolitanareasinTexasincludingSanAntonio,Dallas,andHouston.Forthisexercisewe’llcreateaboxplotcontainedwithinaviolinplot.
1.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS2_Exercise4.R.2.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.
library(readr)library(dplyr)library(ggplot2)
3.Usetheread_csv()functionfromthereadrpackagetoloadthedataintoadataframe.df<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)
4.Selectthecolumnsandfilterthedata.Thisdatasetcontainsdatafrom2010goingforward.We’llusedatafromDecemberoftheyears2010to2017fortheAustin,TXmetropolitanarea.
dfHomeVals<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)dfHomeVals%>%
select(RegionName,State,Metro,`2010`=`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`=`2013-12`,`2014`=`2014-12`,`2015`=`2015-12`,`2016`=`2016-12`,`2017`=`2017-12`)%>%
5.FilterthedataframetoincludeonlyAustin,SanAntonio,Dallas-FortWorth,andHouston.Thesearethefourmajormetropolitanareasinthestate.
dfHomeVals<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)dfHomeVals%>%
select(RegionName,State,Metro,`2010`=`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`=`2013-12`,`2014`=`2014-12`,`2015`=`2015-12`,`2016`=`2016-12`,`2017`=`2017-12`)%>%filter(State==‘TX’&Metro%in%c(“Austin”,“SanAntonio”,“Dallas-FortWorth”,“Houston”))%>%
6.Gatherthedataframe.
dfHomeVals%>%select(RegionName,State,Metro,`2010`=`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`=`2013-12`,`2014`=`2014-12`,`2015`=`2015-12`,`2016`=`2016-12`,`2017`=`2017-12`)%>%filter(State==‘TX’&Metro%in%c(“Austin”,“SanAntonio”,“Dallas-FortWorth”,“Houston”))%>%gather(`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZHVI’)%>%
7.Groupthedatabymetropolitanarea.
dfHomeVals<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)
dfHomeVals%>%
select(RegionName,State,Metro,`2010`=`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`=`2013-12`,`2014`=`2014-12`,`2015`=`2015-12`,`2016`=`2016-12`,`2017`=`2017-12`)%>%filter(State==‘TX’&Metro%in%c(“Austin”,“SanAntonio”,“Dallas-FortWorth”,“Houston”))%>%gather(`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZHVI’)%>%group_by(Metro)%>%
8.Useggplot()withgeom_violin()andgeom_boxplot()tocreatetheplot.
dfHomeVals<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)dfHomeVals%>%
select(RegionName,State,Metro,`2010`=`2010-12`,`2011`=`2011-12`,`2012`=`2012-12`,`2013`=`2013-12`,`2014`=`2014-12`,`2015`=`2015-12`,`2016`=`2016-12`,`2017`=`2017-12`)%>%filter(State==‘TX’&Metro%in%c(“Austin”,“SanAntonio”,“Dallas-FortWorth”,“Houston”))%>%gather(`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZHVI’)%>%group_by(Metro)%>%ggplot(mapping=aes(x=Metro,y=ZHVI))+geom_violin()+geom_boxplot(width=0.1)+ggtitle(“ZHVIforMetroTexas”)+xlab(“Metro”)+ylab(“ZHVI”)
9.YoucanalsocheckyourworkagainstthesolutionfileCS2_Exercise4.R.
10.Savethescriptandthenrunittoseetheoutputshowninthescreenshotbelow.
11.Challenge:Updatethescripttoincludethefollowingmetropolitanareas:Austin,Denver,Phoenix,SaltLakeCity,Boise,Portland.YoucancheckyourcodeagainstthesolutionfileCS2_Exercise4.R.Theoutputplotshouldappearasseeninthescreenshotbelow.
12.Finally,we’llcreateascriptthatdisplaystheZHVIvaluesforeachmetropolitanareainafacetplot.InRStudioselectFile|NewFile|RScriptandthensavethefiletotheCaseStudy1folderwithanameofCS2_Exercise4B.R.
13.Atthetopofthescript,loadthepackagesthatwillbeusedinthisexercise.
library(readr)library(dplyr)library(ggplot2)
14.Usetheread_csv()functionfromthereadrpackagetoloadthedataintoadataframe.df<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)15.Definethecolumnstouse.Inthiscasewe’llusetheyears2000-2017.
dfHomeVals%>%select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=05`,`2008`=`2008-05`,`2009`=`2009-05`,`2010`=`2010-05`,`2011`=`2011-05`,`2012`
=`2012-05`,`2013`=`2013-05`,`2014`=`2014-05`,`2015`=`2015-05`,`2016`=`2016-05`,`2017`=`2017-05`,`2018`=`2018-05`)%>%
16.Filterthedataframetoincludeonlyspecificmetropolitanareas.
dfHomeVals<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)dfHomeVals%>%
select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=
select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=
select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=05`,`2008`=`2008-05`,`2009`=`2009-05`,`2010`=`2010-05`,`2011`=`2011-05`,`2012`=`2012-05`,`2013`=`2013-05`,`2014`=`2014-05`,`2015`=`2015-05`,`2016`=`2016-05`,`2017`=`2017-05`,`2018`=`2018-05`)%>%filter(Metro%in%c(“Austin”,“Denver”,“Phoenix”,“Portland”,“SaltLakeCity”))%>%
17.Gatherthedata.
dfHomeVals<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)dfHomeVals%>%
select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=
select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=
select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=05`,`2008`=`2008-05`,`2009`=`2009-05`,`2010`=`2010-05`,`2011`=`2011-05`,`2012`=`2012-05`,`2013`=`2013-05`,`2014`=`2014-05`,`2015`=`2015-05`,`2016`=`2016-05`,`2017`=`2017-05`,`2018`=`2018-05`)%>%filter(Metro%in%c(“Austin”,“Denver”,“Phoenix”,“Portland”,“SaltLakeCity”))%>%gather(`2000`,`2001`,`2002`,`2003`,`2004`,`2005`,`2006`,`2007`,`2008`,`2009`,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZHVI’)%>%
18.Groupthedatabymetropolitanarea.
dfHomeVals<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)dfHomeVals%>%
select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=
select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=
select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=05`,`2008`=`2008-05`,`2009`=`2009-05`,`2010`=`2010-05`,`2011`=`2011-05`,`2012`=`2012-05`,`2013`=`2013-05`,`2014`=`2014-05`,`2015`=`2015-05`,`2016`=`2016-05`,`2017`=`2017-05`,`2018`=`2018-05`)%>%filter(Metro%in%c(“Austin”,“Denver”,“Phoenix”,“Portland”,“SaltLakeCity”))%>%gather(`2000`,`2001`,`2002`,`2003`,`2004`,`2005`,`2006`,`2007`,`2008`,`2009`,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZHVI’)%>%group_by(Metro)%>%
19.Plotthedataasafacetplot.
dfHomeVals<-read_csv(“County_Zhvi_SingleFamilyResidence.csv”,col_names=TRUE)dfHomeVals%>%
select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=
select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=
select(RegionName,State,Metro,`2000`=`2000-05`,`2001`=05`,`2008`=`2008-05`,`2009`=`2009-05`,`2010`=`2010-05`,`2011`=`2011-05`,`2012`=`2012-05`,`2013`=`2013-05`,`2014`=`2014-05`,`2015`=`2015-05`,`2016`=`2016-05`,`2017`=`2017-05`,`2018`=`2018-05`)%>%filter(Metro%in%c(“Austin”,“Denver”,“Phoenix”,“Portland”,“SaltLakeCity”))%>%gather(`2000`,`2001`,`2002`,`2003`,`2004`,`2005`,`2006`,`2007`,`2008`,`2009`,`2010`,`2011`,`2012`,`2013`,`2014`,`2015`,`2016`,`2017`,key=’YR’,value=’ZHVI’)%>%
group_by(Metro)%>%ggplot(mapping=aes(x=YR,y=ZHVI))+geom_point()+facet_wrap(~Metro)+geom_smooth(method=lm,se=TRUE)+ggtitle(“ZHVIbyMetroArea”)+xlab(“Year”)+ylab(“ZHVI”)
20.YoucanalsocheckyourworkagainstthesolutionfileCS2_Exercise4B.R.21.Savethescriptandthenrunittoseetheoutputshowninthescreenshotbelow.
DataVisualizationandExplorationwithR
Today,datascienceisanindispensabletoolforanyorganization,allowingfortheanalysisandoptimizationofdecisionsandstrategy.Rhasbecomethepreferredsoftwarefordatascience,thankstoitsopensourcenature,simplicity,applicabilitytodataanalysis,andtheabundanceoflibrariesforanytypeofalgorithm.
Thisbookwillallowthestudenttolearn,indetail,thefundamentalsoftheRlanguageandadditionallymastersomeofthemostefficientlibrariesfordatavisualizationinchart,graph,andmapformats.Thereaderwilllearnthelanguageandapplicationsthroughexamplesandpractice.Nopriorprogrammingskillsarerequired.
WebeginwiththeinstallationandconfigurationoftheRenvironmentthroughRStudio.Asyouprogressthroughtheexercisesinthishands-onbookyou’llbecomethoroughlyacquaintedwithR’sfeaturesandthepopulartidyversepackage.Withthisbook,youwilllearnaboutthebasicconceptsofRprogramming,workefficientlywithgraphs,charts,andmaps,andcreatepublication-readydocumentsusingrealworlddata.Thedetailedstep-by-stepinstructionswillenableyoutogetacleansetofdata,produceengagingvisualizations,andcreatereportsfortheresults.
Whatyouwilllearnhowtodointhisbook:
IntroductiontotheRprogramminglanguageandRStudio
Usingthetidyversepackagefordataloading,transformation,andvisualization
GetatourofthemostimportantdatastructuresinR
Learntechniquesforimportingdata,manipulatingdata,performinganalysis,andproducingusefuldatavisualization
Datavisualizationtechniqueswithggplot2
Geographicvisualizationandmapswithggmap
Turningyouranalysesintohighqualitydocuments,reports,andpresentationswithRMarkdown.
Handsoncasestudiesdesignedtoreplicaterealworldprojectsandreinforcetheknowledgeyoulearninthebook
Formoreinformationvisitgeospatialtraining.com!