Data Science Boot Camp Survival Manual
-
Upload
joanna-reed -
Category
Documents
-
view
15 -
download
1
description
Transcript of Data Science Boot Camp Survival Manual
-
1. Prologue2. Chapter0-DataScientist'sToolbox3. Chapter1-RProgramming4. Chapter2-GettingandCleaningData5. Chapter3-ExploratoryDataAnalysis6. Chapter4-ReproducibleResearch7. Chapter5-StatisticalInference8. Chapter6-RegressionModels9. Chapter7-PracticalMachineLearning
10. Chapter8-DevelopingDataProducts11. Capstone12. Epilogue
TableofContents
DataScienceBoot-CampSurvivalManual
2
-
Welcomerecruits!
Duringthenextyearyouwilllearnthefundamentalsofdatascience.TheDataScienceSpecialization,offeredbyJohnsHopkinsUniversity,ischallenging.Successrequiresastrategy.Thisbookaimstoequipeachofyouwiththeknowledgeandskillstocompleteboot-camp.The"DataScienceBoot-CampSurvivalManual"alonecannotguaranteesuccess.Listentotheinstructor'slecturesandapplyyourselftotheevaluationsthroughoutyourtraining.
AccordingtoJeffLeekandtheDataScienceSpecializationTeamthekeywordindatascienceis"science".Tothisend,thefocusoftheten-courseseriesincludingacapstoneprojectistoprovidethelearnerwith:
1. anintroductiontothekeyideasbehindreproducibleresearch,2. anintroductiontothetoolsandtechniquestotransformrawdataintoapresentablereport,3. anopportunitytogainhands-onpracticesoyoucanlearnthetechniquesforyourself,and4. anappreciationofthemathematics&statisticsinvolvedindatascience.
ThecoursescomprisingtheDataScienceSpecializationare:
DataScientist'sToolboxRProgrammingGettingandCleaningDataExploratoryDataAnalysisReproducibleResearchStatisticalInferenceRegressionModelsPracticalMachineLearningDevelopingDataProducts
ThesecoursestaughtbyBrianCaffo,JeffLeek,andRogerD.Pengenablethelearnertogetthefoundationalskills.Whilethelecturesandassignmentsbuildthesefoundationalskills,learnersoftenrequiredfurtherexplanations.Thecourseforumsallowlearnerstodiscussthelecturetopicsandassignments.Yeteachsessionofacoursebeginswithoutthesharedknowledgeofpreviousparticipants.AsaCommunityTeachingAssistant(CTA)itbecameclearthatacompanionguidewouldbebeneficial.
AreyouuptothechallengeofJohnsHopkinsUniversity'sDataScienceSpecialization?
Eachchaptercoversoneofthecorecourses.Atutorial-stylebalancingtheoryandpracticalapplicationmakessurvivingdatascienceboot-camppossible.Youlearntheworkflowtypicallyinvolvedinallphasesofadataanalysisproject.
Chapter0:TheDataScientist'sToolbox
URL:https://www.coursera.org/course/datascitoolbox
Synopsis:"Getanoverviewofthedata,questions,andtoolsthatdataanalystsanddatascientistsworkwith.ThisisthefirstcourseintheJohnsHopkinsDataScienceSpecialization."
Prologue
CoreCourses
StructureoftheBoot-CampSurvivalManual
DataScienceBoot-CampSurvivalManual
3Prologue
-
Chapter1:RProgramming
URL:https://www.coursera.org/course/rprog
Synopsis:"LearnhowtoprograminRandhowtouseRforeffectivedataanalysis.ThisisthesecondcourseintheJohnsHopkinsDataScienceSpecialization."
Chapter2:GettingandCleaningData
URL:https://www.coursera.org/course/getdata
Synopsis:"Learnhowtogather,clean,andmanagedatafromavarietyofsources.ThisisthethirdcourseintheJohnsHopkinsDataScienceSpecialization."
Chapter3:ExploratoryDataAnalysis
URL:https://www.coursera.org/course/exdata
Synopsis:"Learntheessentialexploratorytechniquesforsummarizingdata.ThisisthefourthcourseintheJohnsHopkinsDataScienceSpecialization."
Chapter4:ReproducibleResearch
URL:https://www.coursera.org/course/repdata
Synopsis:"Learntheconceptsandtoolsbehindreportingmoderndataanalysesinareproduciblemanner.ThisisthefifthcourseintheJohnsHopkinsDataScienceSpecialization."
Chapter5:StatisticalInference
URL:https://www.coursera.org/course/statinference
Synopsis:"Learnhowtodrawconclusionsaboutpopulationsorscientifictruthsfromdata.ThisisthesixthcourseintheJohnsHopkinsDataScienceCourseTrack."
Chapter6:RegressionModels
URL:https://www.coursera.org/course/regmods
Synopsis:"Learnhowtouseregressionmodels,themostimportantstatisticalanalysistoolinthedatascientist'stoolkit.ThisistheseventhcourseintheJohnsHopkinsDataScienceSpecialization."
Chapter7:PracticalMachineLearning
URL:https://www.coursera.org/course/predmachlearn
Synopsis:"Learnthebasiccomponentsofbuildingandapplyingpredictionfunctionswithanemphasisonpracticalapplications.ThisistheeighthcourseintheJohnsHopkinsDataScienceSpecialization."
Chapter8:DevelopingDataProducts
URL:https://www.coursera.org/course/devdataprod
Synopsis:"LearnthebasicsofcreatingdataproductsusingShiny,Rpackages,andinteractivegraphics.ThisistheninthcourseintheJohnsHopkinsDataScienceSpecialization."
DataScienceCapstone
DataScienceBoot-CampSurvivalManual
4Prologue
-
URL:https://www.coursera.org/course/dsscapstone
Synopsis:"Thecapstoneprojectclasswillallowstudentstocreateausable/publicdataproductthatcanbeusedtoshowyourskillstopotentialemployers.Projectswillbedrawnfromreal-worldproblemsandwillbeconductedwithindustry,government,andacademicpartners."
CoursesynposesquotedfromthecourseinformationpagesatCourseraasat1April2015.
Althoughthecoursesarestandalone,theknowledgeiscumulative.ThepedagogicalcoursedependenciesareavailablefromJohnsHopkinsUniversity.
Figure1CoursedependencydiagramprovidedbyDanielM.Bontje(created17November2014)
Youneedalanguageorsystemtoperformthetasks(RProgramming)anddatatoanalyse(GettingandCleaningData)togetasenseofthedata(ExploratoryDataAnalysis)beforebuildingmodelsanddrawinginferences(StatisticalInference,RegressionModels)ormakingpredictions(PracticalMachineLearning)fromthedatabeforepresentingyourconclusionsandsupportingevidence(BuildingDataProducts,ReproducibleResearch).
Therecommendedmathematicsbackgroundislinearalgebraandintroductorystatistics(descriptiveandinferential).StatisticalInferenceandRegressionModels,coursesinthisspecialisation,coverallthebasicstatisticalconceptsformingasolidfoundationforsubsequentcoursesintheDataScienceSpecialization.ThesecoursesalongwithPracticalMachineLearningarethetheoreticalunderpinnings,whiletheothersixcoursesareappliedinnature:obtaininngdata,scrubbingdata,exploringdata,modelingdata,andinterpretingdatacollectivelyknownastheOSEMN(prounouncedasawesome)model.
AgainwelcometotheDataScienceBoot-Camp.Reviewthe"DataScienceBoot-CampSurvivalManual"onaregularbasisthroughoutyourtraining.
CourseDependencyandRecommendedSequence
DataScienceBoot-CampSurvivalManual
5Prologue
-
Weshallneitherfailnorfalter;weshallnotweakenortire...giveusthetoolsandwewillfinishthejob.-WinstonChurchill,PrimeMinisterofGreatBritain
PrimaryInstructor:JeffLeek,MS,PhD(Biostatistics)
ThefoundationalcourseDataScientist'sToolboxisahigh-leveloverviewofthespecialisation.Thiscourselaysthegroundworkforthenine-courseseriespluscapstonproject.Acomprehensiveapproachteachingfundamentalskillsfordatascienceregardlessofdataset.
Thekeywordin"datascience"issciencenotdata.Themethodisnotdependentuponthedatasetsize;itscalesfromsmalldatatobigdata.Thedatasciencemethodequatestothescientificmethodusedinthenaturalsciences.TheFinancialTimesarticle,"Bigdata:arewemakingabigmistake?",arguesforarigorousmethodology.Anarticle"TheDataScienceMethology",publishedonDataScience101,arguesforadoptionofthescientificmethodfamiliartoscientistsinthenaturalsciences.
DataScienceMethodology
1. problemformulation(hypothesis)2. obtaindata(experiment)3. analysis(validateorrefutehypothesis)4. dataproduct(report)
ThecoursesintheJohnsHopkinsUniversityDataScienceSpecialization"emphasiseadatasciencemethodologyratherthanfocusingprimarilyondatasciencetechnique.[T]heinstructorshavetakencarethroughouttodemonstratearesponsible,scientifically-basedapproachtocollecting,curatingandanalyzingdatasources,"saysspecialisationparticipantJohnFrederickThiels.
Youwillhavelearnedthebasicskillstosuccessfullyusethevarioustoolsrequiredthroughoutthebookandthedatasciencespecialisationcourses.
Tosuccessfullycompletethehands-onexercisesinthebookandcourseassignments(quizzes,programming,andcourseprojects)somesoftwaremustbeinstalledonyourcomputerorinahostedenvironment:Git,RandRStudio.AGitHubaccountismandatorybecausepeer-assessedsubmissionsmustbeaccessible.Internetaccessisnecessarytofullyparticipateinthecourses;suchaswatchingordownloadinglectures,takingquizzes,submittingprogrammingassignments,andparticipatinginthepeerassessmentprocess.Duetothevarietyofoperatingsystemplatformsonwhichthesoftwarecanbedeployed,forthisbook,wedecidedtosolelyfocusonUbuntuLinuxrunninglocallyorremotelyinavirtualisedenvironment.
Beforedelvingintohowtousethevarioustoolsinourtoolboxitisimportanttoconsiderthetypesofskillsweneedasdata-scientists-in-training.Firstly,linearalgebra,probabilityandcalculusattheintroductorylevelissufficientmathematics.Secondly,introductorydescriptiveandinferentialstatisticsincludinghypothesistestingistherecommendedstatisticsbackground.Thirdly,basicprogrammingskillsarerecommended.NoneoftheaforementionedskillsaremandatoryfortheDataScienceSpecialization.Forthosereadersseekingtolearnanyoftheseskillstherearecoursesavailable,including:
Pre-Calculus-Instructors:SarahEichhornandRachelCohenLehman,UniversityofCalifornia,IrvineProbability-Instructor:SantoshS.Venkatesh
Chapter0-TheDataScientist'sToolbox
LearningObjectives
ToolsoftheTrade
DataScienceBoot-CampSurvivalManual
6Chapter0-DataScientist'sToolbox
-
Calculus:SingleVariable-Instructor:RobertGhrist,UniversityofPennsylvaniaDescriptiveStatistics-Instructor:MatthijsRooduijn,UniversityofAmsterdamInferentialStatistics-Instructor:AnnemarieZandScholten,UniversityofAmsterdamDataAnalysisandStatisticalInference-Instructor:Mineetinkaya-Rundel,DukeUniversityProgrammingforEverbody(Python)-Instructor:CharlesSeverance,UniversityofMichigan
ProgrammingforEverybody(Python)deservesspecialmentionbecauseitisconsistentlyhighly-ratedbycourseparticipantsfortheteaching-styleof"Dr.Chuck."Youdonothavetobeageektoenjoythiscourse.
Readtheinformationpageofeachcourseespeciallyifyoupreferaself-teachingapproachtolearning.Therearefreelyavailabletextbooksforsomeofthesecourses.
WhilethevariousapplicationsrequiredforthesecoursescanbeinstalledonthehostoperatingsystemofyourcomputerwerecommendusingvirtualisationsoftwaresuchasOracleVirtualBox,VMWareWorkstationorFusionorPlayer,andParallelsDesktopdependingupontheoperatingsystemrunningonthecomputer.AnothervirtualisationoptionisRStudioServerAmazonMachineImage(AMI)orrollingyourownlocalorhostedvirtualmachineinstance.
Thissectionwilldescribetwoscenarios:
importingaready-madediskimage(AMI)ofUbuntuLinux14.04LTS(64-bit)ontheAmazonWebServiceElasticComputing2(AWSEC2)hostingplatform.importingaready-madediskimageofUbuntuLinux14.04LTS(32-bitor64-bit)intoOracleVirtualBoxonyourcomputer,and
Anadvantageofvirtualisationsoftware,runningonyourcomputerorremotelyhostedbyaserviceprovider,isalltherequiredapplicationsarekeptseparatefromyourcomputer'soperatingsystemandbydefaultisolatedfromthehostfilesystem.
IfyoupreferinstallingOracleVirtualBoxandcreatingavirtualmachineonyourcomputer,youcanskipthissection.
Instructionsareforthcoming.
PleaseconsulttheinstructionsaboutdownloadingandinstallingOracleVirtualBoxontoyourcomputerbeforeproceeding.
Downloadtheready-madediskimageofUbuntuLinux(32-bitor64-bit)basedontheversionsupportedbyOracleVirtualBoxandthearchitectureofthecomputer.
Note:Somecomputersare64-bitbutonlyallow32-bitoperatingsystemstorunwithinvirtualisationsoftware.
Extractthecompressedarchivecontainingthediskimageusingp7zip.
$7zaeUbuntu_14.04.2-32bit.7z
7-Zip(A)[64]9.20Copyright(c)1999-2010IgorPavlov2010-11-18p7zipVersion9.20(locale=en_CA.UTF-8,Utf16=on,HugeFiles=on,2CPUs)
Processingarchive:Ubuntu_14.04.2-32bit.7z
Extracting32bit/Ubuntu14.04.2(32bit).vdiExtracting32bit
EverythingisOk
VirtualisationSoftware
OptionA:AmazonWebServiceElasticCompute2withAmazonMachineImage
OptionB:LocalComputerwithOracleVirtualBox
DataScienceBoot-CampSurvivalManual
7Chapter0-DataScientist'sToolbox
-
Folders:1Files:1Size:3807379456Compressed:776252068$
AfterinstallingOracleVirtualBoxitistimetolaunchitsowecanimportthevirtualmachinediskimage(.vdi).
Figure0.1Creatinganewvirtualmachineinstance
Click'New'onthemainmenu.Adialogueboxpop-upappearswhereyouenterthenametoassigntothevirtualmachineandselecttheoperatingsystemandversion.Click'Next'tocontinue.
DataScienceBoot-CampSurvivalManual
8Chapter0-DataScientist'sToolbox
-
Figure0.2Allocatingsystemmemorytothenewvirtualmachineinstance
Selecttheamountofsystemmemory(RAM)toallocatetothevirtualmachine.Allocate2048MBofsystemmemorytothisvirtualmachineinstance.Thisparametercanbemodifiedlaterifnecessary.Click'Next'tocontinue.
DataScienceBoot-CampSurvivalManual
9Chapter0-DataScientist'sToolbox
-
Figure0.3Associatinganexistingvirtualharddrivetothenewvirtualmachineinstance
Select'Useanexistingvirtualharddrivefile'andclickonthefilefoldericontonavigatetothevirtualharddrivefilepreviouslydownloadedanduncompressed.Click'Create'toassociatethisdiskimagewiththecurrentvirtualmachine.
DataScienceBoot-CampSurvivalManual
10Chapter0-DataScientist'sToolbox
-
Figure0.4MounttheVirtualBoxGuestAdditionsISO
MaketheVirtualBoxGuestAdditionsISOaccessibletothevirtualmachineinstance.AtthemainscreenofOracleVirtualBoxselecttheDataScientistsToolboxvirtualmachine.Click'Settings',then'Storage',followedby'Empty'.
DataScienceBoot-CampSurvivalManual
11Chapter0-DataScientist'sToolbox
-
Figure0.5MounttheVirtualBoxGuestAdditionsISO
ClicktheCD/DVDiconandselectVBoxGuestAdditions.isofromthedropdownlist.Click'OK'toreturntothemainscreen.
DataScienceBoot-CampSurvivalManual
12Chapter0-DataScientist'sToolbox
-
Figure0.6Startingthenewvirtualmachineinstance
AtthemainscreenofOracleVirtualBoxselectthenewlycreatedvirtualmachineinstance.Click'Start'tolaunchthevirtualmachine.Attheloginprompttypethepasswordfromthedownloadwebpage.
ThefinalpreparatorystepisenablingtheVirtualBoxGuestAdditionsandupdatinganyout-of-datepackagesinstalledonthevirtualmachine.Openaterminalwindow(CTRL+ALT+T).
ActivatetheVirtualBoxGuestAdditionssothevirtualmachineinstanceintegrateswiththehostsystem.
$cd/media/osboxes/VBOXADDITIONS*$sudoshVBoxLinuxAdditions.run
UponsuccessfulinstallationshutdownthevirtualmachineinstancebyclickingtheGeariconintheupperrightcornerofthevirtualmachine,umounttheVirtualBoxGuestAdditionsbyreversingthestepsshowninFigures0.4and0.5.Alternatively,youmaychoosetoleavetheVirtualBoxGuestAdditionsISOattached.
Note:WheneveranupdatedLinuxkernelisinstalledaspartofthenormalupdateprocesstheVirtualBoxGuestAdditionswillhavetobereappliedtoensurethesharedclipboard,forexample,continuestowork.DoNOTforgettorestartthevirtualmachineinstancesotheVirtualBoxGuestAdditionsareactivated.
DataScienceBoot-CampSurvivalManual
13Chapter0-DataScientist'sToolbox
-
Figure0.7Enable/Disablesharedclipboardanddrag-and-drop
Enablingasharedclipboardbetweenyourcomputerandthevirtualmachineinstanceisconfigurableviathe'Settings'menu.
DataScienceBoot-CampSurvivalManual
14Chapter0-DataScientist'sToolbox
-
Figure0.8Pointingdeviceanddeviceboot-orderconfiguration
Themousedevicetypeshouldbeconfiguredas'PS/2Mouse'whetherusingawiredorwirelessmouse.Thedevicebootordershouldbeconfiguredtoensurethevirtualdiskimageisthedefaultbootdevice.
Restartthevirtualmachineinstance.
Switchingbetweenstandardmodeandfull-screenmodeisaseasyasHost_Key+F(RIGHT_CTRL+Fbydefault).
Forconveniencelaunchaterminalsession(CTRL+ATL+T)andwhenitsiconappearsintheapplicationbarright-clickthemouseandselect'LocktoLauncher'.Fromthispointforwardanytimeaterminalsessioniswantedsimplyclickthe'Terminal'icon.
DataScienceBoot-CampSurvivalManual
15Chapter0-DataScientist'sToolbox
-
Figure0.9Systemsettingsconfiguration
BeforeproceedingwithupdatingthecurrentlyinstalledsystemsoftwareandapplicationsweshouldselectanUbuntuLinuxpackagerepositoryingeographicproximitytoyourlocation.Thiscanbeaccomplishedbyclickingthe'SystemSettings'iconintheapplicationbaralongtheleft-edgeofthescreen.Click'Software&Updates'.
Next,openaterminalsession(CTRL+ALT+T).Whentheterminaldisplaystheshellprompttypethefollowingcommandstoupdateandupgradethecurrentlyinstalledsystemsoftwareandapplications.Ifyouseethe'SoftwareUpdater'iconintheapplicationbar,youcanapplysoftwareupdatesbyclickingtheiconinstead.
$sudoapt-get-yupdate$sudoapt-get-yupgrade
DataScienceBoot-CampSurvivalManual
16Chapter0-DataScientist'sToolbox
-
Figure0.10Editingtheusername,password,languagepreferenceandenablingautomaticlogin
Automaticlogincanbeenabledandthedisplaynamefortheuseraccountandpasswordcanbechanged,ifdesired,via'SystemsSetting'sbyclicking'UserAccounts'.
Figure0.11Automaticloginenabled
Click'Unlock'toenableeditingoftheuseraccountconfiguration.Typethecurrentpasswordwhenprompted.Ifyouwanttochangetheaccountname,click'osboxes.org'andtypethedesiredaccountname.Ifyouwanttochangethepassword,clickontheasterisksandtypethedesiredpassword.Ifyouwanttoenableautomaticlogin,click'OFF'sothat'ON'isvisible.Finally,click'Lock'torelocktheuseraccountconfiguration.
Afterashortdetourtofamiliariseourselveswiththecommand-lineinterface(CLI)wewillinstallGit,R,andRStudio.Restassuredthatinteractingwithcommand-lineisnotrequiredbeyondthischapter.RStudioprovidesseamlessintegrationwiththefilesystemtonavigateandmanipulatefiles,versioncontrolandrepositorysynchronisationbetweenyourcomputerandrepositoryhostingservices,andthestatisticalcomputationandsoftwaredevelopmentenvironment.
GettingFamiliarwiththeCommand-LineInterface(CLI)
DataScienceBoot-CampSurvivalManual
17Chapter0-DataScientist'sToolbox
-
15-minuteIntroductiontoNavigatingandManipulatingtheFileSystemfromtheTerminal
Let'sstartexploringthebasicfeaturesoftheenvironmentfromthecomfortofaterminalsessionandthecommand-line.Openaterminalwindow(CTRL+ALT+T)ifyouarerunningagraphicaldesktopenvironment.BylearningafewbasiccommandstonavigateandmanipulatethefilesystemyouwillfeelateaseandunderstandwhatisgoinginbehindthesceneswithinFilePanelofRStudio.
Command Description CommonFlags Arguments
pwd printworkingdirectoryname
ls listfileand/ordirectorynames
-l(longform)-a(hidden)-R(recursive)
[directory_path/][pattern]
(optional)
mkdir makedirectory
[directory_path/]directory_nameor[directory_path/]directory_name_list
(mandatory)
cd changedirectory[directory_path/][directory_name]
(optional)
touch createanemptyfile[directory_path/]file_name
(mandatory)
echo createafile(bydefaultstdout)
-e-n(nocarriagereturn)
"astringofcharacters"
(mandatory)
cp copyfileordirectory -r(recursive)
(source)[directory_path/][filename]
(target)[directory_path/][file_name]
(mandatory)
mv movefileordirectory -r(recursive)
(source)[directory_path/][file_name]
(target)[directory_path/][file_name]
(mandatory)
rm remove/deletefileordirectory-f(force)-r(recursive)
[directory_path/][file_name]
(mandatory)
Argumentsinbracketsareoptionalbutifthe'mandatory'designationispresent,atleastoneoftheargumentsmustbesupplied.Directorynamesandpathsaswellasfilenamesmaycontainwildcardcharacters(*and?)whenusedwithsomeofthesecommands.
Table0.1BasicFileandDirectoryCommands
Foreachexampletypethecommandstotherightofthecommandprompt($)tointeractivelyfollowalongtheseexamples.Takeyourtimeworkingthroughthecommandsuntilyoufullyunderstandwhyeachcommandproducestheobservedresults.
DataScienceBoot-CampSurvivalManual
18Chapter0-DataScientist'sToolbox
-
Example1:Determinethecurrentworkingdirectory
$pwd/home/osboxes
Example2:Listthefileandsubdirectorynamesinthecurrentworkingdirectory
$lsDesktopDownloadsMusicPublicVideosDocumentsexamples.desktopPicturesTemplates
Example3:Createasubdirectorynamed'test'inthecurrentdirectory
$mkdirtest$cdtest$pwd/home/osboxes/test
Example4:Createsubdirectoriesnamed'1','2','3',and'4'inthecurrentdirectory
$mkdir{1,2,3,4}
Example5:Listthefilesandsubdirectorynamesinthecurrentdirectory
$ls1234
Example6:Createsomeemptyfilesandsomefileswithcontent
$touch1/file01.txt2/file02.txt$echo"Bonjourtoutlemonde"Bonjourtoutlemonde$echo"HelloWorld!">./1/file0101.txt$echo"Tobeornottobe">./3/file03.txt
Example7:Changetothedirectoryimmediatelyabovethecurrentdirectoryandlistthefilesandsubdirectorynamesinthesubdirectorynamed'1'
$cd..$ls-ltest/1total4-rw-rw-r--1osboxesosboxes13Apr309:28file0101.txt-rw-rw-r--1osboxesosboxes0Apr309:27file01.txt
Example8:Listthefilesendingwith'.txt'inthesubdirectorynamed'3'
$ls-ltest/3/*.txt-rw-rw-r--1osboxesosboxes19Apr309:29test/3/file03.txt
Example9:(a)Copythefile'file02.txt'fromdirectorynamed'${HOME}/test/2'todirectory'${HOME}/test/4'andnamethefile'file04.txt'
$cp./test/2/file02.txt./test/4/file04.txt
DataScienceBoot-CampSurvivalManual
19Chapter0-DataScientist'sToolbox
-
(b)Copythefile'file02.txt'fromdirectorynamed'${HOME}/test/2'todirectory'${HOME}/test/4'andnamethefile'file02.txt'
$cp~/test/2/file02.txt./test/4/file02.txt
Example10:Makesubdirectory'${HOME}/test/3'thecurrentworkingdirectoryandcreateahiddenfileandahiddensubdirectory
$cdtest/3$touch.hidden01.txt$mkdir.hidden
Example11:Listthenamesofnon-hiddenfilesandsubdirectoriesinthecurrentdirectory
$lsfile03.txt$ls-a...file03.txt.hidden.hidden01.txt
Example12:Createasubdirectorynamed'another'inthehomedirectoryoftheuserandcopythefilesandrecursivelyfrom'${HOME}/osboxes/test'to'${HOME}/another'
$mkdir~/another$cp-r../*~/another
Exampke13:Listthefilesandsubdirectoriesinthehomedirectoryofuser
$ls~anotherDocumentsexamples.desktopPicturesTemplatesVideosDesktopDownloadsMusicPublictest
Example14:Listthefileandsubdirectorynamesin'${HOME}/another'
$ls~/another1234
Example15:Listthefilenamessandrecursivelythesubdirectoriesin'${HOME}/another'
$ls-R~/another/home/osboxes/another:1234/home/osboxes/another/1:file0101.txtfile01.txt/home/osboxes/another/2:file02.txt/home/osboxes/another/3:file03.txt/home/osboxes/another/4:file02.txtfile04.txt
Example16:Createasubdirectorynamed'test/5'inthehomedirectoryoftheuserandmove(copyanddelete)thefilesand/orsubdirectoriesfrom'${HOME}/another'
DataScienceBoot-CampSurvivalManual
20Chapter0-DataScientist'sToolbox
-
$mkdir~/test/5$mv~/another/*../5
Example17:Listthefileandsubdirectorynamesin'${HOME}//another'
$ls-a/home/osboxes/another...
Example18:Listthefileandsubdirectorynamesin'${HOME}/test/5'
$ls../51234
Example19:Listthefilenamesandrecursivelythesubdirectoriesin'${HOME}/test/5'
$ls-R~/test/5/home/osboxes/test/5:1234/home/osboxes/test/5/1:file0101.txtfile01.txt/home/osboxes/test/5/2:file02.txt/home/osboxes/test/5/3:file03.txt/home/osboxes/test/5/4:file02.txtfile04.txt
Example20:Makedirectory'/home/osboxes'thecurrentworkingdirectory
$cd$pwd/home/osboxes
Example21:Deletethesubdirectories'test'and'another'from'${HOME}',andthenlistthefileandsubdirectorynamesinthecurrentdirectory
$rm-rftestanother$lsDesktopDownloadsMusicPublicVideosDocumentsexamples.desktopPicturesTemplates
Example22:Closetheterminalsession
$exit
AcheatsheetfortheBourneAgainSHell(BASH)hasbeenpreparedbythefolksatLearnCodetheHardway(LCodeTHW).AcompletemanualforBASHisavailablefromtheGNUProjectifyouwanttofurtherexploretheCLIanditscapabilities.
Themarkdownlanguage,createdbyJohnGruber,isrelativelysmallandeasytolearnunlikemarkuplanguagessuchas
Markdown-WritingDocumentationtheEasyWay
DataScienceBoot-CampSurvivalManual
21Chapter0-DataScientist'sToolbox
-
HTMLandXML.Takingaportionofthisbookasanexample,withsomeminorchangestodemonstrateparticularfeatures,weexploresomeofthemorecommonmarkdownelements.
Prologue===
#Introduction
Duringthenextyearyouwilllearnthefundamentalsofdatascience.Survivingtheninecourseswhichmakeupthe[DataScienceSpecialization][0001]offeredby[JohnsHopkinsUniversity][jhu]requiresa**strategy**.
Tothisend,thefocusoftheten-courseseriesincludingacapstoneprojectistoprovidethelearnerwith:
1.anintroductiontothekeyideasbehindreproducibleresearch,2.anintroductiontothetoolsandtechniquestotransformrawdataintoapresentablereport,4.anopportunitytogainhands-onpracticesoyoucanlearnthetechniquesforyourself,and3.anappreciationofthemathematics&statisticsinvolvedindatascience.
##CoreCourses
ThecoursescomprisingtheDataScienceSpecializationare:
*DataScientist'sToolbox*RProgramming*ExploratoryDataAnalysis*GettingandCleaningData*ReproducibleResearch*StatisticalInference*RegressionModels*PracticalMachineLearning*DevelopingDataProducts
![CourseDependency](dst_courses.png)*Figure1Coursedependencydiagram*
[0001]:https://www.coursera.org/specialization/jhudatascience/1?utm_medium=courseDescripTop[jhu]:http://www.jhu.edu
Listing0.1Samplemarkdowndocument
Soyoucanimmediatelypractiseeachofthemarkdownelementsusedinthesampledocumentaconcisedescriptionissuppliedwithreferencestothesampledocument.
FontModifiers
Therearetwostylesoffontmodifiersupportedbystandardmarkdown:
bold(textsurroundedby**)italics(textsurroundedby*)
Fromthesampledocumentweseethat'strategy'ismodifiedduringconversiontorenderbolded,whilst'Figure1Coursedependencydiagram'ismodifiedduringconversiontorenderitalicised.
Headings
Therearetwostylesofheaderssupportedbystandardmarkdown:
setextFirst-level(textunderlinedbyatleast3equal-signs)
DataScienceBoot-CampSurvivalManual
22Chapter0-DataScientist'sToolbox
-
Secondary-level(textunderlinedbyatleast3dashes)atx
First-level(#precedingtext)Secondary-level(##precedingtext)Third-level(###precedingtext)Fourth-level(####precedingtext)
Fromthesampledocumentweseethat'Prologue'and'Introduction'arefirst-levelheaders,and'CoreCourses'isasecond-levelheader.
Images
Therearetwostylesofimagelinkssupportedbystandardmarkdown:
inlinefilename:![alternatetext](directory_path/image"optionaltitle")
referenceid:![alternatetext][stringofdigits|stringofterms]
Links
Therearetwostylesoflinkssupportedbystandardmarkdown:
inlineURL:[randomwebsite][website]
referenceid:[randomwebsite][stringofdigits|stringofterms]
Fromthesampledocumentweseethat'DataScienceSpecialization'isreferencedbytheidlabel(0001)whereas'JohnHopkinsUniversity'isreferencedbytheidlabel(jhu).TheactualURLsarecollectedattheendofthesamedocumentalthoughthelabelscouldappearanywhereinthedocument.
Lists
Therearetwostylesoflistssupportedbystandardmarkdown:
orderedlist
number(followedbyanoptionalperiodandtwomandatoryspaces;physicalorderingoverridesnumericlabelduringconversion)
unorderedlist
*(asterisk)-(dash)+(plus)
Fromthesampledocumentweseeanorderedlistcontainingthelearneroutcomesandanunorderedlistcontainingthenamesofeachoftheninecorecourses.
Installthemarkdown(MD)tohyper-textmarkuplanguage(HTML)convertertopractisemodifyingthesamplemarkdowndocument.
$sudoapt-getinstallmarkdown
DataScienceBoot-CampSurvivalManual
23Chapter0-DataScientist'sToolbox
-
Atexteditorcombinedwiththemarkdown-to-htmlconverterisallthatisneeded.
$nanosample.md$markdownsample.md#sendsHTMLoutputtothescreen$markdownsample.md>sample.html#sendsHTMLoutouttoafilenamed'sample.html'$firefoxsample.html#viewtherenderedHTMLinawebbroswer
Takeyourtimeworkingthroughthesamplemarkdowndocumentuntilyoufullyunderstandwhyeachelementproducestheobservedresults.Thisbookiswritteninamarkdownlanguage.InanothercourseyouwilllearnhowtoproduceamarkdowndocumentcombiningtextandexecutableRcodeusingRmarkdown,andconvertittoHTMLandPDFusingRStudio.
Gitisadistributedversioncontrolsystemallowinganynumberofpeopletocollaborativelycontributetosoftwaredevelopmentorotherprojects.SomeofthecoursesrequirelearnerstosubmittheirprogrammingassignmentstoGitHubaspartofapeerassessmentgradingprocess.
InstallingGit
ByinstallingtheGitcommand-lineclientyoucanchoosewhethertomanageyourlocalandremoterepositoriesfromaterminalsessionorwithinRStudio.AssumingyouarerunningtheUbuntuLinuxvirtualmachineoranotherDebianGNU/LinuxderiveddistributiontypethecommandshowntoinstalltheGitclient.
$sudoapt-getinstall-ygitgit-doc
Ifyouhaveinstalledadifferentdistributionrefertothesystemdocumentationtodeterminethepackagemanagerneededtoinstallsoftwarefromthesoftwarerepository.
15-minuteIntroductiontoVersionControlwithGitfromtheTerminal
Let'sstartexploringthebasicfeaturesoftheversioncontrolfromthecomfortofanRConsolesession.Openaterminalwindow(CTRL+ALT+T)ifyouarerunningagraphicaldesktopenvironmentandthentype'R'andpressthe[ENTER]key.OnceRStudioisinstalledyouwillhaveintegratedaccesstoR.
Command Description CommonFlags Arguments
gitinit initialisealocalrepository;defaultiscurrentworkingdirectory
[directory_path/][directory_name]
(optional)
gitbranch determinethecurrentbranch
gitcheckout
createanewbranchinthecurrentrepository -b(newbranch)
branch_name
(mandatory)
gitstatus reportsthestatusofthelocalrepository
gitshowreportsthehistoricaldifferencesofthefilesinthelocalrepository
gitadd addfilestothelocalrepository
-A(add)
-u(trackfilenamechangesand
[directory_path/][file_name]
(mandatory)
Git-VersionControl
DataScienceBoot-CampSurvivalManual
24Chapter0-DataScientist'sToolbox
-
deletions)
gitcommit commitanychangesthelocalrepository
-a(add)
-m(message)
[directory_path/][file_name]
"astringofcharacters"
(optional,mandatory)
gitpull fetchchangesfromanotherrepositoryandmergewithcurrentrepository
sourcetarget
(mandatory)
gitpush updateremoterepositorywithchangesfromthecurrentrepository-u(addupstream(tracking)reference)
targetsource
(mandatoryunless-uflagpresent)
gitmerge flattencommithistorybeforemergingsourcebranchwithtargetbranch --squashbranch_name
(mandatory)
gitrevert undochangestothelocalrepositoryreference_point
(mandatory)
Argumentsinbracketsareoptionalbutifthe'mandatory'designationispresent,atleastoneoftheargumentsmustbesupplied.
Table0.2BasicGitCommands
Foreachoftheexamplesinthissectiontypethecommandstotherightofthecommandprompt($)tointeractivelyfollowalongtheseexamples.Takeyourtimeworkingthroughthecommandsuntilyoufullyunderstandwhyeachcommandproducestheobservedresults.
Preliminaries:ConfigureyouremailaddressandusernametobeusedbyGit.Theflag--globalmeansapplytheconfigurationtoallofyourGitrepositoriesonthecomputer.Theflag--localmeansapplytheconfogurationtoonlythecurrentGitrepository.
$gitconfig[--local|--global]user.email"[email protected]"$gitconfig[--local|--global]user.name"username"
Note:TheoutputofsomeGitcommandsintheseexampleshasbeenreformattedforpresentationwithinthisbook.
Example1:Createalocalrepository.
$mkdirProjects$mkdirProjects/DataScientistsToolbox$mkdirProjects/DataScientistsToolbox/sample$cdProjects/DataScientistsToolbox/sample$gitinitInitialisedemptyGitrepositoryin/home/osboxes/Projects/DataScientistsToolbox/sample/.git/$ls-ladrwxrwxr-x3osboxesosboxes4096Apr519:15.drwxrwxr-x3osboxesosboxes4096Apr519:07..drwxrwxr-x7osboxesosboxes4096Apr519:15.git
Example2:CreateanemptyREADME.mdfileinthelocalrepository.
DataScienceBoot-CampSurvivalManual
25Chapter0-DataScientist'sToolbox
-
$touchREADME.md$gitadd.$gitcommit-m"initialcommit"
[master(root-commit)b7c48f3]initialcommit1filechanged,0insertions(+),0deletions(-)createmode100644README.md
$gitstatus
Onbranchmasternothingtocommit,workingdirectoryclean
$gitshow
commitb7c48f3e5cdc772e6a198c3633acd853a69a5778Author:jhudssDate:SunApr519:21:212015-0300
initialcommit
diff--gita/README.mdb/README.mdnewfilemode100644index0000000..e69de29
Example3:EdittheREADME.mdfileandpastethesamplemarkdowndocumentintothefile.
$nanoREADME.md$gitadd-A.$gitcommit-m"addedcontent"
[master8fd8eb8]addedcontent1filechanged,41insertions(+)
Example4:EdittheREADME.mdfileswapping"GettingandCleaningData"and"ExploratoryDataAnalysis."
$nanoREADME.md$gitcommit-m"swappedorderoftwocourses"
[master87d0125]swappedorderoftwocourses1filechanged,1insertion(+),1deletion(-)
Example5:Determinewhetherthereareanychanges.
$gitstatus
Onbranchmasternothingtocommit,workingdirectoryclean
$gitshow
commit87d012594aa5a8a39e99d4728dc8c853779587abAuthor:jhudssDate:SunApr519:34:342015-0300
swappedorderoftwocourses
diff--gita/README.mdb/README.mdindex756292a..48587e6100644---a/README.md+++b/README.md@@-25,8+25,8@@ThecoursescomprisingtheDataScienceSpecializationare:
*DataScientist'sToolbox*RProgramming-*ExploratoryDataAnalysis
DataScienceBoot-CampSurvivalManual
26Chapter0-DataScientist'sToolbox
-
*GettingandCleaningData+*ExploratoryDataAnalysis*ReproducibleResearch*StatisticalInference*RegressionModels
Example6:Createabranchnamed'draft'.
$gitcheckout-bdraft
Switchedtoanewbranch'draft'
$gitstatus
Onbranchdraftnothingtocommit,workingdirectoryclean
Example7:EdittheREADME.mdfiletoadd"Gitiseasy.Gitisfun.ThanksLinus!"anywhereinthefile.
$nanoREADME.md$gitstatus
OnbranchdraftChangesnotstagedforcommit:(use"gitadd..."toupdatewhatwillbecommitted)(use"gitcheckout--..."todiscardchangesinworkingdirectory)
modified:README.md
nochangesaddedtocommit(use"gitadd"and/or"gitcommit-a")
$gitcommit-a-m"thankedthecreatorofGit"
[draft34af00f]thankedthecreatorofGit1filechanged,2insertions(+)
Example8:Switchtothe'master'branchandchecktherepositorystatus.
$gitcheckoutmaster
Switchedtobranch'master'
$gitstatus
Onbranchmasternothingtocommit,workingdirectoryclean
Example9:Mergethe'draft'branch'withthe'master'branchandchecktherepositorystatus.
$gitmergedraft
Updating87d0125..34af00fFast-forwardREADME.md|2++1filechanged,2insertions(+)
$gitstatus
Onbranchmasternothingtocommit,workingdirectoryclean
$gitshow
commit34af00fc564fd28e485503715dd5a9a9a461329a
DataScienceBoot-CampSurvivalManual
27Chapter0-DataScientist'sToolbox
-
Author:jhudssDate:SunApr519:49:082015-0300
thankedthecreatorofGit
diff--gita/README.mdb/README.mdindex48587e6..aa53fee100644---a/README.md+++b/README.md@@-19,6+19,8@@istoprovidethelearnerwith:3.anappreciationofthemathematics&statisticsinvolvedindatascience.
+Gitiseasy.Gitisfun.ThanksLinus!+##CoreCourses
ThecoursescomprisingtheDataScienceSpecializationare:
AcheatsheetforGitandGitHubhasbeenpreparedbythefolksatGitHub.
GitHubisarepositoryhostingserviceallowinganynumberofpeopletocollaborativelycontributetosoftwaredevelopmentorotherprojects.SomeofthecoursesrequirelearnerstosubmittheirprogrammingassignmentstoGitHubaspartofapeerassessmentgradingprocess.
15-minuteIntroductiontoVersionControlwithGitHubfromtheTerminalandWebBrowser
Figure0.12CreateanaccountwithGitHub
BeforecreatingarepositoryonGitHubyoumustcreateanaccountpreferablywiththesamenameemailaddressusedwhenconfiguringGit.IfyouuseanalternateemailaddressandusernameforyourGitHubaccount,youcanassociateGit'susernameandemailaddresswiththisaccount.
GitHub-RepositoryHostingServiceSupportingtheGitVersionControlSystem
DataScienceBoot-CampSurvivalManual
28Chapter0-DataScientist'sToolbox
-
Figure0.13ChooseaPersonalPlan
Selecttherepositoryhostingplanforyouraccount.ThedefaultfreeplanissufficientforpeerassessmentsduringtheJohnsHopkinsUniversityDataScienceSpecialization.
Figure0.14NewAccountOrientationDashboard
AfteryourGitHubaccountisset-upyouarereadytoexploretheservice.Youshouldupdatetheprofileinformationattheveryleastbeforeproceeding.
Foreachoftheexamplesinthissectiontypethecommandstotherightofthecommandprompt($)tointeractivelyfollowalongtheseexamples.Takeyourtimeworkingthroughthecommandsuntilyoufullyunderstandwhyeachcommandproducestheobservedresults.
Example1:SynchronisealocalrepositorywithanemptyrepositoryofthesamenameonGitHub.ThecommandsbelowcreatetheemptyrepositoryonGitHubandpushthecontentofthelocalrepositorytoyourGitHiubaccount.SubstituteyourGitHubaccountnamefor'user_name'andtypeyouraccountpasswordwhenprompted.
$curl-uuser_namehttps://api.github.com/user/repos\-d"{\"name\":\"sample\",\"description\":\"learningaboutGitandGitHub\"}"$gitaddremoteoriginhttps://github.com/username/sample.git$gitpushoriginmaster
DataScienceBoot-CampSurvivalManual
29Chapter0-DataScientist'sToolbox
-
AcheatsheetforGitandGitHubhasbeenpreparedbythefolksatGitHub.
Risastatisticalanalysisandcomputingenvironmentproviding"anintegratedsuiteofsoftwarefacilitiesfordatamanipulation,calculationandgraphicaldisplay."
InstallingR
Addtheline"debhttp://cran.rstudio.com/bin/linux/ubuntutrusty/"totheendofthesources.listfile.
$sudonano/etc/apt/sources.list
FetchthesigningkeyfortheCRANrepository.
$sudoapt-keyadv--keyserverkeys.gnupg.net--recv-key51716619E084DAB9
InstallthelastestversionofRwhichmightbenewerthanshowninthefigures.
$sudoapt-getupdate$sudoapt-getupgrade$sudoapt-getinstall-yr-baser-doc-infor-mathliblibcurl4-gnutls-dev
15-minuteIntroductiontotheRStatisticalandComputationalEnvironment
Let'sstartexploringthebasicfeaturesoftheRenvironmentfromthecomfortoftheRConsolecommand-lineinterface.Openaterminalwindow(CTRL+ALT+T)ifyouarerunningagraphicaldesktopenvironmentandtype'R'followedbythe[ENTER]key.OnceRStudioisinstalledyouwon'thavetoworkatthecommand-lineunlessyouchoosetodoso.
Command Description Arguments
install.packages installapackagefromCRANpackage_name
(mandatory)
install_github installapackagefromGitHubpackage_name
(mandatory)
library loadapackagepackage_name
(mandatory)
? accessthehelpsystem
[package_name][function_name]
(mandatory)
q() exitRPrompttosavetheenvironmentbeforeshuttingdowntheRStatisticalAnalysisandComputingEnvironment.
Argumentsinbracketsareoptionalbutifthe'mandatory'designationispresent,atleastoneoftheargumentsmustbesupplied.
R-StatisticalAnalysisandComputingEnvironment
DataScienceBoot-CampSurvivalManual
30Chapter0-DataScientist'sToolbox
-
Table0.3EssentialRCommands
Foreachexampletypethecommandstotherightofthecommandprompt(>)tointeractivelyfollowalongtheseexamples.Takeyourtimeworkingthroughthecommandsuntilyoufullyunderstandwhyeachcommandproducestheobservedresults.
Example:ForsimplicitywedonotshowtheoutputofthecommandsusedwithinRConsole.Wewillinstallthedevtoolspackage,asanexemplar,whichwillbeneededtosuccessfullycompileotherpackagesthroughoutthesechaptersandtheninedatasciencecourses.
$RRversion3.1.1(201-07-10)--"SockittoMe"Copyright(C)2014TheRFoundationforStatisticalComputingPlatform:i686-pc-linux-gnu(32-bit)
RisfreesoftwareandcomeswithABSOLUTELYNOWARRANTY.Youarewelcometoredistributeitundercertainconditions.Type'license()'or'licence()'fordistributiondetails.
NaturallanguagesupportbutrunninginanEnglishlocale
Risacollaborativeprojectwithmanycontributors.
Type'contributors()'formoreinformationand'citation()'onhowtociteRorRpackagesinpublications.Type'demo()'forsomedemos,'help()'foron-linehelp,or'help.start()'foranHTMLbrowserinterfacetohelp.Type'q()'toquitR.>install.packages("devtools")>library(devtools)>?devtools>q()$
RStudioisanintegrateddevelopmentenvironmentprovidingaplatform"totacklethetoughestandmostinterestingproblemswithR."
InstallingRStudio
$wgethttp://download1.rstudio.org/rstudio-0.98.1103-i386.deb-O${HOME}/Downloads/rstudio.deb$sudoapt-getinstalllibjpeg62$sudodpkg-i${HOME}/Downloads/rstudio.deb
15-minuteIntroductiontotheRStudioIntegratedDevelopmentEnvironment
RStudio-IntegratedDevelopmentEnvironment
DataScienceBoot-CampSurvivalManual
31Chapter0-DataScientist'sToolbox
-
Figure0.15RStudioIntegratedDevelopmentEnvironment
LaunchRStudiobyclickingonthe'Button'iconneartheupperleftoftheapplicationbar,typing'rstudio'intothesearchfield,andclickingontheRStudioicon.OncetheapplicationisvisibleasshowninFigure0.15right-clickontheRStudioiconintheapplicationbarandselect'LocktoLauncher'.
DataScienceBoot-CampSurvivalManual
32Chapter0-DataScientist'sToolbox
-
Figure0.16ConfigureGlobalOptions
Click'Tools'onthemainmenufollowedby'GlobalOptions'toconfigureRStudio.
DataScienceBoot-CampSurvivalManual
33Chapter0-DataScientist'sToolbox
-
Figure0.17SelecttheCRANrepositorymirrortofetchpackages
Selectageographically-nearbyCRANrepositoryafterclicking'Packages'.
DataScienceBoot-CampSurvivalManual
34Chapter0-DataScientist'sToolbox
-
Figure0.18Configurecodeeditingpreferences
Click'CodeEditing'toconfiguretheappearanceandbehaviourofthecodeeditingpane.
DataScienceBoot-CampSurvivalManual
35Chapter0-DataScientist'sToolbox
-
Figure0.19Configureversioncontroloptions
Click'Git/SVN'toconfigurewhichversioncontrolsystemsystemwillbeused.IfGithaslareadybeeninstalled,thedefaultscanbeaccepted.Click'Apply'.Click'OK'.
DataScienceBoot-CampSurvivalManual
36Chapter0-DataScientist'sToolbox
-
Figure0.20Createadirectory
Clickthe'Files'tabinthelowerrightpaneandnavigatetotheProjectsdirectoryandclick'NewFolder'.TypethenameofthecourseDataScientistsToolbox.IfadirectorynamedProjectsdoesnotexist,createit.
DataScienceBoot-CampSurvivalManual
37Chapter0-DataScientist'sToolbox
-
Figure0.21Createanewproject-Step1
Intheupperrightclick'Project(None)'andselect'NewProject'.
DataScienceBoot-CampSurvivalManual
38Chapter0-DataScientist'sToolbox
-
Figure0.22Createanewproject-Step2
Click'NewDirectory'tocreateanewrepository.
DataScienceBoot-CampSurvivalManual
39Chapter0-DataScientist'sToolbox
-
Figure0.23Createanewproject-Step3
Click'EmptyProject'astheprojecttype.
DataScienceBoot-CampSurvivalManual
40Chapter0-DataScientist'sToolbox
-
Figure0.24Createanewproject-Step4
DataScienceBoot-CampSurvivalManual
41Chapter0-DataScientist'sToolbox
-
Figure0.25Createanewproject-Step5
Navigateto${HOME}/Projects/DataScientistsToolbox.Click'Choose'.
DataScienceBoot-CampSurvivalManual
42Chapter0-DataScientist'sToolbox
-
Figure0.26Createanewproject-Step6
Typeanamefortheproject.Tocreatetheprojecttypeadirectoryname,select'Createagitrepository',andclick'CreateProject'.
DataScienceBoot-CampSurvivalManual
43Chapter0-DataScientist'sToolbox
-
Figure0.27Createanewtextfile
Select'File'onthemainmenufollowedby'NewFile'andselect'TextFile'asthefiletype.
DataScienceBoot-CampSurvivalManual
44Chapter0-DataScientist'sToolbox
-
Figure0.28Savethestudent_grades.csvdatafile
Typethecontentsshowninthecodeeditingpane.Clickonthedisketteiconorselect'File,Save'fromthemenu.Typethefilenameandclick'Save'.
DataScienceBoot-CampSurvivalManual
45Chapter0-DataScientist'sToolbox
-
Figure0.29Settheworkingdirectory
Click'Session'onthemainmenuandselect'SetWorkingDirectory'followedby'ToFilesPaneLocation'.
DataScienceBoot-CampSurvivalManual
46Chapter0-DataScientist'sToolbox
-
Figure0.30Savethestudent_grades.Rscript
Click'File'onthemainmenufollowedby'NewFile'andselect'RScript'.
TypetheRcodeshownbelow.Thenclick'File'followedby'Save'beforetypingthefilenameandclicking'Save'.
DataScienceBoot-CampSurvivalManual
47Chapter0-DataScientist'sToolbox
-
Figure0.31Readstudentgradesfileandoutputthecontents
Highlightthecodeinthe'student_grades.R'tab.Click'Run'.
DataScienceBoot-CampSurvivalManual
48Chapter0-DataScientist'sToolbox
-
Figure0.32Commitchangestothelocalrepository
Clickthe'Git'tabintheupperrightpane.Click'Commit'.
DataScienceBoot-CampSurvivalManual
49Chapter0-DataScientist'sToolbox
-
Figure0.33Selectchangestobecommitedtothelocalrepository
Selecteachofthefourfilesbymarkingthemasstaged.Typeacommitmessage.Click'Commit'tocommitthesechangestothelocalrepository.
DataScienceBoot-CampSurvivalManual
50Chapter0-DataScientist'sToolbox
-
Figure0.34Summaryofchangestothelocalrepository
Reviewthemessagesbeforeclicking'Close'.Afterwardsclosethe'ReviewChanges'pop-upwindow.
DataScienceBoot-CampSurvivalManual
51Chapter0-DataScientist'sToolbox
-
Figure0.35Trackingchangesinanopenproject
ModifytheRcodeasshowninthe'student_grades.R'tab.Didyounoticethenewentryunderthe'Git'tab?Highlightthelastlineofcodeandrunit.Committhischangeusingthesameprocedure.
DataScienceBoot-CampSurvivalManual
52Chapter0-DataScientist'sToolbox
-
Figure0.36PushthecontentsofthelocalrepositorytoGitHub
LogintoGitHubusingawebbrowserandcreateanemptyrepositorynamed'demo'.InRStudioclickthegeariconunderthe'Git'tabandselect'Shell'.Forconvenienceweputthegitcommandsinthecodepane.TypethesecommandsintheshellsubstitutingyourGitHubaccount.Type'exit'toclosetheshell.VerifytherepositoryonGitHubhasbeenupdated.LogoutofGitHub.
DataScienceBoot-CampSurvivalManual
53Chapter0-DataScientist'sToolbox
-
Figure0.37Closethecurrentlyactiveproject
Clickon'demo'intheupperrightcornerofRStudioandclick'CloseProject'.
Figure0.38GitHubrepositorynameddemoafterthepushfromlocalrepository
Congratulations!Yousuccessfullyonfiguredavirtualmachineforuseduringthedatascienceboot-camp.
Practise.Practise.Practiceyournewlyacquiredknowledgeandskillsinpreparationforthecourseproject.
DataScienceBoot-CampSurvivalManual
54Chapter0-DataScientist'sToolbox
-
DataScientist'sToolboxintroducedthestatisticalcomputingandgraphingsuite,theintegrateddevelopmentenvironment,andtheversion/revisioncontrolsystemselectedbytheDataScienceSpecializationLabTeamintheBiostatisticsDepartmentofJohnsHopkinsUniversity.Thefeaturesandcapabiilitiesofthesetoolsextendbeyondthebasicspresentedinthischapter.Whilethegraphicaluserinterfaceisconvenientwehighlyrecommendandencourageyoutobecomecomfortablewiththecommand-lineaswell.
Asadatasciencerecruitoutfittedwithyourkit(Git,R,RStudio,UbuntuLinux,andGitHubaccount)theinstructorforRProgrammingawaits.Boot-camphasbeeneasyuptothispoint.Readthe"DataScienceBoot-CampSurvivalManual"regularlytoavoidwashing-outofboot-camp.
Recruits,dismissed.
FinalThoughts
DataScienceBoot-CampSurvivalManual
55Chapter0-DataScientist'sToolbox
-
Chapter1-RProgramming
DataScienceBoot-CampSurvivalManual
56Chapter1-RProgramming
-
Chapter2-GettingandCleaningData
DataScienceBoot-CampSurvivalManual
57Chapter2-GettingandCleaningData
-
Chapter3-ExploratoryDataAnalysis
DataScienceBoot-CampSurvivalManual
58Chapter3-ExploratoryDataAnalysis
-
Chapter4-ReproducibleResearch
DataScienceBoot-CampSurvivalManual
59Chapter4-ReproducibleResearch
-
Chapter5-StatisticalInference
DataScienceBoot-CampSurvivalManual
60Chapter5-StatisticalInference
-
Chapter6-RegressionModels
DataScienceBoot-CampSurvivalManual
61Chapter6-RegressionModels
-
Chapter7-PracticalMachineLearning
DataScienceBoot-CampSurvivalManual
62Chapter7-PracticalMachineLearning
-
Chapter8-DevelopingDataProducts
DataScienceBoot-CampSurvivalManual
63Chapter8-DevelopingDataProducts
-
Capstone
DataScienceBoot-CampSurvivalManual
64Capstone
-
Epilogue
DataScienceBoot-CampSurvivalManual
65Epilogue
PrologueChapter 0 - Data Scientist's ToolboxChapter 1 - R ProgrammingChapter 2 - Getting and Cleaning DataChapter 3 - Exploratory Data AnalysisChapter 4 - Reproducible ResearchChapter 5 - Statistical InferenceChapter 6 - Regression ModelsChapter 7 - Practical Machine LearningChapter 8 - Developing Data ProductsCapstoneEpilogue