Data Science Boot Camp Survival Manual

65

description

Data Science Boot Camp Survival Manual

Transcript of Data Science Boot Camp Survival Manual

  • 1. Prologue2. Chapter0-DataScientist'sToolbox3. Chapter1-RProgramming4. Chapter2-GettingandCleaningData5. Chapter3-ExploratoryDataAnalysis6. Chapter4-ReproducibleResearch7. Chapter5-StatisticalInference8. Chapter6-RegressionModels9. Chapter7-PracticalMachineLearning

    10. Chapter8-DevelopingDataProducts11. Capstone12. Epilogue

    TableofContents

    DataScienceBoot-CampSurvivalManual

    2

  • Welcomerecruits!

    Duringthenextyearyouwilllearnthefundamentalsofdatascience.TheDataScienceSpecialization,offeredbyJohnsHopkinsUniversity,ischallenging.Successrequiresastrategy.Thisbookaimstoequipeachofyouwiththeknowledgeandskillstocompleteboot-camp.The"DataScienceBoot-CampSurvivalManual"alonecannotguaranteesuccess.Listentotheinstructor'slecturesandapplyyourselftotheevaluationsthroughoutyourtraining.

    AccordingtoJeffLeekandtheDataScienceSpecializationTeamthekeywordindatascienceis"science".Tothisend,thefocusoftheten-courseseriesincludingacapstoneprojectistoprovidethelearnerwith:

    1. anintroductiontothekeyideasbehindreproducibleresearch,2. anintroductiontothetoolsandtechniquestotransformrawdataintoapresentablereport,3. anopportunitytogainhands-onpracticesoyoucanlearnthetechniquesforyourself,and4. anappreciationofthemathematics&statisticsinvolvedindatascience.

    ThecoursescomprisingtheDataScienceSpecializationare:

    DataScientist'sToolboxRProgrammingGettingandCleaningDataExploratoryDataAnalysisReproducibleResearchStatisticalInferenceRegressionModelsPracticalMachineLearningDevelopingDataProducts

    ThesecoursestaughtbyBrianCaffo,JeffLeek,andRogerD.Pengenablethelearnertogetthefoundationalskills.Whilethelecturesandassignmentsbuildthesefoundationalskills,learnersoftenrequiredfurtherexplanations.Thecourseforumsallowlearnerstodiscussthelecturetopicsandassignments.Yeteachsessionofacoursebeginswithoutthesharedknowledgeofpreviousparticipants.AsaCommunityTeachingAssistant(CTA)itbecameclearthatacompanionguidewouldbebeneficial.

    AreyouuptothechallengeofJohnsHopkinsUniversity'sDataScienceSpecialization?

    Eachchaptercoversoneofthecorecourses.Atutorial-stylebalancingtheoryandpracticalapplicationmakessurvivingdatascienceboot-camppossible.Youlearntheworkflowtypicallyinvolvedinallphasesofadataanalysisproject.

    Chapter0:TheDataScientist'sToolbox

    URL:https://www.coursera.org/course/datascitoolbox

    Synopsis:"Getanoverviewofthedata,questions,andtoolsthatdataanalystsanddatascientistsworkwith.ThisisthefirstcourseintheJohnsHopkinsDataScienceSpecialization."

    Prologue

    CoreCourses

    StructureoftheBoot-CampSurvivalManual

    DataScienceBoot-CampSurvivalManual

    3Prologue

  • Chapter1:RProgramming

    URL:https://www.coursera.org/course/rprog

    Synopsis:"LearnhowtoprograminRandhowtouseRforeffectivedataanalysis.ThisisthesecondcourseintheJohnsHopkinsDataScienceSpecialization."

    Chapter2:GettingandCleaningData

    URL:https://www.coursera.org/course/getdata

    Synopsis:"Learnhowtogather,clean,andmanagedatafromavarietyofsources.ThisisthethirdcourseintheJohnsHopkinsDataScienceSpecialization."

    Chapter3:ExploratoryDataAnalysis

    URL:https://www.coursera.org/course/exdata

    Synopsis:"Learntheessentialexploratorytechniquesforsummarizingdata.ThisisthefourthcourseintheJohnsHopkinsDataScienceSpecialization."

    Chapter4:ReproducibleResearch

    URL:https://www.coursera.org/course/repdata

    Synopsis:"Learntheconceptsandtoolsbehindreportingmoderndataanalysesinareproduciblemanner.ThisisthefifthcourseintheJohnsHopkinsDataScienceSpecialization."

    Chapter5:StatisticalInference

    URL:https://www.coursera.org/course/statinference

    Synopsis:"Learnhowtodrawconclusionsaboutpopulationsorscientifictruthsfromdata.ThisisthesixthcourseintheJohnsHopkinsDataScienceCourseTrack."

    Chapter6:RegressionModels

    URL:https://www.coursera.org/course/regmods

    Synopsis:"Learnhowtouseregressionmodels,themostimportantstatisticalanalysistoolinthedatascientist'stoolkit.ThisistheseventhcourseintheJohnsHopkinsDataScienceSpecialization."

    Chapter7:PracticalMachineLearning

    URL:https://www.coursera.org/course/predmachlearn

    Synopsis:"Learnthebasiccomponentsofbuildingandapplyingpredictionfunctionswithanemphasisonpracticalapplications.ThisistheeighthcourseintheJohnsHopkinsDataScienceSpecialization."

    Chapter8:DevelopingDataProducts

    URL:https://www.coursera.org/course/devdataprod

    Synopsis:"LearnthebasicsofcreatingdataproductsusingShiny,Rpackages,andinteractivegraphics.ThisistheninthcourseintheJohnsHopkinsDataScienceSpecialization."

    DataScienceCapstone

    DataScienceBoot-CampSurvivalManual

    4Prologue

  • URL:https://www.coursera.org/course/dsscapstone

    Synopsis:"Thecapstoneprojectclasswillallowstudentstocreateausable/publicdataproductthatcanbeusedtoshowyourskillstopotentialemployers.Projectswillbedrawnfromreal-worldproblemsandwillbeconductedwithindustry,government,andacademicpartners."

    CoursesynposesquotedfromthecourseinformationpagesatCourseraasat1April2015.

    Althoughthecoursesarestandalone,theknowledgeiscumulative.ThepedagogicalcoursedependenciesareavailablefromJohnsHopkinsUniversity.

    Figure1CoursedependencydiagramprovidedbyDanielM.Bontje(created17November2014)

    Youneedalanguageorsystemtoperformthetasks(RProgramming)anddatatoanalyse(GettingandCleaningData)togetasenseofthedata(ExploratoryDataAnalysis)beforebuildingmodelsanddrawinginferences(StatisticalInference,RegressionModels)ormakingpredictions(PracticalMachineLearning)fromthedatabeforepresentingyourconclusionsandsupportingevidence(BuildingDataProducts,ReproducibleResearch).

    Therecommendedmathematicsbackgroundislinearalgebraandintroductorystatistics(descriptiveandinferential).StatisticalInferenceandRegressionModels,coursesinthisspecialisation,coverallthebasicstatisticalconceptsformingasolidfoundationforsubsequentcoursesintheDataScienceSpecialization.ThesecoursesalongwithPracticalMachineLearningarethetheoreticalunderpinnings,whiletheothersixcoursesareappliedinnature:obtaininngdata,scrubbingdata,exploringdata,modelingdata,andinterpretingdatacollectivelyknownastheOSEMN(prounouncedasawesome)model.

    AgainwelcometotheDataScienceBoot-Camp.Reviewthe"DataScienceBoot-CampSurvivalManual"onaregularbasisthroughoutyourtraining.

    CourseDependencyandRecommendedSequence

    DataScienceBoot-CampSurvivalManual

    5Prologue

  • Weshallneitherfailnorfalter;weshallnotweakenortire...giveusthetoolsandwewillfinishthejob.-WinstonChurchill,PrimeMinisterofGreatBritain

    PrimaryInstructor:JeffLeek,MS,PhD(Biostatistics)

    ThefoundationalcourseDataScientist'sToolboxisahigh-leveloverviewofthespecialisation.Thiscourselaysthegroundworkforthenine-courseseriespluscapstonproject.Acomprehensiveapproachteachingfundamentalskillsfordatascienceregardlessofdataset.

    Thekeywordin"datascience"issciencenotdata.Themethodisnotdependentuponthedatasetsize;itscalesfromsmalldatatobigdata.Thedatasciencemethodequatestothescientificmethodusedinthenaturalsciences.TheFinancialTimesarticle,"Bigdata:arewemakingabigmistake?",arguesforarigorousmethodology.Anarticle"TheDataScienceMethology",publishedonDataScience101,arguesforadoptionofthescientificmethodfamiliartoscientistsinthenaturalsciences.

    DataScienceMethodology

    1. problemformulation(hypothesis)2. obtaindata(experiment)3. analysis(validateorrefutehypothesis)4. dataproduct(report)

    ThecoursesintheJohnsHopkinsUniversityDataScienceSpecialization"emphasiseadatasciencemethodologyratherthanfocusingprimarilyondatasciencetechnique.[T]heinstructorshavetakencarethroughouttodemonstratearesponsible,scientifically-basedapproachtocollecting,curatingandanalyzingdatasources,"saysspecialisationparticipantJohnFrederickThiels.

    Youwillhavelearnedthebasicskillstosuccessfullyusethevarioustoolsrequiredthroughoutthebookandthedatasciencespecialisationcourses.

    Tosuccessfullycompletethehands-onexercisesinthebookandcourseassignments(quizzes,programming,andcourseprojects)somesoftwaremustbeinstalledonyourcomputerorinahostedenvironment:Git,RandRStudio.AGitHubaccountismandatorybecausepeer-assessedsubmissionsmustbeaccessible.Internetaccessisnecessarytofullyparticipateinthecourses;suchaswatchingordownloadinglectures,takingquizzes,submittingprogrammingassignments,andparticipatinginthepeerassessmentprocess.Duetothevarietyofoperatingsystemplatformsonwhichthesoftwarecanbedeployed,forthisbook,wedecidedtosolelyfocusonUbuntuLinuxrunninglocallyorremotelyinavirtualisedenvironment.

    Beforedelvingintohowtousethevarioustoolsinourtoolboxitisimportanttoconsiderthetypesofskillsweneedasdata-scientists-in-training.Firstly,linearalgebra,probabilityandcalculusattheintroductorylevelissufficientmathematics.Secondly,introductorydescriptiveandinferentialstatisticsincludinghypothesistestingistherecommendedstatisticsbackground.Thirdly,basicprogrammingskillsarerecommended.NoneoftheaforementionedskillsaremandatoryfortheDataScienceSpecialization.Forthosereadersseekingtolearnanyoftheseskillstherearecoursesavailable,including:

    Pre-Calculus-Instructors:SarahEichhornandRachelCohenLehman,UniversityofCalifornia,IrvineProbability-Instructor:SantoshS.Venkatesh

    Chapter0-TheDataScientist'sToolbox

    LearningObjectives

    ToolsoftheTrade

    DataScienceBoot-CampSurvivalManual

    6Chapter0-DataScientist'sToolbox

  • Calculus:SingleVariable-Instructor:RobertGhrist,UniversityofPennsylvaniaDescriptiveStatistics-Instructor:MatthijsRooduijn,UniversityofAmsterdamInferentialStatistics-Instructor:AnnemarieZandScholten,UniversityofAmsterdamDataAnalysisandStatisticalInference-Instructor:Mineetinkaya-Rundel,DukeUniversityProgrammingforEverbody(Python)-Instructor:CharlesSeverance,UniversityofMichigan

    ProgrammingforEverybody(Python)deservesspecialmentionbecauseitisconsistentlyhighly-ratedbycourseparticipantsfortheteaching-styleof"Dr.Chuck."Youdonothavetobeageektoenjoythiscourse.

    Readtheinformationpageofeachcourseespeciallyifyoupreferaself-teachingapproachtolearning.Therearefreelyavailabletextbooksforsomeofthesecourses.

    WhilethevariousapplicationsrequiredforthesecoursescanbeinstalledonthehostoperatingsystemofyourcomputerwerecommendusingvirtualisationsoftwaresuchasOracleVirtualBox,VMWareWorkstationorFusionorPlayer,andParallelsDesktopdependingupontheoperatingsystemrunningonthecomputer.AnothervirtualisationoptionisRStudioServerAmazonMachineImage(AMI)orrollingyourownlocalorhostedvirtualmachineinstance.

    Thissectionwilldescribetwoscenarios:

    importingaready-madediskimage(AMI)ofUbuntuLinux14.04LTS(64-bit)ontheAmazonWebServiceElasticComputing2(AWSEC2)hostingplatform.importingaready-madediskimageofUbuntuLinux14.04LTS(32-bitor64-bit)intoOracleVirtualBoxonyourcomputer,and

    Anadvantageofvirtualisationsoftware,runningonyourcomputerorremotelyhostedbyaserviceprovider,isalltherequiredapplicationsarekeptseparatefromyourcomputer'soperatingsystemandbydefaultisolatedfromthehostfilesystem.

    IfyoupreferinstallingOracleVirtualBoxandcreatingavirtualmachineonyourcomputer,youcanskipthissection.

    Instructionsareforthcoming.

    PleaseconsulttheinstructionsaboutdownloadingandinstallingOracleVirtualBoxontoyourcomputerbeforeproceeding.

    Downloadtheready-madediskimageofUbuntuLinux(32-bitor64-bit)basedontheversionsupportedbyOracleVirtualBoxandthearchitectureofthecomputer.

    Note:Somecomputersare64-bitbutonlyallow32-bitoperatingsystemstorunwithinvirtualisationsoftware.

    Extractthecompressedarchivecontainingthediskimageusingp7zip.

    $7zaeUbuntu_14.04.2-32bit.7z

    7-Zip(A)[64]9.20Copyright(c)1999-2010IgorPavlov2010-11-18p7zipVersion9.20(locale=en_CA.UTF-8,Utf16=on,HugeFiles=on,2CPUs)

    Processingarchive:Ubuntu_14.04.2-32bit.7z

    Extracting32bit/Ubuntu14.04.2(32bit).vdiExtracting32bit

    EverythingisOk

    VirtualisationSoftware

    OptionA:AmazonWebServiceElasticCompute2withAmazonMachineImage

    OptionB:LocalComputerwithOracleVirtualBox

    DataScienceBoot-CampSurvivalManual

    7Chapter0-DataScientist'sToolbox

  • Folders:1Files:1Size:3807379456Compressed:776252068$

    AfterinstallingOracleVirtualBoxitistimetolaunchitsowecanimportthevirtualmachinediskimage(.vdi).

    Figure0.1Creatinganewvirtualmachineinstance

    Click'New'onthemainmenu.Adialogueboxpop-upappearswhereyouenterthenametoassigntothevirtualmachineandselecttheoperatingsystemandversion.Click'Next'tocontinue.

    DataScienceBoot-CampSurvivalManual

    8Chapter0-DataScientist'sToolbox

  • Figure0.2Allocatingsystemmemorytothenewvirtualmachineinstance

    Selecttheamountofsystemmemory(RAM)toallocatetothevirtualmachine.Allocate2048MBofsystemmemorytothisvirtualmachineinstance.Thisparametercanbemodifiedlaterifnecessary.Click'Next'tocontinue.

    DataScienceBoot-CampSurvivalManual

    9Chapter0-DataScientist'sToolbox

  • Figure0.3Associatinganexistingvirtualharddrivetothenewvirtualmachineinstance

    Select'Useanexistingvirtualharddrivefile'andclickonthefilefoldericontonavigatetothevirtualharddrivefilepreviouslydownloadedanduncompressed.Click'Create'toassociatethisdiskimagewiththecurrentvirtualmachine.

    DataScienceBoot-CampSurvivalManual

    10Chapter0-DataScientist'sToolbox

  • Figure0.4MounttheVirtualBoxGuestAdditionsISO

    MaketheVirtualBoxGuestAdditionsISOaccessibletothevirtualmachineinstance.AtthemainscreenofOracleVirtualBoxselecttheDataScientistsToolboxvirtualmachine.Click'Settings',then'Storage',followedby'Empty'.

    DataScienceBoot-CampSurvivalManual

    11Chapter0-DataScientist'sToolbox

  • Figure0.5MounttheVirtualBoxGuestAdditionsISO

    ClicktheCD/DVDiconandselectVBoxGuestAdditions.isofromthedropdownlist.Click'OK'toreturntothemainscreen.

    DataScienceBoot-CampSurvivalManual

    12Chapter0-DataScientist'sToolbox

  • Figure0.6Startingthenewvirtualmachineinstance

    AtthemainscreenofOracleVirtualBoxselectthenewlycreatedvirtualmachineinstance.Click'Start'tolaunchthevirtualmachine.Attheloginprompttypethepasswordfromthedownloadwebpage.

    ThefinalpreparatorystepisenablingtheVirtualBoxGuestAdditionsandupdatinganyout-of-datepackagesinstalledonthevirtualmachine.Openaterminalwindow(CTRL+ALT+T).

    ActivatetheVirtualBoxGuestAdditionssothevirtualmachineinstanceintegrateswiththehostsystem.

    $cd/media/osboxes/VBOXADDITIONS*$sudoshVBoxLinuxAdditions.run

    UponsuccessfulinstallationshutdownthevirtualmachineinstancebyclickingtheGeariconintheupperrightcornerofthevirtualmachine,umounttheVirtualBoxGuestAdditionsbyreversingthestepsshowninFigures0.4and0.5.Alternatively,youmaychoosetoleavetheVirtualBoxGuestAdditionsISOattached.

    Note:WheneveranupdatedLinuxkernelisinstalledaspartofthenormalupdateprocesstheVirtualBoxGuestAdditionswillhavetobereappliedtoensurethesharedclipboard,forexample,continuestowork.DoNOTforgettorestartthevirtualmachineinstancesotheVirtualBoxGuestAdditionsareactivated.

    DataScienceBoot-CampSurvivalManual

    13Chapter0-DataScientist'sToolbox

  • Figure0.7Enable/Disablesharedclipboardanddrag-and-drop

    Enablingasharedclipboardbetweenyourcomputerandthevirtualmachineinstanceisconfigurableviathe'Settings'menu.

    DataScienceBoot-CampSurvivalManual

    14Chapter0-DataScientist'sToolbox

  • Figure0.8Pointingdeviceanddeviceboot-orderconfiguration

    Themousedevicetypeshouldbeconfiguredas'PS/2Mouse'whetherusingawiredorwirelessmouse.Thedevicebootordershouldbeconfiguredtoensurethevirtualdiskimageisthedefaultbootdevice.

    Restartthevirtualmachineinstance.

    Switchingbetweenstandardmodeandfull-screenmodeisaseasyasHost_Key+F(RIGHT_CTRL+Fbydefault).

    Forconveniencelaunchaterminalsession(CTRL+ATL+T)andwhenitsiconappearsintheapplicationbarright-clickthemouseandselect'LocktoLauncher'.Fromthispointforwardanytimeaterminalsessioniswantedsimplyclickthe'Terminal'icon.

    DataScienceBoot-CampSurvivalManual

    15Chapter0-DataScientist'sToolbox

  • Figure0.9Systemsettingsconfiguration

    BeforeproceedingwithupdatingthecurrentlyinstalledsystemsoftwareandapplicationsweshouldselectanUbuntuLinuxpackagerepositoryingeographicproximitytoyourlocation.Thiscanbeaccomplishedbyclickingthe'SystemSettings'iconintheapplicationbaralongtheleft-edgeofthescreen.Click'Software&Updates'.

    Next,openaterminalsession(CTRL+ALT+T).Whentheterminaldisplaystheshellprompttypethefollowingcommandstoupdateandupgradethecurrentlyinstalledsystemsoftwareandapplications.Ifyouseethe'SoftwareUpdater'iconintheapplicationbar,youcanapplysoftwareupdatesbyclickingtheiconinstead.

    $sudoapt-get-yupdate$sudoapt-get-yupgrade

    DataScienceBoot-CampSurvivalManual

    16Chapter0-DataScientist'sToolbox

  • Figure0.10Editingtheusername,password,languagepreferenceandenablingautomaticlogin

    Automaticlogincanbeenabledandthedisplaynamefortheuseraccountandpasswordcanbechanged,ifdesired,via'SystemsSetting'sbyclicking'UserAccounts'.

    Figure0.11Automaticloginenabled

    Click'Unlock'toenableeditingoftheuseraccountconfiguration.Typethecurrentpasswordwhenprompted.Ifyouwanttochangetheaccountname,click'osboxes.org'andtypethedesiredaccountname.Ifyouwanttochangethepassword,clickontheasterisksandtypethedesiredpassword.Ifyouwanttoenableautomaticlogin,click'OFF'sothat'ON'isvisible.Finally,click'Lock'torelocktheuseraccountconfiguration.

    Afterashortdetourtofamiliariseourselveswiththecommand-lineinterface(CLI)wewillinstallGit,R,andRStudio.Restassuredthatinteractingwithcommand-lineisnotrequiredbeyondthischapter.RStudioprovidesseamlessintegrationwiththefilesystemtonavigateandmanipulatefiles,versioncontrolandrepositorysynchronisationbetweenyourcomputerandrepositoryhostingservices,andthestatisticalcomputationandsoftwaredevelopmentenvironment.

    GettingFamiliarwiththeCommand-LineInterface(CLI)

    DataScienceBoot-CampSurvivalManual

    17Chapter0-DataScientist'sToolbox

  • 15-minuteIntroductiontoNavigatingandManipulatingtheFileSystemfromtheTerminal

    Let'sstartexploringthebasicfeaturesoftheenvironmentfromthecomfortofaterminalsessionandthecommand-line.Openaterminalwindow(CTRL+ALT+T)ifyouarerunningagraphicaldesktopenvironment.BylearningafewbasiccommandstonavigateandmanipulatethefilesystemyouwillfeelateaseandunderstandwhatisgoinginbehindthesceneswithinFilePanelofRStudio.

    Command Description CommonFlags Arguments

    pwd printworkingdirectoryname

    ls listfileand/ordirectorynames

    -l(longform)-a(hidden)-R(recursive)

    [directory_path/][pattern]

    (optional)

    mkdir makedirectory

    [directory_path/]directory_nameor[directory_path/]directory_name_list

    (mandatory)

    cd changedirectory[directory_path/][directory_name]

    (optional)

    touch createanemptyfile[directory_path/]file_name

    (mandatory)

    echo createafile(bydefaultstdout)

    -e-n(nocarriagereturn)

    "astringofcharacters"

    (mandatory)

    cp copyfileordirectory -r(recursive)

    (source)[directory_path/][filename]

    (target)[directory_path/][file_name]

    (mandatory)

    mv movefileordirectory -r(recursive)

    (source)[directory_path/][file_name]

    (target)[directory_path/][file_name]

    (mandatory)

    rm remove/deletefileordirectory-f(force)-r(recursive)

    [directory_path/][file_name]

    (mandatory)

    Argumentsinbracketsareoptionalbutifthe'mandatory'designationispresent,atleastoneoftheargumentsmustbesupplied.Directorynamesandpathsaswellasfilenamesmaycontainwildcardcharacters(*and?)whenusedwithsomeofthesecommands.

    Table0.1BasicFileandDirectoryCommands

    Foreachexampletypethecommandstotherightofthecommandprompt($)tointeractivelyfollowalongtheseexamples.Takeyourtimeworkingthroughthecommandsuntilyoufullyunderstandwhyeachcommandproducestheobservedresults.

    DataScienceBoot-CampSurvivalManual

    18Chapter0-DataScientist'sToolbox

  • Example1:Determinethecurrentworkingdirectory

    $pwd/home/osboxes

    Example2:Listthefileandsubdirectorynamesinthecurrentworkingdirectory

    $lsDesktopDownloadsMusicPublicVideosDocumentsexamples.desktopPicturesTemplates

    Example3:Createasubdirectorynamed'test'inthecurrentdirectory

    $mkdirtest$cdtest$pwd/home/osboxes/test

    Example4:Createsubdirectoriesnamed'1','2','3',and'4'inthecurrentdirectory

    $mkdir{1,2,3,4}

    Example5:Listthefilesandsubdirectorynamesinthecurrentdirectory

    $ls1234

    Example6:Createsomeemptyfilesandsomefileswithcontent

    $touch1/file01.txt2/file02.txt$echo"Bonjourtoutlemonde"Bonjourtoutlemonde$echo"HelloWorld!">./1/file0101.txt$echo"Tobeornottobe">./3/file03.txt

    Example7:Changetothedirectoryimmediatelyabovethecurrentdirectoryandlistthefilesandsubdirectorynamesinthesubdirectorynamed'1'

    $cd..$ls-ltest/1total4-rw-rw-r--1osboxesosboxes13Apr309:28file0101.txt-rw-rw-r--1osboxesosboxes0Apr309:27file01.txt

    Example8:Listthefilesendingwith'.txt'inthesubdirectorynamed'3'

    $ls-ltest/3/*.txt-rw-rw-r--1osboxesosboxes19Apr309:29test/3/file03.txt

    Example9:(a)Copythefile'file02.txt'fromdirectorynamed'${HOME}/test/2'todirectory'${HOME}/test/4'andnamethefile'file04.txt'

    $cp./test/2/file02.txt./test/4/file04.txt

    DataScienceBoot-CampSurvivalManual

    19Chapter0-DataScientist'sToolbox

  • (b)Copythefile'file02.txt'fromdirectorynamed'${HOME}/test/2'todirectory'${HOME}/test/4'andnamethefile'file02.txt'

    $cp~/test/2/file02.txt./test/4/file02.txt

    Example10:Makesubdirectory'${HOME}/test/3'thecurrentworkingdirectoryandcreateahiddenfileandahiddensubdirectory

    $cdtest/3$touch.hidden01.txt$mkdir.hidden

    Example11:Listthenamesofnon-hiddenfilesandsubdirectoriesinthecurrentdirectory

    $lsfile03.txt$ls-a...file03.txt.hidden.hidden01.txt

    Example12:Createasubdirectorynamed'another'inthehomedirectoryoftheuserandcopythefilesandrecursivelyfrom'${HOME}/osboxes/test'to'${HOME}/another'

    $mkdir~/another$cp-r../*~/another

    Exampke13:Listthefilesandsubdirectoriesinthehomedirectoryofuser

    $ls~anotherDocumentsexamples.desktopPicturesTemplatesVideosDesktopDownloadsMusicPublictest

    Example14:Listthefileandsubdirectorynamesin'${HOME}/another'

    $ls~/another1234

    Example15:Listthefilenamessandrecursivelythesubdirectoriesin'${HOME}/another'

    $ls-R~/another/home/osboxes/another:1234/home/osboxes/another/1:file0101.txtfile01.txt/home/osboxes/another/2:file02.txt/home/osboxes/another/3:file03.txt/home/osboxes/another/4:file02.txtfile04.txt

    Example16:Createasubdirectorynamed'test/5'inthehomedirectoryoftheuserandmove(copyanddelete)thefilesand/orsubdirectoriesfrom'${HOME}/another'

    DataScienceBoot-CampSurvivalManual

    20Chapter0-DataScientist'sToolbox

  • $mkdir~/test/5$mv~/another/*../5

    Example17:Listthefileandsubdirectorynamesin'${HOME}//another'

    $ls-a/home/osboxes/another...

    Example18:Listthefileandsubdirectorynamesin'${HOME}/test/5'

    $ls../51234

    Example19:Listthefilenamesandrecursivelythesubdirectoriesin'${HOME}/test/5'

    $ls-R~/test/5/home/osboxes/test/5:1234/home/osboxes/test/5/1:file0101.txtfile01.txt/home/osboxes/test/5/2:file02.txt/home/osboxes/test/5/3:file03.txt/home/osboxes/test/5/4:file02.txtfile04.txt

    Example20:Makedirectory'/home/osboxes'thecurrentworkingdirectory

    $cd$pwd/home/osboxes

    Example21:Deletethesubdirectories'test'and'another'from'${HOME}',andthenlistthefileandsubdirectorynamesinthecurrentdirectory

    $rm-rftestanother$lsDesktopDownloadsMusicPublicVideosDocumentsexamples.desktopPicturesTemplates

    Example22:Closetheterminalsession

    $exit

    AcheatsheetfortheBourneAgainSHell(BASH)hasbeenpreparedbythefolksatLearnCodetheHardway(LCodeTHW).AcompletemanualforBASHisavailablefromtheGNUProjectifyouwanttofurtherexploretheCLIanditscapabilities.

    Themarkdownlanguage,createdbyJohnGruber,isrelativelysmallandeasytolearnunlikemarkuplanguagessuchas

    Markdown-WritingDocumentationtheEasyWay

    DataScienceBoot-CampSurvivalManual

    21Chapter0-DataScientist'sToolbox

  • HTMLandXML.Takingaportionofthisbookasanexample,withsomeminorchangestodemonstrateparticularfeatures,weexploresomeofthemorecommonmarkdownelements.

    Prologue===

    #Introduction

    Duringthenextyearyouwilllearnthefundamentalsofdatascience.Survivingtheninecourseswhichmakeupthe[DataScienceSpecialization][0001]offeredby[JohnsHopkinsUniversity][jhu]requiresa**strategy**.

    Tothisend,thefocusoftheten-courseseriesincludingacapstoneprojectistoprovidethelearnerwith:

    1.anintroductiontothekeyideasbehindreproducibleresearch,2.anintroductiontothetoolsandtechniquestotransformrawdataintoapresentablereport,4.anopportunitytogainhands-onpracticesoyoucanlearnthetechniquesforyourself,and3.anappreciationofthemathematics&statisticsinvolvedindatascience.

    ##CoreCourses

    ThecoursescomprisingtheDataScienceSpecializationare:

    *DataScientist'sToolbox*RProgramming*ExploratoryDataAnalysis*GettingandCleaningData*ReproducibleResearch*StatisticalInference*RegressionModels*PracticalMachineLearning*DevelopingDataProducts

    ![CourseDependency](dst_courses.png)*Figure1Coursedependencydiagram*

    [0001]:https://www.coursera.org/specialization/jhudatascience/1?utm_medium=courseDescripTop[jhu]:http://www.jhu.edu

    Listing0.1Samplemarkdowndocument

    Soyoucanimmediatelypractiseeachofthemarkdownelementsusedinthesampledocumentaconcisedescriptionissuppliedwithreferencestothesampledocument.

    FontModifiers

    Therearetwostylesoffontmodifiersupportedbystandardmarkdown:

    bold(textsurroundedby**)italics(textsurroundedby*)

    Fromthesampledocumentweseethat'strategy'ismodifiedduringconversiontorenderbolded,whilst'Figure1Coursedependencydiagram'ismodifiedduringconversiontorenderitalicised.

    Headings

    Therearetwostylesofheaderssupportedbystandardmarkdown:

    setextFirst-level(textunderlinedbyatleast3equal-signs)

    DataScienceBoot-CampSurvivalManual

    22Chapter0-DataScientist'sToolbox

  • Secondary-level(textunderlinedbyatleast3dashes)atx

    First-level(#precedingtext)Secondary-level(##precedingtext)Third-level(###precedingtext)Fourth-level(####precedingtext)

    Fromthesampledocumentweseethat'Prologue'and'Introduction'arefirst-levelheaders,and'CoreCourses'isasecond-levelheader.

    Images

    Therearetwostylesofimagelinkssupportedbystandardmarkdown:

    inlinefilename:![alternatetext](directory_path/image"optionaltitle")

    referenceid:![alternatetext][stringofdigits|stringofterms]

    Links

    Therearetwostylesoflinkssupportedbystandardmarkdown:

    inlineURL:[randomwebsite][website]

    referenceid:[randomwebsite][stringofdigits|stringofterms]

    Fromthesampledocumentweseethat'DataScienceSpecialization'isreferencedbytheidlabel(0001)whereas'JohnHopkinsUniversity'isreferencedbytheidlabel(jhu).TheactualURLsarecollectedattheendofthesamedocumentalthoughthelabelscouldappearanywhereinthedocument.

    Lists

    Therearetwostylesoflistssupportedbystandardmarkdown:

    orderedlist

    number(followedbyanoptionalperiodandtwomandatoryspaces;physicalorderingoverridesnumericlabelduringconversion)

    unorderedlist

    *(asterisk)-(dash)+(plus)

    Fromthesampledocumentweseeanorderedlistcontainingthelearneroutcomesandanunorderedlistcontainingthenamesofeachoftheninecorecourses.

    Installthemarkdown(MD)tohyper-textmarkuplanguage(HTML)convertertopractisemodifyingthesamplemarkdowndocument.

    $sudoapt-getinstallmarkdown

    DataScienceBoot-CampSurvivalManual

    23Chapter0-DataScientist'sToolbox

  • Atexteditorcombinedwiththemarkdown-to-htmlconverterisallthatisneeded.

    $nanosample.md$markdownsample.md#sendsHTMLoutputtothescreen$markdownsample.md>sample.html#sendsHTMLoutouttoafilenamed'sample.html'$firefoxsample.html#viewtherenderedHTMLinawebbroswer

    Takeyourtimeworkingthroughthesamplemarkdowndocumentuntilyoufullyunderstandwhyeachelementproducestheobservedresults.Thisbookiswritteninamarkdownlanguage.InanothercourseyouwilllearnhowtoproduceamarkdowndocumentcombiningtextandexecutableRcodeusingRmarkdown,andconvertittoHTMLandPDFusingRStudio.

    Gitisadistributedversioncontrolsystemallowinganynumberofpeopletocollaborativelycontributetosoftwaredevelopmentorotherprojects.SomeofthecoursesrequirelearnerstosubmittheirprogrammingassignmentstoGitHubaspartofapeerassessmentgradingprocess.

    InstallingGit

    ByinstallingtheGitcommand-lineclientyoucanchoosewhethertomanageyourlocalandremoterepositoriesfromaterminalsessionorwithinRStudio.AssumingyouarerunningtheUbuntuLinuxvirtualmachineoranotherDebianGNU/LinuxderiveddistributiontypethecommandshowntoinstalltheGitclient.

    $sudoapt-getinstall-ygitgit-doc

    Ifyouhaveinstalledadifferentdistributionrefertothesystemdocumentationtodeterminethepackagemanagerneededtoinstallsoftwarefromthesoftwarerepository.

    15-minuteIntroductiontoVersionControlwithGitfromtheTerminal

    Let'sstartexploringthebasicfeaturesoftheversioncontrolfromthecomfortofanRConsolesession.Openaterminalwindow(CTRL+ALT+T)ifyouarerunningagraphicaldesktopenvironmentandthentype'R'andpressthe[ENTER]key.OnceRStudioisinstalledyouwillhaveintegratedaccesstoR.

    Command Description CommonFlags Arguments

    gitinit initialisealocalrepository;defaultiscurrentworkingdirectory

    [directory_path/][directory_name]

    (optional)

    gitbranch determinethecurrentbranch

    gitcheckout

    createanewbranchinthecurrentrepository -b(newbranch)

    branch_name

    (mandatory)

    gitstatus reportsthestatusofthelocalrepository

    gitshowreportsthehistoricaldifferencesofthefilesinthelocalrepository

    gitadd addfilestothelocalrepository

    -A(add)

    -u(trackfilenamechangesand

    [directory_path/][file_name]

    (mandatory)

    Git-VersionControl

    DataScienceBoot-CampSurvivalManual

    24Chapter0-DataScientist'sToolbox

  • deletions)

    gitcommit commitanychangesthelocalrepository

    -a(add)

    -m(message)

    [directory_path/][file_name]

    "astringofcharacters"

    (optional,mandatory)

    gitpull fetchchangesfromanotherrepositoryandmergewithcurrentrepository

    sourcetarget

    (mandatory)

    gitpush updateremoterepositorywithchangesfromthecurrentrepository-u(addupstream(tracking)reference)

    targetsource

    (mandatoryunless-uflagpresent)

    gitmerge flattencommithistorybeforemergingsourcebranchwithtargetbranch --squashbranch_name

    (mandatory)

    gitrevert undochangestothelocalrepositoryreference_point

    (mandatory)

    Argumentsinbracketsareoptionalbutifthe'mandatory'designationispresent,atleastoneoftheargumentsmustbesupplied.

    Table0.2BasicGitCommands

    Foreachoftheexamplesinthissectiontypethecommandstotherightofthecommandprompt($)tointeractivelyfollowalongtheseexamples.Takeyourtimeworkingthroughthecommandsuntilyoufullyunderstandwhyeachcommandproducestheobservedresults.

    Preliminaries:ConfigureyouremailaddressandusernametobeusedbyGit.Theflag--globalmeansapplytheconfigurationtoallofyourGitrepositoriesonthecomputer.Theflag--localmeansapplytheconfogurationtoonlythecurrentGitrepository.

    $gitconfig[--local|--global]user.email"[email protected]"$gitconfig[--local|--global]user.name"username"

    Note:TheoutputofsomeGitcommandsintheseexampleshasbeenreformattedforpresentationwithinthisbook.

    Example1:Createalocalrepository.

    $mkdirProjects$mkdirProjects/DataScientistsToolbox$mkdirProjects/DataScientistsToolbox/sample$cdProjects/DataScientistsToolbox/sample$gitinitInitialisedemptyGitrepositoryin/home/osboxes/Projects/DataScientistsToolbox/sample/.git/$ls-ladrwxrwxr-x3osboxesosboxes4096Apr519:15.drwxrwxr-x3osboxesosboxes4096Apr519:07..drwxrwxr-x7osboxesosboxes4096Apr519:15.git

    Example2:CreateanemptyREADME.mdfileinthelocalrepository.

    DataScienceBoot-CampSurvivalManual

    25Chapter0-DataScientist'sToolbox

  • $touchREADME.md$gitadd.$gitcommit-m"initialcommit"

    [master(root-commit)b7c48f3]initialcommit1filechanged,0insertions(+),0deletions(-)createmode100644README.md

    $gitstatus

    Onbranchmasternothingtocommit,workingdirectoryclean

    $gitshow

    commitb7c48f3e5cdc772e6a198c3633acd853a69a5778Author:jhudssDate:SunApr519:21:212015-0300

    initialcommit

    diff--gita/README.mdb/README.mdnewfilemode100644index0000000..e69de29

    Example3:EdittheREADME.mdfileandpastethesamplemarkdowndocumentintothefile.

    $nanoREADME.md$gitadd-A.$gitcommit-m"addedcontent"

    [master8fd8eb8]addedcontent1filechanged,41insertions(+)

    Example4:EdittheREADME.mdfileswapping"GettingandCleaningData"and"ExploratoryDataAnalysis."

    $nanoREADME.md$gitcommit-m"swappedorderoftwocourses"

    [master87d0125]swappedorderoftwocourses1filechanged,1insertion(+),1deletion(-)

    Example5:Determinewhetherthereareanychanges.

    $gitstatus

    Onbranchmasternothingtocommit,workingdirectoryclean

    $gitshow

    commit87d012594aa5a8a39e99d4728dc8c853779587abAuthor:jhudssDate:SunApr519:34:342015-0300

    swappedorderoftwocourses

    diff--gita/README.mdb/README.mdindex756292a..48587e6100644---a/README.md+++b/README.md@@-25,8+25,8@@ThecoursescomprisingtheDataScienceSpecializationare:

    *DataScientist'sToolbox*RProgramming-*ExploratoryDataAnalysis

    DataScienceBoot-CampSurvivalManual

    26Chapter0-DataScientist'sToolbox

  • *GettingandCleaningData+*ExploratoryDataAnalysis*ReproducibleResearch*StatisticalInference*RegressionModels

    Example6:Createabranchnamed'draft'.

    $gitcheckout-bdraft

    Switchedtoanewbranch'draft'

    $gitstatus

    Onbranchdraftnothingtocommit,workingdirectoryclean

    Example7:EdittheREADME.mdfiletoadd"Gitiseasy.Gitisfun.ThanksLinus!"anywhereinthefile.

    $nanoREADME.md$gitstatus

    OnbranchdraftChangesnotstagedforcommit:(use"gitadd..."toupdatewhatwillbecommitted)(use"gitcheckout--..."todiscardchangesinworkingdirectory)

    modified:README.md

    nochangesaddedtocommit(use"gitadd"and/or"gitcommit-a")

    $gitcommit-a-m"thankedthecreatorofGit"

    [draft34af00f]thankedthecreatorofGit1filechanged,2insertions(+)

    Example8:Switchtothe'master'branchandchecktherepositorystatus.

    $gitcheckoutmaster

    Switchedtobranch'master'

    $gitstatus

    Onbranchmasternothingtocommit,workingdirectoryclean

    Example9:Mergethe'draft'branch'withthe'master'branchandchecktherepositorystatus.

    $gitmergedraft

    Updating87d0125..34af00fFast-forwardREADME.md|2++1filechanged,2insertions(+)

    $gitstatus

    Onbranchmasternothingtocommit,workingdirectoryclean

    $gitshow

    commit34af00fc564fd28e485503715dd5a9a9a461329a

    DataScienceBoot-CampSurvivalManual

    27Chapter0-DataScientist'sToolbox

  • Author:jhudssDate:SunApr519:49:082015-0300

    thankedthecreatorofGit

    diff--gita/README.mdb/README.mdindex48587e6..aa53fee100644---a/README.md+++b/README.md@@-19,6+19,8@@istoprovidethelearnerwith:3.anappreciationofthemathematics&statisticsinvolvedindatascience.

    +Gitiseasy.Gitisfun.ThanksLinus!+##CoreCourses

    ThecoursescomprisingtheDataScienceSpecializationare:

    AcheatsheetforGitandGitHubhasbeenpreparedbythefolksatGitHub.

    GitHubisarepositoryhostingserviceallowinganynumberofpeopletocollaborativelycontributetosoftwaredevelopmentorotherprojects.SomeofthecoursesrequirelearnerstosubmittheirprogrammingassignmentstoGitHubaspartofapeerassessmentgradingprocess.

    15-minuteIntroductiontoVersionControlwithGitHubfromtheTerminalandWebBrowser

    Figure0.12CreateanaccountwithGitHub

    BeforecreatingarepositoryonGitHubyoumustcreateanaccountpreferablywiththesamenameemailaddressusedwhenconfiguringGit.IfyouuseanalternateemailaddressandusernameforyourGitHubaccount,youcanassociateGit'susernameandemailaddresswiththisaccount.

    GitHub-RepositoryHostingServiceSupportingtheGitVersionControlSystem

    DataScienceBoot-CampSurvivalManual

    28Chapter0-DataScientist'sToolbox

  • Figure0.13ChooseaPersonalPlan

    Selecttherepositoryhostingplanforyouraccount.ThedefaultfreeplanissufficientforpeerassessmentsduringtheJohnsHopkinsUniversityDataScienceSpecialization.

    Figure0.14NewAccountOrientationDashboard

    AfteryourGitHubaccountisset-upyouarereadytoexploretheservice.Youshouldupdatetheprofileinformationattheveryleastbeforeproceeding.

    Foreachoftheexamplesinthissectiontypethecommandstotherightofthecommandprompt($)tointeractivelyfollowalongtheseexamples.Takeyourtimeworkingthroughthecommandsuntilyoufullyunderstandwhyeachcommandproducestheobservedresults.

    Example1:SynchronisealocalrepositorywithanemptyrepositoryofthesamenameonGitHub.ThecommandsbelowcreatetheemptyrepositoryonGitHubandpushthecontentofthelocalrepositorytoyourGitHiubaccount.SubstituteyourGitHubaccountnamefor'user_name'andtypeyouraccountpasswordwhenprompted.

    $curl-uuser_namehttps://api.github.com/user/repos\-d"{\"name\":\"sample\",\"description\":\"learningaboutGitandGitHub\"}"$gitaddremoteoriginhttps://github.com/username/sample.git$gitpushoriginmaster

    DataScienceBoot-CampSurvivalManual

    29Chapter0-DataScientist'sToolbox

  • AcheatsheetforGitandGitHubhasbeenpreparedbythefolksatGitHub.

    Risastatisticalanalysisandcomputingenvironmentproviding"anintegratedsuiteofsoftwarefacilitiesfordatamanipulation,calculationandgraphicaldisplay."

    InstallingR

    Addtheline"debhttp://cran.rstudio.com/bin/linux/ubuntutrusty/"totheendofthesources.listfile.

    $sudonano/etc/apt/sources.list

    FetchthesigningkeyfortheCRANrepository.

    $sudoapt-keyadv--keyserverkeys.gnupg.net--recv-key51716619E084DAB9

    InstallthelastestversionofRwhichmightbenewerthanshowninthefigures.

    $sudoapt-getupdate$sudoapt-getupgrade$sudoapt-getinstall-yr-baser-doc-infor-mathliblibcurl4-gnutls-dev

    15-minuteIntroductiontotheRStatisticalandComputationalEnvironment

    Let'sstartexploringthebasicfeaturesoftheRenvironmentfromthecomfortoftheRConsolecommand-lineinterface.Openaterminalwindow(CTRL+ALT+T)ifyouarerunningagraphicaldesktopenvironmentandtype'R'followedbythe[ENTER]key.OnceRStudioisinstalledyouwon'thavetoworkatthecommand-lineunlessyouchoosetodoso.

    Command Description Arguments

    install.packages installapackagefromCRANpackage_name

    (mandatory)

    install_github installapackagefromGitHubpackage_name

    (mandatory)

    library loadapackagepackage_name

    (mandatory)

    ? accessthehelpsystem

    [package_name][function_name]

    (mandatory)

    q() exitRPrompttosavetheenvironmentbeforeshuttingdowntheRStatisticalAnalysisandComputingEnvironment.

    Argumentsinbracketsareoptionalbutifthe'mandatory'designationispresent,atleastoneoftheargumentsmustbesupplied.

    R-StatisticalAnalysisandComputingEnvironment

    DataScienceBoot-CampSurvivalManual

    30Chapter0-DataScientist'sToolbox

  • Table0.3EssentialRCommands

    Foreachexampletypethecommandstotherightofthecommandprompt(>)tointeractivelyfollowalongtheseexamples.Takeyourtimeworkingthroughthecommandsuntilyoufullyunderstandwhyeachcommandproducestheobservedresults.

    Example:ForsimplicitywedonotshowtheoutputofthecommandsusedwithinRConsole.Wewillinstallthedevtoolspackage,asanexemplar,whichwillbeneededtosuccessfullycompileotherpackagesthroughoutthesechaptersandtheninedatasciencecourses.

    $RRversion3.1.1(201-07-10)--"SockittoMe"Copyright(C)2014TheRFoundationforStatisticalComputingPlatform:i686-pc-linux-gnu(32-bit)

    RisfreesoftwareandcomeswithABSOLUTELYNOWARRANTY.Youarewelcometoredistributeitundercertainconditions.Type'license()'or'licence()'fordistributiondetails.

    NaturallanguagesupportbutrunninginanEnglishlocale

    Risacollaborativeprojectwithmanycontributors.

    Type'contributors()'formoreinformationand'citation()'onhowtociteRorRpackagesinpublications.Type'demo()'forsomedemos,'help()'foron-linehelp,or'help.start()'foranHTMLbrowserinterfacetohelp.Type'q()'toquitR.>install.packages("devtools")>library(devtools)>?devtools>q()$

    RStudioisanintegrateddevelopmentenvironmentprovidingaplatform"totacklethetoughestandmostinterestingproblemswithR."

    InstallingRStudio

    $wgethttp://download1.rstudio.org/rstudio-0.98.1103-i386.deb-O${HOME}/Downloads/rstudio.deb$sudoapt-getinstalllibjpeg62$sudodpkg-i${HOME}/Downloads/rstudio.deb

    15-minuteIntroductiontotheRStudioIntegratedDevelopmentEnvironment

    RStudio-IntegratedDevelopmentEnvironment

    DataScienceBoot-CampSurvivalManual

    31Chapter0-DataScientist'sToolbox

  • Figure0.15RStudioIntegratedDevelopmentEnvironment

    LaunchRStudiobyclickingonthe'Button'iconneartheupperleftoftheapplicationbar,typing'rstudio'intothesearchfield,andclickingontheRStudioicon.OncetheapplicationisvisibleasshowninFigure0.15right-clickontheRStudioiconintheapplicationbarandselect'LocktoLauncher'.

    DataScienceBoot-CampSurvivalManual

    32Chapter0-DataScientist'sToolbox

  • Figure0.16ConfigureGlobalOptions

    Click'Tools'onthemainmenufollowedby'GlobalOptions'toconfigureRStudio.

    DataScienceBoot-CampSurvivalManual

    33Chapter0-DataScientist'sToolbox

  • Figure0.17SelecttheCRANrepositorymirrortofetchpackages

    Selectageographically-nearbyCRANrepositoryafterclicking'Packages'.

    DataScienceBoot-CampSurvivalManual

    34Chapter0-DataScientist'sToolbox

  • Figure0.18Configurecodeeditingpreferences

    Click'CodeEditing'toconfiguretheappearanceandbehaviourofthecodeeditingpane.

    DataScienceBoot-CampSurvivalManual

    35Chapter0-DataScientist'sToolbox

  • Figure0.19Configureversioncontroloptions

    Click'Git/SVN'toconfigurewhichversioncontrolsystemsystemwillbeused.IfGithaslareadybeeninstalled,thedefaultscanbeaccepted.Click'Apply'.Click'OK'.

    DataScienceBoot-CampSurvivalManual

    36Chapter0-DataScientist'sToolbox

  • Figure0.20Createadirectory

    Clickthe'Files'tabinthelowerrightpaneandnavigatetotheProjectsdirectoryandclick'NewFolder'.TypethenameofthecourseDataScientistsToolbox.IfadirectorynamedProjectsdoesnotexist,createit.

    DataScienceBoot-CampSurvivalManual

    37Chapter0-DataScientist'sToolbox

  • Figure0.21Createanewproject-Step1

    Intheupperrightclick'Project(None)'andselect'NewProject'.

    DataScienceBoot-CampSurvivalManual

    38Chapter0-DataScientist'sToolbox

  • Figure0.22Createanewproject-Step2

    Click'NewDirectory'tocreateanewrepository.

    DataScienceBoot-CampSurvivalManual

    39Chapter0-DataScientist'sToolbox

  • Figure0.23Createanewproject-Step3

    Click'EmptyProject'astheprojecttype.

    DataScienceBoot-CampSurvivalManual

    40Chapter0-DataScientist'sToolbox

  • Figure0.24Createanewproject-Step4

    DataScienceBoot-CampSurvivalManual

    41Chapter0-DataScientist'sToolbox

  • Figure0.25Createanewproject-Step5

    Navigateto${HOME}/Projects/DataScientistsToolbox.Click'Choose'.

    DataScienceBoot-CampSurvivalManual

    42Chapter0-DataScientist'sToolbox

  • Figure0.26Createanewproject-Step6

    Typeanamefortheproject.Tocreatetheprojecttypeadirectoryname,select'Createagitrepository',andclick'CreateProject'.

    DataScienceBoot-CampSurvivalManual

    43Chapter0-DataScientist'sToolbox

  • Figure0.27Createanewtextfile

    Select'File'onthemainmenufollowedby'NewFile'andselect'TextFile'asthefiletype.

    DataScienceBoot-CampSurvivalManual

    44Chapter0-DataScientist'sToolbox

  • Figure0.28Savethestudent_grades.csvdatafile

    Typethecontentsshowninthecodeeditingpane.Clickonthedisketteiconorselect'File,Save'fromthemenu.Typethefilenameandclick'Save'.

    DataScienceBoot-CampSurvivalManual

    45Chapter0-DataScientist'sToolbox

  • Figure0.29Settheworkingdirectory

    Click'Session'onthemainmenuandselect'SetWorkingDirectory'followedby'ToFilesPaneLocation'.

    DataScienceBoot-CampSurvivalManual

    46Chapter0-DataScientist'sToolbox

  • Figure0.30Savethestudent_grades.Rscript

    Click'File'onthemainmenufollowedby'NewFile'andselect'RScript'.

    TypetheRcodeshownbelow.Thenclick'File'followedby'Save'beforetypingthefilenameandclicking'Save'.

    DataScienceBoot-CampSurvivalManual

    47Chapter0-DataScientist'sToolbox

  • Figure0.31Readstudentgradesfileandoutputthecontents

    Highlightthecodeinthe'student_grades.R'tab.Click'Run'.

    DataScienceBoot-CampSurvivalManual

    48Chapter0-DataScientist'sToolbox

  • Figure0.32Commitchangestothelocalrepository

    Clickthe'Git'tabintheupperrightpane.Click'Commit'.

    DataScienceBoot-CampSurvivalManual

    49Chapter0-DataScientist'sToolbox

  • Figure0.33Selectchangestobecommitedtothelocalrepository

    Selecteachofthefourfilesbymarkingthemasstaged.Typeacommitmessage.Click'Commit'tocommitthesechangestothelocalrepository.

    DataScienceBoot-CampSurvivalManual

    50Chapter0-DataScientist'sToolbox

  • Figure0.34Summaryofchangestothelocalrepository

    Reviewthemessagesbeforeclicking'Close'.Afterwardsclosethe'ReviewChanges'pop-upwindow.

    DataScienceBoot-CampSurvivalManual

    51Chapter0-DataScientist'sToolbox

  • Figure0.35Trackingchangesinanopenproject

    ModifytheRcodeasshowninthe'student_grades.R'tab.Didyounoticethenewentryunderthe'Git'tab?Highlightthelastlineofcodeandrunit.Committhischangeusingthesameprocedure.

    DataScienceBoot-CampSurvivalManual

    52Chapter0-DataScientist'sToolbox

  • Figure0.36PushthecontentsofthelocalrepositorytoGitHub

    LogintoGitHubusingawebbrowserandcreateanemptyrepositorynamed'demo'.InRStudioclickthegeariconunderthe'Git'tabandselect'Shell'.Forconvenienceweputthegitcommandsinthecodepane.TypethesecommandsintheshellsubstitutingyourGitHubaccount.Type'exit'toclosetheshell.VerifytherepositoryonGitHubhasbeenupdated.LogoutofGitHub.

    DataScienceBoot-CampSurvivalManual

    53Chapter0-DataScientist'sToolbox

  • Figure0.37Closethecurrentlyactiveproject

    Clickon'demo'intheupperrightcornerofRStudioandclick'CloseProject'.

    Figure0.38GitHubrepositorynameddemoafterthepushfromlocalrepository

    Congratulations!Yousuccessfullyonfiguredavirtualmachineforuseduringthedatascienceboot-camp.

    Practise.Practise.Practiceyournewlyacquiredknowledgeandskillsinpreparationforthecourseproject.

    DataScienceBoot-CampSurvivalManual

    54Chapter0-DataScientist'sToolbox

  • DataScientist'sToolboxintroducedthestatisticalcomputingandgraphingsuite,theintegrateddevelopmentenvironment,andtheversion/revisioncontrolsystemselectedbytheDataScienceSpecializationLabTeamintheBiostatisticsDepartmentofJohnsHopkinsUniversity.Thefeaturesandcapabiilitiesofthesetoolsextendbeyondthebasicspresentedinthischapter.Whilethegraphicaluserinterfaceisconvenientwehighlyrecommendandencourageyoutobecomecomfortablewiththecommand-lineaswell.

    Asadatasciencerecruitoutfittedwithyourkit(Git,R,RStudio,UbuntuLinux,andGitHubaccount)theinstructorforRProgrammingawaits.Boot-camphasbeeneasyuptothispoint.Readthe"DataScienceBoot-CampSurvivalManual"regularlytoavoidwashing-outofboot-camp.

    Recruits,dismissed.

    FinalThoughts

    DataScienceBoot-CampSurvivalManual

    55Chapter0-DataScientist'sToolbox

  • Chapter1-RProgramming

    DataScienceBoot-CampSurvivalManual

    56Chapter1-RProgramming

  • Chapter2-GettingandCleaningData

    DataScienceBoot-CampSurvivalManual

    57Chapter2-GettingandCleaningData

  • Chapter3-ExploratoryDataAnalysis

    DataScienceBoot-CampSurvivalManual

    58Chapter3-ExploratoryDataAnalysis

  • Chapter4-ReproducibleResearch

    DataScienceBoot-CampSurvivalManual

    59Chapter4-ReproducibleResearch

  • Chapter5-StatisticalInference

    DataScienceBoot-CampSurvivalManual

    60Chapter5-StatisticalInference

  • Chapter6-RegressionModels

    DataScienceBoot-CampSurvivalManual

    61Chapter6-RegressionModels

  • Chapter7-PracticalMachineLearning

    DataScienceBoot-CampSurvivalManual

    62Chapter7-PracticalMachineLearning

  • Chapter8-DevelopingDataProducts

    DataScienceBoot-CampSurvivalManual

    63Chapter8-DevelopingDataProducts

  • Capstone

    DataScienceBoot-CampSurvivalManual

    64Capstone

  • Epilogue

    DataScienceBoot-CampSurvivalManual

    65Epilogue

    PrologueChapter 0 - Data Scientist's ToolboxChapter 1 - R ProgrammingChapter 2 - Getting and Cleaning DataChapter 3 - Exploratory Data AnalysisChapter 4 - Reproducible ResearchChapter 5 - Statistical InferenceChapter 6 - Regression ModelsChapter 7 - Practical Machine LearningChapter 8 - Developing Data ProductsCapstoneEpilogue