Using R for data analysis - Goldsmiths, University of Londonmas03dm/papers/r_intro.pdf · Using R...

Post on 14-Apr-2020

7 views 0 download

Transcript of Using R for data analysis - Goldsmiths, University of Londonmas03dm/papers/r_intro.pdf · Using R...

OverviewIntroduction

Analysing data: The iris data example

Using R for data analysis

Daniel MullensiefenGoldsmiths, University of London

August 18, 2009

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

IntroductionWhat’s it good for?R and its competitorsCore characteristicsHistory

Analysing data: The iris data exampleGetting data inSummarising data

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

R is good for

I Flexible Data Analysis (programmable)

I Using different analysis techniques

I Data Visualisation

I Numeric Accuracy

I Rapid prototyping of analysis / process modelsI Pre-processing data from different sources

I textfiles (.txt) and binary files (e.g. SPSS .sav, Excel)I Audio filesI databasesI texts (linguistic data)

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

R is good for

I Flexible Data Analysis (programmable)

I Using different analysis techniques

I Data Visualisation

I Numeric Accuracy

I Rapid prototyping of analysis / process modelsI Pre-processing data from different sources

I textfiles (.txt) and binary files (e.g. SPSS .sav, Excel)I Audio filesI databasesI texts (linguistic data)

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

R is good for

I Flexible Data Analysis (programmable)

I Using different analysis techniques

I Data Visualisation

I Numeric Accuracy

I Rapid prototyping of analysis / process modelsI Pre-processing data from different sources

I textfiles (.txt) and binary files (e.g. SPSS .sav, Excel)I Audio filesI databasesI texts (linguistic data)

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

R is good for

I Flexible Data Analysis (programmable)

I Using different analysis techniques

I Data Visualisation

I Numeric Accuracy

I Rapid prototyping of analysis / process modelsI Pre-processing data from different sources

I textfiles (.txt) and binary files (e.g. SPSS .sav, Excel)I Audio filesI databasesI texts (linguistic data)

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

R is good for

I Flexible Data Analysis (programmable)

I Using different analysis techniques

I Data Visualisation

I Numeric Accuracy

I Rapid prototyping of analysis / process models

I Pre-processing data from different sourcesI textfiles (.txt) and binary files (e.g. SPSS .sav, Excel)I Audio filesI databasesI texts (linguistic data)

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

R is good for

I Flexible Data Analysis (programmable)

I Using different analysis techniques

I Data Visualisation

I Numeric Accuracy

I Rapid prototyping of analysis / process modelsI Pre-processing data from different sources

I textfiles (.txt) and binary files (e.g. SPSS .sav, Excel)

I Audio filesI databasesI texts (linguistic data)

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

R is good for

I Flexible Data Analysis (programmable)

I Using different analysis techniques

I Data Visualisation

I Numeric Accuracy

I Rapid prototyping of analysis / process modelsI Pre-processing data from different sources

I textfiles (.txt) and binary files (e.g. SPSS .sav, Excel)I Audio files

I databasesI texts (linguistic data)

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

R is good for

I Flexible Data Analysis (programmable)

I Using different analysis techniques

I Data Visualisation

I Numeric Accuracy

I Rapid prototyping of analysis / process modelsI Pre-processing data from different sources

I textfiles (.txt) and binary files (e.g. SPSS .sav, Excel)I Audio filesI databases

I texts (linguistic data)

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

R is good for

I Flexible Data Analysis (programmable)

I Using different analysis techniques

I Data Visualisation

I Numeric Accuracy

I Rapid prototyping of analysis / process modelsI Pre-processing data from different sources

I textfiles (.txt) and binary files (e.g. SPSS .sav, Excel)I Audio filesI databasesI texts (linguistic data)

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

R is considered less good for

I Graphical User Interfaces

I Internet programming

I Low-level programming

I ...

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

R is considered less good for

I Graphical User Interfaces

I Internet programming

I Low-level programming

I ...

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

R is considered less good for

I Graphical User Interfaces

I Internet programming

I Low-level programming

I ...

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

R is considered less good for

I Graphical User Interfaces

I Internet programming

I Low-level programming

I ...

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

R compares to

I Matlab (open source, community driven, not commercial)

I SPSS, SAS, Stata (programming language, not program)

I Weka (driven by community, not individuals)

I SciPy and other software libraries (entire language specialisedfor data analysis)

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

R compares to

I Matlab (open source, community driven, not commercial)

I SPSS, SAS, Stata (programming language, not program)

I Weka (driven by community, not individuals)

I SciPy and other software libraries (entire language specialisedfor data analysis)

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

R compares to

I Matlab (open source, community driven, not commercial)

I SPSS, SAS, Stata (programming language, not program)

I Weka (driven by community, not individuals)

I SciPy and other software libraries (entire language specialisedfor data analysis)

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

R compares to

I Matlab (open source, community driven, not commercial)

I SPSS, SAS, Stata (programming language, not program)

I Weka (driven by community, not individuals)

I SciPy and other software libraries (entire language specialisedfor data analysis)

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

Pros and Cons

I Huge community support

I Cross-plattform and command-line based

I Interactive: interpreted not complied

I Mainly functional

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

Pros and Cons

I Huge community support

I Cross-plattform and command-line based

I Interactive: interpreted not complied

I Mainly functional

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

Pros and Cons

I Huge community support

I Cross-plattform and command-line based

I Interactive: interpreted not complied

I Mainly functional

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

Pros and Cons

I Huge community support

I Cross-plattform and command-line based

I Interactive: interpreted not complied

I Mainly functional

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

How R came about

I 1976: John Chambers releases 1st version of S: Language forstatistics, stochastic simulation and data visualisation

I 1995: Ross Ihaka and Robert Gentleman release R as GPL

I 1998: Comprehensive R Archive Network (CRAN) founded

I 2001: R News published for 1st time

I 2004: 1st useR! conference

I 2009: More than 1000 packages available on CRAN

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

How R came about

I 1976: John Chambers releases 1st version of S: Language forstatistics, stochastic simulation and data visualisation

I 1995: Ross Ihaka and Robert Gentleman release R as GPL

I 1998: Comprehensive R Archive Network (CRAN) founded

I 2001: R News published for 1st time

I 2004: 1st useR! conference

I 2009: More than 1000 packages available on CRAN

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

How R came about

I 1976: John Chambers releases 1st version of S: Language forstatistics, stochastic simulation and data visualisation

I 1995: Ross Ihaka and Robert Gentleman release R as GPL

I 1998: Comprehensive R Archive Network (CRAN) founded

I 2001: R News published for 1st time

I 2004: 1st useR! conference

I 2009: More than 1000 packages available on CRAN

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

How R came about

I 1976: John Chambers releases 1st version of S: Language forstatistics, stochastic simulation and data visualisation

I 1995: Ross Ihaka and Robert Gentleman release R as GPL

I 1998: Comprehensive R Archive Network (CRAN) founded

I 2001: R News published for 1st time

I 2004: 1st useR! conference

I 2009: More than 1000 packages available on CRAN

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

How R came about

I 1976: John Chambers releases 1st version of S: Language forstatistics, stochastic simulation and data visualisation

I 1995: Ross Ihaka and Robert Gentleman release R as GPL

I 1998: Comprehensive R Archive Network (CRAN) founded

I 2001: R News published for 1st time

I 2004: 1st useR! conference

I 2009: More than 1000 packages available on CRAN

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

What’s it good for?R and its competitorsCore characteristicsHistory

How R came about

I 1976: John Chambers releases 1st version of S: Language forstatistics, stochastic simulation and data visualisation

I 1995: Ross Ihaka and Robert Gentleman release R as GPL

I 1998: Comprehensive R Archive Network (CRAN) founded

I 2001: R News published for 1st time

I 2004: 1st useR! conference

I 2009: More than 1000 packages available on CRAN

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

Getting data inSummarising data

Basic data in and out

I Start R

I Save file fromhttp://www.doc.gold.ac.uk/˜mas03dm/teaching/r/iris.data.txtto R‘s working directory (using getwd()

I Get data into R using the read.table command (usefuloperations help(read.table) and assignment operator“←”)

I Change the species label of the 3rd observations to your ownfirst name (using the indexing function [ , ]), save thisdataset (using write.table())

I Remove the altered dataset (using rm()) and get the originaldataset in again

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

Getting data inSummarising data

Basic data in and out

I Start R

I Save file fromhttp://www.doc.gold.ac.uk/˜mas03dm/teaching/r/iris.data.txtto R‘s working directory (using getwd()

I Get data into R using the read.table command (usefuloperations help(read.table) and assignment operator“←”)

I Change the species label of the 3rd observations to your ownfirst name (using the indexing function [ , ]), save thisdataset (using write.table())

I Remove the altered dataset (using rm()) and get the originaldataset in again

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

Getting data inSummarising data

Basic data in and out

I Start R

I Save file fromhttp://www.doc.gold.ac.uk/˜mas03dm/teaching/r/iris.data.txtto R‘s working directory (using getwd()

I Get data into R using the read.table command (usefuloperations help(read.table) and assignment operator“←”)

I Change the species label of the 3rd observations to your ownfirst name (using the indexing function [ , ]), save thisdataset (using write.table())

I Remove the altered dataset (using rm()) and get the originaldataset in again

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

Getting data inSummarising data

Basic data in and out

I Start R

I Save file fromhttp://www.doc.gold.ac.uk/˜mas03dm/teaching/r/iris.data.txtto R‘s working directory (using getwd()

I Get data into R using the read.table command (usefuloperations help(read.table) and assignment operator“←”)

I Change the species label of the 3rd observations to your ownfirst name (using the indexing function [ , ]), save thisdataset (using write.table())

I Remove the altered dataset (using rm()) and get the originaldataset in again

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

Getting data inSummarising data

Basic data in and out

I Start R

I Save file fromhttp://www.doc.gold.ac.uk/˜mas03dm/teaching/r/iris.data.txtto R‘s working directory (using getwd()

I Get data into R using the read.table command (usefuloperations help(read.table) and assignment operator“←”)

I Change the species label of the 3rd observations to your ownfirst name (using the indexing function [ , ]), save thisdataset (using write.table())

I Remove the altered dataset (using rm()) and get the originaldataset in again

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

Getting data inSummarising data

Data summary and plots

I Summarise dataset (summary(), str() )

I Plot 1st column vs 2nd column (plot())

I Attach dataset to search path (attach())

I Plot Species vs Petal.Width and give graph a title and axesnames

I Plot histogram of Petal.Length (hist()

I Plot scattergram of full dataset(plot(dataset,col=Species))

I Add non-parametric smoother (plot(dataset,col=Species, panel=panel.smooth))

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

Getting data inSummarising data

Data summary and plots

I Summarise dataset (summary(), str() )

I Plot 1st column vs 2nd column (plot())

I Attach dataset to search path (attach())

I Plot Species vs Petal.Width and give graph a title and axesnames

I Plot histogram of Petal.Length (hist()

I Plot scattergram of full dataset(plot(dataset,col=Species))

I Add non-parametric smoother (plot(dataset,col=Species, panel=panel.smooth))

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

Getting data inSummarising data

Data summary and plots

I Summarise dataset (summary(), str() )

I Plot 1st column vs 2nd column (plot())

I Attach dataset to search path (attach())

I Plot Species vs Petal.Width and give graph a title and axesnames

I Plot histogram of Petal.Length (hist()

I Plot scattergram of full dataset(plot(dataset,col=Species))

I Add non-parametric smoother (plot(dataset,col=Species, panel=panel.smooth))

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

Getting data inSummarising data

Data summary and plots

I Summarise dataset (summary(), str() )

I Plot 1st column vs 2nd column (plot())

I Attach dataset to search path (attach())

I Plot Species vs Petal.Width and give graph a title and axesnames

I Plot histogram of Petal.Length (hist()

I Plot scattergram of full dataset(plot(dataset,col=Species))

I Add non-parametric smoother (plot(dataset,col=Species, panel=panel.smooth))

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

Getting data inSummarising data

Data summary and plots

I Summarise dataset (summary(), str() )

I Plot 1st column vs 2nd column (plot())

I Attach dataset to search path (attach())

I Plot Species vs Petal.Width and give graph a title and axesnames

I Plot histogram of Petal.Length (hist()

I Plot scattergram of full dataset(plot(dataset,col=Species))

I Add non-parametric smoother (plot(dataset,col=Species, panel=panel.smooth))

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

Getting data inSummarising data

Data summary and plots

I Summarise dataset (summary(), str() )

I Plot 1st column vs 2nd column (plot())

I Attach dataset to search path (attach())

I Plot Species vs Petal.Width and give graph a title and axesnames

I Plot histogram of Petal.Length (hist()

I Plot scattergram of full dataset(plot(dataset,col=Species))

I Add non-parametric smoother (plot(dataset,col=Species, panel=panel.smooth))

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

Getting data inSummarising data

Data summary and plots

I Summarise dataset (summary(), str() )

I Plot 1st column vs 2nd column (plot())

I Attach dataset to search path (attach())

I Plot Species vs Petal.Width and give graph a title and axesnames

I Plot histogram of Petal.Length (hist()

I Plot scattergram of full dataset(plot(dataset,col=Species))

I Add non-parametric smoother (plot(dataset,col=Species, panel=panel.smooth))

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

Getting data inSummarising data

More plots and a function

I Do boxplot(Petal.Length Species,notch=TRUE). Whatare the notches?

I Set the graphical device to be split into a 2x2 panel: op ←par(mfrow = c(2,2)

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis

OverviewIntroduction

Analysing data: The iris data example

Getting data inSummarising data

More plots and a function

I Do boxplot(Petal.Length Species,notch=TRUE). Whatare the notches?

I Set the graphical device to be split into a 2x2 panel: op ←par(mfrow = c(2,2)

Daniel Mullensiefen Goldsmiths, University of London Using R for data analysis