Introduction to R for Big Data Analysis

51
Introduction to R For Big Data Analysis Wednesday, October 13, 2015 6:00pm 6:45 pm Raastech, Inc. 2201 Cooperative Way, Suite 600 Herndon, VA 20171 +1-703-884-2223 [email protected]

Transcript of Introduction to R for Big Data Analysis

Page 1: Introduction to R for Big Data Analysis

Introduction to R For Big Data Analysis

Wednesday, October 13, 2015 6:00pm – 6:45 pm

Raastech, Inc. 2201 Cooperative Way, Suite 600 Herndon, VA 20171 +1-703-884-2223 [email protected]

Page 2: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 2 of 51 @Raastech

About Me

Harold Dost III @hdost

7+ years of Oracle Middleware experience

OCE (SOA Foundation Practitioner)

Oracle ACE Associate

From Michigan

blog.raastech.com

Page 3: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 3 of 51 @Raastech

About Raastech

Small systems integrator founded in 2009

Headquartered in the Washington DC area

Specializes in Oracle Fusion Middleware

Oracle Platinum Partner – 1 in 3,000 worldwide

Oracle SOA Specialized – 1 in 1,500 worldwide

Oracle ACE – 2 of 500 worldwide

100% of consultants are Oracle certified

100% of consultants present at major Oracle conferences

100% of consultants have published books, whitepapers, or articles

Page 4: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 4 of 51 @Raastech

Outline

1. Getting Started

Installing R

Installing Tools

Getting Data

2. Understanding R

Data Types

Functions

Data Import Mechanisms

Page 5: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 5 of 51 @Raastech

Outline (Cont.)

3. Manipulating Data (Large Data Sets)

Deriving Simple Statistics

Graphing

4. Demo

5. Incorporating into an Enterprise

Using Enterprise Data Sources

Running R in your environment.

Familiarize with Oracle's R offerings

Page 6: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 6 of 51 @Raastech

Page 7: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 7 of 51 @Raastech

Know CRAN

Comprehensive

R

Archive

Network

Page 8: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 8 of 51 @Raastech

Installing R

Windows

Mac

Linux

Page 9: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 9 of 51 @Raastech

Installing R

Windows https://cran.r-project.org/bin/windows/

Mac https://cran.r-project.org/bin/macosx/

Linux https://cran.r-project.org/bin/linux/

Page 10: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 10 of 51 @Raastech

Development Tools

Rstudio - http://www.rstudio.com/products/rstudio/

Open Source Edition

Commercial License - $995

Eclipse

Sublime, TextPad, Other Simple Text Editors,…

Page 11: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 11 of 51 @Raastech

Installing Packages

Anything From CRAN

Anywhere

install.packages(c(“first”, “second”))

> sudo R CMD INSTALL package-version.tar.gz

Page 12: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 12 of 51 @Raastech

Page 13: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 13 of 51 @Raastech

Data Types

Vectors

Matrices

Arrays

Data Frames

Lists

Factors

Page 14: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 14 of 51 @Raastech

Special Values

Infinity, Positive and Negative: Inf and –Inf

Not A Number: NaN

Not Available: NA

Complex Numbers, 1+9i

Page 15: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 15 of 51 @Raastech

Use Case for Infinities

Finding Maximums and Minimums

Placeholder values when others won’t work

Page 16: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 16 of 51 @Raastech

Not a Number (NaN)

In means something went wrong somewhere

A missing argument

Invalid number

Check for with is.nan(x) to prevent leaking

Don’t use “==“ to find NaN, it will only give more NaN

Page 17: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 17 of 51 @Raastech

Assigning NaN

> a = NaN

> a

[1] NaN

Page 18: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 18 of 51 @Raastech

Adding NaN

Adding NaN

> b = 1

> c = a + b

> c

[1] NaN

When adding a number to NaN “Not a Number” you will get NaN.

Page 19: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 19 of 51 @Raastech

Comparing NaN to Regular Number

> d = b == c

> d

[1] NA

When comparing a number to NaN “Not a Number” you will get NA.

Page 20: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 20 of 51 @Raastech

Comparing NaN to NaN

> e = c == a

> e

[1] NA

When comparing NaN “Not a Number” to NaN you will get NA.

Page 21: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 21 of 51 @Raastech

Detecting NaN

> a

[1] NaN

> is.nan(a)

[1] TRUE

> is.na(a)

[1] TRUE

Since NaN aren’t proper numbers, special functions must be used to detect them. They are the result of math gone wrong.

Page 22: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 22 of 51 @Raastech

Detecting NA

> e = c == a

> e

[1] NA

> is.nan(e)

[1] FALSE

> is.na(e)

[1] TRUE

Just as with NaN special functions must be used, but NA generally indicates that there is missing information

Page 23: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 23 of 51 @Raastech

Operators

Assignment ( ->, <-)

Addition (+)

Subtraction (–)

Division (/)

Multiplication (*)

Exponent (^)

Parentheses ( (, ) )

Page 24: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 24 of 51 @Raastech

Page 25: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 25 of 51 @Raastech

Math Functions

max()

min()

log()

sqrt()

Page 26: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 26 of 51 @Raastech

Deriving Simple Statistics

Minimum

Maximum

Median

Arithmetic Mean

Function estimation

Linear

Log

Exponential

R-Values

Standard Deviation

Page 27: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 27 of 51 @Raastech

How to define your own functions

firstfunction <- function(arg1, arg2, ... ){

statements

return(someoutput)

}

Page 28: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 28 of 51 @Raastech

Page 29: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 29 of 51 @Raastech

Twitter Example

First Install the Package

install.packages("twitteR”)

Page 30: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 30 of 51 @Raastech

Twitter Example

Authenticate

consumer = "CONSUMER KEY"

secret = "SECRET KEY"

setup_twitter_oauth(consumer,secret)

Page 31: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 31 of 51 @Raastech

Twitter Example

Get Trend Locations

The resulting WOEID (Where on Earth ID) can be

chosen

availableTrendLocations()

Page 32: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 32 of 51 @Raastech

Twitter Example

Get Trends

trends = getTrends(SOMEWOEID)

Page 33: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 33 of 51 @Raastech

Twitter Example

Retrieve Tweets

tweets <- searchTwitter(trends[XX,XX],n=1500)

tweetdf <- do.call("rbind",lapply(tweets,as.data.frame))

Page 34: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 34 of 51 @Raastech

Twitter Example

Filter

complete.cases is used to check for NA and NaN

numbers

tweetdf <- tweetdf[complete.cases(tweetdf[,15]),]

tweetdf <- tweetdf[tweetdf[,15] != 0,]

Page 35: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 35 of 51 @Raastech

Twitter Example

Simplify the dataframe

simpledf <- tweetdf[c("screenName","longitude","latitude")]

Page 36: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 36 of 51 @Raastech

Twitter Example

Create Matrix from Dataframe

tweetMatrix <- data.matrix(simpledf[2:3],rownames.force = FALSE)

Page 37: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 37 of 51 @Raastech

Twitter Example

Plot the Latitude and Longitude

plot(tweetMatrix)

Page 38: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 38 of 51 @Raastech

Graphing

Image

Contour

Box Chart

Page 39: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 39 of 51 @Raastech

K-Means

Essentially a search algorithm

Divides a dataset into k-clusters

Page 40: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 40 of 51 @Raastech

Time Series

Stock Quotes

Infection Incidents

Gas Prices

Audio

Etc.

Source: http://www.loc.gov/pictures/resource/hec.23488/

Page 41: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 41 of 51 @Raastech

Time Series Analysis

Regression

Forecasting

Time Frequency (FFTs)

Source: http://groups.csail.mit.edu/netmit/sFFT/algorithm.html

Page 42: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 42 of 51 @Raastech

Page 43: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 43 of 51 @Raastech

Using Enterprise Data Sources

Database

Streams

Files

Etc.

Page 44: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 44 of 51 @Raastech

Page 47: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 47 of 51 @Raastech

Oracle R Advanced Analytics for Hadoop

Component of the Oracle Big

Data Software Connectors Suite,

an option for the BDA

Provides abstraction from HiveQL

through R just as in Oracle R

Enterprise does for SQL

http://www.oracle.com/technetwork/database/

database-technologies/bdc/r-advanalytics-for-

hadoop/overview/index.html

Page 49: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 49 of 51 @Raastech

Page 50: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 50 of 51 @Raastech

Contact Information

Harold Dost III

Principal Consultant

@hdost

[email protected]

Page 51: Introduction to R for Big Data Analysis

© Raastech, Inc. 2015 | All rights reserved. Slide 51 of 51 @Raastech

Resources

https://en.wikibooks.org/wiki/Statistical_Analysis:_an_Introduction_using_R/R_basics

http://www.r-project.org/

https://docs.oracle.com/cd/E57012_01/doc.141/e56973/toc.htm

http://cran.r-project.org/web/packages/akmeans/index.html

http://cran.r-project.org/web/packages/twitteR/index.html

http://en.wikipedia.org/wiki/K-means_clustering

http://www.rdatamining.com/examples/kmeans-clustering

http://blog.revolutionanalytics.com/2009/02/how-to-choose-a-random-number-in-r.html

https://www.packtpub.com/books/content/text-mining-r-part-2

http://www.eia.gov/totalenergy/data/monthly/index.cfm#consumption