Introduction to R For Big Data Analysis
Wednesday, October 13, 2015 6:00pm – 6:45 pm
Raastech, Inc. 2201 Cooperative Way, Suite 600 Herndon, VA 20171 +1-703-884-2223 [email protected]
© Raastech, Inc. 2015 | All rights reserved. Slide 2 of 51 @Raastech
About Me
Harold Dost III @hdost
7+ years of Oracle Middleware experience
OCE (SOA Foundation Practitioner)
Oracle ACE Associate
From Michigan
blog.raastech.com
© Raastech, Inc. 2015 | All rights reserved. Slide 3 of 51 @Raastech
About Raastech
Small systems integrator founded in 2009
Headquartered in the Washington DC area
Specializes in Oracle Fusion Middleware
Oracle Platinum Partner – 1 in 3,000 worldwide
Oracle SOA Specialized – 1 in 1,500 worldwide
Oracle ACE – 2 of 500 worldwide
100% of consultants are Oracle certified
100% of consultants present at major Oracle conferences
100% of consultants have published books, whitepapers, or articles
© Raastech, Inc. 2015 | All rights reserved. Slide 4 of 51 @Raastech
Outline
1. Getting Started
Installing R
Installing Tools
Getting Data
2. Understanding R
Data Types
Functions
Data Import Mechanisms
© Raastech, Inc. 2015 | All rights reserved. Slide 5 of 51 @Raastech
Outline (Cont.)
3. Manipulating Data (Large Data Sets)
Deriving Simple Statistics
Graphing
4. Demo
5. Incorporating into an Enterprise
Using Enterprise Data Sources
Running R in your environment.
Familiarize with Oracle's R offerings
© Raastech, Inc. 2015 | All rights reserved. Slide 6 of 51 @Raastech
© Raastech, Inc. 2015 | All rights reserved. Slide 7 of 51 @Raastech
Know CRAN
Comprehensive
R
Archive
Network
© Raastech, Inc. 2015 | All rights reserved. Slide 8 of 51 @Raastech
Installing R
Windows
Mac
Linux
© Raastech, Inc. 2015 | All rights reserved. Slide 9 of 51 @Raastech
Installing R
Windows https://cran.r-project.org/bin/windows/
Mac https://cran.r-project.org/bin/macosx/
Linux https://cran.r-project.org/bin/linux/
© Raastech, Inc. 2015 | All rights reserved. Slide 10 of 51 @Raastech
Development Tools
Rstudio - http://www.rstudio.com/products/rstudio/
Open Source Edition
Commercial License - $995
Eclipse
Sublime, TextPad, Other Simple Text Editors,…
© Raastech, Inc. 2015 | All rights reserved. Slide 11 of 51 @Raastech
Installing Packages
Anything From CRAN
Anywhere
install.packages(c(“first”, “second”))
> sudo R CMD INSTALL package-version.tar.gz
© Raastech, Inc. 2015 | All rights reserved. Slide 12 of 51 @Raastech
© Raastech, Inc. 2015 | All rights reserved. Slide 13 of 51 @Raastech
Data Types
Vectors
Matrices
Arrays
Data Frames
Lists
Factors
© Raastech, Inc. 2015 | All rights reserved. Slide 14 of 51 @Raastech
Special Values
Infinity, Positive and Negative: Inf and –Inf
Not A Number: NaN
Not Available: NA
Complex Numbers, 1+9i
© Raastech, Inc. 2015 | All rights reserved. Slide 15 of 51 @Raastech
Use Case for Infinities
Finding Maximums and Minimums
Placeholder values when others won’t work
© Raastech, Inc. 2015 | All rights reserved. Slide 16 of 51 @Raastech
Not a Number (NaN)
In means something went wrong somewhere
A missing argument
Invalid number
Check for with is.nan(x) to prevent leaking
Don’t use “==“ to find NaN, it will only give more NaN
© Raastech, Inc. 2015 | All rights reserved. Slide 17 of 51 @Raastech
Assigning NaN
> a = NaN
> a
[1] NaN
© Raastech, Inc. 2015 | All rights reserved. Slide 18 of 51 @Raastech
Adding NaN
Adding NaN
> b = 1
> c = a + b
> c
[1] NaN
When adding a number to NaN “Not a Number” you will get NaN.
© Raastech, Inc. 2015 | All rights reserved. Slide 19 of 51 @Raastech
Comparing NaN to Regular Number
> d = b == c
> d
[1] NA
When comparing a number to NaN “Not a Number” you will get NA.
© Raastech, Inc. 2015 | All rights reserved. Slide 20 of 51 @Raastech
Comparing NaN to NaN
> e = c == a
> e
[1] NA
When comparing NaN “Not a Number” to NaN you will get NA.
© Raastech, Inc. 2015 | All rights reserved. Slide 21 of 51 @Raastech
Detecting NaN
> a
[1] NaN
> is.nan(a)
[1] TRUE
> is.na(a)
[1] TRUE
Since NaN aren’t proper numbers, special functions must be used to detect them. They are the result of math gone wrong.
© Raastech, Inc. 2015 | All rights reserved. Slide 22 of 51 @Raastech
Detecting NA
> e = c == a
> e
[1] NA
> is.nan(e)
[1] FALSE
> is.na(e)
[1] TRUE
Just as with NaN special functions must be used, but NA generally indicates that there is missing information
© Raastech, Inc. 2015 | All rights reserved. Slide 23 of 51 @Raastech
Operators
Assignment ( ->, <-)
Addition (+)
Subtraction (–)
Division (/)
Multiplication (*)
Exponent (^)
Parentheses ( (, ) )
© Raastech, Inc. 2015 | All rights reserved. Slide 24 of 51 @Raastech
© Raastech, Inc. 2015 | All rights reserved. Slide 25 of 51 @Raastech
Math Functions
max()
min()
log()
sqrt()
© Raastech, Inc. 2015 | All rights reserved. Slide 26 of 51 @Raastech
Deriving Simple Statistics
Minimum
Maximum
Median
Arithmetic Mean
Function estimation
Linear
Log
Exponential
R-Values
Standard Deviation
© Raastech, Inc. 2015 | All rights reserved. Slide 27 of 51 @Raastech
How to define your own functions
firstfunction <- function(arg1, arg2, ... ){
statements
return(someoutput)
}
© Raastech, Inc. 2015 | All rights reserved. Slide 28 of 51 @Raastech
© Raastech, Inc. 2015 | All rights reserved. Slide 29 of 51 @Raastech
Twitter Example
First Install the Package
install.packages("twitteR”)
© Raastech, Inc. 2015 | All rights reserved. Slide 30 of 51 @Raastech
Twitter Example
Authenticate
consumer = "CONSUMER KEY"
secret = "SECRET KEY"
setup_twitter_oauth(consumer,secret)
© Raastech, Inc. 2015 | All rights reserved. Slide 31 of 51 @Raastech
Twitter Example
Get Trend Locations
The resulting WOEID (Where on Earth ID) can be
chosen
availableTrendLocations()
© Raastech, Inc. 2015 | All rights reserved. Slide 32 of 51 @Raastech
Twitter Example
Get Trends
trends = getTrends(SOMEWOEID)
© Raastech, Inc. 2015 | All rights reserved. Slide 33 of 51 @Raastech
Twitter Example
Retrieve Tweets
tweets <- searchTwitter(trends[XX,XX],n=1500)
tweetdf <- do.call("rbind",lapply(tweets,as.data.frame))
© Raastech, Inc. 2015 | All rights reserved. Slide 34 of 51 @Raastech
Twitter Example
Filter
complete.cases is used to check for NA and NaN
numbers
tweetdf <- tweetdf[complete.cases(tweetdf[,15]),]
tweetdf <- tweetdf[tweetdf[,15] != 0,]
© Raastech, Inc. 2015 | All rights reserved. Slide 35 of 51 @Raastech
Twitter Example
Simplify the dataframe
simpledf <- tweetdf[c("screenName","longitude","latitude")]
© Raastech, Inc. 2015 | All rights reserved. Slide 36 of 51 @Raastech
Twitter Example
Create Matrix from Dataframe
tweetMatrix <- data.matrix(simpledf[2:3],rownames.force = FALSE)
© Raastech, Inc. 2015 | All rights reserved. Slide 37 of 51 @Raastech
Twitter Example
Plot the Latitude and Longitude
plot(tweetMatrix)
© Raastech, Inc. 2015 | All rights reserved. Slide 38 of 51 @Raastech
Graphing
Image
Contour
Box Chart
© Raastech, Inc. 2015 | All rights reserved. Slide 39 of 51 @Raastech
K-Means
Essentially a search algorithm
Divides a dataset into k-clusters
© Raastech, Inc. 2015 | All rights reserved. Slide 40 of 51 @Raastech
Time Series
Stock Quotes
Infection Incidents
Gas Prices
Audio
Etc.
Source: http://www.loc.gov/pictures/resource/hec.23488/
© Raastech, Inc. 2015 | All rights reserved. Slide 41 of 51 @Raastech
Time Series Analysis
Regression
Forecasting
Time Frequency (FFTs)
Source: http://groups.csail.mit.edu/netmit/sFFT/algorithm.html
© Raastech, Inc. 2015 | All rights reserved. Slide 42 of 51 @Raastech
© Raastech, Inc. 2015 | All rights reserved. Slide 43 of 51 @Raastech
Using Enterprise Data Sources
Database
Streams
Files
Etc.
© Raastech, Inc. 2015 | All rights reserved. Slide 44 of 51 @Raastech
© Raastech, Inc. 2015 | All rights reserved. Slide 45 of 51 @Raastech
Oracle R Distribution
Available on Oracle Public Yum
Enhanced dynamic Library loading
Enterprise Support Available
Oracle Advanced Analytics
Oracle Linux
Oracle Big Data Appliance
http://www.oracle.com/technetwork/database/database-technologies/r/r-
distribution/overview/index.html
© Raastech, Inc. 2015 | All rights reserved. Slide 46 of 51 @Raastech
Oracle R Enterprise
Component of the Oracle Advanced
Analytics Option on Oracle Database EE
Allows use of R in the database without SQL
Save R Objects in the database
Easily Integrate with OBIEE
http://www.oracle.com/technetwork/database/database-
technologies/r/r-enterprise/overview/index.html
© Raastech, Inc. 2015 | All rights reserved. Slide 47 of 51 @Raastech
Oracle R Advanced Analytics for Hadoop
Component of the Oracle Big
Data Software Connectors Suite,
an option for the BDA
Provides abstraction from HiveQL
through R just as in Oracle R
Enterprise does for SQL
http://www.oracle.com/technetwork/database/
database-technologies/bdc/r-advanalytics-for-
hadoop/overview/index.html
© Raastech, Inc. 2015 | All rights reserved. Slide 48 of 51 @Raastech
ROracle
Open Source Package
Maintained by Oracle
Uses OCI Interface to interact with databases
http://www.oracle.com/technetwork/database/database-technologies/r/r-
technologies/overview/index.html
© Raastech, Inc. 2015 | All rights reserved. Slide 49 of 51 @Raastech
© Raastech, Inc. 2015 | All rights reserved. Slide 50 of 51 @Raastech
Contact Information
Harold Dost III
Principal Consultant
@hdost
© Raastech, Inc. 2015 | All rights reserved. Slide 51 of 51 @Raastech
Resources
https://en.wikibooks.org/wiki/Statistical_Analysis:_an_Introduction_using_R/R_basics
http://www.r-project.org/
https://docs.oracle.com/cd/E57012_01/doc.141/e56973/toc.htm
http://cran.r-project.org/web/packages/akmeans/index.html
http://cran.r-project.org/web/packages/twitteR/index.html
http://en.wikipedia.org/wiki/K-means_clustering
http://www.rdatamining.com/examples/kmeans-clustering
http://blog.revolutionanalytics.com/2009/02/how-to-choose-a-random-number-in-r.html
https://www.packtpub.com/books/content/text-mining-r-part-2
http://www.eia.gov/totalenergy/data/monthly/index.cfm#consumption
Top Related