Rand Ha Do Op
-
Upload
krishna-pusuluri -
Category
Documents
-
view
225 -
download
0
Transcript of Rand Ha Do Op
![Page 1: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/1.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 1/22
R evolution A nalytics
S eptember 21, 2011
1
L everaging R in Hadoop
Environments
![Page 2: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/2.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 2/22
In Today’s Webinar:
About Revolution AnalyticsWhy R and Hadoop?
The Packages (rhdfs, rhbase, rmr)
Examples
Resources and Further Reading
Co-sponsored by Revolution and Cloudera
![Page 3: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/3.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 3/22
The professor who invented analytic software for the experts now wants to take it to the masses
Most advanced statistical
analysis software available
Half the cost ofcommercial alternatives
2M+ Users
3,000+ Applications
Statistics
PredictiveAnalytics
Data Mining
Visualization
Finance
Life Sciences
Manufacturing
Retail
Telecom
Social Media
Government
Power
Productivity
EnterpriseReadiness
![Page 4: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/4.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 4/22
What’s the Difference B etween R and
R evolution R E nterpris e?
Revolution R is 100% R and More®
4
R EngineLanguage Libraries
4,000+ Community
Packages
Technical
Support
Web-Based
GUI
Web Services
API
Big Data
Analysis
IDE / Developer
GUI
Build
Assurance
Parallel
Tools
Multi-Threaded
Math Libraries
For more information contact: [email protected]
![Page 5: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/5.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 5/22
L et’s Talk about R and Hadoop
5
![Page 6: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/6.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 6/22
Why R and Hadoop?
Hadoop offers a scalable infrastructure forprocessing massive amounts of data
Storage – HDFS, HBASE
Distributed Computing - MapReduce
R is a statistical programming language fordeveloping advanced analytic applications
There is a need for more than counts and
averages on these big data setsAnalyzing all of the data can lead to insightsthat sampling or subsets can’t reveal.
6
![Page 7: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/7.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 7/22
Motivation for this project
Make it easy for the R programmer to interactwith the Hadoop data stores and writeMapReduce programsAbility to run R on a massively distributed
system without having to understand theunderlying infrastructureKeep statisticians focused on the analysis andnot the implementation details
Open source to drive innovation andcollaboration.
7
![Page 8: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/8.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 8/22
R and Hadoop – T he R P ackages
8
R Client
R
Map orReduce
JobTracker
TaskNode
HDFS
HBASE
Thrift
rhdfs - R and HDFS
rhbase - R and HBASErmr - R and MapReduce
Capabilities delivered as individualR packages
rmr
rhdfsrhbase
Downloads available fromGithub
![Page 9: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/9.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 9/22
rhdfs
Manipulate HDFS directly from RMimic as much of the HDFS Java API aspossible
Examples:Read a HDFS text file into a data frame.
Serialize/Deserialize a model to HDFS
Write an HDFS file to local storagerhdfs/pkg/inst/unitTests
rhdfs/pkg/inst/examples
9
![Page 10: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/10.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 10/22
rhdfs F unctions
File Manipulations - hdfs.copy, hdfs.move, hdfs.rename,hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put,hdfs.get
File Read/Write - hdfs.file, hdfs.write, hdfs.close, hdfs.flush,
hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader,hdfs.read.text.file
Directory - hdfs.dircreate, hdfs.mkdir
Utility - hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists
Initialization – hdfs.init, hdfs.defaults
10
![Page 11: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/11.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 11/22
rhbase
Manipulate HBASE tables and their contentUses Thrift C++ API as the mechanism tocommunicate to HBASE
ExamplesCreate a data frame from a collection of rowsand columns in an HBASE table
Update an HBASE table with values from a dataframerhbase/pkg/inst/unitTests
11
![Page 12: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/12.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 12/22
rhbas e F unctions
Table Manipulation – hb.new.table, hb.delete.table,hb.describe.table, hb.set.table.mode, hb.regions.table
Row Read/Write - hb.insert, hb.get, hb.delete,hb.insert.data.frame, hb.get.data.frame, hb.scan
Utility - hb.list.tablesInitialization - hb.defaults, hb.init
12
![Page 13: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/13.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 13/22
rmr
Designed to be the simplest and most elegant way towrite MapReduce programsGives the R programmer the tools necessary to performdata analysis in a way that is “R” likeProvides an abstraction layer to hide the implementation
detailsExamples
Simulations - Monte Carlo and other Stochastic analysisR ‘apply’ family of operations (tapply, lapply…)Binning, quantiles, summaries, crosstabs and inputs to
visualization (ggplot, lattice).Data Mining and Machine Learningrmr/pkg/inst/tests
13
![Page 14: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/14.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 14/22
rmr mapreduce Function
mapreduce (input, output, map, reduce, …)
input – input folder
output – output folder
map – R function used as map
reduce – R function used as reduce
… - other advanced parameters
14
![Page 15: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/15.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 15/22
The Basics
![Page 16: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/16.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 16/22
small.ints = 1:10out = lapply(small.ints, function(x) x^2)
small.ints = to.dfs(1:10)out = mapreduce(input = small.ints,
map = function(k,v) keyval(k, k^2))
groups = rbinom(32, n = 50, prob = 0.4)out = tapply(groups, groups, length)
groups = to.dfs(groups)out = mapreduce(input = groups,
reduce = function(k,vv) keyval(k, length(vv)))
![Page 17: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/17.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 17/22
K-means
![Page 18: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/18.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 18/22
kmeans =function(points, ncenters, iterations = 10,
distfun =function(a,b) norm(as.matrix(a-b), type='F')){
newCenters = kmeans.iter(points, distfun = distfun, ncenters = ncenters)for(i in 1:iterations) {newCenters = lapply(values(newCenters), unlist)newCenters = kmeans.iter(points, distfun,
centers = newCenters)}newCenters}
![Page 19: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/19.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 19/22
kmeans.iter =function(points, distfun, ncenters = length(centers),
centers = NULL) {from.dfs(mapreduce(input = points,
map = if (is.null(centers)) {
function(k,v)keyval(sample(1:ncenters,1),v)}else {
function(k,v) {distances = lapply(centers, function(c)distfun(c,v))
keyval(centers[[which.min(distances)]],v)}},
reduce = function(k,vv) keyval(NULL,apply(do.call(rbind,vv),2,mean))))}
![Page 20: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/20.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 20/22
Final thoughts
R and Hadoop together offer innovation andflexibility needed to meet analyticschallenges of big data
We need contributors to this project!Developers
Documentation
Use casesGeneral Feedback
20
![Page 21: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/21.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 21/22
Resources
Slides / Replay:bit.ly/r-and-hadoop
Open source project:https://github.com/RevolutionAnalytics/RHadoop/wiki
Participate in our survey:http://www.surveymonkey.com/s/JM3N6RP
Revolution R Enterprise: bit.ly/Enterprise-R
Cloudera CDH: http://www.cloudera.com/hadoop/
Email: [email protected]
21
![Page 22: Rand Ha Do Op](https://reader031.fdocuments.us/reader031/viewer/2022021200/577d23421a28ab4e1e995c0b/html5/thumbnails/22.jpg)
8/3/2019 Rand Ha Do Op
http://slidepdf.com/reader/full/rand-ha-do-op 22/22
22
www.revolutionanalytics.com 650.330.0553 Twitter: @RevolutionR
The leading commercial provider of software and support for the popular open source R statistics language.
T hank you.