Big Data Analytics withR and Hadoop
Chapter3 : Integrating R and Hadoop
Sang-Min Song2015.04.09
Three ways to link R and HadoopRHIPE
RHadoop
Hadoop streaming
Chapter3 : Integrating R and Hadoop 2
Introducing RHIPERHIPE stands for R and Hadoop Integrated Programming
Environment.
It means "in a moment" in Greek and is a merger of R and Hadoop.
The RHIPE package uses the Divide and Recombine tech-nique to perform data analytics over Big Data.
RHIPE has mainly been designed to accomplish two goals.Allowing you to perform in-depth analysis of large as well as
small data.Allowing users to perform the analytics operations within R us-
ing a lower-level language.
RHIPE is a lower-level interface as compared to HDFS and MapReduce operation.
Chapter3 : Integrating R and Hadoop 3
Install Sequence1. Installing Hadoop.
2. Installing R.
3. Installing protocol buffers.
4. Setting up environment variables.
5. Installing rJava.
6. Installing RHIPE.
Chapter3 : Integrating R and Hadoop 4
Installing RHIPE3. Installing protocol buffers
Chapter3 : Integrating R and Hadoop 5
Installing RHIPE4. Environment variables
~./bashrc file of hduser (Hadoop user)
R console
Chapter3 : Integrating R and Hadoop 6
Installing RHIPE5. The rJava package installation
6. Installing RHIPE
Chapter3 : Integrating R and Hadoop 7
Understanding the architecture of RHIPE
Chapter3 : Integrating R and Hadoop 8
Word count
Chapter3 : Integrating R and Hadoop 9
Word count
Chapter3 : Integrating R and Hadoop 10
Word count
Chapter3 : Integrating R and Hadoop 11
Understanding the RHIPE function referenceAll these methods are with three categories
Initialization, HDFS, and MapReduce operations
Initializationrhinit(TRUE,TRUE)
Chapter3 : Integrating R and Hadoop 12
Understanding the RHIPE function referenceHDFS
rhls(path)hdfs.getwd()hdfs.setwd("/RHIPE")rhput(src,dest) and rhput("/usr/local/hadoop/NOTICE.txt","/
RHIPE/")rhcp('/RHIPE/1/change.txt','/RHIPE/2/change.txt')rhdel("/RHIPE/1")rhget("/RHIPE/1/part-r-00000", "/usr/local/")rhwrite(list(1,2,3),"/tmp/x")
Chapter3 : Integrating R and Hadoop 13
Understanding the RHIPE function referenceMapReduce
rhwatch(map, reduce, combiner, input, output, mapred, parti-tioner,mapred, jobname)
rhex(job)rhjoin(job)rhkill(job)rhoptions()rhstatus(job)
Chapter3 : Integrating R and Hadoop 14
Introducing RHadoopRHadoop is available with three main R packages
rhdfs, rmr, and rhbase.
rhdfs is an R interface for providing the HDFS usability from the R console.
rmr is an R interface for providing Hadoop MapReduce facil-ity inside the R environment.
rhbase is an R interface for operating the Hadoop HBase data source stored at the distributed network via a Thrift server.
Chapter3 : Integrating R and Hadoop 15
Understanding the architecture of RHadoopSince Hadoop is highly popular because of HDFS and
MapReduce, Revolution Analytics has developed separate R packages, namely, rhdfs, rmr, and rhbase.
Chapter3 : Integrating R and Hadoop 16
Installing RHadoopWe need several R packages to be installed that help it to
connect R with Hadoop.rJava, RJSONIO, itertools, digest, Rcpp, httr, functional, dev-
tools, plyr, reshape2
Setting environment variables
Chapter3 : Integrating R and Hadoop 17
Installing RHadoopInstalling RHadoop [rhdfs, rmr, rhbase]
Chapter3 : Integrating R and Hadoop 18
Word count
Map phase
Reduce phase
Defining the MapReduce job
Executing the MapReduce job
Exploring the wordcount out-put
Chapter3 : Integrating R and Hadoop 19
Understanding the RHadoop function referenceThe hdfs package
Initialization hdfs.init() hdfs.defaults()
File manipulation hdfs.put('/usr/local/hadoop/README.txt','/RHadoop/1/') hdfs.copy('/RHadoop/1/','/RHadoop/2/') hdfs.move('/RHadoop/1/README.txt','/RHadoop/2/') hdfs.rename('/RHadoop/README.txt','/RHadoop/README1.txt') hdfs.delete("/RHadoop") hdfs.rm("/RHadoop") hdfs.chmod('/RHadoop', permissions= '777')
Chapter3 : Integrating R and Hadoop 20
Understanding the RHadoop function referenceThe hdfs package
File read/write f = hdfs.file("/RHadoop/2/README.txt","r",buffersize=104857600) hdfs.write(object,con,hsync=FALSE) hdfs.close(f) m = hdfs.read(f)
Directory operation hdfs.mkdir("/RHadoop/2/") hdfs.rm("/RHadoop/2/")
Utility Hdfs.ls('/') hdfs.file.info("/RHadoop")
Chapter3 : Integrating R and Hadoop 21
Understanding the RHadoop function referenceThe rmr package
For storing and retrieving data small.ints = to.dfs(1:10) from.dfs('/tmp/RtmpRMIXzb/file2bda3fa07850')
For MapReduce mapreduce(input, output, map, reduce, combine, input.fromat, output.-
format, verbose) keyval(key, val)
Chapter3 : Integrating R and Hadoop 22
Chapter3 : Integrating R and Hadoop 23
Thank you