Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf ·...
Transcript of Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf ·...
![Page 1: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/1.jpg)
Driving New Value from Big Data Investments
An Introduction to Using R with HadoopJeffrey BreenPrincipal, Think Big [email protected]://www.thinkbigacademy.com/
February 2013
![Page 2: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/2.jpg)
2
Building Modern Analytics Solutions to Monetize Big Data Investments
Strategy and Roadmap
IMAGINETraining
and Education
ILLUMINATEHands-On
Data Science and Data Engineering
IMPLEMENT
Leading Providerof Innovative Big Analytics Services
![Page 3: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/3.jpg)
3
We Accelerate Your Time to Value
THINK BIG Analytics Methodology
Experiment-Driven Short Projects with Nimble Test Solution Cycles
� Breaking Down Business and IT Barriers
� Discrete Projects with Beginning and End
� Early Releases to Validate ROI andEnsure Long Term Success
IMAGINE
ILLUMINATE
IMPLEMENT
Innovation and Value
![Page 4: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/4.jpg)
4
� Expert Training/Courses− e.g. Hadoop Developer, HBase, Pig and Hive for Modelers
� Joint Application Development� Side-by-Side Mentoring
Enable Your IT Staff with New Skills
Data Architect
Data Architect Big Data
Monitoring
DatabaseAdministrator
Big DataAdministrator
BusinessAnalyst
Data ScienceMath Modeler
Developers
Big DataEngineering
ILLUMINATE: Training and Education
� Build Capabilities to Manage Rapid Innovation Needed with Big Data
� Invest in and Scale Skills to Create Data-Driven Organization
THINK BIG Analytics
![Page 5: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/5.jpg)
Agenda
5
� Why R?� What is Hadoop?� Counting words with MapReduce� Writing MapReduce jobs with RHadoop� Data Warehousing with Hive� Big Data ≠ Hadoop� Want to learn more?� Q&A
![Page 6: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/6.jpg)
Agenda
6
� Why R?� What is Hadoop?� Counting words with MapReduce� Writing MapReduce jobs with RHadoop� Data Warehousing with Hive� Big Data ≠ Hadoop� Want to learn more?� Q&A
![Page 7: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/7.jpg)
Revolution Confidential
7http://thebalancedguy.blogspot.com/2010/09/with-3-boys-and-having-been-cub-scout.html
![Page 8: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/8.jpg)
Revolution Confidential
8http://www.wengerna.com/giant-knife-16999
![Page 9: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/9.jpg)
Number of R Packages Available
How many R Packages are there now?
At the command line enter:> dim(available.packages())
Slide courtesy of John Versotek, organizer of the Boston Predictive Analytics Meetup
![Page 10: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/10.jpg)
Agenda
10
� Why R?� What is Hadoop?� Counting words with MapReduce� Writing MapReduce jobs with RHadoop� Data Warehousing with Hive� Big Data ≠ Hadoop� Want to learn more?� Q&A
![Page 11: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/11.jpg)
Revolution Confidential
![Page 12: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/12.jpg)
![Page 13: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/13.jpg)
Google File System is the Storage.
2003
13
![Page 14: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/14.jpg)
MapReduce is the framework.
2004
14
![Page 15: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/15.jpg)
Enter HadoopAbout this time,
Doug Cutting, the creator of Lucene, was working on Nutch.
15
![Page 16: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/16.jpg)
Nutch Timeline
Year Topics
2003 Google’s GFS paper.
2004 Nutch Distributed File System (NDFS).
2004 Google’s MapReduce paper.
2004-2005
Nutch MapReduce Implementation.
16
![Page 17: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/17.jpg)
Hadoop TimelineYear Topics
2006NDFS and Nutch MapReduce extracted to separate Hadoop Apache project.
2008Hadoop is a top-level Apache project.Yahoo! announces 10K core cluster.
17
![Page 18: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/18.jpg)
�Optimize disk I/O performance.-Minimize disk head seeks!
�Redundant data storage and processing to eliminate many kinds of data loss.
�Horizontal scalability.�Run on commodity, server-class hardware.
Hadoop Design Goals
18
![Page 19: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/19.jpg)
Revolution Confidential
19
from Jeff Dean, based on Peter Norvig’s http://norvig.com/21-days.html
![Page 20: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/20.jpg)
What is Hadoop?
� An open source project designed to support large scale data processing� Inspired by Google’s MapReduce-based computational infrastructure� Comprised of several components- Hadoop Distributed File System (HDFS)- MapReduce processing framework, job scheduler, etc.- Ingest/outgest services (Sqoop, Flume, etc.)- Higher level languages and libraries (Hive, Pig, Cascading, Mahout)
� Written in Java, first opened up to alternatives through its Streaming API→ If your language of choice can handle stdin and stdout, you can use it to write MapReduce jobs
20
![Page 21: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/21.jpg)
21
Hadoop cluster components
Key� italics: process�✲ : MR jobs
Cluster
Slaves
IngestService
OutgestService
SQLStore
SQLStore
Logs
Client Servers
✲ Hive, Pig, ...✲ cron+bash, Azkaban, …
Sqoop, Scribe, …Monitoring, Management
...
Secondary Master Server
Secondary Name Node
Primary Master Server✲ Job Tracker
Name Node
Slave Server✲ Task Tracker
Data Node
DiskDiskDiskDiskDiskDiskDiskDisk
Slave Server✲ Task Tracker
Data Node
DiskDiskDiskDiskDiskDiskDiskDisk
Slave Server✲ Task Tracker
Data Node
DiskDiskDiskDiskDiskDiskDiskDiskfrom Think Big Academy’s Hadoop Developer Course
![Page 22: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/22.jpg)
22
Hadoop’s distributed file system
Services
Name Node
Data Nodes
64MB blocks
3x replication
Cluster
Slaves
IngestService
OutgestService
SQLStore
SQLStore
Logs
Client Servers
✲ Hive, Pig, ...✲ cron+bash, Azkaban, …
Sqoop, Scribe, …Monitoring, Management
...
Secondary Master Server
Secondary Name Node
Primary Master Server✲ Job Tracker
Name Node
Slave Server✲ Task Tracker
Data Node
DiskDiskDiskDiskDiskDiskDiskDisk
Slave Server✲ Task Tracker
Data Node
DiskDiskDiskDiskDiskDiskDiskDisk
Slave Server✲ Task Tracker
Data Node
DiskDiskDiskDiskDiskDiskDiskDiskfrom Think Big Academy’s Hadoop Developer Course
![Page 23: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/23.jpg)
Agenda
23
� Why R?� What is Hadoop?� Counting words with MapReduce� Writing MapReduce jobs with RHadoop� Data Warehousing with Hive� Big Data ≠ Hadoop� Want to learn more?� Q&A
![Page 24: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/24.jpg)
True confession: I was wrong about MapReduce
� When the Google paper was published in 2004, I was running a typical enterprise IT department
� Big hardware (Sun, EMC) + big applications (Siebel, Peoplesoft) + big databases (Oracle, SQL Server)= big licensing & support costs
� Loved the scalability, COTS components, and price, but missed the fact that keys (and values) could be compound & complex
� ... and examples like Wordcount didn’t help!
Source: Hadoop: The Definitive Guide, Second Edition, p. 20
24
![Page 25: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/25.jpg)
Copyright © 2011-‐2013, Think Big AnalyNcs, All Rights Reserved
There is a Map phase
Hadoop uses MapReduce
Input Mappers Sort,Shuffle
Reducers
map 1mapreduce 1phase 2
a 2hadoop 1is 2
Output
There is a Reduce phase
reduce 1there 2uses 1
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
We need to convert the Input
into the Output.
from Think Big Academy’s Hadoop Developer Course
![Page 26: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/26.jpg)
Copyright © 2011-‐2013, Think Big AnalyNcs, All Rights Reserved
There is a Map phase
Hadoop uses MapReduce
Input
(N, "…")
(N, "…")
(N, "")
Mappers
There is a Reduce phase (N, "…")
from Think Big Academy’s Hadoop Developer Course
![Page 27: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/27.jpg)
Copyright © 2011-‐2013, Think Big AnalyNcs, All Rights Reserved
There is a Map phase
Hadoop uses MapReduce
Input
(N, "…")
(N, "…")
(N, "")
Mappers
There is a Reduce phase (N, "…")
(hadoop, 1)(uses, 1)(mapreduce, 1)
(there, 1) (is, 1)(a, 1) (reduce, 1)(phase, 1)
(there, 1) (is, 1)(a, 1) (map, 1)(phase, 1)
from Think Big Academy’s Hadoop Developer Course
![Page 28: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/28.jpg)
Revolution Confidential
28http://blog.stackoverflow.com/wp-content/uploads/then-a-miracle-occurs-cartoon.png
![Page 29: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/29.jpg)
Copyright © 2011-‐2013, Think Big AnalyNcs, All Rights Reserved
There is a Map phase
Hadoop uses MapReduce
Input
(N, "…")
(N, "…")
(N, "")
Mappers Sort,Shuffle
Reducers
There is a Reduce phase (N, "…")
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce, 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
0-9, a-l
m-q
r-z
from Think Big Academy’s Hadoop Developer Course
![Page 30: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/30.jpg)
Copyright © 2011-‐2013, Think Big AnalyNcs, All Rights Reserved
There is a Map phase
Hadoop uses MapReduce
Input
(N, "…")
(N, "…")
(N, "")
Mappers Sort,Shuffle
(a, [1,1]),(hadoop, [1]),
(is, [1,1])
(map, [1]),(mapreduce, [1]),
(phase, [1,1])
Reducers
There is a Reduce phase (N, "…")
(reduce, [1]),(there, [1,1]),
(uses, 1)
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
0-9, a-l
m-q
r-z
from Think Big Academy’s Hadoop Developer Course
![Page 31: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/31.jpg)
Copyright © 2011-‐2013, Think Big AnalyNcs, All Rights Reserved
There is a Map phase
Hadoop uses MapReduce
Input
(N, "…")
(N, "…")
(N, "")
Mappers Sort,Shuffle
(a, [1,1]),(hadoop, [1]),
(is, [1,1])
(map, [1]),(mapreduce, [1]),
(phase, [1,1])
Reducers
map 1mapreduce 1phase 2
a 2hadoop 1is 2
Output
There is a Reduce phase (N, "…")
(reduce, [1]),(there, [1,1]),
(uses, 1)
reduce 1there 2uses 1
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
0-9, a-l
m-q
r-z
from Think Big Academy’s Hadoop Developer Course
![Page 32: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/32.jpg)
Copyright © 2011-‐2013, Think Big AnalyNcs, All Rights Reserved
There is a Map phase
Hadoop uses MapReduce
Input
(N, "…")
(N, "…")
(N, "")
Mappers Sort,Shuffle
(a, [1,1]),(hadoop, [1]),
(is, [1,1])
(map, [1]),(mapreduce, [1]),
(phase, [1,1])
Reducers
map 1mapreduce 1phase 2
a 2hadoop 1is 2
Output
There is a Reduce phase (N, "…")
(reduce, [1]),(there, [1,1]),
(uses, 1)
reduce 1there 2uses 1
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
0-9, a-l
m-q
r-zMap:
• Transform one input to 0-‐N outputs.
Reduce:
• Collect multiple inputs into one output.
from Think Big Academy’s Hadoop Developer Course
![Page 33: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/33.jpg)
Agenda
33
� Why R?� What is Hadoop?� Counting words with MapReduce� Writing MapReduce jobs with RHadoop� Data Warehousing with Hive� Big Data ≠ Hadoop� Want to learn more?� Q&A
![Page 34: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/34.jpg)
Enter RHadoop
� RHadoop is an open source project sponsored by Revolution Analytics� Package Overview- rmr2 - all MapReduce-related functions- rhdfs - interaction with Hadoop’s HDFS file system- rhbase - access to the NoSQL HBase database
� rmr2 uses Hadoop’s Streaming API to allow R users to write MapReduce jobs in R- handles all of the I/O and job submission for you (no while(<stdin>)-like loops!)
34
![Page 35: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/35.jpg)
RHadoop Advantages
� Modular- Packages group similar functions- Only load (and learn!) what you need- Minimizes prerequisites and dependencies
� Open Source- Cost: Low (no) barrier to start using- Transparency: Development, issue tracker, Wiki, etc. hosted on github
• https://github.com/RevolutionAnalytics/RHadoop/� Supported- Sponsored by Revolution Analytics- Training & professional services available- Support available with Revolution R Enterprise subscriptions
35
![Page 36: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/36.jpg)
wordcount: codelibrary(rmr2)
map = function(k,lines) {
words.list = strsplit(lines, '\\s') words = unlist(words.list)
return( keyval(words, 1) )}
reduce = function(word, counts) { keyval(word, sum(counts))}
wordcount = function (input, output = NULL) { mapreduce(input = input, output = output, input.format = "text", map = map, reduce = reduce)}
36
from Revolution Analytics’ Getting Started with RHadoop course
![Page 37: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/37.jpg)
wordcount: submit job and fetch results
Submit job> hdfs.root = 'wordcount'> hdfs.data = file.path(hdfs.root, 'data')> hdfs.out = file.path(hdfs.root, 'out')> out = wordcount(hdfs.data, hdfs.out)
Fetch results from HDFS> results = from.dfs( out )> results.df = as.data.frame(results, stringsAsFactors=F )> colnames(results.df) = c('word', 'count')> head(results.df) word count1 greatness 22 damned 33 tis 54 jade 15 magician 1
37
from Revolution Analytics’ Getting Started with RHadoop course
![Page 38: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/38.jpg)
Code notes
� Scalable- Hadoop and MapReduce abstract away system details- Code runs on 1 node or 1,000 nodes without modification
� Portable- You write normal R code, interacting with normal R objects- RHadoop’s rmr2 library abstracts away Hadoop details- All the functionality you expect is there—including Enterprise R’s
� Flexible- Only the mapper deals with the data directly- All components communicate via key-value pairs- Key-value “schema” chosen for each analysis rather than as a prerequisite to
loading data into the system
38
![Page 39: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/39.jpg)
rmr2 Function Overview
� Convenience- keyval() - creates a key-value pair from any two R objects. Used to generate
output from input formatters, mappers, reducers, etc.� Input/output- from.dfs(), to.dfs() - read/write data from/to the HDFS- make.input.format() - provides common file parsing (text, CSV) or will wrap a user-
supplied function� Job execution- mapreduce() - submit job and return an HDFS path to the results if successful
39
![Page 40: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/40.jpg)
rhdfs function overview
� File & directory manipulation- hdfs.ls(), hdfslist.files()- hdfs.delete(), hdfs.del(), hdfs.rm() - hdfs.dircreate(), hdfs.mkdir()- hdfs.chmod(), hdfs.chown(), hdfs.file.info()- hdfs.exists()
� Copying, moving & renaming files to/from/within HDFS- hdfs.copy(), hdfs.move(), hdfs.rename()- hdfs.put(), hdfs.get()
� Reading files directly from HDFS- hdfs.file(), hdfs.read(), hdfs.write(), hdfs.flush()- hdfs.seek(), hdfs.tell(con), hdfs.close()- hdfs.line.reader(), hdfs.read.text.file()
� Misc.- hdfs.init(), hdfs.defaults()
40
![Page 41: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/41.jpg)
rhbase function overview
� Initialization- hb.init()
� Create and manage tables- hb.list.tables(), hb.describe.table()- hb.new.table(), hb.delete.table()
� Read and write data- hb.insert(), hb.insert.data.frame()- hb.get(), hb.get.data.frame(), hb.scan()- hb.delete()
� Administrative, etc.- hb.defaults(), hb.set.table.mode()- hb.regions.table(), hb.compact.table()
41
![Page 42: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/42.jpg)
Big Data Warehousing with Hive
� Hive supplies a SQL-like query language- very familiar for those with relational database experience
� But Hive compiles, optimizes, and executes these queries as MapReduce jobs on the Hadoop cluster
� Can be used in conjunction with other Hadoop jobs, such as those written with rmr2
42
![Page 43: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/43.jpg)
Hive architecture & access
43
Hadoop
Master✲ Job Tracker Name Node DFS
Hive
Driver(compiles, optimizes, executes)
CLI HWI Thrift Server
Metastore
JDBC ODBC
RODBC, RJDBC, etc.Terminal browser
![Page 44: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/44.jpg)
Accessing Hive via ODBC/JDBClibrary(RJDBC)
# set the classpath to include the JDBC driver location, plus commons-logging
[...]class.path = c(hive.class.path, commons.class.path)drv = JDBC("org.apache.hadoop.hive.jdbc.HiveDriver", classPath=class.path, "`")
# make a connection to the running Hive Server:conn = dbConnect(drv, "jdbc:hive://localhost:10000/default")
# setting the database name in the URL doesn't help,# so issue 'use databasename' command:res = dbSendQuery(conn, 'use mydatabase')
# submit the query and fetch the results as a data.frame:df = dbGetQuery(conn, 'SELECT name, sub FROM employees LATERAL VIEW explode(subordinates) subView AS sub')
44
![Page 45: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/45.jpg)
Other ways to use R and Hadoop
� HDFS- Revolution Enterprise R can read and write files directly on the distributed file
system- Files can include ScaleR’s XDF-formatted data sets
� MapReduce- Many other R packages have been written to use R and Hadoop together,
including RHIPE, segue, Oracle’s R Connector for Hadoop, etc.
� Hive- Hadoop Streaming is also available for Hive to leverage functionality external to
Hadoop and Java- RHive leverages RServe to connect the two
• http://cran.r-project.org/web/packages/RHive/
45
![Page 46: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/46.jpg)
Big Data ≠ Hadoop
� NoSQL databases offer low-latency, random-access to key-values- HBase- Cassandra- CouchDB- MongoDB- Accumulo
� Next week, Think Big’s Douglas Moore will be presenting at the Boston Storm Meetup:- “Predictive Analytics with Storm, Hadoop, R and AWS”- http://www.meetup.com/Boston-Storm-Users/events/103506142/
46
![Page 47: Driving New Value from Big Data Investmentsfiles.meetup.com/1781511/Boston useR R+Hadoop.pdf · Driving New Value from Big Data Investments An Introduction to Using R with Hadoop](https://reader034.fdocuments.us/reader034/viewer/2022042223/5ec9a31ed1fd7f251930c87f/html5/thumbnails/47.jpg)
Want to learn more?
47
� Upcoming public Getting Started with RHadoop 1-day classes- Hands-on examples and exercises covering rhdfs, rhbase, and rmr2- Algorithms and data include wordcount, analysis of airline flight data, and
collaborative filtering using structured and unstructured data from text, CSV files and Twitter
• February 25, 2013 - Palo Alto, CA• March 13, 2013 - Boston, MA
• 25% off with “useR” discount code @ http://bit.ly/rhadoop0313
� Revolution Analytics Quick Start Program for Hadoop- Private Getting Started with RHadoop training- Onsite consulting assistance for initial use case- Revolution R for Hadoop licenses and support- More info @ http://bit.ly/rhadoopqs