Oracle Big Data Connectors: Introduction to Oracle R ... · Introduction to Oracle R Connector for...

20
OTN Developer Day: Oracle Big Data Hands On Lab Manual Oracle Big Data Connectors: Introduction to Oracle R Connector for Hadoop

Transcript of Oracle Big Data Connectors: Introduction to Oracle R ... · Introduction to Oracle R Connector for...

OTN Developer Day:

Oracle Big Data

Hands On Lab Manual Oracle Big Data Connectors: Introduction to Oracle R

Connector for Hadoop

©2012 Oracle – All Rights Reserved.

ORACLE R CONNECTOR FOR

HADOOP 2.0 HANDS-ON LAB

Introduction to Oracle R Connector

for Hadoop

©2012 Oracle – All Rights Reserved.

Contents

Introduction to Oracle R Connector for Hadoop ............................................................... 3

Exercise 1 – Work with data in HDFS and Oracle Database ..................................... 3 Exercise 2 – Execute a simple MapReduce job using ORCH ..................................... 4 Exercise 3 – Count words in movie plot summaries................................................... 6

Solution for “Introduction to Oracle R Connector for Hadoop” ...................................... 9

Exercise 1 – Work with data in HDFS and Oracle Database ..................................... 9 Exercise 2 – Execute a simple MapReduce job using ORCH ................................... 12 Exercise 3 – Count words in movie plot summaries................................................. 15

©2012 Oracle – All Rights Reserved.

Introduction to Oracle R Connector for Hadoop

Oracle R Connector for Hadoop (ORCH), a component of the Big Data Connectors option, provides transparent access to Hadoop and HDFS-resident data. Hadoop is a high performance distributed computational system, and the Hadoop Distributed File System (HDFS) is a distributed high-availability file storage mechanism. With ORCH, R users are not forced to learn a new language to work with Hadoop/HDFS – they continue to work in R. In addition they can leverage open source R packages as part of their mapper and reducer functions when working on HDFS-resident data. ORCH allows for Hadoop jobs to be executed locally at the client for testing purposes, then, by changing one setting, the exact same code can be executed on the Hadoop cluster – without requiring the involvement of administrators, or knowledge of Hadoop internals, the Hadoop call level interface or IT infrastructure. ORCH and Oracle R Enterprise (ORE) can interact in a variety of ways. If ORE is installed on the R client with ORCH, ORCH can copy ore.frames (data tables) to HDFS, ORE can preprocess data that is fed to map-reduce jobs, and ORE can post-process results of map-reduce jobs once data is moved from HDFS to Oracle Database. If ORE is installed on the Big Data Appliance task nodes, mapper and reducer functions can include functions calls to ORE. If ORCH is installed on Oracle Database server, R scripts in embedded R execution can invoke ORCH functionality, achieving operationalization of ORCH scripts via SQL-based applications or those leveraging DBMS_SCHEDULER. To run the commands in this document on the virtual machine (VM), point Firefox to http://localhost:8787 and log into RStudio using oracle user’s credentials. From the RStudio File menu, select ‘File-Open File’ and navigate to location /home/oracle/movie/moviework/advancedanalytics. Select the R Script file “20130206_ORCH_Hands-on_Lab.R”, and the HOL’s script’s commands will be opened and available to run in RStudio.

Exercise 1 – Work with data in HDFS and Oracle Database

Loading the ORCH library provides access to some basic functions for manipulating HDFS. After navigating to a specified directory, we’ll again access database data in the form of the MOVIE_FACT and MOVIE_GENRE tables, and connect to Oracle Database from ORCH. Although you’re connected to the database through ORE, to transfer data between Oracle Database and HDFS requires an ORCH connection. Then, you’ll copy data from the database to HDFS for later use with a MapReduce job. Run these commands from the /home/oracle/movie/moviework/advancedanalytics Linux directory.

1. If you are in R, first exit from R using CTRL-D CTRL-D. This will in effect invoke q() and not save the workspace. Change directory and start R:

cd /home/oracle/movie/moviework/advancedanalytics

R

2. If you are not already connected by default, load the Oracle R Enterprise (ORE) library and connect to the

Oracle database, then list the contents of the database to test the connection. Notice that if a table contains columns with unsupported data types, a warning message is returned. If you are connected, you can just invoke ore.ls().

library(ORE)

ore.connect("moviedemo","orcl","localhost","welcome1",all=TRUE

)

ore.ls()

©2012 Oracle – All Rights Reserved.

3. Load the Oracle R Connector for Hadoop (ORCH) library, get the current working directory, and list the directory contents in Hadoop Distributed File System (HDFS). Change directory in HDFS and view the contents there:

library(ORCH)

hdfs.pwd()

hdfs.ls()

hdfs.cd ("/user/oracle/moviework/advancedanalytics/data")

hdfs.ls()

4. Using ORE, view the names of the database tables MOVIE_FACT and MOVIE_GENRE, look at the first

few rows of each table, and get the table dimensions:

ore.sync("MOVIEDEMO","MOVIE_FACT")

MF <- MOVIE_FACT

names(MF)

head(MF,3)

dim(MF)

names(MOVIE_GENRE)

head(MOVIE_GENRE,3)

dim(MOVIE_GENRE)

5. Since we will use the table MOVIE_GENRE later in our Hadoop recommendation jobs, copy a subset of

MOVIE_GENRE from the database to HDFS and validate that it exists. This requires using orch.connect to establish the connect to the database from ORCH.

MG_SUBSET <- MOVIE_GENRE[1:10000,]

hdfs.rm('movie_genre_subset')

orch.connect(host="localhost", user="moviedemo",

sid="orcl",passwd="welcome1",secure=F)

mg.dfs <- hdfs.push(MG_SUBSET, dfs.name='movie_genre_subset',

split.by="GENRE_ID")

hdfs.exists('movie_genre_subset')

hdfs.describe('movie_genre_subset')

hdfs.size('movie_genre_subset')

Exercise 2 – Execute a simple MapReduce job using ORCH

In this exercise, you will execute a Hadoop job that counts the number of movies in each genre. You will first run the script in “dry run” mode, executing on the local machine serially. Then, you will run on the cluster in the VM. Finally, you will compare the results using ORE.

1. Use hdfs.attach() to attach the “movie_genre” HDFS file to the working session:

©2012 Oracle – All Rights Reserved.

mg.dfs <-

hdfs.attach("/user/oracle/moviework/advancedanalytics/data/movie_genre_subset”)

mg.dfs

hdfs.describe(mg.dfs)

2. Specify to run in dry run mode and then execute the MapReduce job that partitions the data based on

genre_id, and counts up the number of movies in each genre. Note that you will receive debug output while in dry run mode.

orch.dryrun(T)

res.dryrun <- NULL

res.dryrun <- hadoop.run(

mg.dfs,

mapper = function(key, val) {

orch.keyval(val$GENRE_ID, 1)

},

reducer = function(key, vals) {

count <- length(vals)

orch.keyval(key, count)

} ,

config =

new("mapred.config",

map.output = data.frame(key=0, val=0),

reduce.output = data.frame(GENRE_ID=0, COUNT=0))

)

3. Retrieve the result of the Hadoop job, which is stored as an HDFS file. Note that since this is dry run mode,

not all data may be used, so only a subset of results may be returned.

hdfs.get(res.dryrun)

4. Specify to execute using the cluster by setting orch.dryrun to FALSE, rerun the same MapReduce job, and

view the result. Note that this will take longer to execute since it is starting actual Hadoop jobs on the cluster.

orch.dryrun(F)

res.cluster <- NULL

res.cluster <- hadoop.run(

mg.dfs,

mapper = function(key, val) {

orch.keyval(val$GENRE_ID, 1)

},

reducer = function(key, vals) {

count <- length(vals)

orch.keyval(key, count)

} ,

config =

new("mapred.config",

map.output = data.frame(key=0, val=0),

reduce.output = data.frame(GENRE_ID=0, COUNT=0))

)

hdfs.get(res.cluster)

5. Perform the same analysis using ORE:

©2012 Oracle – All Rights Reserved.

res.table <- table(MG_SUBSET$GENRE_ID)

res.table

Exercise 3 – Count words in movie plot summaries

In this exercise, you will execute a Hadoop job that counts how many times each of the words in MOVIE plot summaries occurs. You will first create the HDFS file containing the data extracted from Oracle Database using ORE. Then, you will run the MarpReduce job on the cluster in the VM. Finally, you will view the results using ORE, but since we’ll want the results sorted by most frequent words, another MapReduce job will be needed.

1. If starting a fresh R session, execute the first block. Otherwise, continue to find all the movies with plot summaries and convert them from an ore.factor to ore.character. Remove various unneeded punctuation from the text, create a database table from these, and create the input corpus for the MapReduce job:

library(ORCH)

orch.connect(host="localhost", user="moviedemo",

sid="orcl",passwd="welcome1",secure=F)

hdfs.cd("/user/oracle/moviework/advancedanalytics/data")

ore.drop(table= “corpus_table”)

corpus <- as.character(MOVIE[!is.na(MOVIE$PLOT_SUMMARY),"PLOT_SUMMARY"])

class(corpus)

corpus <- gsub("([/\\\":,#.@-])", " ", corpus)

head(corpus,2)

corpus <- data.frame(text=corpus)

ore.create (corpus,table = "corpus_table")

hdfs.rm("plot_summary_corpus")

input <- hdfs.put(corpus_table,dfs.name="plot_summary_corpus")

2. Try the following example to see how R parses text using strsplit. Notice the extra space between “my”

and “text”. This gets converted to a null output. You will account for that in the next step.

txt <- "This is my text"

strsplit(txt," ")

myList <- list(A = 5, B = 10, C = 25)

sum(unlist(myList))

3. Execute the MapReduce job that performs the word count.

©2012 Oracle – All Rights Reserved.

res <- hadoop.exec(dfs.id = input,

mapper = function(k,v) {

x <- strsplit(v[[1]], " ")[[1]]

x <- x[x!='']

out <- NULL

for(i in 1:length(x))

out <- c(out, orch.keyval(x[i],1))

out

},

reducer = function(k,vv) {

orch.keyval(k, sum(unlist(vv)))

},

config = new("mapred.config",

job.name = "wordcount",

map.output = data.frame(key='', val=0),

reduce.output = data.frame(key='', val=0) )

)

4. View the path of the result HDFS file. Then get the contents of the result. Notice that the results are

unordered.

res

hdfs.get(res)

5. To sort the results, we can use the following MapReduce job. Notice that we can specify explicit

stopwords, i.e., those to be excluded from the set, but that we also eliminate words of 3 letters or fewer. Then view the sorted results, as well as a sample of 10 rows from the HDFS file. Which words are the most popular in the plot summaries?

©2012 Oracle – All Rights Reserved.

stopwords <- c("from","they","that","with","their","when","into","what")

sorted.res <- hadoop.exec(dfs.id = res,

mapper = function(k,v) {

if(!(k %in% stopwords) & nchar(k) > 3) {

cnt <- sprintf("%05d", as.numeric(v[[1]]))

orch.keyval(cnt,k)

}

},

reducer = function(k,vv) {

orch.keyvals(k, vv)

},

export= orch.export(stopwords),

config = new("mapred.config",

job.name = "sort.words",

reduce.tasks = 1,

map.output = data.frame(key='', val=''),

reduce.output = data.frame(key='', val='') )

)

hdfs.get(sorted.res)

hdfs.sample(sorted.res,10)

©2012 Oracle – All Rights Reserved.

Solution for “Introduction to Oracle R Connector for Hadoop”

Oracle R Connector for Hadoop (ORCH), a component of the Big Data Connectors option, provides transparent access to Hadoop and HDFS-resident data. Hadoop is a high performance distributed computational system, and the Hadoop Distributed File System (HDFS) is a distributed high-availability file storage mechanism. With ORCH, R users are not forced to learn a new language to work with Hadoop/HDFS – they continue to work in R. In addition they can leverage open source R packages as part of their mapper and reducer functions when working on HDFS-resident data. ORCH allows for Hadoop jobs to be executed locally at the client for testing purposes, then, by changing one setting, the exact same code can be executed on the Hadoop cluster – without requiring the involvement of administrators, or knowledge of Hadoop internals, the Hadoop call level interface or IT infrastructure. ORCH and Oracle R Enterprise (ORE) can interact in a variety of ways. If ORE is installed on the R client with ORCH, ORCH can copy ore.frames (data tables) to HDFS, ORE can preprocess data that is fed to map-reduce jobs, and ORE can post-process results of map-reduce jobs once data is moved from HDFS to Oracle Database. If ORE is installed on the Big Data Appliance task nodes, mapper and reducer functions can include functions calls to ORE. If ORCH is installed on Oracle Database server, R scripts in embedded R execution can invoke ORCH functionality, achieving operationalization of ORCH scripts via SQL-based applications or those leveraging DBMS_SCHEDULER. To run the commands in this document on the virtual machine (VM), point Firefox to http://localhost:8787 and log into RStudio using oracle user’s credentials. From the RStudio File menu, select ‘File-Open File’ and navigate to location /home/oracle/movie/moviework/advancedanalytics. Select the R Script file “20130206_ORCH_Hands-on_Lab.R”, and the HOL’s script’s commands will be opened and available to run in RStudio.

Exercise 1 – Work with data in HDFS and Oracle Database

Loading the ORCH library provides access to some basic functions for manipulating HDFS. After navigating to a specified directory, we’ll again access database data in the form of the MOVIE_FACT and MOVIE_GENRE tables, and connect to Oracle Database from ORCH. Although you’re connected to the database through ORE, to transfer data between Oracle Database and HDFS requires an ORCH connection. Then, you’ll copy data from the database to HDFS for later use with a MapReduce job. Run these commands from the /home/oracle/movie/moviework/advancedanalytics Linux directory.

1. If you are in R, first exit from R using CTRL-D CTRL-D. This will in effect invoke q() and not save the workspace. Change directory and start R:

cd /home/oracle/movie/moviework/advancedanalytics

R

©2012 Oracle – All Rights Reserved.

2. If you are not already connected by default, load the Oracle R Enterprise (ORE) library and connect to the Oracle database, then list the contents of the database to test the connection. Notice that if a table contains columns with unsupported data types, a warning message is returned. If you are connected, you can just invoke ore.ls().

library(ORE)

ore.connect("moviedemo","orcl","localhost","welcome1",all=TRUE

)

ore.ls()

3. Load the Oracle R Connector for Hadoop (ORCH) library, get the current working directory, and list the directory contents in Hadoop Distributed File System (HDFS). Change directory in HDFS and view the contents there:

©2012 Oracle – All Rights Reserved.

library(ORCH)

hdfs.pwd()

hdfs.ls()

hdfs.cd ("/user/oracle/moviework/advancedanalytics/data")

hdfs.ls()

4. Using ORE, view the names of the database tables MOVIE_FACT and MOVIE_GENRE, look at the first

few rows of each table, and get the table dimensions:

ore.sync("MOVIEDEMO","MOVIE_FACT")

MF <- MOVIE_FACT

names(MF)

head(MF,3)

dim(MF)

names(MOVIE_GENRE)

head(MOVIE_GENRE,3)

dim(MOVIE_GENRE)

5. Since we will use the table MOVIE_GENRE later in our Hadoop recommendation jobs, copy a subset of MOVIE_GENRE from the database to HDFS and validate that it exists. This requires using orch.connect to establish the connect to the database from ORCH.

©2012 Oracle – All Rights Reserved.

Exercise 2 – Execute a simple MapReduce job using ORCH

In this exercise, you will execute a Hadoop job that counts the number of movies in each genre. You will first run the script in “dry run” mode, executing on the local machine serially. Then, you will run on the cluster in the VM. Finally, you will compare the results using ORE.

1. Use hdfs.attach() to attach the “movie_genre” HDFS file to the working session:

mg.dfs <-

hdfs.attach("/user/oracle/moviework/advancedanalytics/data/movie_genre_subset")

mg.dfs

hdfs.describe(mg.dfs)

MG_SUBSET <- MOVIE_GENRE[1:10000,]

hdfs.rm('movie_genre_subset')

orch.connect(host="localhost", user="moviedemo",

sid="orcl",passwd="oracle",secure=F)

mg.dfs <- hdfs.push(MG_SUBSET, dfs.name='movie_genre_subset',

split.by="GENRE_ID")

hdfs.exists('movie_genre_subset')

hdfs.describe('movie_genre_subset')

hdfs.size('movie_genre_subset')

©2012 Oracle – All Rights Reserved.

2. Specify to run in dry run mode and then execute the MapReduce job that partitions the data based on genre_id, and counts up the number of movies in each genre. Note that you will receive debug output while in dry run mode.

orch.dryrun(T)

res.dryrun <- NULL

res.dryrun <- hadoop.run(

mg.dfs,

mapper = function(key, val) {

orch.keyval(val$GENRE_ID, 1)

},

reducer = function(key, vals) {

count <- length(vals)

orch.keyval(key, count)

} ,

config =

new("mapred.config",

map.output = data.frame(key=0, val=0),

reduce.output = data.frame(GENRE_ID=0, COUNT=0))

)

©2012 Oracle – All Rights Reserved.

3. Retrieve the result of the Hadoop job, which is stored as an HDFS file. Note that since this is dry run mode, not all data may be used, so only a subset of results may be returned.

hdfs.get(res.dryrun)

4. Specify to execute using the cluster by setting orch.dryrun to FALSE, rerun the same MapReduce job, and view the result. Note that this will take longer to execute since it is starting actual Hadoop jobs on the cluster.

©2012 Oracle – All Rights Reserved.

orch.dryrun(F)

res.cluster <- NULL

res.cluster <- hadoop.run(

mg.dfs,

mapper = function(key, val) {

orch.keyval(val$GENRE_ID, 1)

},

reducer = function(key, vals) {

count <- length(vals)

orch.keyval(key, count)

} ,

config =

new("mapred.config",

map.output = data.frame(key=0, val=0),

reduce.output = data.frame(GENRE_ID=0, COUNT=0))

)

hdfs.get(res.cluster)

5. Perform the same analysis using ORE:

res.table <- table(MG_SUBSET$GENRE_ID)

res.table

Exercise 3 – Count words in movie plot summaries

In this exercise, you will execute a Hadoop job that counts how many times each of the words in MOVIE plot summaries occurs. You will first create the HDFS file containing the data extracted from Oracle Database using ORE. Then, you will run the MarpReduce job on the cluster in the VM. Finally, you will view the results using ORE, but since we’ll want the results sorted by most frequent words, another MapReduce job will be needed.

©2012 Oracle – All Rights Reserved.

1. If starting a fresh R session, execute the first block. Otherwise, continue to find all the movies with plot

summaries and convert them from an ore.factor to ore.character. Remove various unneeded punctuation from the text, create a database table from these, and create the input corpus for the MapReduce job:

library(ORCH)

orch.connect(host="localhost", user="moviedemo",

sid="orcl",passwd="welcome1",secure=F)

hdfs.cd ("/user/oracle/moviework/advancedanalytics/data")

corpus <- as.character(MOVIE[!is.na(MOVIE$PLOT_SUMMARY),"PLOT_SUMMARY"])

class(corpus)

corpus <- gsub("([/\\\":,#.@-])", " ", corpus)

head(corpus,2)

corpus <- data.frame(text=corpus)

ore.create (corpus,table = "corpus_table")

hdfs.rm("plot_summary_corpus")

input <- hdfs.put(corpus_table,dfs.name="plot_summary_corpus")

2. Try the following example to see how R parses text using strsplit. Notice the extra space between “my” and “text”. This gets converted to a null output. You will account for that in the next step.

txt <- "This is my text"

strsplit(txt," ")

myList <- list(A = 5, B = 10, C = 25)

sum(unlist(myList))

©2012 Oracle – All Rights Reserved.

3. Execute the MapReduce job that performs the word count:

res <- hadoop.exec(dfs.id = input,

mapper = function(k,v) {

x <- strsplit(v[[1]], " ")[[1]]

x <- x[x!='']

out <- NULL

for(i in 1:length(x))

out <- c(out, orch.keyval(x[i],1))

out

},

reducer = function(k,vv) {

orch.keyval(k, sum(unlist(vv)))

},

config = new("mapred.config",

job.name = "wordcount",

map.output = data.frame(key='', val=0),

reduce.output = data.frame(key='', val=0) )

)

4. View the path of the result HDFS file. Then get the contents of the result. Notice that the results are unordered.

res

hdfs.get(res)

©2012 Oracle – All Rights Reserved.

5. To sort the results, we can use the following MapReduce job. Notice that we can specify explicit stopwords, i.e., those to be excluded from the set, but that we also eliminate words of 3 letters or fewer. Then view the sorted results, as well as a sample of 10 rows from the HDFS file. Which words are the most popular in the plot summaries?

stopwords <- c("from","they","that","with","their","when","into","what")

sorted.res <- hadoop.exec(dfs.id = res,

mapper = function(k,v) {

if(!(k %in% stopwords) & nchar(k) > 3) {

cnt <- sprintf("%05d", as.numeric(v[[1]]))

orch.keyval(cnt,k)

}

},

reducer = function(k,vv) {

orch.keyvals(k, vv)

},

export= orch.export(stopwords),

config = new("mapred.config",

job.name = "sort.words",

reduce.tasks = 1,

map.output = data.frame(key='', val=''),

reduce.output = data.frame(key='', val='') )

)

hdfs.get(sorted.res)

hdfs.sample(sorted.res,10)

©2012 Oracle – All Rights Reserved.