paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is...

29
Hello, and welcome to this online, self-paced lesson entitled “Introducing Oracle R Enterprise.” This session is part of an eight-lesson tutorial series on Oracle R Enterprise. My name is Brian Pottle. I will be your guide for the next 45 minutes of interactive lectures and review sessions on this lesson. ORE Introduction - 1

Transcript of paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is...

Page 1: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

Hello, and welcome to this online, self-paced lesson entitled “Introducing Oracle R

Enterprise.” This session is part of an eight-lesson tutorial series on Oracle R Enterprise.

My name is Brian Pottle. I will be your guide for the next 45 minutes of interactive lectures and

review sessions on this lesson.

ORE Introduction - 1

Page 2: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

ORE Introduction - 2

Page 3: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

So, you know the title of the course, but you may be asking yourself, “Is this the right course

for me?” Click the bars to learn about the course objectives, target audience, and

prerequisites.

ORE Introduction - 3

Page 4: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

“Introducing Oracle R Enterprise” is the first lesson of eight self-study sessions on Oracle R

Enterprise.

ORE Introduction - 4

Page 5: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

In this lesson, you’ll learn:

• What R is, who uses it, and why they use it.

• Next, we’ll examine several common user interfaces for R.

• Finally, you’ll learn about Oracle’s strategy for supporting the R community.

So, let’s start with the first topic: Using R: What, Who, and Why?

ORE Introduction - 5

Page 6: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

R is a language and environment for statistical computing and graphics. This GNU Project is

similar to the S language and environment, which was developed at Bell Laboratories

(formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be

considered a different implementation of S. There are some important differences, but much

code written for S runs unaltered under R.

R is an open-source language and environment that supports:

• Statistical computing and data visualization

• Data manipulations and transformations

• And sophisticated graphical displays

ORE Introduction - 6

Page 7: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

With over 2 million R users worldwide, R is increasingly being used as the statistical tool in

the academic world. Many colleges and universities worldwide are using R today in their

statistics classes. In addition, more and more corporate analysts are using R.

R benefits from over 3,700 open-source packages, which can be thought of as a collection of

related functions. This number grows continuously with new packages submissions from the

R user community.

Each package provides specialized functionality in such areas as bioinformatics and financial

market analysis.

In the slide, the list on the right shows “CRAN Task Views.” CRAN stands for the

Comprehensive R Archive Network, which is a network of FTP and web servers that store

identical, up-to-date versions of R code and documentation.

The CRAN Task Views list areas of concentration for a set of packages. Each link contains

information that is available on a wide range of topics.

ORE Introduction - 7

Page 8: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

So, why do statisticians and data analysts use R?

• As mentioned previously, R is a statistics language that is similar to SAS or SPSS.

• R is a powerful and extensible environment, with a wide range of statistics and data

visualization capabilities.

- Powerful: Users can perform data analysis and visualization with a minimal

amount of R code.

- Extensible: Users can write their own R functions and packages that can be used

locally, shared within their organizations, or shared with the broader R community

through CRAN.

• It’s easy to install and use.

• And it’s free and downloadable from the R Project website.

In particular, statisticians like R because it enables them to be productive. They don’t have to

be familiar with database administrator (DBA) tasks or SQL, and they don’t have to switch

programming paradigms between R and SQL to work with database-resident data.

ORE Introduction - 8

Page 9: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

Although it’s a powerful and effective statistical environment, R has limitations.

First, R was conceived as a single-user tool that was not multithreaded. The client and server

components are bundled together as a single executable, much like Excel.

• R is limited by the memory and processing power of the machine on which it runs.

• Also, R can’t automatically leverage the CPU capacity on a user’s multiprocessor laptop

without special packages and programming.

Second, R suffers from another scalability limitation that is associated with RAM.

• R requires data that it operates on to be first loaded into memory.

• In addition, R’s approach to passing data between function invocations results in data

duplication. This “call by value” approach to parameter passing can use up memory

quickly.

So inherently, R is really not designed for use with big data.

Some users have provided packages to overcome some of the memory limitations, but the

users must explicitly program with these packages.

ORE Introduction - 9

Page 10: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

R provides a wealth of resources to help users, including:

• Many R-related books that are available on the R project website

• Many user groups and user conferences that are available to the R community

• Online libraries of reusable code from the CRAN website

• Documented R packages with sample data and code

ORE Introduction - 10

Page 11: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

Next, let’s examine several common user interfaces for R.

ORE Introduction - 11

Page 12: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

First, let’s take a quick look at the interface that comes with open-source R by default, called

the R Console.

• This default open-source R graphical user interface (GUI) includes a command-line

interface for running scripts or individual functions, as shown in the slide.

• In addition, open-source R supports many third-party graphics packages.

- In this example, we load a popular third-party graphics package named “ggplot2.”

- Then, the graphics package is called from the R Console command line. The

second qplot function call displays the graphic on the right.

- Here, the qplot function is invoked on the mtchars data set, which comes with R. In

the graph, we plot miles per gallon against weight, with the size of each dot

indicating the number of cylinders.

ORE Introduction - 12

Page 13: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

In addition to the default open-source R GUI, you can use a third-party integrated development environment (IDE), such as RStudio, which is shown in the slide.

With RStudio:

• You can use the upper-left pane to view R scripts and select portions of an R script for execution.

• In the Console pane, you can execute R scripts or functions at the command line, in a similar fashion as the default R GUI.

• You can execute selected portions of R scripts in the top window by clicking the Run button. With this method, selected lines are pasted into the Console pane and executed.

• You can view graph results in the right pane. In this case, the Plots tab is selected.

In this next view, the R script that we saw previously is displayed in the viewer window.

• Here, the first portion of the script is selected. This code requests help on the gplot() function.

• When the Run button is clicked, the selected code is pasted into the Console pane and then executed.

• In the display pane, you can select (and switch between) different tabbed output views on the Files, Plots, Packages, and Help tabs. In this case, the Help tab is selected to display results from the R help command.

In this final view, the last function in the R script is selected.

• This same gplot() function was shown previously in the default R GUI.

• The Run button is clicked and the code is executed. The Plots tab shows the current output. In fact, RStudio also lets you view previously generated plots.

RStudio is only one of many third-party R IDEs.

As shown in the table of this 2011 poll, RStudio is the second most commonly used interface, behind the built-in R console we looked at earlier. However, it’s often user preference that decides which IDE will be used.

ORE Introduction - 13

Page 14: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

Data visualization in graphics helps convey information faster than most other means. The

link shown in the slide is for the R Graph Gallery, where you can find a variety of graphic

types for R.

ORE Introduction - 14

Page 15: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

Here are a few examples of graphs in R. Of course, there are many others. Moving from left to

right, and top to bottom, we show:

A box plot

Perspective graphs of mathematical surfaces

3-D scatter plots with points

A regression plane

Multivariate facet crafts

Smooth scatter plots

Venn diagrams

And even chromosome mappings from the bioconductor package

ORE Introduction - 15

Page 16: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

In this final section of the lesson, you’ll learn about Oracle’s strategy for supporting the R

community. This section includes the following topics:

• Goals

• Software term definitions

• High- and mid-level architectural overviews

• Software component features

• R user-community definitions

ORE Introduction - 16

Page 17: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

Oracle’s goal for supporting open-source R is to deliver enterprise-level advanced analytics

based on the R environment. The strategy is implemented through the release of the following

Oracle technologies:

Oracle R distribution, which supports configurations of open-source R on various platforms. In

addition, Oracle contributes bug fixes and enhancements to open-source R.

ROracle, the open-source Oracle database interface for R.

Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle

Database 11g, release 2. ORE contains a statistics engine, and provides transparent access to

database-resident data from R, as you will learn in this tutorial series.

Oracle R Connector for Hadoop (ORCH), which is part of the Oracle Big Data Connectors

offering. ORCH provides an R interface to an Oracle Hadoop cluster on the BDA ,and also to

non-Oracle Hadoop clusters. Using ORCH, you can access and manipulate data in the Hadoop

Distributed File System, in the Oracle Database, and on the file system.

ORE Introduction - 17

Page 18: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

Now, let’s examine an architectural view of ORE.

• The R workspace console may be the default R GUI or any of the third-party R GUIs.

Users execute R scripts here.

• Then, the ORE transparency layer intercepts functions that operate on database tables

or views. It translates the request into SQL for execution in Oracle Database for

transformations and statistical computations. In Oracle Database, the statistics engine

consists of native database functionality that leverages SQL and the various database

management system (DBMS) packages, as well as enhancements that are specific to

ORE.

• Finally, the results can be leveraged by enterprise systems, such as Oracle Business

Intelligence Enterprise Edition (OBIEE), or web services-based applications.

This design results in:

• No changes to the R user experience

• The ability to scale to large data sets

• And, the ability to embed results in operational or production systems

ORE Introduction - 18

Page 19: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

In another architectural view, you can see how ORE can work with Oracle R Connector for

Hadoop, which will be discussed in a later section of this series.

ORCH enables native R access to the Hadoop cluster for both:

• MapReduce programming in R

• Access to Hadoop Distributed File System (HDFS) data, in either BDA or non-Oracle

Hadoop clusters

ORE Introduction - 19

Page 20: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

BDA has been mentioned a couple of time so far in this lesson. So, what is it?

Oracle Big Data Appliance:

• Is an optimized solution for storing and integrating low-density data into Exadata.

• Is a preintegrated configuration with 18 of Oracle's Sun servers that include InfiniBand and Ethernet connectivity to simplify implementation and management.

• Has the Cloudera distribution, including Apache Hadoop to acquire and organize data, along with Oracle NoSQL Database Community Edition to acquire data.

• Includes additional system software: Oracle Linux, Oracle Java Hotspot Virtual Machine, and an open-source distribution of R.

Oracle Big Data Connectors is an option for BDA. It consists of:

• Oracle Loader for Hadoop

• Oracle Data Integrator Application Adapter for Hadoop

• Oracle Direct Connector for HDFS

• Oracle R Connector for Hadoop

You can use Oracle R Connector for Hadoop to access data in Exadata, and perform R calculations on HDFS data by using scalable map-reduce methods.

ORE Introduction - 20

Page 21: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

Now, let’s take a brief look at the components of Oracle R Enterprise. From a software

perspective, ORE consists of R packages, database libraries, and SQL extensions.

We’ll divide the features into three main groups: the Transparency Layer, the Statistics

Engine, and SQL extensions.

The Transparency Layer is a set of packages that map R data types to Oracle Database

objects.

• This feature automatically generates SQL for R expressions on mapped data types,

enabling direct interaction with data in Oracle Database while using R language

constructs.

• Functionally, this mapping provides access to database tables from R as a type of

data.frame: a base R data representation with rows and columns. ORE calls this an

“ore.frame.”

• Therefore, when you invoke an R function on an ore.frame, the R operation is sent to

the database for execution as SQL.

The Statistics Engine is a database library that supports a variety of statistical computations.

This engine includes existing in-database advanced analytics and new features added

specifically in ORE.

SQL extensions enable in-database embedded R execution, which is particularly valuable

for third-party R packages, or custom functions, that do not have equivalent in-database

functionality.

21 ORE Introduction - 21

Page 22: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

Taking a deeper look at the ORE architecture, notice that it consists of compute engines.

The first one is the client (or user) R engine, which resides on the desktop.

• This R engine consists of the base R packages, the ORE packages, and any other R packages that the user may have installed.

• At this level, the Transparency Layer intercepts R functions for in-database execution.

• It also enables interactive display of graphical results, while flow control remains with the R environment.

• From the client, users can submit entire R scripts for execution by Oracle Database, using embedded R execution.

• And, although not explicitly depicted here, users can connect to a Hadoop Cluster by using Oracle R Connector for Hadoop.

The second compute engine is Oracle Database.

• This database allows scaling to large data sets.

• R users are able to access tables, views, and external tables, as well as data that is accessible through database links.

• The SQL generator through the Transparency Layer can automatically leverage database parallelism.

• It can also leverage both new and existing in-database statistical and data mining capabilities.

The third compute engine (or engines) are those spawned by Oracle Database as external processes in response to embedded R script execution.

• These R engines enable more efficient data transfer between the database and R.

• The engines are still constrained by R memory limitations. However, because they run on the database server, rather than on the client, they are likely to have greater memory capacity and compute power. Exadata is an example.

• The engines also enable R users to write and test map-reduce scripts before rolling them out to a Hadoop cluster.

• Finally, the engines enable “lights-out” execution of R scripts; that is, scheduling or triggering R script packages inside a SQL or PL/SQL query.

• The R packages installed with ORE are also part of the embedded R engines.

22 ORE Introduction - 22

Page 23: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

R and ORE can receive data from many sources. In this figure, we depict the R engine

running on the user’s laptop, as shown in the previous slide.

Through a series of R packages, R itself is able to access data stored in both files, and in

databases.

In addition, ORE provides transparent access to data stored in the local Oracle Database, as

we previously discussed.

In addition, ORE has access to:

• Data in other databases, which are accessible through database links

• Data in external tables

• And, of course, data in HDFS. In addition to bulk import, ORE makes it possible to

access Hadoop directly, in a similar fashion to external tables, by using HDFS connect.

This means that you can join Hadoop data with database data.

23 ORE Introduction - 23

Page 24: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

ORE Introduction - 24

Page 25: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

ORE Introduction - 25

Page 26: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

So, in this lesson, we covered three primary topics.

First, you learned what R is, who uses it, and why they use it.

Then, we looked at some common user interfaces for R.

Finally, we discussed Oracle’s strategy for supporting the R community, including an overview

of goals, definitions of software terms, high- and mid-level architecture, software component

features, and R user-community definitions.

ORE Introduction - 26

Page 27: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

You’ve just completed “Introducing Oracle R Enterprise”. Please move on to the next lesson

in the series: “Getting Started with ORE”.

ORE Introduction - 27

Page 28: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

ORE Introduction - 28

Page 29: paced lesson entitled “Introducing Oracle R Enterprise ... · Oracle R Enterprise (ORE), which is part of the Oracle Advanced Analytics option for Oracle Database 11g, release 2.

No audio narration. Insert names of authors, technical contributors, editors, reviewers,

producers, and anyone else who helped in the production of this self-study directly in the

animated text box on the slide.

<Course name> <Lesson number> - 29