Oracle Big Data SQL - Amazon S3 - AWS · Oracle Big Data SQL – A New Architecture •Powerful,...

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Oracle Big Data SQL Bringing Data Realms together Transparently!

Rick (Rahul) Pandya Oracle Database Product Management



Safe Harbor Statement

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

3


Agenda

1

2

3

The Data Analytics Challenge

Why Unified Query Matters

SQL on Hadoop and More: Unifying Metadata

Query Franchising: Smart Scan for Hadoop

SQL, Everywhere

Oracle Confidential – Internal/Restricted/Highly

4

4

5


Big Data Customers

5

Big Data Analytic Services

• R&D, Supply Chain, Customer & Consumer

• Centralized Data Science Organization

Business Transformation

• Leading Spanish Bank > 13M customers

• Collect & unify all relevant information

Innovative Network Defense

• Hadoop and NoSQL DB for data of different speeds

• Detect 0-days, uncover intrusions

BDA Exadata

BDA Exadata

BDA Exadata


Driving Business Value from Technology Innovation

Use the Right Tool for the Job and benefit from the Power of “AND”

6

Run the Business Integrate existing systems

Support mission-critical tasks

Protect existing expenditures

Ensure skills relevance

Relational Hadoop

Change the Business

Disrupt competitors

Disintermediate supply chains

Leverage new paradigms

Exploit new analyses

NoSQL

Scale the Business

Serve data faster

Meet mobile challenges

Scale-out economically


Data Analytics Challenge Separate silos of information to analyze

7


Data Analytics Challenge Separate data access interfaces

8


SQL on Hadoop is Obvious

9

Stinger


Data Analytics Challenge No comprehensive SQL interface

10


Oracle Big Data Management Preserving investment in SQL for Big Data analytics

11

NoSQL


Before After

What does unified query mean for you?

Data Science

PhD

???

Anyone


Before After

What does unified query mean for you?

Application Development


Use Rich Oracle SQL Dialect Over All Data Snapshot of Oracle SQL Analytic Functions

• Ranking functions

– rank, dense_rank, cume_dist, percent_rank, ntile

• Window Aggregate functions (moving and cumulative)

– Avg, sum, min, max, count, variance, stddev, first_value, last_value

• LAG/LEAD functions

– Direct inter-row reference using offsets

• Reporting Aggregate functions

– Sum, avg, min, max, variance, stddev, count, ratio_to_report

• Statistical Aggregates

– Correlation, linear regression family, covariance

• Linear regression

– Fitting of an ordinary-least-squares regression line to a set of number pairs.

– Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions

• Descriptive Statistics

– DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- n sigma values, top/bottom 5 values

• Correlations

– Pearson’s correlation coefficients, Spearman's and Kendall's (both nonparametric).

• Cross Tabs

– Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa

• Hypothesis Testing

– Student t-test , F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA

• Distribution Fitting

– Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test, Normal, Uniform, Weibull, Exponential


next = lineNext.getQuantity();

}

if (!q.isEmpty() && (prev.isEmpty() || (eq(q, prev) && gt(q, next)))) {

state = "S";

return state;

}

if (gt(q, prev) && gt(q, next)) {

state = "T";

return state;

}

if (lt(q, prev) && lt(q, next)) {

state = "B";

return state;

}

if (!q.isEmpty() && (next.isEmpty() || (gt(q, prev) && eq(q, next)))) {

state = "E";

return state;

}

if (q.isEmpty() || eq(q, prev)) {

state = "F";

return state;

}

return state;

}

private boolean eq(String a, String b) {

if (a.isEmpty() || b.isEmpty()) {

return false;

}

return a.equals(b);

}

private boolean gt(String a, String b) {


return false;

}

return Double.parseDouble(a) > Double.parseDouble(b);

}

private boolean lt(String a, String b) {


return false;

}

return Double.parseDouble(a) < Double.parseDouble(b);

}

public String getState() {

return this.state;

}

}

BagFactory bagFactory = BagFactory.getInstance();

@Override

public Tuple exec(Tuple input) throws IOException {

long c = 0;

String line = "";

String pbkey = "";

V0Line nextLine;

V0Line thisLine;

V0Line processLine;

V0Line evalLine = null;

V0Line prevLine;

boolean noMoreValues = false;

String matchList = "";

ArrayList<V0Line> lineFifo = new ArrayList<V0Line>();

boolean finished = false;

DataBag output = bagFactory.newDefaultBag();

if (input == null) {

return null;

}

if (input.size() == 0) {

return null;

}

Object o = input.get(0);

if (o == null) {

return null;

}

//Object o = input.get(0);

if (!(o instanceof DataBag)) {

int errCode = 2114;

String msg = "Expected input to be DataBag, but"

Pattern Matching With Oracle SQL Snapshot of Oracle SQL Analytic Functions

Simplified, sophisticated, standards based syntax

SELECT first_x, last_z

FROM ticker MATCH_RECOGNIZE (

PARTITION BY name ORDER BY time

MEASURES FIRST(x.time) AS first_x,

LAST(z.time) AS last_z

ONE ROW PER MATCH

PATTERN (X+ Y+ W+ Z+)

DEFINE X AS (price < PREV(price)),

Y AS (price > PREV(price)),

W AS (price < PREV(price)),

Z AS (price > PREV(price) AND

z.time - FIRST(x.time) <= 7 ))

250+ Lines of Java UDF 12 Lines of SQL

20x less code

Finding Patterns in Stock Market Data - Double Bottom (W)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 15

10:00 10:05 10:10 10:15 10:20 10:25

Ticker


Oracle Big Data SQL – A New Architecture

• Powerful, high-performance SQL on Hadoop

– Full Oracle SQL capabilities on Hadoop

– SQL query processing local to Hadoop nodes

• Simple data integration of Hadoop and Oracle Database – Single SQL point-of-entry to access all data

– Scalable joins between Hadoop and RDBMS data

• Optimized hardware

– Balanced Configurations

– No bottlenecks


16


SQL on Hadoop and More: Unifying Metadata


Why Unify Metadata?

Catalog

CUSTOMERS SALES

CREATE TABLE customers…

CREATE TABLE sales…

SELECT customers.name, sales.amount

SELECT name FROM customers

customers sales

Unified metadata =

No changes for users and applications

+ Seamlessly handle schema-on-read

+ Exploit remote data distribution

+ Holistically optimize queries


How Data is Stored in Hadoop


19

{"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8} {"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7} {"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9} {"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7} {"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42","recommended":"Y","activity":6} {"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8} {"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9} {"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03","recommended":"N","activity":7} {"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9} {"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:01:18","recommended":"Y","activity":7} {"custId":1067283,"movieId":1124,"genreId":9,"time":"2012-07-01:00:01:26","recommended":"Y","activity":7} {"custId":1126174,"movieId":16309,"genreId":9,"time":"2012-07-01:00:01:35","recommended":"N","activity":7} {"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:01:39","recommended":"Y","activity":7}} {"custId":1346299,"movieId":424,"genreId":1,"time":"2012-07-01:00:05:02","recommended":"Y","activity":4}

Example: 1TB File

Block B1

Block B2

Block B3

• 1 block = 256 MB • Example File = 4096 blocks • InputSplits = 4096 Potential scan parallelism


How MapReduce and Hive Read Data

20

Data Node

disk

Consumer

SCAN

Create ROWS

& COLUMNS

• Scan and row creation needs to be able to work on “any” data format

• Data definitions and column deserializations are needed to provide a table

RecordReader => Scans data (keys and values) InputFormat => Defines parallelism SerDe => Makes columns Metastore => Maps DDL to Java access classes


Hive Metastore SQL-on-Hadoop Engines Share Metadata, not MapReduce

21

Hive Metastore

Hive Impala SparkSQL Oracle Big Data SQL …

Table Definitions: movieapp_log_json Tweets avro_log

Metastore maps DDL to Java access classes

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | 22

Extend Oracle External Tables

CREATE TABLE movielog (

click VARCHAR2(4000))

ORGANIZATION EXTERNAL (

TYPE ORACLE_HIVE

DEFAULT DIRECTORY DEFAULT_DIR

ACCESS PARAMETERS

(

com.oracle.bigdata.tablename logs

com.oracle.bigdata.cluster mycluster

))

REJECT LIMIT UNLIMITED;

• New types of external tables

– ORACLE_HIVE (inherit metadata)

– ORACLE_HDFS (specify metadata)

• Access parameters for Big Data – Hadoop cluster

– Remote Hive database/table • DBMS_HADOOP Package for automatic import

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | 23

Enhance Oracle External Tables

• Transparent schema-for-read

– Use fast C-based readers when possible

– Use native Hadoop classes otherwise

• Engineered to understand parallelism – Map external units of parallelism to Oracle

• Architected for extensibility

– StorageHandler capability enables future support for other data sources

– Examples: MongoDB, HBase, Oracle NoSQL DB

CREATE TABLE ORDER (

cust_num VARCHAR2(10),

order_num VARCHAR2(20),

order_total NUMBER(8,2))

ORGANIZATION EXTERNAL (

TYPE ORACLE_HIVE

DEFAULT DIRECTORY DEFAULT_DIR

)

PARALLEL 20

REJECT LIMIT UNLIMITED;


Query Franchising: Smart Scan for Hadoop


CUSTOMERS

SELECT name, SUM(purchase) FROM customers GROUP BY name;

What Can Big Data Learn from Exadata? Intelligent Storage Maximizes Performance

Oracle Exadata Storage Server

Oracle Exadata Storage Server

Oracle SQL query issued • Plan constructed • Query executed

1

Smart Scan Works on Storage • Filter out unneeded rows • Project only queried columns • Score data models • Bloom filters to speed up joins

2


Query Franchising – dispatch of query processing to disparate systems without loss of operational fidelity through translation


Storage Layer

27

Big Data SQL Server: A New Hadoop Processing Engine

Filesystem (HDFS) NoSQL Databases

(Oracle NoSQL DB, Hbase)

Resource Management (YARN, cgroups)

Processing Layer

MapReduce and Hive

Spark Impala Search Big Data

SQL


Smart Scan for Hadoop: Optimizing Performance

28

Data Node

Disk

Big Data SQL Server

External Table Services

Smart Scan

“Oracle on top”

– Apply filter predicates

– Project columns

– Parse semi-structured data

“Hadoop on the bottom”

– Work close to the data

– Schema-on-read with Hadoop classes

– Transformation into Oracle data stream


B B B

How do we query Hadoop?

Big Data SQL Query Execution

HDFS Data Node BDS Server

HDFS Data Node BDS Server

Query compilation determines: • Data locations • Data structure • Parallelism

1

Fast reads using Big Data SQL Server • Schema-for-read using Hadoop classes • Smart Scan selects only relevant data

2

Process filtered result • Move relevant data to database • Join with database tables • Apply database security policies

3 Hive Metastore

HDFS NameNode 1

2 3


Parallel Query and Hadoop

Mapping Hadoop to Oracle

B B B

Hive Metastore

HDFS NameNode

Determine Hadoop Parallelism • Determine schema-for-read • Determine InputSplits • Arrange splits for best performance

1

Map to Oracle Parallelism • Map splits to granules • Assign granules to PX Servers

2

PX Servers Route Work • Offload work to Big Data SQL Servers • Aggregate • Join • Apply PL/SQL

3

1

2

PX

InputSplits


Big Data SQL Server Dataflow

Disks

Data Node

Big Data SQL Server

External Table Services

Smart Scan

Read data from HDFS Data Node • Direct-path reads • C-based readers when possible • Use native Hadoop classes otherwise

1

Translate bytes to Oracle 2

Apply Smart Scan to Oracle bytes • Apply filters • Project Columns • Parse JSON/XML • Score models

3 RecordReader

SerDe 1

01

10

01

0

10

11

00

10

1

01

10

01

0 1

2

3


But How Does Security Work?

B B B

Database security for query access • Virtual Private Databases • Redaction • Audit Vault and Database Firewall

1

Hadoop security for Hadoop jobs • Kerberos Authentication • Apache Sentry (RBAC) • Audit Vault

2

System-specific encryption • Database tablespace encryption • BDA On-disk Encryption

3

SELECT * FROM my_bigdata_table

WHERE SALES_REP_ID =

SYS_CONTEXT('USERENV','SESSION_USER');

Filter on SESSION_USER

DBMS_REDACT.ADD_POLICY(

object_schema => 'MCLICK',

object_name => 'TWEET_V',

column_name => 'USERNAME',

policy_name => 'tweet_redaction',

function_type => DBMS_REDACT.PARTIAL,

function_parameters =>

'VVVVVVVVVVVVVVVVVVVVVVVVV,*,3,25',

expression => '1=1'

);

***


SQL, Everywhere Futures


Move less data Go faster

More Lessons from Exadata

Storage Indexes • Skip reads on irrelevant data • Big Hadoop Blocks ~ Big Speed Up

1

Caching • Cache frequently accessed columns • HDFS Caching

2


Oracle Big Data Management Preserving investment in SQL for Big Data analytics

35

NoSQL


Oracle Big Data Management Unite Information Lifecycles

36

NoSQL

Shared REST APIs • App-embedded schema NoSQL • Shared schema Oracle

Automatic ILM • App-embedded schema NoSQL • Shared schema Oracle


Oracle Big Data Management Unify All Query

37

NoSQL

Oracle Big Data SQL - Amazon S3 - AWS · Oracle Big Data SQL – A New Architecture •Powerful,...

Documents

Transcript of Oracle Big Data SQL - Amazon S3 - AWS · Oracle Big Data SQL – A New Architecture •Powerful,...