Oracle Big Data SQL - Amazon S3 - AWS · Oracle Big Data SQL – A New Architecture •Powerful,...
-
Upload
phungkhanh -
Category
Documents
-
view
224 -
download
6
Transcript of Oracle Big Data SQL - Amazon S3 - AWS · Oracle Big Data SQL – A New Architecture •Powerful,...
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Oracle Big Data SQL Bringing Data Realms together Transparently!
Rick (Rahul) Pandya Oracle Database Product Management
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
3
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Agenda
1
2
3
The Data Analytics Challenge
Why Unified Query Matters
SQL on Hadoop and More: Unifying Metadata
Query Franchising: Smart Scan for Hadoop
SQL, Everywhere
Oracle Confidential – Internal/Restricted/Highly
4
4
5
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Big Data Customers
5
Big Data Analytic Services
• R&D, Supply Chain, Customer & Consumer
• Centralized Data Science Organization
Business Transformation
• Leading Spanish Bank > 13M customers
• Collect & unify all relevant information
Innovative Network Defense
• Hadoop and NoSQL DB for data of different speeds
• Detect 0-days, uncover intrusions
BDA Exadata
BDA Exadata
BDA Exadata
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Driving Business Value from Technology Innovation
Use the Right Tool for the Job and benefit from the Power of “AND”
6
Run the Business Integrate existing systems
Support mission-critical tasks
Protect existing expenditures
Ensure skills relevance
Relational Hadoop
Change the Business
Disrupt competitors
Disintermediate supply chains
Leverage new paradigms
Exploit new analyses
NoSQL
Scale the Business
Serve data faster
Meet mobile challenges
Scale-out economically
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Data Analytics Challenge Separate silos of information to analyze
7
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Data Analytics Challenge Separate data access interfaces
8
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
SQL on Hadoop is Obvious
9
Stinger
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Data Analytics Challenge No comprehensive SQL interface
10
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Oracle Big Data Management Preserving investment in SQL for Big Data analytics
11
NoSQL
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Before After
What does unified query mean for you?
Data Science
PhD
???
Anyone
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Before After
What does unified query mean for you?
Application Development
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Use Rich Oracle SQL Dialect Over All Data Snapshot of Oracle SQL Analytic Functions
• Ranking functions
– rank, dense_rank, cume_dist, percent_rank, ntile
• Window Aggregate functions (moving and cumulative)
– Avg, sum, min, max, count, variance, stddev, first_value, last_value
• LAG/LEAD functions
– Direct inter-row reference using offsets
• Reporting Aggregate functions
– Sum, avg, min, max, variance, stddev, count, ratio_to_report
• Statistical Aggregates
– Correlation, linear regression family, covariance
• Linear regression
– Fitting of an ordinary-least-squares regression line to a set of number pairs.
– Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions
• Descriptive Statistics
– DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- n sigma values, top/bottom 5 values
• Correlations
– Pearson’s correlation coefficients, Spearman's and Kendall's (both nonparametric).
• Cross Tabs
– Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa
• Hypothesis Testing
– Student t-test , F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA
• Distribution Fitting
– Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test, Normal, Uniform, Weibull, Exponential
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
next = lineNext.getQuantity();
}
if (!q.isEmpty() && (prev.isEmpty() || (eq(q, prev) && gt(q, next)))) {
state = "S";
return state;
}
if (gt(q, prev) && gt(q, next)) {
state = "T";
return state;
}
if (lt(q, prev) && lt(q, next)) {
state = "B";
return state;
}
if (!q.isEmpty() && (next.isEmpty() || (gt(q, prev) && eq(q, next)))) {
state = "E";
return state;
}
if (q.isEmpty() || eq(q, prev)) {
state = "F";
return state;
}
return state;
}
private boolean eq(String a, String b) {
if (a.isEmpty() || b.isEmpty()) {
return false;
}
return a.equals(b);
}
private boolean gt(String a, String b) {
if (a.isEmpty() || b.isEmpty()) {
return false;
}
return Double.parseDouble(a) > Double.parseDouble(b);
}
private boolean lt(String a, String b) {
if (a.isEmpty() || b.isEmpty()) {
return false;
}
return Double.parseDouble(a) < Double.parseDouble(b);
}
public String getState() {
return this.state;
}
}
BagFactory bagFactory = BagFactory.getInstance();
@Override
public Tuple exec(Tuple input) throws IOException {
long c = 0;
String line = "";
String pbkey = "";
V0Line nextLine;
V0Line thisLine;
V0Line processLine;
V0Line evalLine = null;
V0Line prevLine;
boolean noMoreValues = false;
String matchList = "";
ArrayList<V0Line> lineFifo = new ArrayList<V0Line>();
boolean finished = false;
DataBag output = bagFactory.newDefaultBag();
if (input == null) {
return null;
}
if (input.size() == 0) {
return null;
}
Object o = input.get(0);
if (o == null) {
return null;
}
//Object o = input.get(0);
if (!(o instanceof DataBag)) {
int errCode = 2114;
String msg = "Expected input to be DataBag, but"
Pattern Matching With Oracle SQL Snapshot of Oracle SQL Analytic Functions
Simplified, sophisticated, standards based syntax
SELECT first_x, last_z
FROM ticker MATCH_RECOGNIZE (
PARTITION BY name ORDER BY time
MEASURES FIRST(x.time) AS first_x,
LAST(z.time) AS last_z
ONE ROW PER MATCH
PATTERN (X+ Y+ W+ Z+)
DEFINE X AS (price < PREV(price)),
Y AS (price > PREV(price)),
W AS (price < PREV(price)),
Z AS (price > PREV(price) AND
z.time - FIRST(x.time) <= 7 ))
250+ Lines of Java UDF 12 Lines of SQL
20x less code
Finding Patterns in Stock Market Data - Double Bottom (W)
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 15
10:00 10:05 10:10 10:15 10:20 10:25
Ticker
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Oracle Big Data SQL – A New Architecture
• Powerful, high-performance SQL on Hadoop
– Full Oracle SQL capabilities on Hadoop
– SQL query processing local to Hadoop nodes
• Simple data integration of Hadoop and Oracle Database – Single SQL point-of-entry to access all data
– Scalable joins between Hadoop and RDBMS data
• Optimized hardware
– Balanced Configurations
– No bottlenecks
Oracle Confidential – Internal/Restricted/Highly
16
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
SQL on Hadoop and More: Unifying Metadata
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Why Unify Metadata?
Catalog
CUSTOMERS SALES
CREATE TABLE customers…
CREATE TABLE sales…
SELECT customers.name, sales.amount
SELECT name FROM customers
customers sales
Unified metadata =
No changes for users and applications
+ Seamlessly handle schema-on-read
+ Exploit remote data distribution
+ Holistically optimize queries
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
How Data is Stored in Hadoop
Oracle Confidential – Internal/Restricted/Highly
19
{"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8} {"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7} {"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9} {"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7} {"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42","recommended":"Y","activity":6} {"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8} {"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9} {"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03","recommended":"N","activity":7} {"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9} {"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:01:18","recommended":"Y","activity":7} {"custId":1067283,"movieId":1124,"genreId":9,"time":"2012-07-01:00:01:26","recommended":"Y","activity":7} {"custId":1126174,"movieId":16309,"genreId":9,"time":"2012-07-01:00:01:35","recommended":"N","activity":7} {"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:01:39","recommended":"Y","activity":7}} {"custId":1346299,"movieId":424,"genreId":1,"time":"2012-07-01:00:05:02","recommended":"Y","activity":4}
Example: 1TB File
Block B1
Block B2
Block B3
• 1 block = 256 MB • Example File = 4096 blocks • InputSplits = 4096 Potential scan parallelism
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
How MapReduce and Hive Read Data
20
Data Node
disk
Consumer
SCAN
Create ROWS
& COLUMNS
• Scan and row creation needs to be able to work on “any” data format
• Data definitions and column deserializations are needed to provide a table
RecordReader => Scans data (keys and values) InputFormat => Defines parallelism SerDe => Makes columns Metastore => Maps DDL to Java access classes
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Hive Metastore SQL-on-Hadoop Engines Share Metadata, not MapReduce
21
Hive Metastore
Hive Impala SparkSQL Oracle Big Data SQL …
Table Definitions: movieapp_log_json Tweets avro_log
Metastore maps DDL to Java access classes
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | 22
Extend Oracle External Tables
CREATE TABLE movielog (
click VARCHAR2(4000))
ORGANIZATION EXTERNAL (
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS
(
com.oracle.bigdata.tablename logs
com.oracle.bigdata.cluster mycluster
))
REJECT LIMIT UNLIMITED;
• New types of external tables
– ORACLE_HIVE (inherit metadata)
– ORACLE_HDFS (specify metadata)
• Access parameters for Big Data – Hadoop cluster
– Remote Hive database/table • DBMS_HADOOP Package for automatic import
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | 23
Enhance Oracle External Tables
• Transparent schema-for-read
– Use fast C-based readers when possible
– Use native Hadoop classes otherwise
• Engineered to understand parallelism – Map external units of parallelism to Oracle
• Architected for extensibility
– StorageHandler capability enables future support for other data sources
– Examples: MongoDB, HBase, Oracle NoSQL DB
CREATE TABLE ORDER (
cust_num VARCHAR2(10),
order_num VARCHAR2(20),
order_total NUMBER(8,2))
ORGANIZATION EXTERNAL (
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
)
PARALLEL 20
REJECT LIMIT UNLIMITED;
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Query Franchising: Smart Scan for Hadoop
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
CUSTOMERS
SELECT name, SUM(purchase) FROM customers GROUP BY name;
What Can Big Data Learn from Exadata? Intelligent Storage Maximizes Performance
Oracle Exadata Storage Server
Oracle Exadata Storage Server
Oracle SQL query issued • Plan constructed • Query executed
1
Smart Scan Works on Storage • Filter out unneeded rows • Project only queried columns • Score data models • Bloom filters to speed up joins
2
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Query Franchising – dispatch of query processing to disparate systems without loss of operational fidelity through translation
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Storage Layer
27
Big Data SQL Server: A New Hadoop Processing Engine
Filesystem (HDFS) NoSQL Databases
(Oracle NoSQL DB, Hbase)
Resource Management (YARN, cgroups)
Processing Layer
MapReduce and Hive
Spark Impala Search Big Data
SQL
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Smart Scan for Hadoop: Optimizing Performance
28
Data Node
Disk
Big Data SQL Server
External Table Services
Smart Scan
“Oracle on top”
– Apply filter predicates
– Project columns
– Parse semi-structured data
“Hadoop on the bottom”
– Work close to the data
– Schema-on-read with Hadoop classes
– Transformation into Oracle data stream
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
B B B
How do we query Hadoop?
Big Data SQL Query Execution
HDFS Data Node BDS Server
HDFS Data Node BDS Server
Query compilation determines: • Data locations • Data structure • Parallelism
1
Fast reads using Big Data SQL Server • Schema-for-read using Hadoop classes • Smart Scan selects only relevant data
2
Process filtered result • Move relevant data to database • Join with database tables • Apply database security policies
3 Hive Metastore
HDFS NameNode 1
2 3
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Parallel Query and Hadoop
Mapping Hadoop to Oracle
B B B
Hive Metastore
HDFS NameNode
Determine Hadoop Parallelism • Determine schema-for-read • Determine InputSplits • Arrange splits for best performance
1
Map to Oracle Parallelism • Map splits to granules • Assign granules to PX Servers
2
PX Servers Route Work • Offload work to Big Data SQL Servers • Aggregate • Join • Apply PL/SQL
3
1
2
PX
InputSplits
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Big Data SQL Server Dataflow
Disks
Data Node
Big Data SQL Server
External Table Services
Smart Scan
Read data from HDFS Data Node • Direct-path reads • C-based readers when possible • Use native Hadoop classes otherwise
1
Translate bytes to Oracle 2
Apply Smart Scan to Oracle bytes • Apply filters • Project Columns • Parse JSON/XML • Score models
3 RecordReader
SerDe 1
01
10
01
0
10
11
00
10
1
01
10
01
0 1
2
3
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
But How Does Security Work?
B B B
Database security for query access • Virtual Private Databases • Redaction • Audit Vault and Database Firewall
1
Hadoop security for Hadoop jobs • Kerberos Authentication • Apache Sentry (RBAC) • Audit Vault
2
System-specific encryption • Database tablespace encryption • BDA On-disk Encryption
3
SELECT * FROM my_bigdata_table
WHERE SALES_REP_ID =
SYS_CONTEXT('USERENV','SESSION_USER');
Filter on SESSION_USER
DBMS_REDACT.ADD_POLICY(
object_schema => 'MCLICK',
object_name => 'TWEET_V',
column_name => 'USERNAME',
policy_name => 'tweet_redaction',
function_type => DBMS_REDACT.PARTIAL,
function_parameters =>
'VVVVVVVVVVVVVVVVVVVVVVVVV,*,3,25',
expression => '1=1'
);
***
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
SQL, Everywhere Futures
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Move less data Go faster
More Lessons from Exadata
Storage Indexes • Skip reads on irrelevant data • Big Hadoop Blocks ~ Big Speed Up
1
Caching • Cache frequently accessed columns • HDFS Caching
2
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Oracle Big Data Management Preserving investment in SQL for Big Data analytics
35
NoSQL
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Oracle Big Data Management Unite Information Lifecycles
36
NoSQL
Shared REST APIs • App-embedded schema NoSQL • Shared schema Oracle
Automatic ILM • App-embedded schema NoSQL • Shared schema Oracle
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Oracle Big Data Management Unify All Query
37
NoSQL
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |