1 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
“Big Data”
Edgars Ruņģis
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 2
What is Big Data ?
VELOCITY VOLUME VARIETY
Sensors BLOG
Social Social
Enormous volumes of real-time
data streams from internal and
external sources and historic
data, hyper-volumes of
structured, semi-structured and
unstructured data
Combine historic data with
data streams and feeds
Detect significant events from
real-time data streams
Respond automatically to
detected events by raising
alerts
Call Data Records, Social
Media Traffic, Videos, Audio
Financial Transactions
Sensor based data, border
crossings, airline passenger
records
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 3
Big Data ≈ Hadoop
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 4
Hadoop Can Be Confusing
5 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
What is Hadoop?
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 6
Hadoop
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models. Hadoop is designed to scale up from
single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver high-
availability, the library itself is designed to detect and handle failures at
the application layer, so delivering a highly-available service on top of a
cluster of computers, each of which may be prone to failures.
7 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
What to Pay Attention To
Distributed Storage
– HDFS
Parallel Processing Framework
– MapReduce
Higher-Level Languages
– Hive
– Pig
– Etc.
8 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
HDFS The Distributed Filesystem
What is it?
Benefits
Limitations
The petabyte-scale distributed file system at
the core of Hadoop.
Linearly-scalable on commodity hardware
An order of magnitude cheaper per TB
Designed around schema-on-read
Low security
Write-once, read-many model
9 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Interacting with HDFS
NameNodes and DataNodes
– NameNodes contain edits and organization
– DataNodes store data
Command-line access resembles UNIX filesystems
– ls (list)
– cat, tail (concatenate or tail file)
– cp, mv (copy or move within HDFS)
– get, put (copy between local file system and HDFS)
10 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
HDFS Mechanics
DataNode
DataNode DataNode
DataNode DataNode
DataNode
Suppose we have a large file
And a set of DataNodes
11 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
HDFS Mechanics
DataNode
DataNode DataNode
DataNode DataNode
DataNode
• The file will be broken up into blocks
• Blocks are stored in multiple locations
• Allows for parallelism and fault-tolerance
• Nodes operate on their local data
12 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
MapReduce The Parallel Processing Framework
What is it?
Benefits
Limitations
The parallel processing framework that
dominates the Big Data landscape.
Provides data-local computation
Fault-tolerant
Scales just like HDFS
You are the optimizer
Batch-oriented
13 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
MapReduce Mechanics
Suppose 3 face cards are
removed.
How do we find which suits
are short using
MapReduce?
14 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
MapReduce Mechanics
Map Phase:
Each TaskTracker has some data local to it.
Map tasks operate on this local data.
If face_card: emit(suit, card)
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
15 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
MapReduce Mechanics
Shuffle/Sort:
Intermediate data is shuffled and sorted for delivery to the reduce tasks
Sort
To Reducers
16 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
MapReduce Mechanics
Reduce Phase:
Reducers operate on local data to produce final result
Emit:key, count(key)
TaskTracker TaskTracker TaskTracker TaskTracker
Spades: 3 Hearts: 2 Diamonds: 2 Clubs: 2
17 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Flow of key/values pairs
Map
Input
K0,v0
Output
K1,V1
Reduce
Input
K1,list(V)
Output
K2,V2
18 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
The default way to process data into Hadoop
19 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
SQL style!
NoSQL DB Driver
Application
NoSQL
HDFS + MapReduce = Hadoop
BigData
Hadoop
What should I do if I don’t know Java but still want to process data into HDFS?
Use Hive!
20 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Hive A move toward declarative language
What is it?
Benefits
Limitations
A SQL-like language for Hadoop.
Abstracts MapReduce code
Schema-on-read via InputFormat and SerDe
Provides and preserves metatdata
Not ideal for ad hoc work (slow)
Subset of SQL-92
Immature optimizer
21 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Storing a Clickstream
Storing large amounts of
clickstream data is a
common use for HDFS
Individual clicks aren’t
valuable by them selves
We’d like to write queries
over all clicks
22 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Defining Tables Over HDFS
Hive allows us to define
tables over HDFS
directories
The syntax is simple SQL
SerDes allow Hive to
deserialize data
23 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
How Does It Work Anatomy of a Hive Query
SELECT suit, COUNT(*)
FROM cards
WHERE face_value > 10
GROUP BY suit;
How does Hive execute
this query?
24 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Anatomy of a Hive Query
SELECT suit, COUNT(*)
FROM cards
WHERE face_value > 10
GROUP BY suit;
1. Hive optimizer builds a MapReduce Job
2. Projections and predicates
become Map code
3. Aggregations become Reduce code
4. Job is submitted to
MapReduce JobTracker
Map task If face_card:
emit(suit, card)
Reduce task emit(suit,
count(suit)) Shu
ffle
25 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Hadoop Programming - Summary
• HDFS - Hadoop Distributed File System
– Designed to achieve SCALE and aggregate THROUGHPUT on commodity hardware
– Not a database; Data remains in its original format on disk (the query engine does NOT own
the data format)
• MapReduce
– Simple programming model for large scale data processing (NOT performance !!!)
– Checkpoints to disk for fault tolerance
– Leverages InputFormat, RecordReader and SerDe
– Leverages multiple copies of data (speculative execution)
• Various higher level languages
– Hive (SQL implemented as MapReduce)
– Pig - Scripting Languages like Python
25
26 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Impala
27 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
What is Impala ?
• Massive parallel processing (MPP) database engine, developed by Cloudera
• Integrated into Hadoop stack on the same level as MapReduce, and not
above it (as Hive and Pig)
• Impala process data in Hadoop cluster without using MapReduce
HDFS
Map Reduce
Hive Pig
Impala
28 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Impala Architecture
29 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
My query needs to go Faster
• Impala, Spark, Stinger, Shark, Tajo, Presto a whole bunch
more
– Remove MapReduce and check pointing to disk
– Sacrifice resilience and scale for performance
– Limited SQL capability and access paths
– Create a (new) SQL Database on top of HDFS
– Create and load data into optimized storage formats to optimize
performance
30 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
How to load data in Hadoop ?
31 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Hadoop command-line
• Source server should to have Hadoop client program.
• File may be loaded by follow command hadoop fs –put
source.file /tmp/hadoop_dir
32 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Load data into Hadoop from RDBMS (Sqoop)
33 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved. http://archive.cloudera.com/cdh4/cdh/4/flume-ng/FlumeUserGuide.html
Acquire stream data. Flume
34 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Flume for collecting Twitter feeds
Flume – twitter collecting
http://archive.cloudera.com/cdh4/cdh/4/flume-ng/FlumeUserGuide.html
35 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Hadoop distributives
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 37
Cloudera
37
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 39
Hortonworks
40 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Big Data and Oracle
41 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Oracle Big Data Solution
Stream Acquire – Organize – Analyze
Oracle BI Foundation Suite
Oracle Real-Time Decisions
Endeca Information Discovery
Decide
Oracle Event Processing Oracle Big Data
Connectors
Oracle Data Integrator
Oracle
Advanced
Analytics
Oracle
Database
Oracle
Spatial
& Graph
Apache Flume
Oracle GoldenGate
Oracle
NoSQL
Database
Cloudera
Hadoop
Oracle R
Distribution
42 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Oracle Big Data Connectors
• Oracle Loader for Hadoop
• Oracle SQL Connector for HDFS
• Oracle R Advanced Analytics for Hadoop
• Oracle XQuery for Hadoop
• Oracle Data Integrator Application Adapters for Hadoop
Licensed Together
43 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Oracle Loader for Hadoop
SHUFFLE /SORT
SHUFFLE /SORT
MAP
MAP
MAP
MAP
SHUFFLE /SORT
REDUCE
REDUCE
INPUT 2
INPUT 1
MAP
MAP
MAP
MAP
MAP
REDUCE
REDUCE
MAP
MAP
MAP
MAP
MAP
REDUCE
REDUCE
REDUCE
Load data from Hadoop
into Oracle Database
Oracle Database
Unstructured Data
REDUCE
Local Oracle table
44 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Generate external table in database pointing to HDFS data
Load into database or query data in place on HDFS
Fine-grained control over data type mapping Parallel load with automatic load balancing
Kerberos authentication
Oracle SQL Connector for HDFS Use Oracle SQL to Access Data on HDFS
External
Table
OSCH
OSCH
OSCH
SQL Query
HDFS
Client
Hadoop
Oracle Database
Access or load into the database in parallel using external table mechanism
OSCH
45 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved. 45
New Data Sources for Oracle External Tables
CREATE TABLE movielog
(click VARCHAR2(4000))
ORGANIZATION EXTERNAL
( TYPE ORACLE_HIVE
DEFAULT DIRECTORY Dir1
ACCESS PARAMETERS
(
com.oracle.bigdata.tablename logs
com.oracle.bigdata.cluster mycluster)
)
REJECT LIMIT UNLIMITED
• New set of properties
– ORACLE_HIVE and ORACLE_HDFS access drivers
– Identify a Hadoop cluster, data source, column mapping, error
handling, overflow handling, logging
• New table metadata passed from Oracle DDL to Hadoop
readers at query execution
• Architected for extensibility
– StorageHandler capability enables future support for other
data sources
– Examples: MongoDB, Hbase, Oracle NoSQL DB
47 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Oracle SQL Connector for HDFS
• Load data from external table with Oracle SQL – INSERT INTO <tablename> AS SELECT * FROM <external tablename>
• Access data in-place on HDFS with Oracle SQL
– Note: No indexes, no partitioning, so queries are a full table scan
• Read data in parallel – Ex: If there are 96 data files and the database can support 96 PQ slaves, all 96 files can
be read in parallel
48 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Oracle SQL Connector for HDFS
• Text files
• Hive tables (text data)
• Oracle Data Pump files generated by Oracle Loader for
Hadoop
Input Data Formats
49 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Oracle SQL Connector for HDFS
• Oracle Data Pump: Binary format data file
• Load of Oracle Data Pump files is more efficient – uses about 50%
less database CPU
– Hadoop does more of the work, transforming text data into binary data
optimized for Oracle
Note: Only Oracle Data Pump files generated by Oracle Loader
for Hadoop
Data Pump Files
50 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Using Hadoop To Optimize IT
51 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Big Data and Optimized Operations
• Big Data can handle a lot of heavy lifting
– It’s a complement to the database
• Big Data allows access to more detail data for less
• We can use Big Data to make the database do more
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 52
Optimizing ETL
Mission
Critical
Reporting
Ad Hoc
Analysis
Long-running
batch
transformation
Big Data Problem
Base Table
Copy/Move
Base Table to
Hadoop
Load to
Oracle
Long-running
batch
transformation
53 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Q&A
54 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
55 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Top Related