1© Copyright 2011 EMC Corporation. All rights reserved.
Research on Big Data- FlexDB: A cloud-scale database engine based on Hadoop
Jidong Chen ([email protected])Manager, Research Scientist, Big Data Lab
EMC Labs ChinaSept. 2011
2© Copyright 2011 EMC Corporation. All rights reserved.
Grand Opening Announcement
EMC Labs China is formed from EMC Research China and the Advanced Technology Venture group, which were established in 2007 by the office of CTO.
3© Copyright 2011 EMC Corporation. All rights reserved.
EMC Labs China - Vision and Mission
Advanced Technology Research and Development
Big Data Lab
Cloud Infrastructure and System Lab
Cloud Platform and Applications Lab
University Collaboration
Industry Standards Office
IP Portfolio Development
VisionBecome an elite
research and advanced technology institute
in China -
Become the model for future EMC Labs
worldwide
4© Copyright 2011 EMC Corporation. All rights reserved.
Outline
• Big Data projects overview at EMC Labs China• Introduction to Cloud Databases• Data analytics in the cloud
– Parallel DBMS– MapReduce
• FlexDB - A cloud-scale database engine based on Hadoop
• Summary
5© Copyright 2011 EMC Corporation. All rights reserved.
2009:0.8 Zb
Growing
by a
Factor of 44
Source: IDC Digital Universe Study, sponsored by EMC, May 20102020: 35.2 Zettabytes
The Digital Universe 2009-2020
6© Copyright 2011 EMC Corporation. All rights reserved.
Big Data is Changing the WorldExpanding Data Sources
• Science and research– Gene sequences– LHC accelerator– Earth and space exploration
• Enterprise applications– Email, documents, files– Applications log– Transaction records
• Web 2.0 data– Search log / click stream– Twitter/ Blog / SNS– Wiki
• Other unstructured data– Video/Movie– Graphics– Digital widgets
Bigger Challenges• Scale out automatically
– Vs. scale up manually
• More capacity and bigger pool– E.g., 10 PB in a single file system
• New process capability– Loading, Analyzing, Moving data– Intelligence
• Better performance– Linear vs. exponent– Faster
• Autonomous– Fewer human interference– Lower cost
7© Copyright 2011 EMC Corporation. All rights reserved.
Research Scopes and Topics in Big Data• Search and Analytics
– Search: Entity Search, Faceted Search, Associative Search– Analytics: Text Analysis, Activity Modeling and Sequence Analysis,
Real-time Data Analysis for Streaming, Parallel Data Mining Algorithms
• MPP Databases and Data Services – Parallel Database: Parallel Query Optimization, Data Partitioning
and Replication, Distributed Transaction– In-memory Database: Cache, Recovery, Consistence– Database as a Service: Multi-tenant Data Management, Auto-
Administration
• Hadoop/NoSQL– Hadoop: Single-node Failure, Performance, Real-time MapReduce
Scheduler and Fault Tolerance– NoSQL: Key-Value Store, Documents Store, Graph Data Store
8© Copyright 2011 EMC Corporation. All rights reserved.
Project Overview• Hadoop/NoSQL
– vHadoop - joint project with VMWare• Parallel SAN file system for DISC on virtualized platform
– Online MapReduce for Real-time Data Analytics• Pipelined task execution, Group task scheduling, Enhanced fault tolerance• Parallel Data Mining
– FlexDB: Cloud-scale Parallel Database for OLAP• MapReduce integration into DBMS, Parallel query execution, Cost-based query
optimization
– Cloud-scale Parallel Database for OLTP• Intelligent database sharding and resharding• Active-active (eager) replication with group communication service• Multiple masters with elastic distributed coordination
9© Copyright 2011 EMC Corporation. All rights reserved.
Cloud Databases• Two largest components of data management market
– Transactional Data Management• Banks, airline reservation, online e-commerce• ACID, write-intensive
– Analytical Data Management• Business planning, decision support• Query-intensive
• Challenges of data management in the Cloud– Scalability– Fault Tolerance– Availability & Consistence– Transaction Management– Flexible Schemes
10© Copyright 2011 EMC Corporation. All rights reserved.
Cloud Databases• Data analytics in the cloud
– Parallel DBMS– MapReduce
• Transactional data management in the cloud– NoSQL Store– SQL Database
• Cloud data services (Database as a Service)– Multi-tenant data management– Auto-administration
11© Copyright 2011 EMC Corporation. All rights reserved.
Commercial Landscape Major Players
• Amazon EC2– IaaS abstraction– Data management using S3 and SimpleDB
• Microsoft Azure– PaaS abstraction– Relational engine (SQL Azure)
• Google AppEngine– PaaS abstraction– Data management using Google MegaStore
12© Copyright 2011 EMC Corporation. All rights reserved.
Data Analytics in the Cloud
• Scalability to large data volumes:– Scan 100 TB on 1 node @ 50 MB/sec = 23 days– Scan on 1000-node cluster = 33 minutes
Divide-And-Conquer (i.e., data partitioning)
• Cost-efficiency:– Commodity nodes (cheap, but unreliable)– Commodity network– Automatic fault-tolerance (fewer admins)– Easy to use (fewer programmers)
13© Copyright 2011 EMC Corporation. All rights reserved.
Solutions for Large-scale Data Analysis
• Parallel DBMS technologies– Proposed in late eighties– Matured over the last two decades– Multi-billion dollar industry: Proprietary DBMS Engines
intended as Data Warehousing solutions for very large enterprises
• Map Reduce – pioneered by Google– popularized by Yahoo! (Hadoop)
14© Copyright 2011 EMC Corporation. All rights reserved.
Parallel DBMS technologies
• Popularly used for more than two decades– Research Projects: Gamma, Grace, …– Commercial: Teradata, Greenplum (acquired by EMC), Netezza
(acquired by IBM), DATAllegro (acquired by Microsoft), Vertica(acquired by HP), Aster Data (acquired by Teradata)
• Share-nothing nodes clusters• Relational Data Model• Indexing• Familiar SQL interface• Parallel query execution
– Horizontal partitioning of relational tables with partitioned execution of SQL queries
• Advanced query optimization• Well understood and studied
15© Copyright 2011 EMC Corporation. All rights reserved.
Greenplum: A Share-nothing Parallel DBMS
Greenplum’s MPP Database has extreme scalability– Optimized for BI and analytics– Fault-tolerant reliability and optimized performance
using commodity CPUs, disks and networking
Provides automatic parallelization– No need for manual partitioning or tuning– Just load and query like any database– Tables are automatically distributed across nodes
Extremely scalable and I/O optimized– All nodes can scan and process in parallel– No I/O contention between segments
Linear scalability by adding nodes
– Each adds storage, query performance and loading performance
Interconnect
Loading
16© Copyright 2011 EMC Corporation. All rights reserved.
Greenplum Database Architecture MPP (Massively Parallel Processing)
Shared-Nothing Architecture
NetworkInterconnect
... ...
......MasterServers
Query planning & dispatch
SegmentServers
Query processing & data storage
SQL
MapReduce
ExternalSources
Loading, streaming, etc.
17© Copyright 2011 EMC Corporation. All rights reserved.
Example of Parallel Query Optimization
select
c_custkey, c_name,
sum(l_extendedprice * (1 - l_discount)) as revenue,
c_acctbal, n_name, c_address, c_phone, c_comment
from
customer, orders, lineitem, nation
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate >= date '1994-08-01'
and o_orderdate < date '1994-08-01'
+ interval '3 month'
and l_returnflag = 'R'
and c_nationkey = n_nationkey
group by
c_custkey, c_name, c_acctbal,
c_phone, n_name, c_address, c_comment
order by
revenue desc
Gather Motion 4:1(slice 3)
Sort
HashAggregate
HashJoin
Redistribute Motion 4:4(slice 1)
HashJoin
Seq Scan on lineitem Hash
Seq Scan on orders
Hash
HashJoin
Seq Scan on customer Hash
Broadcast Motion 4:4(slice 2)
Seq Scan on nation
18© Copyright 2011 EMC Corporation. All rights reserved.
MapReduce
• Overview– large-scale, massively parallel data access platform– Simple data-parallel programming model to express relatively
sophisticated distributed programs – An associated parallel and distributed implementation for commodity
clusters
• Pioneered by Google– Processes 20 PB of data per day
• Popularized by open-source Hadoop project– Used by Yahoo!, Facebook, Amazon, and the list is growing …
19© Copyright 2011 EMC Corporation. All rights reserved.
Programming Framework
Raw Input: <key, value>
MAP
<K2,V2><K1, V1> <K3,V3>
REDUCE
20© Copyright 2011 EMC Corporation. All rights reserved.
Cat...
Bat..
Dog..
Other Words(size:
TByte)
map
map
map
map
split
split
split
split
combine
combine
combine
reduce
reduce
reduce
part0
part1
part2
MapReduce Example: WordCountMap(K, V) {
For each word w in VCollect(w, 1);
}
Combine(K, V[ ]) {Int count = 0;For each v in V
count += v;Collect(K, count);
}
Reduce(K, V[ ]) {Int count = 0;For each v in V
count += v;Collect(K, count);
}
Cat 3
Bat 4
Dog 3…
21© Copyright 2011 EMC Corporation. All rights reserved.
MapReduce Implementation in Hadoop
split0
mapper
split1
split2
split3
split4
mapper
mapper
master
client
job
reducer
reducer
file0
file1
input files
map phase
intermediate files(local disk)
reduce phase
output files
read local write
remote read
write
assign map
assign reduce
22© Copyright 2011 EMC Corporation. All rights reserved.
MapReduce Advantages
• Automatic Parallelization:– Depending on the size of RAW INPUT DATA instantiate
multiple MAP tasks– Similarly, depending upon the number of intermediate <key,
value> partitions instantiate multiple REDUCE tasks
• Run-time:– Data partitioning– Task scheduling– Handling machine failures– Managing inter-machine communication
• Completely transparent to the programmer/analyst/user
23© Copyright 2011 EMC Corporation. All rights reserved.
Possible Applications
• Special-purpose programs to process large amounts of data: crawled documents, Web query logs, etc.– ETL and “read once” data sets– Complex analytics– Semi-structured data, key-value pairs
• At Google and others (Yahoo!, Facebook):– Inverted index– Graph structure of the WEB documents– Summaries of #pages/host, set of frequent queries, etc.– Ad Optimization– Spam filtering
24© Copyright 2011 EMC Corporation. All rights reserved.
Map Reduce vs Parallel DBMS
Parallel DBMS MapReduce
Schema Support Not out of the box
Indexing Not out of the box
Programming ModelDeclarative
(SQL)
Imperative(C/C++, Java, …)
Extensions through Pig and Hive
Optimizations (Compression, Query
Optimization)
Not out of the box
Flexibility Not out of the box
Fault ToleranceCoarse grained
techniques
25© Copyright 2011 EMC Corporation. All rights reserved.
Further Analysis and Comparison• Limitations of some current parallel database / data warehouse
– Often use expensive/specialized hardware– Difficult to scale to more than 100 nodes– Difficult to parallelize data mining applications
• MPI …
– Difficult to deal with unstructured data– Fault tolerance
• One node fails, restart whole query
– Expensive
• Disadvantages of some MapReduce based solution (Hive)– A sub-optimal brute force implementation: No indexing, No JOINs
• Find those guys whose salary is $10,000
– Row based storage, Updates?– Not SQL/BI tool compatible – No support for schema– Non-declarative programming model
26© Copyright 2011 EMC Corporation. All rights reserved.
MapReduce Integration in DBMS Context
• FlexDB - A Cloud-scale Parallel Database Engine based on Hadoop MapReduce (A Research Project)– An architectural hybrid of MapReduce and DBMS
technologies– Use Fault-tolerance and Scalability of Map Reduce
framework – Leverage advanced data processing techniques (e.g.,
Query Optimization) of an RDBMS for high performance– Expose a declarative interface to the user
• Goal: Leverage from the best of both worlds
27© Copyright 2011 EMC Corporation. All rights reserved.
FlexDB Architecture
28© Copyright 2011 EMC Corporation. All rights reserved.
Catalog manager
FlexDB Master
subquery subquery
SELECT *FROM Account
WHERE balance > 30
SELECT *FROM Account
WHERE balance > 30
subquery
SELECT *FROM Account
WHERE balance > 30
MapperReducer
MapReduceFramework
SELECT *FROM Account
WHERE balance > 30
m1n1r1m0n0r0
m3n3r3m2n2r2
m5n5r5m4n4r4
m7n7r7m6n6r6
m9n9r9m8n8r8
JobJob
JobJob
Database Database Database Database Database Database Database
m1n1r1m0n0r0
m3n3r3m2n2r2
m5n5r5m4n4r4
m7n7r7m6n6r6
Account
Query Parser
Query Optimizer
Job Generator
Job Executor
29© Copyright 2011 EMC Corporation. All rights reserved.
Comparison with other systems
FlexDB Hive HadoopDB Traditional parallel database
Query Language SQL HQL SQL (not support join currently)
SQL
Storage Postgres/Greenplum HDFS JDBC compatible
Native OS files
Optimizer Cost based (DB/MR paths)
Simple rule based
Simple rule based
Cost based
Physical storage organization
Column/Row based Row based Currently Row based
Column/Row based
Implementation FlexDB Master + Hadoop + DB
Hive + Hadoop Hive (rev) + Hadoop + DB
Native
Efficiency High Low Middle Very High
Scale Large Large Large Middle
Cost Low Low Low High
30© Copyright 2011 EMC Corporation. All rights reserved.
Summary
• New in cloud computing– Elasticity/Scalability– Resource sharing (multi-tenancy)– Focus on failure
• Data analytics in the cloud: Different solutions suitable for different workloads
– Parallel DBMSs excel at efficient querying of large data sets– MR-style systems excel at complex analytics and ETL tasks
• Combine MapReduce with shared-nothing DBMS to produce a system that better fit the cloud computing market
31© Copyright 2011 EMC Corporation. All rights reserved.
Acknowledgements
• Some slides are adapted from the following references:– Divy Agrawal, Sudipto Das, and Amr El Abbadi, “Big Data and Cloud
Computing: New Wine or just New Bottles?”, VLDB 2010 Tutorial – Michael Stonebraker, Daniel AbadI, David J. DeWitt, Sam Madden, Erik
Paulson, Andrew Pavlo, and Alexander Rasin, “MapReduce and Parallel DBMS’s: Friends or Foes?”, Communications of the ACM 2010
32© Copyright 2011 EMC Corporation. All rights reserved.
易安信中国研究院
陶波博士
易安信中国研究院 院长
博客 http://blog.sina.com.cn/emclabschina
微博 http://weibo.com/emclabschina
33© Copyright 2011 EMC Corporation. All rights reserved.
THANK YOU
Top Related