MPP vs Hadoop
-
Upload
alexey-grishchenko -
Category
Data & Analytics
-
view
17.881 -
download
0
Transcript of MPP vs Hadoop
![Page 1: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/1.jpg)
1 Pivotal Confidential–Internal Use Only 1 Pivotal Confidential–Internal Use Only
MPP vs Hadoop Alexey Grishchenko
HUG Meetup 28.11.2015
![Page 2: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/2.jpg)
2 Pivotal Confidential–Internal Use Only
Agenda
� Distributed Systems
� MPP
� Hadoop
� MPP vs Hadoop
� Summary
![Page 3: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/3.jpg)
3 Pivotal Confidential–Internal Use Only
Agenda
� Distributed Systems � MPP
� Hadoop
� MPP vs Hadoop
� Summary
![Page 4: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/4.jpg)
4 Pivotal Confidential–Internal Use Only
Distributed Systems
Avoid distributed systems in all the problems that potentially could be solved using non-distributed systems
![Page 5: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/5.jpg)
5 Pivotal Confidential–Internal Use Only
Distributed Systems
� Consensus problem – Paxos – RAFT – ZAB – etc.
� Transaction consistency – 2PC – 3PC
![Page 6: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/6.jpg)
6 Pivotal Confidential–Internal Use Only
Distributed Systems
� CAP Theorem
![Page 7: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/7.jpg)
7 Pivotal Confidential–Internal Use Only
Distributed Systems
http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
![Page 8: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/8.jpg)
8 Pivotal Confidential–Internal Use Only
Distributed Systems
Reasons to use � Performance issues – More than 100’000 TPS – More than 4 GB/sec scan rate – More than 100’000 IOPS
� Capacity issues – More than 50TB of data
� DR and Geo-Distribution
![Page 9: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/9.jpg)
9 Pivotal Confidential–Internal Use Only
Agenda
� Distributed Systems
� MPP � Hadoop
� MPP vs Hadoop
� Summary
![Page 10: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/10.jpg)
10 Pivotal Confidential–Internal Use Only
MPP
Main principles � Shared Nothing
� Data Sharding
� Data Replication
� Distributed Transactions
� Parallel Processing
![Page 11: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/11.jpg)
11 Pivotal Confidential–Internal Use Only
MPP
![Page 12: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/12.jpg)
12 Pivotal Confidential–Internal Use Only
MPP
Works well for � Relational data
� Batch processing
� Ad hoc analytical SQL
� Low concurrency
� Applications requiring ANSI SQL
![Page 13: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/13.jpg)
13 Pivotal Confidential–Internal Use Only
MPP
Not the best choice for � Non-relational data
� OLTP and event stream processing
� High concurrency
� 100+ server clusters
� Non-analytical use cases
� Geo-Distributed use cases
![Page 14: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/14.jpg)
14 Pivotal Confidential–Internal Use Only
Agenda
� Distributed Systems
� MPP
� Hadoop � MPP vs Hadoop
� Summary
![Page 15: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/15.jpg)
15 Pivotal Confidential–Internal Use Only
Hadoop
Main Components � HDFS
� YARN
� MapReduce
� HBase
� Hive / Hive+Tez
![Page 16: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/16.jpg)
16 Pivotal Confidential–Internal Use Only
Hadoop
HDFS � Distributed filesystem
� Block-level storage with big blocks
� Non-updatable
� Synchronous block replication
� No built-in Geo-Distribution support
� No built-in DR solution
![Page 17: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/17.jpg)
17 Pivotal Confidential–Internal Use Only
Hadoop
HDFS
![Page 18: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/18.jpg)
18 Pivotal Confidential–Internal Use Only
Hadoop
YARN � Cluster resource manager
� Manages CPU and RAM allocation
� Schedulers are pluggable
� Can handle different resource pools
� Supports both MR and non-MR workload
![Page 19: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/19.jpg)
19 Pivotal Confidential–Internal Use Only
Hadoop
YARN
![Page 20: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/20.jpg)
20 Pivotal Confidential–Internal Use Only
Hadoop
MapReduce � Framework for distributed data processing
� Two main operations: map and reduce
� Data hits disk after “map” and before “reduce”
� Scales to thousands of servers
� Can process petabytes of data
� Extremely reliable
![Page 21: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/21.jpg)
21 Pivotal Confidential–Internal Use Only
Hadoop
MapReduce
![Page 22: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/22.jpg)
22 Pivotal Confidential–Internal Use Only
Hadoop
HBase � Distributed key-value store
� Data is sharded by key
� Data is stored in sorted order
� Stores multiple versions of the row
� Easily scales
![Page 23: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/23.jpg)
23 Pivotal Confidential–Internal Use Only
Hadoop
HBase
![Page 24: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/24.jpg)
24 Pivotal Confidential–Internal Use Only
Hadoop
Hive � Query engine with SQL-like syntax
� Translates HiveQL query to MR / Tez / Spark job
� Processes HDFS data
� Supports UDFs and UDAFs
![Page 25: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/25.jpg)
25 Pivotal Confidential–Internal Use Only
Hadoop
Hive
![Page 26: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/26.jpg)
26 Pivotal Confidential–Internal Use Only
Hadoop
Works well for � Write Once Read Many
� 100+ server clusters
� Both relational and non-relational data
� High concurrency
� Batch processing and analytical workload
� Elastic scalability
![Page 27: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/27.jpg)
27 Pivotal Confidential–Internal Use Only
Hadoop
Not the best choice for � Write-heavy workloads
� Small clusters
� Analytical DWH cases
� OLTP and event stream processing
� Cost savings
![Page 28: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/28.jpg)
28 Pivotal Confidential–Internal Use Only
Agenda
� Distributed Systems
� MPP
� Hadoop
� MPP vs Hadoop � Summary
![Page 29: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/29.jpg)
29 Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business MPP Hadoop
Platform Openness Mostly Closed Open
![Page 30: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/30.jpg)
30 Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business MPP Hadoop
Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity
![Page 31: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/31.jpg)
31 Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business MPP Hadoop
Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common
![Page 32: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/32.jpg)
32 Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business MPP Hadoop
Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K
![Page 33: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/33.jpg)
33 Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business MPP Hadoop
Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K Implementation Cost Moderate High
![Page 34: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/34.jpg)
34 Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business MPP Hadoop
Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K Implementation Cost Moderate High Extensibility Vendor-provided APIs Open Source
![Page 35: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/35.jpg)
35 Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business MPP Hadoop
Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K Implementation Cost Moderate High Extensibility Vendor-provided APIs Open Source Supportability Easy Complex
![Page 36: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/36.jpg)
36 Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business MPP Hadoop
Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K Implementation Cost Moderate High Extensibility Vendor-provided APIs Open Source Supportability Easy Complex Scalability Up to 100 servers Up to 5000 servers
![Page 37: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/37.jpg)
37 Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business MPP Hadoop
Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K Implementation Cost Moderate High Extensibility Vendor-provided APIs Open Source Supportability Easy Complex Scalability Up to 100 servers Up to 5000 servers Scalability Up to 100-300 TB Up to 100 PB
![Page 38: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/38.jpg)
38 Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business MPP Hadoop
Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K Implementation Cost Moderate High Extensibility Vendor-provided APIs Open Source Supportability Easy Complex Scalability Up to 100 servers Up to 5000 servers Scalability Up to 100-300 TB Up to 100 PB Target Systems DWH Purpose-Built Batch
![Page 39: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/39.jpg)
39 Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business MPP Hadoop
Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K Implementation Cost Moderate High Extensibility Vendor-provided APIs Open Source Supportability Easy Complex Scalability Up to 100 servers Up to 5000 servers Scalability Up to 100-300 TB Up to 100 PB Target Systems DWH Purpose-Built Batch Target End Users Business Analysts Developers
![Page 40: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/40.jpg)
40 Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect MPP Hadoop
Query Optimization Good Poor to None
![Page 41: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/41.jpg)
41 Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect MPP Hadoop
Query Optimization Good Poor to None Debugging Easy Very Hard
![Page 42: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/42.jpg)
42 Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect MPP Hadoop
Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java
![Page 43: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/43.jpg)
43 Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect MPP Hadoop
Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java DBA Skill Level Low High
![Page 44: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/44.jpg)
44 Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect MPP Hadoop
Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java DBA Skill Level Low High Single Job Redundancy Low High
![Page 45: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/45.jpg)
45 Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect MPP Hadoop
Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java DBA Skill Level Low High Single Job Redundancy Low High Query Latency 10-20 ms 10-20 sec
![Page 46: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/46.jpg)
46 Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect MPP Hadoop
Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java DBA Skill Level Low High Single Job Redundancy Low High Query Latency 10-20 ms 10-20 sec Query Runtime 5-7 sec 10-15 mins
![Page 47: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/47.jpg)
47 Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect MPP Hadoop
Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java DBA Skill Level Low High Single Job Redundancy Low High Query Latency 10-20 ms 10-20 sec Query Runtime 5-7 sec 10-15 mins Query Max Runtime 1-2 hours 1-2 weeks
![Page 48: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/48.jpg)
48 Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect MPP Hadoop
Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java DBA Skill Level Low High Single Job Redundancy Low High Query Latency 10-20 ms 10-20 sec Query Runtime 5-7 sec 10-15 mins Query Max Runtime 1-2 hours 1-2 weeks Min Collection Size Megabytes Gigabytes
![Page 49: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/49.jpg)
49 Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect MPP Hadoop
Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java DBA Skill Level Low High Single Job Redundancy Low High Query Latency 10-20 ms 10-20 sec Query Runtime 5-7 sec 10-15 mins Query Max Runtime 1-2 hours 1-2 weeks Min Collection Size Megabytes Gigabytes Max Concurrency 10-15 queries 70-100 jobs
![Page 50: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/50.jpg)
50 Pivotal Confidential–Internal Use Only
Agenda
� Distributed Systems
� MPP
� Hadoop
� MPP vs Hadoop
� Examples
� Summary
![Page 51: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/51.jpg)
51 Pivotal Confidential–Internal Use Only
Summary
Use MPP for � Analytical DWH
� Ad hoc analyst SQL queries and BI
� Keep under 100TB of data
Use Hadoop for
� Specialized data processing systems
� Over 100TB of data
![Page 52: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/52.jpg)
52 Pivotal Confidential–Internal Use Only 52 Pivotal Confidential–Internal Use Only
Questions?
![Page 53: MPP vs Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052405/586e84dc1a28aba0038b61b7/html5/thumbnails/53.jpg)
BUILT FOR THE SPEED OF BUSINESS