Post on 26-Dec-2014
description
© 2014 EXASOL AG
1
SQL and high performance compu3ng on Hadoop
Graham Mossman, Senior Solu;on Engineer, EXASOL
© 2014 EXASOL AG
2
I Love My Lawnmower ...
... because it cuts my grass well
© 2014 EXASOL AG
3
But ...
... it‘s quite a struggle cuBng my hedge
© 2014 EXASOL AG
4
And ...
... it isn‘t good at making apple sauce
© 2014 EXASOL AG
5
And don‘t even thinking about...
... using it to cut hair
© 2014 EXASOL AG
6
Hadoop today is …
§ S;ll Open Source ! § Began with HDFS and Map/Reduce § Now comprises a number of addi;onal technologies
§ File systems § (e.g. Tachyon)
§ Cluster Managers § (e.g. YARN + Mesos)
§ Execu;on Engines § (e.g. Tez, Spark etc.)
§ Analy;cal Layer and Applica;ons § (e.g. Hive, Pig, various SQL on Hadoop)
© 2014 EXASOL AG
7
Hadoop With Everything
§ Hadoop was invented to more easily distribute the Nutch and Lucene applica;ons across a cluster of machines. § Map/Reduce – distributed processing § HDFS – distributed file system
§ Began to be used for …. just about everything. § But not all processing tasks are like indexing the Internet § Hadoop started to acract cri;cism
§ But usually when it was being used for something it wasn’t designed for
© 2014 EXASOL AG
8
Definitely NOT jobs for Hadoop
§ Word processing
§ Payroll system
§ Anything on a single computer
§ Anything with “small” data
© 2014 EXASOL AG
9
Analy3cal Queries
§ “GROUP BY“ logic
§ i.e. not concerned with individual data items
§ Analy;cal Func;ons § MAX, MEDIAN, MIN, SUM, COUNT, STANDARD DEVIATION …
§ Table joins, nested sub-‐queries Usually short-‐running, ad-‐hoc and submiced many at a ;me.
© 2014 EXASOL AG
10
Map/Reduce and HDFS : the wrong tools for Analy3cs ?
§ Queries tend to be short : fault tolerance is less important § If chance of failure in a 5 hour batch is 1 in 300 § Chance of failure in a 5 second query is 1 in 1,000,000
§ Queries tend to be short : start-‐up ;me is significant § a 20 second start-‐up ;me is NOT OK on a 5 second query
§ A number of projects started to address these issues § e.g. “Hot containers” in Hive on Tez to reduce start-‐up ;me
© 2014 EXASOL AG
11
Example taken from Reynold Xin’s 2012 “Shark: Hive (SQL) on Spark” presentation
Map/Reduce: the wrong language for Analy3cs ?
Stage 0: Map-Shuffle-ReduceMapper(row) { fields = row.split("\t") emit(fields[0], fields[1]);}Reducer(key, values) { sum = 0; for (value in values) { sum += value; } emit(key, sum);}
Stage 1: Map-ShuffleMapper(row) { ... emit(page_views, page_name);}... shuffleStage 2: Localdata = open("stage1.out")for (i in 0 to 10) { print(data.getNext())}
© 2014 EXASOL AG
12
Equivalent in SQL
SELECT page_name, SUM(page_views) views
FROM wikistats GROUP BY page_nameORDER BY views DESC LIMIT 10;
© 2014 EXASOL AG
13
The SQL language
§ Portable § Well-‐defined standards exist § No detailed knowledge of the plaporm required
§ e.g. you don’t need to manage memory § SQL is assumed by a lot of repor;ng tools § Widely used and understood even by non-‐technical people
© 2014 EXASOL AG
14
I‘m not saying that SQL is perfect
• Try writing the simple Hadoop “Word Count” example in pure SQL
• Or try to “sessionise” weblog data
• Or anything with data that is not structured• “Which part of STRUCTURED Query Language
don’t you understand …?!”
• All I’m saying is that is an excellent language for analytical queries.
© 2014 EXASOL AG
15
Hadoop could handle SQL (via Hive), but historically …
§ High Latency § Restricted SQL op;ons § All but simple table joins were difficult § Licle support for compression & indexing § Merv Adrian (Gartner Research -‐ 2014)
§ “What is remarkable is that Hadoop does SQL. Just don’t expect it to do it well”
§ Result : EVERYTHING looked good compared to Hive
© 2014 EXASOL AG
16
Everyone s3ll likes to compare themselves to Hive
© 2014 EXASOL AG
17
EXASOL being no excep3on !
© 2014 EXASOL AG
18
Hive con3nues to be improved … § Completed § Views (HIVE-‐1143)
§ Par;;oned Views (HIVE-‐1941)
§ Storage Handlers (HIVE-‐705)
§ HBase Integra;on § HBase Bulk Load
§ Locking (HIVE-‐1293)
§ Indexes (HIVE-‐417)
§ Bitmap Indexes (HIVE-‐1803)
§ Filter Pushdown (HIVE-‐279) § Table-‐level Sta;s;cs (HIVE-‐1361)
§ Dynamic Par;;ons
§ Binary Data Type (HIVE-‐2380)
§ Decimal Precision and Scale Support § HCatalog
§ HiveServer2 (HIVE-‐2935)
§ Column Sta;s;cs in Hive (HIVE-‐1362)
§ List Bucke;ng (HIVE-‐3026)
§ Group By With Rollup (HIVE-‐2397) § Enhanced Aggrega;on, Cube, Grouping
and Rollup (HIVE-‐3433)
§ Op;mizing Skewed Joins (HIVE-‐3086)
§ Correla;on Op;mizer (HIVE-‐2206)
§ Hive on Tez (HIVE-‐4660) § Vectorized Query
Execu;on (HIVE-‐4160)
§ In Progress § Atomic Insert/Update/Delete
(HIVE-‐5317)
§ Transac;on Manager (HIVE-‐5843)
§ Cost Based Op;mizer in Hive (HIVE-‐5775)
§ Proposed § Spa;al Queries
§ Theta Join (HIVE-‐556)
§ JDBC Storage Handler
§ MapJoin Op;miza;on § Proposal to standardize and expand
Authoriza;on in Hive
§ Dependent Tables (HIVE-‐3466)
§ AccessServer
§ Type Qualifiers in Hive § MapJoin & Par;;on Pruning
(HIVE-‐5119)
§ SQL Standard based secure authoriza;on (HIVE-‐5837)
§ Updatable Views (HIVE-‐1143) § Hive on Spark (HIVE-‐7292)
© 2014 EXASOL AG
19
The dream data architecture for analy3cs …
§ Based on the SQL language § but leverages Hadoop’s extreme scalability § and Hadoop’s fault tolerance § while not compromising on speed.
Could it please also have some maturity ? And be easy to use ?
© 2014 EXASOL AG
20
The current reality
§ SQL on SQL, which is arguably § Less scalable § Less fault tolerant § Less good with unstructured data
§ SQL on Hadoop, which is arguably § Less mature § Less easy to use § Slower
© 2014 EXASOL AG
21
Choices for SQL and Hadoop
§ SQL AND HADOOP § A Connector
§ HADOOP ON SQL § User Defined Func;ons
§ SQL ON HADOOP § Something like Hive, but becer
© 2014 EXASOL AG
22
Op3on 1 – SQL AND HADOOP
Run SQL-‐on-‐SQL and Hadoop-‐on-‐Hadoop and use a connector to join the two systems
Pros § Minimal impact (SQL and Hadoop worlds can func;on as before) § Easier to implement
Cons § Network ! § Challenge of op;mising across two technologies
© 2014 EXASOL AG
23
Op3on 2 – HADOOP ON SQL
§ Bring Map/Reduce into the Parallel database § For example using Java User Defined Func;ons
select my_java_map_func1on(words) a_word, count(*) word_count from DOCUMENTS group by 1
§ Doesn’t benefit from Hadoop’s storage advantages
© 2014 EXASOL AG
24
Op3on 3 -‐ SQL ON HADOOP
Build a rela;onal database on Hadoop storage § Impala (Cloudera) § S;nger (Hortonworks) § Presto (Facebook) § SparkSQL (UC Berkeley) § HAWQ (Pivotal) § BigSQL (IBM) § Apache Phoenix (for HBase) § Apache Tajo § Apache Drill § etc etc etc …. AND DON‘T FORGET HIVE !
© 2014 EXASOL AG
25
Four possible market outcomes…
§ Hadoop and SQL databases are on a collision course – only one will survive § No sign of that so far
§ They are complementary – both will survive § Probably -‐ the challenge is how to make them work together
§ They will merge and become one § Some indica;ons this is already star;ng to happen
§ Something even more amazing will come along and replace them both § Some;mes this happens – Spark ?
© 2014 EXASOL AG
26
My Personal Opinionated Opinion
Becer to use a tool that has been made for the job A purpose-‐built tool will always beat one made originally for another purpose.
© 2014 EXASOL AG
27
Ques3ons ?
My contact details : Email : graham.mossman@exasol.com Twicer : @EXADude