Download - Graham Mossman - SQL and high performance computing on Hadoop

Transcript
Page 1: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

1  

SQL  and  high  performance  compu3ng  on  Hadoop  

Graham  Mossman,  Senior  Solu;on  Engineer,  EXASOL  

Page 2: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

2  

I  Love  My  Lawnmower  ...  

...  because  it  cuts  my  grass  well  

Page 3: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

3  

But  ...  

...  it‘s  quite  a  struggle  cuBng  my  hedge  

Page 4: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

4  

And  ...  

...  it  isn‘t  good  at  making  apple  sauce  

Page 5: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

5  

And  don‘t  even  thinking  about...  

...  using  it  to  cut  hair  

Page 6: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

6  

Hadoop  today  is  …  

§  S;ll  Open  Source  !  §  Began  with  HDFS  and  Map/Reduce  §  Now  comprises  a  number  of  addi;onal  technologies  

§  File  systems    §  (e.g.  Tachyon)  

§  Cluster  Managers    §  (e.g.  YARN  +  Mesos)  

§  Execu;on  Engines    §  (e.g.  Tez,  Spark  etc.)  

§  Analy;cal  Layer  and  Applica;ons  §   (e.g.  Hive,  Pig,  various  SQL  on  Hadoop)  

 

Page 7: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

7  

Hadoop  With  Everything  

§  Hadoop  was  invented  to  more  easily  distribute  the  Nutch  and  Lucene  applica;ons  across  a  cluster  of  machines.  §  Map/Reduce  –  distributed  processing  §  HDFS  –  distributed  file  system  

§  Began  to  be  used  for  ….  just  about  everything.  §  But  not  all  processing  tasks  are  like  indexing  the  Internet  §  Hadoop  started  to  acract  cri;cism  

§  But  usually  when  it  was  being  used  for  something  it  wasn’t  designed  for  

Page 8: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

8  

Definitely  NOT  jobs  for  Hadoop  

 §  Word  processing  

 §  Payroll  system  

§  Anything  on  a  single  computer  

§  Anything  with  “small”  data  

Page 9: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

9  

Analy3cal  Queries  

 §  “GROUP  BY“  logic  

§  i.e.  not  concerned  with  individual  data  items  

§  Analy;cal  Func;ons  §  MAX,  MEDIAN,  MIN,  SUM,  COUNT,  STANDARD  DEVIATION  …  

§  Table  joins,  nested  sub-­‐queries    Usually  short-­‐running,  ad-­‐hoc  and  submiced  many  at  a  ;me.  

Page 10: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

10  

Map/Reduce  and  HDFS  :  the  wrong  tools  for  Analy3cs  ?  

§  Queries  tend  to  be  short  :  fault  tolerance  is  less  important  §  If  chance  of  failure  in  a  5  hour  batch  is  1  in  300  § Chance  of  failure  in  a  5  second  query  is  1  in  1,000,000  

§  Queries  tend  to  be  short  :  start-­‐up  ;me  is  significant  §  a  20  second  start-­‐up  ;me  is  NOT  OK  on  a  5  second  query  

§  A  number  of  projects  started  to  address  these  issues  §  e.g.  “Hot  containers”  in  Hive  on  Tez  to  reduce  start-­‐up  ;me  

Page 11: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

11  

Example taken from Reynold Xin’s 2012 “Shark: Hive (SQL) on Spark” presentation  

Map/Reduce:  the  wrong  language  for  Analy3cs  ?  

Stage 0: Map-Shuffle-ReduceMapper(row) { fields = row.split("\t") emit(fields[0], fields[1]);}Reducer(key, values) { sum = 0; for (value in values) { sum += value; } emit(key, sum);}

Stage 1: Map-ShuffleMapper(row) { ... emit(page_views, page_name);}... shuffleStage 2: Localdata = open("stage1.out")for (i in 0 to 10) { print(data.getNext())}

Page 12: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

12  

Equivalent  in  SQL  

SELECT page_name, SUM(page_views) views

FROM wikistats GROUP BY page_nameORDER BY views DESC LIMIT 10;

Page 13: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

13  

The  SQL  language  

§  Portable  § Well-­‐defined  standards  exist  §  No  detailed  knowledge  of  the  plaporm  required  

§  e.g.    you  don’t  need  to  manage  memory  §  SQL  is  assumed  by  a  lot  of  repor;ng  tools  § Widely  used  and  understood  even  by  non-­‐technical  people  

 

Page 14: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

14  

I‘m  not  saying  that  SQL  is  perfect    

•  Try writing the simple Hadoop “Word Count” example in pure SQL

•  Or try to “sessionise” weblog data

•  Or anything with data that is not structured•  “Which part of STRUCTURED Query Language

don’t you understand …?!”

•  All I’m saying is that is an excellent language for analytical queries.

Page 15: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

15  

Hadoop  could  handle  SQL  (via  Hive),    but  historically  …  

§  High  Latency  §  Restricted  SQL  op;ons  §  All  but  simple  table  joins  were  difficult  §  Licle  support  for  compression  &  indexing  § Merv  Adrian  (Gartner  Research  -­‐  2014)    

§ “What  is  remarkable  is  that  Hadoop  does  SQL.  Just  don’t  expect  it  to  do  it  well”  

§  Result  :  EVERYTHING  looked  good  compared  to  Hive    

Page 16: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

16  

Everyone  s3ll  likes  to  compare  themselves  to  Hive  

Page 17: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

17  

EXASOL  being  no  excep3on  !  

Page 18: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

18  

Hive  con3nues  to  be  improved  …  §  Completed  §  Views  (HIVE-­‐1143)  

§  Par;;oned  Views  (HIVE-­‐1941)  

§  Storage  Handlers  (HIVE-­‐705)  

§  HBase  Integra;on  §  HBase  Bulk  Load  

§  Locking  (HIVE-­‐1293)  

§  Indexes  (HIVE-­‐417)  

§  Bitmap  Indexes  (HIVE-­‐1803)  

§  Filter  Pushdown  (HIVE-­‐279)  §  Table-­‐level  Sta;s;cs  (HIVE-­‐1361)  

§  Dynamic  Par;;ons  

§  Binary  Data  Type  (HIVE-­‐2380)  

§  Decimal  Precision  and  Scale  Support  §  HCatalog  

§  HiveServer2  (HIVE-­‐2935)  

§  Column  Sta;s;cs  in  Hive  (HIVE-­‐1362)  

§  List  Bucke;ng  (HIVE-­‐3026)  

§  Group  By  With  Rollup  (HIVE-­‐2397)  §  Enhanced  Aggrega;on,  Cube,  Grouping  

and  Rollup  (HIVE-­‐3433)  

§  Op;mizing  Skewed  Joins  (HIVE-­‐3086)  

§  Correla;on  Op;mizer  (HIVE-­‐2206)  

§  Hive  on  Tez  (HIVE-­‐4660)  §  Vectorized  Query  

Execu;on  (HIVE-­‐4160)  

§  In  Progress  §  Atomic  Insert/Update/Delete  

(HIVE-­‐5317)  

§  Transac;on  Manager  (HIVE-­‐5843)  

§  Cost  Based  Op;mizer  in  Hive  (HIVE-­‐5775)  

§  Proposed  §  Spa;al  Queries  

§  Theta  Join  (HIVE-­‐556)  

§  JDBC  Storage  Handler  

§  MapJoin  Op;miza;on  §  Proposal  to  standardize  and  expand  

Authoriza;on  in  Hive  

§  Dependent  Tables  (HIVE-­‐3466)  

§  AccessServer  

§  Type  Qualifiers  in  Hive  §  MapJoin  &  Par;;on  Pruning  

(HIVE-­‐5119)  

§  SQL  Standard  based  secure  authoriza;on  (HIVE-­‐5837)  

§  Updatable  Views  (HIVE-­‐1143)  §  Hive  on  Spark  (HIVE-­‐7292)  

 

Page 19: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

19  

The  dream  data  architecture  for  analy3cs  …  

§ Based  on  the  SQL  language  § but  leverages  Hadoop’s  extreme  scalability    §  and  Hadoop’s  fault  tolerance    § while  not  compromising  on  speed.  

Could  it  please  also  have  some  maturity  ?    And  be  easy  to  use  ?    

Page 20: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

20  

The  current  reality  

§  SQL  on  SQL,  which  is  arguably  § Less  scalable  § Less  fault  tolerant  § Less  good  with  unstructured  data  

§  SQL  on  Hadoop,  which  is  arguably  § Less  mature  § Less  easy  to  use  § Slower    

Page 21: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

21  

Choices  for  SQL  and  Hadoop  

§  SQL  AND  HADOOP  §  A  Connector  

§  HADOOP  ON  SQL  §  User  Defined  Func;ons  

§  SQL  ON  HADOOP  §  Something  like  Hive,  but  becer  

 

Page 22: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

22  

Op3on  1  –  SQL  AND  HADOOP  

Run  SQL-­‐on-­‐SQL  and  Hadoop-­‐on-­‐Hadoop  and  use  a  connector  to  join  the  two  systems    

Pros  §  Minimal  impact  (SQL  and  Hadoop  worlds  can  func;on  as  before)  §  Easier  to  implement    

Cons  §  Network  !  §  Challenge  of  op;mising  across  two  technologies  

Page 23: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

23  

Op3on  2  –  HADOOP  ON  SQL  

§  Bring  Map/Reduce  into  the  Parallel  database    §  For  example  using  Java  User  Defined  Func;ons        

select  my_java_map_func1on(words)  a_word,    count(*)  word_count  from  DOCUMENTS  group  by  1  

 §  Doesn’t  benefit  from  Hadoop’s  storage  advantages  

Page 24: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

24  

Op3on  3  -­‐    SQL  ON  HADOOP  

Build  a  rela;onal  database  on  Hadoop  storage  §  Impala  (Cloudera)  §  S;nger  (Hortonworks)  §  Presto  (Facebook)  §  SparkSQL  (UC  Berkeley)  §  HAWQ  (Pivotal)  §  BigSQL  (IBM)  §  Apache  Phoenix  (for  HBase)  §  Apache  Tajo  §  Apache  Drill  §  etc  etc  etc  ….  AND  DON‘T  FORGET  HIVE  !  

 

Page 25: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

25  

Four  possible  market  outcomes…  

§  Hadoop  and  SQL  databases  are  on  a  collision  course  –  only  one  will  survive    §  No  sign  of  that  so  far  

§  They  are  complementary  –  both  will  survive    §  Probably  -­‐  the  challenge  is  how  to  make  them  work  together  

§  They  will  merge  and  become  one    §  Some  indica;ons  this  is  already  star;ng  to  happen  

§  Something  even  more  amazing  will  come  along  and  replace  them  both  §  Some;mes  this  happens  –  Spark  ?  

Page 26: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

26  

My  Personal  Opinionated  Opinion  

Becer  to  use  a  tool  that  has  been  made  for  the  job  A  purpose-­‐built  tool  will  always  beat  one  made  originally  for  another  purpose.  

Page 27: Graham Mossman - SQL and high performance computing on Hadoop

©  2014  EXASOL  AG  

27  

Ques3ons  ?  

My  contact  details  :    Email  :  [email protected]    Twicer  :  @EXADude