2014 feb 24_big_datacongress_hadoopsession1_hadoop101

29
HADOOP 101: AN INTRODUCTION TO HADOOP WITH THE HORTONWORKS SANDBOX Adam Muise – Solu/on Architect, Hortonworks

description

A hands on introduction to Hadoop by using the Hortonworks Sandbox

Transcript of 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

Page 1: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

HADOOP  101:  AN  INTRODUCTION  TO  HADOOP  WITH  THE  HORTONWORKS  SANDBOX  

Adam  Muise  –  Solu/on  Architect,  Hortonworks  

Page 2: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

Who  are  we?  

Page 3: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

Who  is                                        ?  

Page 4: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

We  do  Hadoop  

The  leaders  of  Hadoop’s  development  

Community  driven,    Enterprise  Focused  

Drive  Innova/on  in  the  plaForm  –  We  lead  the  roadmap    

100%  Open  Source  –  Democra/zed  Access  to  Data  

Page 5: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

We  do  Hadoop  successfully.  

Support    

Professional  Services  Training  

Page 6: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

Enter  the  Hadoop.  

hOp://www.fabulouslybroke.com/2011/05/ninja-­‐elephants-­‐and-­‐other-­‐awesome-­‐stories/  

………  

Page 7: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

Hadoop  was  created  because  tradi/onal  technologies  never  cut  it  

for  the  Internet  proper/es  like  Google,  Yahoo,  Facebook,  TwiOer,  

and  LinkedIn  

Page 8: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

Tradi/onal  architecture  didn’t  scale  enough…  

DB   DB  DB  

SAN  

App  App   App  App  

DB   DB  DB  

SAN  

App  App   App  App   DB   DB  DB  

SAN  

App  App   App  App  

Page 9: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

Databases  can  become  bloated  and  useless  

Page 10: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

Tradi/onal  architectures  cost  too  much  at  that  volume…  

$/TB  

$pecial  Hardware  

$upercompu/ng  

Page 11: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

So  what  is  the  answer?  

Page 12: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

If  you  could  design  a  system  that  would  handle  this,  what  would  it  

look  like?  

Page 13: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

It  would  probably  need  a  highly  resilient,  self-­‐healing,  cost-­‐efficient,  

distributed  file  system…  

Storage   Storage   Storage  

Storage   Storage   Storage  

Storage   Storage   Storage  

Page 14: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

It  would  probably  need  a  completely  parallel  processing  framework  that  

took  tasks  to  the  data…  

Storage   Storage   Storage  

Storage   Storage   Storage  

Storage   Storage   Storage  Processing   Processing  Processing  

Processing   Processing  Processing  

Processing   Processing  Processing  

Page 15: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

It  would  probably  run  on  commodity  hardware,  virtualized  machines,  and  

common  OS  plaForms  

Storage   Storage   Storage  

Storage   Storage   Storage  

Storage   Storage   Storage  Processing   Processing  Processing  

Processing   Processing  Processing  

Processing   Processing  Processing  

Page 16: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

It  would  probably  be  open  source  so  innova/on  could  happen  as  quickly  

as  possible  

Page 17: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

It  would  need  a  cri/cal  mass  of  users  

Page 18: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

Apache  Hadoop  

Flume  Ambari  

HBase  Falcon  

MapReduce  HDFS  

Sqoop  HCatalog  

Pig  

Hive  

Storm  YARN  

Knox  

Tez  

Page 19: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

Hortonworks  Data  PlaForm  

Flume  Ambari  

HBase  Falcon  

MapReduce  HDFS  

Sqoop  HCatalog  

Pig  

Hive  

Storm   YARN  

Knox  

Tez  

Page 20: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

We  are  going  to  learn  how  to  work  with  Hadoop  in  less  than  an  hour.  

Page 21: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

To  do  this,  we  need  to  install  Hadoop  right?  

Page 22: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

Nope.  

Page 23: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

Enter  the        

Sandbox.  

Page 24: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

The  Sandbox  is  ‘Hadoop  in  a  Can’.  It  contains  one  copy  of  each  of  the  Master  and  Worker  node  processes  used  in  a  cluster,  only  in  a  single  

virtual  node.  

Storage   Storage   Storage  

Storage   Storage   Storage  

Storage   Storage   Storage  Processing   Processing  Processing  

Processing   Processing  Processing  

Processing   Processing  Processing  

Processing  Storage  

Linux  VM  

Page 25: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

Gefng  started  with  Sandbox  VM:    -­‐  Pick  your  flavor  of  VM  at…  

 hOp://www.hortonworks.com/sandbox  -­‐  Start  the  sandbox  VM  -­‐  find  the  IP  displayed      -­‐  go  to…  

 hOp://172.16.130.131    -­‐  Register  -­‐  Click  on  ‘Start  Tutorials’  -­‐  On  the  lek  hand  nav,  click  on  ‘HCatalog,  Basic  Pig  

 &  Hive  Commands’    

Page 26: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

In  this  tutorial  we  will:  -­‐  Land  files  in  HDFS  -­‐  Assign  metadata  with  HCatalog  -­‐  Use  SQL  with  Hive  -­‐  Learn  to  process  data  with  Pig  

Page 27: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

Try  the  other  tutorials.  

Page 28: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

Hadoop  is  the  new  Modern  Data  Architecture  for  the  Enterprise  

Page 29: 2014 feb 24_big_datacongress_hadoopsession1_hadoop101

© Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Page  29  

There is NO second place

Hortonworks  …the  Bull  Elephant  of  Hadoop  InnovaGon