Hd insight essentials quick view

37
HDInsight Essentials ISBN : 1849695369 / ISBN 13 : 9781849695367 Rajesh Nadipalli 05/01/2014

Transcript of Hd insight essentials quick view

HDInsight  Essentials  ISBN  :  1849695369    /  ISBN  13  :  9781849695367  

Rajesh  Nadipalli  05/01/2014  

Goals  of  this  Book  • Focus  on  Microso'’s  new  Hadoop  distribu=on  • Serve  as  Quick  Reference  • Provide  an  Overview  of  Hadoop  • Address  both  cloud  and  on-­‐premise  setup  for  HDInsight  • Highlight  HDInsight  differen:ator    • Provide  Prac=cal  &  Real  world  examples  

Book  Table  of  Contents  • Chapter  1:    HDInsight  in  a  Heartbeat  • Chapter  2:    Deployment  HDInsight  on  premise  • Chapter  3:    HDInsight  Azure  cloud  service  • Chapter  4:    Administer  your  cluster  • Chapter  5:    Ingest  data  to  your  cluster  • Chapter  6:    Transform  data  in  your  cluster  • Chapter  7:    Analyze  &  Report  data  from  cluster  • Chapter  8:    Project  Planning  &                                              Architectural  Considera=ons  

CHAPTER  1  HIGHLIGHTS:    HDINSIGHT  IN  A  HEARTBEAT  

Big  Data  Problem  Characteristics    

Hadoop  Overview  

Self Healing Distributed Storage

Fault Tolerant Distributed Computing

+ Abstraction for

Parallel Processing

CORE HADOOP COMPONENTS •  HDFS:  Distributed  Storage  –  replicated,  self-­‐healing  and  scalable    

•  MapReduce:    Parallel  Processing,  process  local  data  for  efficiency    

 

NameNode

JobTracker TaskTracker  

 TaskTracker  

 TaskTracker  

 MapReduce  Layer  

Distributed    File  System  

Layer   Secondary NameNode

Master  Node   Slaves  Nodes  

DataNode    

DataNode    

DataNode    

Hadoop  Nodes  Layout  

Data  Sources        

RDBMS    Databases  

Audio,    Images   Log  Files   Sensors,    

RFID  Social    

Media,  Feeds  

 Hadoop  Data  Store  

       

HDFS  

Hbase    (NOSQL  DB)  

 Data  Processing  

     

Mapreduce  

 Data  Access  

     

Hive   Pig   Mahout    Machine  Learning  

Flume,  Sqoop  

Excel  

Business    Data  Feeds  

Zook

eepe

r  (Distrib

uted  Process  M

anag

ement)  

Hcatalog  (M

etad

ata  on

 Pig,  H

ive,  M

apRe

duce  )  

Oozie    Workflow,  Scheduler  

Infrastructure  ,  Ope

ra:o

ns  

(Mon

itorin

g,  Con

figura<

on)  

Hadoop  Eco  System  

Collect & Import to HDFS

Process (MapReduce)

Analyze (BI Tools) Report & Publish

End  to  End  Solution  on  Hadoop  

Popular  Hadoop  Distributions  •  Amazon  Elas=c  MapReduce  (cloud,  hbp://aws.amazon.com/elas=cmapreduce/)    

•  Cloudera  (hbp://www.cloudera.com/content/cloudera/en/home.html)    

•  EMC  PivitolHD  (hbp://gopivotal.com/)    

•  Hortonworks  HDP  (hbp://hortonworks.com/)    

•  MapR  (hbp://mapr.com/)    

•  Microsod  HDInsight  (cloud,  hbp://www.windowsazure.com/)  

HDInsight  Differenciator  •  Enterprise-­‐ready  Hadoop  backed  by  Microsod    

•  Analy:cs  using  Excel  

•  Integra=on  with  Ac=ve  Directory.      

•  Integra=on  with  .NET  and  Javascript    

•  Connectors  to  RDBMS    

•  Scale  using  cloud  offering:    Azure  HDInsight  service  enables  customers  to  scale  quickly  and  has  seamless  interface  between  HDFS  and  Azure  Storage  Vault    

•  JavaScript  Console  

WordCount  in  HDInsight  

CHAPTER  2  HIGHLIGHTS:    HDINSIGHT  INSTALL  ON  PREMISE  

Apache  Hadoop        

•  Open  Source  Sodware  •  Community  Development      

Hortonworks  Data  PlaSorm        

•  Enterprise  Hadoop  Plagorm  (HDP)  •  Leaders  in  Hadoop  •  Code  commibers  to  Hadoop  

Microso'  HDInsight        

•  Built  on  top  of  HDP  •  Integra=on  with  ASV,  Excel,  Powerview,  

SQLServer,  Ac=ve  Directory      

HDInsight  Distribution  

Physical  Install  Options  

NN          SNN            JT  

DN    /  TT  

Single  node  for  development/test      

Mul=  node  for  produc=on      

Multi  Node  Install  Steps  •  Pre-­‐requisites  •  Networking  Setup  •  Remote  Scrip=ng  •  Firewall  Setup  •  Sodware  Install  (each  node)  •  Hadoop  Configura=on  •  Verifica=on  

CHAPTER  3  HIGHLIGHTS:    HDINSIGHT  AZURE  SERVICE  

Azure  Cloud  Service  

Create  Storage  

Create  HDInsight  cluster  

CHAPTER  4  HIGHLIGHTS:    ADMINISTER  YOUR  CLUSTER  

HDInsight  Cluster  Management  

HDInsight  Dashboard  

HDInsight  Dashboard  

NameNode  Status  

Jobtracker  Status  

CHAPTER  5  HIGHLIGHTS:    INGEST  DATA  INTO  YOUR  CLUSTER  

Loading  Data  into  your  Cluster  You  have  following  op=ons…    •  Loading  data  using  Hadoop  commands  •  Loading  data  using  Azure  Storage  Vault  •  Loading  data  using  Interac:ve  JavaScript    •  Shipping  data  to  your  Cluster  •  Loading  data  from  RDBMS  via  Sqoop  

Loading  via  Azure  Storage  Explorer  

CHAPTER  6  HIGHLIGHTS:    TRANSFORM  YOUR  DATA  

Transforming  Data  You  have  following  op=ons…    •  MapReduce  •  Hive  •  Pig  •  Others  

Processing  Data  in  Cluster  Map for Jan2012

Map for Feb2012

Map for Apr2013

…  

One Reducer

HDFS  

Hive  JDBC/OBDC

Metastore

Thrift Server

Command Line Web GUI

Driver (Parser, Planner, Executor)

MapReduce  

Hive  

Raw  Data  in  HDFS  •  Distributed  

Storage  •  Reliable  

Data  Processing  via  Pig  •  Pipelines  •  Itera=ve  Processing  •  Research  

Data  Warehouse  

HDFS  

Data  Warehouse  via  Hive  •  BI  Tools  •  Analysis  

Hive  or  Pig?  

CHAPTER  7  HIGHLIGHTS:    ANALYZE  &  REPORT  

Analyze  using  Excel  

Analyze  using  Excel  

CHAPTER  8:    PROJECT  PLANNING  &  ARCHITECTURAL  CONSIDERATIONS  

Execu:ve  &  Stakeholder    

Buy-­‐in  

Discovery  &  Analysis  

Design  

Implementa:on  User  Acceptance  

Produc:on  Opera:ons  

Feedback,  New  Requirements