Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol

24
Introduction to Microsoft Azure HDInsight Dattatrey Sindhol

Transcript of Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol

Introduction to

Microsoft Azure HDInsight

Dattatrey Sindhol

2

Agenda

Introduction

Hadoop Distributions

Microsoft Azure HDInsight

Microsoft BI and Data Platform

HDInsight - Use Cases

HDInsight - Typical Implementation

Further Learning

3

Introduction

What is Big Data?

“Big Data is a collection

of data sets so large and

complex that it becomes

difficult to process using

on-hand database

management tools or

traditional data processing

applications”.

4

Introduction

Hadoop is an open source

framework, from Apache foundation,

capable of processing very large

volumes of heterogeneous data sets

in a distributed fashion across clusters

of commodity computers and

hardware using a simplified

programming model.

What is Hadoop?

5

Introduction

Conclusion

In simple terms, Big Data is the Challenge and Hadoop is the Solution.

6

Hadoop Distributions

Amazon Elastic

Map Reduce

(EMR)

Cloudera Hortonworks

IBM

InfoSphere

BigInsights

MapR

Pivotal Teradata IntelAzure

HDInsight

Reference: How the 9 Leading Commercial Hadoop Distributions Stack Up

7

Which Distribution Should I Use?

Cost

Scalability

Availability

Existing Technology Stack

Existing Infrastructure

Existing Skillset

8

HDInsight - Overview

Microsoft’s

Hadoop

Distribution in

the Cloud

Offers Hadoop

on Windows

Platform

Based on

Hortonworks

Data Platform

(HDP)

Tightly

integrated

with Microsoft

Technology

Stack

9

HDInsight - Architecture

10

Microsoft Data Platform and Enterprise BI Ecosystem

11

Why HDInsight?

Microsoft Stack

Runs on Windows

Create & Destroy

On-Demand

DFS Implementation

in Blob Storage

DFS Implementation

in Blob Storage

Store data on Blob

Storage for Later Use

Automation using

PowerShell

Orchestration/Work

flow using SSIS

Scheduling using

SQL Agent

BI & Analytics with

Power BI

12

Considerations

Requires dropping and

re-creating the cluster to

scale-up/down

Storage and Cluster should be in

the same Data Center

13

HDInsight Versions

COMPONENT VERSION 1.6 VERSION 2.1 VERSION 3.0VERSION 3.1

(Current/Default)

Hortonworks Data Platform (HDP) 1.1 1.3 2.0 2.1.7

Apache Hadoop & YARN 1.0.3 1.2.0 2.2.0 2.4.0

Tez 0.4.0

Apache Pig 0.9.3 0.11.0 0.12.0 0.12.1

Apache Hive & HCatalog 0.9.0 0.11.0 0.12.0 0.13.1

HBase 0.98.0

Apache Sqoop 1.4.2 1.4.3 1.4.4 1.4.4

Apache Oozie 3.2.0 3.3.2 4.0.0 4.0.0

Apache HCatalog 0.4.1 Merged with Hive Merged with Hive Merged with Hive

Apache Templeton 0.1.4 Merged with Hive Merged with Hive Merged with Hive

Ambari API v1.0 1.4.1 >=1.5.1

Zookeeper 3.4.5 3.4.5

Storm 0.9.1

Mahout 0.9.0

Phoenix 4.0.0.2.1.7.0-2162

14

HDInsight Use Case - Iterative Exploration

15

HDInsight Use Case - Data Warehouse on Demand

16

HDInsight Use Case - ETL Automation

17

HDInsight Use Case - BI Integration

18

Typical Implementation

Transactional

Social

Warehouse

Azure

Blob

Blob Blob

Blob Blob

Multi-NodeHDInsight Cluster

MapReduce• Hive• Java

Reporting and Analytics

• SSRS• Excel• Power BI

Web LogsClickstream

Files(TXT, XML, JSON, ..)

Collaboration

Office 365 / SharePoint

19

Typical Implementation (Contd…)

E-C

om

mer

ceIn

tern

al S

yste

ms

OLTP

Transactional

Internal Systems

Customers

Internal SystemsTeam

SqoopOr AzCopy

Hive Metastore

MapReduceHive

Multi-NodeHDInsight Cluster

MapReduce• Hive• Pig• Java• Python

Collaboration, Reporting, and Analytics• SSRS• Excel• Power BI

PowerShell / SSIS / SQL Agent

Subscription & Cluster Management | Data Movement | Job Execution

Warehouse

Web LogsSo

cial

Web Logs

Azure

Blob Storage

Blob

Blob Blob

Blob

Blob

BlobBlob

20

Further Reading and Learning Resources

• HDInsight Emulator

• http://azure.microsoft.com

• Learning map for HDInsight: http://azure.microsoft.com/en-us/documentation/articles/hdinsight-learn-map

21

References

• http://msdn.microsoft.com/en-us/library/dn749804.aspx

• http://azure.microsoft.com/en-us/documentation/articles/hdinsight-

component-versioning/

• http://msdn.microsoft.com/en-us/library/dn749848.aspx

• http://msdn.microsoft.com/en-us/library/dn749787.aspx

• http://msdn.microsoft.com/en-us/library/dn749805.aspx

• http://msdn.microsoft.com/en-us/library/dn749876.aspx

22

Related Apache Projects

Term Description

Ambari / HUE Deployment, Configuration, and Monitoring

Avro / Parquet / RC / Sequence Data serialization system

Flume / S4 / Storm Collection and import of log and event data

Hbase / Cassandra Column-oriented database scaling to billions of rows

HCatalog Schema and Data Type Sharing over Pig, Hive, and MapReduce

Hive / Drill / Impala Data Warehouse with SQL-Like Access

Hive-QL/HQL SQL-Like Language to Query Hive

Mahout Library of machine learning and data mining algorithms

Pig High-level programming for Hadoop computations

Oozie Orchestration and workflow management

Sqoop Imports data from relational databases

Tez Application framework for graph

Whirr Cloud-agnostic deployment of clusters

MapReduce / YARNMapReduce is a programming model for distributed data processing. MapReduce has undergone a

complete overhaul in hadoop-0.23 and we now have Map-Reduce 2.0 (MRv2) or YARN.

Zookeeper Configuration management and coordination

THANK YOU

24

Top 10Mobile Companies

Top 5Outsourced Product Development Companies

2012 Partner of the year Windows Azure, Finalist

40GLOBAL OFFICES

7500EMPLOYEES

23COUNTRIES

Excellence Award

Technology Agency of the Year