Stéphane Fréchette - Samedi SQL - Introduction to HDInsight

Introduction to HDInsight

Stéphane FréchetteSaturday February 7, 2015

Who am I?

My name is Stéphane Fréchette

I have a passion for architecting, designing and building solutions that matter.

Twitter: @sfrechetteBlog: stephanefrechette.comEmail: stephanefrechette@ukubu.com

Topics

• What is Big Data?• Apache Hadoop• Hadoop Ecosystem• Microsoft Azure HDInsight• Demos• Summary• Resources• Q&A

“Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time…” - Wikipedia

What is Big Data?

Many Options

Variability

Internet of thingsAudio / Video

Log Files

Text/Image

Social Sentiment

Data Market FeedseGov Feeds

Weather

Wikis / Blogs

Click Stream

Sensors / RFID / Devices

Spatial & GPS Coordinates

WEB 2.0Mobile

Advertising Collaboration

eCommerce

Digital Marketing

Search Marketing

Web Logs

Recommendations

ERP / CRM

Sales Pipeline

PayablesPayroll

Inventory

Contacts

Deal Tracking

Terabytes(10E12)

Gigabytes(10E9)

Exabytes(10E18)

Petabytes(10E15)

Velocity - Variety

1980190,000$

20100.07$

19909,000$

200015$

Storage/GB

ERP / CRM WEB 2.0

Internet of things

What is Big Data?

Common Scenarios

Clickstream Analysis Sensor/Machine

Time and Place Server Logs

Sentiment

What is Big Data?

Hadoop

• Apache Hadoop is for big data• Open-source software framework that allows for the distributed processing

of large data sets across clusters of computers using simple programming models• Designed to scale up from single servers to thousands of machines, each

offering local computation and storage

TRADITIONAL RDBMS HADOOP

Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)

Access Interactive and Batch Batch

Updates Read / Write many times Write once, Read many times

Structure Static Schema Dynamic Schema

Integrity High (ACID) Low

Scaling Nonlinear Linear

DBA Ratio 1:40 1:3000

Reference: Tom White’s Hadoop: The Definitive Guide

Hadoop

• Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.

HDFS ≠ Database

MapReduce

• MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

Processing function:- Mapping- Reducing

First, store the data

Server

Server Server

Server

How it works?

Second, take the processing to the data…

// Map Reduce function in JavaScript

var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {

if (words[i] !== "")context.write(words[i].toLowerCase(),1);}}};

var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {sum += parseInt(values.next());

}context.write(key, sum);};

ServerServer

Runtime

How it works?

Distributed Storage(HDFS)

Query(Hive)

Distributed Processing(MapReduce)

Scripting(Pig)

L Database(HBase)

Metadata(HCatalog)

Data Integration( O

DBC / SQO

OP/ REST)

Relational(SQ

L Server)

Machine Learning(Mahout)

Graph(Pegasus)

Stats processing(RHadoop)

Event Pipeline(Flum

Active Directory (Security)

Monitoring & Deployment

(System Center)

C#, F#, .NETPowerShell

Pipeline / workflow

(Oozie)

Azure Storage Vault (ASV)

APS | Polybase

Business Intelligence

(Excel, Power

View, SSAS)World's Data (Azure Data

Marketplace)

Event Driven Processing

LegendRed = Core HadoopBlue = Data processingPurple = Microsoft integration points and value addsOrange = Data MovementGreen = Packages

Hadoop Ecosystem

HDInsight

• HDInsight is a Hadoop-based service that brings a 100 % Apache Hadoop solution that runs on the Microsoft Azure platform• Based on the Hortonworks Data Platform (HDP)• Scalable, on-demand service

Storage

Azure Storage (Blob)File System

Two choices

Demo[Spinning up a HDInsight Cluster ;-)]

Now what?

Working with your HDInsight cluster - running jobs, import/export data, viewing and consuming data…

• .NET• Java• Pig• Hive• Sqoop• Excel• Others

What is Hive?

• A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis• Provides an SQL-Like language called HiveQL to query data• Integration between Hadoop and BI and visualization tools

http://hive.apache.org

What is Pig?

• Write complex MapReduce jobs using a simple script language (Pig Latin)• A platform for analyzing large data sets that consists of high-level language

for expressing data analysis programs• Pig translates and compiles complex MapReduce jobs on the fly

http://pig.apache.org

What is Sqoop?

• Command-line interface application to transfer bulk data between Hadoop and relational datastores

http://sqoop.apache.org

Demo[Query, Analyze, Transfer + Visual Studio Tools for HDInsight]

HadoopData Analytics

Data Flow

Demo[Self-Service BI with Hive and Excel…]

Machine Learning

Graph Processing

Distributed Compute

Extract Load Transform

Predictive Analysis

Capabilities

Data Knowledge Action

Summary

Resources

• Apache Projects (list with links) http://bit.ly/MfpLtE• Microsoft Azure HDInsight http://bit.ly/1dnlAX1• HDInsight Documentation & Tutorials http://bit.ly/LWRYol• Hortonworks Sandbox 2.2 & Tutorials http://bit.ly/1gkkCte• Cloudera VMs CDH 5.3.x http://bit.ly/1ENWgHH• Microsoft JDBC Driver 4.1 | 4.0 for SQL Server http://bit.ly/1kEgJ7O• Microsoft Hive ODBC Driver http://bit.ly/NFkhcH• Getting Started with Big Data (MVA) http://bit.ly/1wU90Xd• Big Data and Business Analytics Immersion v3.1 (MVA) http://bit.ly/1unvvX1• Introducing Microsoft Azure HDInsight (free e-book) http://bit.ly/1JOPe5F

What Questions Do You Have?

Thank YouFor attending this session

Stéphane Fréchette - Samedi SQL - Introduction to HDInsight

Technology

Transcript of Stéphane Fréchette - Samedi SQL - Introduction to HDInsight

Bulletin Climatique Quotidien Samedi 18 juin 2016 › donnees_libres › bulletins … · Bulletin Climatique Quotidien Samedi 18 juin 2016 A D D D «L ... Samedi 18 juin: en altitude,

FRÉCHETTE GAZETTE

Eganville Samedi

Paddock du samedi

eBECS SmartWorker - Microsoft Azure · 10/1/2015 · Azure HDInsight, AzureML, Power BI, Azure Data Factory, Azure Data Lake Hot path analytics Azure Stream Analytics, Azure HDInsight

Big Data & Azure HDInsight: Programming with C#

Big data, Hadoop, HDInsight

Azure HDInsight

Working with Hive in HDInsight · Working with Hive in HDInsight 14 In this exercise, you will learn the essentials how to execute HiveQL queries against HDInsight using the PowerShell

No Bullshit! du samedi

HDInsight in Windows Azuredownload.microsoft.com/download/1/2/2/.../2014.04.22_HDInsightIn… · 22.04.2014 · HDInsight in Windows Azure R 1.00 4 HDInsight Versions on Azure Component

Louis-Honoré Fréchette P.S. FRÉCHETTE GAZETTE · all things winter. We also expect a visit (or visits) from Bon Homme! This event provides us an opportunity to reflect on what

Introduction to Mahout with HDInsight

Getting your Big Data on with HDInsight

Server and Cloud Platform template€¦ · Azure HDInsight, AzureML, Power BI, Azure Data Factory, Azure Data Lake Hot Path Analytics Azure Stream Analytics, Azure HDInsight Storm

The Fundamentals Guide to HDP and HDInsight

HdInsight essentials Hadoop on Microsoft Platform

Introduction to Azure HDInsight

Clanbook Samedi

Windows Azure HDInsight Service