NDC Sydney - Analyzing StackExchange with Azure Data Lake

37
Analyzing StackExchange with Azure Data Lake Tom Kerkhove NDC Sydney 2017

Transcript of NDC Sydney - Analyzing StackExchange with Azure Data Lake

Analyzing StackExchange

with Azure Data LakeTom Kerkhove

NDC Sydney 2017

Tom Kerkhove

Azure Consultant @ Codit

Microsoft Azure MVP & Advisor

Member of Azug.be

“Integration of Things” whitepaper (https://bit.ly/azure-iot)

2

Nice to meet you

blog.tomkerkhove.be

@TomKerkhove

tomkerkhove

Agenda

• Introduction to Azure Data Lake

• What is Azure Data Lake Store?

• What is Azure Data Lake Analytics?

• Analyzing StackExchange with Azure Data Lake

3

4

5

Let’s go open-source, right?!

➔ Comes with a few challenges for C#/SQL professional

➔ New languages to learn & maintain

➔ Rapidly evolving ecosystem

➔ Cluster management

➔ Typically linux machines

6

7

Analyzing Big Data in Azure

➔ WebHDFS compatible

➔ Any size

➔ Any format as-is

➔ Write-once-read-many

➔ Enterprise-grade security

➔ Thé big data store in Azure

8

Azure Data Lake Store

9

Characteristics

➔ Data Warehousing➔ Structured data

➔ Defined set of schemas

➔ Requires Extract-Transform-Load (ETL) before storing

➔ Known for some of us

➔ Exploratory analysis is hard because of transforming the data

10

Data Warehousing vs Data Lakes

➔ Data Lakes

➔ Raw data(unstructured/semi-structured/structured)

➔ “Dump” all your data in the lake

➔ Data scientists will interpret data

from the lake

➔ Without metadata, turns in a data

swamp pretty fast

11Martin Fowler on Data Lake & Data Warehouses: https://bit.ly/martin-fowler-data-lake

Security

➔ Roled-based Access Control (RBAC)

➔ Grant user/groups access to folder/file(https://bit.ly/data-lake-store-acls)

➔ Firewall (off by default)

➔ Encryption at rest

➔ Keys managed by Microsoft

➔ Bring-your-own-key with Azure Key Vault

12

➔ ~$0,032/GB stored per month

➔ Transaction costs➔ ~$0,043 per 1M write transactions

➔ ~$0,0034 per 1M read transactions

➔ 1 transaction is block of up to 128 kB

➔ Regular Egress fees

➔ Monthly commitment packages➔ Save up to 33%

13

Pricing

Azure Data Lake Store vs Blob Storage

14

No Limitations

Store whatever you

want in any format

Security

Built-in Azure Active

Directory support

Pricing

More expensive than

Storage GRS

Redundancy

It’s there but no control

over it

Built for Scale

Optimized for high-

scale reads

Integration

With Data Factory, Data

Catalog & HDInsight

Full comparison on https://bit.ly/adls-vs-storage

Azure Data Lake Analytics

➔ Run analytics jobs on managed clusters

➔ No maintenance ~ Serverless

➔ Written in U-SQL

➔ SQL Syntax

➔ Extensibility in C#

➔ Easily scaled with Analytics Units

➔ Pay for processing time only

15

➔ Built-in partitioned tables

➔ Query data where it lives

➔ No need to prepare data

➔ One query that runs on multiple

data stores

➔ Use the correct data store

for the job

16

Data Sources

Writing U-SQL scripts

17

Extract from data source byusing built-in or customextractors.

Transform / Analyse the datausing SQL-syntax, in-line C# orC# method calls

Output the result to a datasource by using built-in orcustom extractors

➔ C# Expressions

➔ User-Defined Functions (UDF)

➔ User-Defined Operations (UDO)

➔ User-Defined Aggregators (UDAGG)

18

Extensibility

➔ User-Defined Extractors

➔ User-Defined Processors

➔ Take one row and produce

one row

➔ Pass-through versus

transforming

➔ User-Defined Reducers

➔ Take n rows and produce 1

row

19

➔ User-Defined Outputters

➔ User-Defined Appliers

➔ Take one row and produce 0 to

n rows

➔ Used with OUTER/CROSS

APPLY

➔ User-Defined Combiners

➔ Combines rowsets (like a user-

defined join)

User-Defined Operations (UDO)

20

Metadata Model

21

U-SQL Batch Job Execution Lifetime

Michael Rys on “Tuning & Optimizing U-SQL” https://bit.ly/tuning-optimizing-u-sql

22

23

Job States

➔ Roled-based Access Control (RBAC)

➔ Firewall (Off by default)

➔ Access control on service catalog

➔ Access control on a per-database level

24

Security

➔ Account-level limitations

➔ Maximum of AUs

➔ Maximum of concurrent job

➔ Days to retain queries

➔ Job-level limitations

➔ Maximum of AUs

➔ Maximum priority

➔ Granted per user and/or group

25

Resource Management

➔ Billed for processing time, not per job

➔ Billed per second

➔ $1,687 per hour per Analytics Unit

➔ ~ $0,028 per minute

➔ Monthly commitment packages

➔ Save up to 74%

26

Pricing

Demo

Meet StackExchange

➔ Over 280 websites

➔ 150+ GB of open-source data

➔ Different kinds of data➔ Posts

➔ Users

➔ Votes

➔ ...

➔ A big data sample data set

28

What Are We Going To Do?

• Download the original data set

Acquiring The Data

• Upload data set to Azure

• Determine what service to use

Moving The Data • Merging data from

each site into one file

• Conversion from XML to CSV

Aggregating The Data

• Run business logic on it

• Attempt to gain knowledge from it

Analyzing The Data • Visualize what we’ve

learned

Visualizing The Data

29

30

How is it setup?

➔ Azure Data Lake Store

➔ Graphs

• Storage Utilization

• Read/Write

• Ingress/Egress

➔ Audit & Request logs

➔ No Metrics

➔ No Alerts

31

➔ Azure Data Lake Analytics

➔ Graphs

• Job status

• Used # of AU time

➔ Metrics

• Job status

• Used # of AU time

➔ Audit & Request logs

➔ No alerts

Operations

➔ Store Explorer➔ Browse store

➔ Download complete / subset of file

➔ Preview

➔ Only in Visual Studio

➔ Job Visualizer➔ Determine bottlenecks by using heatmaps

➔ Playback jobs based on telemetry

➔ Query optimization

➔ Job Profiler

32

Azure Data Lake tools for Visual Studio

➔ Integration with Source control

➔ Unit Testing extensibility

➔ Local execution

➔ Simulate Data Lake Store

➔ Run & debug jobs

33

Azure Data Lake tools for Visual Studio (Code)

➔ Integrate with your data pipelines in Azure Data Factory

➔ Move data from Azure Data Lake Store to other store

➔ Move data to Azure Data Lake Store

➔ Run U-SQL jobs within pipeline

➔ Integration with Azure Data Catalog

➔ Register your Azure Data Lake Store assets

34

Integration with Azure Services

➔ Azure Data Lake Best Practices by Microsoft(Contact me)

➔ “Mastering Azure Analytics” by Zoiner Tejada(https://bit.ly/mastering-azure-analytics)

➔ MVA “Introducing Azure Data Lake”(https://bit.ly/intro-to-azure-data-lake)

➔ Azure Data Lake GitHub Repo(https://azure.github.io/AzureDataLake/)

➔ U-SQL Documentation(https://usql.io)

35

Learn more!

➔ Big Data is not just a hype so get ready

➔ Azure Data Lake Store➔ Analyse today & explore tomorrow

➔ Data Swamps

➔ Data Lake Analytics➔ No cluster management ~ “Serverless”

➔ Re-use existing skills

➔ Pay for what we use

➔ Big Data in Azure? Use Azure Data Lake!

36

Summary

37