NDC HCPCS HCPCS Description NDC Description Date End Date ...
NDC Sydney - Analyzing StackExchange with Azure Data Lake
-
Upload
tom-kerkhove -
Category
Technology
-
view
112 -
download
0
Transcript of NDC Sydney - Analyzing StackExchange with Azure Data Lake
Tom Kerkhove
Azure Consultant @ Codit
Microsoft Azure MVP & Advisor
Member of Azug.be
“Integration of Things” whitepaper (https://bit.ly/azure-iot)
2
Nice to meet you
blog.tomkerkhove.be
@TomKerkhove
tomkerkhove
Agenda
• Introduction to Azure Data Lake
• What is Azure Data Lake Store?
• What is Azure Data Lake Analytics?
• Analyzing StackExchange with Azure Data Lake
3
Let’s go open-source, right?!
➔ Comes with a few challenges for C#/SQL professional
➔ New languages to learn & maintain
➔ Rapidly evolving ecosystem
➔ Cluster management
➔ Typically linux machines
6
➔ WebHDFS compatible
➔ Any size
➔ Any format as-is
➔ Write-once-read-many
➔ Enterprise-grade security
➔ Thé big data store in Azure
8
Azure Data Lake Store
Characteristics
➔ Data Warehousing➔ Structured data
➔ Defined set of schemas
➔ Requires Extract-Transform-Load (ETL) before storing
➔ Known for some of us
➔ Exploratory analysis is hard because of transforming the data
10
Data Warehousing vs Data Lakes
➔ Data Lakes
➔ Raw data(unstructured/semi-structured/structured)
➔ “Dump” all your data in the lake
➔ Data scientists will interpret data
from the lake
➔ Without metadata, turns in a data
swamp pretty fast
Security
➔ Roled-based Access Control (RBAC)
➔ Grant user/groups access to folder/file(https://bit.ly/data-lake-store-acls)
➔ Firewall (off by default)
➔ Encryption at rest
➔ Keys managed by Microsoft
➔ Bring-your-own-key with Azure Key Vault
12
➔ ~$0,032/GB stored per month
➔ Transaction costs➔ ~$0,043 per 1M write transactions
➔ ~$0,0034 per 1M read transactions
➔ 1 transaction is block of up to 128 kB
➔ Regular Egress fees
➔ Monthly commitment packages➔ Save up to 33%
13
Pricing
Azure Data Lake Store vs Blob Storage
14
No Limitations
Store whatever you
want in any format
Security
Built-in Azure Active
Directory support
Pricing
More expensive than
Storage GRS
Redundancy
It’s there but no control
over it
Built for Scale
Optimized for high-
scale reads
Integration
With Data Factory, Data
Catalog & HDInsight
Full comparison on https://bit.ly/adls-vs-storage
Azure Data Lake Analytics
➔ Run analytics jobs on managed clusters
➔ No maintenance ~ Serverless
➔ Written in U-SQL
➔ SQL Syntax
➔ Extensibility in C#
➔ Easily scaled with Analytics Units
➔ Pay for processing time only
15
➔ Built-in partitioned tables
➔ Query data where it lives
➔ No need to prepare data
➔ One query that runs on multiple
data stores
➔ Use the correct data store
for the job
16
Data Sources
Writing U-SQL scripts
17
Extract from data source byusing built-in or customextractors.
Transform / Analyse the datausing SQL-syntax, in-line C# orC# method calls
Output the result to a datasource by using built-in orcustom extractors
➔ C# Expressions
➔ User-Defined Functions (UDF)
➔ User-Defined Operations (UDO)
➔ User-Defined Aggregators (UDAGG)
18
Extensibility
➔ User-Defined Extractors
➔ User-Defined Processors
➔ Take one row and produce
one row
➔ Pass-through versus
transforming
➔ User-Defined Reducers
➔ Take n rows and produce 1
row
19
➔ User-Defined Outputters
➔ User-Defined Appliers
➔ Take one row and produce 0 to
n rows
➔ Used with OUTER/CROSS
APPLY
➔ User-Defined Combiners
➔ Combines rowsets (like a user-
defined join)
User-Defined Operations (UDO)
21
U-SQL Batch Job Execution Lifetime
Michael Rys on “Tuning & Optimizing U-SQL” https://bit.ly/tuning-optimizing-u-sql
➔ Roled-based Access Control (RBAC)
➔ Firewall (Off by default)
➔ Access control on service catalog
➔ Access control on a per-database level
24
Security
➔ Account-level limitations
➔ Maximum of AUs
➔ Maximum of concurrent job
➔ Days to retain queries
➔ Job-level limitations
➔ Maximum of AUs
➔ Maximum priority
➔ Granted per user and/or group
25
Resource Management
➔ Billed for processing time, not per job
➔ Billed per second
➔ $1,687 per hour per Analytics Unit
➔ ~ $0,028 per minute
➔ Monthly commitment packages
➔ Save up to 74%
26
Pricing
Meet StackExchange
➔ Over 280 websites
➔ 150+ GB of open-source data
➔ Different kinds of data➔ Posts
➔ Users
➔ Votes
➔ ...
➔ A big data sample data set
28
What Are We Going To Do?
• Download the original data set
Acquiring The Data
• Upload data set to Azure
• Determine what service to use
Moving The Data • Merging data from
each site into one file
• Conversion from XML to CSV
Aggregating The Data
• Run business logic on it
• Attempt to gain knowledge from it
Analyzing The Data • Visualize what we’ve
learned
Visualizing The Data
29
➔ Azure Data Lake Store
➔ Graphs
• Storage Utilization
• Read/Write
• Ingress/Egress
➔ Audit & Request logs
➔ No Metrics
➔ No Alerts
31
➔ Azure Data Lake Analytics
➔ Graphs
• Job status
• Used # of AU time
➔ Metrics
• Job status
• Used # of AU time
➔ Audit & Request logs
➔ No alerts
Operations
➔ Store Explorer➔ Browse store
➔ Download complete / subset of file
➔ Preview
➔ Only in Visual Studio
➔ Job Visualizer➔ Determine bottlenecks by using heatmaps
➔ Playback jobs based on telemetry
➔ Query optimization
➔ Job Profiler
32
Azure Data Lake tools for Visual Studio
➔ Integration with Source control
➔ Unit Testing extensibility
➔ Local execution
➔ Simulate Data Lake Store
➔ Run & debug jobs
33
Azure Data Lake tools for Visual Studio (Code)
➔ Integrate with your data pipelines in Azure Data Factory
➔ Move data from Azure Data Lake Store to other store
➔ Move data to Azure Data Lake Store
➔ Run U-SQL jobs within pipeline
➔ Integration with Azure Data Catalog
➔ Register your Azure Data Lake Store assets
34
Integration with Azure Services
➔ Azure Data Lake Best Practices by Microsoft(Contact me)
➔ “Mastering Azure Analytics” by Zoiner Tejada(https://bit.ly/mastering-azure-analytics)
➔ MVA “Introducing Azure Data Lake”(https://bit.ly/intro-to-azure-data-lake)
➔ Azure Data Lake GitHub Repo(https://azure.github.io/AzureDataLake/)
➔ U-SQL Documentation(https://usql.io)
35
Learn more!
➔ Big Data is not just a hype so get ready
➔ Azure Data Lake Store➔ Analyse today & explore tomorrow
➔ Data Swamps
➔ Data Lake Analytics➔ No cluster management ~ “Serverless”
➔ Re-use existing skills
➔ Pay for what we use
➔ Big Data in Azure? Use Azure Data Lake!
36
Summary