Killer Scenarios with Data Lake in Azure with U-SQL

26
Microsoft Data Science Summit Sept 26 – 27 | Atlanta, GA

Transcript of Killer Scenarios with Data Lake in Azure with U-SQL

Page 1: Killer Scenarios with Data Lake in Azure with U-SQL

Microsoft Data Science SummitSept 26 – 27 | Atlanta, GA

Page 2: Killer Scenarios with Data Lake in Azure with U-SQL

Killer Scenarios with Data Lake in Azure with U-SQLMichael RysPrincipal Program Manager Big Data@[email protected]://aka.ms/azuredatalake

Page 3: Killer Scenarios with Data Lake in Azure with U-SQL

Agenda Today (BR013): Killer extensibility in Azure Data Lake with U-SQL Custom rowset aggregation How to do JSON processing Image processing How to call R from U-SQL

Yesterday (BR014): Introduction to Azure Data Lake and U-SQL What is Azure Data Lake? Why U-SQL? Core concepts

Schema on read on file and file sets C# extensibility SQL with U-SQL Script level execution and optimization

Tool usage

Page 4: Killer Scenarios with Data Lake in Azure with U-SQL

U-SQL extensibilityExtend U-SQL with C#/.NET

Built-in operators, function, aggregates

C# expressions (in SELECT expressions)

User-defined aggregates (UDAGGs)

User-defined functions (UDFs)

User-defined operators (UDOs)

Page 5: Killer Scenarios with Data Lake in Azure with U-SQL

User-Defined Extractors User-Defined Outputters User-Defined Processors

Take one row and produce one row Pass-through versus transforming

User-Defined Appliers Take one row and produce 0 to n rows Used with OUTER/CROSS APPLY

User-Defined Combiners Combines rowsets (like a user-defined join)

User-Defined Reducers Take n rows and produce m rows (normally m<n)

Scaled out with explicit U-SQL Syntax that takes a UDO instance (created as part of the execution): EXTRACT OUTPUT PROCESS COMBINE REDUCE

What are UDOs?Custom Operator ExtensionsScaled out by U-SQL

Page 6: Killer Scenarios with Data Lake in Azure with U-SQL

UDO model• Marking UDOs• Parameterizing UDOs• UDO signature• UDO-specific

processing pattern• Rowsets and their

schemas in UDOs• Setting results

By position By name

[SqlUserDefinedExtractor] public class DriverExtractor : IExtractor { private byte[] _row_delim; private string _col_delim; private Encoding _encoding; // Define a non-default constructor since I want to pass in my own parameters public DriverExtractor( string row_delim = "\r\n", string col_delim = ",“ , Encoding encoding = null ) { _encoding = encoding == null ? Encoding.UTF8 : encoding; _row_delim = _encoding.GetBytes(row_delim); _col_delim = col_delim; } // DriverExtractor

// Converting text to target schema private void OutputValueAtCol_I(string c, int i, IUpdatableRow outputrow) { var schema = outputrow.Schema;

if (schema[i].Type == typeof(int)) { var tmp = Convert.ToInt32(c); outputrow.Set(i, tmp); } ... } //SerializeCol

public override IEnumerable<IRow> Extract( IUnstructuredReader input , IUpdatableRow outputrow) { foreach (var row in input.Split(_row_delim)) { using(var s = new StreamReader(row, _encoding)) { int i = 0; foreach (var c in s.ReadToEnd().Split(new[] { _col_delim }, StringSplitOptions.None)) { OutputValueAtCol_I(c, i++, outputrow); } // foreach } // using yield return outputrow.AsReadOnly(); } // foreach } // Extract } // class DriverExtractor

Page 7: Killer Scenarios with Data Lake in Azure with U-SQL

Code behindHow to specify UDOs?

Page 8: Killer Scenarios with Data Lake in Azure with U-SQL

C# Class Project for U-SQLHow to specify UDOs?

Page 9: Killer Scenarios with Data Lake in Azure with U-SQL

Any .Net language usable however not first-class in tooling Use U-SQL specific .Net DLLs Compile DLL, upload to ADLS, register

with script

How to specify UDOs?

Page 10: Killer Scenarios with Data Lake in Azure with U-SQL

Managing Assemblies

• CREATE ASSEMBLY db.assembly FROM @path;• CREATE ASSEMBLY db.assembly FROM byte[];

• Can also include additional resource files

• REFERENCE ASSEMBLY db.assembly;

• Referencing .Net Framework Assemblies• Always accessible system namespaces:

• U-SQL specific (e.g., for SQL.MAP)• All provided by system.dll system.core.dll

system.data.dll, System.Runtime.Serialization.dll, mscorelib.dll (e.g., System.Text, System.Text.RegularExpressions, System.Linq)

• Add all other .Net Framework Assemblies with:REFERENCE SYSTEM ASSEMBLY [System.XML];

• Enumerating Assemblies• Powershell command• U-SQL Studio Server Explorer

• DROP ASSEMBLY db.assembly;

Create assemblies Reference assemblies Enumerate assemblies Drop assemblies

VisualStudio makes registration easy!

Page 11: Killer Scenarios with Data Lake in Azure with U-SQL

USING clause 'USING' csharp_namespace | Alias '=' csharp_namespace_or_class.

Examples: DECLARE @ input string = "somejsonfile.json";

REFERENCE ASSEMBLY [Newtonsoft.Json];REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

USING Microsoft.Analytics.Samples.Formats.Json;

@data0 = EXTRACT IPAddresses string FROM @input USING new JsonExtractor("Devices[*]");

USING json = [Microsoft.Analytics.Samples.Formats.Json.JsonExtractor];

@data1 = EXTRACT IPAddresses string FROM @input USING new json("Devices[*]");

Allows shortening and disambiguating C# namespace and class names

Page 12: Killer Scenarios with Data Lake in Azure with U-SQL

Overlapping Range AggregationStart Time - End Time - User Name5:00 AM - 6:00 AM - ABC5:00 AM - 6:00 AM - XYZ8:00 AM - 9:00 AM - ABC8:00 AM - 10:00 AM - ABC10:00 AM - 2:00 PM - ABC7:00 AM - 11:00 AM - ABC9:00 AM - 11:00 AM - ABC11:00 AM - 11:30 AM - ABC11:40 PM - 11:59 PM - FOO11:50 PM - 0:40 AM - FOO

https://blogs.msdn.microsoft.com/azuredatalake/2016/06/27/how-do-i-combine-overlapping-ranges-using-u-sql-introducing-u-sql-reducer-udos

Start Time - End Time - User Name5:00 AM - 6:00 AM - ABC5:00 AM - 6:00 AM - XYZ7:00 AM - 2:00 PM - ABC11:40 PM - 0:40 AM - FOO

Page 13: Killer Scenarios with Data Lake in Azure with U-SQL

U-SQL:

@r = REDUCE @in PRESORT begin ON user PRODUCE begin DateTime , end DateTime , user string READONLY user USING new ReduceSample.RangeReducer();

Overlapping Range Aggregation

Presort input rowset

Partition and scale out

Declare passthrough

User-defined Reducer

Page 14: Killer Scenarios with Data Lake in Azure with U-SQL

Code Behind:namespace ReduceSample{ [SqlUserDefinedReducer(IsRecursive = true)] public class RangeReducer : IReducer { public override IEnumerable<IRow> Reduce(IRowset input, IUpdatableRow output) { // Init aggregation values int i = 0; var begin = DateTime.MaxValue; var end = DateTime.MinValue;

foreach (var row in input.Rows) { ... begin = row.Get<DateTime>("begin"); end = row.Get<DateTime>("end"); ... output.Set<DateTime>("begin", begin); output.Set<DateTime>("end", end); yield return output.AsReadOnly(); ... } // foreach } // Reduce

Overlapping Range Aggregation

• Provides better scale

• Requires associative operation

• Implement IReducer• Implement IReducer

• Get input column

• Input Rowset Partition

• Set output column

• Accumulate rows

Page 15: Killer Scenarios with Data Lake in Azure with U-SQL

JSON Processing

How do I extract data from JSON documents?

https://github.com/Azure/usql/tree/master/Examples/DataFormats

Page 16: Killer Scenarios with Data Lake in Azure with U-SQL

Architecture of Sample Format Assembly

Single JSON document per file: Use JsonExtractor

Multiple JSON documents per file: Do not allow CR/LF (row delimiter) in JSON Use built-in Text Extractor to extract Use JsonTuple to schematize (with CROSS

APPLY) Currently loads full JSON document into

memory better to use JSONReader Processing if docs

are large

JSON Processing Microsoft.Analytics.Samples.Formats

NewtonSoft.Json System.Xml

Page 17: Killer Scenarios with Data Lake in Azure with U-SQL

JSON Processing

@json = EXTRACT personid int, name string, addresses string FROM @input USING new Json.JsonExtractor(“[*].person");

@person = SELECT personid, name, Json.JsonFunctions.JsonTuple(addresses)["address"] AS address_array FROM @json;

@addresses = SELECT personid, name, Json.JsonFunctions.JsonTuple(address) AS address FROM @person CROSS APPLY EXPLODE (Json.JsonFunctions.JsonTuple(address_array).Values) AS A(address);

@result = SELECT personid, name, address["addressid"]AS addressid, address["street"]AS street, address["postcode"]AS postcode, address["city"]AS city FROM @addresses;

Key to field relative to objects in JsonExtractor

JPath Expression mapping objects to Row

Generates 1-level key value-pairs as SqlMap

Gets value from map as string

Convert string array into Map and pivot all Values into rows

Get object map for array item

Get desired keys from object map

Page 18: Killer Scenarios with Data Lake in Azure with U-SQL

Image ProcessingCopyright

Camera Make

Camera Model

Thumbnail

Michael Canon 70D

Michael Samsung S7

https://github.com/Azure/usql/tree/master/Examples/ImageApp

Page 19: Killer Scenarios with Data Lake in Azure with U-SQL

Image processing assembly Uses System.Drawing Exposes

Extractors Outputter Processor User-defined Functions

Trade-offs Column memory limits:

Image Extractor vs Feature Extractor

Main memory pressures in vertex:

UDFs vs Processor vs Extractor

Image Processing

Page 20: Killer Scenarios with Data Lake in Azure with U-SQL

R Processing

KMeans Centroids

Page 21: Killer Scenarios with Data Lake in Azure with U-SQL

ArchitectureU-SQL Processing with R R Programmer Assembly

KMeansRReducer

R Engine (Runtime)

R to .Net interop (RDotNet.dll & RDotNet.NativeLib.dll)

R Runtime (R-bin.zip)

R Engine Manager Utility (RUtilities.dll)

Similar Approaches can be done for deploying other runtimes: Python, JavaScript, JVM No external access from UDOsFuture work: More generic samples More automatic experiences (no user

wrappers/deploys)

Page 22: Killer Scenarios with Data Lake in Azure with U-SQL

Summary of U-SQL UDOs

Page 23: Killer Scenarios with Data Lake in Azure with U-SQL

What are UDOs?

Custom Operator Extensions written in .Net (C#)Scaled out by U-SQL

Page 24: Killer Scenarios with Data Lake in Azure with U-SQL

UDO Tips and Warnings

• Tips when Using UDOs: READONLY clause to allow pushing predicates through

UDOs REQUIRED clause to allow column pruning through UDOs PRESORT on REDUCE if you need global order Hint Cardinality if it does choose the wrong plan

• Warnings and better alternatives: Use SELECT with UDFs instead of PROCESS Use User-defined Aggregators instead of REDUCE Learn to use Windowing Functions (OVER expression)

• Good use-cases for PROCESS/REDUCE/COMBINE: The logic needs to dynamically access the input and/or

output schema. E.g., create a JSON doc for the data in the row where the columns are not known apriori.

Your UDF based solution creates too much memory pressure and you can write your code more memory efficient in a UDO

You need an ordered Aggregator or produce more than 1 row per group

Page 25: Killer Scenarios with Data Lake in Azure with U-SQL

Additional Resources Blogs and community page:

http://usql.io (U-SQL Github) http://blogs.msdn.microsoft.com/azuredatalake/ http://blogs.msdn.microsoft.com/mrys/ https://channel9.msdn.com/Search?term=U-SQL#ch9Se

arch

Documentation and articles: http://aka.ms/usql_reference https://azure.microsoft.com/en-us/documentation/servic

es/data-lake-analytics/ https://msdn.microsoft.com/en-us/magazine/mt614251

ADL forums and feedback http://aka.ms/adlfeedback https://social.msdn.microsoft.com/Forums/azure/en-US/h

ome?forum=AzureDataLake

http://stackoverflow.com/questions/tagged/u-sql

Page 26: Killer Scenarios with Data Lake in Azure with U-SQL

© 2016 Microsoft Corporation. All rights reserved.