Killer Scenarios with Data Lake in Azure with U-SQL

Post on 21-Apr-2017

470 views 0 download

Transcript of Killer Scenarios with Data Lake in Azure with U-SQL

Microsoft Data Science SummitSept 26 – 27 | Atlanta, GA

Killer Scenarios with Data Lake in Azure with U-SQLMichael RysPrincipal Program Manager Big Data@MikeDoesBigDatausql@microsoft.comhttp://aka.ms/azuredatalake

Agenda Today (BR013): Killer extensibility in Azure Data Lake with U-SQL Custom rowset aggregation How to do JSON processing Image processing How to call R from U-SQL

Yesterday (BR014): Introduction to Azure Data Lake and U-SQL What is Azure Data Lake? Why U-SQL? Core concepts

Schema on read on file and file sets C# extensibility SQL with U-SQL Script level execution and optimization

Tool usage

U-SQL extensibilityExtend U-SQL with C#/.NET

Built-in operators, function, aggregates

C# expressions (in SELECT expressions)

User-defined aggregates (UDAGGs)

User-defined functions (UDFs)

User-defined operators (UDOs)

User-Defined Extractors User-Defined Outputters User-Defined Processors

Take one row and produce one row Pass-through versus transforming

User-Defined Appliers Take one row and produce 0 to n rows Used with OUTER/CROSS APPLY

User-Defined Combiners Combines rowsets (like a user-defined join)

User-Defined Reducers Take n rows and produce m rows (normally m<n)

Scaled out with explicit U-SQL Syntax that takes a UDO instance (created as part of the execution): EXTRACT OUTPUT PROCESS COMBINE REDUCE

What are UDOs?Custom Operator ExtensionsScaled out by U-SQL

UDO model• Marking UDOs• Parameterizing UDOs• UDO signature• UDO-specific

processing pattern• Rowsets and their

schemas in UDOs• Setting results

By position By name

[SqlUserDefinedExtractor] public class DriverExtractor : IExtractor { private byte[] _row_delim; private string _col_delim; private Encoding _encoding; // Define a non-default constructor since I want to pass in my own parameters public DriverExtractor( string row_delim = "\r\n", string col_delim = ",“ , Encoding encoding = null ) { _encoding = encoding == null ? Encoding.UTF8 : encoding; _row_delim = _encoding.GetBytes(row_delim); _col_delim = col_delim; } // DriverExtractor

// Converting text to target schema private void OutputValueAtCol_I(string c, int i, IUpdatableRow outputrow) { var schema = outputrow.Schema;

if (schema[i].Type == typeof(int)) { var tmp = Convert.ToInt32(c); outputrow.Set(i, tmp); } ... } //SerializeCol

public override IEnumerable<IRow> Extract( IUnstructuredReader input , IUpdatableRow outputrow) { foreach (var row in input.Split(_row_delim)) { using(var s = new StreamReader(row, _encoding)) { int i = 0; foreach (var c in s.ReadToEnd().Split(new[] { _col_delim }, StringSplitOptions.None)) { OutputValueAtCol_I(c, i++, outputrow); } // foreach } // using yield return outputrow.AsReadOnly(); } // foreach } // Extract } // class DriverExtractor

Code behindHow to specify UDOs?

C# Class Project for U-SQLHow to specify UDOs?

Any .Net language usable however not first-class in tooling Use U-SQL specific .Net DLLs Compile DLL, upload to ADLS, register

with script

How to specify UDOs?

Managing Assemblies

• CREATE ASSEMBLY db.assembly FROM @path;• CREATE ASSEMBLY db.assembly FROM byte[];

• Can also include additional resource files

• REFERENCE ASSEMBLY db.assembly;

• Referencing .Net Framework Assemblies• Always accessible system namespaces:

• U-SQL specific (e.g., for SQL.MAP)• All provided by system.dll system.core.dll

system.data.dll, System.Runtime.Serialization.dll, mscorelib.dll (e.g., System.Text, System.Text.RegularExpressions, System.Linq)

• Add all other .Net Framework Assemblies with:REFERENCE SYSTEM ASSEMBLY [System.XML];

• Enumerating Assemblies• Powershell command• U-SQL Studio Server Explorer

• DROP ASSEMBLY db.assembly;

Create assemblies Reference assemblies Enumerate assemblies Drop assemblies

VisualStudio makes registration easy!

USING clause 'USING' csharp_namespace | Alias '=' csharp_namespace_or_class.

Examples: DECLARE @ input string = "somejsonfile.json";

REFERENCE ASSEMBLY [Newtonsoft.Json];REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

USING Microsoft.Analytics.Samples.Formats.Json;

@data0 = EXTRACT IPAddresses string FROM @input USING new JsonExtractor("Devices[*]");

USING json = [Microsoft.Analytics.Samples.Formats.Json.JsonExtractor];

@data1 = EXTRACT IPAddresses string FROM @input USING new json("Devices[*]");

Allows shortening and disambiguating C# namespace and class names

Overlapping Range AggregationStart Time - End Time - User Name5:00 AM - 6:00 AM - ABC5:00 AM - 6:00 AM - XYZ8:00 AM - 9:00 AM - ABC8:00 AM - 10:00 AM - ABC10:00 AM - 2:00 PM - ABC7:00 AM - 11:00 AM - ABC9:00 AM - 11:00 AM - ABC11:00 AM - 11:30 AM - ABC11:40 PM - 11:59 PM - FOO11:50 PM - 0:40 AM - FOO

https://blogs.msdn.microsoft.com/azuredatalake/2016/06/27/how-do-i-combine-overlapping-ranges-using-u-sql-introducing-u-sql-reducer-udos

Start Time - End Time - User Name5:00 AM - 6:00 AM - ABC5:00 AM - 6:00 AM - XYZ7:00 AM - 2:00 PM - ABC11:40 PM - 0:40 AM - FOO

U-SQL:

@r = REDUCE @in PRESORT begin ON user PRODUCE begin DateTime , end DateTime , user string READONLY user USING new ReduceSample.RangeReducer();

Overlapping Range Aggregation

Presort input rowset

Partition and scale out

Declare passthrough

User-defined Reducer

Code Behind:namespace ReduceSample{ [SqlUserDefinedReducer(IsRecursive = true)] public class RangeReducer : IReducer { public override IEnumerable<IRow> Reduce(IRowset input, IUpdatableRow output) { // Init aggregation values int i = 0; var begin = DateTime.MaxValue; var end = DateTime.MinValue;

foreach (var row in input.Rows) { ... begin = row.Get<DateTime>("begin"); end = row.Get<DateTime>("end"); ... output.Set<DateTime>("begin", begin); output.Set<DateTime>("end", end); yield return output.AsReadOnly(); ... } // foreach } // Reduce

Overlapping Range Aggregation

• Provides better scale

• Requires associative operation

• Implement IReducer• Implement IReducer

• Get input column

• Input Rowset Partition

• Set output column

• Accumulate rows

JSON Processing

How do I extract data from JSON documents?

https://github.com/Azure/usql/tree/master/Examples/DataFormats

Architecture of Sample Format Assembly

Single JSON document per file: Use JsonExtractor

Multiple JSON documents per file: Do not allow CR/LF (row delimiter) in JSON Use built-in Text Extractor to extract Use JsonTuple to schematize (with CROSS

APPLY) Currently loads full JSON document into

memory better to use JSONReader Processing if docs

are large

JSON Processing Microsoft.Analytics.Samples.Formats

NewtonSoft.Json System.Xml

JSON Processing

@json = EXTRACT personid int, name string, addresses string FROM @input USING new Json.JsonExtractor(“[*].person");

@person = SELECT personid, name, Json.JsonFunctions.JsonTuple(addresses)["address"] AS address_array FROM @json;

@addresses = SELECT personid, name, Json.JsonFunctions.JsonTuple(address) AS address FROM @person CROSS APPLY EXPLODE (Json.JsonFunctions.JsonTuple(address_array).Values) AS A(address);

@result = SELECT personid, name, address["addressid"]AS addressid, address["street"]AS street, address["postcode"]AS postcode, address["city"]AS city FROM @addresses;

Key to field relative to objects in JsonExtractor

JPath Expression mapping objects to Row

Generates 1-level key value-pairs as SqlMap

Gets value from map as string

Convert string array into Map and pivot all Values into rows

Get object map for array item

Get desired keys from object map

Image ProcessingCopyright

Camera Make

Camera Model

Thumbnail

Michael Canon 70D

Michael Samsung S7

https://github.com/Azure/usql/tree/master/Examples/ImageApp

Image processing assembly Uses System.Drawing Exposes

Extractors Outputter Processor User-defined Functions

Trade-offs Column memory limits:

Image Extractor vs Feature Extractor

Main memory pressures in vertex:

UDFs vs Processor vs Extractor

Image Processing

R Processing

KMeans Centroids

ArchitectureU-SQL Processing with R R Programmer Assembly

KMeansRReducer

R Engine (Runtime)

R to .Net interop (RDotNet.dll & RDotNet.NativeLib.dll)

R Runtime (R-bin.zip)

R Engine Manager Utility (RUtilities.dll)

Similar Approaches can be done for deploying other runtimes: Python, JavaScript, JVM No external access from UDOsFuture work: More generic samples More automatic experiences (no user

wrappers/deploys)

Summary of U-SQL UDOs

What are UDOs?

Custom Operator Extensions written in .Net (C#)Scaled out by U-SQL

UDO Tips and Warnings

• Tips when Using UDOs: READONLY clause to allow pushing predicates through

UDOs REQUIRED clause to allow column pruning through UDOs PRESORT on REDUCE if you need global order Hint Cardinality if it does choose the wrong plan

• Warnings and better alternatives: Use SELECT with UDFs instead of PROCESS Use User-defined Aggregators instead of REDUCE Learn to use Windowing Functions (OVER expression)

• Good use-cases for PROCESS/REDUCE/COMBINE: The logic needs to dynamically access the input and/or

output schema. E.g., create a JSON doc for the data in the row where the columns are not known apriori.

Your UDF based solution creates too much memory pressure and you can write your code more memory efficient in a UDO

You need an ordered Aggregator or produce more than 1 row per group

Additional Resources Blogs and community page:

http://usql.io (U-SQL Github) http://blogs.msdn.microsoft.com/azuredatalake/ http://blogs.msdn.microsoft.com/mrys/ https://channel9.msdn.com/Search?term=U-SQL#ch9Se

arch

Documentation and articles: http://aka.ms/usql_reference https://azure.microsoft.com/en-us/documentation/servic

es/data-lake-analytics/ https://msdn.microsoft.com/en-us/magazine/mt614251

ADL forums and feedback http://aka.ms/adlfeedback https://social.msdn.microsoft.com/Forums/azure/en-US/h

ome?forum=AzureDataLake

http://stackoverflow.com/questions/tagged/u-sql

© 2016 Microsoft Corporation. All rights reserved.