Data and AI LATAM 2018 - microsoftsessions.com Data_Science_wi… · T-SQL. Production Apps....
-
Upload
doankhuong -
Category
Documents
-
view
218 -
download
0
Transcript of Data and AI LATAM 2018 - microsoftsessions.com Data_Science_wi… · T-SQL. Production Apps....
Streamline Productivity and Simplify DeploymentLa parte de imagen con el identificador de relación rId3 no se encontró en el archivo.
Applications
Applications
La parte de imagen con el identificador de relación rId5 no se encontró en el archivo.
MODEL
Scoring
Separate service or embedded logic
La parte de imagen con el identificador de relación rId5 no se encontró en el archivo.
MODELData Transformations
Model Training
Model
Model Training
Data Transformations
Scoring
SQL Server
Analytics server
IEEE Spectrum Top Programming Languages
IEEE Spectrum, July 2017 KDnuggets Top Data Science Tools, 2017
http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster
Eliminate data movement
Operationalize scripts and models
Enterprise grade performance and scaleExtensibility
EXEC sp_execute_external_script@language =N'R',@script =N'print(paste("Hello World from:", Revo.version$version.string));'
EXEC sp_execute_external_script@language =N'Python',@script =N'import sysprint ("Hellow World from:", sys.version)'
Augments R & Python with parallelized, distributed algorithmsProvides in-database execution of scripts and algorithms usedParallel algorithms overcome Python and R memory limitations
Reduces security risks by keeping data in-databaseCreates and consumes portable models
One call, one answerArbitrarily large data setsArbitrarily large worker task setMathematically the same as single-threadedPlatform independentMost are written in C++ for speed
1. Algorithm begins initiator process2. Initiator distributes work to nodes3. Finalizer collects results4. Finalizer iterates or continues5. Finalizer evaluates final model6. Returns single model to calling script
Load a large dataset
Run a RevoScaleRor RevoScalePy
algorithm
RevoScaleR & RevoScalePy Algorithms and Functions
Data Larger than RAM
One callOne or many models returnedArbitrarily large data setsArbitrarily large worker task set
Load a large dataset
Run a RevoScaleRor RevoScalePy
algorithm
RevoScaleR & RevoScalePy Algorithms and Functions
Data Larger than RAM
Augments RevoScaleRFast learnersDeep learning algorithmsEnsemble results using rxEnsemble
Executes RevoScaleR algos on remote data & CPUs“rxSetComputeContext” redirects to remoteAlgorithms in RevoScaleR library redirect as setResults are returned to script as though local
1. Algorithm on local checks compute context2. If set remote, packages and ships request3. Local script blocks (by default) awaiting response4. Remote unpacks and executes in parallel5. Remote returns results to local interface6. Local interface returns results to script
Load a large dataset
Run a RevoScaleRor RevoScalePy
algorithm
SQL Server, Teradata1, Hadoop1, Hadoop MapReduce1, HDInsight1
1R only
Supports custom, multi-layer network topology with filtered, convolutional, and pooling bundles
Binary classificationMulti-class classificationRegression
Bing Ads Click Prediction ($50M per year revenue gain);Image Classification
L1, L2 regularization Binary classificationMulti-class classification
Classifying user feedback
Easy to train learner for anomaly detection
Anomaly Detection Fraud detection
Boosted decision tree. Similar to XGBoost. Supports up to ~100K features
Binary classificationRegression
One of the most popular and best performing learners inside Microsoft
state-of-the-art tree ensembles (Random Forest) Supports up to ~100K features.
Binary classificationRegression
Churn Prediction
Speed, scalability and supports L1,L2 regularization. Supports up to 1B features!
Binary classification, Regression
Outlook used for email spam filtering
Battle tested, large language support, performant (Bing, Office)
Performs natural language processing of free text into numerical representation
Support ticket classification,Sentiment analysis
Ease of use; 1 line of code to set Converts categories into numerical data
Ad Click Prediction
Ease of use Selects a subset of features to speed up training time
Sentiment analysis,Ad Click Prediction
SQL Server 2017
Key Points:• Classical pattern of pulling data out
of database to a separate modeling environment
• Data scientists will already be familiar with this approach, so it's something to build on
RemoteExecutionContext
Results
ParallelWorker Tasks
RevoScaleR& RevoScalePy
Parallel Algorithms
Iterate/ Sequence
SQL Server 2016/17
RemoteExecutionContext
Results
ParallelWorker Tasks
RevoScaleR& RevoScalePy
Parallel Algorithms
Iterate/ Sequence
SQL Server 2016/17
Key Points:• Fast and friendly for existing
R/Python users• Limited to what can be done through
a SQL Compute Context• Requires external script execute
permissions
SQL Server 2016/17 Run Python & R
From within the Query
Processor
Run R and Python from SQL environments
T-SQL Apps
T-SQL Script
SQL Server 2016/17 Run Python & R
From within the Query
Processor
Run R and Python from SQL environments
T-SQL Apps
T-SQL Script
Key Points:• Works with all of R/Python functions
for maximum flexibility• Most natural for users with some SQL
familiarity, but doable for all
T-SQL Stored ProcedureSQL
Server 2016/17
Enable smart non-R apps
BI & Reporting; Web apps
T-SQL Script
T-SQL Stored ProcedureSQL
Server 2016/17
Enable smart non-R apps
BI & Reporting; Web apps
T-SQL Script
Key Points:• Helper functions are available so you
don't have to manually recode and R/Python script
• Allows firing via traditional triggers
Events
T-SQLProduction
AppsModels
SQL Server 2017
Real time scoring engine
Stored Proc’sand Triggers
Events
Events
T-SQLProduction
AppsModels
SQL Server 2017
Real time scoring engine
Stored Proc’sand Triggers
Events
Key Points:• R/Python need not be installed• Works with many of the RevoScale*
and MicrosoftML models• Works on SQL 2017 and Azure SQL
DB• single millisecond response times
Integration with SQL query execution• Parallel query pushing data to multiple external processes / threads• Use in-memory technology and Columnstore Indexes alongside your ML
scripts
Streaming mode execution• Stream data in batches to the R/Python process to scale beyond available memory
Train and Predict using parallelism• Leverage RevoScaleR/revoscalepy and scale your R and Python scripts using
multi-threading and parallel processing
Native scoring for faster real-time predictions (New in 2017)
No dependency between rows (ex: scoring) – Trivial Parallelism
sp_execute_external_script@script =
N’Predict…’, @parallel = 1
(MAXDOP = 2)
exec sp_execute_external_script@language = N'R’, @script = N'
# unserialize modellogitObj <- unserialize(modelbin);
# build classification model to predict tipped or notsystem.time(OutputDataSet <- data.frame(predict(logitObj, newdata = InputDataSet,
type = "response")))[3];‘
, @input_data_1 = N’SELECT tipped, passenger_count, trip_time_in_secs, trip_distance, d.direct_distanceFROM dbo.nyctaxi_sample TABLESAMPLE (50 PERCENT) REPEATABLE (98074)CROSS APPLY [CalculateDistance](pickup_latitude, pickup_longitude,
dropoff_latitude, dropoff_longitude) as dOPTION(MAXDOP 2) -- Needed only to control DOP’
, @parallel = 1, @params = N'@modelbin varbinary(max), @r_rowsPerRead int’, @modelbin = @model, @r_rowsPerRead = 5000;
Requirements:No dependency between rows (ex: scoring)
Key Benefits:Execute script over chunks of dataProcess data that doesn’t fit in memoryCan be used from client (rx* function) or server
5000
5000
5000
Dataset = 15000 RowsSp_execute_external_script
@r_rowsPerRead = 5000
exec sp_execute_external_script@language = N'R’, @script = N'
# unserialize modellogitObj <- unserialize(modelbin);
# build classification model to predict tipped or notsystem.time(OutputDataSet <- data.frame(predict(logitObj, newdata = InputDataSet,
type = "response")))[3];‘
, @input_data_1 = N’SELECT tipped, passenger_count, trip_time_in_secs, trip_distance, d.direct_distanceFROM dbo.nyctaxi_sample TABLESAMPLE (50 PERCENT) REPEATABLE (98074)CROSS APPLY [CalculateDistance](pickup_latitude, pickup_longitude,
dropoff_latitude, dropoff_longitude) as d’
, @params = N'@modelbin varbinary(max), @r_rowsPerRead int’, @modelbin = @model, @r_rowsPerRead = 5000;
exec sp_execute_external_script@language = N'R’, @script = N'
# Define the connection stringconnStr <- paste("Driver=SQL Server;Server=", instance_name, ";Database=", database_name, ";Trusted_Connection=true;", sep="");
# Set ComputeContextcc <- RxInSqlServer(connectionString = connStr, numTasks = 4);
# Pull data from queryfeatureDataSource = RxSqlServerData(sqlQuery = input_query, connectionString = connStr, computeContext = cc);
# Table to write data to, using compute contexttipPredictions = RxSqlServerData(table = "nyc_taxi_tip_predictions", connectionString = connStr);
# Unserialize modellogitObj <- unserialize(modelbin);
# Predict tipped or not based on modelPredictions -> rxPredict(logitObj, data = featureDataSource, outData = tipPredictions, overwrite = TRUE);’
, @params = N'@input_query nvarchar(max)’, @input_query = N'SELECT * FROM nyctaxi_training_sample'
rxCall…
rxCall…sp_execute_ext
ernal_script@script =
N’rxLogit…’, @input_data_1
= N’SELECT….’
(MAXDOP = 2)
+BxlServer
+BxlServer
m1+ m2
<Model Object>
@modelSELECT native_modelFROM models
WHERE model_name = 'Fraud Detection Model’
PREDICT MODEL = @model DATA = new_transaction
-- Check/set External Resource Pool configSELECT * FROM sys.resource_governor_resource_pools WHERE name = 'default'SELECT * FROM sys.resource_governor_external_resource_pools WHERE name = 'default'ALTER RESOURCE POOL "default" WITH (max_memory_percent = 60);ALTER EXTERNAL RESOURCE POOL "default" WITH (max_memory_percent = 80);ALTER RESOURCE GOVERNOR RECONFIGURE; -- enforce changes
DMV Description
sys.dm_exec_requests New column: external_script_request_id
sys.dm_external_script_requests Returns running external scripts, DOP & assigned user account
sys.dm_external_script_execution_stats Number of executions for rx* functions in RevoScaleR package
sys.dm_os_performance_counters New “External Scripts” performance counters
https://docs.microsoft.com/en-us/sql/advanced-analytics/r/new-components-in-sql-server-to-support-r
Reduced surface area and isolation
‘external scripts enabled’ required
R/Python script execution outside of SQL Server process
space
Script execution requires explicit
permission
sp_execute_external_script requires EXECUTE ANY
EXTERNAL SCRIPT for non-admins
SQL Server login/user required and db/table access
R/Python processes have limited privileges
R/Python processes run under local user accounts in the
SQLRUserGroup
Each execution is isolated. Different users with different
accounts
Windows firewall rules to block outbound traffic
''
Examples using R:
sqlPackages <- rxInstalledPackages(fields = c("Package","Version", "Built"), computeContext = sqlServer)
pkgs <- c("ggplot2")rxInstallPackages(pkgs = pkgs, verbose = TRUE, scope = "private", computeContext = sqlServer)
pkgs <- c("ggplot2")rxRemovePackages(pkgs = pkgs, verbose = TRUE, scope = "private", computeContext = sqlServer)
Example using T-SQL:
EXEC sp_execute_external_script @language=N'R', @script=N'myPackages <- rxInstalledPackages();OutputDataSet <- as.data.frame(myPackages);'
• Azure SQL Database• R support• Python support
• Machine Learning Services in SQL Server on Linux• Additional algorithms and pre-trained models• Native Scoring for more models
R Services ML ServicesAKA.MS/MLSQLDEV
SSMS Reports for ML ServicesML cheat sheetHospital length of Stay demo scripts
SQL Server Machine Learning Services
Learning and Scoring Process
ImagesFeaturization (using pre-trained
ResNet18 neural network model)
FeaturesClassification
Algorithm(Boosted Tree)
Classifier Model
LearningLabels
Images Features
Scoring
PredictionsFeaturization (using pre-trained
ResNet18 neural network model)
Classification
Distributed Featurization and Training On HD-InsightSQL Server
Edge
Distributed Featurization
CT Scan Images
Azure Blob StorageClassifier Training
Featurization
ModelsTable
HDInsight-MRS
FeaturizationScoring
with the classifier model
Web App
Diagnosis: 35% certainty
Stored Procedures with R Code
Scoring with Deep Learning Model in SQL
Stored Procedure
call
Model table,Features table,
New Images table
SQL Server