Data Science at Scale on MPP databases - Use Cases & Open Source Tools
-
Upload
esther-vasiete -
Category
Data & Analytics
-
view
552 -
download
0
Transcript of Data Science at Scale on MPP databases - Use Cases & Open Source Tools
![Page 1: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/1.jpg)
1 © Copyright 2016 Pivotal. All rights reserved. 1 © Copyright 2016 Pivotal. All rights reserved.
Esther Vasiete Pivotal Data Scientist Structure Data 2016
Data Science at Scale on MPP Databases – Use Cases & Open Source Tools
Joint work with Pivotal Data Science
![Page 2: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/2.jpg)
2 © Copyright 2016 Pivotal. All rights reserved.
Agenda � Introduction
� Open Source Data Science Toolkit
� Real world applications – Predictive maintenance of automobiles – Predicting insurance claims – Predicting customer churn
� Data science deep-dive with Jupyter notebooks – Text analytics on MPP (github.com/vatsan) – Image processing on MPP (github.com/gautamsm)
![Page 3: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/3.jpg)
3 © Copyright 2016 Pivotal. All rights reserved.
Pivotal Data Science Our Charter: Pivotal Data Science is Pivotal’s differentiated and highly opinionated data-centric service delivery organization (part of Pivotal Labs)
Our Goals: Expedite customer time-to-value and ROI, by driving business-aligned innovation and solutions assurance within Pivotal’s Data Fabric technologies.
Drive customer adoption and autonomy across the full spectrum of Pivotal Data technologies through best-in-class data science and data engineering services, with a deep emphasis on knowledge transfer.
Data Science Data Engineering
App Dev
![Page 4: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/4.jpg)
4 © Copyright 2016 Pivotal. All rights reserved.
Pivotal Data Science Knowledge Development
![Page 5: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/5.jpg)
5 © Copyright 2016 Pivotal. All rights reserved.
Use Case: Preventive Maintenance for Connected Vehicles � Customer vehicles transmit Diagnostic Trouble Codes (DTC)
and vehicle status data to the Pivotal analytics environment
� Can the DTC data be leveraged to predict the presence of potential problems in vehicles?
� Set up a data science framework on the Pivotal analytics environment that would enable the customer data science team to continuously monitor problems in their vehicles using DTC data
![Page 6: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/6.jpg)
6 © Copyright 2016 Pivotal. All rights reserved.
Problem Setup – Predicting Job Type from Diagnostic Trouble Codes (DTCs)
Time
Job Type: Transmission
Job Type: Transmission
Engine Job Type:
Body
DTC: B DTC: B,
P, C
DTC: U DTC: B
DTC: B
DTC: B, P, C, U
DTC: P, B, U
DTC: P
DTC: B
DTC: B,P
DTC: B,P
Can the DTCs observed here predict
this Job Type?
Can the DTCs observed here predict this Job
Type?
Can the DTCs observed here predict this Job
Type?
![Page 7: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/7.jpg)
7 © Copyright 2016 Pivotal. All rights reserved.
Data Parallelism One or more job on the same day
Multi-labeling problem
One-vs-rest classifiers
built in parallel
1
0
0
1
0 1
0
Class 1
Class 2
Class 3
One-vs-Rest Classification
Red vs. Non Red
On Segment 1
Green vs. Non Green
On Segment 2
Blue vs. Non Blue
On Segment N
![Page 8: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/8.jpg)
8 © Copyright 2016 Pivotal. All rights reserved.
Model Scoring Pipeline
DTC: B DTC: B, P, C
DTC: U
Body
Axle
Engine
Prob >= Threshold
Prob >= Threshold
Prob >= Threshold
Model Caching
(GPDB/ HAWQ)
Real time scoring
web or mobile app dashboard
Ingest
Sink
![Page 9: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/9.jpg)
9 © Copyright 2016 Pivotal. All rights reserved.
MPP Architectural Overview Think of it as multiple PostGreSQL servers
Segments/Workers
Master
Rows are distributed across segments by a particular field (or randomly)
![Page 10: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/10.jpg)
10 © Copyright 2016 Pivotal. All rights reserved.
IT TAKES MORE THAN
ONE TOOL
![Page 11: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/11.jpg)
11 © Copyright 2016 Pivotal. All rights reserved.
Open Source Data Science Toolkit
KEY LANGUAGES
P L A T F O R M
KEY TOOLS
MLlib
PL/X
Pivotal Big Data Suite
Mod
elin
g To
ols
Visu
aliz
atio
n To
ols
Platform
GemFire
![Page 12: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/12.jpg)
12 © Copyright 2016 Pivotal. All rights reserved.
Scalable, In-Database Machine Learning
• Open Source https://github.com/madlib/madlib • Works on Greenplum DB, Apache HAWQ and PostgreSQL • In active development by Pivotal • MADlib is now an Apache Software Foundation incubator project!
Apache (incubating)
![Page 13: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/13.jpg)
13 © Copyright 2016 Pivotal. All rights reserved.
Functions
Supervised Learning Regression Models • Cox Proportional Hazards Regression • Elastic Net Regularization • Generalized Linear Models • Linear Regression • Logistic Regression • Marginal Effects • Multinomial Regression • Ordinal Regression • Robust Variance, Clustered Variance • Support Vector Machines Tree Methods • Decision Tree • Random Forest Other Methods • Conditional Random Field • Naïve Bayes
Unsupervised Learning • Association Rules (Apriori) • Clustering (K-means) • Topic Modeling (LDA)
Statistics Descriptive • Cardinality Estimators • Correlation • Summary Inferential • Hypothesis Tests Other Statistics • Probability Functions
Other Modules • Conjugate Gradient • Linear Solvers • PMML Export • Random Sampling • Term Frequency for Text
Time Series • ARIMA
Aug 2015
Data Types and Transformations • Array Operations • Dimensionality Reduction (PCA) • Encoding Categorical Variables • Matrix Operations • Matrix Factorization (SVD, Low Rank) • Norms and Distance Functions • Sparse Vectors
Model Evaluation • Cross Validation
Predictive Analytics Library
@MADlib_analytic
![Page 14: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/14.jpg)
14 © Copyright 2016 Pivotal. All rights reserved.
Use Case: Predicting insurance claim amounts using structured and unstructured data � Using features from structured and unstructured data
sources associated with claims, build the capability to predict claim amounts
![Page 15: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/15.jpg)
15 © Copyright 2016 Pivotal. All rights reserved.
Text analytics on MPP
� Unstructured data in the form of claim comments and claim descriptions (text)
� Use a bag-of-words approach (unigrams, bigrams)
� tf-idf for more meaningful insights
![Page 16: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/16.jpg)
16 © Copyright 2016 Pivotal. All rights reserved.
Code walkthrough: Text analytics on MPP
github.com/vatsan/text_analytics_on_mpp/tree/master/vector_space_models
We’ll walk through this Jupyter notebook
![Page 17: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/17.jpg)
17 © Copyright 2016 Pivotal. All rights reserved.
Use Case: Churn prediction
� Build a churn model to predict which customers are most likely to churn
� Provide insights into key factors responsible for churn to potentially intervene prior to churn
![Page 18: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/18.jpg)
18 © Copyright 2016 Pivotal. All rights reserved.
Usage Time Series Data
� Aggregate weekly usage by user
� Compute descriptive statistics
� Extract features based on business expertise
![Page 19: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/19.jpg)
19 © Copyright 2016 Pivotal. All rights reserved.
Open Source Analytics Ecosystem
Companies benefit from algorithmic breadth and scalability for building and socializing data science models
MLlib
PL/X
Algorithms Visualization
Best of breed in-memory and in-database tools for an MPP platform
![Page 20: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/20.jpg)
20 © Copyright 2016 Pivotal. All rights reserved.
• For embarrassingly parallel tasks, we can use procedural languages to easily parallelize any stand-alone library in Java, Python, R, pgSQL or C/C++
• The interpreter/VM of the language ‘X’ is installed on each node of the MPP environment
Standby Master
…
Master Host
SQL
Interconnect
Segment Host Segment Segment
Segment Host Segment Segment
Segment Host Segment Segment
Segment Host Segment Segment
Data Parallelism through PL/X : X in Python, R, Java, C/C++ and pgSQL
• plpython and python are loaded as dynamic libraries on the master and segment nodes (libpython.so and plpython.so are under $GPHOME/ext/python)
![Page 21: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/21.jpg)
21 © Copyright 2016 Pivotal. All rights reserved.
User Defined Functions (UDFs) in PL/Python � Procedural languages need to be installed on each database used.
� Syntax is like normal Python function with function definition line replaced by SQL wrapper. Alternatively like a SQL User Defined Function with Python inside.
CREATE FUNCTION seasonality (x float[]) RETURNS float[] AS $$ import statsmodels.api as sm s = sm.tsa.seasonal_decompose(x).seasonal return s $$ LANGUAGE plpythonu;
SQL wrapper
SQL wrapper
Normal Python
![Page 22: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/22.jpg)
22 © Copyright 2016 Pivotal. All rights reserved.
Usage Time Series Data with PL/X � Easily harness your UDF with open source libraries (for machine learning,
signal processing...)
� Runs at scale through data parallelism
![Page 23: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/23.jpg)
23 © Copyright 2016 Pivotal. All rights reserved.
Code walkthrough: Image processing on MPP
github.com/gautamsm/data-science-on-mpp/tree/master/image_processing
In-database Canny edge detection with OpenCV inside a PL/C function
![Page 24: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/24.jpg)
24 © Copyright 2016 Pivotal. All rights reserved.
Pivotal Data Science Blogs
1. Scaling native (C++) apps on Pivotal MPP
2. Predicting commodity futures through Tweets
3. A pipeline for distributed topic & sentiment analysis of tweets on Greenplum
4. Using data science to predict TV viewer behavior
5. Twitter NLP: Scaling part-of-speech tagging
6. Distributed deep learning on MPP and Hadoop
7. Multi-variate time series forecasting
8. Pivotal for good – Crisis Textline
http://blog.pivotal.io/data-science-pivotal
![Page 25: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/25.jpg)
25 © Copyright 2016 Pivotal. All rights reserved.
Thank You!
![Page 26: Data Science at Scale on MPP databases - Use Cases & Open Source Tools](https://reader031.fdocuments.us/reader031/viewer/2022022203/5873b7ee1a28abbc788b4ce5/html5/thumbnails/26.jpg)
A NEW PLATFORM FOR A NEW ERA