Machine Learning Streams with Spark 1.0

Post on 11-Aug-2014

299 views 4 download

Tags:

description

 

Transcript of Machine Learning Streams with Spark 1.0

Seattle Spark Meetup Machine Learning Streams with Spark 1.0 Drew Minkin Principal Program Manager, Ubix Labs

A Frost Venture Partners Company 01.14 | Revision 10.0 | Confidential and Proprietary Information

Machine Learning and Business Analytics Streams and Real Time Analytics Deep Dive into MLlib

AGENDA

Machine Learning and Business Analytics

Machine Learning is Not A Spectator Sport

Machine Learning and Data Science

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Reactive Proactive

Prod

uctio

n Re

sear

ch

The Analytics Spectrum

http://halobi.com/wp-content/uploads/Blog-1-1024x600.png

Graph

Data Management

Simulation

Process Improvement Content Delivery

Knowledge Management

Data Modeling

Visualization

Data Quality

Monitoring

Analysis

Optimization

Algorithms

Trialing

Statistics

Domain Expertise

Integration

Big Data

Collaboration

Descriptive Predictive Prescriptive

Five Families of Algorithms

http://en.wikipedia.org/wiki/Wu_Xing

Association

Classification

Estimation

Forecasting

Clustering

Classification

http://akorra.com/2012/06/06/top-10-creatures-that-influenced-martial-arts/

Target a Discrete Answer –Yes/No §  Find All Columns Driving its Value §  Use model to score new records

§  Many Different Measures of Accuracy §  Quick and Improving Iterations §  Most Actionable Types of Models

§  Hospital Readmission §  Equipment Failure §  Likelihood to purchase

Examples

Credit Scoring Banding

Association and Sequencing

http://38.media.tumblr.com/tumblr_m81wcfIO3V1qmzwx0o1_1280.jpg

Examples §  Collaborative Filtering §  Identify cross-sell §  Identify sequential, next-sale §  Make purchase recommendations §  Complex event associations

§  Transactions and items in §  Rules, Sequences and Itemsets out

Recommender Systems

Forecasting and Time Series

http://akorra.com/2012/06/06/top-10-creatures-that-influenced-martial-arts/

•  Input of measure over time and related series •  Predictions generated for short term trends •  Based on cycles and events

Examples §  Workforce Optimization §  Timing Purchasing Decisions §  Optimizing Maintenance Windows §  Material Cost Planning §  Equipment Usage Planning

Demand Sensing

Estimation and Regression

http://akorra.com/2012/06/06/top-10-creatures-that-influenced-martial-arts/

Predicting a Continuous Distribution §  Many Different Measures of Accuracy §  Quick and Improving Iterations §  Most Actionable Types of Models

§  Length Of Stay Estimation §  Customer Lifetime Value

Examples

Pricing Optimization

Clustering

http://akorra.com/2012/06/06/top-10-creatures-that-influenced-martial-arts/

§  Hard and Soft Groupings §  Profiles of Subgroups §  Likenesses and Differences

Examples •  Marketing Campaigns •  Reward Programs •  Equipment Utilization •  Process Improvement Analysis

Market Segmentation

Combining Algorithms in Harmony

http://en.wikipedia.org/wiki/Wu_Xing

Streams and Real Time Analytics

A Frost Venture Partners Company 01.14 | Revision 10.0 | Confidential and Proprietary Information

The Challenges of Scaling Analytics Classes of Analytics Complexity Spark vs. Storm, etc. Stream Paradigms and Spark

AGENDA

Streams and Real Time Analytics

Will Business Run out of Modeling Opportunities?

The Approaching Crisis for Machine Learning

Hype vs. Reality in Scaling Data Science

http://www.kdnuggets.com/2013/04/poll-results-largest-dataset-analyzed-data-mined.html

2009 vs. 2014 Scaling Data Science

http://www.kdnuggets.com

Spectrum of Stream Based Analytics La

tency

Events/Sec

Months Days Hours Minutes Seconds 100 ms < 1 ms

0 10 102 103 104 105 106

Big Data NoSQL RDBMS

Business Monitoring

Machine Monitoring

Real Time Monitoring

Web Analytics

EDW Analytics

Operational Analytics

http://www.cs.ucr.edu/~mueen/ppt/StreamInsigh%205%20SLIDE%20DEMO.pptx

Challenges of Stream Based Applications

http://www.cs.ucr.edu/~mueen/ppt/StreamInsigh%205%20SLIDE%20DEMO.pptx

Devices  

Sensors  Web  servers  

Feeds  

Complex Analytics & Mining

Challenges of Stream Based Applications

http://www.cs.ucr.edu/~mueen/ppt/StreamInsigh%205%20SLIDE%20DEMO.pptx

Hopping Windows

Tumbling Windows

Event Synchronization Latency Time Window Management

Deep Dive into MLlib

A Frost Venture Partners Company 01.14 | Revision 10.0 | Confidential and Proprietary Information

Architecture Descriptive Analytics Predictive Analytics Prescriptive Analytics

AGENDA

Deep Dive into MLlib

MLlib Descriptive Analytics

http://halobi.com/wp-content/uploads/Blog-1-1024x600.png

Graph

Data Management

Simulation

Process Improvement

Reactive Proactive

Prod

uctio

n Re

sear

ch

Content Delivery

Knowledge Management

Data Modeling

Visualization

Data Quality

Monitoring

Analysis

Optimization

Algorithms

Trialing

Statistics

Domain Expertise

Integration

Big Data

Collaboration

MLlib Descriptive Analytics - Data Types

http://halobi.com/wp-content/uploads/Blog-1-1024x600.png

Vectors •  Dense

MLlib Descriptive Analytics - Data Types

http://halobi.com/wp-content/uploads/Blog-1-1024x600.png

Vectors •  Sparse

MLlib Descriptive Analytics - Data Types

http://halobi.com/wp-content/uploads/Blog-1-1024x600.png

Linear Algebra •  CoordinateMatrix •  DistributedMatrix •  IndexedRow •  IndexedRowMatrix •  MatrixEntry •  RowMatrix

MLlib Descriptive Analytics – Summary Statistics

http://halobi.com/wp-content/uploads/Blog-1-1024x600.png

Sample size Maximum value of each column Sample mean vector Minimum value of each column Number of nonzero elements Sample variance vector

MLlib Descriptive Analytics - SVD

http://public.lanl.gov/mewall/kluwer2002.html

Singular Value Decomposition Can Collapse Sparse Matrices to Denser Forms

MLlib Descriptive Analytics – PCA

http://halobi.com/wp-content/uploads/Blog-1-1024x600.png

Primary Component Analysis Reduces Dimensionality with Feature Selection

MLLib Predictive Analytics

http://halobi.com/wp-content/uploads/Blog-1-1024x600.png

Graph

Data Management

Simulation

Process Improvement

Reactive Proactive

Prod

uctio

n Re

sear

ch

Content Delivery

Knowledge Management

Data Modeling

Visualization

Data Quality

Monitoring

Analysis

Optimization

Algorithms

Trialing

Statistics

Domain Expertise

Integration

Big Data

Collaboration

MLlib Predictive Analytics – Bayesian Classifier

http://xkcd.com/1132/

MLlib Predictive Analytics – Logistic Regression

http://halobi.com/wp-content/uploads/Blog-1-1024x600.png

Grandaddy of Algorithms

Coefficients from states or exact values Small scores can make big changes

MLlib Predictive Analytics - SVM

http://www.youtube.com/watch?v=3liCbRZPrZA http://www.projectrho.com/public_html/rocket/fasterlight.php

Linear Support Vector Machine for classifiers

Behold the “kernel trick”

MLlib Predictive Analytics – Regression

http://halobi.com/wp-content/uploads/Blog-1-1024x600.png

Linear Ridge

Least Absolute Shrinkage & Selection Operator

MLlib Predictive Analytics – Kmeans

http://halobi.com/wp-content/uploads/Blog-1-1024x600.png

MLlib Predictive Analytics – Matrix Factorization

http://halobi.com/wp-content/uploads/Blog-1-1024x600.png

Collaborative Filtering Alternating Least Squares (ALS)

Reactive Proactive

Prod

uctio

n Re

sear

ch

Prescriptive Analytics

http://halobi.com/wp-content/uploads/Blog-1-1024x600.png

Graph

Data Management

Simulation

Process Improvement Content Delivery

Knowledge Management

Data Modeling

Visualization

Data Quality

Monitoring

Analysis

Optimization

Algorithms

Trialing

Statistics

Domain Expertise

Integration

Big Data

Collaboration

MLlib Prescriptive Analytics – Gradient Descent

http://bleedingedgemachine.blogspot.com/2012/12/gradient-descent.html http://kungfupanda.wikia.com/wiki/Monkey

Linear and Nonlinear Optimization

minimize smooth functions without constraints,

MLlib Prescriptive Analytics – L-BFGS

http://graphics.utdallas.edu/sites/default/files/gpucvt.png

Limited-Memory BFGS

Nonlinear Minimize Smoothing Constraint is Memory

Notes from the MLlib Streams Field

MLlib Predictive Analytics – K Nearest Neighbor

http://www.youtube.com/watch?v=3liCbRZPrZA http://www.projectrho.com/public_html/rocket/fasterlight.php

Variation for classifiers

MLlib – A Call to Action

http://www.fanpop.com/clubs/voltron/images/2172709/title/original-fanart http://adventuretime.wikia.com/wiki/Princess_Monster_Wife

Coming Soon •  Decision Trees •  Model Performance Tools It Takes A Village •  Time Series •  Ensemble MLI