Gl conference2014 toolkits_alice

Post on 14-Jun-2015

135 views 0 download

Tags:

description

GraphLab's Alice Zheng presents on using the toolkits within GraphLab Create to build data products.

Transcript of Gl conference2014 toolkits_alice

Machine Learning Toolkits in GraphLab Create Alice Zheng GraphLab, Inc.

Going Beyond Data Engineering

GraphLab Create enables Data Intelligence •  Recommender systems for retailers •  Fraud detection for financial institutions •  Market segmentation and ad targeting •  Churn prediction for telecom •  Community detection and friend

recommendation for social networks

©  2014  GraphLab,  Inc.  

The Data Pipeline

Raw Data

Features

Models

Data Engineering

Data Intelligence

Predictions

GraphLab Create Design Principles

•  Easy to use •  Powerful •  Fast •  Composable

Example: Movie Recommender

City of God

Wild Strawberries

The Celebration

Women on the Verge of a Nervous Breakdown

What do I recommend???

Example: Movie Recommender

City of God

Wild Strawberries

The Celebration

La Dolce Vita

Women on the Verge of a Nervous Breakdown

User-Movie Interaction Matrix Women  on  the  Verge  …  

The  Celebra2on  

City  of  God   Wild  Strawberries  

La  Dolce  Vita  

Bob  

Anna  

David  

Ethan  

Matrix Factorization User-item interactions

Information about users Information about items

Item latent factors User latent factors

×

+ +

Demo

The Moral of the Story

•  Data scientists need the right tools for the right job

•  There is always a more clever model •  There is probably some bug in your data •  GraphLab Create •  Versatile, composable, automated •  Play, learn, build better models

GraphLab Create Toolkits •  Recommenders

•  Item similarity, factorization machine, matrix factorization, non-negative matrix factorization, matrix factorization for ranking

•  Graph analytics •  PageRank, triangle counting, degree distribution, graph coloring, connected

components, shortest path, k-core decomposition •  User-defined graph computation

•  Nearest Neighbors •  Brute-force and ball trees

•  Topic modeling •  LDA

•  Regression/Classification •  Linear regression, logistic regression, SVM, gradient boosted trees, neural networks/

deep learning •  Clustering

•  K-Means •  Other popular ML libraries

•  Vowpal Wabbit

GraphLab Create Toolkits •  Recommenders

•  Item similarity, factorization machine, matrix factorization, non-negative matrix factorization, matrix factorization for ranking

•  Graph analytics •  PageRank, triangle counting, degree distribution, graph coloring, connected

components, shortest path, k-core decomposition •  User-defined graph computation

•  Nearest Neighbors •  Brute-force and ball trees

•  Topic modeling •  LDA

•  Regression/Classification •  Linear regression, logistic regression, SVM, gradient boosted trees, neural

networks/deep learning •  Clustering

•  K-Means •  Other popular ML libraries

•  Vowpal Wabbit

GraphLab Create Toolkits •  Recommenders

•  Item similarity, factorization machine, matrix factorization, non-negative matrix factorization, matrix factorization for ranking

•  Graph analytics •  PageRank, triangle counting, degree distribution, graph coloring, connected

components, shortest path, k-core decomposition •  User-defined graph computation

•  Nearest Neighbors •  Brute-force and ball trees

•  Topic modeling •  LDA

•  Regression/Classification •  Linear regression, logistic regression, SVM, gradient boosted trees, neural

networks/deep learning •  Clustering

•  K-Means •  Other popular ML libraries

•  Vowpal Wabbit

Come to Training Day!

•  GraphLab data science training day tomorrow!

•  A full day of lectures and exercises •  Data engineering, model building,

deployment, all on GraphLab Create

Speed + Scale

•  How much do you need? •  How much data do you really have?

Data Funnel

Raw Data

Features Models

PB GB—TB

MB

Data Analytics Life Cycle Extract

Transform Load

Data Analytics Life Cycle Extract

Transform Load

Model Learning

Data Analytics Life Cycle Extract

Transform Load

Model Learning

Data Analytics Life Cycle Extract

Transform Load

Model Learning

Data Analytics Life Cycle

ETL

Data Analytics Life Cycle

ETL Model

Learning

Data Analytics Life Cycle

ETL Model

Learning

Data Analytics Life Cycle

ETL Model

Learning

Benchmarks

0   200   400   600   800   1000   1200   1400   1600   1800  

Run Time of Item Similarity on Netflix Dataset

GraphLab Create (1 Node), 3.6 minutes

Mahout (5 Node), 29 minutes

Become a GLC User!

•  We push the frontier of the industry •  ... and our customers guide us •  Our features are customer driven •  Tell us what you think!