Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

48
½ S L using to turn into

description

Presentation made at Hadoop Summit Amsterdam 2014. It's ab

Transcript of Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Page 1: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

½ S L using to turn

into

Page 2: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Semi-Supervised Learning on Hadoop to understand user behaviors

Hadoop Summit Amsterdam2-3 Avril 2014

Page 3: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Florian [email protected]

Data Science Studio

Web

WebWeb

Web

Web

Web

Web

Page 4: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Motivation

• CxO– Pages Views, Unique Visitors, Dollars, Subscription

• Editor / Product Manager– Time Spent, Comments

• Users– Content

What does matter on a web site ?

Page 5: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Key Usage Metrics

• Publisher– Time Spent on Page– Number of pages seen– Number of comments– Move to Subscription

• Search Engine– Click on first hits / re-click– Rephrasing ratio– Will come back tomorrow– Click on Advertisting

• Online Game– Time spent in the game– Level Progress– In-App Purchase

Page 6: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

The Quest for the Missing Proxy

• Publisher– Time Spent on Page– Number of pages seen– Number of comments– User Satisfaction– Move to Subscription

• Search Engine– Click on first hits / re-click– Rephrasing ratio– User Satisfaction– Will come back tomorrow– Click on Advertisting

• Online Game– Time spent in the game– Level Progress– User Satisfaction– In-App Purchase

USER SATISFACTION !

Page 7: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Question

How to measure and drive user satisfaction on a large web sites with very diverse usage

patterns ?

Page 8: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

The Problem

New Comers From Google News

People Coming from twitter and Facebook Posts

People coming to the website almost each and everyday

People that loves to comment Foreigners Robots

People fond of sport section only …. …..

BEHAVIOUR DIVERSITY

THE AVERAGEDMETRICS WOULD

HIDEIMPORTANT

VARIATION ON SPECIFIC SEGMENTS

Page 9: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

SubProblem 1: Hard Segments

• Segments Users per Number of visits per month– > 20 days per month ->

Engaged Users

• Segment per transformed or not

• Segment per country

Page 10: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Subproblem 2: Hard Metrics

• NewspaperTime Spent on the website log(Number of page views) +

Number of actions

• Search engineClick RatioClick ratio

• E-Commerce Transformation Ratio

Page 11: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Limits

Hard Segments MISSING PART OF

THE REALITY

Hard Metrics ARGUING

BETWEEN TEAM

Page 12: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Semi-Supervised Learning

All Labeled Data

All Unlabeled Data

Some Labeled Data

Lots of Unlabeled Data

Training Data

SupervisedLearning

UnsupervisedLearning

Semi-SupervisedLearning

Model

Model

Model

Page 13: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

½ SL – Natural Language Processing

I hope I’ll enjoy Amsterdam, and not only because of Hadoop

Je pense bien passer du bon temps à Amsterdam, et pas seulement grâce à Hadoop

Statistical Knowledge Text Structure

(Unsupervised)

Aligned Corpus

(Supervised)

Page 14: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

½ SL Applied to Web Sessions

Lots of customer sessions

Not so many concrete customer feedbacks

Subscription

Page 15: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Semi-Supervised Learning 3 Approaches

• Generative Models, e.g. gaussian fits– All Data fits a gaussian distribution with parameter X – Find X that better fit distribution of both labeled data and

unlabeled data

• Fits with costs – Supervised learning with a costs function that capture a

distance between point related to the unlabeled data structure

• Ad-hoc : Combine unsupervised, then supervised

Page 16: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Clustering+Supervised in practiceUnlabeled training data points in grey

Labeled training data points in color

Page 17: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Supervised Learning Only

Page 18: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

½ SL : Fit to the underlying structure

Page 19: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Our Approach

1. (Lots of ) Data preparation to build miningful user session

2. Clustering sessions and validate/tag those clusters by end users

3. Create Predictive User Satisfaction Metrics

4. Follow those metrics !

Page 20: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Data Prep: Overview

Step 1Build SessionsPig

Step 2Parse IP/Time/..Custom Python

(or )

Step 3Parse Sequences Hive or Python custom

Step 4Build user-level statsHive

RAW DATA READY FOR ML

Page 21: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Step 1. Build Session

• Use Hive ( Or Pig) • Group into “Session” • Depending on the variable– IP, Device Select only one per log– URL, Event Create an ordered array that

represents the sequence of events in the session

Page 22: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Step 2 : Basic Feature• IP Address Location, City • User-Agent Device • Timestamp User Time Day or night ?

Python + Hadoop Streaming

Option 1 Option 2

Page 23: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Extracted DataO

RIG

INAL

ORI

GIN

AL

ORI

GIN

AL

NEW

!!

NEW

!!

NEW

!!

Country From IP Device From User-AgentHour from Country & Time

Page 24: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Step 3: Session Signals

• Simple Signals– Number of Page Views– Time Spent …..– Etc…

• Limitation It might not help that much to differentiate behaviour

Page 25: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

More Elaborate: N-Grams Model

Field Unit Sample 1-Gram 2-Gram 3-Gram

Protein Amino Acid

Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu

DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,..

NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,..

NLP (character)

Word ..some like it hot… some,like,it some-like,like-it some-like-it,like-it-hote

Page 26: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

N-Grams Model For Sessions

Field Unit Sample 1-Gram 2-Gram 3-Gram

Protein Amino Acid

Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu

DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,..

NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,..

NLP (character)

Word ..some like it hot… some,like,it some-like,like-it some-like-it,like-it-hote

Web Sessions Page View [/home , /products, /trynow, /blog]

/home, /products, /trynow, /blog

/home /products, /products /trynow, /trynow /blog

/home-/products-/trynow, /products-/trynow-/blog

Page 27: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Session N-Grams Analytics

Campaign / URL / Event Detailed Token Simple Token

utm=google_search google-search-my-site google-search

/home home home

/search?q=baseball search-baseball search

click=www.nfl.com click-nfl click

/sport/new-player-com.. sport/new-player-comming sport

/search?q=Mick+JONES search-mick+jones search

click=www.nfl.com click-nfl click

/sport/new-player-com.. sport/new-player/comming sport

/politics/home politics-home politics

Important Tricks:• Incorporate the first referrer / marketing campaign as FIRST TOKEN• Build two level of tokens: detailed, and category only

N-Grams Fine Grain N-Grams Coarse Grain

Page 28: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

How To In Practice

• Hive query using the n-grams UDF

• Compute the LLR (Least-Likehood Ratio) Metrics

• Keep the most frequent n-grams of each type (detailed / non detailed) as features for the session

• Hint : Set the frequency limit so that > 90% session can be described by a non-detailed n-gram

Page 29: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Step 4. Cohort-like data

• Per cookie compute metrics – Nb. Days since first visit– Nb visits in the last 30 days– Average session time – …

• Reintegrate this information

• Easily achieved with a HiveQL query

Page 30: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Machine Learning for HDFS DataKind Algorithms

for clusteringSimplicity TRAIN set size

Apache Mahout MapReduce ~ 10 available Expert TERABYTES

Python (Scikit+Pandas+…)

Out for training / In for apply

~ 20 available (including bi-clustering)

Medium (10GB)1 SERVER RAM

H2O Separate Cluster 1 (kMeans) Medium (100GB – 1TB)CLUSTER RAM

Open Source R + Hadoop

Varies Varies Varies Varies

Open Source R + Pattern (Casacding)

Out for training/ In for apply

> 3 Medium (1GB) 1 Server RAM in R

Spark + MLLib Separate Cluster 1 Medium (100GB – 1TB) CLUSTER RAM

Page 31: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

How Big is out data here ?

Step 1Build Sessions

Step 2Parse IP/Time/..

Step 3Parse Sequences

Step 4Build user-level stats

RAW DATA READY FOR ML

Uncompressed data size, for 1 year worth of log on a website with 10 Millions Unique Visitors per month

10 GB5TB

Page 32: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Clustering With Scikit on HDFS

1. Use Pydoop to get data on train server2. Use pandas to read data transform to numerical3. Kmeans().fit()4. Ipython to draw some graphs5. Enjoy

or

Page 33: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Session Data

Page 34: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Clustering

Page 35: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Clustering & Cluster Sampling

Take a balanced number of samplesin each cluster, close to the centroid

Page 36: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Labelling

0’ 00

0’ 12

1’ 04

1’ 45

3’ 02

Visualizing Sessions

Search for a specific Topic

Labelling

I can guess what this guy was doing !!!

Page 37: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Labelling

Search for a specific Topic

Newcomer from Google News

Foreigner Discovering The

Site

Fan that loves to comment

Home Page Wanderer

Dark Bot (Competitor?)

Page 38: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

What if ?

Search for a specific Topic

Newcomer from Google News

Foreigner Discovering The

Site

Fan that loves to comment

Home Page Wanderer

Dark Bot (Competitor?)

Page 39: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Supervised LearningSearch for a

specific Topic

Newcomer from Google News

Foreigner Discovering The

Site

Fan that loves to comment

Home Page Wanderer

Dark Bot (Competitor?)

Independently from the clusters, used the trained examples in order to classify each session in the predefined segments

Page 40: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Supervised Learning : e.g. in python

• Load the data and the label in python (Pandas)

• Fit the labeled sessions against a model

• Save the model in HDFS (python pickle)

• Run the model against all the data (Hadoop Streaming)

We’ve got a tool to help you do that in Data Science StudioHe’s called the Doctor and he’s

fun to use !

Page 41: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Compute Metrics Per Segments

Search for a specific Topic Newcomer from

Google NewsForeigner

Discovering The Site

Fan that loves to comment

Home Page Wanderer Dark Bot

(Competitor?)

0.3€ per session0.23€ acquisition costs

```

13k sessions1.3€ per session0.23€ acquisition costs

938k sessions

938k sessions0.3€ per session0.23€ acquisition costs

738k sessions0.83€ per session0.73€ acquisition costs

68k sessions0.3€ per session

1.23€ acquisition costs

1k sessions0€ per session

0€ acquisition costs

Page 42: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

User Satisfaction Metrics

• Future-Based Metrics– Will the user most

likely subscribe/pay in the future ?

• Expressed-Opinion– Does he like satisfied

from its behaviour ?

Page 43: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Opinion-Based Training For User Satisfaction

User Feedbacks as “Labels” to build a modelon satisfaction

“Predict” a satisfaction scorefor non-trained session

Session Data

Feedbacks

ScoredSession

HYPOTHESIS : IF TWO USERS HAVE SIMILAR NAVIGATION PATTERNSTHEY HAVE SIMILAR USER SATISFACTION LEVELS

(100 Million Sessions)

(10.000 feedbacks)

Page 44: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Compute Metrics Per Segments

Search for a specific Topic Newcomer from

Google NewsForeigner

Discovering The Site

Fan that loves to comment

Home Page Wanderer Dark Bot

(Competitor?)

0.3€ per session0.23€ acquisition costs

```

13k sessions1.3€ per session0.23€ acquisition costs

938k sessions

938k sessions0.3€ per session0.23€ acquisition costs

738k sessions0.83€ per session0.73€ acquisition costs

68k sessions0.3€ per session

1.23€ acquisition costs

1k sessions0€ per session

0€ acquisition costs

SATISFACTION SCORE 0.87§

SATISFACTION SCORE 0.37 SATISFACTION SCORE 0.28

SATISFACTION SCORE 0.12

SATISFACTION SCORE 0.28 SATISFACTION SCORE 0.12

Page 45: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Data in Time: SmoothingIn Red : The Base Metric In Blue : The smoothed metricRAW DATA MAY VARY A LOT

FROM DAYS TO DAYS

IT WILL SCARE PEOPLE

Page 46: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Exponential Smoothing In Hive

SELECT segmentmoving_avg(day, satisfaction, 15, 1.52, 15, DATEDIFF(‘2014-15-01’, ‘2014-01-01’))

FROM stats

GROUP BY segmentThese factors determine whether your smooth a lotor not, and over how many days

Page 47: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Final : Follow Smoothed Satisfaction

Search for a specific Topic

Newcomer from Google News

Foreigner Discovering The

Site

Fan that loves to comment

Home Page Wanderer

Dark Bot (Competitor?)

Follow Statisfaction Metric Per Segment

Damnour latest

releasehas diverging

effects on segments

Page 48: Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Thank You !

Florian Douetteau@fdouetteau

Questions now or later: [email protected]

dataiku.com