Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

½ S L using to turn

into

Semi-Supervised Learning on Hadoop to understand user behaviors

Hadoop Summit Amsterdam2-3 Avril 2014

Florian [email protected]

Data Science Studio

Web

WebWeb

Web

Web

Web

Web

Motivation

• CxO– Pages Views, Unique Visitors, Dollars, Subscription

• Editor / Product Manager– Time Spent, Comments

• Users– Content

What does matter on a web site ?

Key Usage Metrics

• Publisher– Time Spent on Page– Number of pages seen– Number of comments– Move to Subscription

• Search Engine– Click on first hits / re-click– Rephrasing ratio– Will come back tomorrow– Click on Advertisting

• Online Game– Time spent in the game– Level Progress– In-App Purchase

The Quest for the Missing Proxy

• Publisher– Time Spent on Page– Number of pages seen– Number of comments– User Satisfaction– Move to Subscription

• Search Engine– Click on first hits / re-click– Rephrasing ratio– User Satisfaction– Will come back tomorrow– Click on Advertisting

• Online Game– Time spent in the game– Level Progress– User Satisfaction– In-App Purchase

USER SATISFACTION !

Question

How to measure and drive user satisfaction on a large web sites with very diverse usage

patterns ?

The Problem

New Comers From Google News

People Coming from twitter and Facebook Posts

People coming to the website almost each and everyday

People that loves to comment Foreigners Robots

People fond of sport section only …. …..

BEHAVIOUR DIVERSITY

THE AVERAGEDMETRICS WOULD

HIDEIMPORTANT

VARIATION ON SPECIFIC SEGMENTS

SubProblem 1: Hard Segments

• Segments Users per Number of visits per month– > 20 days per month ->

Engaged Users

• Segment per transformed or not

• Segment per country

Subproblem 2: Hard Metrics

• NewspaperTime Spent on the website log(Number of page views) +

Number of actions

• Search engineClick RatioClick ratio

• E-Commerce Transformation Ratio

Limits

Hard Segments MISSING PART OF

THE REALITY

Hard Metrics ARGUING

BETWEEN TEAM

Semi-Supervised Learning

All Labeled Data

All Unlabeled Data

Some Labeled Data

Lots of Unlabeled Data

Training Data

SupervisedLearning

UnsupervisedLearning

Semi-SupervisedLearning

Model

Model

Model

½ SL – Natural Language Processing

I hope I’ll enjoy Amsterdam, and not only because of Hadoop

Je pense bien passer du bon temps à Amsterdam, et pas seulement grâce à Hadoop

Statistical Knowledge Text Structure

(Unsupervised)

Aligned Corpus

(Supervised)

½ SL Applied to Web Sessions

Lots of customer sessions

Not so many concrete customer feedbacks

Subscription

Semi-Supervised Learning 3 Approaches

• Generative Models, e.g. gaussian fits– All Data fits a gaussian distribution with parameter X – Find X that better fit distribution of both labeled data and

unlabeled data

• Fits with costs – Supervised learning with a costs function that capture a

distance between point related to the unlabeled data structure

• Ad-hoc : Combine unsupervised, then supervised

Clustering+Supervised in practiceUnlabeled training data points in grey

Labeled training data points in color

Supervised Learning Only

½ SL : Fit to the underlying structure

Our Approach

1. (Lots of ) Data preparation to build miningful user session

2. Clustering sessions and validate/tag those clusters by end users

3. Create Predictive User Satisfaction Metrics

4. Follow those metrics !

Data Prep: Overview

Step 1Build SessionsPig

Step 2Parse IP/Time/..Custom Python

(or )

Step 3Parse Sequences Hive or Python custom

Step 4Build user-level statsHive

RAW DATA READY FOR ML

Step 1. Build Session

• Use Hive ( Or Pig) • Group into “Session” • Depending on the variable– IP, Device Select only one per log– URL, Event Create an ordered array that

represents the sequence of events in the session

Step 2 : Basic Feature• IP Address Location, City • User-Agent Device • Timestamp User Time Day or night ?

Python + Hadoop Streaming

Option 1 Option 2

Extracted DataO

RIG

INAL

ORI

GIN

AL

ORI

GIN

AL

NEW

!!

NEW

!!

NEW

!!

Country From IP Device From User-AgentHour from Country & Time

Step 3: Session Signals

• Simple Signals– Number of Page Views– Time Spent …..– Etc…

• Limitation It might not help that much to differentiate behaviour

More Elaborate: N-Grams Model

Field Unit Sample 1-Gram 2-Gram 3-Gram

Protein Amino Acid

Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu

DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,..

NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,..

NLP (character)

Word ..some like it hot… some,like,it some-like,like-it some-like-it,like-it-hote

N-Grams Model For Sessions

Field Unit Sample 1-Gram 2-Gram 3-Gram

Protein Amino Acid

Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu

DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,..

NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,..

NLP (character)

Word ..some like it hot… some,like,it some-like,like-it some-like-it,like-it-hote

Web Sessions Page View [/home , /products, /trynow, /blog]

/home, /products, /trynow, /blog

/home /products, /products /trynow, /trynow /blog

/home-/products-/trynow, /products-/trynow-/blog

http://www.dataiku.com/home

Session N-Grams Analytics

Campaign / URL / Event Detailed Token Simple Token

utm=google_search google-search-my-site google-search

/home home home

/search?q=baseball search-baseball search

click=www.nfl.com click-nfl click

/sport/new-player-com.. sport/new-player-comming sport

/search?q=Mick+JONES search-mick+jones search

click=www.nfl.com click-nfl click

/sport/new-player-com.. sport/new-player/comming sport

/politics/home politics-home politics

Important Tricks:• Incorporate the first referrer / marketing campaign as FIRST TOKEN• Build two level of tokens: detailed, and category only

N-Grams Fine Grain N-Grams Coarse Grain

How To In Practice

• Hive query using the n-grams UDF

• Compute the LLR (Least-Likehood Ratio) Metrics

• Keep the most frequent n-grams of each type (detailed / non detailed) as features for the session

• Hint : Set the frequency limit so that > 90% session can be described by a non-detailed n-gram

Step 4. Cohort-like data

• Per cookie compute metrics – Nb. Days since first visit– Nb visits in the last 30 days– Average session time – …

• Reintegrate this information

• Easily achieved with a HiveQL query

Machine Learning for HDFS DataKind Algorithms

for clusteringSimplicity TRAIN set size

Apache Mahout MapReduce ~ 10 available Expert TERABYTES

Python (Scikit+Pandas+…)

Out for training / In for apply

~ 20 available (including bi-clustering)

Medium (10GB)1 SERVER RAM

H2O Separate Cluster 1 (kMeans) Medium (100GB – 1TB)CLUSTER RAM

Open Source R + Hadoop

Varies Varies Varies Varies

Open Source R + Pattern (Casacding)

Out for training/ In for apply

> 3 Medium (1GB) 1 Server RAM in R

Spark + MLLib Separate Cluster 1 Medium (100GB – 1TB) CLUSTER RAM

How Big is out data here ?

Step 1Build Sessions

Step 2Parse IP/Time/..

Step 3Parse Sequences

Step 4Build user-level stats

RAW DATA READY FOR ML

Uncompressed data size, for 1 year worth of log on a website with 10 Millions Unique Visitors per month

10 GB5TB

Clustering With Scikit on HDFS

1. Use Pydoop to get data on train server2. Use pandas to read data transform to numerical3. Kmeans().fit()4. Ipython to draw some graphs5. Enjoy

or

Session Data

Clustering

Clustering & Cluster Sampling

Take a balanced number of samplesin each cluster, close to the centroid

Labelling

0’ 00

0’ 12

1’ 04

1’ 45

3’ 02

Visualizing Sessions

Search for a specific Topic

Labelling

I can guess what this guy was doing !!!

Labelling


Newcomer from Google News

Foreigner Discovering The

Site

Fan that loves to comment

Home Page Wanderer

Dark Bot (Competitor?)

What if ?




Site


Home Page Wanderer


Supervised LearningSearch for a

specific Topic



Site


Home Page Wanderer


Independently from the clusters, used the trained examples in order to classify each session in the predefined segments

Supervised Learning : e.g. in python

• Load the data and the label in python (Pandas)

• Fit the labeled sessions against a model

• Save the model in HDFS (python pickle)

• Run the model against all the data (Hadoop Streaming)

We’ve got a tool to help you do that in Data Science StudioHe’s called the Doctor and he’s

fun to use !

Compute Metrics Per Segments

Search for a specific Topic Newcomer from

Google NewsForeigner

Discovering The Site


Home Page Wanderer Dark Bot

(Competitor?)

0.3€ per session0.23€ acquisition costs

```

13k sessions1.3€ per session0.23€ acquisition costs

938k sessions



68k sessions0.3€ per session

1.23€ acquisition costs

1k sessions0€ per session

0€ acquisition costs

User Satisfaction Metrics

• Future-Based Metrics– Will the user most

likely subscribe/pay in the future ?

• Expressed-Opinion– Does he like satisfied

from its behaviour ?

Opinion-Based Training For User Satisfaction

User Feedbacks as “Labels” to build a modelon satisfaction

“Predict” a satisfaction scorefor non-trained session

Session Data

Feedbacks

ScoredSession

HYPOTHESIS : IF TWO USERS HAVE SIMILAR NAVIGATION PATTERNSTHEY HAVE SIMILAR USER SATISFACTION LEVELS

(100 Million Sessions)

(10.000 feedbacks)

Compute Metrics Per Segments

Search for a specific Topic Newcomer from

Google NewsForeigner

Discovering The Site


Home Page Wanderer Dark Bot

(Competitor?)

0.3€ per session0.23€ acquisition costs

```


938k sessions



68k sessions0.3€ per session

1.23€ acquisition costs

1k sessions0€ per session

0€ acquisition costs

SATISFACTION SCORE 0.87§

SATISFACTION SCORE 0.37 SATISFACTION SCORE 0.28

SATISFACTION SCORE 0.12

SATISFACTION SCORE 0.28 SATISFACTION SCORE 0.12

Data in Time: SmoothingIn Red : The Base Metric In Blue : The smoothed metricRAW DATA MAY VARY A LOT

FROM DAYS TO DAYS

IT WILL SCARE PEOPLE

Exponential Smoothing In Hive

SELECT segmentmoving_avg(day, satisfaction, 15, 1.52, 15, DATEDIFF(‘2014-15-01’, ‘2014-01-01’))

FROM stats

GROUP BY segmentThese factors determine whether your smooth a lotor not, and over how many days

Final : Follow Smoothed Satisfaction




Site


Home Page Wanderer


Follow Statisfaction Metric Per Segment

Damnour latest

releasehas diverging

effects on segments

Thank You !

Florian Douetteau@fdouetteau

Questions now or later: [email protected]

dataiku.com

Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Technology

Transcript of Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours