Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours
-
Upload
dataiku -
Category
Technology
-
view
1.952 -
download
1
description
Transcript of Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours
½ S L using to turn
into
Semi-Supervised Learning on Hadoop to understand user behaviors
Hadoop Summit Amsterdam2-3 Avril 2014
Motivation
• CxO– Pages Views, Unique Visitors, Dollars, Subscription
• Editor / Product Manager– Time Spent, Comments
• Users– Content
What does matter on a web site ?
Key Usage Metrics
• Publisher– Time Spent on Page– Number of pages seen– Number of comments– Move to Subscription
• Search Engine– Click on first hits / re-click– Rephrasing ratio– Will come back tomorrow– Click on Advertisting
• Online Game– Time spent in the game– Level Progress– In-App Purchase
The Quest for the Missing Proxy
• Publisher– Time Spent on Page– Number of pages seen– Number of comments– User Satisfaction– Move to Subscription
• Search Engine– Click on first hits / re-click– Rephrasing ratio– User Satisfaction– Will come back tomorrow– Click on Advertisting
• Online Game– Time spent in the game– Level Progress– User Satisfaction– In-App Purchase
USER SATISFACTION !
Question
How to measure and drive user satisfaction on a large web sites with very diverse usage
patterns ?
The Problem
New Comers From Google News
People Coming from twitter and Facebook Posts
People coming to the website almost each and everyday
People that loves to comment Foreigners Robots
People fond of sport section only …. …..
BEHAVIOUR DIVERSITY
THE AVERAGEDMETRICS WOULD
HIDEIMPORTANT
VARIATION ON SPECIFIC SEGMENTS
SubProblem 1: Hard Segments
• Segments Users per Number of visits per month– > 20 days per month ->
Engaged Users
• Segment per transformed or not
• Segment per country
Subproblem 2: Hard Metrics
• NewspaperTime Spent on the website log(Number of page views) +
Number of actions
• Search engineClick RatioClick ratio
• E-Commerce Transformation Ratio
Limits
Hard Segments MISSING PART OF
THE REALITY
Hard Metrics ARGUING
BETWEEN TEAM
Semi-Supervised Learning
All Labeled Data
All Unlabeled Data
Some Labeled Data
Lots of Unlabeled Data
Training Data
SupervisedLearning
UnsupervisedLearning
Semi-SupervisedLearning
Model
Model
Model
½ SL – Natural Language Processing
I hope I’ll enjoy Amsterdam, and not only because of Hadoop
Je pense bien passer du bon temps à Amsterdam, et pas seulement grâce à Hadoop
Statistical Knowledge Text Structure
(Unsupervised)
Aligned Corpus
(Supervised)
½ SL Applied to Web Sessions
Lots of customer sessions
Not so many concrete customer feedbacks
Subscription
Semi-Supervised Learning 3 Approaches
• Generative Models, e.g. gaussian fits– All Data fits a gaussian distribution with parameter X – Find X that better fit distribution of both labeled data and
unlabeled data
• Fits with costs – Supervised learning with a costs function that capture a
distance between point related to the unlabeled data structure
• Ad-hoc : Combine unsupervised, then supervised
Clustering+Supervised in practiceUnlabeled training data points in grey
Labeled training data points in color
Supervised Learning Only
½ SL : Fit to the underlying structure
Our Approach
1. (Lots of ) Data preparation to build miningful user session
2. Clustering sessions and validate/tag those clusters by end users
3. Create Predictive User Satisfaction Metrics
4. Follow those metrics !
Data Prep: Overview
Step 1Build SessionsPig
Step 2Parse IP/Time/..Custom Python
(or )
Step 3Parse Sequences Hive or Python custom
Step 4Build user-level statsHive
RAW DATA READY FOR ML
Step 1. Build Session
• Use Hive ( Or Pig) • Group into “Session” • Depending on the variable– IP, Device Select only one per log– URL, Event Create an ordered array that
represents the sequence of events in the session
Step 2 : Basic Feature• IP Address Location, City • User-Agent Device • Timestamp User Time Day or night ?
Python + Hadoop Streaming
Option 1 Option 2
Extracted DataO
RIG
INAL
ORI
GIN
AL
ORI
GIN
AL
NEW
!!
NEW
!!
NEW
!!
Country From IP Device From User-AgentHour from Country & Time
Step 3: Session Signals
• Simple Signals– Number of Page Views– Time Spent …..– Etc…
• Limitation It might not help that much to differentiate behaviour
More Elaborate: N-Grams Model
Field Unit Sample 1-Gram 2-Gram 3-Gram
Protein Amino Acid
Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu
DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,..
NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,..
NLP (character)
Word ..some like it hot… some,like,it some-like,like-it some-like-it,like-it-hote
N-Grams Model For Sessions
Field Unit Sample 1-Gram 2-Gram 3-Gram
Protein Amino Acid
Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu
DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,..
NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,..
NLP (character)
Word ..some like it hot… some,like,it some-like,like-it some-like-it,like-it-hote
Web Sessions Page View [/home , /products, /trynow, /blog]
/home, /products, /trynow, /blog
/home /products, /products /trynow, /trynow /blog
/home-/products-/trynow, /products-/trynow-/blog
Session N-Grams Analytics
Campaign / URL / Event Detailed Token Simple Token
utm=google_search google-search-my-site google-search
/home home home
/search?q=baseball search-baseball search
click=www.nfl.com click-nfl click
/sport/new-player-com.. sport/new-player-comming sport
/search?q=Mick+JONES search-mick+jones search
click=www.nfl.com click-nfl click
/sport/new-player-com.. sport/new-player/comming sport
/politics/home politics-home politics
Important Tricks:• Incorporate the first referrer / marketing campaign as FIRST TOKEN• Build two level of tokens: detailed, and category only
N-Grams Fine Grain N-Grams Coarse Grain
How To In Practice
• Hive query using the n-grams UDF
• Compute the LLR (Least-Likehood Ratio) Metrics
• Keep the most frequent n-grams of each type (detailed / non detailed) as features for the session
• Hint : Set the frequency limit so that > 90% session can be described by a non-detailed n-gram
Step 4. Cohort-like data
• Per cookie compute metrics – Nb. Days since first visit– Nb visits in the last 30 days– Average session time – …
• Reintegrate this information
• Easily achieved with a HiveQL query
Machine Learning for HDFS DataKind Algorithms
for clusteringSimplicity TRAIN set size
Apache Mahout MapReduce ~ 10 available Expert TERABYTES
Python (Scikit+Pandas+…)
Out for training / In for apply
~ 20 available (including bi-clustering)
Medium (10GB)1 SERVER RAM
H2O Separate Cluster 1 (kMeans) Medium (100GB – 1TB)CLUSTER RAM
Open Source R + Hadoop
Varies Varies Varies Varies
Open Source R + Pattern (Casacding)
Out for training/ In for apply
> 3 Medium (1GB) 1 Server RAM in R
Spark + MLLib Separate Cluster 1 Medium (100GB – 1TB) CLUSTER RAM
How Big is out data here ?
Step 1Build Sessions
Step 2Parse IP/Time/..
Step 3Parse Sequences
Step 4Build user-level stats
RAW DATA READY FOR ML
Uncompressed data size, for 1 year worth of log on a website with 10 Millions Unique Visitors per month
10 GB5TB
Clustering With Scikit on HDFS
1. Use Pydoop to get data on train server2. Use pandas to read data transform to numerical3. Kmeans().fit()4. Ipython to draw some graphs5. Enjoy
or
Session Data
Clustering
Clustering & Cluster Sampling
Take a balanced number of samplesin each cluster, close to the centroid
Labelling
0’ 00
0’ 12
1’ 04
1’ 45
3’ 02
Visualizing Sessions
Search for a specific Topic
Labelling
I can guess what this guy was doing !!!
Labelling
Search for a specific Topic
Newcomer from Google News
Foreigner Discovering The
Site
Fan that loves to comment
Home Page Wanderer
Dark Bot (Competitor?)
What if ?
Search for a specific Topic
Newcomer from Google News
Foreigner Discovering The
Site
Fan that loves to comment
Home Page Wanderer
Dark Bot (Competitor?)
Supervised LearningSearch for a
specific Topic
Newcomer from Google News
Foreigner Discovering The
Site
Fan that loves to comment
Home Page Wanderer
Dark Bot (Competitor?)
Independently from the clusters, used the trained examples in order to classify each session in the predefined segments
Supervised Learning : e.g. in python
• Load the data and the label in python (Pandas)
• Fit the labeled sessions against a model
• Save the model in HDFS (python pickle)
• Run the model against all the data (Hadoop Streaming)
We’ve got a tool to help you do that in Data Science StudioHe’s called the Doctor and he’s
fun to use !
Compute Metrics Per Segments
Search for a specific Topic Newcomer from
Google NewsForeigner
Discovering The Site
Fan that loves to comment
Home Page Wanderer Dark Bot
(Competitor?)
0.3€ per session0.23€ acquisition costs
```
13k sessions1.3€ per session0.23€ acquisition costs
938k sessions
938k sessions0.3€ per session0.23€ acquisition costs
738k sessions0.83€ per session0.73€ acquisition costs
68k sessions0.3€ per session
1.23€ acquisition costs
1k sessions0€ per session
0€ acquisition costs
User Satisfaction Metrics
• Future-Based Metrics– Will the user most
likely subscribe/pay in the future ?
• Expressed-Opinion– Does he like satisfied
from its behaviour ?
Opinion-Based Training For User Satisfaction
User Feedbacks as “Labels” to build a modelon satisfaction
“Predict” a satisfaction scorefor non-trained session
Session Data
Feedbacks
ScoredSession
HYPOTHESIS : IF TWO USERS HAVE SIMILAR NAVIGATION PATTERNSTHEY HAVE SIMILAR USER SATISFACTION LEVELS
(100 Million Sessions)
(10.000 feedbacks)
Compute Metrics Per Segments
Search for a specific Topic Newcomer from
Google NewsForeigner
Discovering The Site
Fan that loves to comment
Home Page Wanderer Dark Bot
(Competitor?)
0.3€ per session0.23€ acquisition costs
```
13k sessions1.3€ per session0.23€ acquisition costs
938k sessions
938k sessions0.3€ per session0.23€ acquisition costs
738k sessions0.83€ per session0.73€ acquisition costs
68k sessions0.3€ per session
1.23€ acquisition costs
1k sessions0€ per session
0€ acquisition costs
SATISFACTION SCORE 0.87§
SATISFACTION SCORE 0.37 SATISFACTION SCORE 0.28
SATISFACTION SCORE 0.12
SATISFACTION SCORE 0.28 SATISFACTION SCORE 0.12
Data in Time: SmoothingIn Red : The Base Metric In Blue : The smoothed metricRAW DATA MAY VARY A LOT
FROM DAYS TO DAYS
IT WILL SCARE PEOPLE
Exponential Smoothing In Hive
SELECT segmentmoving_avg(day, satisfaction, 15, 1.52, 15, DATEDIFF(‘2014-15-01’, ‘2014-01-01’))
FROM stats
GROUP BY segmentThese factors determine whether your smooth a lotor not, and over how many days
Final : Follow Smoothed Satisfaction
Search for a specific Topic
Newcomer from Google News
Foreigner Discovering The
Site
Fan that loves to comment
Home Page Wanderer
Dark Bot (Competitor?)
Follow Statisfaction Metric Per Segment
Damnour latest
releasehas diverging
effects on segments