LA HUG 2012 02-07
-
Upload
mapr-technologies -
Category
Technology
-
view
101 -
download
0
description
Transcript of LA HUG 2012 02-07
Beating up on Bayesian Bandits
Mahout
• Scalable Data Mining for Everybody
What is Mahout
• Recommendations (people who x this also x that)
• Clustering (segment data into groups of)• Classification (learn decision making from
examples)• Stuff (LDA, SVD, frequent item-set, math)
What is Mahout?
• Recommendations (people who x this also x that)
• Clustering (segment data into groups of)• Classification (learn decision making from
examples)• Stuff (LDA, SVM, frequent item-set, math)
Classification in Detail
• Naive Bayes Family– Hadoop based training
• Decision Forests– Hadoop based training
• Logistic Regression (aka SGD)– fast on-line (sequential) training
Classification in Detail
• Naive Bayes Family– Hadoop based training
• Decision Forests– Hadoop based training
• Logistic Regression (aka SGD)– fast on-line (sequential) training
Classification in Detail
• Naive Bayes Family– Hadoop based training
• Decision Forests– Hadoop based training
• Logistic Regression (aka SGD)– fast on-line (sequential) training– Now with MORE topping!
An Example
And Another
From: Dr. Paul AcquahDear Sir,Re: Proposal for over-invoice Contract Benevolence
Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor....
Date: Thu, May 20, 2010 at 10:51 AMFrom: George <[email protected]>
Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?
Feature Encoding
Hashed Encoding
Feature Collisions
How it Works
• We are given “features”– Often binary values in a vector
• Algorithm learns weights– Weighted sum of feature * weight is the key
• Each weight is a single real value
A Quick Diversion
• You see a coin– What is the probability of heads?– Could it be larger or smaller than that?
• I flip the coin and while it is in the air ask again• I catch the coin and ask again• I look at the coin (and you don’t) and ask again• Why does the answer change?– And did it ever have a single value?
A First Conclusion
• Probability as expressed by humans is subjective and depends on information and experience
A Second Conclusion
• A single number is a bad way to express uncertain knowledge
• A distribution of values might be better
I Dunno
5 and 5
2 and 10
The Cynic Among Us
A Second Diversion
Two-armed Bandit
Which One to Play?
• One may be better than the other• The better machine pays off at some rate• Playing the other will pay off at a lesser rate– Playing the lesser machine has “opportunity cost”
• But how do we know which is which?– Explore versus Exploit!
Algorithmic Costs
• Option 1– Explicitly code the explore/exploit trade-off
• Option 2– Bayesian Bandit
Bayesian Bandit
• Compute distributions based on data• Sample p1 and p2 from these distributions
• Put a coin in bandit 1 if p1 > p2
• Else, put the coin in bandit 2
The Basic Idea
• We can encode a distribution by sampling• Sampling allows unification of exploration and
exploitation
• Can be extended to more general response models
Deployment with Storm/MapR
All state managed transactionally in MapR file system
Service Architecture
MapR Lockless Storage Services
MapR Pluggable Service Management
Storm
Hadoop
Find Out More
• Me: [email protected] [email protected] [email protected]
• MapR: http://www.mapr.com • Mahout: http://mahout.apache.org
• Code: https://github.com/tdunning