Download - News From Mahout

1©MapR Technologies - Confidential

News From Mahout


whoami – Ted Dunning

Chief Application Architect, MapR Technologies

Committer, member, Apache Software Foundation– particularly Mahout, Zookeeper and Drill

(we’re hiring)

Contact me [email protected]

[email protected]

[email protected]

@ted_dunning

mailto:[email protected]




Slides and such (available late tonight):– http://www.mapr.com/company/events/nyhug-03-05-2013

Hash tags: #mapr #nyhug #mahout

http://www.mapr.com/company/events/nyhug-03-05-2013








New in Mahout

0.8 is coming soon (1-2 months)

gobs of fixes

QR decomposition is 10x faster– makes ALS 2-3 times faster

May include Bayesian Bandits

Super fast k-means– fast

– online (!?!)


New in Mahout


gobs of fixes




– online (!?!)

– fast

Possible new edition of MiA coming– Japanese and Korean editions released, Chinese coming


Real-time Learning


We have a product to sell …

from a web-site


Bogus Dog Food is the Best!

Now available in handy 1 ton

bags!

Buy 5!

What picture?

What tag-line?

What call to action?


The Challenge

Design decisions affect probability of success– Cheesy web-sites don’t even sell cheese

The best designers do better when allowed to fail– Exploration juices creativity

But failing is expensive– If only because we could have succeeded

– But also because offending or disappointing customers is bad


More Challenges

Too many designs– 5 pictures

– 10 tag-lines

– 4 calls to action

– 3 back-ground colors

=> 5 x 10 x 4 x 3 = 600 designs

It gets worse quickly– What about changes on the back-end?

– Search engine variants?

– Checkout process variants?


Example – AB testing in real-time

I have 15 versions of my landing page

Each visitor is assigned to a version– Which version?

A conversion or sale or whatever can happen– How long to wait?

Some versions of the landing page are horrible– Don’t want to give them traffic


A Quick Diversion

You see a coin– What is the probability of heads?

– Could it be larger or smaller than that?

I flip the coin and while it is in the air ask again

I catch the coin and ask again

I look at the coin (and you don’t) and ask again

Why does the answer change?– And did it ever have a single value?


A Philosophical Conclusion

Probability as expressed by humans is subjective and depends on information and experience


I Dunno


5 heads out of 10 throws


2 heads out of 12 throws


So now you understand Bayesian probability


Another Quick Diversion

Let’s play a shell game

This is a special shell game

It costs you nothing to play

The pea has constant probability of being under each shell(trust me)

How do you find the best shell?

How do you find it while maximizing the number of wins?


Pause for short con-game


Interim Thoughts

Can you identify winners or losers without trying them out?

Can you ever completely eliminate a shell with a bad streak?

Should you keep trying apparent losers?


So now you understand multi-armed bandits


Conclusions

Can you identify winners or losers without trying them out?No

Can you ever completely eliminate a shell with a bad streak?No

Should you keep trying apparent losers?Yes, but at a decreasing rate


Is there an optimum strategy?


Bayesian Bandit

Compute distributions based on data so far

Sample p1, p2 and p2 from these distributions

Pick shell i where i = argmaxi pi

Lemma 1: The probability of picking shell i will match the probability it is the best shell

Lemma 2: This is as good as it gets


And it works!

11000 100 200 300 400 500 600 700 800 900 1000

0.12

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

n

reg

ret

ε- greedy, ε = 0.05

Bayesian Bandit with Gamma- Normal


Video Demo


The Code

Select an alternative

Select and learn

But we already know how to count!

n = dim(k)[1]

p0 = rep(0, length.out=n)

for (i in 1:n) {

p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1)

}

return (which(p0 == max(p0)))

for (z in 1:steps) {

i = select(k)

j = test(i)

k[i,j] = k[i,j]+1

}

return (k)


The Basic Idea

We can encode a distribution by sampling

Sampling allows unification of exploration and exploitation

Can be extended to more general response models


The Original Problem



bags!

Buy 5!

x1x2

x3


Response Function

p(win) = w qii

å xiæ

èç

ö

ø÷

6- 6 - 4 - 2 0 2 4

1

0

0.5

x

y


Generalized Banditry

Suppose we have an infinite number of bandits– suppose they are each labeled by two real numbers x and y in [0,1]

– also that expected payoff is a parameterized function of x and y

– now assume a distribution for θ that we can learn online

Selection works by sampling θ, then computing f

Learning works by propagating updates back to θ– If f is linear, this is very easy

– For special other kinds of f it isn’t too hard

Don’t just have to have two labels, could have labels and context

E z[ ] = f (x, y |q )


Context Variables



bags!

Buy 5!

x1x2

x3

user.geo env.time env.day_of_week env.weekend


Caveats

Original Bayesian Bandit only requires real-time

Generalized Bandit may require access to long history for learning– Pseudo online learning may be easier than true online

Bandit variables can include content, time of day, day of week

Context variables can include user id, user features

Bandit × context variables provide the real power


You can do thisyourself!


Super-fast k-means Clustering


Rationale


What is Quality?

Robust clustering not a goal– we don’t care if the same clustering is replicated

Generalization is critical

Agreement to “gold standard” is a non-issue


An Example


Diagonalized Cluster Proximity


Clusters as Distribution Surrogate


Theory


For Example

Grouping these two clusters

seriously hurts squared distance

D4

2 (X) >1

s 2D5

2 (X)


Algorithms


Typical k-means Failure

Selecting two seeds here cannot be

fixed with Lloyds

Result is that these two clusters get glued

together


Ball k-means

Provably better for highly clusterable data

Tries to find initial centroids in each “core” of each real clusters

Avoids outliers in centroid computation

initialize centroids randomly with distance maximizing tendency

for each of a very few iterations:

for each data point:

assign point to nearest cluster

recompute centroids using only points much closer than closest cluster


Still Not a Win

Ball k-means is nearly guaranteed with k = 2

Probability of successful seeding drops exponentially with k

Alternative strategy has high probability of success, but takes O(nkd + k3d) time


Still Not a Win

Ball k-means is nearly guaranteed with k = 2

Probability of successful seeding drops exponentially with k

Alternative strategy has high probability of success, but takes O( nkd + k3d ) time

But for big data, k gets large


Surrogate Method

Start with sloppy clustering into lots of clusters

κ = k log n clusters

Use this sketch as a weighted surrogate for the data

Results are provably good for highly clusterable data


Algorithm Costs

Surrogate methods– fast, sloppy single pass clustering with κ = k log n

– fast sloppy search for nearest cluster,

O(d log κ) = O(d (log k + log log n)) per point

– fast, in-memory, high-quality clustering of κ weighted centroidsO(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality

O(κ d log k) or O(d log κ log k) for larger k, looser quality

– result is k high-quality centroids• Even the sloppy surrogate may suffice


Algorithm Costs

Surrogate methods– fast, sloppy single pass clustering with κ = k log n

– fast sloppy search for nearest cluster,

O(d log κ) = O(d ( log k + log log n )) per point

– fast, in-memory, high-quality clustering of κ weighted centroidsO(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality

O(κ d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality

– result is k high-quality centroids• For many purposes, even the sloppy surrogate may suffice


Algorithm Costs

How much faster for the sketch phase?– take k = 2000, d = 10, n = 100,000

– k d log n = 2000 x 10 x 26 = 500,000

– d (log k + log log n) = 10(11 + 5) = 170

– 3,000 times faster is a bona fide big deal


How It Works

For each point– Find approximately nearest centroid (distance = d)

– If (d > threshold) new centroid

– Else if (u > d/threshold) new cluster

– Else add to nearest centroid

If centroids > κ ≈ C log N– Recursively cluster centroids with higher threshold


Implementation


But Wait, …

Finding nearest centroid is inner loop

This could take O( d κ ) per point and κ can be big

Happily, approximate nearest centroid works fine


Projection Search

total ordering!


LSH Bit-match Versus Cosine

0 8 16 24 32 40 48 56 64

1

- 1

- 0.8

- 0.6

- 0.4

- 0.2

0

0.2

0.4

0.6

0.8

X Axis

Y A

xis


Results


Parallel Speedup?

1 2 3 4 5 20

10

100

20

30

40

50

200

Threads

Tim

e p

er

po

int

(μs) 2

3

4

56

8

10

12

14

16

Threaded version

Non- threaded

Perfect Scaling

✓


Quality

Ball k-means implementation appears significantly better than simple k-means

Streaming k-means + ball k-means appears to be about as good as ball k-means alone

All evaluations on 20 newsgroups with held-out data

Figure of merit is mean and median squared distance to nearest cluster


Contact Me!

We’re hiring at MapR in US and Europe

MapR software available for research use

Get the code as part of Mahout trunk (or 0.8 very soon)

Contact me at [email protected] or @ted_dunning

Share news with @apachemahout