1 ©MapR Technologies 2013 Which Algorithms Really Matter?
-
Upload
tiara-duty -
Category
Documents
-
view
214 -
download
0
Transcript of 1 ©MapR Technologies 2013 Which Algorithms Really Matter?
![Page 1: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/1.jpg)
1©MapR Technologies 2013
Which Algorithms Really Matter?
![Page 2: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/2.jpg)
2©MapR Technologies 2013
Me, Us
Ted Dunning, Chief Application Architect, MapRCommitter PMC member, Mahout, Zookeeper, DrillBought the beer at the first HUG
MapRDistributes more open source components for HadoopAdds major technology for performance, HA, industry standard API’s
InfoHash tag - #maprSee also - @ApacheMahout @ApacheDrill
@ted_dunning and @mapR
![Page 3: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/3.jpg)
4©MapR Technologies 2013
Topic For Today
What is important? What is not? Why? What is the difference from academic research? Some examples
![Page 4: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/4.jpg)
5©MapR Technologies 2013
What is Important?
Deployable
Robust
Transparent
Skillset and mindset matched?
Proportionate
![Page 5: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/5.jpg)
6©MapR Technologies 2013
What is Important?
Deployable– Clever prototypes don’t count if they can’t be standardized
Robust
Transparent
Skillset and mindset matched?
Proportionate
![Page 6: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/6.jpg)
7©MapR Technologies 2013
What is Important?
Deployable– Clever prototypes don’t count
Robust– Mishandling is common
Transparent– Will degradation be obvious?
Skillset and mindset matched?
Proportionate
![Page 7: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/7.jpg)
8©MapR Technologies 2013
What is Important?
Deployable– Clever prototypes don’t count
Robust– Mishandling is common
Transparent– Will degradation be obvious?
Skillset and mindset matched?– How long will your fancy data scientist enjoy doing standard ops tasks?
Proportionate– Where is the highest value per minute of effort?
![Page 8: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/8.jpg)
9©MapR Technologies 2013
Academic Goals vs Pragmatics
Academic goals– Reproducible– Isolate theoretically important aspects– Work on novel problems
Pragmatics– Highest net value– Available data is constantly changing– Diligence and consistency have larger impact than cleverness– Many systems feed themselves, exploration and exploitation are both
important– Engineering constraints on budget and schedule
![Page 9: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/9.jpg)
10©MapR Technologies 2013
Example 1:Making Recommendations Better
![Page 10: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/10.jpg)
11©MapR Technologies 2013
Recommendation Advances
What are the most important algorithmic advances in recommendations over the last 10 years?
Cooccurrence analysis?
Matrix completion via factorization?
Latent factor log-linear models?
Temporal dynamics?
![Page 11: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/11.jpg)
12©MapR Technologies 2013
The Winner – None of the Above
What are the most important algorithmic advances in recommendations over the last 10 years?
1. Result dithering2. Anti-flood
![Page 12: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/12.jpg)
13©MapR Technologies 2013
The Real Issues
Exploration Diversity Speed
Not the last fraction of a percent
![Page 13: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/13.jpg)
14©MapR Technologies 2013
Result Dithering
Dithering is used to re-order recommendation results – Re-ordering is done randomly
Dithering is guaranteed to make off-line performance worse
Dithering also has a near perfect record of making actual performance much better
![Page 14: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/14.jpg)
15©MapR Technologies 2013
Result Dithering
Dithering is used to re-order recommendation results – Re-ordering is done randomly
Dithering is guaranteed to make off-line performance worse
Dithering also has a near perfect record of making actual performance much better
“Made more difference than any other change”
![Page 15: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/15.jpg)
16©MapR Technologies 2013
Simple Dithering Algorithm
Generate synthetic score from log rank plus Gaussian
Pick noise scale to provide desired level of mixing
Typically
Oh… use floor(t/T) as seed
![Page 16: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/16.jpg)
17©MapR Technologies 2013
Example … ε = 0.5
![Page 17: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/17.jpg)
18©MapR Technologies 2013
Example … ε = log 2 = 0.69
![Page 18: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/18.jpg)
19©MapR Technologies 2013
Exploring The Second Page
![Page 19: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/19.jpg)
20©MapR Technologies 2013
Lesson 1:Exploration is good
![Page 20: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/20.jpg)
21©MapR Technologies 2013
Example 2:Bayesian Bandits
![Page 21: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/21.jpg)
22©MapR Technologies 2013
Bayesian Bandits
Based on Thompson sampling Very general sequential test Near optimal regret Trade-off exploration and exploitation
Possibly best known solution for exploration/exploitation
Incredibly simple
![Page 22: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/22.jpg)
23©MapR Technologies 2013
Thompson Sampling
Select each shell according to the probability that it is the best
Probability that it is the best can be computed using posterior
But I promised a simple answer
![Page 23: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/23.jpg)
24©MapR Technologies 2013
Thompson Sampling – Take 2
Sample θ
Pick i to maximize reward
Record result from using i
![Page 24: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/24.jpg)
25©MapR Technologies 2013
Fast Convergence
![Page 25: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/25.jpg)
26©MapR Technologies 2013
Thompson Sampling on Ads
An Empirical Evaluation of Thompson Sampling - Chapelle and Li, 2011
![Page 26: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/26.jpg)
27©MapR Technologies 2013
Bayesian Bandits versus Result Dithering
Many useful systems are difficult to frame in fully Bayesian form Thompson sampling cannot be applied without posterior sampling
Can still do useful exploration with dithering
But better to use Thompson sampling if possible
![Page 27: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/27.jpg)
28©MapR Technologies 2013
Lesson 2:Exploration is pretty easy to do and pays big benefits.
![Page 28: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/28.jpg)
29©MapR Technologies 2013
Example 3:On-line Clustering
![Page 29: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/29.jpg)
30©MapR Technologies 2013
The Problem
K-means clustering is useful for feature extraction or compression
At scale and at high dimension, the desirable number of clusters increases
Very large number of clusters may require more passes through the data
Super-linear scaling is generally infeasible
![Page 30: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/30.jpg)
31©MapR Technologies 2013
The Solution
Sketch-based algorithms produce a sketch of the data Streaming k-means uses adaptive dp-means to produce this sketch
in the form of many weighted centroids which approximate the original distribution
The size of the sketch grows very slowly with increasing data size Many operations such as clustering are well behaved on sketches
Fast and Accurate k-means For Large Datasets. Michael Shindler, Alex Wong, Adam Meyerson.
Revisiting k-means: New Algorithms via Bayesian Nonparametrics . Brian Kulis, Michael Jordan.
![Page 31: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/31.jpg)
32©MapR Technologies 2013
An Example
![Page 32: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/32.jpg)
33©MapR Technologies 2013
An Example
![Page 33: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/33.jpg)
34©MapR Technologies 2013
The Cluster Proximity Features
Every point can be described by the nearest cluster – 4.3 bits per point in this case– Significant error that can be decreased (to a point) by increasing number of
clusters
Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities)– Error is negligible– Unwinds the data into a simple representation
Or we can increase the number of clusters (n fold increase adds log n bits per point, decreases error by sqrt(n)
![Page 34: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/34.jpg)
35©MapR Technologies 2013
Diagonalized Cluster Proximity
![Page 35: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/35.jpg)
36©MapR Technologies 2013
Lots of Clusters Are Fine
![Page 36: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/36.jpg)
37©MapR Technologies 2013
Typical k-means Failure
Selecting two seeds here cannot be
fixed with Lloyds
Result is that these two clusters get glued
together
![Page 37: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/37.jpg)
38©MapR Technologies 2013
Streaming k-means Ideas
By using a sketch with lots (k log N) of centroids, we avoid pathological cases
We still get a very good result if the sketch is created – in one pass– with approximate search
In fact, adaptive dp-means works just fine
In the end, the sketch can be used for clustering or …
![Page 38: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/38.jpg)
39©MapR Technologies 2013
Lesson 3:Sketches make big data small.
![Page 39: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/39.jpg)
40©MapR Technologies 2013
Example 4:Search Abuse
![Page 40: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/40.jpg)
41©MapR Technologies 2013
Recommendations
Alice got an apple and a puppy
Charles got a bicycle
Alice
Charles
![Page 41: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/41.jpg)
42©MapR Technologies 2013
Recommendations
Alice got an apple and a puppy
Charles got a bicycle
Bob got an apple
Alice
Bob
Charles
![Page 42: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/42.jpg)
43©MapR Technologies 2013
Recommendations
What else would Bob like??
Alice
Bob
Charles
![Page 43: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/43.jpg)
44©MapR Technologies 2013
Log Files
Alice
Bob
Charles
Alice
Bob
Charles
Alice
![Page 44: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/44.jpg)
45©MapR Technologies 2013
History Matrix: Users by Items
Alice
Bob
Charles
✔ ✔ ✔
✔ ✔
✔ ✔
![Page 45: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/45.jpg)
46©MapR Technologies 2013
Co-occurrence Matrix: Items by Items
-
1 2
1 1
1
1
2 1
How do you tell which co-occurrences are useful?.
00
0 0
![Page 46: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/46.jpg)
47©MapR Technologies 2013
Co-occurrence Binary Matrix
1
1not
not
1
![Page 47: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/47.jpg)
48©MapR Technologies 2013
Indicator Matrix: Anomalous Co-Occurrence
✔✔
Result: The marked row will be added to the indicator field in the item document…
![Page 48: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/48.jpg)
49©MapR Technologies 2013
Indicator Matrix
✔
id: t4title: puppydesc: The sweetest little puppy ever.keywords: puppy, dog, pet
indicators: (t1)
That one row from indicator matrix becomes the indicator field in the Solr document used to deploy the recommendation engine.
Note: data for the indicator field is added directly to meta-data for a document in Solr index. You don’t need to create a separate index for the indicators.
![Page 49: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/49.jpg)
50©MapR Technologies 2013
Internals of the Recommender Engine
50
![Page 50: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/50.jpg)
51©MapR Technologies 2013
Internals of the Recommender Engine
51
![Page 51: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/51.jpg)
52©MapR Technologies 2013
Looking Inside LucidWorks
What to recommend if new user listened to 2122: Fats Domino & 303: Beatles?
Recommendation is “1710 : Chuck Berry”
52
Real-time recommendation query and results: Evaluation
![Page 52: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/52.jpg)
53©MapR Technologies 2013
Real-life example
![Page 53: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/53.jpg)
54©MapR Technologies 2013
Lesson 4:Recursive search abuse pays
Search can implement recsWhich can implement search
![Page 54: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/54.jpg)
55©MapR Technologies 2013
Summary
![Page 55: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/55.jpg)
56©MapR Technologies 2013
![Page 56: 1 ©MapR Technologies 2013 Which Algorithms Really Matter?](https://reader034.fdocuments.us/reader034/viewer/2022051517/56649cb55503460f94979868/html5/thumbnails/56.jpg)
57©MapR Technologies 2013
Me, Us
Ted Dunning, Chief Application Architect, MapRCommitter PMC member, Mahout, Zookeeper, DrillBought the beer at the first HUG
MapRDistributes more open source components for HadoopAdds major technology for performance, HA, industry standard API’s
InfoHash tag - #maprSee also - @ApacheMahout @ApacheDrill
@ted_dunning and @mapR