Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16
-
Upload
mlconf -
Category
Technology
-
view
614 -
download
0
Transcript of Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16
Beyond the Classifier, Inspiration from Engineering Algorithms
Yael Elmatad, Data Scientist at Tapad@y_s_e
ML Conf NYCApril 15, 2016
+
2
Introduction to TapadTapad is a marketing technology company that seeks to bridge the gap between users’ various screens.
3
Tapad’s Solution: The Device Graph™
4
Modeling Identity is Hard
1. Identifier persistence and accuracy
2. Conflicting data
3. Grouping keys / Transitive properties
4. User Privacy and Data Governance
5. Use case flexibility
4
5
Modeling Identity is Hard
1. Identifier persistence and accuracy
2. Conflicting data
3. Grouping keys / Transitive properties
4. User Privacy and Data Governance
5. Use case flexibility
5
6
Focus: Identifier Persistence & GroupingsGrouping keys
How can we effectively, at scale, determine groups of identifiers?
Identifier Persistence
How can we make sure that these identifiers are persistent in time?
Spoiler Alert
No classifiers, recommender systems, or community detection in sight.
6
7
Grouping: Connected Components● Over 1.4 billion devices in each weekly Device Graph
● There are 6.6 billion connections between these Devices
Question:
How do we determine connected components at scale?
Previous attempts:
Various graph based databases and solutions (Giraph, GraphX, Cassovary) - we were not able to identify clusters at scale.
Current solution:
Runs in logarithmic rounds
8
Connected Component Basics: Label PropInitializing, assign self as cluster label
A B C D
Cluster Label (Temp): A B C D
Iterations: Ask neighbor for current label, take min of neighbors and self.
A B C D
A A B C
Stop iterations when no labels change over previous iteration.
9
Need A More Efficient Solution: Hash-to-MinStandard message passing is O(d), where d = cluster diameter.
arXiv.org > cs > arXiv:1203.5387v2
10
Hash-to-Min: Initialization
A B C D
E
v C(v)
A (A,B)
B (A,B,C)
C (B,C,D,E)
D (C,D)
E (C,E)
A A B C
C
For node v, assign minimum of v and its neighbors as cluster label and a cluster C(v) which is a set of v + v’s neighbors.
11
Hash-to-Min: Round 1For each C(v), vmin = minimal member of C(v)Broadcast C(v) to vmin and broadcast vmin to all other members of C(v)Each node, v, then merges all the C(v) + vmin it receives.
A B C D
E
A A B C
C
v C(v)
A (A,B)
B (A,B,C)
C (B,C,D,E)
D (C,D)
E (C,E)
12
Hash-to-Min: Round 1For each C(v), vmin = minimal member of C(v)Broadcast C(v) to vmin and broadcast vmin to all other members of C(v)Each node, v, then merges all the C(v) + vmin it receives.
A B C D
E
A A A B
B
v C(v)
A (A,B,C)
B (A,B,C,D,E)
C (A,C,D,E)
D (B)
E (B)
13
Hash-to-Min: Round 2 + Completion
A B C D
E
A A A A
A
v C(v)
A (A,B,C,D,E)
B (A,B)
C (A)
D (A)
E (A)
Iterations cease when no updates are made to C(v)’s
Completes in O(log(d)) where d = cluster diameter.
14
Hash-to-Min: Round 2 + Completion
Iterations cease when no updates are made to C(v)’s
Completes in O(log(d)) where d = cluster diameter.
A B C D
E
A A A A
A
v C(v)
A (A,B,C,D,E)
B (A)
C (A)
D (A)
E (A)
15
First labeling scheme:
Labeled by lowest device id participating in cluster.
Example:
Once we have CC, how do we label them?
A
B
C
DE
AOnly 78% of devices maintain label after 1 week.
16
Why 22% Change? ID Expiration & Creation
D
B
C
D
C
B C
Label Device Expires:
D
B
C
D
B
C
AB A
New Lowest ID Created:
17
Why 22% Change? Splits and Merges
D
B
C
AAD
B
CAAC
Cluster Splits:
DB
CAAC
D
B
C
AA
Clusters Merge:
18
Only a small fraction are of Merge/Split variety
Type of change Percent
Device Expiration & Creation
> 75%
Cluster Merges & Splits < 25%
19
Solution? Map onto Stable-Marriage Problem
Definition of “Stable Marriage”
Given n men and n women, where each person has ranked all members of the opposite sex in order of preference, marry the men and women together such that there are no two people of opposite sex who would both rather have each other than their current partners. When there are no such pairs of people, the set of marriages is deemed stable.
(wikipedia definition)
20
Stable-Marriage - (By Negation)Want to pair triangles to circles.
Unstable Match:
Prefer Each Other
A stable solution is defined as the lack of these instabilities.The Gale-Shapley algorithm is a method for finding stable solutions.
21
Gale-Shapley Algorithm
a
b
c δ
ɣ
β
(Psst… it won the Nobel Prize in Economics in 2012)
22
Gale-Shapley Pre-Iteration (GS0): Rankings
Rank: (β,ɣ,δ)
Rank: (β,δ,ɣ)
Rank: (δ,ɣ,β)
Rank: (c,b,a)
Rank: (b,a,c)
Rank: (c,a,b)
a
b
c δ
ɣ
β
23
GS1: Circles “Propose” to Triangles
Rank: (β,ɣ,δ)
Rank: (β,δ,ɣ)
Rank: (δ,ɣ,β)
Rank: (c,b,a)
Rank: (b,a,c)
Rank: (c,a,b)
a
b
c δ
ɣ
β
24
GS1: Triangles tentatively accept best proposal
Rank: (β,ɣ,δ)
Rank: (β,δ,ɣ)
Rank: (δ,ɣ,β)
Rank: (c,b,a)
Rank: (b,a,c)
Rank: (c,a,b)
a
b
c δ
ɣ
β
25
GS2: Unengaged circles try again
Rank: (β,ɣ,δ)
Rank: (β,δ,ɣ)
Rank: (δ,ɣ,β)
Rank: (c,b,a)
Rank: (b,a,c)
Rank: (c,a,b)
a
b
c δ
ɣ
β
26
GS2: Triangles again tentatively accept best offer
Rank: (β,ɣ,δ)
Rank: (β,δ,ɣ)
Rank: (δ,ɣ,β)
Rank: (c,b,a)
Rank: (b,a,c)
Rank: (c,a,b)
a
b
c δ
ɣ
β
27
GS3: iterations terminate when all triangles/circles are paired
Rank: (β,ɣ,δ)
Rank: (β,δ,ɣ)
Rank: (δ,ɣ,β)
Rank: (c,b,a)
Rank: (b,a,c)
Rank: (c,a,b)
a
b
c δ
ɣ
β
28
How do we use it at Tapad?
Considerations:
● How do you rank best labels for your cluster?
● Need to be able to run at scale for 100 million label pairs.
● Needs to run on in a distributed fashion (MapReduce).
● Needs to be able to handle ties.
● Need to handle label expiry and new label creation.
29
Results & Cluster Stability
Metric:
The % of devices that maintain their cluster label after x weeks.
Min ID Based Gale-Shapley Based
1 week 78% 98%
8 weeks 33% 87%
30
Conclusion
Many challenges which get thrown at data scientists can potentially be solved by deterministic engineering algorithms.
Being familiar with these algorithms prevents data scientists from reinventing the wheel.
Once you start using these algorithms, you start seeing use cases for them everywhere (we use connected components in no less than 3 parts of our graph building process).
31
Thank you!
Thanks to the Data Science/Engineering teams at Tapad
Read our blog: http://engineering.tapad.com
Careers:http://www.tapad.com/about-us/careers/openings
(Data Science & Engineering!)
Follow us on twitter: @tapad, @tapadeng
Contact me: [email protected], @y_s_e