Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

Post on 26-Jan-2017

615 views 0 download

Transcript of Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

Beyond the Classifier, Inspiration from Engineering Algorithms

Yael Elmatad, Data Scientist at Tapad@y_s_e

ML Conf NYCApril 15, 2016

+

2

Introduction to TapadTapad is a marketing technology company that seeks to bridge the gap between users’ various screens.

3

Tapad’s Solution: The Device Graph™

4

Modeling Identity is Hard

1. Identifier persistence and accuracy

2. Conflicting data

3. Grouping keys / Transitive properties

4. User Privacy and Data Governance

5. Use case flexibility

4

5

Modeling Identity is Hard

1. Identifier persistence and accuracy

2. Conflicting data

3. Grouping keys / Transitive properties

4. User Privacy and Data Governance

5. Use case flexibility

5

6

Focus: Identifier Persistence & GroupingsGrouping keys

How can we effectively, at scale, determine groups of identifiers?

Identifier Persistence

How can we make sure that these identifiers are persistent in time?

Spoiler Alert

No classifiers, recommender systems, or community detection in sight.

6

7

Grouping: Connected Components● Over 1.4 billion devices in each weekly Device Graph

● There are 6.6 billion connections between these Devices

Question:

How do we determine connected components at scale?

Previous attempts:

Various graph based databases and solutions (Giraph, GraphX, Cassovary) - we were not able to identify clusters at scale.

Current solution:

Runs in logarithmic rounds

8

Connected Component Basics: Label PropInitializing, assign self as cluster label

A B C D

Cluster Label (Temp): A B C D

Iterations: Ask neighbor for current label, take min of neighbors and self.

A B C D

A A B C

Stop iterations when no labels change over previous iteration.

9

Need A More Efficient Solution: Hash-to-MinStandard message passing is O(d), where d = cluster diameter.

arXiv.org > cs > arXiv:1203.5387v2

10

Hash-to-Min: Initialization

A B C D

E

v C(v)

A (A,B)

B (A,B,C)

C (B,C,D,E)

D (C,D)

E (C,E)

A A B C

C

For node v, assign minimum of v and its neighbors as cluster label and a cluster C(v) which is a set of v + v’s neighbors.

11

Hash-to-Min: Round 1For each C(v), vmin = minimal member of C(v)Broadcast C(v) to vmin and broadcast vmin to all other members of C(v)Each node, v, then merges all the C(v) + vmin it receives.

A B C D

E

A A B C

C

v C(v)

A (A,B)

B (A,B,C)

C (B,C,D,E)

D (C,D)

E (C,E)

12

Hash-to-Min: Round 1For each C(v), vmin = minimal member of C(v)Broadcast C(v) to vmin and broadcast vmin to all other members of C(v)Each node, v, then merges all the C(v) + vmin it receives.

A B C D

E

A A A B

B

v C(v)

A (A,B,C)

B (A,B,C,D,E)

C (A,C,D,E)

D (B)

E (B)

13

Hash-to-Min: Round 2 + Completion

A B C D

E

A A A A

A

v C(v)

A (A,B,C,D,E)

B (A,B)

C (A)

D (A)

E (A)

Iterations cease when no updates are made to C(v)’s

Completes in O(log(d)) where d = cluster diameter.

14

Hash-to-Min: Round 2 + Completion

Iterations cease when no updates are made to C(v)’s

Completes in O(log(d)) where d = cluster diameter.

A B C D

E

A A A A

A

v C(v)

A (A,B,C,D,E)

B (A)

C (A)

D (A)

E (A)

15

First labeling scheme:

Labeled by lowest device id participating in cluster.

Example:

Once we have CC, how do we label them?

A

B

C

DE

AOnly 78% of devices maintain label after 1 week.

16

Why 22% Change? ID Expiration & Creation

D

B

C

D

C

B C

Label Device Expires:

D

B

C

D

B

C

AB A

New Lowest ID Created:

17

Why 22% Change? Splits and Merges

D

B

C

AAD

B

CAAC

Cluster Splits:

DB

CAAC

D

B

C

AA

Clusters Merge:

18

Only a small fraction are of Merge/Split variety

Type of change Percent

Device Expiration & Creation

> 75%

Cluster Merges & Splits < 25%

19

Solution? Map onto Stable-Marriage Problem

Definition of “Stable Marriage”

Given n men and n women, where each person has ranked all members of the opposite sex in order of preference, marry the men and women together such that there are no two people of opposite sex who would both rather have each other than their current partners. When there are no such pairs of people, the set of marriages is deemed stable.

(wikipedia definition)

20

Stable-Marriage - (By Negation)Want to pair triangles to circles.

Unstable Match:

Prefer Each Other

A stable solution is defined as the lack of these instabilities.The Gale-Shapley algorithm is a method for finding stable solutions.

21

Gale-Shapley Algorithm

a

b

c δ

ɣ

β

(Psst… it won the Nobel Prize in Economics in 2012)

22

Gale-Shapley Pre-Iteration (GS0): Rankings

Rank: (β,ɣ,δ)

Rank: (β,δ,ɣ)

Rank: (δ,ɣ,β)

Rank: (c,b,a)

Rank: (b,a,c)

Rank: (c,a,b)

a

b

c δ

ɣ

β

23

GS1: Circles “Propose” to Triangles

Rank: (β,ɣ,δ)

Rank: (β,δ,ɣ)

Rank: (δ,ɣ,β)

Rank: (c,b,a)

Rank: (b,a,c)

Rank: (c,a,b)

a

b

c δ

ɣ

β

24

GS1: Triangles tentatively accept best proposal

Rank: (β,ɣ,δ)

Rank: (β,δ,ɣ)

Rank: (δ,ɣ,β)

Rank: (c,b,a)

Rank: (b,a,c)

Rank: (c,a,b)

a

b

c δ

ɣ

β

25

GS2: Unengaged circles try again

Rank: (β,ɣ,δ)

Rank: (β,δ,ɣ)

Rank: (δ,ɣ,β)

Rank: (c,b,a)

Rank: (b,a,c)

Rank: (c,a,b)

a

b

c δ

ɣ

β

26

GS2: Triangles again tentatively accept best offer

Rank: (β,ɣ,δ)

Rank: (β,δ,ɣ)

Rank: (δ,ɣ,β)

Rank: (c,b,a)

Rank: (b,a,c)

Rank: (c,a,b)

a

b

c δ

ɣ

β

27

GS3: iterations terminate when all triangles/circles are paired

Rank: (β,ɣ,δ)

Rank: (β,δ,ɣ)

Rank: (δ,ɣ,β)

Rank: (c,b,a)

Rank: (b,a,c)

Rank: (c,a,b)

a

b

c δ

ɣ

β

28

How do we use it at Tapad?

Considerations:

● How do you rank best labels for your cluster?

● Need to be able to run at scale for 100 million label pairs.

● Needs to run on in a distributed fashion (MapReduce).

● Needs to be able to handle ties.

● Need to handle label expiry and new label creation.

29

Results & Cluster Stability

Metric:

The % of devices that maintain their cluster label after x weeks.

Min ID Based Gale-Shapley Based

1 week 78% 98%

8 weeks 33% 87%

30

Conclusion

Many challenges which get thrown at data scientists can potentially be solved by deterministic engineering algorithms.

Being familiar with these algorithms prevents data scientists from reinventing the wheel.

Once you start using these algorithms, you start seeing use cases for them everywhere (we use connected components in no less than 3 parts of our graph building process).

31

Thank you!

Thanks to the Data Science/Engineering teams at Tapad

Read our blog: http://engineering.tapad.com

Careers:http://www.tapad.com/about-us/careers/openings

(Data Science & Engineering!)

Follow us on twitter: @tapad, @tapadeng

Contact me: yael@tapad.com, @y_s_e