Automatic Detection of Web Trackers by Vasia Kalavri

Post on 08-Jan-2017

395 views 0 download

Transcript of Automatic Detection of Web Trackers by Vasia Kalavri

Automatic Detectionof Web Trackers

Vasia KalavriApache Flink PMC, PhD student @KTH

kalavri@kth.se, @vkalavri

Telefonica Research, Barcelona

Computer Networks, Multimedia, Online Social Networks, Security & Privacy, Recommender Systems, HCI & Mobile Computing, Distributed Systems…

2

Ads

Recommendations

Browsing the Web

3

Tracker

Tracker

Ad Server

display relevant ads

cookie exchange

profiling

Tracking

4

5

The study's authors defined "creepiness" by the feeling consumers get when they sense an ad is too personal because it uses data the consumer did not agree to provide, such as online-search and browsing history. Consumers are even more creeped out by this because they don't know how and where that information will be used.

6

amazon.com imdb.com facebook.com

X Y X

Y

IP 1.1.1.1

ID-A = “aaa”

IP 1.1.1.1

ID-X = “xxx”

IP 2.2.2.2

ID-B = “bbb”

IP 2.2.2.2

ID-Y = “yyy”

IP 3.3.3.3

ID-C = “ccc”

IP 3.3.3.3

ID-X = “xxx”

IP 3.3.3.3

ID-Y = “yyy”

Linking Tracker Information

7

Steven Englehardt, Dillon Reisman, Christian Eubank, Peter Zimmerman, Jonathan Mayer, Arvind Narayanan, Edward W. Felten: Cookies That Give You Away: The Surveillance Implications of Web Tracking. WWW 2015: 289-299

Can’t we block them?

proxy

Tracker

Tracker

Ad Server

8

Legitimate site

● not frequently updated● not sure who or based on what criteria URLs are

blacklisted● miss “hidden” trackers or dual-role nodes● blocking requires manual matching against the list● can you buy your way into the whitelist?

Available Solutions

AdBlock, DoNotTrack, EasyPrivacy:

crowd-sourced “black lists” of tracker URLs

9

10

Our Goal

Exploit fundamental properties necessary for tracker operation

Use existing data to build a trackers classifier

● structural attributes: connections, network positions

● operational aspects: data volume exchange, communication patterns

Can we detect Trackers automatically?

● Are Trackers similar? How?○ network structure○ data received/sent○ response times○ latency

● Are Trackers different from normal sites? How?● Are Trackers mainly connected to other Trackers?

12

The Road to our Goal● algorithms● tuning● features● combinations of

algorithms and features and parameters...

13

The Dataset

172.134.23.3 http://www.buzzfeed.com/sheridanwatson/happy-birthday-eva-you -lucky- gal#.gnJbE8EDDK 3 45 20150203:17080345 34 200 GET www.buzzfeed.com/ HTTP/1.1 Host: www.google-analytics.com User-Agent: Mozilla/5.0 (Windows; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729) Accept: text/html,application/xhtml+xml, application/xml;q=0.9,*/*;q=0.8 Keep-Alive: 300 Connection: keep-alive 234561 34 0 0 ES

#records: ~80m#users: ~3k#URLs: ~2m#Trackers: ~4k

14

Basic Dataset Analysis

● How many requests to Trackers?

DataSet API

● Do Tracker requests have larger latency than other requests?

15

● How many Trackers ○ per user?○ per request?○ per website?

● Do popular websites embed more Trackers than others?

● Do same-topic websites share Trackers?

● Do different users visiting the same website end up on different Trackers?

● Do Trackers send / receive more / less bytes?

● Do they have more / less connections on average?

Main IdeaModel the data as a referer → host bipartite graph and exploit the graph structure to identify Trackers

facebook.com

youtube.com

google-analytics.com

b.scorecardresearch.com

embedded URLsURLs explicitly visited by the user

16

Attempt#1Relevance Search

Iterative, random walk-like algorithm for bipartite graphs

Given an input source node, assign a “relevance score” to other nodes, based on how similar their network position is

Sun, J., Qu, H., Chakrabarti, D., & Faloutsos, C. (2005). Relevance search and anomaly detection in bipartite graphs. ACM SIGKDD Explorations Newsletter,7(2), 48-55.

Relevance Search Algorithm

google-analytics

b.scorecardresearch

xzy/logo_small.jpg

0.9

0.1

sour

ce

In each iteration, a vertex:

- sends a score to out-neighbors- sums up received scores and

updates value

18

Relevance Search Implementation

● single-source relevance search○ similar to pagerank○ easily mapped to vertex-centric iterations

● multi-source relevance search○ each vertex keeps a vector of scores○ compute top-k relevant nodes per source○ merge the top-k lists

19

Gelly API

Data Pipeline

top-k relevant

nodes

www.google-analytics.com: Twww.bscored-research.com: Twww.facebook.com: NTwww.github.com: NTcdn.cxense.com: NT...

Bipartite graph creation

Multi-source Relevance Search

Classification

20

Relevance Search Tuning

● How many and which sources to give as input?● How to define convergence?● Does initialization matter?● How to weigh the input graph?● How to define the relevance score threshold?

21

Relevance Search Problems

● Easy to find the few very similar and the few very different pages

● Popular trackers are similar to other popular trackers, but not to not-so-popular ones

● We might keep re-discovering what we already know

22

Relevance Search doesn’t seem to completely solve the problem… Where do we go now?

23

Attempt#2...N-1Combining Relevance Search

with other algorithms

Several Clustering algorithms

k-nn Classification

Random Forest

Data Pipeline(s)

top-k relevant

nodes

www.google-analytics.com: Twww.bscored-research.com: Twww.facebook.com:

NTwww.github.com: NTcdn.cxense.com: NT...

Bipartite graph creation

Multi-sourceRelevance Search

[feature extraction]

Classification

[your clustering, classification, etc. algorithm here]

[evaluation]

25

r1

r2

r3

r5

r6

r7

h1

h2

h3

h4

h5

h6

h7

h8

NT

NT

T

T

?

T

NT

NT

r4

referer-hosts graph

h2

h3 h4

h5 h6

h8

h7

h1r1

r2r3

r3 r3r4

r5r6

r7

hosts-projection graph

: referer

: non-tracker host

: tracker host

: unlabeled host

The Projection Graph

27

Attempt#NCommunity Detection on the

Projection Graph

The Projection Graph captures implicit connections between trackers, through other sites

Do Trackers form communities in the Projection Graph?

● Do they form connected components?

Basic Analysis of the Projection Graph

● Do Trackers have unusually high degrees?

DataSet & Gelly APIs

29

● Are they mainly connected to other Trackers?

Visualization

30

Final Data Pipeline

raw logs cleaned logs

1: logs pre-processing

2: bipartite graph creation

3: largest connected component extraction

4: hosts-projection graph

creation

5: community detection

google-analytics.com: Tbscored-research.com: Tfacebook.com: NTgithub.com: NTcdn.cxense.com: NT...

6: results

DataSet API

Gelly

DataSet API

31

Very high accuracy and very low FPR :-)

Start simple

Lessons LearnedChoose features incrementallyVisualize your data

Re-evaluate your models

Try different data representations

Use a flexible system

Automatic Detectionof Web Trackers

Vasia KalavriApache Flink PMC, PhD student @KTH

kalavri@kth.se, @vkalavri

Optimizing the Pipeline

Flink Optimizerto the rescue :-)

34