Automatic Detection of Web Trackers by Vasia Kalavri

Automatic Detectionof Web Trackers

Vasia KalavriApache Flink PMC, PhD student @KTH

kalavri@kth.se, @vkalavri

Telefonica Research, Barcelona

Computer Networks, Multimedia, Online Social Networks, Security & Privacy, Recommender Systems, HCI & Mobile Computing, Distributed Systems…

Recommendations

Browsing the Web

Tracker

Ad Server

display relevant ads

cookie exchange

profiling

Tracking

The study's authors defined "creepiness" by the feeling consumers get when they sense an ad is too personal because it uses data the consumer did not agree to provide, such as online-search and browsing history. Consumers are even more creeped out by this because they don't know how and where that information will be used.

amazon.com imdb.com facebook.com

IP 1.1.1.1

ID-A = “aaa”

IP 1.1.1.1

ID-X = “xxx”

IP 2.2.2.2

ID-B = “bbb”

IP 2.2.2.2

ID-Y = “yyy”

IP 3.3.3.3

ID-C = “ccc”

IP 3.3.3.3

ID-X = “xxx”

IP 3.3.3.3

ID-Y = “yyy”

Linking Tracker Information

Steven Englehardt, Dillon Reisman, Christian Eubank, Peter Zimmerman, Jonathan Mayer, Arvind Narayanan, Edward W. Felten: Cookies That Give You Away: The Surveillance Implications of Web Tracking. WWW 2015: 289-299

Can’t we block them?

Tracker

Ad Server

Legitimate site

● not frequently updated● not sure who or based on what criteria URLs are

blacklisted● miss “hidden” trackers or dual-role nodes● blocking requires manual matching against the list● can you buy your way into the whitelist?

Available Solutions

AdBlock, DoNotTrack, EasyPrivacy:

crowd-sourced “black lists” of tracker URLs

Our Goal

Exploit fundamental properties necessary for tracker operation

Use existing data to build a trackers classifier

● structural attributes: connections, network positions

● operational aspects: data volume exchange, communication patterns

Can we detect Trackers automatically?

● Are Trackers similar? How?○ network structure○ data received/sent○ response times○ latency

● Are Trackers different from normal sites? How?● Are Trackers mainly connected to other Trackers?

The Road to our Goal● algorithms● tuning● features● combinations of

algorithms and features and parameters...

The Dataset

172.134.23.3 http://www.buzzfeed.com/sheridanwatson/happy-birthday-eva-you -lucky- gal#.gnJbE8EDDK 3 45 20150203:17080345 34 200 GET www.buzzfeed.com/ HTTP/1.1 Host: www.google-analytics.com User-Agent: Mozilla/5.0 (Windows; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729) Accept: text/html,application/xhtml+xml, application/xml;q=0.9,*/*;q=0.8 Keep-Alive: 300 Connection: keep-alive 234561 34 0 0 ES

#records: ~80m#users: ~3k#URLs: ~2m#Trackers: ~4k

Basic Dataset Analysis

● How many requests to Trackers?

DataSet API

● Do Tracker requests have larger latency than other requests?

● How many Trackers ○ per user?○ per request?○ per website?

● Do popular websites embed more Trackers than others?

● Do same-topic websites share Trackers?

● Do different users visiting the same website end up on different Trackers?

● Do Trackers send / receive more / less bytes?

● Do they have more / less connections on average?

Main IdeaModel the data as a referer → host bipartite graph and exploit the graph structure to identify Trackers

facebook.com

youtube.com

google-analytics.com

b.scorecardresearch.com

embedded URLsURLs explicitly visited by the user

Attempt#1Relevance Search

Iterative, random walk-like algorithm for bipartite graphs

Given an input source node, assign a “relevance score” to other nodes, based on how similar their network position is

Sun, J., Qu, H., Chakrabarti, D., & Faloutsos, C. (2005). Relevance search and anomaly detection in bipartite graphs. ACM SIGKDD Explorations Newsletter,7(2), 48-55.

Relevance Search Algorithm

google-analytics

b.scorecardresearch

xzy/logo_small.jpg

In each iteration, a vertex:

- sends a score to out-neighbors- sums up received scores and

updates value

Relevance Search Implementation

● single-source relevance search○ similar to pagerank○ easily mapped to vertex-centric iterations

● multi-source relevance search○ each vertex keeps a vector of scores○ compute top-k relevant nodes per source○ merge the top-k lists

Gelly API

Data Pipeline

top-k relevant

www.google-analytics.com: Twww.bscored-research.com: Twww.facebook.com: NTwww.github.com: NTcdn.cxense.com: NT...

Bipartite graph creation

Multi-source Relevance Search

Classification

Relevance Search Tuning

● How many and which sources to give as input?● How to define convergence?● Does initialization matter?● How to weigh the input graph?● How to define the relevance score threshold?

Relevance Search Problems

● Easy to find the few very similar and the few very different pages

● Popular trackers are similar to other popular trackers, but not to not-so-popular ones

● We might keep re-discovering what we already know

Relevance Search doesn’t seem to completely solve the problem… Where do we go now?

Attempt#2...N-1Combining Relevance Search

with other algorithms

Several Clustering algorithms

k-nn Classification

Random Forest

Data Pipeline(s)

top-k relevant

www.google-analytics.com: Twww.bscored-research.com: Twww.facebook.com:

NTwww.github.com: NTcdn.cxense.com: NT...

Bipartite graph creation

Multi-sourceRelevance Search

[feature extraction]

Classification

[your clustering, classification, etc. algorithm here]

[evaluation]

referer-hosts graph

r3 r3r4

hosts-projection graph

: referer

: non-tracker host

: tracker host

: unlabeled host

The Projection Graph

Attempt#NCommunity Detection on the

Projection Graph

The Projection Graph captures implicit connections between trackers, through other sites

Do Trackers form communities in the Projection Graph?

● Do they form connected components?

Basic Analysis of the Projection Graph

● Do Trackers have unusually high degrees?

DataSet & Gelly APIs

● Are they mainly connected to other Trackers?

Visualization

Final Data Pipeline

raw logs cleaned logs

1: logs pre-processing

2: bipartite graph creation

3: largest connected component extraction

4: hosts-projection graph

creation

5: community detection

google-analytics.com: Tbscored-research.com: Tfacebook.com: NTgithub.com: NTcdn.cxense.com: NT...

6: results

DataSet API

Very high accuracy and very low FPR :-)

Start simple

Lessons LearnedChoose features incrementallyVisualize your data

Re-evaluate your models

Try different data representations

Use a flexible system

Automatic Detectionof Web Trackers

Vasia KalavriApache Flink PMC, PhD student @KTH

kalavri@kth.se, @vkalavri

Optimizing the Pipeline

Flink Optimizerto the rescue :-)

Automatic Detection of Web Trackers by Vasia Kalavri

Technology

Transcript of Automatic Detection of Web Trackers by Vasia Kalavri

ASUN TRACKERS

INTERLAKE SNOW TRACKERS

January 2012 Webinar: Let's Talk Trackers

Meitrack GPS Vehicle Trackers Presentation

The Miracle Workers BehindPACKERS’ TRACKERS

Evaluation of Automatic Formant Trackers

Heat Stress Trackers - Kestrel Instruments

Tracking Healthcare Trackers

Social Media Trackers

SPENDER TRACKERS

User manual for TK102 and TK103 GPS trackers · PDF fileTitle: User manual for TK102 and TK103 GPS trackers Author: Subject: User manual for TK102 and TK103 GPS trackers Keywords

Vasia Poiltou Professional services CV v1.0

db x-trackers - Investireoggi.it

Printable Habit Trackers - Ultimate Printables

Light Sensors for Solar Trackers

Laser Trackers: Testing and Standards

DISTRIBUTED STREAM PROCESSING - ETH...• Web logs • online recommendations, personalization • Network packets • intrusion detection, load balancing ... Vasiliki (Vasia) Kalavri

Charles Darwin - Bush Trackers - Home

Marshall Sterling Insurance Wearable Fitness Trackers...Fitness Trackers Learn about how fitness trackers can help you stay active. Reduce Energy Costs Use these simple tips to keep

Tobii T60 & T120 Eye Trackers