Post on 08-Jan-2017
Automatic Detectionof Web Trackers
Vasia KalavriApache Flink PMC, PhD student @KTH
kalavri@kth.se, @vkalavri
Telefonica Research, Barcelona
Computer Networks, Multimedia, Online Social Networks, Security & Privacy, Recommender Systems, HCI & Mobile Computing, Distributed Systems…
2
Ads
Recommendations
Browsing the Web
3
Tracker
Tracker
Ad Server
display relevant ads
cookie exchange
profiling
Tracking
4
5
The study's authors defined "creepiness" by the feeling consumers get when they sense an ad is too personal because it uses data the consumer did not agree to provide, such as online-search and browsing history. Consumers are even more creeped out by this because they don't know how and where that information will be used.
6
amazon.com imdb.com facebook.com
X Y X
Y
IP 1.1.1.1
ID-A = “aaa”
IP 1.1.1.1
ID-X = “xxx”
IP 2.2.2.2
ID-B = “bbb”
IP 2.2.2.2
ID-Y = “yyy”
IP 3.3.3.3
ID-C = “ccc”
IP 3.3.3.3
ID-X = “xxx”
IP 3.3.3.3
ID-Y = “yyy”
Linking Tracker Information
7
Steven Englehardt, Dillon Reisman, Christian Eubank, Peter Zimmerman, Jonathan Mayer, Arvind Narayanan, Edward W. Felten: Cookies That Give You Away: The Surveillance Implications of Web Tracking. WWW 2015: 289-299
Can’t we block them?
proxy
Tracker
Tracker
Ad Server
8
Legitimate site
● not frequently updated● not sure who or based on what criteria URLs are
blacklisted● miss “hidden” trackers or dual-role nodes● blocking requires manual matching against the list● can you buy your way into the whitelist?
Available Solutions
AdBlock, DoNotTrack, EasyPrivacy:
crowd-sourced “black lists” of tracker URLs
9
10
Our Goal
Exploit fundamental properties necessary for tracker operation
Use existing data to build a trackers classifier
● structural attributes: connections, network positions
● operational aspects: data volume exchange, communication patterns
Can we detect Trackers automatically?
● Are Trackers similar? How?○ network structure○ data received/sent○ response times○ latency
● Are Trackers different from normal sites? How?● Are Trackers mainly connected to other Trackers?
12
The Road to our Goal● algorithms● tuning● features● combinations of
algorithms and features and parameters...
13
The Dataset
172.134.23.3 http://www.buzzfeed.com/sheridanwatson/happy-birthday-eva-you -lucky- gal#.gnJbE8EDDK 3 45 20150203:17080345 34 200 GET www.buzzfeed.com/ HTTP/1.1 Host: www.google-analytics.com User-Agent: Mozilla/5.0 (Windows; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729) Accept: text/html,application/xhtml+xml, application/xml;q=0.9,*/*;q=0.8 Keep-Alive: 300 Connection: keep-alive 234561 34 0 0 ES
#records: ~80m#users: ~3k#URLs: ~2m#Trackers: ~4k
14
Basic Dataset Analysis
● How many requests to Trackers?
DataSet API
● Do Tracker requests have larger latency than other requests?
15
● How many Trackers ○ per user?○ per request?○ per website?
● Do popular websites embed more Trackers than others?
● Do same-topic websites share Trackers?
● Do different users visiting the same website end up on different Trackers?
● Do Trackers send / receive more / less bytes?
● Do they have more / less connections on average?
Main IdeaModel the data as a referer → host bipartite graph and exploit the graph structure to identify Trackers
facebook.com
youtube.com
google-analytics.com
b.scorecardresearch.com
embedded URLsURLs explicitly visited by the user
16
Attempt#1Relevance Search
Iterative, random walk-like algorithm for bipartite graphs
Given an input source node, assign a “relevance score” to other nodes, based on how similar their network position is
Sun, J., Qu, H., Chakrabarti, D., & Faloutsos, C. (2005). Relevance search and anomaly detection in bipartite graphs. ACM SIGKDD Explorations Newsletter,7(2), 48-55.
Relevance Search Algorithm
google-analytics
b.scorecardresearch
xzy/logo_small.jpg
0.9
0.1
sour
ce
In each iteration, a vertex:
- sends a score to out-neighbors- sums up received scores and
updates value
18
Relevance Search Implementation
● single-source relevance search○ similar to pagerank○ easily mapped to vertex-centric iterations
● multi-source relevance search○ each vertex keeps a vector of scores○ compute top-k relevant nodes per source○ merge the top-k lists
19
Gelly API
Data Pipeline
top-k relevant
nodes
www.google-analytics.com: Twww.bscored-research.com: Twww.facebook.com: NTwww.github.com: NTcdn.cxense.com: NT...
Bipartite graph creation
Multi-source Relevance Search
Classification
20
Relevance Search Tuning
● How many and which sources to give as input?● How to define convergence?● Does initialization matter?● How to weigh the input graph?● How to define the relevance score threshold?
21
Relevance Search Problems
● Easy to find the few very similar and the few very different pages
● Popular trackers are similar to other popular trackers, but not to not-so-popular ones
● We might keep re-discovering what we already know
22
Relevance Search doesn’t seem to completely solve the problem… Where do we go now?
23
Attempt#2...N-1Combining Relevance Search
with other algorithms
Several Clustering algorithms
k-nn Classification
Random Forest
Data Pipeline(s)
top-k relevant
nodes
www.google-analytics.com: Twww.bscored-research.com: Twww.facebook.com:
NTwww.github.com: NTcdn.cxense.com: NT...
Bipartite graph creation
Multi-sourceRelevance Search
[feature extraction]
Classification
[your clustering, classification, etc. algorithm here]
[evaluation]
25
r1
r2
r3
r5
r6
r7
h1
h2
h3
h4
h5
h6
h7
h8
NT
NT
T
T
?
T
NT
NT
r4
referer-hosts graph
h2
h3 h4
h5 h6
h8
h7
h1r1
r2r3
r3 r3r4
r5r6
r7
hosts-projection graph
: referer
: non-tracker host
: tracker host
: unlabeled host
The Projection Graph
27
Attempt#NCommunity Detection on the
Projection Graph
The Projection Graph captures implicit connections between trackers, through other sites
Do Trackers form communities in the Projection Graph?
● Do they form connected components?
Basic Analysis of the Projection Graph
● Do Trackers have unusually high degrees?
DataSet & Gelly APIs
29
● Are they mainly connected to other Trackers?
Visualization
30
Final Data Pipeline
raw logs cleaned logs
1: logs pre-processing
2: bipartite graph creation
3: largest connected component extraction
4: hosts-projection graph
creation
5: community detection
google-analytics.com: Tbscored-research.com: Tfacebook.com: NTgithub.com: NTcdn.cxense.com: NT...
6: results
DataSet API
Gelly
DataSet API
31
Very high accuracy and very low FPR :-)
Start simple
Lessons LearnedChoose features incrementallyVisualize your data
Re-evaluate your models
Try different data representations
Use a flexible system
Automatic Detectionof Web Trackers
Vasia KalavriApache Flink PMC, PhD student @KTH
kalavri@kth.se, @vkalavri
Optimizing the Pipeline
Flink Optimizerto the rescue :-)
34