Tracking The Trackers WWW 2016
-
Upload
josep-m-pujol -
Category
Technology
-
view
1.103 -
download
0
Transcript of Tracking The Trackers WWW 2016
Tracking the TrackersZhonghao Yu [email protected] Sam Macbeth [email protected] Konark Modi [email protected] Josep M. Pujol [email protected]
Twitter pages only accessible to the
authenticated user also load 3rd parties like GA
This browsing session on 5 different sites
involved more than 60 different 3rd parties.
GET /css?family=Open+Sans+Condensed:300,700 Host: fonts.googleapis.com User-Agent: Mozilla/5.0 ... Firefox/45.0 Referer: http://www.meetic.com/home/index.php IP: 79.227.235.241
fonts.googleapis.com is a potential tracker <meetic.com/home/index.php, UID> <www20016.ca/, UID> <wired.com/, UID> However, in THIS request, there is no data element that can be used as a UID. Since there is no unsafe data element, the request is safe.
GET /impression.php/f3ae074XXX/api_key=597038480XXX&lid=115… Host: www.facebook.com User-Agent: Mozilla/5.0 … Firefox/45.0 Referer: http://www.meetic.com/home/index.php Cookie: datr=0IPhVj5YHEJ20XXX; c_user=10973XXXX; … csm=2; IP: 79.227.235.241
facebook.com is a potential tracker too, <meetic.com/home/index.php, 10973XXXX> <www20016.ca/, 10973XXXX> <wired.com/, 10973XXXX> <ebay-kleinanzeigen.de/s-muenchen/cyclocross/k0l6411r200, 10973XXXX> Unlike fonts.googleapi.com, the request above is not safe with regards to privacy because it contain two values that we consider unsafe, thus could be used as UIDs, c_user=10973XXXX and datr=0IPhVj5YHEJ20XXX Because it contains at least one unsafe value, the request is considered unsafe.
GET /collect?v=1&_v=j41&a=321948996&t=event&ni=0&_s=1&...&vp=1291x524&..._u=QCCAAAABI~&jid=&cid=6531474... Host: www.google-analytics.com Referer: http://www.meetic.com/home/index.php IP: 79.227.235.241
google-analytics.com is a potential tracker too, <meetic.com/home/index.php, 1291x522:79.227.235.241> <www20016.ca/, 1291x522:79.227.235.241> <wired.com/, 1291x522:79.227.235.241> <ebay-kleinanzeigen.de/s-muenchen/cyclocross/k0l6411r200,
1291x522:79.227.235.241> <analytics.twitter.com/user/solso/home, 1291x522:79.227.235.241> The UID is not as evident as for Facebook. But the combination vp+IP is an unsafe data element, it can be used as a UID. Therefore this request is also unsafe. vp+IP = 1291x522:79.227.235.241
GET /collect?v=1&_v=j41&a=321948996&t=event&ni=0&_s=1&...&vp=1291x524&..._u=QCCAAAABI~&jid=&cid=6531474... Host: www.google-analytics.com Referer: http://www.meetic.com/home/index.php IP: 79.227.235.241
google-analytics.com is a potential tracker too, <meetic.com/home/index.php, 1291x522:79.227.235.241> <www20016.ca/, 1291x522:79.227.235.241> <wired.com/, 1291x522:79.227.235.241> <ebay-kleinanzeigen.de/s-muenchen/cyclocross/k0l6411r200,
1291x522:79.227.235.241> <analytics.twitter.com/user/solso/home, 1291x522:79.227.235.241> The UID is not as evident as for Facebook. But the combination vp+IP is an unsafe data element, it can be used as a UID. Therefore this request is also unsafe. vp+IP = 1291x522:79.227.235.241
Tracking in the Wild Largest field study with real traffic to date, 200,000 users in Germany for a week(*) 21M page loads, 5M unique pages (URLs) from 350K domains (*) Between 09/09/2015 and 16/09/2015
Tracking in the Wild: Prevalence Potential trackers are 3rd parties that are present in many different domains. Unsafe data elements are data elements for which we cannot rule out that possibility that they are UIDs.
21Mpageloads
withoutpoten3altrackers
withpoten3altrackers
1to9 >=10
5% 95%
24%76%
Tracking in the Wild: Prevalence Potential trackers are 3rd parties that are present in many different domains. Unsafe data elements are data elements for which we cannot rule out that possibility that they are UIDs.
21Mpageloads
withoutunsafevalues
withunsafevalues
1to9 >=10
22% 78%
21%79%
Tracking in the Wild: Prevalence Potential trackers are 3rd parties that are loaded in many different domains. Unsafe values are data elements for which we cannot rule out that possibility that they are UIDs.
21Mpageloads
withoutunsafevalues
withunsafevalues
1to9 >=10
22% 78%
21%79%
78% of all page loads can be tracked
Tracking in the Wild: Reach % of page loads seen
% of page loads seen with unsafe data elements (tracking)
rank
Google 62.4% 42.4% 1st Facebook 21.1% 18.5% 2nd AppNexus 10.15% 9.9% 3rd ADITION 8.7% 8.4% 4th Criteo 8.7% 8.2% 5th … Comscore 6.1% 5.9% -- DoublePimp 0.5% 0.5% -- NewRelic 2% 0.03% -- …
Tracking in the Wild: Reach % of page loads seen
% of page loads seen with unsafe data elements (tracking)
rank
Google 62.4% 42.4% 1st Facebook 21.1% 18.5% 2nd AppNexus 10.15 9.9% 3rd ADITION 8.7% 8.4% 4th Criteo 8.7% 8.2% 5th … Comscore 6.1% 5.9% -- DoublePimp 0.5% 0.5% -- NewRelic 2% 0.03% -- …
58 organizations with a reach
larger than 1%
CLIQZ Tracking Protection
Maximize coverage,
minimize false positives
Aggressiveness is counter-productive…
• increases site breakage, which forces users to add exceptions, thus reducing protection coverage.
• affects legitimate services and data collection
Block only the Ability to Track GET /collect?v=1&_v=j41&a=321948996&t=event&ni=0&_s=1&...&vp=1291x524&..._u=QCCAAAABI~&jid=&cid=6531474... Host: www.google-analytics.com Referer: http://www.meetic.com/home/index.php IP: 79.227.235.241
Intervention only on unsafe data elements – those elements that can be used as UIDs, Should protect the user, while minimizing side-effects: a) site-breakage for users b) legitimate data collection for 3rd parties
Blocklists are coarse-grained
CDF of the number of requests with observed unsafe data elements by 3rd party domains contained both in Disconnect Blocklist and CLIQZ list of potential trackers (~2000 domains each). Intersection is 477 domains.
Blocklists are coarse-grained
CDF of the number of requests with observed unsafe data elements by 3rd party domains contained both in Disconnect Blocklist and CLIQZ list of potential trackers (~2000 domains each). Intersection is 477 domains.
Only 2% of tracker domains in Disconnect always send unsafe data elements.
Blocklists are coarse-grained
CDF of the number of requests with observed unsafe data elements by 3rd party domains contained both in Disconnect Blocklist and CLIQZ list of potential trackers (~2000 domains each). Intersection is 477 domains.
98% of tracker domains have a MIXED behavior Lack of resolution…
Only 2% of tracker domains is Disconnect always send unsafe data elements.
Blocklists are coarse-grained
Blocklists by domain (reverse suffix) are too coarse-grained.
BLOCKLIST by Domain
Blocklists are too coarse-grained
EasyPrivacy (from Adblock Plus) has hundreds of regular
expressions to cover for mixed behavior of trackers.
BLOCKLIST by Domain + RegExp Exceptions
Blocklists are too coarse-grained BLOCKLIST by Domain + More RegExp Exceptions
EasyPrivacy (from Adblock Plus) has hundreds of regular
expressions to cover for mixed behavior of trackers.
We propose a more fine-grained approach to algorithmically determine the safeness level of individual data elements within a request to a 3rd party
Determining Safeness
Each 3rd party request to a potential tracker is parsed to obtain a list of tuples T = [<s, d, k, v>] whose safeness level is evaluated in real-time, T = [
<s=wired.com/, d=3rdparty.com, k=z, v=1501498154>, <s=wired.com/, d=3rdparty.com, k=fl,v=21.0>, <s=wired.com/, d=3rdparty.com, k=u, v=CCAAAABI>, <s=wired.com/, d=3rdparty.com, k=vr,v=1440x900>, <s=wired.com/, d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=wired.com/, d=3rdparty.com, k=vp,v=1322x781>, <s=wired.com/, d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
] The aim is to identify which data elements (including combinations) are unsafe, and therefore, they are candidates to be used as UIDs.
Determining Safeness
Each 3rd party request to a potential tracker is parsed to obtain a list of tuples T = [<s, d, k, v>] whose safeness level is evaluated in real-time, T = [
<s=wired.com/, d=3rdparty.com, k=z, v=1501498154>, <s=wired.com/, d=3rdparty.com, k=fl,v=21.0>, <s=wired.com/, d=3rdparty.com, k=u, v=CCAAAABI>, <s=wired.com/, d=3rdparty.com, k=vr,v=1440x900>, <s=wired.com/, d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=wired.com/, d=3rdparty.com, k=vp,v=1322x781>, <s=wired.com/, d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
] The aim is to identify which data elements (including combinations) are unsafe, and therefore, they are candidates to be used as UIDs.
We cannot do this effectively. But we can do the opposite, identify data elements that cannot be used effectively as UIDs, and consider them safe.
Determining Safeness
T = [
<s=w..., d=3rdparty.com, k=z, v=1501498154>, <s=w..., d=3rdparty.com, k=fl,v=21.0>, <s=w..., d=3rdparty.com, k=u, v=CCAAAABI>, <s=w..., d=3rdparty.com, k=vr,v=1440x900>, <s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=w..., d=3rdparty.com, k=vp,v=1322x781>, <s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
]
All tuples are UNSAFE by default unless we can determine that the given data-element is not a good UID, hence safe.
Determining Safeness
T = [
<s=w..., d=3rdparty.com, k=z, v=1501498154>, <s=w..., d=3rdparty.com, k=fl,v=21.0>, <s=w..., d=3rdparty.com, k=u, v=CCAAAABI>, <s=w..., d=3rdparty.com, k=vr,v=1440x900>, <s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=w..., d=3rdparty.com, k=vp,v=1322x781>, <s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
]
The value 1501498154 has never been seen before for <d, k>. Thus, cannot be used as UID => SAFE
Determining Safeness
T = [
<s=w..., d=3rdparty.com, k=z, v=1501498154>, <s=w..., d=3rdparty.com, k=fl,v=21.0>, <s=w..., d=3rdparty.com, k=u, v=CCAAAABI>, <s=w..., d=3rdparty.com, k=vr,v=1440x900>, <s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=w..., d=3rdparty.com, k=vp,v=1322x781>, <s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
]
The value 21.0 is to short to encode any UID => SAFE
Determining Safeness
T = [
<s=w..., d=3rdparty.com, k=z, v=1501498154>, <s=w..., d=3rdparty.com, k=fl,v=21.0>, <s=w..., d=3rdparty.com, k=u, v=CCAAAABI>, <s=w..., d=3rdparty.com, k=vr,v=1440x900>, <s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=w..., d=3rdparty.com, k=vp,v=1322x781>, <s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
]
More than 3 different values in less than 2 days by the same tuple <d,k>. Not persistent, bad UID => SAFE
Determining Safeness
T = [
<s=w..., d=3rdparty.com, k=z, v=1501498154>, <s=w..., d=3rdparty.com, k=fl,v=21.0>, <s=w..., d=3rdparty.com, k=u, v=CCAAAABI>, <s=w..., d=3rdparty.com, k=vr,v=1440x900>, <s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=w..., d=3rdparty.com, k=vp,v=1322x781>, <s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
]
Always the same value for <d,k>. We cannot rule out that the data-elements are UID => keep as UNSAFE
Only using local information is not enough; vr=1440x1024 is not a UID… We need something extra.
Determining Safeness
T = [
<s=w..., d=3rdparty.com, k=z, v=1501498154>, <s=w..., d=3rdparty.com, k=fl,v=21.0>, <s=w..., d=3rdparty.com, k=u, v=CCAAAABI>, <s=w..., d=3rdparty.com, k=vr,v=1440x900>, <s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=w..., d=3rdparty.com, k=vp,v=1322x781>, <s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
] Locally UNSAFE, i.e. always the same
value for <d, k>. Globally SAFE since more than 20 other users have observed the same value 1440x900 for tuple <d,k> = <3rdparty.com,vr> in the last 2 days.
Determining Safeness
T = [
<s=w..., d=3rdparty.com, k=z, v=1501498154>, <s=w..., d=3rdparty.com, k=fl,v=21.0>, <s=w..., d=3rdparty.com, k=u, v=CCAAAABI>, <s=w..., d=3rdparty.com, k=vr,v=1440x900>, <s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=w..., d=3rdparty.com, k=vp,v=1322x781>, <s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
] Locally UNSAFE, i.e. always the same
value for tuple <d,k>. Globally SAFE since it has reach the safeness-quorum based on k-Anonymity.
Determining Safeness
T = [
<s=w..., d=3rdparty.com, k=z, v=1501498154>, <s=w..., d=3rdparty.com, k=fl,v=21.0>, <s=w..., d=3rdparty.com, k=u, v=CCAAAABI>, <s=w..., d=3rdparty.com, k=vr,v=1440x900>, <s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=w..., d=3rdparty.com, k=vp,v=1322x781>, <s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
] Locally UNSAFE, i.e. always the same value for <d, k>.
Globally UNSAFE not enough people has seen the value for <d, k>, always same <d, k, u>. Not safe to send. Two options: a) it is a UID, or an element that could be used as such. b) a false positive due to the Transient State (0.07%)
Determining Safeness
T = [
<s=w..., d=3rdparty.com, k=z, v=1501498154>, <s=w..., d=3rdparty.com, k=fl,v=21.0>, <s=w..., d=3rdparty.com, k=u, v=CCAAAABI>, <s=w..., d=3rdparty.com, k=vr,v=1440x900>, <s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=w..., d=3rdparty.com, k=vp,v=1322x781>, <s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
] Locally UNSAFE and Globally UNSAFE
At this point the request analysis is complete: 1) ALLOW Request removing unsafe data-elements 2) ALLOW Request obfuscating unsafe data-elements 3) BLOCK Request or ALLOW Request without alteration
Safeness Quorum without Tracking To determine that a data-element is globally safe we need to count the number of unique users that have observed a tuple
<d,k,v> e.g. <d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec> Users could share tuples with a field that identifies them (u),
<u=usrXXX, d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec> with CLIQZ. But that would make CLIQZ a tracker! Instead, each user sends the tuple – if observed – once and only once per hour:
<d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec> Actual values are not needed; counting and membership test on GWL
<d=ed5c0cf7b05572eb, k=4d3a21d8c684c09c19b93be911827fd5, v=e60f936dc719ca649a80a97490a09940>
Evaluation: Protection Coverage Requests Blocked
False positives ratio (requests blocked without unsafe data-elements)
Protection Misses (requests allowed with unsafe data-elements)
CLIQZ 51.7% -- --
Disconnect 66.1% 38.8% 12.3%
Kontaxis & Chew (Firefox Tracking Protection) [est.]
36.6% 29.4% 25.4%
Evaluation: Site Breakage Reload Rate % Increase
over baseline % Increase over CLIQZ
BASELINE (without tracking protection)
0.00101 -- --
CLIQZ 0.00104 4% --
Adblock Plus (counting exceptions added by users)
0.00110 10% 150%
CLIQZ as Blocklist
0.00125 25% 525%
Conclusions Tracking is a BIG problem
– Privacy is seriously at risk
Tracking Protection is not an easy task – Trade-off between site breakage and protection
coverage
Blocklist-based approaches have limitations – Maintainability – Coarse-grained resolution – Too many false positives
CLIQZ tracking protection addresses them to a large extent
Future Work CLIQZ tracking protection might be better than the state-of-the-art. But it is far from perfect, • still produces site-
breakages • protection coverage is
not 100% • it can be attacked in
multiple ways
[Picture from http://mtthwhgn.com/tag/flooding/]
we provide a bigger hammer for the whack-a-tracker
Implementation Details Realtime Component 1) Parsing request 2) Local safeness: membership test on LWL 3) Global safeness: membership test on GWL LWL and GWL are Bloom Filters, combined less than < 512KB, FP ratio of 0.1%. Takes about 1-12 ms.
Offline Component Data from users needs to be sent to CLIQZ to build GWL for the safeness quorum. GWL needs to be sent back to the users’ browsers. We use an eventual consistency model with incremental updates over daily snapshots. Bandwidth costs per user per day: 90KB upload, 566KB download. For a worse-case propagation lag of 10 minutes. False positive unsafe data elements due to transient state is 0.07%
Determining Safeness
T = [
<s=w..., d=3rdparty.com, k=z, v=1501498154>, <s=w..., d=3rdparty.com, k=fl,v=21.0>, <s=w..., d=3rdparty.com, k=u, v=CCAAAABI>, <s=w..., d=3rdparty.com, k=vr,v=1440x1024>, <s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=w..., d=3rdparty.com, k=vp,v=1322x981>, <s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,
]
Cookies from potential trackers are always blocked. POST requests are also analyzed, blocked only if: • match Cookie values • match QS values declared
unsafe • match values from browser-
fingerprinting User initiated actions are always ALLOWED (even if tracking)