Tracking The Trackers WWW 2016

Tracking the TrackersZhonghao Yu [email protected] Sam Macbeth [email protected] Konark Modi [email protected] Josep M. Pujol [email protected]

Page load triggers requests to multiple 3rd

parties

Even on pages on sites that you probably want

to keep private, like this dating site.

Of course, general news domains also load

many 3rd parties

as well as electronic commerce sites like

Ebay

Twitter pages only accessible to the

authenticated user also load 3rd parties like GA

Twitter pages only accessible to the

authenticated user also load 3rd parties like GA

This browsing session on 5 different sites

involved more than 60 different 3rd parties.

GET /css?family=Open+Sans+Condensed:300,700 Host: fonts.googleapis.com User-Agent: Mozilla/5.0 ... Firefox/45.0 Referer: http://www.meetic.com/home/index.php IP: 79.227.235.241

fonts.googleapis.com is a potential tracker <meetic.com/home/index.php, UID> <www20016.ca/, UID> <wired.com/, UID> However, in THIS request, there is no data element that can be used as a UID. Since there is no unsafe data element, the request is safe.

GET /impression.php/f3ae074XXX/api_key=597038480XXX&lid=115… Host: www.facebook.com User-Agent: Mozilla/5.0 … Firefox/45.0 Referer: http://www.meetic.com/home/index.php Cookie: datr=0IPhVj5YHEJ20XXX; c_user=10973XXXX; … csm=2; IP: 79.227.235.241

facebook.com is a potential tracker too, <meetic.com/home/index.php, 10973XXXX> <www20016.ca/, 10973XXXX> <wired.com/, 10973XXXX> <ebay-kleinanzeigen.de/s-muenchen/cyclocross/k0l6411r200, 10973XXXX> Unlike fonts.googleapi.com, the request above is not safe with regards to privacy because it contain two values that we consider unsafe, thus could be used as UIDs, c_user=10973XXXX and datr=0IPhVj5YHEJ20XXX Because it contains at least one unsafe value, the request is considered unsafe.

GET /collect?v=1&_v=j41&a=321948996&t=event&ni=0&_s=1&...&vp=1291x524&..._u=QCCAAAABI~&jid=&cid=6531474... Host: www.google-analytics.com Referer: http://www.meetic.com/home/index.php IP: 79.227.235.241

google-analytics.com is a potential tracker too, <meetic.com/home/index.php, 1291x522:79.227.235.241> <www20016.ca/, 1291x522:79.227.235.241> <wired.com/, 1291x522:79.227.235.241> <ebay-kleinanzeigen.de/s-muenchen/cyclocross/k0l6411r200,

1291x522:79.227.235.241> <analytics.twitter.com/user/solso/home, 1291x522:79.227.235.241> The UID is not as evident as for Facebook. But the combination vp+IP is an unsafe data element, it can be used as a UID. Therefore this request is also unsafe. vp+IP = 1291x522:79.227.235.241

Not a conveniently chosen example… ...tracking is a pervasive problem.

Tracking in the Wild Largest field study with real traffic to date, 200,000 users in Germany for a week(*) 21M page loads, 5M unique pages (URLs) from 350K domains (*) Between 09/09/2015 and 16/09/2015

Tracking in the Wild: Prevalence Potential trackers are 3rd parties that are present in many different domains. Unsafe data elements are data elements for which we cannot rule out that possibility that they are UIDs.

21Mpageloads

withoutpoten3altrackers

withpoten3altrackers

1to9 >=10

5% 95%

24%76%

Tracking in the Wild: Prevalence Potential trackers are 3rd parties that are present in many different domains. Unsafe data elements are data elements for which we cannot rule out that possibility that they are UIDs.

21Mpageloads

withoutunsafevalues

withunsafevalues

1to9 >=10

22% 78%

21%79%

Tracking in the Wild: Prevalence Potential trackers are 3rd parties that are loaded in many different domains. Unsafe values are data elements for which we cannot rule out that possibility that they are UIDs.

21Mpageloads

withoutunsafevalues

withunsafevalues

1to9 >=10

22% 78%

21%79%

78% of all page loads can be tracked

Tracking in the Wild: Reach % of page loads seen

% of page loads seen with unsafe data elements (tracking)

rank

Google 62.4% 42.4% 1st Facebook 21.1% 18.5% 2nd AppNexus 10.15% 9.9% 3rd ADITION 8.7% 8.4% 4th Criteo 8.7% 8.2% 5th … Comscore 6.1% 5.9% -- DoublePimp 0.5% 0.5% -- NewRelic 2% 0.03% -- …

Tracking in the Wild: Reach % of page loads seen

% of page loads seen with unsafe data elements (tracking)

rank

Google 62.4% 42.4% 1st Facebook 21.1% 18.5% 2nd AppNexus 10.15 9.9% 3rd ADITION 8.7% 8.4% 4th Criteo 8.7% 8.2% 5th … Comscore 6.1% 5.9% -- DoublePimp 0.5% 0.5% -- NewRelic 2% 0.03% -- …

58 organizations with a reach

larger than 1%

CLIQZ Tracking Protection

Maximize coverage,

minimize false positives

CLIQZ Tracking Protection

Maximize coverage,

minimize false positives

Aggressiveness is counter-productive…

•  increases site breakage, which forces users to add exceptions, thus reducing protection coverage.

•  affects legitimate services and data collection

Block only the Ability to Track GET /collect?v=1&_v=j41&a=321948996&t=event&ni=0&_s=1&...&vp=1291x524&..._u=QCCAAAABI~&jid=&cid=6531474... Host: www.google-analytics.com Referer: http://www.meetic.com/home/index.php IP: 79.227.235.241

Intervention only on unsafe data elements – those elements that can be used as UIDs, Should protect the user, while minimizing side-effects: a)  site-breakage for users b)  legitimate data collection for 3rd parties

Blocklists are coarse-grained

CDF of the number of requests with observed unsafe data elements by 3rd party domains contained both in Disconnect Blocklist and CLIQZ list of potential trackers (~2000 domains each). Intersection is 477 domains.



Only 2% of tracker domains in Disconnect always send unsafe data elements.



98% of tracker domains have a MIXED behavior Lack of resolution…

Only 2% of tracker domains is Disconnect always send unsafe data elements.


Blocklists by domain (reverse suffix) are too coarse-grained.

BLOCKLIST by Domain

Blocklists are too coarse-grained

EasyPrivacy (from Adblock Plus) has hundreds of regular

expressions to cover for mixed behavior of trackers.

BLOCKLIST by Domain + RegExp Exceptions

Blocklists are too coarse-grained BLOCKLIST by Domain + More RegExp Exceptions

EasyPrivacy (from Adblock Plus) has hundreds of regular

expressions to cover for mixed behavior of trackers.

We propose a more fine-grained approach to algorithmically determine the safeness level of individual data elements within a request to a 3rd party

Determining Safeness

Each 3rd party request to a potential tracker is parsed to obtain a list of tuples T = [<s, d, k, v>] whose safeness level is evaluated in real-time, T = [

<s=wired.com/, d=3rdparty.com, k=z, v=1501498154>, <s=wired.com/, d=3rdparty.com, k=fl,v=21.0>, <s=wired.com/, d=3rdparty.com, k=u, v=CCAAAABI>, <s=wired.com/, d=3rdparty.com, k=vr,v=1440x900>, <s=wired.com/, d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=wired.com/, d=3rdparty.com, k=vp,v=1322x781>, <s=wired.com/, d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,

] The aim is to identify which data elements (including combinations) are unsafe, and therefore, they are candidates to be used as UIDs.


Each 3rd party request to a potential tracker is parsed to obtain a list of tuples T = [<s, d, k, v>] whose safeness level is evaluated in real-time, T = [

<s=wired.com/, d=3rdparty.com, k=z, v=1501498154>, <s=wired.com/, d=3rdparty.com, k=fl,v=21.0>, <s=wired.com/, d=3rdparty.com, k=u, v=CCAAAABI>, <s=wired.com/, d=3rdparty.com, k=vr,v=1440x900>, <s=wired.com/, d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=wired.com/, d=3rdparty.com, k=vp,v=1322x781>, <s=wired.com/, d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,

] The aim is to identify which data elements (including combinations) are unsafe, and therefore, they are candidates to be used as UIDs.

We cannot do this effectively. But we can do the opposite, identify data elements that cannot be used effectively as UIDs, and consider them safe.


T = [

<s=w..., d=3rdparty.com, k=z, v=1501498154>, <s=w..., d=3rdparty.com, k=fl,v=21.0>, <s=w..., d=3rdparty.com, k=u, v=CCAAAABI>, <s=w..., d=3rdparty.com, k=vr,v=1440x900>, <s=w..., d=3rdparty.com, k=ua,v=3FeFF2301E>, <s=w..., d=3rdparty.com, k=vp,v=1322x781>, <s=w..., d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec>,

]

All tuples are UNSAFE by default unless we can determine that the given data-element is not a good UID, hence safe.


T = [


]

The value 1501498154 has never been seen before for <d, k>. Thus, cannot be used as UID => SAFE


T = [


]

The value 21.0 is to short to encode any UID => SAFE


T = [


]

More than 3 different values in less than 2 days by the same tuple <d,k>. Not persistent, bad UID => SAFE


T = [


]

Always the same value for <d,k>. We cannot rule out that the data-elements are UID => keep as UNSAFE

Only using local information is not enough; vr=1440x1024 is not a UID… We need something extra.


T = [


] Locally UNSAFE, i.e. always the same

value for <d, k>. Globally SAFE since more than 20 other users have observed the same value 1440x900 for tuple <d,k> = <3rdparty.com,vr> in the last 2 days.


T = [


] Locally UNSAFE, i.e. always the same

value for tuple <d,k>. Globally SAFE since it has reach the safeness-quorum based on k-Anonymity.


T = [


] Locally UNSAFE, i.e. always the same value for <d, k>.

Globally UNSAFE not enough people has seen the value for <d, k>, always same <d, k, u>. Not safe to send. Two options: a)  it is a UID, or an element that could be used as such. b)  a false positive due to the Transient State (0.07%)


T = [


] Locally UNSAFE and Globally UNSAFE

At this point the request analysis is complete: 1)  ALLOW Request removing unsafe data-elements 2)  ALLOW Request obfuscating unsafe data-elements 3)  BLOCK Request or ALLOW Request without alteration

Safeness Quorum without Tracking To determine that a data-element is globally safe we need to count the number of unique users that have observed a tuple

<d,k,v> e.g. <d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec> Users could share tuples with a field that identifies them (u),

<u=usrXXX, d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec> with CLIQZ. But that would make CLIQZ a tracker! Instead, each user sends the tuple – if observed – once and only once per hour:

<d=3rdparty.com, k=c7,v=e9d4a7e4d2185cec> Actual values are not needed; counting and membership test on GWL

<d=ed5c0cf7b05572eb, k=4d3a21d8c684c09c19b93be911827fd5, v=e60f936dc719ca649a80a97490a09940>

Evaluation: Protection Coverage Requests Blocked

False positives ratio (requests blocked without unsafe data-elements)

Protection Misses (requests allowed with unsafe data-elements)

CLIQZ 51.7% -- --

Disconnect 66.1% 38.8% 12.3%

Kontaxis & Chew (Firefox Tracking Protection) [est.]

36.6% 29.4% 25.4%

Evaluation: Site Breakage Reload Rate % Increase

over baseline % Increase over CLIQZ

BASELINE (without tracking protection)

0.00101 -- --

CLIQZ 0.00104 4% --

Adblock Plus (counting exceptions added by users)

0.00110 10% 150%

CLIQZ as Blocklist

0.00125 25% 525%

Conclusions Tracking is a BIG problem

–  Privacy is seriously at risk

Tracking Protection is not an easy task –  Trade-off between site breakage and protection

coverage

Blocklist-based approaches have limitations –  Maintainability –  Coarse-grained resolution –  Too many false positives

CLIQZ tracking protection addresses them to a large extent

Future Work CLIQZ tracking protection might be better than the state-of-the-art. But it is far from perfect, •  still produces site-

breakages •  protection coverage is

not 100% •  it can be attacked in

multiple ways

[Picture from http://mtthwhgn.com/tag/flooding/]

we provide a bigger hammer for the whack-a-tracker

Thanks a lot! Q&A

Zhonghao Yu Sam Macbeth Konark Modi

Appendix

Implementation Details Realtime Component 1) Parsing request 2) Local safeness: membership test on LWL 3) Global safeness: membership test on GWL LWL and GWL are Bloom Filters, combined less than < 512KB, FP ratio of 0.1%. Takes about 1-12 ms.

Offline Component Data from users needs to be sent to CLIQZ to build GWL for the safeness quorum. GWL needs to be sent back to the users’ browsers. We use an eventual consistency model with incremental updates over daily snapshots. Bandwidth costs per user per day: 90KB upload, 566KB download. For a worse-case propagation lag of 10 minutes. False positive unsafe data elements due to transient state is 0.07%


T = [


]

Cookies from potential trackers are always blocked. POST requests are also analyzed, blocked only if: •  match Cookie values •  match QS values declared

unsafe •  match values from browser-

fingerprinting User initiated actions are always ALLOWED (even if tracking)

Protection Coverage

Unsafe Data Origins

Tracking The Trackers WWW 2016

Technology

Transcript of Tracking The Trackers WWW 2016