Measuring the impact of Google Analytics

32
Measuring the impact: Stephen Merity / smerity.com @smerity

description

Have you ever been curious as to how widely Google Analytics is used across the web? Stop pondering, start coding! In this presentation, Stephen discusses how he used the Common Crawl dataset to perform wide scale analysis over billions of web pages and what this means for privacy on the web at large.

Transcript of Measuring the impact of Google Analytics

Page 1: Measuring the impact of Google Analytics

Measuringtheimpact:

StephenMerity/smerity.com @smerity

Page 2: Measuring the impact of Google Analytics

Smerity@CommonCrawl

ContinuingthecrawlDocumentingbestpractices

GuidesfornewcomerstoCommonCrawl+bigdataReferenceforseasonedveterans

Spendingmanyhoursblessingand/orcursingHadoop

Before:UniversityofSydney'11,Harvard'14

GoogleSydney,Freelancer.com,GrokLearning

Page 3: Measuring the impact of Google Analytics

[email protected]

Iwashopingoncreatingatoolthatwillautomaticallyextractsomeofthemostcommonmemes("ButdoesitrunLinux?"and

"InSovietRussia..."stylejokesetc)andIneededacorpus-

.Idointenselyapologise.

Iwroteaprimitive(threaded:S)webcrawlerandstarteditbeforeI

consideredrobots.txt

--PastSmerity(16/12/2007)

Page 4: Measuring the impact of Google Analytics

WheredidalltheHTTPreferrersgo?

Page 5: Measuring the impact of Google Analytics

Referrers:leakingbrowsinghistory

Ifyouclickfrom

to

http://www.reddit.com/r/sanfrancisco

http://www.sfbike.org/news/protected-bikeways-planned-for-the-embarcadero/

thenSFBikeknowsyoucamefromReddit

Page 6: Measuring the impact of Google Analytics

1)HowmanywebsitesisGoogleAnalytics(GA)on?

2)Howmuchofauser'sbrowsinghistorydoesGAcapture?

Page 7: Measuring the impact of Google Analytics

Top10kdomains:65.7%

Top100kdomains:64.2%

Topmilliondomains:50.8%

Itkeepsdroppingoff,butbyhowmuch..?

Page 8: Measuring the impact of Google Analytics

Estimateofcapturedbrowsinghistory...

?

Page 9: Measuring the impact of Google Analytics

ReferrersalloweasywebtrackingwhendoneatGoogle'sscale!

Noinformation!GA→!GA

Fullinformation!GA→GA

GA→!GA→GAGA→!GA→GA→!GA→GA→!GA→GA→!GA→GA

Page 10: Measuring the impact of Google Analytics

Keyinsight:leakedbrowsinghistory

GoogleonlyneedsoneineverytwolinkstohaveGAinordertohaveyourfullbrowsingpath*

*possiblylessiflinkgraph+clicktiming+machinelearningused

Page 11: Measuring the impact of Google Analytics

Estimatingleakedbrowserhistory

foreach :link={pageA}→{pageB}total_links+=1if{pageA}or{pageB}hasGA:

total_leaked+=1

Estimateofleakedbrowserhistoryissimply:total_leaked/total_links

Page 12: Measuring the impact of Google Analytics

JointprojectwithChadHornbaker*atHarvardIACS

*Bestfullnameever:CaptainCharlesLafforestHornbakerII

Page 13: Measuring the impact of Google Analytics

Thetask

GoogleAnalyticscount:" "

Generatelinkgraph

Mergelinkgraph&GAcount

.google-analytics.com/ga.jswww.winradio.net.auNoGA1www.winrar.com.cnGA6www.winratzart.comGA1www.winrenner.chGA244

domainA.com->domainB.com<totaltimes>

cnet-cnec-driver.softutopia.com->www.softutopia.com24

Page 14: Measuring the impact of Google Analytics

Excitingageofopendata

Opendata+

Opentools+

Cloudcomputing

Page 15: Measuring the impact of Google Analytics

WARCrawwebdata

WATmetadata(links,title,...)foreachpage

WETextractedtext

Page 16: Measuring the impact of Google Analytics

WARC=GAusagerawwebdata

WAT=hyperlinkgraphmetadata(links,title,...)foreachpage

Page 17: Measuring the impact of Google Analytics

Estimatingthetask'ssize

Pagelevel( ):http://en.wikipedia.org/3.5billionnodes,128billionedges,331GBcompressed

Subdomainlevel( ):101millionnodes,2billionedges,9.2GBcompressed

Decidedonusingsubdomainsinsteadofpagelevel

http:// /

Page 18: Measuring the impact of Google Analytics

Engineeringforscale

✓Usetheframeworkthatmatchesbest

✓Debuglocally

✓StandardHadoopoptimizations(combiner,compression,re-useJVMs...)

✓Manysmalljobs≫onebigjob

✓Gangliaformetrics&monitoring

Page 19: Measuring the impact of Google Analytics

Hadoop:'(

Page 20: Measuring the impact of Google Analytics

Hadoop:'(

Page 21: Measuring the impact of Google Analytics

Monitoring&metricswithGanglia

Page 22: Measuring the impact of Google Analytics

Engineeringforcost

✓AvoidHadoopifit'ssimpleenough✓Usespotinstanceseverywhere*✖UseEMRifhighlycostsensitive

(ElasticMapReduce=hostedHadoop)

*Everywherebutthemasternode!

Page 23: Measuring the impact of Google Analytics

Jugglingspotinstances

c1.xlargegoesfrom$0.58p/hto$0.064p/h

Page 24: Measuring the impact of Google Analytics

EMR:Thegood,thebad,theugly

significantlyeasier,oneclicksetup

priceisinsanewhenusingspotinstances(spot=$0.075withEMR=$0.12)

Guesshowmanylogfilesfora100nodecluster?

Page 25: Measuring the impact of Google Analytics

584,764+logfiles.

Ouch.

Page 26: Measuring the impact of Google Analytics

Costprojection

BestoptimizedsmallHadoopjob:1/177ththedatasetin23minutes(12c1.xlargemachines+Hadoopmaster)

Estimatedfulldatasetjob:~210TBforwebdata+~90TBforlinkdata~$60inEC2costs(177hoursofspotinstances)~$100inEMRcosts(avoidEMRforcost!)

Page 27: Measuring the impact of Google Analytics

Finalresults

29.96%of48milliondomainshaveGA(topmilliondomainswas50.8%)

Thatmeansthat

oneineverytwohyperlinkswillleakinformationtoGoogle

Page 28: Measuring the impact of Google Analytics

Thewiderimpact

Page 29: Measuring the impact of Google Analytics

WantBigOpenData?

WebData

Coverseverythingatscale!Languages...

Topics...Demographics...

Page 30: Measuring the impact of Google Analytics

Processingthewebisfeasible

Downloadingitisapain!CommonCrawldoesthatforyou

Processingitisscary!Bigdataframeworksexistandare(relatively)painless

Theseexperimentsaretooexpensive!Cloudcomputingmeansexperimentscanbejustafewdollars

Page 31: Measuring the impact of Google Analytics

Getstartednow..!

Wantrawwebdata?CommonCrawl.org

Wanthyperlinkgraph/webtables/RDFa?WebDataCommons.org

Wantexamplecodetogetyoustarted?https://github.com/Smerity/cc-warc-examples

Page 32: Measuring the impact of Google Analytics

Measuringtheimpact:

Fullwrite-up:http://smerity.com/cs205_ga/

StephenMerity/smerity.com @smerity