Measuring the impact of Google Analytics

Post on 23-Aug-2014

324 views 2 download

Tags:

description

Have you ever been curious as to how widely Google Analytics is used across the web? Stop pondering, start coding! In this presentation, Stephen discusses how he used the Common Crawl dataset to perform wide scale analysis over billions of web pages and what this means for privacy on the web at large.

Transcript of Measuring the impact of Google Analytics

Measuringtheimpact:

StephenMerity/smerity.com @smerity

Smerity@CommonCrawl

ContinuingthecrawlDocumentingbestpractices

GuidesfornewcomerstoCommonCrawl+bigdataReferenceforseasonedveterans

Spendingmanyhoursblessingand/orcursingHadoop

Before:UniversityofSydney'11,Harvard'14

GoogleSydney,Freelancer.com,GrokLearning

banned@slashdot.org

Iwashopingoncreatingatoolthatwillautomaticallyextractsomeofthemostcommonmemes("ButdoesitrunLinux?"and

"InSovietRussia..."stylejokesetc)andIneededacorpus-

.Idointenselyapologise.

Iwroteaprimitive(threaded:S)webcrawlerandstarteditbeforeI

consideredrobots.txt

--PastSmerity(16/12/2007)

WheredidalltheHTTPreferrersgo?

Referrers:leakingbrowsinghistory

Ifyouclickfrom

to

http://www.reddit.com/r/sanfrancisco

http://www.sfbike.org/news/protected-bikeways-planned-for-the-embarcadero/

thenSFBikeknowsyoucamefromReddit

1)HowmanywebsitesisGoogleAnalytics(GA)on?

2)Howmuchofauser'sbrowsinghistorydoesGAcapture?

Top10kdomains:65.7%

Top100kdomains:64.2%

Topmilliondomains:50.8%

Itkeepsdroppingoff,butbyhowmuch..?

Estimateofcapturedbrowsinghistory...

?

ReferrersalloweasywebtrackingwhendoneatGoogle'sscale!

Noinformation!GA→!GA

Fullinformation!GA→GA

GA→!GA→GAGA→!GA→GA→!GA→GA→!GA→GA→!GA→GA

Keyinsight:leakedbrowsinghistory

GoogleonlyneedsoneineverytwolinkstohaveGAinordertohaveyourfullbrowsingpath*

*possiblylessiflinkgraph+clicktiming+machinelearningused

Estimatingleakedbrowserhistory

foreach :link={pageA}→{pageB}total_links+=1if{pageA}or{pageB}hasGA:

total_leaked+=1

Estimateofleakedbrowserhistoryissimply:total_leaked/total_links

JointprojectwithChadHornbaker*atHarvardIACS

*Bestfullnameever:CaptainCharlesLafforestHornbakerII

Thetask

GoogleAnalyticscount:" "

Generatelinkgraph

Mergelinkgraph&GAcount

.google-analytics.com/ga.jswww.winradio.net.auNoGA1www.winrar.com.cnGA6www.winratzart.comGA1www.winrenner.chGA244

domainA.com->domainB.com<totaltimes>

cnet-cnec-driver.softutopia.com->www.softutopia.com24

Excitingageofopendata

Opendata+

Opentools+

Cloudcomputing

WARCrawwebdata

WATmetadata(links,title,...)foreachpage

WETextractedtext

WARC=GAusagerawwebdata

WAT=hyperlinkgraphmetadata(links,title,...)foreachpage

Estimatingthetask'ssize

Pagelevel( ):http://en.wikipedia.org/3.5billionnodes,128billionedges,331GBcompressed

Subdomainlevel( ):101millionnodes,2billionedges,9.2GBcompressed

Decidedonusingsubdomainsinsteadofpagelevel

http:// /

Engineeringforscale

✓Usetheframeworkthatmatchesbest

✓Debuglocally

✓StandardHadoopoptimizations(combiner,compression,re-useJVMs...)

✓Manysmalljobs≫onebigjob

✓Gangliaformetrics&monitoring

Hadoop:'(

Hadoop:'(

Monitoring&metricswithGanglia

Engineeringforcost

✓AvoidHadoopifit'ssimpleenough✓Usespotinstanceseverywhere*✖UseEMRifhighlycostsensitive

(ElasticMapReduce=hostedHadoop)

*Everywherebutthemasternode!

Jugglingspotinstances

c1.xlargegoesfrom$0.58p/hto$0.064p/h

EMR:Thegood,thebad,theugly

significantlyeasier,oneclicksetup

priceisinsanewhenusingspotinstances(spot=$0.075withEMR=$0.12)

Guesshowmanylogfilesfora100nodecluster?

584,764+logfiles.

Ouch.

Costprojection

BestoptimizedsmallHadoopjob:1/177ththedatasetin23minutes(12c1.xlargemachines+Hadoopmaster)

Estimatedfulldatasetjob:~210TBforwebdata+~90TBforlinkdata~$60inEC2costs(177hoursofspotinstances)~$100inEMRcosts(avoidEMRforcost!)

Finalresults

29.96%of48milliondomainshaveGA(topmilliondomainswas50.8%)

Thatmeansthat

oneineverytwohyperlinkswillleakinformationtoGoogle

Thewiderimpact

WantBigOpenData?

WebData

Coverseverythingatscale!Languages...

Topics...Demographics...

Processingthewebisfeasible

Downloadingitisapain!CommonCrawldoesthatforyou

Processingitisscary!Bigdataframeworksexistandare(relatively)painless

Theseexperimentsaretooexpensive!Cloudcomputingmeansexperimentscanbejustafewdollars

Getstartednow..!

Wantrawwebdata?CommonCrawl.org

Wanthyperlinkgraph/webtables/RDFa?WebDataCommons.org

Wantexamplecodetogetyoustarted?https://github.com/Smerity/cc-warc-examples

Measuringtheimpact:

Fullwrite-up:http://smerity.com/cs205_ga/

StephenMerity/smerity.com @smerity