Google plus analytics tools for measuring marketing campaign effectiveness
Measuring the impact of Google Analytics
-
Upload
open-data-bay-area-obda -
Category
Internet
-
view
324 -
download
2
description
Transcript of Measuring the impact of Google Analytics
Measuringtheimpact:
StephenMerity/smerity.com @smerity
Smerity@CommonCrawl
ContinuingthecrawlDocumentingbestpractices
GuidesfornewcomerstoCommonCrawl+bigdataReferenceforseasonedveterans
Spendingmanyhoursblessingand/orcursingHadoop
Before:UniversityofSydney'11,Harvard'14
GoogleSydney,Freelancer.com,GrokLearning
Iwashopingoncreatingatoolthatwillautomaticallyextractsomeofthemostcommonmemes("ButdoesitrunLinux?"and
"InSovietRussia..."stylejokesetc)andIneededacorpus-
.Idointenselyapologise.
Iwroteaprimitive(threaded:S)webcrawlerandstarteditbeforeI
consideredrobots.txt
--PastSmerity(16/12/2007)
WheredidalltheHTTPreferrersgo?
Referrers:leakingbrowsinghistory
Ifyouclickfrom
to
http://www.reddit.com/r/sanfrancisco
http://www.sfbike.org/news/protected-bikeways-planned-for-the-embarcadero/
thenSFBikeknowsyoucamefromReddit
1)HowmanywebsitesisGoogleAnalytics(GA)on?
2)Howmuchofauser'sbrowsinghistorydoesGAcapture?
Top10kdomains:65.7%
Top100kdomains:64.2%
Topmilliondomains:50.8%
Itkeepsdroppingoff,butbyhowmuch..?
Estimateofcapturedbrowsinghistory...
?
ReferrersalloweasywebtrackingwhendoneatGoogle'sscale!
Noinformation!GA→!GA
Fullinformation!GA→GA
GA→!GA→GAGA→!GA→GA→!GA→GA→!GA→GA→!GA→GA
Keyinsight:leakedbrowsinghistory
GoogleonlyneedsoneineverytwolinkstohaveGAinordertohaveyourfullbrowsingpath*
*possiblylessiflinkgraph+clicktiming+machinelearningused
Estimatingleakedbrowserhistory
foreach :link={pageA}→{pageB}total_links+=1if{pageA}or{pageB}hasGA:
total_leaked+=1
Estimateofleakedbrowserhistoryissimply:total_leaked/total_links
JointprojectwithChadHornbaker*atHarvardIACS
*Bestfullnameever:CaptainCharlesLafforestHornbakerII
Thetask
GoogleAnalyticscount:" "
Generatelinkgraph
Mergelinkgraph&GAcount
.google-analytics.com/ga.jswww.winradio.net.auNoGA1www.winrar.com.cnGA6www.winratzart.comGA1www.winrenner.chGA244
domainA.com->domainB.com<totaltimes>
cnet-cnec-driver.softutopia.com->www.softutopia.com24
Excitingageofopendata
Opendata+
Opentools+
Cloudcomputing
WARCrawwebdata
WATmetadata(links,title,...)foreachpage
WETextractedtext
WARC=GAusagerawwebdata
WAT=hyperlinkgraphmetadata(links,title,...)foreachpage
Estimatingthetask'ssize
Pagelevel( ):http://en.wikipedia.org/3.5billionnodes,128billionedges,331GBcompressed
Subdomainlevel( ):101millionnodes,2billionedges,9.2GBcompressed
Decidedonusingsubdomainsinsteadofpagelevel
http:// /
Engineeringforscale
✓Usetheframeworkthatmatchesbest
✓Debuglocally
✓StandardHadoopoptimizations(combiner,compression,re-useJVMs...)
✓Manysmalljobs≫onebigjob
✓Gangliaformetrics&monitoring
Hadoop:'(
Hadoop:'(
Monitoring&metricswithGanglia
Engineeringforcost
✓AvoidHadoopifit'ssimpleenough✓Usespotinstanceseverywhere*✖UseEMRifhighlycostsensitive
(ElasticMapReduce=hostedHadoop)
*Everywherebutthemasternode!
Jugglingspotinstances
c1.xlargegoesfrom$0.58p/hto$0.064p/h
EMR:Thegood,thebad,theugly
significantlyeasier,oneclicksetup
priceisinsanewhenusingspotinstances(spot=$0.075withEMR=$0.12)
Guesshowmanylogfilesfora100nodecluster?
584,764+logfiles.
Ouch.
Costprojection
BestoptimizedsmallHadoopjob:1/177ththedatasetin23minutes(12c1.xlargemachines+Hadoopmaster)
Estimatedfulldatasetjob:~210TBforwebdata+~90TBforlinkdata~$60inEC2costs(177hoursofspotinstances)~$100inEMRcosts(avoidEMRforcost!)
Finalresults
29.96%of48milliondomainshaveGA(topmilliondomainswas50.8%)
Thatmeansthat
oneineverytwohyperlinkswillleakinformationtoGoogle
Thewiderimpact
WantBigOpenData?
WebData
Coverseverythingatscale!Languages...
Topics...Demographics...
Processingthewebisfeasible
Downloadingitisapain!CommonCrawldoesthatforyou
Processingitisscary!Bigdataframeworksexistandare(relatively)painless
Theseexperimentsaretooexpensive!Cloudcomputingmeansexperimentscanbejustafewdollars
Getstartednow..!
Wantrawwebdata?CommonCrawl.org
Wanthyperlinkgraph/webtables/RDFa?WebDataCommons.org
Wantexamplecodetogetyoustarted?https://github.com/Smerity/cc-warc-examples
Measuringtheimpact:
Fullwrite-up:http://smerity.com/cs205_ga/
StephenMerity/smerity.com @smerity