How to evolve your analytics stack with your business using Snowplow
How we use Hive at SnowPlow, and how the role of HIve is changing
-
Upload
yalisassoon -
Category
Documents
-
view
116 -
download
2
description
Transcript of How we use Hive at SnowPlow, and how the role of HIve is changing
SnowPlow
How Apache Hive and other big data technologies are transforming web analyticsHow Hive is used at SnowPlowStrengths and weaknesses of Hive vs alternatives
http://snowplowanalytics.com@snowplowdata@yalisassoon
Some history
Web analytics Big data
1990
1993
1996
1997
2004
2006
2008
2010
Web is born
Log file based web analytics
Javascript tagging
publishes MapReduce paper
Hadoop project split out of Nutch
Facebook develops Hive
publishes Dremel paper
Implications
• Web analytics solutions were developed on the assumption that granular, event level and customer level data was too expensive to store and query
• Data is aggregated from the start. Data collection and analysis is tightly coupled
• Web analytics is limited– Hits– Clicks, unique visitors, conversions– Traffic sources
• Web analytics is silo’ed. (Separate tools to use vs other data sets.)– Hard to link to customer data (e.g. CRM)– Hard to link to marketing data (e.g. DoubleClick)– Hard to link to financial data (e.g. unit profit)
Let’s reinvent web analyticsWeb analytics is one (very rich) data set that is at the heart of:
Customer analytics
Platform / application
analytics
Catalogue analytics
• How do my users segment by behaviour?
• What is the customer lifetime value of my users? How can I forecast it based on their behaviour?
• What are the ‘sliding doors’ moments in a customers journey that impact their lifetime value?
• Which channels should I spend marketing budget to acquire high value customers?
• How do improvements to my application drive improved user engagement and lifetime value?
• Which parts of my application should I focus development on to drive return?
• How are my different products (items in a shop / articles on a newspaper / videos on a media site) performing? What is driving the most engagement? Revenue? Profit?
• How should I organise my catalogue online to drive the best user experience? How can I personalise it to different users?
SnowPlow: open source platform that delivers the granular web analytics data, so you can perform the above
SnowPlow leverages big data and cloud technology across its architecture. Hive on EMR is used A LOT
Javascript tag
Pixel served from Amazon Cloudfront
Request to pixel (incl. query string) logged
Hive
Read logs using custom SerDe
Write a single table of clean, partitioned event data back to S3 for ease of querying
S3
Single, “fat” Hive table
Query in Hive
Output to other analytics programmes e.g. Excel, Tableau, R…
Hive!
How SnowPlow data looks in Hive:
User Page Market-ing
Event Browser OS Device
One line per event e.g. page
view, add-to-basket
user_idvisit_id
ip_address
urltitle
sourcemedium
termcontent
campaignreferrer
categoryactionlabel
propertyvalue
namefamily
versiontype
lang…
namefamily
manufact-urer
typeis_mobile?
widthheight
We Hive… but…
• Easy to use and query. (Especially compared with NoSQL competitors e.g. MongoDB)– E.g. http://
snowplowanalytics.com/analytics/basic-recipes.html
– http://snowplowanalytics.com/analytics/customer-analytics/cohort-analysis.html
• Rapidly develop ETL and analytics queries• Easy to run on Amazon EMR• Tight integration with Amazon S3
• Hard to debug• Slow• Limited power• Batch based (Hadoop’s fault…)
For storage and analytics, columnar databases provide an attractive alternative
• Scales to terabytes (not petabytes)• Fixed cost (dedicated analytics server
with LOTs of RAM)• Significantly faster – seconds not minutes• Plug in to many analytics front ends e.g.
Tableau, Qlikview, R
• Scales horizontally – to petabytes at least• Pay-as-you-go (on EMR) – each query
costs $• An increasing number of front-ends can
be ‘plugged in’ e.g. Toad for Cloud Databases
For segmentation, personalisation and recommendation on web analytics data, you can’t beat Mahout
You can do computations that do not fit the SQL processing model incl. machine learning in Hive via transformation scripts…
… but why would you?
CREATE TABLE docs(contents STRING);FROM (MAP docs.contents USING 'tokenizer_script' AS word, cntFROM docsCLUSTER BY word) map_outputREDUCE map_output.word, map_output.cnt USING 'count_script' AS word, cnt;
• Large number of recommendation, clustering and categorisation algorithms
• Plays well with Hadoop• Large, active developer community
For ETL in production, you really need something more robust than Hive
• ETL: need to define sophisticated data pipelines so that:– Clear audit path: which lines of data have been processed, which have not– Where they have not, error handling flows to deal with the lines. (Including potential reprocessing)– Should fail gracefully (not shut down whole job)– Should be easy to debug when things go wrong, diagnose the problem, and start again where left off…
• An alternative to Hive we are exploring:
– Java framework for developing Hadoop-powered data processing applications– Scala (Scalding) and Clojure (Cascalog) wrappers available
Where we’re going with Hive @ SnowPlow
Javascript tag
Pixel served from Amazon Cloudfront
Request to pixel (incl. query string) logged
Scalding (Cascading)
Ruby wrapper
Infobright Infobright
BI tools e.g. Tableau, Qlikview, Pentaho
Data exploration tools e.g. R, Excel
MI tools e.g. Mahout
Hive for ad hoc analytics on the atomic data
Hive for SnowPlow users with Petabytes of data
But.. for most users… Hive is NOT part of the core flow
Any questions?
http://github.com/snowplow
http://snowplowanalytics.com
@snowplowdata
@yalisassoon