Hive at Last.fm
-
Upload
huguk -
Category
Technology
-
view
939 -
download
0
description
Transcript of Hive at Last.fm
Hive at Last.fm!
Omar Ali - Data Developer!March 2012!
Overview!
• Hadoop at Last.fm"• Hive"• Examples"
What I want to show you:"• How it fits with a Hadoop infrastructure"• Typical workflow with Hive"• Ease of use for experiments and prototypes!
Hadoop!
• Brief overview of our infrastructure"• How we use it""
Hadoop!
64 node cluster "!
Charts!
Hive!
• What is Hive?"• How does it fit in with the rest of our system?"• Using existing data in Hive"• Example query"
What is Hive?!
• Data Warehouse"• You see your data in the form of tables"• Query language very similar to SQL"
hive> show tables like 'omar_charts_*'; OK omar_charts_globaltags_album omar_charts_globaltags_artist omar_charts_globaltags_track omar_charts_tagcloud_album omar_charts_tagcloud_artist omar_charts_tagcloud_track
hive> describe omar_charts_tagcloud_album; OK albumid int tagid int weight double
What is a table?!
Standard!!
• Metadata stored by Hive"
• Table data stored by Hive"
• Deleting the table deletes the data and the metadata!
External!"
• Metadata stored by Hive"
• Table data referenced by Hive"
• Deleting the table only deletes the metadata!
What is a table?!
Standard!!
• Metadata stored by Hive"
• Table data stored by Hive"
• Deleting the table deletes the data and the metadata!
External!"
• Metadata stored by Hive"
• Table data referenced by Hive"
• Deleting the table only deletes the metadata!
Log Files Database Tables
Example: scrobbles!
Scrobble Log: 13364451 30886670 217803052 358001787 0 0 0 1 0 0 1319068581 42875138 1717 3776668 4641276 0 0 0 1 0 0 1319068445 43108664 1003811 2237730 1019632 0 0 0 1 0 0 1319068783 36107186 1033304 2393940 13409429 0 0 0 0 0 1 1319068524 23842745 1261965 2349564 14091069 0 0 0 0 0 1 1319068594
Directory Structure: /data/submissions/2002/01/01 ... /data/submissions/2012/03/20 /data/submissions/2012/03/21
A Hive Query!select track.title, size(collect_set(s.userid)) as reach from meta_track track join data_submissions s on (s.trackid = track.id) where s.insertdate = "2012-‐03-‐01” and (s.scrobble + s.listen > 0) and s.artistid = 57976724 -‐-‐ Lana Del Rey group by track.title order by reach desc limit 5;
A Hive Query!select track.title, size(collect_set(s.userid)) as reach from meta_track track join data_submissions s on (s.trackid = track.id) where s.insertdate = "2012-‐03-‐01” and (s.scrobble + s.listen > 0) and s.artistid = 57976724 -‐-‐ Lana Del Rey group by track.title order by reach desc limit 5;
Total MapReduce jobs = 3 Launching Job 1 out of 3 Number of reduce tasks not specified. Estimated from input data size: 52 2012-‐03-‐19 23:28:58,613 Stage-‐1 map = 0%, reduce = 0% 2012-‐03-‐19 23:29:08,765 Stage-‐1 map = 3%, reduce = 0% 2012-‐03-‐19 23:29:10,794 Stage-‐1 map = 9%, reduce = 0%
A Hive Query!select track.title, size(collect_set(s.userid)) as reach from meta_track track join data_submissions s on (s.trackid = track.id) where s.insertdate = "2012-‐03-‐01” and (s.scrobble + s.listen > 0) and s.artistid = 57976724 -‐-‐ Lana Del Rey group by track.title order by reach desc limit 5;
Born to Die 10765 Video Games 9382 Off to the Races 6569 Blue Jeans 6266 National Anthem 5795 ~300 seconds
Examples!
• Trends in UK Listening"• Hadoop User Group Charts"
Trends in UK Listening!
Trends in UK Listening!
Trends in UK Listening!
select artistid, hourOfDay, meanPlays, stdPlays, meanReach, stdReach, hoursInExistence, meanPlays / sqrt(hoursInExistence) as stdErrPlays, meanReach / sqrt(hoursInExistence) as stdErrReach from (select artistCounts.artistid as artistid, artistCounts.hourOfDay, avg(artistCounts.plays) as meanPlays, stddev_samp(artistCounts.plays) as stdPlays, avg(artistCounts.reach) as meanReach, stddev_samp(artistCounts.reach) as stdReach, size(collect_set(concat(artistCounts.insertdate, hourOfDay))) as hoursInExistence from (select artistid, insertdate, hour(from_unixtime(unixtime)) as hourOfDay, count(*) as plays, size(collect_set(s.userid)) as reach from lookups_userid_geo g join data_submissions s on (g.userid = s.userid) where insertdate >= '2011-‐01-‐01' and insertdate < '2012-‐01-‐01' and (listen + scrobble) > 0 and lower(g.countrycode) = 'gb' group by artistid, insertdate, hour(from_unixtime(unixtime)) ) artistCounts group by artistCounts.artistid, artistCounts.hourOfDay ) artistStats where meanReach > 25;
select artistid, hourOfDay, meanPlays, stdPlays, meanReach, stdReach, hoursInExistence, meanPlays / sqrt(hoursInExistence) as stdErrPlays, meanReach / sqrt(hoursInExistence) as stdErrReach from (select artistCounts.artistid as artistid, artistCounts.hourOfDay, avg(artistCounts.plays) as meanPlays, stddev_samp(artistCounts.plays) as stdPlays, avg(artistCounts.reach) as meanReach, stddev_samp(artistCounts.reach) as stdReach, size(collect_set(concat(artistCounts.insertdate, hourOfDay))) as hoursInExistence from (select artistid, insertdate, hour(from_unixtime(unixtime)) as hourOfDay, count(*) as plays, size(collect_set(s.userid)) as reach from lookups_userid_geo g join data_submissions s on (g.userid = s.userid) where insertdate >= '2011-‐01-‐01' and insertdate < '2012-‐01-‐01' and (listen + scrobble) > 0 and lower(g.countrycode) = 'gb' group by artistid, insertdate, hour(from_unixtime(unixtime)) ) artistCounts group by artistCounts.artistid, artistCounts.hourOfDay ) artistStats where meanReach > 25;
select artistid, hourOfDay, meanPlays, stdPlays, meanReach, stdReach, hoursInExistence, meanPlays / sqrt(hoursInExistence) as stdErrPlays, meanReach / sqrt(hoursInExistence) as stdErrReach from (select artistCounts.artistid as artistid, artistCounts.hourOfDay, avg(artistCounts.plays) as meanPlays, stddev_samp(artistCounts.plays) as stdPlays, avg(artistCounts.reach) as meanReach, stddev_samp(artistCounts.reach) as stdReach, size(collect_set(concat(artistCounts.insertdate, hourOfDay))) as hoursInExistence from (select artistid, insertdate, hour(from_unixtime(unixtime)) as hourOfDay, count(*) as plays, size(collect_set(s.userid)) as reach from lookups_userid_geo g join data_submissions s on (g.userid = s.userid) where insertdate >= '2011-‐01-‐01' and insertdate < '2012-‐01-‐01' and (listen + scrobble) > 0 and lower(g.countrycode) = 'gb' group by artistid, insertdate, hour(from_unixtime(unixtime)) ) artistCounts group by artistCounts.artistid, artistCounts.hourOfDay ) artistStats where meanReach > 25;
So far!
• Test data: listening statistics for each artist, in each hour of the day"• Base data: averaged hourly statistics for each artist"
• Next step: compare them"
Comparison!
select test.artistid, test.meanReach, base.meanReach, test.stdReach, base.stdReach, test.stdErrReach, base.stdErrReach, (test.meanReach -‐ base.meanReach) / (base.stdReach) as zScore, (test.meanReach -‐ base.meanReach) / (base.stdErrReach * test.stdErrReach) as
deviation from omar_uk_artist_base base join omar_uk_artist_hours test on (base.artistid = test.artistid) where test.hourOfDay = 15 order by deviation desc limit 5;
Trends in UK Listening!
Summary!
• Hive is easy to use"• It sits comfortably on top of a Hadoop infrastructure"• Familiar if you know SQL"• Can ask big questions"• Can ask wide ranging questions"• Allows analyses that would otherwise need a lot of
preliminary work ""
HUG Charts!
Any Questions?!