Hive at Last.fm

Hive at Last.fm!

Omar Ali - Data Developer!March 2012!

Overview!

•  Hadoop at Last.fm"•  Hive"•  Examples"

What I want to show you:"•  How it fits with a Hadoop infrastructure"•  Typical workflow with Hive"•  Ease of use for experiments and prototypes!

Hadoop!

•  Brief overview of our infrastructure"•  How we use it""

Hadoop!

64 node cluster "!

Charts!

Hive!

•  What is Hive?"•  How does it fit in with the rest of our system?"•  Using existing data in Hive"•  Example query"

What is Hive?!

•  Data Warehouse"•  You see your data in the form of tables"•  Query language very similar to SQL"

hive> show tables like 'omar_charts_*'; OK omar_charts_globaltags_album omar_charts_globaltags_artist omar_charts_globaltags_track omar_charts_tagcloud_album omar_charts_tagcloud_artist omar_charts_tagcloud_track

hive> describe omar_charts_tagcloud_album; OK albumid int tagid int weight double

What is a table?!

Standard!!

•  Metadata stored by Hive"

•  Table data stored by Hive"

•  Deleting the table deletes the data and the metadata!

External!"


•  Table data referenced by Hive"

•  Deleting the table only deletes the metadata!

What is a table?!

Standard!!


•  Table data stored by Hive"

•  Deleting the table deletes the data and the metadata!

External!"


•  Table data referenced by Hive"

•  Deleting the table only deletes the metadata!

Log Files Database Tables

Example: scrobbles!

Scrobble Log: 13364451 30886670 217803052 358001787 0 0 0 1 0 0 1319068581 42875138 1717 3776668 4641276 0 0 0 1 0 0 1319068445 43108664 1003811 2237730 1019632 0 0 0 1 0 0 1319068783 36107186 1033304 2393940 13409429 0 0 0 0 0 1 1319068524 23842745 1261965 2349564 14091069 0 0 0 0 0 1 1319068594

Directory Structure: /data/submissions/2002/01/01 ... /data/submissions/2012/03/20 /data/submissions/2012/03/21

A Hive Query!select track.title, size(collect_set(s.userid)) as reach from meta_track track join data_submissions s on (s.trackid = track.id) where s.insertdate = "2012-‐03-‐01” and (s.scrobble + s.listen > 0) and s.artistid = 57976724 -‐-‐ Lana Del Rey group by track.title order by reach desc limit 5;


Total MapReduce jobs = 3 Launching Job 1 out of 3 Number of reduce tasks not specified. Estimated from input data size: 52 2012-‐03-‐19 23:28:58,613 Stage-‐1 map = 0%, reduce = 0% 2012-‐03-‐19 23:29:08,765 Stage-‐1 map = 3%, reduce = 0% 2012-‐03-‐19 23:29:10,794 Stage-‐1 map = 9%, reduce = 0%


Born to Die 10765 Video Games 9382 Off to the Races 6569 Blue Jeans 6266 National Anthem 5795 ~300 seconds

Examples!

•  Trends in UK Listening"•  Hadoop User Group Charts"

Trends in UK Listening!

select artistid, hourOfDay, meanPlays, stdPlays, meanReach, stdReach, hoursInExistence, meanPlays / sqrt(hoursInExistence) as stdErrPlays, meanReach / sqrt(hoursInExistence) as stdErrReach from (select artistCounts.artistid as artistid, artistCounts.hourOfDay, avg(artistCounts.plays) as meanPlays, stddev_samp(artistCounts.plays) as stdPlays, avg(artistCounts.reach) as meanReach, stddev_samp(artistCounts.reach) as stdReach, size(collect_set(concat(artistCounts.insertdate, hourOfDay))) as hoursInExistence from (select artistid, insertdate, hour(from_unixtime(unixtime)) as hourOfDay, count(*) as plays, size(collect_set(s.userid)) as reach from lookups_userid_geo g join data_submissions s on (g.userid = s.userid) where insertdate >= '2011-‐01-‐01' and insertdate < '2012-‐01-‐01' and (listen + scrobble) > 0 and lower(g.countrycode) = 'gb' group by artistid, insertdate, hour(from_unixtime(unixtime)) ) artistCounts group by artistCounts.artistid, artistCounts.hourOfDay ) artistStats where meanReach > 25;

So far!

•  Test data: listening statistics for each artist, in each hour of the day"•  Base data: averaged hourly statistics for each artist"

•  Next step: compare them"

Comparison!

select test.artistid, test.meanReach, base.meanReach, test.stdReach, base.stdReach, test.stdErrReach, base.stdErrReach, (test.meanReach -‐ base.meanReach) / (base.stdReach) as zScore, (test.meanReach -‐ base.meanReach) / (base.stdErrReach * test.stdErrReach) as

deviation from omar_uk_artist_base base join omar_uk_artist_hours test on (base.artistid = test.artistid) where test.hourOfDay = 15 order by deviation desc limit 5;

Trends in UK Listening!

Summary!

•  Hive is easy to use"•  It sits comfortably on top of a Hadoop infrastructure"•  Familiar if you know SQL"•  Can ask big questions"•  Can ask wide ranging questions"•  Allows analyses that would otherwise need a lot of

preliminary work ""

HUG Charts!

Any Questions?!

Hive at Last.fm

Technology

Transcript of Hive at Last.fm