Hive at Last.fm

27
Hive at Last.fm Omar Ali - Data Developer March 2012

description

 

Transcript of Hive at Last.fm

Page 1: Hive at Last.fm

Hive at Last.fm!

Omar Ali - Data Developer!March 2012!

Page 2: Hive at Last.fm

Overview!

•  Hadoop at Last.fm"•  Hive"•  Examples"

What I want to show you:"•  How it fits with a Hadoop infrastructure"•  Typical workflow with Hive"•  Ease of use for experiments and prototypes!

Page 3: Hive at Last.fm

Hadoop!

•  Brief overview of our infrastructure"•  How we use it""

Page 4: Hive at Last.fm

Hadoop!

64 node cluster "!  

Page 5: Hive at Last.fm

Charts!

Page 6: Hive at Last.fm
Page 7: Hive at Last.fm

Hive!

•  What is Hive?"•  How does it fit in with the rest of our system?"•  Using existing data in Hive"•  Example query"

Page 8: Hive at Last.fm

What is Hive?!

•  Data Warehouse"•  You see your data in the form of tables"•  Query language very similar to SQL"

hive>  show  tables  like  'omar_charts_*';  OK  omar_charts_globaltags_album  omar_charts_globaltags_artist  omar_charts_globaltags_track  omar_charts_tagcloud_album  omar_charts_tagcloud_artist  omar_charts_tagcloud_track  

hive>  describe  omar_charts_tagcloud_album;              OK  albumid  int  tagid      int  weight    double  

Page 9: Hive at Last.fm

What is a table?!

Standard!!

•  Metadata stored by Hive"

•  Table data stored by Hive"

•  Deleting the table deletes the data and the metadata!

External!"

•  Metadata stored by Hive"

•  Table data referenced by Hive"

•  Deleting the table only deletes the metadata!

Page 10: Hive at Last.fm

What is a table?!

Standard!!

•  Metadata stored by Hive"

•  Table data stored by Hive"

•  Deleting the table deletes the data and the metadata!

External!"

•  Metadata stored by Hive"

•  Table data referenced by Hive"

•  Deleting the table only deletes the metadata!

Log  Files  Database  Tables  

Page 11: Hive at Last.fm

Example: scrobbles!

Scrobble  Log:  13364451  30886670  217803052  358001787  0  0  0  1  0  0  1319068581  42875138  1717  3776668  4641276  0  0  0  1  0  0  1319068445  43108664  1003811  2237730  1019632  0  0  0  1  0  0  1319068783  36107186  1033304  2393940  13409429  0  0  0  0  0  1  1319068524  23842745  1261965  2349564  14091069  0  0  0  0  0  1  1319068594  

Directory  Structure:  /data/submissions/2002/01/01  ...  /data/submissions/2012/03/20  /data/submissions/2012/03/21  

Page 12: Hive at Last.fm

A Hive Query!select          track.title,  size(collect_set(s.userid))  as  reach  from          meta_track  track          join  data_submissions  s  on  (s.trackid  =  track.id)  where          s.insertdate  =  "2012-­‐03-­‐01”  and  (s.scrobble  +  s.listen  >  0)          and  s.artistid  =  57976724  -­‐-­‐  Lana  Del  Rey  group  by          track.title  order  by          reach  desc  limit  5;  

Page 13: Hive at Last.fm

A Hive Query!select          track.title,  size(collect_set(s.userid))  as  reach  from          meta_track  track          join  data_submissions  s  on  (s.trackid  =  track.id)  where          s.insertdate  =  "2012-­‐03-­‐01”  and  (s.scrobble  +  s.listen  >  0)          and  s.artistid  =  57976724  -­‐-­‐  Lana  Del  Rey  group  by          track.title  order  by          reach  desc  limit  5;  

Total  MapReduce  jobs  =  3  Launching  Job  1  out  of  3  Number  of  reduce  tasks  not  specified.  Estimated  from  input  data  size:  52  2012-­‐03-­‐19  23:28:58,613  Stage-­‐1  map  =  0%,    reduce  =  0%  2012-­‐03-­‐19  23:29:08,765  Stage-­‐1  map  =  3%,    reduce  =  0%  2012-­‐03-­‐19  23:29:10,794  Stage-­‐1  map  =  9%,    reduce  =  0%  

Page 14: Hive at Last.fm

A Hive Query!select          track.title,  size(collect_set(s.userid))  as  reach  from          meta_track  track          join  data_submissions  s  on  (s.trackid  =  track.id)  where          s.insertdate  =  "2012-­‐03-­‐01”  and  (s.scrobble  +  s.listen  >  0)          and  s.artistid  =  57976724  -­‐-­‐  Lana  Del  Rey  group  by          track.title  order  by          reach  desc  limit  5;  

Born  to  Die    10765  Video  Games    9382  Off  to  the  Races  6569  Blue  Jeans    6266  National  Anthem  5795   ~300  seconds  

Page 15: Hive at Last.fm

Examples!

•  Trends in UK Listening"•  Hadoop User Group Charts"

Page 16: Hive at Last.fm

Trends in UK Listening!

Page 17: Hive at Last.fm

Trends in UK Listening!

Page 18: Hive at Last.fm

Trends in UK Listening!

Page 19: Hive at Last.fm

select      artistid,  hourOfDay,      meanPlays,  stdPlays,  meanReach,  stdReach,  hoursInExistence,      meanPlays  /  sqrt(hoursInExistence)  as  stdErrPlays,        meanReach  /  sqrt(hoursInExistence)  as  stdErrReach  from      (select          artistCounts.artistid  as  artistid,  artistCounts.hourOfDay,          avg(artistCounts.plays)  as  meanPlays,  stddev_samp(artistCounts.plays)  as  stdPlays,            avg(artistCounts.reach)  as  meanReach,  stddev_samp(artistCounts.reach)  as  stdReach,          size(collect_set(concat(artistCounts.insertdate,  hourOfDay)))  as  hoursInExistence      from          (select                artistid,  insertdate,  hour(from_unixtime(unixtime))  as  hourOfDay,                count(*)  as  plays,  size(collect_set(s.userid))  as  reach          from              lookups_userid_geo  g              join  data_submissions  s  on  (g.userid  =  s.userid)          where              insertdate  >=  '2011-­‐01-­‐01'  and  insertdate  <  '2012-­‐01-­‐01'              and  (listen  +  scrobble)  >  0                and  lower(g.countrycode)  =  'gb'          group  by              artistid,  insertdate,  hour(from_unixtime(unixtime))          )  artistCounts      group  by          artistCounts.artistid,  artistCounts.hourOfDay      )  artistStats  where      meanReach  >  25;  

Page 20: Hive at Last.fm

select      artistid,  hourOfDay,      meanPlays,  stdPlays,  meanReach,  stdReach,  hoursInExistence,      meanPlays  /  sqrt(hoursInExistence)  as  stdErrPlays,        meanReach  /  sqrt(hoursInExistence)  as  stdErrReach  from      (select          artistCounts.artistid  as  artistid,  artistCounts.hourOfDay,          avg(artistCounts.plays)  as  meanPlays,  stddev_samp(artistCounts.plays)  as  stdPlays,            avg(artistCounts.reach)  as  meanReach,  stddev_samp(artistCounts.reach)  as  stdReach,          size(collect_set(concat(artistCounts.insertdate,  hourOfDay)))  as  hoursInExistence      from          (select                artistid,  insertdate,  hour(from_unixtime(unixtime))  as  hourOfDay,                count(*)  as  plays,  size(collect_set(s.userid))  as  reach          from              lookups_userid_geo  g              join  data_submissions  s  on  (g.userid  =  s.userid)          where              insertdate  >=  '2011-­‐01-­‐01'  and  insertdate  <  '2012-­‐01-­‐01'              and  (listen  +  scrobble)  >  0                and  lower(g.countrycode)  =  'gb'          group  by              artistid,  insertdate,  hour(from_unixtime(unixtime))          )  artistCounts      group  by          artistCounts.artistid,  artistCounts.hourOfDay      )  artistStats  where      meanReach  >  25;  

Page 21: Hive at Last.fm

select      artistid,  hourOfDay,      meanPlays,  stdPlays,  meanReach,  stdReach,  hoursInExistence,      meanPlays  /  sqrt(hoursInExistence)  as  stdErrPlays,        meanReach  /  sqrt(hoursInExistence)  as  stdErrReach  from      (select          artistCounts.artistid  as  artistid,  artistCounts.hourOfDay,          avg(artistCounts.plays)  as  meanPlays,  stddev_samp(artistCounts.plays)  as  stdPlays,            avg(artistCounts.reach)  as  meanReach,  stddev_samp(artistCounts.reach)  as  stdReach,          size(collect_set(concat(artistCounts.insertdate,  hourOfDay)))  as  hoursInExistence      from          (select                artistid,  insertdate,  hour(from_unixtime(unixtime))  as  hourOfDay,                count(*)  as  plays,  size(collect_set(s.userid))  as  reach          from              lookups_userid_geo  g              join  data_submissions  s  on  (g.userid  =  s.userid)          where              insertdate  >=  '2011-­‐01-­‐01'  and  insertdate  <  '2012-­‐01-­‐01'              and  (listen  +  scrobble)  >  0                and  lower(g.countrycode)  =  'gb'          group  by              artistid,  insertdate,  hour(from_unixtime(unixtime))          )  artistCounts      group  by          artistCounts.artistid,  artistCounts.hourOfDay      )  artistStats  where      meanReach  >  25;  

Page 22: Hive at Last.fm

So far!

•  Test data: listening statistics for each artist, in each hour of the day"•  Base data: averaged hourly statistics for each artist"

•  Next step: compare them"

Page 23: Hive at Last.fm

Comparison!

select        test.artistid,        test.meanReach,  base.meanReach,      test.stdReach,  base.stdReach,      test.stdErrReach,  base.stdErrReach,      (test.meanReach  -­‐  base.meanReach)  /  (base.stdReach)  as  zScore,      (test.meanReach  -­‐  base.meanReach)  /  (base.stdErrReach  *  test.stdErrReach)  as            

 deviation  from      omar_uk_artist_base  base      join  omar_uk_artist_hours  test  on  (base.artistid  =  test.artistid)  where      test.hourOfDay  =  15  order  by      deviation  desc  limit  5;  

Page 24: Hive at Last.fm

Trends in UK Listening!

Page 25: Hive at Last.fm

Summary!

•  Hive is easy to use"•  It sits comfortably on top of a Hadoop infrastructure"•  Familiar if you know SQL"•  Can ask big questions"•  Can ask wide ranging questions"•  Allows analyses that would otherwise need a lot of

preliminary work ""

Page 26: Hive at Last.fm

HUG Charts!

Page 27: Hive at Last.fm

Any Questions?!