Hypercubes In Hbase

23
Hypercubes in HBase Fredrik Möllerstrand <[email protected] > Hadoop User Group UK, April 14 2009

Transcript of Hypercubes In Hbase

Page 1: Hypercubes In Hbase

Hypercubes in HBaseFredrik Möllerstrand <[email protected]>Hadoop User Group UK, April 14 2009

Page 2: Hypercubes In Hbase

Hello.

• Per Andersson, Fredrik Möllerstrand

• Chalmers University of Technology, Sweden

• Master thesis at last.fm

Page 3: Hypercubes In Hbase

stats.last.fm• Web statistics for in-house use.

• Served out of mysql.

Page 4: Hypercubes In Hbase

stats.last.fm

• y-axis: pageviews.

• x-axis: time.

• also: countries.

Page 5: Hypercubes In Hbase

stats.last.fm

SELECT pageviews, countryFROM webstatisticsGROUP BY country;

Page 6: Hypercubes In Hbase

SQL: Star schema• Facts table & dimension tables.

• Joins!

Page 7: Hypercubes In Hbase

Data Cube Foundations

• n-dimensional cube

• attribute => dimension

• attribute value => measurement

Page 8: Hypercubes In Hbase

Data Cube Foundations

• dimensionality reductions

• projections

Page 9: Hypercubes In Hbase

Data Cube Foundations

• Aggregation: sum, count, average, &c.

• Data cubes modeled as: in RDBMSs, modeled as star schemas. in HBase, modeled with column-families.

Page 10: Hypercubes In Hbase

Data Cubes in HBase

• Store projections,not distinct dimensions.

• Pre-compute *everything*.

Page 11: Hypercubes In Hbase

Data Cubes in HBase

• Rowkey: unit + timei.e. ‘pageviews-20090414’

• One column-family for every projection.i.e. ‘country-useragent’

• One qualifier per point in n-space.i.e. ‘US-safari’, ‘NO-opera’, &c.

Page 12: Hypercubes In Hbase

The SQL-DB Problem

• Too much data to keep in memory.

• Plenty of joins makes queries complex.

• Can’t serve at mouse click rate.

Page 13: Hypercubes In Hbase

The Solution;A data store that is:

• Distributed

• Multi-dimensional

• Magnetic(!)

• Just general enough

Page 14: Hypercubes In Hbase

Enter: Zohmg.

Page 15: Hypercubes In Hbase

Zohmg;A data store that is:

• Distributed

• Multi-dimensional

• Time-series-based

• Magnetic(!)

• Just general enough

Page 16: Hypercubes In Hbase

Tech

• Rides on the back of Dumbo.

• Stores aggregates in HBase.

• Serves JSON.

Page 17: Hypercubes In Hbase

Zohmg

• $> setup.py# create hbase database.

• $> import.py --mapper weblogs.py# run dumbo job.

• $> serve.py# start web server.

Page 18: Hypercubes In Hbase

Developers, developers.

• Configuration - yaml.

• Mapper - python.

Page 19: Hypercubes In Hbase

User’s configuration.project_name: webmetrics

dimensions: - country - domain - useragent - usertype

units: - pageviews

projections: country: - country domain-usertype: - domain - usertype country-domain-useragent-usertype: - country - domain - useragent - usertype

Page 20: Hypercubes In Hbase

User’s mapper.

def map(key, value): from lfm.data.parse import web

log = web.parse(value)

dimensions = {'country' : geoip(log.host), 'domain' : log.domain, 'useragent' : classify(log.useragent), 'usertype' : ("user", "anon")[log.userid == None] } values = {'pageviews' : 1} yield log.timestamp, dimensions, values

Page 21: Hypercubes In Hbase

Example.

Page 22: Hypercubes In Hbase

Dimensions in HBase• Column-family:

country-useragent-domain

• Qualifier:US-firefox-last.fm

Page 23: Hypercubes In Hbase

Questions?