HBase at Mendeley

27
HBase at Mendeley Dan Harvey Data Mining Engineer [email protected]

description

The details behind how and why we use HBase in the data mining team at Mendeley.

Transcript of HBase at Mendeley

Page 1: HBase at Mendeley

HBase at Mendeley

Dan HarveyData Mining Engineer

[email protected]

Page 2: HBase at Mendeley

Overview

➔ What is Mendeley➔ Why we chose HBase➔ How we're using HBase➔ Challenges

Page 3: HBase at Mendeley

Mendeley helps researchers work smarter

Page 4: HBase at Mendeley

Mendeley extracts research data..

Install Mendeley Desktop

Mendeley helps researchers work smarter

Page 5: HBase at Mendeley

..and aggregates research data in the cloud

Mendeley extracts research data..

Mendeley helps researchers work smarter

Page 6: HBase at Mendeley
Page 7: HBase at Mendeley
Page 8: HBase at Mendeley
Page 9: HBase at Mendeley
Page 10: HBase at Mendeley
Page 11: HBase at Mendeley

Mendeley in numbers

➔ 600,000+ users➔ 50+ million user documents

➔ Since January 2009

➔ 30 million unique documents➔ De-duplicated from user and other imports

➔ 5TB of papers

Page 12: HBase at Mendeley

Data Mining Team

➔ Catalogue➔ Importing➔ Web Crawling➔ De-duplication

➔ Statistics➔ Related and recommended research➔ Search

Page 13: HBase at Mendeley

Starting off

➔ Users data in MySQL➔ Normalised document tables

➔ Quite a few joins..

➔ Stuck with MySQL for data mining➔ Clustering and de-duplication➔ Got us to launch the article pages

Page 14: HBase at Mendeley

But..

➔ Re-process everything often➔ Algorithms with global counts➔ Modifying algorithms affect everything

➔ Iterating over tables was slow➔ Could not easily scale processing➔ Needed to shard for more documents➔ Daily stats took > 24h to process...

Page 15: HBase at Mendeley

What we needed

➔ Scale to 100s of millions of documents➔ ~80 million papers➔ ~120 million books➔ ~2-3 billion references

➔ More projects using data and processing➔ Update the data more often➔ Rapidly prototype and develop➔ Cost effective

Page 16: HBase at Mendeley

So much choice..

But they mostly miss out good scalable processing.

And many more...

Page 17: HBase at Mendeley

HBase and Hadoop

➔ Scalable storage➔ Scalable processing➔ Designed to work with map reduce

➔ Fast scans➔ Incremental updates➔ Flexible schema

Page 18: HBase at Mendeley

Where HBase fits in

Page 19: HBase at Mendeley

How we store data

➔ Mostly documents➔ Column Families for different data

➔ Metadata / raw pdf files➔ More efficient scans

➔ Protocol Buffers for metadata➔ Easy to manage 100+ fields➔ Faster serialisation

Page 20: HBase at Mendeley

Example Schema

Row Column family Qualifiersha1_hash metadata document

date_addeddate_modifiedsource

content pdffull_textentity_extraction

canonical_id version_live

● All data for documents in one table

Page 21: HBase at Mendeley

How we process data

➔ Java Map Reduce➔ More control over data flows➔ Allows us to do more complex work

➔ Pig➔ Don't have to think in map reduce➔ Twitter's Elephant Bird decodes protocol buffers➔ Enables rapid prototyping➔ Less efficient than using java map reduce

➔ Quick example...

Page 22: HBase at Mendeley

Example

➔ Trending keywords over time➔ For a give keyword, how many documents per year?➔ Multiple map/reduce tasks➔ 100s of line of java...

Page 23: HBase at Mendeley

Pig Example-- Load the document bagrawDocs = LOAD 'hbase://canonical_documents'

USING HbaseLoader('metadata:document')AS (protodoc);

-- De-serialise protocol bufferdocs = FOREACH rawDocs GENERATE

DocumentProtobufBytesToTuple(protodoc)AS doc;

-- Get keyword, year tuplestagYear = FOREACH docs GENERATE

FLATTEN (doc.(year, keywords_bag))AS keyword, doc::year AS year;

Page 24: HBase at Mendeley

-- Group unique (keyword, year) tuplesyearTag = GROUP tagYear BY (keyword, year);

-- Create (keyword, year, count) tuplesyearTagCount = FOREACH yearTag GENERATE

FLATTEN(group) AS (keyword, year),COUNT(tagYear) AS count;

-- Group the counts by keywordtagYearCounts = GROUP yearTagCount BY keyword;

-- Group the counts by keywordtagYearCounts = FOREACH tagYearCounts GENERATE

group AS keyword,yearTagCount.(year, count) AS years;

STORE tagYearCounts INTO 'tag_year_counts';

Page 25: HBase at Mendeley

Challenges

➔ MySQL hard to export from➔ Many joins slow things down➔ Don't normalise if you don't have to!

➔ HBase needs memory➔ Stability issues if you give it too little

Page 26: HBase at Mendeley

Challenges: Hardware

➔ Knowing where to start is hard...➔ 2x quad core Intel cpu➔ 4x 1TB disks➔ Memory

➔ Started with 8GB, then 16GB➔ Upgrading to 24GB soon

➔ Currently 15 nodes

Page 27: HBase at Mendeley

www.mendeley.com