Near Real Time Processing of Social Media Data with HBase

25
CC 2.0 by William Brawley | http://flic.kr/p/7PdUP3

description

Monitoring social media and news media in general is kind of building a search engine - a specific search engine called news agent. Where classical search engines are about to have all kind of content, newsagent focus on news and social media only. This is where content freshness matters - it has to be near real time. In order to process several million news with hundreds of Megabytes per day one has to choose a system architecture that is reliable on one hand and massive scalable at the other hand. Those requirements leads to a distributed architecture, a NoSQL approach. This talk will give you a brief overview about the Media Monitoring use case, the system architecture based on Hadoop and HBase, challenges and lessons learned.

Transcript of Near Real Time Processing of Social Media Data with HBase

Page 1: Near Real Time Processing of Social Media Data with HBase

CC 2.0 by William Brawley | http://flic.kr/p/7PdUP3

Page 2: Near Real Time Processing of Social Media Data with HBase

2

Agenda

•  Why Hadoop and HBase? •  Social Media Monitoring •  Prospective Search and Coprocessors

•  Challenges & Lessons Learned •  Resources to get started

August 31,

2012

Page 3: Near Real Time Processing of Social Media Data with HBase

3

About Sentric

•  Spin-off of MeMo News AG, the leading provider for Social Media Monitoring & Analytics in Switzerland

•  Big Data expert, focused on Hadoop, HBase and Solr

•  Objective: Transforming data into insights

August 31,

2012

Page 4: Near Real Time Processing of Social Media Data with HBase

CC 2.0 by Editor B| h"p://flic.kr/p/bcU5aD  

Page 5: Near Real Time Processing of Social Media Data with HBase

5

Social Media Monitoring Process

Why Hadoop and HBase?

August 31,

2012

Information Gathering

Information Processing

Analysis & Interpretation

Insight Presentation

Page 6: Near Real Time Processing of Social Media Data with HBase

6

Requirements

Why Hadoop and HBase?

August 31,

2012

SMM

Cost effective

High scalable

RT Alerting

Analytical capabilities

Reliable

Freshness

Page 7: Near Real Time Processing of Social Media Data with HBase

7

Hadoop

•  HDFS + MapReduce •  Based on Google Papers •  Distributed Storage and Computation

Framework •  Affordable Hardware, Free Software

•  Significant Adoption

Why Hadoop and HBase?

August 31,

2012

Page 8: Near Real Time Processing of Social Media Data with HBase

8

HBase

•  Non-Relational, Distributed Database •  Column-Oriented •  Multi-Dimensional •  High Availability •  High Performance •  Build on top of HDFS as storage layer

Why Hadoop and HBase?

August 31,

2012

Page 9: Near Real Time Processing of Social Media Data with HBase

9

Technology Stack

Why Hadoop and HBase?

August 31,

2012

HBase /HDFS Storage

Hadoop Mahout Analytics

Solr Search

HBase RowLog Event mechanism (MQ)

Prospective search Real-time alerting

Page 10: Near Real Time Processing of Social Media Data with HBase

CC 2.0 by nolifebeforecoffee | http://flic.kr/p/c1UTf

Page 11: Near Real Time Processing of Social Media Data with HBase

11

Overview

Social Media Monitoring

August 31,

2012

Search Agents

Downloaded Articles

Output

match?

RT Alerts Reports Web-UI

Icons by http://dryicons.com

Page 12: Near Real Time Processing of Social Media Data with HBase

12

Solution Architecture

Social Media Monitoring

August 31,

2012

REST

n News Agents

MySQL Solr

Web-UI

RT Alerts

Coprocessor

HBase

Icons by http://dryicons.com

Page 13: Near Real Time Processing of Social Media Data with HBase

13

Prospective Search with Coprocessors

Social Media Monitoring

August 31,

2012

Processing

HRegionServer

HRegion

Put operations

Prospective Search

RT Alerts

Icons by http://dryicons.com

Page 14: Near Real Time Processing of Social Media Data with HBase

14

Key Figures

•  Monthly growth •  Index: 200GB •  50 Mio. docs/month

•  HBase: 600 GB •  Raw data, meta data and extracted

data

•  A few 1000 map-reduce jobs/month

Social Media Monitoring

August 31,

2012

Page 15: Near Real Time Processing of Social Media Data with HBase

CC 2.0 by saebaryo | h"p://flic.kr/p/5T4t5L

Page 16: Near Real Time Processing of Social Media Data with HBase

16 1  Benchmarks - workloads 2  Supervision 3  Keys and shards – Schema design /LG 4  Timestamps, the 4th dimension 5  Short ColumnFamily names-> 6  File handles. OS 7  JVM Tuning, GC !!! 8  Scaling region servers, data locality! 9  Automatic vs manual splits, compaction 10  Do not use HBase as rock solid in prod 11  Forget feuerwehr aktionen, it takes some time 12  Use Hbase for a apropriate use case 13  Tune and tweak – it‘s not a project – it‘s a process 14  You need devops in production 15  Huge know-how curve, you need to know the hole ecosystem 16  Use a distribution, ist packed, tested and supports migration, enterprise grade 17  Virtualisierung, Hardware 18  Dont struggle to much, there is a good community 19  Share your knowledge 20  It‘s early state, many tools around, a few still missing

Challenges & Lessons Learned

August 31, 2012

Page 17: Near Real Time Processing of Social Media Data with HBase

17

Challenges

•  Everyone is still learning •  Some issues only appear at scale •  At scale, nothing works as advertised

•  Production cluster configuration •  Hardware issues •  Tuning cluster configuration to our work

loads

•  HBase stability •  Monitoring health of HBase

Challenges & Lessons Learned

August 31,

2012

Page 18: Near Real Time Processing of Social Media Data with HBase

18

Lessons - General

•  Do not rely on HBase as frontend storage layer. It’s not going to be rock solid

•  Don’t struggle to much, there is a good community

•  Share your knowledge •  It‘s early stage, many tools around, a

few still missing

Challenges & Lessons Learned

August 31,

2012

Page 19: Near Real Time Processing of Social Media Data with HBase

19

Lessons - Planning

•  Use HBase for an appropriate use case •  Use a distribution, its packed, tested and

supports migration, enterprise grade •  Benchmarks – know your workloads &

query patterns •  YCSB

•  Schema & Key Design •  What’s queried together should be stored

together •  Scaling region servers, data locality! •  Virtualization vs. Real Hardware

Challenges & Lessons Learned

August 31,

2012

Page 20: Near Real Time Processing of Social Media Data with HBase

20

Lessons - Performance Tuning

•  Number of CF < 10 •  Compaction + Flushing I/O intensive

•  Short ColumnFamily names •  HFile index size occupying aloc RAM (storefileindexSize)

•  OS file handles •  ulimit –n 32768

•  JVM Tuning, GC !!! •  HMaster 1024 MB •  RegionServer 8192 MB •  -XX:+UseConcMarkSweepGC •  -XX:+CMSIncrementalMode

•  Automatic vs. manual splits •  Be careful with expensive operations in coprocessors •  Play with all the configurations and benchmark for tuning

Challenges & Lessons Learned

August 31,

2012

Page 21: Near Real Time Processing of Social Media Data with HBase

21

Lessons - Operation

•  Monitoring/Operational tooling is most important

•  Forget “emergency actions”, it takes some time

•  Tune and tweak – it‘s not a project – it‘s a process

•  You need DevOps in production •  Huge know-how curve, you need to

know the whole ecosystem •  Hadoop, HDFS, MapRed

Challenges & Lessons Learned

August 31,

2012

Page 22: Near Real Time Processing of Social Media Data with HBase

22

Resources to get started

•  http://hbase.apache.org/book.html •  http://www.sentric.ch/blog/best-

practice-why-monitoring-hbase-is-important

•  http://www.sentric.ch/blog/hadoop-overview-of-top-3-distributions

•  http://www.sentric.ch/blog/hadoop-best-practice-cluster-checklist

•  http://outerthought.org/blog/465-ot.html

August 31,

2012

Page 23: Near Real Time Processing of Social Media Data with HBase

23

Thank you!

Questions? Christian Gügi, [email protected]

Jean-Pierre König, [email protected]

NoSQL Roadshow Basel

August 31,

2012

Page 24: Near Real Time Processing of Social Media Data with HBase

24

Cluster

Masters

August 31, 2012

Page 25: Near Real Time Processing of Social Media Data with HBase

25

Cluster

Worker

August 31, 2012