HPCC Systems vs Hadoop

51
Big Data Architectural Overview VS By Fujio Turner @FujioTurner

description

A detailed comparison of the best know Big Data solutions: HPCC Systems and Hadoop.

Transcript of HPCC Systems vs Hadoop

Page 1: HPCC Systems vs Hadoop

Big Data Architectural

Overview

VS

By Fujio Turner

@FujioTurner

Page 2: HPCC Systems vs Hadoop

LexisNexis is a provider of legal, tax, regulatory, news, business information, and analysis to legal, corporate, government,!

accounting and academic markets. !!

LexisNexis has been in business since 1977 with over 30,000 employees worldwide. 

What is HPCC Systems?Who is ?

LexisNexis Risk is the division of the LexisNexis which focuses on data, Big Data processing, linking and vertical expertise and supports HPCC Systems as an open source project under Apache 2.0 License.

http://hpccsystems.com/

Page 3: HPCC Systems vs Hadoop

Comparison

JAVA C++

Petabytes

1-80,000 Jobs/day

Since 2005

Exabytes

Since 2000

Indexed: 2K-3K Jobs/sec*

? ? ? ? ? ?

Thor Roxie

Block Based File Based

In-Memory: 30 - 40 Jobs/min*

Non-Indexed: 4-1,040,000 Jobs/day

 *based on job (size / result set / complexity)

Page 4: HPCC Systems vs Hadoop

BusinessDevelopmentCustomers1 20

Non-Indexed Full Data Set

http://hpccsystems.com/why-hpcc/benchmarks

Page 5: HPCC Systems vs Hadoop

“I’m sub-second fast.”

“I can query all or part of your

data.”

Thor RoxieSingle Threaded

Hard Disk Index(optional)

Multi-Threaded Hard Disk

Index(optional) In-memory

SSD

Either/Both

Cluster Architecture

Page 6: HPCC Systems vs Hadoop

300GB File

Kevin CA 45 Mark MI 27 Sara FL 64

Name State Age

How do the platforms !handle the same data?!

Example

Customer Data May 2010

Page 7: HPCC Systems vs Hadoop

Name Node

Data Nodes

!a?

! b?

! c?

big blocks

Kevin CA 45 Mark MI 27 Sara FL 64

Store DataData is stored in random blocks.

? ? ?

Page 8: HPCC Systems vs Hadoop

Name Node

Data Nodes

block a = server 1 …… b = …….. 2 …… c = …….. 3

!a?

! b?

! c?

big blocks

Kevin CA 45 Mark MI 27 Sara FL 64

Store DataBlock location are stored in memory.

? ? ?

Page 9: HPCC Systems vs Hadoop

K.. CA 45 M.. MI 27 S.. FL 64

Thor Master

Thor Slaves

Kevin CA 45 Mark MI 27 Sara FL 64

Store Data

File Name ~/customers_2010-05

Data is distributed evenly in the cluster with replica copies and is seen as a file (example below).

Page 10: HPCC Systems vs Hadoop

K.. CA 45 M.. MI 27 S.. FL 64

Thor Master

Thor Slaves

Kevin CA 45 Mark MI 27 Sara FL 64

Store Data

Dali

File Location & Job Scheduler

File locations are stored on disk.

File Name ~/customers_2010-05

Page 11: HPCC Systems vs Hadoop

!a?

! b?

! c?

Name Node

Data Nodes

What state do most people live in?

Blocks are scanned for wanted data

Page 12: HPCC Systems vs Hadoop

!a?

! b?

! c?

Name Node

Data Nodes

Mapper

What state do most people live in? CA 1 FL 1 MI 1 FL 1 CA 1 MI 1 MI ..

Found data is sent to Mapper(s) in Key/Value pairs

and stored.

Page 13: HPCC Systems vs Hadoop

!a?

! b?

! c?

Name Node

Data Nodes

Mapper

Reducer

CA 120 MI 500 FL 7

What state do most people live in? CA 1 FL 1 MI 1 FL 1 CA 1 MI 1 MI ..

Stored data is sent to Reducer(s) to be

aggregated.

Page 14: HPCC Systems vs Hadoop

!a?

! b?

! c?

Name Node

Data Nodes

Mapper

Reducer

CA 120 MI 500 FL 7

What state do most people live in? CA 1 FL 1 MI 1 FL 1 CA 1 MI 1 MI ..

Cannot use SSD in Mapper or Reducer

due to too many writes.

Page 15: HPCC Systems vs Hadoop

K CA 45 M MI 27 S FL 64Thor Master

Thor Slaves

Dali

What state do most people live in?

ESP

1a.

2.

File Location & Job Scheduler 1.a A pre-compiled query is triggered. (Mostly used in Roxie) 1b. Ad-hoc query. !2.Query is sent to Dali to get file locations.

1b.

Page 16: HPCC Systems vs Hadoop

K CA 45 M MI 27 S FL 64Thor Master

Thor Slaves

Dali

What state do most people live in?

ESP3.

File Location & Job Scheduler3. Job is placed in que to be sent to Thor Master. Thor Master coordinates job execution on Thor Slave nodes.

Page 17: HPCC Systems vs Hadoop

K CA 45 M MI 27 S FL 64Thor Master

Thor Slaves

Dali

What state do most people live in?

ESP

File Location & Job SchedulerJob are done locally on slaves and/or coordinated by master globally.

Page 18: HPCC Systems vs Hadoop

K CA 45 M MI 27 S FL 64Thor Master

Thor Slaves

Dali

What state do most people live in?

ESP

4.

4.

MI 500 CA 120 FL 7

File Location & Job Scheduler

4.Job is returned with optional grouped by & sorted by at run time.

Page 19: HPCC Systems vs Hadoop

K CA 45 M MI 27 S FL 64Thor Master

Thor Slaves

Dali

What state do most people live in?

ESP

MI 500 CA 120 FL 7

File Location & Job Scheduler

SORT!GROUP!DEDUP!JOIN!MERGE!BETWEEN!LENGTH!REGEX!ROUND!SUM!COUNT!TRIM!WHEN!AVE!CASE!NORMALIZE!DENORMALIZE!K-MEANS!more ….

Multiple other actions can be done on the data in a single job.

Page 20: HPCC Systems vs Hadoop

! a? !

K CA 45

a b c

K CA 45

Closer Look at Finding DataFull block is scanned to find your data. Blocks can be many terabytes in size.

Page 21: HPCC Systems vs Hadoop

! a? !

K CA 45

a b c

K CA 45

Closer Look at Finding Data

CA , 1

When data is found its sent to mapper.

Page 22: HPCC Systems vs Hadoop

! a? !

K CA 45

a b c

K CA 45

Closer Look at Finding Data

K CA 45

Data location is know. !“Apply Schema on Read” during time of query. !Data is processed locally.

Name State Age

Page 23: HPCC Systems vs Hadoop

! a? !

K CA 45

a b c

File size can be a few bytes to 4 exabytes with no limits on the total number of files that can be stored.

K CA 45

Closer Look at Finding Data

K CA 45

Page 24: HPCC Systems vs Hadoop

! a?

128GB - 1TB

8TB - 16TB or more

1.5 - 12.5% of data is in memory and only recently used data is in memory.

Speed

2013

Page 25: HPCC Systems vs Hadoop

K CA 45 M MI 27 S FL 64Thor Master

Thor Slaves

Kevin CA 45 Mark MI 27 Sara FL 64

CA row #3 MI row #17 MI row #4 FL row #5

Speed - Part 1Indexing

Index Index Index

• index per file • customize by field(s)

File Name ~/customers_2010-05

File Name ~/customers_2010-05_index

Page 26: HPCC Systems vs Hadoop

1 40

Non-Indexed

1 200

To

Indexed

Page 27: HPCC Systems vs Hadoop

1 40

Non-Indexed

1 200+

To

Indexed

male row #345 female row #4 male row #97 female row #267

CA row #3 MI row #17 MI row #4 FL row #5

Example Index Example Index

Page 28: HPCC Systems vs Hadoop

Speed - Part 2Roxie

K CA 45 M MI 27 S FL 64Roxie Master

Roxie Slaves

Index In-Memory

Index Index Index

Page 29: HPCC Systems vs Hadoop

Speed - Part 2Roxie

K CA 45 M MI 27 S FL 64Roxie Master

Roxie Slaves

Index In-Memory & Part or All Data

Index Index Index

orIndex In-Memory

Page 30: HPCC Systems vs Hadoop

Speed - Part 2Roxie

K CA 45 M MI 27 S FL 64Roxie Master

Roxie Slaves

Roxie is Multi-ThreadedIndex In-Memory & Part or All Data

orIndex In-Memory

Index Index Index

Page 31: HPCC Systems vs Hadoop

Speed - Part 2Roxie

K CA 45 M MI 27 S FL 64Roxie Master

Roxie Slaves

Roxie is Multi-ThreadedIndex In-Memory & Part or All Data

orIndex In-Memory

Index Index Index

SSD are OK - write few / read many

Page 32: HPCC Systems vs Hadoop

Speed - Part 2Roxie

K CA 45 M MI 27 S FL 64Roxie Master

Roxie Slaves

Roxie is Multi-ThreadedIndex In-Memory & Part or All Data

orIndex In-Memory

Index Index Index

2004

Page 33: HPCC Systems vs Hadoop

Thor Master

Thor Slaves

Dali ESP

Roxie Master

Roxie Slaves

Common Cluster

Data is mostly unstructured. Use Thor to do ETL & create indexes. Send results to Roxie for user queries.

Page 34: HPCC Systems vs Hadoop

Dali ESP

Roxie Master

Roxie Slaves

High Speed Cluster

Data is mostly structured. Main goal is to have fast queries all the time.

Page 35: HPCC Systems vs Hadoop

Thor Master

Thor Slaves

Dali ESP

Storage Cluster

Data is structured or unstructured. Main goal is to storage lots of data and query using indexes on all or part of the data in the cluster.

Page 36: HPCC Systems vs Hadoop

!a?

! b?

! c?

Name Node

Data Nodes / Task Tracker

Mapper

Reducer

Complex or Multi-Step Queries

Job Tracker

Job Tracker coordinates multi step jobs.

Page 37: HPCC Systems vs Hadoop

Job Tracker

CA 120 MI 500 FL 7

Food 31 Water 99 Candy 84

Wed 80 Fri 73 Sun 96

1 2 3 4 5 6 7 8 9

3 hours 1 hours 1 hours 6 hours

Sum 80 Count 73

1 hours

Page 38: HPCC Systems vs Hadoop

How do I Query HPCC Systems?ECL (Enterprise Control Language) is a C++ based query language for use with HPCC Systems Big Data platform. ECLs syntax and format is very simple and easy to learn.!!

Note - ECL is very similar to Hadoop’s pig ,but!more expressive and feature rich.

Page 39: HPCC Systems vs Hadoop

Map/Reduce

SQL w/ JOINS

GraphDB

Machine Learning

Simple to Complex Queries

ECL (Enterprise Control Language) C++ based query language

Page 40: HPCC Systems vs Hadoop

Sort

Count

Group

Classification

(ROXIE) 0.27 seconds to (THOR) few hours

Country = ‘US’

Join

Index of ~/facebook_2013

Query is Completed in a Single Job!Asynchronously

~/facebook_2013

Country = ‘US’

~/twitter_2013

SORT!GROUP!DEDUP!JOIN!MERGE!BETWEEN!LENGTH!REGEX!ROUND!SUM!COUNT!TRIM!WHEN!AVE!CASE!NORMALIZE!DENORMALIZE!K-MEANS!more ….

+

Page 41: HPCC Systems vs Hadoop

Machine Learning Built-in

Regression! Linear Regression Classification! Naive Bayes Perceptron Decisions Trees Logistic Regression Clustering! K-Means KD Trees Agglomerative/Hierarchical Association Analysis! AprioriN EclatN Rules

http://hpccsystems.com/ml

Michael Payne ,of Clemson University, on high speed machine learning with PB-BLAS in HPCC Systems.

http://youtu.be/s_HWlMwi6iI

Page 42: HPCC Systems vs Hadoop

Lorem Ipsum is simply dummy text of the printing

Un-Structured Data?!Example

lots of text

300GB File

Page 43: HPCC Systems vs Hadoop

Lorem Ipsum is simply dummy text of the printing

Un-Structured Data

Regular Expression in C++

or Pattern Match in

ECL

Regular Expression in Java

+Reg Ex +meta data stored only

Filtered Data+

Indexes

Page 44: HPCC Systems vs Hadoop

Lorem Ipsum is simply dummy text of the printing

Full Text Search

Pattern Match in ECL and

+Rex Ex or

Page 45: HPCC Systems vs Hadoop

vs

More Moving Parts = More Downtime

Management & Administration

Page 46: HPCC Systems vs Hadoop

“I want sub-second speed but made investment in HDFS.”

K CA 45 M MI 27 S FL 64Roxie Master

Index Index Index

Roxie Slaves

!a?

! b?

! c?

Name Node

Data Nodes / Task Trackerhttp://hpccsystems.com/products-and-services/products/modules/hadoop-integration

Hadoop / HPCC Transport Plug-in

Page 47: HPCC Systems vs Hadoop

Migrating from Hadoop to HPCC Systems

K CA 45 M MI 27 S FL 64Roxie Master

Index Index Index

Roxie Slaves

Name NodeData Nodes / Task Tracker

Thor Master

Thor Slaves

Slowly replace Hadoop with Thor.

Page 48: HPCC Systems vs Hadoop

Alternative Query Methods

Page 49: HPCC Systems vs Hadoop

HPCC Systems Security

Encrypt Data on Disk optional

Third Party Authentication Kerberos OK

User / Group Authentication

Page 50: HPCC Systems vs Hadoop

http://www.slideshare.net/FujioTurner/

For More HPCC!“How To’s”!

Go to SlideShare

Page 51: HPCC Systems vs Hadoop

http://www.youtube.com/watch?v=8SV43DCUqJg

Watch how to install HPCC Systems

in 5 Minutes

Download HPCC Systems Open Source

Community Edition

or

Source Codehttps://github.com/hpcc-systems

http://hpccsystems.com/download/