HPCC Systems vs Hadoop
-
Upload
fujio-turner -
Category
Technology
-
view
1.036 -
download
7
description
Transcript of HPCC Systems vs Hadoop
Big Data Architectural
Overview
VS
By Fujio Turner
@FujioTurner
LexisNexis is a provider of legal, tax, regulatory, news, business information, and analysis to legal, corporate, government,!
accounting and academic markets. !!
LexisNexis has been in business since 1977 with over 30,000 employees worldwide.
What is HPCC Systems?Who is ?
LexisNexis Risk is the division of the LexisNexis which focuses on data, Big Data processing, linking and vertical expertise and supports HPCC Systems as an open source project under Apache 2.0 License.
http://hpccsystems.com/
Comparison
JAVA C++
Petabytes
1-80,000 Jobs/day
Since 2005
Exabytes
Since 2000
Indexed: 2K-3K Jobs/sec*
? ? ? ? ? ?
Thor Roxie
Block Based File Based
In-Memory: 30 - 40 Jobs/min*
Non-Indexed: 4-1,040,000 Jobs/day
*based on job (size / result set / complexity)
BusinessDevelopmentCustomers1 20
Non-Indexed Full Data Set
http://hpccsystems.com/why-hpcc/benchmarks
“I’m sub-second fast.”
“I can query all or part of your
data.”
Thor RoxieSingle Threaded
Hard Disk Index(optional)
Multi-Threaded Hard Disk
Index(optional) In-memory
SSD
Either/Both
Cluster Architecture
300GB File
Kevin CA 45 Mark MI 27 Sara FL 64
Name State Age
How do the platforms !handle the same data?!
Example
Customer Data May 2010
Name Node
Data Nodes
!a?
! b?
! c?
big blocks
Kevin CA 45 Mark MI 27 Sara FL 64
Store DataData is stored in random blocks.
? ? ?
Name Node
Data Nodes
block a = server 1 …… b = …….. 2 …… c = …….. 3
!a?
! b?
! c?
big blocks
Kevin CA 45 Mark MI 27 Sara FL 64
Store DataBlock location are stored in memory.
? ? ?
K.. CA 45 M.. MI 27 S.. FL 64
Thor Master
Thor Slaves
Kevin CA 45 Mark MI 27 Sara FL 64
Store Data
File Name ~/customers_2010-05
Data is distributed evenly in the cluster with replica copies and is seen as a file (example below).
K.. CA 45 M.. MI 27 S.. FL 64
Thor Master
Thor Slaves
Kevin CA 45 Mark MI 27 Sara FL 64
Store Data
Dali
File Location & Job Scheduler
File locations are stored on disk.
File Name ~/customers_2010-05
!a?
! b?
! c?
Name Node
Data Nodes
What state do most people live in?
Blocks are scanned for wanted data
!a?
! b?
! c?
Name Node
Data Nodes
Mapper
What state do most people live in? CA 1 FL 1 MI 1 FL 1 CA 1 MI 1 MI ..
Found data is sent to Mapper(s) in Key/Value pairs
and stored.
!a?
! b?
! c?
Name Node
Data Nodes
Mapper
Reducer
CA 120 MI 500 FL 7
What state do most people live in? CA 1 FL 1 MI 1 FL 1 CA 1 MI 1 MI ..
Stored data is sent to Reducer(s) to be
aggregated.
!a?
! b?
! c?
Name Node
Data Nodes
Mapper
Reducer
CA 120 MI 500 FL 7
What state do most people live in? CA 1 FL 1 MI 1 FL 1 CA 1 MI 1 MI ..
Cannot use SSD in Mapper or Reducer
due to too many writes.
K CA 45 M MI 27 S FL 64Thor Master
Thor Slaves
Dali
What state do most people live in?
ESP
1a.
2.
File Location & Job Scheduler 1.a A pre-compiled query is triggered. (Mostly used in Roxie) 1b. Ad-hoc query. !2.Query is sent to Dali to get file locations.
1b.
K CA 45 M MI 27 S FL 64Thor Master
Thor Slaves
Dali
What state do most people live in?
ESP3.
File Location & Job Scheduler3. Job is placed in que to be sent to Thor Master. Thor Master coordinates job execution on Thor Slave nodes.
K CA 45 M MI 27 S FL 64Thor Master
Thor Slaves
Dali
What state do most people live in?
ESP
File Location & Job SchedulerJob are done locally on slaves and/or coordinated by master globally.
K CA 45 M MI 27 S FL 64Thor Master
Thor Slaves
Dali
What state do most people live in?
ESP
4.
4.
MI 500 CA 120 FL 7
File Location & Job Scheduler
4.Job is returned with optional grouped by & sorted by at run time.
K CA 45 M MI 27 S FL 64Thor Master
Thor Slaves
Dali
What state do most people live in?
ESP
MI 500 CA 120 FL 7
File Location & Job Scheduler
SORT!GROUP!DEDUP!JOIN!MERGE!BETWEEN!LENGTH!REGEX!ROUND!SUM!COUNT!TRIM!WHEN!AVE!CASE!NORMALIZE!DENORMALIZE!K-MEANS!more ….
Multiple other actions can be done on the data in a single job.
! a? !
K CA 45
a b c
K CA 45
Closer Look at Finding DataFull block is scanned to find your data. Blocks can be many terabytes in size.
! a? !
K CA 45
a b c
K CA 45
Closer Look at Finding Data
CA , 1
When data is found its sent to mapper.
! a? !
K CA 45
a b c
K CA 45
Closer Look at Finding Data
K CA 45
Data location is know. !“Apply Schema on Read” during time of query. !Data is processed locally.
Name State Age
! a? !
K CA 45
a b c
File size can be a few bytes to 4 exabytes with no limits on the total number of files that can be stored.
K CA 45
Closer Look at Finding Data
K CA 45
! a?
128GB - 1TB
8TB - 16TB or more
1.5 - 12.5% of data is in memory and only recently used data is in memory.
Speed
2013
K CA 45 M MI 27 S FL 64Thor Master
Thor Slaves
Kevin CA 45 Mark MI 27 Sara FL 64
CA row #3 MI row #17 MI row #4 FL row #5
Speed - Part 1Indexing
Index Index Index
• index per file • customize by field(s)
File Name ~/customers_2010-05
File Name ~/customers_2010-05_index
1 40
Non-Indexed
1 200
To
Indexed
1 40
Non-Indexed
1 200+
To
Indexed
male row #345 female row #4 male row #97 female row #267
CA row #3 MI row #17 MI row #4 FL row #5
Example Index Example Index
Speed - Part 2Roxie
K CA 45 M MI 27 S FL 64Roxie Master
Roxie Slaves
Index In-Memory
Index Index Index
Speed - Part 2Roxie
K CA 45 M MI 27 S FL 64Roxie Master
Roxie Slaves
Index In-Memory & Part or All Data
Index Index Index
orIndex In-Memory
Speed - Part 2Roxie
K CA 45 M MI 27 S FL 64Roxie Master
Roxie Slaves
Roxie is Multi-ThreadedIndex In-Memory & Part or All Data
orIndex In-Memory
Index Index Index
Speed - Part 2Roxie
K CA 45 M MI 27 S FL 64Roxie Master
Roxie Slaves
Roxie is Multi-ThreadedIndex In-Memory & Part or All Data
orIndex In-Memory
Index Index Index
SSD are OK - write few / read many
Speed - Part 2Roxie
K CA 45 M MI 27 S FL 64Roxie Master
Roxie Slaves
Roxie is Multi-ThreadedIndex In-Memory & Part or All Data
orIndex In-Memory
Index Index Index
2004
Thor Master
Thor Slaves
Dali ESP
Roxie Master
Roxie Slaves
Common Cluster
Data is mostly unstructured. Use Thor to do ETL & create indexes. Send results to Roxie for user queries.
Dali ESP
Roxie Master
Roxie Slaves
High Speed Cluster
Data is mostly structured. Main goal is to have fast queries all the time.
Thor Master
Thor Slaves
Dali ESP
Storage Cluster
Data is structured or unstructured. Main goal is to storage lots of data and query using indexes on all or part of the data in the cluster.
!a?
! b?
! c?
Name Node
Data Nodes / Task Tracker
Mapper
Reducer
Complex or Multi-Step Queries
Job Tracker
Job Tracker coordinates multi step jobs.
Job Tracker
CA 120 MI 500 FL 7
Food 31 Water 99 Candy 84
Wed 80 Fri 73 Sun 96
1 2 3 4 5 6 7 8 9
3 hours 1 hours 1 hours 6 hours
Sum 80 Count 73
1 hours
How do I Query HPCC Systems?ECL (Enterprise Control Language) is a C++ based query language for use with HPCC Systems Big Data platform. ECLs syntax and format is very simple and easy to learn.!!
Note - ECL is very similar to Hadoop’s pig ,but!more expressive and feature rich.
Map/Reduce
SQL w/ JOINS
GraphDB
Machine Learning
Simple to Complex Queries
ECL (Enterprise Control Language) C++ based query language
Sort
Count
Group
Classification
(ROXIE) 0.27 seconds to (THOR) few hours
Country = ‘US’
Join
Index of ~/facebook_2013
Query is Completed in a Single Job!Asynchronously
~/facebook_2013
Country = ‘US’
~/twitter_2013
SORT!GROUP!DEDUP!JOIN!MERGE!BETWEEN!LENGTH!REGEX!ROUND!SUM!COUNT!TRIM!WHEN!AVE!CASE!NORMALIZE!DENORMALIZE!K-MEANS!more ….
+
Machine Learning Built-in
Regression! Linear Regression Classification! Naive Bayes Perceptron Decisions Trees Logistic Regression Clustering! K-Means KD Trees Agglomerative/Hierarchical Association Analysis! AprioriN EclatN Rules
http://hpccsystems.com/ml
Michael Payne ,of Clemson University, on high speed machine learning with PB-BLAS in HPCC Systems.
http://youtu.be/s_HWlMwi6iI
Lorem Ipsum is simply dummy text of the printing
Un-Structured Data?!Example
lots of text
300GB File
Lorem Ipsum is simply dummy text of the printing
Un-Structured Data
Regular Expression in C++
or Pattern Match in
ECL
Regular Expression in Java
+Reg Ex +meta data stored only
Filtered Data+
Indexes
Lorem Ipsum is simply dummy text of the printing
Full Text Search
Pattern Match in ECL and
+Rex Ex or
vs
More Moving Parts = More Downtime
Management & Administration
“I want sub-second speed but made investment in HDFS.”
K CA 45 M MI 27 S FL 64Roxie Master
Index Index Index
Roxie Slaves
!a?
! b?
! c?
Name Node
Data Nodes / Task Trackerhttp://hpccsystems.com/products-and-services/products/modules/hadoop-integration
Hadoop / HPCC Transport Plug-in
Migrating from Hadoop to HPCC Systems
K CA 45 M MI 27 S FL 64Roxie Master
Index Index Index
Roxie Slaves
Name NodeData Nodes / Task Tracker
Thor Master
Thor Slaves
Slowly replace Hadoop with Thor.
Alternative Query Methods
HPCC Systems Security
Encrypt Data on Disk optional
Third Party Authentication Kerberos OK
User / Group Authentication
http://www.slideshare.net/FujioTurner/
For More HPCC!“How To’s”!
Go to SlideShare
http://www.youtube.com/watch?v=8SV43DCUqJg
Watch how to install HPCC Systems
in 5 Minutes
Download HPCC Systems Open Source
Community Edition
or
Source Codehttps://github.com/hpcc-systems
http://hpccsystems.com/download/