Post on 24-Mar-2018
Accelerative Technology Lab 2014
Big Data Analytics Using Accelerator for HPC
KK Yong (kk.yong@mimos.my)
R&D Activities carried out at NVIDIA-MIMOS Joint Lab (First in South East Asia)
Outline
• About MIMOS – ATL
• MIMOS Platform & Application
• Challenges of GPU Libraries adaption in Big Data Analytics
• MiAccLib Architecture and Framework
• MIMOS R&D GPU Cluster
About MIMOS
• Malaysia’s National R&D Center
• 10 core research areas: – Advanced Analysis & Modelling
– Advanced Computing (*)
• Accelerative Technology Lab
– Information Security
– Intelligent Informatics
– Knowledge Technology
– Microenergy
– Microelectronics
– Nanoelectronics
– Psychometrics
– Wireless Communications
(*) Advanced Computing
Spearheads R&D activities in acceleration on large-scale computing, chiefly Cloud Computing; from SaaS and IaaS to Services Delivery Platform.
About Accelerative Technology Lab
• To facilitate adaptation of many-core/parallel/GPU techniques in scientific, financial, big data processing areas
• To enhance GPU related R&D activities in Malaysia
• To serve as a one-stop center to promote, share & teach GPU technologies/solutions to customers and those interested in GPGPU, and to do joint collaborations on GPU topics
Accelerative
Technology Lab
Finance
Text / string
analytics
Crypto Video
Analytics
Database Acceleration (Galactica)
Oil and Gas
DB
Open Platform
5
Traditional Multicores (Main Processor)
General Purpose GPU 512-2880 GPU Cores (Co-Processor)
Many Integrated CPU Core (60
Cores) (Co-Processor)
On-board Memory Additional/External
Memory (SSD/HSM)
Parallel programming
CUDA platform 40Gb/s Infiniband ConnectX
GPU/Multi Core Driven Applications for MIMOS
Pa ra l l e l Da ta
P ro c e s s i n g
En/Decryption (scrambling+)
+ SSL Accelerator
Large DataSets (SOCSO)
Streamed Data (ISP & AVMedic)
Fraud Management System (SOCSO)
PDRM (HRMIS)
Business/ Enterprise Data MiAccLib 2.0
Patient Data Analytics
Challenges of GPU Libraries adaption in Big Data Analytics
• Most of our library users are non-scientific in nature • GPGPU is seen as an “Acceleration Co-Processor” • Hide the algorithmic complexity with simple API parameters • Structured & Unstructured Data in xxx Gigabytes
Facebook status updates: 700 per second Twitter tweets: 600 per second Buzz posts: 55 per second
Google: 34,000 searches per second Yahoo: 3,200 searches per second Bing: 927 searches per second
F u t u r e g r o w t h i s s t a g g e r i n g …
Overall Architecture
Generic web service
SOAP interface
Specific web-service 1
Specific web-service 2
Specific web-service
3
Specific Application
Generic Application
Specific Application
Specific Application
Application specific
Specific API exposed through web service interface ( contains
specific data preparation stage)
Application specific Algorithms
Functional Algorithms
Various Hardware
Mi-AccLib 2.x APIs (DLL)
Application Programming Interface
VAR
- Historical
- Generic
GPU Multi-Core CPU
- Text/String Processing
- Text/String Analytics
Queries Acceleration
& Query Parser/Optimiser
String Library
- Searching
- Sorting
- Matching
- Scrambler
- …
Numeric Library
- Financial
- Matrix
- Scientific
- Statistics
- …
IMDB Library
- Retrieval
- Transfer
- Indexing
- Analytics
- …
SQL/ SPARQL Library
- Unified Indexer
- Query Operator
- Multi-Format Data Manager
- Resource Manager
- …
Library API (dll & .so)
MIMOS Middle Framework
Many-Core Compute Engine
MIDDLEWARE LAYER
Analytics Component
Processing Component
Orchestration Component
PRESENTATION LAYER
Storage
Analytics Component
Processing Component
Mi-AccLib Libraries (Specific)
Finance Libraries
Video Analytics (Specific)
Text/String Libraries
Statistical/ Predictive Analytics
ETL Tool (Mi-Morphe)
Parallel Queries Processing
(OLAP Accelerator)
Mi-AccLib Libraries
(Algorithms & Generic)
Batch / Real Time
Hadoop & Storm
Libraries with Nodes
Traditional SQL Engine
Machine Learning
GIS Based Analytics
Crypto Libraries
Orchestration & Scheduling Engine
Use Case: SOCSO’s Data Cleansing
• Data Detection and Rectification
• Consisting 7711 cleansing rules
• Key 8 steps:
Extraction Loading Transformation Exception detection
Potential bad data extraction
Rectification discussion
Correction Correction verification
Using
Data Cleansing Challenges
• 31 systems from heterogeneous environment: – Environment: UNISYS, AS400,
Windows, Linux. – Data source: DMS1100, DB2,
Informix, MS SQL, MySQL, Excel, MS Access, Foxpro and flat files.
• Big Data with Big Computation:
– 319 source data – Involves ~1 billions records, e.g.:
• 15 million employees with 15 millions of monthly contribution
• 880,000 employers with 65 millions of monthly contributions
• Match against reference JPN data with 15 million records
Source
Database
Accelerated Duplicate Detection
Source
Database
Reference
Database
Accelerated Data Validation with Reference Data
Accelerated Data Detection | Exact match | Edit distance |
Numeric Distance | Date Distance |
search
~15 Millions
~15 Millions
Snapshot
Data Duplication in One Column Data Duplication in Multi Column
370120105041 370120105041 Identification
number 460721025197 460731025197
Full name Othman Md Amin
Othman B Md Amin
Data Cleansing Performance
0
200
400
600
800
1000
1200
10k 65k 1 million 14 million
Min
ute
s
Number of Records
CPU (1-core) CPU (8-core) GPU optimised (448 cores)
More than 24 hours More than 24 hours
GPU < 0.1 min
GPU < 0.5 min
GPU < 3 min
GPU = 45 min
High-Speed Name Search Performance
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Exact Match Edit Distance Wild CardSe
con
ds
10 million records 40 million records
All search < 0.2 s
Exact Match Mohamad Mohamad
Edit Distance Mohamad Muhammad
Wild Card Moh • Mohamad • Lee Ang Moh • El-Mohan
Accelerated
&
Parallelized
Algorithms
10+ Million Records of
transaction data
Mi-AccLib
Perkeso Data X JPN Data
350 Trillion
X 7711 Rules combinations/rule
2.7 Quintillion Operations
14
Business Intelligent Use Case
MIDDLEWARE LAYER
Analytics Component
Processing Component
Orchestration Engine
PRESENTATION LAYER
Data Warehouse
Data Mart Data Mart Data Mart
Dashboard
Parallel Query Accelerator
Galactica
Parallel Queries Accelerator
• Orchestration of Heterogeneous Hardware Components
– Multi-Core CPUs with Many-Core GPGPU
• Emerging GPU accelerated queries processing engine for massively data parallel computation – Analytical algorithms
• Easy to access parallel engine – SQL style accessing – Standard Database Connector
Data Warehouse
Business Intelligence Tool
Presentation Layer
NVIDIA Tesla GPU Technology
March 24-27, 2014 – San Jose Galactica - Accelerated Queries Processing Presented at NVIDIA GTC 2014
Galactica Performance
TPC-H Dataset (1GB, 10GB & 50GB) with three set of queries and PostgreSQL
32GB Dataset, Distributed Processing with 7 Nodes (Hadoop) & PostgreSQL
N/A – Failed in query execution
March 24-27, 2014 – San Jose Galactica - Accelerated Queries Processing Presented at NVIDIA GTC 2014
MiBIS – Data Visualization
MIMOS Mi-BIS is a platform that creates a convenient environment for customised report creation and business analytics. With Mi-BIS, organisations can easily create and manage reports, perform in-depth analysis which includes data exploration, ad hoc query analysis and visualisation of multi-dimensional data, to assist their decision-making process.
Features:
• Dashboard Management • KPI Management • Location Intelligence • Parallel query processing accelerators • Big Data Processing Engine
Scrambling (Database Encrypt & Decrypt)
* ECB = electronic code book;
0
200
400
600
800
1000
1200
1400
1600
1800
32MB 64MB 128MB 256MB
mill
ise
con
d
Message size
Encryption
CPU AES-128 (Quadro 4000)
AES-128 (K20) AES-256 (K20)
0
200
400
600
800
1000
1200
1400
1600
1800
32MB 64MB 128MB 256MB
Message size
Decryption
CPU AES-128 (Quadro 4000)
AES-128 (K20) AES-256 (K20)
> 7x > 6x
Video Analytics Implementation in GPU
Intelligent Surveillance Platform
Video Analytics • Intrusion Detection • Loitering Detection • Slip & Fall Detection • Unattended Object Detection • Object Removal Detection
Video Analytics Implementation in GPU
*40++ cameras implementation
~25% utilization*
Region of Interest during intrusion
* Differs based on server configuration & video complexity
Camera 2 MJPEG Decoding
IP Network
CPU
Surveillance Server
Video Analytics Processing
Client PC
ALERT !!!
Camera 1 MJPEG Decoding
…
Camera N MJPEG Decoding
…
GPU VA Library
Background Subtraction
AccBackgroundSubtractionFrameDiff AccCompMotion, AccUpdateBackground AccCompShadow, AccRGB2HSV
Morphing Process AccMorphFilterVariable
CCL AccConnectComponentLabel
Region Analyzer
AccExtractPropertiesCentroid, AccExtractPropertiesSize, AccExtractPropertiesBB, AccExtractPropertiesHWRatio, AccExtractPropertiesOrientation, AccExtractPropertiesHProject, AccExtractPropertiesSkew, AccRegionLabelUpdate, AccCompOverlap, AccPropUpdate, AccCombineBlob
Filters AccFlickerFilter, AccRegionFilter
Detection
AccVAParallelIntrusionDetection
Video Analytics
Processing
Parallelization of the VA algorithms
Previous data dependency Efficient memory management. Algorithm Decomposition
CPU + GPU
CPU
Tasks CPU + GPU * CPU Utilization
Network Stream In CPU 10%
Decompression CPU 5%
Video Analytics GPU 35%
Streaming Out & Display
CPU 50%
* Data taken on system server CPU - Dual 8 cores
GPU VA System Results
* Reference to 10fps
0
2
4
6
CPU Dual 6 Core GPU Kepler K20C
Tim
e (
ms)
VA Processing Time CPU vs GPU
3.6x
0
10
20
30
40
50
60
70
80
90
100
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
20 15 10
No
of C
ame
ra
Uti
lizat
ion
fps
CPU with SINGLE GPU K20C
No. of Cameras
CP
U
GP
U
GP
U M
EMO
RY
0
10
20
30
40
50
60
70
80
90
100
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
20 15 10
No
of C
ame
ra Uti
lizat
ion
fps
CPU with DUAL GPU K20C
No. of Cameras
CP
U
GP
U
GP
U M
EMO
RY
41 41 35
70
80 90
CP
U
GP
U
GP
U M
EMO
RY
CP
U
GP
U
GP
U M
EMO
RY
CP
U
GP
U
GP
U M
EMO
RY
CP
U
GP
U
GP
U M
EMO
RY
MIMOS ATL: R&D GPU Cluster
Enabling High Performance Computing for MiAccLib
29.3 Teraflops (SP) / 13.9 Teraflops (DP)
MIMOS GPU Cluster Features
• Altair PBS system (v12.0) – PBS Scheduling – PBS Display Manager – PBS Compute Manager – PBS Analytic
• Point To Point Mellanox Infiniband • NVIDIA GPU Direct • MVAPICH-GDR v2.0: MPI Over
Infiniband • CUDA 6.5 • Operating System: CentOS 6.4 • NVIDIA Tesla & Intel Xeon Phi
Experiment on Infiniband Bandwidth
CPU Model Socket/Cores/GHz Memory IB Card IB Switch OS CUDA GPU
Ohio Intel E5-2680 2x10 @ 2.8Ghz 64GB Mellanox
Connect-IB Mellanox FDR
IB Switch RHEL 6.5 CUDA 6.5
NVIDIA Tesla K40c
MIMOS Intel E5-2640 2x6 @ 2.50GHz 32 GB Mellanox
ConnectX-3 Pointer to
Pointer CentOS 6.5 CUDA 6.5
NVidia Tesla K20c
Updated on: 25/9/2014, MIMOS
Ohio Result MIMOS’s Result
MPI Benchmark Test: MVAPICH2-GDR 2.0
0
1000
2000
3000
4000
5000
6000
7000
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M
GPU Direct RDMA Analysis - ATL
Updated on: 25/9/2014, MIMOS
CPU
GPU K20c GDDR5
Memory
Server 1
CPU
GPU K20c GDDR5
Memory
Server 2
0
1000
2000
3000
4000
5000
6000
7000
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M
Ban
dw
isth
(M
B/s
ec)
Meesage Size (Bytes)
MVAPICH2-GDR 2.0 ATL Benchmark
Device to Device
Host to Host
Host to Device
Device to Host
Work on 2D Wave Equation Demo
𝑢𝑗𝑛+1 = 𝑐2
𝜕𝑡2
𝜕𝑥2𝑢𝑗+1,𝑖
𝑛 + 𝑢𝑗−1,𝑖𝑛 + 𝑢𝑗,𝑖+1
𝑛 + 𝑢𝑖,𝑗−1𝑛 + 2 1 − 2𝑐2
𝜕𝑡2
𝜕𝑥2𝑢𝑗,𝑖
𝑛 − 𝑢𝑗,𝑖𝑛−1
𝜕2𝑢
𝜕𝑡2= 𝑐2
𝜕2𝑢
𝜕𝑥2+
𝜕2𝑢
𝜕𝑦2
Algo 1: Wave Equation 2D
Algo 2: Finite Difference with Wave Equation 2D
Ported Two Dimension Wave Equation into ATL GPU Cluster with Demonstration Application
FDTD with RDMA (Work In-Progress) Objective: Multi GPUs compute larger area of simulation of wave propagation
Challenge: Communication between halo is needed
Advantage: RDMA is use to reduce the memory transfer latency by directly transfer the halo into the other GPU’s in different nodes
GPU RDMA
Halo need to be exchange for each steps as wave propagate into different GPU, More partitioning mean more sharing is needed
2048 pixel
1st 1024 pixel compute in GPU1 2nd 1024 pixel compute in GPU2
Wave are crossing into different GPU frame. Red line is the boundary of the image in different GPUs
In-progress works:
High Performance Parallel Data Warehouse
Big Data Support
Query performance: Expect
10x or more improve query
performance compared to
RDBMS
Scale Out: Incremental by
adding Hardware
Shared Nothing Support RDBMS SQL query
Support familiar BI tools through RDBMS SQL query
In-Memory Modified HDFS
with RDMA
Data Warehouse and Data Model Plugin to BI tools
Data Loading Data Warehouse Analyze & Visualize
Hig
h S
pe
ed
Ne
two
rk C
on
ne
ctio
n
(In
fin
iba
nd
)
Visit us http://gpu.mimos.my
Up Coming Event: • CUDA Programming Challenge 2014 (30th September, 2014) • GPU Annual Workshop (10th October 2014, Technology Park Malaysia, MIMOS)