A Grammar for Reproducible and Painless Extract-Transform ...
Scalable Performance of System S for Extract-Transform-Load Processing
-
Upload
eaton-benton -
Category
Documents
-
view
36 -
download
4
description
Transcript of Scalable Performance of System S for Extract-Transform-Load Processing
23/04/19
© International Business Machines Corporation 2008
IBM Research - Tokyo
Scalable Performance of System S for Extract-Transform-Load Processing
Toyotaro Suzumura, Toshihiro Yasue and Tamiya OnoderaIBM Research - Tokyo
IBM Research - Tokyo
© International Business Machines Corporation 20082
Outline
Background and Motivation
System S and its suitability for ETL
Performance Evaluation of System S as a Distributed ETL Platform
Performance Optimization
Related Work and Conclusions
IBM Research - Tokyo
© International Business Machines Corporation 20083
What is ETL ?
Extraction : handle the extraction of data from different distributed data sources
Transformation :cleansing and customizing the data for the business needs and rules while transforming the data to match the data warehouse schema
Loading : Load the data into data warehouse
ETL = Extraction + Transformation + Loading
DataWarehouse
DataSources ETL
Extract Load
Transform
IBM Research - Tokyo
© International Business Machines Corporation 20084
Data Explosion in ETL
Data Explosion
– The amount of data stored in a typical contemporary data warehouse may double every 12 to 18 months
Data Source Examples: – Logs for Regulatory Compliance (e.g. SOX)
– POS (Point-of-sale) Transaction of Retail Store (e.g. Wal-Mart)
– Web Data (e.g. internet auction sites, EBay)
– CDR (Call Detail Record) for Telecom companies to analyze customer’s behavior
Trading Data
IBM Research - Tokyo
© International Business Machines Corporation 20085
Near-Real Time ETL
Given the data explosion problem, there are strong needs for ETL processing to be as fast as possible so that business analysts can quickly grasp the trends of customer activities
IBM Research - Tokyo
© International Business Machines Corporation 20086
Our Motivation:
Assess the applicability of System S, data stream processing system to the ETL processing for, considering both qualitative and quantitative ETL constraints
Thoroughly evaluate the performance of System S as a scalable and distributed ETL platform to achieve “Near-Real Time ETL” and solve the data explosion in the ETL domain
IBM Research - Tokyo
© International Business Machines Corporation 20087
Outline
Background and Motivation
System S and its suitability for ETL
Performance Evaluation of System S as a Distributed ETL Platform
Performance Optimization
Related Work and Conclusions
IBM Research - Tokyo
© International Business Machines Corporation 20088
Stream Computing and System S
System S: Stream Computing Middleware developed by IBM Research
System S is productized as “InfoSphere Streams” now.
Traditional ComputingTraditional Computing Stream ComputingStream Computing
Fact finding with data-at-rest Insights from data in motion
IBM Research - Tokyo
© International Business Machines Corporation 20089
InfoSphere Streams Programming Model
Application Programming (SPADE)
Source Adapters Sink AdaptersOperator Repository
Platform optimized compilation
IBM Research - Tokyo
© International Business Machines Corporation 200810
SPADE : Advantages of Stream Processing as Parallelization Model
A stream-centric programming language dedicated for data stream processing
Streams as first class entity– Explicit task and data parallelism
– Intuitive way to exploit multi-core and multi-nodes
Operator and data source profiling for better resource management
Reuse of operators across stored and live data
Support for user-customized operators (UDOP)
IBM Research - Tokyo
© International Business Machines Corporation 200811
A simple SPADE example
[Application]SourceSink trace
[Nodepool]Nodepool np := (“host1”, “host2”, “host3)
[Program]// virtual schema declarationvstream Sensor (id : id_t, location : Double, light : Float, temperature : Float, timestamp : timestamp_t)
// a source stream is generated by a Source operator – in this case tuples come from an input filestream SenSource ( schemaof(Sensor) ) := Source( ) [ “file:///SenSource.dat” ] {} -> node(np, 0)
// this intermediate stream is produced by an Aggregate operator, using the SenSource stream as inputstream SenAggregator ( schemaof(Sensor) ) := Aggregate( SenSource <count(100),count(1)> ) [ id . location ] { Any(id), Any(location), Max(light), Min(temperature), Avg(timestamp) } -> node(np, 1)
// this intermediate stream is produced by a functor operatorstream SenFunctor ( id: Integer, location: Double, message: String ) := Functor( SenAggregator ) [ log(temperature,2.0)>6.0 ] { id, location, “Node ”+toString(id)+ “ at location ”+toString(location) } -> node(np, 2)
// result management is done by a sink operator – in this case produced tuples are sent to a socketNull := Sink( SenFunctor ) [ “udp://192.168.0.144:5500/” ] {} -> node(np, 0)
SinkSource Aggregate Functor
IBM Research - Tokyo
© International Business Machines Corporation 200812
Template Documentation
X86 Box
X86 Blade
CellBlade
X86 Blade
FPGABlade
X86 Blade
X86 Blade
X86Blade
X86 Blade
X86Blade
InfoSphere Streams Runtime
TransportStreams Data Fabric
Processing Element
Container
Processing Element
Container
Processing Element
Container
Processing Element
Container
Processing Element
Container
Optimizing scheduler assigns operators to processing nodes, and continually manages resource allocation
Optimizing scheduler assigns operators to processing nodes, and continually manages resource allocation
IBM Research - Tokyo
© International Business Machines Corporation 200813
System S as a Distributed ETL Platform ?
Can we use System S as a distributed ETL processing platform ?
??
IBM Research - Tokyo
© International Business Machines Corporation 200814
Outline
Background and Motivation
System S and its suitability for ETL
Performance Evaluation of System S as a Distributed ETL Platform
Performance Optimization
Related Work and Conclusions
IBM Research - Tokyo
© International Business Machines Corporation 200815
Target Application for Evaluation
Inventory processing for multiple warehouses that includes most of the representative ETL primitives (Sort,Join,and Aggregate)h
IBM Research - Tokyo
© International Business Machines Corporation 200816
SPADE Program for Distributed Processing
SourceWarehouse
Items 3(Warehouse_20090901_3.txt)
Split
Functor
Sink
Key=item
WarehouseItems 2
(Warehouse_20090901_2.txt)
WarehouseItems 1
(Warehouse_20090901_1.txt)Source
Source bundle
6 million
Around 60
Source
Item Catalog
Functor
Sink
Sort Join Sort Join
Aggregate
Sort
Functor SinkUDOP
(SplitDuplicatedTuples)
Sink
ODBCAppend
Sort
Join Sort Join
Aggregate
Sort
Functor SinkUDOP
(SplitDuplicatedTuples)
Sink
ODBCAppend
Sort
Join Sort Join
Aggregate
Sort
Functor SinkUDOP
(SplitDuplicatedTuples)
Sink
ODBCAppend
Sort
Data Distribution Host
Compute host (1) 0100-0300-000100-0900-00
Compute host (2)
Compute host (N)
IBM Research - Tokyo
© International Business Machines Corporation 200817
SPADE Program (1/2)
[Nodepools]nodepool np[] := ("s72x336-00", "s72x336-02",
"s72x336-03", "s72x336-04")
[Program]vstream Warehouse1Schema(id: Integer, item : String, Onhand : String, allocated : String, hardAllocated : String,
fileNameColumn : String)
vstream Warehouse2OutputSchema(id: Integer, item : String, Onhand : String,
allocated : String, hardAllocated : String, fileNameColumn : String, description: StringList)
vstream ItemSchema(item: String, description: StringList)
##===================================================## warehouse 1##===================================================bundle warehouse1Bundle := ()
for_begin @i 1 to 3stream Warehouse1Stream@i(schemaFor(Warehouse1Schema)) := Source()["file:///SOURCEFILE", nodelays, csvformat]{} -> node(np, 0), partition["Sources"]warehouse1Bundle += Warehouse1Stream@ifor_end
## stream for computing subindex stream StreamWithSubindex(schemaFor(Warehouse1Schema), subIndex: Integer) := Functor(warehouse1Bundle[:])[] { subIndex := (toInteger(strSubstring(item, 6,2)) / (60 / COMPUTE_NODE_NUM))-2 } -> node(np, 0), partition["Sources"]
for_begin @i 1 to COMPUTE_NODE_NUM stream ItemStream@i(schemaFor(Warehouse1Schema), subIndex:Integer)for_end := Split(StreamWithSubindex) [ subIndex ]{} -> node(np, 0), partition["Sources"]
for_begin @i 1 to COMPUTE_NODE_NUM
stream Warehouse1Sort@i(schemaFor(Warehouse1Schema)) := Sort(ItemStream@i <count(SOURCE_COUNT@i)>)[item, asc]{} -> node(np, @i-1), partition["CMP%@i"]
stream Warehouse1Filter@i(schemaFor(Warehouse1Schema)) := Functor(Warehouse1Sort@i)[ Onhand="0001.000000" ] {} -> node(np, @i-1), partition["CMP%@i"]
Nil := Sink(Warehouse1Filter@i)["file:///WAREHOUSE1_OUTPUTFILE@i", csvFormat, noDelays]{}
-> node(np, @i-1), partition["CMP%@i"]for_end
IBM Research - Tokyo
© International Business Machines Corporation 200818
SPADE Program (2/2)##====================================================## warehouse 2##====================================================stream ItemsSource(schemaFor(ItemSchema)) := Source()["file:///ITEMS_FILE", nodelays, csvformat]{} -> node(np, 1), partition["ITEMCATALOG"]
stream SortedItems(schemaFor(ItemSchema)) := Sort(ItemsSource <count(ITEM_COUNT)>)[item, asc]{} -> node(np, 1), partition["ITEMCATALOG"]
for_begin @i 1 to COMPUTE_NODE_NUM stream JoinedItem@i(schemaFor(Warehouse2OutputSchema)) := Join(Warehouse1Sort@i <count(SOURCE_COUNT@i)>;
SortedItems <count(ITEM_COUNT)>) [ LeftOuterJoin, {item} = {item} ]{} -> node(np, @i-1), partition["CMP%@i"] ##=================================================## warehouse 3##=================================================for_begin @i 1 to COMPUTE_NODE_NUM stream SortedItems@i(schemaFor(Warehouse2OutputSchema)) := Sort(JoinedItem@i <count(JOIN_COUNT@i)>)[id, asc]{} -> node(np, @i-1), partition["CMP%@i"] stream AggregatedItems@i(schemaFor(Warehouse2OutputSchema),
count: Integer) := Aggregate(SortedItems@i <count(JOIN_COUNT@i)>) [item . id] { Any(id), Any(item), Any(Onhand), Any(allocated), Any(hardAllocated), Any(fileNameColumn), Any(description), Cnt() } -> node(np, @i-1), partition["CMP%@i"]
stream JoinedItem2@i(schemaFor(Warehouse2OutputSchema), count: Integer) := Join(SortedItems@i <count(JOIN_COUNT@i)>; AggregatedItems@i <count(AGGREGATED_ITEM@i)>) [ LeftOuterJoin, {id, item} = {id, item} ] {} -> node(np, @i-1), partition["CMP%@i"]
stream SortJoinedItem@i(schemaFor(Warehouse2OutputSchema), count: Integer) := Sort(JoinedItem2@i <count(JOIN_COUNT@i)>)[id(asc).fileNameColumn(asc)]{} -> node(np, @i-1), partition["CMP%@i"]
stream DuplicatedItems@i(schemaFor(Warehouse2OutputSchema), count: Integer) stream UniqueItems@i(schemaFor(Warehouse2OutputSchema), count: Integer) := Udop(SortJoinedItem@i)["FilterDuplicatedItems"]{} -> node(np, @i-1), partition["CMP%@i"]
Nil := Sink(DuplicatedItems@i)["file:///DUPLICATED_FILE@i", csvFormat, noDelays]{} -> node(np, @i-1), partition["CMP%@i"] stream FilterStream@i(item: String, recorded_indicator: Integer) := Functor(UniqueItems@i)[] { item, 1 } -> node(np, @i-1), partition["CMP@i"] stream AggregatedItems2@i(LoadNum: Integer, Item_Load_Count: Integer) := Aggregate(FilterStream@i <count(UNIQUE_ITEM@i)>) [ recorded_indicator ] { Any(recorded_indicator), Cnt() } -> node(np, @i-1), partition["CMP@i"]
stream AddTimeStamp@i(LoadNum: Integer, Item_Load_Count: Integer, LoadTimeStamp: Long) := Functor(AggregatedItems2@i)[] { LoadNum, Item_Load_Count, timeStampMicroseconds() } -> node(np, @i-1), partition["CMP@i"]
Nil := Sink(AddTimeStamp@i)["file:///final_result.out", csvFormat, noDelays]{} -> node(np, @i-1), partition["CMP@i"] for_end
IBM Research - Tokyo
© International Business Machines Corporation 200819
Qualitative Evaluation of SPADE
Implementation
– Lines of SPADE: 76 lines
– # of Operators: 19 (1 UDOP Operator)
Evaluation
– With the built-in operators of SPADE, we could develop the given ETL scenario in a highly productive manner
– The functionality of System S for running a SPADE program on distributed nodes was a great help
IBM Research - Tokyo
© International Business Machines Corporation 200820
Performance Evaluation
2 22
12 32
3 23
13
4 24
14
1 21
11 31
Total = 14 Nodes (Each node has 4 cores)
6 26
16
7 27
17
8 28
18
5 25
15
9 29
19
10 30
20
1 2 3 4
5 6 7 8 9 10 11 12 13 14
Data Distribution
Compute Host (10 Nodes, 40 Cores)
n e0101b0${n}e1
Total Nodes: 14 nodes and 56 CPU cores Spec. for Each Node : Intel Xeon X5365 3.0 GHz Xeon (4 physical cores with HT), 16GB memory, RHEL 5.3 64bit (Linux Kernel 2.6.18.-164.el5) Network : Infiniband Network (DDR 20Gbps) Or 1Gbps Network
Software: InfoSphere Streams: beta versionData : 9 Million Records (1 Record is around 100 Byte)
Item Sorting
IBM Research - Tokyo
© International Business Machines Corporation 200821
Node Assignment
2 22
12 32
3 23
13
4 24
14
1 21
11 31
Total = 14 Nodes (Each node has 4 cores)
6 26
16
7 27
17
8 28
18
5 25
15
9 29
19
10 30
20
1 2 3 4
5 6 7 8 9 10 11 12 13 14
Data Distribution
Compute Host (10 Nodes, 40 Cores)
n e0101b0${n}e1
Item Sorting
Not used
IBM Research - Tokyo
© International Business Machines Corporation 200822
Throughput for Processing 9 Million Data
Throughput and Speedup for Processing 9M Data
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
4 8 12 16 20 24 28 32 36 40
# of Cores
Thr
ough
put
(rec
ords
/s)
1
2
3
4
5
6
7
8
Spe
edup
s ag
ains
t 4
Cor
es
Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: A
Speed-up
Maximum Throughput : around 180000 records per second (144 Mbps)
IBM Research - Tokyo
© International Business Machines Corporation 200823
Analysis (I-a) : Breakdown the Total Time
Elapsed Time for Processing 9 Million Data
0
20
40
60
80
100
120
4 8 12 16 20 24 28 32 36 40
# of Cores (1 Node has 4 Cores)
Elas
ped
Tim
e (s
)
Time for Computation Time for Data Distribution Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: A
Data Distribution is Dominant
Computation
IBM Research - Tokyo
© International Business Machines Corporation 200824
Analysis (I-b) Speed-up ratio against 4 cores when focusing on only “computation part”
Speed up ratio
0
2
4
6
8
10
12
14
4 8 12 16 20 24 28 32 36 40
# of Cores
Spe
ed-
up
ratio a
gain
st 4
core
s
Speed up of Throughput for Compute Time Linear Speedup
Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: A
Over Linear-Scale
IBM Research - Tokyo
© International Business Machines Corporation 200825
CPU Utilization at Compute Hosts
Idle
Computation
Computation
IBM Research - Tokyo
© International Business Machines Corporation 200826
Outline
Background and Motivation
System S and its suitability for ETL
Performance Evaluation of System S as a Distributed ETL Platform
Performance Optimization
Related Work and Conclusions
IBM Research - Tokyo
© International Business Machines Corporation 200827
Performance Optimization
The previous experiment shows that most of the time is spent in the data distribution or I/O processing
For performance optimization, we implemented a SPADE program in such a way that all the nodes are participated in the data distribution while each source operator is only responsible for certain chunk of data records divided by the number of source operators
IBM Research - Tokyo
© International Business Machines Corporation 200828
Performance Optimization
SplitKe
y=item
WarehouseItems 1 Source
Around 60
Source
Item Catalog
Functor
Sink
Sort Join Sort Join
Aggregate
Sort
Functor SinkUDOP
(SplitDuplicatedTuples)
Sink
ODBCAppend
Sort
Join Sort Join
Aggregate
Sort
Functor SinkUDOP
(SplitDuplicatedTuples)Sink
ODBCAppend
Sort
Join Sort Join
Aggregate
Sort
Functor SinkUDOP
(SplitDuplicatedTuples)Sink
ODBCAppend
Sort
Data Distribution HostCompute host (1)
0100-0300-00
0100-0900-00
Compute host (2)
Compute host (N)
SplitWarehouse
Items 1 Source
SplitWarehouse
Items 1 Source
SourceWarehouse
Items 3(Warehouse_20090901_3.txt)
Split
Functor
Sink
Key=item
WarehouseItems 2
(Warehouse_20090901_2.txt)
WarehouseItems 1
(Warehouse_20090901_1.txt)Source
Source bundle
6 million
Around 60
Source
Item Catalog
Functor
Sink
Sort Join Sort Join
Aggregate
Sort
Functor SinkUDOP
(SplitDuplicatedTuples)
Sink
ODBCAppend
Sort
Join Sort Join
Aggregate
Sort
Functor SinkUDOP
(SplitDuplicatedTuples)Sink
ODBCAppend
Sort
Join Sort Join
Aggregate
Sort
Functor SinkUDOP
(SplitDuplicatedTuples)Sink
ODBCAppend
Sort
Data Distribution Host
Compute host (1) 0100-0300-000100-0900-00
Compute host (2)
Compute host (N)
Original SPADE Program Optimized SPADE Program
1. We modified the SPADE data-flow program in such a way that multiple Source operators participate in the data distribution
2. Each data distribution node can read a chunk of the whole data
IBM Research - Tokyo
© International Business Machines Corporation 200829
Node Assignment
Data Distribution
All the 14 nodes participate in the data distribution
Each operator reads the number of records that divide the total data records (9M records) with the number of source operators.
The node assignment for compute nodes are the same as Experiment I
6 20 7 21 8 22
1 15 2 16 3 17 4 18
5 19
Total = 14 Nodes (Each node has 4 cores)
10 24 11 129 23 13 14
1 2 3 4
5 6 7 8 9 10 11 12 13 14
n e0101b0${n}e1disk disk disk disk
disk disk disk disk disk disk disk disk disk disk
Data Distribution / Compute Host
IBM Research - Tokyo
© International Business Machines Corporation 200830
Elapsed time with varying number of compute nodes and source operators
2024
2832 45
3025
2018
1512
96
30
5
10
15
20
25
30
35
40
Elapsed Time (s)
# of ComputeNodes
# of Source Nodes
Elapsed Time (s) for Processing 9M Records
45 30 25 20 18 15 12 9 6 3
Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:C
# of source operators
IBM Research - Tokyo
© International Business Machines Corporation 200831
Throughput : Over 800000 records / secThroughput with varying number of source nodes
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
3 6 9 12 15 18 20 25 30 45
# of Source Nodes
Thr
ough
put
(rec
ords
/s)
20 24 28 32
Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores, 16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:C
IBM Research - Tokyo
© International Business Machines Corporation 200832
Scalability : Achieved Super-Linear with Data Distribution Optimization
Comparison among various optimizations
0
1
2
3
4
5
6
7
8
9
10
4 8 12 16 20 24 28 32
# of compute nodes
Spee
dup
ratio
aga
inst
4 c
ores
1 Source Operator
3 Source Operators
Optimization (9 Source Operators for 20, 24,28, 32)
More Optimization
Linear
Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores, 16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:C
IBM Research - Tokyo
© International Business Machines Corporation 200833
Outline
Background and Motivation
System S and its suitability for ETL
Performance Evaluation of System S as a Distributed ETL Platform
Performance Optimization
Related Work and Conclusions
IBM Research - Tokyo
© International Business Machines Corporation 200834
Related Work
Near Real-Time ETL
– Panos et.al. reviewed the state of the art of both conventional and near real-time ETL [2008 Springer]
ETL Benchmarking
– Wyatt et.al. identifies a common characteristics of ETL workflows in an effort of proposing a unified evaluation method for ETL [2009 Springer Lecture Notes]
– TPC-ETL: formed in 2008 and still under the development by the TPC subcommittee
IBM Research - Tokyo
© International Business Machines Corporation 200835
Conclusions and Future WorkConclusions
– Demonstrated the software productivity and scalable performance of System S in the ETL domain
– After the data distribution optimization, we achieved over linear scalability performance by processing around 800000 records per second on 14 nodes
Future Work
– Comparison with the existing ETL tools / systems and various application scenarios (TPC-ETL?)
– Automatic Data Distribution Optimization
IBM Research - Tokyo
© International Business Machines Corporation 200836
Future Direction: Automatic Data Distribution Optimization
We were able to identify the appropriate number of source operators through a series of long-running experiments.
However, It is not wise for such a distributed systems as System S to force users/developers to experimentally find the appropriate number of source nodes.
We will need to have an automatic optimization mechanism that maximizes the throughput by automatically finding the best number of source nodes in a seamless manner from the user.
d1 S1
S2
S3
Sn
d2
d3
dn
C1
C2
C3
Cm
Source Operators
ComputeOperators
n(S1, C1)
n(Sn, C3)
Node Pool
1 2 3 P
IBM Research - Tokyo
© International Business Machines Corporation 200839
Towards Adaptive Optimization
d1 S1
S2
S3
Sn
d2
d3
dn
C1
C2
C3
Cm
Source Operators
ComputeOperators
D S
C1
C2
C3
Cm
Source Operator
ComputeOperators
DataDistribution
Optimizer
Original SPADE Program Optimized SPADE Program
§ The current SPADE compiler has compile-time optimizer by obtaining the statistical data such as tuple/byte rates and CPU ratio for each operator.
§ We would like to let users/developers to write a SPADE program in a left manner without considering the data partitioning and data distribution.
§ By extending the current optimizer, the system automatically could convert the left-hand side program to right-hand program that achieves the maximum data distribution
IBM Research - Tokyo
© International Business Machines Corporation 200840
Comparison among various optimizations
0
1
2
3
4
5
6
7
8
9
10
4 8 12 16 20 24 28 32
# of compute nodes
Spee
dup
ratio
aga
inst 4
cor
es
1 Source Operator
3 Source Operators
Optimization (9 Source Operators for 20, 24,28, 32)
More Optimization
Linear
Executive Summary Elapsed Time for Processing 9 Million Data
0
20
40
60
80
100
120
4 8 12 16 20 24 28 32 36 40
# of Cores (1 Node has 4 Cores)
Elasp
ed T
ime (
s)
Time for Computation Time for Data Distribution
Elapsed Time for Baseline Optimized version vs. others
Motivation: § Evaluate System S as an ETL platform at a
large experimental environment, Watson cluster
§ Understand the performance characteristics at such a large testbed such as scalability and performance bottlenecks
Findings: § A series of our experiments have shown that
data distribution cost is dominant in the ETL processing
§ The optimized version in right hand side shows that when changing the number of data feed (or source) operators, the throughput is dramatically increased and obtains higher speed-ups than the others
§ Using the Infiniband network is critical for such an ETL workload that includes barrier before aggregating all the data for sorting operation, and we achieved almost double performance against the one with 1Gbs network
Throughput with varying number of source nodes
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
3 6 9 12 15 18 20 25 30 45
# of Source Nodes
Thro
ughp
ut (r
ecor
ds/s
)
20 24 28 32
Throughput Comparison w/ varying number of source nodes
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
2 3 4 5 6 9 12 15 18 20 25 30 45
# of Source nodes
Thro
ughp
ut (d
ata/
s)
20 24 28 32
Comparison between 1Gbs network and Infiniband Network
1Gbps Network Infiniband Network
Throughput
Optimized version
IBM Research - Tokyo
© International Business Machines Corporation 200841
Node Assignment (B) for Experiment II
2 22
12 32
3 23
13
4 24
14
1 21
11 31
Total = 14 Nodes (Each node has 4 cores)
6 26
16
7 27
17
8 28
18
5 25
15
9 29
19
10 30
20
1 2 3 4
5 6 7 8 9 10 11 12 13 14
Data Distribution
Compute Host (10 Nodes, 40 Cores)
n e0101b0${n}e1
Experimental Environment is comprised of 3 source nodes for data distribution, 1 node for item sorting, and 10 nodes for computation. The compute node has 4 cores and we manually allocate each operator with the following scheduling policy. The following diagram shows the case in that 32 operators are used for the computation. Each operator is allocated to adjunct node in order
Item Sorting
IBM Research - Tokyo
© International Business Machines Corporation 200842
SPADE Program with Data Distribution Optimization
WarehouseItems 2
(Warehouse_20090901_2.txt)Source Split
Functor
Sink
Join Sort Join Aggregate
SortFunctor Sink
UDOP(SplitDuplicated
Tuples)Sink
ODBCAppend
Sort
Join Sort Join Aggregate
SortFunctor Sink
UDOP(SplitDuplicated
Tuples)Sink
ODBCAppend
Sort
Join Sort Join Aggregate
SortFunctor Sink
UDOP(SplitDuplicated
Tuples)Sink
ODBCAppend
Sort
12
4
c0101b05
c0101b01
WarehouseItems 2
(Warehouse_20090901_2.txt)Source Split
Functor
Sink
c0101b02
WarehouseItems 2
(Warehouse_20090901_2.txt)Source Split
Functor
Sink
c0101b03
Join Sort Join Aggregate
SortFunctor Sink
UDOP(SplitDuplicated
Tuples)Sink
ODBCAppend
Sort
Join Sort Join Aggregate
SortFunctor Sink
UDOP(SplitDuplicated
Tuples)Sink
ODBCAppend
Sort
Join Sort Join Aggregate
SortFunctor Sink
UDOP(SplitDuplicated
Tuples)Sink
ODBCAppend
Sort
12
4
c0101b06
Join Sort Join Aggregate
SortFunctor Sink
UDOP(SplitDuplicated
Tuples)Sink
ODBCAppend
Sort
Join Sort Join Aggregate
SortFunctor Sink
UDOP(SplitDuplicated
Tuples)Sink
ODBCAppend
Sort
Join Sort Join Aggregate
SortFunctor Sink
UDOP(SplitDuplicated
Tuples)Sink
ODBCAppend
Sort
12
4
c0101b07
Join Sort Join Aggregate
SortFunctor Sink
UDOP(SplitDuplicated
Tuples)Sink
ODBCAppend
Sort
Join Sort Join Aggregate
SortFunctor Sink
UDOP(SplitDuplicated
Tuples)Sink
ODBCAppend
Sort
Join Sort Join Aggregate
SortFunctor Sink
UDOP(SplitDuplicated
Tuples)Sink
ODBCAppend
Sort
12
4
….
Join Sort Join Aggregate
SortFunctor Sink
UDOP(SplitDuplicated
Tuples)Sink
ODBCAppend
Sort
Join Sort Join Aggregate
SortFunctor Sink
UDOP(SplitDuplicated
Tuples)Sink
ODBCAppend
Sort
Join Sort Join Aggregate
SortFunctor Sink
UDOP(SplitDuplicated
Tuples)Sink
ODBCAppend
Sort
12
4
s72x336-14
Since 3 nodes are participated in the data distribution, the number of communication is at maximum 120 (3 x 40).
IBM Research - Tokyo
© International Business Machines Corporation 200843
New SPADE Program
for_begin @j 1 to COMPUTE_NODE_NUM
bundle warehouse1Bundle@j := ()
for_end
#define SOURCE_NODE_NUM 3
for_begin @i 0 to SOURCE_NODE_NUM-1
stream Warehouse1Stream@i(schemaFor(Warehouse1Schema))
:= Source()["file:///SOURCEFILE", nodelays, csvformat]{}
-> node(SourcePool, @i), partition["Sources@i"]
stream StreamWithSubindex@i(schemaFor(Warehouse1Schema), subIndex: Integer)
:= Functor(Warehouse1Stream1)[] {
subIndex := (toInteger(strSubstring(item, 6,2)) / (60 / COMPUTE_NODE_NUM)) }
-> node(SourcePool, @i), partition["Sources@i"]
for_begin @j 1 to COMPUTE_NODE_NUM
stream ItemStream@i@j(schemaFor(Warehouse1Schema), subIndex:Integer)
for_end
:= Split(StreamWithSubindex@i) [ subIndex ]{}
-> node(SourcePool, @i), partition["Sources@i"]
for_begin @j 1 to COMPUTE_NODE_NUM
warehouse1Bundle@j += ItemStream@i@j
for_end
for_end
for_begin @j 1 to COMPUTE_NODE_NUM
stream StreamForWarehouse1Sort@j(schemaFor(Warehouse1Schema))
:= Functor(warehouse1Bundle@j[:])[]{}
-> node(np, @j-1), partition["CMP%@j"]
stream Warehouse1Sort@j(schemaFor(Warehouse1Schema))
:= Sort(StreamForWarehouse1Sort@j <count(SOURCE_COUNT@j)>)[item, asc]{}
-> node(np, @j-1), partition["CMP%@j"]
stream Warehouse1Filter@j(schemaFor(Warehouse1Schema))
:= Functor(Warehouse1Sort@j)[ Onhand="0001.000000" ] {}
-> node(np, @j-1), partition["CMP%@j"]
for_end
bundle warehouse1Bundle := ()for_begin @i 1 to 3stream Warehouse1Stream@i(schemaFor(Warehouse1Schema)) := Source()["file:///SOURCEFILE", nodelays, csvformat]{} -> node(np, 0), partition["Sources"]warehouse1Bundle += Warehouse1Stream@ifor_end
Experiment I
After
warehouse2, 3, and 4 are omittedin this chart, but we executed them
for the experiment
IBM Research - Tokyo
© International Business Machines Corporation 200844
Node Assignment (C) for Experiment III
Data Distribution
§ All the 14 nodes participate in the data distribution, and each Source operator is assigned as the manner described in the following diagram. For instance, 24 Source operators are allocated to each node in order and when 14 source operators are allocated to 14 nodes, then the next source operator is allocated to the first node.
§ Each operator reads the number of records that divide the total data records (9M recordss) with the number of source operators. This data division is conducted in prior using a Linux tool called “split”
§ The node assignment for compute nodes are the same as Experiment I
6 20 7 21 8 22
1 15 2 16 3 17 4 18
5 19
Total = 14 Nodes (Each node has 4 cores)
10 24 11 129 23 13 14
1 2 3 4
5 6 7 8 9 10 11 12 13 14
n e0101b0${n}e1disk disk disk disk
disk disk disk disk disk disk disk disk disk disk
IBM Research - Tokyo
© International Business Machines Corporation 200845
Performance Result for Experiment II and Comparison with Experiment I
Comparison in Throughput between Non- Optimization and I/ OOptimization
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
4 8 12 16 20 24 28 32
# of Cores
Thr
ough
put
(Dat
aRec
ords
per
sec
)
0
20
40
60
80
100
120
140
160
Spe
e-up
Rat
io a
gain
stN
on-O
ptim
izat
ion
(%)
Non- Optimization (EXP20091129a) I/ O OptimizationSpeed- up against Non Optimization
When 3 nodes are participated in the data distribution, the throughput is increased to almost double when compared with the result given by Experiment I
Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:B
IBM Research - Tokyo
© International Business Machines Corporation 200846
Analysis (II-a) Optimization by changing the number of source operators
Motivation for this experimentIn the previous page, the throughput is
saturated around 16 cores due to the lack of data feeding ratio against computation
Experimental Environment§ We changed the number of source
operators while not changing the total volume of data (9M data records), and measured throughput
§ We only tested 9MDATA-32 (32 operators for computation)
Experimental Results In this experiment shows that the 9
source nodes obtains the best throughput.
Performance with varying number of source operators (thetotal data records are the same, 9M, and 32 cores are used
for computation)
0
100000
200000
300000
400000
500000
600000
3 5 9 15
# of Source Nodes
Thr
ough
put
(rec
ords
/s)
Best
Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: B
6 7 8
1 2 3 4
5
Total = 14 Nodes (Each node has 4 cores)
9
1 2 3 4
5 6 7 8 9 10 11 12 13 14
n e0101b0${n}e1disk disk disk disk
disk disk disk disk disk disk disk disk disk disk
Node Assignment for 9 Data Distribution Node
IBM Research - Tokyo
© International Business Machines Corporation 200847
Analysis (II-b) : Increased Throughput by Data Distribution Optimization The following graph shows the overall results by taking the same optimization approach in previous
experiment, which increases the number of source operators.
3 source operators are used for 4, 8, 12, 16, and 9 source operators are used for 20, 24, 28 and 32.
We achieved 5.84 times speedup against 4 cores at 32 cores
Increased Throughput by Data Distribution Optimization
0
100000
200000
300000
400000
500000
600000
4 8 12 16 20 24 28 32
# of cores
Thro
ughp
ut (d
ata
reco
rds
/ se
c)
1 Source Operators 3 Source Operators Optimization (9 Source Operators for 20, 24,28, 32) Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: B
IBM Research - Tokyo
© International Business Machines Corporation 200848
Analysis (II-c) : Increased Throughput by Data Distribution Optimization
Speedup against 4 cores
0
1
2
3
4
5
6
7
8
9
4 8 12 16 20 24 28 32
# of cores
Spe
edup
aga
inst
4 c
ores
1 Source Operators 3 Source Operators
Optimization (9 Source Operators for 20, 24,28, 32) Ideal Scalability
The yellow line shows the best performance since 9 nodes are participated in the data distributionfor 20, 24, 28 and 32 cores.
Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:B
IBM Research - Tokyo
© International Business Machines Corporation 200849
Experiment (III): Increasing More Source Operators
Motivation
– In this experiment, we understand the performance characteristics by increasing more source operators than previous experiment (Experiment II).
– We also identify the performance comparison between Infiniband network and the commodity 1Gbps network
Experimental Setting
– We increase the number of source operators up to 45 from 3, and test this configuration against relatively large number of computes nodes, 20, 24, 28, 32 nodes.
– Node Assignment for Data Distribution and Computation is the same as previous experiment (Experiment II)
IBM Research - Tokyo
© International Business Machines Corporation 200850
Analysis (II-a): Throughput and Elapsed Time
Throughput with varying number of source nodes
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
3 6 9 12 15 18 20 25 30 45
# of Source Nodes
Thr
ough
put
(rec
ords
/s)
20 24 28 32
800000 tuples/sec (1 tuple=100byte) = 640 Mbps
The maximum total throughput, around 640 Mbps, is below the network bandwidth of both Infiniband and 1Gbps LAN.
Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: C
Throughput Elapsed Time
IBM Research - Tokyo
© International Business Machines Corporation 200851
Analysis (III-c) : Performance Without Infiniband In this experiment, we measured the
throughput without Infiniband against varying number of source operators.
Unlike the performance we obtained with Infiniband, the throughput is saturated around 12 – 15.
This result shows that the throughput is around 400000 data records per seconds at maximum, and this accounts for around 360 Mbps.
Although the network we used in this experiment is 1Gbps, this assumes to be an upper limit for consuming full network bandwidth while considering the System S overhead.
Drastic performance degradation from 15 to 18 can be observed, and we assume that this is because, 14 source operators are allocated to 14 nodes and afterwards 2 or more operators (processes) simultaneously accesses the 1Gbs network card and the resource contention is occurred.
Throughput Comparison w/ varying number of source nodes
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
2 3 4 5 6 9 12 15 18 20 25 30 45
# of Source nodes
Thro
ughp
ut (d
ata/
s)
20242832
Elapsed Time
Comparison with varying number of source nodes
0
10
20
30
40
50
60
2 3 4 5 6 9 12 15 18 20 25 30 45
# of Source nodes
Elap
sed
Tim
e(s)
for
Pro
cess
ing
9M r
ecor
ds
20242832
Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:C
# of source operator
# of source operator
Throughput
Elapsed Time
IBM Research - Tokyo
© International Business Machines Corporation 200852
Analysis (III-d) : Comparison between W/O Infiniband and W/ Infiniband
Throughput with varying number of source nodes
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
3 6 9 12 15 18 20 25 30 45
# of Source Nodes
Thro
ughp
ut (r
ecor
ds/s
)
20 24 28 32
W/ Infiniband
This chart shows the performance comparison by enabling or disabling the Infiniband network. The absolute throughput number when enabling Infiniband is “double” against w/o Infiniband. This result indicates that using Infiniband in ETL-typed workloads is essential to obtain high throughput
Throughput Comparison w/ varying number of source nodes
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
2 3 4 5 6 9 12 15 18 20 25 30 45
# of Source nodes
Thro
ughp
ut (d
ata/
s)
20 24 28 32
W/O Infiniband
Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: C
# of source operator # of source operator
IBM Research - Tokyo
© International Business Machines Corporation 200853
Analysis (I-c) Elapsed Time for Distributing 9M Data to Multiple Cores
Elapsed Time for Distributing 9 Million Data to Multiple Cores
0
5
10
15
20
25
30
35
40
45
50
55
4 8 12 16 20 24 28 32 36 40
# of Cores (1 Node has 4 Cores)
Elap
sed
Tim
e (s
)The following graph demonstrates that the elapsed time for distributing all the data to varying number of compute cores is nearly constant
Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: A