Post on 21-Dec-2015
ETL Queues for Active Data Warehousing
Alexis Karakasidis
Panos Vassiliadis
Evaggelia Pitoura
Dept. of Computer ScienceUniversity of Ioannina
IQIS'05 17 June 2005, Baltimore MD, USA 2
Forecast
• We demonstrate that we can employ queue theory to predict the behavior of an Active ETL process
• We discuss implementation issues in order to achieve several nice properties concerning minimal system overhead and high freshness of data
IQIS'05 17 June 2005, Baltimore MD, USA 3
Contents
• Problem description
• System Architecture & Theoretical Analysis
• Experiments
• Conclusions and Future Work
IQIS'05 17 June 2005, Baltimore MD, USA 4
Contents
• Problem description
• System Architecture & Theoretical Analysis
• Experiments
• Conclusions and Future Work
IQIS'05 17 June 2005, Baltimore MD, USA 5
Active Data Warehousing
• Traditionally, data warehouse refreshment has been performed off-line, through Extractction-Transformation-Loading (ETL) software.
• Active Data Warehousing refers to a new trend where data warehouses are updated as frequently as possible, to accommodate the high demands of users for fresh data.
• Issues that come up:– How to design an Active DW?– How can we implement an Active DW?
IQIS'05 17 June 2005, Baltimore MD, USA 6
Issues and Goals of this paper
• Smooth upgrade of the software at the source – The modification of the software configuration at the
source side is minimal.• Minimal overhead of the source system • No data losses are allowed• Maximum freshness of data
– The response time for the transport, cleaning transformation and loading of a new source record to the DW should be small and predictable
• Stable interface at the warehouse side – The architecture should scale up with respect to the
number of sources and data consumers at the DW
IQIS'05 17 June 2005, Baltimore MD, USA 7
Contributions
• We set up the architectural framework and the issues that arise for the case of active data warehousing.
• We develop the theoretical framework for the problem, by employing queue theory for the prediction of the performance of the system.
– We provide a taxonomy for ETL tasks that allows treating them as black-box tasks.
– Then, standard queue theory techniques can be applied for the design of an ETL workflow.
• We provide technical solutions for the implementation of our reference architecture, achieving the aforementioned goals
• We prove our results through extensive experimentation.
IQIS'05 17 June 2005, Baltimore MD, USA 8
Related work
• Obviously, work in the field of ETL is related – must be customized for active DW
• Streams, due to the nature of the data – still, all R.W. is on continuous queries, no updates
• Huge amount of work in materialized view refreshment – orthogonal to our problem
• Web services – due to the fact that in our architecture, the DW
exports W.S.’s to the sources
IQIS'05 17 June 2005, Baltimore MD, USA 9
Contents
• Problem description
• System Architecture & Theoretical Analysis
• Experiments
• Conclusions and Future Work
IQIS'05 17 June 2005, Baltimore MD, USA 10
Add_SPK1
SUPPKEY=1
SK1
DS.PS1.PKEY, LOOKUP_PS.SKEY,
SUPPKEY
$2€
COST DATE
DS.PS2 Add_SPK2
SUPPKEY=2
SK2
DS.PS2.PKEY, LOOKUP_PS.SKEY,
SUPPKEYCOST DATE=SYSDATE
AddDate CheckQTY
QTY>0
U
DS.PS1
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
DIFF1
DS.PS_NEW1.PKEY,DS.PS_OLD1.PKEYDS.PS_NEW
1
DS.PS_OLD
1
DW.PARTSUPP
Aggregate1
PKEY, DAYMIN(COST)
Aggregate2
PKEY, MONTHAVG(COST)
V2
V1
TIME
DW.PARTSUPP.DATE,DAY
FTP1S1_PARTSU
PP
S2_PARTSUPP
FTP2
DS.PS_NEW
2
DIFF2
DS.PS_OLD
2
DS.PS_NEW2.PKEY,DS.PS_OLD2.PKEY
Sources DW
DSA
ETL workflows
IQIS'05 17 June 2005, Baltimore MD, USA 11
Queue Theory for ETL• We can model various kinds of ETL
transformations as queues, which we call ETL queues
• Each queue has an incoming arrival rate λ and a mean service time 1/μ
• Little’s Law: N= λ*T• M/M/1 queue (Poisson arrivals)
– Mean response time W=1/(μ-λ)– Mean queue length L=ρ/(1 - ρ), ρ=λ/μ
Server
IQIS'05 17 June 2005, Baltimore MD, USA 12
Queue Theory for ETL
• Queues can be combined to form queue networks
• Jackson networks: networks were each queue can be solved independently (under reasonable constraints)
• We can use queue theory to predict the behavior of the Active Data Warehouse
IQIS'05 17 June 2005, Baltimore MD, USA 13
How to predict the behavior of the Active Data Warehouse
1. Compose ETL queues in a Jackson network to simulate the implementation of the Active Data Staging Area (ADSA)
2. Then, solve the Jackson network and relate the parameters of ADSA, specifically:– Source arrival rate (i.e., rate or record production at
the source)– Overall service time (i.e., time that a record spends
in the ADSA)– Mean queue length (i.e., no. of records in the
network)
IQIS'05 17 June 2005, Baltimore MD, USA 14
Taxonomy of ETL transformations
• Filters• Transformers• Binary Operators
• Generic model
P ai
P ri
ETL
rejected
IQIS'05 17 June 2005, Baltimore MD, USA 15
System Architecture
ETL
Source
Source
S FlowR
ADSA DW
ETL
ETL
WS Client
ETL
WS Client
WS
WS
DW
IQIS'05 17 June 2005, Baltimore MD, USA 16
Contents
• Problem description
• System Architecture & Theoretical Analysis
• Experiments
• Conclusions and Future Work
IQIS'05 17 June 2005, Baltimore MD, USA 17
Experimentation environment• Source: an application in C that uses an ISAM library• ADSA implemented in Sun JDK 1.4• Web Services platform:
– Apache Axis 1.1 [AXIS04]– Xerces XML parser– Apache Tomcat 1.3.29
• DW implemented over MySQL 4.1 • Configuration:
– Source: PIII 700MHz with 256MB memory, SuSE Linux 8.1– DW: Pentium 4 2.8GHz with 1GB memory, Mandrake Linux,
ADSA included– Department’s LAN for the network
• Source operates at full capacity
IQIS'05 17 June 2005, Baltimore MD, USA 18
First set of experiments
• A first set of experiments over a simple configuration, to determine fundamental architectural choices
• Issues– Smooth upgrade of the source software– UDP vs TCP– Source Overhead– Data delay– Topology
Source
Source
S FlowR
ADSA DW
WS Client WS
DW
IQIS'05 17 June 2005, Baltimore MD, USA 19
Experimentation results
• Smooth upgrade: not more than 100 lines of code modified
• UDP resulted in 35% data loss, due to ADSA overflow => TCP a clear choice
• Source overhead is highly dependent on row blocking:– Source overhead is 1.7% with a source flow regulator,
vs 34% without– WS mode (blocking vs non-blocking) has no effect– Medium size packets seem to work better
IQIS'05 17 June 2005, Baltimore MD, USA 20
Data Freshness• We count the time to carry all records from source to
DW• We empty the ADSA with 3 policies:
– Immediate transport– We simulate a slower ADSA by removing 50, 100, 150, 200,
250 and 300 records from the queue every 0.1 sec– We remove 500, 1000, 1500, 2000, 2500 and 3000 records
every 1 sec– Source max rate is about 1250 records / sec
• Findings:– Small package sizes result in small delays– There is a threshold (the source rate) underneath which the
queue explodes– We can achieve data freshness time equal to data insertion
time when we continuously empty a small size queue
IQIS'05 17 June 2005, Baltimore MD, USA 21
Data Freshness
Queue size over time. Emptying rate 150 records per 0.1 sec
0
500
1000
1500
2000
0
4.9
9.69
14.4
19.1
23.8
28.5
33.2
37.9
42.5
47.2 52
56.6
61.3
65.9
70.6
75.2
Time (secs)
Siz
e o
f qu
eue
(#el
emen
ts)
IQIS'05 17 June 2005, Baltimore MD, USA 22
Queue size over time. Emptying rate 50 records per 0.1 sec
0
10000
20000
30000
40000
50000
60000
70000
0
13.8
27.4
41.1
54.5
68.1
81.6
94.9
108
122
135
148
162
175
189
202
215
Time (secs)
Siz
e o
f qu
eue
(#el
emen
ts)
Queue size over time. Emptying rate 100 records per 0.1 sec
0
5000
10000
15000
20000
25000
30000
35000
0
7.2
14.2
21.2
28.2
35.1 42
48.9
55.9
62.7
69.6
76.5
83.4
90.2 97 104
111
Time (secs)
Siz
e o
f qu
eue
(#el
emen
ts)
Queue size over time. Emptying rate 150 records per 0.1 sec
0
500
1000
1500
20000
4.9
9.69
14.4
19.1
23.8
28.5
33.2
37.9
42.5
47.2 52
56.6
61.3
65.9
70.6
75.2
Time (secs)
Siz
e o
f qu
eue
(#el
emen
ts)
Queue size over time. Emptying rate 200 records per 0.1 sec
0
500
1000
1500
2000
0
4.92
9.61
14.3 19
23.6
30.6
35.3
39.9
44.6
49.3
53.9
58.6
63.2
67.8
72.4 77
Time (secs)
Siz
e o
f qu
eue
(#el
emen
ts)
Queue size over time. Emptying rate 250 records per 0.1 sec
0200400600800
1000120014001600
0
4.88
9.81
16.3 21
25.7
30.3
34.9
39.6
44.2
48.8
53.5
58.1
62.8
67.4 72
76.7
Time (secs)
Siz
e o
f qu
eue
(#el
emen
ts)
Queue size over time. Emptying rate 300 records per 0.1 sec
0
500
1000
1500
2000
0
5.7
10.4
15.1
20.1
24.8
29.4 34
38.7
43.4 48
52.7
57.3
61.9
66.5
71.1
75.7
Time (secs)
Siz
e o
f qu
eue
(#el
emen
ts)
Data Fresh-ness
IQIS'05 17 June 2005, Baltimore MD, USA 23
Data Freshness
Time to complete transfer from ADSA to DW
0
50
100
150
200
250
500 1000
1500
2000
2500
3000 Queue emptying
rate
Time (secs) Time to
complete transfer from ADSA to DW
IQIS'05 17 June 2005, Baltimore MD, USA 24
Experiments including transformation scenarios
• We enrich the previous configuration with several ETL activities in the ADSA
• Based on the previous, we have fixed:– 2-tier architecture, ADSA at the DW– Source Flow Regulation with medium size
packages– TCP for network connection– Non-blocking calling of DW WS’s
IQIS'05 17 June 2005, Baltimore MD, USA 25
Scenarios to measure data freshness
Filter 10%
Source
Source
S FlowR
ADSA DW
SK GB Sum WS Client WS
DW
Filter 10%
Source
Source
S FlowR
ADSA DW
Filter 2%
GB Sum WS Client
WS Client
WS
WS
DW SK
Source
Source
S FlowR
ADSA DW
WS Client WS
DW
Filter 10%
Source
Source
SFlowR
ADSA DW
WS
WS
DW Replace SK
GB Sum
Filter 6%
WS Client
Replace
ment
Filter 2%
WS Client
(a) (c)
(b) (d)
IQIS'05 17 June 2005, Baltimore MD, USA 26
Goals of the experiments
• Steadiness of the system– System is steady whenever service rate is higher than
arrival rate; transient effects disappear
• Source overhead– Medium size blocking is still a winner
• Throughput for ADSA – The ADSA is only one packet behind the source– Avg. delay per row ~0.9 msec for all scenarios
• Success of theoretical prediction– Half a packet underestimation
IQIS'05 17 June 2005, Baltimore MD, USA 27
Contents
• Problem description
• System Architecture & Theoretical Analysis
• Experiments
• Conclusions and Future Work
IQIS'05 17 June 2005, Baltimore MD, USA 28
Conclusions
• We can employ queue theory to predict the behavior of an Active ETL process
• We have proposed an architectural configuration with– Minimal source overhead– No effect on the source due to the operation of an
ADSA– No packet losses, due to the usage of TCP– Small delay in the ADSA, especially if row blocking in
medium size blocks is used
IQIS'05 17 June 2005, Baltimore MD, USA 29
Future Work
• Combine our configuration with results in the optimization of ETL processes (ICDE’05)
• Fault tolerance
• Experiment with higher client loads at the warehouse side
• Scale-up the number of sources involved
IQIS'05 17 June 2005, Baltimore MD, USA 32
Grand View
Source
DW
Source Application
Source
Source Application
Source
Source Application
Plain Data
Clean, reconciled, possibly aggregated data to be loaded in the DW
γ σ
GROUP
SK
σ γ
ADSA
IQIS'05 17 June 2005, Baltimore MD, USA 33
Jackson’s Theorem and ETL queues
Jackson’s Theorem. If in an open network the condition λi < µi · mi holds for every i {1, ..,N} (with mi standing for the number of servers at node i) then the steady state probability of the network can be expressed as the product of the state probabilities of the individual nodes:
π (k1,…, kN) = π1(k1)π2(k2)... πΝ(kΝ)
Therefore, we can solve this class of networks in four steps:• Solve the traffic equations to find λi for each queuing node i• Determine separately for each queuing system i its steady-state
probabilities πi(ki)• Determine the global steady-state probabilities π (k1,…, kN). Derive
the desired global performance measures.• From step 1, we can derive the mean delay and queue length for
each node.
IQIS'05 17 June 2005, Baltimore MD, USA 34
Source Code AlterationsOriginal Routine Altered Routine
Open_isam_File(){ …opening_isam_file_commands …}
Open_isam_File(){ …opening_isam_file_commands …if(open==success)
DWFlowR_socket_open()}
Write_record_to_File(){ …insert_record_commands …}
Write_record_to_File(){ …insert_record_commands …if(write==success)
write_to_SFlowR()}
Close_isam_File(){ …closing_isam_file_commands …}
Close_isam_File(){ …closing_isam_file_commands …if(close==success)
DWFlowR_socket_close()}
IQIS'05 17 June 2005, Baltimore MD, USA 36
Data Freshness• We count the time to carry all records from source to DW• We empty the ADSA with 3 policies:
– Immediate transport– We simulate a slower ADSA by removing 50, 100, 150, 200, 250
and 300 records from the queue every 0.1 sec– We remove 500, 1000, 1500, 2000, 2500 and 3000 records every 1
sec• Source max rate is about 1250 records / sec• Findings:
– Small package sizes result in small delays– There is a threshold (the source rate) underneath which the queue
explodes– We can achieve data freshness time equal to data insertion time
when we continuously empty a small size queue
IQIS'05 17 June 2005, Baltimore MD, USA 37
Source overhead
Time to insert 1 000 000 records
0
200
400
600
800
1000
1200
1 100 1000
Number of records sent simultaneously
Co
mp
leti
on
tim
e (s
ecs)
plain
non blocking invocation
blocking invocation
IQIS'05 17 June 2005, Baltimore MD, USA 38
Topology and source overhead
Time to insert 1 000 000 records to the Source in relation to topology used
780
800
820
840
860
880
900
920
Configuration
Tim
e (s
ecs)
plain
1-tier
2-tier (Mediator atSource Host)
2-tier (Mediator at DWHost)
3-tier
IQIS'05 17 June 2005, Baltimore MD, USA 40
Source overhead
0
20
40
60
80
100
120
140
times (secs)
Plain Operation Packet size at source: 1 row/packet Packet size at source: 10 rows/packet Packet size at source: 25 rows/packet Packet size at source: 50 rows/packet Packet size at source: 75 rows/packet
IQIS'05 17 June 2005, Baltimore MD, USA 41
Throughput for ETL operations
Throughput Capability of ETL Operations
0
50
100150
200
250
300
350400
450
500
Etl Operations
pac
kets
/ s
ec
Filter - 2%
Filter - 6%
Filter - 10%
Aggregate - group bysum
Transform - SurrogateKey
Transform - Replace
IQIS'05 17 June 2005, Baltimore MD, USA 42
Scenarios to measure data freshness
Scenario a - Average Number of Packets in Queue with Various Service Rates
020
4060
80100
120140
160
0 9 18 27 36 45 54 63 72 81 90
Time (seconds)
# P
acke
ts
~20 packets / sec
~23 packets / sec
~27 packets / sec
~33 packets / sec
Scenario b - Average Number of Packets in Queue @ ~23 packets / sec
01
23
45
67
8
1 8 15 22 29 36 43 50 57 64 71 78 85 92
Time (seconds)
# P
acke
ts
FILTER_10_01
GBSUM_01
SK_01
WS_01
Scenario c - Average Number of Packets in Queue @ ~23 packets / sec
0
2
4
6
8
10
1 9 17 25 33 41 49 57 65 73 81 89
Time (seconds)
# P
acke
ts
FILTER_10_01
FILTER_2_01
GBSUM_01
SK_01
WS_GB_01
WS_GB_01
Scenario d - Average Number of Packets in Queue @ ~23 packets / sec
0
5
10
15
20
1 9 17 25 33 41 49 57 65 73 81 89
Time (seconds)
# P
acke
ts
FILTER_10_01
FILTER_2_01
FILTER_6_01
GBSUM_01
REP_01
REP_02
SK_01
WS_GB_01
WS_UPD2_01
IQIS'05 17 June 2005, Baltimore MD, USA 43
Data Delay
88.589
89.590
90.591
91.592
92.593
scenario(a)
scenario(b)
scenario(c)
STORE
scenario(c)
GROUPBY
scenario(d)
STORE
scenario(d)
GROUPBY
Tim
e (s
ecs)
IQIS'05 17 June 2005, Baltimore MD, USA 44
Theoretical prediction vs. actual measurements of average queue length for scenario (c) in packets
Measured Theoretical Prediction
Difference
FILTER_10_01 0.160 0.056 0.104
FILTER_02_01 0.134 0.047 0.087
SK_01 0.154 0.054 0.100
GB_SUM_01 0.137 0.048 0.089
WS_GB 0.091 0.031 0.059
WS_GB_UPD 0.100 0.035 0.066
IQIS'05 17 June 2005, Baltimore MD, USA 45
Theoretical Predictions and Actual Measurements
• In most cases, we underestimate the actual queue size by half a packet (i.e., 25 records)
• We overestimate the actual queue size when we simulate slow servers, esp. in the combination of large timeouts and large packets
• Reasons for the discrepancies:– Simulation of slower rates through timeouts– Due to the row-blocking approach, the granule of
transport is a single packet