ETLBenchmarks Manapps 090127 - Marc Russel's Blog · PDF fileapplicable best practices. ......
Transcript of ETLBenchmarks Manapps 090127 - Marc Russel's Blog · PDF fileapplicable best practices. ......
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 1
ETL Benchmarks V 1.1
Comparing
DATASTAGE SERVER 7.5
DATASTAGE PX 7.5
TALEND OPEN STUDIO 2.4.1
INFORMATICA 8.1.1
PENTAHO DATA INTEGRATOR 3.0.0
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 2
This document is published under the Creative Commons license: http://creativecommons.org/licenses/by/3.0/us/
You are free:
to Share — to copy, distribute, display, and perform the work
to Remix — to make derivative works
Under the following conditions:
Attribution. You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).
For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to this web page.
Any of the above conditions can be waived if you get permission from the copyright holder.
Apart from the remix rights granted under this license, nothing in this license impairs or restricts the author's moral rights.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 3
Table of Contents
You are free: .................................................................................................................................... 2
Under the following conditions:...................................................................................................... 2
Table of Contents .................................................................................................................................... 3
General comments .................................................................................................................................. 5
Hardware Configuration.......................................................................................................................... 6
Test 1: File Input Delimited > File Output Delimited............................................................................... 8
Scenario: .............................................................................................................................................. 8
Test results: ....................................................................................................................................... 14
Test 2: File Input Delimited > Table MySQL Output.............................................................................. 15
Scenario: ............................................................................................................................................ 15
Test results: ....................................................................................................................................... 18
Test 3: Table Oracle Input > File Output Delimited............................................................................... 18
Scenario: ............................................................................................................................................ 18
Test results: ....................................................................................................................................... 25
Test 4: File Input Delimited > Table Output Oracle BULK ..................................................................... 26
Scenario: ............................................................................................................................................ 26
Test results: ....................................................................................................................................... 32
Test 5: File Input Delimited > Transform > File Output Delimited ........................................................ 33
Scenario: ............................................................................................................................................ 33
Tests result: ....................................................................................................................................... 45
Test 6: Table Input Oracle > Aggregation > Table Output Oracle (ELT) ................................................ 46
Scenario: ............................................................................................................................................ 46
Test results: ....................................................................................................................................... 52
Test 7: Tables Input Oracle > Transformation > Tables Output Oracle (ELT)........................................ 53
Scenario: ............................................................................................................................................ 53
Test results: ....................................................................................................................................... 59
Test 8: File Input Delimited > Sort > File Output Delimited .................................................................. 60
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 4
Scenario: ............................................................................................................................................ 60
Tests result: ....................................................................................................................................... 66
Test 9: File Input Delimited > Aggregate > File Output Delimited ........................................................ 70
Scenario: ............................................................................................................................................ 70
Tests result: ....................................................................................................................................... 77
Test 10: File Input Delimited > Lookup > File Output Delimited ........................................................... 80
Scenario: ............................................................................................................................................ 80
Tests result: ....................................................................................................................................... 92
Test 11: File Input Delimited > Lookup > File Output Delimited && rejects ......................................... 96
Scenario: ............................................................................................................................................ 96
Tests result: ..................................................................................................................................... 109
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 5
General comments
This document constitutes Version 1.1 of the ETL Benchmark, as version 1.0 showed inaccurate tests results for the PowerCenter solution powered by Informatica, as our tests were carried out with inadequate settings for this product.
An expert from Informatica suggested adapted settings, and the same tests were run again on the same environment, in order to preserve the benchmarking basis between all compared ETL tools.
Use of this settings on the Informatica PowerCenter solution greatly improve the results obtained by this solution on the same ETL benchmark tests, as detailed in this corrected version of our benchmark.
This Version 1.1 of the benchmark thus includes the updated results and comparison between all tested tools, and Annexe1 details the changes in the use of the Informatica software.
We are open to comments from all tested editors, but also to other publishers, and are ready to give access to our testing conditions in order to allow them to verify the results obtained by their products and to suggest applicable best practices.
For the tests with DataStage PX, we used 2 nodes to take advantage of the dual cores and of the parallelization feature of the tool.
Results: Even if it is difficult to give results for this kind of benchmark, and we think that each test is different, some people ask us to give a global synthesis of those tests.
Global performance: As requested by some people after the issue of version 1.0 of this ETL Benchmark, we have assigned, for each test, a specific number of points to the tested solutions (5 points to the best, 4 to the second…1 to the fifth). According to this scenario, results are as follows:
o First: Informatica 8.1.1 (353 points)
o Second: Talend Open Studio 2.4.1 (333 points)
o Third: IBM Datastage PX 7.5 (239 points)
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 6
o Fourth: IBM Dataserver 7.5 (199 points)
o Fifth: Pentaho Data Integration 3.0.0 (148 points)
Below are the detailed results:
TOS 2.4.1 PDI 3.0.0 IBM DS 7.5 IBM DS PX 7.5 INFA PWC 8.1.1Test1 13 7 19 8 16Test2 0 0 0 0 0Test3 13 3 7 9 11Test4 8 7 12 5 13Test5 15 4 13 12 18Test6 15 4 10 5 12Test7 11 3 7 8 15Test8 13 12 5 14 16Test8.2 12 13 4 15 18Test8.3 12 12 4 15 17Test9 12 6 15 12 17Test9.2 16 5 12 9 19Test9.3 12 8 13 11 16Test10 20 7 12 10 13Test10.2 20 6 6 13 16Test10.3 16 6 6 14 18Test10.4 12 4 8 17 19Test11 20 7 10 8 16Test11.2 20 6 6 12 16Test11.3 16 6 6 13 19Test12 20 8 13 6 13Test12.2 20 7 6 11 16Test12.3 17 7 5 12 19Total 333 148 199 239 353
In terms of intuitiveness and ease of use, Talend Open Studio and DataStage Server are ahead of the pack. DataStage PX comes in the third position, Informatica in fourth and the least intuitive is Pentaho Data Integrator. Our main reason for this assessment of Pentaho is mostly linked to the many parameters that need to be learnt. However, we think that if you invest lots of time in it, it could become an powerful tool.
Open Source ETL & Parallelization: Pentaho Data Integrator claims the first position here. It is easier to parallelize with PDI. We did however fine some issues with the way the tool lets you to parallelize all the components, but some results are inconsistent.
Hardware Configuration
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 7
OS: Windows XP Pro SP2 CPU: Intel Core2 Duo 2 GHz JVM 1.6.0_87 RAM: 4 Go
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 8
Test 1: File Input Delimited > File Output Delimited
Scenario:
Reading X lines from a file input delimited and writing in a file output delimited.
File input delimited extract:
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 9
TALEND OPEN STUDIO
Job name: file_input_delimited__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 10
PENTAHO DATA INTEGRATION
Job name: file_input_delimited__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 11
DATASTAGE SERVER
Job name: file_input_delimited__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 12
DATASTAGE PX
Job name: PX_file_input_delimited__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 13
INFORMATICA
Job name: file_input_delimited__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 14
Test results:
Test 1: File Input Delimited > File Output Delimited
Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.1 1,00 7,80 39,10 162,09 PDI 3.0.0 2,00 15,50 83,80 417,80 IBM DS 7.5 2,00 4,00 12,50 66,00 IBM DS PX 7.5 3,40 12,00 40,00 150,00 INFA PWC 8.1.1 2,00 7,00 18,00 74,00
Statistics: Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 2 2 3,4 2
1 000 000 1,99 0,51 1,54 0,9
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 15
5 000 000 2,14 0,32 1,02 0,46
20 000 000 2,58 0,41 0,93 0,47
Test 2: File Input Delimited > Table MySQL Output
Scenario:
Reading X lines from a file input delimited and writing into a table output MySQL. Comments:
DataStage 7.5, DataStage PX 7.5 and Informatica 8.1.1 are not tested for this use case. To begin, the test has been done with default parameters. To optimize the performances, the commit parameter has been learned. To finish, the job has been parallelize. To parallelize with TOS 2.4.1, we just have to cut through our file input delimited (With the header and the limit parameters) and parallelize two sub‐jobs. With PDI 3.0.0, we just have to increment the number of copy.
TOS 2.4.1 permits to use the extended insert, which is a MySQL feature. This feature limits the number of database accesses and increases the performances. With this feature, TOS 2.4.1 is 6 times faster.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 16
TALEND OPEN STUDIO
Job name: file_input_delimited__table_output_mysql
Job (Multi‐Thread Execution checked on Job Settings)
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 17
PENTAHO DATA INTEGRATION
Job name: file_input_delimited__table_output_mysql
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 18
Test results:
Test 2: File Input Delimited > Table MySQL Output
Lines 100 000 1 000 000 5 000 000 TOS 2.4.1 15,26 144,50 731,78 PDI 3.0.0 14,90 151,80 843,90 TOS 2.4.1 with Extended Insert 2,60 25,00 129,00
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 TOS 2.4.1 Extended Insert
ratio compared with TOS 2.4.1
100 000 0,98 0,18 1 000 000 1,05 0,17 5 000 000 1,15 0,18
Test 3: Table Oracle Input > File Output Delimited
Scenario:
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 19
Reading X lines from a table output Oracle and writing into a file output delimited.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 20
TALEND OPEN STUDIO
Job name: table_input_oracle__file_output_delimited
Job
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 21
PENTAHO DATA INTEGRATION
Job name: table_input_oracle__file_output_delimited
Job
SCHEMA VIEWER NOT POSSIBLE
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 22
DATASTAGE SERVER
Job name: table_input_oracle__file_output_delimited
Job
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 23
DATASTAGE PX
Job name: PX_table_input_oracle__file_output_delimited
Job
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 24
INFORMATICA
Job name: table_input_oracle__file_output_delimited
Job
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 25
Test results:
Test 3: Table Oracle Input > File Output Delimited
Lines 100 000 500 000 1 000 000TOS 2.4.1 2,25 6,26 14,25PDI 3.0.0 4,78 21,20 37,40IBM DS 7.5 4,00 11,00 19,00IBM DS PX 7.5 4,00 8,00 15,00INFA PWC 8.1.1 5 6 9
Statistics: Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 2,12 1,78 1,78 2
500 000 3,39 1,76 1,28 0,95
1 000 000 2,62 1,33 1,05 0,63
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 26
Test 4: File Input Delimited > Table Output Oracle BULK
Scenario:
Reading X lines from a file input delimited and writing into a table output Oracle BULK.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 27
TALEND OPEN STUDIO
Job name: file_input_delimited__table_output_oracle_bulk
Job
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 28
PENTAHO DATA INTEGRATION
Job name: file_input_delimited__table_output_oracle_bulk
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 29
DATASTAGE SERVER
Job name: file_input_delimited__table_output_oracle_bulk
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 30
DATASTAGE PX
Job name: PX_file_input_delimited__table_output_oracle_bulk
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 31
INFORMATICA
Job name: file_input_delimited__table_output_oracle_bulk
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 32
Test results:
Test 4: File Input Delimited > Table Output Oracle BULK
Lines 100 000 1 000 000 2 000 000TOS 2.4.1 4,36 22,12 49,66PDI 3.0.0 2,60 30,60 72,70IBM DS 7.5 3,00 18,00 40,00IBM DS PX 7.5 6,00 27,00 55,00INFA PWC 8.1.1 4 7 11
Statistics: Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 0,6 0,69 1,38 0,92
1 000 000 1,38 0,81 1,22 0,31
2 000 000 1,46 0,8 1,11 0,22
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 33
Test 5: File Input Delimited > Transform > File Output Delimited
Scenario:
Reading X lines from a file input delimited and writing in a file output delimited after some changes. Changes list:
• The field `rate` content is multiplied by 100. • The new field `name` is a concatenation (`firstname`+ « » +`lastname`). • The fields `address` content is converted to uppercase.
Comments: Pentaho Data Integration hasn’t any graphic component to transform data. Thus, we have to use a custom code component. The used language is JavaScript. The four others ETL got a transformer to do this. Talend Open Studio got a custom code too, named tJavaRow or tPerlRow.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 34
TALEND OPEN STUDIO
Job name: file_input_delimited__transformation__file_output_delimited
Job
Schema of file_input_delimited
Schema of file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 36
PENTAHO DATA INTEGRATION
Job name: file_input_delimited__transformation__file_output_delimited
Job
Schema of file_input_delimited
Schema of file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 38
DATASTAGE SERVER
Job name: file_input_delimited__transformation__file_output_delimited
Job
Schema of file_input_delimited
Schema of file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 40
DATASTAGE PX
Job name: PX_file_input_delimited__transformation__file_output_delimited
Job
Schema of file_input_delimited
Schema of file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 42
INFORMATICA
Job name: file_input_delimited__transformation__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 45
Tests result:
Test 5: File Input Delimited > Transform > File Output Delimited
Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.1 1,30 8,50 43,10 183,13 PDI 3.0.0 5,30 51,00 259,40 1126,10 IBM DS 7.5 2,00 10,00 56,00 178,00 IBM DS PX 7.5 4,75 11,33 41,00 155,00 INFA PWC 8.1.1 3,00 6,00 17,00 74,00
Statistics: Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 4,07 1,54 3,65 2,3
1 000 000 6 1,18 1,33 0,7
5 000 000 6,02 1,3 0,95 0,39
20 000 000 6,16 0,97 0,84 0,4
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 46
Test 6: Table Input Oracle > Aggregation > Table Output Oracle (ELT)
Scenario:
Reading X lines from tables input Oracle and writing into another tables output Oracle (ELT Mod).
Comments: Only Talend Open Studio permits to use an ELT mod. Informatica got the Push Down Optimization, but I didn’t find this feature on the tool.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 47
TALEND OPEN STUDIO
Job names: ELT__table_input_oracle__aggregate_group_by_age_count__table_output_oracle
Job (ELT)
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 48
PENTAHO DATA INTEGRATION
Job name: table_input_oracle__aggregate_group_by_age_count__table_output_oracle
Job
SCHEMA VIEWER NOT POSSIBLE
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 49
DATASTAGE SERVER
Job name: table_input_oracle__aggregate_group_by_age_count__table_output_oracle
Job
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 50
DATASTAGE PX
Job name: PX_table_input_oracle__aggregate_group_by_age_count__table_output_oracle
Job
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 51
INFORMATICA
Job name: table_input_oracle__aggregate_group_by_age_count__table_output_oracle
Job
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 52
Test results:
Test 6: Table Input Oracle > Aggregation > Table Output Oracle (ELT)
Lines 100 000 500 000 1 000 000TOS 2.4.1 1,24 1,4 1,69PDI 3.0.0 4,26 22,26 47,80IBM DS 7.5 2,40 8,00 13,67IBM DS PX 7.5 8,00 12,00 17,50INFA PWC 8.1.1 4 3 4
Statistics: Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 3,44 1,94 6,45 3,22
500 000 15,9 5,71 8,57 2,14
1 000 000 28,28 8,09 10,36 2,36
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 53
Test 7: Tables Input Oracle > Transformation > Tables Output Oracle (ELT)
Scenario:
Reading X lines from tables input Oracle and writing into another tables output Oracle (ELT Mod) after some changes.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 54
TALEND OPEN STUDIO
Job name: table_input_oracle__elt__table_output_oracle
Job (ELT)
Schema of table_lookup_oracle
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 55
PENTAHO DATA INTEGRATION
Job name: table_input_oracle__elt__table_output_oracle
Job
SCHEMA VIEWER NOT POSSIBLE
Schema of table_lookup_oracle
SCHEMA VIEWER NOT POSSIBLE
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 56
DATASTAGE SERVER
Job name: table_input_oracle__elt__table_output_oracle
Job
Schema of table_lookup_oracle
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 57
DATASTAGE PX
Job name: PX_table_input_oracle__elt__table_output_oracle
Job
Schema of table_lookup_oracle
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 58
INFORMATICA
Job name: table_input_oracle__elt__table_output_oracle
Job
Schema of table_lookup_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 59
Schema of table_input_oracle
Test results: Test 7: Tables Input Oracle > Transformation > Tables Output Oracle (ELT)
Lines 100 000 500 000 1 000 000TOS 2.4.1 5,99 23,26 52,72PDI 3.0.0 38,35 201,60 382,60IBM DS 7.5 12,70 65,00 116,00IBM DS PX 7.5 15,00 30,50 47,50INFA PWC 8.1.1 5 9 14
Statistics: Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 6,4 2,12 2,5 0,83
500 000 8,67 2,79 1,31 0,39
1 000 000 7,26 2,2 0,9 0,27
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 60
Test 8: File Input Delimited > Sort > File Output Delimited
Scenario:
Reading X lines from a file input delimited and writing in a file input delimited sorted. Sorts list:
• Order by the integer field `age` ASC. • Order by the string field `firstname` ASC. • Order by the fields `age` and `firstname` ASC.
Comments: With the version used, I can’t do sort in memory with Pentaho Data Integrator. But the feature is present on latest version. On Talend Open Studio, with a large volume (5 000 000 and 20 000 000), we have to use the component tExternalSort which use GNU sort, a sort software.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 61
TALEND OPEN STUDIO
Job names: • file_input_delimited__sort_on_age__file_output_delimited • file_input_delimited__sort_on_firstname__file_output_delimited • file_input_delimited__sort_on_firstname_and_age__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 62
PENTAHO DATA INTEGRATION
Job names: • file_input_delimited__sort_on_age__file_output_delimited • file_input_delimited__sort_on_firstname__file_output_delimited • file_input_delimited__sort_on_firstname_and_age__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 63
DATASTAGE SERVER
Job names: • file_input_delimited__sort_on_age__file_output_delimited • file_input_delimited__sort_on_firstname__file_output_delimited • file_input_delimited__sort_on_firstname_and_age__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 64
DATASTAGE PX
Job names: • PX_file_input_delimited__sort_on_age__file_output_delimited • PX_file_input_delimited__sort_on_firstname__file_output_delimited • PX_file_input_delimited__sort_on_firstname_and_age__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 65
INFORMATICA
Job names: • file_input_delimited__sort_on_age__file_output_delimited • file_input_delimited__sort_on_firstname__file_output_delimited • file_input_delimited__sort_on_firstname_and_age__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 66
Tests result:
Test 8: File Input Delimited > Sort > File Output Delimited
Sorted by Age
Sorted by age Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.1 1,44 15,73 188,21 1016,03 PDI 3.0.0 3,63 32,85 155,95 668,20 IBM DS 7.5 4,20 60,70 267,70 IBM DS PX 7.5 4,00 16,25 64,50 492,67 INFA PWC 8.1.1 5,00 13,00 50,00 201,00
Statistics: Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 2,51 2,92 2,78 3,47
1 000 000 2,09 3,86 1,03 0,82
5 000 000 0,83 1,42 0,34 0,26
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 67
20 000 000 0,66 +++ 0,48 0,2
Test 8: File Input Delimited > Sort > File Output Delimited
Sort By First Name
Sorted by firstname Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.1 1,69 18,05 168,46 1071,20 PDI 3.0.0 3,40 31,20 157,15 739,20 IBM DS 7.5 6,00 58,00 426,00 IBM DS PX 7.5 4,00 16,00 57,00 624,00 INFA PWC 8.1.1 4,00 13,00 51,00 223,00
Statistics: Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 2,01 3,55 2,37 2,36
1 000 000 1,73 3,21 0,89 0,72
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 68
5 000 000 0,93 2,53 0,34 0,3
20 000 000 0,69 +++ 0,58 0,21
Test 8: File Input Delimited > Sort > File Output Delimited
Sort By First Age, Name
Sorted by age & firstname Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.1 1,33 17,40 225,03 1007,00 PDI 3.0.0 3,22 29,27 159,10 842,20 IBM DS 7.5 7,33 60,00 360,00 IBM DS PX 7.5 4,50 16,33 59,00 582,50 INFA PWC 8.1.1 5,00 13,00 49,00 211,00
Statistics: Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 2,42 5,51 3,38 3,75
1 000 000 1,68 3,45 0,94 0,74
5 000 000 0,71 1,6 0,26 0,22
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 70
Test 9: File Input Delimited > Aggregate > File Output Delimited
Scenario:
Reading X lines from a file input delimited, achieving an aggregation and writing the operations result in a file output delimited. 1 – Group by the field `age`; Operation: COUNT. 2 – Group by the field `age`; Operations: COUNT, SUM(rate), AVG(rate), MIN(rate), MAX(rate).
3 – Group by the field `firstname`; Operations: COUNT.
Comments: When the output flow is too big (aggregate by firstname with big volume here), we have to use the tSortedAggregateRow on Talend Open Studio. This component sorts rows before the aggregation. On this case, Pentaho Data Integrator failed.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 71
TALEND OPEN STUDIO
Job names: • file_input_delimited__aggregate_group_by_age_count__file_output_delimited • file_input_delimited__aggregate_group_by_age_count_sum_avg_min_max__file_o
utput_delimited • file_input_delimited__aggregate_group_by_firstname_count__file_output_delimit
ed
Job
Job using the tExternalSortRow component
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 72
Schema of file_input_delimited
Schema of file_output_delimited file_input_delimited__aggregate_group_by_age_count__file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 73
PENTAHO DATA INTEGRATION
Job names: • file_input_delimited__aggregate_group_by_age_count__file_output_delimited • file_input_delimited__aggregate_group_by_age_count_sum_avg_min_max__file_o
utput_delimited • file_input_delimited__aggregate_group_by_firstname_count__file_output_delimit
ed
Job
Schema of file_input_delimited
Schema of file_output_delimited file_input_delimited__aggregate_group_by_age_count__file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 74
DATASTAGE SERVER
Job names: • file_input_delimited__aggregate_group_by_age_count__file_output_delimited • file_input_delimited__aggregate_group_by_age_count_sum_avg_min_max__file_o
utput_delimited • file_input_delimited__aggregate_group_by_firstname_count__file_output_delimit
ed
Job
Schema of file_input_delimited
Schema of file_output_delimited file_input_delimited__aggregate_group_by_age_count__file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 75
DATASTAGE PX
Job names: • PX_file_input_delimited__aggregate_group_by_age_count__file_output_delimited • PX_file_input_delimited__aggregate_group_by_age_count_sum_avg_min_max__fi
le_output_delimited • PX_file_input_delimited__aggregate_group_by_firstname_count__file_output_deli
mited
Job
Schema of file_input_delimited
Schema of file_output_delimited file_input_delimited__aggregate_group_by_age_count__file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 76
INFORMATICA
Job names: • file_input_delimited__aggregate_group_by_age_count__file_output_delimited • file_input_delimited__aggregate_group_by_age_count_sum_avg_min_max__file_o
utput_delimited • file_input_delimited__aggregate_group_by_firstname_count__file_output_delimit
ed
Job
Schema of file_input_delimited
Schema of file_output_delimited file_input_delimited__aggregate_group_by_age_count__file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 77
Tests result:
Test 9: File Input Delimited > Aggregate > File Output Delimited
Group by age (count)
Group by Age (Count) Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.1 0,62 6,99 30,05 124,16 PDI 3.0.0 2,70 26,53 134,30 466,50 IBM DS 7.5 2,00 6,00 21,00 128,00 IBM DS PX 7.5 4,00 6,50 21,33 78,00 INFA PWC 8.1.1 3,00 5,00 8,00 27,00
Statistics: Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 4,35 3,23 6,45 4,84
1 000 000 3,8 0,86 0,93 0,72
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 78
5 000 000 4,47 0,7 0,71 0,27
20 000 000 3,76 1,03 0,63 0,22
Test 9: File Input Delimited > Aggregate > File Output Delimited Group by Age (Count, Sum(Rate), Avg(Rate), Min(Rate), Max(Rate))
Group by Age (Count, Sum(Rate), Avg(Rate), Min(Rate), Max(Rate)) Lines 100 000 1 000 000 5 000 000 20 000 000TOS 2.4.1 0,84 7,44 37,61 139,12PDI 3.0.0 2,60 25,20 138,30 426,00IBM DS 7.5 2,00 11,00 50,00 184,00IBM DS PX 7.5 11,25 15,33 33,50 254,33INFA PWC 8.1.1 2,00 6,00 12,00 38,00 Statistics: Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 3,1 2,38 13,39 2,38
1 000 000 3,39 1,48 2,06 0,8
5 000 000 3,68 1,33 0,89 0,31
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 79
20 000 000 3,06 1,32 1,91 0,27
Test 9: File Input Delimited > Aggregate > File Output Delimited Group by FirstName (Count)
Group by FirstName (Count) Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.1 0,86 7,89 198,79 928,08 PDI 3.0.0 2,70 29,70 162,30 544,00 IBM DS 7.5 2,00 14,00 68,00 424,00 IBM DS PX 7.5 4,50 11,00 40,00 505,00 INFA PWC 8.1.1 4 9 23 85
Statistics: Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 3,14 2,33 5,23 4,65
1 000 000 3,76 1,77 1,39 1,14
5 000 000 0,82 0,34 0,2 012
20 000 000 0,59 0,46 0,54 0,092
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 80
Test 10: File Input Delimited > Lookup > File Output Delimited
Scenario:
Reading X lines from a file input delimited, looking up to another file input delimited, for 4 fields using id_client column. Writing the jointure result into a file output delimited.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 81
TALEND OPEN STUDIO
Job name: file_input_delimited__file_lookup_delimited__file_output_delimited
Job
Schema of file_input_delimited
Schema of file_lookup_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 83
PENTAHO DATA INTEGRATION
Job name: file_input_delimited__file_lookup_delimited__file_output_delimited
Job
Schema of file_input_delimited
Schema of file_lookup_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 85
DATASTAGE SERVER
Job name: file_input_delimited__file_lookup_delimited__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 86
Schema of file_lookup_delimited
Schema file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 88
DATASTAGE PX
Job name: PX_file_input_delimited__file_lookup_delimited__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 89
Schema of file_lookup_delimited
Schema file_output_delimited
Transformer Component
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 90
INFORMATICA
Job name: file_input_delimited__file_lookup_delimited__file_output_delimited
Job
Schema of file_input_delimited
Schema of file_lookup_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 92
Tests result:
Test 10: File Input Delimited > Lookup > File Output Delimited
Lookup 100 000 rows ~7MB
Lookup 100 000 rows ~7MB Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.1 1,45 6,39 28,72 108,37 PDI 3.0.0 4,14 21,40 87,60 288,90 IBM DS 7.5 5,00 10,60 33,00 139,00 IBM DS PX 7.5 5,00 12,20 40,00 122,00 INFA PWC 8.1.1 5,00 11,00 32,00 116,00
Statistics: Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 2,86 3,45 3,45 3,44
1 000 000 3,35 1,66 1,91 1,72
5 000 000 3,05 1,15 1,39 1,11
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 93
20 000 000 2,67 1,28 1,13 1,07
Test 10: File Input Delimited > Lookup > File Output Delimited
Lookup 500 000 rows ~34MB
Lookup 500 000 rows ~34MB Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.1 3,9 8,89 32,36 115,67 PDI 3.0.0 7,90 24,50 97,40 291,10 IBM DS 7.5 28,00 33,00 56,00 195,00 IBM DS PX 7.5 7,00 13,00 40,00 122,00 INFA PWC 8.1.1 4,00 11,00 33,00 122,00
Statistics: Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 2,03 7,18 1,79 1,03
1 000 000 2,76 3,71 1,46 1,24
5 000 000 3,01 1,73 1,24 1,02
20 000 000 2,52 1,69 1,05 1,05
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 94
Test 10: File Input Delimited > Lookup > File Output Delimited
Lookup 1 000 000 rows ~68MB
Lookup 1 000 000 rows ~68MB Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.1 9,86 14,26 38,6 121,44 PDI 3.0.0 14,50 32,20 116,60 487,25 IBM DS 7.5 68,30 80,00 102,00 203,00 IBM DS PX 7.5 9,25 15,00 40,00 123,00 INFA PWC 8.1.1 5,00 12,00 35,00 142,00
Statistics: Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 1,47 6,93 0,94 0,51
1 000 000 2,26 5,61 1,05 0,84
5 000 000 3,02 2,64 1,04 0,91
20 000 000 4,01 1,67 1,01 1,16
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 95
Test 10: File Input Delimited > Lookup > File Output Delimited
Lookup 5 000 000 rows ~365MB
Lookup 5 000 000 rows ~365MB Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.1 56,51 69,1 199,26 557,1 PDI 3.0.0 IBM DS 7.5 369,00 407,00 496,00 973,00 IBM DS PX 7.5 24,00 30,00 55,00 134,00 INFA PWC 8.1.1 11,00 14,00 42,00 141,00 Statistics: Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 Failed 6,53 0,42 0,19
1 000 000 Failed 5,89 0,43 0,2
5 000 000 Failed 2,49 0,28 0,21
20 000 000 Failed 1,75 0,24 0,25
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 96
Test 11: File Input Delimited > Lookup > File Output Delimited && rejects
Scenario:
Reading X lines from a file input delimited, looking up to another file input delimited, for 4 fields using id_client column. Writing the jointure result into a file output delimited and the output rejects into another files output delimited.
1 – Filter rejects: `age` content < 18 2 – Filter rejects: `age` content < 18 and inner join reject
Comments: Talend Open Studio and DataStage Server are the more ergonomic tools to manage the expression filter rejects and inner join rejects (with the Transformer component (tMap on Talend Open Studio)). For DataStage PX, Pentaho Data Integrator and Informatica, we have to use filter components. Talend Open Studio, Informatica and DataStage Server are the more ergonomic tools to manage the expression filter rejects and inner join rejects. For DataStage PX, Pentaho and Data Integrator, we have to use filter components.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 97
TALEND OPEN STUDIO
Job name: file_input_delimited__file_lookup_delimited__file_output_delimited__rejects_file_output_delimited
Job
Schema of file_input_delimited
Schema of file_lookup_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 98
Schema of file_output_delimited (age>=18)
Schema of file_output_delimited (age<18) = Schema of file_ output _delimited
tMap Component
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 99
PENTAHO DATA INTEGRATION
Job name: file_input_delimited__file_lookup_delimited__file_output_delimited__rejects_file_output_delimited
Job
Schema of file_input_delimited
Schema of file_lookup_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 100
Schema of file_output_delimited
Schema of file_output_delimited (age<18) = Schema of file_ output _delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 101
Mapping Component
DATASTAGE SERVER
Job name:
file_input_delimited__file_lookup_delimited__file_output_delimited__rejects_file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 102
Schema file_lookup_delimited
Schema of file_output_delimited
Schema of file_output_delimited (age<18) = Schema of file_ output _delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 104
DATASTAGE PX
Job name: PX_file_input_delimited__file_lookup_delimited__file_output_delimited__rejects_file_output_delim
ited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 105
Schema file_lookup_delimited
Schema of file_output_delimited
Schema of file_output_delimited (age<18) = Schema of file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 107
INFORMATICA
Job name: file_input_delimited__file_lookup_delimited__file_output_delimited__rejects_file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 108
Schema file_lookup_delimited
Schema of file_output_delimited
Schema of file_output_delimited (age<18) = Schema of file_output_delimited
Transformer Component
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 109
Tests result:
Test 11: File Input Delimited > Lookup > File Output Delimited && rejects
Lookup 100 000 rows ~7MB + Filter 18 years
Lookup 100 000 rows ~7MB Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.1 1,51 6,74 29,55 101,65 PDI 3.0.0 3,30 17,10 78,40 305,00 IBM DS 7.5 6,00 10,50 36,00 144,00 IBM DS PX 7.5 7,00 14,00 41,00 137,00 INFA PWC 8.1.1 5,00 10,00 33,00 120,00
Statistics: Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 2,19 3,97 4,64 3,31
1 000 000 2,54 1,56 2,08 1,48
5 000 000 2,65 1,22 1,39 1,12
20 000 000 3 1,42 1,35 1,18
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 110
Test 11: File Input Delimited > Lookup > File Output Delimited && rejects
Lookup 500 000 rows ~34MB + Filter 18 years
Lookup 500 000 rows ~34MB Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.1 4,26 9,28 32,44 111,98 PDI 3.0.0 7,80 20,50 81,50 310,00 IBM DS 7.5 28,60 34,00 57,00 173,00 IBM DS PX 7.5 7,50 14,25 44,67 155,20 INFA PWC 8.1.1 5,00 10,00 34,00 126,00
Statistics: Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 1,83 6,71 1,76 1,17
1 000 000 2,21 3,66 1,54 1,08
5 000 000 2,51 1,76 1,38 1,05
20 000 000 2,77 1,54 1,39 1,13
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 112
Test 11: File Input Delimited > Lookup > File Output Delimited && rejects
Lookup 1 000 000 rows ~68MB + Filter 18 years
Lookup 1 000 000 rows ~68MB Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.1 10,2 15,22 38,31 126,63 PDI 3.0.0 14,10 32,35 111,35 319,05 IBM DS 7.5 66,00 68,00 95,00 220,00 IBM DS PX 7.5 9,00 18,00 51,00 153,33 INFA PWC 8.1.1 6,00 14,00 34,00 130,00
Statistics: Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 1,38 6,47 0,88 0,59
1 000 000 2,13 4,47 1,18 0,92
5 000 000 2,91 1,7 1,33 0,89
20 000 000 2,52 1,74 1,21 1,03
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 113
TALEND OPEN STUDIO
Job name: file_input_delimited__file_lookup_delimited__file_output_delimited__rejects_and_innerjoin_rejects
_file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 114
Schema of file_lookup_delimited
Schema of file_output_delimited (age>=18)
Schema of file_output_delimited (age<18) = Schema of file_output_delimited Schema of file_output_delimited (inner join rejects) = Schema of file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 116
PENTAHO DATA INTEGRATION
Job name: file_input_delimited__file_lookup_delimited__file_output_delimited__rejects_and_innerjoin_rejects
_file_output_delimited
Job
Schema of file_input_delimited
Schema of file_lookup_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 117
Schema of file_output_delimited
Schema of file_output_delimited (age<18) = Schema of file_output_delimited Schema of file_output_delimited (inner join rejects) = Schema of file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 118
Mapping Component
DATASTAGE SERVER
Job name: file_input_delimited__file_lookup_delimited__file_output_delimited__rejects_and_innerjoin_rejects
_file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 119
Job
Schema of file_input_delimited
Schema of file_lookup_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 120
Schema file_output_delimited
Schema of file_output_delimited (age<18) = Schema of file_output_delimited Schema of file_output_delimited (inner join rejects) = Schema of file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 122
DATASTAGE PX
Job name: PX_file_input_delimited__file_lookup_delimited__file_output_delimited__rejects_and_innerjoin_rej
ects_file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 123
Schema of file_lookup_delimited
Schema file_output_delimited
Schema of file_output_delimited (age<18) = Schema of file_output_delimited Schema of file_output_delimited (inner join rejects) = Schema of file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 125
INFORMATICA
Job name: file_input_delimited__file_lookup_delimited__file_output_delimited__rejects_and_innerjoin_rejects
_file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 126
Schema of file_lookup_delimited
Schema file_output_delimited
Schema of file_output_delimited (age<18) = Schema of file_output_delimited Schema of file_output_delimited (inner join rejects) = Schema of file_output_delimited
Transformer Component
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 127
Test 12: file_input_delimited >_file_lookup_delimited > file_output_delimited__rejects && innerjoin_rejects_file_output_delimited
Lookup 100 000 rows ~7MB
Lookup 100 000 rows ~7MB Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.1 1,42 5,65 24,63 106,78 PDI 3.0.0 2,60 13,00 59,80 327,60 IBM DS 7.5 6,00 10,00 30,00 137,00 IBM DS PX 7.5 9,00 15,25 47,33 146,00 INFA PWC 8.1.1 4,00 12,00 33,00 121,00
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 1,83 4,22 6,34 2,82
1 000 000 2,3 1,77 2,7 2,12
5 000 000 2,43 1,22 1,92 1,64
20 000 000 3,07 1,28 1,37 1,13
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 128
Test 12: file_input_delimited >_file_lookup_delimited > file_output_delimited__rejects && innerjoin_rejects_file_output_delimited
Lookup 500 000 rows ~34MB
Lookup 500 000 rows ~34MB Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.1 4,16 8,74 30,34 120,53 PDI 3.0.0 7,26 19,30 72,25 319,60 IBM DS 7.5 28,00 35,50 63,00 189,50 IBM DS PX 7.5 11,00 16,00 44,00 150,00 INFA PWC 8.1.1 5 11 33 127
Statistics: Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 1,75 6,73 6,73 1,2
1 000 000 2,21 4,06 1,83 1,26
5 000 000 2,38 2,08 1,45 1,09
20 000 000 2,65 1,57 1,24 1,05
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 129
Test 12: file_input_delimited >_file_lookup_delimited > file_output_delimited__rejects && innerjoin_rejects_file_output_delimited
Lookup 1 000 000 rows ~68MB
Lookup 1 000 000 rows ~68MB Lines 100 000 1 000 000 5 000 000 20 000 000 TOS 2.4.1 10,98 15,18 38,49 126,57 PDI 3.0.0 13,30 27,35 79,00 413,45 IBM DS 7.5 38,49 90,40 108,00 231,00 IBM DS PX 7.5 13,00 19,00 49,00 134,00 INFA PWC 8.1.1 6 13 37 131
Statistics: Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 1,21 3,51 1,18 0,55
1 000 000 1,8 5,96 1,25 0,86
5 000 000 2,05 2,81 1,27 0,96
20 000 000 3,27 1,83 1,06 1,04
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 130
Annex 1: Informatica settings and results
This annex presents the settings changes made by Informatica and limitations they have found
Comments and amendment done on the basic PowerCenter 8.1.1 installation: *** Since the 'benchmark' machine is a tiny laptop with limited ressource (XP 32bit, Core2 Duo CPU and 3,43 GB of RAM) we've done following change:
‐ Auto‐Memory deactivation:
MaxMem at 0 in the Default Session Config
‐ High Availability storage deactivation:
EnableHAStorage at No for the 'Integration Service
‐ Metadata Manager and Reporting Service deactivation
*** Configuration amendments :
‐ Unix environment variable INFA_DEFAULT_DOMAIN added
‐ Custom variable FileRdrTreatNullCharAs on the Integration Service added (NULL character are encountered in source data files)
*** Standard Oracle 10g (10.1.0.2.0) Database installation with:
sga_max_size=164MB
pga_aggregate_target=115MB
Comments and "bestpractices" for the tests: Test 1: File Input Delimited > File Output Delimited - dynamic partitioning at 2 with more than 5 millions rows This is a Disk Bounded test Test 2: File Input Delimited > Table MySQL Output Not Applicable Test 3: Table Oracle Input > File Output Delimited - no partitioning as it's too small in volume and short in time Test 4: File Input Delimited > Table Output Oracle BULK
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 131
- commit size at 100000 - dynamic partitioning at 2 with 2 millions rows This is a Disk Bounded test Test 5: File Input Delimited > Transform > File Output Delimited - function "CONCAT(CONCAT(firstname,' '),lastname)" is replaced by "firstname || ' ' || lastname" - dynamic partitioning at 2 with more than 5 millions rows This is a Disk Bounded test Test 6: Table Input Oracle > Aggregation > Table Output Oracle (ELT) - no partitioning as it's too small in volume and short in time Oracle database is not 'tuned' for ELT mode Test 7: Tables Input Oracle > Transformation > Tables Output Oracle (ELT) - commit size at 50000 - no partitioning as it's too small in volume and short in time Oracle database is not 'tuned' for ELT mode Test 8: File Input Delimited > Sort > File Output Delimited - sorter memory adjustment This is a memory limited test at 20 millions rows (2 pass sort are required) and also disk limited sometime Test 9: File Input Delimited > Aggregate > File Output Delimited - dynamic partitioning at 2 with more than 5 millions rows in source - aggregator memory adjustment This is a CPU bounded test Test 10: File Input Delimited > Lookup > File Output Delimited - dynamic partitioning at 2 with more than 5 millions rows in source or lookup - lookup memory adjustment - lookup in the flow with hash partitioning point This is a CPU bounded test Test 11: File Input Delimited > Lookup > File Output Delimited && rejects - use of router in place of filters - dynamic partitioning at 2 with more than 5 millions rows in source - lookup memory adjustment - lookup in the flow with hash partitioning point This is a CPU bounded test Test 12: file_input_delimited >_file_lookup_delimited > file_output_delimited__rejects && innerjoin_rejects_file_output_delimited - use of router in place of filters - dynamic partitioning at 2 with more than 5 millions rows in source - lookup memory adjustment - lookup in the flow with hash partitioning point This is a CPU bounded test