Parallel Processing of JOIN Queries in OGSA-DAI · PDF fileParallel Processing of JOIN Queries...

Parallel Processing of JOIN Queries in OGSA-DAI

Fan Zhu

Aug 21, 2009

MSc in High Performance Computing

The University of Edinburgh

Year of Presentation: 2009

Abstract

JOIN Query is the most important and often most expensive of all relational operations, especially when its input is obtained from considerable size of tables on distributed heterogeneous database. As parallel join processing is a well understood technique to get results as quickly as possible, one way to speed up query execution is to exploit parallelism.

Since most real queries involve joins of several tables, efficient join execution becomes very important. This thesis focuses on query processing in a distributed heterogeneous database not in a DBMS. The aims of the project are: a) to investigate methods for parallel execution of join query, which are usually used to optimize a single join operation. b) To analyze the difference in performance caused by different query plans, which is used to speed up complex queries that contains multiple join operations.

The main steps and achievements of this project are the following:

a) The first step of the project is to study and extend my knowledge on relational algebra and OGSA-DAI (Open Grid Service Architecture - Data Access and Integration) software. As OGSA-DAI middleware allows process query and transform data from distributed resources, the mechanism and the interface defined in OGSA and the primary components of OGSA-DAI middleware are used into our experiments.

b) The second step is to design efficient parallel approaches to optimize join execution strategies currently used by OGSA-DAI. It is the most important work to analyze and investigate the parallel mechanism when executing complex join query operations on large tables, including Independent Parallelism, Pipelined Parallelism, Partitioned Parallelism and Mixed Parallelism. Based on our parallelism analysis, two parallel join algorithms - Hash Split Join algorithm and Sorted Merge Join algorithm – are adopted in the project. All function modules are divided into OGSA-DAI activities and the function of implementation activities are described in detail.

c) The third step is to implement the parallel algorithms and to evaluate the performance of parallel query. The thesis discusses and analyzes the performance of every functionality activity such as SQL Query Activity, Tuple Sort Activity and Sorted Merge Activity. It analyzes respectively the performance of queries based on two-table-join, multi-table-join and join on distributed heterogeneous database. Based on our experiments, it point out the affect made by different query plans.

Keywords: SQL Query, Join Query, OGSA-DAI, Parallelization.

i

Contents

Chapter 1 Introduction ............................................................................................... 1

1.1 Project Aims .................................................................................................... 1

1.2 Research Methods ............................................................................................ 1

1.3 Thesis Structure ............................................................................................... 2

Chapter 2 Background Knowledge ............................................................................. 4

2.1 SQL and Relational Theory .............................................................................. 4

2.2 Query Graphs and Query Plans ......................................................................... 6

2.3 OGSA-DAI ..................................................................................................... 7

Chapter 3 Analysis and Design of Parallel Algorithms ............................................... 12

3.1 Requirements Capture ..................................................................................... 12

3.2 Mechanisms of Parallel Query Execution......................................................... 13

3.3 Partitioning Algorithms ................................................................................... 16

3.4 Parallel Join Implementations .......................................................................... 17

3.5 User Side Workflow ....................................................................................... 21

Chapter 4 Performance Analysis ............................................................................... 25

4.1 Experimental Setup ......................................................................................... 25

4.2 Performance Analysis for Single Activity ........................................................ 29

4.3 Single Join ...................................................................................................... 36

4.4 Multiple Join................................................................................................... 37

ii

4.5 Join on Distributed Heterogeneous Database................................................ 39

Chapter 5 Conclusions .............................................................................................. 42

Appendix A Source Code ..................................................................................... 44

Appendix B Submission Script ............................................................................. 45

References ............................................................................................................... 46

iii

List of Tables

Table 1 Bandwidth of SQL Query Activity ............................................................ 31

Table 2 Bandwidth of Hash Split Activity .............................................................. 32

Table 2 Sorted Merge Activity and Union All Activity .......................................... 35

Table 3 Overall Activity Performance ................................................................... 36

Table 4 Query Plan 1 vs. Query Plan 2 .................................................................. 38

Table 5 Performance on Different Database .......................................................... 40

Table 6 Performance of Heterogeneous Database ................................................. 41

iv

List of Figures

Figure 1 Logical Query Plan ................................................................................... 5

Figure 2 Inner Join .................................................................................................. 6

Figure 3 Query Graph Example.............................................................................. 7

Figure 4 Query Plan Example ................................................................................. 7

Figure 5 OGSA Services Framework. ..................................................................... 8

Figure 6 The Architecture of OGSA-DAI .............................................................. 10

Figure 7 OGSA-DAI Runtime Overview ............................................................... 11

Figure 8 Independent Parallelism .......................................................................... 14

Figure 9 Pipelined Parallelism................................................................................ 14

Figure 10 Partitioned Parallelism........................................................................... 15

Figure 11 Independent and Pipelined Mixed Parallelism ...................................... 15

Figure 12 Serial Join Workflow ............................................................................. 22

Figure 13 Hash Split Join ....................................................................................... 23

Figure 14 Sorted Merge Join .................................................................................. 24

Figure 15 Query Graph .......................................................................................... 27

Figure 16 Query Tree ............................................................................................. 27

Figure 18 Running Time of Reproduced Test ........................................................ 29

Figure 19 Workflow without Swallow Activity ...................................................... 30

Figure 20 Workflow with Swallow Activity ........................................................... 31

v

Figure 21 Performance of Hash Split Activity ...................................................... 33

Figure 22 Array List vs. Linked List ...................................................................... 34

Figure 23 Performance of Tuple Sort Activity ....................................................... 35

Figure 24 Query Plan 1 .......................................................................................... 37

Figure 25 Query Plan 2 .......................................................................................... 38

Figure 26 Re-use in Hash Split Join ....................................................................... 39

Figure 27 DB2 vs. MySQL ..................................................................................... 40

vi

Acknowledgements

First of all, I would like to show my deepest thanks to my supervisor, Mr. Bartosz Dobrzelecki, who has provided me with valuable suggestion and guides among this dissertation.

I also want to extend my thanks to all my friends for their encouragement and support.

1

Chapter 1 Introduction

1.1 Project Aims

With the wide application of digital technology, the amount of data to be processed increases at a higher rate than the speed of processing units. This leads to a problem that the traditional database query algorithm may NOT be best suited for massive distributed data sets anymore on the internet. If a query operation takes long time to get its final result, the information it generates may already be obsolete in many application domains. Could we reduce the running time of a query by some techniques?

On the other hand, given a query on multiple tables in the database application system, there are many schemes that a database management system can follow to process it and produce its results. Although all schemes will produce equivalent result in terms of their final output, their running cost varies. For example, the amount of time that two schemes need to run is different. Sometimes the time cost difference between two schemes may be enormous. What is the scheme that needs the least amount of time?

Here I will also identify the research meaning by describing them from the perspectives of problem statement: The join query is the most expensive operation executed by databases and is proved that it can be optimized by parallelization. Thus it is important work for this project to parallelize a join query. Besides, different query plans may affect performance a lot. Matching a query to the most suitable plan will be also very helpful.

The fundamental goal of this project is to investigate and solve the join processing problem by parallelization technique. This project will develop OGSA-DAI implementations of several parallel join algorithms. We also want to gather some experimental data that would help us understand which approaches to parallel join execution are most beneficial.

1.2 Research Methods

The research methods of this project are:

- To research and analyze on efficient parallelization approaches to optimize existing implementation of join operators which take significant time to execute.

2

- To design and implement parallelization algorithms useful for querying distributed data based on OGSA-DAI.

1.3 Thesis Structure

The thesis is organised as five chapters. This chapter describes the project’s purposes, roadmap and methods adopted in the project research. The rest of the thesis is structured as follow:

Chapter 2 introduces the basic of SQL (Structured Query Language) and OGSA-DAI (Open Grid Service Architecture - Data Access and Integration) so that it is easier to understand our work and techniques in the project. In section 2.1 the four subsets of the declarative database language SQL are described and the SELECT query on multiple tables is introduced. Then, a basic set of relational operators is described. The Query Graph, which is used as a graph tool in analysis for query operation, is introduced in section 2.2. The description of the mechanisms and interfaces defined in OGSA and the primary components of architecture for OGSA-DAI are shown in section 2.3. We consider the query requirements based on database integration by OGSA applications. Some aspects in which consumers make requests to an OGSA-DAI product are described in detail.

Chapter 3 discusses design and implementation algorithms. As this project is implemented on OGSA-DAI framework, the functional modules are divided into OGSA-DAI activities. It is the most important work to analyze and investigate the parallel mechanism when executing complex join query operations on large tables, including Independent Parallelism, Pipelined Parallelism, Partitioned Parallelism and Mixed Parallelism. Based on our parallelism analysis for join queries, two parallel join algorithms - Hash Split Join algorithm and Sorted Merge Join algorithm – are used in the project. Section 3.4 shows the functionality of implemented activities in detail. Finally we discuss how the activities are assembled into OGSA-DAI workflows. In section 3.5, we give the implementation detail of Hash Split Join workflow and Sorted Merge Join.

Chapter 4 contains the performance analysis for our parallel. First of all, it shows the test environment of software and hardware, the test data set and the test join query on the TPC Benchmark™H (TPC-H)[4]. Then it discusses and analyzes the performance of every functionality activity such as SQL Query Activity, Tuple Sort Activity and Sorted Merge Activity. Sections 4.3 to 4.5 analyze the performance of queries based on two-table-join, multi-table-join and join on distributed heterogeneous database. It explores the reasons for different performance by analysing their implementation. It also illustrates the overall workflow performance and how it works on different databases. Based on our experiments, it points out the effect made by different query plans. It provides some conclusion about how to match a query to a plan.

In Chapter 5, the final part of the thesis, the conclusions are presented based on our analysis and experiments. Our discussion in this thesis focuses on join query

3

optimization for sequential processing by parallelization method and query plan choosing for complex request. It touches upon issues and techniques related to optimizing join queries in distributed heterogeneous database environments.

4

Chapter 2 Background Knowledge

2.1 SQL and Relational Theory

SQL (Structured Query Language) is a declarative database language which designed for management and retrieval of data in RDBMS (Relational Database Management System).

There are four important parts of the SQL language: Data Manipulation Language (DML), Data Definition Language (DDL), Data Control Language (DCL) and Transactional Control Language (TCL). This project cares about DML part of SQL which is used to retrieve, store, modify, delete, update and manage data in database. For example, DML allows users to describe the desired properties of the result without specifying how to obtain it. This is also why SQL is a declarative language.

The most common operation in SQL is result retrieval, which is performed with key word SELECT. A SELECT query can retrieve data from one or more tables. Join operations are needed in order to combine multiple tables.

This project focuses on SELECT queries joining multiple tables.

2.1.1 Relational Algebra

In order to define the database structure and constraints, a data model must include a set of operations to manipulate the data. A basic set of relational model operations constitute the relational algebra. Relational algebra is used to represent declarative SQL queies in a procedural form which can be executed. A sequence of relational algebra operations forms a relational algebra expression. [5]

A SQL query is a relational algebra expression and can be performed with relational algebra operations such as SELECT, PROJECT, JOIN, UNION, INTERSECTION and CARTESIAN PRODUCT.

Select and Join will be used in this project and will be explained in the next sections.

5

2.1.2 Select Statement and Logical Query Plan

Select statement, which retrieves data from specified table(s), is the most commonly used of SQL expressions. For example, here is a simple Select – From – Where query:

SELECT id, name, job FROM employee WHERE salary > 100

To be able to execute this declarative query, a logical query plan needs to be complied. SELECT query is translated to relational expression using Projection, Selection and Table Scan. The above simple query will be translated to a logical query plan (Figure 1):

Figure 1 Logical Query Plan

On execution, the system will fetch all records stored in the employee table (TABLE SCAN), then it will filter records and discard all those for which salary is <= 100 (SELECT). Finally the PROJECT operation will select only three attributes from each record.

However, if you want to join more tables, the number of possible combinations rapidly explodes. All these plans will generate identical result but will have different cost. Due to combinatorial explosion it is not possible to perform exhaustive search for the best query plan. In this project, we try to devise some heuristic rules to help us choose the most promising plan in a limited time.

2.1.3 Join Query

JOIN Query is the most important among all relational operations. A Join query clause combines tuples from two source tables. The SQL language supports fours types of joins: INNER, OUTER, LEFT, and RIGHT JOIN. This project focuses on INNER JOIN which is the most commonly used in applications and also the default join-type.

6

An INNER JOIN essentially combines the records from two tables (A and B) based on a given join-predicate. The result of join can be defined as the outcome of first taking the Cartesian product (or cross-join) of all records in the tables (combining every record in table A with every record in table B) - then return all records which satisfy the join predicate [1].

People

Name ID Betty 100 Jones 101 Jack 102

Nationality

ID Country 100 United Kingdom 101 Australia

SELECT * FROM People INNER JOIN Nationality Where People.ID = Nationality.ID;

People.Name People.ID Nationality.Country Nationality.ID Betty 100 United Kingdom 100 Jones 101 Australia 101

Figure 2 Inner Join

An EQUIJOIN is a specific type of comparator-based join, which uses only equality comparisons (= only) in the join-predicate. Figure 2 is an example of EQUIJOIN. Tuples with ID equal to 100 or 101 are accepted because these values appear in both tables. Tuples with People.ID = 102 will be discarded as there is no related value in table Nationality.

SQL queries often include multiple joins. The SQL language allows to define joins explicitly using the JOIN keyword (see example above). However, user queries usually contain implicit joins with join predicates defined after the WHERE clause.

2.2 Query Graphs and Query Plans

Query Graph is a single graph corresponding to each query. It does not specify any order on which operation to perform first. For example, the join query in previous section can be translated into Figure 3.

7

Figure 3 Query Graph Example

Query Plan (Figure 4) presents a specific order of operations for executing a query. It is a set of steps used to help accessing and modifying a SQL RDMS. Since SQL is declarative, there are typically a large number of alternative ways to execute a given query, with widely varying performance. When a query is submitted to the database, the query optimizer evaluates some of the different, correct possible plans for executing the query and returns what it considers the best alternative [2].

Figure 4 Query Plan Example

In this project, SQL query will be analysed first and parsed into a query graph. After observe this query graph, a query plan will be chosen based on our heuristic rules. There will be more details in Section 4.1.4 and Section 4.1.5.

2.3 OGSA-DAI

OGSA-DAI stands for Open Grid Services Architecture Data Access and Integration. The aim of OGSA-DAI is to develop a standard interface for distributed data resources on the Grid. Nowadays, there are a lot of data out there but these data are not in the same database or even not linked together. Islands of data have this problem. We need a way to integrate isolated and distributed data sources.

An OGSA-DAI web service allows data to be queried, updated, transformed and delivered. OGSA-DAI web services can be used to provide web services that offer data integration functionality to clients. OGSA-DAI web services can be deployed within a Grid environment. OGSA-DAI thereby provides a means for users to Grid-enable their data resources [3].

8

2.3.1 OGSA Grid Environment

The Grid is defined as an infrastructure consisting of multiple computers connected via network technologies providing the impression of one computer system. In 2001, researchers led by Globus and IBM began developing new Grid standards and technology. The aim was to merge the understanding developed through the design of early Grid applications with the Web Services middleware. Their goal was to allow Grid developers to exploit the huge commercial investment in Web Services infrastructure. The result was the Open Grid Services Architecture (OGSA) -- a high-level framework designed to support dynamic virtual organizations that share independently administered data and resources seamlessly across a network of heterogeneous computers. The OGSA is used to identify the components needed in a grid system. OGSA defines a service-based structure for creating a grid computing environment. Still under development, this architecture defines the major functional components required to meet those requirements. Prof. Ian Foster gave a description of the mechanisms and interfaces defined in OGSA[10][11]. The OGSA services framework is shown in Figure 5. The services are built on Web service standards, with semantics, additions, extensions and modifications that are relevant to Grids[11].

Figure 5 OGSA Services Framework. Cylinders represent individual services

The important points are the followings:

• An important motivation for OGSA is the composition paradigm or building block approach, where a set of functions is built or adapted as required. This provides the adaptability, flexibility and robustness to change that is required in the architecture.

9

• The entire set of OGSA capabilities does not have to be present in a system. A system may choose to utilize or provide only a subset of services from any capability.

• OGSA represents the services, their interfaces, and the semantics/behavior and interaction of these services.

• The architecture is not layered, where the implementation of one service is built upon.

2.3.2 OGSA-DAI Software

With the increase of data produced in research and business environments, data management is increasingly challenging. Since 2002, the Open Grid Service Architecture - Data Access and Integration (OGSA-DAI) project funded by the UK e-Science Programme has been working to develop an effective solution to the data management challenge and in particular to data access and integration problems. OGSA-DAI facilitates Data Access and Integration of data resources such as relational databases within a Grid. The reference paper [12] presents a status report on OGSA-DAI activities and announces future directions. The paper [13] describes a new architecture for future OGSA-DAI releases and its rationale. The OGSA-DAI 3.0 is a complete top-to-bottom redesign and implementation of the OGSA-DAI product. The paper [14] describes the motivation behind this redesign and provides an overview of OGSA-DAI 3.0, comparing and contrasting with last OGSA-DAI releases. 1

2.3.3 OGSA-DAI Framework

The OGSA-DAI is a framework that enables existing data resources to be integrated into a grid environment. OGSA-DAI is a middleware to interface with databases, which allows data resources, such as file systems, relational or XML databases, to be accessed, federated and integrated across the network [15]. As well as accessing and updating data in a database, OGSA-DAI offers an extensibility mechanism, making it possible to add further user defined activities to OGSA-DAI that can be executed in addition to activities already offered by OGSA-DAI, such as SQL query and update.

The primary components of new architecture for OGSA-DAI are shown2 in Figure 6[13]. The architecture looks forward to multiple data services administered through a consistent regime. There are three data services: one serves OGSA-DAI, one serves the WS-DAI standard perhaps as a configuration of OGSA-DAI and one serves Mobius.

1 Paper [12] and [13] talks about the old OGSA-DAI product, while paper [14] is related to the current one.

2 This figure is for OGSA-DAI 2.x. the architecture used by OGSA-DAI 3.x has slightly different.

10

Figure 6 The Architecture of OGSA-DAI

2.3.4 OGSA-DAI Activity

Activity is a workflow unit implementing a certain function linked with a specific name. Arbitrary data related function can be encapsulated as an activity. These activities can be used to provide complex functionality.

OGSA-DAI come with a default set of activities like: SQL query activity, data format transfer activity, data set union activity. These activities can split into several categories like delivery activities and relational activities.

As it is showed in Figure 7, every activity has a client side code and server side code; they are matched by their unique ID. There are actually three parts in an activity workflow:

1. User code: Client toolkit API allows user to assemble workflows by connecting activities. It also provides methods for submitting workflows to OGSA-DAI services. It calls the client side code to fill required inputs and declare output. Note that one user code can call more than one activities and every activity can have multiple instances. Executed on user side.

11

2. Client Side: It manages the inputs and outputs of an activity. Inputs will be sent to server side code and outputs will be forward to user. Executed on client side, too.

3. Server Side: It is the one who actually do the functionality task. It may connect to database (SQL related activities). Executed on server side.

Figure 7 OGSA-DAI Runtime Overview

12

Chapter 3 Analysis and Design of Parallel Algorithms

This chapter contains requirements capture and system design. Some of implementation details are also introduced in order to specify a low level overview of main functions and solutions to common problems.

3.1 Requirements Capture

The main aim of this project is to use parallelization to optimize SQL join query processing which have huge input tables on distributed database system. The other aim of this project is to analyse how different query plans affect the performance.

OGSA-DAI is designed to enable remote access to data. It is a well designed framework and takes advantage in management of distributed database system. OGSA-DAI is a framework that simplifies building distributed data processing systems. By using OGSA-DAI in our project we can focus on parallel algorithms and not worry about the details of distributed processing.

The following sections present the basic and additional functional goals in this project.

3.1.1 Basic Functional Goals

This project mainly contains four goals:

² Implementation of Hash Join algorithm and Sorted Merge algorithm to implement query execution on OGSA-DAI.

² Parallelisation of the above algorithms. ² Performance analysis of these two joins algorithms. ² Performance analysis of different query plans.

13

3.1.2 Performance Goals

Because this project is about optimization, it focuses on performance. As a client/server framework, OGSA-DAI may introduce some overhead during execution. In this project we will investigate if this overhead is damaging, how bad is it and try to understand where exactly time is spent.

We also try to find the bottlenecks in this project and whether parallelization is going to reduce execution times.

3.2 Mechanisms of Parallel Query Execution

When executing query operations on large tables, poor performance may occurs, especially on complex join operations. There are two limiting factors: the amount of available main memory and computational complexity. Consequently, we try to use parallel mechanism, which handles both the limitations well, to improve runtime efficiency. In the distributed context, when queries may be executed by middleware sitting on top of RDBMS – we cannot use foreign key based indexes that are available to the local RDBMS. Besides, we also have limited plans to choose because tables have different locations.

In this project, we will try to investigate three basic mechanisms to bring parallelism into Join execution. Taking Query - R1 JOIN R2 JOIN R3 JOIN R4 (Ri stands for input table) as example, the three mechanisms are:

3.2.1 Independent Parallelism

The above query could be executed in the following three steps:

Step 1: R1 JOIN R2 => R12


Step 3: R12 JOIN R34 => R1234

Independent parallelism is illustrated in Figure 8. In this algorithm, independent steps (Step 1 & Step 2 in this case) can be fully parallelised, which leads to a great possibly of huge speedup. However, scalability is limited. To execute 4 joins independently you will need to have at least 8 relations where every pair is joined independently - this is not a frequent scenario. We get some parallelism - but it will rarely allow us to use say 8 processors.

14

Figure 8 Independent Parallelism

3.2.2 Pipelined Parallelism

Another approach would be building a data processing pipeline as in these steps:



Step 3: R123 JOIN R4 => R1234

Pipelined parallelism is illustrated in Figure 9. In this algorithm, there are data dependencies in different steps, which mean all these steps have to execute one by one. If two operations are related in such a way, the output of first operation is used as input in the second operation. On the other hand, if the first operation can be carried out so that partial results can be produced and immediately channelled to the second operation, then it becomes possible for the first operation to produce the next partial result while the second operation processes earlier partial results.

Figure 9 Pipelined Parallelism

3.2.3 Partitioned Parallelism

Partitioned Parallelism is used for single join operation (Ri JOIN Rj => Rij). There are three steps to join two tables together by using partitioned parallelism:

Step 1: Split the input data into small sets.

Step 2: Join related sets together.

Step 3: Union previous results together.

15

Figure 10 shows how Partitioned Parallelism works.

Figure 10 Partitioned Parallelism

3.2.4 Mixed Parallelism

Different parallelization mechanisms can be applied to different parts of a query plans. A query plan may be divided into parts which belong to different algorithms. For example, Figure 11 presents one of possible query plans for the following query:

R1 JOIN R2 JOIN R3 JOIN R4 JOIN R5 JOIN R6 JOIN R7

The following equivalence holds for EQUIJOIN.

(R1 JOIN R2) JOIN R3 ≡ R1 JOIN (R2 JOIN R3)

Therefore we can rewrite our query as:

((R1 JOIN R2) JOIN (R3 JOIN R4)) JOIN ((R5 JOIN R6) JOIN R7)

The underlined one is applied independent parallelism and italic one is pipelined parallelism.

Figure 11 Independent and Pipelined Mixed Parallelism

16

3.3 Partitioning Algorithms

Data partitioning is used in the partitioned parallelism approach to distribute data over a number of processing elements. Each processing element is then executed simultaneously with other processing elements, thereby creating parallelism. It is the basic step of parallel query processing. When partitioning the workload, four partition algorithms are taken into consideration [3]:

1. Round-robin data partitioning

In round-robin algorithm, data is partitioned by its record number. To illustrate, if data is partitioned into n parts, the (xn+i)th data will be put in ith block. The biggest advantage of this algorithm is its perfect load balance (every part has the same amount of data (±1)).

2. Hash data partitioning

Data will be partitioned by applying a hash function so every new work set has its specific set of attribute values. However, load balance will be poor if distribution of values is skewed. For example, if we try to partition a work set into five parts. The work set is {1, 2, 3, 4, 6, 7, 8, 11, 12, 16} and hash function is (x mod 5). The result of partitioning will be:

Work set 1: {1, 6, 11, 16}

Work set 2: {2, 7, 12}

Work set 3: {3, 8}

Work set 4: {4}

Work set 5: {}

This shows the potential bad load balance of hash data partitioning.

Furthermore, hash data partitioning is the best way to handle EQUIJOIN operation while range data partitioning is used to solve JOIN with greater than / less than operations. If we use the same hash function to partition both join inputs then related data tuples will be end up in the same bucket. So when processing an EQUIJOIN operation, we can easily match related hash split work sets together.

3. Range data partitioning

A simple example makes this algorithm easy to understand: Partition a set of discrete number into three subsets. In this case, all the numbers less than 100 can be grouped

17

into set 1; numbers ranged in [101, 1000] will be set 2; numbers which are larger than 1000 will be split into last set.

Same as hash data partitioning, range data partitioning has similar pros and cons.

4. Random-unequal data partitioning

The partitioning function of this algorithm maybe hash or range partitioning function, or just unknown function. Data will be grouped randomly.

All these partitioning algorithms have their advantages and disadvantages. So, partitioning algorithm should be chosen based on the type of JOIN algorithm.

The project will use hash data partitioning and round-robin partitioning as partitioning algorithms. The former one is used for Hash Split Join Algorithm because it splits data based on the values of input tuples; the latter one is used for Sorted Merge Join Algorithm because when splitting input data for this JOIN algorithm, we do not care about values of tuples but split data randomly in order to ensure a good load balance. Further information is available in chapter 4.

3.4 Parallel Join Implementations

There are many parallel join algorithms, but in this project, we just focus two of them: Hash Split Join algorithm and Sorted Merge Join algorithm. This section focuses on the structure of OGSADAI server side and client side code.

3.4.1 Hash Split Join

In this algorithm, both input tuple sets will be split by their sorting key using default hash function first. Given a value K, the hash value produced by the default hash function is:

hash ( K ) = K.hashCode () mod NUM

NUM stands for the number of output subsets. hashCode() is the JAVA library function belongs to Object class which generate a integer number as result.

After that, related sets can be joined in parallel. Every set contains one part of final result. The last step is to union all join result into the final result.

The following activities were implemented to support hash split join algorithm in OGSA-DAI. For each activity input, output and behaviour are described.

Hash Split Activity

18

Split input data by the giving name of column into a given number of sets by hash functions.

Activity inputs:

l Data. Type: OGSA-DAI list of Tuples. A stream of tuples to be split. l Name. Type: String. The name of column to split on. l Number. Type: Integer. The number of output sets.

Activity outputs:

l Result. Type: Array of OGSA-DAI list of Tuples.

Hash Join Activity

Join two sets together on the term of inner join operation. There is one more thing should be noticed that this is a generic activity. It can be also used join un-split input sets.

Activity inputs:

l Data1. Type: OGSA-DAI list of Tuples. The first dataset to be joined. l Data2. Type: OGSA-DAI list of Tuples. The second dataset to be joined. l Name1. Type: String. The name of column to use for the join from the first dataset. l Name2. Type: String. The name of column to use for the join from the second

dataset.

Activity outputs:

l Result. Type: OGSA-DAI list of Tuples.

Union All Activity

Union the given array of list of tuples into one. This activity is used to generate the final result.

Activity inputs:

l Data. Type: Array of OGSA-DAI list of Tuples. The datasets to be union together. l Number. Type: Integer. The number of datasets to be union together.

Activity outputs:


19

Hash Split Join User Side Code

This is user side function. This function manages all hash join related activities. It connects activities’ output to the certain input. It is the one who build the entire workflow from single activities.

Activity inputs:

l Query. Type: SQL Query. The request we try to executed. l Number. Type: Integer. The number of processors.

Activity outputs:


3.4.2 Sorted Merge Join

Sorted merge join algorithm is different from the hash split join algorithm. This algorithm needs four steps: split, sort split sets, merge split sets, sort merge join. This algorithm uses parallelization to sort the original input set and performs the low complexity sorted join as the last step.

First of all, input tuples will be split into subsets. After that, these balanced sets can be sorted in parallel and merged together. The last step is to join two ordered sets into the final result.

The following activities were used to form sorted merge join algorithm in OGSA-DAI.

Random Split Activity

Split input data into subsets with equivalent size.

Activity inputs:

l Data. Type: OGSA-DAI list of Tuples. A stream of tuples to be split. l Number. Type: Integer. The number of output sets.

Activity outputs:

l Result. Type: Array of OGSA-DAI list of Tuples.

20

Tuple Sort Activity

Sort input data by the giving column.

Activity inputs:

l Data. Type: OGSA-DAI list of Tuples. A stream of tuples to be sorted. l Name. Type: String. The name of column to sort.

Activity outputs:


Sorted Merge Activity

Merge sorted sets into one. This function only needs to scan every input set once, which leads to a good performance.

Activity inputs:

l Data. Type: Array of OGSA-DAI list of Tuples. The ordered datasets to be merged together.

l Number. Type: Integer. The number of sets. l Name. Type: String. The name of column used for merge.

Activity outputs:


Sorted Join Activity

Join two ordered sets together. This function also only need to scan every input sets once.

Activity inputs:

l Data1. Type: OGSA-DAI list of Tuples. The first dataset to be joined. l Data2. Type: OGSA-DAI list of Tuples. The second dataset to be joined. l Name1. Type: String. The name of column to use for the join from the first dataset. l Name2. Type: String. The name of column to use for the join from the second

dataset.

Activity outputs:


21

Sorted Merge Join User Side Code

This is user side code. Similar with Hash Join User Side Code, this function manages all sorted merge join related activities. It is the one who build the entire workflow from single activities.

Activity inputs:

l Query. Type: SQL Query. The request we try to executed. l Number. Type: Integer. The number of processors.

Activity outputs:


3.5 User Side Workflow

3.5.1 OGSA-DAI workflow

OGSA-DAI class PipelineWorkflow is used to assemble activities into workflow. By using this class, activities can be organized by their logical order. Independent activities can run in parallel. In detail, this is how OGSA-DAI workflows are assembled and executed programmatically:

Step 1: Initialize all the activities in the workflow.

Step 2: Connect active ties’ inputs and outputs.

Step 3: Add activities to pipeline.

Step 4: Get a handle of OGSA-DAI DataRequestExecutionResource object, and then execute the entire pipeline on this object.

Note that, this class is called pipeline only because it organize related activities as pipeline. It allows parallelization. For example, several sort activities can run in parallel.

3.5.2 Serial Join

Serial Join read both tables and joins them together without any parallel optimization. It is used to generate comparable result for testing and a baseline execution time for performance comparison. The main steps in the workflow are following:

22

Step 1: Read tables from database using SQLQuery activity.

Step 2: Sort both left side and right side table using TupleSort activity.

Step 3: Join tables by using OGSA-DAI default TupleMergeJoin activity.

Figure 12 is the workflow of serial join. It uses default join activity provided by OGSA-DAI. Because of default join activity requires ordered inputs. The input tuples need to be sorted first.

Figure 12 Serial Join Workflow

3.5.3 Hash Split Join

As it is motioned in chapter 3.4.1, Figure 13 illustrates Hash Split Join user side workflow:


Step 2: Split both left side and right side table into hash sets by HashSplit activity.

Step 3: Use sort-merge join to apply smallest join unit.

Step 3.1: Sort every hash set using TupleSort activity.

Step 3.2: Join ordered sets using SortedMergeJoin activity.

Step 4: Union all the result using UnionAll activity.

23

Figure 13 Hash Split Join

3.5.4 Sorted Merge Join

As it is motioned in chapter 3.4.2, Figure 14 illustrates Sorted Merge Join user side workflow which is quite similar with Hash Split Join workflow:


Step 2: Split both left side and right side table into sets randomly using RandomSplit activity.

Step 3: Sort every set using TupleSort activity.

Step 4: Combine sorted sets in Step 3 into big sorted set using SortedMerge activity.

Step 5: Join two sorted sets using SortedJoin activity.

24

Figure 14 Sorted Merge Join

25

Chapter 4 Performance Analysis

4.1 Experimental Setup

4.1.1 Test Environment

Here is a list of key software used in our tests:

ü Linux: 2.6.18-128.1.14.el5 x86_64 GNU/Linux ü Tomcat: 5.0.30 ü OGSA-DAI-3.1-axis-1.2.1

The test machine is Ness, which is a parallel machine based on AMD Opteron processors running Linux. It has shared memory architecture. The system consists of the two back end X4600 SMP nodes; both nodes contain 16 2GB memory processor cores. This project only uses one of its back end node with a maximal 16 cores.

Furthermore, all the queries request data from IBM DB2 server if not specifically mentioned otherwise. More details about databases and machines they run on are included in Chapter 4.5

4.1.2 Test Data Set

In order to evaluate correctness and performance, TPC-H Benchmark is introduced. TPC is Transaction Processing Performance Council. TPC benchmarks are widely used nowadays to evaluate performance and verify correctness of a database system.

The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions [4].

The default TPC-H has uniform distribution of values. In order to analyse the performance under various distributions, TPC-H skew is used to test the performance on unbalanced input data. It is a modified version of this benchmark provided by Surajit Chaudhuri and Vivek Narasayya from Microsoft.

26

The TPC-H generator allows choosing the size of dataset, this project uses the default setting - 100MB as the size of database size.

4.1.3 Test Query

This project chooses a query that contains largest number of tables from TPC-H queries as the test query. This query is:

SELECT * FROM customer, orders, lineitem, supplier, nation, region Where c_custkey = o_custkey and l_orderkey = o_orderkey and l_suppkey = s_suppkey and c_nationkey = s_nationkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey

This query contains two largest tables in database: orders (600,000 tuples) and lineitem (150,000 tuples). This query is complex enough for our needs as it joins six tables together.

The bad news is that it is hard to reuse previously sorted tuple streams in consecutive joins. When this query is analysed, it is clear that all the tables are joined on different keys, which means we need to split tuples by different key, sort tuples by different order. Under this situation, there is no way to reuse previous results. However, after analysing other queries of the TPC-H, it is clear that it is the case for most of the queries..

4.1.4 Query Graph

The large query in the previous section can be represented graphically as a query graph (Figure 15). Nodes (like P, PS, and L) represent source tables. JOIN operations are represented by the graph Edges. Query graph is a compact and convenient representation of join queries and it is used in join ordering algorithms.

27

Figure 15 Query Graph

4.1.5 Query Plan

The query graph presented in Figure 15 can be translated into a query plan shown in Figure 16. This query graph can be mapped to a logical query plan. Note that there may be several mappings that result in semantically equivalent query plans.

Figure 16 Query Tree

The last step is to translate logical query plan into the executable code.

Alternatively, previous query graph can be also transferred into other query plan. Other examples of equivalent query plans are presented in Section 4.4.

4.1.6 Measurement of Time

OGSA-DAI workflow is only being defined on client side and OGSA-DAI activities are only initialized in user side code. However, the workflow is executing on the server side

28

after its initialization. As a result, only overall running time is available in user side test code; we can not specify the cost of each activity in the test code.

An alternative solution is adding timer before and after OGSA-DAI execute unit as debug information, thus can help us to obtain approximate running time of each single activity. As initialization cost of activities is excluded from time measurement, time measured this way must be smaller than the real time because it does not account for the pre-processing and post-processing time in.

However, OGSA-DAI mechanism is not that suitable for testing individual running time of every activity. When connecting output stream of one activity to another’s input stream, sender will produce its data in small chunks and insert data into the pipe connecting activities block by block. Once the first data chunk is sent, receiver activity is started and blocked by waiting for input stream. There may has overlap between the time line of these two activities. That is also why the sum of individual activities’ running times may be larger than the overall running time measured in user side.

4.1.7 Script for Submission

As this project tries to measure performance on 16 processors, OGSA-DAI must be run on the back end of Ness. Cause OGSA-DAI runs on the tomcat server, the submit script will be organized as five steps:

1) Set environment parameters. 2) Start up tomcat server. 3) Wait for a while until its service boot successfully by sleep. 4) Run our test case. 5) Shut down tomcat server.

The script is available in Appendix B.

4.1.8 Reproducibility of Measurements

There is one thing that should be noted: Java needs a pre-run to warm up. A warm up can help OGSA-DAI initialize its context on the first request and perform just in time complication. Without the warm up phase, the performance of initial test runs may differ significantly from subsequent runs. In our case the initial run is about four times slower than the second run.

In order to solve this problem, every test contains a inner loop that executes the same query ten times. Running time of each query is measured by the average time of the last nine loop operations (result of first one will be discarded).

29

Besides, JIT, which stand for just-in-time, technique may take some advantages in this test. This technique is used for improving the runtime performance of a program. It is an automatic optimisation based on runtime analysis and dynamic translation. It gains improvement over interpreters to speed up the hot spot of the code. It also can re-compile the code if this is found to be advantageous.

Figure 17 illustrate that there is a more than four times speedup when executing query more than once. It can be contributed to that OGSA-DAI only initializes its context in the first request, repeatedly query request can save this time in the rest of executions. In the real world, the OGSA-DAI service should be always initialized. As a result, the first query running time, which is also the slowest one, will be discarded.

Furthermore, it is hard to find some benefits bought by the JIT technique. If it works, the running time of second round should be a little bit slower than later ones. That is because it should spend extra time to re-compile the code whose execution time should be counted in round two and gain some speedup in the rest rounds due to the optimization made by re-compile. It may also be the case that all JIT optimisations are applied during the initial run.

Figure 17 Running Time of Reproduced Test. The result is based on large parallel test (150,000 tuples).

4.2 Performance Analysis for Single Activity

Performance and analysis of every activity is presented in this section. This information is used to spot the bottlenecks in the join workflow.

Bandwidth is used to measure the performance of these activities, which is calculated by dividing the number of joined tuples by the processing time.

30

Every activity has pre-process, process and post-process steps. When evaluating an activity, timer starts at its pre-process step and ends at post-process step.

4.2.1 Swallow Activity

Swallow activity is used to empty a given tuple list. It goes though its input (OGSA-DAI requirement) and returns an empty list or a count as output. This activity is very fast due to its empty body it basically swallows input tuples.

It has two purposes:

a) It removes noise in time measurement. As OGSA-DAI requires connecting all the activities’ input and output in the workflow, we need to add some activities which are not essential for JOIN operation. For example, TupleToWebRowSetCharArrays activity is used to transfer tuples list format to web readable format and is the last part in the workflow. When its input set is large, this activity is really slow and contributes more than 80% of overall running time. However, this activity is used to transfer data and is not part of the actual join processing. The alternative solution is adding a swallow activity before TupleToWebRowSetCharArrays activity. The TupleToWebRowSetCharArrays activity will only take a negligible time to execute for its empty input set and will damage the performance anymore.

b) Performance of individual activity may confuse us due to mechanism of OGSA-DAI workflow. As it is showed in Figure 18, Activity 3 needs the output from both Activity 1 and Activity 2. However, Activity 1 is slower than Activity 2, so Activity 2 will block and wait for Activity 1 to finish. In this case, the individual time of Activity 2 will be larger than it is.

Figure 18 Workflow without Swallow Activity

A swallow activity can handle this problem easily. As it is showed in Figure 19, when testing the running time for individual activity, a swallow activity will be added behind the target activity. In above case, Activity 2 will end without waiting. The waiting time will be transferred to swallow activity which will not confuse us anymore.

31

Figure 19 Workflow with Swallow Activity

To conclude, swallow activity is inserted after every activity in order to measure execution time of a part of the workflow and is very helpful activity to simplify code.

4.2.2 Performance of SQL Query Activities

Table 1 shows the performance of SQL Query Activity. This is a serial activity. Workflow for this test is:

SQL Query -> Swallow -> TupleToWebRowSetCharArrays

Running time in this section is measured by the overall running time of this workflow. As it contains two additional and inexpensive activities, it should be a little bit bigger than it is.

According to this table, this activity has a 45,000 to 95,000 tuples per second bandwidth which depending on the data size. It can also find that the bandwidth increased when increasing the number of tuples. As SQL query activity contains steps with steady and nonignorable cost like setup and steps whose cost closely related with data size, this activity has a better performance when handling large data sets.

Number of Tuples Time (s) Bandwidth150,000 Tuples 1.59 94340125,000 Tuples 1.4 89286100,000 Tuples 1.18 8474675,000 Tuples 0.98 7653150,000 Tuples 0.77 6493525,000 Tuples 0.55 45455

Table 1 Bandwidth of SQL Query Activity

32

4.2.3 Performance of Split Activities

Both Hash Split Activity and Random Split Activity is an O (N) task. It can be seen that with the number of tuples decreased, the running time of both the two activities will be decreased in the same pattern. Actually, as Hash Split Activity has an extra hash function, it is a little bit slower than Random Split Activity. However, as this extra hash function only contributes a little of the overall activity running time. The performances of both the two hash activities are almost the same. In this section, Hash Split Activity will be used to illustrate the performance.

Execution time in this section is measures only the split activity. This is a serial activity. Here is the workflow to evaluate performance:

SQL Query -> Split -> Swallow -> TupleToWebRowSetCharArrays

As the reason pointed out in Section 4.1.6, SQL query activity will introduce some noise. Table 2 shows the bandwidth of this activity.

Number of Tuples Time (s) Bandwidth (Tuples / second)150,000 2.48 60484125,000 2.18 57339100,000 1.73 5780375,000 1.38 5434850,000 1.02 4902025,000 0.48 52083

Table 2 Bandwidth of Hash Split Activity

33

Figure 20 Performance of Hash Split Activity 3

Figure 20 shows the performance of Hash Split Activity. With the decreasing of size of data set, the time consumed by split activity is reduced. The bandwidth of this activity is about:

55,000 tuples / second

Furthermore, the number of their output sets, which should be the same number as the number of processors, is not affecting the performance of activity. That is because this is a serial activity. Parallelization does not help this activity.

4.2.4 Performance of Tuple Sort Activity

This task should be the main bottleneck and should be parallelised. However, running time of this activity is not reduced by using parallelism.

Tuple sort activity contains two parts: Add tuples to list and sort list. There are timers before and after both these two parts in order to find out where exactly time is going. Generally, this activity will be discussed in three parts:

Adding tuples to a list

If we use Java ArrayList to add tuples, the bottleneck is bandwidth of SQL query data flow, which is not increased by adding extra processors. Which means, even if the number of processors is increased (every work set become smaller), the tuple sort activity will not execute faster due to the serial SQL query activity.

If Java LinkedList is chosen to add tuples, the bottleneck is adding tuples to the LinkedList, it should gain some speedup when optimizing due to its potential of parallelization.

We can see such behaviour in Figure 21. First of all, it can be found that Tuple Sort Activity has a very different performance when using Array List and Linked List. Workflow of the former one is blocked by its previous SQL query activity, so it can hardly gain some speedup by parallelising Tuple Sort Activity. However, in the latter situation, Tuple Sort Activity becomes the bottleneck due to the slow Linked List related operation. In this case, parallelization works well.

3 Test code about this graph can be found in /test_related/hash_split_activty/ in the code package.

34

Figure 21 Array List vs. Linked List

Sorting a list

This part uses a JAVA library function, Collections.sort(), to sort given list. This library function use Merge Sort Algorithm to sort a list which has a complexity of O (N ㏒N). Besides, if a list is already sorted, Merge Sort Algorithm cost fewer time to generate result.

More details are in the coming section.

Overall Sort Activity

Here is the workflow to evaluate performance:

SQL Query -> (Sort-part 1 -> Sort-part 2) -> Swallow -> TupleToWebRowSetCharArrays

As Linked List is much slower than Array List (although it benefits from parallelisation), this project used Array List in its final version. Figure 22 shows the performance of this Tuple Sort Activity based on Array List. It can be found that with decreasing of number of input tuples, the overall running time is also reduced with a bandwidth about:

70,000 tuples / second

Furthermore, it can be found that the execution time of part 1 is always bigger than part 2’s. It can be contributed to the feature of workload that: most of the data in the real world (includes our test workloads TPC-H and TPC-H Skew) is not totally disordered. Most of them are already ordered or part ordered which can save a lot of time on search and swap operations when doing our own sort.

35

Figure 22 Performance of Tuple Sort Activity 4. Part 1 stands for adding tuples to list. Part 2 stands for sorting given list. Overall stands for overall running time of

this activity.

4.2.5 Sorted Merge Activity and Union All Activity

Performance measurements for both these two activities are hard. It is because testing for this activity contains some un-removable noise like waiting for all its inputs to be ready. Here is the workflow to evaluate performance:

SQL Query -> Split -> Union All -> Swallow -> TupleToWebRowSetCharArrays

Besides, as both the two activity only need scan every table once, the bandwidth if this activity is also close. The bandwidth of these two activities is in Table 3:

Number of Tuples Time (s) Bandwidth (Tuples / second)150,000 1.86 80645125,000 1.59 78616100,000 1.4 7142975,000 1.2 6250050,000 0.77 6493525,000 0.4 62500

Table 3 Sorted Merge Activity and Union All Activity

4 Test code about this graph can be found in / test_related/tuple_sort_activty/ in the code package.

36

4.3 Single Join

4.3.1 SQL Query

In order to observe result clearly, the entire test results under this section choose a join operation that contains the biggest two tables, where orders contain 150,000 tuples and lineitem contains 600,000 tuples. The query is presented below:

SELECT * FROM orders, lineitem Where l_orderkey = o_orderkey

4.3.2 Performance Analysis Summary

Using parallelization to optimise single join unit is the main aim of this project. In this section, there will be analysis of overall function and every single activity.

The overall running time is approximately 6.28 seconds on 2 processors. However, when the test runs on more than two processors, there is no obvious speedup.

Table 4 shows where time is spent among the workflow execution. From the table you can find:

1) SQL Query Activity is always slow. Its bandwidth is closely related with the data size, which also means SQL Query Activity is suitable for large tables.

2) Split Activities have the worst performance among all activities according to this table. However, it is because Split Activities is the one comes immediately after SQL Query which brings some noise. (Section 4.1.6)

3) Tuple Sort Activity should be the most expensive activities among all the activities. However, in this case the bottleneck is the bandwidth of SQL Query and the performance defect of this activity cannot be displayed clearly.

4) The overall performance is always more than 33,000 tuples per second which is a little bit smaller than the slowest activity (SQL Query Activity in this case).

Activity Bandwidth (tuples/second)SQL Query 40,000 - 80,000

Hash Split / Random Split 55,000Tuple Sort 70,000

Union All / Sorted Merge 60,000 - 80,000Overall 33,000+

Table 4 Overall Activity Performance. Note that running times of these activities are bigger than it was. The reason is addressed in Section 4.1.6.

37

4.4 Multiple Join

This section presents performance analysis of complex SQL query that contains more than two tables which is mentioned in Section 4.1.3. Execution of two different but equivalent query plans is discussed.

4.4.1 Query Plan 1: Independent Parallelism

In Query Plan 1, it tries to take a good usage of Independent Parallelism. As it is showed in Figure 23, the six input tables are divided into three groups first, and the result of each group will take part of following Joins. Joins in each group can run on parallel. In this example, table region join table nation, table supplier join table customer and table lineitem join table orders can be executed in the same time.

Figure 23 Query Plan 1

Rectangle stands for table with its table name and number of tuples in it. Dashed line rectangle represents join operators and the number inside it stands for the number of temporary result at that step.

4.4.2 Query Plan 2: Pipelined Parallelism

Query Plan 2 in Figure 24 is a good example of Pipelined Parallelism which is mentioned in Section 3.2.2. All the Joins must be run one by one. However, parallelization still can take advantages in this case. To illustrate, when the first Join, Lineitem join Orders, is executed, activities like SQL Query Activity, Split Activity for other tables can runs in parallel.

38

Figure 24 Query Plan 2 5

Same as Figure 23 above, rectangle stands for table with its table name and number of tuples in it. Dashed line rectangle represents join operators and the number inside it stands for the number of temporary result at that step.

4.4.3 Query Plan 1 vs. Query Plan 2

Table 5 shows a different performance of these two plans. The first plan is about 1.5 times faster than the second one. This phenomenon can be explained as follow:

Time (s) Standard Deviation (s)Query Plan 1 43.79 0.41Query Plan 2 68.355 0.7875

Table 5 Query Plan 1 vs. Query Plan 2

1) Query Plan 1 takes advantages from operating Joins in parallel. However, it can only gain very limited speedup which depends on the number of tables. For

5 Test code about this graph can be found in / test_related/mutiple_joins / in the code package.

39

example, if there are only three tables in a query, this optimization will not affect the performance. In our test case, there are six tables and five Joins. Three out of five Joins can be executed concurrently, which leads to a good performance.

2) Query Plan 2 suits for cases that join tables by the same column, so it can optimize its critical path by re-use previous temporary results. For Hash Split Join, it means reuse the split subsets (Figure 25). When using Sorted Merge Join, the previously result set is already be sorted and can be used directly in next Join. However, in our test case, all the tables are joined on different columns; Query Plan 2 can hardly take some re-use that is why it is slower than Query Plan 1.

Figure 25 Re-use in Hash Split Join

4.5 Join on Distributed Heterogeneous Database

4.5.1 Database Information

This project uses two database systems:

1) IBM DB2 Server, which is on the rat-epcc.epcc.ed.ac.uk server, which has 500MB memory and single processor.

2) MySQL 5.0.27, which is on the coal.epcc.ed.ac.uk server, which has 2GB memory and two processors.

40

Both of the two databases contain the same 100MB TPH-H Skew data set. However, IBM DB2 Server is much slower than MySQL Server as it has less memory and processors.

4.5.2 IBM DB2 vs. MYSQL

Table 6 shows the average running time of queries on both IBM DB2 Server and MySQL Server. Test runs on two processors based on Hash Split Join algorithm. The size of inputs is valued by the biggest table in the queries. All the tests have the same parameters expect the size of input and the database.

Size of Inputs Times (s) Standard Deviation (s)DB2 (150,000 tuples) 6.26 0.13DB2 (100,000 tuples) 4.38 0.11DB2 (50,000 tuples) 2.43 0.09

MySQL (150,000 tuples) 5.89 0.21MySQL (100,000 tuples) 4.12 0.18MySQL (50,000 tuples) 2.28 0.19

Table 6 Performance on Different Database6

Figure 26 is visualises data from the above table. It draws a clear picture about the performance of both the two database. It can be seen that the performance of IBM DB2 Server request and MySQL Server request are nearly the same.

Figure 26 DB2 vs. MySQL

6 Test result is collected by executing query mentioned in section 4.3.1 on two processors.

41

4.5.3 Heterogeneous Database

This section discusses the performance of request on heterogeneous databases. Compared with request on single database in Table 7, it is a little bit faster when we use two. That is because the usage of heterogeneous database decreased the number of retrieved tuples from single database. As bandwidth of database is the bottleneck in this case, request from mixed database should increase the overall bandwidth and improve the performance. However, in our test this improvement is not obviously visible due to the data size difference between left and right input table.

Size of Left Table Size of Right Table Heterogeneous: Left MySQLRight DB2

Heterogeneous: Left DB2Right MySQL Both from DB2 Both from MySQL

15,000 150,000 5.91 6.08 6.26 5.8915,000 100,000 4.21 4.2 4.38 4.1215,000 50,000 2.36 2.22 2.43 2.28

Table 7 Performance of Heterogeneous Database7

7 Test result is also collected by executing query mentioned in section 4.3.1 on two processors.

42

Chapter 5 Conclusions

The main aims of the project described in this thesis were the followings:

a) To investigate methods for parallel execution of join query, which are usually used to optimize a single join operation.

b) To analyze the difference in performance caused by different query plans, which is used to speed up complex queries that contain multiple join operations.

To achieve these goals, the following stages were:

a) The first stage of this project was to get familiar with OGSA-DAI and relational algebra.

b) The second stage was to design efficient parallel approaches to optimize existing operations of join queries. It is the most important work to analyze and investigate the parallel mechanism when executing complex join query operations on large tables.

c) The third stage was to implement the parallel algorithms in the distributed environment and to evaluate their performance when executing queries in parallel.

Based on our experiments in this project, there are some conclusions:

a) Bandwidth of database is one of the limitations. When it (SQL Query Activity in this activity) becomes the bottleneck, the only way to optimise it is get a more powerful and fast database system. Performance will increase as much as the bandwidth increased. Speed up gained by using parallelization under this case is not obvious.

b) Sort operation is another limitation. When it becomes the one which limits the performance, parallelization is a good solution to gain a speed up. All the activities can be parallelised except SQL Query Activity and Union All Activity. Compared with O (N㏒N) Sort Activity and Join Activity, they only have a O (N) complexity. Cause of these two serial activities are not costly under this case, it is expected to gain a great speedup. As it is mentioned in section 4.2.4, when we use Linked List to

43

sort a set when Sort Activity becomes the bottleneck, the speedup is close to linear. Partitioned parallelism works well in this situation.

c) A right Query Plan helps a lot especially when a request is complex. If request contains some Joins on the same key, pipelined parallelization is the best plan because it can take a great advantage of re-use. Otherwise, independent parallelism plan can provide a considerable and stable speedup.

44

Appendix A Source Code

The source code is organized as two parts: code implementing activities and code implementing test cases. As OGSA-DAI is a client/server framework, activity related code is grouped into two directories. The source code for all the tests used in this project can be found in test_related directory.

1 Code 1 Activities 1 Client

2 HashSplit.java 2 SortedMerge.java 2 TupleSort.java 2 RandomSplit.java 2 Swallow.java 2 SortedJoin.java 2 TupleMergeJoin.java

1 Server 2 HashSplitActivity.java 2 RandomSplitActivity.java 2 SortedJoinActivity.java 2 SortedMergeActivity.java 2 SwallowActivity.java 2 TupleMergeJoinActivity.java 2 TupleSortActivity.java

1 Test_related 1 hash_split_activity 1 large_parallel_hash_split_join_150k 1 large_parallel_hash_split_join_150k+swallow 1 large_parallel_hash_split_join_20k 1 large_parallel_hash_split_join_20k+swallow 1 large_parallel_hash_split_join_60k 1 large_serial_hash_split_join 1 mutiple_joins 1 script 1 small_serial_hash_split_join 1 small_serial_hash_split_join+swallow 1 sql_query_activity 1 test_different_database 1 tuple_sort_activity 1 union_all_activity

45

Appendix B Submission Script

Tests are running on Ness which uses Sun Grid Engine as its batch scheduler. The test submission script is presented below:

The submit script above contains five parts:

[1] Set environment parameters. [2] Start up tomcat server. [3] Wait for a while until its service boot successfully by sleep. [4] Run our test case. [5] Shut down tomcat server.

46

References

[1] http://en.wikipedia.org/wiki/JOIN, WIKIPEDIA JOIN webpage.

[2] http://en.wikipedia.org/wiki/Query_plan Query plan

[3] www.ogsadai.org.uk, OGSA-DAI webpage.

[4] http://www.tpc.org/tpch/

[5] Elmasri, R. A. and Navathe, S. B. 1999 Fundamentals of Database Systems. 3rd. Addison-Wesley Longman Publishing Co., Inc.

[6] Yu, C. T. and Meng, W. 1998 Principles of Database Query Processing for Advanced Applications. Morgan Kaufmann Publishers Inc.

[7] Taniar, D., Leung, C., Rahayu, W., and Goel, S. 2008 High Performance Parallel Database Processing and Grid Databases. Wiley Publishing.

[8] Özsu, M. T. and Valduriez, P. 1999 Principles of Distributed Database Systems (2nd Ed.). Prentice-Hall, Inc.

[9] Chaudhuri, S., Gupta, A. K., and Narasayya, V. 2002. Compressing SQL workloads. In Proceedings of the 2002 ACM SIGMOD international Conference on Management of Data (Madison, Wisconsin, June 03 - 06, 2002). SIGMOD '02. ACM, New York, NY, 488-499. DOI= http://doi.acm.org/10.1145/564691.564747

[10] Foster I., Kesselman C., Nick J., Tuecke S.. The Physiology of the Grid. Jun 2002. http://www.globus.org/alliance/publications/papers/ogsa.pdf

[11] Foster I., Kishimoto H., Savva A. The Open Grid Services Infrastructure, Version 1.5 (GFD I.080), Jul 2006

[12] Antonioletti M., Atkinson M., Baxter R. OGSA-DAI Status Report and Future Direction. Proceedings of the UK e-Science All Hands Meeting 2004

[13] Atkinson M., Karasavvas K., Antonioletti M.. A New Architecture for OGSA-DAI. Proceedings of the UK e-Science All Hands Meeting 2005

[14] Antonioletti M., Hong N. P. C., Hume A. C. OGSA-DAI 3.0 – The Whats and the Whys. Proceedings of the UK e-Science All Hands Meeting 2007

http://en.wikipedia.org/wiki/JOIN

http://en.wikipedia.org/wiki/Query_plan

http://www.ogsadai.org.uk

http://www.tpc.org/tpch/

http://doi.acm.org/10.1145/564691.564747

http://www.globus.org/alliance/publications/papers/ogsa.pdf

47

[15] Antonioletti, M., Atkinson, M., Baxter, R., Borley, A., Chue Hong, N. P., Collins, B., Hardman, N., Hume, A. C., Knox, A., Jackson, M., Krause, A., Laws, S., Magowan, J., Paton, N. W., Pearson, D., Sugden, T., Watson, P., and Westhead, M. 2005. The design and implementation of Grid database services in OGSA-DAI: Research Articles. Concurr. Comput. : Pract. Exper. 17, 2-4 (Feb. 2005), 357-376. DOI= http://dx.doi.org/10.1002/cpe.v17:2/4

[16] Jackson M. and Theocharopoulos E. OGSA-DAI WS-DAIX 1.0. University of Edinburgh. May 2008. http://www.ogsadai.org.uk/documentation

[17] Wang K., Xie Y. J., Li S. L., Wang X.Y.. Performance Analysis of the OGSA-DAI 3.0 Software. 5th International Conference on Information Technology: New Generations. Apr 2008: 15-20

[18] Oevers M., Collins B. M., Knox A., Williams J. The Use of OGSA-DAI with IBM DB2 Content Manager for Multiplatforms in the eDiaMoND Project. Jan 2004. http://www.cs.indiana.edu/~plale/GGFDataWorkshop04/08-Oevers-OGSA-DAI-eDiamond.pdf

http://dx.doi.org/10.1002/cpe.v17:2/4

http://www.ogsadai.org.uk/documentation

http://www.cs.indiana.edu/~plale/GGFDataWorkshop04/08-Oevers-OGSA-DAI-e

Parallel Processing of JOIN Queries in OGSA-DAI · PDF fileParallel Processing of JOIN Queries...

Documents

Transcript of Parallel Processing of JOIN Queries in OGSA-DAI · PDF fileParallel Processing of JOIN Queries...