Download - DATA BECDCTION ALGORITHBS FOR DISIRIBUTSD A THESIS IN ...

DATA BECDCTION ALGORITHBS FOR DISIRIBUTSD

QDIBY PROCISSING

by

JIA-SHINN WANG, E.E.

A THESIS IN

COHPOTEH SCIENCE

Submitted to the Graduate Faculty of Texas Tech University in

Partial Fulfillicent of the Fequirements for

the Degree of

HASIER OF SCIENCE

Approved

Accepted

May, 1984

ACKNCWLED3EM£N'IS

I would like to express my deepest appreciation to the

committee chairman. Dr. Gopal Lakhani, for his guidance and

suggestions in preparing this thesis. I wish to thank Dr.

Leonard H. Weiner for his valuable assistance. I am also

grateful to the members of my ccmicittee for their support

and assistance. Finally, to my faiily and Yu-Hua, ay

everlasting thanks for their understanding and encouragement

during my graduate study.

11

CONTENTS

ACKNOWLEDGEMENTS 11

CHAPTEF

I. INTRODOCTION 1

II. QUERY PROCESSING IN DISTRIBUTED DATAEASE 7

Query Processing 7 Semijcin 10

Distributed Query Processing Strategies 13

III. I!1PR0VEKENT IN HEVNER AND YAC'S METHCC 17

System Model 18 Simple Query 23

General Query 30

IV. MINI-MAX ALGORITHM 39

Introduction 39 Mini-Max Algorithm 42 Correctness of the Algorithm 51 Complexity of the Algorithm 55

V. CONCLUSION 58 Merits of Algorithm Mini-Max 58 Simulation for Study of Total and Response Time 61 Future Research and Suggestions 63

BIBLIOGRAPHY 64

111

LIST OF TABIES

1. A Distributed Eata Base System 21

2. Example of Siitple Query 27

3. Example of General Queries 31

4. Candidate Schedule for RJ 33

5. Candidate Schedule for R2 33

6. Data Base State for Algorithm Mini-Max 48

7. Comparison of Total Time and Response Time 62

IV

LIST OF FIGURES

1. Graph Cost for lES 25

2. Cost Graph for Reducing A Relation 25

3. Cost Graph for An Example of A Simple Query 28

4. Candidate Schedules for General Query Example 32

5. Cost Graph for RJ 35

6. Cost Graph for R2 36

7. Final Cost Graph for General Query 36

8. Cost Graph Including Processing Cost 41

9. Alternate Cost Graph Schedule 43

10. Cost Graph for IFS in Algorithm Mini-Max 49

11. Cost Graph After First Phase 50

12. Final Cost Graph for Algorithm Mini-Max 51

CHAPTER I

INTRODUCTICN

Recent advances in computer and communciation

technologies, coupled with explosion in size and complexity

of application areas, have led to design of large computer

communication complexes. For instance, the ARPANET

currently supports communication among more than one hundred

computer systems. The network is under continual development

and is used by thousands of users daily. Computer networks

have capability to bring computing power to the people who

need it, and provide access to a wider variety of resources

dispersed among several computers which are linked by a

communication facility to provide the basis for

"distributed" computing services.

In general, the notion of "distributed systems" varies

in character and scope with different people [KAR80]. So

far, there is no accepted definition and basis for

classifying these systems. Basically, at least four

physical components of a system might be distributed:

hardware or processing logic, data, the processing itself,

and control. Some speak of a system that has any one of

these components distributed as being a "distributed

system." However, Enslow [ENS78] pointed out that a

^

distributed system should include: a multiplicity of general

purpose resource components, a physical distribution of

these physical and logical components, a high level oper

ating system that unifies and integrates the control of the

distributed components, system transparency, and cooperative

autonomy.

Evolution of modern computer network technology and

rise of common carrier packet switched networks have provid

ed motivation to develop distributed information networks.

In addition, due to increasing geographic dispersion of end

users within an organization, pressures are generated or

data processing and corporate management to distribute data

processing and storage capabilities at the location of data

origin and/or end use of the data. The Distributed Data

Base Management System (DDBMS) provides faster, easier access

and more reliability to data than is feasible in tradition

al, centralized Data Base Management System (EBMS). Accord

ing to the CODASYL systems committee [NCC78], the major be

nefits derived of distributions focus on increased data

availablity and reduced exposure of total system failure due

to hardware/software failure to end users.

There are currently three favored approches to the data

base model: the hierarchical model, the network model, and

the relational model £DAT31]. The hierarchical model and

the network model present to the user a navigational

interface with which the user must determine the data access

path. The relational model was later proposed to achieve a

high level of data independence. Users perceieve the data

base as a collection of relations (or tables), regardless of

actual physical data structures used for stroage. Each rela

tion is composed of a set of homogeneous records (called tu

ples) , which in turn are subdivided into an ordered set of

fields call attributes. This simple data representation al

lows access of data by specifying the properties of data to

be retrieved, rather than specifying how data are to he ac

cessed. A component of the data base called the query opti

mizer determines ah efficient access plan.

Distributed data base systems are considerably more

complex than centralized ones [SWA81, SMA79, SMI81, RAM79].

These additional complexities are due to some problems in

herent to distribution, such as synchronization, heterogene

ity, and geographical dispersion. Major areas of current

research are query optimization, distributed concurrency

control, failure recovery, distributed data base design, and

distributed architectures for data base machines [SAC82].

The present thesis focuses on the problem of guery

optimization in distributed data base systems. The query

optimizer is a difficult component of a data base system

^

[ROT80], because the cost of almost all interactions with

the data base depends on the quality of plans which are de

termined by the query optimizer. This component is invoked

not only for retrival operations but also for replace and

remove operations. In [SAC82], the authors have presented

an excellent review of this area.

The principal bottleneck in distributed data base sys

tems is data communication. All economically feasible long

distance communication media incur lengthy delays. Moreover,

the cost of moving data through a network is high enough

that it can be compared with the cost of storing the data

locally for many days. In addition, parallel processing is

also an inherent aspect of distributed systems and mitigates

to some extent the communication factor. Distributed guery

evalution methods not only focus on efficient utilization of

network links but also explore mutltiplicity of assignable

resources to provide services within the system-

Most of the works in this area have concentrated on

static decomposition strategies which develop entire sche

dules before data processing for query evaluation starts.

Various simulation studies [SAC82] have shown that the

errors in this kind of scheduling would have a significant

influence on the quality of the performance. A more adaptive

control for the query optimizer should be enforced. It is

generally true that by exploring more assignable resources

to provide services will lower the response time of the

query. But at the same time, the total system load will be

relatively increased. If the work load is taken into con

sideration, then the gain in the response time of the guery

may decrease a lot due to system overhead.

Our research was motivated by research work of

S. Bing Yao et al. [SAC82] over the past few years. He

have noticed the necessity of various assumptions of their

data base system model. Although the assumptions of their

data base system model are almost necessary for deriving any

mathematically optimal solutions, the degree of correctness,

however, should vary froB an environment to another. Since

these assumptions fall into three mutually independent

aspects of computations, namely, data model, computer archi

tecture, and network topology and transmission characteris

tics, it is very difficult to quantify and relate them and

further to consider them in any optimization process. Con

sequently, their algorithms ignore any type of overhead,

whatsoever,in distributed processing, except the trans

mission cost. We propose to take a different view of the

problem of overhead cost. Another purpose of this thesis is

to investigate feasibility of development of an algorithm

which emphasizes parallelism but to a limited extent. He

believe that an extreme parallelisir in transmission and

operation execution eventually creates substantial overhead

which would affect the response of the system adversely.

In Chapter II of this thesis, an overview of query pro

cessing in the distributed data base system is provided.

This chapter presents a theoretical model of distributed

data base, implication of constraints on data base model,

guery representation and some general methods of query real

izations. It also includes discussion on seme important re

search in the area of query optimization. In Chapter III,

the approach of Hevner and Yao [HEV79b, APE83 ] on query op

timization is discussed in detail. Further, by means of ex

amples, we shew that their algorithms do net necessarily

produce schedules which are optimal within the limitations

of the data base model. Estimation of file size during

guery processing is a complex problem [SAC82, YU82 ]. Esti

mation based en static assumptions do net provide close to

actual results. The heuristic algorithm presented in Chapter

IV allows to use the actual size of intermediate files in

the development of query processing strategy. Proof of cor

rectness of the algorithm is presented later. He conclude

this thesis in Chapter V and provide some results of

simulation of cur algorithm.

CHAPTER II

QUERY PROCESSING IN CISTRIBUTEI DATABASE

Query Processing

A guery is an access reguest made by a user or a

program in which one or more files need to be accessed

[RAM79 ]. When mutiple files are accessed by the same query

in a distributed bata base, these files usually have to

reside at a common location before the query can be

processed, at least partially. Substantial communication

overhead is involved if these files are geographically

distributed and a copy of each file needs to be transferred

to a common location. It is therefore necessary to

decompose the guery into subgueries so that each subquery

accesses a single remote file. These subgueries may then be

processed in parallel at different locations. Finding an

efficient schedule of subgueries is important. If the query

is processed inefficiently, it not only takes a long time

before the end user gets his answer, but it might also

decrease the performance of the whole system because of

network congestion [SCH78].

Several nonprocedural languages are known for guery

representation, but we choose relation algebra for our

8

study. Relational algebra is based on three principal

operations: selection, projection, and join. A guery can

be fully described in terms of these operations and their

attribute domains [HEV79b].

Distributed queries are processed by moving data to de

sired locations and then by performing the reguired opera

tions there. The distributed query problem is one of trans

forming a distributed query into a set of local queries each

of which either transfers data from a sinqle location to

other locations or executes a relational operation on local

ly available data. A major objective in any distributed

query processing strategy is to decrease transmission cost.

It is important that the size of data base files is reduced

at their source locations. Therefore, every strategy tries

to execute selection and projection as early as possible be

cause each such operation reduces the size of its operand.

The size reduction decreases the subsequent movement cost of

intermediate relations. The final data lovement of all par

tial solutions must move data to the result site for final

processing. All algorithms assume that an initial phase is

performed. In this phase:

1. The guery is transformed into disjunctive normal

form. The conjuncts can be treated as subgueries and

can be executed independently. The result of the

original query is then the union of the results of

its subgueries.

2. All local processing at different locations should

possibly be performed in parallel.

A general processing cost function for a static system

of transmission (from site i to site j) ^^^ processing (at

site j) of s units of data, can be stated as :

TiJ (s) = lij + Cij(s) + Proj(s*sj)

where lij is the transmission overhead cost, CiJ is the cost

of using the link per unit of data, Proj is the cost of

processing a unit of data by the computer at site J, and sj

is the size of the relation already at site j which have to

be joined with the incoming data. The cost is expressed in

units of time. If no direct link exists between i and j ,

the cost is assumed to be taken over the cheapest path

connecting the two sites.

Distributed query evaluation strategies try to reach

one of two objectives: minimal total time or minimal

response time [HEV78]. Minimal total time objective attempts

to minimize resource consumption and therefore to maximize

system throuhput. The objective behind the ainimal response

time is to minimize system response time and therefore,

probably, to maximize resource consumption. These two

objectives are ,in general, conflicting. A decrease in

10

response time, for instance, may be obtained by having a

large number of parallel transmissions to different sites.

This requires a higher resource consumption, and consequent

ly, the system throughput is reduced.

The general problem of guery optimization in distribut

ed data bases has been proven to be NP-hard [HEV79a]. Gener

al strategies are, therefore, based on heuristics. The term

"optimal strategy" often means a strategy which produces

plans of a lover cost than a plan produced by naive stra

tegies and which also seem reasonably suitable to norrral

queries- Even if a strategy can be proven to produce the

optimum query processing, its rasults could be seriously bi

ased by using various methods of estimating size of interme

diate results.

Semijoin

Semijoin is a relational operator used in a number of

guery processing algorithms [ROT80, BERBIa, APE83 ] and data

base machines [PAE79, 02K77]. In a distributed data base,

we may need to compute joins on relations which are located

at different sites. In order to process these operations,

we need to transfer whole relation from one site to another

site. Because the transmission speed is much slower than

the processing speed, these join operations appear to be

11

most costly operations in the distributed data base.

Therefore, instead of computing joins directly, we should

first reduce the size of the relations, wherever possible,

by using selections and projections on appropriate attri

butes. The semijoin operation has the effect of reducing

the size of relation first before performing the join opera

tion.

A semijoin is "half of a join;" the semijoin of rela

tion Ri by relation Rj on a common attribute D is the jcin

of Ri with the relation projected from RJ on that common at

tribute D. He denote the semijoin operation by the symbol

0. The semijoin of relation Ri on domain A with relation Rj

on domain B is defined as (ti f Si I3tj G J such that

ti.A = tj.B) and is denoted by Ri[A«CB]RJ- The relational

algebra notation t.X means "the value of domain X of tuple

t."

The semijcin operation in this thesis will be limited

to single domain. The justification for this assumption is

its practicality. To perform a semijoin of relation Bi on

domains A and E with RJ on domains C and D, we need to com

pute projection of RJ on the pair of domains C and D. The

size of such a projection will generally be quite large, and

therefore the benefit of the semijoin is likely to exceed

its cost. Still, we do allow multidomain seaijoins in a

12

limited context. If a pair of domains in a relation are

always treated as a single composite domain, then the com

posite domain can be considered as atomic.

The characteristics of semijoin operation can be found

in [BER81a], [00082]. The principal issue discussed there

is to characterize those gueries that could be fully reduced

by semijoins. Applications of semijoin reduce the size of

data and therefore, in general, decrease transimission cost.

However, it is not difficult to see that these do not neces

sarily provide optimal solutions.

There are three basic advantages of semijoin operation

in distributed query processing. First, (Ri-RJ)^^!* nd so

semijoin monotonically reduce the size of the data base. By

contrast, joins can increase the size of the data base; in

the worst case |Ri join RJI = |Ri| • IRjI. Second, semi-

joins can be computed with less intersite data transfer than

needed for joins- To compute Ri^Rj, we need only transmit a

projection of a relation, whereas to compute the join we

must transmit the entire relation. Of course, semijoin may

also have less effect than a join, since BiiiRj only reduces

Ri whereas join simultaneously reduces both Ri and Rj. The

third advantage of semijcin is that the "reductive effect"

of a single join can be obtained by two semijoins, usually

at lower cost.

13

However, there are cases in which j o i n s outperform

s e m i j o i n s [BERSIa], and therefore a p p l i c a t i o n s of semijo ins

are based mostly on h e u r i s t i c arguments. An optimal query

process ing algorithm would almost c e r t a i n l y include both

j o i n s and semi jo ins . The graceful in tegrat ion of these tac

t i c s i s an open problem.

JOi.§llituted ^uery Frccessinq S t r a t e g i e s

Research on d i s tr ibuted query processing was i n i t i a l l y

done by Wong [W0N77]. He proposed an opt imizat ion method

based on a greedy h e u r i s t i c that produced general query

process ing s t r a t g i e s but not neces sar i l y optimal s t r a t e g i e s .

An enhanced vers ion of t h i s method i s implemented in the

SDD-1 system [1^0180] [BERSIb] which i s operat ional at

Computer Corporation of America. Epstein et a l . [EPS78]

developed an algorithm which i s partly based on c e n t r a l i z e d

guery s t r a t e g i e s . This algorithm i s implemented in the

d i s t r i b u t e d INGRES data base system. The algorithm

opt imizes each guery operation l o c a l l y . Chu and Hurley

[CHD80] incorporated the idea of permuting r e l a t i o n a l

operat ions in the guery trees [SMI75] and defined the

s o l u t i o n space of f e a s i b l e processing s t r a t e g i e s for

d i s t r i b u t e d g u e r i e s . They presented an optimal algorithm

14

using 0-1 integer programming which is known to be an

exponential time problem. Hevner and Yao [HEV78] proposed to

use volume of data transferred as the optimization cri

terion. They proposed solutions for a special class of

query — simple guery. In [HEV79t] [APE83], they extended

these algorithns to process general distributed gueries.

Their algorithm GENERAL has a serious problem in that it may

allow simultaneous transmission of different data on the

sane link. He discuss this problem in detail later [LAK83].

Bernstein et al. [BERSIa] investigated the reducing proper

ties of semijoins and used the semijoin operation to develop

full reduction methods for distributed guery processing

[G0082]. These methods are applicable for a special class of

queries known as the tree queries. Their concern was to de

termine full reducer, net to consider their cost effective

ness. Therefore, this strategy does not minimize trans

mission costs, but rather reduces the amount of irrelevant

data. Kerschberg et al. [KER«2] derived an efficient algo

rithm for one variable queries given in disjunctive normal

form for a star shaped network.

The objective of most of distributed guery processing

strategies is to minimize use of network resources.

Transformation of the guery into disjunctive normal form and

initial parallel local processing are considered to be

15

beneficial, and are always performed. The basic

optimization problem then, is to find distributed joins,

which require least transmissions between sites. Three

different actions can he applied:

1. Redistribution of a given relation to a certain site

i, in order to perform an on site join at i.

2. Redistribution of several relations to different

sites, in order to perform on site fragments of joins

at those sites.

3. Redistribution of the attributes of certain rela

tions, in order to perform semijoins.

Most methods avoid an expensive search of the solution

space by (1) assuming a nonredundant static materialization

of all the relations referenced in the guery. In this way,

the number of network sites to be considered is generally

less than or equal to the number of relations referenced in

the guery, and by <2) exploring only a heuristically defined

subset of the solution space.

Studies [SAC82] show that evaluation of strategies

produced by heuristic algorithms perform significantly

better than those are produced by naive methods, but often

are poorer than optimal strategies.

16

Several problems remain to be solved. The principal

ones are :

1- Network topologies.

2. Initial materializations.

3. Query evaluation cost functions.

4. System load.

5. Ojective functions.

CHAPTER III

IMPROVEMENT IN H3VNEH AND YAC'S METHOD

Hevner and Yao have proposed several algorithms [HEV78,

HEV79b, SAC82, APE83 ] for distributed gueries- These

algorithms are based on an algorithm which they developed

for a class of queries, called "simple gueries." Their

algorithms can be classified into two different sets. One

is for total time minimization, and the ether is for

response time minimization. In this chapter, we show that

their algorithm for the response time version does not

necessarily produce optimal schedule, and later we present

modifications to improve the schedules produced by their

PARALLEL algorithir.

The first subdivision of this chapter describes their

distributed data base model. In the second subdivision, an

alternate algorithm is proposed for simple gueries. The

error and modifications of their GENERAL algorithm for

general gueries are described in the third subdivision.

17

18

System Model

In a distributed system environment, we can view the

network as a connected, undirected graph. Each node of the

graph represents an independent computer which has

processing and data storage capabilities, and each edge

represents a point to point [MAR81] communication line which

can carry data to the adjoining node. Each computer

contains a version of a distributed data base management

system.

The data base is viewed logically in the relational

data model. The data base is distributed over whole system

nodes in units of relations, and it can be accessed locally

or globally from any system node.

We assume that the data transmission cost between any

two nodes is a linear function of the size of the data. This

function will be denoted by T(x) <-- I • C (x) , where x is

the amount of data transmitted and I is the transmission

start up cost which is assumed to be constant. The local

processing time compared to the transmission time is assumed

negligible. Domains of various attributes are assumed to be

independent and that values are evenly distributed in every

attribute of a relation. The response time of a query

schedule is the time elapsed between the start of the first

transmission and the time when the relation arrives at the

reguired computer. The minimum response time of a relation

19

is the minimum response time among all possible schedules

for the relation.

For a distributed data base system described above, we

assume that the following information is given:

For distributed system:

N Number of nodes

H Number cf unique relations in the data base

For each relation Ri, i = 1, ,M:

ni Number cf records

ai Number of domains

si size (in bytes)

For each domain dij, j = 1,..--.,ai, of relation Ri:

Vij Number of possible domain values

Uij Number of values domain d_ij currently holds

Pij Selectivity; PiJ = Dij/Vij (0 < Pij < 1)

WiJ Size of data item in the domain

Bij Projected size of domain; Eij = Uij*Hij

Query processing involving data stored at a single node is

termed local processing. An advantage of local processing is

to reduce the amount of data that needs further processing.

To estimate effect of processing, we defined following

parameters:

n Number of relations in the remainder of the query

Fi Number cf domains in relation Hi

Gi Number cf internal join domains in relation Ri

20

It is assumed that attributes of a relation are independent

and so, if a semijoin over an attribute Dij of a relation Ri

is taken with another attribute Dkl of another relation Rk,

the effect of reduction on relation parameters is :

Si < Si • Pkl

Pij < Pij • Pkl

BiJ < Eij • Pja

In a distributed data base, it is, in general, better to do

initial local processing first beacuse it reduces the amount

of data to be transmitted. After the initial processing,

each node that has guery data will be considered to contain

only one "integrated" relation. The relations at each node

remain distinct. However, by reformatting the guery so that

each node becomes a variable, the distribution aspects of

the guery can be emphasized [APE83].

The relations to be referenced by the queries can be

stored in arbitrary computers of the network. To process a

query that requires non-local relations, we have to transmit

data from one computer to another.

One way of processing a query is to use Initial Feasi

ble Solution (IFS) strategy. The idea of IFS is to send all

involved data directly to the computer where the result

(target computer) is required. Current research [ APE83,

BER81b, KER82] in the area of distributed query processinq

21

show that computation of intermidiate results by computers

other than the target computer may be mere efficient (in

terms of total tiire and response time) .

To show the relavence of constraints of above model for

transmission strategies, we take an example. Consider a

distributed relational data base where each of the following

relations is located at a separate node and there is a link

between every two nodes. For simplicity, we assume, without

loss of generality, that Tij = Cij (s),

Suppose that the query is : Find the department number

(D*) of all departments that are located in Texas (LOC) ,

have a budget over $20,0C0 (BUD), have at least two managers

(CLASS), and have at least one computer in their equipment

inventory (TYPE). The result should be sent to node 3. The

data base state is given in Table 1.

TABLE 1

A ristributed Data Base SysteE

NODE EEIATICN VARIABLE SIZE SELECTIVITY

1 Cepartffent (D*, LCC) D 1000 1 2 Employee (E*, D*, CLASS) E 800 4/5 3 Eguipment(EQ*, D*, TYPE) EQ 100 1/10 4 Budget (BUD, D*) B 200 1/5

22

The IFS strategy will produce the following schedules :

1. send department relation to node 3.

2. send employee relation to node 3.

3. send budget relation to node 3.

4. perform query operations at node 3.

It can he easily seen that the response time of IFS

strategy for this example is 1000. Now, consider following

schedules:

step1:

send equipment relation to node 4,

perform jcin and projection.

step2:

send the result from node 4 to node 2,

perform join and projection.

s te p 3:

send the result from node 2 te node 4,

perform join and projection.

step4:

send the result from node 1 to node 3.

The response time for these schedules are:

step 1 : 100

step 2 : (1/10) • 200 = 20

step 3 : (1/10) * (1/5) * 300 = 16

step 4 : (1/10) • (1/5) • (4/5) * 1000 = 16

23

The response time of the query is 100 + 20 + 16 + 16 = 152.

This schedule, obviously, is far superior than the schedule

obtained by IFS.

From this example, we can see that before transferrinq

the larger relation directly to the result node, it may be

better to reduce its size by doing some intermediate opera

tions with some smaller relations first. It is generally

true that by bringing in more relations would avoid trans

mission of redundant data, but at the same time it is going

to increase the overhead. Arrangement of these schedules

determines the efficiency of a guery processor.

JijPl. Query

Hevner and Yao [HEV78, HSV79] presented an optimal

algorithm (PARALLEL) for simple gueries on a completely

connected network. This algorithm minimizes response time. A

simple query is defined such that after initial local

processing, each relation referenced in the query contains

only one domain - the common joining domain. The objective

of the algorithm PARALLEL is to minimize the response time

by reducing the transmission cost.

Before we describe this algorithm, let us describe a

graphing method [HEV78] that will be used to analyze the

costs of different distribution strategies. This graphing

24

method represents the timing of data movements in a

distribution strategy. Data novement is represented by a

horizontal line connecting processing symbols. The length of

the line corresponds to novement time. A query cost graph

is composed of parallel lines of movement, one for each re

quired relation. These parallel lines are viewed on a hori

zontal time axis. This allow us to recognize synchronization

between different lines of movement. For example, the cost

graph for Initial Feasible Solution (IFS) is shown in Fig

ure 1. The relations 2i,...,Rj are assumed to be at result

node originally. Therefore their line of meveraent costs are

zero.

The lines of movement to relation R_i represents the

seguential and parallel data movements that are made in

order to reduce the size of the relation Ri. before it is

moved. Sometines it is beneficial to move seme relations to

a relation sc that the relation size is reduced and the

subsequent data movements are less costly. As in Figure 2,

relation Rj and Rk are moved parallel to Ri. Relation Ri is

reduced in size and then is moved to the result node.

We define the response time for relation Ri as the time

from the start of data movement until relation Ri is

receive! at the result node. For example in Figure 2, the

response time will be t2*t3. The time t1 is not included

25

RJ: I

R2: I

Ri: I

Bj: I

Rk: 1

C(SJ)

C(S2)

C(Sk)

- 1

- I

Figure 1: Graph Cost for IFS

because it occurs in parallel with time t2. The delay time

for relation Ri is defined as the time from the start of

data movement until it reaches the relation Ri. In Fig

ure 2, the delay time is t2.

x.

26

Now, let us describe the algorithm PARALLEL. There are

two assumptions for the algorithm- First, local processing

costs are insignificant in comparison with data movement

cost. Second, after initial local processing each relation

contains only one common joining domain. Since all rela

tions are interconnected on a common joining domain, each

node is assumed to contain only one reguired relation for

the query. Also once a relation is moved to another rela

tion, then it need not be moved to the destination node-

The algorithm PARALIEL starts with the IFS and searches

for cost beneficial data moves in the current system state.

The state of the system is given by the size Si, selectivity

Pi, and line of movement cost Li, for each relation Ri. A

cost beneficial move to reduce response time is defined as a

data move to a relation Bi so that movement cost for Ri is

reduced. Relations are ordered so that after initial local

processing step, SJ < S2 <....< Sra. Algorithm PARALLEL then

uses this ordering to implement the moving cf relations of

smaller size to relations of larger size if such a move is

cost beneficial.

Algorithm PARALLEL

1. After initial local processing, order relations so

that SJ < S2 <...< Sm. Initial system state

parameter Si, Pi and 0i = C (Si) for all Ri.

27

2. For i = 1 to m repeat steps 3 through 4.

3. Find the response time Oij for all j < i;

Oij = Cj + C(Si * IfJ^^^Pk) .

4. Choose the most cost beneficial movements;

Oi = min (Ci, Oij).

He point cut below that the algorithm PARALLEL may not

produce optimal schedules [LAK83]. Cur observation is stat

ed in the following example- Let RJ and R2 be two relations

having only one attribute over a common join domain, and let

the join of Rj and R^ be reguired for a query at some loca

tion other than that of Rj and RJ. After initial process

ing, the size and selectivity values are listed in the Ta

ble 2.

TABLE 2

Example of Simple Query

Relation Size Selectivity Ej 100 0-1 B2 200 0-2

28

According to the algorithm PARALLEL, the optimal

schedule for transmission of attribute to the destination

should consist of transmission of Ej to the location R2 and

the transmission of the output of the semijcin of Rj and R2

(to be computed at location of R2) to the destination. The

cost graph is shown in Figure 3- The expected volume of the

output of the semijoin operation is 200 * 0.1 = 20. The to

tal cost for this schedule is I • C(100) • I • C(20). Now

consider the following alternate schedule.

Rj C(100) R2 C(20) Result J , ,

Response time = C(100) -i- C (2 0)

(a) Cost graph for algorithm PARALLEL

pj C(100) R2 C(10) Result , , ,

R2 C(100) Result , ,

Response time = C(100) + C(10)

(b) Cost graph for improved schedule

Figure 3: Cost Graph for An Example of A Simple Query

29

Let Rj be transmitted simultaneously, to the location

of R2 and to the destination, and let R2 be transmitted to

the destination at the same time until the transmission of

Rj ends. For this example, first half of R2 would be trans

mitted before the transmission of Rj ends- The join of Rj

and the remaining half of R2 is computed at the location of

R2, and it is then transmitted to the destination. A join of

Rj with the first half of R2 is computed at the destination

computer.

The response time of this schedule is I -»• c(100) • I *

C(100*0-1) = 21 + C(110). This schedule, therefore, outper

forms the schedule produced by PARALLEL. Complete response

time (the transmission and processing time together) may now

be computed. The cost function of computing join of two at

tributes of size X and y may be denoted by K * Pro(x*y).

The function is justified because the two attributes may not

be sorted. The response time of the schedule obtained by

their algorithm is I • C(100) + K • Pro(200*100) + I +

C(200*0.1) = 21 + C(120) • K + Pro(2000). The response time

of the alternate schedule is I • C(100) + K • Pro (100* 100) -•

I + C(100*0.1) • K + Pro(100*100) = 21 + C(110) • 2K •

Pro(2000). If C(10) > K, the response time of the alternate

schedule is better. The reason is that the algorithm

PARALLEL does not use the hardware fully for parallel

transmission.

\ :

30

general 2uer^

In [APE83], Apers et al. extended simple guery

algorithms for general gueries. The algorithms (GENERAL,

RESPONSE) assume the same network properties and

transmission cost function as given before for the algorithm

PARALLEL. A query is called GENERAL, if there are some

relations which may have any number of common joining and

output domains. Furthermore, a node in the network may

contain any number of required relations. The joining

domains within each relation are assumed to be independent-

Thus a selectivity reduction on one domain does not affect

the selectivity of the other joining domains.

The central idea in these algorithms is to try to

reduce the relation size as much as possible by using

semijoin operations and then transfer the reduced relations

to the destination node. Using the indenpendence of

domains, they consider that a general guery involving

several domains can be partitioned into several simple

queries, each one with an undefined result node. The

algorithm PARALLEL can be applied to each subquery

separately. The schedules for each simple guery are then

integrated into a complete guery processing strategy. They

showed that under the hypothesis of attribute independence

within each referenced relation, the algorithm GENERAI and

RESPONSE will produce schedules for optimal response time.

31

First, let us demonstrate these algcrithras by an

example. The steps of the algorithms GENERAL (step 1, step

2, setp 5) and RESPONSE (step 3, step 4) are shown below.

Step 1: Do all initial local processing: All the opera

tions that can be done viithout transferring data

should be first executed locally-

Let the data base state of the example, after initial local

processing, be given in Table 3. There are two relations Ej

and R2, each of them has two common joining attributes, bij

and bi2 with selectivity pij and pi2, respectively-

TABLE 3

Example of General Queries

1

Realtion Ri

RI R2

Size Si

1000 2000

Domain 1 biJ Pil

400 400

0.4 0.4

Domain 2 bi_2 pi2

100 0.2 450 0-9

Step 2: Generate candidate schedules: For each of the

joining attributes, consider the simple query

described by it. Apply the alqorithra PARALLEL

to each simple query. Save all candidates for

integration in step (3).

32

By applying the algorithm PARALLEL to both bjj and b2j

separately, produce candidate schedules which are shown in

Figure 4(a). The candidate schedules for bj2 and b22 are

shown in Figure 4 (b) .

tjl 42 0 bjj: I-- 1

b21 420 blJ: I-- 1

(a) Candidate schedules for bjj and b2j,

tj2 120 bj2: I I

bj2 120 b2 2 110 b22: I I I

(b) Candidate schedules for b 12 and bJJ

Figure 4: Candidate Schedules for General Query Example

Step 3: Candidate schedule ordering: For each relation

Ei, order the candidate schedules on joining

attribute bij, j = 1,2,.--,g in ascending order

of arrival time- Let ARTk denote the arrival

time of schedule CSCHk-

33

For relation Ej, the schedules of attributes that can be

applied are ordered on their arrival time in the node where

Rj is located, and this result is given in Table 4. The

schedules for relation E2 are shown in Table 5.

TABLE 4

Candidate Schedule for El

Attribute hk

b22 b21

Arrival Time ARTk

330 42 0

TABLE 5

Candidate Schedule for R2

Attribute

b12 til

Arrival Time ARTk

120 420

34

Step 4: Schedule integration: For each candidate sche

dule CSCHk, in ascending order, construct an

integrated schedule for Ri that consists of

parallel transmission of CSCHk and all CSCH£,

£ < k. Select the integrated schedule of minimum

response time.

In step (4) , for each of these attributes bk, an integrated

schedule for relation Ri is constructed which consists of

parallel transirission of all attributes having arrival time

less than or equal to ARTk. This construction for relation

Rj is shown in Figure 5. Figure 6 shows the schedule ccn-

struction for relation R2. In Figure 5, there are two sche

dules for Rj. Along with the Initial Feasible Solution for

Rj, the schedule with the minimun response is chosen. This

is the second schedule, and its response time is 800. The

first schedule in Figure 6 is chosen for relation R2, and

its response time is 540.

Step 5: Remove schedule redundancies: Eliminate schedules

of relations which have already been transmitted

in schedules cf other relations.

The final guery processing strategy for this example is

shown in Figure 7.

The algorithm RESPONSE has two steps (step 3 and step 4

in our example). Step 1 computes the optimal schedules for

35

bj2 120 b22 110 Rj b22: I I I

920 , , I

Response time = C(100) + C (0.2*450) + C (0.9*1000) = 120 + 11C + 920 = 1150

b12 I —

120 b22 110 Rj I

380 b21 I 1

b2j 1 —

420

Response time = C (400) -•• C (0. 9*0.4*1000) = 420 + 380 = 800

Figure 5: Cost Graph for Rj

each joining attributes individually and integrates these

schedules. Step 2 determines the attributes that may te

transmitted using these schedules to locations of data base

to reduce volume of data. A problem arises when some rela

tions share two or more joining attributes. In this case,

these common attributes can not be transmitted concurrently

on any single communication link which is shared by the two

Therefore, the schedules computed individually locations. inerfci-ui- #

for joining attributes can not be executed in parallel. The

step 2 does not consider this problem. As a result, the

b12: bj2 I —

120 R2 • I -

420

Response time = C(100) • C (0.2*2000) = 120 -»• 420 = 540

b12 120 R2 ,11

b11: bjj 420

180

Response time = C(400) + C (0. 2*0. 4*2000) = 420 + 180 = 600

Figure 6: Cost Graph for R2

36

Rl:

bj2 I —

120 b22 j —

110 R1

b2j I -

420

380

R2:

tj2 120 R2 ^ 4 20

Figure 7: Fi^^l Cost Graph for General Query

37

schedules which RESPONSE produces may try to transfer more

than one common attributes over a single communication link

concurrently. Figure 7 shows this problem graphically. The

schedules of two attributes shared between Rj and R2 have

concurrent transmission of b22 and b2j (both from location

E2) to the location of relation Rj. It concludes that the

algorithm RESPCNSE will not produce optimal response time

schedules for general gueries.

Now we will synthesize guality of schedules which are

produced by the algorithm RESPONSE. If an attribute can not

be fragmented, then the schedules produced by the algorithm

PARALLEL for the joining attributes have minimum response

time [HEV79a]. If each relation schedule is of the minimum

response time, then the total guery processing strategy is

also of minimum response time. These two statements follow

from the attribute independence of relations. Because of

the optimality of algorithm PARALLEL, the candidate sche

dules used in RESPONSE are also minimum response time sche

dules for each joining attribute. Algorithm RESPONSE puts

these candidate schedules in ascending order of arrival

time and it only considers integrated schedules for

relation Ri that consist of the parallel transmission of

joining attributes with arrival time less than or egual to

the arrival time of a certain CSCBk. If we consider all

38

possible attributes which can reduce the relation Ri, then

the schedules which give the minimal response time is opti

mal. Algorithm RESPONSE considers all possible attributes

by bringing them in parallel if their arrival time is less

than or equal to the current attribute been considered. By

bringing attributes in parallel to a relation, it implies

that these attributes can be transmitted concurrently. In

order to make sure that these attributes can be transfered

concurrently, the schedules which PARALLEL produced can not

have any conflicting transmission. If the relation that has

more than one common joining attributes (as shown in Fig

ure 4 and Table 4), it may have a conflicting transmission.

It concludes that the algorithm RESPONSE is applicable only

for relations such that no two relations have more than one

joining attributes which join together.

CHAPTES IV

MINI-MAX ALGCRITHM

Introduction

Every query optimization strategy is defined over a

simplified model of a real distributed data base. Even with

very simple cost functions, as described in previous

chapters, many sutproblems in distributed guery processing

are known to he NP-hard [HEV79b, YLC82, YU83]. As a result,

most optimization algorithms try to find an efficient

solution, and therefore, are either heuristics, or are

generalized algorithms tailored for special types of queries

suitable for specific network configurations.

To simplify the problem further, in [APE83, HEV79b,

YA079] the authors assume that processing time of guery

operations can be ignored. They concentrate on transmission

cost only. correctness of this assumption essentially

depends upon the environment of the network which connects

hiqh speed processors. But this is not the case in some

types of distributed architecture for which there are high

bandwidth channels which connect relatively slow processing

unit, such as local area network.

39

40

If the processing cost of guery operations is also

included in the cost model (this assumption seems to be more

realistic), the problem becomes much more complicated and

difficult. To emphasize on this point, consider cost qraph

of IFS in Figure 8, (the transmission time is represented by

dash line and the processing time is represented by *) . It

is clear that the transmission time of the query is limited

by transmission time of the largest size of relation (R4 in

this case), and the processing time of the guery is the time

of computation of some binary operation on relations (Rj,

R2, Rj[, R£ in this example) which are sent to the the desti

nation. Now, if we can reduce the processing time at the

destination, we can reduce the response time of the query.

In order to reduce the processing time at the destination,

we can try to combine some relations (as shown in Figure 3)

provided that the cost of these combined relations do not

exceed the maximum transmission time. The principal idea is

to combine relations as many as possible, process subgueries

at locations other than the destination to achieve parallel

ism.

Now the problem becomes how to determine the relations

h'ch can be combined profitably. This problem is more

plex than the traditional Bin-Packing problem (an NP-Hard

blem) which states that given a list of real numbers in

w

CO

41

RI : I E2 : I

Rj : 1 R4 : 1

:ec:(c:tc*i)c | 4 > « * 4 e « | *4***1^^

Figure 8: Cost Graph Including Processing Cost

the range (0,1), place these numbers in a mininum number of

"bins" so that no bin holds numbers summing to more than 1.

In our study, if the bin size is defined by the maximum

transmission time and then tho number of bins are the nuiber

of relations which can be combined.

Another severe problem is to estimate the size of rela

tions [YU82]. Any static guery processing strategy needs to

estimate size cf intermediate relations which are results of

the execution of the subgueries. Existing distributed query

processing algorithms estimate the size of an intermediate

relation by assuming that the values of attribute in differ-

T, + innc: are among the same range- This assumption may ent reiaxj-"** "^

be true in practice. As a result, without knowing the

1 r^i-r^ of the intermediate relations, the errors in the actual si^c ^

estimation result are accumulated.

42

Experiments on the effects of estimating the size of

relations have been conducted by Epstein et al. fSAC82].

They used Distributed INGRES [EPS78] as a tool to run over

two sets of relations and a large number of gueries. They

showed that typical difference in performance is around ten

to forty percent. However, a later experiment en different

relations showed a difference of performance of almost seven

hundred percent.

It is veil recognized that deriving an algorithm to

find efficient distribution strategies for combining rela

tions is very difficult problem- In this chapter. We first

describe a somewhat simplified query environment. An algo

rithm (Mini-Max) is then presented to find a distribution

strategy for this environment-

Mini- Jax Algorithm

Basically^ the system assumptions are the same as

described in Chapter III. For simplicity, we restrict the

query to simple query, though extension of our ideas to

general gueries is not very difficult. We define

transmission time Ti of relation Ri as the time from start

of the schedule until the relation Ri is received at the

destination. It is clear that several distribution

<?trategies may produce the same transmission time, with

different response time. For example, both of the cost

43

graphs in Figure 9 show feasible data movements for a query

requiring data from four remote relations. Eoth cost graphs

have egual transmission time although the distribution stra

tegies and response time differ- The transmission time is

t4+t»1 for both graphs.

First cost graph:

RI: R4 t4 Rj tM , , I

R2: E2 t2 1 I

R3: R3 t3 , ,

Second cost graph:

R1"

,r 1 1

^-'' E2 t2 B3 f 3 ,r 1 1

Figure 9: Alternate Cost Graph Schedule

x

44

We now propose algorithm Mini-Max to find a

distribution strategy for response time reduction- Since it

is difficult to find an optimal strategy which can combine

several relations, we consider schedules which combine a re

lation with just one another relation. Our aim is to find

as many combined schedules as possible such that the pro

cessing time at the destination is reduced and at the same

time these schedules do not increase minimum transmission

time schedule. It should be noted that the minimum trans

mission time schedule defined here is actually the maximum

transmission time schedule among all the schedules we pro

duced. The reason we called it "the minimum transmission

time schedule" is that this schedule provides the minimum

transmission time among all the possible schedules-

We start with IFS and search for cost beneficial data

movement in the current system state. The state of the sys

tem is given by the size Si, selectivity Pi, and trans

mission time Ti for each relation Ei. A cost beneficial move

to reduce the response time is defined as one that moves

some relations to another relation Ri. so that the trans

mission time Ti of Ri is either reduced or effect of this

move does not increase Ti beyond transmission time of

current system state. Considering the example in Figure 9,

though, both cost graphs have t4+t»1 transmission time, the

45

second one suggests that relation R2 can be combined with

relation R3 without increasing its transmission time

(t2+t»3) beyond the transmission time Ti (t4<'t»1) of the

complete schedule. The second schedule has tetter response

time because that there are two combined relations (Rj and

E2) instead of three (Rj, R2, and Ej) for processing at the

destination.

In order to combined as many relations as we can, the

algorithm is executed in several phases (recall that we al

low a relation to be combined with just one another relation

at any time). Relations are ordered by their sizes for each

phase- Algoritnm Mini-Max, then, uses this relation ordering

to implement a tactic of pairing relation of the smallest

size with the relation of larger size if such a move is cost

beneficial. The algorithm searches for cost beneficial data

transmission by trying to join a small relation to a large

relation. Relations are checked in the order of decresing

size. Each time a schedule (for the largest relation) is

produced which either is a combined schedule of two rela

tions or it is a schedule for which joining of another rela

tion is not profitable. After schedule is produced, the

algorithm checks the rest of the relations which have not

been tested for inclusion in the schedules yet using the

me method. It is shown (in next section) that algorithm

46

Mini-Max produces the maximum number of pa ir s of r e l a t i o n s

in e v e r y phase and a t the same time these s c h e d u l e s have the

minimum t r a n s m i s s i o n t ime .

Algori thm Mini-Max i s presented i n the f o l l o w i n g :

1 . ( I n i t i a l i z a t i o n . )

Index r e l a t i o n s so t h a t S j < S2 < < Sn

Ti <— C (Si) (* i n i t i a l i z e t ransmiss ion c o s t *)

Max_Trans <— 0 (* minimum t r a n s m i s s i o n c o s t *)

Sch <— empty (* schedule pool *)

Next_Phase <— t r u e

Buffer,T€mp_Buffer <— s e t of ordered r e l a t i o n s

2 . (Repeat u n t i l no fur ther reduct ion i s p r o f i t a b l e . )

While Next_Phase do s t ep 2 to s t e p 6

Buffer <— Temp_buffer

Temp_Buffer <— empty

Next_Pha£e <— f a l s e

3 . (Are a l l the r e l a t i o n s checked?)

While (Buffer # empty) do s t e p 3 t o s t e p 5

Pick up a pair of r e l a t i o n Ri and Rj from Q such t h a t

Bi i s t h e s m a l l e s t s i z e of r e l a t i o n and Rj i s the

l a r g e s t s i z e of r e l a t i o n in Buf fer .

4 (Construct a s c h e d u l e and check the t ransmis s ion

c o s t . )

T ' j = Ti + Pro (Si • Sj) + C(Pi • Sj)

Case

47

4. 1 ( i ) T»j < Tj

Append s c h e d u l e (Rj, Rj) to Sch

Buffer <— Buffer - (B i , Rj}

Sj <— Pi * Sj

l i <— C(Sj)

Temp^Buffer <— Temp_Buffer + (Rj)

Next_Phase <— true

( i i ) T ' j > Tj

Case

4 . 2 (a) T»j < Max_Trans

Append schedule (F i , Rj) to Sch

Buffer < - - Buffer - {Ri, Ej]

S j <— Pi * S j

Tj <— C(Sj)

Temp_3uffer <— Temp_Buffer ••• (Rj)

4^3 (b) T ' j > Max_Trans

Append schedule (Rj) t o Sch

Buffer < - - Buffer - {Rj}

Temp_Buffer <— Temp_Buffer + {Rj}

5 . (Determine the maximum transmis s ion t ime . )

If Max_Trans < T ' j then Max_Trans = T' j

6 . ( R e i n i t i a l i z e for the next phase.)

Arrange the r e l a t i o n s in Buffer by the i r s i z e

Euffer = TeiEp_Buffer

48

A simple query example is presented to illustrate use

of alqorithra Mini-Max strategy. Let us assunie a guery such

that four required relations are located at four different

nodes. After ini t ial processing, the size and selectivity

value are shown in Table 6.

TABLE 6

Data Base State for Algorithm Mini-Max

Relation RI: R2: R3: E4:

Size 100 300 800 1000

Selectivity 0.1 0.3 0-8 1.0

Assume that the result node is separate from the nodes

which store the given relations. Let the transmission cost

function be C (x) = 10 • x, and let the processing cost

compared to the transmission cost be of ratio one to four.

The schedule for initial feasible solution (IFS) is shown in

Figure 10.

Algorithm Mini-Max attempt to reduce the transmission time

by finding the pairs in the order of R4, R3, R2, Rj. B4

transmission time reduction:

Transmit Rj to R4

49

110 Rj: , ,

310 R2: I ,

810 E3: I J

1010 B4: I ,

response time = C (1000)+Pro (100*300)+Pro (0. 1*300*800) + Pro (0.1*0.3*800*1000)

= 1010 • 3 + 2.4 + 1.6 = 1017

Figure 10: Cost Graph for IFS in Algorithm Mini-Max

T ' 4 = C(10C) + Pro (10C+1000) + C (0 .1 *1000 )

= 110 + 10 + 110

= 230

Since 230 < 10 10=T4, the transmission of Rj to R4 is inte

grated into the strategy.

Transmit R2 to R3:

T'3 = C(300) + Pro(300*800) + C(0.3*800)

= 310 + 24 • 250

= 584

Since 584 < €10 = T3, the transttission of R2 to E3 i s

in tegrated i n t o the s t r a t e g y . After the f i r s t phase, the

c o s t graph i s shown in Figure 11.

50

Rj 110 R4 10 110 I j * * * i i i i * I

R2 310 R3 24 250

Figure 11: Cost Graph After First Phase

Rj and R2 are not needed in the second phase because these

relations have been combined. The size of R4 and R3 after

the first phase is 100 and 240, respectively. Since the size

of R4 is smaller than R3, in the second phase, the algorithm

will try to ccnibined R4 to R3-

Transmit R4 to H3:

T'3 = C(10C) • Pro(10C*240) + 0 (0 .1*240)

= 110 + 2 . 4 + 34

= 146 .4

S i n c e 146.4 < 250 , the t r a n s m i s s i o n of R4 to R3 i s i n t e g r a t

ed i n t o t h e s t r a t e g y . The f i n a l c o s t graph i s shown i n F ig

ure 12 .

^

51

RI R4 J I * * * * l 1

R2 R3 I

response time = C(300) • Pro (300*800) •»• C(0-1*1000) +Pro( 100*0. 3*800) -t-C (0.1*0.3*800)

= 310 • 24 + 110 + 2-4 • 34 = 480.4

Figure 12: Final Cost Graph for Algorithm Mini-Max

Correctness of the Algorithm

In this section, we prove that the algorithm aini-Max,

for each phase, finds a distribution strategy which obtains

the maximum nunter of pairs of relations without increasing

the minimum transmission time of the query. Recall that the

query being simple performs only one join on any relation.

Hence, after a relation is combined with another, the

relation is not referenced again. The basis of the proof is

given in the following lemma. «e denote by Tij the cost of

combining (a pair) Rj into relation Rj (i.e., Tij = C(Si) .

pro (Si • Sj) * C(Pi * Sj)). ^e denote by Tk the schedule

for relation Ek if it is not combined with any other

relation.

> i l 111 I I I .

52

LEMMA^- For s i m p l e guery . If Si < Sm < Sk < S j , then

max ( T i j , Tmk) < max (Tile, Tmj) < max (Tim, Tkj) .

Because Si < Sm t h e r e f o r e T i j < Tmj (which f o l l o w s from

t h e d e f i n i t i o n of t r a n s m i s s i o n time) - S i m i l a r l y , Sk < Sj

i m p l i e s t h a t Tmk < Tmj. Hence, max ( T i j , Trajs) < Tmj <

max(Tik, Tmj) . i t means t h a t Tik < Tjcj and Tmj < Tkj . I t

p r o v e s t h a t max (Tik, Tmj) < Tkj < max(Tim, Tkj ) .

The lemma given above s t a t e s tha t the minimum t r a n s

m i s s i o n time s c h e d u l e for any two pa ir s cf combined r e l a

t i o n s i s e i t h e r the schedule of pa ir ing the s m a l l e s t s i z e of

r e l a t i o n with the l a r g e s t s i z e r e l a t i o n or the schedule by

p a i r i n g the second s m a l l e s t s i z e of r e l a t i o n with the second

l a r g e s t s i z e of r e l a t i o n . Further, i t s a y s that i f T i j i s

t h e minimum t r a n s m i s s i o n time s c h e d u l e , then a l l r e l a t i o n s

wi th s i z e g r e a t e r than S j should have been paired with r e l a

t i o n s with s i z e s i a l l e r than S i - Al l r e l a t i o n s of s i z e be

tween Si and Sj can not pair with the r e l a t i o n s which are of

s i z e s m a l l e r than S i .

THEOREM_j. For a s imple guery, the algorithm Mini-Max

d e r i v e s the minimuB t r a n s m i s s i o n time s c h e d u l e s and maximum

number o f p a i r s of combined s c h e d u l e s for each phase.

Proof: We index the r e l a t i o n s so t h a t S j < S2 < . . . . <

s n . F i r s t , we prove that the a lgor i thm Mini-Max w i l l d e r i v e

t h e minimum t r a n s m i s s i o n time s c h e d u l e . Two c a s e s are

^

53

s t u d i e d h e r e , t h a t the minimum transmiss ion time schedule

produced by a l g o r i t h m Mini-Max i s given by e i t h e r T i j or Tj-«

Let Q be the set of schedules produced by algorithm Mini-

Max. Suppose that Tij is the minimum transmission time

schedule in Q, and let it were not a minimum transmission

time schedule- Let Q» be the set of schedules which con

tains the minimum transmission time schedule- It is easy to

see that all the schedules in Q' are less than Tij. Since

Tij < Tj, there must exist a schedule Tmj in Q» where Rm #

Ri. The size of Rm can not exceed cr egual to the size of

Ri. If it is not the case, then by definition, Tij < Tmj.

It contradicts the assumption that C» contains the minimum

transmission time schedule. So the size of Rjp should be

smaller than Si. Since the algorithm Mini-Max always choos

es the smallest relation to pair it with the largest rela

tion first, the number of relations Rk where k > j are egual

to the number of relations Rn where n < i. If Tmj exist

then there exist at least one relation R^ in Q', y > j,

which can not te paired with Rn, for n < i. Since Tij < Tj

< TV Bl should be paired with one of the relations Rx, for

X > n. From Lemma 1, max (Tij, Tmy) < max (Txj, Tmjf) <

max(Tmj, Tx^). Again, it leads to a contradiction of the

assumption that all the schedules in Q' are less than Tij.

Therefore Tij must be the minimum transmission time

54

s c h e d u l e . S i m i l a r l y , we prove for the oth^r case that Tj

g i v e s minimum r e s p o n s e t i m e . The proof i s i d e n t i c a l and

t h e r e f o r e i s c t t i t t e d .

We now prove t h a t the a lgor i thm Mini-Max w i l l d e r i v e

t h e maximum number o f p a i r s of combined s c h e d u l e s . Again,

l e t Q t e t h e s e t of s c h e d u l e s genertaed by algorithm Mini-

Max. Suppose t h a t Q were not the s e t of the maximum pairs

of combined s c h e d u l e . Let Q' be another s e t of s c h e d u l e s .

F u r t h e r , l e t the number of pa ired r e l a t i o n s in Q' be g r e a t e r

than t h e number of pa i red r e l a t i o n s in Q. Let Kk be the

s m a l l e s t s i z e of r e l a t i o n which does not have a combined

s c h e d u l e produced by the a lgor i thm Mini-Max, but Q» p a i r s i t

wi th some o t h e r r e l a t i o n . If i t i s not the c a s e , then the

number of pa ired r e l a t i o n s in Q' can not be mere than Q.

S i n c e t h e a lgor i thm Mini-Max s e l e c t s the s m a l l e s t r e l a t i o n

f i r s t t h e r e , the s c h e d u l e s Tjm, T2m-1 , . . - - . Tk-1m-k'«-2 e x i s t ,

and Tkj i s g r e a t e r than the minimum transmis s ion t ime , where

k < j < m-k4 2 . In order t o f ind a r e l a t i o n to pair with Bk,

t h e on ly c h o i c e i s t o pair Rk with Ri, where j < i < k.

Otherwise Q* does not conta in the minimum t r a n s m i s s i o n time

s c h e d u l e . Suppose Ri i s the l a r g e s t p o s s i b l e s i z e of

r e l a t i o n which can be paired with Rk. Then, there e x i s t s a

r e l a t i o n Rw in Q» , where m-k-f2 < w < m, which o r i g i n a l l y

p a i r s with Rx in Q i s l e f t a l o n e , where j ^ x < k- S ince Sk

x:

55

^ Sw, Rw can not be paired with Rj, where k < j< nz^jj,

hence the number of combined schedules is not more in Q*.

It contradicts the assumption that the numbers of combined

schedule are mere in Q'. Therefore algorithm Mini-Max must

produce the maximum number of paired relations-

Cora lejtjt of the Algorithm

He assume that the algorithm is computed at a location

which is a 'master* of all locations in the system. The

master has current information of the data base sysyera

(e.g., relations, their sizes, locations, network

connections and load on each line, etc-) . The master, after

computing an efficient scheduling, dispatches instructions

(e.g., subguery operations, transmission, etc.) to

appropriate locations.

Assume that the data structure Buffer in the algorithm

Mini-Max is a array of records. Each record stores various

informations about relation Ri, namely, the size Si, its

current schedule and transmission time Ti-

In step 4.1 cr step 4.3 of the algorithm Mini-Max, the

relation Ri is paired with Rj. How Ri, after have teen

chosen for transmission to Rj, is no longer exists for

consideration in the rest of guery processing strategy.

However, for the sake of efficiency, the record of updated

56

Rj is stored in the record of n , and the record of Hj is

marked 'dead'. Thus, at the beginning of a phase, first few

of records in Euffer are those which were obtained by pair

ing in the previous phase. We call these records of class

M. The relation Pi in step 3 belongs to M. Some other re

cords in Buffer are those which were marked dead in previous

phases, step 4 determines best pairs for relations of class

M with relations which are not dead. A relation of class M

may be paired with another in M. Now, we estimate com

plexity of various steps of the algotithm.

The array Puffer is sorted according to the size of re

lations. Determination of the largest active relation Sj

for Si which satisfies step 4.1 or step 4.3 is very much

like binary search in a sorted array which has multiple en

tries. Hence, for each phase, step 4 regaires 0(JM| Lg (n))

time. For step 6, it is sufficient to sort and arrange only

the relations of class H. Hence, each phase reguires

0(1M| lg(l?5l)) time for step 6.

Let Mi denote the class M for the Ith phase. He now es

timate mi = IMjI- In the Ith phase, mi relations are com

bined with another mi relations at most. Hence, number of

relations that remain in the (1 + 1) th phase is n - .E.mi.

Further, a relation of class Mi+I is combined with an active

elation which belongs to either Mi - Mi+j or belongs to the

57

r e s t o f a c t i v e r e l a t i o n s which number i s n - mj - -

i a i ' l - 2mi. Hence, mi»1 < ( 1 / 2 ) * (n - i mk) < k=l

(1 /2 ) • ( n - i*mi) . Fur ther , roi+1 < mi, which i m p l i e s mi-t-1 <

min(mi, (1 /2 ) *(n - i*mi) ) . i t can be shown, by i n d u c t i o n ,

t h a t mi < n / ( i + 1 ) , o t h e r w i s e , the above mentioned i n e g u a l i t y

i s n o t t r u e . l e t Lg denote the binary logor i thm.

The c o m p l e t e complex i ty of the algorithm i s n n

OCiS l^ i Lg (n) • i S i O i l g ("i) ) n n

= C{Lg(n) ^S^(D/i) • i i i ( n / i ) L g ( n / i ) )

= 0 ( n Lg(n) ^ l i ( V i J )

= 0 (n Lg2n) .

THEOREM.. 2 . The a lgor i thm Kini-Max d e r i v e s minimum

t r a n s m i s s i o n s c h e d u l e s in time 0 (n Lg^n) .

I t i s a p p r o p r i a t e here t o r e c a l l that the t ime com

p l e x i t y of the a lgor i thm PARAILEL i s 0 ( n 2 ) .

CHAPTER V

CONCLUSION

In previous chapters of this thesis, we have described

the general basis for processing distributed gueries and we

have also presented an algorithm for processing queries in

distributed data base system. in this chapter, we discuss

the relationships between our algorithm and other works in

this area. We also suggest some topics for future research.

Merits of Algprithj Mini-Max

The first conprehensive algorithm for distributed guery

processing was developed by Wong [WCN77J, and was later

implemented in SEC-1. Wong's algorithm translates a guery

into a sequence of two kind of operations: (1) move a

subrelation from one site to another, and (2) process a

query operation on locally available data. Each "move"

command is improved recursively by a lower cost sequences of

"move" and "process" commands. The algorithm terminates when

no "move" command can be replaced by a lower cost sequence.

The final SDE-1 strategy (algorithm OPT) [BERSIb] is

similiar to the previous one. They introduced the concept of

semijoin to abstract the main optimization problem.

Basically, algorithm OPT is an example of serial greedy

58

X

59

o p t i m i z a t i o n t e c h n i g u e ; i t always seeks to maximize

immediate g a i n , i g n o r i n g the f a c t that the e x e c u t i o n of a

s e m i j o i n o f t e n d e c r e a s e s the c o s t and i n c r e a s e s the b e n e f i t

of o t h e r s e m i j o i n s . As a r e s u l t , the o p t i m i z a t i o n proce

dures generated by OPT are sequences of e f f i c i e n t s e g u e n t i a l

programs ( i . e . , only one move and s i n g l e j c i n operat ion w i l l

be e x e c u t e d a t any t ime in the network).

Our a l g o r i t h m not only c o n s i d e r s a l l s e m i j o i n s that

c o u l d maximize immediate g a i n s tu t a l s o c o n s i d e r s maximizing

t h e number of o p e r a t i o n s which could be executed in p a r a l l e l

a t s e v e r a l s i t e s . This concept improves the e f f e c t i v e n e s s of

t h e a lgor i thm (1) by reducing the response time of the

query , and (2) by reducing the process ing overhead a t the

d e s t i n a t i o n .

Most a l g o r i t h m s for d i s t r i b u t e d query proces s ing to

d a t e a r e "open loop" and can not respond to e r r o r s in c o s t

e s t i m a t i o n . To c l o s e the l o o p , we must be able t o suspend

e x e c u t i o n of the s u b g u e r i e s and c o n s t r u c t a new s t r a t e g y

t h a t u t i l i z e s (1) p a r t i a l r e s u l t s computed so far and (2)

t h e c o s t i n f o r m a t i o n obta ined by the p a r t i a l computat ion.

T h i s s u g g e s t s t h a t an a d a p t i v e guery process ing s t r a t e g y

s h o u l d be e n f o r c e d .

Our a l g o r i t h m i s presented towards t h i s d i r e c t i o n . We

d i v i d e t h e guery o p t i m i z a t i o n procedures i n t o s t a g e s - In

60

each stage, the query optimizer chooses some subplans

involving join of two nonlocal relations Ri and Rj, trans

mission of Fi froir site i to site j, according to the pres

ent stage's ccroputation result. Thus, the errors in the es

timation will not accumulate. Therefore we can keep the

estimitation errors relatively low.

In our algorithm, a relation is transmitted to only one

destination, but several relations may be transmitted to

different locations concurrently. This involves much less

overhead as opposed to if several relations are to be trans

mitted, each to several other locations, simultaneously.

Yao's model assumes that a relation sent to several loca

tions reaches there at the same tine. Most networks do not

connect every two locations by a direct link. As a result,

whichever method, virtual or datagram, is chosen for trans

mission, the overhead on different paths would vary, and

thus this assumption remains of theoretical interest.

Another major advantage of our algorithm is that we do

not ignore the total cost of the query processing while de-

* 1 a strategy that emphasizes parallelism and thus re

duces response time cost of the guery. As there are so many

^^^<^ involved in the calculation of total time cost, parameters xiiwo.

difficult to establish any kind of relationship

en the total time cost and the response time cost. The

61

RESPONSE algorithm of [APE83] minimizes the response time

only and thus increases total ti.nie significantly- We have

carried out certain simulation [LAK84] te compare these two

costs for the algorithm Mini-Max and the RESPONSE.

iiliJiation for Study cf Total and I^sponse Time

Table 7 contains costs for the two algorithms

(algorithm 5ESE0NSE and algorithm Mini-Max) for various sets

of data bases. He assume that the data base contains ten

different relations of arbitary sizes but no relation may

contain more than one thousand distinct attribute values-

These files are generated using the VAX-11/780 uniform

distribution random number generator.

The processing time of guery operation is ignored in

our simulation so that the optimality cf response time

derived by the algorithm RESPONSE can be compared. As

indicated in the table that the ratio RT is always greater

than one. The ratios for the total time comparision are

rvinq. It should be noted that the total time cost is

. hiqher than the response time cost. So, the ratio TT is

. widely from RT. If we look at the distribution of

file size in Table 7, there is a tendency for the

iqorithm RESPCNSE to perform much more poorly than the

62

algorithm Mini-Max if the sizes of the different relations

vary greatly.

1

233 147 117 220 92

173 138 234 99 150 132

TT:

RT:

TABLE 7

Comparison of Total Time and Response Time

3 67 200 143 310 150 245 145 295 193 199 140

385 309 175 358 163 322 230 293 278 205 168

3S4 463 191 394 205 326 260 309 326 242 231

Size 5

400 466 253 448 318 335 304 490 394 311 251

477 525 275 449 331 376 393 513 430 363 393

500 529 315 486 332 403 481 595 474 464 4 40

8

509 550 405 563 560 469 489 601 489 611 444

564 591 427 601 569 478 489 621 507 625 477

10

578 621 596 629 603 560 490 6 33 634 652 608

Ratio TT RT

12.31 2.23 1.99 6.93 2.27 7.91 5.44 3.01 2.50 4. 11 3.55

1.59 2.72 1.95 1.88 2.81 2.00 2.43 1.93 3.45 2.12 2.37

total time of RESPONSE divided by total time of Mini-Max response time of Mini-Max divided by response tine of RESPONSE

63

l iJ ture Research and Su^gest icj i s

When v iewed from a r e a l i s t i c point of v iew, the query

environment a s sumpt ions in t h i s t h e s i s appear g u i t e

r e s t r i c t i v e . The s i m p l i c a t i o n i s required in order that the

a l g o r i t h m Mini-Max may t e s t a t e d and understood i n a

c o n c e p t u a l l y s i m p l e manner. The algorithm Mini-Max can be

e a s i l y a p p l i e d i n g e n e r a l guery environment with minor

m o d i f i c a t i o n s . For i n s t a n c e , we can apply algori thm

Mini-Max on each common jo in domain as descr ibed in [APE83]

by assuming a t t r i b u t e i n d e p e n i e n t , or rep lace the greedy

s t r a t e g y by a l g o r i t h m Mini-Max in [BERSIb]-

We have s u g g e s t e d t h a t our algorithm can be appl ied on

a d a p t i v e c o n t r o l p r o c e s s . In order to measure the g u a l i t y

of t h e d i s t r i b u t i o n s t r a t e g i e s derived by the a lgor i thm

Mini-Max more p r e c i s e l y , i t should be t e s t e d for dynamic

c o s t e s t i m a t i o n . The dynamic environment should be modeled

t h a t e x t e n d s and adapts the a l g o r i t h i Mini-Max. The modeling

approach should a l s o i n c l u d e deve loping methods to guant i fy

t h e overhead in a adapt ive environment as opposed to s t a t i c

env ironment .

X

BIBLIOGRAPHY

APE83- — Peter M. G. Apers, Alan R- Hevner and S. Einq Yao, Q£timjzatjon Algorithms for Distributed Queries, IEEE Trans, on Software Engineering, Vol. SE-9, No. 1, January 1983, pp. 57-68.

BAB79 . -- E. Babb, Implementing a Relational Database by 15§.lis of J^ecialized Hardware, ACM Trans. on Database System, Vol. 47 No. l7 March 1979, pp. 1-29.

BERSIa. — Philip A. Bernstein and Dah-Ming W. Chiu, JUsing Semi-Joins to Solve Relational Queries, Journal of ACM, Vol. 28, No. 1, January 1981, pp. 25-40.

BERSIb. — P. Eernstein, N. Goodman, E. Wong, C. Reeve and J. Eothnie, Querry Processing in a System for Distributed Dat abases(SED-j), ACM Trans, on Database System, Vol. 6, No. 4, Deceiber 1981, pp. 602-625.

CHD75. — Wesley H. Chu and E. E. Nahouraii, File Directory Design Considerations for Distributed lata Bases, Proceedings, International Conference on Very Large Data Base, October 1975.

CHU80. — Wesley W. Chu and Paul Hurley, Optimal Query Processing for Distributed Database Systems, IEEE Trans. oirComputeri,~Vol. C-31,"NO. S, September 1982, pp. 835-850.

DAT81. — C. J. Date, An Introduction to Database Systems, Addison-Wesley Publishing Company, Inc., 1981, pp. 181-234.

T?NS78 — Philip H. Enslow, What is a "Distributed" Data " Processing System?, Computer , January 1973, pp. 13-21.

64

65

EPS78. - - R. E p s t e i n , M. Stonebraker and E. Wong, d i s t r i b u t e d ^uery P r o c e s s i n g in a R e l a t i c n a l Data Base System, i n P r o c . SIGMCD, May 1978, pp. 169-180 .

G0082. Sim

- - Nathan Goodman and Oded Shmueli, Tree Q u e r i e s : A P i^ £ i a s s of R e l a t i o n a l Q u e r i e s , ACM Trans- on Data

base Sys tem, Vol . 7 , No. 4 , December 1982, pp. 653-667-

HEV78. — Alan R. Hevner and S. Bing Yao, £i|ery Process ing on a D i s t r i b u t e d Database , Proceedings Berkeley Workshop, 1978 .

HEV79a. - - Alan R. Hevner, The Optimization qg Query ^ r o -c e s s i n q in E i s t r i b u t e d Database Systems, Ph.D. T h e s i s , Purdue U n i v e r s i t y , L a f a y e t t e , Indiana , 1979.

HEV79b. — Alan R. Hevner and S. Bing Yao, Query Process inq / i B D i s t r i b u t e d Database Sys tems , IEEE Trans, on Software "^ Engineer ingT Vol. SE-5 , No. 3 , May 1979, pp. 177-187.

KERS2. — Larry Kerschberg, Peter D. Ting and S. Bing Yao, Query O e t i m i z a t i o n in Star Computer Networks, ACM Trans. on Database Sys t ems , Vol. 7 , No. 4 , December 1982, pp. 6 7 8 - 7 1 1 .

LAK83. — G. C. Lakhani and J. S. i ang , A Cgminent on O^ti-miza^-ion Algorithms for D i s t r i b u t e d Q u e r i e s , To appear i n lEEETransT o n ' s o f t w a r e Engineer ing , July 1984.

LAK84 — G- ^' Lakhani and J. S. Wang, P a r a l l e l i z i n g Query O o e r a t i o n s in D i s ^ i b u t e d Database, in preparat ion for i ^ m i i i l ^ to ACM SIGMOD, 1984.

po^ jaires Martin, Computer Networks and D i s t r i b u t e d P r o c e s s i n g : Sof tware , Technigue, A r c h i t e c t u r e . , p J 5 n t I c e - H a l l , " l n c . , Englewood C l i f f , N . J . , 1981, pp. 1 -64 .

Mrr78 " COCASYL Systems Committee, Distributed Data Jase Technology/ Proceedings, National Computer Conference, ^9787

66

0ZK77. — E.A. Ozkarahan, S.A. Schuster and K-C. S e v c i k , Pe iJo i f iance E v a l u a t i o n of a R e l a t i o n a l A s s o c i a t i v e Proc e s s o r , ACM T r a n s , on Database System, Vol. 2 , No. 2 , June 1977, pp. 1 7 5 - 1 9 5 .

RAM79. — C. V. Ramamoorthy and Benjamin R- Wah, Data Man-/ agei ient i n E i s t r i b u t e d Database , AFACP, 1979," pp.

667 -6 80 . "

ROT80. — J . E o t h n i e , P. E e r n s t e i n , S- Fox, N. Goodman, M. Hammer, T. Landers , C. Reeve, D. Shipman and E. Wong, I n -t r o d u c t i o n to a System f o r D i s t r i b u t e d Databases (S^D-j) , ACM Trans . on Database Sys tems , Vol. 5 , No, 1, March 1980 , pp . 1 -17 .

SAC82. — Giovanni Maria Sacco and S. Bing Yao, Query Cpt i -^ m i z a t i o n in D j s t r i b u t e d Data Base Systems, Advances in

ComputersT VolT"21,"'T982,''pp. 225-273 .

SCH7S. — H. E. Schwetman, Hybrjd Simulat ion Models of Ccm-£] i i €r S y s t g g , Cemmun. ACM, Vol- 21, No- 9, September 1978 , pp . 7 l S - 7 2 3 .

SMA79- Dana L. Small and Wesley W. Chu, A D i s t r i b u t e d Data Base A r c h i t e c t u r e for Data Process ing in a Dynamic i n v i r o n r a i n t , in Proc- COMPCON, March 1979, pp. 123-127.

e„-r7c j . M , Smith and P.Y. Chang, Opt iBiz inq the Per-i^^m^n^P of a R e l a t i o n a l Algebra Database I n t e r f a c e , Com-i S f i f l , V o l . - l S r i i i r c h 1975, pp. 568 -579 .

^ J , Smith, P. B e r n s t e i n , U. Dayal , N. Goodman, T. T nders K. Lin and E. Wong, Mult ibase - I n t e g r a t i n g Heterogeneous D i s t r i b u t e d Database Systems, AFACP, 1981, p i 7 ' 4 8 7 - 4 9 9 .

, o i - - James R. Swager, A r c h i t e c t u r e of a D i s t r i b u t e d ^^ n ^abase Information Resource , Nat ional Computer

f f f f ^ n c e r i S e l T ' p p T 501-5057

T 67

W0N77. - - Eugene Wong and Computer Corporation of America, l ^ t r i e y j n g Ci spersed Data form SDD-1: A System for D i s -t r i b u t e d D a t a b a s e s , Berkeley Workshop D i s t r i b u t e d Data Management Comput. Networks, 1977, pp. 217-235.

YA079. — S. Bing Yao, Optimizat ion of Query Evaluat ion A l g o r i t h m s , ACM Trans , on Database Systems, Vol. 4 , lie, 27 1979, pp . 133-155 .

YLC82. — C.T. Yu, K. Lam, C.C. Chang and S.K. Chang, A Promis ing Approach t o D i s t r i b u t e d Query P r o c e s s i n q , Berke ley Conference on D i s t r i b u t e d Data Ease, February 1982, pp. 3 n - 3 9 0 .

YU82. — C.T. Yu and Y.C. Lin, Some Est imation Problems i n D i s t r i b u t e d Query P r o c e s s j n j . Proceedings , The 3rd I n t e r n a t i o n a l Conference on D i s t r i b u t e d Computing Systems, '^ t o b e r 1982 , pp. 1 3 - 1 9 .

Oc-

YU83. — C.T. Yu and C.C. Chang, On the Uesi^S of A Query Process ing . S t r a t e g y in A Dis tr ibuted Database Enyiron-i e n t 7 ACM SIGMOC 8 3 , Vol. 13, No. 4, May 1983, pp. 3 0 - 3 9 .