Sl02 2x2 (1)

15
Distributed Database Systems Fall 2012 Distributed Database Design SL02 Design problem Design strategies (top-down, bottom-up) Fragmentation (horizontal, vertical) Allocation and replication of fragments, optimality, heuristics DDBS12, SL02 1/60 M. B ¨ ohlen Design Problem Design problem of distributed systems: Making decisions about the placement of data and programs across the sites of a computer network as well as possibly designing the network itself. In DDBMS, the distribution of applications involves Distribution of the DDBMS software Distribution of applications that run on the database Distribution of applications will not be considered in the following; instead the distribution of data is studied. DDBS12, SL02 2/60 M. B ¨ ohlen Framework of Distribution Dimension for the analysis of distributed systems Level of sharing: no sharing, data sharing, data + program sharing Behavior of access patterns: static, dynamic Level of knowledge on access pattern behavior: no information, partial information, complete information Distributed database design should be considered within this general framework. DDBS12, SL02 3/60 M. B ¨ ohlen Design Strategies/1 Top-down approach Designing systems from scratch Homogeneous systems Bottom-up approach The databases already exist at a number of sites The databases should be connected to solve common tasks DDBS12, SL02 4/60 M. B ¨ ohlen

description

.......

Transcript of Sl02 2x2 (1)

Page 1: Sl02 2x2 (1)

Distributed Database SystemsFall 2012

Distributed Database DesignSL02

I Design problemI Design strategies (top-down, bottom-up)I Fragmentation (horizontal, vertical)I Allocation and replication of fragments, optimality, heuristics

DDBS12, SL02 1/60 M. Bohlen

Design Problem

I Design problem of distributed systems: Making decisions aboutthe placement of data and programs across the sites of a computernetwork as well as possibly designing the network itself.

I In DDBMS, the distribution of applications involvesI Distribution of the DDBMS softwareI Distribution of applications that run on the database

I Distribution of applications will not be considered in the following;instead the distribution of data is studied.

DDBS12, SL02 2/60 M. Bohlen

Framework of DistributionI Dimension for the analysis of distributed systems

I Level of sharing: no sharing, data sharing, data + program sharingI Behavior of access patterns: static, dynamicI Level of knowledge on access pattern behavior: no information,

partial information, complete information

I Distributed database design should be considered within thisgeneral framework.

DDBS12, SL02 3/60 M. Bohlen

Design Strategies/1

I Top-down approachI Designing systems from scratchI Homogeneous systems

I Bottom-up approachI The databases already exist at a number of sitesI The databases should be connected to solve common tasks

DDBS12, SL02 4/60 M. Bohlen

Page 2: Sl02 2x2 (1)

Design Strategies/2I Top-down design strategy

DDBS12, SL02 5/60 M. Bohlen

Design Strategies/3

I Distribution design is the central part of the design in DDBMSs(the other tasks are similar to traditional databases).

I Objective: Design the LCSs by distributing the data over the sites.I Two main aspects have to be designed carefully:

I Fragmentation: Relation may be divided into a number ofsub-relations, which are distributed

I Allocation and replication: Each fragment is stored at site(s) with”optimal” distribution

I Distribution design issuesI Why fragment at all?I How to fragment?I How much to fragment?I How to test correctness?I How to allocate?

DDBS12, SL02 6/60 M. Bohlen

Design Strategies/4I Bottom-up design strategy

DDBS12, SL02 7/60 M. Bohlen

Fragmentation/1I What is a reasonable unit of distribution? Relation or fragment of

relation?I Relations as unit of distribution:

I If the relation is not replicated, we get a high volume of remote dataaccesses.

I If the relation is replicated, we get unnecessary replications, whichcause problems in executing updates and waste disk space

I Might be an OK solution, if queries need all the data in the relationand data stays only at the sites that use the data

I Fragments of relation as unit of distribution:I Application views are usually subsets of relationsI Thus, locality of accesses of applications is defined on subsets of

relationsI Permits a number of transactions to execute concurrently, since they

will access different portions of a relationI Parallel execution of a single query (intra-query concurrency)I However, semantic data control (especially integrity enforcement) is

more difficult⇒ Fragments of relations are (usually) appropriate unit of distribution.

DDBS12, SL02 8/60 M. Bohlen

Page 3: Sl02 2x2 (1)

Fragmentation/2

I Fragmentation aims to improve:I ReliabilityI PerformanceI Balanced storage capacity and costsI Communication costsI Security

I The following information is used to decide fragmentation:I Quantitative information: cardinality of relations, frequency of queries,

site where query is run, selectivity of the queries, etc.I Qualitative information: predicates in queries, types of access of

data, read/write, etc.

DDBS12, SL02 9/60 M. Bohlen

Fragmentation/3

I Types of FragmentationI Horizontal: partitions a relation along its tuplesI Vertical: partitions a relation along its attributesI Mixed/hybrid: a combination of horizontal and vertical fragmentation

(a) Horizontal Fragmentation

(b) Vertical Fragmentation (c) Mixed Fragmentation

DDBS12, SL02 10/60 M. Bohlen

Fragmentation/4I Example of database instance

DataDDBS12, SL02 11/60 M. Bohlen

Fragmentation/5

I Example (contd.): Horizontal fragmentation of PROJ relationI PROJ1: projects with budgets less than 200′000I PROJ2: projects with budgets greater than or equal to 200′000

DDBS12, SL02 12/60 M. Bohlen

Page 4: Sl02 2x2 (1)

Fragmentation/6I Example (contd.): Vertical fragmentation of PROJ relation

I PROJ1: information about project budgetsI PROJ2: information about project names and locations

DDBS12, SL02 13/60 M. Bohlen

Correctness Rules of Fragmentation

I CompletenessI Decomposition of relation R into fragments R1,R2, . . . ,Rn is complete

iff each data item in R can also be found in some Ri .I Reconstruction

I If relation R is decomposed into fragments R1,R2, . . . ,Rn, then thereshould exist some relational operator ∇ that reconstructs R from itsfragments, i.e., R = R1∇ . . .∇Rn

I Union to combine horizontal fragmentsI Join to combine vertical fragments

I DisjointnessI If relation R is decomposed into fragments R1,R2, . . . ,Rn and data

item di appears in fragment Rj , then di should not appear in any otherfragment Rk , k , j (exception: primary key attribute for verticalfragmentation)

I For horizontal fragmentation, data item is a tupleI For vertical fragmentation, data item is an attribute

DDBS12, SL02 14/60 M. Bohlen

Idea of Horizontal Fragmentation

I Intuition behind horizontal fragmentationI Every site should hold all information that is used to query the siteI The information at the site should be fragmented so the queries of the

site run faster

I Horizontal fragmentation is defined as selection operation, σp(R)

I Example:σBUDGET<200K (PROJ)

σBUDGET≥200K (PROJ)

DDBS12, SL02 15/60 M. Bohlen

Information Requirements/1

Database information:

I Links between relations (a link models a 1:N relationship betweenrelations that are related to each other by an equality join)

Links (one to many)

I Cardinality of relations: card(R)

DDBS12, SL02 16/60 M. Bohlen

Page 5: Sl02 2x2 (1)

Information Requirements/2

Application information:

I Simple predicates used in queriesI Relation R[A1,A2, ...,An]I Aiθvi a simple predicate (θ ∈ {=, <,≤, >,≥,,}, vi ∈ Di , Di is the

domain of Ai)I For relation R we define Pr = {p1, p2, ..., pm}I Example: PNAME = ”Maintenance”, BUDGET < 200000

I Minterm predicatesI minterm predicates are conjunctions of simple predicatesI Relation R, Pr = {p1, p2, ..., pm}I M = {m1,m2, ...,mr } such that

M = {mi | mi = ∧pj∈Prpj∗}, 1 ≤ j ≤ m, 1 ≤ i ≤ z

where pj∗ = pj or pj∗ = ¬pj

DDBS12, SL02 17/60 M. Bohlen

02.1 Information Requirements/3

Application information (cnt’d):

I Minterm selectivitiesI The number of tuples of the relation that would be accessed by a user

query, which is specified according to a given minterm predicate mi .I Access frequencies

I The frequency with which a user application qi accesses data.

DDBS12, SL02 18/60 M. Bohlen

02.2

Horizontal Fragmentation/1

We write Pr ′ to denote the complete and minimal set of simplepredicates generated from Pr :I The set of predicates is complete if and only if any two tuples in the

same fragment are referenced with the same probability by anyapplication

I The set of predicates is minimal if and only if each simple predicateis relevant for determining the fragmentation and for each pair offragments there is at least one query that accesses the fragmentsdifferently

DDBS12, SL02 19/60 M. Bohlen

Horizontal Fragmentation/2

A horizontal fragment Ri of relation R consists of all the tuples of R thatsatisfy a predicate Fj :

Rj = σFj (R), 1 ≤ j ≤ w

where Fj is a selection formula, which is (preferably) a minterm predicate.

Computing the horizontal fragmentation:

1. Determine Pr (80/20 rule)

2. Compute Pr’

3. Determine M

4. Minimize M (eliminating impossible minterms)

DDBS12, SL02 20/60 M. Bohlen

Page 6: Sl02 2x2 (1)

Horizontal Fragmentation/3

I Example: Fragmentation of the PROJ relationI Consider the following query: Find the name and budget of projects

given their location.I The query is issued at all three locationsI Fragmentation based on LOC, using the set of predicates{LOC = ‘Montreal‘, LOC = ‘NewYork ‘, LOC = ‘Paris‘}

PROJ1 = σLOC=‘Montreal‘(PROJ)PNO PNAME BUDGET LOCP1 Instrument. 150000 Montreal

PROJ2 = σLOC=‘NewYork ‘(PROJ)PNO PNAME BUDGET LOCP2 DB Develop. 135000 New YorkP3 CAD/CAM 250000 New York

PROJ3 = σLOC=‘Paris‘(PROJ)PNO PNAME BUDGET LOCP4 Maintenance 310000 Paris

DDBS12, SL02 21/60 M. Bohlen

Horizontal Fragmentation/4

I If access is only according to the location, the above set ofpredicates is complete

I i.e., in each fragment PROJi each tuple has the same probability ofbeing accessed

I If there is a second query/application that accesses only thoseproject tuples where the budget is less than $200K, the set ofpredicates is not complete.

I P2 in PROJ2 has higher probability to be accessed

DDBS12, SL02 22/60 M. Bohlen

Horizontal Fragmentation/5

I Example (contd.):I Add BUDGET < 200K and BUDGET ≥ 200K to the set of predicates

to make it complete.⇒ {LOC = ‘Montreal‘, LOC = ‘NewYork ‘, LOC = ‘Paris‘,

BUDGET ≥ 200K ,BUDGET < 200K } is a complete setI Minterms to fragment the relation are given as follows:

(LOC = ‘Montreal‘) ∧ (BUDGET ≤ 200K)

(LOC = ‘Montreal‘) ∧ (BUDGET > 200K)

(LOC = ‘NewYork ‘) ∧ (BUDGET ≤ 200K)

(LOC = ‘NewYork ‘) ∧ (BUDGET > 200K)

(LOC = ‘Paris‘) ∧ (BUDGET ≤ 200K)

(LOC = ‘Paris‘) ∧ (BUDGET > 200K)

DDBS12, SL02 23/60 M. Bohlen

Horizontal Fragmentation/6

I Example (contd.): Now, PROJi , i = 1, 2, 3 will be split in twofragments

PROJ1 = σLOC=‘Montr‘∧BDGT<200K (PROJ)PNO PNAME BUDGET LOCP1 Instrumnt 150000 Montreal

PROJ2 = σLOC=‘NY ‘∧BDGT<200K (PROJ)PNO PNAME BUDGET LOCP2 DB Devel 135000 New York

PROJ3 = σLOC=‘Paris‘∧BDGT≥200K (PROJ)PNO PNAME BUDGET LOCP4 Maintenance 310000 Paris

PROJ′2 = σLOC=‘NY ‘∧BDGT≥200K (PROJ)PNO PNAME BUDGET LOCP3 CAD/CAM 250000 New York

I Note that the following fragments are empty:I σLOC=‘Paris‘∧BDGT<200K (PROJ)I σLOC=‘Montreal‘∧BDGT≥200K (PROJ)

DDBS12, SL02 24/60 M. Bohlen

02.3

Page 7: Sl02 2x2 (1)

Derived Horizontal Fragmentation/1

I Defined on a member (target) relation of a link according to aselection operation specified on its owner (source).

I Each link is an equijoin.I Fragments of member relation can be defined with a semijoin on

fragments of owner relation.

DDBS12, SL02 25/60 M. Bohlen

02.5 Derived Horizontal Fragmentation/2

I Given a link L where owner(L) = S and member(L) = R, thederived horizontal fragments of R are defined as

Ri = R |>< Si, 1 ≤ i ≤ w

where w is the maximum number of fragments that will be definedon R and

Si = σFi (S)

where Fi is the formula according to which the primary horizontalfragment Si is defined.

DDBS12, SL02 26/60 M. Bohlen

Derived Horizontal Fragmentation/3

I Given link L1 whereI owner(L1) = PAY andI member(L1) = EMP

andI PAY1 = σSAL≤30000(PAY)I PAY2 = σSAL>30000(PAY)

we getI EMP1 = EMP |>< PAY1I EMP2 = EMP |>< PAY2

DDBS12, SL02 27/60 M. Bohlen

02.7 Vertical Fragmentation/1I Objective of vertical fragmentation is to partition a relation into a set

of smaller relations so that many of the applications will run on onlyone fragment.

I Vertical fragmentation of a relation R produces fragmentsR1,R2, . . . , each of which contains a subset of R ’s attributes.

I Vertical fragmentation is defined using the projection operation ofthe relational algebra:

πA1,A2,...,An(R)

I Example:

PROJ1 = πPNO ,BUDGET (PROJ)

PROJ2 = πPNO ,PNAME,LOC(PROJ)

I Vertical fragmentation has also been studied for (centralized) DBMSI Smaller relations, and hence less page accessesI e.g., MONET system

DDBS12, SL02 28/60 M. Bohlen

Page 8: Sl02 2x2 (1)

Vertical Fragmentation/2

I Vertical fragmentation is more complicated than horizontalfragmentation

I In horizontal partitioning: for n simple predicates, the number ofpossible minterms is 2n; some of them can be ruled out by existingimplications/constraints.

I In vertical partitioning: for m non-primary key attributes, the numberof possible fragments is equal to B(m) (= the mth Bell number), i.e.,the number of partitions of a set with m members.

I For large numbers, B(m) ≈ mm (e.g., B(15) = 109)

DDBS12, SL02 29/60 M. Bohlen

Vertical Fragmentation/3

I Two types of heuristics for vertical fragmentation exist:I Grouping: assign each attribute to one fragment, and at each step,

join some of the fragments until some criteria is satisfied.I Bottom-up approach

I Splitting: starts with a relation and decides on beneficial partitioningsbased on the access behaviour of applications to the attributes.

I Top-down approachI Results in non-overlapping fragments

I Optimal solution is probably closer to the full relation than to a set ofsmall relations with only one attribute

DDBS12, SL02 30/60 M. Bohlen

Vertical Fragmentation/4

I Application information: The major information required as inputfor vertical fragmentation is related to applications (queries)

I Since vertical fragmentation places in one fragment those attributesusually accessed together, there is a need for some measure thatwould define more precisely the notion of “togetherness”, i.e., howclosely related the attributes are.

I This information is obtained from queries and collected in theAttribute Usage Matrix and Attribute Affinity Matrix.

DDBS12, SL02 31/60 M. Bohlen

Vertical Fragmentation/5

I Given are the user queries/applications Q = (q1, . . . , qq) that willrun on relation R(A1, . . . ,An)

I Attribute Usage Matrix: Denotes which query uses which attribute:

use(qi ,Aj) =

{1 iff qi uses Aj

0 otherwise

I The use(qi , •) vectors for each application are easy to define if thedesigner knows the applications that willl run on the DB (consideralso the 80-20 rule)

DDBS12, SL02 32/60 M. Bohlen

Page 9: Sl02 2x2 (1)

Vertical Fragmentation/6I Example: Consider relation PROJ(PNO ,PNAME,BUDGET , LOC)

and queries:

q1 = SELECT BUDGET FROM PROJ WHERE PNO=Value

q2 = SELECT PNAME,BUDGET FROM PROJ

q3 = SELECT PNAME FROM PROJ WHERE LOC=Value

q4 = SELECT SUM(BUDGET) FROM PROJ WHERE LOC =Value

I Abbreviations:A1 = PNO ,A2 = PNAME,A3 = BUDGET ,A4 = LOC

I Attribute Usage Matrix

DDBS12, SL02 33/60 M. Bohlen

02.8 Vertical Fragmentation/7

I Attribute Affinity Matrix: Denotes the frequency of two attributes Ai

and Aj with respect to a set of queries Q = (q1, . . . , qn):

aff(Ai ,Aj) =∑

k : use(qk ,Ai )=1,use(qk ,Aj )=1

(∑

sites l

ref l(qk ) ∗ acc l(qk ))

whereI ref l(qk ) is the cost (= number of accesses to (Ai ,Aj)) of query qK at

site lI acc l(qk ) is the frequency of query qk at site l

DDBS12, SL02 34/60 M. Bohlen

Vertical Fragmentation/8I Example (contd.): Let the cost of each query be ref l(qk ) = 1, and

the frequency acc l(qk ) of the queries be as follows:

Site1 Site2 Site3acc1(q1) = 15 acc2(q1) = 20 acc3(q1) = 10acc1(q2) = 5 acc2(q2) = 0 acc3(q2) = 0acc1(q3) = 25 acc2(q3) = 25 acc3(q3) = 25acc1(q4) = 3 acc2(q4) = 0 acc3(q4) = 0

I Attribute affinity matrix aff(Ai ,Aj) =

I e.g., aff(A1,A3) =∑1

k=1∑3

l=1 acc l(qk ) =acc1(q1) + acc2(q1) + acc3(q1) = 45(q1 is the only query to access both A1 and A3)

DDBS12, SL02 35/60 M. Bohlen

Vertical Fragmentation/9

I Idea: Take the attribute affinity matrix (AA) and reorganize theattribute orders to form clusters where the attributes in each clusterdemonstrate high affinity to one another.

I Bond energy algorithm (BEA) has been suggested for thatpurpose for several reasons:

I It is designed specifically to determine groups of similar items asopposed to a linear ordering of the items.

I The final groupings are insensitive to the order in which items arepresented.

I The computation time is reasonable (O(n2), where n is the number ofattributes)

DDBS12, SL02 36/60 M. Bohlen

Page 10: Sl02 2x2 (1)

Vertical Fragmentation/10

I BEA:I Input: AA matrixI Output: Clustered AA matrix (CA)I Permutation is done in such a way to maximize the following global

affinity mesaure (affinity of Ai and Aj with their neighbors):

AM =n∑

i=1

n∑

j=1

aff(Ai ,Aj)[aff(Ai ,Aj−1) + aff(Ai ,Aj+1) +

aff(Ai−1,Aj) + aff(Ai+1,Aj)]

DDBS12, SL02 37/60 M. Bohlen

Vertical Fragmentation/11I Example (contd.): Cluster Affinity Matrix CA after running BEA

I Elements with similar values are grouped together, and two clusterscan be identified

I An additional partitioning algorithm is needed to identify the clustersin CA

I Usually more clusters and more than one candidate partitioning, thusadditional steps are needed to select the best clustering.

I The resulting fragmentation after partitioning (PNO is added inPROJ2 explicilty as key):

PROJ1 = {PNO ,BUDGET }PROJ2 = {PNO ,PNAME, LOC}

DDBS12, SL02 38/60 M. Bohlen

BEA Algorithm/1

I Input: The AA matrixI Output: Clustered affinity matrix CA which is a permutation of AA

I Initialization: Place and fix one of the columns of AA in CA (wechoose column 1 in our example)

I Iteration: Place the remaining n-i columns in the remaining i+1positions in the CA matrix. For each column choose the placementthat makes the largest contribution to the global affinity measure.

I Row order: Order the rows according to the column ordering.

DDBS12, SL02 39/60 M. Bohlen

BEA Algorithm/2In order to determine the best placement we define the contribution of aplacement.I Contribution of a placement:

cont(Ai ,Ak ,Aj) =

2 ∗ bond(Ai ,Ak )+

2 ∗ bond(Ak ,Aj)−2 ∗ bond(Ai ,Aj)

I Bond between a pair of attributes is defined as:

bond(Ax ,Ay) =n∑

z=1

aff(Az ,Ax) ∗ aff(Az ,Ay)

DDBS12, SL02 40/60 M. Bohlen

Page 11: Sl02 2x2 (1)

BEA Algorithm/3

Consider the following AA matrix and the corresponding CA matrix whereA1 and A2 have been placed.

AA =

A1 A2 A3 A4

A1 45 0 45 0A2 0 80 5 75A3 45 5 53 3A4 0 75 3 78

CA =

A1 A2

A1 45 0A2 0 80A3 45 5A4 0 75

bond(A3,A1) = 45 ∗ 45 + 5 ∗ 0 + 53 ∗ 45 + 3 ∗ 0 = 4410bond(A3,A2) = 45 ∗ 0 + 5 ∗ 80 + 53 ∗ 5 + 3 ∗ 75 = 890

The next step is to place A3.

DDBS12, SL02 41/60 M. Bohlen

BEA Algorithm/4I Ordering (0-3-1) :

cont(A0,A3,A1)

= 2 ∗ bond(A0,A3) + 2 ∗ bond(A3,A1)–2 ∗ bond(A0,A1)

= 2 ∗ 0 + 2 ∗ 4410 – 2 ∗ 0 = 8820

I Ordering (1-3-2) :

cont(A1,A3,A2)

= 2 ∗ bond(A1,A3) + 2 ∗ bond(A3,A2)–2 ∗ bond(A1,A2)

= 2 ∗ 4410 + 2 ∗ 890 – 2 ∗ 225 = 10150

I Ordering (2-3-4) :

cont(A2,A3,A4)

= 2 ∗ bond(A2,A3) + 2 ∗ bond(A3,A4)–2 ∗ bond(A2,A4)

= 1780

DDBS12, SL02 42/60 M. Bohlen

BEA Algorithm/5After adding A3 the CA matrix has the form

CA =

A1 A3 A2

45 45 00 5 8045 53 50 3 75

After adding A4 and ordering rows the CA matrix has the form

CA =

A1 A3 A2 A4

A1 45 45 0 0A3 45 53 5 3A2 0 5 80 75A4 0 3 75 78

DDBS12, SL02 43/60 M. Bohlen

02.9 BEA Algorithm/6

I The final step is to divide a set of clustered attributes {A1, ...,An} intotwo sets {A1, ...,Ai} and {Ai+1, ...,An} such that the costs of queriesthat access both sets are minimized.

I The appropriate split point on the diagonal must be determined:

A1 A3 A2 A4

A1 45 45 0 0A3 45 53 5 3A2 0 5 80 75A4 0 3 75 78

I This is an optimization problem: the optimal point on the diagonalmust be determined.

I Note: Generalizations are needed to deal withI partitions located in the middle of the matrixI multiple split points

DDBS12, SL02 44/60 M. Bohlen

Page 12: Sl02 2x2 (1)

BEA Algorithm/7

I AQ(qi) = {Aj | use(qi ,Aj) = 1} attributes accessed by qi

I TQ = {qi | AQ(qi) ⊆ TA } queries that access TA onlyI BQ = {qi | AQ(qi) ⊆ BA } queries that access BA onlyI OQ = Q − {TQ ∪ BQ} queries that access TA and BA

I CQ =∑

qi∈Q∑∀Sj

refj(qi) ∗ accj(qi) cost of all queriesI CTQ =

∑qi∈TQ

∑∀Sj

refj(qi) ∗ accj(qi) cost of TQ queriesI CBQ =

∑qi∈BQ

∑∀Sj

refj(qi) ∗ accj(qi) cost of BQ queriesI COQ =

∑qi∈OQ

∑∀Sj

refj(qi) ∗ accj(qi) cost of other queries

I CTQ ∗ CBQ − COQ2 maximize

DDBS12, SL02 45/60 M. Bohlen

Correctness of Vertical Fragmentation

I Relation R is decomposed into fragments R1,R2, . . . ,RnI e.g., PROJ = {PNO ,BUDGET ,PNAME, LOC} into

PROJ1 = {PNO ,BUDGET } and PROJ2 = {PNO ,PNAME, LOC}

I CompletenessI Guaranteed by the partitioning algortihm, which assigns each

attribute in A to one partitionI Reconstruction

I Join to reconstruct vertical fragmentsI R = R1 Z · · · Z Rn = PROJ1 Z PROJ2

I DisjointnessI Attributes have to be disjoint in VF. Two cases are distinguished:

I If tuple IDs are used, the fragments are really disjointI Otherwise, key attributes are replicated automatically by the systemI e.g., PNO in the above example

DDBS12, SL02 46/60 M. Bohlen

Mixed Fragmentation

I In most cases simple horizontal or vertical fragmentation of a DBschema will not be sufficient to satisfy the requirements of theapplications.

I Mixed fragmentation (hybrid fragmentation): Consists of ahorizontal fragment followed by a vertical fragmentation, or a verticalfragmentation followed by a horizontal fragmentation

I Fragmentation is defined using the selection and projectionoperations of relational algebra:

σp(πA1,...,An(R))

πA1,...,An(σp(R))

DDBS12, SL02 47/60 M. Bohlen

Replication and Allocation

I Replication: Which fragements shall be stored as multiple copies?I Complete Replication

I Complete copy of the database is maintained in each siteI Selective Replication

I Selected fragments are replicated in some sites

I Allocation: On which sites to store the various fragments?I Centralized

I Consists of a single DB and DBMS stored at one site with usersdistributed across the network

I PartitionedI Database is partitioned into disjoint fragments, each fragment assigned

to one site

DDBS12, SL02 48/60 M. Bohlen

Page 13: Sl02 2x2 (1)

Replication/1

I Replicated DBI fully replicated: each fragment at each siteI partially replicated: each fragment at some of the sites

I Non-replicated DB (= partitioned DB)I partitioned: each fragment resides at only one site

I Rule of thumb:I If read only queries

update queries ≥ 1, then replication is advantageous, otherwise

replication may cause problems

DDBS12, SL02 49/60 M. Bohlen

Replication/2

I Comparison of replication alternatives

DDBS12, SL02 50/60 M. Bohlen

Fragment Allocation/1

I Fragment allocation problemI Given are:

– fragments F = {F1,F2, ...,Fn}– network sites S = {S1,S2, ...,Sm}– and applications Q = {q1, q2, ..., ql}

I Find: the ”optimal” distribution of F to S

I OptimalityI Minimal cost

I Communication + storage + processing (read and update)I Cost in terms of time (usually)

I PerformanceI Response time and/or throughput

I ConstraintsI Per site constraints (storage and processing)

DDBS12, SL02 51/60 M. Bohlen

Fragment Allocation/2

I Required informationI Database Information

I selectivity of fragmentsI size of a fragment

I Application InformationI RRij : number of read accesses of a query qi to a fragment FjI URij : number of update accesses of query qi to a fragment FjI uij : a matrix indicating which queries updates which fragments,I rij : a similar matrix for retrievalsI originating site of each query

I Site InformationI USCk : unit cost of storing data at a site SkI LPCk : cost of processing one unit of data at a site Sk

I Network InformationI communication cost/frame between two sitesI frame size

DDBS12, SL02 52/60 M. Bohlen

Page 14: Sl02 2x2 (1)

Fragment Allocation/3

I We discuss an allocation model which attempts toI minimize the total cost of processing and storageI meet certain response time restrictions

I General Form:min(Total Cost)

I subject toI response time constraintI storage constraintI processing constraint

I Decision variable xij

xij =

{1 if fragment Fi is stored at site Sj

0 otherwise

DDBS12, SL02 53/60 M. Bohlen

Fragment Allocation/4I The total cost function has two components: storage and query

processing.

TOC =∑

Sk∈S

Fj∈FSTCjk +

qi∈QQPCi

I Storage cost of fragment Fj at site Sk :

STCjk = USCk ∗ size(Fj) ∗ xij

where USCk is the unit storage cost at site k

I Query processing cost for a query qi is composed of twocomponents:

I composed of processing cost (PC) and transmission cost (TC)

QPCi = PCi + TCi

DDBS12, SL02 54/60 M. Bohlen

Fragment Allocation/5I Processing cost is a sum of three components:

I access cost (AC), integrity contraint cost (IE), concurency control cost(CC)

PCi = ACi + IEi + CCi

I Access cost:

ACi =∑

sk∈S

Fj∈F(URij + RRij) ∗ xij ∗ LPCk

where LPCk is the unit process cost at site kI Integrity and concurrency costs:

I Can be similarly computed, though depends on the specific constraints

I Note: ACi assumes that processing a query involves decomposingit into a set of subqueries, each of which works on a fragment, ...,

I This is a very simplistic modelI Does not take into consideration different query costs depending on

the operator or different algorithms that are applied

DDBS12, SL02 55/60 M. Bohlen

Fragment Allocation/6I The transmission cost is composed of two components:

I Cost of processing updates (TCU) and cost of processing retrievals(TCR)

TCi = TCUi + TCRi

I Cost of updates:I Inform all the sites that have replicas + a short confirmation message

back

TCUi =∑

Sk∈S

Fj∈Fuij ∗ (update message cost + acknowledgment cost)

I Retrieval cost:I Send retrieval request to all sites that have a copy of fragments that are

needed + sending back the results from these sites to the originatingsite.

TCRi =∑

Fj∈FminSk∈S

(xjk ∗ (cost retrieval request + cost sending back result))

DDBS12, SL02 56/60 M. Bohlen

Page 15: Sl02 2x2 (1)

Fragment Allocation/7

I Modeling the constraintsI Response time constraint for a query qi

execution time of qi ≤ max. allowable response time for qi

I Storage constraints for a site Sk

Fj∈Fstorage requirement of Fj at Sk ≤ storage capacity of Sk

I Processing constraints for a site Sk

qi∈Qprocessing load of qi at site Sk ≤ processing capacity ofSk

DDBS12, SL02 57/60 M. Bohlen

Fragment Allocation/8

I Solution MethodsI The complexity of this allocation model/problem is NP-completeI Correspondence between the allocation problem and similar

problems in other areasI Plant location problem in operations researchI Knapsack problemI Network flow problem

I Hence, solutions from these areas can be re-usedI Use different heuristics to reduce the search space

I Assume that all candidate partitionings have been determined togetherwith their associated costs and benefits in terms of query processing.

I The problem is then reduced to find the optimal partitioning andplacement for each relation

I Ignore replication at the first step and find an optimal non-replicatedsolution

I Replication is then handeled in a second step on top of the previousnon-replicated solution.

DDBS12, SL02 58/60 M. Bohlen

Conclusion/1

I Distributed design decides on the placement of (parts of the) dataand programs across the sites of a computer network

I There are two key technical questions: fragmentation andallocation/replication of data

I Horizontal fragmentation is defined via the selection operationσp(R)

I Rewrites the queries of each site in the conjunctive normal form andfinds a minimal and complete set of conjunctions to determinefragmentation

I Vertical fragmentation via the projection operation πA (R)I Computes the attribute affinity matrix and groups “similar” attributes

together

I Mixed fragmentation is a combination of both approaches

DDBS12, SL02 59/60 M. Bohlen

Conclusion/2

I Allocation/Replication of dataI Type of replication: no replication, partial replication, full replicationI Optimal allocation/replication modelled as a cost function under a set

of constraintsI The complexity of the problem is NP-completeI Use of different heuristics to reduce the complexity

DDBS12, SL02 60/60 M. Bohlen