CS 345: Topics in Data Warehousing

CS 345:Topics in Data Warehousing

Tuesday, November 9, 2004

Review of Thursday’s Class

• Index Selection

• Selecting Views and Indexes Together

• Storage Systems– Mirroring, Striping, and Parity– RAID Levels

Outline of Today’s Class

• Eliminating Bottlenecks– Multi-channel RAID– Clusters of Processors

• Shared-Memory (SMP)• Shared-Disk• Shared-Nothing

• Horizontal Partitioning• Partitioning and Clustered Indexes• Multi-Dimensional Clustering• Clustered File Organization• Integer Mapped Tables

Eliminating Storage Bottlenecks

• Storage systems have various components• Improving performance of one component through

parallelism can make another component the bottleneck– Initially, disk transfer rate is limiting factor– Solution: Add more disks– Eventually, I/O bus becomes saturated– Solution: Add new I/O buses (and controllers)– Multi-channel RAID

CPU System Bus

DiskController

Disk

I/O Bus

Disk

Multi-channel RAID

CPU System Bus

DiskController

Disk

I/O Bus

Disk

DiskController

Disk

I/O Bus

Disk

Eliminating Processing Bottlenecks• Three architectures to add processing power• Shared-memory clusters

– Also called Symmetric Multi-Processing (SMP)– One computer, many processors– Each processor has its own cache– All processors share same memory, disks

• Shared-disk clusters– Computers sharing the same storage system– Each processor has its own cache, memory– All processors share same disks

• Shared-nothing clusters– Independent computers connected on a LAN– Each processor has its own cache, memory, disks

Parallel Architectures

CPU CPU CPU

CPU CPU CPU

Disk

Disk

Disk

Disk

CPU

Disk

Disk

Disk

Disk

CPU

CPU

CPU Disk

Disk

Disk

CPU

CPU

Shared Disk Cluster Shared Nothing Cluster

SymmetricMultiprocessor(SMP)

Pros and Cons• Shared-memory (SMP) machines

– Advantages:• Easy to administer (one big server)• Avoids transmitting data over network

– Disadvantages:• Single point of failure (not fault tolerant)• Limited scalability

– Fixed set of options (2, 4, 8, 16, 32, or 64 processors)– Cost increases dramatically at higher end

• Shared-nothing clusters– Advantages:

• Most cost-efficient choice for large systems• Continuous scalability• Easy to make fault tolerant through redundancy

– Disadvantages:• Communication costs can be high

• Shared-disk clusters– Intermediate between SMPs and shared-nothing clusters– Not very popular

Horizontal Partitioning

• Divide rows of a table into partitions based on value of some attribute(s)– Rows from same partition are stored together– Rows from different partitions are separated

• Two kinds of partitioning– Hash partitioning

• Rows divided into partitions using a hash function

– Range partitioning• Each partition holds a range of attribute values

Horizontal Partitioning Example

Name Age Income

Joe 25 50000

Mary 28 67000

Fred 21 34000

Name Age Income

Steve 34 78000

Ann 32 24000

Maude 37 55000

Name Age Income

Len 41 92000

Mark 49 59000

Paula 43 45000

Partition 1 (Age 20-29)

Partition 2 (Age 30-39) Partition 3 (Age 40-49)

Partitioning and Shared-Nothing• Store each partition on a different node• Partitioned join

– Join two tables that are partitioned on the join key– Requires local operations only– No data needs to be shipped around

Fact TablePartition 1

Dimension TablePartition 1





Partitioning and Shared-Nothing

• Joins with repartitioning– Necessary when one or both inputs are not

partitioned on the join key• Repartitioned join

– Step 1: Partition each input on join key– Step 2: Assign each partition to a node– Step 3: Send rows to the proper nodes– Step 4: Perform local join

Partitioning and Data Warehouses• Shared-nothing data warehouse

– Partition fact tables across cluster nodes– Replicate dimension tables across cluster nodes

• Replicated dimension tables– Dimension tables are small– Storing multiple copies of them is cheap– No communication needed for parallel joins

• Partitioned fact tables– Fact tables are big– Process queries in parallel for each partition– Divide the work among the nodes in the cluster

• One big dimension– Sometimes one dimension table is quite big (e.g. customer)– Can partition the big dimension table– Partition fact table on key of big dimension

Partitioning and Shared-Disk / SMP

• Shared-nothing cluster– Each disk can be accessed by only one node– Typically: Each partition is stored on a different node– Avoid unnecessary network communication

• Shared-disk / Shared-memory cluster– Any node can access any disk– Typically: All partitions are striped across all disks– Achieve maximum I/O parallelism

CPU CPU CPUCPU CPU CPU

Partition Pruning

• Often some partitions can be excluded entirely based on query filters– Table is partitioned on Age (20-29, 30-39, 40-49)– Query says “WHERE Age < 35”– No need to scan Age 40-49 partition– This is called partition pruning

• Effect of partition pruning on I/O parallelism– If each disk stores a single partition:

• Partition pruning reduces I/O parallelism• Disks for pruned partitions are idle

– If all partitions are striped across all disks:• Maximum I/O parallelism regardless of pruning

Manageability and Partitioning• Indexing partitioned tables

– Local index • Indexes the data from a single partition• Stored with that partition

– Global index• Indexes the data from all partitions• Could be stored anywhere, or replicated

• Reducing load time via partitioning– Often fact tables are partitioned on Date– Also indexes, aggregate tables, etc.– Newly loaded records go into the last partition– Only indexes & aggregates for that partition need to be updated– All other partitions remain unchanged

• Expiring old data– Often older data is less useful / relevant for data analysts– To reduce data warehouse size, old data is often deleted– If data is partitioned on date, simply delete the oldest partition

Partitioning and Clustered Indexes

• Recall: only one clustered index per table– Rows physically ordered based on clustered index search key– Queries filtering on that search key have very good I/O

performance• Sequential access• Relevant records packed together into few pages

– Queries filtering on other attributes have poor I/O performance• Unless only a few rows need to be retrieved

• Idea: Combine partitioning and clustering– Partition on attribute A– Each partition is clustered on attribute B– Reasonably good I/O performance for queries that filter on either

A or B

Partitioning & Clustered Indexes

• Clustered index on A– “A=10” tuples are

contiguous on disk– “B=5” tuples are spread out– Filtering on A = Excellent– Filtering on B = Poor

• Partitioned on A with clustered index on B– “A=10” tuples are confined

to one partition– “B=5” tuples are in a few

contiguous blocks (one per partition)

– Filtering on A = Very Good– Filtering on B = Very Good

Clustered on APartitioned on A,Clustered on B

A=10

B=5

Multi-Dimensional Clustering

• Goal: structured I/O access for multiple attributes– Similar to the last idea of combining partitioning and clustering

• Divide data into “chunks” by grouping on multiple dimensions– For example, partition on Age, State, Gender– One chunk is Age=24, State=CA, Gender=F

• Records from each chunk are stored together on disk in one or more fixed-size blocks– No block can contain data from more than one chunk– Some wasted space due to partially-filled blocks

• Construct “block indexes” with one index entry per block– Instead of one index entry per RID

• Block index can be built on single or multiple attributes

Multi-Dimensional Clustering

Age 1State 1

Age 1State 2

Age 1State 3

Age 1State 4

Age 2State 1

Age 2State 2

Age 2State 3

Age 2State 4

Age 3State 1

Age 3State 2

Age 3State 3

Age 3State 4

Age 4State 1

Age 4State 2

Age 4State 3

Age 4State 4

Chunk #

1

2

3

4

5

6

…

15

16

•4 different Age values•4 different State values•16 chunks in all

Some space is wasted due to partially filled blocks

MDC Pros and Cons

• Advantages of Multi-Dimensional Clustering– Block index is smaller than RID index

• One row per block not one row per record– Records with same value are packed together in a relatively

small number of blocks• Decent I/O performance• Easy to prefetch• Some of the benefits of clustered indexes, but for multiple

dimensions

• Disadvantages of Multi-Dimensional Clustering– Sparsity leads to space blow-up

• If each block holds only a few records, lots of wasted space

• MDC is similar to MOLAP storage format– Similar pros and cons

Avoiding Sparsity in MDC

• Trade-off in choosing block size– Big block size → Fewer blocks are needed → Block

indexes are smaller– Small block size → Less wasted space due to

partially-filled blocks– Block size should be at least as large as page size– Choose block size such that most chunks fill at least

one block• Avoiding sparsity

– Chunks are less than one page → MDC works poorly– Don’t choose too many clustering attributes– Avoid clustering attributes with high cardinality

Clustered File Organization

• Clustered file organization– Another physical design option offered by some DBMSs– Used primarily for “master-detail” data– Not commonly used with star schemas

• How it works– R and S are two relations sorted on a common join key– Tuples for R and S are interleaved on the same pages on disk– Each R tuple is followed by all joining S tuples

• Conceptually, similar to materialized view– Computing the join of R and S is very fast– Scanning R or S alone is slower because irrelevant data (tuples

form the other relation) is mixed in with relevant data


Relation R Relation S

Standard File Organization


Integer Mapped Tables

• Dimension tables are often wide– Large number of attributes– Attributes often consist of descriptive text

• When dimensions also have many rows, query performance can be affected– Joining to big dimension tables can be expensive

• How to reduce the width of the dimension table?– Break into several mini-dimensions – Move some attributes to an outrigger table– “Integer mapping” (not standard terminology)

Integer Mapped Tables

• Construct auxiliary mapping table for each descriptive text column– Mapping table has schema (int_value, real_value)– Assign unique integer value to each text value

• Replace long text columns with short integer values in dim. table– Integers serve as “pointers” to actual data values in

mapping table• Use a view to provide user-friendly dimensions

– Join physical dimension table to mapping table– Replace integer values with actual descriptive values– Result looks like ordinary dimension with no integer

mapping

Integer Mapped TablesOrdinary dimension table Integer mapped

dimension tableMapping tables

Non-mappeddimensioncolumns

Non-mappeddimensioncolumns

Long columnsto be mapped

Long valuesreplaced byintegers

Long values inmapping tables

Pros and Cons of Integer Mapping

• Pros and Cons– Thinner dimension table → Better query performance– Requires extra join to mapping table → Worse query

performance– Extra complexity (but can be hidden behind view)

• Whether query performance is improved depends on the data characteristics– Some queries may be faster while others are slower

• Mostly useful for:– Very large dimension tables– Rarely-queried columns

• Also useful for especially large data types– E.g. Image, embedded XML file

CS 345: Topics in Data Warehousing

Documents

Transcript of CS 345: Topics in Data Warehousing