Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and...

31
Data Warehouse and OLAP Data Warehouse and OLAP Week 5 1

Transcript of Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and...

Page 1: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

Data Warehouse and OLAPData Warehouse and OLAP

Week 5

1

Page 2: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

Midterm IMidterm I

• Friday, March 4• Scope

– Homework assignments 1 – 4– Open book

Page 3: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

Team Homework Assignment #7Team Homework Assignment #7

R d 121 139 146 150 f h b k• Read pp. 121 – 139, 146 – 150  of the text book.• Do Examples 3.8, 3.10 and Exercise 3.4 (b) and (c). Prepare for 

the results of the homework assignment.the results of the homework assignment.• Due date

– beginning of the lecture on Friday March11th. 

Page 4: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

TopicsTopics

• Definition of data warehouse• Multidimensional data model• Data warehouse architecture• From data warehousing to data mining

Page 5: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

What is Data Warehouse? (1)What is Data Warehouse? (1)

• A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site

• A data warehouse is a semantically consistent data store that serves as a physical implementation of a decision support data model and stores the information on which an enterprise need to make strategic decisions

Page 6: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

What is Data Warehouse? (2)What is Data Warehouse? (2)

• Data warehouses provide on‐line analyticalData warehouses provide on line analytical processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities, which facilitate effective data generalization and data mining

• Many other data mining functions, such as association, classification, prediction, and clustering, 

b i t t d ith OLAP ti t hcan be integrated with OLAP operations to enhance interactive mining of knowledge at multiple levels of abstractionabstraction

Page 7: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

What is Data Warehouse? (3)What is Data Warehouse? (3)

• A decision support database that is maintained separately from the organization’s operational database

• “A data warehouse is a subject‐oriented, integrated, time‐variant, and nonvolatile collection of data in support of management’s decision‐making process[Inm96].”—W. H. Inmon

Page 8: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

Data Warehouse FrameworkData Warehouse Framework

data mining

Figure 1.7 Typical framework of a data warehouse for AllElectronics

8

Page 9: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

Data Warehouse is S bj O i dSubject-Oriented

• Organized around major subjects, such as customer, product, sales, etc.

• Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing

• Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process

Page 10: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

Data Warehouse isI t t dIntegrated

• Constructed by integrating multiple, heterogeneous data sourcessources– relational databases, flat files, on‐line transaction records

• Data cleaning and data integration techniques are appliedData cleaning and data integration techniques are applied.– Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources

• E.g., Hotel price: currency, tax, breakfast covered, etc.  

Page 11: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

D t W h i Ti V i tData Warehouse is Time Variant

• The time horizon for the data warehouse is significantly longer than that of operational systemsg p y– Operational database: current value data– Data warehouse data: provide information from a historical perspective (e.g., past 5‐10 years)

• Every key structure in the data warehouse– Contains an element of time, explicitly or implicitly

Page 12: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

D t W h i N l tilData Warehouse is Nonvolatile

• A physically separate store of data transformed from the operational environment

• Operational update of data does not occur in the data warehouse environment– Does not require transaction processing, recovery, and concurrency control mechanismsR i l t ti i d t i– Requires only two operations in data accessing: 

• initial loading of data and access of data

Page 13: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

OLTP vs OLAPOLTP vs. OLAP

Table 3.1 Comparison between OLTP and OLAP13

Page 14: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

Why Separate is Data Warehouse Needed? (1)(1)

• Why not perform on‐line analytical processing directly on operational databases instead of spending additional timeoperational databases instead of spending additional time and resources to construct a separate data warehouse?

Page 15: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

Why Separate is Data Warehouse N d d? ( )Needed? (2)

• High performance for both systemsg p y– DBMS— tuned for OLTP: searching for particular records, indexing, hashing, concurrency control, recovery

– Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation (summarization and 

ti )aggregation)

Page 16: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

TopicsTopics

• Definition of data warehouse• Multidimensional data model• Data warehouse architecture• From data warehousing to data mining

Page 17: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

From Tables and Spreadsheets to D C bData Cubes

• A data warehouse is based on a multidimensionalA data warehouse is based on a multidimensional data model

• This model views data in the form of a data cubeThis model views data in the form of a data cube• A data cube allows data to be modeled and viewed in multiple dimensionsmultiple dimensions

Page 18: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

From Tables and Spreadsheets to Data C b (1)Cubes (1)

• A data cube is defined by facts and dimensionsA data cube is defined by facts and dimensions– Facts are data which data warehouse focus on

• Fact tables contain numeric measures (such asFact tables contain numeric measures (such as dollars_sold) and keys to each of the related dimension tables

– Dimensions are perspectives with respect to fact

• Dimension tables describe the dimension with attributes. For example, item (item_name, brand, type), or time(day week month quarter year)or time(day, week, month, quarter, year) 

Page 19: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

Figfrom ure 1.6.Fra

m a relatio agm

ents oonal d

atab of relationsbase for A

lllElectroniccs

19

Page 20: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

From Tables and Spreadsheets t D t C b (2)to Data Cubes (2)

dimensions

Table 3.2 A 2-D view of sales data for AllElectronics according to the di i ti d it h th l f b h l t d i

Facts (numerical measures)

dimensions time and item, where the sales are from branches located in the city of Vancouver. The measure displayed is dollar_sold (in thousands).

20

Page 21: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

From Tables and Spreadsheets t D t C b (3)to Data Cubes (3)

Table 3.3 A 3-D view of sales data for AllElectronics according to the dimensions time, item, and location. The measure displayed is dollar_sold (in thousands).

21

Page 22: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

From Tables and Spreadsheets t D t C b (4)to Data Cubes (4)

Figure 3.1 A 3-D data cube representation of the data in Table 3.3, according to the dimensions time, item, and location. The measure displayed is dollar_sold (in thousands). 22

Page 23: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

From Tables and Spreadsheets t D t C b (5)to Data Cubes (5)

Figure 3.2 A 4-D data cube representation, according to the dimensions time, item, location, and supplier. The measure displayed is dollar sold (in time, item, location, and supplier. The measure displayed is dollar_sold (in thousands).

23

Page 24: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

CuboidCuboid

• A data cube is a lattice of cuboids• The total number of cuboids• The apex cuboid• The base cuboid

24

Page 25: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

Figure 3.14 Lattice of cuboids, making up a 3-D data cube. Each

25

g g pcuboid represents a different group-by. The base cuboid contains

the three dimensions city, item, and year.

Page 26: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

The Curse of DimensionalityThe Curse of Dimensionality

• How many cuboids are there in a n‐dimensional data cube?• How many cuboids are there in a n dimensional data cube• How many cuboids are there in a n‐dimensional data cube 

and each dimension (i) has the number of level, (Li)?

26

Page 27: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

Conceptual Modeling of Data W hWarehouses

M d li d t h di i &• Modeling data warehouses: dimensions & measures– Star schema: A fact table in the middle connected to a set of dimension tablesof dimension tables 

– Snowflake schema:  A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflakeF t t ll ti M lti l f t t bl h di i– Fact constellations:  Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation g y

Page 28: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

Star SchemaStar Schematime key

time

item_ ydayday_of_the_weekmonthquarter

item_keyitem_namebrandtype

Sales Fact Table

time keyqyear

typesupplier_type

branch_key

item_key

time_key

dollars_sold

location_keybranch_keybranch_name

location_keystreetcit

branch location

unit_soldbranch_type cityprovince_or_streetcountry

Figure 3.4 Star schema of a data warehouse for sales.

28

Page 29: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

Snowflake SchemaSnowflake Schemasupplier key

suppliertime_key

time

item

Sales Fact Table

time_key

supplier_keysupplier_typeday

day_of_the_weekmonthquarter

item_keyitem_namebrandtype

location key

branch_key

item_keyyear typesupplier_key

units_sold

dollars_sold

location_key

branch_keybranch_nameb h t

branch

location_key

location

citybranch_type street

citycity_keycityprovince_or_streetcountry

Figure 3.4 Snowflake schema of a data warehouse for sales.country

29

Page 30: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

Fact ConstellationFact Constellation

ti k

item_key

Shipping Fact Table

time_key

time

itemSales Fact Table

from_location

shipper_key

time_keydayday_of_the_weekmonthquarter

item_keyitem_namebrand

itemSales Fact Table

k

time_key

unit shipped

dollars_sold

to_location

location key

year typesupplier_type

branch_key

item_key

_ pp

unit_sold

dollars_sold

location_key

branch_keybranch_namebranch type

location_keystreet

branch location

hi

shipper_keyshipper_namelocation key

shipperbranch_type city

province_or_streetcountry

shipper

Figure 3.5 Fact constellation schema of a data warehouse for sales and shipping.

location_keyshipper_type

30

Page 31: Data Warehouse and OLAPData Warehouse and OLAPtwang/595DM/Slides/Week5.pdfData Warehouse and OLAPData Warehouse and ... the results of the homework assignment. ... longer than that

ExerciseExercise

• Exercise 3 5 (a) – page 153Exercise 3.5 (a)  page 153

31