Dr. M. Sulaiman Khan ([email protected]) Dept. of Computer Science University of Liverpool 2010
-
Upload
harriet-wheeler -
Category
Documents
-
view
20 -
download
0
description
Transcript of Dr. M. Sulaiman Khan ([email protected]) Dept. of Computer Science University of Liverpool 2010
Dr. M. Sulaiman Khan
Dept. of Computer Science
University of Liverpool
2010
COMP207: Data Mining
Data Warehousing
COMP207:Data Mining
Data WarehousesData CubesWarehouse SchemasOLAPMaterialisation
Today's Topics
Data Warehousing
COMP207:Data Mining
Most common definition:“A data warehouse is a subject-oriented, integrated,
time-variant and nonvolatile collection of data in support of management's decision-making process.” - W. H. Inmon
Corporate focused, assumes a lot of data, and typically sales related
Data for “Decision Support System” or “Management Support System”
1996 survey: Return on Investment of 400+%
Data Warehousing: Process of constructing (and using) a data warehouse
What is a Data Warehouse?
Data Warehousing
COMP207:Data Mining
Subject-oriented: Focused on important subjects, not transactions Concise view with only useful data for decision
making
Integrated: Constructed from multiple, heterogeneous data
sources. Normally distributed relational databases, not necessarily same schema.
Cleaning, pre-processing techniques applied for missing data, noisy data, inconsistent data (sounds familiar, I hope)
Data Warehouse
Data Warehousing
COMP207:Data Mining
Time-variant: Has different values for the same fields over time. Operational database only has current value. Data
Warehouse offers historical values.
Nonvolatile: Physically separate store Updates not online, but in offline batch mode only Read only access required, so no concurrency issues
Data Warehouse
Data Warehousing
COMP207:Data Mining
Data Warehouses are distinct from:
Distributed DB: Integrated via wrappers/mediators. Far too slow, semantic integration much more complicated.Integration done before loading, not at run time.
Operational DB: Only records current value, lots of extra non useful information.Different schemas/models, access patterns, users, functions, even though the data is derived from an operational db.
Data Warehouse
Data Warehousing
COMP207:Data Mining
OLAP: Online Analytical Processing (Data Warehouse)OLTP: Online Transaction Processing (Traditional DBMS)
OLAP data typically: historical, consolidated, and multi-dimensional (eg: product, time, location).
Involves lots of full database scans, across terabytes or more of data.
Typically aggregation and summarisation functions.
Distinctly different uses to OLTP on the operational database.
OLAP vs OLTP
Data Warehousing
COMP207:Data Mining
Data is normally Multi-Dimensional,
and can be thought of as a cube.
Often: 3 dimensions of time, location and product.
No need to have just 3 dimensions -- could have one for cars with make, colour, price, location, and time for example.
Image courtesy of IBM OLAP Miner documentation
Data Cubes
Data Warehousing
COMP207:Data Mining
Can construct many 'cuboids' from the full cube by excluding dimensions.
In an N dimensional data cube, the cuboid with N dimensions is the 'base cuboid'. A 0 dimensional cuboid (other than non existent!) is called the 'apex cuboid'.
Can think of this as a lattice of cuboids...
(Following lattice courtesy of Han & Kamber)
Data Cubes
Data Warehousing
COMP207:Data Mining
Lattice of Cuboids
Data Warehousing
COMP207:Data Mining
all
time item locationsupplier
time,item time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,location
time,item,supplier
time,location,supplier
item,location,supplier
time, item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
Each dimension can also be thought of in terms of different units. Time: decade, year, quarter, month, day, hour (and
week, which isn't strictly hierarchical with the others!) Location: continent, country, state, city, store Product: electronics, computer, laptop, dell, inspiron
This is called a “Star-Net” model in data warehousing, and allows for various operations on the dimensions and the resulting cuboids.
Multi-dimensional Units
Data Warehousing
COMP207:Data Mining
Star-Net Model
Data Warehousing
COMP207:Data Mining
Shipping Method
AIR-EXPRESS
TRUCKORDER
Customer Orders
CONTRACTS
Customer
ProductPRODUCT GROUP
PRODUCT LINE
PRODUCT ITEM
SALES PERSON
DISTRICT
DIVISION
OrganizationPromotion
DISTRICT
REGION
COUNTRY
Geography
DAILY QTRLY ANNUALYTime
Roll Up: Summarise data by climbing up hierarchy.Eg. From monthly to quarterly, from Liverpool to England
Drill Down: Opposite of Roll UpEg. From computer to laptop, from £100-999 to £100-
199 Slice: Remove a dimension by setting a value for it
Eg. location/product where time is Q1,2007 Dice: Restrict cube by setting values for multiple
dimensionsEg. Q1,Q2 / North American cities / 3 products sub cube
Pivot: Rotate the cube (mostly for visualisation)
Data Cube Operations
Data Warehousing
COMP207:Data Mining
Star Schema: Single fact table in the middle, with connected set
of dimension tables
(Hence a star) Snowflake Schema: Some of the dimension tables
further refined into smaller dimension tables(Hence looks like a snow flake)
Fact Constellation: Multiple fact tables can share dimension tables(Hence looks like a collection of star schemas. Also called Galaxy Schema)
Data Cube Schemas
Data Warehousing
COMP207:Data Mining
Star Schema
Data Warehousing
COMP207:Data Mining
Sales Fact Table
time_key
item_key
location_key
units_sold
Time Dimension
time_keyday
day_of_weekmonthquarter
year
Item Dimension
item_keynamebrandtype
supplier_type
Loc.n Dimension
location_keystreetcity
statecountry
continent
Measure (value)
Snowflake Schema
Data Warehousing
COMP207:Data Mining
Sales Fact Table
time_key
item_key
location_key
units_sold
Time Dimension
time_keyday
day_of_weekmonthquarter
year
Item Dimension
item_keynamebrandtype
supplier_key
Loc Dimension
location_keystreet
city_key
Measure (value)
City Dimension
city_keycity
statecountry
Fact Constellation
Data Warehousing
COMP207:Data Mining
Sales Fact Table
time_key
item_key
location_key
units_sold
Time Dimension
time_keyday
day_of_weekmonthquarter
year
Item Dimension
item_keynamebrandtype
supplier_key
Loc Dimension
location_keystreet
city_key
Measure (value)
City Dimension
city_keycity
statecountry
Shipping Table
time_key
item_key
from_key
units_shipped
ROLAP: Relational OLAP Uses relational DBMS to store and manage the warehouse
data Optimised for non traditional access patterns Lots of research into RDBMS to make use of!
MOLAP: Multidimensional OLAP Sparse array based storage engine Fast access to precomputed data
HOLAP: Hybrid OLAP Mixture of both MOLAP and ROLAP
OLAP Server Types
Data Warehousing
COMP207:Data Mining
Data Warehouse Architecture
Data Warehousing
COMP207:Data Mining
DataWarehouse
ExtractTransformLoadRefresh
OLAP Engine
AnalysisQueryReportsData mining
Monitor&
IntegratorMetadata
Data Sources Front-End Tools
Serve
Data Marts
Operational DBs
Othersources
Data Storage
OLAP Server
(also courtesy of Han & Kamber)
In order to compute OLAP queries efficiently, need to materialise some of
the cuboids from the data. None: Very slow, as need to compute entire cube at run
time Full: Very fast, but requires a LOT of storage space and
time to compute all possible cuboids Partial: But which ones to materialise? Called an 'iceberg
cube', as only partially materialised and the rest is "below water".Many cells in a cuboid will be empty, only materialise sections that contain more values than a minimum threshold.
Materialisation
Data Warehousing
COMP207:Data Mining
http://en.wikipedia.org/wiki/Data_warehouse
and subsequent links
Further Reading
Data Warehousing
COMP207:Data Mining