Post on 14-Jan-2016
On-Line Application On-Line Application ProcessingProcessing
WarehousingWarehousingData CubesData Cubes
(Data Mining)(Data Mining)
(slides borrowed from Stanford)(slides borrowed from Stanford)
OverviewOverview
Traditional database systems are Traditional database systems are tuned to many, small, simple queries.tuned to many, small, simple queries.
Some new applications use fewer, Some new applications use fewer, more time-consuming, complex more time-consuming, complex queries.queries.
New architectures have been New architectures have been developed to handle complex developed to handle complex “analytic” queries efficiently.“analytic” queries efficiently.
The Data WarehouseThe Data Warehouse
The most common form of data The most common form of data integration.integration. Copy sources into a single DB Copy sources into a single DB
((warehousewarehouse) and try to keep it up-to-) and try to keep it up-to-date.date.
Usual method: periodic reconstruction of Usual method: periodic reconstruction of the warehouse, perhaps overnight.the warehouse, perhaps overnight.
Frequently essential for analytic queries.Frequently essential for analytic queries.
OLTPOLTP
Most database operations involve Most database operations involve On-Line Transaction ProcessingOn-Line Transaction Processing (OTLP).(OTLP). Short, simple, frequent queries and/or Short, simple, frequent queries and/or
modifications, each involving a small modifications, each involving a small number of tuples.number of tuples.
Examples: Answering queries from a Examples: Answering queries from a Web interface, sales at cash registers, Web interface, sales at cash registers, selling airline tickets.selling airline tickets.
OLAPOLAP
Of increasing importance are Of increasing importance are On-Line On-Line Application ProcessingApplication Processing (OLAP) (OLAP) queries.queries. Few, but complex queries --- may run for Few, but complex queries --- may run for
hours.hours. Queries do not depend on having an Queries do not depend on having an
absolutely up-to-date database.absolutely up-to-date database.
OLAP ExamplesOLAP Examples
1.1. Amazon analyzes purchases by its Amazon analyzes purchases by its customers to come up with an customers to come up with an individual screen with products of individual screen with products of likely interest to the customer.likely interest to the customer.
2.2. Analysts at Wal-Mart look for items Analysts at Wal-Mart look for items with increasing sales in some with increasing sales in some region.region.
Common ArchitectureCommon Architecture
Databases at store branches handle Databases at store branches handle OLTP.OLTP.
Local store databases copied to a Local store databases copied to a central warehouse overnight.central warehouse overnight.
Analysts use the warehouse for Analysts use the warehouse for OLAP.OLAP.
Loading the Data Loading the Data WarehouseWarehouse
Source Systems Data Staging Area Data Warehouse
(OLTP)
Data is periodically extracted
Data is cleansed and transformed
Users query the data warehouse
Terminology: ETLTerminology: ETL
ETL = ETL = EExtraction, xtraction, TTransformation, & ransformation, & LLoadoad Extraction: Get the data out of the Extraction: Get the data out of the
source systemssource systems Transformation: Convert the data into a Transformation: Convert the data into a
useful format for analysisuseful format for analysis Load: Get the data into the data Load: Get the data into the data
warehouse warehouse (…and build indexes, materialized views, etc.)(…and build indexes, materialized views, etc.)
Data Integration is HardData Integration is Hard
Data warehouses combine data from multiple Data warehouses combine data from multiple sourcessources
Data must be translated into a consistent formatData must be translated into a consistent format Data integration represents ~80% of effort for a Data integration represents ~80% of effort for a
typical data warehouse project!typical data warehouse project! Some reasons why it’s hard:Some reasons why it’s hard:
Metadata is often poor or non-existentMetadata is often poor or non-existent Data quality is often badData quality is often bad
Missing or default valuesMissing or default values Multiple spellings of the same thing Multiple spellings of the same thing
(Cal vs. UC Berkeley vs. University of California)(Cal vs. UC Berkeley vs. University of California) Inconsistent semanticsInconsistent semantics
What is an airline passenger?What is an airline passenger?
Federated DatabasesFederated Databases
An alternative to data warehousesAn alternative to data warehouses Data warehouseData warehouse
Create a copy of all the data Create a copy of all the data Execute queries against the copyExecute queries against the copy
Federated database Federated database Pull data from source systems as needed to answer queriesPull data from source systems as needed to answer queries
““lazy” vs. “eager” data integrationlazy” vs. “eager” data integration
Data Warehouse Federated Database
Query
Answer
QueryExtraction
Rewritten Queries
Answer
SourceSystems
SourceSystems
WarehouseMediator
Star SchemasStar Schemas
A A star schemastar schema is a common is a common organization for data at a organization for data at a warehouse. It consists of:warehouse. It consists of:
1.1. Fact tableFact table : a very large accumulation of : a very large accumulation of facts such as sales.facts such as sales.
Often “insert-only.”Often “insert-only.”
2.2. Dimension tablesDimension tables : smaller, generally : smaller, generally static information about the entities static information about the entities involved in the facts.involved in the facts.
Example: Star SchemaExample: Star Schema
Suppose we want to record in a Suppose we want to record in a warehouse information about every warehouse information about every beer sale: the bar, the brand of beer, beer sale: the bar, the brand of beer, the drinker who bought the beer, the the drinker who bought the beer, the day, the time, and the price charged.day, the time, and the price charged.
The fact table is a relation:The fact table is a relation:
Sales(bar, beer, drinker, day, time, Sales(bar, beer, drinker, day, time, price)price)
Example, ContinuedExample, Continued
The dimension tables include The dimension tables include information about the bar, beer, and information about the bar, beer, and drinker “dimensions”:drinker “dimensions”:
Bars(bar, addr, license)Bars(bar, addr, license)
Beers(beer, manf)Beers(beer, manf)
Drinkers(drinker, addr, phone)Drinkers(drinker, addr, phone)
Visualization – Star Visualization – Star SchemaSchema
Dimension Table (Beers) Dimension Table (etc.)
Dimension Table (Drinkers)Dimension Table (Bars)
Fact Table - Sales
Dimension Attrs. Dependent Attrs.
Dimensions and Dependent Dimensions and Dependent AttributesAttributes
Two classes of fact-table attributes:Two classes of fact-table attributes:1.1. Dimension attributesDimension attributes : the key of a : the key of a
dimension table.dimension table.
2.2. Dependent attributesDependent attributes : a value : a value determined by the dimension determined by the dimension attributes of the tuple.attributes of the tuple.
Example: Dependent Example: Dependent AttributeAttribute
priceprice is the dependent attribute of is the dependent attribute of our example Sales relation.our example Sales relation.
It is determined by the combination It is determined by the combination of dimension attributes: of dimension attributes: barbar, , beerbeer, , drinkerdrinker, and the , and the timetime (combination of (combination of day and time-of-day attributes).day and time-of-day attributes).
Comparing Facts and Comparing Facts and DimensionsDimensions
NarrowNarrow Big (many rows)Big (many rows) NumericNumeric Growing over timeGrowing over time
WideWide Small (few rows)Small (few rows) DescriptiveDescriptive StaticStatic
Facts Dimensions
Facts contain numbers, dimensions contain labels
Cross Tabulation of Cross Tabulation of salessales by by item-item-name name and and colorcolor
The table above is an example of a The table above is an example of a cross-cross-tabulationtabulation ( (cross-tabcross-tab), also referred to as a ), also referred to as a pivot-tablepivot-table..
A cross-tab is a table whereA cross-tab is a table where values for one of the dimension attributes form the row headers, values for one of the dimension attributes form the row headers,
values for another dimension attribute form the column headersvalues for another dimension attribute form the column headers Values in individual cells are (aggregates of)Values in individual cells are (aggregates of) the values of the the values of the
dimension attributes that specify the cell.dimension attributes that specify the cell.
MarginalsMarginals
The data cube also includes The data cube also includes aggregation (typically SUM) along aggregation (typically SUM) along the margins of the cube.the margins of the cube.
The The marginalsmarginals include aggregations include aggregations over one dimension, two dimensions,over one dimension, two dimensions,……
Visualization - Data Cube w/ Visualization - Data Cube w/ AggregationAggregation
price
bar
beer
drinkerSU
M o
ver
all D
rinke
rs
Example: MarginalsExample: Marginals
Our 4-dimensional Our 4-dimensional SalesSales cube cube includes the sum of includes the sum of priceprice over each over each bar, each beer, each drinker, and bar, each beer, each drinker, and each time unit (perhaps days).each time unit (perhaps days).
It would also have the sum of It would also have the sum of priceprice over all bar-beer pairs, all bar-over all bar-beer pairs, all bar-drinker-day triples,…drinker-day triples,…
Structure of the CubeStructure of the Cube
Think of each dimension as having Think of each dimension as having an additional value *.an additional value *.
A point with one or more *’s in its A point with one or more *’s in its coordinates aggregates over the coordinates aggregates over the dimensions with the *’s.dimensions with the *’s.
Example: Sales(“Joe’s Bar”, “Bud”, Example: Sales(“Joe’s Bar”, “Bud”, *, *) holds the sum over all drinkers *, *) holds the sum over all drinkers and all time of the Bud consumed at and all time of the Bud consumed at Joe’s. Joe’s.
Relational RepresentationRelational Representation
Crosstabs can be represented as relations The value all is used to
represent aggregates The SQL:1999 standard
actually uses null values in place of all
Three-Dimensional Data Three-Dimensional Data CubeCube A data cube is a multidimensional generalization of a crosstab
Cannot view a three-dimensional object in its entirety but crosstabs can be used as views on a data cube
Data CubeData Cube
Axes of the cube Axes of the cube represent attributes of represent attributes of the data recordsthe data records e.g. color, month, statee.g. color, month, state Called Called dimensionsdimensions
Cells hold aggregated Cells hold aggregated measurements measurements e.g. total $ sales, e.g. total $ sales,
number of autos soldnumber of autos sold Called Called factsfacts
Real data cubes have Real data cubes have >> 3 dimensions>> 3 dimensions
Jul Aug SepCA
ORWA
Red
Blue
Gray
Auto Sales
Slicing and DicingSlicing and Dicing
Jul Aug SepCA
ORWA
Red
Blue
Gray
Red
Blue
Gray
Jul Aug SepCA
ORWA
Blue
Jul Aug SepCA
ORWA
Blue
Jul Aug SepTotal
Querying the Data CubeQuerying the Data Cube
Cross-tabulationCross-tabulation ““Cross-tab” for shortCross-tab” for short Report data grouped by 2 Report data grouped by 2
dimensionsdimensions Aggregate across other Aggregate across other
dimensionsdimensions Include subtotalsInclude subtotals
Operations on a cross-tabOperations on a cross-tab Roll up (further Roll up (further
aggregation)aggregation) Drill down (less Drill down (less
aggregation)aggregation)
CACA OROR WAWA TotalTotal
JulJul 4545 3333 3030 108108
AugAug 5050 3636 4242 128128
SepSep 3838 3131 4040 109109
TotalTotal 133133 100100 112112 345345
Number of Autos Sold
Roll Up and Drill DownRoll Up and Drill Down
CACA OROR WAWA TotaTotall
JulJul 4545 3333 3030 108108
AugAug 5050 3636 4242 128128
SepSep 3838 3131 4040 109109
TotaTotall
133133 100100 112112 345345
Number of Autos Sold
CACA OROR WAWA TotalTotal
133133 100100 112112 345345
Number of Autos Sold
CACA OROR WAWA TotaTotall
RedRed 4040 2929 4040 109109
BlueBlue 4545 3131 3737 113113
GraGrayy
4848 4040 3535 123123
TotaTotall
133133 100100 112112 345345
Roll upby Month
Number of Autos Sold
Drill downby Color
Full Data Cube with Full Data Cube with SubtotalsSubtotals
Pre-computation of aggregates Pre-computation of aggregates → → fast fast answers to OLAP queriesanswers to OLAP queries
Ideally, pre-compute all 2Ideally, pre-compute all 2nn types of types of subtotalssubtotals
Otherwise, perform aggregation as neededOtherwise, perform aggregation as needed Coarser-grained totals can be computed Coarser-grained totals can be computed
from finer-grained totalsfrom finer-grained totals But not the other way aroundBut not the other way around
Data Cube LatticeData Cube Lattice
Total
State Month Color
State, Month
State,Color
Month,Color
State, Month, Color
DrillDown
RollUp
MOLAP vs. ROLAPMOLAP vs. ROLAP
MOLAP = Multidimensional OLAPMOLAP = Multidimensional OLAP Store data cube as multidimensional arrayStore data cube as multidimensional array (Usually) pre-compute all aggregates(Usually) pre-compute all aggregates Advantages:Advantages:
Very efficient data access Very efficient data access →→ fast answers fast answers Disadvantages:Disadvantages:
Doesn’t scale to large numbers of dimensionsDoesn’t scale to large numbers of dimensions Requires special-purpose data storeRequires special-purpose data store
SparsitySparsity
Imagine a data warehouse for Safeway.Imagine a data warehouse for Safeway. Suppose dimensions are: Customer, Product, Store, DaySuppose dimensions are: Customer, Product, Store, Day If there are 100,000 customers, 10,000 products, 1,000 If there are 100,000 customers, 10,000 products, 1,000
stores, and 1,000 days…stores, and 1,000 days… ……data cube has 1,000,000,000,000,000 cells!data cube has 1,000,000,000,000,000 cells! Fortunately, most cells are empty.Fortunately, most cells are empty. A given store doesn’t sell every product on every day.A given store doesn’t sell every product on every day. A given customer has never visited most of the stores.A given customer has never visited most of the stores. A given customer has never purchased most products.A given customer has never purchased most products. Multi-dimensional arrays are not an efficient way to Multi-dimensional arrays are not an efficient way to
store sparse data.store sparse data.
MOLAP vs. ROLAPMOLAP vs. ROLAP
ROLAP = Relational OLAPROLAP = Relational OLAP Store data cube in relational databaseStore data cube in relational database Express queries in SQLExpress queries in SQL Advantages:Advantages:
Scales well to high dimensionalityScales well to high dimensionality Scales well to large data setsScales well to large data sets Sparsity is not a problemSparsity is not a problem Uses well-known, mature technologyUses well-known, mature technology
Disadvantages:Disadvantages: Query performance is slower than MOLAPQuery performance is slower than MOLAP Need to construct explicit indexesNeed to construct explicit indexes
Creating a Cross-tab with Creating a Cross-tab with SQLSQL
SELECT state, month, SUM(quantity)FROM salesGROUP BY state, monthWHERE color = 'Red'
Grouping Attributes
Measurements
Filters
What about the totals?What about the totals?
SQL aggregation query SQL aggregation query with GROUP BY does not with GROUP BY does not produce subtotals, totalsproduce subtotals, totals
Our cross-tab report is Our cross-tab report is incomplete.incomplete.
CACA OROR WAWA TotalTotal
JulJul 4545 3333 3030 ??
AugAug 5050 3636 4242 ??
SepSep 3838 3131 4040 ??
TotalTotal ?? ?? ?? ??
Number of Autos Sold
State Month SUMCA Jul 45CA Aug 50CA Sep 38OR Jul 33OR Aug 36OR Sep 31WA Jul 30WA Aug 42WA Sep 40
One solution: a big UNION One solution: a big UNION ALLALLSELECT state, month, SUM(quantity)FROM salesGROUP BY state, monthWHERE color = 'Red‘UNION ALLSELECT state, "ALL", SUM(quantity)FROM salesGROUP BY stateWHERE color = 'Red'UNION ALLSELECT "ALL", month, SUM(quantity)FROM salesGROUP BY monthWHERE color = 'Red‘UNION ALLSELECT "ALL", "ALL", SUM(quantity)FROM salesWHERE color = 'Red'
OriginalQuery
StateSubtotals
MonthSubtotals
OverallTotal
A better solutionA better solution
““UNION ALL” solution gets cumbersome with UNION ALL” solution gets cumbersome with more than 2 grouping attributesmore than 2 grouping attributes
n grouping attributes → 2n grouping attributes → 2nn parts in the union parts in the union OLAP extensions added to SQL 99 are more OLAP extensions added to SQL 99 are more
convenientconvenient CUBE, ROLLUPCUBE, ROLLUP
SELECT state, month, SUM(quantity)FROM salesGROUP BY CUBE(state, month)WHERE color = 'Red'
Results of the CUBE queryResults of the CUBE queryState MonthSUM(quantity)CA Jul 45CA Aug 50CA Sep 38CA NULL 133OR Jul 33OR Aug 36OR Sep 31OR NULL 100WA Jul 30WA Aug 42WA Sep 40WA NULL 112NULL Jul 108NULL Aug 128NULL Sep 109NULL NULL 345
Notice the use of NULL for totals
Subtotals at all levels
ROLLUP vs. CUBEROLLUP vs. CUBE CUBE computes entire latticeCUBE computes entire lattice ROLLUP computes one path through latticeROLLUP computes one path through lattice
Order of GROUP BY list mattersOrder of GROUP BY list matters Groups by all prefixes of the GROUP BY listGroups by all prefixes of the GROUP BY list
GROUP BY ROLLUP(A,B,C)•A,B,C•(A,B) subtotals•(A) subtotals•Total
GROUP BY CUBE(A,B,C)•A,B,C•Subtotals for the following:(A,B), (A,C), (B,C), (A), (B), (C)•Total
ROLLUP exampleROLLUP example
Total
State Month Color
State, Month
State,Color
Month,Color
State, Month, Color
SELECT color, month, state, SUM(quantity)FROM salesGROUP BY ROLLUP(color,month,state)