Post on 24-Dec-2015
Introduction to Dimensional AnalysisSession 2
5/20/2005
M
D
Metadata Solutions
Dan McCrearyPresidentDan McCreary & Associatesdan@danmccreary.com(952) 931-9198
2
Agenda• General introduction to Data Dictionaries
that drive Business Intelligence (BI) concepts and terminology
• Understand why Data Dictionaries are so critical in accurate BI
• Understand how BI looks at the world in different ways
• Understand how data warehouse concepts and data dictionary impact analysis and research
3
What is a Data Warehouse?
• Fast Retrieval
• Internally Consistent
• Slice and Dice Capability
• Easy to “Browse”
• Complete and Reliable
• Data Quality Controls– GI-GO (Garbage-In, Garbage-Out)
Source: Ralph Kimball
4
Factors Driving Business Intelligence• Computer process and store twice as much data per dollar
every 18 months (Moore’s Law)• People can make better decisions if they have tools to
quickly see only the data they are interested in seeing• People frequently want to analyze data in new ways that
was unanticipated by people creating "canned reports"• Tools can be designed to allow non-technical (non-SQL
programmers) to generate their own reports• People have an incredible ability to categorize things base
on their properties and attributes but if they don't have consistent definitions of these properties they will not be generating consistent results
5
The BI Iterative Process
• The BI process in an on-going iterative process where the structure of the data warehouse changes based on what data is critical to an organizations business objectives.
AccessData Warehouse Analysis
Insights, Conclusions andFindings
Publishing, Change, DataGap Analysis, New Data
Gathered
BI ProjectManagement
BI ProjectManagement
6
BI Evolution
• Shorten the time-to-report interval• Allow users to "browse" data sets interactively• Remove programmers with "backlogs" of reports• Users frequently waited days, weeks for months to get a custom report
created
Monthly Green Bar Reports BrowseableGraphical Interface
Increasing Responsiveness
7
Dimensions of BI
Degree ofEnd User Control
Technical SophisticationRequired
Highly Responsive to "What If"
ScenariosLow
(analysts)
High(programmers)
Few Dimensionsfew parameters, few filters
Many Dimensionsmany variables
8
Overlapping Terminology
BusinessIntelligence
Data Mining
TransactionProcessing
(OLTP)
DimensionalAnalysis
Indexing
Aggregates
Statistical AnalysisPatternDiscovery
Data Storage(RDBMS)
DataWarehousing
Data DictionariesData Modeling
Semantics
9
Key Terms Covered in This Class• Properties• Dimension• Aggregation and Levels• Enumerations of Categorical Data• Labeling Categories• Giving precise definitions to Labels• Dimension Hierarchies and Levels• Cubes• Measures• Filters• Data Warehouse Presentation
10
Things Have Many "Properties"
People are very good at recognizing and sorting things by their properties.
14
Dimensional Analysis
• The science of figuring out intuitive ways that people want to categorize information using independent variables to graphically filter and browse their data
15
Dimension
• List of categories used to partition the information based on a property of the objects
• Dimension Names: Color, Shape
16
Labels
• A name given to a non-overlapping category within a dimensions
"red"
"blue"
"green"
Labels
17
Enumeration
• Whenever we decide to break the continuous observable world into a predefined list of categories when each category has a label we call this an "enumerated value domain". These will then become the "dimensions" of our cube.
"red" "green" "blue"
Statisticians call this type of "categorical data" and it requires the categories to be non-overlapping.
Note: NO OVERLAP!
18
The Challenge of Semantic Classification• People are good at sorting based on a property they see
• People are good at assigning names to a property type
• People usually come up with different names for properties
• Some dimensions people easily agree on
• Some are very difficult to classifyand even more difficult go get peopleto agree on a non-overlapping classification system
"Polygon" "Square"
"Red Circle"
"Green""Blue"
"Blue-Green"
What happens with a small percentage of data does not quite fit into a discrete category?
19
Level
• A layer of "aggregation" within a single dimension – categorization of properties
All Shapes
ShapesWith Curves
ShapesWithout Curves
Circle Heart Square Trapezoid
Levels
Moon Star Diamond
20
Measures (example weight)
9.1
3.5
1.1
9.3
5.5
6.6
8.4
7.45.7
6.18.2
2.6
3.8
10
A measure is any property that you can perform math on (sums, averages).
22
Sample Object "Fact Table"id Color Shape DashStyle Weight
1 blue heart solid 5.7
2 blue star small-dash 6.6
3 green moon large-dash 1.1
4 red trapezoid small-dash 3.8
5 green square solid 9.3
6 red diamond small-dash 8.2
7 green circle small-dash 2.6
8 red circle large-dash 3.5
9 blue trapezoid solid 5.5
10 blue square large-dash 10
11 blue diamond large-dash 8.4
12 red heart large-dash 9.1
13 red moon solid 7.4
14 green star solid 6.1
Measures tend to havedata types of integers andfloating point numbers.
Note that categorical data can not beadded together. But we can count thefrequencies of items with a category!
23
Shape DimensionShape Code Has Curves Definition
circle Yes A round shape with no corners.
diamond NoA shape with our corners and parallel edges but not horizontal and vertical
edges.
square No A shape with four corners with horizontal and vertical edges that are parallel.
heart Yes A shape that is round on the top and pointy on the bottom.
moon Yes A crescent shape.
trapezoid No Four corners, four sides but the sides are not all parallel.
Note that there is no reference to "Has Curves" in the prior table. "HasCurves" is a property of the shape value domain because it can be "inferred"from the shape of the object.
Some categorical definitions use "exclusionary" language.
Note that "Has Curves" also must have a precise definition in the data dictionary.
24
Facts and Dimension
Shape FactsColor_FKShape_FK
WeightShape DimHas Curves
Color NameColor Dim
Shape Name
Note that "Has curves" does not need to be in the central fact table.It is a property of the shape!
25
Adding Dimensions
9.1
3.5
1.1
9.3
5.5
6.6
8.4
7.45.7
6.18.2
2.6
3.8
10
We have now added a 3rd dimension – "Dash Style"
26
Each New Property is Another Dimension
Shape FactsColor_FKShape_FK
DashStyle_FK
Shape Dim
Shape Code
Color CodeColor Dim
Weight
DashStype DimShape
Has Curves
27
Filters
9.1
3.5
1.1
9.3
5.5
6.6
8.4
7.45.7
6.18.2
2.6
3.8
10
A filter will exclude all objects with a specified property.For example we can exclude all shapes with a property of "Circle"
28
Example: Discarding Invalid Scores
This example filter removes all scores EXCEPT the valid scores.
29
Selecting Only Scale Scores
This filter removes all scores EXCEPT the assessments Scale Scoreusing the Test Score Type dimension.
30
The Star Schema
Dim1
Cat1
Cat2
Cat3
PK
Facts
Foreign KeyForeign KeyForeign KeyForeign Key
Measure1Measure1
Foreign KeyPrimary Key
Dim2
Cat1
Cat2
Cat3
PK
Dim3
Cat1
Cat2
Cat3
PK
Dim4
Cat1
Cat2
Cat3
PK
Dim5
Cat1
Cat2
Cat3
PK
31
Adding Measures
Shape FactsColor_FKShape_FK
DashStyle_FKShape DimShapeCode
ColorCodeColor Dim
WeightValueDashStype Dim
ShapeCodeHeightValuePriceAmountDensityValue
Measures can be easily be added to the fact table without changing any of the dimensions.
Measures areIntegers or floatsthat you can performmath on.
32
Cube• A Cube is a pre-built structure that has facts
and many dimensions (not necessarily just three)
• Designed to have averages and sums for most levels "pre-calculated" to make analysis fast
Shape Dimension
Color Dimension9.1
3.5
1.1
9.3
5.5
6.6
8.4
7.45.7
6.18.2
2.6
3.8
10
Dash-S
tyle Dim
ensio
n
33
Build a Mental Model
9.1
3.5
1.1
9.3
5.5
6.6
8.4
7.45.7
6.18.2
2.6
3.8
10
FilterFunnel
Horizontal Dimension (columns)
Verti
cal D
imen
sion
(rows
)
Measure = count
Presentation
(aka "Page Fields")
35
Count of Year vs. Assessment Name
The Row Dimension is the "Test Name".
The Column Dimension is the "Fiscal Year"The measure is the count of records in the cube.
There are around25 milliontestresults
36
Conformed Dimensions
• When building many cubes, there is a large benefit to "reusing" dimensions
• Commonly reused dimensions– Time (Fiscal Year, Quarter)– Organization (School, District)– Expense Category– Student
37
Each bar represents thesum of all the expendituresin the category (ExpendituresOn girls athletics for the fiscalyear 1991)
38
Sample of National Conformed Dimensions
Process
Student Assessment
Student Attendance
School and District Status
Teacher Licensing
Date
School Food and Nutrition
Student Disciplinary Reporting
Student Safety Reporting
District Technology Planning
District Financial Reporting
Organ
izatio
n
Asses
smen
t
Financia
l
Teach
er
Claims
School In
ciden
t Data
School T
echnology
Student
39
Role of Data Architecture
• Facilitate how business users want to identify and categorize data
• Assist in the creation and documentation of categorical value domains and measures
• Creation of machine-readable data dictionaries for use in building data warehouse structures