Post on 27-Jul-2015
th SQLNIGHT
CHAPTER
Building Data Warehouse in SQL Server
Antonios ChatzipavlisSQLschool.gr Founder, Principal Consultant
SQL Server Evangelist, MVP on SQL Server
May 30, 2015
25
I have been started with computers.
I started my professional carrier in computers industry.
I have been started to work with SQL Server version 6.0
I earned my first certification at Microsoft as Microsoft Certified Solution
Developer (3rd in Greece) and started my carrier as Microsoft Certified
Trainer (MCT) with more than 20.000 hours of training until now!
I became for first time Microsoft MVP on SQL Server
I created the SQL School Greece (www.sqlschool.gr)
I became MCT Regional Lead by Microsoft Learning Program.
I was certified as MCSE : Data Platform, MCSE: Business Intelligence
Antonios ChatzipavlisDatabase Archi tect
SQL Server Evange l i st
MCT, MCSE, MCITP, MCPD, MCSD, MCDBA,
MCSA, MCTS, MCAD, MCP, OCA, ITIL-F
1982
1988
1996
1998
2010
2012
2013
CHAPTER
Follow us in social media
Twitter @antoniosch / @sqlschool
Facebook fb/sqlschoolgr
YouTube yt/user/achatzipavlis
LinkedIn SQL School Greece group
Pinterest pi/SQLschool/
help@sqlschool.gr
Stay Involved!
• Sign up for a free membership today at sqlpass.org
• Linked In: http://www.sqlpass.org/linkedin
• Facebook: http://www.sqlpass.org/facebook
• Twitter: @SQLPASS
• PASS: http://www.sqlpass.org
Whatever your data passion – there’s a Virtual Chapter for you!
www.sqlpass.org/vc
Planning on attending PASS Summit 2015? Start
saving today!
• The world’s largest gathering of SQL Server & BI professionals
• Take your SQL Server skills to the next level by learning from the world’s
top SQL Server experts, in over 190 technical sessions
• Over 5000 registrations, representing 2000 companies, from 52
countries, ready to network & learn
Save $150 right now using
discount code LC15CPJ8$1795until July 12th, 2015
Don’t miss your chance to vote in the 2015 PASS
elections-update your myPASS profile by June 1!
In order to vote for the 2015 PASS Nomination Committee & the Board of Directors, you need
to complete all mandatory fields in your myPASS profile by 11:59 PM PDT June 1, 2015.
• PASS members will be reminded to review & complete their profiles
• Members will receive instructions for updating profiles and deleting duplicate profiles
• Eligible voters will receive information about key election dates and the voting process after June 1
Head to sqlpass.org/myPASS today!
For more info on elections,
visit to sqlpass.org/elections
• Overview of Data Warehousing
• Data Warehouse Solution
• Data Warehouse Infrastructure
• Data Warehouse Hardware
• Data Warehouse Design Overview
• Designing Dimension Tables
• Designing Fact Tables
• Data Warehouse Physical Design
Agenda
Overview of Data Warehousing
• There are many definitions for the term “data warehouse,”
and disagreements over specific implementation details.
• It is generally agreed that a data warehouse is a centralized
store of business data that can be used for reporting and
analysis to inform key decisions.
• A data warehouse provides a solution to the problem of
distributed data that prevents effective business decision-
making.
What is a Data Warehouse?
The single organizational repository of enterprise wide
data across many or all lines of business and subject
areas.
Contains massive and integrated data.
Represents the complete organizational view of
information needed to run and understand the
business.
Definition of Data Warehouse
• Contains a large volume of data that relates to historical
business transactions.
• Is optimized for read operations that support querying the
data.
• Is loaded with new or updated data at regular intervals.
• Provides the basis for enterprise BI applications.
Data Warehouse characteristics
• Finding the information required for business decision• This is time-consuming and error-prone.
• Key business data is distributed across multiple systems. • This makes it hard to collate all the information necessary for a particular
business decision.
• Fundamental business questions are hard to answer. • Most business decisions require a knowledge of fundamental facts.
• The distribution of data throughout multiple systems in a typical
organization can make them difficult, or even impossible, to answer.
What makes a Data Warehouse useful?
• The specific, subject oriented, or departmental view of
information from the organization.
• Generally these are built to satisfy user requirements for
information
What is a Data Mart?
Data Warehouse Vs Data Mart
Data Warehouse Data Mart
Scope• Application independent
• Centralized or Enterprise
• Planned
• Specific application
• Decentralized by group
• Organic but may be planned
Data• Historical, detailed, summary
• Some denormalization
• Some history, detailed, summary
• High denormalization
Subjects • Multiple Subjects • Single central subject area
Sources • Many internal and external sources • Few internal and external sources
Other
• Flexible
• Data oriented
• Long life
• Single complex structure
• Restrictive
• Project oriented
• Short life
• Multiple simple structures
• Centralized Data Warehouse
• Departmental Data Mart
• Hub and Spoke
Data Warehouse Architectures
• Centralized Data Warehouse
• Departmental Data Mart
• Hub and Spoke
Data Warehouse Architectures
• Centralized Data Warehouse
• Departmental Data Mart
• Hub and Spoke
Data Warehouse Architectures
• Centralized Data Warehouse
• Departmental Data Mart
• Hub and Spoke
Data Warehouse Architectures
Components of a Data Warehousing Solution
Data
WarehouseMaster Data
Management
Data
Cleansing
Data
So
urc
es
ETL
Data
Models
Reporting and Analysis
1. Start by identifying the business questions that the data warehousing
solution must answer
2. Determine the data that is required to answer these questions
3. Identify data sources for the required data
4. Assess the value of each question to key business objectives versus
the feasibility of answering it from the available data
For large enterprise-level projects, an incremental approach can be effective:
• Break the project down into multiple sub-projects
• Each sub-project deals with a particular subject area in the data warehouse
Starting a Data Warehouse Project
Core Data Warehousing
• SQL Server Database
Engine
• SQL Server Integration
Services
• SQL Server Master Data
Services
• SQL Server Data Quality
Services
Enterprise BI
• SQL Server Analysis
Services
• SQL Server Reporting
Services
• Microsoft SharePoint
Server
• Microsoft Office
Self-Service BI
Big Data Analysis
• Excel Add-ins
(PowerPivot, Power
Query, Power View,
Power Map)
• Microsoft Office 365
Power BI
• Windows Azure
HDInsight
SQL Server As a Data Warehousing Platform
Data Warehouse Solution
A data warehouse
is a relational database that
is optimized for reading data
for analysis and reporting.
Keep in mind
• Logical:• Is typically designed to denormalize data into a structure that minimizes the
number of join operations required in the queries used to retrieve and
aggregate data.
• A common approach is to design a star schema
• Physical:
• Affect the performance and manageability of the data warehouse
Logical and Physical Database schema
• Query processing requirements, including anticipated peak
memory and CPU utilization.
• Storage volume and disk input/output requirements.
• Network connectivity and bandwidth.
• Component redundancy for high availability.
Hardware selection
• Failover time requirements.
• Configuration and management complexity.
• The volume of data in the data warehouse.
• The frequency of changes to data in the data warehouse.
• The effect of the backup process on data warehouse
performance.
• The time to recover the database in the event of a failure.
High availability and Disaster Recovery
• The authentication mechanisms that you must support to
provide access to the data warehouse.
• The permissions that the various users who access the data
warehouse will require.
• The connections over which data is accessed.
• The physical security of the database and backup media.
Security
• Data Source Connection Types
• Credentials and Permissions
• Data Formats
• Data Acquisition Windows
Data sources
• Staging:• What data must be staged?
• Staging data format
• Required transformations:• Transformations during extraction versus data flow transformations
• Incremental ETL:• Identifying data changes for extraction
• Inserting or updating when loading
ETL Processes
• Data quality:
• Cleansing data:
• Validating data values
• Ensuring data consistency
• Identifying missing values
• Deduplicating data
• Master data management:
• Ensuring consistent business entity definitions across multiple systems
• Applying business rules to ensure data validity
Data Quality and Master Data Management
Data Warehouse Infrastructure
Data volume
• The amount of data that the data warehouse must store
• The size and frequency of incremental loads of new data.
• The primary consideration is the number of rows in fact tables
• But don’t forget dimension data, indexes, and data models
that are stored on disk.
System Sizing Factors
Analysis and Reporting Complexity
• This includes the number, complexity, and predictability of the
queries that will be used to analyze the data or produce reports.
• Typically, BI solutions must support a mix of the following query
types:
• Simple. Relatively straightforward SELECT statements.
• Medium. Repeatedly executed queries that include aggregations or many joins.
• Complex. Unpredictable queries with complex aggregations, joins, and
calculations.
System Sizing Factors
Number of Users
• This is the total number of information workers who will
access the system, and how many of them will do so
concurrently.
Availability Requirements
• These include when the system will need to be used, and
what planned or unplanned downtime the business can
tolerate.
System Sizing Factors
Typical System Categorization
Small Medium Large
Data Volume 100s of GBs to 1 TB 1 to 10 TB 10 TB to 100s of TBs
Analysis and
Reporting
Complexity
Over 50% simple
30% medium
Less than 10% complex
50% simple
30-35% medium
10-15% complex
30-35% simple
40% medium
20-25% complex
Number of Users100 total
10 to 20 concurrent
1,000 total
100 to 200 concurrent
1,000s of concurrent
users
Availability
RequirementsBusiness hours
1 hour of downtime per
night24/7 operations
Data Warehouse Workloads
ETL
• Control flow tasks
• Data query and insert
• Network data transfer
• In-memory data pipeline
• SSIS Catalog or MSDB I/O
Reporting
• Client requests
• Data source queries
• Report rendering
• Caching
• Snapshot execution
• Subscription processing
• Report Server Catalog I/O
Operations and
Maintenance• OS activity
• Logging
• SQL Server Agent Jobs
• SSIS packages
• Indexes
• Backups
DW
• Processing
• Aggregation storage• Multidimensional on disk
• Tabular in memory
• Query execution
Cubes
Typical Server Topologies for a BI Solution
Single Server
Architecture
DW
Distributed
Architecture
ServersFew Many
Hardware costs
Software license costs
Configuration complexity
Scalability & Performance
Flexibility
Scaling-out a BI Solution
Analysis ServicesData Warehouse
Integration Services Reporting Services
• Partitioning the data
across multiple
database servers
• SQL Server Parallel
Data Warehouse
edition
• Install the Reporting
Services database on a
single database server,
• Then install the
Reporting Services
report server service
on multiple servers
that all connect to the
same Reporting
Services database.
Create a read-only copy of a
multidimensional database and
connect to it from multiple Analysis
Services query servers.
Use multiple SSIS
servers to perform a
subset of the ETL
processes in parallel
Planning for High Availability
• AlwaysOn Failover
Cluster
• RAID Storage
• AlwaysOn Failover
Cluster
• AlwaysOn Availability
Group
• NLB Report
Servers
• AlwaysOn
Availability
Group
• AlwaysOn
Failover Cluster
Data Warehouse
Analysis Services
Integration Services
Reporting Services
Data Warehouse Hardware
• A DW usually has longer-running queries
• A DW has higher read activity than write activity
• The data in DW is usually more static
• In a DW it is much more important to be able to process a
large amount of data quickly, than it is to support a high
number of I/O operations per second
Keep in mind
• Determine initial data volume• Number of fact table rows x row size
• Use 100 bytes per row as an estimate if unknown
• Add 30-40% for dimensions and indexes
• Project data growth• Number of new fact rows per month
• Factor in compression• Typically 3:1
Determining Storage Requirements
Other storage requirements
• Configuration databases
• Log files
• TempDB
• Staging tables
• Backups
• Analysis Services models
• Use more smaller disks instead of fewer larger disks
• Use the fastest disks you can afford• Consider solid state disks especially for random I/O
• Use RAID 10, or minimally RAID 5
• Consider a dedicated storage area network for manageability
and extensibility• Balance I/O across enclosures, storage processors, and disk groups
Considerations for Storage Hardware
Server size Minimum memory Maximum memory
1 socket 64 GB 128 GB
2 sockets 128 GB 256 GB
4 sockets 256 GB 512 GB
8 sockets 512 GB 768 GB
Server Memory
• Determine core MCR
• Apply formula to estimate required number of cores:
Estimating CPU Requirements
𝐶𝑃𝑈𝑠 =
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑞𝑢𝑒𝑟𝑦 𝑠𝑖𝑧𝑒 𝑖𝑛 𝑀𝐵𝑀𝐶𝑅
𝑥 𝐶𝑜𝑛𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑢𝑠𝑒𝑟𝑠
𝑇𝑟𝑎𝑟𝑔𝑒𝑡 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒 𝑡𝑖𝑚𝑒
• This metric measures the maximum SQL Server data processing rate for a
standard query and data set for a specific server and CPU combination.
• This is provided as a per-core rate, and it is measured as a query-based scan
from memory cache.
• MCR is the initial starting point for Fast Track system design.
• It represents an estimated maximum required I/O bandwidth for the server, CPU,
and workload.
• MCR is useful as an initial design guide because it requires only minimal local
storage and database schema to estimate potential throughput for a given CPU.
• It is not a measure of system performance.
Maximum Consumption Rate (MCR)
• Create a reference dataset based on the TPC-H line item table or similar data set. • The table should be of a size that it can be entirely cached in the SQL Server buffer pool yet still
maintain a minimum one-second execution time for the query provided here.
• For FTDW the following query is used:
SELECT sum([integer field]) FROM [table]
WHERE [restrict to appropriate data volume]
GROUP BY [col].
• Ensure that Resource Governor settings are at default values.
• Ensure that the query is executing from the buffer cache. • Executing the query once should put the pages into the buffer, and subsequent executions should
read fully from buffer. Validate that there are no physical reads in the query statistics output.
Calculate MCR
• Set STATISTICS IO and STATISTICS TIME to ON to output results.
• Run the query multiple times, at MAXDOP = 1.
• Record the number of logical reads and CPU time from the statistics output for each query execution.
• Calculate the MCR in MB/s using the formula:
( [Logical reads] / [CPU time in seconds] ) * 8KB / 1024
• A consistent range of values (+/- 5%) should appear over a minimum of five query executions. • Significant outliers (+/- 20% or more) may indicate configuration issues. The
average of at least 5 calculated results is the FTDW MCR.
Calculate MCR
• Special SQL Server Edition only available in hardware appliances
• Massively parallel processing
• Shared-nothing architecture
• Dedicated control nodes, compute nodes, and storage nodes
SQL Server Parallel Data Warehouse
Du
al Fib
er
Ch
an
nel
Database servers
(compute nodes)
Infi
nib
an
d
Storage Arrays
Control Node
Cluster
Management
Servers
Landing Zone
(ETL Interface)
Backup Nodes
Data Warehouse Design Overview
The Dimensional Model
Fact
Dimension
Dimension
Dimension
Dimension
Dimension
Dimension
Snowflake schema
Star schema
Measures
Attributes
Attributes
Attributes
Attributes
Attributes
Attributes
• Identify the grain
• Select the required dimensions
• Identify the facts
Dimensional Modeling
• The grain of a dimensional model is the lowest level of detail
at which you can aggregate the measures.
• It is important to choose the level of grain that will support the
most granular of reporting and analytical requirements
• Typically the lowest level possible from the source data is the
best option.
Identify the grain
• Determine which of the dimensions related to the business
process should be included in the model
• The selection of dimensions depends on the reporting and
analytical requirements, specifically on the business entities by
which users need to aggregate the measures
• Almost all dimensional models include a time-based
dimension
Select the required dimensions
• Identify the facts that you want to include as measures.
• Measures are numeric values that can be expressed at the
level of the grain chosen earlier and aggregated across the
selected dimensions.
• Depending on the grain you choose for the dimensional
model and the grain of the source data, you might need to
allocate measures from a higher level of grain across multiple
fact rows.
Identify the facts
Documenting Dimensional Models
Sales Order
Item Quantity
Unit Cost
Total Cost
Unit Price
Sales Amount
Shipping Cost
Time(Order Date and
Ship Date)
Salesperson
CustomerProduct
Calendar Year
Month
Date
Fiscal Year
Fiscal Quarter
Month
Date
Region
Country
Territory
Manager
Name
Name
Country
State or Province
City
Age
Marital Status
Gender
CategorySubcategoryProduct Name
ColorSize
Designing Dimension Tables
Each row
in a dimension table represents
an instance of a business entity
by which the measures
in the fact table
can be aggregated
Keep in mind
• A key column uniquely identifies each row in the dimension table.
• Usually the dimension data is obtained from a source system in which a key is already assigned, this is the “business key”
• It is standard practice to define a new “surrogate key” that uses an integer value to identify each row. • A surrogate key is recommended for the following reasons:
• The data warehouse might use dimension data from multiple source systems, so it is possible that business keys are not unique.
• Some source systems use non-numeric keys, such as a globally unique identifier (GUID), or natural keys, such as an email address, to uniquely identify data entities. Integer keys are smaller and more efficient to use in joins from fact tables.
• If the dimension table supports “Type 2” slowly-changing dimensions.
Dimension keys
Dimension keys
ProductKey ProductAltKey ProductName Color Size
1 MB1-B-32 MB1 Mountain Bike Blue 32
2 MB1-R-32 MB1 Mountain Bike Red 32
CustomerKey CustomerAltKey Name
1 1002 Amy Alberts
2 1005 Neil Black
Surrogate Key Business (Alternate) Key
• Hierarchies • Multiple attributes can be combined to form hierarchies that enable users to drill
down into deeper levels of detail.
• Business users can view aggregated fact data at each level
• Slicers• Attributes do not need to form hierarchies to be useful in analysis and reporting.
• Business users can group or filter data based on single-level hierarchies to create analytical sub-groupings of data.
• Drill-through detail • Some attributes have little value as slicers or members of a hierarchy.
• It can be useful to include entity-specific attributes to facilitate drill-through functionality in reports or analytical applications.
Dimension Attributes and Hierarchies
Dimension Attributes and Hierarchies
CustKey CustAltKey Name Country State City Phone Gender
1 1002 Amy Alberts Canada BC Vancouver 555 123 F
2 1005 Neil Black USA CA Irvine 555 321 M
3 1006 Ye Xu USA NY New York 555 222 M
Hierarchy
SlicerDrill-through detail
• Identify the semantic meaning of NULL• Unknown or None?
• Do not assume NULL equality• Use ISNULL( )
Unknown and None
OrderNo Discount DiscountType
1000 1.20 Bulk Discount
1001 0.00 N/A
1002 2.00
1003 0.50 Promotion
1004 2.50 Other
1005 0.00 N/A
1006 1.50
So
urc
e
Dim
en
sio
n T
ab
le
DiscKey DiscAltKey DiscountType
-1 Unknown Unknown
0 N/A None
1 Bulk Discount Bulk Discount
2 Promotion Promotion
3 Other Other
• The simplest type of SCD to implement.
• Attribute values are updated directly in the existing dimension table row and no history is maintained.
• Suitable for attributes that are used to provide drill-through details
• Unsuitable for analytical slicers or hierarchy members where historic comparisons must reflect the attribute values as they were at the time of the fact event.
Slowly Changing Dimensions – Type 1
CustKey CustAltKey Name Phone
1 1002 Amy Alberts 555 123
CustKey CustAltKey Name Phone
1 1002 Amy Alberts 555 222
• These changes involve the creation of a fresh version of the dimension entity in the form of a new row.
• Typically, a bit column in the dimension table is used as a flag to indicate which version of the dimension row is the current one.
• Additionally, datetime columns are often used to indicate the start and end of the period for which a version of the row was (or is) current. • Maintaining start and end dates makes it easier to assign the appropriate foreign key value to fact rows as they are
loaded so they are related to the version of the dimension entity that was current at the time the fact occurred.
Slowly Changing Dimensions – Type 2
CustKey CustAltKey Name City Current Start End
1 1002 Amy Alberts Vancouver Yes 1/1/2000
CustKey CustAltKey Name City Current Start End
1 1002 Amy Alberts Vancouver No 1/1/2000 1/1/2012
4 1002 Amy Alberts Toronto Yes 1/1/2012
• Rarely used
• The previous value (or a complete history of previous values) is maintained in the dimension table row.
• This requires modifying the dimension table schema to accommodate new values for each tracked attribute, and can result in a complex dimension table that is difficult to manage.
Slowly Changing Dimensions – Type 3
CustKey CustAltKey Name Cars
1 1002 Amy Alberts 0
CustKey CustAltKey Name Prior Cars Current Cars
1 1002 Amy Alberts 0 1
• Surrogate key
• Granularity
• Range
Time DimensionDateKey DateAltKey MonthDay WeekDay Day MonthNo Month Year
00000000 01-01-1753 NULL NULL NULL NULL NULL NULL
20130101 01-01-2013 1 3 Tue 01 Jan 2013
20130102 01-02-2013 2 4 Wed 01 Jan 2013
20130103 01-03-2013 3 5 Thu 01 Jan 2013
20130104 01-04-2013 4 6 Fri 01 Jan 2013
• Attributes and hierarchies
• Multiple calendars
• Unknown values
• Create a Transact-SQL script
• Use Microsoft Excel
• Use a BI tool to autogenerate a time dimension table
Populating a Time Dimension Table
• A common requirement in a data warehouse is to support
dimensions with parent-child hierarchies
• Typically, parent-child hierarchies are implemented as self-
referencing tables, in which a column in each row is used as a
foreign-key reference to a primary-key value in the same
table
Self-Referencing Dimension
EmployeeKey EmployeeAltKey EmployeeName ManagerKey
1 1000 Kim Abercrombie NULL
2 1001 Kamil Amireh 1
3 1002 Cesar Garcia 1
4 1003 Jeff Hay 2
• Combine low-cardinality attributes that don’t belong in
existing dimensions into a junk dimension
• Avoids creating many small dimension tables
Junk Dimensions
JunkKey OutOfStockFlag FreeShippingFlag CreditOrDebit
1 1 1 Credit
2 1 1 Debit
3 1 0 Credit
4 1 0 Debit
5 0 1 Credit
6 0 1 Debit
7 0 0 Credit
8 0 0 Debit
Designing Fact Tables
Fact Table Columns
OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount
20120101 25 120 1000 1 350.99
20120101 99 120 1000 2 6.98
20120101 25 178 1001 2 701.98
OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount
20120101 25 120 1000 1 350.99
20120101 99 120 1000 2 6.98
20120101 25 178 1001 2 701.98
OrderDateKey ProductKey CustomerKey OrderNo Qty SalesAmount
20120101 25 120 1000 1 350.99
20120101 99 120 1000 2 6.98
20120101 25 178 1001 2 701.98
Dimension Keys
Measures
Degenerate Dimensions
Types of MeasureOrderDateKey ProductKey CustomerKey SalesAmount
20120101 25 120 350.99
20120101 99 120 6.98
20120102 25 178 701.98
DateKey ProductKey StockCount
20120101 25 23
20120101 99 118
20120102 25 22
OrderDateKey ProductKey CustomerKey ProfitMargin
20120101 25 120 25
20120101 99 120 22
20120102 25 178 27
Additive measures
Semi-additive measures
Non-additive measures
Types of Fact Table
OrderDateKey ProductKey CustomerKey OrderNo Qty Cost SalesAmount
20120101 25 120 1000 1 125.00 350.99
20120101 99 120 1000 2 2.50 6.98
20120101 25 178 1001 2 250.00 701.98
DateKey ProductKey OpeningStock UnitsIn UnitsOut ClosingStock
20120101 25 25 1 3 23
20120101 99 120 0 2 118
OrderNo OrderDateKey ShipDateKey DeliveryDateKey
1000 20120101 20120102 20120105
1001 20120101 20120102 00000000
1002 20120102 00000000 00000000
Transaction
fact tables
Periodic
snapshot tables
Accumulating
snapshot
fact tables
Data Warehouse Physical Design
Understanding DW Components Activity
ETL
Data Models
Reports
User Queries
ETL Loads• Bulk inserts
• Some lookups and
updates
• Large fact
tables
• Star joins to
dimension
tables
Data Model Processing• Mostly table/index
scans
Report Processing• Predictable queries
• Many rows with range-based
query filters
Self-Service BI• Potentially
unpredictable
queries
• Create files with an initial size• Based on the eventual size of the objects that will be stored on them
• This pre-allocates sequential disk blocks and helps avoid fragmentation.
• Disable autogrowth• If you begin to run out of space in a data file, it is more efficient to explicitly
increase the file size by a large amount rather than rely on incremental
autogrowth.
Data files guidelines
• Create at least one filegroup in addition to the primary one, and then set it as the default filegroup so you can separate data tables from system tables.
• Create dedicated filegroups for extremely large fact tables and using them to place those fact tables on their own logical disks.
• If some tables in the data warehouse are loaded on a different schedule from others, consider using filegroups to separate the tables into groups that can be backed up independently.
• If you intend to partition a large fact table, create a filegroup for each one so that older, stable rows can be backed up, and then set as read-only.
Filegroups guidelines
• Separate staging database• Create it on a logical disk distinct from the data warehouse files.
• Into the data warehouse database • Create a file and filegroup for them on a logical disk
• Separate from the fact and dimension tables.
• An exception to the previous guideline is made for staging tables that will be
switched with partitions to perform fast loads.
• These must be created on the same filegroup as the partition with which they will be
switched.
Staging tables
• To avoid fragmentation of data files • Place it on a dedicated logical disk
• Set its initial size based on how much it is likely to be used.
• Set the growth increment to be quite large to ensure that
performance is not interrupted by frequent growth of
TempDB.
• Creating multiple files for TempDB to help minimize
contention during page free space (PFS) scans as temporary
objects are created and dropped.
TempDB
• Set the transaction mode of the Data Warehouse, Staging
Database and TempDB to Simple
• Helps to avoid having to truncate transaction logs
• Additionally, most of the inserts in a data warehouse are
typically performed as bulk load operations which are not
logged.
• To avoid disk resource conflicts between data warehouse I/O
and logging, place the transaction log files for all databases
on a dedicated logical disk.
Transaction logs
• SQL Server Enterprise edition supports data compression at
both page and row level.
• Data compression benefits in a data warehouse• Reduced storage requirements.
• Improved query performance
• Best practices for data compression in a data warehouse• Use page compression on all dimension tables and fact table partitions.
• If performance is CPU-bound, revert to row compression on frequently-
accessed partitions.
Data Compression
• Improved query performance
• More granular manageability
• Improved data load performance
• Best practices for partitioning in a DW• Partition Large Fact Tables
• Partition on an incrementing date key
• Design the partition scheme for ETL and manageability.
• Maintain an empty partition at the start and end of the table
Table Partitioning
• Indexes maximize query performance
• Planning Indexes is the most important part of database
design process
• Some inexperienced BI professionals are tempted to create
many indexes on all tables to support queries.
Indexes in DW
• Create a clustered index on the surrogate key column.
• This column is used to join the dimension table to fact tables, and a clustered
index will help the query optimizer minimize the number of reads required to filter
fact rows.
• Create a non-clustered index on the alternate key column and
include the SCD current flag, start date, and end date columns.
• This index will improve the performance of lookup operations during ETL data
loads that need to handle slowly-changing dimensions.
• Create non-clustered indexes on frequently searched attributes, and
consider including all members of a hierarchy in a single index.
Dimension table indexes
• Create a clustered index on the most commonly-searched
date key. • Date ranges are the most common filtering criteria in most data warehouse
workloads, so a clustered index on this key should be particularly effective in
improving overall query performance.
• Create non-clustered indexes on other, frequently-searched
dimension keys.
• Columnstore index on all columns
Fact table indexes
• Create a view for each dimension and fact table with
NOLOCK query hint in the view definition
• Create views with user-friendly view and column names
• Do not include metadata columns in views
• Create views to combine snowflake dimension tables
• Partition-align indexed views
• Use the SCHEMABINDING option
• Security
Using Views in a DW
• Overview of Data Warehousing
• Data Warehouse Solution
• Data Warehouse Infrastructure
• Data Warehouse Hardware
• Data Warehouse Design Overview
• Designing Dimension Tables
• Designing Fact Tables
• Data Warehouse Physical Design
Summary
Thank you
SELECT
KNOWLEDGE
FROM
SQL SERVER
http://www.sqlschool.gr
Copyright © 2015 SQL School Greece