SQL Server 2012 Data Warehousing Deep Dive Dejan Sarka, SolidQ dsarka@solidq
description
Transcript of SQL Server 2012 Data Warehousing Deep Dive Dejan Sarka, SolidQ dsarka@solidq
Agenda
• DW Problems• Bitmap Filtered Hash Joins• Table Partitioning• Filtered Indexes• Indexed Views• Data Compression• Window Functions• Columnstore Indexes
2
Algorithms Complexity
• Forever* = about 40 billion billion years!
3
SSAS Dimensional Addressing
Nelson White
USA Japan
USA Japan USA_North USA_
South USA_North USA_
South Seattle Boston Seattle Boston
1991, Qtr1
Jan 00 10 20 30 40 50 60 70
Feb 01 11 21 31 41 51 61 71
Mar 02 12 22 32 42 52 62 72
1991, Qtr2 03 13 23 33 43 53 63 73
1991, Qtr 3 04 14 24 34 44 54 64 74
1991, Qtr4
Oct 05 15 25 35 45 55 65 75
Nov 06 16 26 36 46 56 66 76
Dec 07 17 27 37 47 57 67 77
Axis(1).Position(3)
Axis(1).Position(1).Memb
ers(2)
Axis(1)
Every cell has an address
4
SSAS Tabular Problems
• SSAS address space: mn cells−Maximum number of possible combinations
200 * 5000 * 1095 = 109,500,000
−SSAS address space grows exponentially!−Can run out of address space – limited scalability
5
6
RDBMS Joins• Merge: complexity ~ O(n)
– Needs sorted inputs, equijoin
• Hash: complexity ~ O(n) / ~O(n2)– Needs equijoin
• Nested Loops: complexity ~ O(n) (indexed), ~ O(n2) (not
indexed)– Works always, can become quadratic
• Non-equijoins are frequently quadratic– E.g., running totals
Linearize Joins
x y1 = x y2 = x2
y3 = x2
per partes
0 0 0 00,2 0,2 0,04 0,040,4 0,4 0,16 0,160,6 0,6 0,36 0,360,8 0,8 0,64 0,64
1 1 1 11,2 1,2 1,44 1,041,4 1,4 1,96 1,161,6 1,6 2,56 1,361,8 1,8 3,24 1,64
2 2 4 22,2 2,2 4,84 2,042,4 2,4 5,76 2,162,6 2,6 6,76 2,362,8 2,8 7,84 2,64
3 3 9 3
7
8
Bitmap Filtered Star Joins
• Optimized bitmap filtering for star schema joins– Bitmap representation of a set of values from
a dim table to pre-filter rows to join from a fact table
– Enables filtering rows early in the plan, allowing subsequent operators to operate on fewer rows
Bloom Filter (1)*
• Bloom filter is a bit array of m bits– Start with all bits set to 0
• k different hash functions defined– Each of which maps some set element to one of the m positions with a uniform random distribution
• To add an element, feed it to each of the k hash functions to get k array positions– Set the bits at all these positions to 1
9
Source: Wikipedia
Bloom Filter (2)• To test whether and element it is in the set, feed
it to each of the k hash functions to get k array positions– If any of the bits at these positions are 0, the element
is not in the set – If all are 1, then either the element is in the set, or the
bits have been set to 1 during the insertion of other elements
10
Table Partitioning
• Partition function• Partition scheme• Aligned indexes• Partition elimination• Partition switching
11
Filtered Indexes
• Where clause in the Create Index statement
• Small B-trees on subset of data only• Useful when some values are selective,
while others dense– Index on selective values only
12
Indexed Views
• Useful for queries that aggregate data– Can also reduce number of joins
• Depending on edition of SQL Server can be used automatically– No need to change reporting queries
• Many limitations
13
Data Compression
• Pre-SQL 2005: variable-length data types• SQL 2005: vardecimal• SQL 2008
−Row compression−Page compression
• SQL 2008 R2−Unicode compression
SQL 2008 Compression
• Row compression– Fixed-width data type values stored in variable format
•Page compression
• Prefix compression
• Dictionary compression
Unicode Compression
• Works on nchar(n) and nvarchar(n)• Automatically with row or page compression• Savings depends on language
– Up to 50% in English, German– Only 15% in Japanese
• Very low performance penalty
Window Functions
• Functions operating on a window (set) of rows defined by an OVER clause
• Types of functions:• Ranking• Aggregate• Distribution
SELECT empid, ordermonth, qty, SUM(qty) OVER(PARTITION BY empid ORDER BY ordermonth ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS runqtyFROM Sales.EmpOrders;
17
Window Functions in SQL Server
• SQL Server 2005:– Ranking calculations– Aggregates with only window partitioning
• SQL Server 2012:– Aggregates with also window ordering and framing– Offset functions: LAG, LEAD, FIRST_VALUE,
LAST_VALUE– Distribution functions: PERCENT_RANK,
CUME_DIST, PERCENTILE_CONT, PERCENTILE_DISC
18
SQL Server DW / OLAP Offerings
• Personal and team level– PowerPivot for Excel (client)– PowerPivot for SharePoint (server)
• Corporate level– SQL Server – SSAS Tabular– SSAS Dimensional– Fast Track Data Warehouse– Parallel Data Warehouse
Vert iPaq
19
Trans-Relational Model• Not “beyond” relational
– Transformation between logical and physical layer• Steve Tarin, Required Technologies Inc. (1999)• All columns stored in sorted order
– All joins become merge joins– Can condense storage– Of course, updates suffer
• Logically, this is a pure relational model• SQL Server uses own variant
– Order of columns not preserved – optimized for compression
– Leverages parallel hash joins rather than merge joins20
Columnar Storage (1)Row / Col 1 2 3
Name Color City
1 Nut Red London
2 Bolt Green Paris
3 Screw Blue Oslo
4 Screw Red London
5 Cam Blue Paris
6 Cog Red London
Row / Col 1 2 3
Name Color City
1 Bolt Blue London
2 Cam Blue London
3 Cog Green London
4 Nut Red Oslo
5 Screw Red Paris
6 Screw Red Paris
21
Columnar Storage (2)
Row / Col 1 2 3
Name Color City
1 Bolt [1:1] Blue [1:2] London [1:3]
2 Cam [2:2] Green [3:3] Oslo [4:4]
3 Cog [3:3] Red [4:6] Paris [5:6]
4 Nut [4:4]
5 Screw [5:6]
6
22
Row / Col 1 2 3
Name Color City
1 Bolt Blue London
2 Cam Blue London
3 Cog Green London
4 Nut Red Oslo
5 Screw Red Paris
6 Screw Red Paris
Row Reconstruction Table
Row / Col 1 2 3
Name Color City
1 Bolt [1:1] Blue [1:2] London [1:3]
2 Cam [2:2] Green [3:3] Oslo [4:4]
3 Cog [3:3] Red [4:6] Paris [5:6]
4 Nut [4:4]
5 Screw [5:6]
6
Row / Col 1 2 3
Name Color City
1 3 6 4
2 1 4 6
3 6 5 3
4 4 1 5
5 2 2 1
6 5 3 2
23
SQL Server Solution (1)*
• Converting rows to column segments
24
Source: SQL Server Column Store Indexes by Per-Åke Larson, et al., MicrosoftSIGMOD’10, June 12–16, 2011
SQL Server Solution (2)
• Storing column segments as BLOBs– Leverages existing
BLOB storage– Additional segment
metadata– Multiple
compression algorithms
25
Columnstore Compression
• Encoding values to 32-bit or 64-bit integer– Dictionary-based encoding– Value-based (prefix) encoding
• Optimal row ordering with VertiPaq™ algorithm to rearrange rows– Optimal ordering for Run-Length Encoding
(RLE) for best overall compression• Compression
– RLE - data stored as <value, count> pairs– Bit-Pack– use min number of bits for a value
26
Result: Reduced I/O
• Fetches only needed columns from disk
• Columns are compressed
• Less IO• Better buffer hit
rates
C1
C2
C4 C5 C6
C3
SELECT region, sum (sales) …
Result: Reading Segments
• Column segment contains values from one column for a set of about 1M rows
• Column segment is unit of transfer from disk
• Storage engine can eliminate segments early in the process• Because of additional
column segment metadata
C1 C2 C3 C5 C6C4
Set of about 1M rows
Column Segment
Reducing CPU Usage
• Columnstore indexes reduce disk IO• Bitmap-filtered hash joins can be executed
in parallel• Problem: CPU becomes a bottleneck• Solution: reduce CPU usage by
processing large numbers of rows– Iterators that do not process row-at-a-time– Process batch-at-a-time
Batch Processing
• Orthogonal to columnstore indices– Can support other storage
• However, best results with columnstore indices– Sometimes can perform batch operations
directly on compressed data• Can mix batch and row operators
– Can dynamically switch from batch to row mode
30
Batch Operators
• The following operators support batch mode processing:– Filter– Project– Scan– Local hash (partial) aggregation– Hash inner join– Batch hash table build
31
Source: http://social.technet.microsoft.com/wiki/contents/articles/sql-server-columnstore-index-faq.aspx#Batch_mode_processing
Columnstore Indexes Constraints
• Base table must be clustered B-tree or heap
• Columnstore index:– Nonclustered – One per table– Must be partition-aligned– Not allowed on indexed view– Can’t be a filtered index
32
Data Type Restrictions
• Unsupported types– Decimal > 18 digits– Binary– BLOB– (n)varchar(max)– Uniqueidentifier– Date/time types > 8 bytes– CLR
33
Query Performance Restrictions
• Outer joins• Unions• Consider modifying queries to hit “sweet
spot”– Inner joins– Star joins– Aggregation
34
Loading New Data
• Columnstore index makes table read-only• Partition switching allowed• INSERT, UPDATE, DELETE, and MERGE
not allowed• Two recommended methods for loading
data• Disable, update, rebuild• Partition switching
35
36
Columnstore Indexes Usage
• Use when:– Read-mostly workload– Most updates are appending new data– Workflow permits partitioning or index drop/rebuild – Queries often scan & aggregate lots of data
• Use on fact (and large dimensions) tables• Do not use when:
– Frequent updates– Partition switching or rebuilding index doesn’t fit workflow– Frequent small look up queries– VertiPaq cannot handle your data model
Review
• DW Problems• Bitmap Filtered Hash Joins• Table Partitioning• Filtered Indexes• Indexed Views• Data Compression• Windows Functions• Columnstore Indexes
37
38
Q & A
• Questions?
• Thank you for coming to this conference…• …and this presentation!
39
References
• Books:– SQL Server Books OnLine – Dejan Sarka, Grega Jerkič and Matija Lah: MCTS
Self-Paced Training Kit (Exam 70-463): Building Data Warehouses with Microsoft SQL Server 2012
• Courses and Seminars– SQL Server 2012 and SharePoint BI Immersion– Advanced Transact-SQL