SQL Server 2012 Data Warehousing Deep Dive Dejan Sarka, SolidQ dsarka@solidq

SQL Server 2012 Data Warehousing Deep Dive

Dejan Sarka, SolidQ

[email protected]

Agenda

• DW Problems• Bitmap Filtered Hash Joins• Table Partitioning• Filtered Indexes• Indexed Views• Data Compression• Window Functions• Columnstore Indexes

2

Algorithms Complexity

• Forever* = about 40 billion billion years!

3

SSAS Dimensional Addressing

Nelson White

USA Japan

USA Japan USA_North USA_

South USA_North USA_

South Seattle Boston Seattle Boston

1991, Qtr1

Jan 00 10 20 30 40 50 60 70

Feb 01 11 21 31 41 51 61 71

Mar 02 12 22 32 42 52 62 72

1991, Qtr2 03 13 23 33 43 53 63 73

1991, Qtr 3 04 14 24 34 44 54 64 74

1991, Qtr4

Oct 05 15 25 35 45 55 65 75

Nov 06 16 26 36 46 56 66 76

Dec 07 17 27 37 47 57 67 77

Axis(1).Position(3)

Axis(1).Position(1).Memb

ers(2)

Axis(1)

Every cell has an address

4

SSAS Tabular Problems

• SSAS address space: mn cells−Maximum number of possible combinations

200 * 5000 * 1095 = 109,500,000

−SSAS address space grows exponentially!−Can run out of address space – limited scalability

5

6

RDBMS Joins• Merge: complexity ~ O(n)

– Needs sorted inputs, equijoin

• Hash: complexity ~ O(n) / ~O(n2)– Needs equijoin

• Nested Loops: complexity ~ O(n) (indexed), ~ O(n2) (not

indexed)– Works always, can become quadratic

• Non-equijoins are frequently quadratic– E.g., running totals

Linearize Joins

x y1 = x y2 = x2

y3 = x2

per partes

0 0 0 00,2 0,2 0,04 0,040,4 0,4 0,16 0,160,6 0,6 0,36 0,360,8 0,8 0,64 0,64

1 1 1 11,2 1,2 1,44 1,041,4 1,4 1,96 1,161,6 1,6 2,56 1,361,8 1,8 3,24 1,64

2 2 4 22,2 2,2 4,84 2,042,4 2,4 5,76 2,162,6 2,6 6,76 2,362,8 2,8 7,84 2,64

3 3 9 3

7

8

Bitmap Filtered Star Joins

• Optimized bitmap filtering for star schema joins– Bitmap representation of a set of values from

a dim table to pre-filter rows to join from a fact table

– Enables filtering rows early in the plan, allowing subsequent operators to operate on fewer rows

Bloom Filter (1)*

• Bloom filter is a bit array of m bits– Start with all bits set to 0

• k different hash functions defined– Each of which maps some set element to one of the m positions with a uniform random distribution

• To add an element, feed it to each of the k hash functions to get k array positions– Set the bits at all these positions to 1

9

Source: Wikipedia

http://en.wikipedia.org/wiki/Bloom_filter

Bloom Filter (2)• To test whether and element it is in the set, feed

it to each of the k hash functions to get k array positions– If any of the bits at these positions are 0, the element

is not in the set – If all are 1, then either the element is in the set, or the

bits have been set to 1 during the insertion of other elements

10

Table Partitioning

• Partition function• Partition scheme• Aligned indexes• Partition elimination• Partition switching

11

Filtered Indexes

• Where clause in the Create Index statement

• Small B-trees on subset of data only• Useful when some values are selective,

while others dense– Index on selective values only

12

Indexed Views

• Useful for queries that aggregate data– Can also reduce number of joins

• Depending on edition of SQL Server can be used automatically– No need to change reporting queries

• Many limitations

13

Data Compression

• Pre-SQL 2005: variable-length data types• SQL 2005: vardecimal• SQL 2008

−Row compression−Page compression

• SQL 2008 R2−Unicode compression

SQL 2008 Compression

• Row compression– Fixed-width data type values stored in variable format

•Page compression

• Prefix compression

• Dictionary compression

Unicode Compression

• Works on nchar(n) and nvarchar(n)• Automatically with row or page compression• Savings depends on language

– Up to 50% in English, German– Only 15% in Japanese

• Very low performance penalty

Window Functions

• Functions operating on a window (set) of rows defined by an OVER clause

• Types of functions:• Ranking• Aggregate• Distribution

SELECT empid, ordermonth, qty, SUM(qty) OVER(PARTITION BY empid ORDER BY ordermonth ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS runqtyFROM Sales.EmpOrders;

17

Window Functions in SQL Server

• SQL Server 2005:– Ranking calculations– Aggregates with only window partitioning

• SQL Server 2012:– Aggregates with also window ordering and framing– Offset functions: LAG, LEAD, FIRST_VALUE,

LAST_VALUE– Distribution functions: PERCENT_RANK,

CUME_DIST, PERCENTILE_CONT, PERCENTILE_DISC

18

SQL Server DW / OLAP Offerings

• Personal and team level– PowerPivot for Excel (client)– PowerPivot for SharePoint (server)

• Corporate level– SQL Server – SSAS Tabular– SSAS Dimensional– Fast Track Data Warehouse– Parallel Data Warehouse

Vert iPaq

19

Trans-Relational Model• Not “beyond” relational

– Transformation between logical and physical layer• Steve Tarin, Required Technologies Inc. (1999)• All columns stored in sorted order

– All joins become merge joins– Can condense storage– Of course, updates suffer

• Logically, this is a pure relational model• SQL Server uses own variant

– Order of columns not preserved – optimized for compression

– Leverages parallel hash joins rather than merge joins20

Columnar Storage (1)Row / Col 1 2 3

Name Color City

1 Nut Red London

2 Bolt Green Paris

3 Screw Blue Oslo

4 Screw Red London

5 Cam Blue Paris

6 Cog Red London

Row / Col 1 2 3

Name Color City

1 Bolt Blue London

2 Cam Blue London

3 Cog Green London

4 Nut Red Oslo

5 Screw Red Paris

6 Screw Red Paris

21

Columnar Storage (2)

Row / Col 1 2 3

Name Color City

1 Bolt [1:1] Blue [1:2] London [1:3]

2 Cam [2:2] Green [3:3] Oslo [4:4]

3 Cog [3:3] Red [4:6] Paris [5:6]

4 Nut [4:4]

5 Screw [5:6]

6

22

Row / Col 1 2 3

Name Color City

1 Bolt Blue London

2 Cam Blue London

3 Cog Green London

4 Nut Red Oslo

5 Screw Red Paris

6 Screw Red Paris

Row Reconstruction Table

Row / Col 1 2 3

Name Color City

1 Bolt [1:1] Blue [1:2] London [1:3]

2 Cam [2:2] Green [3:3] Oslo [4:4]

3 Cog [3:3] Red [4:6] Paris [5:6]

4 Nut [4:4]

5 Screw [5:6]

6

Row / Col 1 2 3

Name Color City

1 3 6 4

2 1 4 6

3 6 5 3

4 4 1 5

5 2 2 1

6 5 3 2

23

SQL Server Solution (1)*

• Converting rows to column segments

24

Source: SQL Server Column Store Indexes by Per-Åke Larson, et al., MicrosoftSIGMOD’10, June 12–16, 2011

SQL Server Solution (2)

• Storing column segments as BLOBs– Leverages existing

BLOB storage– Additional segment

metadata– Multiple

compression algorithms

25

Columnstore Compression

• Encoding values to 32-bit or 64-bit integer– Dictionary-based encoding– Value-based (prefix) encoding

• Optimal row ordering with VertiPaq™ algorithm to rearrange rows– Optimal ordering for Run-Length Encoding

(RLE) for best overall compression• Compression

– RLE - data stored as <value, count> pairs– Bit-Pack– use min number of bits for a value

26

Result: Reduced I/O

• Fetches only needed columns from disk

• Columns are compressed

• Less IO• Better buffer hit

rates

C1

C2

C4 C5 C6

C3

SELECT region, sum (sales) …

Result: Reading Segments

• Column segment contains values from one column for a set of about 1M rows

• Column segment is unit of transfer from disk

• Storage engine can eliminate segments early in the process• Because of additional

column segment metadata

C1 C2 C3 C5 C6C4

Set of about 1M rows

Column Segment

Reducing CPU Usage

• Columnstore indexes reduce disk IO• Bitmap-filtered hash joins can be executed

in parallel• Problem: CPU becomes a bottleneck• Solution: reduce CPU usage by

processing large numbers of rows– Iterators that do not process row-at-a-time– Process batch-at-a-time

Batch Processing

• Orthogonal to columnstore indices– Can support other storage

• However, best results with columnstore indices– Sometimes can perform batch operations

directly on compressed data• Can mix batch and row operators

– Can dynamically switch from batch to row mode

30

Batch Operators

• The following operators support batch mode processing:– Filter– Project– Scan– Local hash (partial) aggregation– Hash inner join– Batch hash table build

31

Source: http://social.technet.microsoft.com/wiki/contents/articles/sql-server-columnstore-index-faq.aspx#Batch_mode_processing

http://social.technet.microsoft.com/wiki/contents/articles/sql-server-columnstore-index-faq.aspx

http://social.technet.microsoft.com/wiki/contents/articles/sql-server-columnstore-index-faq.aspx

Columnstore Indexes Constraints

• Base table must be clustered B-tree or heap

• Columnstore index:– Nonclustered – One per table– Must be partition-aligned– Not allowed on indexed view– Can’t be a filtered index

32

Data Type Restrictions

• Unsupported types– Decimal > 18 digits– Binary– BLOB– (n)varchar(max)– Uniqueidentifier– Date/time types > 8 bytes– CLR

33

Query Performance Restrictions

• Outer joins• Unions• Consider modifying queries to hit “sweet

spot”– Inner joins– Star joins– Aggregation

34

Loading New Data

• Columnstore index makes table read-only• Partition switching allowed• INSERT, UPDATE, DELETE, and MERGE

not allowed• Two recommended methods for loading

data• Disable, update, rebuild• Partition switching

35

36

Columnstore Indexes Usage

• Use when:– Read-mostly workload– Most updates are appending new data– Workflow permits partitioning or index drop/rebuild – Queries often scan & aggregate lots of data

• Use on fact (and large dimensions) tables• Do not use when:

– Frequent updates– Partition switching or rebuilding index doesn’t fit workflow– Frequent small look up queries– VertiPaq cannot handle your data model

Review

• DW Problems• Bitmap Filtered Hash Joins• Table Partitioning• Filtered Indexes• Indexed Views• Data Compression• Windows Functions• Columnstore Indexes

37

38

Q & A

• Questions?

• Thank you for coming to this conference…• …and this presentation!

39

References

• Books:– SQL Server Books OnLine – Dejan Sarka, Grega Jerkič and Matija Lah: MCTS

Self-Paced Training Kit (Exam 70-463): Building Data Warehouses with Microsoft SQL Server 2012

• Courses and Seminars– SQL Server 2012 and SharePoint BI Immersion– Advanced Transact-SQL

SQL Server 2012 Data Warehousing Deep Dive Dejan Sarka, SolidQ dsarka@solidq

Documents

Transcript of SQL Server 2012 Data Warehousing Deep Dive Dejan Sarka, SolidQ dsarka@solidq