Turbocharge your Data Warehouse Queries with Columnstore Indexes Len Wyatt Program Manager Microsoft...

Post on 02-Jan-2016

214 views 0 download

Transcript of Turbocharge your Data Warehouse Queries with Columnstore Indexes Len Wyatt Program Manager Microsoft...

Turbocharge your Data Warehouse Queries with Columnstore IndexesLen WyattProgram ManagerMicrosoft Corporation

DBI313

Agenda

MotivationHow columnstores speed up queriesLoading columnstoresOptimizing database and index design Optimizing queries

demo

Columnstores speed up queries

Overview of Columnstore Index

How ColumnStore Indexes Speed Up Queries6

C1 C2 C3 C5 C6C4

ColumnStore indexes store data column-wise

Each page stores data from a single column

Highly compressedAbout 2x better than PAGE compressionMore data fits in memory

Each column can be accessed independently

Fetch only columns neededCan dramatically decrease IO

Heaps, B-trees store data row-wise

Columnstore Index Structure

Column SegmentSegment contains values from one column for a set of rowsSegments for the same set of rows comprise a row groupSegments are compressedEach segment stored in a separate LOBSegment is unit of transfer between disk and memory

7

Segments

C1 C2 C3 C5 C6C4

Row group

Columnstore Index Example

OrderDateKey ProductKey StoreKey RegionKey Quantity SalesAmount

20101107 106 01 1 6 30.00

20101107 103 04 2 1 17.00

20101107 109 04 2 2 20.00

20101107 103 03 2 1 17.00

20101107 106 05 3 4 20.00

20101108 106 02 1 5 25.00

20101108 102 02 1 1 14.00

20101108 106 03 2 5 25.00

20101108 109 01 1 1 10.00

20101109 106 04 2 4 20.00

20101109 106 04 2 5 25.00

20101109 103 01 1 1 17.00

Horizontally Partition (Row Groups)OrderDateKey ProductKe

yStoreKey RegionKey Quantity SalesAmount

20101107 106 01 1 6 30.00

20101107 103 04 2 1 17.00

20101107 109 04 2 2 20.00

20101107 103 03 2 1 17.00

20101107 106 05 3 4 20.00

20101108 106 02 1 5 25.00OrderDateKey ProductKey StoreKey RegionKey Quantity SalesAmount

20101108 102 02 1 1 14.00

20101108 106 03 2 5 25.00

20101108 109 01 1 1 10.00

20101109 106 04 2 4 20.00

20101109 106 04 2 5 25.00

20101109 103 01 1 1 17.00

Vertically Partition (Segments)

OrderDateKey

20101107

20101107

20101107

20101107

20101107

20101108

ProductKey

106

103

109

103

106

106

StoreKey

01

04

04

03

05

02

RegionKey

1

2

2

2

3

1

Quantity

6

1

2

1

4

5

SalesAmount

30.00

17.00

20.00

17.00

20.00

25.00

OrderDateKey

20101108

20101108

20101108

20101109

20101109

20101109

ProductKey

102

106

109

106

106

103

StoreKey

02

03

01

04

04

01

RegionKey

1

2

1

2

2

1

Quantity

1

5

1

4

5

1

SalesAmount

14.00

25.00

10.00

20.00

25.00

17.00

Compress Each Segment*OrderDateKey

20101107

20101107

20101107

20101107

20101107

20101108

ProductKey

106

103

109

103

106

106

StoreKey

01

04

04

03

05

02

RegionKey

1

2

2

2

3

1

Quantity

6

1

2

1

4

5

SalesAmount

30.00

17.00

20.00

17.00

20.00

25.00

Some segments will compress more than others

OrderDateKey

20101108

20101108

20101108

20101109

20101109

20101109

ProductKey

102

106

109

106

106

103

StoreKey

02

03

01

04

04

01

RegionKey

1

2

1

2

2

1

Quantity

1

5

1

4

5

1

SalesAmount

14.00

25.00

10.00

20.00

25.00

17.00

*Encoding and reordering not shown

Fetch Only Needed ColumnsSELECT ProductKey, SUM (SalesAmount) FROM SalesTable WHERE OrderDateKey < 20101108

StoreKey

01

04

04

03

05

02

StoreKey

02

03

01

04

04

01

RegionKey

1

2

2

2

3

1

RegionKey

1

2

1

2

2

1

Quantity

6

1

2

1

4

5

Quantity

1

5

1

4

5

1

OrderDateKey

20101107

20101107

20101107

20101107

20101107

20101108

OrderDateKey

20101108

20101108

20101108

20101109

20101109

20101109

ProductKey

106

103

109

103

106

106

ProductKey

102

106

109

106

106

103

SalesAmount

30.00

17.00

20.00

17.00

20.00

25.00

SalesAmount

14.00

25.00

10.00

20.00

25.00

17.00

Fetch Only Needed SegmentsSELECT ProductKey, SUM (SalesAmount) FROM SalesTable WHERE OrderDateKey < 20101108

StoreKey

01

04

04

03

05

02

StoreKey

02

03

01

04

04

01

RegionKey

1

2

2

2

3

1

RegionKey

1

2

1

2

2

1

Quantity

6

1

2

1

4

5

Quantity

1

5

1

4

5

1

OrderDateKey

20101107

20101107

20101107

20101107

20101107

20101108

OrderDateKey

20101108

20101108

20101108

20101109

20101109

20101109

ProductKey

106

103

109

103

106

106

ProductKey

102

106

109

106

106

103

SalesAmount

30.00

17.00

20.00

17.00

20.00

25.00

SalesAmount

14.00

25.00

10.00

20.00

25.00

17.00

Batch Mode Speeds Up Queries

Biggest advancement in SQL Server query processing in years…• Data moves as a batch through query

plan operators• Minimizes instructions per row• Takes advantage of cache structures

• Highly efficient algorithms• Better parallelism

Batch mode processing

Process ~1000 rows at a timeBatch stored in vector formOptimized to fit in L1 cache.

Vector operators implementedFilter, hash join, hash aggregation

Greatly reduced CPU time (7 to 40X)

15

bit

map o

f qu

alif

yin

g

row

s

Column vectors

Batch object

#1 Takeaway!

Make sure most of the work of the query happens in batch mode

Loading Columnstores Effectively

Loading new data into a columnstore index

Tables with columnstores can be read, not updated

Partition switching allowedINSERT, UPDATE, DELETE, and MERGE not allowed

Recommended methods for loading dataDisable, update, rebuildPartition switchingUNION ALL

Adding Data Using Disable, Update, Rebuild

Disable (or drop) the columnstore indexALTER INDEX my_index ON MyTable DISABLE

Update the tableRebuild the columnstore indexALTER INDEX my_index ON MyTable REBUILD

Adding Data Using Partition Switching

Columnstores must be partition-aligned Partition switching fully supportedTo add data daily

Partition by dayEvery day

Split last partitionLoad data into staging table and columnstore index itSwitch it in

Avoids costly drop/rebuild

Adding Data Using UNION ALL (trickle load)

Master table (columnstore)Delta table (rowstore)Query using UNION ALL local-global aggregation workaroundAdd Delta to Master nightly

Achieving Fast Columnstore Index BuildsMemory intensive

Memory requirement related to # of columns, data, DOPIndex build is parallel only if table has > 1 million rows

One thread per segmentLow memory throttles parallelismConsider

High min server memory setting Set REQUEST_MAX_MEMORY_GRANT_PERCENT to 50Add memoryOmit columnsReduce parallelism

create columnstore index <name> on <table>(<columns>) with (maxdop = 1);

Optimizing database and index design

Eliminating Unsupported Data TypesCurrent unsupported types for columnstores:

decimal > 18 digitsBinaryBLOB(n)varchar(max)UniqueidentifierDate/time types > 8 bytes and CLR

Omit column from columnstore, orModify column type to supported type

Reduce precision of numerics to 18 digits or lessConvert guid’s to intsReduce precision of datetimeoffset to 2 or lessConvert hierarchyid to int or string

Reduce Nonclustered B-trees

Covering B-trees are no longer needed on source tableExtra B-trees can cause optimizer to choose poor planSave spaceReduce ETL time

Ensuring segment elimination by date

Use clustered B-tree on date in source table

Columnstore inherits orderOr, partition by dateOrdering by load date, ship date, order date etc. can all work

Dates are naturally correlated

Design out strings from columnstoresString filters don’t get pushed to storage engine

more batches to processdefeats segment elimination

Joining on string columns is slowFactor strings out to dimensions

Date LicenseNum Measure

20120301 XYZ123 100

20120302 ABC777 200

Date LicenseId Measure

20120301 1 100

20120302 2 200

LicenseId LicenseNum

1 XYZ123

2 ABC777

Optimizing queries

Best Practices

Use star schemaPut columnstores on large tables onlyInclude every column of table in columnstore indexUse integer surrogate keys for joins

Forcing use or non-use of Columnstores

Query hintOPTION(IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX)

Index hint… FROM F WITH(index=MyColumnStore) …… FROM F WITH(index=MyClusteredBtree) …

Things to Avoid

Join/filter on string columnsJoin pairs of very large tables if you don’t have toNOT IN <subquery> on columnstore tableOUTER JOIN on columnstore tableUNION ALL to combine columnstore tables with other tables

Common workarounds

demo

Example need for a workaround

The common theme

Since there are some queries that the optimizer won’t be able to run in batch mode…Check execution plan to verify batch mode

Find the subset that can run in batch modeRewrite query to run mostly in batch modeJoin to the rest of the data

#1 Takeaway!

Make sure most of the work of the query happens in batch mode

Outer Join Example & Workaround

Outer join prevents batch processing

Rewrite queryInner join in batch modeLeft join to complete the data set

select m.Title, COUNT(p.IP) PurchaseCountfrom Media m left outer join Purchase p on p.MediaId=m.MediaIdgroup by m.Titleorder by COUNT(p.IP) desc

with T (Title, PurchaseCount) as ( select m.Title, COUNT(p.IP) PurchaseCount from Media m join Purchase p on p.MediaId=m.MediaId group by m.Title ) select distinct m.Title,

ISNULL(T.PurchaseCount,0) as PurchaseCountfrom Media m left outer join T on m.Title=T.Titleorder by ISNULL(T.PurchaseCount,0) desc;

6.4 sec elapsed55 CPU-seconds

0.2 sec elapsed1.9 CPU-sec

IN and EXISTs Example & Workaround

Using IN and EXISTS with subqueries can prevent batch mode execution

IN ( <constants list> ) typically works fine

Example:MediaId IN (23263, 29637, 27208)

select p.Date, count(*) from Purchase p where p.MediaId in (select MediaId from MediaStudyGroup) group by p.Date order by p.Date; --or--select p.Date, count(*) from Purchase p where exists (select m.MediaId from MediaStudyGroup m where m.MediaId = p.MediaId) group by p.Date order by p.Date;

select p.Date, count(*) from Purchase pjoin MediaStudyGroup m on p.MediaId = m.MediaId group by p.Date order by p.Date;

3.0 sec elapsed32 CPU-seconds

0.05 sec elapsed0.3 CPU-seconds

Union All Example

UNION ALL canprevent batch modeexecution

create view vPurchase as select * from Purchase union allselect * from DeltaPurchase;

select p.date, d.DayNumOfMonth, count(*) from vPurchase as p, Date d where p.Date = d.DateId group by p.date, d.DayNumOfMonth;

select p.date, d.DayNumOfMonth, m.Genre, count(*)from vPurchase p, Date d, Media mwhere p.Date = d.DateId and m.MediaId = p.MediaId group by p.date, d.DayNumOfMonth, m.Genre

Batch mode0.1 sec elapsed

Row mode19 sec elapsed

Union All Workaround

Push GROUP BY and aggregation over UNION ALLDo final GROUP BY and aggregation of resultsCalled “local-global aggregation”

with MainSummary (date, DayNumOfmonth, Genre, c) as ( select p.date, d.DayNumOfMonth, m.Genre, count(*) c from Purchase p, Date d, Media m where p.Date = d.DateId and m.MediaId = p.MediaId group by p.date, d.DayNumOfMonth, m.Genre ), DeltaSummary (date, DayNumOfmonth, Genre, c) as ( select p.date, d.DayNumOfMonth, m.Genre, count(*) c from DeltaPurchase p, Date d, Media m where p.Date = d.DateId and m.MediaId = p.MediaId group by p.date, d.DayNumOfMonth, m.Genre ), CombinedSummary (date, DayNumOfMonth, Genre, c) as ( --union all across the output of the two queries select * from MainSummary UNION ALL select * from DeltaSummary ) --group by to aggregate the data.select t.date, t.DayNumOfmonth, t.Genre, sum(c) as c from CombinedSummary as t group by t.date, t.DayNumOfmonth, t.Genre;

Batch mode0.3 sec elapsed

Scalar Aggregates Example & Workaround

Aggregate without group by doesn’t get batch processing

Workaround:Add a group by!

select count(*) from Purchase

with CountByDate (Date, c) as ( select Date, count(*) from Purchase group by Date ) select sum(c) from CountByDate;

1.0 sec elapsed15 CPU-seconds

0.06 sec elapsed0.3 CPU-seconds

Multiple DISTINCT aggregates example

Generates atable spoolSpool write/read is single threaded

SQL Server 2012runs queries with 1 DISTINCT aggand 1 or more non-distinct aggs in batch mode without any spool!

select p.Date, count(distinct p.UserId) as UserIdCount, count(distinct p.MediaId) as MediaIdCountfrom Purchase p, Media m where p.MediaId = m.MediaId and m.Category in ('Horror') group by p.Date;

26 sec elapsed31 CPU-seconds

Multiple DISTINCT aggregates workaround

Form each DISTINCT aggregate in aseparate subqueryJoin results on grouping keys

with DistinctMediaIds (Date, MediaIdCount) as ( select p.Date, count(distinct p.MediaId) as MediaIdCountfrom Purchase p, Media m where p.MediaId = m.MediaId and m.Category in ('Horror') group by p.Date ), DistinctUserIds (Date, UserIdCount) as ( select p.Date, count(distinct p.UserId) as UserIdCount from Purchase p, Media m where p.MediaId = m.MediaId and m.Category in ('Horror') group by p.Date ) select m.Date, m.MediaIdCount, u.UserIdCount from DistinctMediaIds m join DistinctUserIds u on m.Date=u.Date

0.5 sec elapsed6 CPU-seconds

Summary

Summary

Keys to fast query processingColumnstore Index + Batch mode = amazing performanceColumn and segment elimination greatly reduce data demand

Working with the read-only property of columnstores:

Drop, Update, RebuildPartition SwitchingUNION ALL method.

Future work will reduce need for query tuningFor now, make sure most work happens in batch mode

© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to

be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS

PRESENTATION.