Sql server scalability fundamentals
-
Upload
chris-adkin -
Category
Software
-
view
139 -
download
3
Transcript of Sql server scalability fundamentals
SQL Server Scalability Fundamentals
Level 300
Scalability
From Wikipedia:
“Scalability is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged in order to accommodate that growth.”
What Are We Aiming For ? Scalability ?
High throughput ?
Low response times ?
A combination of the above ?
Some goals can be mutually exclusive,for example, Hadoop is scalable but givespoor response times
What We Know
Vs. Vs.
What We KnowDBA Tasks
Installation of OS and SQLBasic memory configurationPerfmon style monitoringMonitoring via SQL ProfilerBackup/restore and HA setup
You can read an execution planYou know what the basic SQL objects areYou know how to Google things, especially books online
A Good Way Of Thinking About Latency
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
1ns 10ns 100ns 100us 10ms10us
Cache Out Curves
Data Size
Throughput/thread
Cache Size
There Are Several Of These Curves
Throughput
Touched Data Size
CPU Cache
TLB
NUMARemote
Storage
Response time = service time + wait time
0.053 0.871
= 0.924
Insert throughput = 1000 / 0.924
= 1082 / s
Single Threaded Performance
Insert Throughput By Doing The Math
Response time = service time + wait time
0.053 0.871
= 0.924
Insert throughput = 1000 / 0.924
= 1082 / s
“Big O” Notation
How elapsed time for an algorithm or space complexity changes in response to the size of the input data set
“Big O” and The Database Engine: Examples
Sort ( average and best case scenarios )O( n log(n) )
Insert into a memory optimised table with a hash index wherebucket count >= distinct values inserted O(1)
Insert into a memory optimised table with a hash index wherebucket count < distinct values inserted O(n)
What Gives Us The Biggest Bang For Our Buck ?
80 % 20 %
Schema design Indexing strategy T-SQL code design
Synchronizationprimitives
Leveraging thearchitecture of the modern CPU
LoopLow memory, Slow
HashHigh Memory, Fast
MergeRare
Join Types
Loop Join
Core
L3 Cache
L1 Instruction Cache 32KB
L0 UOP cache
L2 Unified Cache 256K
L1 Data Cache32KB
Core
CoreL1 Instruction Cache 32KB
L0 UOP cache
L2 Unified Cache 256K
L1 Data Cache32KB
Core
Bi-directional ring bus
Memory bus
C P U
Problem With Loop Joins – The Modern CPU !!!
L1 Cache sequential access
L1 Cache In Page Random access
L1 Cache In Full Random access
L2 Cache sequential access
L2 Cache In Page Random access
L2 Cache Full Random access
L3 Cache sequential access
L3 Cache In Page Random access
L3 Cache Full Random access
Main memory
0 20 40 60 80 100 120 140 160 1804
4
4
11
11
11
14
18
38
167
Main
memoryCPU
Main Memory Is Not As Fast As We Might Think !!!
Crawling A Tree In Memory
Hash Join
Row Mode Hash Join Scalability
2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
Elapsed time (ms) / Degree of Parallelism
Degree of Parallelism
Elap
sed
time
(ms)
Scalability best case scenario
NUMA node 0 boundary
If we keep going do we hit ‘Negative’ scale ?
Batch Mode To The Rescue ?
2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
500
1,000
1,500
2,000
2,500
3,000
3,500
4,000
Elapsed time (ms) / Degree of Parallelism
Degree of Parallelism
Elap
sed
time
(ms)
Scalability best case scenario
NUMA node 0 boundary
Two CPU sockets <> twice the throughput, WHY ?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20050,000
100,000150,000200,000250,000300,000350,000
Singleton Insert Rate / Threads
Single Sockets Inserts / s Two Sockets Inserts / s
Threads
Inse
rts /
s What About OLTP Workloads ?
We will come back to this later . . .
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200
200,000,000
400,000,000
600,000,000
800,000,000
1,000,000,000
1,200,000,000
1,400,000,000
1,600,000,000
LOGCACHE_ACCESS Spins / Threads
Single Sockets LOGCACHE_ACCESS spins Two Sockets LOGCACHE_ACCESS spins
Multiple Sockets and Spin-locking
The In Memory OLTP Engine
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200
2,000,0004,000,0006,000,0008,000,000
10,000,00012,000,00014,000,00016,000,000
Singleton Insert Rate / Thread Count, Hash Index Page count = 524288
Threads
Sing
leto
n In
sert
/ s
Scalability best case scenario
This looks good, but get the Page count wrong and your singleton inserts scale horribly !!!
What Happens When Memory Is Scarce ?
Available hash memory (MB)
Merge Join
Merge, Which Order ?
Cost relative to the batch = 87 %
Cost relative to the batch = 13 %
SELECT MAX(h.OrderDate) ,MAX(d.ProductID)FROM [Sales].[SalerOrderDetail] dINNER MERGE JOIN [Sales].[SalesOrderHeader] hON d.SalesOrderID = h.SalesOrderIDOPTION (MAXDOP 1)
SELECT MAX(h.OrderDate) ,MAX(d.ProductID)FROM [Sales].[SalesOrderHeader] hINNER MERGE JOIN [Sales].[SalerOrderDetail] dON d.SalesOrderID = h.SalesOrderIDOPTION (MAXDOP 1)
Digging Deeper
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'SalesOrderHeader'. Scan count 1, logical reads 689, physical reads 2, read-ahead reads 685, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'SalesOrderDetail'. Scan count 1, logical reads 1246, physical reads 3, read-ahead reads 1277, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderDetail'. Scan count 1, logical reads 1246, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'SalesOrderHeader'. Scan count 1, logical reads 689, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Turning M:N Into 1:N
Cost relative to the batch = 48 %
Cost relative to the batch = 52 %
SELECT MAX(h.OrderDate) ,MAX(b.MAX_P)FROM (SELECT SalesOrderId ,MAX(ProductId) AS MAX_P FROM [Sales].[SalesOrderDetail] GROUP BY SalesOrderId) bINNER MERGE JOIN [Sales].[SalesOrderHeader] hON b.SalesOrderID = h.SalesOrderIDOPTION (MAXDOP 1)
SELECT MAX(h.OrderDate) ,MAX(d.ProductID)FROM [Sales].[SalesOrderHeader] hINNER MERGE JOIN [Sales].[SalerOrderDetail] dON d.SalesOrderID = h.SalesOrderIDOPTION (MAXDOP 1)
Every Table Should Always Have A Clustered Index
Cluster ( O_ORDERKEY) Index ( O_ORDERKEY)CREATE UNIQUE CLUSTERED INDEX CIX_KeyON ORDERS_Cluster (O_ORDERKEY)WITH (FILLFACTOR = 100)
SELECT *FROM ORDERS_ClusterWHERE O_ORDERKEY = 300000
CREATE UNIQUE NONCLUSTERED INDEX CIX_KeyON ORDERS_Cluster (O_ORDERKEY)WITH (FILLFACTOR = 100)
SELECT *FROM ORDERS_ClusterWHERE O_ORDERKEY = 300000
Or Perhaps Not ?Cluster ( O_ORDERKEY ) Index ( O_ORDERKEY )
clustered indexes work against you when the number of non-covering secondary index seeks and scans increase
CREATE UNIQUE CLUSTERED INDEX CIX_KeyON ORDERS_Cluster (O_ORDERKEY)WITH (FILLFACTOR = 100)
CREATE INDEX IX_CustomerON ORDERS_Cluster(O_CUSTKEY)
CREATE UNIQUE NONCLUSTERED INDEX CIX_KeyON ORDERS_Heap (O_ORDERKEY)WITH (FILLFACTOR = 100)
CREATE INDEX IX_CustomerON ORDERS_Heap(O_CUSTKEY)
Tuning By Workload
Log write latency is critical to a point
OLTP and OLAP do not play nice together
Fine grained locking => no lock escalation Trace flag 1221 – prevent all lock escalation Trace flag 1224 – only escalate locks under severe
memory pressure
Avoid compression for OLTP
Certain plan shapes suit OTLP application best, next slide . . .
OLTP Checklist
OLTP Tuning For Dummies
1. Nested loopjoins are the
dominant join types
2. Seeks not scans
3. Serial iterators
4. There should be no sorts or spools except for performance spools and / or sorts associated
with optimized nested loop joins
Seeks Not Scans
1. Do not normalise beyond 3rd normal form2. Find most CPU/IO intensive queries with
sys.dm_exec_query_stats3. Add OPTION(LOOP JOIN) to the offending query4. Check estimated plan of bad query5. If spool found, add indexes until remedied6. Goto 2 until no more queries have non index paths7. Buy more hardware . . . wisely
OLTP Tuning For Dummies
Not All Spools Are BadExecution Time
Optimized Non-optimized
Constant
10,000,000(1% of rows) 6.5 minutes 26 minutes 4x
100,000,000(10% of rows) 10.4 minutes 4.3 hours 25x
250,000,000(25% of rows) 11.3 minutes 10.6 hours 56x
The “Optimized nested loop join”, data is pre-sorted on the outer side of the join, refer to Craig Freedmans blog post
Pursue high sequential scan IO throughput
Avoid CPU core starvation by understanding core consumption rates
Leverage columns stores and batch mode
Kimball dimensional model ?=> pursue star join optimisations
Certain plan shapes suit OLAP styleapplications best, next slide . . .
OLAP Checklist
Response time = service time + wait time
0.053 0.871
= 0.924
Insert throughput = 1000 / 0.924
= 1082 / s
Multi Threaded Performance
0 8 16 24 32 40 48 56 641
6
11
16
21
26
31
P = 100%P = 95%P = 90%P = 80%
Number of cores (N)
Spee
dup
Fact
or
Amdahl’s law: The increase in speed is proportional to the percentage of the workload that can be performed in parallel
Constructs That Force Serial Plans
All T-SQL user defined functions
CLR UDFS with data access
System functions such as OBJECT_ID(), ERROR_NUMBER()@@TRANCOUT . . .
Dynamic cursors
Constructs That Force Serial Regions
System table scansBackwards scansGlobal scalar aggregateSequence functionsRecursive queriesMulti-consumer spoolTOPTVFs
Response time = service time + wait time
0.053 0.871
= 0.924
Insert throughput = 1000 / 0.924
= 1082 / s
The Query Parallelism “Mothership of knowledge”
https://blogs.msdn.microsoft.com/craigfr/2007/04/17/parallel-query-execution-presentation/
What A ‘Good’ OLAP Execution Plan Should Look Like
v
2. Hash joins are the dominant
join type
v
3. Bitmap filters pushed as deep into the plan as
possiblev
1. Parallelism
v
4. Re-partitions streams iterators are not so good, we
will cover these later
The Optimizer Cost Model
“To understand the relevance of cost you would have to jumpin a time machine, go back in time and find a machine under a certaindevelopers desk, cost is based in the amount of time it took this machineto perform certain operations”Adam Machanic
The machine under Lubor’s ( Kollar ) desk
The Optimizer Has Limitations !!!
It contains hard codingCertain assumptions are explicitly hard coded into the optimizer
Not all scenarios it encounters are costed forThere are constructs and situations which the optimizer does not cost for, referred to the optimizer team as “Out of model scenarios”
The optimizer has blind spots
Hard Coding In The Optimizer
Hash distribution is always uniform
The memory grant for a varchar column is always have its size
The ratio between random and sequential IO is hard coded for the IO cost model
“Out of model” Scenarios
It always assumes that the buffer pool is cold
No IO costing for parallel plans
Data in different columns is never correlated
Cardinality estimates for table variables are not costed, unless the statement is recompiled
Which Query Runs The Fastest ?
1468ms
561ms
Cost relative to the batch of 44 %
Cost relative to the batch of 56 %
VARCHAR Columns And Memory Grants
1468ms
561ms
Cost relative to the batch of 44 %
Cost relative to the batch of 56 %
CPU Core Consumption RatesSELECT a.*INTO MyBigTableSourceFROM sys.all_objects aCROSS JOIN sys.all_objects bCROSS JOIN (SELECT TOP 500 * FROM sys.all_objects c) dt
SELECT COUNT(*)FROM MyBigTableSourceOPTION (MAXDOP 1)
Few-Outer-Row OptimizationFew-Outer-Row is a specific optimization for nested loop joins. In some data warehousing queries, the outer side of a nested loop join is a parallel scan with a filter.
Hash Join Exception To The Rule
. . . and this is what a few outer rows plan looks like
Few Outer Rows Optimisation
DBCC SETCPUWEIGHT(100000)
SELECT [UnitPrice]FROM FactInternetSales fisJOIN TaxYear tyON fis.DueDateKey > ty.YearStartAND fis.DueDateKey < ty.YearEnd
CREATE TABLE TaxYear ( YearStart datetime ,YearEnd datetime )
INSERT INTO TaxYear VALUES ( '20050406', '20060406', '20070406', '20080406', '20090406', '20100406', '20110406', '20120406' )
1. Add column store to the fact table2. Add OPTION(HASH JOIN) to the query3. Do you get a hash of the dimension and probe by the fact ?:
If not, check your statistics on the facts . . . and indexes on dimensions
4. Optimise the living daylights out of fact scan
OLAP Tuning For Dummies
Working Out Where Columns Begin and End Is Expensive !!!
46% of total sampled CPU time !!!
Avoid segment trimming
Align segments in order to get the best segment elimination possible
Make sure as little of your data as possible is in delta stores
Deleted rows are only ever removed after a column store reorg/rebuild, keep an eye on these !!!
No predicate push down on strings prior to SQL Server 2016
Optimising Column Store Scans
A row group can have a maximum of 1,048,576 rows, but it can be closed before it gets to this due to a number of reasons(exposed in SQL 2016 in sys.dm_db_column_store_row_group_physical_stats)
Trim_reason_desc Trim Reason
BULKLOAD BATCHSIZE specified for bulk insert, or end of bulk insert.
REORG_FORCED REORG with COMPRESS_ALL_ROWGROUPS = ON which closes every open row group and compresses it into columnar format
DICTIONARY_SIZE If Dictionary is full, the row group will be trimmed ( 16MB dictionary)
MEMORY_LIMITATION Memory pressure during index build caused row group to be trimmed
RESIDUAL_ROW_GROUP_INDEXBUILD Last row group(s) have less than 1 million rows when index rebuilt.
Segment Trimming
ID Value
1 Beer
2 Gin
3 Vodka
4 Whisky
5 Coca Cola
6 Wine
7 Brandy
Local Dictionary
BeverageId
Min Value 1
Max Value 2
DimBeverage Segment 001
Local Dictionary
BeverageId
Min Value 3
Max Value 4
Segment 002Local Dictionary
BeverageId
Min Value 5
Max Value 6
Segment 003
FactBeverageConsumed – Aligned
Aligned Segments
‘Alignment’ = min and max column values are in order with no overlap
ID Value
1 Beer
2 Gin
3 Vodka
4 Whisky
5 Coca Cola
6 Wine
7 Brandy
Local Dictionary
BeverageId
Min Value 1
Max Value 4
DimBeverage Segment 001
Local Dictionary
BeverageId
Min Value 2
Max Value 3
Segment 002Local Dictionary
BeverageId
Min Value 1
Max Value 5
Segment 003
FactBeverageConsumed – Non Aligned
Non Aligned Segments
Min and max values across segments overlap !!!
Scan
Local Dictionary
BeverageId
Min Value 1
Max Value 4
Segment 001
Local Dictionary
BeverageId
Min Value 2
Max Value 3
Segment 002Local Dictionary
BeverageId
Min Value 1
Max Value 5
Segment 003
Local Dictionary
BeverageId
Min Value 1
Max Value 2
Segment 001
Local Dictionary
BeverageId
Min Value 3
Max Value 4
Segment 002Local Dictionary
BeverageId
Min Value 5
Max Value 7
Segment 003
Scan Scan
ID Value
. . . .
4 Whisky
. . . .
DimBeverage
Non aligned segments Aligned segments
What Segment Alignment Gives Us
C P UC P U More Performance ? . . . Sorted Hash
2 4 6 8 10 12 14 16 18 20 22 240
10000
20000
30000
40000
50000
60000
70000
80000
Non-sorted column store Sorted column store
Degree of Parallelism
Tim
e (m
s)
SELECT p.Product ,p.ProductNumber ,p.ReorderPoint ,d.*FROM (SELECT TOP (2147483647) p0.ProductId ,p0.ProductNumber ,p0.ReorderPoint FROM bigProduct AS p0 WHERE p0.ProductId BETWEEN 1001 AND 20001) AS pCROSS APPLY (SELECT th.TransactionId ,RANK() OVER ( PARTITION BY p.ProductId ) ORDER BY th.ActualCost DESC ) AS LineTotalRank, ,RANK() OVER ( PARTITION BY p.ProductId ORDER BY th.Quantity DESC ) AS OrderQtyRank FROM bigTransactionHistory AS th WHERE th.ProductId = p.ProductId) AS dOPTION (MAXDOP 10)
Column Store Scans Are Good, Seeks Are Bad !
With A Clustered Column Store, Elapsed Time = 262 seconds
Table 'bigProduct'. Scan count 1, logical reads 228, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'Worktable'. Scan count 9577, logical reads 109258564, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'bigTransactionHistory'. Scan count 1, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 85220, lob physical reads 0, lob read-ahead reads 0.Table 'bigTransactionHistory'. Segment reads 30, segment skipped 0. SQL Server Execution Times: CPU time = 226750 ms, elapsed time = 262456 ms.
With A Conventional Index , Elapsed Time = 71 seconds
Table 'bigProduct'. Scan count 1, logical reads 228, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'bigTransactionHistory'. Scan count 9577, logical reads 98005, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 36264 ms, elapsed time = 71046 ms.
We want as much of our data to be in compressed row groupsfor OLAP style applications
Delta stores prevent segment elimination
Scanning delta stores has the same performance characteristic as scanning page compressed b-trees ( SQL 2012 – 14 ) and non-compressed b-trees ( SQL 2016 )
. . . Leads on to column store load strategies
Delta Stores
Row group closure and compression is triggered by 102,400 rows Best compression is achieved at 1,048,576 rows ( ~ million ) SSIS data flow task buffer sizes and buffer max rows can be turned:
Row Group Compression:102,400 Is The Magic Number
Can your data sources pump data into the column store in 1 million rows batches consistently ?
Can zero data flow buffer spills be guaranteed with large batch sizes ?
Trickle data into staging tables and transfer the rows to the column store when there is approximately 1 million of them
Turn the tuple mover off – TF634
Force row group closure and compression viaALTER INDEX <index name> ON <table name> REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON)
Bulk loads are explained in depth in this MSDN article
SQL Server 2016 supports parallel inserts into column stores
Loading Strategy
Data Type Elimination And Predicate Pushdown Support
Data Type Min / Max Predicate Pushdown
Segment Elimination
Numeric yes no noDateTimeOffset yes no noChar yes no noVarchar yes no noNchar yes no noNvarchar yes no noBinary no no noVarbinary no no noUniqueidentifier no no no
Factor these data types out into dimension tables where possible !!!
New Column Store Features In SQL Server 2016
Filtered non clustered column store indexes
Ability to create a column store index on a memory optimised table
Support for row versioning
Column store can be declared in line when creating a table
Ability to specify a compression delay on the column store
New Column Store Features In SQL Server 2016
Support for updateable column stores on an Always On readable replica
Support for string predicate pushdown
Parallel insert now supported for clustered column stores
Simple aggregate pushdown
Support for multiple distinct counts
New Column Store Features In SQL Server 2016
Updateable non clustered column store indexes
Ability to create conventional b-tree indexes on clustered column store indexes
Ability to create foreign keys on clustered column stores
Batch mode windowing functions
Batch mode sort iterator support
New Column Store Features In SQL Server 2016
Updateable non clustered column store indexes
Ability to create conventional b-tree indexes on clustered column store indexes
Ability to create foreign keys on clustered column stores
Batch mode windowing functions
Batch mode sort iterator support
Optimising Windowing Function Performance With Row Stores
SELECT a.object_id ,SUBSTRING(CONVERT(VARCHAR(40), NEWID()), 1, 3) AS codeFROM sys.all_objects aCROSS JOIN sys.all_objects bCROSS JOIN (SELECT TOP 200 * FROM sys.all_objects b)INTO TestDataWHERE a.type = 'P'AND b.type = 'P'UNION ALLSELECT a.object_id ,SUBSTRING(CONVERT(VARCHAR(40), NEWID()), 1, 3) AS codeFROM sys.all_objects aCROSS JOIN sys.all_objects bWHERE a.type = 'P'AND b.type = 'P'UNION ALLSELECT a.object_id ,SUBSTRING(CONVERT(VARCHAR(40), NEWID()), 1, 3) AS codeFROM sys.all_objects aCROSS JOIN sys.all_objects bWHERE a.type = 'U'AND b.type = 'U'UNION ALLSELECT a.object_id ,SUBSTRING(CONVERT(VARCHAR(40), NEWID()), 1, 3) AS codeFROM sys.all_objects aCROSS JOIN sys.all_objects bWHERE a.type = 'V'AND b.type = 'V‘
CREATE CLUSTERED INDEX csx ON TestData (object_id, code)
The Plan We Get For A Simple Query Using RANK
SELECT t.object_id ,code ,RANK() OVER (PARTITION BY t.object_id ORDER BY code ASC) AS rkFROM TestData t
Anti Scale !!!
2 3 4 5 6 7 81,050,000
1,100,000
1,150,000
1,200,000
1,250,000
1,300,000
1,350,000
1,400,000
1,450,000
Elapsed Time (ms) / Degree of Parallelism
Degree of Parallelism
Elap
sed
Tim
e (m
s)
SQL Server 2016 Has A Batch Windowing Function
CREATE TABLE [dbo].[DummyTable]( [object_id] [int] NULL)
CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON [dbo].[DummyTable]
SELECT t.object_id ,code ,RANK() OVER (PARTITION BY t.object_id ORDER BY code ASC) AS rkFROM TestData tLEFT JOIN DummyTable dON d.object_id = t.object_idOPTION (HASH JOIN)
How does this help us with row stores ?
The Magic Is In What Now Appears In The Execution Plan
From “Anti Scale” To Scalability !!!
2 3 4 5 6 7 80
200,000400,000600,000800,000
1,000,0001,200,0001,400,0001,600,0001,800,0002,000,000
Elapsed Time (ms) / Degree of Parallelism
Row store BATCH mode Row store ROW mode
Degree of Parallelism
Elap
sed
Tim
e (m
s)
What About Creating A Column Store On The TestData Table ?
2 3 4 5 6 7 80
500,000
1,000,000
1,500,000
2,000,000
2,500,000
Elapsed Time (ms) / Degree of Parallelism
Row store BATCH mode Row store ROW mode Column store BATCH mode
Degree of Parallelism
Elap
sed
Tim
e (m
s)
Natural Born Performance Killers
Physical reads for OLTP
Poor sequential scan rates and CPUcore starvation for OLAP
High WRITELOG latency for OLTP
Etc etc . . .
XML Processing ?
The Overhead Of Rendering XML
SELECT a.*FROM sys.all_objects aCROSS JOIN sys.all_objects b
SELECT a.*FROM sys.all_objects aCROSS JOIN sys.all_objects bFOR XML RAW
Where are our CPU cycles going ?
4 s 34 s
Where The Database Engine Is Spending Its CPU Time
Non-Xml version of the query Xml version of the query
The XML query is more than 5x CPU intensive
Digging Into Stack: The FOR XML RAW Version Of Query
Challenges Does whoever designed your infrastructure
understand the resource requirements andusage patterns of the database engine ?
“Big boxes” being purchased with little understanding of how SQL Server scales on such machines
CPU unfriendly SQL engine behaviour: sorting, hashing and index seeks ( pointer chasing )
An end of the CPU Ghz free lunch
What Have We Learned ?
1. There are plan shapes which suit OLTP and OLAP applications, pursue these2. Parallel query scalability suffers once the NUMA boundary is crossed3. Clustered indexes are not a silver bullet, consider non-covering secondary indexes4. Avoid column store segment trimming in order to leverage 8MB read a-heads for OLAP style applications, operational analytics is more nuanced5. Align segments by pre-sorting the data on columns frequently used in predicates6. Not all data types support segment elimination and predicate pushdown, this can be designed around7. Be cognizant of the overheads of processing XML !