Estimation, Statistics and “Oh My!”
description
Transcript of Estimation, Statistics and “Oh My!”
Dave BallantyneClear Sky SQL
Estimation, Statistics and “Oh My!”
› Freelance Database Developer/Designer– Specializing in SQL Server for 15+ years
› SQLLunch– Lunchtime usergroup– London & Cardiff , 2nd & 4th Tuesdays
› TSQL Smells script author– http://tsqlsmells.codeplex.com/
› Email : [email protected]› Twitter : @DaveBally
The ‘Me’ slide
› This is also me– Far to often ….
› Estimates are central› Statistics provide
estimates
“Oh my!”
› Every Journey starts with a plan
› Is this the ‘best’ way to Lyon?– Fastest– Most efficient– Shortest
› SQL Server make similar choices
The plan
› SQL Server is a cost based optimizer– Cost based = compares predicted costs
› Therefore estimations are needed› Every choice is based on these estimations › Has to form a plan
– And the plan cannot change in execution if ‘wrong’
Estimation
Estimation – Per execution
› Costs are not actual costs– Little or no relevance to the execution costs
› Cost is not a metric– 1 <> 1 anything
› Their purpose:– Pick between different candidate plans for a query
Estimation
› Included within a index› Auto Updated and created
– Optimizer decides “It would be useful if I knew..”– Only on single column– Not in read-only databases
› Can be manually updated› Auto-Creation can operate Async
How is estimation calculated ? - Statistics
› DBCC SHOW_STATISTICS(tbl,stat)– WITH STAT_HEADER
› Display statistics header information– WITH HISTOGRAM
› Display detailed step information– WITH DENSITY_VECTORS
› Display only density of columns– Density = rows / count of distinct values
– WITH STATS_STREAM› Binary stats blob ( not supported )
Statistics Data
Statistics – WITH STAT_HEADER
Total rows in table
Rows read and sampled
No of steps in histogram – Max 200
Density (Rows/Distinct values) (exc boundaries) not used
Avg byte len for all columns
Statistics – WITH HISTOGRAM Each step contains data on a range of
values Range is defined by
◦ <= RANGE_HI_KEY◦ > Previous range RANGE_HI_KEY
Row 3 > 2 and <=407Row 4 > 407 and <=470
6 rows of data = 470
17 Rows of data > 407 and < 470
9 Distinct values> 407 and < 470 Density (17 / 9)
Statistics in practiceRANGE_HI_KEY <= PredicateAND > Previous RANGE_HI_KEY
As Predicate == RANGE_HI_KEYEstimate = EQ_ROWS
Statistics in practiceRANGE_HI_KEY <= PredicateAnd > previous RANGE_HI_KEYAs Predicate < RANGE_HI_KEY
Estimate = AVG_RANGE_ROWS
› Greater accuracy on Range boundary values– Based upon the ‘maxdiff’ algorithm
› Relevant for leading column– Estimate for Smiths– But not Smiths called John
› Additional Columns cumulative density only– 1/(Count of Distinct Values)
Statistics
WITH DENSITY_VECTORS
Density vector = 1/(Count of Distinct Values)
=19,517
1 / 19,517 = ~5.123738E-05
DENSITY_VECTORS in practice
DENSITY_VECTORS in practice
211 * 5.123728E-05= ~1.02331
› All Diazs will estimate to the same:– As will all Smiths,Jones & Ballantynes– The statistics do not contain detail on FirstName– Only how many distinct values there are– And assumes these are evenly distributed
› Not only across a single Surname› But ALL Surnames
DENSITY_VECTORS in practice
› So far we have only used a single statistic for estimations
› For this query:
› To provide the best estimate the optimizer ideally needs to know about LastName and FirstName
Multiple Statistics
› Correlating Multiple Stats› AND conditions › Find the intersection
Multiple Statistics - Usage
LastName = ‘Sanchez’
FirstName = ‘Ken’
› And logic– Intersection Est = ( Density 1 * Density 2)
› Or logic– Row Est 1 + Row Est 2 –(Intersection Estimate)– Avoid double counting
Multiple Statistics
Multiple Stats – In action
10% * 10 % = 1%10% * 20 % = 2%
No Correlation in the data is
assumed
• To keep statistics fresh they get ‘aged’• 0 to > 0 • <= 6 Rows (For Temp Tables)
• 6 Modification• <= 500 Rows
• 500 Modifications• >= 501 Rows
• 500 + 20% of table
• Will cause statistics to be updated on next use• Will cause statements to be recompiled on next
execution• Temp tables in stored procedures more complex
Aged Statistics
Large Tables Trace flag 2371 -Dynamically lower
statistics update threshold for large tables >25,000 rows 2008r2 (SP1) & 2012
› When density vector is not accurate enough› Manually created statistics only› Additional where clause can be utilised
Filtered Statistics
Filter Expression = FilterUnfiltered Rows = Total rows in table before filter
Rows SampledNumber of filtered rows sampled
Filtered Statistics
Density of London * Density of Ramos
Filter is matched and histogram is used
Sampled Data› For ‘large’ data sets a smaller sample can be used› Here 100% of the rows have been sampled
› Here ~52% of the rows have been sampled
› Statistics will assume the same distribution of values through the entire dataset
Non Literal values› Also Auto/Forced Parameterization
› Remember the Density Vector ?
› 19972 (Total Rows )* 0.0008285004 =› 16.5468
Non Literal Values
On Equality The Average Density Is
Assumed
› Stored ProceduresNon Literal Values
› Enables a better plan to be built– (most of the time)– Uses specific values rather than average values
› Values can be seen in properties pane
› Erratic execution costs are often Parameter Sniffing problems
Parameter Sniffing
› Force a value to be used in optimization› A literal value› Or UNKNOWN
– Falls back to density information
Optimize For
› OPTION(RECOMPILE)– Recompile on every execution
› Because the plans aren’t cached– No point as by definition the plan wont be reused
› Uses variables as if literals– More accurate estimates
Forcing Recompiling
› The Achilles heel– A plan is fixed
› But… The facts are:– More Smiths than Ballantynes– More Customers in London than Leeds
Unbalanced data
London
Leeds
Smith 500 50Ballantyne 10 1
› This is known as the ‘Plan Skyline’ problemUnbalanced Data
400
500
600
700
800
900
1000
1100
1200
Summarised (20 Step) Surname Stats Distribution
Unbalanced Data
Abbas Baker Cai Cooper Gill He Jiménez Long Moore Patel Ramirez Ruiz Shen Torres Wood Zhou0
50
100
150
200
250
Full Statistics distribution
› But wait…› It gets worse
– That was only EQ_ROWS– RANGE_ROWS ??
Unbalanced Data
Abbas Baker Cai Cooper Gill He JiménezLong Moore PatelRamirez Ruiz Shen Torres Wood Zhou0
10
20
30
40
50
60
70
80
Range Rows statistics distribution
› Variations in plans– Shape
› Which is the ‘primary’ table ?– Physical Joins– Index (non)usage
› Bookmark lookups– Memory grants
› 500 rows need more memory than 50– Parallel plans
Unbalanced Data
› Can the engine resolve this ?– No!
› We can help though– And without recompiling– Aim is to prevent ‘car-crash’ queries– Not necessarily provide a ‘perfect’ plan
Unbalanced data
Demo
› “There is always a trace flag”– Paul White ( @SQL_Kiwi)
› TF 9292– Show when a statistic header is read
› TF 9204– Show when statistics have been fully loaded
› TF 8666– Display internal debugging info in QP
Which stats are used ?
TF 9292 and 9204
TF8666
› Statistics Used by the Query Optimizer in Microsoft SQL Server 2008
› Plan Caching in SQL Server 2008› SQL Server internals 2008 book (MSPress)
References