Estimation, Statistics and “Oh My!”

Dave BallantyneClear Sky SQL

Estimation, Statistics and “Oh My!”

› Freelance Database Developer/Designer– Specializing in SQL Server for 15+ years

› SQLLunch– Lunchtime usergroup– London & Cardiff , 2nd & 4th Tuesdays

› TSQL Smells script author– http://tsqlsmells.codeplex.com/

› Email : [email protected]› Twitter : @DaveBally

The ‘Me’ slide

mailto:[email protected]

› This is also me– Far to often ….

› Estimates are central› Statistics provide

estimates

“Oh my!”

› Every Journey starts with a plan

› Is this the ‘best’ way to Lyon?– Fastest– Most efficient– Shortest

› SQL Server make similar choices

The plan

› SQL Server is a cost based optimizer– Cost based = compares predicted costs

› Therefore estimations are needed› Every choice is based on these estimations › Has to form a plan

– And the plan cannot change in execution if ‘wrong’

Estimation

Estimation – Per execution

› Costs are not actual costs– Little or no relevance to the execution costs

› Cost is not a metric– 1 <> 1 anything

› Their purpose:– Pick between different candidate plans for a query

Estimation

› Included within a index› Auto Updated and created

– Optimizer decides “It would be useful if I knew..”– Only on single column– Not in read-only databases

› Can be manually updated› Auto-Creation can operate Async

How is estimation calculated ? - Statistics

› DBCC SHOW_STATISTICS(tbl,stat)– WITH STAT_HEADER

› Display statistics header information– WITH HISTOGRAM

› Display detailed step information– WITH DENSITY_VECTORS

› Display only density of columns– Density = rows / count of distinct values

– WITH STATS_STREAM› Binary stats blob ( not supported )

Statistics Data

Statistics – WITH STAT_HEADER

Total rows in table

Rows read and sampled

No of steps in histogram – Max 200

Density (Rows/Distinct values) (exc boundaries) not used

Avg byte len for all columns

Statistics – WITH HISTOGRAM Each step contains data on a range of

values Range is defined by

◦ <= RANGE_HI_KEY◦ > Previous range RANGE_HI_KEY

Row 3 > 2 and <=407Row 4 > 407 and <=470

6 rows of data = 470

17 Rows of data > 407 and < 470

9 Distinct values> 407 and < 470 Density (17 / 9)

Statistics in practiceRANGE_HI_KEY <= PredicateAND > Previous RANGE_HI_KEY

As Predicate == RANGE_HI_KEYEstimate = EQ_ROWS

Statistics in practiceRANGE_HI_KEY <= PredicateAnd > previous RANGE_HI_KEYAs Predicate < RANGE_HI_KEY

Estimate = AVG_RANGE_ROWS

› Greater accuracy on Range boundary values– Based upon the ‘maxdiff’ algorithm

› Relevant for leading column– Estimate for Smiths– But not Smiths called John

› Additional Columns cumulative density only– 1/(Count of Distinct Values)

Statistics

WITH DENSITY_VECTORS

Density vector = 1/(Count of Distinct Values)

=19,517

1 / 19,517 = ~5.123738E-05

DENSITY_VECTORS in practice


211 * 5.123728E-05= ~1.02331

› All Diazs will estimate to the same:– As will all Smiths,Jones & Ballantynes– The statistics do not contain detail on FirstName– Only how many distinct values there are– And assumes these are evenly distributed

› Not only across a single Surname› But ALL Surnames


› So far we have only used a single statistic for estimations

› For this query:

› To provide the best estimate the optimizer ideally needs to know about LastName and FirstName

Multiple Statistics

› Correlating Multiple Stats› AND conditions › Find the intersection

Multiple Statistics - Usage

LastName = ‘Sanchez’

FirstName = ‘Ken’

› And logic– Intersection Est = ( Density 1 * Density 2)

› Or logic– Row Est 1 + Row Est 2 –(Intersection Estimate)– Avoid double counting

Multiple Statistics

Multiple Stats – In action

10% * 10 % = 1%10% * 20 % = 2%

No Correlation in the data is

assumed

• To keep statistics fresh they get ‘aged’• 0 to > 0 • <= 6 Rows (For Temp Tables)

• 6 Modification• <= 500 Rows

• 500 Modifications• >= 501 Rows

• 500 + 20% of table

• Will cause statistics to be updated on next use• Will cause statements to be recompiled on next

execution• Temp tables in stored procedures more complex

Aged Statistics

http://sqlblog.com/blogs/paul_white/archive/2012/08/15/temporary-tables-in-stored-procedures.aspx

Large Tables Trace flag 2371 -Dynamically lower

statistics update threshold for large tables >25,000 rows 2008r2 (SP1) & 2012

› When density vector is not accurate enough› Manually created statistics only› Additional where clause can be utilised

Filtered Statistics

Filter Expression = FilterUnfiltered Rows = Total rows in table before filter

Rows SampledNumber of filtered rows sampled

Filtered Statistics

Density of London * Density of Ramos

Filter is matched and histogram is used

Sampled Data› For ‘large’ data sets a smaller sample can be used› Here 100% of the rows have been sampled

› Here ~52% of the rows have been sampled

› Statistics will assume the same distribution of values through the entire dataset

Non Literal values› Also Auto/Forced Parameterization

› Remember the Density Vector ?

› 19972 (Total Rows )* 0.0008285004 =› 16.5468

Non Literal Values

On Equality The Average Density Is

Assumed

› Stored ProceduresNon Literal Values

› Enables a better plan to be built– (most of the time)– Uses specific values rather than average values

› Values can be seen in properties pane

› Erratic execution costs are often Parameter Sniffing problems

Parameter Sniffing

› Force a value to be used in optimization› A literal value› Or UNKNOWN

– Falls back to density information

Optimize For

› OPTION(RECOMPILE)– Recompile on every execution

› Because the plans aren’t cached– No point as by definition the plan wont be reused

› Uses variables as if literals– More accurate estimates

Forcing Recompiling

› The Achilles heel– A plan is fixed

› But… The facts are:– More Smiths than Ballantynes– More Customers in London than Leeds

Unbalanced data

London

Leeds

Smith 500 50Ballantyne 10 1

› This is known as the ‘Plan Skyline’ problemUnbalanced Data

400

500

600

700

800

900

1000

1100

1200

Summarised (20 Step) Surname Stats Distribution

Unbalanced Data

Abbas Baker Cai Cooper Gill He Jiménez Long Moore Patel Ramirez Ruiz Shen Torres Wood Zhou0

50

100

150

200

250

Full Statistics distribution

› But wait…› It gets worse

– That was only EQ_ROWS– RANGE_ROWS ??

Unbalanced Data

Abbas Baker Cai Cooper Gill He JiménezLong Moore PatelRamirez Ruiz Shen Torres Wood Zhou0

10

20

30

40

50

60

70

80

Range Rows statistics distribution

› Variations in plans– Shape

› Which is the ‘primary’ table ?– Physical Joins– Index (non)usage

› Bookmark lookups– Memory grants

› 500 rows need more memory than 50– Parallel plans

Unbalanced Data

› Can the engine resolve this ?– No!

› We can help though– And without recompiling– Aim is to prevent ‘car-crash’ queries– Not necessarily provide a ‘perfect’ plan

Unbalanced data

Demo

› “There is always a trace flag”– Paul White ( @SQL_Kiwi)

› TF 9292– Show when a statistic header is read

› TF 9204– Show when statistics have been fully loaded

› TF 8666– Display internal debugging info in QP

Which stats are used ?

TF 9292 and 9204

TF8666

› Statistics Used by the Query Optimizer in Microsoft SQL Server 2008

› Plan Caching in SQL Server 2008› SQL Server internals 2008 book (MSPress)

References

http://msdn.microsoft.com/en-us/library/dd535534(v=sql.100).aspx



http://msdn.microsoft.com/en-us/library/ee343986(v=sql.100).aspx

http://msdn.microsoft.com/en-us/library/ee343986(v=sql.100).aspx

› My Email : [email protected]› Twitter : @DaveBally

Q&A

mailto:[email protected]

Estimation, Statistics and “Oh My!”

Documents

Transcript of Estimation, Statistics and “Oh My!”