Estimation, Statistics and “Oh My!”

44
Dave Ballantyne Clear Sky SQL Estimation, Statistics and “Oh My!”

description

Estimation, Statistics and “Oh My!”. Dave Ballantyne Clear Sky SQL . The ‘Me’ slide. Freelance Database Developer/Designer Specializing in SQL Server for 15+ years SQLLunch Lunchtime usergroup London & Cardiff , 2 nd & 4 th Tuesdays TSQL Smells script author - PowerPoint PPT Presentation

Transcript of Estimation, Statistics and “Oh My!”

Page 1: Estimation, Statistics and “Oh My!”

Dave BallantyneClear Sky SQL

Estimation, Statistics and “Oh My!”

Page 2: Estimation, Statistics and “Oh My!”

› Freelance Database Developer/Designer– Specializing in SQL Server for 15+ years

› SQLLunch– Lunchtime usergroup– London & Cardiff , 2nd & 4th Tuesdays

› TSQL Smells script author– http://tsqlsmells.codeplex.com/

› Email : [email protected]› Twitter : @DaveBally

The ‘Me’ slide

Page 3: Estimation, Statistics and “Oh My!”

› This is also me– Far to often ….

› Estimates are central› Statistics provide

estimates

“Oh my!”

Page 4: Estimation, Statistics and “Oh My!”

› Every Journey starts with a plan

› Is this the ‘best’ way to Lyon?– Fastest– Most efficient– Shortest

› SQL Server make similar choices

The plan

Page 5: Estimation, Statistics and “Oh My!”

› SQL Server is a cost based optimizer– Cost based = compares predicted costs

› Therefore estimations are needed› Every choice is based on these estimations › Has to form a plan

– And the plan cannot change in execution if ‘wrong’

Estimation

Page 6: Estimation, Statistics and “Oh My!”

Estimation – Per execution

Page 7: Estimation, Statistics and “Oh My!”

› Costs are not actual costs– Little or no relevance to the execution costs

› Cost is not a metric– 1 <> 1 anything

› Their purpose:– Pick between different candidate plans for a query

Estimation

Page 8: Estimation, Statistics and “Oh My!”

› Included within a index› Auto Updated and created

– Optimizer decides “It would be useful if I knew..”– Only on single column– Not in read-only databases

› Can be manually updated› Auto-Creation can operate Async

How is estimation calculated ? - Statistics

Page 9: Estimation, Statistics and “Oh My!”

› DBCC SHOW_STATISTICS(tbl,stat)– WITH STAT_HEADER

› Display statistics header information– WITH HISTOGRAM

› Display detailed step information– WITH DENSITY_VECTORS

› Display only density of columns– Density = rows / count of distinct values

– WITH STATS_STREAM› Binary stats blob ( not supported )

Statistics Data

Page 10: Estimation, Statistics and “Oh My!”

Statistics – WITH STAT_HEADER

Total rows in table

Rows read and sampled

No of steps in histogram – Max 200

Density (Rows/Distinct values) (exc boundaries) not used

Avg byte len for all columns

Page 11: Estimation, Statistics and “Oh My!”

Statistics – WITH HISTOGRAM Each step contains data on a range of

values Range is defined by

◦ <= RANGE_HI_KEY◦ > Previous range RANGE_HI_KEY

Row 3 > 2 and <=407Row 4 > 407 and <=470

6 rows of data = 470

17 Rows of data > 407 and < 470

9 Distinct values> 407 and < 470 Density (17 / 9)

Page 12: Estimation, Statistics and “Oh My!”

Statistics in practiceRANGE_HI_KEY <= PredicateAND > Previous RANGE_HI_KEY

As Predicate == RANGE_HI_KEYEstimate = EQ_ROWS

Page 13: Estimation, Statistics and “Oh My!”

Statistics in practiceRANGE_HI_KEY <= PredicateAnd > previous RANGE_HI_KEYAs Predicate < RANGE_HI_KEY

Estimate = AVG_RANGE_ROWS

Page 14: Estimation, Statistics and “Oh My!”

› Greater accuracy on Range boundary values– Based upon the ‘maxdiff’ algorithm

› Relevant for leading column– Estimate for Smiths– But not Smiths called John

› Additional Columns cumulative density only– 1/(Count of Distinct Values)

Statistics

Page 15: Estimation, Statistics and “Oh My!”

WITH DENSITY_VECTORS

Density vector = 1/(Count of Distinct Values)

=19,517

1 / 19,517 = ~5.123738E-05

Page 16: Estimation, Statistics and “Oh My!”

DENSITY_VECTORS in practice

Page 17: Estimation, Statistics and “Oh My!”

DENSITY_VECTORS in practice

211 * 5.123728E-05= ~1.02331

Page 18: Estimation, Statistics and “Oh My!”

› All Diazs will estimate to the same:– As will all Smiths,Jones & Ballantynes– The statistics do not contain detail on FirstName– Only how many distinct values there are– And assumes these are evenly distributed

› Not only across a single Surname› But ALL Surnames

DENSITY_VECTORS in practice

Page 19: Estimation, Statistics and “Oh My!”

› So far we have only used a single statistic for estimations

› For this query:

› To provide the best estimate the optimizer ideally needs to know about LastName and FirstName

Multiple Statistics

Page 20: Estimation, Statistics and “Oh My!”

› Correlating Multiple Stats› AND conditions › Find the intersection

Multiple Statistics - Usage

LastName = ‘Sanchez’

FirstName = ‘Ken’

Page 21: Estimation, Statistics and “Oh My!”

› And logic– Intersection Est = ( Density 1 * Density 2)

› Or logic– Row Est 1 + Row Est 2 –(Intersection Estimate)– Avoid double counting

Multiple Statistics

Page 22: Estimation, Statistics and “Oh My!”

Multiple Stats – In action

10% * 10 % = 1%10% * 20 % = 2%

No Correlation in the data is

assumed

Page 23: Estimation, Statistics and “Oh My!”

• To keep statistics fresh they get ‘aged’• 0 to > 0 • <= 6 Rows (For Temp Tables)

• 6 Modification• <= 500 Rows

• 500 Modifications• >= 501 Rows

• 500 + 20% of table

• Will cause statistics to be updated on next use• Will cause statements to be recompiled on next

execution• Temp tables in stored procedures more complex

Aged Statistics

Page 24: Estimation, Statistics and “Oh My!”

Large Tables Trace flag 2371 -Dynamically lower

statistics update threshold for large tables >25,000 rows 2008r2 (SP1) & 2012

Page 25: Estimation, Statistics and “Oh My!”

› When density vector is not accurate enough› Manually created statistics only› Additional where clause can be utilised

Filtered Statistics

Filter Expression = FilterUnfiltered Rows = Total rows in table before filter

Rows SampledNumber of filtered rows sampled

Page 26: Estimation, Statistics and “Oh My!”

Filtered Statistics

Density of London * Density of Ramos

Filter is matched and histogram is used

Page 27: Estimation, Statistics and “Oh My!”

Sampled Data› For ‘large’ data sets a smaller sample can be used› Here 100% of the rows have been sampled

› Here ~52% of the rows have been sampled

› Statistics will assume the same distribution of values through the entire dataset

Page 28: Estimation, Statistics and “Oh My!”

Non Literal values› Also Auto/Forced Parameterization

Page 29: Estimation, Statistics and “Oh My!”

› Remember the Density Vector ?

› 19972 (Total Rows )* 0.0008285004 =› 16.5468

Non Literal Values

On Equality The Average Density Is

Assumed

Page 30: Estimation, Statistics and “Oh My!”

› Stored ProceduresNon Literal Values

Page 31: Estimation, Statistics and “Oh My!”

› Enables a better plan to be built– (most of the time)– Uses specific values rather than average values

› Values can be seen in properties pane

› Erratic execution costs are often Parameter Sniffing problems

Parameter Sniffing

Page 32: Estimation, Statistics and “Oh My!”

› Force a value to be used in optimization› A literal value› Or UNKNOWN

– Falls back to density information

Optimize For

Page 33: Estimation, Statistics and “Oh My!”

› OPTION(RECOMPILE)– Recompile on every execution

› Because the plans aren’t cached– No point as by definition the plan wont be reused

› Uses variables as if literals– More accurate estimates

Forcing Recompiling

Page 34: Estimation, Statistics and “Oh My!”

› The Achilles heel– A plan is fixed

› But… The facts are:– More Smiths than Ballantynes– More Customers in London than Leeds

Unbalanced data

London

Leeds

Smith 500 50Ballantyne 10 1

Page 35: Estimation, Statistics and “Oh My!”

› This is known as the ‘Plan Skyline’ problemUnbalanced Data

400

500

600

700

800

900

1000

1100

1200

Summarised (20 Step) Surname Stats Distribution

Page 36: Estimation, Statistics and “Oh My!”

Unbalanced Data

Abbas Baker Cai Cooper Gill He Jiménez Long Moore Patel Ramirez Ruiz Shen Torres Wood Zhou0

50

100

150

200

250

Full Statistics distribution

Page 37: Estimation, Statistics and “Oh My!”

› But wait…› It gets worse

– That was only EQ_ROWS– RANGE_ROWS ??

Unbalanced Data

Abbas Baker Cai Cooper Gill He JiménezLong Moore PatelRamirez Ruiz Shen Torres Wood Zhou0

10

20

30

40

50

60

70

80

Range Rows statistics distribution

Page 38: Estimation, Statistics and “Oh My!”

› Variations in plans– Shape

› Which is the ‘primary’ table ?– Physical Joins– Index (non)usage

› Bookmark lookups– Memory grants

› 500 rows need more memory than 50– Parallel plans

Unbalanced Data

Page 39: Estimation, Statistics and “Oh My!”

› Can the engine resolve this ?– No!

› We can help though– And without recompiling– Aim is to prevent ‘car-crash’ queries– Not necessarily provide a ‘perfect’ plan

Unbalanced data

Demo

Page 40: Estimation, Statistics and “Oh My!”

› “There is always a trace flag”– Paul White ( @SQL_Kiwi)

› TF 9292– Show when a statistic header is read

› TF 9204– Show when statistics have been fully loaded

› TF 8666– Display internal debugging info in QP

Which stats are used ?

Page 41: Estimation, Statistics and “Oh My!”

TF 9292 and 9204

Page 42: Estimation, Statistics and “Oh My!”

TF8666

Page 44: Estimation, Statistics and “Oh My!”

› My Email : [email protected]› Twitter : @DaveBally

Q&A