Post on 18-Feb-2018
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 1/26
FINDING A PROPERSTATISTICAL
DISTRIBUTION FORDATASET
from
http://www.mathwave.com/articles/distribution_tting_faq.html
1
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 2/26
What is DistributionFitting?
• Distribution tting is the procedure ofselecting a statistical distribution thatbest ts to a data set generated by
some random process. In other words,if you have some random dataavailable, and would lie to now what
particular distribution can be used todescribe your data, then distributiontting is what you are looing for.
2
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 3/26
Who and Wh Shou!d Us"Distributions?
• !andom factors a"ect all areas of our life, andbusinesses motivated to succeed in today#s highlycompetitive environment need a tool to deal withris and uncertainty involved. $sing probabilitydistributions is a scientic way of dealing withuncertainty and maing informed businessdecisions.
• In practice, probability distributions are applied insuch diverse elds as actuarial science andinsurance, ris analysis, investment, maret
research, business and economic research,customer support, mining, reliability engineering,chemical engineering, hydrology, image processing,physics, medicine, sociology, demography etc.
3
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 4/26
Wh is it I#$ortant toS"!"%t Th" B"st Fitting
Distribution?%robability distributions can be viewed as a tool for dealingwith uncertainty: you use distributions to perform speciccalculations, and apply the results to mae well&groundedbusiness decisions. 'owever, if you use a wrong tool, you willget wrong results. If you select and apply an inappropriatedistribution (the one that doesn#t t to your data well), your
subsequent calculations will be incorrect, and that willcertainly result in wrong decisions.
In many industries, the use of incorrect models can haveserious consequences such as inability to complete tass orpro*ects in time leading to substantial time and money loss,wrong engineering design resulting in damage of e+pensive
equipment etc. In some specic areas such as hydrology,using appropriate distributions can be even more critical.Distribution tting allows you to develop valid models of
random processes you deal with, protecting you from potentialtime and money loss which can arise due to invalid modelselection, and enabling you to mae better business decisions.
4
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 5/26
Can&t I 'ust Assu#" Th"Nor#a! Distribution?
• he -ormal distribution has been developed more than 0 years ago, and isprobably one of the oldest and frequently used distributions out there. 1o whynot *ust use it2
It Is S##"tri%
• he probability density function of the -ormal distribution is symmetric about itsmean value, and this distribution cannot be used to model right&sewed or left&sewed data:
It Is Unbound"d
• he -ormal distribution is dened on the entire real a+is (&Innity, 3Innity), andif the nature of your data is such that it is bounded or non&negative (can onlytae on positive values), then this distribution is almost certainly not a good t:
Its Sha$" Is Constant
• he shape of the -ormal distribution does not depend on the distributionparameters. 4ven if your data is symmetric by nature, it is possible that it is bestdescribed by one of the heavy&tailed models such as the 5auchy distribution:
5
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 6/26
Whi%h Distribution Shou!d IChoos"?
6ou cannot 7*ust guess7 and use any other particular distributionwithout testing several alternative models as this can result inanalysis errors.
In most cases, you need to t two or more distributions, comparethe results, and select the most valid model. he 7candidate7
distributions you t should be chosen depending on the nature ofyour probability data. 8or e+ample, if you need to analy9e thetime between failures of technical devices, you should t non-negative distributions such as 4+ponential or eibull, since thefailure time cannot be negative.
6ou can also apply some other identication methods based onproperties of your data. 8or e+ample, you can build a histogramand determine whether the data are symmetric, left&sewed, orright&sewed, and use the distributions which have the sameshape.
6
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 7/26
Whi%h Distribution Shou!d IChoos"?
• o actually t the 7candidate7 distributions youselected, you need to employ statistical methodsallowing to estimate distribution parametersbased on your sample data. he solution of thisproblem involves the use of certain algorithmsimplemented in speciali9ed software.
• ;fter the distributions are tted, it is necessary todetermine how well the distributions youselected t to your data. his can be done usingthe specic goodness of t tests or visually bycomparing the empirical (based on sample data)and theoretical (tted) distribution graphs. ;s aresult, you will select the most valid modeldescribing your data.
7
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 8/26
4+planatory Data ;nalysis(4D;)
4D; includes:D"s%ri$ti(" statisti%s (numerical summaries): mean, median,
range, variance, standard deviation, etc. In 1%11 choose Ana!)"*D"s%ri$ti(" Statisti%s* D"s%ri$ti("s.
+o!#ogoro(,S#irno( - Sha$iro,Wi!. t"sts: hese methods testwhether one distribution (e.g. your dataset) is signicantly di"erent
from another (e.g. a normal distribution) and produce a numericalanswer, yes or no. $se the 1hapiro&il test if the sample si9e isbetween < and 000 and the =olmogorov&1mirnov test if the samplesi9e is greater than 000. $nfortunately, in some circumstances,both of these tests can produce misleading results, so 7real7statisticians prefer graphical plots to tests such as these.
Gra$hi%a! #"thods: frequency distribution histograms stem > leaf plots scatter plots bo+ > whiser plots -ormal probability plots: %% and ?? plots @raphs with error bars (@raphs: 4rror Aar)
8
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 9/26
%robability %lots• he assumption of a normal model for a
population of responses will be r"/uir"d in
ord"r to $"r0or# %"rtain in0"r"n%"procedures. 1istogra# %an b" us"d to g"t anid"a o0 th" sha$" of a distribution. 'owever,there are more sensitive tools for checing if theshape is close to a normal model B a 2,2 P!ot3
• 2,2 P!ot is a $!ot o0 th" $"r%"nti!"s 4or/uinti!"s5 o0 a standard normal distribution (orany other specic distribution) against thecorresponding percentiles of the observed data.
If the observations follow appro+imately a normaldistribution, the resulting plot should be roughlya straight line with a positive slope.
9
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 10/26
10
%robability plots
• Q-Q plot: Quantile-quantile plot
• Graph of the qi-quantile of a fitted (model) distribution versusthe qi-quantile of the sample distribution.
• If F ̂ (x) is the correct distribution that is fitted, for a largesample size, then F ̂ (x) and F n(x) ill be close together and the
Q-Q plot ill be appro!imatel" linear ith intercept # andslope $.
• %or small sample, even if F ̂ (x) is the correct distribution, thereill some departure from the straight line.
.,...&,$,)('
)(
)(
$
$
ni X q F x
q F x
iin
S
q
i
M
q
i
i
===
=
−
−
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 11/26
11
%robability plots
• - plot: robabilit"-robabilit" plot.
• It is valid for both continuous as ell as discrete data sets.
• If F ̂ (x) is the correct distribution that is fitted, for a large
sample size, then F ̂ (x) and F n(x) ill be close together and
the - plot ill be appro!imatel" linear ith intercept #and slope $.
( )
( ) .,...&,$,'
" probabilitsample
eagainst th " probabilitmodeltheof graph*
)(
)(
niq X F
X F
iin
i
==
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 12/26
12
%robability plots
• +he Q-Q plot ill amplif" the differences beteen the tails of
the model distribution and the sample distribution.
• hereas, the P-P plot ill amplif" the differences at the
middle portion of the model and sample distribution.
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 13/26
13
%robability plots: Dataset
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 14/26
14
%robability plots: Dataset
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 15/26
15
%robability plots: Dataset
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 16/26
16
%robability plots: Dataset
The Detrended Normal QQ
plot shows the differenes
!etween the o!ser"ed and
e#peted "al$es of a normaldistri!$tion% &f the
distri!$tion is normal' the
points sho$ld l$ster in a
hori(ontal !and aro$nd (ero
with no pattern%
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 17/26
17
%robability plots: Dataset
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 18/26
18
%robability plots: Dataset
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 19/26
?? %lot he graphs below are e+amples for which a normal model for the
response is not reasonable.
C. he ?&? plot above left indicates the e+istence of two clusters of observations.
. he ?&? plot above right shows an e+ample where the shape of distributionappears to be sewed right.
<. he ?&? plot below left shows evidence of an underlying distribution that has
heavier tails compared to those of a normal distribution. 19
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 20/26
?? %lot•
he ?&? plot below right shows evidence of an underlyingdistribution which is appro+imately normal e+cept for onelarge outlier that should be further investigated.
20
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 21/26
?? %lot
• It is most important that you can seethe departures in the above graphsand not as important to now if the
departure implies sewed left versussewed right and so on. ; histogramwould allow you to see the shape and
type of departure from normality.
21
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 22/26
Data ransformations and heir$ses
Data Trans0or#ation Can Corr"%t For
og ransformation (log(E)) %ositive 1ew, $nequal Fariances
1quare !oot ransformation(sqrt(E))
%ositive 1ew, $nequal Fariances
!eciprocal ransformation (C/E) %ositive 1ew, $nequal Fariances!everse 1core ransformation Ball of the above can correct fornegative sew, but you must rstreverse the scores. Gust subtracteach score from the highest score
in the data set 3 C.
-egative 1ew
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 23/26
Hutliers and 4+treme 1cores
23
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 24/26
)*)) + ,#plore -o#*lot
• The top of the !o# is the upperfourth or 75th perentile%
• The !ottom of the !o# is the lowerfourth or 25th perentile%
• 50 . of the sores fall within the!o# or interquartile range%
• The hori(ontal line is the median%• The ends of the whis/ers
represent the larest and smallest"al$es that are not o$tliers%
• n outlier ' O' is defined as a "al$ethat is smaller or larer than 1%5!o#lenths%
• n extreme "al$e' E ' isdefined as a "al$e that is smalleror larer than 3 !o#lenths%
• Normall distri!$ted sorestpiall ha"e whis/ers that area!o$t the same lenth and the!o# is tpiall smaller than the
whis/ers%
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 25/26
5hoosing a 1core to Dene Hutliers
)ore . !o"e . !o"e
3%0 0%0013 0%00263%1 0%0010 0%0020
3%2 0%0007 0%0014
3%3 0%0005 0%0010
3%4 0%0003 0%0006
7/23/2019 Distribution Fitting Fix
http://slidepdf.com/reader/full/distribution-fitting-fix 26/26
Decisions for 4+tremes and Hutliers
C. 5hec your data to verify all numbers are entered
correctly.. Ferify your devices (data testing machines) are
woring within manufacturer specications.<. $se -on¶metric statistics, they donJt require
a normal distribution.
K. Develop a criteria to use to label outliers andremove them from the data set. 6ou must reportthese in your methods section.
C. If you remove outliers consider including a statisticalanalysis of the results with and without the outlier(s). Inother words, report both, see 1tevens (CLL0) Detecting
outliers.. Do a log transformation.
C. If you data have negative numbers you must shift thenumbers to the positive scale (eg. add 0 to each).
. ry a natural log transformation rst in 1%11 use -().<. ry a log base C0 transformation, in 1%11 use @C0().