Sampling and Variability (Chapter 5.1 - 5.4)

Sampling and Variability

(Chapter 5.1 - 5.4)

Chengyuan Peng

92777A

[email protected]

Purpose of Sampling• What is Data Population

• Problems with using all of the data– The whole data not available– Too much data– Necessary to sample the data when building

models

• Capture a Sample:– To represent only some part of the population

Variability of Variables• Main Feature of a Variable

– Takes on a variety of values

– Contains Pattern distribution

• Numerical variables

• Categorical variables

• Graphical Display of a Pattern Distribution– Histogram, Curve

• Problems– Convergence: True Population Distribution Pattern

Unknown

– Measuring Variability: Which Distribution Curve is the Right one to use ????

Converging

• To Create a Distribution Curve for the Sample– Selecting instance values, one at a time at random

– Recalculated when adding a new instance value

• Converge– At first: a large change

– After a while: settled down -> Converges to the Final shape

• Summary– What is measured not the shape of the curve, but the

Variability of the sample

Measuring Variability

• Require Some Method of Measuring Variability– Without being sensitive to column width or smoothing method

• What is Variability– How far the individual instances from the Mean of the sample

• Standard Deviation --- One Popular Measure

- O n e F o r m u l a :

S t a n d a r d d e v i a t i o n ( ) ( )x m n2 1

- A n o t h e r F o r m u l a : I m p o r t a n t f o r d a t a p r e p a r a t i o n p r o c e s s

s = ( ) ( )x n m n2 2 1

• Why Confidence– An alternative of sampling the whole population

– To establish some acceptable degree of confidence,

• 95% as a satisfactory level of confidence

Variability of Numeric and Alpha Variables

• Distinction

– Alpha: for nominal / categorical; measured in nonnumeric scales

– Numeric: measured in numeric scales

– Different when measuring variability

• Measuring Variability of Numeric Variables– Covered above– Random sampling without introducing bias

• Measuring Variability of Alpha Variables– Instead of standard deviation

– Rate of Discovery (ROD):• Measure the rate of change of the relative proportion of values

discovered

• Sample size increases, the ROD of new alpha values falls

Sampling and Variability (Chapter 5.1 - 5.4)

Documents

Transcript of Sampling and Variability (Chapter 5.1 - 5.4)