Sampling and Variability (Chapter 5.1 - 5.4)
description
Transcript of Sampling and Variability (Chapter 5.1 - 5.4)
Purpose of Sampling• What is Data Population
• Problems with using all of the data– The whole data not available– Too much data– Necessary to sample the data when building
models
• Capture a Sample:– To represent only some part of the population
Variability of Variables• Main Feature of a Variable
– Takes on a variety of values
– Contains Pattern distribution
• Numerical variables
• Categorical variables
• Graphical Display of a Pattern Distribution– Histogram, Curve
• Problems– Convergence: True Population Distribution Pattern
Unknown
– Measuring Variability: Which Distribution Curve is the Right one to use ????
Converging
• To Create a Distribution Curve for the Sample– Selecting instance values, one at a time at random
– Recalculated when adding a new instance value
• Converge– At first: a large change
– After a while: settled down -> Converges to the Final shape
• Summary– What is measured not the shape of the curve, but the
Variability of the sample
Measuring Variability
• Require Some Method of Measuring Variability– Without being sensitive to column width or smoothing method
• What is Variability– How far the individual instances from the Mean of the sample
• Standard Deviation --- One Popular Measure
- O n e F o r m u l a :
S t a n d a r d d e v i a t i o n ( ) ( )x m n2 1
- A n o t h e r F o r m u l a : I m p o r t a n t f o r d a t a p r e p a r a t i o n p r o c e s s
s = ( ) ( )x n m n2 2 1
• Why Confidence– An alternative of sampling the whole population
– To establish some acceptable degree of confidence,
• 95% as a satisfactory level of confidence
Variability of Numeric and Alpha Variables
• Distinction
– Alpha: for nominal / categorical; measured in nonnumeric scales
– Numeric: measured in numeric scales
– Different when measuring variability
• Measuring Variability of Numeric Variables– Covered above– Random sampling without introducing bias
• Measuring Variability of Alpha Variables– Instead of standard deviation
– Rate of Discovery (ROD):• Measure the rate of change of the relative proportion of values
discovered
• Sample size increases, the ROD of new alpha values falls