Ch 2 BB Basic Statistics

download Ch 2 BB Basic Statistics

of 82

Transcript of Ch 2 BB Basic Statistics

  • 7/29/2019 Ch 2 BB Basic Statistics

    1/82

    Chapter 2:

    Basic Statistics

  • 7/29/2019 Ch 2 BB Basic Statistics

    2/82

    Part A

    Types of Data, Data Qualityand Data Collection

  • 7/29/2019 Ch 2 BB Basic Statistics

    3/82

    3

    Data

    Data are facts or figures related to any characteristicof an individualAlso called a variable

    A m/c, an year, a casting, a dimension, a person

    Power station outages (up to 31/03/01 since commissioning)

    Station Date of

    commi-

    ssioning

    Avail-

    ability

    (%)

    No. of

    outages

    Average

    duration of

    non-stop

    operation

    (days)

    Average loss per

    outage (hours)

    Main

    cause

    of

    outage

    Capa-

    city

    utiliza-

    tionForced Planned

    C:15 12/11/98 92.59 30 27 64 52 Leakage High

    C:16 10/05/97 93.04 47 28 52 52 Leakage Mod.

    D 12/10/78 88.32 124 58 261 164 Gen* V. Low

    E 31/12/84 82.77 116 42 440 158 Gen* Low

    F 29/09/88 89.23 82 50 379 79 Gen* High

    VARIABLES

    INDIVIDUALS* Generator stator / rotor problem

  • 7/29/2019 Ch 2 BB Basic Statistics

    4/82

    4

    Types of Data/Variable

    Continuous Discrete

    Numerical/Quantitative

    Ordinal Nominal

    Categorical/Qualitative

    Data/Variable

  • 7/29/2019 Ch 2 BB Basic Statistics

    5/82

    5

    Types of Data - Examples

    Continuous: An infinite number of values (positiveor negative) are possible, e.g. measurements ofweight, length, chemical composition.

    Discrete: The variable can take values 0,1,2,3, ..e.g. count of frequency (# of defects, breakdownsetc.)

    Ordinal: Data classified in ordered categories, e.g.quality of service provided is classified as poor,moderate, good or yearly rainfall classified as verylow, low, moderate, good and very good.

    Nominal: Data classified in categories having noinherent or explicit order, e.g. location classified aseast, west, north, south or names of departments.

  • 7/29/2019 Ch 2 BB Basic Statistics

    6/82

    6

    Types of Data - Outage Data Example

    Variable Name Variable Type1. Date of commissioning

    2. Availability (%)

    3. Number of outages sincecommissioning

    4. Average duration of non-stopoperation (days)

    5. Average loss per outage (hours)

    6. Main cause of outage

    7. Capacity utilization

  • 7/29/2019 Ch 2 BB Basic Statistics

    7/82

    7

    Types of Data - Further Considerations

    Continuous data may appear as discrete either due torounding (see the outage data example) or due tomeasurement limitations. We should treat such data ascontinuous unless the number of levels in the data set isvery few (say 2-4).

    However, hourly records of steam pressure at turbineinlet (station F) show that the values are either 126 or127 or 128. Great care must be exercised whileanalyzing such data.

    Discrete data having seven or more levels may betreated as continuous data.

    Dichotomous data (O.K/Not O.K, Pass/Fail etc.) may betreated as discrete data after coding the two categoriesas 1 (O.K) and 0 (Not O.K).

  • 7/29/2019 Ch 2 BB Basic Statistics

    8/82

    8

    Variable and Attribute Data

    In the field of Quality Control, various types ofdata are classified as

    - VARIABLE DATA : Continuous data

    - ATTRIBUTE DATA: Others Discrete and

    counts of items falling in various categories

    (Dichotomous, Ordinal and Nominal)

    Henceforth we shall use this later classification.

  • 7/29/2019 Ch 2 BB Basic Statistics

    9/82

    9

    Data Gateway

    Problem/

    HypothesisData

    Solution/

    Fact

    DATA COLLECTION DATA ANALYSIS

    Quality problems can not be solved merely based on experience.

    Any claim not backed by data is only a hypothesis.

    Data Gates: Quality of the data gates and their placement at

    appropriate locations of a process are extremely important forprocess control.

    Data Quality: Data collection step is vital garbage in, garbage out

  • 7/29/2019 Ch 2 BB Basic Statistics

    10/82

  • 7/29/2019 Ch 2 BB Basic Statistics

    11/82

    11

    Information Content in Datafor Process Control

    Source of Data Attribute Data Variable Data

    General literature Very low Low

    Past data: In-house routine Q.Crecords

    Low Moderate

    Past Data: Statistically designedexperiments

    Moderate High

    Live data: Passive observation ofthe process

    Moderate High

    Live Data: Statistically designed

    experiments

    High Very High

    Do not transform variable data to attribute data.

    That will be like burning diamond for heat.

  • 7/29/2019 Ch 2 BB Basic Statistics

    12/82

    12

    Data Collection Process

    INDIVI-DUALS

    VARIABLES

    Var. 1 Var. 2 Var. 3 . . . Var. p

    Ind. 1 Data Data Data Data

    Ind. 2 Data Data Data Data

    Ind. 3 Data Data Data Data

    . . . . .

    . . . . .

    Ind. n Data Data Data Data

    Population . .

    Sample

    Measurement . .

    Recording

    Editing, Storage, Retrieval

  • 7/29/2019 Ch 2 BB Basic Statistics

    13/82

    13

    Linking Data Qualityto Data Collection Process

    Process Elements Wrong Noisy Irrelevant

    Inadequate

    Hard Redun

    dant

    Popula

    tion

    Individual

    Issuesrelatedto data

    basemgmt.

    Variables

    Sample Procedure

    Size

    Measurement

    Gauge

    Appraiser

    Others

    Record

    ing

    Format

    Recorder

    Editing, Storage,Retrieval

  • 7/29/2019 Ch 2 BB Basic Statistics

    14/82

  • 7/29/2019 Ch 2 BB Basic Statistics

    15/82

    15

    Measurement Related Causes forPoor Data Quality

    Calibration

    Status

    Not done

    Done long back

    Results

    Not used

    Not traceable

    Number

    Many

    Variableleast count

    Different makes

    Capability

    Operating range

    Beyond limit

    Type of data

    Unwanted

    Lowrepeatability

    Low leastcount

    Precision

    Operation

    Malfunctioning

    Breakdown

    Gauges

    Bias Inadvertent error

    Number Reproducibility

    Appraisers

    Measurand

    Unstable

    Inhomogeneous

    Method

    Standard procedure

    Notavailable

    Not followed

    Communication

    PoorDataQuality

    Measurement

  • 7/29/2019 Ch 2 BB Basic Statistics

    16/82

    16

    Data Collection Planning- Principle of Inverse Loading

    The Planning Questions

    1) What do you want to know?

    2) How do you want to seewhat it is that you need toknow?

    3) What type of tool willgenerate what it is that

    you need to see?

    4) What type of data isrequired of the selectedtool?

    5) Where can you get therequired type of data?

    Plan

    Execute

    ...... . .

    ..

    .

    Has X any effect on Y?

    . .... .

    ...

    Histogram Scatter diagram

    Final inspection andproduction log book

    Nowhere- tobe collected

    Illustration

    Y X

    YX1 X2 X3

    X1 X2 X3

    Y11 Y21 Y31

    . . .

    Y1n Y2p Y3q

    X Y

    X1 Y1

    . .

    Xn Yn

  • 7/29/2019 Ch 2 BB Basic Statistics

    17/82

  • 7/29/2019 Ch 2 BB Basic Statistics

    18/82

    18

    Check Sheet and Data Sheet

    Check Sheet: Checks (/, , x etc.) are madeagainst a category of a variable or combination ofcategories of several variables. Used primarily forcollecting attribute data.

    Data Sheet: Measurement results are recordedagainst an individual and its characteristics. Usedfor collecting both attribute and variable data.

    Many consider all check sheets as data sheets

    and vice versa. However, we shall distinguishbetween the two as above.

  • 7/29/2019 Ch 2 BB Basic Statistics

    19/82

    19

    Process Distribution Check SheetPower Generation Process (Moving Target)

    Month: September

    Process average (Y1 bar): 420 MW

    Characteristic: Y1= Total generation (MW), Y2= System demand

    Sampling interval: Every 3.5 hours

    Target: Min(420, Y1) Data: Target - Y1 bar

    Class Interval Check Frq

    55.01 4

    Total No. of observations: 206

    Import limit = +20

    Export limit = -10

    Wasteful import

    due to lack of control

    Wasteful export

    due to lack of control

    Defect rate = 27 %

  • 7/29/2019 Ch 2 BB Basic Statistics

    20/82

    20

    Causes for Wasteful Import of Power

    0.0

    5.0

    10.0

    15.0

    20.0

    25.0

    30.0

    35.0

    1

    104

    207

    310

    413

    516

    619

    722

    825

    928

    1031

    1134

    1237

    1340

    Run Chart of half-hourly readings ofgeneration at station C15 in September 2001

    A

    B CD

    A: Process failure B: Process deficiencyC: Early slow down D: Late pick up

  • 7/29/2019 Ch 2 BB Basic Statistics

    21/82

    21

    Defect Cause Check Sheet

    StationDefect

    C15 C16 D E F Total

    Processfailure

    52

    Processdeficiency 81

    Earlyslowdown

    15

    Late pick

    up

    34

    Total 54 22 65 21 20 182

    Month: September, 2001 Data: # of hours of generation affected

    Note: Criticality of the defects is not same over all stations

  • 7/29/2019 Ch 2 BB Basic Statistics

    22/82

    22

    Identifying Critical Causesfor Wasteful Import

    C15 C16 D E F

    PF 30 0 15 7 0

    PD 11 9 36 14 11

    ES 2 2 9 0 2

    LP 11 11 5 0 7

    C15 C16 D E F

    PF 29.0 29.5 107.0 103.5 110.0

    PD 5 2 10 5 5

    ES 10 4 30 - 15

    LP 10 4 30 - 15

    C15 C16 D E F Total

    PF 870 0 1605 725 0 3200

    PD 55 18 360 70 55 558

    ES 20 8 270 0 30 328

    LP 110 44 150 0 105 409

    Total 1055 70 2385 795 190 4495

    Hours of low generation Average generation loss at each instant

    Total generation loss (MWH)

    =

    PF = Process failure

    PD = Process deficiency

    ES = Early slow down

    LP = Late pick up

  • 7/29/2019 Ch 2 BB Basic Statistics

    23/82

    23

    Other Types of Check Sheets

    Defective item check sheet Checks are made against various causes of

    rejection/rework of an item.

    Defect location check sheet

    Instead of a table a diagram is made of the defectspace.

    Checks are made at the location where defect occurs.

    Locational segregation of defects, if any, providesvaluable clue.

    Leakage in a cooling system Cracks in castings

    Wear out of moving parts

  • 7/29/2019 Ch 2 BB Basic Statistics

    24/82

    24

    Other Types of Check Sheets (..Contd.)

    Check-up confirmation check sheet Used to make a comprehensive check-up of

    product/process quality (usually at the final stage).

    Preprinted items of checks avoids duplication andmissing of tests to be performed.

    It is a variation of check list, which is used for checkingif all the tasks have been performed or not.

    C-E diagram check sheet Checks are made against the cause of a problem in the

    C-E diagram.

  • 7/29/2019 Ch 2 BB Basic Statistics

    25/82

    25

    Data Sheet General Format

    TitleCommon relevant information

    Individual Var. 1 Var. 2 Var. p Remark

    Ind. 1

    Ind. 2

    Ind. n

    Important summary of data

    Notes:

  • 7/29/2019 Ch 2 BB Basic Statistics

    26/82

    26

    Data Sheet - Example

    Up-load detention report for the month of July, 2001Rake

    N0.

    Date Arrival

    time

    Qua

    lity

    # of

    wagons

    Form

    date

    Form

    time

    Depart.

    date

    Depart.

    time

    Deten.

    hours

    Demur.

    hours

    Rea

    son

    Actualunloadingtime - Hr.

    01 01 19.45 Envi

    ro

    58 02 05.35 02 15.30 09.55 - - 09.00

    . . . . . . . . . . . . .

    20 14 07.50 Du.

    hill

    58 15 16.45 16 00.20 07.35 23 S(19)+I(4)

    14.30

    . . . . . . . . . . . . .

    42 31 20.20 . . . . . . . . . 14.45

    Purpose?

    Estimation of demurrage hours

    Control of demurrage hoursImportant reasons cited are receipt in quick succession, successive detentionsand wet coal. These are beyond the control of the coal handling section.

    Inadequate Data!

  • 7/29/2019 Ch 2 BB Basic Statistics

    27/82

    Part B

    Summarization of Data

  • 7/29/2019 Ch 2 BB Basic Statistics

    28/82

    28

    Data Analysis Getting Started

    102.8 105.2 103.2 104.0 105.2 104.8 105.6 105.0

    105.0 104.0 104.0 105.2 106.0 106.4 103.2 104.2

    102.0 103.6 103.8 105.0 105.2 105.2 106.0 105.0103.0 103.2 103.0 103.0 104.2 105.8 105.4 104.8

    104.8 105.2 105.2 106.0 104.0 104.2 103.8 104.4

    104.0 102.2 103.4 104.4 104.4 104.2 104.8 106.2

    106.4 104.8 102.8 103.6 104.8 104.4 104.8 104.0

    104.0 104.0 104.0 104.0 104.4 104.0 102.6 103.0

    104.8 102.8 104.0 103.4 103.6 104.0 104.0 103.4106.0 104.4 104.4 102.4 102.8 105.0 105.2 105.2

    Hours Generation (MW)

    10.00 13.30

    14.00 17.30

    18.00 21.3022.00 01.30

    02.00 05.30

    06.00 09.30

    10.00 13.30

    14.00 17.30

    18.00 21.3022.00 01.30

    Half-hourly record of generation by station E during 19/9/01 (10 hrs.)to 21/9/01 (1.30 hrs.) under normal operating condition

    What are your conclusions?

  • 7/29/2019 Ch 2 BB Basic Statistics

    29/82

    29

    Frequency Distribution- Analyzing a large data set on the same variable

    Class Interval Tally Frequency

    101.7 102.3 02

    102.3 102.9 06

    102.9 103.5 10

    103.5 104.1 19

    104.1 104.7 11

    104.7 105.3 22

    105.3 105.9 03

    105.9 106.5 07

    Total 80

    Generation data set (previous slide)The eighty observations are grouped in eight classes of equal length

    Does the frequency distribution provide better insight into the process?

    DATA + ANALYSIS = INFORMATION

    Data are not information

  • 7/29/2019 Ch 2 BB Basic Statistics

    30/82

    30

    Constructing Frequency Distributions- Variable Data

    Data set

    Number of observations (N):About 100 on the same variable.

    Formation of the classes (first column)

    Number of classes (k)

    Too many classes obscure the pattern of the distribution due to samplingfluctuations. Details are lost with too few classes. Optimum number of classes

    is given by k = 1 + 3.3 log10 (N)

    The simpler formula k = N also works well in practice.

    For better visual impact, it is preferable to have 5 k 12.

    For the generation data set we have N = 80. Therefore, k =

    1+3.3*log(80) = 7.3. This means the number of classes should be

    either 7 or 8. We have chosen 7 classes.

  • 7/29/2019 Ch 2 BB Basic Statistics

    31/82

    31

    Constructing Frequency Distributions(..contd.)

    Class width (h) h = (R + w) / k

    where R = Range of the observations = Maximum Minimumand w = Least count of measurement.

    Next, h is rounded to the nearest integer multiple of w. This means, if the

    least unit of measurement (w) is 0.1, then h = 2.312 should be rounded to

    2.3. However, if w = 0.2, then the same h should be rounded to 2.4.

    In our generation data example, R = 106.4102.0 = 4.4, and w =

    0.2. Thus, h = (4.4+0.2) / 7 = 0.657, which is rounded to 0.6. We

    shall explain later, why taking h = 0.7 will be erroneous.

    Note that if h is rounded down then we shall need (k+1) classes to cover the

    whole range of the observations. How many classes shall we need if his rounded up?

    i i ib i

  • 7/29/2019 Ch 2 BB Basic Statistics

    32/82

    32

    Constructing Frequency Distributions(..Contd.)

    Class limits The minimum value of the generation data is 102.0 and the class width has

    been determined as 0.6. So we can form the classes as

    102.0 102.6, 102.7 103.3, 103.4 103.9, . . .

    The problem with the above classification is that there is a gap between twosuccessive class intervals. This is not desirable since we are dealing withcontinuous data.

    Discontinuity can be removed by forming the classes as

    102.0 102.6, 102.6 103.2, 103.2 103.8, . . .

    However, this classification has another problem. Suppose we have an

    observation 102.6. In which class shall we place it, first or second?

    In order to avoid such confusion we take

    Lower limit of the first class = Minimum w/2

    and then successively add the class width to this lower limit to obtain

    the other class limits.

    C i F Di ib i

  • 7/29/2019 Ch 2 BB Basic Statistics

    33/82

    33

    Constructing Frequency Distributions(..Contd.)

    Class limits (..Contd.)

    Thus, for the generation data we have the classes as

    101.9 102.5 102.5 103.1 103.1 103.7 103.7 104.3

    104.3 104.9 104.9 105.5 105.5 106.1 106.1 106.7

    Note that now we have

    - 8 classes (since h has been rounded down from 0.657 to 0.6)- no confusion in classification (since there are no observations whichfall on the class limits) and

    - an extended last class (ideally the upper limit of the last class shouldhave been 106.5).

    In the example, we have extended the first class instead of the last

    one since this has brought out the process abnormalities better.Thus the eight classes used are

    101.7102.3, 102.3102.8, , 105.9106.5

    C t ti F Di t ib ti

  • 7/29/2019 Ch 2 BB Basic Statistics

    34/82

    34

    Constructing Frequency Distributions(..Contd.)

    Tally marking (second column) Start with the first observation. Find the class to which the observation belongs.

    Put a tally against the class.

    Classify all the remaining observations as above.

    Tally marks are grouped in five, with the fifth tally crossed through the previousfour tallies. This provides a better visual display and helps in counting the

    frequency of each class. Note that all the above observations get classified as we go through the

    observations only once. However, if we concentrate on a class and then try to findout the number of observations in the class then we have to go through theobservations k times. This not only consumes more time but also increases thechance of committing error.

    Counting frequency (third column) The frequency (f) of each class is obtained simply by counting the tallies.

    Other columns Columns giving cumulative frequency (f1, f1+f2, ..) and relative frequency (f1/N,

    f2/N, ..) may also be added, if required.

  • 7/29/2019 Ch 2 BB Basic Statistics

    35/82

    35

    Constructing Frequency Distributions- Getting the class intervals right

    Why class width (h) is rounded to nearest integer multiple of w Consider the same generation data example. Here w=0.2. Assume that h = 0.657

    is rounded to 0.7 (which is not an integer multiple of 0.2) instead of 0.6. Thus theclasses will be 101.9 102.6, 102.6 103.3, ..

    Now in order to overcome the problem of classifying observations like 102.6, weare forced to consider w=0.1 and have the classes as101.95 102.65, 102.65 103.35, 103.35 104.05, 104.05 104.75,104.75 105.45, 105.45 106.15, 106.15 106.85

    Note that the number of observation units covered by each class are not same. Forexample, the second class covers three units (102.8, 103.0 and 103.2) but thethird class covers four units (103.4, 103.6, 103.8 and 104.0). As a result thefrequency distribution is likely to show many peaks.

    Balancing end points Assuming w=0.1, the seven classes shown above should be appropriate. However,

    note that the last class is extended by four units beyond the maximum observedvalue of 106.4. It is desirable to distribute this imbalance to the two end classes bystarting the first class from 101.75 and ending at 106.65.

    F Di t ib ti f Th G ti

  • 7/29/2019 Ch 2 BB Basic Statistics

    36/82

    36

    Frequency Distribution of The GenerationDataFurther analysis

    The frequency distribution shows an abnormal pattern (nearly alternative peaks). Doesthis mean the process mean is jumping randomly by about 1.2 unit?

    Following two frequency distributions constructed out of the same data provide someadditional clues.

    Fractional part Frequency

    .0 27

    .2 18

    .4 15

    .6 5

    .8 15

    Total 80

    Class interval Frequency

    101.7 102.7 04

    102.7 103.7 17

    103.7 104.7 26

    104.7 105.7 25

    105.7 106.7 08

    Total 80

    0s occur more frequently at thecost of 6s. Does this indicatemeasurement bias?

    Smooth pattern (left skewed). Smoothness hasbeen achieved not only by reducing the number ofclasses but also by including the adjacent 0 s and 6sin the same interval.

  • 7/29/2019 Ch 2 BB Basic Statistics

    37/82

    37

    Histogram

    Histogram is a graphical representation of a frequency distribution of variable data.

    The histogram of the generation data having five classes is shown below.

    101.7 103.7 105.7Generation in E station (MW)

    0

    5

    1015

    20

    25

    Frequency

    30 Bars of equal width (=class width)

    Heights of the bars are proportional tothe frequencies of the classes

    Bar width of about 1 cm. (7-10 classes)

    Horizontal axis is about 1.6 timeslonger than the vertical axis

    Central tendency: About 104.2.

    Pattern of variation: Slightly left skewed

    Specification limits: Should be shown wherever applicable.

    Class mid-point: Marking the class mid-points may be helpful in certain cases.

    Open ended classes: Avoid adding too many classes at the ends having zero orvery low frequencies. Shown as open ended bars with arbitrarily reduced heights.

    C t ti f Hi t

  • 7/29/2019 Ch 2 BB Basic Statistics

    38/82

    38

    Construction of Histogram- An exercise

    Half-hourly record of power (MW) generated by station E during 29.9.2001(10.00 hours) to 30.9.2001 (24.00 hours) gives us the following data.

    6.4 6.4 6.8 6.0 5.2 4.8 6.4 4.4 5.2 6.0

    7.6 8.0 7.4 6.6 8.0 5.6 7.2 7.2 7.0 4.0

    6.4 8.0 8.0 6.0 6.0 6.4 7.8 7.6 7.6 7.4

    7.6 7.6 7.4 4.6 4.2 4.8 6.0 5.6 5.4 5.0

    6.2 7.8 7.4 7.2 7.4 7.8 6.6 6.4 6.8 6.8

    6.8 6.8 6.6 6.8 6.6 6.8 6.8 6.8 7.0 7.0

    6.0 5.6 4.4 4.6 4.6 4.8 6.2 7.0 6.6 6.4

    5.2 5.2 7.2 7.4 6.0 5.0 7.0 7.6 7.6 7.4

    5.2 7.2 7.2 7.0 7.2 6.8 6.0 6.0 6.0 5.2

    Construct a histogram of the above data set. Compare with the histogram

    for the period 19.9.01 to 21.9.01 ( previous slide) and offer your comments.

    29/9(10 hrs.)

    30/9(24 hrs.)

    Commonly Observed Histogram

  • 7/29/2019 Ch 2 BB Basic Statistics

    39/82

    39

    Commonly Observed HistogramPatterns

    Single peak, symmetric, bell

    shaped, commonly observedpattern of a stable process

    Single peak, positively

    skewed (long tail on theright)

    Single peak, negatively

    skewed (Long tail on theleft)

    Many characteristics follow suchpatterns. We have already seenthat generation data isnegatively skewed while

    breakdown data is positivelyskewed. However such shapesmay also indicate processinstability.

    LSL USL

    Single peak, thick tailTwo peaks (bi-modal)

    Frequency Distribution

  • 7/29/2019 Ch 2 BB Basic Statistics

    40/82

    40

    Frequency Distributionof Discrete Data

    Number of plant outages in each year since commissioningStation Period Type of

    outage# of outages in a year

    D 1978-79To

    2000-01

    Forced 2, 3, 1, 0, 3, 2, 1, 0, 2, 2, 0, 2, 3, 0, 2, 1, 2, 1, 1,0, 1, 0, 2

    Planned 3, 5, 1, 4, 2, 5, 2, 1, 6, 3, 7, 7, 4, 7, 6, 5, 6, 4, 2,2, 2, 6, 2

    E 1985-86To2000-01

    Forced 2, 2, 5, 3, 0, 0, 1, 0, 1, 0, 2, 1, 1, 0, 1, 4Planned 15, 7, 8, 3, 7, 5, 2, 6, 3, 8, 7, 4, 5, 4, 3, 4

    F 1988-89To

    2000-01

    Forced 4, 1, 1, 0, 0, 1, 1, 2, 0, 1, 0, 1, 6

    Planned 3, 11, 6, 12, 4, 0, 1, 2, 8, 2, 4, 4, 6

    Ideally we should construct six frequency distributions (for each type of outage in

    each station). However, due to shortage of data we shall construct only two - one forforced outage and the other for planned outage.

    What can you say about the occurrence of two types of outages from theabove data set?

  • 7/29/2019 Ch 2 BB Basic Statistics

    41/82

    Summary Measures

  • 7/29/2019 Ch 2 BB Basic Statistics

    42/82

    42

    Summary Measuresof a Univariate Data Set

    Type Commonly Used Measure*

    Measures of Location orCentre

    Mean,Median, Mode, TrimmedMean, Geometric Mean

    Measures of Spread orVariability Range, Standard Deviation,Entropy (for nominal data)

    Measures ofShape Skewness, Kurtosis

    General Measure Quartiles

    * There are a host of other measures developed for specific applications

  • 7/29/2019 Ch 2 BB Basic Statistics

    43/82

    43

    Arithmetic Mean

    May be used for ordinal databut not for nominal data

    Sensitive to extreme values

    Usually referred to as MEAN or AVERAGE

    MEAN =Sum of all the observations

    Number of observations

    =

    X1 + X2 + X3 + . . . . . + XN-1 + XN

    N=

    Xii=1

    n

    NX

    Notation

    Example:In a rising voltage test the alternating breakdown voltage(kV) of

    24 samples of an insulation arrangement were found to be as follows:

    210; 208; 208; 175; 182; 206; 190; 194; 198; 205; 212; 200; 205; 202; 207;

    210; 202; 201; 188; 205; 209; 201; 216; 196

    MEAN = [210 + 208 + + 216 + 196] / 24 = 201.25 kV

  • 7/29/2019 Ch 2 BB Basic Statistics

    44/82

    44

    Mean of Grouped Data

    NotationsClass: i (=1, 2, , k)

    Frequency of the iih class: fi

    Value of the ith class: Mi (Class mid-point if class width > least count)

    Formula

    i=1

    i=1

    k

    k fi * Mi fi

    X =

    Example: The observations { 1.3, 1.3, 1.5, 3.3, 3.5, 3.5, 3.5, 3.6, 5.4, 5.4, 5.8, 7.3,7.4, 9.1} are grouped as follows:

    i Class Interval Mi fi fi * Mi

    1 1.25 3.25 2.25 3 06.75

    2 3.25 5.25 4.25 5 21.25

    3 5.25 7.25 6.25 3 18.75

    4 7.25 9.25 8.25 4 33.00

    Total () 15 79.75

    Mean = 79.75/15 = 5.32

    Mean of ungrouped data = 4.62. Thuserror due to grouping is 5.32-4.62 = 0.7,which is close to the maximum valuepossible, i.e. (class width/2) = 1.0. WHY?

    In general, error will not be so large.Nevertheless, it is recommended to use theindividual observations for computingmean, whenever possible.

  • 7/29/2019 Ch 2 BB Basic Statistics

    45/82

    45

    Interpretation of Mean

    170 180 190 210 220

    Mean = 201.25

    Dot Plot of the Breakdown Voltage Data (Previous Slide)

    Mean is the balance point (or fulcrum) for the distribution of the values

    Mean is analogous to centre of gravity

    In case of unimodal and symmetric distribution, mean also indicates thecentral tendency of the distribution and may be interpreted as a TYPICALVALUE.

    In the above example, the observations are not symmetrically distributedaround the mean. The distribution is skewed to the left. Consequently meanshould be interpreted here as a measure of centre or location and not thatof central tendency or typical value.

  • 7/29/2019 Ch 2 BB Basic Statistics

    46/82

    46

    Misuse of Mean

    Landfill

    Site

    DioxinPresently, WHO has classified Dioxin asa known human carcinogen

    Question: Are the people in the neighborhood of thelandfill site safe with respect to exposure to dioxin?

    Data: Dioxin content the soil samples taken from alarge residential area in the neighborhood of the site.

    Answer: Yes, since the average dioxin content in thesamples is found to be less than the permissible limit.

    Critique: Individuals are not exposed to average soil levels, they areexposed to dioxins/furans present in the air they breathe, food they eatand water they drink. Higher exposure of residents living in the vicinityof the site are not averagedoutwith the lower exposure of residentsten miles away.

  • 7/29/2019 Ch 2 BB Basic Statistics

    47/82

    47

    Properties of Mean

    P1Sum of the deviations of all the observations from mean isalways zero. In notation, we have

    (Xi X) = 0n

    i=1

    Sum of negative deviations=

    Sum of positive deviations

    P2Data Transformation:

    (i) Let Y i = Xi k. Then Y = X k(ii) Let Y i = k*Xi. Then Y = k*X

    (iii) Let Yi = Xi/k. Then Y = X/k

    These three properties are frequently used to reduce the size ofthe data, which in turn reduces both computational load and error.

    An Example follows.

  • 7/29/2019 Ch 2 BB Basic Statistics

    48/82

    48

    Properties of Mean (Contd.)

    Example of P2: Data Transformation

    Outer diameter (X) of tubular glass shell (Specification: 37.5 0.8 mm.)

    i Xi Yi = 37.5 - Xi Zi = Yi*100

    1 37.46 0.04 4

    2 36.66 0.84 84

    3 37.44 0.06 6

    4 37.85 -0.35 -35

    5 37.36 0.14 14

    6 36.95 0.55 55

    7 37.62 -0.12 -12

    8 36.96 0.54 54

    9 37.12 0.38 38

    10 37.36 0.14 14

    TOTAL 269 47 = 222

    Thus Z = 222/10 = 22.2

    Since Yi = Zi/100, using theproperty (iii) of P2 we have

    Y = 22.2/100 = 0.222

    Further, since Xi = 37.5 Yi ,using property (i) of P2 we have

    X = 37.5 0.222 = 37.278

    In this case the Zi values are verylarge because the least count of

    measurement used is too small. Usinga gauge having a lest count of 0.1 mm.and recording the deviations from theTARGET would have been better.

  • 7/29/2019 Ch 2 BB Basic Statistics

    49/82

    49

    Properties of Mean (Contd.)

    P3 The sum of the squared deviations of a set of observations is minimum whenthe deviations are taken from the mean of the observations.

    In notation, we have (Xi X)2 < (Xi M)2, M X

    Implication: Consider we have production figures for the last twenty days. We want topredict the production of the 21st day, assuming production condition remains the same.Then the best prediction is the average of the past twenty days data, provided the loss

    due to prediction error is proportional to the square of the error.

    P4 Sample mean is more stable than other possible measures of center.

    We shall see this later.

    P5 Mean is strongly affected by extreme values.

    This is a disadvantage of mean over other measures of center. However,routine trimming of extreme values is not recommended unless themeasurements are subjective in nature. Genuine outliers must, of course, beeliminated from the data set.

  • 7/29/2019 Ch 2 BB Basic Statistics

    50/82

    50

    Pooled Mean

    Data Set 1 n1 X1 n1*X1

    Data Set 2 n2 X2 n2*X2

    . . . .

    . . . .

    Data Set k nk Xk nk*Xk

    All (Pooled) ni ni*Xi/ ni ni*Xi

    No. of

    observations Average Total

    Example: Process averages in threeshifts are found to be 15, 12 and 13based on 30, 40 and 20 observationsrespectively. Then the process averagefor the day is

    (15*30+12*40+13*20)/(30+40+20) =990/90 = 11 [(15+12+13)/3 = 13]

    Pooled mean ni*Xi/ ni = Xi/ k (WHEN?)

    Note that the formula for mean of grouped data is similar to the above.

    A related concept is that ofweighted mean. An application example follows.

    Weighted Mean

  • 7/29/2019 Ch 2 BB Basic Statistics

    51/82

    51

    Weighted Mean- An Application Example

    BlowingDrawing

    &Cutting

    SortingMoltenglass

    Glazing

    Reject Accept

    Glassshells

    Tube

    Assume the total number of shells produced in a shift is 24000. In a particularshift 8% of the shells produced are found to be rejected. We want to estimatethe average outer diameter of all the shells produced in the shift.

    Samples can not be taken before sorting. So 50 shells are randomly selectedfrom each of the two streams ( reject and accept). The average diameter of the50 shells in the reject and accept groups are found to be 37.7 mm and 37.6mm respectively.

    Shift average = Weighted mean of the average of the two streams = ( 0.08 *37.7 + 0.92 * 37.6) / (0.08 + 0.92) = 37.61.

    OD Sensor

    Weighted Mean

    X = wi*Xi / wi

    Weighted Mean

  • 7/29/2019 Ch 2 BB Basic Statistics

    52/82

    52

    Weighted Mean- An Application Example (Contd.)

    However, it would have been better to take more samples from the rejectstream. (WHY?)

    Because of the higher variation expected in this stream.

    Assume 100 shells (instead of 50) were selected from the reject stream and gotthe same average (37.7 mm).

    Now the weights are given by wi = pi*N/fi. (WHY?)pi = Proportion of the i

    th category, N = Total sample size

    fi = Sample size of the ith category.

    If the samples are selected randomly from the total population, then the numberof samples expected in the ith category is pi*N. Since we have selected fi samples,we must compensate for this by a factor of pi* N / fi.

    In our example, p1 = 0.08, p2 = 0.92, f1 = 100, f2 = 50, N = 150. Thus w1 =

    (0.08 * 150) / 100 = 0.12 and w2 = (0.92 * 150) / 50 = 2.76. This gives the shift

    average as (37.7 * 0.12 + 37.6 * 2.76) / (0.12 + 2.76) = 37.60.

  • 7/29/2019 Ch 2 BB Basic Statistics

    53/82

    53

    Median and Mode

    Ordinal data: Category containing the (N+1)/2 caseNumerical data: (N+1)/2 th ordered observation, when N isodd and average of N/2 th and (N/2)+1 th ordered observations,when N is even.

    Can be computed even for open ended classes at the extremesprovided each of the end classes contain less than 50% of the

    observations.

    Insensitive to outliers.

    Median

    Category or the value occurring with greatest frequency

    Only measure of center for nominal data

    May not be unique and highly sensitive to how the classes orcategories are formed.

    Mode

  • 7/29/2019 Ch 2 BB Basic Statistics

    54/82

    54

    Caveat: Dont Trust Centre Alone

    Mean depth

    = D < H

    HStatisticians tell the story ofpeople who got themselvesdrowned by wading into alake with an average depthof 3 feet.

    Median

    Median

    Distribution of marksobtained by studentsof two schools. Whichschool is better?

    School A

    School B

    Mean may not tell you all youneed to know. Pay attentionto variation as well.

  • 7/29/2019 Ch 2 BB Basic Statistics

    55/82

    55

    Standard Deviation

    Standard Deviation is the most important measure of variability in a data set.

    Let {X1, X2, , Xn} be a sampledata set and X is the mean of the observations.

    Variability is measured in terms of the deviations of the observations from mean. For ourdata set, the deviations are (Xi X), i = 1, 2, , n.

    Next, these deviations are summarized to obtain a single value for reporting variability.

    Recall from property P1 of mean that the sum of the deviations will be always zero. So

    we can not summarize by simply taking the average of the deviations.

    The mathematical trick used to get rid of this difficulty (negative deviations) is to squarethe deviations and then these squares are averaged. So we compute (Xi - X)

    2/(n - 1).The reason for using (n - 1) instead of n as the divisor will be explained later.

    Finally, the effect of squaring is neutralized by taking square root of the above average toobtain the quantity called Standard Deviation. So we have

    Sample Standard Deviation = s = (Xi - X)2

    n - 1

  • 7/29/2019 Ch 2 BB Basic Statistics

    56/82

    56

    Computing Standard Deviation

    Root [ (Xi X)2 / (n 1)]

    Mean (Xi X)2 / (n 1)

    Square (Xi X)2

    Deviation Xi - XRead

    Computei Xi Xi X (Xi X)

    2

    1 4 -3 9

    2 7 0 0

    3 2 -5 25

    4 5 -2 45 11 4 16

    6 2 -5 25

    7 10 3 9

    8 7 0 0

    9 15 8 64

    10 9 2 4

    11 5 -2 4

    Total 77 0 160

    A Numerical Example

    Mean Square Deviation= 160 / (111) = 160 / 10= 16

    Root Mean SquareDeviation or StandardDeviation = 16 = 4.

    Shorter Method

    (Xi X)2

    = Xi2 ( Xi)2/ n

    = 699 772/11= 160

  • 7/29/2019 Ch 2 BB Basic Statistics

    57/82

    57

    Interpretation of Standard Deviation

    Let us be honest. It is not easy to interpret standard deviation.

    Literally speaking, standard deviation is a measure of the closeness of the data values totheir mean. However, the difficulty in interpretation arises because the closeness depends ontwo things- the range of the data values and also the distribution of the values within the range.

    2 3 4 5 6 7 8 9

    2 3 4 5 6 7 8 9

    2 3 4 5 6 7 8 9

    2 3 4 5 6 7 8 9

    2 3 4 5 6 7 8 9

    2 3 4 5 6 7 8 9

    The six caseshave identicalmean (= 5.5)and range (=

    7). But themaximum s. dis about twice

    that of theminimum value.

    Compare thedistributions

    having s. d. of2.07 and 2.14.

    Application of Standard Deviation

  • 7/29/2019 Ch 2 BB Basic Statistics

    58/82

    58

    pp- Rake weight data

    3100 3200 3300 3400 3500

    3100 3200 3300 3400 3500

    Dot plots of weight of 24 rakes of coal received during January - June 2002.

    Four rakes have been selected randomly from each of the six months. All therakes in the sample consists of 58 wagons.

    Indigenous

    Imported

    Source Mean(Ton)

    Range(Ton)

    n-1(Ton)

    Indigenous 3211.5 219.5 68.8

    Imported 3401.8 189.6 40.7

    Range of the two distributions donot differ as much as the standard

    deviation do. We shall see laterthat higher variation of indigenouscoal implies higher inventory cost.

  • 7/29/2019 Ch 2 BB Basic Statistics

    59/82

    Part C

    Population, Sample andProbability Distribution

    Population

  • 7/29/2019 Ch 2 BB Basic Statistics

    60/82

    60

    Population

    Astatistical populationis a set of values orattributes

    of the characteristic(s)

    of a set of well defined objects

    belonging to a specified group and/or period

    Example 1 Example 2

    Characteristic Height Ash content

    Object of adult males in lots of coal

    Group of India received in Oct2010

    T f P l i

  • 7/29/2019 Ch 2 BB Basic Statistics

    61/82

    61

    Types of Population

    Finite and real

    Infinite and hypothetical

    Ash content in a particular lot can be thought of as anobservation from an infinite and hypothetical

    population of all possible values of ash content

    Continuous

    Power generated by a power station, A tank of liquidchemical. Such population need to be suitablydiscretized for the purpose of measurement

    P l ti d S l

  • 7/29/2019 Ch 2 BB Basic Statistics

    62/82

    62

    Population and Sample

    A (random) sample is a subset of the population obtained in sucha manner such that each object (unit) of the population (or of

    subpopulation) has equal probability of being included in thesubset.

    samples must be distinguished from specimens. A specimen ismerely a convenient subset of the population.

    Purpose of sampling is to draw conclusions about a target

    population economically with acceptable limits of error.

    Population Sample

    Mean XStandard Deviation s

    V i ti i l d l ti

  • 7/29/2019 Ch 2 BB Basic Statistics

    63/82

    63

    Variation in sample and population

    Histogram of plate thickness(sample values)

    Probability distributionof thickness (for thepopulation

    Frequency polygonAn estimate of thepopulation distribution

    Larger sample More classes Smaller class interval

    Smoother frequency polygon andcloser to the population distribution

    In case of hypothetical population, the distribution of a characteristic inthe population will never be known. Normal distribution is frequentlyassumed for a population distribution.

    Discrete Probability Distribution

  • 7/29/2019 Ch 2 BB Basic Statistics

    64/82

    64

    Discrete Probability Distribution

    1

    2

    3

    4

    5

    6

    1/6

    p(x)

    x1 2 3 4 5 6

    Sample space

    Random variable Xtakes values

    x={1, 2, 3, 4, 5, 6}

    P(X=x)

    P(X=1)=p(1)=1/6p(x): Probability mass function

    Continuous Probability Distribution

  • 7/29/2019 Ch 2 BB Basic Statistics

    65/82

    65

    Continuous Probability Distribution

    Measurementof diameter

    x1.

    .

    .

    x2

    F(x) = P(Xx): Probabilitydistribution function

    f(x) = F(x): Probability

    density function

    Random variable Xtakes valuesx1 x x2

    Sample space

    x

    f(x)

    x1 x2

    f(x) does not give the probability of X=x

    Bernoulli and Hypergeometric

  • 7/29/2019 Ch 2 BB Basic Statistics

    66/82

    66

    Bernoulli and HypergeometricSample space

    P( ) = P(x=0) = p(0) = 0.8

    P( ) = P(x=1) = p(1) = 0.2

    X follows Bernoulli Distribution

    having parameter p = 0.2

    x=0

    x=1

    x=2

    X follows HypergeometricDistribution with parameters

    N=10, n=3 and d=2

    P(0) = ?, p(1) = ?, p(2) = ?

    N=10, d=2n=3

    Hypergeometric Distribution

  • 7/29/2019 Ch 2 BB Basic Statistics

    67/82

    67

    Hypergeometric Distribution

    n

    N

    r

    d

    rn

    dNrxP /)(

    N=10, n=3, d=2

    P(x=0) = (10-2C3-0) * (2C0) / (

    10C3) =(56 * 1) / 120 = 0.467

    P(x=1) = (10-2C3-1) * (2C1) / (

    10C3) = (28 * 2)/120 = 0.467

    P(x=2) = (10-2C3-2) * (2C2) / (

    10C3) = (8 * 1) / 120 = 0.066

    p(0) + p(1) + p(2) = 0.467 + 0.467 + 0.067 = 1

    Binomial Distribution

  • 7/29/2019 Ch 2 BB Basic Statistics

    68/82

    68

    Binomial Distribution

    p=0.2

    X follows Binomial Distributionwith parametersn=3 and p=0.2

    x=0

    x=1

    x=2

    n=3

    x=3

    Hypergeometric

    Finite Population

    Sampling without replacement

    Binomial

    Infinite population OR

    Sampling with replacement

    Binomial Distribution: Distribution of no. of defectives insamples drawn from a process under control (p=constant)

    Computing Binomial Probability

  • 7/29/2019 Ch 2 BB Basic Statistics

    69/82

    69

    Computing Binomial Probabilityn =3, p = 0.2

    p(0) = 3C0 * (0.2)0 * (0.8)3-0

    = 1 * 1* 0.512 = 0.512

    P(1) = 3C1 * (0.2)1 * (0.8)3-1

    = 3 * 0.2 * 0.64 = 0.384

    p(2) = 3C2 * (0.2)2 * (0.8)3-2

    = 3 * 0.04 * 0.8 = 0.096

    p(3) = 3C3 * (0.2)3 * (0.8)3-3

    = 1 * 0.008 * 1 = 0.008

    p(0) + p(1) + p(2) + p(3)

    = 0.512+0.384+0.096+0.008

    =1.000

    Poisson Distribution

  • 7/29/2019 Ch 2 BB Basic Statistics

    70/82

    70

    Poisson Distribution

    As an approximation to Binomial probability Small p (say < 0.1)

    Large n

    As a distribution in its own right

    Count of defects/unit

    Infinite opportunities of occurrence Rare event accidents, flaws in cloth, instances of power outages,

    absenteeism in large organizations, no. of production stoppages

    ..

    .

    .

    Many opportunities and maximumof 1 defect per opportunity

    Defects are randomly distributed Defect rate constant and proportionalto area, No location preference

    Poisson Probability

  • 7/29/2019 Ch 2 BB Basic Statistics

    71/82

    71

    Poisson Probability

    2

    ...,2,1,0,!)()( rr

    erprxP

    r

    Example: The no. of error in bills raised by the billing department follows Poissondistribution. Mean error rate per bill is o.5. A bill is selected at random. What is

    the probability that the bill will contain (i) exactly two errors, (ii) at most twoerrors and (iii) at least two errors?

    (i) = 0.5, p(2) = exp(-0.5) * (0.5)2 / 2! = 0.6065 * 0.25 / 2 = 0.076

    (ii) p( 2) = p(0) + p(1) + p(2) = 0.6065 + 0.3033 + 0.076 = 0.986

    p(0) = exp(-0.5) * (0.5)0 / 0! = 0.6065 * 1 * 1 = 0.6065

    p(1) = exp(-0.5) * (0.5)1 / 1! = 0.6065 * 0.5 * 1 = 0.3033

    (iii) p( 2) = 1 - p( 1) = 1 - p(0) p(1) = 1 0.6065 0.3033 = 0.09

    Normal Distribution

  • 7/29/2019 Ch 2 BB Basic Statistics

    72/82

    72

    Normal Distribution

    x

    f(x)

    Inflection point

    Symmetric Unimodal

    Bell shaped

    - to +

    Area under curve = 1

    Also Known as Gaussian distribution

    Arises naturally in many physical, biological and socialmeasurements

    Non-normal Abnormal All cases are approximations only most measurementsare non-negetive

    Normal Characteristics -Examples

  • 7/29/2019 Ch 2 BB Basic Statistics

    73/82

    73

    Normal Characteristics ExamplesTHE

    NORMAL

    LAW OF ERROR

    STANDS OUT IN THE

    EXPERIENCE OF MANKIND

    AS ONE OF THE BROADEST

    GENERALIZATIONS OF NATURAL

    PHILOSOPHY. IT SERVES AS THE GUIDING

    INSTRUMENT IN RESEARCHES IN THE PHYSICAL

    AND SOCIAL SCIENCES AND IN MEDICINE, AGRICULTURE

    AND ENGINEERING. IT IS AN INDISPENSIBLE TOOL FOR THE ANALYSIS AND

    INTERPRETATION OF THE BASIC DATA OBTAINED BY OBSERVATION AND EXPERIMENT

    Machined dimensions

    Fill volume/weight Colour density

    Wear-out failure time

    Germination at a given ageing

    Height of Indian tribals No. of single girls in a bar (1 - 2 P.M)

    Return from a diversified portfolio

    - W. J. Youden

    Central Limit Theorem

  • 7/29/2019 Ch 2 BB Basic Statistics

    74/82

    74

    Central Limit Theorem Distribution of an average (X-bar) or a sum (X) tends to be

    normal, irrespective of the distributional form of X.

    Many statistical procedures are based on the assumption ofNormality. CLT acts as safeguard for validity of such applications.

    Aggregation of numeroussmall but independentrandom events. In thiscase eight events - eachproducing small randomdisplacement either to

    the left or to the right.

    Normal Density Function

  • 7/29/2019 Ch 2 BB Basic Statistics

    75/82

    75

    Normal Density Function

    2

    )(2

    1 2

    21)(

    VarianceMean

    xexf

    x

    x xx1 x2

    P (X < x)= F (x)

    P (x1 < X < x2)= F (x2) F (x1) P (X > x)

    = 1 - F (x)

    Normal Probability

  • 7/29/2019 Ch 2 BB Basic Statistics

    76/82

    76

    Normal Probability

    f(x)

    x

    68.27%

    2

    95.45% 3

    99.73%

    f(x)

    Popularlyknown as

    68-95-99.73

    rule

    Standard Normal Distribution

  • 7/29/2019 Ch 2 BB Basic Statistics

    77/82

    77

    X

    Z0 1 2 3-1-2-3

    = 0

    = 1

    = 1= 1

    = 2= 2

    = 3=

    3

    2

    2

    2

    1

    )(

    z

    ezf

    Z - Transform

  • 7/29/2019 Ch 2 BB Basic Statistics

    78/82

    78

    Standard Normal Table

  • 7/29/2019 Ch 2 BB Basic Statistics

    79/82

    79

    z 0.00 0.01 . . 0.09

    0.0 0.50000 0.50399 . . 0.53586

    0.1 0.53983 0.54379 0.57534

    . . . . . .1.0 0.84134 0.84375 0.86214

    . . . . . .

    2.0 0.97725 0.97778 0.98169

    . . . . . .

    3.0 0.99865 0.99869 . . 0.99900. . . . .

    3.9 0.99995 0.99995 . . 0.99997

    z

    ..

    . .....

    ....

    ..

    ..

    .......

    .........

    .

    .. .

    .. Other tables may giveprobabilities between 0 andz > 0 be careful.

    Tables giving probabilitiesfor negative values of z are

    convenient but are notessential.

    P (z > -2 .01) = ?

    P (-1 .09 < z < 2) = ?

    P (- .19 < z < - .01) = ?

    Normal Distribution - Exercise

  • 7/29/2019 Ch 2 BB Basic Statistics

    80/82

    80

    The specification on viscosity of a chemical produced by a batch

    process is given as 16.52.5. Viscosity of 10 consecutive batches

    produced in the immediate past are given below:

    14.8, 15.6, 16.9, 17.0, 14.9, 15.6, 14.5, 15.2, 15.7, 14.2

    (a) Assuming viscosity follows Normal distribution, find theexpected rejection percent of batches. [Ans: 6.2%]

    (b) Note that none of the 10 sample batches are rejected. Still, is

    there any cause for concern?

    Normal Probability Plotting

  • 7/29/2019 Ch 2 BB Basic Statistics

    81/82

    81

    y g

    Purpose:To examine (based on sample data) whether the

    population distribution is Normal or not.

    Method:

    Rank the sample observations from smallest to largest (R=1,

    2, ., n). Try to have n>25.

    Compute observed relative cumulative frequency F(x) = (R-0.5)/n [or F(x) = R/ (n+1)] for each x, where R is the rank

    of observation x.

    Plot [x, F(x)] in Normal probability paper

    If the points fall approximately along a straight line then the

    underlying distribution can be considered as Normal

    NPP - Example

  • 7/29/2019 Ch 2 BB Basic Statistics

    82/82

    82

    p

    Observation (x)

    Rank

    (R- 0.5)/10

    14.2 1 .05

    14.5 2 .15

    14.8 3 .2514.9 4 .35

    15.2 5 .45

    15.6 6 .55

    15.6 7 .65

    15.7 8 .7516.9 9 .85

    17.0 10 .95

    Viscosity data: Specification 16.52.5

    14.8, 15.6, 16.9, 17.0, 14.9, 15.6, 14.5, 15.2, 15.7, 14.2

    Viscosity

    Percent

    1918171615141312

    99

    95

    90

    80

    70

    60

    50

    40

    30

    20

    10

    5

    1

    Mean

    0.374

    15.44

    StDev 0.9348

    N 10

    AD 0.359

    P-Value

    Probability Plot of ViscosityNormal - 95% CI