Histograms and SPC

3
quality control T he histogram is one of the seven basic tools of quality control used to summarize, display and ana- lyze process data. Karl Pearson, 1857–1936, introduced it as a way of showing the probability distribution of a continuous variable. The derivation of the word “his- togram” is uncertain. Sometimes it is said to be derived from the Greek “histos” meaning “anything set upright” (as the masts of a ship, the bar of a loom, or the vertical bars of a histogram); and “gramma,” i.e., 'drawing, record, writing. It is also said that Karl Pearson derived the name from “historical diagram.” A histogram consists of tabular fre- quencies, shown as adjacent rectan- gles, erected over discrete intervals, with an area equal to the frequency of the observations in the interval. The height of a rectangle is also equal to the frequency density of the interval, i.e., the frequency divided by the width of the interval. The total area of the histogram is equal to the number of data. A histogram may also be normalized displaying rela- tive frequencies. It then shows the proportion of cases that fall into each of several categories, with the total area equaling 1. The categories are usually specified as consecutive, non-overlapping intervals of a vari- able. The categories (intervals) must be adjacent, and often are chosen to be of the same size. The rectangles of a histogram are drawn so that they touch each other to indicate that the original variable is continuous. The ordinary histogram shows the number of datum per unit interval so that the height of each bar is equal to the proportion of total data that falls into that category. The area under the curve represents the total number of data. This histogram shows absolute numbers, with the frequency in thousands. In Figure 1, the histogram on the right differs from the one on the left in that it shows the data cumulative- ly—and the total area of all the bars is equal 100%. The curve displayed is a simple density estimate. In other words, a histogram repre- sents a frequency distribution by means of rectangles whose widths represent class intervals and whose areas are proportional to the corre- sponding frequencies. The intervals are placed together in order to show that the data represented by the his- togram, while exclusive, is also con- tinuous. (For example, in a his- togram it is possible to have two con- necting intervals of 10.5–20.5 and 20.5–33.5, but not two connecting intervals of 10.5–20.5 and 22.5–32.5. Empty intervals are represented as empty and not skipped.) Histograms are used to plot densi- ty of data, and often for density esti- mation: estimating the probability density function of the underlying variable. The total area of a his- togram used for probability density is always normalized to 1. Since the sum of the intervals on the x-axis is always 1, histograms are identical to relative frequency plots. Above are examples of ordinary and cumulative histograms of the same data. The data shown is a ran- dom sample of 10,000 points from a The Histogram as a Measurement of Process Consistency 36 I metalfinishing I September 2012 www.metalfinishing.com Figure 1. Both histograms use the same data, the difference is in how the data is presented. Ordinary histogram Cumulative histogram rnorm (1000) rnorm (1000) -4 -2 0 2 4 -4 -2 0 2 4 Frequency 0 500 1000 1500 2000 0 2000 4000 Frequency 6000 8000 10000

Transcript of Histograms and SPC

Page 1: Histograms and SPC

qualitycontrol

The histogram is one of the sevenbasic tools of quality control

used to summarize, display and ana-lyze process data. Karl Pearson,1857–1936, introduced it as a way ofshowing the probability distributionof a continuous variable.

The derivation of the word “his-togram” is uncertain. Sometimes it issaid to be derived from the Greek“histos” meaning “anything setupright” (as the masts of a ship, thebar of a loom, or the vertical bars of ahistogram); and “gramma,” i.e.,'drawing, record, writing. It is alsosaid that Karl Pearson derived thename from “historical diagram.”

A histogram consists of tabular fre-quencies, shown as adjacent rectan-gles, erected over discrete intervals,with an area equal to the frequencyof the observations in the interval.The height of a rectangle is alsoequal to the frequency density of the

interval, i.e., the frequency divided bythe width of the interval. The totalarea of the histogram is equal to thenumber of data. A histogram mayalso be normalized displaying rela-tive frequencies. It then shows theproportion of cases that fall intoeach of several categories, with thetotal area equaling 1. The categoriesare usually specified as consecutive,non-overlapping intervals of a vari-able. The categories (intervals) mustbe adjacent, and often are chosen tobe of the same size. The rectangles ofa histogram are drawn so that theytouch each other to indicate that theoriginal variable is continuous.

The ordinary histogram shows thenumber of datum per unit intervalso that the height of each bar is equalto the proportion of total data thatfalls into that category. The areaunder the curve represents the totalnumber of data. This histogram

shows absolute numbers, with thefrequency in thousands.

In Figure 1, the histogram on theright differs from the one on the leftin that it shows the data cumulative-ly—and the total area of all the bars isequal 100%. The curve displayed is asimple density estimate.

In other words, a histogram repre-sents a frequency distribution bymeans of rectangles whose widthsrepresent class intervals and whoseareas are proportional to the corre-sponding frequencies. The intervalsare placed together in order to showthat the data represented by the his-togram, while exclusive, is also con-tinuous. (For example, in a his-togram it is possible to have two con-necting intervals of 10.5–20.5 and20.5–33.5, but not two connectingintervals of 10.5–20.5 and 22.5–32.5.Empty intervals are represented asempty and not skipped.)

Histograms are used to plot densi-ty of data, and often for density esti-mation: estimating the probabilitydensity function of the underlyingvariable. The total area of a his-togram used for probability densityis always normalized to 1. Since thesum of the intervals on the x-axis isalways 1, histograms are identical torelative frequency plots.

Above are examples of ordinaryand cumulative histograms of thesame data. The data shown is a ran-dom sample of 10,000 points from a

The Histogram as a Measurement of Process Consistency

36 I metalfinishing I September 2012 www.metalfinishing.com

Figure 1. Both histograms use the same data, the difference is in how the data is presented.

Ordinary histogram Cumulative histogram

rnorm (1000) rnorm (1000)

-4 -2 0 2 4 -4 -2 0 2 4

Freq

uenc

y

050

010

0015

0020

00

020

0040

00

Freq

uenc

y

6000

8000

1000

0

Page 2: Histograms and SPC

qualitycontrol

normal distribution with a mean of 0and a standard deviation of 1.

SHAPE OR FORM OF A DISTRIBUTIONThe shape of a histogram providesimportant information about thedata distribution. The histogram ismay be highly or moderately skewedto the left or right. A symmetricalshape is also possible, although ahistogram is never perfectly symmet-rical. If the histogram is skewed tothe left, or negatively skewed, the tailextends further to the left.

The mode of a distribution is thatvalue which is most frequentlyoccurring or has the largest probabil-ity of occurrence. The sample modeoccurs at the peak of the histogram.

For many phenomena, it is quitecommon for the distribution of theresponse values to cluster around asingle mode (unimodal) and thendistribute themselves with lesser fre-quency out into the tails. The normaldistribution is the classic example ofa unimodal distribution.

The histogram shown in Figure 2illustrates data from a bimodal (2peak) distribution. The histogramserves as a tool for diagnosing prob-lems such as bimodality.Questioning the underlying reasonfor distributional non-unimodalityfrequently leads to greater insightand improved deterministic model-

ing of the phenomenon under study.For example, for the data presentedabove, the bimodal histogram iscaused by a lack of uniformity in thedata.

An example of a distributionskewed to the left might be the rela-tive frequency of exam scores. Mostof the scores are above 70 percentand only a few low scores occur. Anexample for a distribution skewed tothe right or positively skewed is a his-togram showing the relative frequen-cy of housing values. A relativelysmall number of expensive homescreate the skeweness to the right. The

www.metalfinishing.com September 2012 I metalfinishing I 37

tail extends further to the right. Theshape of a symmetrical distributionmirrors the skeweness of the left orright tail. For example, the his-togram of data for IQ scores.Histograms can be unimodal, bi-modal or multi-modal, dependingon the dataset.

A truncated histogram endsabruptly at one end, which indicatespossible sorting or inspection ofnon-conforming parts. This may alsomean that part of the distributionhas been removed by screening, 100% inspection or review. Such prac-tices are usually costly and are goodcandidates for improvement efforts.

Plateau Histograms. A nearly flat orplateau-like histogram often meansthat the process is not well definedor understood by those doing thework or inspection. Since individu-als run the process in different ways,there are a great many differentmeasurements and none that standout. The solution is to more clearlydefine the process and/or piece partparameters.

The plateau might be called a“multimodal distribution.” Severalprocesses with normal distributionsare combined. Because there aremany peaks close together, the top ofthe distribution resembles a plateau.

Number of cells and width. There is

Figure 2. Bimodal Histogram.

-6 -4 -2 0 2 4

Freq

uenc

y

1500

1000

500

0

Figure 3. Skewed Histograms.

Positive Skewed

Skewed Histogram

Negative Skewed

Page 3: Histograms and SPC

no “best” number of cells, and dif-ferent cell sizes can reveal differentfeatures of the data. Some theoreti-cians have attempted to determinean optimal number of cells, butthese methods generally makestrong assumptions about the shapeof the distribution. Depending onthe actual data distribution and thegoals of the analysis, different cellwidths may be appropriate, so exper-imentation is usually needed todetermine an appropriate width.There are, however, various usefulguidelines and rules of thumb.

Most engineers favor setting thenumber of cells somewhere between11 and 17, but always an odd num-ber. The later point is important sothat the mid-point of the distribu-tion is not split between two cells. Itis also a good rule, when using meas-urement data, to set the cell limits apoint halfway between the numberof decimal points of the most precisedata. Consider what happens where acell is 4 to 8 and the next cell 8 to 12.A reading of 8 could fall in either cell,hence the rule.

Kurtosis. In probability theory andstatistics, kurtosis is derived from theGreek word meaning bulging is anymeasure of the “peakedness” of the

qualitycontrol

probability dis-tribution of areal-valued ran-dom variable. Ina similar way tothe concept ofskewness, kurto-

sis is a descriptor of the shape of aprobability distribution and, just asfor skewness, there are different waysof quantifying it for a theoretical dis-tribution and corresponding ways ofestimating it from a sample from apopulation.

One math-based common measureof kurtosis, originating with KarlPearson, is based on a scaled versionof the fourth moment of the data orpopulation, but it has been arguedthat this measure really measuresheavy tails, and not peakedness. Forthis measure, higher kurtosis meansmore of the variance is the result ofinfrequent extreme deviations, asopposed to frequent modestly sizeddeviations. It is common practice touse an adjusted version of Pearson’skurtosis, the excess kurtosis, to pro-vide a comparison of the shape of agiven distribution to that of the nor-mal distribution. Distributions withnegative or positive excess kurtosisare called platykurtic or leptokurticdistributions, respectively. When acurve, or histogram, is compared to anormal distribution, a platykurticdata set has a flatter peak around itsmean, which causes thin tails withinthe distribution.

Leptokurtic is a description of thekurtosis in a distribution in which

38 I metalfinishing I September 2012 www.metalfinishing.com

the statistical value is positive.Leptokurtic distributions have high-er peaks around the mean comparedto normal distributions. TheJapanese scientist, Genechi Taguchi,argued that the goal of manufactur-ing should not be to simply produceproduct within the specification,but rather the goal should be to pro-duce product as close to nominal aspossible. He argued that any devia-tion from nominal has a cost.

There isn’t space in this column tofully explain this idea—suffice to saythat a leptokurtic distribution willproduce superior product. There is agreater difference between a partproduced near the statistical designlimit in a process producing aplatykurtic distribution and onewith a leptokurtic distribution.

The Taguchi Principle is the basicupon which six-sigma theory andpractice are based.

BIO Leslie W. Flott, Ph.B., CQE, ASQ Fellow,is certified as an IDEM WastewaterTreatment Operator and IndianaWastewater Treatment Operator. Hereceived his Bachelor of Science Degree inChemistry from NorthwesternUniversity and his Masters Degree inmaterials engineering from Notre DameUniversity. Most recently, Flott served asthe environmental program director andinstructor at Ivy Tech CommunityCollege. Prior to that, he was the health,environment, and safety manager atWayne Metal Protection Company.

Figure 4. Truncated, or cliff-like, Histogram.

Figure 5. Plateau-like Histogram.

Figure 6. Illustration of Kurtosis.

Platykurtic

Leptokurtic