Treatment of false calls in evaluating nondestructive inspection proficiency

6
Journal of Nondestructive Evaluation, VoL 2, Nos. 3//4, 1981 Treatment of False Calls in Evaluating Nondestructive Inspection Proficiency Henry Sharp, JrJ and William H. Sproat 2 Received July 27, 1981; revised January 18, 1982 A test designed to evaluate nondestructive flaw detection proficiency by ultrasonic and eddy current methods has been conducted. Test data for each participant are reduced to con- tingency table format, and analysis of these tables via traditional measures of association is used to rank individual performances. The presence of false calls in an inspection record introduces a random inflation in flaw detection results. Measures are required which com- pensate for that fact. Proficiency rankings based on two such measures, Somers's d coefficient and the mean square contingency coefficient, show reasonable consistency with intuition-based rankings by experienced observers. Finally, a weighted generalization of Somers's d is suggested, although optimal selection of weights remains an open problem. KEY WORDS: nondestructive inspection; flaw detection proficiency; performance ranking; false calls. 1. INTRODUCTION A number of efforts to determine the reliability of nondestructive methods designed to detect flaws in materials, components, and assemblies of aerospace engineering systems have been conducted over the past decade. The impetus for these efforts has been the emergence of fracture mechanics criteria which specify the acceptable flaw sizes and detection proba- bilities for fracture critical parts. Programs to assess detection reliabilities were first conducted under laboratory conditions, then in a production situation, and finally in flight maintenance environments. The laboratory programs were aimed at answering the question, "How reliable are the nondestructive meth- ods such as eddy current, ultrasonic, radio graphic, etc.?" As evaluations moved into the production sphere, the variability in detection attributed to hu- man factors began to surface. Concern shifted from i Department of Mathematics, Emory University, Atlanta, Georgia 30322, 2Lockheed-Georgia Company, Marietta, Georgia. specific capability of the method per se to the relative capabilities of the individuals who apply particular nondestructive methods. Questions bearing on the effectiveness of techni- cians working in the arena of nondestructive inspec- tion (NDI) were raised in earlier contractural work performed by the Lockheed-Georgia Company for the Air Force Logistics Command. This extensive program (1974-1978) to measure flaw detection relia- bility in aircraft maintenance environments revealed a broad spectrum of individual proficiencies which had no apparent relationship with degree of training, level of inspection activity or years of experience. 0) Consequently, in 1979, the Air Force initiated a program of individual proficiency measurements through the administration of practical examinations. The contract covering work reported on here entailed the development of both hardware and methodology in support of that program. Participants were as- signed inspection tasks which involved fatigue crack detection in simulated structures composed of fastened splice elements. Eddy current bolt-hole and ultrasonic shear wave scans were performed to detect 189 0195-9298/81/ 1200-0189503.00/0 ©1981 PlenumPublishingCm'poration

Transcript of Treatment of false calls in evaluating nondestructive inspection proficiency

Journal of Nondestructive Evaluation, VoL 2, Nos. 3//4, 1981

Treatment of False Calls in Evaluating Nondestructive Inspection Proficiency

Henry Sharp, JrJ and William H. Sproat 2

Received July 27, 1981; revised January 18, 1982

A test designed to evaluate nondestructive flaw detection proficiency by ultrasonic and eddy current methods has been conducted. Test data for each participant are reduced to con- tingency table format, and analysis of these tables via traditional measures of association is used to rank individual performances. The presence of false calls in an inspection record introduces a random inflation in flaw detection results. Measures are required which com- pensate for that fact. Proficiency rankings based on two such measures, Somers's d coefficient and the mean square contingency coefficient, show reasonable consistency with intuition-based rankings by experienced observers. Finally, a weighted generalization of Somers's d is suggested, although optimal selection of weights remains an open problem.

KEY WORDS: nondestructive inspection; flaw detection proficiency; performance ranking; false calls.

1. INTRODUCTION

A number of efforts to determine the reliability of nondestructive methods designed to detect flaws in materials, components, and assemblies of aerospace engineering systems have been conducted over the past decade. The impetus for these efforts has been the emergence of fracture mechanics criteria which specify the acceptable flaw sizes and detection proba- bilities for fracture critical parts. Programs to assess detection reliabilities were first conducted under laboratory conditions, then in a production situation, and finally in flight maintenance environments. The laboratory programs were aimed at answering the question, " H o w reliable are the nondestructive meth- ods such as eddy current, ultrasonic, radio graphic, etc.?" As evaluations moved into the production sphere, the variability in detection attributed to hu- man factors began to surface. Concern shifted from

i Department of Mathematics, Emory University, Atlanta, Georgia 30322,

2 Lockheed-Georgia Company, Marietta, Georgia.

specific capability of the method p e r se to the relative capabilities of the individuals who apply particular nondestructive methods.

Questions bearing on the effectiveness of techni- cians working in the arena of nondestructive inspec- tion (NDI) were raised in earlier contractural work performed by the Lockheed-Georgia Company for the Air Force Logistics Command. This extensive program (1974-1978) to measure flaw detection relia- bility in aircraft maintenance environments revealed a broad spectrum of individual proficiencies which had no apparent relationship with degree of training, level of inspection activity or years of experience. 0) Consequently, in 1979, the Air Force initiated a program of individual proficiency measurements through the administration of practical examinations. The contract covering work reported on here entailed the development of both hardware and methodology in support of that program. Participants were as- signed inspection tasks which involved fatigue crack detection in simulated structures composed of fastened splice elements. Eddy current bolt-hole and ultrasonic shear wave scans were performed to detect

189 0195-9298/81 / 1200-0189503.00/0 ©1981 Plenum Publishing Cm'poration

190 Sharp and Sproat

cracks of varying sizes at the fastener sites. The proficiencies of the individuals were then assessed on the basis of their ability both to detect the flaws and to avoid incorrect flaw identifications (false calls).

2. STATEMENT OF THE PROBLEM

The decision making process is subject to errors of commission (these are the false calls) and errors of omission (these are the missed flaws). The proficiency measure itself is subject to error in that some flaws may be found by chance instead of by genuine skill. To obtain a reasonable approximation of true pro- ficiency, the ranking instrument must compensate for this chance contribution. The incorporation of false call data in the evaluation of flaw detection capabil- ity, therefore, is essential, yet it has received little attention in the literature. It is possible, of course, to attain a 100% detection score on a practical examina- tion by marking every site as flawed. Conversely, a perfect performance would involve 100% finds and no false calls. Real-world performance lies on a broad spectrum between these two extremes; varying from many false calls (due perhaps to ineptitude or guess- work) to few false calls (arising perhaps from con- servative judgment, which consequently is prone to miss marginally detectable flaws). How to strike a rational balance in weighing these inspection char- acteristics is a difficult problem, which depends prim- arily upon considerations of safety and economics. Specifically, what measure should be used in assess- ing inspectional competence, and what impact should false calls have on that measure? It is our main purpose here to stimulate the active consideration of such questions in the hope of generating, eventually, a significant pool of data and experience from which an acceptable, comprehensive statistical treatment might evolve.

3. DATA BASE AND FORMAT

The data, upon which this study is based, was collected by Lockheed-Georgia with the cooperation: of the NDI team at one Air Force base. Standard test items (simulated structure with fatigue cracks) were prepared, each consisting of about 180 inspection sites of which roughly one in six were flawed. (This ratio was unknown to the inspection team.) In vari- ous configurations, these were inspected by each of 30 participants using, separately, eddy current and

ultrasonic equipment. Raw data obtained from each inspection included: (a) marked sites, (b) inspection time, (c) type of equipment used, and (d) configura- tion used.

Data from each inspection were tabulated giving the number of flaws detected within several crack ranges, the number of missed flaws, and the number of false calls. We are not concerned here with a full-scale presentation of these data, nor with their analysis. Details appear in the Air Force documenta- tion of the study. (2) We will use the data here in simplest (hit/miss) form to illustrate the application of certain classical techniques to the problem.

Assume a test configuration of n inspection sites (fastener holes) of which c I are flawed. Denote the set of flawed sites by F and the complementary set of nonflawed sites by F. The inspection consists of n decisions, one at each site, and the inspection record designates a set of r 1 sites suspected of being flawed. Let these marked sites be denoted by M and the set of unmarked sites by 3~t. Of course, the number of elements in P is n - cl = c2 and in 3~ is n - r 1 = r 2. The data are condensed into a typical fourfold con- tingency table as follows:

M

r P

a l l a12

a21 a22

C 1 C2

r 1

rE

n

(1)

In this table, aij represents a count of the sites possessing the indicated characteristics. Note that all counts the number of finds, a12 counts the number of false calls, a21 counts the number of no finds (missed flaws), while all + a22 counts the number of correct calls. We may use a standard chi-square calculation to test for independence of effects; the test statistic is

n~ x - - ( 2 )

rlr2clc2

where A is the determinant of the given 2 × 2 matrix. Under the assumption of independence, this sta-

tistic has an approximate chi-square distribution with one degree of freedom. The chi-square value at the 90th percentage point (0.90 level) is 2.71, hence we can be reasonably confident that a test statistic greater than 2.71 corresponds to an inspection performance in which "marks" and "flaws" are associated. Two tables in our data base, representative of high and

Treatment of False Calls 191

low performance, are:

F P

M 29 14 43 M 6 20 26

2~ 3 126 129 3~ 26 120 146

32 140 172 32 140 172 (a) (b)

(3)

The computed statistics for the above examples are 90.3 and 0.40, yielding almost complete (a) and al- most negligible (b) confidence in asserting a positive association between "marks" and "flaws." A more careful examination of the inspection record in (b) is instructive. If the 26 marked sites were truly selected at random, then the expected number of marked flaws would be 4.8 with standard deviation 1.8 (using the hypergeometric distribution). Furthermore, al- though this is close to an extreme among our cases, it is by no means an outlier. Among the 60 perfor- mances records comprising our data, an appreciable number of others also are at a level barely (if at all) distinguishable from random marking. Should this observation be representative of on-line inspection capability, it seems clear that the flaw detection relia- bility assumed in Military Standards is far too opti- mistic. Much additional data are needed, but in our opinion, those which already are in hand point in- escapably to the desirability of establishing stan- dardized tests, certification, and periodic reexamina- tion.

4. MEAN SQUARE CONTINGENCY

The chi-square test described in the preceding section identifies those inspection records supporting a decision to reject the hypothesis of independence of effects. Larger calculated values of the chi-square statistic yield greater confidence in this decision; thus it seems reasonable to base a ranking of inspection records on this statistic. As pointed out in Mosteller, (3~ this value is strongly influenced by the size of the data base. Hence, within a given controlled experi- ment( f ixed n), chi-square values are acceptable for ranking purposes. But for broader application, a more satisfactory measure is provided by the mean square

contingency*

d? 2 = x 2 / n

which may be calculated by

= ~/ A2/ r, r2c,c2 (4)

Table 1 lists a typical sample of inspection re- cords and coefficients representing a spectrum of ability: two each from the high, mid, and low ranges. The lowest record of each type identifies fewer than the expected number of flaw detections under ran- dom marking. Both eddy current and ultrasonic data are included, but not necessarily belonging to the same inspectors. The number of eddy current inspect- ion sites is different from the number of ultrasonic inspection sites, and not all test configurations con- tain the same number of flaws. Note that a coeffi- cient of less than 0.1 corresponds roughly to the acceptance of the hypothesis of independence at less t h a n t h e 80th percentile. The ranking suggested by these mean square contingency coefficients matches with reasonable accuracy an independent intuitive consistency; however, q~ has several drawbacks. Un- der a finer partitioning of inspection sites and re- sponses, e.g., by crack size and signal intensity, a contingency table larger than 2 × 2 will be required. In such cases (not considered here), Cramer's modifi- cation

C = x /n

m i n ( r - 1 , c - 1) ' r = number of rows

c = number of columns

(5) may be the most appropriate. The more serious ob- jection, however, to this type of ranking instrument arises from an inability to assign it a probablistic interpretation. We seek, therefore, an alternative measure of association which utilizes the inspection data as a predictor of probabilities. A number of such measures have been proposed, many of which are discussed in Goodman and Kruskal (4) and Everitt. (5) Ideal, perhaps, would be the construction of a new measure explicitly tailored to fit NDI parameters.

*Either positive or negative associations are possible, depending on the sign of the matrix value.

192 Sharp and Sproat

Table I. Inspection Records for Eddy Current and Ultrasonic Methods

Inspection Method n M N F MfqF )(/INF MCqF (X2/n) 1/2 d

EC 172 26 2 6 138 0.84 0.80 EC 172 29 14 3 126 0.72 0.81 EC 172 18 13 10 131 0.53 0.55 EC 172 22 51 6 93 0.32 0.44 EC 172 6 20 26 120 0.05 0.05 EC 172 10 58 18 86 neg neg US 188 22 3 8 155 0.77 0.71 US 188 19 2 11 156 0.72 0.62 US 188 21 16 13 138 0.50 0.52 US 188 t2 16 22 138 0.27 0.25 US 188 11 48 17 112 0.07 0.09 US 188 1 12 29 146 neg neg

5. OPERATIONALLY DEFINED MEASURES

We illustrate a possible approach to measures of this kind by considering first a very simple model. The quality of inspection is determined, of course, by success in distinguishing flawed from nonflawed sites. Thus, referring to (1), we may define a measure by

R= (a'l+az2)-(a12+a21) (6) n

which can be interpreted as the "net" correct call probability. In effect, this formula weights calls equally so that bad decisions cancel out a like num- ber of good decisions. Observe, however, that for purposes of ranking, we need only consider

R'= (a H + a22)/n (7)

for it follows immediately that if R 1 < R 2 then R'l< R~ and this same relationship persists even in the case that R 1 and R 2 are based upon different sized popu- lations.

Despite (and, perhaps, because of) its simplicity, the model has an undesirable characteristic. Consider the following tables:

a l l a12 k a l l

a21 a22 ka21

c I c2 nl kCl

a12

a22

c 2 n 2

(8a)

In each case, the same proportions of both flaws and nonflaws have been detected, yet R'I¢R' 2 (unless

a22 = c 2 a l l / c 1). F o r example, c o n s i d e r

1 4 10 4 R{ = 0.3 and R~= 0.26

3 2 30 2 (8b)

The following example shows that even restricting attention to a given sized population fails to alleviate the problem:

15 30 30 10 R I = 0.45 and R i = 0.4 (8c)

25 30 50 10

The previous, unsatisfactory model relied on the partition FtOF of the set of all inspection sites. A possible way to overcome the difficulty so encoun- tered is through a model based on the product space F × F, and there is in fact an intuitive attractiveness about this idea. Since we are attempting to assess the ability to distinguish flawed from nonflawed sites, we examine the inspection response to each opportunity of making such a distinction. That is, we quantify the total information provided for all pairs in

r r = F × P = { ( x , y ) : x ~ F , yEP} (9)

Although many of the traditional measures of associ- ation are based upon this type of analysis, we shall be concerned only with one, which seems to us a more acceptable ranking instrument than the mean square contingency coefficient.

This measure of association was introduced in 1962 by R. H. Somers (6) (it is usually referred to as Somers's d). Although designed for use with ordinal variables, it will suit our purpose if we agree to assign corresponding orders to our variables; e.g., F < P and M < M. We think of the relationship between flaws and marks as directional: our expectation is that the existence of a flaw influences the placement of a mark. For each point (x, y) in ~r, the_inspection response is one of four types: (M, 3~), (M, M), (M, M) or (M, 3)). It seems reasonable to assume that high ability is suggested by a preponderance of re- sponses in the same order as point coordinates, namely (M, 3)). Similarly, we would expect relatively few responses of type (M, M). The set of points in ~r corresponding to (M, M) is called concordant and the set corresponding to (M, M) is called discondant. The remaining types are called tied on the dependent

Treatment of False Calls 193

variable (i.e., both coordinates are marked or both are unmarked). These sets are illustrated below

F

M Tied

A 3 = a l l a l 2

Discordant

A 2 ~--- a12a21

Concordant

A 1 = a l la22

Tied

A 4 = a21a22

M 3) " P "/

[ ~1 =ClC 2 = (all +a21)(a~2 + a22)

: a l l a l 2 -[- a12a21 -[- a l i a 2 2 -~- a21a22

The number of points in C (concordant set) is a I la22, the number of points in D (discordant set) is a12a21, the number of sites in F is c 1, and the number of sites in P is c 2. The determinant of the contingency table counts the points in the net excess of concordant points over the discordant points. That version of Somers's d coefficient specific to the 2 × 2 case is obtained by averaging this excess over rr:

d = alla22 - - a l z a 2 1 : A (10) c c2 1, 1

We note that the similarity between this expres- sion and (6) is even more striking when (10) is expressed in the alternate form,

d - a l l a l 2 (11) C 1 C2

The ratio aij/c j may be interpreted as an esti- mator of the conditional probability of inspection response i, given site characteristic j. Thus d is an estimator of the differences in (conditional) probabil- ities of a detected flaw versus a false call. In like language, R is an estimator of the difference in probabilities of a correct call versus an incorrect call. But the objection raised against R is not valid in the present case, for it follows easily from (10) that proportionate column entries leave d invariant.

There is a very simple relati9nship between d and 0:

d : ~ 0 (12) V ClC2

Again, refer to Table I for the list of d coefficients and the representative sample of inspection records. As with q), Somers's d seems to be reasonably con- sistent with an intuitive ranking of the inspection records. In comparison, (a) d seems to be more forgiving of false calls than q,, and (b) q) seems to be more forgiving of missed flaws than d. It is because of this, along with its probabilistic interpretation, that we prefer the coefficient d to q) (among tradi- tional measures).

6. GENERALIZATIONS AND ALTERNATIVES

As evident from the derivation of Somers's d coefficient (in the 2 × 2 case), the set ~r = F × P is partitioned into three subsets, one of which does not appear explicitly in the formulation. In effect, we are assigning weights as follows:

Subset Weight

Concordant 1 Discordant - 1 Tied 0

03)

To generalize this procedure, partition the tied subset into the sets corresponding to inspection re- sponses (M, M) and (M, M), and assign weights as follows:

Subset Weight

A1 w 1

Aa w2 A 3 w 3

A 4 w4

(14)

where A 1 is the concordant set, A 2 is the discordant set, and

A3={(x,y )" xEM, yEM],

A4=[(x,y ) :x E 3~r, yEA)]

Now we define a generalized coefficient, correspond- ing to (10), by

4

E IA lw, (15) 7/" i = l

Two inspection records in which detection ratios are

194 Sharp and Sproat

the same, for example, ACKNOWLEDGMENT

a l l a12

a21 a22

/gall real2 and (16)

ka21 ma22

The authors are grateful to the referee for a number of very helpful comments on the initial ver- sion of this paper. This work was supported in part by National Science Foundation Industrial Research Participation Grant No. SP17907400 to Lockheed- Georgia Company.

yield identical coefficients, 8. In general, this is a condition we would wish to impose upon any ac- ceptable ranking instrument.

The weight assignment indicated in (13), which places equal importance on the correct identifications both of flaws and nonflaws, is not necessarily opti- mal, but in the absence of further information, it offers a reasonable indicator of inspection com- petence. A detailed risk/benefit analysis based upon considerations of safety and economics may be the best guide to an appropriate selection of weights or ranges and weights.

REFERENCES

1. W. H. Lewis, W. H. Sproat, B. D. Dodd, and J. M. Hamilton, Reliability of nondestructive inspections, U.S. Air Force Logist- ics Command Report No. SA-ALC/MME 76-6-38-1 (1978).

2. W. H. Sproat, Measurement of NDI technician proficiency on C-5 wing hot spot inspections, Lockheed-Georgia Report No. LG79ER0161 (1979).

3. F. Mosteller, Association and estimation in contingency tables, J. Amer. Statist. Assoc. 63:1-28 0968).

4. L. A. Goodman and W. H. Kruskal, Measures of Association for Cross Classifications (Spfinger-Verlag, New York, 1979).

5. B. S. Everitt, The Analysis of Contingency Tables (John Wiley & Sons, New York, 1977).

6. R. H. Somers, A new asymetric measure of association for ordinal variables, Amer. Sociol. Rev. 27:799-811 (1962).