Earwitness testimony 2. Voices, faces and context

15
Earwitness Testimony 2: Voices, Faces and Context SUSAN COOK* and JOHN WILDING Department of Psychology, Royal Holloway, University of London SUMMARY Two factors relevant to voice recognition were investigated in the study reported here: the effect on memory for a voice of the presence of a face or personal information about the speaker, and the effects of the re-presentation of this information as contextual cues at test. Recognition memory for the briefly heard voice of a stranger was superior in conditions where the face of the speaker was absent. Presence of the additional contextual cues at test had no effects on recognition performance. Theoretical and forensic applications of the findings are discussed in terms of face-recognition models and witness line-up design. The sensitivity of voice memory measures to the different types of experimental design is also considered. & 1997 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 11: 527–541 (1997) No. of Figures: 0 No. of Tables: 1 No. of References: 29 INTRODUCTION Detailed study of voice memory has been carried out during the last 20 years (Clifford, 1983; Yarmey, 1995 for reviews), and researchers are starting to ask more general questions as to how voice recognition relates to person recognition. Just how would a voice memory be affected by the presence of the speaker’s face? Given that performance on memory tasks for known and unknown voices is not identical (Cook and Wilding, in press), how does knowing something about a speaker affect memory for their voice? The work reported here seeks to address the effects of the addition of face and personal information on memory for voices in a psycho-legal setting. Initial studies of contextual effects on voice memory by the authors (Cook and Wilding, unpublished work) had not found an effect of context on memory for a once-heard voice when the context was in the form of the voice saying a framing sentence spoken by another unknown speaker, but had found a marginally significant facilitation with re-presentation of the same words spoken as those CCC 0888–4080/97/060527–15$17.50 Received 14 August 1996 & 1997 John Wiley & Sons, Ltd. Accepted 12 February 1997 APPLIED COGNITIVE PSYCHOLOGY, VOL. 11, 527–541 (1997) The authors thank the schools and colleges who provided participants in this experiment. Most importantly, thanks also go to John Valentine for his statistical advice. *Correspondence to: Susan Cook, Psychology Dept., Royal Holloway, University of London, Egham Hill, Egham, Surrey TW20 0EX. Contract grant number: R00429434069

Transcript of Earwitness testimony 2. Voices, faces and context

Page 1: Earwitness testimony 2. Voices, faces and context

Earwitness Testimony 2: Voices, Faces and Context

SUSAN COOK* and JOHN WILDINGDepartment of Psychology, Royal Holloway, University of London

SUMMARY

Two factors relevant to voice recognition were investigated in the study reported here: theeffect on memory for a voice of the presence of a face or personal information about thespeaker, and the effects of the re-presentation of this information as contextual cues at test.Recognition memory for the briefly heard voice of a stranger was superior in conditions wherethe face of the speaker was absent. Presence of the additional contextual cues at test had noeffects on recognition performance. Theoretical and forensic applications of the findings arediscussed in terms of face-recognition models and witness line-up design. The sensitivity ofvoice memory measures to the different types of experimental design is also considered. & 1997John Wiley & Sons, Ltd.

Appl. Cognit. Psychol. 11: 527±541 (1997)No. of Figures: 0 No. of Tables: 1 No. of References: 29

INTRODUCTION

Detailed study of voice memory has been carried out during the last 20 years(Clifford, 1983; Yarmey, 1995 for reviews), and researchers are starting to ask moregeneral questions as to how voice recognition relates to person recognition. Just howwould a voice memory be affected by the presence of the speaker's face? Given thatperformance on memory tasks for known and unknown voices is not identical (Cookand Wilding, in press), how does knowing something about a speaker affect memoryfor their voice? The work reported here seeks to address the effects of the addition offace and personal information on memory for voices in a psycho-legal setting.Initial studies of contextual effects on voice memory by the authors (Cook and

Wilding, unpublished work) had not found an effect of context on memory for aonce-heard voice when the context was in the form of the voice saying a framingsentence spoken by another unknown speaker, but had found a marginallysignificant facilitation with re-presentation of the same words spoken as those

CCC 0888±4080/97/060527±15$17.50 Received 14 August 1996& 1997 John Wiley & Sons, Ltd. Accepted 12 February 1997

APPLIED COGNITIVE PSYCHOLOGY, VOL. 11, 527±541 (1997)

The authors thank the schools and colleges who provided participants in this experiment. Mostimportantly, thanks also go to John Valentine for his statistical advice.*Correspondence to: Susan Cook, Psychology Dept., Royal Holloway, University of London, Egham

Hill, Egham, Surrey TW20 0EX.Contract grant number: R00429434069

Page 2: Earwitness testimony 2. Voices, faces and context

originally heard, compared with a novel sentence. Faces and personal informationseemed more likely to form context that would be integrated with the face, and therewas earlier work that had found facilitation of voice memory by reinstatement offace context with large stimulus sets. What would happen in a voice lineup,therefore, when the context was in the form of a face or personal information?

Face processing models

What does consideration of the theoretical models of person recognition suggest aboutperson recognition by voice? Bruce and Young (1986) suggested a model wherebythere are three sequential steps to identifying an individual from his or her face.

1. Face Recognition Units store description of an individual's features. There is aseparate FRU for every familiar face, which is activated when a recognizableview of the person is presented.

2. Activation triggers a Person Identity Node that allows access to personalinformation, e.g. `is a politician'. The same PIN can be activated by a person'sface, voice, and spoken or written name (or distinctive characteristics such asChurchill's cigar and hat).

3. Name generation.

Burton, Bruce and Johnston (1990) produced a broadly similar but moreinteractive activationÐIACÐmodel. PINs became nodes that allow access topersonal information rather than holding the information themselves. The decisionabout whether a person is familiar is made at the PIN node. As the same PIN can beactivated by a person's face, voice or name, there must be some theoretical VoiceRecognition Units that accomplish person recognition from the auditory perceptuallevel to allow access to the PIN level. The models make no explicit predictions thatthe Face Recognition Units compete with Voice Recognition Units, but rather seemto imply a mutual facilitation. It seems from the models therefore that voice memoryshould not be affected by impaired encoding due to face presence at time of learning.Such models are explicitly aimed at recognizing known faces, but it is still by nomeans clear what distinguishes familiar from unfamiliar people in cognitive terms.An initial attempt to include voice recognition in the theory of person recognition ina similar way to that in which person recognition mediated by name has beeninvestigated (Valentine, Brennen and Bredart, 1996) seemed appropriate.

Studies using face and voice

There are few studies that look at memory for face and voice together, even though avery common occurrence for hearing sighted people must be both to hear and to seethe perpetrator of a crime. Indeed our contacts with the UK Dorset Police confirmedin personal correspondence that this is the most common way in which voicerecognition testimony arises. The predicted effect on memory of having a dual codedepends on which theoretical literature one wishes to draw. Legge, Grosmann andPieper (1984) found that black and white photographs present at time of learning ledto similar levels of voice identification to those obtained when the photographs wereabsent. Their task, involving two-alternative, forced choice, recognition of twentyvoices, was very different from the ones being used in the current work. More akin to

528 S. Cook and J. Wilding

& 1997 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 11: 527±541 (1997)

Page 3: Earwitness testimony 2. Voices, faces and context

the current study, in his 1986 work manipulating levels of illumination, Yarmeypredicted that subjects should selectively allocate more attention to the voice of anassailant as their ability to see him declined. In a study, using slides with fourdifferent levels of filter (to simulate daylight through to night vision), of a simulatedattack and rape, accompanied by tape-recorded speech of the `assailant' and the`victim' conversing, Yarmey predicted that subjects should choose to allocateattention to voice information as lighting declined. The task was to select the voice ofthe male `assailant' from a lineup of five after a lag of 15 minutes. No effect oflighting condition on voice recognition in the predicted direction (i.e. accuracyincreasing with decreasing daylight) was found for either correct recognitions or falsealarms. This study could be criticized on the grounds of the use of static slides asstimuli. The design also implied that selective attention of this type was undervoluntary, or at least logical, control. This is an implication that is made explicitly byArmstrong and McKelvie (1996), although, as with Yarmey (1986), such logicalcontrol is not always apparent in the data. Yarmey's study was, however, far moreforensically plausible and in keeping with the applied aims of the current study thanwas the work of Legge et al.In a later study, McAllister, Dale, Bregman, McCabe and Cotton (1993a)

suggested a rather different theoretical prediction for the interplay of voice and faceinformation. These authors predicted that visual information may interfere withvoice recognition, and also that voice information may interfere with facerecognition. They used photographs and tape-recorded voices as their stimuli,with a 1-minute inspection time. They did not find an effect of having heard a voice,as well as seen a photograph, on the accuracy of selecting the target from a photolineup (presented sequentially). Repeating the experiment with a voice lineupproduced a decrease in accuracy in subjects who had seen the photo as well ashearing a voice compared with those who had just heard the voice. The time lag was5 minutes in this experiment. These authors did acknowledge, however, that they hadselected their foils on the grounds of visual rather than vocal similarity to thesuspect. This could be problematic, as the construction of a fair lineup would requirethe use of broadly similar voices (Hollien, 1990). They also used only a singlesuspect, which may be problematic given the unanswered question of whatconstitutes a distinctive voice (Yarmey, 1991). There is also a problem withcomparing performance where all of the information available at encoding isavailable at retrieval (voice only) directly with performance when only one of the twomodalities is available at test (voice and face at encoding, voice only at test).Armstrong and McKelvie (1996) criticize McAllister et al. on just these grounds.As a result of the conflicting evidence between Yarmey (1986) and McAllister et al.

(1993a), and in view of the importance of the result of adding face to voice in termsof the person recognition line of enquiry, the experiment reported here manipulatedface and voice information. The prediction was that having the face present atexposure would affect voice memory at test.

Voices and contextual reinstatement

Earlier unpublished work (Cook and Wilding) had failed to find an effect of voice ascontext for voice, and had found only a marginally significant effect of words spokenas context for voice. It seemed at least possible that the wrong sort of information

Earwitness context and faces 529

& 1997 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 11: 527±541 (1997)

Page 4: Earwitness testimony 2. Voices, faces and context

was being provided to get forensically useful levels of contextual facilitation. Somerelevant research has been carried out on what features of context are important indetermining its efficiency.

Interactive versus independent context effects

Somewhat similar to McGeoch's (1932) idea of internal and external contexts,Baddeley (1982) formed the idea of independent and interactive contexts. Anindependent context was one that the subject did not associate with the stimulus, andan interactive context was one where he or she did make that link. An independentcontext may make the memory trace more accessible, but would not be integratedwith that trace in such a way as to affect its meaning. An interactive context wouldaffect the meaning of the trace itself to the subject. Baddeley gives the example thatthe word COLD with GROUND as its context could affect the way in which COLDis stored; conversely hearing the word COLD when a person is drunk may affect theaccessibility of the trace but should not affect its interpretation. This is a verysubject-centred (subjective even) understanding of different types of context.Defining context is problematic. A recent paper on the subject provided thefollowing definition, which really is very wide indeed:

Context information is usually thought of as information that is present in theprocessing environment, either at encoding or a retrieval, and that is peripheral orincidental to the cognitive task being performed. (Murnane and Phelps, 1995,p. 158)

Defining what will form an independent context and what will form an interactiveone is even less satisfactory. There is a risk of falling into the circularity of argument,whereby `x' must be an independent context because it has produced the results thatare hypothesized as being produced by an independent context. Even so, it seemedpossible that another voice or the words spoken were not sufficiently integrated withthe target voice in the minds of the subjects to form an effective context for retrieval.

Context and faces

Early attempts to use faces as contextual cues for faces were not uniformly successful(e.g. Bower and Karlin, 1974) although Watkins, Ho and Tulving (1976) did obtain apositive effect of context, as measured by correct recognitions of 36 pairs of faces.Later work with faces showed that the more integrated the context was with faces inthe minds of the subjects, the more the recognition was enhanced by re-presentingthat context at test. In accordance with this, Baddeley and Woodhead (1982) pairedfaces randomly with jobs and found no beneficial effect of presenting the job again attest time to facial recognition. Conversely it was argued that posing the personagainst a church would lead the subject to make suppositions about them atencoding that would then make a church a helpful context enhancing performance atrecognition. Winograd and Rivers-Bulkeley (1977) found that asking subjects to ratepairs of faces for marital compatibility produced quite a big advantage of re-presenting the same pairs at recognition test over pairing faces with new faces. Thisproduced more facilitation through keeping the pairs the same at presentation and attest, than if the original task had been to assess the individual friendliness of faces.

530 S. Cook and J. Wilding

& 1997 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 11: 527±541 (1997)

Page 5: Earwitness testimony 2. Voices, faces and context

This was argued to be an independent context task, whereas the marital compatibilitywas supposed to promote interactive context. From the faces literature, therefore, itseemed to be worth pursuing the idea of a more integrated context, such as the face ofthe person, to facilitate memory of the voice of the person.

Verbal labels as context for faces

Watkins et al. (1976) found that rather larger context effects than simply thoseobtained by presenting voices in pairs, could be obtained by presenting some sort ofverbal history with the picture and then using that as context at test time (e.g. `keepstropical fish'; `is a civil rights activist'). This produced 84% correct recognition attest when the labels were kept consistent and 71% when swapped. Kerr andWinograd (1982) replicated and showed that the effect was the same whether one ortwo phrases accompanied the picture, suggesting that it was the fact rather than thequantity of biographical information that was critical. These findings suggested thatverbal labels might form an effective integrated context when presented with voices.

Faces as context for voices

There has been a little earlier work that has manipulated contextual reinstatement offaces when testing voice memory. Legge et al. (1984) used black and whitephotographs of faces as context as subjects listened to voices. Their design was verydifferent from those in this paper, as they used a two-alternative forced-choicerecognition test with multiple voice targets. Their data seem to show facilitation ofvoice recognition when the face was present at both presentation and test, comparedwith conditions when the face was either never present or else present only at test;unfortunately no statistic is reported so it is not quite clear what significance levelattaches to the data. Armstrong and McKelvie (1996) also report higher correctrecognition levels for voices identified when the face present at learning was re-presented as contextual reinstatement, and their data are clearly statisticallysignificant. Their experimental design was similar to Legge et al.'s. Would thecontextual reinstatement facilitate voice recognition in a lineup in a similar way toreinstatement with forced-choice recognition pairs?It was felt that, if such a thing as an interactive context for a once-heard voice

existed, face information or the other personal information ought to form such aninteractive context. It was hypothesized that personal information and face shouldboth enhance voice memory performance if re-presented at test, as compared to acondition where they were present at presentation but not at test.In a view of the unresolved questions about the interaction of face and voice, this

study manipulated face, voice, and personal history information in such a way as tobe of potential use forensically. The specific questions addressed were:

(1) Was voice recognition enhanced or hindered by presentation of a facesimultaneously with the voice?

(2) Was voice recognition enhanced or hindered by presentation of personalinformation simultaneously with the voice?

(3) Was voice recognition enhanced by presentation of personal information or aface at learning and again as contextual reinstatement at test?

Earwitness context and faces 531

& 1997 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 11: 527±541 (1997)

Page 6: Earwitness testimony 2. Voices, faces and context

(4) Did presenting the maximum number of cues as contextual reinstatement at test(face and personal information) produce different levels of voice recognitionaccuracy than presenting just the face cue?

METHOD

Design

A between-subjects method was used. Subjects each served in one of the sevenconditions for this study. Each subject heard two voices, one male and one female,and was tested on two auditory lineups, one for each sex of speaker. No target absentlineups were used, and consequently only correct recognitions and misses could berecorded.

1 Voice onlyThese subjects heard the voice materials presented after an orienting task of readingtwo statements of general knowledge. After 1 week these subjects were tested on twovoice-only lineups.2 Voice and face onlyThese subjects heard the voice materials presented after an orienting task of readingtwo statements of general knowledge, and whilst looking at a video of the faces ofthe speakers talking. After 1 week these subjects were tested on two voice-onlylineups.3 Voice and face with contextThese subjects heard the voice materials presented after an orienting task of readingtwo statements of general knowledge, and whilst looking at a video of the faces ofthe speakers talking. After 1 week these subjects were tested on two voice-onlylineups, with still photographs of the speakers to reinstate the context.4 Voice and personal informationThese subjects heard the voice materials presented after an orienting task of readingone statement about each speaker. After 1 week these subjects were tested on twovoice-only lineups.5 Voice and personal information with contextThese subjects heard the voice materials presented after an orienting task of readingone statement about each speaker. After 1 week these subjects were tested on twovoice-only lineups, with the written statements about each speaker present ascontextual cues.6 Voice and face and personal informationThese subjects heard the voice materials presented after an orienting task of readingone statement about each speaker, and whilst looking at a video of the faces of thespeakers talking. After 1 week these subjects were tested on two voice-only lineups.7 Voice and face and personal information with contextThese subjects heard the voice materials presented after an orienting task of readingone statement about each speaker, and whilst looking at a video of the speakers'faces talking. After 1 week these subjects were tested on two voice-only lineups withstill photographs of the speakers and with the written information about eachspeaker both present as contextual cues.

532 S. Cook and J. Wilding

& 1997 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 11: 527±541 (1997)

Page 7: Earwitness testimony 2. Voices, faces and context

Subjects

Two hundred and ten subjects participated in this study, of whom 127 were male,and 83 were female. All were students in local schools and sixth-form colleges,recruited through contacts with various teachers. The age range was 15;2±20;0, witha mean age of 17;2. Subjects were tested in small groups of about five by oneresearcher. None of the subjects knew any of the speakers on the tape personally.

Materials

Audio tapesVoices of volunteers from within and around the Psychology Department were used.Speakers were all aged between 18 and 25, and none had a marked regional accent.Voices were recorded in a soundproof room. Each speaker was recorded saying all sixof the test sentences. Voices were chosen from the available pool by screening with theHandkins and Cross checklist (1985, as reported by Yarmey, 1991). Voices that wererated as being extreme along the key dimensions of pitch or pace of speech wereexcluded. Four targets were used, two male and two female. Lineups were constructedusing the editing facilities on the Pro Audio computer setup and were then transferredto high-quality audio tape. Lineups were of six voices, all of the same gender as thetarget voice, and with mild Southeast English accents, saying the same sentence as hadbeen uttered by the target voice. Fifteen different male voices were used as foils toconstruct the male lineups, and 15 different female ones to construct the femalelineups. The target voice was positioned second, third, fourth or fifth in each lineup(i.e. not first or last) in a pseudo-random manner. The exposure tapes consisted of onemale and one female speaker saying one sentence each. For example one female targetsaid: What do you look like? How will we be able to recognize you? To which the maletarget responded: I am medium height with brown hair and I will wear a long black coat.Mean sentence length was 15 syllables. The same sentences were used at presentation

as at test. No target absent lineups were prepared as the affect of target absence on alineup is well documented (e.g. Bull and Clifford, 1984) and was not of key concernhere, and also due to the experimental complexity already inherent in the design.

Video tapesColour video recordings were made of four targets (two male and two female)speaking the same sentences as were used for the audio recordings. Two female andtwo male targets were used saying six different sentences each and were fullyrandomized over the conditions. Still colour photographic reproductions of thespeakers were extracted from the video recordings using thermal video imagingtechnology on a Sony Mavigraph machine.

Personal informationSingle sentence personal information statements were given to ten people to ratethem for memorability on a five-point scale. A subset of the sentences that were ratedas being most memorable were used, presented in a large black and white paperformat (these included: `S/he is having an affair' and `S/he has multiple sclerosis').Control general knowledge statements of a similar length were generated in the sameblack and white paper format by the experimenter.

Earwitness context and faces 533

& 1997 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 11: 527±541 (1997)

Page 8: Earwitness testimony 2. Voices, faces and context

Procedure

TestingSubjects were allocated to one of the seven conditions in a pseudo-random manner,but with roughly equivalent sex ratios in each condition. Subjects were informed thatthe experiment was a hearing one, but were not explicitly told to learn the voices ofthe speakers. A single experimenter introduced herself to each group of no more thanfive subjects, and then issued standard test instructions, as follows:

`I would like to carry out a hearing experiment with you today, if that is alrightwith everyone? Is there anyone who knows that they have a hearing problem? Isthere anyone who knows that they will not be here next week? There are twostatements (about the people who you are about to hear) on the paper at the front,please can you all read them? Can everyone see them? Now you are going to hear(see) two people speaking. The tape is very brief, so please listen very carefullyindeed. Are you ready?'

The video or audio tape was then played. The subjects were then allowed to go,and were reminded to attend in the same group at the same time the following week.After a time lag of 1 week, subjects were tested in the same groups. Subjects in thecontext conditions were given a copy of the context sentences and/or thephotographs of the target speakers each to hold while they listened to the lineups.All subjects heard auditory lineups of six voices, and indicated which speaker theythought was the target from the week before. The words spoken in the lineup werethe same as had been spoken the previous week. The design was fully randomizedbetween those who heard each of the two pairs of speakers, and those who heard themale speaker first and those who heard the female speaker first.

Test markingThe standard score sheets were marked as being correct if a positive correctidentification of the target had been made. The maximum score was therefore two (onefor the male target and one for the female target). No reduction was made for falsealarms. No subject refused to select a voice in the lineup. Signal detection theory couldnot be used in marking the test as there was a large number of zero scores (for whichbeta and d prime could not be calculated) and there were no target absent lineups.

RESULTS

In order to determine those conditions inwhich performancewas better than chance thenumbers of subjects scoring 0, 1 or 2 correct recognitions in the two tests were comparedwith chance expectation, using chi-squared. Only one condition yielded results that didnot differ significantly from chance expectation, namely condition 6, voice and face andpersonal information with no contextual reinstatement (w2=0.69, df=2, p40.05).

Sex effects

In order to check for gender effects in these data, the subjects who had scoreddifferently for the male and female voices were taken, and those listeners of each

534 S. Cook and J. Wilding

& 1997 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 11: 527±541 (1997)

Page 9: Earwitness testimony 2. Voices, faces and context

gender scoring better with one gender of speaker than with the other were counted.These frequencies were compared with chance expectation, and were not found todiffer significantly from chance (w2=0.031, df=2, p40.05). No further discussion ofsex effects in these data is therefore carried out.

Experimental effects

Performance was generally better in the condition where no face materials wereavailable (see Table 1).A Kruskal±Wallis one-way analysis of variance showed a main effect of

presentation condition (H=29.0, n=210, p50.0001).

Face effects

Planned orthogonal contrasts conducted according to the method of contrastsdescribed by Marascuilo and McSweeney (1977, p. 308) showed that a significanteffect was one of face present versus face absent (w2=22.83, df=1, p40.005), withvoice recognition performance being superior in conditions when the face was absentat presentation (i.e. groups 1, 4 and 5 against groups 2, 3, 6 and 7).

Personal information effects

Although the mean scores were marginally higher when the personal informationstatements were present than when they were not, there was no overall evidence of aneffect of personal information on voice memory, and the contrast designed to showsuch an effect (i.e. group 1 against groups 4 and 5) failed to reach significance(w2=0.64, df=1, p40.05).

Face versus face with personal information effects

There was no overall effect of having personal information present as well as face, ascompared with the conditions where only face was present (i.e. groups 2 and 3against groups 6 and 7) (w2=0.40, df=1, p40.05).

Earwitness context and faces 535

& 1997 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 11: 527±541 (1997)

Table 1. Mean scores by condition

Condition Mean score (N=30) Range

1. Voice only 1.0 0±22. Voice and face 0.73 0±23. Voice and face and context 0.57 0±24. Voice and PIN 1.10 0±25. Voice and PIN and context 1.17 0±26. Voice and face and PIN 0.40 0±27. Voice and face and PIN and context 0.77 0±2

Overall (N=210) 0.82 0±2

Page 10: Earwitness testimony 2. Voices, faces and context

Context effects

Although the mean scores were higher when personal information context wasreinstated or when both contexts were reinstated than when there was noreinstatement, scores were actually lower when the facial context was reinstatedthan when it was not. The contrast designed to show an overall effect of having thecontextual information in the form of face or personal information reinstated at test(groups 2, 4 and 6 against groups 3, 5 and 7) failed to reach statistical significance(w2=0.73, df=1, p40.05).

Effects of face versus face and PIN as context reinstatement

The mean scores were higher in the voice and face without context and the voice,face and personal information with context, and this interaction (groups 2 and 7against 3 and 6) was significant (w2=4.37, df=1, p50.05). This may suggest that thecontextual enhancement was greater for double context than for facial context alone.Indeed, from an examination of the mean scores, facial context alone did seem to bedetrimental to voice memory performance, whereas face and personal informationcontext together appeared to facilitate voice-memory performance. There is,however, something rather odd about the absolute level of score for group 6 (i.e.face and personal information without contextual reinstatement), which did notdiffer significantly from chance. The interaction may be of psychological interest ifreplicated in other data, but the present interaction could be unduly affected by thevery low score in group 6 and should be regarded with caution.

Bonferroni correction

Applying the Bonferroni correction, as five comparisons were carried out, theacceptable probability for significance would become p50.01 (i.e. 0.05/5). By thismore stringent criterion, only the face contrast would achieve statistical significance.

DISCUSSION

As predicted, the effect of having a face present along with the voice at presentationdid influence performance on a subsequent long-term voice-recognition task. In fact,having the face present was strongly detrimental to the voice-recognition memory.This raises a number of theoretical and practical issues. The effects of havingpersonal information present, and of having the information from presentationreinstated at test, were significant in these data.

Face and voice: the face overshadowing effect

The most striking result is the strength of the negative effect of having a face presenton voice recognition. Earlier work had suggested that the presence of a visible facemay not affect voice memory (Yarmey, 1986); or that it did in fact impair voicememory (McAllister et al., 1993a). The current study followed McAllister et al. infinding a detrimental effect of face presence on memory for a once-heard voice. In

536 S. Cook and J. Wilding

& 1997 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 11: 527±541 (1997)

Page 11: Earwitness testimony 2. Voices, faces and context

this experiment the voice and face were presented in an integrated and moderatelynaturalistic way, on video. This was felt to be preferable to slides and tape (Yarmey,1986) or photograph and tape (McAllister et al., 1993a) as certain cross modal effectsare only apparent with a moving stimulus (see below for discussion of the McGurkeffect). Our work was also of more direct interest to most legal situations in that thetime lag between presentation and test was 1 week rather than 15 minutes (Yarmey,1986) or 5 minutes (McAllister et al. 1993a). The results of the present study alsoseem preferable in that more than a single voice target was used, unlike the earlierstudies. Although a more naturalistic simulation involving higher levels of arousal inthe earwitnesses would still be desirable, our study seems to have fewer drawbacksfor generalization to a legal setting than do the earlier laboratory works.The results of this work also contradict the null finding of Legge et al. (1984).

Legge et al. found no difference between subjects who saw faces in the inspectionperiods and those who did not in terms of recognition memory for voices. Their taskwas a very different one, however, involving a two-alternative forced-choicerecognition test with 20 voices. It has already been suggested in earlier work(Cook and Wilding, in press) that the results of voice recognition tests can bemarkedly different in single or dual lineups to results in tasks with a large number ofcompeting target items. It may be that the face overshadowing effect is only apparentwhen a small number of voices are targets. It seems quite plausible that if multiplevoices are interfering with each other, the additional interference of face would ceaseto be easily detected by explicit memory measures. The forensic parallel being drawnwas more plausible in our work, which, coupled with the overwhelming size of theeffect, seems to suggest that this study was decisive.Yarmey (1986) and McAllister et al. (1993a) discuss their results in terms of

interference of the visual modality with the auditory modality. Another way toapproach the theoretical implications is to look to the face-processing literature, andsee whether any predictions concerning person identification by voice are possiblefrom that standpoint. In the original conception of the Bruce and Young (1986) facerecognition model there was a voice recognition route. It appears from the presentresults that this route does not operate in tandem with the face recognition route toachieve overall person recognition. It would seem, therefore, that the voicerecognition mechanism with a once-heard voice does not really work if a face hasbeen presented at first voice exposure. It could be that, even with the greatestamount of information available in our conditions, learning was not good enoughfor the Bruce and Young family of models of familiar person recognition to beappropriate. Further work is still needed to establish how and when voicerecognition can be reconciled with such models conceptually.This result could indicate the strength of the visual modality in sighted humans, or

could indicate something more specific about person recognition by voice or face.This idea has some support from McAllister et al. (1993a, experiment 1) where theyfound no effect of hearing a voice as well as seeing the face on a photograph line-up,i.e. no detrimental effect of voice presence on memory for a once-seen face. WhileMcAllister et al. could not place too much reliance on this null result in isolation, thestrong finding of face effects on memory for a once-heard voice in the current workdoes suggest an interesting separation of functions. McAllister, Dale and Keay(1993b) also found no effect of auditory±visual lineup compared to visual lineup inanother similar set of experiments on lineup type. This would also fit in with the idea

Earwitness context and faces 537

& 1997 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 11: 527±541 (1997)

Page 12: Earwitness testimony 2. Voices, faces and context

that voice information is simply not really available to conscious memory after onebrief presentation where the face has also been available.

Facial overshadowing or visual overshadowing?

In their 1984 paper Legge et al. raise the question as to whether their result (apositive one of contextual facilitation by faces) would be replicated with other visualstimuli.

It is intriguing to ask whether human faces are unique visual stimuli in their ability tofacilitate voice memory. Would pictures of flowers, or alphanumeric characters, orabstract designs or chimpanzee faces do as well? (Legge, Grosmann and Pieper,1984, p. 303)

We would echo their question with regard to the detrimental effect which face hadon voice learning in this and earlier studies. There may be some relevant evidencefrom the `McGurk Effect' (MacDonald and McGurk, 1978; McGurk andMacDonald, 1976). In this robust effect the subjects see a moving face with thelips forming one consonant (e.g. `ga') and simultaneously hear another consonant(e.g. `bi'). Subjects typically report that they heard the consonant sound that theyactually saw being mouthed, and form an illusory percept (e.g. `gi'). That is to saythat the visual consonant information overrides the auditory consonant information.This effect can be obtained over a number of cases, including real speech (Dekle,Fowler and Funnell, 1992), and pluck and bow movements with a cello (Saldana andRosenblum, 1993). Interestingly the magnitude of the effect is much less with thenon-speech, cello stimulus used by Saldana and Rosenblum than in their speech- andface-based data. It may be that the overriding of the learning of voice quality by thepresence of facial information, like the McGurk Effect, is an indication of aprocessing preference.

As a follow-up to this result, it may be desirable to look at what happens to voicememory when the subjects are unable to see anything, to attempt to see whether thevisual modality per se interfered with auditory memory. This question has beenlooked at somewhat by earlier workers. Yarmey (1986) has looked at the effects oflevels of illumination in slide materials on memory for both faces and voices, but thesubjects were not otherwise unable to see the periphery beyond the slide projectionscreen. Blocking off all forms of visual input would show whether the absence of anyform of visual input enhanced the voice-recognition route, and indeed made voicerecognition more likely in the pragmatic sense. Bull, Rathborn and Clifford (1983)found a weak positive effect of blindness on voice-recognition memory, but have notattempted to replicate this with denial of visual information to sighted humans. Thisleaves the question of whether auditory memory for voices could be furtherenhanced by the denial of all visual information open to further study, with legal andtheoretical implications.

Personal information effects

Although the mean scores in the personal information present conditions weremarginally higher than for the voice-alone condition, there was not a significanteffect of personal information. It did not appear to interfere with voice learning, but

538 S. Cook and J. Wilding

& 1997 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 11: 527±541 (1997)

Page 13: Earwitness testimony 2. Voices, faces and context

neither was there a clear facilitation with presence of personal information at learning.Unfortunately an explicit memory test of the items of personal information was notconducted, which might have been an improvement on the experimental design. Ifknowing personal information has no effect, then the sole difference between known andunknown voices should be the amount of exposure to a voice that a person has as theirlearning exposure. Further examination of the effects of personal information on voicememory is clearly called for before a satisfactory theoretical model may be drawn up.

Contextual reinstatement

The current work failed to parallel the facilitation of face memory by items ofpersonal information in voice-memory performance with a facilitation of memoryfor voices. This contradicts the findings of Watkins et al., (1976) and Kerr andWinograd (1982) who did find that beneficial context effects with face memory couldbe obtained by presenting some sort of verbal history with the picture and then usingthat as context at test time (e.g. `keeps tropical fish', is a civil rights activist'). Both ofthese studies used a very dissimilar design to the present work, and either that or thechange from face to voice memory may account for our null result.There was no significant effect in the current work on memory for voices of

maintaining context in terms of showing the face of the speaker again at lineup: infact the mean voice-memory scores were rather worse in the face context group thanthe face non-context group. This seems to run contrary to the encoding specificityprinciple. It may be that face does not form an integrated context with voice,although it is hard to see what could be more integrated with voice than face. Also,intuitively, if any stimulus should achieve automatic processing, one would thinkthat it would be voice/face pairings (given the social nature of the human animal).Other studies have found faces to form a beneficial context that facilitates memory.Legge et al. seem to have found a higher percentage of correct recognition of voiceswhen facial context was reinstated at test, although no statistical significance isquoted for their data. In a replication, Armstrong and McKelvie (1996) report astatistically significant effect of context in the form of face on memory for voices. Aswith Legge et al. the experimental paradigm was very different from those used inour work or elsewhere in the earwitnessing literature, relying as they did on a two-alternative forced-choice recognition memory test. Although Armstrong andMcKelvie overtly seek to address their work to a legal setting, it is hard toconceive of a case in which a witness would be testifying that they recognized tendifferent once-heard voices with any credibility. It is also not normal practice toconduct two- alternative old/new recognition tests 5 minutes after witnessing thevoice in a forensic setting. It seems therefore that there might be some contextualfacilitation by face, as by personal information, which is only if the task is a verydifferent one from those that form the accepted body of the earwitness testimonyliterature (Clifford, 1983; Yarmey, 1995 for reviews). Armstrong and McKelvie seekto explain their results in terms of intentional and incidental testing. It is a robustfinding that voice memory is poor when the testing of it is unexpected (e.g. Sasloveand Yarmey, 1980), and it may therefore be desirable for future research to addressintentional as well as incidental learning of voice.It is apparent with both single contexts, therefore, that our results have failed to

replicate studies elsewhere in the literature that find contextual facilitation of either

Earwitness context and faces 539

& 1997 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 11: 527±541 (1997)

Page 14: Earwitness testimony 2. Voices, faces and context

face with reinstatement of personal information or of voice with face. It has beenargued elsewhere (Cook and Wilding, in press) that test results can be very differentin a multiple-testing paradigm with high levels of interfering items. It seems plausiblethat contextual reinstatement would be more apparent in such paradigms, where theinterference effects leave room for facilitation of score by context.

There was some suggestion from the data that the contextual enhancement wasgreater for double context than for facial context alone, being negative in the case offacial context alone but positive in the case of facial and personal informationcontext. However, the absolute level of the voice recognition for the double contextcondition, even with reinstatement, was still not compelling (38.5% correctrecognitions), and it was felt that this result was at least in part explained by thepoor performance in the double cue without reinstatement condition (group 6),which did not differ significantly from chance. It is by no means apparent from thesedata that contextual reinstatement would have a practical role to play in an appliedlegal setting, given the low absolute levels of performance. It may be possible thatthere is some sort of preferential learning, such that it is easier to learn PIN/face linksthan voice/face links, so that when all three items are present the voice learning isneglected. It could be that the contextual reinstatement condition (group 7) isshowing some release from that blocking by allowing use of the relatively well-learned face PIN association to locate the stored information about surface voicecharacteristics. This is clearly not an idea that could be asserted on the back of asingle interaction effect alone, but further research is desirable to establish moreabout the face, PIN, and voice interaction effects with a view to reconciling voiceinformation with models of person recognition.

Conclusion

The highly detrimental effect of the presence of face information on memory for aonce-heard voice, raises the question as to how long this effect continues. Furtherwork is needed to see whether the preferential treatment that is apparently accordedto face information over voice information is reduced after a longer exposure to bothstimuli, or is affected by the temporal order of encountering face and voice. Thefindings in the current study certainly support the current practice of treating speechuttered in the conduct of a visual lineup with caution. Current UK law (Police andCriminal Evidence Act 1984, s.66, D 2.9) requires that if a witness asks to hear amember of a visual identity parade speak:

The witness shall be reminded that the participants in the parade have been chosenon the basis of physical appearance only. Members of the parade may then be askedto comply with the witness's request to hear them speak, . . . (Annex A, p. 88)

The current findings would endorse this as a minimum precaution that should betaken when adding voice information to face information.

REFERENCES

Armstrong, H. A. and McKelvie, S. J. (1996). E�ect of face context on recognition memoryfor voices. The Journal of General Psychology, 123, 259±270.

540 S. Cook and J. Wilding

& 1997 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 11: 527±541 (1997)

Page 15: Earwitness testimony 2. Voices, faces and context

Baddeley, A. (1982). Domains of recollection. Psychological Review, 89, 708±729.Baddeley, A. and Woodhead, M. (1982). Depth of processing and face recognition. CanadianJournal of Psychology, 36, 148±164.

Bower, G. H. and Karlin, M. B. (1974). Depth of processing pictures of faces and recognitionmemory. Journal of Experimental Psychology, 103, 751±757.

Bruce, V. and Young, A. (1986). Understanding face recognition. British Journal ofPsychology, 77, 305±327.

Bull, R. and Cli�ord, B. R. (1984). Earwitness voice recognition accuracy. In G. L. Wells andE. F. Loftus (Eds.), Eyewitness testimony, psychological perspectives (pp. 92±123).Cambridge: Cambridge University Press.

Bull, R., Rathborn, H. and Cli�ord, B. R. (1983). The voice-recognition accuracy of blindlisteners. Perception, 12, 223±226.

Burton, A. M., Bruce, V. and Johnston, R. A. (1990). Understanding face recognition with aninteractive activation model. British Journal of Psychology, 81, 361±380.

Cli�ord, B. R. (1983). Memory for voices: the feasibility and quality of earwitness evidence. InS. M. A. Lloyd-Bostock and B. R. Cli�ord (Eds.), Evaluating witness evidence (pp. 189±218).Chichester, Wiley.

Cook, S. A. and Wilding, J. M. (in press). Earwitness testimony: never mind the variety, hearthe length. Applied Cognitive Psychology.

Dekle, D. J., Fowler, C. A. and Funnell, M. G. (1992). Audiovisual integration in perceptionof real words. Perception and Psychophysics, 51, 355±362.

Hollien, H. (1990). The acoustics of crimeÐthe new science of forensic phonetics. New York:Plenum.

Kerr, N. H. and Winograd, E. (1982). E�ects of contextual elaboration on face recognition.Memory and Cognition, 10, 603±609.

Legge, G. E., Grosmann, C. and Pieper, C. M. (1984). Learning unfamiliar voices. Journal ofExperimental Psychology: Learning Memory and Cognition, 10, 198±303.

MacDonald, J. and McGurk, H. (1978). Visual in¯uences on speech perception processes.Perception and Psychophysics, 24, 253±257.

Marascuilo, L. A. and McSweeney, M. (1977). Nonparametric and distribution-free methods forthe social sciences. Monterey: Brooks/Cole.

McAllister, H. A., Dale, R. H. I., Bregman, N. J., McCabe, A. and Cotton, C. R. (1993a).When eyewitnesses are also earwitnesses: e�ects on visual and voice identi®cations. Basicand Applied Social Psychology, 14, 161±170.

McAllister, H. A., Dale, R. H. I. and Keay, C. E. (1993b). E�ects of lineup modality onwitness credibility. Journal of Social Psychology, 133, 365±376.

McGeoch, J. A. (1932). Forgetting and the law of disuse. Psychological Review, 39, 352±370.McGurk, H. and MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746±748.Murnane, K. and Phelps, M. P. (1995). E�ects of changes in relative cue strength on context-dependent recognition. Journal of Experimental Psychology: Learning Memory andCognition, 21, 158±172.

Saldana, H. M. and Rosenblum, L. D. (1993). Visual in¯uences on auditory pluck and bowjudgements. Perception and Psychophysics, 54, 406±416.

Saslove, H. and Yarmey, D. A. (1980). Long-term auditory memory: speaker identi®cation.Journal of Applied Psychology, 65, 111±116.

Valentine, T., Brennen, T. and Bredart, S. (1996). The cognitive psychology of proper names.London: Routledge.

Watkins, M. J., Ho, E. and Tulving, E. (1976). Context e�ects in recognition memory forfaces. Journal of Verbal Learning and Verbal Behaviour, 15, 505±517.

Winograd, E. and Rivers-Bulkeley, N. T. (1977). E�ects of changing context on rememberingfaces. Journal of Experimental Psychology: Human Learning and Memory, 3, 397±405.

Yarmey, A. D. (1986). Verbal, visual, and voice identi®cation of a rape suspect under di�erentlevels of illumination. Journal of Applied Psychology, 71, 363±370.

Yarmey, A. D. (1991). Descriptions of distinctive and non-distinctive voices over time. Journalof the Forensic Science Society, 31, 421±428.

Yarmey, A. D. (1995). Earwitness speaker identi®cation. Psychology, Public Policy and Law,1, 792±816.

Earwitness context and faces 541

& 1997 John Wiley & Sons, Ltd. Appl. Cognit. Psychol. 11: 527±541 (1997)