VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M....

31
VOL. 74, No. 6 NOVEMBER 1967 PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE 1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories, Neva York, New York Man could not perceive speech well if each phoneme were cued by a unit sound. In fact, many phonemes are encoded so that a single acoustic cue carries information in parallel about successive phonemic segments. This reduces the rate at which discrete sounds must be per- ceived, but at the price of a complex relation between cue and phoneme: cues vary greatly with context, and there are, in these cases, no commutable acoustic segments of phonemic size. Phoneme perception therefore requires a special decoder. A possible model sup- poses that the encoding occurs below the level of the (invariant) neuromotor commands to the articulatory muscles. The decoder may then identify phonemes by referring the incoming speech sounds to those commands. Our aim is to identify some of the conditions that underlie the perception of speech. We will not consider the whole process, but only the part that lies between the acoustic stream and a level of perception corresponding roughly to the phoneme. 4 Even this, 1 The research reported here was aided at its beginning, and for a considerable period afterward, by the Carnegie Corporation of New York. Funds have also come from the National Science Foundation and the Depart- ment of Defense. The work is currently supported by the National Institute of Child Health and Human Development, the Na- tional Institute of Dental Research, and the Office of Naval Research. 2 Also at the University of Connecticut. 3 Also at the University of Pennsylvania. 4 For our purposes the phoneme is the shortest segment that makes a significant difference between utterances. It lies in the lowest layer of language, has no meaning in itself, and is, within limits, commutable. There are, for example, three such segments in the word "bad"—-/b/, /•&/, and /d/—estab- lished by the contrasts with "dad," "bed," and as we will try to show, presents an in- teresting challenge to the psychologist. The point we want most to make is that the sounds of speech are a special and especially efficient code on the phonemic structure of language, not a cipher or alphabet. We use the term code, 5 in contrast to cipher or alpha- "bat." As commonly defined, a phoneme is an abstract and general type of segment, represented in any specific utterance by con- crete and specific tokens, called phones, that may vary noticeably as a function of context. The distinguishable variants so produced are referred to as allophones. We do not mean to imply that every phoneme is necessarily perceived as one listens to speech. Linguistic constraints of various kinds make it possible to correct or insert segments that are heard wrongly or not at all. Phonemes can be perceived, however, and some number of them must be perceived if the listener is to discover which constraints apply. 5 We borrow the terms cipher and code from cryptography. A cipher substitutes a symbol for each of the units (letters, 431

Transcript of VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M....

Page 1: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

VOL. 74, No. 6 NOVEMBER 1967

PSYCHOLOGICAL REVIEWPERCEPTION OF THE SPEECH CODE1

A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER,AND M. STUDDERT-KENNEDY»

Haskins Laboratories, Neva York, New York

Man could not perceive speech well if each phoneme were cued by aunit sound. In fact, many phonemes are encoded so that a singleacoustic cue carries information in parallel about successive phonemicsegments. This reduces the rate at which discrete sounds must be per-ceived, but at the price of a complex relation between cue andphoneme: cues vary greatly with context, and there are, in thesecases, no commutable acoustic segments of phonemic size. Phonemeperception therefore requires a special decoder. A possible model sup-poses that the encoding occurs below the level of the (invariant)neuromotor commands to the articulatory muscles. The decoder maythen identify phonemes by referring the incoming speech sounds tothose commands.

Our aim is to identify some of theconditions that underlie the perceptionof speech. We will not consider thewhole process, but only the part thatlies between the acoustic stream anda level of perception correspondingroughly to the phoneme.4 Even this,

1 The research reported here was aided atits beginning, and for a considerable periodafterward, by the Carnegie Corporation ofNew York. Funds have also come from theNational Science Foundation and the Depart-ment of Defense. The work is currentlysupported by the National Institute of ChildHealth and Human Development, the Na-tional Institute of Dental Research, and theOffice of Naval Research.

2 Also at the University of Connecticut.3 Also at the University of Pennsylvania.4 For our purposes the phoneme is the

shortest segment that makes a significantdifference between utterances. It lies in thelowest layer of language, has no meaningin itself, and is, within limits, commutable.There are, for example, three such segmentsin the word "bad"—-/b/, /•&/, and /d/—estab-lished by the contrasts with "dad," "bed," and

as we will try to show, presents an in-teresting challenge to the psychologist.

The point we want most to make isthat the sounds of speech are a specialand especially efficient code on thephonemic structure of language, not acipher or alphabet. We use the termcode,5 in contrast to cipher or alpha-"bat." As commonly defined, a phonemeis an abstract and general type of segment,represented in any specific utterance by con-crete and specific tokens, called phones, thatmay vary noticeably as a function of context.The distinguishable variants so produced arereferred to as allophones.

We do not mean to imply that everyphoneme is necessarily perceived as onelistens to speech. Linguistic constraints ofvarious kinds make it possible to correct orinsert segments that are heard wrongly or notat all. Phonemes can be perceived, however,and some number of them must be perceivedif the listener is to discover which constraintsapply.

5 We borrow the terms cipher and codefrom cryptography. A cipher substitutesa symbol for each of the units (letters,

431

Page 2: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

432 LIBERMAN, COOPER, SHANKWEILER, AND STUDDERT-KENNEDY

bet, to indicate that speech soundsrepresent a very considerable restruc-turing of the phonemic "message." Theacoustic cues for successive phonemesare intermixed in the sound stream tosuch an extent that definable segmentsof sound do not correspond to seg-ments at the phoneme level. More-over, the same phoneme is most com-monly represented in different pho-nemic environments by sounds thatare vastly different. There is, inshort, a marked lack of correspond-ence between sound and perceivedphoneme. This is a central fact ofspeech perception. It is, we think,the result of a complex encoding thatmakes the sounds of speech especiallyefficient as vehicles for the transmissionof phonemic information. But it alsoposes an important question: by whatmechanism does the listener decode thesounds and recover the phonemes ?

In this paper we will (a) askwhether speech could be well per-ceived if it were an alphabet or acousticcipher, (£>) say how we know thatspeech is, in fact, a complex code, (c)describe some properties of the per-ceptual mode that results when speechis decoded, (rf) consider how the en-coding and decoding might occur, and(e) show that the speech code is sowell matched to man as to provide, de-spite its complexity, a uniquely effec-tive basis for communication.

usually) of the original message. In a code,on the other hand, the units of the originaland encoded forms do not correspond instructure or number, the encoded messagetypically containing fewer units. Since thesedistinctions are relevant to our purpose here,we have adopted the terms cipher and codeas a convenient way to refer to them. Weshould add, however, that the arbitrary rela-tion between the original and encoded formsof a message, so usual in cryptography, isnot a feature of the encoding of phonemesinto syllables.

COULD SPEECH BE ALPHABETIC?

There are reasons for supposing thatphonemes could not be efficiently com-municated by a sound alphabet—thatis, by sounds that stand in one-to-onecorrespondence with the phonemes.Such reasons provide only indirect sup-port for the conclusion that speech isa code rather than an alphabet. Theyare important, however, because theyindicate that the encoded nature ofspeech may be a condition of its effec-tiveness in communication. More spe-cifically, they tell us which aspects ofthe code are likely to be relevant tothat effectiveness.

Phoneme Communication and theProperties of the Ear

Of the difficulties we might expectto have with a sound alphabet, the mostobvious concerns rate. Speech can befollowed, though with difficulty, atrates as high as 400 words per minute(Orr, Friedman, & Williams, 1965).If we assume an average of four to fivephonemes for each English word, thisrate yields about 30 phonemes persecond. But we know from auditorypsychophysics (Miller & Taylor, 1948)that 30 sounds per second would over-reach the temporal resolving power ofthe ear: discrete acoustic events at thatrate would merge into an unanalyzablebuzz; a listener might be able to tellfrom the pitch of the buzz how fast thespeaker was talking, but he couldhardly perceive what had been said.Even 15 phonemes per second, whichis not unusual in conversation, wouldseem more than the ear could copewith if phonemes were a string of dis-crete acoustic events.

There is at least one other require-ment of a sound alphabet that would behard to satisfy: a sufficient number ofidentifiable sounds. The number ofphonemes, and hence the number of

Page 3: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

PERCEPTION OF THE SPEECH CODE 433

acoustic shapes required, is in thedozens. In English there are about40. We should, of course, be able tofind 40 identifiable sounds if we couldpattern the stimuli in time, as in thecase of melodies. But if we are tocommunicate as rapidly as we do, thephoneme segments could last no longerthan about 50 milliseconds on the aver-age. Though it is not clear from re-search on auditory perception howmany stimuli of such brief duration canbe accurately identified, the availabledata suggest that the number is con-siderably less than 40 (Miller, 1956a,1956b; Nye, 1962; Pollack, 1952; Pol-lack & Picks, 1954). We will be in-terested, therefore, to see whether anyfeatures of the encoding and decodingmechanisms are calculated to enhancethe identifiability of the signals.

Results of Attempts to CommunicatePhonemes by an Acoustic Alphabet

That these difficulties of rate andsound identification are real, we maysee from the fact that it has not beenpossible to develop an efficient soundalphabet despite repeated and thorough-going attempts to do so. Thus, inter-national Morse code (a cipher as weuse the term here) works poorly incomparison with speech, even afteryears of practice. But Morse is surelynot the best example; the signals areone-dimensional and therefore not idealfrom a psychological standpoint. Moreinteresting, if less well known, are themany sound alphabets that have beentested in the attempt to develop readingmachines for the blind (Coffey, 1963;Cooper, 1950a; Freiberger & Murphy,1961; Nye, 1964, 1965; Studdert-Ken-nedy & Cooper, 1966; Studdert-Ken-nedy & Liberman, 1963). These de-vices convert print into sound, theconversion being made typically by en-cipherment of the optical alphabet intoone that is acoustic. The worth of

these devices has been limited, not bythe difficulty of converting print tosound, but by the perceptual limitationsof their human users. In the 50-yearhistory of this endeavor, a wide varietyof sounds has been tried, includingmany that are multidimensional andotherwise appropriately designed, itwould seem, to carry information effi-ciently. Subjects have practiced withthese sounds for long periods of time,yet there has nowhere emerged anyevidence that performance with theseacoustic alphabets can be made to ex-ceed performance with Morse, and thatis little more than a tenth of what canbe achieved with speech.

Reading and Listening

We hardly need say that languagecan be written and read by means ofan alphabet, but we would emphasizehow different are the problems of com-municating phonemes by eye and byear, and how different their relevanceto a psychology of language. In con-trast to the ear, the eye should have nogreat difficulty in rapidly perceivingordered strings of signals. Given theeye's ability to perceive in space, weshould suppose that alphabetic seg-ments set side by side could be per-ceived in clusters. Nor is there rea-son to expect that it might be difficultto find identifiable optical signals insufficient number. Many shapes areavailable, and a number of differentalphabets are, indeed, in use. Thus,written language has no apparent needfor the special code that will be seento characterize language in its acousticform. In writing and reading it ispossible to communicate phonemes bymeans of a cipher or alphabet; indeed,there appears to be no better way.

Spoken and written language dif-fer, then, in that the former must be acomplex code while the latter can be asimple cipher. Yet perception of

Page 4: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

434 LIBERMAN, COOPER, SHANKWEILER, AND STUDDERT-KENNEDY

speech is universal, though reading isnot. In the history of the race, as inthe development of the individual,speaking and . listening come first;writing and reading come later, if atall. Moreover, the most efficient wayof writing and reading—namely, byan alphabet—is nevertheless so unnatu-ral that it has apparently been inventedonly once in all history (Gelb, 1963).Perceiving the complex speech codeis thus basic to language, and to man,in a way that reading an alphabet isnot. Being concerned about language,we are therefore the more interested inthe speech code. Why are speechsounds, alone among acoustic signalsand in spite of the limitations of theear, perceived so well ?

ACOUSTIC CUES : A RESTRUCTURINGOF PHONEMES AT THE LEVEL

OF SOUND

To know in what sense speech soundsare a code on the phonemes, we mustfirst discover which aspects of thecomplex acoustic signal underlie theperception of particular phonemes.For the most part, the relevant dataare at hand. We can now identifyacoustic features that are sufficient andimportant cues for the perception ofalmost all the segmental phonemes.6

Much remains to be learned, but weknow enough to see that the phonemicmessage is restructured in the soundstream and, from that knowledge, tomake certain inferences about per-ception.

6 For the discussion that follows we shallrely most heavily on the results of experi-ments with synthetic speech carried out atHaskins Laboratories. The reader will un-derstand that a review of the relevant litera-ture would refer to many other studies, in-cluding, in particular, those that rest onanalysis of real speech. For recent reviewsand discussions, see Stevens and House, inpress; Kozhevnikov and Chistovich, 1965,Chapter 6.

An Example of Restructuring

To illustrate the nature of the code,we will describe an important acousticcue for the perception of the voicedstop /d/. This example is importantin its own right and also broadly rep-resentative. The phoneme /d/, orsomething very much like it, occurs inall the languages of the world.r InEnglish, and perhaps in other lan-guages, too, it carries a heavy load ofinformation, probably more than anyother single phoneme (Denes, 1963) ;and it is among the first of the pho-nemelike segments to appear in thevocalizations of the child (Whetnall &Fry, 1964, p. 84). The acoustic cuewe have chosen to examine—the sec-ond-formant transition 8—is a majorcue for all the consonants except, per-haps, the fricatives /s/ and /s/, and isprobably the single most importantcarrier of linguistic information in thespeech signal (Delattre, 1958, 1962;Delattre, Liberman, & Cooper, 1955,1964; Harris, 1958; Liberman, 1957;

7 Some form of a voiceless unaspiratedstop having a place of production in thealveolar-dental region is universal (Hockett,1955; Joseph Greenberg, personal communi-cation, November, 1966). The particularexample, /d/, used here is a member ofthat class in almost all important respects.Even in regard to voicing, it is a better fitthan might at first appear, since it shareswith the voiceless, unaspirated stop of, say,French or Hungarian, the same position onthe dimension of voice-onset-time, which is amost important variable for phonemic dif-ferences in voicing. (Lisker & Abramson,1964a, 1964b).

8 A f ormant is a concentration of acousticenergy within a restricted frequency region.Three or four formants are usually seen inspectrograms of speech. In the synthetic,hand-painted spectrograms of Figures 1 and2, only the lowest two are represented.Formants are referred to by number, thefirst being the lowest in frequency, thesecond, the next higher, and so on. Aformant transition is a relatively rapidchange in the position of the formant on thefrequency scale.

Page 5: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

PERCEPTION OF THE SPEECH CODE 435

Liberman, Delattre, Cooper, & Gerst-man, 1954; Liberman, Ingemann,Lisker, Delattre, & Cooper, 1959;Lisker, 1957; O'Connor, Gerstman,Liberman, Delattre, & Cooper, 1957).

Context-conditioned variations in theacoustic cue. Figure 1 displays twohighly simplified spectrographic pat-terns that will, when converted intosound, be heard as the syllables /di/and /du/ (Liberman, Delattre, Cooper,& Gerstman, 1954). They exemplifythe results of a search for the acousticcues in which hand-drawn or "syn-thetic" spectrograms were used as abasis for experimenting with the com-plex acoustic signal (Cooper, 1950,1953; Cooper, Delattre, Liberman,Borst, & Gerstman, 1952; Cooper,Liberman, & Borst, 1951). Thesteady-state formants, comprising ap-proximately the right-hand two-thirdsof each pattern, are sufficient to pro-duce the vowels /i/ and /u/ (Delattre,Liberman, & Cooper, 1951; Delattre,Liberman, Cooper, & Gerstman, 1952).At the left of each pattern are therelatively rapid changes in frequency ofthe formants—the formant transitions—that are, as we indicated, importantacoustic cues for the perception of theconsonants. The transition of the first,or lower, formant, rising from a verylow frequency to the level appropriatefor the vowel, is a cue for the classof voiced stops /b,d,g/ (Delattre,Liberman, & Cooper, 1955; Liberman,Delattre, & Cooper, 1958). It wouldbe exactly the same for /bi, bu/ and/gi, gu/ as for /di, du/. Most gen-erally, this transition is a cue for theperception of manner and voicing. Theacoustic feature in which we are hereinterested is the transition of the sec-ond formant, which is, in the patternsof Figure 1, a cue for distinguishingamong the voiced stops, /b,d,g/; thatis to say, the second-formant transitionfor /gi/ or /bi/, as well as /gu/ or

COa.

1800

O 600UJce

du

0 300 0 300

TIME IN MSEC.

FIG. 1. Spectrographic patterns sufficientfor the synthesis of /d/ before /i/ and /u/.

/bu/, would be different from those for/di/ and /du/ (Liberman, Delattre,Cooper, & Gerstman, 1954). In gen-eral, transitions of the second formantcarry important information about theplace of production of most consonants(Delattre, 1958; Liberman, 1957;Liberman, Delattre, Cooper, & Gerst-man, 1954).

It is, then, the second-formant transi-tions that are, in the patterns of Figure1, the acoustic cues for the perceptionof the /d/ segment of the syllables /di/and /du/. We would first note that/d/ is the same perceptually in the twocases, and then see how different arethe acoustic cues. In the case of /di/the transition rises from approximately2200 cps to 2600 cps; in /du/ it fallsfrom about 1200 to 700 cps. In otherwords, what is perceived as the samephoneme is cued, in different contexts,by features that are vastly different inacoustic terms. How different theseacoustic features are in nonspeech per-ception can be determined by remov-ing them from the patterns of Figure 1and sounding them in isolation. Whenwe do that, the transition isolated fromthe /di/ pattern sounds like a rapidlyrising whistle or glissando on highpitches, the one from /du/ like arapidly falling whistle on low pitches.9

9 This is true of the patterns shown inFigure 1 as converted to sound by the PatternPlayback. When the formants correspond

Page 6: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

436 LIBEEMAN, COOPER, SHANKWEILER, AND STUDDERT-KENNEDY

do do du

FIG. 2. Spectrographic patterns sufficient for the synthesis of /d/ before vowels. (Dashedline at 1800 cps shows the "locus" for /d/.)

These signals could hardly sound moredifferent from each other. Further-more, neither of them sounds like /d/nor like speech of any sort.

The disappearance of phoneme boun-daries: Parallel transmission. We turnnow to another, related aspect of thecode: the speech signal typically doesnot contain segments corresponding tothe discrete and commutable phonemes.There is no way to cut the patterns ofFigure 1 so as to recover /d/ segmentsthat can be substituted one for theother. Nor can we make the commu-tation simply by introducing a physicalcontinuity between the cut ends, as wemight, for example, if we were seg-menting and recombining the alpha-betic elements of cursive writing.

Indeed, if we could somehow sepa-rate commutability from segmentability,we should have to say that there is no/d/ segment at all, whether commuta-ble or not. We cannot cut either the/di/ or the /du/ pattern in such a wayas to obtain some piece that will pro-duce /d/ alone. If we cut progres-sively into the syllable from the right-hand end, we hear /d/ plus a vowel, ora nonspeech sound; at no point will wehear only /d/. This is so because theformant transition is, at every instant,

more closely in their various constant fea-tures to those produced by the human vocalapparatus, the musical qualities describedabove may be harder to hear or may dis-appear altogether. So long as the second-formant transitions of /di/ and /du/ arenot heard as speech, however, they do notsound alike.

providing information about two pho-nemes, the consonant and the vowel—that is, the phonemes are being trans-mitted in parallel.

The Locus: An Acoustic Invariant?The patterns of Figure 2 produce

/d/ in initial position with each of avariety of vowels, thus completing aseries of which the patterns shown inFigure 1 are the extremes. If one ex-trapolates the various second-formanttransitions backward in time, he seesthat they seem to have diverged froma single frequency. To find that fre-quency more exactly, and to determinewhether it might in some sense be saidto characterize /d/, we can, as in Fig-ure 3, pair each of a number of straightsecond formants with first formantsthat contain a rising transition suffi-cient to signal a voiced stop of somekind. On listening to such patterns,one hears an initial /d/ most stronglywhen the straight second formant is at1800 cps. This has been called the/d/ "locus" (Delattre, Liberman, &Cooper, 1955). There are, corre-spondingly, second-formant loci forother consonants, the frequency posi-tion of the locus being correlated withthe place of production. In general,the locus moves somewhat as a func-tion of the associated vowel, but, ex-cept for a discontinuity in the case of/k, g, q), the locus is more nearly in-variant than the formant transition(Delattre, Liberman, & Cooper, 1955;Ohman, 1966; Stevens & House, 1956).

Page 7: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

PERCEPTION OF THE SPEECH CODE 437

FIG. 3. Schematic display of the stimuli used in finding the second-formant loci of/b/i /d/, and /g/. (A—Frequency positions of the straight second formants and the vari-ous first formants with which each was paired. B—A typical test pattern, made up ofthe first and second formants circled in A. The best pattern for /d/ was the one with thestraight second formant at 1800 cps. Figure taken from Delattre, Liberman, andCooper, 1955.)

Is there, then, an invariant acousticcue for /d/? Consider the varioussecond formants that all begin at the1800-cycle locus and proceed from thereto a number of different vowel posi-tions, such as are shown in Figure 4.

We should note first that these transi-tions are not superimposable, so thatif they are to be regarded as invariantacoustic cues, it could only be in thevery special and limited sense that theystart at the same point on the frequency

CLo

ai

sccU-

3000-

2400-

1 200-

600-

A

Wv^v<^-^^^^^

vSS^^^^HH i uVvvx^^^^^

f

PINTERVAL

-1 k

1 11 1 /"1 '/1 / . .

F2i

- 1930

- 1680

Fit

O.I O.Z 0.3 0

TIME IN SECONDS0.1 0.2 0.3

FIG. 4. A—Second-formant transitions that start at the /d/ locus and B—comparabletransitions that merely "point" at it, as indicated by the dotted lines. (Those of Aproduce syllables beginning with /b/, /d/, or /g/, depending on the frequency-levelof the formant; those of B produce only syllables beginning with /d/. Figure takenfrom Delattre, Liberman, and Cooper, 1955.)

Page 8: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

438 LIBERMAN, COOPER, SHANKWEILER, AND STUDDERT-KENNEDY

scale. But even this very limited in-variance is not to be had. If we con-vert the patterns of Figure 4 to sound,having paired each of the second form-ants in turn with the single firstformant shown at the bottom of thefigure, we do not hear /d/ in everycase. Taking the second formants inorder, from the top down, we hearfirst /b/, then /d/, then /g/, then onceagain /d/, as shown in the figure. Inorder to hear /d/ in every case, wemust erase the first part of the transi-tion, as shown in Figure 4B, so that it"points" at the locus but does not actu-ally begin there. Thus, the 1800-cpslocus for /d/ is not a part of theacoustic signal, nor can it be madepart of that signal without grosslychanging the perception (Delattre,Liberman, & Cooper, 1955).

Though the locus can be defined inacoustic terms—that is, as a particularfrequency—the concept is more articu-latory than acoustic, as can be seen inthe rationalization of the locus byStevens and House (1956). What iscommon to /d/ before all the vowelsis that the articulatory tract is closedat very much the same point. Ac-cording to the calculations of Stevensand House, the resonant frequency ofthe cavity at the instant of closure isapproximately 1800 cps, but since nosound emerges until some degree ofopening has occurred, the locus fre-quency is not radiated as part of theacoustic signal. At all events it seemsclear that, though the locus is morenearly invariant with the phoneme thanis the transition itself, the invariance isa derived one, related more to articu-lation than to sound. As we will seelater, however, the locus is only a steptoward the invariance with phonemeperception that we must seek; betterapproximations to that invariance areprobably to be had by going fartherback in the chain of articulatory events,

beyond the cavity shapes that underliethe locus, to the commands that pro-duce the shapes.

How General Is the Restructuring?

Having dealt with the restructuringof only one phoneme and one acousticcue, we should say now that, with sev-eral interesting exceptions yet to benoted, it is generally true of the seg-mental phonemes that they are drasti-cally restructured at the level ofsound.10 We will briefly summarizewhat is known in this regard about thevarious types of cues and linguisticclasses or dimensions.

Transitions of the second formant.For the second-formant transition—thecue with which we have been concernedand the one that is, perhaps, the mostimportant for the perception of con-sonants according to place of produc-tion—the kind of invariance lack wefound with /d/ characterizes all thevoiced stops, voiceless stops, andnasal consonants (Liberman, Delattre,Cooper, & Gerstman, 1954; Malecot,1956). Indeed, the invariance prob-lem is, if anything, further complicatedin these other phonemes. In the caseof /g, k, rj), for example, there is asudden and considerable shift in thelocus as between the unrounded androunded vowels, creating a severe lackof correspondence between acoustic sig-

10 Somewhat similar complications arise inthe suprasegmental domain. For data con-cerning the relation between the acousticsignal and the perception of intonation seeHadding-Koch and Studdert-Kennedy, 1964a,1964b.

Lieberman (1967) has measured some ofthe relevant physiological variables and, onthe basis of his findings, has devised ahypothesis to account for the perception ofintonation, in particular the observations ofHadding-Koch and Studdert-Kennedy. Lie-berman's account of perception in the supra-segmental domain fits well with our ownviews about the decoding of segmental pho-nemes, discussed later in this paper.

Page 9: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

PERCEPTION OF THE SPEECH CODE 439

nal and linguistic perception that wementioned earlier in this paper andthat we have dealt with in some detailelsewhere (Liberman, 1957).

With the liquids and semivowels/r, 1, w, j/ the second-formant transi-tion originates at the locus—as it can-not in the case of the stop and nasalconsonants—so the lack of correspond-ence between acoustic signal and pho-neme is less striking, but even withthese phonemes the transition cues arenot superimposable for occurrences ofthe same consonant in different con-texts (Lisker, 1957; O'Connor, Gerst-man, Liberman, Delattre, & Cooper,1957).

Transitions of the third formant.What of third-formant transitions,which also contribute to consonant per-ception in terms of place of production ?Though we know less about the third-formant transitions than about thoseof the second, such evidence as we havedoes not suggest that an invariantacoustic cue can be found here (Harris,Hoffman, Liberman, Delattre, &Cooper, 1958; Lisker, 1957b; O'Con-nor, Gerstman, Liberman, Delattre, &Cooper, 1957). In fact, the invarianceproblem is further complicated in thatthe third-formant transition seems tobe more or less important (relative tothe second-formant transition, for ex-ample) depending on the phonemiccontext. In the case of our /di, du/example (Figure 1), we find that anappropriate transition of the thirdformant contributes considerably to theperception of /cl/ in /di/ but not at allto /d/ in /du/. To produce equal—that is, equally convincing—/d/'s be-fore both /i/ and /u/, we must useacoustic cues that are, if anything, evenmore different than the two second-formant transitions we described earlierin this section.

Constriction noises. The noises pro-duced at the point of constriction—as

in the fricatives and stops—are anotherset of cues for consonant perceptionaccording to place of production, therelevant physical variable being thefrequency position of the band-limitednoise. When these noises have con-siderable duration—as in the fricatives—the cue changes but little with con-text. Since the cue provided by thesenoises is of overriding importance inthe perception of /s/ and /s/, we shouldsay that these consonants show little orno restructuring in the sound stream(Harris, 1958; Hughes & Halle, 1956).(This may not be true when the speechis rapid.) They therefore constitute anexception to the usual strong depend-ence of the acoustic signal on its con-text—that is, to the effects of a syllabiccoding operation.

On the other hand, the brief noisecues—the bursts, so called—of the stopconsonants display as much restruc-turing as do the transitions of the sec-ond formant (Liberman, Delattre, &Cooper, 1952). Bursts of noise thatproduce the best /k/ or /g/ vary overa considerable frequency range depend-ing on the following vowel. The rangeis so great that it extends over the do-main of the /p, b/ burst, creating thecuriosity of a single burst of noise at1440 cps that is heard as /p/ before/i/ but as A/ before /a/.11 We shouldalso note that the relative importance ofthe bursts, as of the transitions, variesgreatly for the same stop in differentcontexts. For example, /g/ or /k/,which are powerfully cued by a second-formant transition before the vowel/&/, will likely require a burst at anappropriate frequency if they are tobe well perceived in front of /o/.12

11 This result was obtained in the experi-ment (Liberman, Delattre, & Cooper, 1952)with synthetic speech referenced above andverified for real speech in a tape-cutting ex-periment by Carol Schatz (1954).

12 This can be seen in the results of ex-periments on the bursts and transitions as

Page 10: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

440 LIBERMAN, COOPER, SHANKWEILER, AND STUDDERT-KENNEDY

Thus, the same consonant is primarilycued in two different contexts by sig-nals as physically different as a form-ant transition and a burst of noise.

Manner, voicing, and position. Indescribing the complexities of the rela-tion between acoustic cue and perceivedphoneme, we have so far dealt onlywith cues for place of production andonly with consonants in initial positionin the syllable. A comparable lack ofregularity is also found in the distinc-tions of manner and voicing and in thecues for consonants in different posi-tions (Abramson & Lisker, 1965;Delattre, 1958; Delattre, Liberman, &Cooper, 1955; Liberman, 1957; Liber-man, Delattre, & Cooper, 1958; Liber-man, Delattre, Cooper, & Gerstman,1954; Liberman, Delattre, Gerstman, &Cooper, 1956; Lisker, 1957a, 1957b;Lisker & Abramson, 1964b; Ohman,1966). We will not consider thesecues in detail since the problems theypresent are similar to those already en-countered. Indeed, the cues we havediscussed as examples of encoding aremerely a subset of those that show ex-tensive restructuring as a function ofcontext: it is the usual case that theacoustic cues for a consonant are dif-ferent when the consonant is pairedwith different vowels, when it is indifferent positions (initial, medial, orfinal) with respect to the same vowels,and for all types of cues (manner orvoicing, as well as place). Thus, forexample, the cues for manner, place,and voicing of /b/ in /ba/ are acousti-cally different from those of /b/ in/ab/, the transitional cues for placebeing almost mirror images; further,the preceding set of cues differs from

cues for the stops (Liberman, Delattre, &Cooper, 1952; Liberman, Delattre, Cooper, &Gerstman, 1954). It is confirmed wheneverone attempts, by using all possible cues, tosynthesize the best stops.

corresponding sets for /b/ with eachof the other vowels.

The vowels. We should remark,finally, on the acoustic cues for thevowels. For the steady-state vowelsof Figures 1 and 2, perception dependsprimarily on the frequency position ofthe formants (Delattre, Liberman, &Cooper, 1951; Delattre, Liberman,Cooper, & Gerstman, 1952). There is,for these vowels, no restructuring ofthe kind found to be so common amongthe consonant cues and, accordingly,no problem of invariance betweenacoustic signal and perception.18

However, vowels are rarely steadystate in normal speech; most commonlythese phonemes are articulated betweenconsonants and at rather rapid rates.Under these conditions vowels alsoshow substantial restructuring—that is,the acoustic signal at no point corre-sponds to the vowel alone, but rathershows, at any instant, the merged in-fluences of the preceding or followingconsonant (Lindblom & Studdert-Ken-nedy, in press; Lisker, 1958; Shearme& Holmes, 1962; Stevens & House,1963).

In slow articulation, then, the acous-tic cues for the vowels—and, as wesaw earlier, the noise cues for fricatives—tend to be invariant. In this respectthey differ from the cues for the otherphonemes, which vary as a function ofcontext at all rates of speaking. How-ever, articulation slow enough to per-mit the vowels and fricatives to avoidbeing encoded is probably artificial andrare.

18 The absolute formant frequencies of thesame vowel are different for men, women,and children, and for different individuals,partly as a consequence of differences in thesizes of vocal tracts. This creates an in-variance problem very different from thekind we have been discussing and more simi-lar, perhaps, to the problems encountered inthe perception of nonspeech sounds, such asthe constant ratios of musical intervals.

Page 11: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

PERCEPTION OF THE SPEECH CODE 441

Phoneme segmentation. We returnnow to the related problem of seg-mentation, briefly discussed above forthe /di, du/ example. We saw therethat the acoustic signal is not seg-mented into phonemes. If one exam-ines the acoustic cues more generally,he finds that successive phonemes aremost commonly merged in the soundstream. This is, as we will see, a cor-relate of the parallel processing thatcharacterizes the speech code and is anessential condition of its efficiency.One consequence is that the acousticcues cannot be divided on the time axisinto segments of phonemic size.14

The same general conclusion may bereached by more direct procedures.Working with recordings of realspeech, Harris (1953) tried to arriveat "building blocks" by cutting taperecordings into segments of phonemelength and then recombining the seg-ments to form new words. "Experi-ments indicated that speech based uponone building block for each vowel andconsonant not only sounds unnaturalbut is mostly unintelligible . . . [p.962]." In a somewhat similar attemptto produce intelligible speech by therecombination of parts taken from pre-viously recorded utterances, Peterson,Wang, and Sivertsen (1958) concludedthat the smallest segments one can useare of roughly half-syllable length.Thus, it has not been possible, in gen-eral, to synthesize speech from pre-recorded segments of phonemic dimen-

14 This is not to say that the sound spectro-gram fails to show discontinuities along thetime axis. Fant (1962a, 1962b) discusses theinterpretation to be given these abruptchanges and the temporal segments boundedby them. He warns: "Sound segmentboundaries should not be confused withphoneme boundaries. Several adjacentsounds of connected speech may carry infor-mation on one and the same phoneme, andthere is overlapping in so far as one andthe same sound segment carries informationon several adjacent phonemes [1962a, p. 9]."

sions. Nor can we cut the soundstream along the time dimension so asto recover segments that will be per-ceived as separate phonemes. Ofcourse, there are exceptions. As wemight expect, these are the steady-statevowels and the long-duration noises ofcertain fricatives in which, as we haveseen, the sounds show minimal re-structuring. Apart from these excep-tions, however, segments correspond-ing to the phonemes are not found atthe acoustic level.

We shall see later that the articula-tory gestures corresponding to succes-sive phonemes—or, more precisely,their subphonemic features—are over-lapped, or shingled, one onto another.This parallel delivery of informationproduces at the acoustic level themerging of influences we have alreadyreferred to and yields irreducible acous-tic segments of approximately syllabicdimensions. Thus, segmentation alsoexhibits a complex relation betweenlinguistic structure or perception, onthe one hand, and the sound stream onthe other.

PERCEPTION OF THE RESTRUCTUREDPHONEMES : THE SPEECH MODE

If phonemes are encoded syllabicallyin the sound stream, they must be re-covered in perception by an appropri-ate decoder. Perception of phonemesthat have been so encoded might beexpected to differ from the perceptionof those that have not and also, ofcourse, from nonspeech. In this sec-tion we will suggest that such differ-ences do, in fact, exist.

We have already seen one exampleof such a difference in the transitioncues for /d/ in /di/ and /du/. Takenout of speech context, these transitionssound like whistles, the one risingthrough a range of high pitches andthe other falling through low pitches;they do not sound like each other, nor

Page 12: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

442 LIBERMAN, COOPER, SHANKWEILER, AND STUDDERT-KENNEDY

even like speech. This example couldbe multiplied to include the transitioncues for many other phonemes. Withsimplified speech of the kind alreadyshown, the listener's perception isvery different depending on whetherhe is, for whatever reason, in thespeech mode or out of it (Brady,House, & Stevens, 1961).

Even on the basis of what can beheard in real speech, one might havesuspected that the perception of en-coded and unencoded phonemes15 issomehow different. One has only tolisten carefully to some of the latter inorder to make reasonably accurateguesses about the auditory and acousticdimensions that are relevant to theirperception. The fricatives /s/ and /s/,for example, obviously differ in man-ner from other phonemes in that thereis noise of fairly long duration; more-over, one can judge by listening thatthey differ from each other in the"pitch" of the noise—that is, in the waythe energy is distributed along the fre-quency scale. Consider, on the otherhand, the encoded phonemes /b,d,g/.No amount of listening, no matter howcareful, is likely to reveal that an im-portant manner cue is a rapidly risingfrequency at the low end of the fre-quency scale (first formant), or thatthese stops are distinguished from eachother primarily by the direction andextent of a rapid frequency glide inthe upper frequency range (second andthird formants).

15 There is a need, in much that follows,for a convenient way to refer to classes ofphonemes that show much—or little—restruc-turing of their acoustic cues as a function ofcontext. The former are, indeed, encoded.We shall refer to the latter as "unencodedphonemes," implying only that they arefound at the other end of a continuum ondegree of restructuring; we do not wish toimply differences in the processes affectingthese phonemes, whether or not such differ-ences can be inferred from their perceptualcharacteristics.

These observations of perceptual dif-ferences between speech and nonspeechsounds, and even among classes ofphonemes, do not stand alone. Con-trolled experiments can show moreaccurately, if sometimes less directly,the differences in perception. We willnext consider some of these experi-ments.

Tendencies toward Categorical andContinuous Perception

Research with some of the encodedphonemes has shown that they arecategorical, not only in the abstractlinguistic sense, but as immediatelygiven in perception. Consider, first,that in listening to continuous varia-tions in acoustic signals, one ordinarilydiscriminates many more stimuli thanhe can absolutely identify. Thus, wediscriminate about 1200 differentpitches, for example, though we canabsolutely identify only about seven.Perception of the restructured pho-nemes is different in that listeners dis-criminate very little better than theyidentify absolutely; that is to say, theyhear the phonemes but not the intra-phonemic variations.

The effect becomes clear impression-istically if one listens to simplified,synthetic speech signals in which thesecond-formant transition is varied inrelatively small, acoustically equalsteps through a range sufficient to pro-duce the three stops, /b/, /d/, and/g/. One does not hear steplikechanges corresponding to the changesin the acoustic signal, but essentiallyquantal jumps from one perceptualcategory to another.

To evaluate this effect more exactly,various investigators have made quan-titative comparisons of the subjects'ability to identify the stimuli absolutelyand to discriminate them on any basiswhatsoever. For certain consonantdistinctions it has been found that the

Page 13: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

PERCEPTION OF THE SPEECH CODE 443

mode of perception is, in fact, nearlycategorical: listeners can discriminateonly slightly better than they can iden-tify absolutely. In greater or lesserdegree, this has been found for /b,d,g/(Eimas, 1963; Griffith, 1958; Liber-man, Harris, Hoffman, & Griffith,1957; Studdert-Kennedy, Liberman, &Stevens, 1963, 1964) ; /d,t/ (Liber-man, Harris, Kinney, & Lane, 1961)16;/b,p/ in intervocalic position (Liber-man, Harris, Eimas, Lisker, & Bastian,1961), and presence or absence of /p/in slit vs. split (Bastian, Delattre, &Liberman, 1959; Bastian, Eimas, &Liberman, 1961; Harris, Bastian, &Liberman, 1961).

The perception of unencoded steady-state vowels is quite different from theperception of stops.17 To appreciatethis difference one need only listen tosynthetic vowels that vary, as in theexample of the stops, in relatively smalland acoustically equal steps through arange sufficient to produce three ad-jacent phonemes—say A/, /I/, andA/. As heard, these vowels changestep-by-step, much as the physical stim-ulus changes: the vowel /i/ shades into/I/, and /I/ into A/. Immediate per-ception is more nearly continuous thancategorical and the listener hears manyintraphonemic variations. More pre-cise measures of vowel perception indi-cate that, in contrast to the stops,listeners can discriminate many more

10 Studies of this distinction (and ofthe corresponding ones for the other stops)in 11 diverse languages indicate that it tendsto be categorical also in production. (Abram-son & Lisker, 1965; Lisker & Abramson,1964a, 1964b.)

17 In experiments with mimicry, LudmillaChistovich and her colleagues have obtaineddifferences between vowels and consonantsthat are consistent with the differing tenden-cies toward categorical and continuous per-ception described here. (Chistovich, 1960;Chistovich, Klaas, & Kuz'min, 1962; Galunov& Chistovich, 1965; Kozhevnikov & Chisto-vich, 1965.)

stimuli than they can identify abso-lutely (Fry, Abramson, Eimas, &Liberman, 1962; Stevens, Ohman, &Liberman, 1963; Stevens, Ohman,Studdert-Kennedy, & Liberman, 1964).Similar studies of the perception ofvowel duration (Bastian & Abramson,1962) and tones in Thai (Abramson,1961), both of which are phonemic inthat language, have produced similarresults. We should suppose thatsteady-state vowels, vowel duration,and the tones can be perceived in es-sentially the same manner as continu-ous variations in nonspeech signals.The results of a direct experimentalcomparison by Eimas (1963) suggestthat this is so.

We emphasize that in speaking ofvowels we have so far been concernedonly with those that are isolated andsteady state. These are, as we havesaid, unencoded and hence not neces-sarily perceived in the speech mode.But what of the more usual situationwe described earlier, that of vowels be-tween consonants and in rapid articu-lation? Stevens (1966) has supposedthat the rapid changes in formant posi-tion characteristic of such vowelswould tend to be referred in perceptionto the speech mode, and he has someevidence that this is so, having foundthat perception of certain vowels inproper dynamic context is more nearlycategorical than that of steady-statevowels. Inasmuch as these rapidlyarticulated vowels are substantially re-structured in the sound stream, Ste-vens' results may be assumed to re-flect the operation of the speech de-coder.

Lateral Differences in the Perceptionof Speech and Nonspeech

The conclusion that there is a speechmode, and that it is characterized byprocesses different from those under-lying the perception of other sounds,

Page 14: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

444 LIBERMAN, COOPER, SHANKWEILER, AND STUDDERT-KENNEDY

is strengthened by recent indicationsthat speech and nonspeech sounds areprocessed primarily in different hemi-spheres of the brain. Using Broad-bent's (1954) method of deliveringcompeting stimuli simultaneously to thetwo ears, investigators have found thatspeech stimuli presented to the rightear (hence, mainly to the left cerebralhemisphere) are better identified thanthose presented to the left ear (hence,mainly to the right cerebral hemi-sphere), and that the reverse is truefor melodies and sonar signals (Broad-bent & Gregory, 1964; Bryden, 1963;Chancy & Webster, 1965; Kimura,1961, 1964, 1967). In the terminologyof this paper, the encoded speech sig-nals are more readily decoded in theleft hemisphere than in the right. Thissuggests the existence of a special left-hemisphere mechanism different fromthe right-hemisphere mechanism forthe perception of sounds not similarlyencoded. It is of interest, then, to askwhether the encoded stops and the un-encoded steady-state vowels are, per-haps, processed unequally by the twohemispheres. An experiment was car-ried out (Shankweiler & Studdert-Kennedy, 1967b) designed to answerthis question, using synthetic speechsyllables that contrasted in just onephoneme. A significantly greaterright-ear advantage was found for theencoded stops than for the unencodedsteady-state vowels. The fact that thesteady-state vowels are less stronglylateralized in the dominant (speech)hemisphere may be taken to mean thatthese sounds, being unencoded, can be,and presumably sometimes are, proc-essed as if they were nonspeech. Inanother experiment (Shankweiler &Studdert-Kennedy, 1967a), the conso-nant and vowel comparisons were madewith real speech. Different combina-tions of the same set of consonant-vowel-consonant syllables were used

for both tests. As before, a decisiveright-ear advantage was found for con-trasting stop consonants, and againthere was no difference for vowels, eventhough these were here articulated indynamic context between consonants.We will be interested to determinewhat happens to lateral differences invowel perception when the vowels arevery rapidly articulated. Such vowelsare, as has been said, necessarily re-structured to some extent and may becorrespondingly dependent on thespeech decoder for their perception.

Perception oj Speech by Machine andby Eye

If speech is, as we have suggested, aspecial code, its perception should bedifficult in the absence of an appropri-ate decoder such as we presumably usein listening to speech sounds. It isrelevant, therefore, to note how verydifficult it is to read visual transformsof speech or to construct automaticspeech recognizers.

Consider, first, a visual transform—for example, the spectrogram. As wehave already seen, the spectrographicpattern for a particular phoneme typi-cally looks very different in differentcontexts. Furthermore, visual inspec-tion of a spectrogram does not revealhow a stretch of speech might be di-vided into segments corresponding tophonemes, or even how many pho-nemes it might contain: the eye seesthe transformed acoustic signal in itsundecoded form. We should not besurprised, then, to discover that spec-trograms are, in fact, extremely diffi-cult to read.

Some part of the difficulty may beattributed to inadequacies of the trans-form or to lack of training. But animproved transform, if one should befound, would not by itself suffice tomake spectrograms readable, since itwould not obviate the need to decode.

Page 15: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

PERCEPTION OF THE SPEECH CODE 445

SCHEMA FOR PRODUCTION

LEVELS CONVERSIONS DESCRIPTIONSOF THE SIGNALS

Sontenco

Phono met(or Sett ofDleltncllwFeaturei)

MotorCommand*

CharacltriitieGoituree

Plooe andMonntr

(Spe.ch)

X R«pri*«nt«d byI Patt*rni of

Syntactic

Rules

In thi CentralNirvoui Syit«m

UUlUtI^BuTomotor Riilei I

c,e>Ci-- *•l i t , ._.

| Myomotor Rulei |

g.g,g, •»•

|Artlct '

I 4

{ AGO

lalory Rulcij

•tie Rule*- |

(01 abovo)

Neural Signal!to the Mutclos

MuioleContractions

Vocal TractShapoi, etc.

Soundi ofSpeech

FIG. 5. Schematic representations of as-sumed stages in speech production.

Nor is training likely to overcome thedifficulty. Many persons have hadconsiderable experience with spectro-grams, yet none has found it possibleto read them well.18 Ideally, trainingin "reading" spectrograms should causethe transitions for /d/ in /di/ and /du/to look alike. The speech decoder,after all, makes them sound alike whenspeech is perceived by ear; moreover,it seems obvious that if the decoder didnot do this, speech would be muchmore difficult to perceive. If, as wesuspect, training alone cannot make theacoustic cues for the same phonemelook alike, then we should, perhaps,conclude that the speech decoder, whichmakes them sound alike, is biologicallytied to an auditory input.

18 "As a matter of fact I have not met onesingle speech researcher who has claimed hecould read speech spectrograms fluently, andI am no exception myself [Fant, 1962a,p. 4]." Spectrograms are, even so, lessdifficult to read than oscillograms—whencetheir popular name, "visible speech," andmuch of the early enthusiasm for them(Potter, Kopp, & Green, 1947).

How, then, do machines fare in rec-ognizing the encoded sounds of speech?If speech were a cipher, like print, itwould be no more difficult to build aspeech recognizer than a print reader.In fact, the speech recognizer hasproved to be more difficult, and by avery wide margin, largely because itneeds an appropriate decoder, as aprint reader does not, and because thedesign of that decoder is not easily ac-complished—a conclusion confirmed bytwo decades of intensive effort. Wemight repeat, in passing, that for hu-man beings the difficulty is the otherway around: perceiving speech is fareasier than reading.

ENCODING AND DECODING: FROMPHONEME TO SOUND AND BACK

Conversions from Phoneme to Sound

Having considered the evidence thatspeech is a code on the phonemes, wemust ask how the encoding might havecome about. The schema in Figure 5represents the various conversions thatpresumably occur as speech proceedsfrom sentence, through the empty seg-ments at the phoneme level, to the finalacoustic signal. The topmost box,labeled Syntactic Rules, would, ifproperly developed, be further brokendown into phrase-structure rules, trans-formation rules, morphophonemic rules,and the like. (See, for example, Chom-sky, 1964.) These processes would beof the greatest interest if we weredealing with speech perception in thebroadest sense. Here, however, wemay start with the message in theform of an ordered sequence of pho-nemes (or, as we will see, their con-stituent features) and follow it throughthe successive converters that yield, atthe end, the acoustic waveform.

Subphonemic features: Their role inproduction. First, we must take ac-count of the fact that the phonemes

Page 16: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

446 LIBERMAN, COOPER, SHANKWEILER, AND STUDDERT-KENNEDY

are compounded of a smaller numberof elements, and, indeed, shift the em-phasis from the phoneme to these sub-phonemic features.19 The total gesturein the articulation of /b/, for example,can be broken down into several dis-tinctive elements: (a) closing andopening of the upper vocal tract in sucha way as to produce the manner featurecharacteristic of the stop consonants;(b) closing and opening of the vocaltract specifically at the lips, thus pro-ducing the place feature termed bilabi-ality; (c) closing the velum to providethe feature of orality; and (d) startingvocal fold vibration simultaneouslywith the opening of the lips, appropri-ate to the feature of voicing. The pho-neme /p/ presumably shares with /b/features 1, 2, and 3, but differs as tofeature 4, in that vocal fold vibration

19 The term "feature" has varied uses, evenin this paper. Thus, in this paragraph, wedescribe a particular speech gesture in con-ventional phonetic terms, referring to howthe articulators move and where the con-strictions occur as the manner and placefeatures that, taken together, characterize thegesture. Two paragraphs later, the descrip-tion of a possible model for speech produc-tion identifies subphonemic features as im-plicit instructions to separate and independentparts of the motor machinery. Viewed inthis way, the distinctive features of aphoneme are closely linked to specific musclesand the neural commands that actuate them.

Distinctive features of this kind are clearlydifferent from those so well known from thework of Jakobson and his colleagues. SeeJakobson, Fant, and Halle (1952) and thevarious revisions proposed by the severalauthors (Fant, 1967; Halle, 1964; Jakobson& Halle, 19S6; Stevens & Halle, in press).We are, nevertheless, deeply indebted tothem for essential points—in particular, thatthe phonemes can be characterized by sets offeatures which are few in number, limited toa very few states each, and maintainedthroughout several phonemes. Other im-portant characteristics for an ideal systemof distinctive features include mutual inde-pendence, clear correspondences with physio-logical—or acoustic—observables, and asmuch universality across languages as parsi-mony for a single language will allow.

begins some 50 or 60 milliseconds af-ter opening of the lips; /m/ has fea-tures 1, 2, and 4 in common with /b/,but differs in feature 3, since the velumhangs open to produce the feature ofnasality; /d/ has features 1, 3, and 4in common with /b/, but has a differ-ent place of articulation; and so on.

That subphonemic features are pres-ent both in production and perceptionhas by now been quite clearly estab-lished.20 Later in this paper we willdiscuss some relevant data on produc-tion. Here we merely note that wemust deal with the phonemes in termsof their constituent features becausethe existence of such features is essen-tial to the speech code and to the effi-cient production and perception of lan-guage. We have earlier remarked thathigh rates of speech would overtax thetemporal resolving power of the ear itthe acoustic signal were merely a ci-pher on the phonemic structure of thelanguage. Now we should note thatspeaking at rates of 10 to 15 phonemesper second would as surely be impos-sible if each phoneme required a sepa-rate and distinct oral gesture. Suchrates can be achieved only if separateparts of the articulatory machinery—•muscles of the lips, tongue, velum, etc.—can be separately controlled, and ifthe linguistic code is such that a changeof state for any one of these articula-tory entities, taken together with thecurrent state of the others, is a changeto another element of the code—that is,to another phoneme. Thus, dividingthe load among the articulators allowseach to operate at a reasonable pace,and tightening the code keeps the in-formation rate high. It is this kind ofparallel processing that makes it pos-sible to get high-speed performance

20 For discussions of the psychologicalreality of phoneme segments and subphonemicfeatures, see Kozhevnikov and Chistovich,196S; Stevens and House, in press; Wickel-gren, 1966.

Page 17: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

PERCEPTION OF THE SPEECH CODE 447

with low-speed machinery. As we willsee, it also accounts for the overlappingand intermixing of the acoustic signalsfor the phonemes that is characteristicof the speech code.

A model jor production. The sim-plest possible model of the productionprocess would, then, have the pho-nemes of an utterance represented bysets of subphonemic features, and thesein turn would be assumed to exist inthe central nervous system as implicitinstructions21 to separate and inde-pendent parts of the motor machinery.The sequence of neural signals corre-sponding to this multidimensionalstring of control instructions may wellrequire, for its actualization, someamplitude adjustments and temporalcoordination, in the box labeled "Neu-romotor Rules," in order to yield theneural impulses that go directly to theselected muscles of articulation andcause them to contract. If we are tocontinue to make the simplest possibleassumption, however, we must supposethat reorganization of the instructionsat this stage would be limited to suchcoordination and to providing supple-mentary neural signals to insure co-operative activity of the remainder ofthe articulatory apparatus; there wouldbe no reorganization of the commandsto the "primary" actuators for the se-lected features. In that case, the neuralsignals that emerge would bear stillan essentially one-to-one correspond-

21 These instructions might be of twotypes, "on-off" or "go to," even in a maxi-mally simple model. In the one case, theaffected muscle would contract or not withlittle regard for its current state (or theposition of the articulator it moves) ; in theother, the instruction would operate via the7-efferent system to determine the degree ofcontraction (hence, the final position of thearticulator, whatever its initial position).Both types of instruction—appropriate, per-haps, to fast and slow gestures, respectively—may reasonably be included in the model.

ence with the several dimensions of thesubphonemic structure. Indeed, this isa necessary condition for our model,since total reorganization at this stagewould destroy the parallel processing,referred to earlier, on which high-speed reception depends, or else yielda syllabic language from which thephoneme strings could not be recovered.

The next conversion, from neuralcommand (in the final common paths)to muscle contraction, takes place inthe box labeled "Myomotor Rules." Ifmuscles contract in accordance with thesignals sent to them, then this conver-sion should be essentially trivial, andwe should be able not only to observethe muscle contractions by looking attheir EMG signals, but also to infer theneural signals at the preceding level.

It is at the next stage—the conver-sion from muscle contraction to vocaltract shapes by way of ArticulatoryRules—that a very considerable amountof encoding or restructuring mustsurely occur. If we take into accountthe structure and function of the ar-ticulatory system, in particular the in-tricate linkages and the spatial overlapof the component parts, we must sup-pose that the relation between contrac-tion and resulting shape is complex,though predictable. True encodingoccurs as a consequence of two furtheraspects of this conversion; the fact thatthe subphonemic features can be, andare, put through in parallel means thateach new set of contractions (a) startsfrom whatever configuration then exists(as the result of the preceding contrac-tions) and (£>) typically occurs beforethe last set has ended, with the resultthat the shape of the tract at any in-stant represents the merged effects ofpast and present instructions. Suchmerging is, in effect, an encoding oper-ation according to our use of that term,since it involves an extensive restruc-turing of the output—in this case, the

Page 18: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

448 LIBERMAN, COOPER, SHANKWEILER, AND STUDDERT-KENNEDY

shape of the tract. The relation ofmessage units to code becomes espe-cially complex when temporal and spa-tial overlaps occur together. Thus, theconversion from muscle contraction toshape is, by itself, sufficient to producethe kinds of complex relation betweenphoneme and sound that we have foundto exist in the overall process of speechproduction, that is, a loss of segmenta-bility together with very large changesin the essential acoustic signals as afunction of context. Given the struc-ture of the articulatory apparatus, thesecomplexities appear to be a necessaryconcomitant of the parallel processingthat makes the speech code so efficient.

The final conversion, from shape tosound, proceeds by what we have inFigure 5 called Acoustic Rules. Theserules, which are now well understood,32

are complex from a computationalstandpoint, but they operate on an in-stant-by-instant basis and yield (for themost part) one-to-one relations ofshape to sound.

Does the Encoding Truly Occur in theConversion from Command to Shape?

We have supposed that, of the fourconversions between phoneme andsound, one at least—the conversionfrom contractions to tract shape—iscalculated to produce an encoding ofthe kind we found when earlier we ex-amined the acoustic cues for phonemeperception. But does it, in fact, pro-duce that encoding, and if so, does itaccount for all of it? Downstreamfrom this level there is only the conver-sion from shape to sound, and that,though complex, does not appear to in-volve encoding. But what of the up-stream conversions, particularly the one

22 The dependence of speech sound onvocal tract shape (and movement) has, ofcourse, been studied by many workers andfor many years. A landmark in this field,and justification for our statement, is thebookbyFant (1960).

that lies between the neural representa-tions of the phonemes and the com-mands to the articulatory muscles ? Wecannot at the present time observe thoseprocesses, nor can we directly measuretheir output—that is, the commands tothe muscles. We can, however, ob-serve some aspects of the contractions—for example, the electromyographiccorrelates—and if we assume, as seemsreasonable, that the conversion fromcommand to contractions is straight-forward, then we can quite safely inferthe structure of the commands.23 Bydetermining to what extent those in-ferred commands (if not the electro-myographic signals themselves) areinvariant with the phoneme we can,then, discover how much of the encod-ing occurs in the conversion from con-traction to shape (Articulatory Rules)and how much at higher levels.

23 For commands of the on-off type (seeFootnote 21), we would expect muscle con-tractions—and EMG potentials—to beroughly proportional to the commands;hence, the commands will be mirrored directlyby the EMG potentials when these can bemeasured unambiguously for the muscles ofinterest. All these conditions are realizable,at least approximately, for a number ofphoneme combinations—for example, thebilabial stops and nasal consonants with un-rounded vowels, as described in laterparagraphs.

Commands of the "go to" type, which maywell be operative in gestures that are rela-tively slow and precise, would presumablyoperate via the -y-efferent system to produceonly so much contraction of the muscle as isneeded to achieve a target length. Thecontraction—and the resulting EMG signal—would then be different for different startingpositions, that is, for the same phoneme indifferent contexts. Even so, the significantaspects of the command can be inferred,since presence versus absence and sequentialposition (if not the relative timing) of theEMG signal persist despite even largechanges in its magnitude.

For general discussions of the role of elec-tromyography in research on speech, see thereviews by Cooper (1965) and Fromkin andLadefoged (1966).

Page 19: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

PERCEPTION OF THE SPEECH CODE 449

Motor commands as presumed in-variants. Before discussing the elec-tromyographic evidence, we should saywhat kind of invariance with the sub-phonemic features (hence, with the pho-nemes) we might, at best, expect tofind. It should be clear from many ofthe data presented in this paper thatlanguage is no more a left-to-rightprocess at the acoustic level than it issyntactically. As we have seen, theacoustic representations of the succes-sive phonemes are interlaced. Somecontrol center must therefore "know"what the syllable is to be in order thatits component subphonemic featuresmay be appropriately combined andoverlapped. There are other groundsthan the data we have cited for in-ferring such syllabic organization.Temporal relations between syllablesmay be adjusted for slow or rapidspeech and for changes in syllable dura-tion such as Lindblom has reported fora given syllable placed in differentpolysyllabic contexts. But thesechanges in syllable duration do notaffect all portions of the syllableequally: timing relations within thesyllable are also adjusted according tocontext (Lindblom, 1964), and thismust entail variation in the relativetiming of the component subphonemicfeatures, as is suggested by Kozhevni-kov and Chistovich's (1965) analysisof the articulatory structure of conso-nant clusters. Ohman (1964, 1966)has proposed a model for syllable ar-ticulation consistent with this analysis.Such contextual variations preclude theinvariance of all the features that com-prise a phoneme in a particular con-text. The most that we can expect isthat some subset of these features, andso of the neural signals to the muscles(after operation of the NeuromotorRules of Figure 5), will be invariantwith the phoneme; there will then befor each subphonemic feature charac-

teristic neuromotor "markers," impli-cating only one or a few componentparts of the system, perhaps only thecontraction of a single specific muscle.These characteristic components of thetotal neural activity we will refer to as"motor commands." 24 We should alsoemphasize that we refer here to the pho-neme as perceived, not as an abstractlinguistic entity serving primarily aclassificatory function.

Indications from electromyography.What, then, do we find when we lookat the electromyographic (EMG) cor-relates of articulatory activity ? Recentresearch in our laboratory and severalothers has been directed specifically tothe questions raised in the precedingparagraphs, but the data are, as yet,quite limited. As we see these data,they do, however, permit some tenta-tive conclusions.

When two adjacent phonemes areproduced by spatially separate groupsof muscles, there are essentially invari-ant EMG tracings from the character-istic gestures for each phoneme, re-gardless of the identity of the other.25

24 Thus, the motor commands are, in onesense, abstract "-erne" type entities, with in-variance assumed and observation directedto their discovery and enumeration; in an-other sense, and to the degree that observa-tion justifies the assumption about invariance,motor commands constitute the essential sub-set of real neural signals with which a gen-eral model of production and perceptionshould be principally concerned.

26 It is not, of course, necessary that theEMG signals be precisely the same since in-variance is expected only of the motor com-mands inferred from them. In the minimallycomplicated cases cited in this paragraph, thecommands do transform into EMG potentialsthat are essentially the same (in differentcontexts), aside from some differences inmagnitude.

Magnitude differences do occur, however,variously for the individual and context, andmay have linguistic significance attributedto them. Thus, Fromkin (1966) concludes,mainly from EMG data on the lip-closinggestures of her principal speaker:

Page 20: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

450 LIBERMAN, COOPER, SHANKWEILER, AND STUDDERT-KENNEDY

This has been found for /f/, for ex-ample, followed by /s/, /t/, /is/, /<?/,or /0s/ and preceded by /i/, /il/, or/im/ (MacNeilage, 1963). Similarly,invariance has been found for /b/, /p/,and /m/ regardless of the followingvowel (Fromkin, 1966; Harris, Ly-saught, & Schvey, 1965). Here it iseasy to associate the place and mannerfeatures with the contractions of spe-cific muscles, and to equate the EMGsignals for a specific feature whereverit occurs, at least within this limitedset of phonemes. We should empha-size that corresponding invariance isnot to be found in the acoustic signal:for /f/ the duration of the noise variesover a range of two to one in the con-texts listed above; the vast differencesin acoustic cue for /b/ or /p/ beforevarious vowels were described in anearlier section.

When the temporally overlappinggestures for successive phonemes in-volve more or less adjacent musclesthat control the same structures, it isof course more difficult to discoverwhether there is invariance or not.26

The results of the present investigation donot support the hypothesis that a simpleone-to-one correspondence exists betweena phoneme and its motor commands. Forthe bilabial stops /b/ and /p/ differentmotor commands produce different muscu-lar gestures for these consonants occurringin utterance initial and final positions [p.195].

Harris, Lysaught, and Schvey (1965) report,on the contrary, that differences in EMGsignal for the lip-closing gesture did notreach statistical significance for two groupsof five subjects each in two experiments, oneof which overlapped Fromkin's study. Theynoted, though, that each of the subjectsshowed an individual pattern of small butconsistent variations of EMG signal withcontext. (For a discussion of these dif-ferences in the interpretation of basicallysimilar data, see Cooper, 1966.)

28 A practical difficulty exists in allocatingobserved EMG potentials to specific muscleswhen more than one muscle is close to a

This is the situation for the /di, du/examples we used earlier. In our ownstudies of such cases we find essentiallyidentical EMG signals from tongue-tip electrodes for the initial consonant;however, the signal for a followingphoneme that also involves the tonguetip may show substantial changes fromits characteristic form (Harris, 1963;Harris, Huntington, & Sholes, 1966).Such changes presumably reflect theexecution of position-type commands(see Footnotes 21 and 23) rather thanreorganization at a higher level. Truereorganization of the neural commandsis not excluded by such data—indeed,we have some data that might be sointerpreted (MacNeilage, DeClerk, &Silverman, 1966)—but the evidenceso far is predominantly on the side ofinvariance.

If the commands—or, indeed, sig-nals of any kind—are to be invariantwith the phonemes, they must reflectthe segmentation that is so conspicu-ously absent at the acoustic level. Dowe, then, find such segmentation in theEMG records ? Before answering thatquestion, we should remind ourselvesthat the activity of a single muscle ormuscle group does not, in any case, re-flect a phoneme but only a subphone-mic feature, and that a change fromone phoneme to another may require achange in command for only one ofseveral features. We should not ex-pect, therefore, that all the subphone-mic features will start and stop at eachphoneme boundary, and, in fact, theydo not. We do find, however, in theonsets and offsets of EMG activity invarious muscles a segmentation likethat of the several dimensions that con-stitute the phoneme. One can see,then, where the phoneme boundaries

surface electrode. Needle electrodes willoften resolve such ambiguities, but they poseother problems of a practical nature.

Page 21: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

PERCEPTION OF THE SPEECH CODE 451

must be.27 This is in striking contrastto what is found at the acoustic level.There, as we have earlier pointed out,almost any segment we can isolate con-tains information about—and is a cuefor—more than one phoneme. Thus inthe matter of segmentation, too, theEMG potentials—and even more themotor commands inferred from them—bear a simpler relation to the perceivedphonemes than does the acoustic signal.

In summary we should say that wedo not yet know precisely to what ex-tent motor commands are invariantwith the phonemes. It seems reason-ably clear, however, that they are morenearly invariant than are the acousticsignals, and we may conclude, there-fore, that a substantial part of the re-structuring occurs below the level ofthe commands. Whether some signifi-cant restructuring occurs also at higherlevels can only be determined by fur-ther investigation.

Decoding: From, Sound to Phoneme

If speech were a simple cipher onemight suppose that phoneme percep-tion could be explained by the ordinaryprinciples of psychophysics and dis-crimination learning. On that view,which is probably not uncommon amongpsychologists, speech consists of sounds,much like any others, that happen tosignal the phoneme units of the lan-guage. It is required of the listeneronly that he learn to connect eachsound with the name of the appropri-

27 In some cases the boundaries are sharplymarked by a sudden onset of EMG signal fora particular muscle. Usually, though, the on-sets are less abrupt and activity persists for afew hundred milliseconds. This might seeminadequate to provide segmentation mark-ers for the rapid-fire phoneme sequences;indeed, precise time relationships may beblurred in normal articulation, but withoutobscuring the sequential order of events andthe separateness of the channels carrying in-formation about the several features.

ate phoneme. To be sure, the soundsmust be discriminably different, but thisis a standard problem of auditory psy-chophysics and poses no special dif-ficulty for the psychologist. It wouldfollow that perception of speech is notdifferent in any fundamental way fromperception of other sounds (assumingsufficient practice), and that, as Lane(1965) and Cross, Lane, and Sheppard(1965) suppose, no special perceptualmechanism is required.

The point of this paper has been, tothe contrary, that speech is, for the mostpart, a special code that must oftenmake use of a special perceptual mech-anism to serve as its decoder.28 Onthe evidence presented here we shouldconclude, at the least, that most pho-nemes can not be perceived by astraightforward comparison of the in-coming signal with a set of storedphonemic patterns or templates, even ifone assumes mechanisms that can ex-tract simple stimulus invariances suchas, for example, constant ratios. Tofind acoustic segments that are in anyreasonably simple sense invariant withlinguistic (and perceptual) segments—that is, to perceive without decoding—one must go to the syllable level orhigher. Now the number of syllables,reckoned simply as the number of per-missible combinations and permuta-tions of phonemes, is several thousandin English. But the number of acousticpatterns is far greater than that, sincethe acoustic signal varies considerablywith the speaker and with the stress,intonation, and tempo of his speech

28 We referred earlier to perception in thespeech mode and indicated that it was notoperative—or, at least, not controlling—forsome speech sounds as well as all nonspeechsounds. It need hardly be said, then, thatthe speech decoder provides only one path-way for perception; nonspeech sounds ob-viously require another pathway, which mayserve adequately for the unencoded aspectsof speech as well.

Page 22: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

452 LIBERMAN, COOPER, SHANKWEILER, AND STUDDERT-KENNEDY

(Cooper, Liberman, Lisker, & Gait-enby, 1963; Liberman, Ingemann,Lisker, Delattre, & Cooper, 1959;Lindblom, 1963, Peterson & Sivertsen,1960; Peterson, Wang, & Sivertsen,1958; Sivertsen, 1961). To perceivespeech by matching the acoustic signalto so many patterns would appear, atbest, uneconomical and inelegant. Moreimportant, it would surely be inade-quate, since it must fail to account forthe fact that we do perceive phonemes.No theory can safely ignore the factthat phonemes are psychologically real.

How, then, is the phoneme to berecovered? We can imagine, at leastin general outline, two very differentpossibilities: one, a mechanism that op-erates on a purely auditory basis, theother, by reference to the processes ofspeech production.

The case for an auditory decoder.The relevant point about an auditorydecoder is that it would process thesignal in auditory terms—that is, with-out reference to the way in which itwas produced—and successfully extractthe phoneme string. On the basis ofwhat we know of speech, and of theattempts to build an automatic pho-neme recognizer, we must supposesuch a device would have to be rathercomplex. There is, to be sure, someevidence for the existence of processingmechanisms in the auditory systemthat illustrate at a very simple levelone kind of thing that an auditorydecoder might have to do. Thus,Whitfield and Evans (1965) havefound single cells in the auditory cor-tex of the cat that respond to frequencychanges but not to steady tones. Someof these cells responded more to tonesthat are frequency modulated upwardand others to tones modulated down-ward. Such specificity of responsealso exists at lower levels in the audi-tory nervous system (Nelson, Erulkar,& Bryant, 1966). These findings are

examples of a kind of auditory process-ing that might, perhaps, be part of aspeech decoder, but one that wouldhave to go beyond these processes—tomore complex ones—if it were to dealsuccessfully with the fact that verydifferent transitions will, in differentcontexts, cue the same phoneme. Butif this could be accomplished, and if itwere done independently of motor par-ameters, we should have a purely audi-tory decoder.

The case for mediation by produc-tion. Though we cannot exclude thepossibility that a purely auditory de-coder exists, we find it more plausibleto assume that speech is perceived byprocesses that are also involved in itsproduction. The most general and ob-vious motivation for such a view is thatthe perceiver is also a speaker andmust be supposed, therefore, to possessall the mechanisms for putting lan-guage through the successive codingoperations that result eventually in theacoustic signal.28 It seems unparsi-monious to assume that the speaker-listener employs two entirely separateprocesses of equal status, one for en-coding language and the other for de-coding it. A simpler assumption isthat there is only one process, with ap-propriate linkages between sensory andmotor components.30

28 We have noted that training in readingspeech spectrograms has so far not succeededin developing in the trainee a visual decoderfor speech comparable to the one that worksfrom an auditory input. This may well re-flect the existence of special mechanisms forthe speech decoder that are lacking for itsvisual counterpart. In general, a theoryabout the nature of the speech decoder—and,in particular, a motor theory—must be con-cerned with the nature of the mechanism,though not necessarily with the question ofhow it was acquired in the history of theindividual or the race. This is an interest-ing, but separate, question and is not con-sidered here.

30 We should suppose that the links be-tween perception and articulation exist at

Page 23: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

PERCEPTION OF THE SPEECH CODE 453

Apart from parsimony, there arestrong reasons for considering thislatter view. Recall, for example, thecase of /di, du/. There we saw thatthe acoustic patterns for /d/ were verydifferent, though the perception of theconsonantal segment was essentiallythe same. Since it appears that the/d/ gesture—or, at least, some impor-tant, "diagnostic" part of it—may alsobe essentially the same in the two cases,we are tempted to suppose that onehears the same /d/ because perceptionis mediated by the neuromotor corre-lates of gestures that are the same.This is basically the argument weused in one of our earliest papers(Liberman, Delattre, & Cooper, 1952)to account for the fact that bursts ofsound at very different frequencies arerequired to produce the perception of/k/ before different vowels. Extend-ing the argument, we tried in that samepaper to account also for the fact thatthe same burst of sound is heard as/p/ before /i/ but as /k/ before /a/,the point being that, because of tem-poral overlap (and consequent acous-tic encoding), very different gestureshappen to be necessary in these dif-ferent vowel environments in order toproduce the same acoustic effect (theconsonant burst). The argument wasapplied yet again to account for a find-ing about second-formant transitions ascues: that the acoustic cues for /g/can be radically different even whenthe consonant is paired with closely re-lated vowels (i.e., in the syllables /ga/and /go/), yet the perception and thearticulation are essentially the same(Liberman, 1957). But these are

relatively high levels of the nervous sys-tem. For information about, or reference to,motor activity, the experienced organism neednot rely—at least not very heavily andcertainly not exclusively—on proprioceptivereturns from the periphery, for example,from muscular contractions. (See vonHoist & Mittelstadt, 19SO.)

merely striking examples of what mustbe seen now as a general rule: thereis typically a lack of correspondencebetween acoustic cue and perceivedphoneme, and in all these cases it ap-pears that perception mirrors articu-lation more closely than sound.31 Ifthis were not so, then for a listener tohear the same phoneme in various en-vironments, speakers would have tohold the acoustic signal constant, whichwould, in turn, require, that they makedrastic changes in articulation. Speak-ers need not do that—and in all prob-ability they cannot—yet listenersnevertheless hear the same phoneme.This supports the assumption that thelistener uses the inconstant sound as abasis for finding his way back to thearticulatory gestures that produced itand thence, as it were, to the speaker'sintent.

The categorical perception of stopconsonants also supports this assump-tion.82 As described earlier in this pa-per, perception of these sounds is cate-gorical, or discontinuous, even whenthe acoustic signal is varied continu-ously. Quite obviously, the requiredarticulations would also be discontinu-ous. With /b,d,g/, we can vary theacoustic cue along a continuum, whichcorresponds, in effect, to closing thevocal tract at various points along itslength. But in actual speech the clos-ing is accomplished by discontinuous orcategorically different gestures: by thelips for /b/, the tip of the tongue for/d/, and the back of the tongue for /g/.Here, too, perception appears to betied to articulation.

31 For further discussion of this point, seeLiberman, 1957; Cooper, Liberman, Harris,and Grubb, 1958; Lisker, Cooper, and Liber-man, 1962; Liberman, Cooper, Harris, Mac-Neilage, and Studdert-Kennedy, in press.

82 For further discussion of this point, seeLiberman, Cooper, Harris, and MacNeilage,1962.

Page 24: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

454 LIBERMAN, COOPER, SHANKWEILER, AND STUDDERT-KENNEDY

The view that speech is perceived byreference to production is receiving in-creased attention. Researchers at thePavlov Institute in Leningrad have ad-duced various kinds of evidence to sup-port a similar hypothesis (Chistovich,1960; Chistovich, Klaas, & Kuz'min,1962). Ladefoged (1959) has pre-sented a motor-theoretical interpreta-tion of some aspects of speech percep-tion. More recently, Ladefoged andMcKinney (1963) have found in studiesof stress that perception is related moreclosely to certain low-level aspectsof stress production than it is to theacoustic signal. And in a very recentstudy Lieberman (1967) has found in-teresting evidence for a somewhat sim-ilar conclusion in regard to the per-ception of some aspects of intonation.

The role of productive processes inmodels for speech perception. In itsmost general form, the hypothesisbeing described here is not necessarilydifferent in principle from a model forspeech perception called "analysis-by-synthesis" that has been advanced byStevens (1960) and by Halle andStevens (1962; Stevens & Halle, inpress) following a generalized modelproposed by MacKay (1951). Incontrast, perhaps, to the computer-based assumptions of analysis-by-syn-thesis, we would rather think in termsof overlapping activity of several neuralnetworks—those that supply controlsignals to the articulators and thosethat process incoming neural patternsfrom the ear—and to suppose that in-formation can be correlated by thesenetworks and passed through them ineither direction. Such a formulation,in rather general terms, has been pre-sented elsewhere (Liberman, Cooper,Studdert-Kennedy, Harris, & Shank-weiler, in press).

The most general form of the viewthat speech is perceived by referenceto production does not specify the level

at which the message units are re-covered. The assumption is that atsome level or levels of the productionprocess there exist neural signals stand-ing in one-to-one correspondence withthe various segments of the language—phoneme, word, phrase, etc. Percep-tion consists in somehow running theprocess backward, the neural signalscorresponding to the various segmentsbeing found at their respective levels.In phoneme perception—our primaryconcern in this paper—the invariant isfound far down in the neuromotor sys-tem, at the level of the commands to themuscles. Perception by morphopho-nemic, morphemic, and syntactic rulesof the language would engage the en-coding processes at higher levels. Thelevel at which the encoding process isentered for the purposes of perceptualdecoding may, furthermore, determinewhich shapes can and cannot be de-tected in raw perception. The invari-ant signal for the different acousticshapes that are all heard as /d/, forexample, may be found at the level ofmotor commands. In consequence, thelistener is unaware, even in the mostcareful listening, that the acoustic sig-nals are, in fact, quite radically differ-ent. On the other hand, a listener canreadily hear the difference between to-kens of the morphophoneme {S} whenit is realized as /s/ in cats and as /z/in dogs, though he also "knows" thatthese two acoustic and phonetic shapesare in some sense the same. If theperception of that commonality is alsoby reference to production, the invari-ant is surely at a level considerablyhigher than the motor commands.

THE EFFICIENCY OF THE SPEECH CODE

Speech can be produced rapidly be-cause the phonemes are processed inparallel. They are taken apart intotheir constituent features, and the fea-

Page 25: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

PERCEPTION OF THE SPEECH CODE 455

tures belonging to successive phonemesare overlapped in time. Thus the loadis divided among the many largely in-dependent components of the articula-tory system. Within the phonologicalconstraints on combinations and se-quences that can occur, a speaker pro-duces phonemes at rates much higherthan those at which any single articula-tory component must change its state.33

In the conversion of these multidi-mensional and overlapping articulatoryevents into sound, a complex encodingoccurs. The number of dimensions isnecessarily reduced, but the paralleltransmission of phonemic informationis retained in that the cues for succes-sive phonemes are imprinted on a sin-gle aspect of the acoustic signal. Thus,the movement of articulators as inde-pendent as the lips and the tongue bothaffect the second formant and its tran-sitions : given an initial labile conso-nant overlapped with a tongue shapeappropriate for the following vowel,one finds, as we have already shown,that the second-formant transition si-

83 The phonological constraints may, infact, play an essential part in making thedecoding operation fast and relatively free oferror. Since the set of phonemes andphoneme sequences that is used in any par-ticular language is far smaller than thepossible set of feature combinations—smallereven than the combinations that are physio-logically realizable—information about theallowable combinations of features could beused in the decoding process to reestablishtemporal relationships that may have beenblurred in articulation (see Footnote 27).Such a "recutting" operation would ease therequirements on precision of articulation andso allow faster communication; also, it wouldmake unambiguous the segmentation intophonemes, thereby qualifying them as unitsof immediate perception. One may speculate,further, that the serial string of reconstitutedphonemes is useful—perhaps essential—in thenext operation of speech reception, namely,gaining immediate access to an inventory ofthousands of syllable- or word-size units.

multaneously carries information aboutboth phonemic segments.84

If the listener possesses some devicefor recovering the articulatory eventsfrom their encoded traces in the soundstream, then he should perceive thephonemes well and, indeed, evade sev-eral limitations that would otherwiseapply in auditory perception. As wepointed out early in this paper, thetemporal resolving power of the earsets a relatively low limit on the rateat which discrete acoustic segments canbe perceived. To the extent that thecode provides for parallel processingof successive phonemes, the listenercan perceive strings of phonemes morerapidly than he could if the acousticsignals for them were arranged serially,as in an alphabet. Thus, the parallelprocessing of the phonemes is as im-portant for efficient perception as forproduction.

We also referred earlier to the dif-ficulty of finding a reasonable numberof acoustic signals of short durationthat can be readily and rapidly identi-fied. This limitation is avoided if thedecoding of the acoustic signal enablesthe listener to recover the articulatoryevents that produced it, since percep-tion then becomes linked to a system ofphysiological coordinates more richlymultidimensional—hence more distinc-tive—than the acoustic (and auditory)signal.

Having said of the speech code thatit seems particularly well designed tocircumvent the shortcomings of theear, we should consider whether its ac-

34 Fant (1962a) makes essentially the samepoint:

The rules relating speech waves to speechproduction are in general complex sinceone articulation parameter, e.g., tongueheight, affects several of the parameters ofthe spectrogram. Conversely, each of theparameters of the spectrogram is generallyinfluenced by several articulatory variables[p. 5].

Page 26: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

456 LIBERMAN, COOPER, SHANKWEILER, AND STUDDERT-KENNEDY

complishments stop there. As wepointed out earlier, these shortcomingsdo not apply to the eye, for example,and we do indeed find that the best wayto communicate language in the vis-ual mode is by means of an alphabetor cipher, not a code. But we alsonoted that the acoustic code is moreeasily and naturally perceived than isthe optical alphabet. Perhaps this isdue primarily to the special speech de-coder, whose existence we assumed forthe conversion of sound to phoneme.We would suggest an additional pos-sibility : the operations that occur inthe speech decoder—including, in par-ticular, the interdependence of per-ceptual and productive processes—maybe in some sense similar to those thattake place at other levels of grammar.If so, there would be a special com-patibility between the perception ofspeech sounds and the comprehensionof language at higher stages. Thismight help to explain why, so far frombeing merely one way of conveyinglanguage, the sounds of speech are,instead, its common and privilegedcarriers.

REFERENCES

ABRAMSON, A. S. Identification and dis-crimination of phonemic tones. Journal ofthe Acoustical Society of America, 1961,33, 842. (Abstract)

ABRAMSON, A. S., & LISKER, L. Voice onsettime in stop consonants: Acoustic analysisand synthesis. Reports of the Fifth In-ternational Congress on Acoustics, 1965,la, Paper AS1.

BASTIAN, J., & ABRAMSON, A. S. Identifi-cation and discrimination of phonemicvowel duration. Journal of the AcousticalSociety of America, 1962, 34, 743. (Ab-stract)

BASTIAN, J., DELATTRE, P. C., & LIBERMAN,A. M. Silent interval as a cue for thedistinction between stops and semivowels inmedial position. Journal of the AcousticalSociety of America, 1959, 31, 1568. (Ab-stract)

BASTIAN, J., EIMAS, P. D., & LIBERMAN, A.M. Identification and discrimination of a

phonemic contrast induced by silent inter-val. Journal of the Acoustical Society ofAmerica, 1961, 33, 842. (Abstract)

BRADY, P. T., HOUSE, A. S., & STEVENS,K. N. Perception of sounds characterizedby a rapidly changing resonant frequency.Journal of the Acoustical Society ofAmerica, 1961, 33, 1337-1362.

BROADBENT, D. E. The role of auditorylocalization in attention and memory span.Journal of Experimental Psychology, 1954,47, 191-196.

BROADBENT, D. E., & GREGORY, M. Ac-curacy of recognition for speech presentedto the right and left ears. Quarterly Jour-nal of Experimental Psychology, 1964, 16,359-360.

BRYDEN, M. P. Ear preference in auditoryperception. Journal of ExperimentalPsychology, 1963, 65, 103-105.

CHANEY, R. B., & WEBSTER, J. C. Infor-mation in certain multidimensional acous-tic signals. Report No. 1339, 1965.United States Navy Electronics LaboratoryReports, San Diego, Calif.

CHISTOVICH, L. A. Classification of rapidlyrepeated speech sounds. AkusticheskiiZhurnal, 1960, 6, 392-398. (Trans, in So-viet Physics-Acoustics, New York, 1961,6, 393-398).

CHISTOVICH, L. A., KLAAS, Y. A., & Kuz'-MIN, Y. I. The process of speech sounddiscrimination. Voprosy Psikhologii, 1962,8, 26-39. (Research Library, Air ForceCambridge Research Laboratories, Bed-ford, Mass. TT-64-13064 35-P)

CHOMSKY, N. Current issues in linguistictheory. In J. A. Fodor & J. J. Katz(Eds.), The structure of language. Engle-wood Cliffs, N. J.: Prentice Hall, 1964.Pp. 50-118.

COFFEY, J. L. The development and evalu-ation of the Battelle Aural Reading De-vice. Proceedings of the InternationalCongress on Technology and Blindness I.New York: American Foundation for theBlind, 1963. Pp. 343-360.

COOPER, F. S. Research on reading machinesfor the blind. In P. A. Zahl (Ed.),Blindness: Modern approaches to the un-seen environment. Princeton, N. J.:Princeton University Press, 1950. Pp.512-543. (a)

COOPER, F. S. Spectrum analysis. Journalof the Acoustical Society of America,1950, 22, 761-762. (b)

COOPER, F. S. Some instrumental aids toresearch on speech. Report on the FourthAnnual Round Table Meeting on Lin-

Page 27: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

PERCEPTION OF THE SPEECH CODE 457

guistics and Language Teaching. Mono-graph Series No. 3. Washington: George-town University Press, 1953. Pp. 46-53.

COOPER, F. S. Research techniques andinstrumentation: EMG. American Speechand Hearing Association Reports, 1965, 1,153-158.

COOPER, F. S. Describing the speech processin motor command terms. Journal of theAcoustical Society of America, 1966, 39,1221 (Abstract) (Status Report of SpeechResearch, Haskins Laboratories, SR-5/6,1966, 2.1-2.27— text.)

COOPER, F. S., DELATTRE, P. C, LIBERMAN,A. M., BORST, J., & GERSTMAN, L. J. Someexperiments on the perception of syntheticspeech sounds. Journal of the AcousticalSociety of America, 1952, 24, 597-606.

COOPER, F. S., LIBERMAN, A. M., & BORST,J. M. The interconversion of audibleand visible patterns as a basis for re-search in the perception of speech. Pro-ceedings of the National Academy ofSciences, 1951, 37, 318-325.

COOPER, F. S., LIBERMAN, A. M., HARRIS,K. S., & GRUBB, P. M. Some input-outputrelations observed in experiments on theperception of speech. Proceedings of theSecond International Congress of Cyber-netics, 1958. Namur, Belgium: AssociationInternationale de Cybernetique. Pp. 930-941.

COOPER, F. S., LIBERMAN, A. M., LISKER, L.,& GAITENBY, J. Speech synthesis by rules.Proceedings of the Speech CommunicationSeminar, Stockholm, 1962. Stockholm:Royal Institute of Technology, 1963. F2.

CROSS, D. V., LANE, H. L., & SHEPPARD,W. C. Identification and discriminationfunctions for a visual continuum and theirrelation to the motor theory of speech per-ception. Journal of Experimental Psy-chology, 1965, 70, 63-74.

DELATTRE, P. C. Les indices acoustiques dela parole: Premier rapport. Phonetica,1958, 2, 108-118, 226-251.

DELATTRE, P. C. Le jeu des transitions desformants et la perception des consonnes.Proceedings of the Fourth InternationalCongress of Phonetic Sciences, Helsinki,1961. s-Gravenhage: Mouton, 1962. Pp.407-417.

DELATTRE, P. C., LIBERMAN, A. M., &COOPER, F. S. Voyelles synth6tiques adeux formants et voyelles cardinales. LeMattre Phonetique, 1951, 96, 30-37.

DELATTRE, P. C., LIBERMAN, A. M., &COOPER, F. S. Acoustic loci and transi-tional cues for consonants. Journal of

the Acoustical Society of America, 1955,27, 769-773.

DELATTRE, P. C., LIBERMAN, A. M., &COOPER, F. S. Formant transitions andloci as acoustic correlates of place ofarticulation in American fricatives. StudioLinguistica, 1964, 18, 104-121.

DELATTRE, P. C., LIBERMAN, A. M., COOPER,F. S., & GERSTMAN, L. J. An experi-mental study of the acoustic determinantsof vowel color: Observations on one- andtwo-formant vowels synthesized from spec-trographic patterns. Word, 1952, 8,195-210.

DENES, P. B. On the statistics of spokenEnglish. Journal of the Acoustical Societyof America, 1963, 35, 892-904.

EIMAS, P. D. The relation between identi-fication and discrimination along speechand non-speech continua. Language andSpeech, 1963, 6, 206-217.

FANT, C. G. M. Acoustic theory of speechproduction. 's-Gravenhage: Mouton, 1960.

FANT, C. G, M. Descriptive analysis of theacoustic aspects of speech. Logos, 1962,5, 3-17. (a)

FANT, C. G. M. Sound spectrography. Pro-ceedings IV International Congress onPhonetic Sciences, Helsinki, 1961. 's-Gravenhage: Mouton, 1962. Pp. 14-33. (b)

FANT, C. G. M. Theory of distinctive fea-tures. Speech Transmission LaboratoryQuarterly Progress and Status Report,Royal Institute of Technology (KTH),Stockholm, January 15, 1967. Pp. 1-14.

FREIBERGER, J., & MURPHY, E. F. Readingmachines for the blind. IRE ProfessionalGroup on Human Factors in Electronics,1961, HFE-2, 8-19.

FROMKIN, V. A. Neuro-muscular specifica-tion of linguistic units. Language andSpeech, 1966, 9, 170-199.

FROMKIN, V. A., & LADEFOGED, P. Electro-myography in speech research. Phonetica,1966, 15, 219-242.

FRY, D. B., ABRAMSON, A. S., EIMAS, P. D.,& LIBERMAN, A. M. The identificationand discrimination of synthetic vowels.Language and Speech, 1962, 5, 171-189.

GALUNOV, V. L, & CHISTOVICH, L. A. Re-lationship of motor theory to the generalproblem of speech recognition (review).Akusticheskii Zhurnal, 1965, 11, 417-426.(Trans, in Soviet Physics-Acoustics, NewYork, 1966, 11, 357-365.)

GSLB, I. J. A study of writing. Chicago:University of Chicago Press, 1963.

GRIFFITH, B. C. A study of the relationbetween phoneme labeling and discrimina-

Page 28: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

458 LIBERMAN, COOPER, SHANKWEILER, AND STUDDERT-KENNEDY

bility in the perception of synthetic stopconsonants. Unpublished doctoral disser-tation, University of Connecticut, 1958.

HADDING-KOCH, K., & STUDDERT-KENNEDY,M. An experimental study of some in-tonation contours. Phonctica, 1964, 11,175-185. (a)

HADDING-KOCH, K., & STUDDERT-KENNEDY,M. Intonation contours evaluated byAmerican and Swedish test subjects. Pro-ceedings of the Fifth International Con-gress of Phonetic Sciences, Munster, Aug-ust 1964. (b)

HALLE, M. On the bases of phonology. InJ. A. Fodor & J. J. Katz (Eds.), Thestructure of language. Englewood Cliffs,N. J.: Prentice-Hall, 1964. Pp. 324-333.

HALLE, M., & STEVENS, K. N. Speechrecognition: A model and a program forresearch. IRE Transactions on Informa-tion Theory JT-8, 1962, 2, 155-159.

HARRIS, C. M. A study of the buildingblocks in speech. Journal of the Acousti-cal Society of America, 1953, 25, 962-969.

HARRIS, K. S. Cues for the discriminationof American English fricatives in spokensyllables. Language and Speech, 1958, 1,1-7.

HARRIS, K. S. Behavior of the tongue in theproduction of some alveolar consonants.Journal of the Acoustical Society ofAmerica, 1963, 35, 784. (Abstract)

HARRIS, K. S., BASTIAN, J., & LIBERMAN,A. M. Mimicry and the perception of aphonemic contrast induced by silent inter-val : Electromyographic and acoustic mea-sures. Journal of the Acoustical Societyof America, 1961, 33, 842. (Abstract)

HARRIS, K. S., HOFFMAN, H. S., LIBERMAN,A. M., DELATTRE, P. C., & COOPER, F. S.Effect of third-formant transitions on theperception of the voiced stop consonants.Journal of the Acoustical Society ofAmerica, 1958, 30, 122-126.

HARRIS, K. S., HUNTINGTON, D. A., &SHOLES, G. N. Coarticulation of somedisyllabic utterances measured by electro-myographic techniques. Journal of theAcoustical Society of America, 1966, 39,1219. (Abstract)

HARRIS, K. S., LYSAUGHT, G., & SCHVEY,M. M. Some aspects of the production oforal and nasal labial stops. Language andSpeech, 1965, 8, 135-147.

HOCKETT, C. F. A manual of phonology.Baltimore: Waverly Press, 1955.

HUGHES, G. W., & HALLE, M. Spectralproperties of fricative consonants. Journalof the Acoustical Society of America,1956, 28, 303-310.

JAKOBSON, R., FANT, G., & HALLE, M.Preliminaries to speech analysis. The dis-tinctive features and their correlates. Tech-nical Report No. 13, 1952, Acoustics Lab-oratory, M.I.T. (Republished, Cambridge,Mass.: M.I.T. Press, 1963.)

JAKOBSON, R., & HALLE, M. Fundamentalsof language. 's-Gravenhage: Mouton,1956.

KIMURA, D. Cerebral dominance and per-ception of verbal stimuli. Canadian Jour-nal of Psychology, 1961, 15, 166-171.

KIMURA, D. Left-right differences in theperception of melodies. Quarterly Journalof Experimental Psychology, 1964, 16,355-358.

KIMURA, D. Functional asymmetry of thebrain in dichotic listening. Cortex, 1967,3, in press.

KOZHEVNIKOV, V. A., & CHISTOVICH, L. A.Reck' Artikuliatsia i vospriiatie. Moscow-Leningrad, 1965. (Trans, in Speech:Articulation and perception. Washington:Joint Publications Research Service, 1966,30, 543.)

LADEFOGED, P. The perception of speech.In mechanisation of thought processes,1959. London: H. M. Stationery Office.Pp. 397-409.

LADEFOGED, P., & McKiNNEY, N. P. Loud-ness, sound pressure, and sub-glottal pres-sure in speech. Journal of the AcousticalSociety of America, 1963, 35, 454-460.

LANE, H. Motor theory of speech percep-tion : A critical review. Psychological Re-view, 1965, 72, 275-309.

LIBERMAN, A. M. Some results of researchon speech perception. Journal of theAcoustical Society of America, 1957, 29,117-123.

LIBERMAN, A. M., COOPER, F. S., HARRIS,K. S., & MACNEILAGE, P. F. A motortheory of speech perception. Proceedingsof the Speech Communication Seminar,Stockholm, 1962. Stockholm: Royal In-stitute of Technology, 1963, D3.

LIBERMAN, A. M., COOPER, F. S., HARRIS,K. S., MACNEILAGE, P. F., & STUDDERT-KENNEDY, M. Some observations on amodel for speech perception. Proceedingsof the AFCRL Symposium on Models forthe Perception of Speech and Visual Form,Boston, November 1964. Cambridge:Massachusetts Institute of TechnologyPress, in press.

LIBERMAN, A. M., COOPER, F. S., STUDDERT-KENNEDY, M., HARRIS, K. S., & SHANK-WEILER, D. P. Some observations on theefficiency of speech sounds. Paper pre-sented at the XVIII International Congress

Page 29: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

PERCEPTION OF THE SPEECH CODE 459

of Psychology, Moscow, August 1966.Zeitschrift fur Phonetik, S prachwissen-schaft mid Kommunikations-jorschtrng, inpress.

LIBERMAN, A. M., DELATTRE, P. C., & COOPKR,F. S. The role of selected stimulus vari-ables in the perception of the unvoiced-stopconsonants. American Journal of Psychol-ogy, 19S2, 65, 497-516.

LIBERMAN, A. M., DELATTRE, P. C., & COOPER,F. S. Some cues for the distinction be-tween voiced and voiceless stops in initialposition. Language and Speech, 1958, 1,153-167.

LIBERMAN, A. M., DELATTRE, P. C., COOPER,F. S., & GERSTMAN, L. J. The role ofconsonant-vowel transitions in the percep-tion of the stop and nasal consonants.Psychological Monographs, 1954, 68(8,Whole No. 379).

LIBERMAN, A. M., DELATTRE, P. C., GERST-MAN, L. J., & COOPER, F. S. Tempo offrequency change as a cue for distinguish-ing classes of speech sounds. Journal ofExperimental Psychology, 1956, 52, 127-137.

LIBERMAN, A. M., HARRIS, K. S., EIMAS,P. D., LISKER, L., & BASTIAN, J. Aneffect of learning on speech perception:The discrimination of durations of silencewith and without phonemic significance.Language and Speech, 1961, 4, 175-195.

LIBERMAN, A. M., HARRIS, K. S., HOFFMAN,H. S., & GRIFFITH, B. C. The discrimina-tion of speech sounds within and acrossphoneme boundaries. Journal of Experi-mental Psychology, 1957, 54, 358-368.

LIBERMAN, A. M., HARRIS, K. S., KINNEY,J. A., & LANE, H. The discrimination ofrelative onset time of the components ofcertain speech and nonspeech patterns.Journal of Experimental Psychology, 1961,61, 379-388.

LIBERMAN, A. M., INGEMANN, F., LISKER, L.,DELATTRE, P. C., & COOPER, F. S. Minimalrules for synthesizing speech. Journal ofthe Acoustical Society of America, 1959,31, 1490-1499.

LIEBERMAN, P. Intonation, perception andlanguage. Cambridge: Massachusetts In-stitute of Technology Press, 1967.

LINDBLOM, B. Spectrographic study ofvowel reduction. Journal of the Acousti-cal Society of America, 1963, 35, 1773-1781.

LINDBLOM, B. Articulatory activity invowels. Journal of the Acoustical Societyof America, 1964, 36, 1038. (Abstract)

LINDBLOM, B., & STUDDERT-KENNEDY, M.On the role of formant transitions in

vowel recognition. Speech transmissionlaboratory quarterly progress and statusreport, Royal Institute of Technology(KTH), Stockholm, in press. (AlsoStatus report of speech research. HaskinsLaboratories, in press.)

LISKER, L. Closure duration and the voiced-voiceless distinction in English. Language,1957, 33, 42-49. (a)

LISKER, L. Minimal cues for separating/w,r,l,j/ in introvocalic production. Word,19S7, 13, 257-267. (b)

LISKER L,. Anatomy of unstressed syllables.Journal of the Acoustical Society ofAmerica, 1958, 30, 682. (Abstract)

LISKER, L., & ABRAMSON, A. S. A cross-language study of voicing in initial stops:Acoustical measurements. Word, 1964,20, 384-422. (a)

LISKER, L., & ABRAMSON, A. S. Stop cate-gories and voice onset time. Proceedingsof the Fifth International Congress ofPhonetic Sciences, Munster, August, 1964.(b)

LISKER, L., COOPER, F. S., & LIBERMAN,A. M. The uses of experiment in lan-guage description. Word, 1962, 18, 82-106.

MACKAY, D. M. Mindlike behavior inartefacts. British Journal for the Philos-ophy of Science, 1951, 2, 105-121.

MACNEILAGE, P. F. Electromyographic andacoustic study of the production of certainfinal clusters. Journal of the AcousticalSociety of America, 1963, 35, 461-463.

MACNEILAGE, P. F., DECLERK, J. L., &SILVERMAN, S. I. Some relations betweenarticulator movement and motor control inconsonant-vowel-consonant monosyllables.Journal of the Acoustical Society ofAmerica, 1966, 40, 1272. (Abstract)

MALECOT, A. Acoustic cues for nasal con-sonants. Language, 1956, 32, 274-284.

MILLER, G. A. The magical number seven,plus or minus two, or, some limits on ourcapacity for processing information. Psy-chological Review, 1956, 63, 81-96. (a)

MILLER, G. A. The perception of speech.In M. Halle (Ed.), For Roman Jakobson.'s-Gravenhage: Mouton, 1956. Pp. 353-359. (b)

MILLER, G. A., & TAYLOR, W. G. Theperception of repeated bursts of noise.Journal of the Acoustical Society of Amer-ica, 1948, 20, 171-182.

NELSON, P. G., ERULKAR, S. D., & BRYAN,S. S. Responses of units of the inferior-colliculus to time-varying acoustic stimuli.Journal of Neurophysiology, 1966, 29,834-860.

Page 30: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

460 LIBERMAN, COOPER, SHANKWEILER, AND STUDDERT-KENNEDY

NYE, P. W. Aural recognition time formultidimensional signals. Nature, 1962,196, 1282-1283.

NYE, P. W. Reading aids for blind people—a survey of progress with the techno-logical and human problems. MedicalElectronics and Biological Engineering,1964, 2, 247-264.

NYE, P. W. An investigation of audio out-puts for a reading machine. February,1965. Autonomies Division, NationalPhysical Laboratory, Teddington, Eng-land.

O'CONNOR, J. D., GERSTMAN, L. J., LIBER-MAN, A. M., DELATTRE, P. C., &COOPER, F. S. Acoustic cues for the per-ception of initial /w,j,r,l/ in English.Word, 1957, 13, 25-43.

OH MAN, S. E. G. Numerical model forcoarticulation, using a computer-simulatedvocal tract. Journal of the Acoustical So-ciety of America, 1964, 36, 1038. (Ab-stract)

OHMAN, S. E. G. Coarticulation in VCVutterances: Spectrographic measurements.Journal of the Acoustical Society of Amer-ica, 1966, 39, 151-168.

ORR, D. B., FRIEDMAN, H. L., & WILLIAMS,J. C. C. Trainability of listening com-prehension of speeded discourse. Journalof Educational Psychology, 1965, 56,148-156.

PETERSON, G. E., & SIVERTSEN, E. Objec-tives and techniques of speech synthesis.Language and Speech, 1960, 3, 84-95.

PETERSON, G. E., WANG, W. S.-Y., & SIVERT-SEN, E, Segmentation techniques in speechsynthesis. Journal of the Acoustical So-ciety of America, 1958, 30, 739-742.

POLLACK, I. The information of elementaryauditory displays. Journal of the Acousti-cal Society of America, 1952, 24, 745-749.

POLLACK, I., & PICKS, L. Information ofelementary multidimensional auditory dis-plays. Journal of the Acoustical Societyof America, 1954, 26, 155-158.

POTTER, R. K., KOPP, G. A., & GREEN, H. C.Visible speech. New York: VanNostrand, 1947.

SCHATZ, C. The role of context in the per-ception of stops. Language, 1954, 30,47-56.

SHANKWEILER, D., & STUDDERT-KENNEDY,M. An analysis of perceptual confusionsin identification of dichotically presentedCVC syllables. Journal of the AcousticalSociety of America, 1967, in press. (Ab-stract) (a)

SHANKWEILER, D., & STUDDERT-KENNEDY,M. Identification of consonants and vowels

presented to left and right ears. QuarterlyJournal of Experimental Psychology, 1967,19, 59-63. (b)

SHEARME, J. N., & HOLMES, J. N. An ex-perimental study of the classification ofsounds in continuous speech according totheir distribution in the formant 1-formant2 plane. In Proceedings of the Fourth In-ternational Congress of Phonetic Sciences,Helsinki, 1961. 's-Gravenhage: Mouton,1962. Pp. 234-240.

SIVERTSEN, E. Segment inventories forspeech synthesis. Language and Speech,1961, 4, 27-61.

STEVENS, K. N. Toward a model for speechrecognition. Journal of the AcousticalSociety of America, 1960, 32, 47-55.

STEVENS, K. N. On the relations betweenspeech movements and speech perception.Paper presented at the meeting of theXVIII International Congress of Psychol-ogy, Moscow, August, 1966. (Zeitschriftfur Phonetik, Sprachwissenschaft undKommunikations-forschung, in press.)

STEVENS, K. N., & HALLE, M. Remarks onanalysis by synthesis and distinctive fea-tures. Proceedings of the AFCRL Sym-posium on Models for the Perception ofSpeech and Visual Form, Boston, Novem-ber 1964. Cambridge: M.I.T. Press, inpress.

STEVENS, K. N., & HOUSE, A. S. Studies offormant transitions using a vocal tract ana-log. Journal of the Acoustical Society ofAmerica, 1956, 28, 578-585.

STEVENS, K. N., & HOUSE, A. S. Perturba-tion of vowel articulations by consonantalcontext: An acoustical study. Journal ofSpeech and Hearing Research, 1963, 6,111-128.

STEVENS, K. N., & HOUSE, A. S. Speechperception. In J. Tobias & E. Schubert(Eds.), Foundations of modern auditorytheory. New York: Academic Press, inpress.

STEVENS, K. N., OHMAN, S. E. G., & LIBER-MAN, A. M. Identification and discrimina-tion of rounded and unrounded vowels.Journal of the Acoustical Society of Amer-ica, 1963, 35, 1900. (Abstract)

STEVENS, K. N., OHMAN, S. E. G., STUDDERT-KENNEDY, M., & LIBERMAN, A. M. Cross-linguistic study of vowel discrimination.Journal of the Acoustical Society of Amer-ica, 1964, 36, 1989. (Abstract)

STUDDERT-KENNEDY, M., & COOPER, F. S.High-performance reading machines forthe blind; psychological problems, techno-logical problems, and status. Paper pre-

Page 31: VOL. NOVEMBER 1967 PSYCHOLOGICAL REVIEW...PSYCHOLOGICAL REVIEW PERCEPTION OF THE SPEECH CODE1 A. M. LIBERMAN,2 F. S. COOPER, D. P. SHANKWEILER, AND M. STUDDERT-KENNEDY» Haskins Laboratories,

PERCEPTION OF THE SPEECH CODE 461

scnted at the meeting of St. Dunstan's In-ternational Conference on Sensory De-vices for the Blind, London, June 1966.

STUDDERT-KENNEDY, M., & LIBERMAN, A. M.Psychological considerations in the designof auditory displays for reading machines.Proceedings of the International Congresson Technology and Blindness, 1963. Pp.289-304.

STUDDERT-KENNEDY, M., LIBERMAN, A. M.,& STEVENS, K. N. Reaction time tosynthetic stop consonants and vowels atphoneme centers and at phoneme boun-daries. Journal of the Acoustical Societyof America, 1963, 35, 1900. (Abstract)

STUDDERT-KENNEDY, M., LIBERMAN, A. M.,& STEVENS, K. N. Reaction time during

the discrimination of synthetic stop con-sonants. Journal of the Acoustical Societyof America, 1964, 36, 1989. (Abstract)

VON HOLST, E., & MlTTELSTADT, H. DaSreafferenzprinzip. Naturwissenschaft, 1950,37, 464-476.

WHETNALL, E., & FEY, D. B. The deafchild. London: Heinemann, 1964.

WHITFIELD, I. C., & EVANS, E. F. Re-sponses of auditory cortical neurons tostimuli of changing frequency. Journal ofNeurophysiology, 1965, 28, 655-672.

WICKELGREN, W. A. Distinctive featuresand errors in short-term memory for Eng-lish consonants. Journal of the AcousticalSociety of America, 1966, 39, 388-398.

(Received June 19, 1967)