50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf ·...

80
50 years of progress in speech recognition technology - Where we are, and where we should go - Sadaoki Furui Department of Computer Science Tokyo Institute of Technology [email protected] From a poor dog to a super cat

Transcript of 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf ·...

Page 1: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

50 years of progress in speech recognition technology

- Where we are, and where we should go -

Sadaoki FuruiDepartment of Computer Science

Tokyo Institute of [email protected]

From a poor dog to a super cat

Page 2: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Outline• Progress of automatic speech recognition

(ASR) technology– 5 generations– Synchronization with computer (IT)

technology• Summary of the technological progress• How to narrow the gap between machine

and human speech recognition• Statistical knowledge processing• Conclusion

Page 3: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

What is 4G?

~1920 ~1952 ~1968 ~1980

ASR prehistory 1G 2G 3G 3.5G

~1990 ~2007

4G

Generations of ASR technology

Page 4: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Radio Rex Radio Rex –– 19201920’’s ASRs ASR

A sound-activated toy dog named “Rex” (from Elmwood Button Co.) could be called by name from his doghouse by name. (Thanks to Nelson Morgan.)

Page 5: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

1st generation technology (1952 – 1968)

• General– The earliest attempts to devise

digit/syllable/vowel/phoneme recognition systems– Spectral resonances extracted by an analogue filter

bank and logic circuits– Statistical syntax at the phoneme level

• Early systems– Bell labs, RCA Labs, MIT Lincoln Labs, University

College London, Radio Research Lab (Japan), Kyoto Univ. (Japan), NEC Labs (Japan)

Page 6: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Block schematic of isolated digit recognizer circuits (K. H. Davis, R. Biddulph and S. Balashek, Bell Labs, 1952)

Page 7: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Schematic diagram of a phonetic typewriter (syllable recognizer)(H. Olsen and H. Belar, RCA Labs, 1956)

Page 8: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Phonetic typewriter with parts exposed(H. Olsen and H. Belar, RCA Labs, 1956)

Page 9: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Words used in a recognition study (J. W. Forgie and C. D. Forgie, MIT Lincoln Lab, 1959)

Page 10: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Structure of a program which classifies each LOOK as belonging to one of 10 possible vowels

(J. W. Forgie and C. D. Forgie, MIT Lincoln Lab, 1959)

Page 11: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Early schematic diagram of an automatic speech recognizer (P. Denes, University College London, 1959)

Page 12: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Histogram illustrating the prediction of phonemes in a test sentence (D. B. Fry, University College London, 1959)

Page 13: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Schematic diagram showing the arrangement by which statistical linguistic information is combined with acoustic information in a mechanical speech recognizer

(P. Denes, University College London, 1959)

Page 14: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

P. Denes, 1959

The automatic speech typewriter (P. Denes, University College London, 1959)

Page 15: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Examples of manually analyzed spectra of the 5 Japanese vowels (J. Suzuki & K. Nakata, Radio Research Labs, Japan, 1961)

Page 16: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Block diagram of a vowel recognizer based on the majority decision principle (J. Suzuki & K. Nakata, Radio Research Labs, Japan, 1961)

Page 17: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Front view of the Japanese vowel recognizer (J. Suzuki & K. Nakata, Radio Research Labs, Japan, 1961)

Page 18: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Front view of the spoken digit recognizer (J. Suzuki & K. Nakata, Radio Research Labs, Japan, 1961)

Page 19: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Block diagram of continuous speech recognition experiments(J. Suzuki & K. Nakata, Radio Research Labs, Japan, 1961)

Page 20: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Block diagram of the phonetic typewriter (T. Sakai and S. Doshita, Kyoto Univ., Japan, 1962)

Page 21: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Block diagram of the vowel segmentation mechanism (T. Sakai and S. Doshita, Kyoto Univ., Japan, 1962)

Page 22: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Block diagram of the recognition mechanism (T. Sakai and S. Doshita, Kyoto Univ., Japan, 1962)

Page 23: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Block diagram of the Japanese spoken digit recognizer (K. Nagata, Y. Kato and S. Chiba, NEC Labs, Japan, 1963)

Page 24: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Photograph of the Japanese spoken digit recognizer (K. Nagata, Y. Kato and S. Chiba, NEC Labs, Japan, 1963)

Page 25: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

• DTW– (Elementary time-normalization methods by Martin et

al.)– Vintsyuk in Soviet Union proposed the use of DP– Sakoe & Chiba at NEC labs started to use DP

• Isolated word recognition– The area of isolated word or discrete utterance

recognition became a viable and usable technology based on fundamental studies in Russia and Japan (Velichko & Zagoruyko, Sakoe & Chiba, and Itakura)

• IBM Labs: large-vocabulary ASR• Bell Labs: speaker-independent ASR

2nd generation technology (1968 – 1980) (1)

Page 26: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Block diagram of speech processing stages (T. B. Martin et al., RCA Labs, 1964)

Page 27: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Neural recognition network for /g/ followed by vowels /i/, /I/, /ε/+/`d/ (T. B. Martin et al., RCA Labs, 1964)

Page 28: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Speech processing equipment (T. B. Martin et al., RCA Labs, 1964)

Page 29: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Dynamic time warping algorithm (T. K. Vintsyuk, Lab of Pattern Recognition, USSR, 1968)

Page 30: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

1: Curve of maximum similarity; 2, 3: linear normalization (V. M. Velichko and N. G. Zagoruyko, Lab of Pattern Recognition, USSR, 1970)

Page 31: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Matrix of similarity (V. M. Velichko and N. G. Zagoruyko, Lab of Pattern Recognition, USSR, 1970)

Page 32: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Warping function and adjustment window definition(H. Sakoe and S. Chiba, NEC Labs, 1978)

Page 33: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Flanagan writes, “… the computer in the background is a Honeywell DDP516. This was the first integrated circuits machine that we had in the laboratory. … with memory of 8K words (one-half of which was occupied by the Fortran II compiler) …”

(November 1970)

Page 34: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

An Example of the time-warping function. The parallelogram shows the possible domain of (n, m) coordinates (F. Itakura, Bell/NTT Labs, 1975)

Page 35: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Flow chart of the isolated word recognition system (F. Itakura, Bell/NTT Labs, 1975)

Page 36: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Block diagram of the real-time spoken word recognition system (L. C. W. Pols, TNO, the Netherlands, 1971)

Page 37: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Flow diagram for an operational isolated utterance speech recognition system (G. M. White, Xerox PARC, 1976)

Page 38: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

2nd generation technology (1968 – 1980) (2)

• Continuous speech recognition– Reddy at CMU conducted pioneering research based

on dynamic tracking of phonemes• DARPA program

– Focus on speech understanding– Goal: 1000-word ASR, a few speakers, continuous

speech, constrained grammar, less than 10% semantic error

– Hearsay I & II systems at CMU– Harpy system at CMU– HWIM system at BBN

Page 39: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

A block diagram of the CMU Hearsay-II system organization (V. R. Lesser, R. D. Fennel, L. D. Erman and D. R. Reddy, CMU, 1975)

Page 40: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

A block diagram of the CMU Harpy system organization. Shown is a small (hypothetical) fragment of the Harpy state transition network, including paths

accepted for sentences beginning with “Give me.”(B. T. Lowerre, CMU, 1976)

Page 41: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

3rd generation technology (1980 – 1990)

• Connected word recognition– Two-level DP, one-pass method, level-building (LB)

approach, frame-synchronous LB approach. • Statistical framework• HMM• Δcepstrum• N-gram• Neural net• DARPA program (Resource management task)

– SPHINX system at CMU– BYBLOS system at BBN– DECIPHER system at SRI– Lincoln Labs, MIT, AT&T Bell Labs

Page 42: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Statistical models for speech

Acoustic models• HMM math worked out in the 1960s• Applied to speech at CMU and IBM in the early

1970s– Harpy system: early applications of HMMs to ASR

(Baker)• Extended by others in the mid-1980s with

collection of large standard corpora

Language models• N-gram (bigram, trigram): first applied by IBM• Smoothing

Page 43: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

0103-11

MFCC-based front-end processor

FFTFFT

FFT basedspectrum

Speech

DCTDCTLogLog

Mel scaletriangular filters

AcousticvectorΔ

Δ2

Page 44: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Structure of speech production and recognition system based on information transmission theory

0011-12

(Transmission theory)

(Speech recognition process)

Acoustic channelAcoustic channel

W S XW

Speech recognition systemSpeech recognition system

Informationsource

Informationsource ChannelChannel DecoderDecoder

Textgeneration

Textgeneration Acoustic

processingAcoustic

processingSpeech

productionSpeech

production Linguisticdecoding

Linguisticdecoding

)()()(

maxarg)(maxargˆXP

WPWXPXWPW

WW==

Page 45: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Structure of phoneme HMMs

0104-08

Page 46: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

0012-01

Statistical language modeling

Probability of the word sequence w1k = w1w2...wk :k k

P (w1k) =ΠP (wi |w1w2…wi−1) =ΠP (wi |w1i−1) i =1 i =1

P(wi |w1i−1) = N(w1i) / N(w1i−1)

where N (w1i) is the number of occurrences of the string w1i

in the given training corpus.

Approximation by Markov processes:Bigram model P(wi |w1i−1) = P(wi |wi−1)Trigram model P(wi |w1i−1) = P(wi |wi−2wi−1)

Smoothing of trigram by unigram and bigram:P(wi |wi−2wi−1) = λ1P(wi |wi−2wi−1)+ λ2P(wi |wi−1) + λ3P(wi)^

Page 47: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

“Julie” doll with speech synthesis and recognition technology, produced by Worlds of Wonder in conjunction with Texas Instruments (1987)

Page 48: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

3.5th generation technology (1990 - 2000)

• Error minimization (discriminative) approach– MCE (Minimum Classification Error) approach– MMI (Maximum Mutual Information) criterion

• Robust ASR– Background noises, voice individuality, microphones,

transmission channels, room reverberation, etc.– VTLN, MLLR, HLDA, fMPE, PMC, etc.

• DARPA program– ATIS task– Broadcast news (BN) transcription integrated with

information extraction and retrieval technology– Switchboard task

• Applications– Automate and enhance operator services

Page 49: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

History of DARPA speech recognitionbenchmark tests

1k

ATIS

100%

10%

1%

WO

RD

ER

RO

R R

ATE

1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003

ReadSpeech

SpontaneousSpeech

ConversationalSpeech

BroadcastSpeechVaried

Microphone

Noisy

20k

5k

foreign

Courtesy NIST 1999 DARPAHUB-4 Report, Pallett et al.

foreign

ResourceManagement

WSJ

Switchboard

NAB

Page 50: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

3.5th generation technology (2000 -)

• DARPA program– EARS (Effective Affordable Reusable Speech-to-

Text) program for rich transcription, GALE– Detecting, extracting, summarizing, and translating

important information• Spontaneous speech recognition

– CSJ project in Japan– Meeting projects in US and Europe

• Robust ASR– Utterance verification and confidence measures– Combining systems or subsystems– Graphical models (DBN)

• Multimodal speech recognition– Audio-visual ASR

Page 51: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Now

Page 52: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Major speech recognition applications

• Conversational systems for accessing information services (e.g. automatic flight status or stock quote information systems)

• Systems for transcribing, understanding and summarizing ubiquitous speech documents (e.g. broadcast news, meetings, lectures, presentations, congressional records, court records, and voicemails)

Page 53: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Multimedia contents technology0010-27

Ubiquitous computingUbiquitous computing

ContentsContents

WWW (Internet)WWW (Internet)

Mobilecomputing

Mobilecomputing

WearablecomputingWearablecomputing

Image/motionprocessing

Image/motionprocessing

Informationextraction(mining)

Informationextraction(mining)

Text processing(Translation)

Text processing(Translation)

Human-computerinteraction

(Dialog)

Human-computerinteraction

(Dialog)

Informationretrieval(access)

Informationretrieval(access)

Multimediamultimodal

communication

Multimediamultimodal

communication

Speech/audioprocessing

Speech/audioprocessing

Page 54: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Audio indexing system

Audio feeder Metadata composer

Component manager

Speech detection

Speaker segmentation

Speaker tracking

Speaker identification

Speech recognition

Name extraction

Sentence detection

Machine translation

Dense component integration creates state-of-the art

rich transcriptions

Audio source XML metadata

(J. Makhoul, BBN, 2006)

Page 55: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

50 years of progress in speech recognition (1)

(1)Corpus-base statistical modeling

e.g. HMM and n-gramsTemplate matching

(2) Filter bank/spectral resonances Cepstral features(Cepstrum +ΔCepstrum +ΔΔCepstrum)

(3) Heuristic time-normalization DTW/DP matching

(4) “Distance”-based methods Likelihood-based methods

(5) MAP (ML) approachDiscriminative approache.g. MCE/GPD and MMI

Page 56: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

50 years of progress in speech recognition (2)

(6) Isolated word recognition Continuous speech recognition

(7) Small vocabulary recognition Large vocabulary recognition

(8) Context-independent units Context-dependent units

(10) Speaker-independent/adaptive recognitionSingle speaker recognition

(11) Monologue recognition Dialogue/conversation recognition

(9) Noisy/telephone speech recognitionClean speech recognition

Page 57: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

50 years of progress in speech recognition (3)

(12) Read speech recognition Spontaneous speech recognition

(13) Recognition Understanding

(14)Single-modality (audio signal only)

recognitionMulti-modal (audio/visual speech)

recognition

(16) No commercial application Many practical commercial applications

(17) Few languages Many languages

(15) Hardware recognizer Software recognizer

Page 58: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

(David C. Moschella: “Waves of Power”)

10

100

1,000

3,000N

umbe

r of

use

rs (m

illio

n)

Year:1970 1980 1990 2000 2010 2020 2030

System centeredSystem centered

PC centeredPC centered

Network centeredNetwork centered

Knowledge resource centeredKnowledge resource centered

IT technology progress

2G 3G 3.5G 4GASR

Mic

ropr

oces

sor

PC Lapt

op P

C

Page 59: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

ICASSP and ASR

• ICASSP has made great contributions to speech technology progress since its beginning in 1976.

• How many speech papers have been presented at ICASSP conferences?

• How many ASR papers have been presented at ICASSP conferences?

Page 60: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

50

100

150

200

‘78 USA

’79 USA

24 23 21 27

64 60 57

164

105

136 140137

180

159

143146139

130

165158

130132129

140

177161

173

207

’80 USA

’81 USA

’82 France

’83 USA

’84 USA

’85 USA

’86 Japan

’87 USA

’88 USA

’89 Scotla

nd

’90 USA

’91 Canada

’92 USA

’93 USA

’94 Australia

’95 USA

’96 USA

’97 Germ

any

’98 USA

’99 USA

’00 Turkey

’01 USA

’02 USA

’03 Hong Kong

’04 Canada

’05 USA

’06 France

60

The number of ICASSP ASR papers (1978-2006)

(3487 ASR papers have been presented since 1978)

3G 3.5GASR

Page 61: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

The number of ICASSP speech papers (1978-2006)‘78

USA’79

USA’80

USA

’81 U

SA’82

Fran

ce’83

USA

’84USA

’85 U

SA’86

Japa

n’87

USA

’88 U

SA

’89 S

cotla

nd’90

USA

’91 C

anad

a’92

USA

’93 U

SA

’94 A

ustra

lia’95

USA

’96 U

SA

’97 G

erman

y’98

USA

’99 U

SA’00

Turke

y’01

USA

’02 U

SA

’03 H

ong K

ong

’04 C

anad

a’05

USA

’06 Fr

ance

100

150

200

250

300

350

ASROther areas

(5998 speech papers have been presented at ICASSP since 1978)

Page 62: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

2006

(3487 ASR papers from 49 countries have been presented since 1978)

Distribution of the ICASSP ASR papers over major countries

19791984

19891993

19972002

19801981

19821983

19851986

19871988

19901991

19921994

19951996

19981999

20002001

20032004

2005

Other countriesUKFranceGermany

1978

50

100

150

200

Japan

USA

Page 63: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

10

20

30

40

50

60

70

80

19791984

19891993

19972002

20061978

19801981

19821983

19851986

19871988

19901991

19921994

19951996

19981999

20002001

20032004

2005

OthersTaiwanSpainAustraliaSwitzerlandKoreaHong KongBelgiumChinaIndiaFinlandCanadaItaly

Distribution of the papers by countries “other” than UK, France, Germany, Japan and USA

Page 64: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

(Source: White paper on science, technology and industry by the OECD)

R&D expenditures

R&D expenditures in China are rapidly increasing, at twice the rate of overall Chinese economic growth.

0

0.5

1

1.5

2

2.5

3

3.5

90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06

USAEU(15 countries)JapanGermanyChina

X 10

0 bi

llion

dol

lars

Page 65: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

B7) Time normalization

B8) Segmentation and labeling

C6) Speaker normalization

Speaker adaptation Situation adaptation

C5b) Orthographic synthesis

A5a) Re-synthesis

extraction

B4) Digital parameter and feature

feature extraction

A, except feature extraction (C)3) Analog signal transformation and

A2) Digital signal transformation

A, except speech enhancement (C)1) Signal conditioning

“State-of-the-art of automatic speech recognition technology”(B. Beek, E. P. Neuberg and D. C. Hodge, 1977)

State-of-the-artProcessing techniques

A=useful now; B=shows promise; C=a long way to go

Page 66: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

and realization

C14) Performance evaluation

A-C13) Time normalization

A for speaker verification; C for all others

12) Speaker recognition

B-C11) Speech understanding

C10) Lexical matching

pragmatics

C9d) Speaker and situation

C9c) Semantics

B9b) Syntax

C9a) Language statistics

Page 67: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

“Unsolved problems”(B. Beek, E. P. Neuberg and D. C. Hodge, 1977)

13) Hypothesize-and-test, backtrack, feed forward.

12) Recognition algorithm (shortest distance, (pairwise) discriminant, Bayes, probabilities).

11) Limits of acoustic information only.

10) Semantics of (limited) tasks.

9) Limited vocabulary and restricted language structure necessary; possibility of adding new words.

8) Missing or extra added (“uh”) speech sound.

7) Phonological rules.

6) Stressed/unstressed.

5) Establish anchor point; scan utterance from left to right; start from stressed vowel, etc

4) Detect smaller units in continuous speech (word/phoneme boundaries; acoustic segments).

3) Nonlinear time normalization (dynamic programming).

representation, zero-crossing distributions).

2) Extract relevant acoustic parameters (pole, zero, formant (transitions), slopes, dimensional

1) Detect speech in noise; speech/non-speech

Page 68: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

27) Syntax rules.28) Vocal-tract modeling.

26) Morphology rules.

25) Co-articulation rules.

24) Use of prosodic information.

23) Economical ways for adding new speakers to system.

22) Detect speech in presence of competing speech.

21) Cost-effectiveness.

20) Human engineering problem of incorporating speech understanding system into actual situations.

19) Real-time processing.

18) Consistency of references.

17) Necessity of visual feedback, error control, level for rejections.

16) Mimicking; uncooperative speaker (s).

15) Adaptive and interactive quick learning.

14) Effect of nasalization, cold, emotion, loudness, pitch, whispering, distortions due to talker’s acoustical environment, distortions by communication systems (telephone, transmitter-receiver, intercom, public address, face masks), nonstandard environments.

Page 69: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Progress of speech recognition technology since 1980

0010-11

2 20 200 2000 20000 Unrestricted

Spontaneousspeech

Readspeech

Fluentspeech

Connectedspeech

Isolatedwords

Vocabulary size (number of words)

Spea

king

styl

e

1980

1990

2000

naturalconversation

naturalconversation2-way

dialogue2-way

dialoguetranscriptiontranscription

networkagent &

intelligentmessaging

networkagent &

intelligentmessaging

system drivendialogue

system drivendialogue

officedictation

officedictation

namedialingname

dialingform fillby voiceform fillby voice

directoryassistancedirectoryassistance

wordspotting

wordspotting

digitstringsdigit

strings

voicecommands

voicecommands

carnavigation

Page 70: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

20 things we still don’t know about speech(R. Moore, 1994) (1)

(1) How important is the communicative nature of speech?(2) Is human-human speech communication relevant to

human-machine speech communication?(3) Speech technology or speech science? (How can we

integrate speech science and technology?)(4) Whither a unified theory? (5) Is speech special?(6) Why is speech contrastive?(7) Is there random variability in speech?(8) How important is individuality?(9) Is disfluency normal?(10) How much effort does speech require?

Page 71: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

20 things we still don’t know about speech(R. Moore, 1994) (2)

(11) What is a good architecture (for speech processes)?

(12) What are suitable levels of representation?

(13) What are the units?

(14) What is the formalism?

(15) How important are the physiological mechanisms?

(16) Is time-frame based speech analysis sufficient?

(17) How important is adaptation?

(18) What are the mechanisms for learning?

(19) What is speech good for?

(20) How good is speech?

Page 72: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Major ASR problems (N. Morgan, 2006)

• Unexpected rate of speech can still hurt

• Unexpected accent can hurt

• Performance in noise, reverberation still bad

• Don’t know when we know

• Few advances in basic understanding

• It takes a long time to build a system for a new language; requires a large amount of resources

Page 73: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

How can we make a human-like speech recognizer?

(Doraemon)

Page 74: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

What’s likely to help (N. Morgan, 2006) (1)

• The obvious: faster computers, more memoryand disk, more data

• Improved techniques for learning from unlabeled data

• Serious efforts to handle:• noise and reverberation• speaking style variation• out-of-vocabulary words (and sounds)

• Learning how to select features• Learning how to select models• Feedback from downstream processing

Page 75: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

• New (multiple) features and models• New statistical dependencies

(e.g., graphical models)• Multiple time scales• Multiple (larger) sound units • Dynamic/robust pronunciation models• Language models including structure (still!)• Incorporating prosody• Incorporating meaning• Non-speech modalities• Understanding confidence

What’s likely to help (N. Morgan, 2006) (2)

Page 76: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

A communication - theoretic view ofspeech generation & recognition

0010-22

P ( X | W )M

P ( W | M )

Linguistic channel

W X Speech recognizer

P ( M )

Message source

Acoustic channel

Language Vocabulary Grammar Semantics Context Habits

Speaker Reverberation Noise Transmission characteristics Microphone

Knowledgesources

Sources ofvariations

Page 77: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

0310-23b

Knowledge sources for speech recognition

• Human speech recognition is a matching process whereby an audio signal is matched to existing knowledge

• Knowledge (Meta-data)– Domain and topics of utterances– Context– Semantics– Who is speaking– etc.

• Systematization of variousrelated knowledge

• How to incorporate knowledge sources into the statistical ASR framework

Speech(data)

Speech(data)

RecognitionRecognition

Transcription(information)

Transcription(information)

KnowledgeKnowledge

GeneralizationMeta-data

GeneralizationMeta-data

AbstractionAbstraction

Page 78: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

~1920 ~1952 ~1968 ~1980

ASR prehistory 1G 2G 3G

Extended knowledge processing

3.5G

~1990 ~2007

4G

Generations of ASR technology

Page 79: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Conclusion

• Speech recognition technology has made very significant progress in the past 50+ years with the help of computer technology.

• The majority of technological changes have been directed toward the purpose of increasing robustness of recognition.

• However, 60% (16/28) of the “unsolved problems” listed by Beek et al. in 1977 have not yet been solved.

• A much greater understanding of the human speech process is required before automatic speech recognition systems can approach human performance.

• Significant advances will come from extended knowledge processing in the framework of statistical pattern recognition.

Page 80: 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf · Histogram illustrating the prediction of phonemes in a test sentence ... • Isolated

Thank you for your kind attention.