Humix community #2 - TTS, Speech Recognition and Natural Language Processing
2 APT/ITU Conformance and Interoperability Workshop ... · Speech Recognition Smart Phone ......
Transcript of 2 APT/ITU Conformance and Interoperability Workshop ... · Speech Recognition Smart Phone ......
Contact : Chiori Hori E-mail : [email protected]
National Institute of Information and Communications Technology
(NICT), Japan
ASIA-PACIFIC TELECOMMUNITY
2nd
APT/ITU Conformance and Interoperability Workshop
(C&I-2)
Document:
C&I-2/ INP-12
26 August 2014, Bangkok, Thailand 26 August 2014
"Acceleration of R&D Towards Speech Translation Technologies in the Asia-
Pacific Region by U-STAR"
"Acceleration of R&D Towards Speech Translation Technologies in the Asia-
Pacific Region by U-STAR"
Chiori HoriSpoken Language Communication Laboratory
National Institute of Information and Communications Technology
National Institute of Information and Communications TechnologyUniversal Communication Research Center
Spoken Language Communication Laboratory
Kyoto, JapanEmail: [email protected]
Network
Closed caption
AndIndexes
Publicserver
Video data
Public Server
Video data
Public Server
Video data
Speech data
NICT audio indexing system
Audio indexing system
Real-time Audio indexing
Real-time indexing:speech transcriptionQuery-based retrieval, audio including queries, event categories,speaker diarization ( who speak what and when)Video categorization by topics
Speech Interface Human and Human, Human and
Machine for natural communication
Speech
Text-to-Speech
Synthesized Speech
From Kyotostation
Speech translationfor people speaking different languages
Spoken dialog system with machine
Speech-to-Speech
TranslationModality
Conversion
Japanese
English
Speech-to-textCommunication system
SpokenDialog
Transcribed speech
DialogManage-
ment
Japanese
Japanese text
How can I get to NICT?
MachineTranslation
SpeechRecognition
Smart Phone
Research Target
http://en.wikipedia.org/wiki/List_of_language_families
Many different languages in the world Overcoming the language barriers is a long-held
dream of mankind. Speech translation technology
Breaking the language barriers
How to overcome language barriers?
Speech-to-Speech Translation (S2ST)A means of communication
between different language speakers
English“I go to school”
Speech Recognition
(ASR)
MachineTranslation
(MT)
SpeechSynthesis
(TTS)
ホテルの予約をお願いします.
a hotelmake a reservation forplease
Please makea reservation fora hotel
Japanese「ホテルの予約をお願いします.」
Corpora
Convert toEnglish wordsequence「ホテル」⇒ “a hotel”「予約」⇒”make areservation for”「お願いします」⇒“Please”
Convert toword sequenceusing lexicon and grammar
Convert toJapanese phoneme sequence“h”, “o”, “t”…
Select appropriate waveformfor English text
Reorder word sequencesaccording toEnglish grammar“a hotel” “please” “make a reservation for” “please” “a hotel”
h o t e r u n o y o y a k u o o n e g a i ...
Please make a reservation for a hotel
History of the International Consortium (1)
Network-based S2ST research by consortiums of C-STAR and A-STAR
2008 20102006 2007
A‐STAR
2009
Japan, China,Korea,
Indonesia, Thailand, India
(6 countries)
+Vietnam,Singapore
(2 countries)
A‐STARNetwork‐based
S2ST
2011 2012 20132000199919921991
C‐STAR
Japan,US,
Germany(3 countries)
C‐STARNetwork‐based
S2ST
+Korea, Italy,France, China,U.S., U.K.,Switzerland,Sweden,India,
(9 countries)
1993
Preparation for the U-STAR Research Activity
Polish speech
Hungarian speech
Dutch speech
German speech
English speech
Turkish speech
Portuguese speech
French speech
Japan NICT Korea ETRI Thailand NECTECIndonesia BPPT China CASIA India CDACVietnam IOIT Singapore I2R Bhutan DITTPakistan KICS‐UET Nepal LTK Mongolia MUSTMongolia NUM Sri Lanka UCSC Philippines UPDFrance CNRS‐LIMSI Portugal INESC‐ID Turkey TUBITAKUK University of Shefield Germany TUM Germany UUlmBelgium ESAT Hungary BME‐TMIT Hungary PPKE
Malay speech
Vietnamese speech
Hindi speech
Chinese speech
Indonesian speech
Thai speech
Korean speech
Japanese speech
Speech data for training acoustic models
Parallel corpus and dictionary for training translation models
from English to the target language
NICTJP
Speech-to-Speech translation
CM
LIBC
MLIB
CM
LB
S2ST servers
U-STAR
ASR/M
T/TTS servers
CMLIB is implemented for the U-STAR S2ST servers
S2ST Client
CM
LBC
MLB
CM
LBC
MLB
CM
LB
S2ST Application on SmartphoneMCML-based
Communication libraries (CMLIB)
Network-based Speech-to-Speech Translation (S2ST)
Communication between Different Language Speakers
ASR Module
Thai
ASR Module
Japanese
S2ST Server S2ST Server
Japanese Speaker
S2ST Client
ThaiSpeaker
S2ST Client
MT Module
Japanese → Thai
S2ST Server
TTS Module
Thai
S2ST Server
MT Module
Thai → Japanese
S2ST Server
TTS Module
Japanse
S2ST Server
Network
Initiation of Standardization from Asia
APT ASTAP Meeting (August 2009) A-STAR Speech-to-speech Translation Demo in 8 Countries (July 2009)
ASTAP 16 Plenary SessionDiscussion to develop the standardization activity more internationally, not limited to the Asian-Pacific region. -> Approved to raise the standardization draft from APT to ITU-T
U-STAR MOU (July 2010)
From Asia to the World
A-STAR to U-STAR
The Universal Speech Translation Advanced Research Consortium is an international research collaboration entity aiming to break language barriers around the world through network-based speech-to-speech translation (S2ST) technologies.
History of the International Consortium (2)
Network-based S2ST research by U-STAR
A‐STAR U‐STAR
Japan, China,Korea,
Indonesia, Thailand, India
(6 countries)
+Vietnam,Singapore
(2 countries)
A‐STARNetwork‐based
S2ST
U‐STAR Network‐based
S2ST
+Bhutan,Mongolia,Nepal, Pakistan,
Philippines,Sri Lanka
(6 countries)
+France, Portugal, Turkey, U.K.,
Germany, Hungary,Poland, Belgium,
Ireland(9 counties)
C‐STAR
Japan,US,
Germany(3 countries)
C‐STARNetwork‐based
S2ST
+Korea, Italy,France, China,U.S., U.K.,Switzerland,Sweden,India,
(9 countries)
2008 20102006 2007 2009 2011 2012 20132000199919921991 1993
From ASTAP To ITU-T Recommendations for network-based speech-to-speech translation andwas published by HP2/SG16.
F.745http://www.itu.int/rec/T‐REC‐F.745‐201010‐I
H.625http://www.itu.int/rec/T‐REC‐H.625‐201010‐I
TitleFunctional Requirements for
Network‐based S2ST Architectural Requirements
for Network‐based S2ST
Recomm‐endation(2010)
U-STAR Network-based Speech Translation
The orange-colored areas indicates the countries whose official languages are supported by U-STAR’s apps.
S2ST servers located all over the world are connected through network.
Preparation for the U-STAR Research Activity
Polish speech
Hungarian speechDutch speech
German speechEnglish speechTurkish speech
Portuguese speechFrench speech
Japan NICT Korea ETRI Thailand NECTECIndonesia BPPT China CASIA India CDACVietnam IOIT Singapore I2R Bhutan DITTPakistan KICS‐UET Nepal LTK Mongolia MUSTMongolia NUM Sri Lanka UCSC Philippines UPD
Malay speechVietnamese speech
Hindi speechChinese speech
Indonesian speech
Thai speechKorean speech
Japanese speech
Speech data for training acoustic models
Parallel corpus and dictionary for training translation models
from English to the target language
NICTJP
U-STAR members Coverage of the official languages
Exampleof Hindi
27 MT servers, 17 ASR servers, 14 TTS serversChat system using speech
translationon a smartphone
Client App
ASR using the Collected Speech
20
22
24
26
28
30
32
34
JP
WER
(%)
Baseline AM (USV)
AM+LM (USV) AM+LM (SV)
30
35
40
45
50
55
60
65
70
TH
WER
(%)
Baseline AM (USV)
AM (USV)+Web AM (SV)+Web
Fig. Evaluation of Model Adaptation: Japanese (left) and Thai (right)
Accuracy improvementsusing the collected speech
VoiceTra4U on AndroidData collection through iPhone and android phone application for speech translation
© NICT
Intraoperable Speech Communication Platform for 1) human-to-human and 2) human-to-machine
Back-End Server
ASR Servers
DM Server
Client
Online Shopping / BookingSystems
i.e.) Hotels, Stores,etc.
EmergencySystems
i.e.) Hospitals, Police Departments, etc.
Educational Systemsi.e.) VoIP Lessons,
Schools
MT Servers TTS Servers
MCML(ITU‐T
Standardized Protocols)
Language and Domain Portability for Speech Communciation Tool using ITU-T standardized S2ST protocol
‐ 17 languages for ASR, 27 for MT, and 14 for TTS‐ Chat for up to 5 people
Speech‐to‐Speech Translation
2020Olympics
in TokyoReal‐Time Indexing
Video data
NICTaudio indexing
system
Searching scenes with the sound of
“explosion”
Scene of “Riots”
Video A: 20 sec
Video B: 35 sec
Speech VideoAudioevent
Spoken Dialog System
How can I get to the stadium?
Which game
will you see?
From Tokyo
station?
Multilingual Communication Project