1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of...

32
1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Gerhard Rigoll Munich University of Technology Munich University of Technology Institute for Human-Machine Communication Institute for Human-Machine Communication Munich, Germany Munich, Germany [email protected] [email protected]

Transcript of 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of...

Page 1: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

1

The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of

Multimedia Information

Gerhard RigollGerhard RigollMunich University of TechnologyMunich University of Technology

Institute for Human-Machine CommunicationInstitute for Human-Machine CommunicationMunich, GermanyMunich, [email protected]@ei.tum.de

Page 2: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

2

ALERT system for selective dissemination of multimedia

information• Official start: 01/2000, start of work: 03/2000, duration: 30 months• Man power effort: ~30 MY ---> Budget: ~1.6 Mio Euro EC funding• Web Site: http://alert.uni-duisburg.de

General Project dates

Page 3: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

3

InternetNEWS

Media information flooding

supervision byinformation brokers

Page 4: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

4

Internet

NEWS information(sound, video, text)

today‘s headlines ..

..

transcriptiontopic

detection

TAXES

ALERT MESSAGE

Media monitoring in the alert project

Page 5: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

5

General project Objectives

To develop a demo system capable of identifying specific information in multimedia data, consisting of

text, audio and video streams

using advanced speech recognition video processing techniques automatic topic detection algorithms

demonstrator shall alert a user about the existence of requested information send detailed information (on client's further request)

extracted text annotated audio/video data and video clips

provide functionality in French, German and Portuguese demo system will be evaluated mainly by industrial

partners

Page 6: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

6

integration

technologies

users

THe alert Consortium

Page 7: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

7

WP structure (WP0-WP4)

today

Month 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

WP0 ManagementManagement Committee meetingsManagement Reports

WP1 User needs, market study, specsUser needs and market studyDemonstrator specification

WP2 Multilingual common structurePilot data availableDefinition of the common structureCommon resources for all languages

WP3 Information indexing and struct.Automatic transcriptionAudio- and video-based segmentation

WP4 Automatic topic detection

deliverablemilestone

Page 8: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

8

WP structure (WP5-WP7)

today

Month 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

WP5 System development and integ.Infrastructure and Media IntegrationMultimedia Document StructurationAccess and InteractionSystem Integration

WP6 EvaluationEvaluation plansTest and evaluation of the Port. systemTest and evaluation of the French systemTest and evaluation of the German system

WP7 Exploitation and DisseminationScientificCommercial

deliverablemilestone

Page 9: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

9

Collection of pilot corpus

First step to setup similar resources Purpose: testbed for assessing methods for data

collection, annotation and distribution Collection guidelines:

Minimum amount: 5 hours Type of data: video, audio and annotation Video format: MPEG1Audio format: PCM linear, 16KHz sampling rate,

16 bits/sample, mono, collected from antennaAnnotation based on LDC guidelines Thematic orientation: news and interview

shows

Page 10: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

10

Collection of final databases

Experimental results recommendations for final corpusquality mp3, 32 kbps, 16kHz, mono

Minimum amount: speech recognition: 50 hours (training), 3

hours (development), 3 hours (evaluation) word-labelled

topic detection: 300 hours, topic annotatedtext corpus: 100 million words

Full data set:1300 hours word or topic annotated> 10k topic annotated summaries in German text corpus: > 1 billion words

Page 11: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

11

comparison of coding schemes for broadcast speech databases

Page 12: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

12

multimediadocument

video/imageprocessing

video/imageprocessing

speechprocessing

speechprocessing

automatictopic

detection

automatictopic

detection

match topicsfound againstuser profiles

match topicsfound againstuser profiles

multimedia document database

multimedia document database label

database

label database

alertspecificusers

alertspecificusers

if video

if audio

if text

contained

contained

contained

segmentation

segmentation

topickeywords

video-basedtranscription

best hypo-wordgraph

Multimedia datA-labeling and alert-generation

Page 13: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

13

Begin Cut

WindowChange

NewscasterNewscaster

InterviewCut

Dissolve

Wipe

Cut

Report

WeatherForecast

Cut Cut End

Interview = Newscaster NewscasterInterviewed PersonCut Cut

Basic principle of video-segmentation

Stochastic Video-Model (based on HMMs):

Page 14: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

14

Result of video-based segmentation

Page 15: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

15

0 5000 10000 15000 20000

frame

0 5000 10000 15000 20000

frame

we

ath

er f

ore

cas

t

intr

o

spe

ak

er

rep

ort

spe

ak

er w

ith

in

terv

iew

pa

rtn

er

referencevideosegmentation

automatic video segmentation

Combined video-audio-segmentation

Page 16: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

16

topic segmentation

Results: video based detection of topic boundaries is feasibleprecision rate = 1 - insertion rate = 88.2 %recall rate = 1 - deletion rate = 82.2 %

Page 17: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

17

French BN speech recognizer

continuous density HMM system33 phones + 3 non-speech (silence, filler words,

breath)~20% WER (on news)65k dictionaryautomatic pronunciation with manual verification58 hours acoustic training data, 350 Mio words text RT decoding: 5700 states, 92k Gaussians10xRT decoding: 11000 states, 350k Gaussians4-gram language model 15M bi-, 15M tri-, 13M

four-grams

Page 18: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

18

Portuguese BN speech recognizer

Based on the AUDIMUS LVCSR systemHybrid system based on MLP/HMM techniquesCombination of different acoustic models

(product of posterior probabilities)38 phones + silence, 57k dictionary4 gram LM: 5M bi-, 12M tri-, 13M fourgramsTrained on 13 h of BN dataResults:

15xRT: F0: ~20%, All F: ~40 %

Page 19: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

19

German Baseline Speech Recognition System

Page 20: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

20

German BN speech recognizer

continuous density HMM system50 phones + 17 non speech (silence, filler

words, breath, rustle, ...)~20 % WER (initial DuDeutsch: >70 % WER)100 k dictionary initial pronunciation from CELEX, compound

word construction10xRT: 30-90k Gaussians3-gram (cached) language model, 8M bi-,

16M trigrams

Page 21: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

21

system phone models #mixtures WER

baseline German triphones 31 780 ~30%system, 100k,spontaneous speech

baseline, not triphones 31 780 79,7%trained on broad-cast data

baseline with triphones 31 780 72,3%broadcast language model

acoustic models monophones 1 722 54,3%trained on broadcast data

acoustic models triphones 96 417 22,8%optimized onbroadcast data

Evolution of the german system

Page 22: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

22

viele menschen auch heute noch in provisorischen notunterkünften .

viele menschen auch heute noch einen froh wie so daß nur unterkünften

zweitausend beben ganze ortschaften zerstört .

zweite außen beben also ortschaften so stück

muß der anreiz zur zusätzlichen privaten vorsorge erhöht werden .

mußte ein reiz zum zusätzlichen privater vorsorge erhöht werden

zuschuß bekommen . dafür will sich die csu arbeitnehmerunion

zuschuß bekommen dafür wie sich die csu arbeitnehmer im juni

mit rund zusätzlichen zwo komma fünf millionen mark muß der landkreis

mitte und zusätzlichen zwo komma fünf millionen mal muß der lahn kreis

Examples for German transcription results

Page 23: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

23

Automatic topic detection

Objectives:to divide automatically audio/video

streams into topic-specific homogeneous segments

automatic assignment of requested topics to distinct segments

Test set:

• 22 topics in 2956 training and 1284 test texts• deletion of 150 stop words• no stemming performed

Page 24: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

24

New approach to topic detection

This is a text containing important topics.

p(w1)p(w2)p(w3) . . .

[00.....0100....0]

MMI Neural Net

VQ label

Page 25: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

25

0102030405060708090

100

k-means MMI

0102030405060708090

100

new approach compared system

Comparison of new approach and standard system

Comparison of feature quantization with k-means clustering and MMI neural net

Results for Clean text

Page 26: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

26

Results with partially corrupted texts:

• some words are fragmented similar to speech recognition output•22 topics in 3037 training and 1319 test texts• no stop words• no stemming

Partially Corrupted text

Page 27: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

27

0102030405060708090

100

1 bes

t

2 bes

t

3 bes

t

4 bes

t

new approach compared system

173 topics

0102030405060708090

100

new approach compared system

22 topics

Results for Corrupted text

Page 28: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

28

Demonstrator specification (details)

USERRETRIEVAL

RetrievalInterface

TOPIC DETECTION

USER

TopicDetection

ProfilesDatabase

DatabaseInterface

DATASTORAGE

USER

Alert GeneratorWeb | Email | WAP Info | ...

ALERT GENERATION

DATACAPTURE

ProgramDescriptions

Database

ProgramIdentification

FormatConversion

&Compression

DataAcquisition

Video Labelling :Content Classes,Editing Effects

VideoExtraction

VideoSegmentation

VIDEO TOOLS

SpeechTranscription

AUDIO TOOLS

AudioExtraction

AudioSegmentation

Audio Labelling :Speaker,Acoustic

Conditions,Channel,Language

TV/InternetBroadcast News

RADIOBroadcast News

INTERNETNews Texts

Page 29: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

29

Publications ICASSP 2001 (7/2001)

LIMSI: Automatic transcription of compressed broadcast audio GMUD: New approaches to audio- visual segmentation of TV news

for automatic topic retrieval.

TREC-9 (11/2000) LIMSI: The LIMSI SDR system for TREC-9

argus press (11/2000) Observer: Observer Argus Media beteiligt sich am EU-

Forschungsprojekt ALERT

ICSLP 2000 (10/2000) GMUD: Compound splitting and lexical unit recombination for

improved performance of a speech recognition system for German parlianmentary speeches

INESC: The Use of Syllable Segmentation Information in Continuous Speech Recognition Hybrid Systems Applied to the Portuguese Language

INESC: Combination of Acoustic Models in Continuous Speech Recognition Hybrid Systems

Page 30: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

30

Publications (II)

ICSLP 2000 (10/2000) LIMSI: Fast decoding for indexation of broadcast data LIMSI: Investigating text normalization and pronunciation

variants for German broadcast transcription

EDCL 2000 4th European Conference on Research and Advanced Technology for Digital Libraries (9/2000) INESC: Topic Detection in Read Documents

ASR 2000 (9/2000) INESC: A Decoder for Finite-State Structured Search Spaces

ICASSP 2000 (6/2000) GMUD: A Novel Error Measure for the Evaluation of Video

Indexing Systems

Page 31: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

31

PresentationsSchaufenster der Wissenschaft (3/2001)

GMUD: Informationen aus Radio, Fernsehen und Internet: Automatische Themenerkennung in Multimedia-Daten

Euromap Informationstag (12/2000) GMUD: Das Projekt ALERT - Alert system for selective dissemination of

multimedia information

IV Jornadas de Arquivo e Documentação (10/2000) INESC: Speech recognition and topic detection applied to alert systems for

broadcast news

ASR 2000 (9/2000) GMUD: ALERT System for Selective Dissemination of Multimedia

Information

Homme Technologie et Systèmes Complexes (6/2000) VECSYS: Parlez Naturellement, la Machine Vous Comprend

RIAO'2000 Content-based Multimedia Information Access (4/2000) VECSYS, LIMSI: An Audio Transcriber for Broadcast Document Indexation

Page 32: 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of Multimedia Information Gerhard Rigoll Munich University of.

32

outlook

use of additional datacross-talker situationsenlarged number of topicsimproving rejection mechanisms of

unknown topics (confidence for topics)detection of new topicssummarizationscalable summarizationtopic-dependent summarization