1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of...

Post on 18-Dec-2015

219 views 0 download

Tags:

Transcript of 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of...

1

The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of

Multimedia Information

Gerhard RigollGerhard RigollMunich University of TechnologyMunich University of Technology

Institute for Human-Machine CommunicationInstitute for Human-Machine CommunicationMunich, GermanyMunich, Germanyrigoll@ei.tum.derigoll@ei.tum.de

2

ALERT system for selective dissemination of multimedia

information• Official start: 01/2000, start of work: 03/2000, duration: 30 months• Man power effort: ~30 MY ---> Budget: ~1.6 Mio Euro EC funding• Web Site: http://alert.uni-duisburg.de

General Project dates

3

InternetNEWS

Media information flooding

supervision byinformation brokers

4

Internet

NEWS information(sound, video, text)

today‘s headlines ..

..

transcriptiontopic

detection

TAXES

ALERT MESSAGE

Media monitoring in the alert project

5

General project Objectives

To develop a demo system capable of identifying specific information in multimedia data, consisting of

text, audio and video streams

using advanced speech recognition video processing techniques automatic topic detection algorithms

demonstrator shall alert a user about the existence of requested information send detailed information (on client's further request)

extracted text annotated audio/video data and video clips

provide functionality in French, German and Portuguese demo system will be evaluated mainly by industrial

partners

6

integration

technologies

users

THe alert Consortium

7

WP structure (WP0-WP4)

today

Month 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

WP0 ManagementManagement Committee meetingsManagement Reports

WP1 User needs, market study, specsUser needs and market studyDemonstrator specification

WP2 Multilingual common structurePilot data availableDefinition of the common structureCommon resources for all languages

WP3 Information indexing and struct.Automatic transcriptionAudio- and video-based segmentation

WP4 Automatic topic detection

deliverablemilestone

8

WP structure (WP5-WP7)

today

Month 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

WP5 System development and integ.Infrastructure and Media IntegrationMultimedia Document StructurationAccess and InteractionSystem Integration

WP6 EvaluationEvaluation plansTest and evaluation of the Port. systemTest and evaluation of the French systemTest and evaluation of the German system

WP7 Exploitation and DisseminationScientificCommercial

deliverablemilestone

9

Collection of pilot corpus

First step to setup similar resources Purpose: testbed for assessing methods for data

collection, annotation and distribution Collection guidelines:

Minimum amount: 5 hours Type of data: video, audio and annotation Video format: MPEG1Audio format: PCM linear, 16KHz sampling rate,

16 bits/sample, mono, collected from antennaAnnotation based on LDC guidelines Thematic orientation: news and interview

shows

10

Collection of final databases

Experimental results recommendations for final corpusquality mp3, 32 kbps, 16kHz, mono

Minimum amount: speech recognition: 50 hours (training), 3

hours (development), 3 hours (evaluation) word-labelled

topic detection: 300 hours, topic annotatedtext corpus: 100 million words

Full data set:1300 hours word or topic annotated> 10k topic annotated summaries in German text corpus: > 1 billion words

11

comparison of coding schemes for broadcast speech databases

12

multimediadocument

video/imageprocessing

video/imageprocessing

speechprocessing

speechprocessing

automatictopic

detection

automatictopic

detection

match topicsfound againstuser profiles

match topicsfound againstuser profiles

multimedia document database

multimedia document database label

database

label database

alertspecificusers

alertspecificusers

if video

if audio

if text

contained

contained

contained

segmentation

segmentation

topickeywords

video-basedtranscription

best hypo-wordgraph

Multimedia datA-labeling and alert-generation

13

Begin Cut

WindowChange

NewscasterNewscaster

InterviewCut

Dissolve

Wipe

Cut

Report

WeatherForecast

Cut Cut End

Interview = Newscaster NewscasterInterviewed PersonCut Cut

Basic principle of video-segmentation

Stochastic Video-Model (based on HMMs):

14

Result of video-based segmentation

15

0 5000 10000 15000 20000

frame

0 5000 10000 15000 20000

frame

we

ath

er f

ore

cas

t

intr

o

spe

ak

er

rep

ort

spe

ak

er w

ith

in

terv

iew

pa

rtn

er

referencevideosegmentation

automatic video segmentation

Combined video-audio-segmentation

16

topic segmentation

Results: video based detection of topic boundaries is feasibleprecision rate = 1 - insertion rate = 88.2 %recall rate = 1 - deletion rate = 82.2 %

17

French BN speech recognizer

continuous density HMM system33 phones + 3 non-speech (silence, filler words,

breath)~20% WER (on news)65k dictionaryautomatic pronunciation with manual verification58 hours acoustic training data, 350 Mio words text RT decoding: 5700 states, 92k Gaussians10xRT decoding: 11000 states, 350k Gaussians4-gram language model 15M bi-, 15M tri-, 13M

four-grams

18

Portuguese BN speech recognizer

Based on the AUDIMUS LVCSR systemHybrid system based on MLP/HMM techniquesCombination of different acoustic models

(product of posterior probabilities)38 phones + silence, 57k dictionary4 gram LM: 5M bi-, 12M tri-, 13M fourgramsTrained on 13 h of BN dataResults:

15xRT: F0: ~20%, All F: ~40 %

19

German Baseline Speech Recognition System

20

German BN speech recognizer

continuous density HMM system50 phones + 17 non speech (silence, filler

words, breath, rustle, ...)~20 % WER (initial DuDeutsch: >70 % WER)100 k dictionary initial pronunciation from CELEX, compound

word construction10xRT: 30-90k Gaussians3-gram (cached) language model, 8M bi-,

16M trigrams

21

system phone models #mixtures WER

baseline German triphones 31 780 ~30%system, 100k,spontaneous speech

baseline, not triphones 31 780 79,7%trained on broad-cast data

baseline with triphones 31 780 72,3%broadcast language model

acoustic models monophones 1 722 54,3%trained on broadcast data

acoustic models triphones 96 417 22,8%optimized onbroadcast data

Evolution of the german system

22

viele menschen auch heute noch in provisorischen notunterkünften .

viele menschen auch heute noch einen froh wie so daß nur unterkünften

zweitausend beben ganze ortschaften zerstört .

zweite außen beben also ortschaften so stück

muß der anreiz zur zusätzlichen privaten vorsorge erhöht werden .

mußte ein reiz zum zusätzlichen privater vorsorge erhöht werden

zuschuß bekommen . dafür will sich die csu arbeitnehmerunion

zuschuß bekommen dafür wie sich die csu arbeitnehmer im juni

mit rund zusätzlichen zwo komma fünf millionen mark muß der landkreis

mitte und zusätzlichen zwo komma fünf millionen mal muß der lahn kreis

Examples for German transcription results

23

Automatic topic detection

Objectives:to divide automatically audio/video

streams into topic-specific homogeneous segments

automatic assignment of requested topics to distinct segments

Test set:

• 22 topics in 2956 training and 1284 test texts• deletion of 150 stop words• no stemming performed

24

New approach to topic detection

This is a text containing important topics.

p(w1)p(w2)p(w3) . . .

[00.....0100....0]

MMI Neural Net

VQ label

25

0102030405060708090

100

k-means MMI

0102030405060708090

100

new approach compared system

Comparison of new approach and standard system

Comparison of feature quantization with k-means clustering and MMI neural net

Results for Clean text

26

Results with partially corrupted texts:

• some words are fragmented similar to speech recognition output•22 topics in 3037 training and 1319 test texts• no stop words• no stemming

Partially Corrupted text

27

0102030405060708090

100

1 bes

t

2 bes

t

3 bes

t

4 bes

t

new approach compared system

173 topics

0102030405060708090

100

new approach compared system

22 topics

Results for Corrupted text

28

Demonstrator specification (details)

USERRETRIEVAL

RetrievalInterface

TOPIC DETECTION

USER

TopicDetection

ProfilesDatabase

DatabaseInterface

DATASTORAGE

USER

Alert GeneratorWeb | Email | WAP Info | ...

ALERT GENERATION

DATACAPTURE

ProgramDescriptions

Database

ProgramIdentification

FormatConversion

&Compression

DataAcquisition

Video Labelling :Content Classes,Editing Effects

VideoExtraction

VideoSegmentation

VIDEO TOOLS

SpeechTranscription

AUDIO TOOLS

AudioExtraction

AudioSegmentation

Audio Labelling :Speaker,Acoustic

Conditions,Channel,Language

TV/InternetBroadcast News

RADIOBroadcast News

INTERNETNews Texts

29

Publications ICASSP 2001 (7/2001)

LIMSI: Automatic transcription of compressed broadcast audio GMUD: New approaches to audio- visual segmentation of TV news

for automatic topic retrieval.

TREC-9 (11/2000) LIMSI: The LIMSI SDR system for TREC-9

argus press (11/2000) Observer: Observer Argus Media beteiligt sich am EU-

Forschungsprojekt ALERT

ICSLP 2000 (10/2000) GMUD: Compound splitting and lexical unit recombination for

improved performance of a speech recognition system for German parlianmentary speeches

INESC: The Use of Syllable Segmentation Information in Continuous Speech Recognition Hybrid Systems Applied to the Portuguese Language

INESC: Combination of Acoustic Models in Continuous Speech Recognition Hybrid Systems

30

Publications (II)

ICSLP 2000 (10/2000) LIMSI: Fast decoding for indexation of broadcast data LIMSI: Investigating text normalization and pronunciation

variants for German broadcast transcription

EDCL 2000 4th European Conference on Research and Advanced Technology for Digital Libraries (9/2000) INESC: Topic Detection in Read Documents

ASR 2000 (9/2000) INESC: A Decoder for Finite-State Structured Search Spaces

ICASSP 2000 (6/2000) GMUD: A Novel Error Measure for the Evaluation of Video

Indexing Systems

31

PresentationsSchaufenster der Wissenschaft (3/2001)

GMUD: Informationen aus Radio, Fernsehen und Internet: Automatische Themenerkennung in Multimedia-Daten

Euromap Informationstag (12/2000) GMUD: Das Projekt ALERT - Alert system for selective dissemination of

multimedia information

IV Jornadas de Arquivo e Documentação (10/2000) INESC: Speech recognition and topic detection applied to alert systems for

broadcast news

ASR 2000 (9/2000) GMUD: ALERT System for Selective Dissemination of Multimedia

Information

Homme Technologie et Systèmes Complexes (6/2000) VECSYS: Parlez Naturellement, la Machine Vous Comprend

RIAO'2000 Content-based Multimedia Information Access (4/2000) VECSYS, LIMSI: An Audio Transcriber for Broadcast Document Indexation

32

outlook

use of additional datacross-talker situationsenlarged number of topicsimproving rejection mechanisms of

unknown topics (confidence for topics)detection of new topicssummarizationscalable summarizationtopic-dependent summarization