1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of...

The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of

Multimedia Information

Gerhard RigollGerhard RigollMunich University of TechnologyMunich University of Technology

Institute for Human-Machine CommunicationInstitute for Human-Machine CommunicationMunich, GermanyMunich, Germanyrigoll@ei.tum.derigoll@ei.tum.de

ALERT system for selective dissemination of multimedia

information• Official start: 01/2000, start of work: 03/2000, duration: 30 months• Man power effort: ~30 MY ---> Budget: ~1.6 Mio Euro EC funding• Web Site: http://alert.uni-duisburg.de

General Project dates

InternetNEWS

Media information flooding

supervision byinformation brokers

Internet

NEWS information(sound, video, text)

today‘s headlines ..

transcriptiontopic

detection

ALERT MESSAGE

Media monitoring in the alert project

General project Objectives

To develop a demo system capable of identifying specific information in multimedia data, consisting of

text, audio and video streams

using advanced speech recognition video processing techniques automatic topic detection algorithms

demonstrator shall alert a user about the existence of requested information send detailed information (on client's further request)

extracted text annotated audio/video data and video clips

provide functionality in French, German and Portuguese demo system will be evaluated mainly by industrial

partners

integration

technologies

THe alert Consortium

WP structure (WP0-WP4)

Month 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

WP0 ManagementManagement Committee meetingsManagement Reports

WP1 User needs, market study, specsUser needs and market studyDemonstrator specification

WP2 Multilingual common structurePilot data availableDefinition of the common structureCommon resources for all languages

WP3 Information indexing and struct.Automatic transcriptionAudio- and video-based segmentation

WP4 Automatic topic detection

deliverablemilestone

WP structure (WP5-WP7)

Month 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

WP5 System development and integ.Infrastructure and Media IntegrationMultimedia Document StructurationAccess and InteractionSystem Integration

WP6 EvaluationEvaluation plansTest and evaluation of the Port. systemTest and evaluation of the French systemTest and evaluation of the German system

WP7 Exploitation and DisseminationScientificCommercial

deliverablemilestone

Collection of pilot corpus

First step to setup similar resources Purpose: testbed for assessing methods for data

collection, annotation and distribution Collection guidelines:

Minimum amount: 5 hours Type of data: video, audio and annotation Video format: MPEG1Audio format: PCM linear, 16KHz sampling rate,

16 bits/sample, mono, collected from antennaAnnotation based on LDC guidelines Thematic orientation: news and interview

Collection of final databases

Experimental results recommendations for final corpusquality mp3, 32 kbps, 16kHz, mono

Minimum amount: speech recognition: 50 hours (training), 3

hours (development), 3 hours (evaluation) word-labelled

topic detection: 300 hours, topic annotatedtext corpus: 100 million words

Full data set:1300 hours word or topic annotated> 10k topic annotated summaries in German text corpus: > 1 billion words

comparison of coding schemes for broadcast speech databases

multimediadocument

video/imageprocessing

speechprocessing

automatictopic

detection

automatictopic

detection

match topicsfound againstuser profiles

multimedia document database

multimedia document database label

database

label database

alertspecificusers

if video

if audio

if text

contained

segmentation

topickeywords

video-basedtranscription

best hypo-wordgraph

Multimedia datA-labeling and alert-generation

Begin Cut

WindowChange

NewscasterNewscaster

InterviewCut

Dissolve

Report

WeatherForecast

Cut Cut End

Interview = Newscaster NewscasterInterviewed PersonCut Cut

Basic principle of video-segmentation

Stochastic Video-Model (based on HMMs):

Result of video-based segmentation

0 5000 10000 15000 20000

referencevideosegmentation

automatic video segmentation

Combined video-audio-segmentation

topic segmentation

Results: video based detection of topic boundaries is feasibleprecision rate = 1 - insertion rate = 88.2 %recall rate = 1 - deletion rate = 82.2 %

French BN speech recognizer

continuous density HMM system33 phones + 3 non-speech (silence, filler words,

breath)~20% WER (on news)65k dictionaryautomatic pronunciation with manual verification58 hours acoustic training data, 350 Mio words text RT decoding: 5700 states, 92k Gaussians10xRT decoding: 11000 states, 350k Gaussians4-gram language model 15M bi-, 15M tri-, 13M

four-grams

Portuguese BN speech recognizer

Based on the AUDIMUS LVCSR systemHybrid system based on MLP/HMM techniquesCombination of different acoustic models

(product of posterior probabilities)38 phones + silence, 57k dictionary4 gram LM: 5M bi-, 12M tri-, 13M fourgramsTrained on 13 h of BN dataResults:

15xRT: F0: ~20%, All F: ~40 %

German Baseline Speech Recognition System

German BN speech recognizer

continuous density HMM system50 phones + 17 non speech (silence, filler

words, breath, rustle, ...)~20 % WER (initial DuDeutsch: >70 % WER)100 k dictionary initial pronunciation from CELEX, compound

word construction10xRT: 30-90k Gaussians3-gram (cached) language model, 8M bi-,

16M trigrams

system phone models #mixtures WER

baseline German triphones 31 780 ~30%system, 100k,spontaneous speech

baseline, not triphones 31 780 79,7%trained on broad-cast data

baseline with triphones 31 780 72,3%broadcast language model

acoustic models monophones 1 722 54,3%trained on broadcast data

acoustic models triphones 96 417 22,8%optimized onbroadcast data

Evolution of the german system

viele menschen auch heute noch in provisorischen notunterkünften .

viele menschen auch heute noch einen froh wie so daß nur unterkünften

zweitausend beben ganze ortschaften zerstört .

zweite außen beben also ortschaften so stück

muß der anreiz zur zusätzlichen privaten vorsorge erhöht werden .

mußte ein reiz zum zusätzlichen privater vorsorge erhöht werden

zuschuß bekommen . dafür will sich die csu arbeitnehmerunion

zuschuß bekommen dafür wie sich die csu arbeitnehmer im juni

mit rund zusätzlichen zwo komma fünf millionen mark muß der landkreis

mitte und zusätzlichen zwo komma fünf millionen mal muß der lahn kreis

Examples for German transcription results

Automatic topic detection

Objectives:to divide automatically audio/video

streams into topic-specific homogeneous segments

automatic assignment of requested topics to distinct segments

Test set:

• 22 topics in 2956 training and 1284 test texts• deletion of 150 stop words• no stemming performed

New approach to topic detection

This is a text containing important topics.

p(w1)p(w2)p(w3) . . .

[00.....0100....0]

MMI Neural Net

VQ label

0102030405060708090

k-means MMI

0102030405060708090

new approach compared system

Comparison of new approach and standard system

Comparison of feature quantization with k-means clustering and MMI neural net

Results for Clean text

Results with partially corrupted texts:

• some words are fragmented similar to speech recognition output•22 topics in 3037 training and 1319 test texts• no stop words• no stemming

Partially Corrupted text

0102030405060708090

173 topics

0102030405060708090

22 topics

Results for Corrupted text

Demonstrator specification (details)

USERRETRIEVAL

RetrievalInterface

TOPIC DETECTION

TopicDetection

ProfilesDatabase

DatabaseInterface

DATASTORAGE

Alert GeneratorWeb | Email | WAP Info | ...

ALERT GENERATION

DATACAPTURE

ProgramDescriptions

Database

ProgramIdentification

FormatConversion

&Compression

DataAcquisition

Video Labelling :Content Classes,Editing Effects

VideoExtraction

VideoSegmentation

VIDEO TOOLS

SpeechTranscription

AUDIO TOOLS

AudioExtraction

AudioSegmentation

Audio Labelling :Speaker,Acoustic

Conditions,Channel,Language

TV/InternetBroadcast News

RADIOBroadcast News

INTERNETNews Texts

Publications ICASSP 2001 (7/2001)

LIMSI: Automatic transcription of compressed broadcast audio GMUD: New approaches to audiovisual segmentation of TV news

for automatic topic retrieval.

TREC-9 (11/2000) LIMSI: The LIMSI SDR system for TREC-9

argus press (11/2000) Observer: Observer Argus Media beteiligt sich am EU-

Forschungsprojekt ALERT

ICSLP 2000 (10/2000) GMUD: Compound splitting and lexical unit recombination for

improved performance of a speech recognition system for German parlianmentary speeches

INESC: The Use of Syllable Segmentation Information in Continuous Speech Recognition Hybrid Systems Applied to the Portuguese Language

INESC: Combination of Acoustic Models in Continuous Speech Recognition Hybrid Systems

Publications (II)

ICSLP 2000 (10/2000) LIMSI: Fast decoding for indexation of broadcast data LIMSI: Investigating text normalization and pronunciation

variants for German broadcast transcription

EDCL 2000 4th European Conference on Research and Advanced Technology for Digital Libraries (9/2000) INESC: Topic Detection in Read Documents

ASR 2000 (9/2000) INESC: A Decoder for Finite-State Structured Search Spaces

ICASSP 2000 (6/2000) GMUD: A Novel Error Measure for the Evaluation of Video

Indexing Systems

PresentationsSchaufenster der Wissenschaft (3/2001)

GMUD: Informationen aus Radio, Fernsehen und Internet: Automatische Themenerkennung in Multimedia-Daten

Euromap Informationstag (12/2000) GMUD: Das Projekt ALERT - Alert system for selective dissemination of

multimedia information

IV Jornadas de Arquivo e Documentação (10/2000) INESC: Speech recognition and topic detection applied to alert systems for

broadcast news

ASR 2000 (9/2000) GMUD: ALERT System for Selective Dissemination of Multimedia

Information

Homme Technologie et Systèmes Complexes (6/2000) VECSYS: Parlez Naturellement, la Machine Vous Comprend

RIAO'2000 Content-based Multimedia Information Access (4/2000) VECSYS, LIMSI: An Audio Transcriber for Broadcast Document Indexation

outlook

use of additional datacross-talker situationsenlarged number of topicsimproving rejection mechanisms of

unknown topics (confidence for topics)detection of new topicssummarizationscalable summarizationtopic-dependent summarization

1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of...

Documents

Transcript of 1 The ALERT System: Audiovisual Broadcast Speech Transcription for Selective Dissemination of...

Roberto Gerhard

Gerhard Bosch

Diversidad e industria audiovisual e industria audiovisual ...

Del producto audiovisual al objeto audiovisual

Cultura Audiovisual

Financiamento Audiovisual - AULA 3 | Lei do Audiovisual | Gilberto Toscano

Gerhard Masur

Financiamento Audiovisual - Lei do Audiovisual - Gilberto Toscano - jul 2014

Audiovisual Archives

Lenguaje audiovisual

Composicion Audiovisual

Comunicaciòn audiovisual

creatividad audiovisual.

PortafolioArtilugio audiovisual

audiovisual actm

Gerhard Rigoll Munich University of Technology Institute for Human-Machine Communication

Lei Audiovisual

Narrativa audiovisual

Gerhard Andersson

Hire Price Audiovisual June 2017 Code Audiovisual Image Day Rate