DCU at the NTCIR-11 SpokenQuery&Doc Taskresearch.nii.ac.jp/ntcir/workshop/Online...ASR-based topic...

1
Manual Match UnmatchAMLM 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 UnmatchAMLM Transcripts LI-LPr-0.2 LI-LPr-0.5 LI-P-0.9 TF_IDF Spoken Query Types MAP Slide-group segments with prosody This research is supported by the Science Foundation Ireland (Grant 12/CE/I2267) as part of CNGL (www.cngl.ie) Results Introduction Speech is more than a simple sequence of words. Prosodic variation encode rich information about: emotions, discourse structure, dialogue acts, focus, emphasis, contrast, topic shifting, etc. We examined the potential of prosodic prominence in the NTCIR-11 SpokenQuery&Doc Task. Indexing Stores normalised prosodic features for each term. David N. Racca Gareth J.F. Jones CNGL Centre for Global Intelligent Content, School of Computing, Dublin City University, Dublin, Ireland {dracca, gjones}@computing.dcu.ie Background and Previous Work Prosody may be useful in speech search: Relationship between stress and TF-IDF scores [1]. SDR exploiting amplitude and duration [2]. Topic tracking exploiting energy and pitch [3]. SCR exploiting pitch, loudness, and duration [4]. Conclusions No significant differences between prosodic and text based runs. Transcript quality affects retrieval effectiveness. Prosodic-based models may be useful for certain queries. Retrieval Increases weights of prominent terms. References [1] F. Crestani. Towards the use of prosodic information for spoken document retrieval . SIGIR'01, 2001. [2] B. Chen et al. Improved spoken document retrieval by exploring extra acoustic and linguistic cues. INTERSPEECH'01, 2001. [3] C. Guinaudeau et al. Accounting for prosodic information to improve ASR-based topic tracking for TV broadcast news . INTERSPEECH'11, 2011. [4] D.N. Racca et al. DCU search runs at MediaEval 2014 Search and Hyperlinking. MediaEval 2014 Multimedia Benchmark Workshop, 2014. IPUs with prosody IPU Grouping Lectures WAV Segment Index Manual Match UnmatchAMLM 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 Match Transcripts LI-LPr-0.5 LI-Pr-0.7 LI-Dur-0.3 TF_IDF Spoken Query Types MAP Match Unmatch Manual Match Unmatch Manual Match Unmatch Manual Manual Match Unmatch 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Query 1: Prosodic-based vs TF_IDF TF_IDF Prosodic-based AveP Spoken Query Types Manual Match UnmatchAMLM 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 Manual Transcripts LI-Pr-0.7 LI-LPr-0.7 TF_IDF Spoken Query Types MAP DCU at the NTCIR-11 SpokenQuery&Doc Task Data Pre-processing VAD OpenSMILE Julius LVCSR ChaSen Annotation Removal Forced Alignment Lecture Normalisation Queries WAV tf-idf 単語 max ( f0 tf-idf )=280.44 Hz Raw max ( f0 tf-idf )=0.58 Normalised max ( f0 単語 )=236.46 Hz Raw max ( f0 単語 )=0.49 Normalised Terrier Matching tf ( i,j )= k 1 tf i,j tf i,j + k 1 ( 1 b + b dl j avdl ) idf ( i,C )= log ( N n i + 1 ) ac ( i,j )= { f0 ( i,j ) Pitch [P] l ( i, j ) Loudness [L] d ( i,j ) Duration [Dur] f0 range ( i,j ) Pitch Range [Pr] l ( i, j ) . f0 ( i, j ) [LP] l ( i, j ) . f0 range ( i,j ) [LPr] rel ( q , s j )= i M w ( i,j ) w ( i,j )= { idf ( i,C )[α . tf ( i,j )+( 1 −α) ac ( i,j )] LI θ ir . tf ( i, j ) . idf ( i,C )+θ ac . ac ( i,j ) θ ir ac G tf ( i, j ) . idf ( i,C ) TF_IDF f0 ( i,j )=max k { max ( f0 i,j k ) } l ( i,j )= max k { max ( l i, j k ) } f0 range ( i,j )=max k { max ( f0 i,j k ) } min k { min ( f0 i,j k ) } d ( i,j )=max k { d i,j k } IPUs WAV Manually Annotated Transcripts F0, Loudness every 10ms ASR Transcripts Manual Transcripts Normalised F0, Loudness Enriched ASR Transcripts Enriched Manual Transcripts Enriched Transcripts Terrier Indexing

Transcript of DCU at the NTCIR-11 SpokenQuery&Doc Taskresearch.nii.ac.jp/ntcir/workshop/Online...ASR-based topic...

Page 1: DCU at the NTCIR-11 SpokenQuery&Doc Taskresearch.nii.ac.jp/ntcir/workshop/Online...ASR-based topic tracking for TV broadcast news. INTERSPEECH'11, 2011. [4] D.N. Racca et al. DCU search

Manual Match UnmatchAMLM0

0.020.040.060.08

0.10.120.14

UnmatchAMLM TranscriptsLI­LPr­0.2 LI­LPr­0.5 LI­P­0.9 TF_IDF

Spoken Query Types

MAP

Slide-group segments with prosody

This research is supported by the Science Foundation Ireland (Grant 12/CE/I2267) as part of CNGL (www.cngl.ie)

Results

Introduction• Speech is more than a simple sequence of words.• Prosodic variation encode rich information about:– emotions, discourse structure, dialogue acts, focus, emphasis, contrast, topic shifting, etc.

• We examined the potential of prosodic prominence in the NTCIR-11 SpokenQuery&Doc Task.

Indexing Stores normalised prosodic features for each term.

David N. Racca Gareth J.F. JonesCNGL Centre for Global Intelligent Content, School of Computing, Dublin City University, Dublin, Ireland

{dracca, gjones}@computing.dcu.ie

Background and Previous Work

Prosody may be useful in speech search: Relationship between stress and TF-IDF scores [1]. SDR exploiting amplitude and duration [2]. Topic tracking exploiting energy and pitch [3]. SCR exploiting pitch, loudness, and duration [4].

Conclusions• No significant differences between prosodic and text based runs.• Transcript quality affects retrieval effectiveness.• Prosodic-based models may be useful for certain queries.

Retrieval Increases weights of prominent terms.

References[1] F. Crestani. Towards the use of prosodic information for spoken document retrieval. SIGIR'01, 2001.[2] B. Chen et al. Improved spoken document retrieval by exploring extra acoustic and linguistic cues. INTERSPEECH'01, 2001.[3] C. Guinaudeau et al. Accounting for prosodic information to improve ASR-based topic tracking for TV broadcast news. INTERSPEECH'11, 2011.[4] D.N. Racca et al. DCU search runs at MediaEval 2014 Search and Hyperlinking. MediaEval 2014 Multimedia Benchmark Workshop, 2014.

IPUs with prosody

IPUGrouping

LecturesWAV

Segment Index

Manual Match UnmatchAMLM0

0.020.040.060.08

0.10.120.14

Match TranscriptsLI­LPr­0.5 LI­Pr­0.7 LI­Dur­0.3 TF_IDF

Spoken Query Types

MAP

Match

Unmatch

Manual

Match

Unmatch

Manual

Match

Unmatch

Manual

Manual

Match

Unm

atch

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Query 1: Prosodic-based vs TF_IDF

TF_IDF Prosodic-based

AveP

Sp

oken

Qu

ery

Typ

es

Manual Match UnmatchAMLM0

0.020.040.060.08

0.10.120.14

Manual TranscriptsLI­Pr­0.7 LI­LPr­0.7 TF_IDF

Spoken Query Types

MAP

DCU at the NTCIR-11 SpokenQuery&Doc Task

Data Pre-processing

VAD

OpenSMILE

JuliusLVCSR

ChaSen

AnnotationRemoval

Forced Alignment

LectureNormalisation

QueriesWAV

tf-idf単語max (f0 tf-idf)=280.44 Hz Raw

max (f0 tf-idf)=0.58 Normalised

max (f0単語)=236.46 Hz Raw

max (f0単語)=0.49 Normalised

Terrier Matching

tf( i , j)=k1 tf i , j

tf i , j+k1(1−b+bdl j

avdl )idf (i ,C)=log( N

n i

+1)

ac (i , j)={f0 (i , j) Pitch [P]l (i , j) Loudness [L]d (i , j) Duration [Dur]f0range (i , j) Pitch Range [Pr]l (i , j) . f0 (i , j) [LP]l (i , j) . f0range( i , j) [LPr]

rel (q , s j)=∑i

M

w( i , j)

w (i , j)={idf (i ,C)[α . tf (i , j)+(1−α)ac (i , j) ] LI

θir . tf (i , j) . idf (i ,C )+θac .ac( i , j)θir+θac

G

tf (i , j) . idf (i ,C ) TF_IDF

f0 (i , j)=maxk

{max (f0i , jk )}

l(i , j)=maxk

{max( l i , jk

) }

f0range (i , j)=maxk

{max (f0i , jk

) }−mink

{min(f0i , jk

)}

d (i , j)=maxk

{d i , jk }

IPUsWAV

ManuallyAnnotatedTranscripts

F0,Loudness

every 10ms

ASRTranscripts

ManualTranscripts

NormalisedF0,

Loudness

EnrichedASR

Transcripts

EnrichedManual

Transcripts

EnrichedTranscripts

TerrierIndexing