Towards Spoken Knowledge Structuring and...

Speaker: Hung-yi Lee

Towards Spoken Knowledge

Structuring and Organization

When Speech Processing Technology

meets MOOCs

Introduction

2012 is the year of the massive open online

course (MOOC)

Instructors post recorded video/audio of their lectures on

online lecture platforms.

Learners worldwide can easily access the curricula.

More learning materials

Too much materials ……

Learner

I want to learn “SVM”.

A course is too much

Learner

I want to learn “SVM”.

Machine Learning(by Andrew Ng)

XII. Support Vector

Machines (Week 7)

Learning From Data

(Yaser S. Abu-Mostafa)

Lecture 14: Support

Vector Machines

Lecture 15: Kernel

Methods

機器學習基石(by 林軒田)

https://class.coursera.org/ml-003/class/index

https://class.coursera.org/ml-003/class/index

A course is not enough

Inter-discipline

Learner

I want to learn

“amino acid”.

Introduction to Biology

Amino acid

Amino acid

Organic Chemistry

Vision: Personalized Courses

I want to learn “XXX”.

I am a graduate student of

computer science.

I can spend 10 hours.Learner

I open a course for you.

on-line learning

material

Spoken Language Processing techniques can be very

helpful.

The spoken content in courses plays the most

important role in conveying the knowledge.

Outline

Part I: Overview each block in spoken knowledge

structuring and organization

Speech Recognition

Temporal Structure

Spoken content retrieval

Linking related lectures

Speech summarization

Knowledge graph construction

Inferring prerequisite and advanced concepts

Part II: Spoken Content Retrieval

Part III: Speech Summarization

Part IV: Demo

Part I: Overview

(Multilingual)

Speech

Recognition

Speech Recognition

Recognition

Output

Multimedia

(Audio/Video)

Speech Recognition

Lectures on Coursera and edX has manual

transcriptions

Most lectures on the Internet do not have

transcriptions

Speech recognition!

Speech Recognition

Speech

Recognition

Speech Recognition is the foundation of the following

speech techniques

Spoken Lectures

DNA is the material

of ……

Hidden Markov Model

N-gram Language Models

Speech Recognition

Speech

Recognition


speech techniques

Spoken Lectures

DNA is the material

of ……

N-gram Language Models

Deep Neural Network

Speech Recognition

Speech

Recognition


speech techniques

Spoken Lectures

DNA is the material

of ……

Deep Neural Network Continues Language Model

(Multilingual)

Speech

Recognition

Multi-layer Temporal Structure

Recognition

Output

Multimedia

(Audio/Video)

Temporal Structure

Multi-layer temporal structure

lectures

course

ch 1 ch 2 ……

Learner

Too long ……

Multi-layer temporal structure

lectures

sections

paragraphs

course

ch 1 ch 2

Paragraph

1

Paragraph

2

Paragraph

3 ……

……

……

audio

Not Directly

Available

several

utterances

several

utterances

several

utterances

Find automatically

(Multilingual)

Speech

Recognition

Spoken Content Retrieval

Spoken

Content

Retrieval

Recognition

Output

Multimedia

(Audio/Video)

Temporal Structure

Basic goal: return paragraphs or sections containing

keywords

This is called “Spoken Term Detection” (口語詞彙偵測)

Spoken Content Retrieval – Goal

“DNA”

user

sections

Paragraph

1

Paragraph

2

Paragraph

3 ……

……

several

utterances

several

utterances

several

utterances

paragraphs

audio

Basic goal: return paragraphs or sections containing

keywords

This is called “Spoken Term Detection” (口語詞彙偵測)

Advanced goal: Semantic retrieval of spoken content

Spoken Content Retrieval – Goal

user

“Inheritance

Material”

I know that the user is

looking for “DNA”.

Retrieval system

(Multilingual)

Speech

Recognition

Visualizing Search Results

Spoken

Content

Retrieval

Recognition

Output

Multimedia

(Audio/Video)

Temporal Structure

Easy to browseLinking Similar

Lectures

Learner

Search

With spoken content retrieval, we can use keywords

to search related lectures

Electronic

Textbooks

Course AInheritance

Lecture 8-2 Lecture 8-3 Lecture 8-4


Course B

“DNA”

DNA

ReplicationDNA

Structure

Learner

Linking

Linking lectures with similar content

Compute similarity between lectures in courses and sections in

textbooks

Merge the materials with high cosine similarity

Course AInheritance


Lecture 5-5 Lecture 5-6

Course BLecture 5-4

“DNA”

(Multilingual)

Speech

Recognition

Speech Summarization

Syntactic and Semantic

Analysis

Speech

Summarization

Spoken

Content

Retrieval

Recognition

Output

Multimedia

(Audio/Video)

Temporal Structure


Lectures


Audio is hard to browse

Lecture

SummarySelect the most informative segments to

form a compact version

40 minutes

1 minute

(Multilingual)

Speech

Recognition

Knowledge Graph Construction


Analysis

Speech

Summarization

Spoken

Content

Retrieval

Key Term

Extraction

Recognition

Output

Knowledge

Graph

Multimedia

(Audio/Video)

Temporal Structure


Lectures

Audio transcriptions of video clips


- Keyterm Extraction


Keyterm extraction

Keyterm

ExtractionDNA

Inheritance

RNA

Adenosine triphosphate

……


- Relation Extraction


Keyterm extraction

Find relation between keyterms

Co-reference

Resolution

Transcriptions As DNA encodes RNA,

it is the material of inheritance.

Syntactic

Parsing




Keyterm extraction


Co-reference

Resolution

Transcriptions As DNA encodes RNA,

it is the material of inheritance.DNA

Syntactic

parsing tree

Relation

Extraction

Syntactic

Parsing




Keyterm extraction


Co-reference

Resolution

Transcriptions

relation

Syntactic

parsing tree

[Mausam, EMNLP’12].




Keyterm extraction


Knowledge

Graph

DNA

RNA

Protein

GenomeGene

Nucleotide

Inheritance

Knowledge Graph

(Multilingual)

Speech

Recognition

Inferring Learning Path


Analysis

Speech

Summarization

Spoken

Content

Retrieval

Key Term

Extraction

Recognition

Output

Knowledge

Graph

Multimedia

(Audio/Video)

Temporal Structure

Easy to browseInferring

Learning Path

Linking Similar

Lectures



Construct a knowledge graph

DNA

RNA

Protein

GenomeGene

Nucleotide

Inheritance

First mentioned

in Lecture 1

First mentioned

in Lecture 3

Analyze the positions where the concepts are mentioned

the first time in a course

First mentioned

in Lecture 10



Construct a knowledge graph

DNA

RNA

Protein

GenomeGene

Nucleotide

Inheritance

First mentioned

in Lecture 1

First mentioned

in Lecture 3

“Inheritance” is the prerequisite concept of “DNA”

“RNA” is the advanced concept of “DNA”

First mentioned

in Lecture 10

Part II:


People think ……


Speech Recognition

+

Text Retrieval=

Speech Recognition + Text Retrieval

Speech

Recognition

Text

Retrieval

ResultText

Retrieval Query learner

Spoken Lectures

Models


= Speech Recognition + Text Retrieval

Speech recognition has uncertainty

Speech Recognition + Text Retrieval

……….. DNA ...

x1

R(x1)=0.9

x2

R(x2)=0.3

x1 0.9

x2 0.3

…user

“DNA”

DNA ………….

Is the problem solved?

Speech

Recognition

Text

Retrieval

ResultText


Spoken Lectures

Models

The retrieval performance seriously degrades with

inevitable recognition errors.

In real application, speech recognition accuracy can be

low.

Is the problem solved?

Speech

Recognition

Text

Retrieval

ResultText


Spoken Lectures

Models

To make retrieval performance less limited by recognition

errors

We need new ideas beyond cascading speech recognition

and text retrieval.

My point


Speech Recognition

+

Text Retrieval=

Beyond Cascading Speech

Recognition and Text Retrieval

Incorporating Information Lost in Standard Speech

Recognition

Improving Recognition Models by User Relevance

feedback

Query Expansion with Speech Signals

Spoken Content Retrieval without Speech

Recognition

Interactive Retrieval

Similarity

Is it truly “DNA”?

Recognition

ModelsAcoustic Models

Language Model

Lots of inaccurate assumption

Speech

Recognition

Similarity

“DNA”

“DNA”

“DNA”

similarity

It is not realistic to find examples for all queries.

Use Pseudo-relevance Feedback (PRF)

Is it truly “DNA”?

Retrieval

System

Pseudo Relevance Feedback (PRF)

Query Q

First-pass Retrieval Resultx1

x2 x3

Recognition

Output

Retrieval

System


Query Q

Confidence scores

R(x1)


x2 x3

R(x2) R(x3)

Not shown to the user

Recognition

Output

Retrieval

System


Query Q

R(x1)


x2 x3

R(x2) R(x3)

Assume the result with high confidence scores as correct

Examples of Q

Considered as examples of Q

Recognition

Output

Retrieval

System


Query Q

R(x1)


x2 x3

R(x2) R(x3)

similar dissimilar

Examples of Q

Recognition

Output

Retrieval

System


Query Q

R(x1)


x2 x3

R(x2) R(x3)

time 1:01

time 2:05

time 1:45

…

time 2:16

time 7:22

time 9:01

Rank according to new scores

Examples of Q

Recognition

Output

How to compute the similarity of two audio

segments?

Similarity between Audio Segments

Use a feature vector to present a short time span.

similarity

… …

……

How to compute the similarity of two audio

segments?


A audio segment is a sequence of feature vectors.

similarity

… … … … … … … … … … …


Dynamic Time Warping (DTW)

… … … … … … …

…

…

…

…

…

…

(A) (B)

Digital Speech Processing (DSP) of NTU based on lattices


- Experiments

Evaluation Measure: MAP (Mean Average Precision)

(A) (B)


- Experiments

(B): speaker independent (50% recognition accuracy)

(A): speaker dependent (84% recognition accuracy)

(A) and (B) use different speech recognition systems

(A) (B)

PRF (red bars) improved the first-pass retrieval

results with lattices (blue bars)


- Experiments

Graph-based Approach

In PRF, each result considers the similarity to some

examples

Consider the similarity between all results

Formulated as a problem on graph

Graph Construction

The first-pass results is considered as a graph.

Each retrieval result is a node

First-pass Retrieval

Result from lattices

x1

x2

x3

x2

x3

x1

x4

x5

Graph Construction

The first-pass results is considered as a graph.

Nodes are connected if their retrieval results are similar.

x2

x3

x1

x4

x5Dynamic Time Warping

(DTW) Similarity

similar

Changing Confidence Scores by Graph

The score of each node depends on its neighbors.

x2

x3

x1

x4

x5

R(x1)

R(x2)

R(x3)

R(x5)R(x4)

high

high

近朱者赤

近墨者黑


The score of each node depends on its neighbors.

x2

x3

x1

x4

x5

R(x1)

R(x2)

R(x3)

R(x5)R(x4)

low

low

近朱者赤

近墨者黑

The score of each node depends on its connected

nodes.

x2

x3

x1

x4

x5


Score of x1 depends on

the scores of x2 and x3


nodes.

x2

x3

x1

x4

x5







nodes.

x2

x3

x1

x4

x5

……






The scores are found by random walk algorithm.

Digital Speech Processing (DSP) of NTU based on lattices

(A) (B)

Graph-based Approach -

Experiments

(B): speaker independent (low recognition accuracy)

(A): speaker dependent (high recognition accuracy)

(A) (B)

Graph-based re-ranking (green bars) outperformed PRF (red

bars)

Graph-based Approach -

Experiments

Graph-based Approach –

Experiments on Babel Program

Join Babel program (巴別塔計畫) at MIT

Evaluation program of spoken term detection

More than 30 research groups divided into 4

teams

Spoken content to be retrieved are in special

languages

Graph-based Approach –

Experiments on Babel Program

3 out of 4 teams used this approach

Speech recognition system is based on deep neutral networks

New Directions for



Recognition


feedback



Recognition


Online search engine optimizes performance by user

relevance feedback

E.g. click-through data [T Joachims, SIGKDD 02]

User Relevance Feedback

time 1:10 T

time 2:01 F

time 3:04 T

time 5:31 T

time 1:10 T

time 2:01 F

time 3:04 T

time 5:31 F

time 1:10 F

time 2:01 F

time 3:04 T

time 5:31 T

Query Q1 Query Q2 Query Qn

……

Update Recognition Models

Speech

Recognition Models

TextText


Spoken

Content

Retrieval

Result

update

optimize

time 1:10

time 2:01

time 3:04

time 5:31 T

time 1:10 T

time 2:01 F

time 3:04

time 5:31

time 1:10 F

time 2:01

time 3:04

time 5:31

Query Q1 Query Q2 Query Qn

……

New Directions for



Recognition


feedback



Recognition


Query Expansion

for Text Retrieval

learner

“Inheritance material”

Retrieval

system

Search “Inheritance material” and “DNA”

To handle the problem of semantic retrieval,

retrieval system will expand the user query.

Query Expansion

for Spoken Content Retrieval

learner



Expand the queries by speech signals

Speech signals

of “DNA”

Retrieval

system

Query Expansion

for Spoken Content Retrieval

learner



Retrieval

system

Recognition

output (Text)

Speech signals

of “DNA”

Match on speech signal level

New Directions for



Recognition


feedback



Recognition


Spoken Content Retrieval without

Speech Recognition

Spoken Queries

Spoken

Lectures

user

“DNA”

Match on speech

signal level

New Directions for



Recognition


feedback



Recognition



Model the interactive retrieval process as

Markov Decision Process (MDP)

user

Deep Neural

Network

Find something

related to “speech”?

Part III:


MMR approach

Maximum marginal relevance (MMR)

approach

Unsupervised approach: Use heuristic rules to

select utterances

Select utterances whose content are similar to

the whole lectures

Minimize redundancy in summary at the same

time

Supervised Approach

Training dataLecture 1

Lecture 2

Lecture 3

2nd and 4th utterances

form the summary

3rd utterances form the

summary

1st and 2nd utterances

form the summary

Use the training data to learn model

for summarization

Supervised Approach

– Binary Classification

Summarization problem can be formulated as a

binary classification program

Included in the summary or not

utterance 1

utterance 2

utterance 3

utterance 4

Binary

Classifier-1

+1

+1

-1

utterance 2

utterance 3

classification

result

summary

Binary

ClassifierBinary

ClassifierBinary

Classifier

Lecture

Supervised Approach


Training data

Lecture 1

Lecture 2

Lecture 3

The utterances in the

summary are positive

examples.

Otherwise, negative

examples

Train a binary classifier

Supervised Approach


Binary classifier individually considers each utterance

To generate a good summary, “global information” should be

considered

Example: summary should be concise

大家好 ……

LSA就是 Latent semantic analysis

LSA用來強化 summarization

LSA可以用來強化 summarization

我再說一次

……

…….

LSA可以用來強化 summarization

Spoken DocumentSummary

More advanced machine learning techniques

Globally considering the whole

spoken lectures

Learn a special model by structured learning

techniques

Input: whole lecture

Output: summary

Special

ModelLecture

Summary

Consider the

whole lecture

3 utterances

selected in

summary

Evaluation Function

Evaluation function of utterance set F(s)

s: utterance set in a lecture

F(s) 10

評分utterance

set s

how suitable it is to

consider utterance set

s as the summary

Properties:

• Concise?

• Include

keyword?

• Short enough?

……….

How good it is to take this

utterance set as summary?

Lecture

Evaluation Function

– How to summary

With F(s), we can do summarization on new

lectures now

Lecture

s1

s2

s3

s4

s5

s6

s7

Compute F(s) for

all utterance sets

If s6

maximizes

F(s)

summary

Enumerate all

the possible

utterance set s

Evaluation Function

Evaluation function of utterance set F(s)

s: utterance set in a lecture

F(s) 10

評分utterance

set s

how suitable it is to

consider utterance set

s as the summary

Properties:

• Concise?

• Include

keyword?

• Short enough?

……….

How good it is to take this

utterance set as summary?

Lecture

What properties

should F(s) check?

Learn F(s) from training data

Reference

summary

Reference

summary

Evaluation Function - Training

…

9

7

-4

highFind F(s) such that

lecture

Training data

F(s)

F(s)

F(s)

Structured SVM: I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support

Vector Learning for Interdependent and Structured Output Spaces, ICML, 2004.

Speech Summarization - Structure

Temporal structure helps summarization

lectures

sections

paragraphs

ch 1

Paragraph

1

Paragraph

2

Paragraph

3 ……

……

audio

several

utterances

several

utterances

several

utterances


Temporal structure helps summarization

Long summary: consecutive utterances in a

paragraph are more likely to be

Short summary: one utterance is selected on behalf

of a paragraph.

…𝑥𝑖+3𝑥𝑖−2 𝑥𝑖−1 𝑥𝑖+6𝑥𝑖+4 𝑥𝑖+5

…𝑥𝑖+3𝑥𝑖−2 𝑥𝑖+6𝑥𝑖+4

Important paragraph

Representative of the paragraph

𝑥𝑖+5𝑥𝑖 𝑥𝑖+1 𝑥𝑖+2𝑥𝑖−1

𝑥𝑖 𝑥𝑖+1 𝑥𝑖+2

Paragraph 1 Paragraph 2 Paragraph 3

Paragraph 1 Paragraph 2

Evaluation Function - Structure

Add structure information into evaluation function

of utterance set F(s)

F(s) 100評分

Properties:

• Concise?

• Include

keyword?

• Short enough?

……….

utterances

Given the

information of

structure

Paragraph 1 Paragraph 2


Structure in text are clear

Paragraph boundaries are directly known

For spoken content, there is no obvious

structure

Here the structure are considered as “hidden

variables”

Structured learning with hidden variables

Speech Summarization -

Experiments

Evaluation Measure: ROUGE-1 and ROUGE-2

Larger scores means the machine-generated summaries

is more similar to human-generated summaries.

Part IV:

Demo

On-line lecture platforms (MIT)

“Cang-Jie (倉頡)”:

Search lecture recording and textbook

Linking video clips or textbook sections with similar

content


http://people.csail.mit.edu/tlkagk/Cangjie/

Concluding Remarks

(Multilingual)

Speech

Recognition

Towards Spoken Knowledge

Structuring and Organization


Analysis

Speech

Summarization

Spoken

Content

Retrieval

Key Term

Extraction

Recognition

Output

Knowledge

Graph

Multimedia

(Audio/Video)

Temporal Structure

Visualizing

Search ResultEasy to browse

Inferring

Learning Path

Ultimate Goal

I want to learn “XXX”.

I am a graduate student of

computer science.

I can spend 10 hours.Learner

I open a course for you.

on-line learning

material

Personalized course for each learner

Thank You for Your Attention

Appendix

Video Demonstration

Paragraph Boundaries

With speech recognition, we know the content of

each utterances

Compute their similarities

Find the boundary of paragraph such that

The content of the utterances in a paragraph is

similar

Paragraph

1

Paragraph

2

Paragraph

3 ……paragraphs

audio

Paragraph

4

Slide Boundaries

The slides are modeled as HMMs

Align the slides with paragraphs

Paragraph

1

Paragraph

2

Paragraph

3 ……paragraphs

audio

Paragraph

4

s1 s2

Evaluation function F(s)

A good summary should

1. include the most important utterance

2. but minimize the redundancy at the same time

3. not too long

Utterance set s fulfill the above

requirement should have large F(s)





3. not too long

sx

i

i

xIsF

I(xi): importance of utterance xi

ii xfwxI f(xi): feature of sentence xi

Evaluation function F(s,D)




3. not too long

Di sx

ixIsF

I(xi): importance of utterance xi

ii xfwxI f(xi): feature of sentence xi

Lexical feature: use speech recognition

to transcribe each utterance into text

similarity to the transcriptions of whole

lectures

latent topic distribution

how many keywords

……

Prosodic feature:

Energy, pitch, syllable duration,

pause duration

weights for each feature





3. not too long

λ is a parameter to be determined.

sx sxx

jii

i ji

xxSimxIsF,

,

Sim(xi, xj): similarity between utterances xi and xj





3. not too long

sx sxx

jii

i ji

xxSimxIsF,

,

sx

i

i

KxL(constraint)

L(xi): length of utterance xi

K: length constraint of summary





3. not too long

sx sxx

jii

i ji

xxSimxIsF,

,

sx

i

i

KxL(constraint)

ii xfwxI Jointly learn from

training data

Idea of Training

Training data

D1

D2

Reference

summaryR1

Reference

summaryR2

……

Find w and λ in F(s) such that

F(R1) > F(sD1)

F(R2) > F(sD2)

……

sD1 is all utterance set in D1, except R1

sD2 is all utterance set in D2, except R2

I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Learning for

Interdependent and Structured Output Spaces, ICML, 2004.

Towards Spoken Knowledge Structuring and...

Documents

Transcript of Towards Spoken Knowledge Structuring and...