Towards Spoken Knowledge Structuring and...
Transcript of Towards Spoken Knowledge Structuring and...
Speaker: Hung-yi Lee
Towards Spoken Knowledge
Structuring and Organization
When Speech Processing Technology
meets MOOCs
Introduction
2012 is the year of the massive open online
course (MOOC)
Instructors post recorded video/audio of their lectures on
online lecture platforms.
Learners worldwide can easily access the curricula.
More learning materials
Too much materials ……
Learner
I want to learn “SVM”.
A course is too much
Learner
I want to learn “SVM”.
Machine Learning(by Andrew Ng)
XII. Support Vector
Machines (Week 7)
Learning From Data
(Yaser S. Abu-Mostafa)
Lecture 14: Support
Vector Machines
Lecture 15: Kernel
Methods
機器學習基石(by 林軒田)
A course is not enough
Inter-discipline
Learner
I want to learn
“amino acid”.
Introduction to Biology
Amino acid
Amino acid
Organic Chemistry
Vision: Personalized Courses
I want to learn “XXX”.
I am a graduate student of
computer science.
I can spend 10 hours.Learner
I open a course for you.
on-line learning
material
Spoken Language Processing techniques can be very
helpful.
The spoken content in courses plays the most
important role in conveying the knowledge.
Outline
Part I: Overview each block in spoken knowledge
structuring and organization
Speech Recognition
Temporal Structure
Spoken content retrieval
Linking related lectures
Speech summarization
Knowledge graph construction
Inferring prerequisite and advanced concepts
Part II: Spoken Content Retrieval
Part III: Speech Summarization
Part IV: Demo
Part I: Overview
(Multilingual)
Speech
Recognition
Speech Recognition
Recognition
Output
Multimedia
(Audio/Video)
Speech Recognition
Lectures on Coursera and edX has manual
transcriptions
Most lectures on the Internet do not have
transcriptions
Speech recognition!
Speech Recognition
Speech
Recognition
Speech Recognition is the foundation of the following
speech techniques
Spoken Lectures
DNA is the material
of ……
Hidden Markov Model
N-gram Language Models
Speech Recognition
Speech
Recognition
Speech Recognition is the foundation of the following
speech techniques
Spoken Lectures
DNA is the material
of ……
N-gram Language Models
Deep Neural Network
Speech Recognition
Speech
Recognition
Speech Recognition is the foundation of the following
speech techniques
Spoken Lectures
DNA is the material
of ……
Deep Neural Network Continues Language Model
(Multilingual)
Speech
Recognition
Multi-layer Temporal Structure
Recognition
Output
Multimedia
(Audio/Video)
Temporal Structure
Multi-layer temporal structure
lectures
course
ch 1 ch 2 ……
Learner
Too long ……
Multi-layer temporal structure
lectures
sections
paragraphs
course
ch 1 ch 2
Paragraph
1
Paragraph
2
Paragraph
3 ……
……
……
audio
Not Directly
Available
several
utterances
several
utterances
several
utterances
Find automatically
(Multilingual)
Speech
Recognition
Spoken Content Retrieval
Spoken
Content
Retrieval
Recognition
Output
Multimedia
(Audio/Video)
Temporal Structure
Basic goal: return paragraphs or sections containing
keywords
This is called “Spoken Term Detection” (口語詞彙偵測)
Spoken Content Retrieval – Goal
“DNA”
user
sections
Paragraph
1
Paragraph
2
Paragraph
3 ……
……
several
utterances
several
utterances
several
utterances
paragraphs
audio
Basic goal: return paragraphs or sections containing
keywords
This is called “Spoken Term Detection” (口語詞彙偵測)
Advanced goal: Semantic retrieval of spoken content
Spoken Content Retrieval – Goal
user
“Inheritance
Material”
I know that the user is
looking for “DNA”.
Retrieval system
(Multilingual)
Speech
Recognition
Visualizing Search Results
Spoken
Content
Retrieval
Recognition
Output
Multimedia
(Audio/Video)
Temporal Structure
Easy to browseLinking Similar
Lectures
Learner
Search
With spoken content retrieval, we can use keywords
to search related lectures
Electronic
Textbooks
Course AInheritance
Lecture 8-2 Lecture 8-3 Lecture 8-4
Lecture 5-4 Lecture 5-5 Lecture 5-6
Course B
“DNA”
DNA
ReplicationDNA
Structure
Learner
Linking
Linking lectures with similar content
Compute similarity between lectures in courses and sections in
textbooks
Merge the materials with high cosine similarity
Course AInheritance
Lecture 8-2 Lecture 8-3 Lecture 8-4
Lecture 5-5 Lecture 5-6
Course BLecture 5-4
“DNA”
(Multilingual)
Speech
Recognition
Speech Summarization
Syntactic and Semantic
Analysis
Speech
Summarization
Spoken
Content
Retrieval
Recognition
Output
Multimedia
(Audio/Video)
Temporal Structure
Easy to browseLinking Similar
Lectures
Speech Summarization
Audio is hard to browse
Lecture
SummarySelect the most informative segments to
form a compact version
40 minutes
1 minute
(Multilingual)
Speech
Recognition
Knowledge Graph Construction
Syntactic and Semantic
Analysis
Speech
Summarization
Spoken
Content
Retrieval
Key Term
Extraction
Recognition
Output
Knowledge
Graph
Multimedia
(Audio/Video)
Temporal Structure
Easy to browseLinking Similar
Lectures
Audio transcriptions of video clips
Knowledge Graph Construction
- Keyterm Extraction
Knowledge graph construction
Keyterm extraction
Keyterm
ExtractionDNA
Inheritance
RNA
Adenosine triphosphate
……
Knowledge Graph Construction
- Relation Extraction
Knowledge graph construction
Keyterm extraction
Find relation between keyterms
Co-reference
Resolution
Transcriptions As DNA encodes RNA,
it is the material of inheritance.
Syntactic
Parsing
Knowledge Graph Construction
- Relation Extraction
Knowledge graph construction
Keyterm extraction
Find relation between keyterms
Co-reference
Resolution
Transcriptions As DNA encodes RNA,
it is the material of inheritance.DNA
Syntactic
parsing tree
Relation
Extraction
Syntactic
Parsing
Knowledge Graph Construction
- Relation Extraction
Knowledge graph construction
Keyterm extraction
Find relation between keyterms
Co-reference
Resolution
Transcriptions
relation
Syntactic
parsing tree
[Mausam, EMNLP’12].
Knowledge Graph Construction
- Relation Extraction
Knowledge graph construction
Keyterm extraction
Find relation between keyterms
Knowledge
Graph
DNA
RNA
Protein
GenomeGene
Nucleotide
Inheritance
Knowledge Graph
(Multilingual)
Speech
Recognition
Inferring Learning Path
Syntactic and Semantic
Analysis
Speech
Summarization
Spoken
Content
Retrieval
Key Term
Extraction
Recognition
Output
Knowledge
Graph
Multimedia
(Audio/Video)
Temporal Structure
Easy to browseInferring
Learning Path
Linking Similar
Lectures
Inferring Learning Path
Inferring prerequisite and advanced concepts
Construct a knowledge graph
DNA
RNA
Protein
GenomeGene
Nucleotide
Inheritance
First mentioned
in Lecture 1
First mentioned
in Lecture 3
Analyze the positions where the concepts are mentioned
the first time in a course
First mentioned
in Lecture 10
Inferring Learning Path
Inferring prerequisite and advanced concepts
Construct a knowledge graph
DNA
RNA
Protein
GenomeGene
Nucleotide
Inheritance
First mentioned
in Lecture 1
First mentioned
in Lecture 3
“Inheritance” is the prerequisite concept of “DNA”
“RNA” is the advanced concept of “DNA”
First mentioned
in Lecture 10
Part II:
Spoken Content Retrieval
People think ……
Spoken Content Retrieval
Speech Recognition
+
Text Retrieval=
Speech Recognition + Text Retrieval
Speech
Recognition
Text
Retrieval
ResultText
Retrieval Query learner
Spoken Lectures
Models
Spoken Content Retrieval
= Speech Recognition + Text Retrieval
Speech recognition has uncertainty
Speech Recognition + Text Retrieval
……….. DNA ...
x1
R(x1)=0.9
x2
R(x2)=0.3
x1 0.9
x2 0.3
…user
“DNA”
DNA ………….
Is the problem solved?
Speech
Recognition
Text
Retrieval
ResultText
Retrieval Query learner
Spoken Lectures
Models
The retrieval performance seriously degrades with
inevitable recognition errors.
In real application, speech recognition accuracy can be
low.
Is the problem solved?
Speech
Recognition
Text
Retrieval
ResultText
Retrieval Query learner
Spoken Lectures
Models
To make retrieval performance less limited by recognition
errors
We need new ideas beyond cascading speech recognition
and text retrieval.
My point
Spoken Content Retrieval
Speech Recognition
+
Text Retrieval=
Beyond Cascading Speech
Recognition and Text Retrieval
Incorporating Information Lost in Standard Speech
Recognition
Improving Recognition Models by User Relevance
feedback
Query Expansion with Speech Signals
Spoken Content Retrieval without Speech
Recognition
Interactive Retrieval
Beyond Cascading Speech
Recognition and Text Retrieval
Incorporating Information Lost in Standard Speech
Recognition
Improving Recognition Models by User Relevance
feedback
Query Expansion with Speech Signals
Spoken Content Retrieval without Speech
Recognition
Interactive Retrieval
Similarity
Is it truly “DNA”?
Recognition
ModelsAcoustic Models
Language Model
Lots of inaccurate assumption
Speech
Recognition
Similarity
“DNA”
“DNA”
“DNA”
similarity
It is not realistic to find examples for all queries.
Use Pseudo-relevance Feedback (PRF)
Is it truly “DNA”?
Retrieval
System
Pseudo Relevance Feedback (PRF)
Query Q
First-pass Retrieval Resultx1
x2 x3
Recognition
Output
Retrieval
System
Pseudo Relevance Feedback (PRF)
Query Q
Confidence scores
R(x1)
First-pass Retrieval Resultx1
x2 x3
R(x2) R(x3)
Not shown to the user
Recognition
Output
Retrieval
System
Pseudo Relevance Feedback (PRF)
Query Q
R(x1)
First-pass Retrieval Resultx1
x2 x3
R(x2) R(x3)
Assume the result with high confidence scores as correct
Examples of Q
Considered as examples of Q
Recognition
Output
Retrieval
System
Pseudo Relevance Feedback (PRF)
Query Q
R(x1)
First-pass Retrieval Resultx1
x2 x3
R(x2) R(x3)
similar dissimilar
Examples of Q
Recognition
Output
Retrieval
System
Pseudo Relevance Feedback (PRF)
Query Q
R(x1)
First-pass Retrieval Resultx1
x2 x3
R(x2) R(x3)
time 1:01
time 2:05
time 1:45
…
time 2:16
time 7:22
time 9:01
Rank according to new scores
Examples of Q
Recognition
Output
How to compute the similarity of two audio
segments?
Similarity between Audio Segments
Use a feature vector to present a short time span.
similarity
… …
……
How to compute the similarity of two audio
segments?
Similarity between Audio Segments
A audio segment is a sequence of feature vectors.
similarity
… … … … … … … … … … …
Similarity between Audio Segments
Dynamic Time Warping (DTW)
… … … … … … …
…
…
…
…
…
…
(A) (B)
Digital Speech Processing (DSP) of NTU based on lattices
Pseudo Relevance Feedback (PRF)
- Experiments
Evaluation Measure: MAP (Mean Average Precision)
(A) (B)
Pseudo Relevance Feedback (PRF)
- Experiments
(B): speaker independent (50% recognition accuracy)
(A): speaker dependent (84% recognition accuracy)
(A) and (B) use different speech recognition systems
(A) (B)
PRF (red bars) improved the first-pass retrieval
results with lattices (blue bars)
Pseudo Relevance Feedback (PRF)
- Experiments
Graph-based Approach
In PRF, each result considers the similarity to some
examples
Consider the similarity between all results
Formulated as a problem on graph
Graph Construction
The first-pass results is considered as a graph.
Each retrieval result is a node
First-pass Retrieval
Result from lattices
x1
x2
x3
x2
x3
x1
x4
x5
Graph Construction
The first-pass results is considered as a graph.
Nodes are connected if their retrieval results are similar.
x2
x3
x1
x4
x5Dynamic Time Warping
(DTW) Similarity
similar
Changing Confidence Scores by Graph
The score of each node depends on its neighbors.
x2
x3
x1
x4
x5
R(x1)
R(x2)
R(x3)
R(x5)R(x4)
high
high
近朱者赤
近墨者黑
Changing Confidence Scores by Graph
The score of each node depends on its neighbors.
x2
x3
x1
x4
x5
R(x1)
R(x2)
R(x3)
R(x5)R(x4)
low
low
近朱者赤
近墨者黑
The score of each node depends on its connected
nodes.
x2
x3
x1
x4
x5
Changing Confidence Scores by Graph
Score of x1 depends on
the scores of x2 and x3
The score of each node depends on its connected
nodes.
x2
x3
x1
x4
x5
Changing Confidence Scores by Graph
Score of x1 depends on
the scores of x2 and x3
Score of x2 depends on
the scores of x1 and x3
The score of each node depends on its connected
nodes.
x2
x3
x1
x4
x5
……
Changing Confidence Scores by Graph
Score of x1 depends on
the scores of x2 and x3
Score of x2 depends on
the scores of x1 and x3
The scores are found by random walk algorithm.
Digital Speech Processing (DSP) of NTU based on lattices
(A) (B)
Graph-based Approach -
Experiments
(B): speaker independent (low recognition accuracy)
(A): speaker dependent (high recognition accuracy)
(A) (B)
Graph-based re-ranking (green bars) outperformed PRF (red
bars)
Graph-based Approach -
Experiments
Graph-based Approach –
Experiments on Babel Program
Join Babel program (巴別塔計畫) at MIT
Evaluation program of spoken term detection
More than 30 research groups divided into 4
teams
Spoken content to be retrieved are in special
languages
Graph-based Approach –
Experiments on Babel Program
3 out of 4 teams used this approach
Speech recognition system is based on deep neutral networks
New Directions for
Spoken Content Retrieval
Incorporating Information Lost in Standard Speech
Recognition
Improving Recognition Models by User Relevance
feedback
Query Expansion with Speech Signals
Spoken Content Retrieval without Speech
Recognition
Interactive Retrieval
Online search engine optimizes performance by user
relevance feedback
E.g. click-through data [T Joachims, SIGKDD 02]
User Relevance Feedback
time 1:10 T
time 2:01 F
time 3:04 T
time 5:31 T
time 1:10 T
time 2:01 F
time 3:04 T
time 5:31 F
time 1:10 F
time 2:01 F
time 3:04 T
time 5:31 T
Query Q1 Query Q2 Query Qn
……
Update Recognition Models
Speech
Recognition Models
TextText
Retrieval Query learner
Spoken
Content
Retrieval
Result
update
optimize
time 1:10
time 2:01
time 3:04
time 5:31 T
time 1:10 T
time 2:01 F
time 3:04
time 5:31
time 1:10 F
time 2:01
time 3:04
time 5:31
Query Q1 Query Q2 Query Qn
……
New Directions for
Spoken Content Retrieval
Incorporating Information Lost in Standard Speech
Recognition
Improving Recognition Models by User Relevance
feedback
Query Expansion with Speech Signals
Spoken Content Retrieval without Speech
Recognition
Interactive Retrieval
Query Expansion
for Text Retrieval
learner
“Inheritance material”
Retrieval
system
Search “Inheritance material” and “DNA”
To handle the problem of semantic retrieval,
retrieval system will expand the user query.
Query Expansion
for Spoken Content Retrieval
learner
“Inheritance material”
Search “Inheritance material” and “DNA”
Expand the queries by speech signals
Speech signals
of “DNA”
Retrieval
system
Query Expansion
for Spoken Content Retrieval
learner
“Inheritance material”
Search “Inheritance material” and “DNA”
Retrieval
system
Recognition
output (Text)
Speech signals
of “DNA”
Match on speech signal level
New Directions for
Spoken Content Retrieval
Incorporating Information Lost in Standard Speech
Recognition
Improving Recognition Models by User Relevance
feedback
Query Expansion with Speech Signals
Spoken Content Retrieval without Speech
Recognition
Interactive Retrieval
Spoken Content Retrieval without
Speech Recognition
Spoken Queries
Spoken
Lectures
user
“DNA”
Match on speech
signal level
New Directions for
Spoken Content Retrieval
Incorporating Information Lost in Standard Speech
Recognition
Improving Recognition Models by User Relevance
feedback
Query Expansion with Speech Signals
Spoken Content Retrieval without Speech
Recognition
Interactive Retrieval
Interactive Retrieval
Model the interactive retrieval process as
Markov Decision Process (MDP)
user
Deep Neural
Network
Find something
related to “speech”?
Part III:
Speech Summarization
MMR approach
Maximum marginal relevance (MMR)
approach
Unsupervised approach: Use heuristic rules to
select utterances
Select utterances whose content are similar to
the whole lectures
Minimize redundancy in summary at the same
time
Supervised Approach
Training dataLecture 1
Lecture 2
Lecture 3
2nd and 4th utterances
form the summary
3rd utterances form the
summary
1st and 2nd utterances
form the summary
Use the training data to learn model
for summarization
Supervised Approach
– Binary Classification
Summarization problem can be formulated as a
binary classification program
Included in the summary or not
utterance 1
utterance 2
utterance 3
utterance 4
Binary
Classifier-1
+1
+1
-1
utterance 2
utterance 3
classification
result
summary
Binary
ClassifierBinary
ClassifierBinary
Classifier
Lecture
Supervised Approach
– Binary Classification
Training data
Lecture 1
Lecture 2
Lecture 3
The utterances in the
summary are positive
examples.
Otherwise, negative
examples
Train a binary classifier
Supervised Approach
– Binary Classification
Binary classifier individually considers each utterance
To generate a good summary, “global information” should be
considered
Example: summary should be concise
大家好 ……
LSA就是 Latent semantic analysis
LSA用來強化 summarization
LSA可以用來強化 summarization
我再說一次
……
…….
LSA可以用來強化 summarization
Spoken DocumentSummary
More advanced machine learning techniques
Globally considering the whole
spoken lectures
Learn a special model by structured learning
techniques
Input: whole lecture
Output: summary
Special
ModelLecture
Summary
Consider the
whole lecture
3 utterances
selected in
summary
Evaluation Function
Evaluation function of utterance set F(s)
s: utterance set in a lecture
F(s) 10
評分utterance
set s
how suitable it is to
consider utterance set
s as the summary
Properties:
• Concise?
• Include
keyword?
• Short enough?
……….
How good it is to take this
utterance set as summary?
Lecture
Evaluation Function
– How to summary
With F(s), we can do summarization on new
lectures now
Lecture
s1
s2
s3
s4
s5
s6
s7
Compute F(s) for
all utterance sets
If s6
maximizes
F(s)
summary
Enumerate all
the possible
utterance set s
Evaluation Function
Evaluation function of utterance set F(s)
s: utterance set in a lecture
F(s) 10
評分utterance
set s
how suitable it is to
consider utterance set
s as the summary
Properties:
• Concise?
• Include
keyword?
• Short enough?
……….
How good it is to take this
utterance set as summary?
Lecture
What properties
should F(s) check?
Learn F(s) from training data
Reference
summary
Reference
summary
Evaluation Function - Training
…
9
7
-4
highFind F(s) such that
lecture
Training data
F(s)
F(s)
F(s)
Structured SVM: I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support
Vector Learning for Interdependent and Structured Output Spaces, ICML, 2004.
Speech Summarization - Structure
Temporal structure helps summarization
lectures
sections
paragraphs
ch 1
Paragraph
1
Paragraph
2
Paragraph
3 ……
……
audio
several
utterances
several
utterances
several
utterances
Speech Summarization - Structure
Temporal structure helps summarization
Long summary: consecutive utterances in a
paragraph are more likely to be
Short summary: one utterance is selected on behalf
of a paragraph.
…𝑥𝑖+3𝑥𝑖−2 𝑥𝑖−1 𝑥𝑖+6𝑥𝑖+4 𝑥𝑖+5
…𝑥𝑖+3𝑥𝑖−2 𝑥𝑖+6𝑥𝑖+4
Important paragraph
Representative of the paragraph
𝑥𝑖+5𝑥𝑖 𝑥𝑖+1 𝑥𝑖+2𝑥𝑖−1
𝑥𝑖 𝑥𝑖+1 𝑥𝑖+2
Paragraph 1 Paragraph 2 Paragraph 3
Paragraph 1 Paragraph 2
Evaluation Function - Structure
Add structure information into evaluation function
of utterance set F(s)
F(s) 100評分
Properties:
• Concise?
• Include
keyword?
• Short enough?
……….
utterances
Given the
information of
structure
Paragraph 1 Paragraph 2
Speech Summarization - Structure
Structure in text are clear
Paragraph boundaries are directly known
For spoken content, there is no obvious
structure
Here the structure are considered as “hidden
variables”
Structured learning with hidden variables
Speech Summarization -
Experiments
Evaluation Measure: ROUGE-1 and ROUGE-2
Larger scores means the machine-generated summaries
is more similar to human-generated summaries.
Part IV:
Demo
On-line lecture platforms (MIT)
“Cang-Jie (倉頡)”:
Search lecture recording and textbook
Linking video clips or textbook sections with similar
content
Inferring prerequisite and advanced concepts
http://people.csail.mit.edu/tlkagk/Cangjie/
Concluding Remarks
(Multilingual)
Speech
Recognition
Towards Spoken Knowledge
Structuring and Organization
Syntactic and Semantic
Analysis
Speech
Summarization
Spoken
Content
Retrieval
Key Term
Extraction
Recognition
Output
Knowledge
Graph
Multimedia
(Audio/Video)
Temporal Structure
Visualizing
Search ResultEasy to browse
Inferring
Learning Path
Ultimate Goal
I want to learn “XXX”.
I am a graduate student of
computer science.
I can spend 10 hours.Learner
I open a course for you.
on-line learning
material
Personalized course for each learner
Thank You for Your Attention
Appendix
Video Demonstration
Paragraph Boundaries
With speech recognition, we know the content of
each utterances
Compute their similarities
Find the boundary of paragraph such that
The content of the utterances in a paragraph is
similar
Paragraph
1
Paragraph
2
Paragraph
3 ……paragraphs
audio
Paragraph
4
Slide Boundaries
The slides are modeled as HMMs
Align the slides with paragraphs
Paragraph
1
Paragraph
2
Paragraph
3 ……paragraphs
audio
Paragraph
4
s1 s2
Evaluation function F(s)
A good summary should
1. include the most important utterance
2. but minimize the redundancy at the same time
3. not too long
Utterance set s fulfill the above
requirement should have large F(s)
Evaluation function F(s)
A good summary should
1. include the most important utterance
2. but minimize the redundancy at the same time
3. not too long
sx
i
i
xIsF
I(xi): importance of utterance xi
ii xfwxI f(xi): feature of sentence xi
Evaluation function F(s,D)
A good summary should
1. include the most important utterance
2. but minimize the redundancy at the same time
3. not too long
Di sx
ixIsF
I(xi): importance of utterance xi
ii xfwxI f(xi): feature of sentence xi
Lexical feature: use speech recognition
to transcribe each utterance into text
similarity to the transcriptions of whole
lectures
latent topic distribution
how many keywords
……
Prosodic feature:
Energy, pitch, syllable duration,
pause duration
weights for each feature
Evaluation function F(s)
A good summary should
1. include the most important utterance
2. but minimize the redundancy at the same time
3. not too long
λ is a parameter to be determined.
sx sxx
jii
i ji
xxSimxIsF,
,
Sim(xi, xj): similarity between utterances xi and xj
Evaluation function F(s)
A good summary should
1. include the most important utterance
2. but minimize the redundancy at the same time
3. not too long
sx sxx
jii
i ji
xxSimxIsF,
,
sx
i
i
KxL(constraint)
L(xi): length of utterance xi
K: length constraint of summary
Evaluation function F(s)
A good summary should
1. include the most important utterance
2. but minimize the redundancy at the same time
3. not too long
sx sxx
jii
i ji
xxSimxIsF,
,
sx
i
i
KxL(constraint)
ii xfwxI Jointly learn from
training data
Idea of Training
Training data
D1
D2
Reference
summaryR1
Reference
summaryR2
……
Find w and λ in F(s) such that
F(R1) > F(sD1)
F(R2) > F(sD2)
……
sD1 is all utterance set in D1, except R1
sD2 is all utterance set in D2, except R2
I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Learning for
Interdependent and Structured Output Spaces, ICML, 2004.