Tv parser an automatic tv video parsing method_liang_20100309
-
Upload
institute-of-automation-chinese-academy-of-sciences -
Category
Technology
-
view
719 -
download
3
description
Transcript of Tv parser an automatic tv video parsing method_liang_20100309
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
TVParser: An Automatic TV Video ParsingMethod
Chao Liang
National Laboratory of Pattern Recognition (NLPR)Chinese Academy of Sciences, Institute of Automation (CASIA)
March 9, 2011
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Outline
1 IntroductionMotivationRelated work
2 Our solutionBasic ideasRole histogram
3 TVParser modelModel formulationParameter estimationState inference
4 Experimental ResultsData setsFace namingScene segmentation
5 Conclusion
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
MotivationRelated work
Introduction
MotivationVoluminous TV videos vs. efficient management
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
MotivationRelated work
Introduction
TV videoStory plot (scene structure)
[Scene: Monica and Rachel's, Carol and Susan are showing off Ben to the gang.]
Phoebe: Oh my God, oh, ok, was that too much pressure for him?Susan: Oh, is he hungry already?Carol: I guess so. (Carol starts to breast feed Ben.)… …
[Scene: Central Perk, the gang is all there.]
Julie: Rachel, do you have any muffins left?Rachel: Yeah, I forget which ones.Julie: Oh, you're busy, that's ok, I'll get it. Anybody else want one?… …
Characters (named faces)
RACH MNCA PHBE JOY CHANROSS
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
MotivationRelated work
Related work
Movie/Script alignment
Script-subtitle alignment
[Scene: Rachel is
entering the living room.]
Monica: Julie.
Rachel: What?!
00:10:44,210 -->
00:10:45,177
Monica: Julie.
00:10:45,444 -->
00:10:46,775
Rachel: What?!
script subtitle movie
Disadvantages
Syntax and words discrepancy between the script and subtitleAvailability of the subtitle
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
MotivationRelated work
Related work (cont.)
Face naming
Fully supervisedWeakly supervised
[Scene: Rachel is
entering the living room.]
Monica: Julie.
Rachel: What?!
(a) weakly supervised (b) fully supervised
Monica
Rachel
Disadvantages
Expensive manual labelsLarge-scale applications
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
MotivationRelated work
Related work (cont.)
Scene segmentation
Content-based methodScript-guided method
bq4, shot4bq3, shot3bq1, shot1 bq2, shot2
aq2,q4aq1,q3
Scene q4Scene q3Scene q1 Scene q2
Shot 4Shot 3Shot 2
HMM : λ= {A, B, п} = {A(aqi, qj), B(bqi, shotj),п}
Observation
sequence
Hidden state
sequence
Viterbi alignment : Q = {q1, q2, q3, q4, q5, ...}
Shot 1
. . .aq2,q3aq1,q2 aq3,q4
aq1,q4
t = 1 t = 2 t = 3 t = 4
Disadvantages
Matching units are asymmetricLatent geometric distribution
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Basic ideasRole histogram
Our solution
Basic ideasA generative TVParser model to align video and script bymining face-name correspondence.
JOEY
MNCA
RACH
CHAN 0 0
1 0
2 0
2 0
0 1
0 1
1 1
2 2
0 0 0
0 1 1
1 0 2
0 0 1
0 2
0 0
0 0
1 2
0 0 1
1 1 0
2 1 0
3 0 1
S1 S2 S3 S4 S7 S8 S9 S10 S11C1 C2 C3
C1:{S1, ,S4} C2:{S6, ,S8} C3:{S10, ,S12}
name histogram face histogram
0
0
0
1
S12
AdvantagesFace names can be identified in an unsupervised way (learning)Global optimal scene segmentation can be inferred (inference)Fast algorithms for both parameter learning and state inference
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Basic ideasRole histogram
Role histogram
Basic ideaBag-of-Words (BoW) representationRole composition is a generic and semantic feature for bothvideo (as face histogram) and script (as name histogram)
Name clustering
Face clusteringDifficulty: variational environment conditions, e.g. pose, etc.
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Basic ideasRole histogram
Role histogram
Face clustering
Solution I: Semi-supervised kernel k-means clustering
Key points
Incorporate pairwise constraints (must-link and cannot-link)Adopt manifold-manifold distance
t
must-link and cannot-link manifold-manifold distance
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Basic ideasRole histogram
Role histogram
Face clusteringSolution II: Loose clustering number
Key pointsAllowing purified substructures
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Model formulationParameter estimationState inference
Model formulation
Graphical TVParser model
. . . . . . . . .
v(i-1)
ti-1+di-1 ti ti+di ti+1 ti+1+di+1ti-1
si-1 si si+1
v(i) v(i+1)
pi-1 = (ti-1 , di-1) pi = (ti , di) pi+1 = (ti+1 , di+1)
S : {si |i=1, · · ·, r} is observed script scene sequence;V : {vj |j=1, · · ·, u} is observed video shot sequence;P : {pi=(ti , di )|i=1, · · · , r} is the hidden video scene partitionsequence where t1 = 1,
∑i di = u and ti = ti−1 + di−1 (i > 1).
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Model formulationParameter estimationState inference
Model formulation
Complete TVParser model
P(V,S,P) = P(s1)P(p1|s1)P(v(1)|p1, s1)
×r∏
i=2
P(si |si−1)P(pi |si )P(v(i)|pi , si )
The generative process
(1) Enter into the i th script scene si from its predecessor si−1;
(2) Decide si ’s related partition pi = (ti , di );
(3) Generate the corresponding video shot subsequence v(i) = v[ti :ti+dj ]
indexing from ti to ti + di
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Model formulationParameter estimationState inference
Model formulation
Additional constraint
P(s1) = 1 ⇔ s1 = 1P(si |si−1) = 1 ⇔ si = i , si−1 = i − 1
Simplified TVParser model
P(V,S,P) =r∏
i=1
P(pi |si )︸ ︷︷ ︸duration
P(v(i)|pi , si )︸ ︷︷ ︸observation
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Model formulationParameter estimationState inference
Model formulation
Scene duration probability
Poisson distribution
P(pi |si ;λi ) =λdii e
−λi
di != e−λi ·
λdiidi !
Reasons
Poisson is a plausible model of state duration;Model parameter, λ = {λi}, is the expected duration of scenes;Parameter can be estimated by Maximum likelihood method
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Model formulationParameter estimationState inference
Model formulation
Observation probability
Gaussian distribution
P(v(i)|pi , si ;A, σi ) =1√
2πσ2i
exp
{−
(si − A v(i))>(si − A v(i))
2σ2i
}
Meaning for parameter A
A = [Aij ] ∈ RM×N is the face-name relation matrix that associatesM name with N face clusters. By regulating the entry of A asAij ≥ 0 and
∑i Aij = 1, we can treat each column as a identity
distribution of the face cluster.
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Model formulationParameter estimationState inference
Parameter estimation
Model parameters Ψ = {{λi}, {σ2i },A}Maximum likelihood estimation (MLE)
maxΨ̂
∑P
P(P|V,S; Ψ) · logP(V,S,P; Ψ̂)
s.t. 111>MA = 111
>N
A ≥ 0,
Optimization problem
For {λi}and{σi}, unconstraint optimizationFor A, constraint optimization
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Model formulationParameter estimationState inference
Parameter estimation
Re-estimation for {λi}
λi =
∑piP(pi |V,S; Ψ) · di∑piP(pi |V,S; Ψ)
Re-estimation for {σi}
σ2i =
∑piP(pi |V,S; Ψ) · (si−Av(i))(si−Av(i))>∑
piP(pi |V,S; Ψ)
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Model formulationParameter estimationState inference
Parameter estimation
Re-estimation for A
Aij ← Aij
√√√√ (W −111Mηηη>)+
ij
2(AU)ij + (W −111Mηηη>)−ij
where
W =∑P
P(P|V,S; Ψ)r∑
i=1
1
σ2i
siv>(i)
U =∑P
P(P|V,S; Ψ)r∑
i=1
1
2σ2i
v(i)v>(i)
ηηη>=1
M· (111>
MW − 2 111>
NU)
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Model formulationParameter estimationState inference
Parameter estimation
Summation in both W and U∑P
P(P|V,S; Ψ)
Sum over the whole possible partition sequence spaceTypical example: u = 15 (scenes) and r = 300 (shots), thenpossible segmentation number: C15
299 ≈ O(1024) (Intractable!)
Solution: Sequence ⇒ segments∑P
P(P|V,S; Ψ)r∑
i=1
=r∑
i=1
∑pi
P(pi |V,S; Ψ)
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Model formulationParameter estimationState inference
Parameter estimation
Posterior probability P(pi |V,S; Ψ)
Forward-backward algorithm
Forward-backward variables{αpi (si ) ,P(si , pi , v[1:ti+di ]; Ψ)
βpi (si ) ,P(v[ti+di+1:u]|si , pi ; Ψ)
Forward-backward recursionInitial conditions
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Model formulationParameter estimationState inference
State inference
Hidden partition sequence P∗Viterbi Algorithm
Local optimal
δτ (si ; θ) , maxp[1:i−1]
P(p[1:i−1], s[1:i−1], τ ∈ qi , o[1:τ ]; θ)
Forward recursionBacktracking
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Data setsFace namingScene segmentation
Data sets
Two TV series
6 episodes from American TV series “Friends”5 episodes from Chinese TV series “I Love My Family”(Family)
Data details (average per episode)
Length: 30 minRole number: 10Face number: 2× 105
Shot number: 300
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Data setsFace namingScene segmentation
Face naming
BaselinesFace clustering
Unconstrained kernel K means (KK)Constraint K -means (CK)Completely positive factorization (CP)Constraint spectral Learning (SL)
Face Recognition
K nearest neighbor (KNN)Support vector machine (SVM)
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Data setsFace namingScene segmentation
Face naming
CriteriaFace clustering
NMI =
∑l
∑h nl.h log(
n·nl,hnlnh
)√(∑
l nl log nln )(∑
h nh log nhn )
where n is the number of objects, nl is the size of the l th classin the groundtruth, nh is the size of the hth cluster in the resultand nl,h is the size of their intersect.Face Recognition
Fw =∑i
wi ·2× precisioni × recalliprecisioni + recalli
where wi denotes the weight of the i th role according tohis/her spoken lines in the script.
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Data setsFace namingScene segmentation
Face naming
Face clusteringConstraint vs. unconstraintClustering number variance
X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.00
0.1
0.2
0.3
0.4
0.5
CK
KK
SSKK
SL
CP
X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.00
0.1
0.2
0.3
0.4
0.5
CK
KK
SSKK
SL
CP
NM
I sc
ore
Cluster number (x times) Cluster number (x times)
NM
I sc
ore
Friends Family
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Data setsFace namingScene segmentation
Face naming
Face recognition (naming)Optimal recognition achieved when the clustering numberapproximates 2 times of the character number
Cluster number (x times) Cluster number (x times)
X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.0-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
A purifying rate
Precision
Recall
Fw-measure
X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.0-0.2
0
0.2
0.4
0.6
0.8
A purifying rate
Precision
Recall
Fw-measure
Friends Family
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Data setsFace namingScene segmentation
Face naming
Main character naming resultAccuracyRobustness
X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.00
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1st
main character
2nd
main character
3rd
main character
4th main character
X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.00
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1st
main character
2nd
main character
3rd
main character
4th main character
Wei
ghte
d F
-mea
sure
Cluster number (x times)
Wei
ghte
d F
-mea
sure
Cluster number (x times)
Friends Family
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Data setsFace namingScene segmentation
Face naming
Compare with supervised methodsComparable to supervised methodsEven better when training set is limited
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
KNN
SVM
TVParser (1st
best)
TVParser (2nd
best)
TVParser (3rd
best)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
KNN
SVM
TVParser (1st
best)
TVParser (2nd
best)
TVParser (3rd
best)
Wei
gh
ted
F-m
easu
re
training-test-ratio
Wei
gh
ted
F-m
easu
re
training-test-ratio
Friends Family
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Data setsFace namingScene segmentation
Scene segmentation
BaselinesScene segmentation methods (algorithms)
Shot similarity graph (SSG)Dynamic time warping (DTW)Hidden Markov model (HMM)
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Data setsFace namingScene segmentation
Scene segmentation
Criteria
Scene segmentation
ρ = (r∑
i=1
diu
r∑j=1
d2ij
d2i
) · (r∑
j=1
d∗j
u
r∑i=1
d2ij
d∗2j
)
where dij is the length of overlap between the scene segmentpi and p∗j , di is the length of the scene pi and r is total lengthof all scenes. This purity value ranges from 0 to 1, and thelarger a value is, the closer it is to the groundtruth.
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Data setsFace namingScene segmentation
Scene segmentation
Scene segmentation result
Segmentation Sources Purity ScoresMethods (video+) Friends Family
SSG - 0.55± 0.11 0.53± 0.07DTW sub.+scr. 0.60± 0.13 -HMM scr. 0.59± 0.08 0.53± 0.05
TVParser scr. 0.67± 0.07 0.58± 0.03
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Data setsFace namingScene segmentation
Scene segmentation
Scene segmentation result under various role histogramsName histogram: first four characters are dominantFace histogram: more clusters are generally better
Face histogram dimensionNam
e hist
ogram
dimens
ion
Puri
ty s
core Aver
age
puri
tyAv
erag
e pu
rity
Face histogram size
↑0.12(≈71%)
↑0.05(≈29%)
2
4
6
8
10
X 0.00
X 0.50
X 1.00
X 1.50
X 2.00
X 2.50
0.4
0.5
0.6
0.7
0.4
0.45
0.5
0.55
0.6
0.65
X 0.25 X 0.75 X 1.25 X 1.75 X 2.250.46
0.5
0.54
0.58
0.6
2 3 4 5 6 7 8 9 10 110.4
0.45
0.5
0.55
0.6
Face histogram size
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Conclusion
We propose a generative model to formulate story plotdevelopment in TV videos, which solves face naming andscene segmentation in an unified framework.
Key novelties
Unsupervised face naming through model parameter learningGlobal optimal scene segmentation by hidden state inferenceFast algorithms for both parameter learning and state inference
Future work
Personalized applications, e.g. TV video synthesis, etc;Generic cross-media analysis and association methods.
Chao Liang TVParser: An Automatic TV Video Parsing Method
IntroductionOur solution
TVParser modelExperimental Results
Conclusion
Q & A
Thanks!
Chao Liang TVParser: An Automatic TV Video Parsing Method