Multimodal Person Discovery in Broadcast TV Taskat MediaEval 2015
Johann Poignant, Hervé Bredin, Claude BarrasLIMSI – CNRS, Orsay, [email protected]
Motivation➢Indexing TV archives → find people appearing and speaking➢Biometric models are not always available → find people identities in the video (pronounced names, names written on screen)
Definition of the task➢From a collection of TV broadcast pre-segmented into shots, participants are asked to provide:
➢The names of people both speaking and appearing at the same time during the shot, with a confidence score➢An evidence (a unique shot) proving that the person holds the right name
➢List of persons was not provided➢Person biometric models trained on externel data can not be used➢Participants should find their names in the audio (using ASR) or visual (using OCR) streams
A BA
Hello Mrs B
Mr A
blah blah
shot #1 shot #2 shot #3
A B B
blah blah
shot #4 speaking face
evidence
A B
blah blah
A
text overlay
speech transcript
INPUT
OUTPUT
LEGEND
Shot#1 is an evidence for Mr A
Shot#3 is an evidence for Mrs B
Datasets➢Dev set: REPERE (2 French channel, 8 shows, 137 hours, 50 hours manually annotated)➢Test set: INA corpus (1 French channel, 172 editions of the evening broadcast news, 106 hours) ➢Annotated a posteriori based on participants' submissions
Baseline and metadata➢Available as open-source software at
http://github.com/MediaEvalPersonDiscoveryTask
Metrics➢For each queries q ϵ Q, rank all shots (according to their confidence) of the best hypothesis p (if Levenshtein ratio(p, q) > 0.95)
➢Average Precision
➢Mean Average Precision
➢Correctness of evidences
➢Evidence-weighted Mean Average
k is the rankP(k) is the precision at cut-off krel(k) =1 if the shot at rank k is relevant, 0 otherwise
∑k=1
(P(k)) · rel(k))
# relevant shots
n
AP =
QMAP = ∑
q=1 AP(q)1
Q
C(q) = 1 if the Levenshtein ratio(p, q) > 0.95 and if the evidence proves q identity0 otherwise
QEwMAP = ∑
q=1 C(q) · AP(q)1
Q
Top Related