page i T A UNIVERSIT POLITÈCNICA DE ALÈNCIA V MASTER'S
Transcript of page i T A UNIVERSIT POLITÈCNICA DE ALÈNCIA V MASTER'S
�memoria� � 2014/9/11 � 12:16 � page i � #1
✐ ✐
UNIVERSITAT POLITÈCNICA DE VALÈNCIA
DEPARTMENT OF COMPUTER SYSTEMS AND COMPUTATION
MASTER'S THESIS
Integration of Ma hine Learning and Language Pro essing
Te hnologies into Video Le ture Platforms.
Master in Arti� ial Intelligen e, Pattern Re ognition and Digital Imaging.
Alejandro Pérez González de Martos
Dire ted by:
Dr. Alfons Juan Cis ar
Dr. Jorge Civera Saiz
September 11, 2014
�memoria� � 2014/9/11 � 12:16 � page ii � #2
✐ ✐
�memoria� � 2014/9/11 � 12:16 � page iii � #3
✐ ✐
I would like to thank the UPV's Ma hine Learning and Language Pro essing
group and the Joºef Stefan Institute for making this thesis possible.
�memoria� � 2014/9/11 � 12:16 � page iv � #4
✐ ✐
�memoria� � 2014/9/11 � 12:16 � page v � #5
✐ ✐
Contents
1 Introdu tion 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 S ienti� and te hni al goals . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Do ument stru ture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Preliminaries 3
2.1 Ma hine Learning and Language Pro essing . . . . . . . . . . . . . . . 3
2.1.1 Pattern re ognition . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Automati spee h re ognition . . . . . . . . . . . . . . . . . . . 5
2.1.3 Statisti al ma hine translation . . . . . . . . . . . . . . . . . . 6
2.2 Video le ture repositories . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 poliMedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 VideoLe tures.NET . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 The Open ast Matterhorn platform . . . . . . . . . . . . . . . . . . . . 11
3 The transLe tures Platform 13
3.1 Introdu tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 transLe tures Database . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 transLe tures Web Servi e . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 transLe tures Ingest Servi e . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5 transLe tures Player . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.1 Standard post-editing . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.2 Intelligent intera tion . . . . . . . . . . . . . . . . . . . . . . . 18
3.5.3 Two-step supervision . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5.4 User evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.5 Con lusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Integration into Open ast Matterhorn 25
4.1 Introdu tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 System ar hite ture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Platform analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.1 Media pa kage . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.2 Work�ow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Integration pro ess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4.1 Generation of automati trans riptions and translations . . . . 29
4.4.2 Subtitle visualisation and supervision . . . . . . . . . . . . . . 30
v
�memoria� � 2014/9/11 � 12:16 � page vi � #6
✐ ✐
Contents
5 Using spee h trans riptions in le ture re ommender systems 33
5.1 Introdu tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 System updates and optimisation . . . . . . . . . . . . . . . . . . . . . 36
5.4 Integration into VideoLe tures.NET . . . . . . . . . . . . . . . . . . . 37
5.4.1 Topi and user modeling . . . . . . . . . . . . . . . . . . . . . . 37
5.4.2 Learning re ommendation feature weights . . . . . . . . . . . . 38
5.4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.5 Con lusions and future work . . . . . . . . . . . . . . . . . . . . . . . 39
6 Con lusions and Future Work 41
6.1 General Con lusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
vi MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 1 � #7
✐ ✐
Chapter 1
Introdu tion
1.1 Motivation
Online multimedia repositories are rapidly growing and be oming evermore onsoli-
dated as key knowledge assets. This is parti ularly true in the area of edu ation, where
large repositories of video le tures are being established on the ba k of in reasingly
available and standardised infrastru tures. Well-known examples of this in lude mas-
sive open online ourses (MOOCs) aggregators su h as Coursera [1℄, the Universitat
Politèni a de Valèn ia (UPV) poliMedia platform system for the ost-e�e tive ellab-
oration and publi ation of quality edu ational videos [33℄, and VideoLe tures.NET,
an award-winning free and open a ess edu ational video le tures repository [47℄.
As in other repositories, trans ription and translation of video le tures in platforms
su h as poliMedia and VideoLe tures.NET is needed to make them a essible to
speakers of di�erent languages and to people with disabilities [14, 48℄. Moreover,
trans riptions and translations of video le tures an also be helpful in the development
of multiple digital ontent management appli ations su h as le ture ategorisation,
summarisation, re ommendation, automated topi �nding or plagiarism dete tion.
However, most le tures in these platforms are neither trans ribed nor translated
be ause of the la k of e� ient solutions to obtain them at a reasonable level of a -
ura y. This was the motivation behind the transLe tures European proje t [38, 45℄,
whi h aimed at developing innovative, ost-e�e tive solutions for produ ing a urate
trans riptions and translations for large video repositories.
1.2 S ienti� and te hni al goals
This work aims to e�e tively integrate the tools developed in transLe tures [44, 41℄
into di�erent video le ture platforms. In parti ular, these tools will be integrated into
the aforementioned poliMedia and VideoLe tures.NET repositories. In the ase of
the latter, video le ture trans riptions will be used to develop a le ture re ommender
appli ation as part of the PASCAL Harvest Proje t La Vie [32℄. In addition, transLe -
1
�memoria� � 2014/9/11 � 12:16 � page 2 � #8
✐ ✐
Chapter 1. Introdu tion
tures solutions will also be integrated into Open ast Matterhorn [22℄, an open-sour e
platform to support the management of edu ational multimedia ontent adopted by
more than 40 di�erent organisations all around the world.
It should be noted that this thesis is framed within the resear h proje t transLe -
tures, and it is therefore part of a ollaborative work. That said, spe ial attention
will be paid to the author's individual ontributions.
1.3 Do ument stru ture
This thesis is organised as follows: Chapter 2 introdu es the ma hine learning and lan-
guage pro essing omputer s ien e �elds, fo using on the automati spee h re ognition
and statisti al ma hine translation tasks. In this hapter, the poliMedia and Vide-
oLe tures.NET video le ture repositories are also des ribed, as well as the Open ast
Matterhorn platform. Next, the set of tools for the integration of transLe tures te h-
nologies into video le ture repositories is presented in Chapter 3. Chapter 4 addresses
the integration of these tools into the Open ast Matterhorn platform. Chapter 5 de-
s ribes a le ture re ommender system that uses automati spee h trans riptions to
better represent le ture ontents at a semanti level. In parti ular, this system was
implemented for the VideoLe tures.NET repository. Finally, on luding remarks are
given in Chapter 6.
2 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 3 � #9
✐ ✐
Chapter 2
Preliminaries
2.1 Ma hine Learning and Language Pro essing
The Ma hine Learning �eld, evolved from the broad �eld of Arti� ial Intelligen e,
deals with building omputer systems that optimise performan e riteria using pre-
vious data or experien e. It involves the development of mathemati al algorithms
that dis over knowledge from spe i� data sets, and then �learn� from the data in
an iterative fashion that allows predi tions to be made. Today, Ma hine Learning
in ludes a variety of appli ations su h as natural language pro essing, sear h engine
fun tion, medi al diagnosis, omputer vision, and sto k market analysis.
As the reader might guess from the aforementioned appli ations, the range of
learning problems is learly large. To avoid reinventing the wheel for every new
Ma hine Learning appli ation, the resear h ommunity has tended to formalise the
problems in a set of fairly narrow prototypes. Ma hine Learning systems an be
lassi�ed along three parti ularly meaningful dimensions [10℄:
• Classi� ation on the basis of the underlying learning strategies used.
• Classi� ation on the basis of the knowledge representation.
• Classi� ation in terms of the appli ation domain of the performan e system for
whi h knowledge is a quired.
Ea h point in the spa e de�ned by the above dimensions orresponds to a parti -
ular learning strategy. In this thesis, we on entrate on the lassi� ation task, also
referred to as pattern re ognition, where one attempts to build algorithms apable of
automati ally onstru ting methods for distinguishing between di�erent exemplars,
based on their di�erentiating patterns. More spe i� ally, we will fo us on the par-
ti ular tasks of spee h re ognition and ma hine translation, whi h an be in luded
in the Natural Language Pro essing (NLP) �eld. The NLP handles the resear h for
e� ient methods to enhan e human-ma hine (and human-human) ommuni ation.
3
�memoria� � 2014/9/11 � 12:16 � page 4 � #10
✐ ✐
Chapter 2. Preliminaries
In the following se tions we give a short overview of the statisti al pattern re ogni-
tion approa h, and brie�y introdu e the automati spee h re ognition and ma hine
translation problems.
2.1.1 Pattern re ognition
The term pattern re ognition refers to the task of pla ing some obje t to a orre t lass
based on the measurements about the obje t [43℄. Con retely, pattern re ognition sys-
tems aim to re ognise their environment from data a quired by appropriate sensors
(opti al, a ousti , thermal, hemi al, et .). Some of the spe i� pattern re ognition
obje tives in lude spee h and handwriting re ognition, ma hine translation, �nger-
print/fa ial verti� ation, and disease identi� ation. In order to on isely des ribe the
di�erent pattern re ognition problems, probability theory has proven to be the most
adequate language. We will assume the reader has some prior knowledge on probabil-
ity theory, and therefore some basi de�nitions will be skipped. For more details and
a very gentle and detailed dis ussion, see the book of Introdu tion to probability [16℄.
A ording to the lassi� ation paradigm, to re ognise means to lassify into one
out of C given lasses or ategories with minimum probability of error. Many pattern
re ognition systems an be thought to onsist of �ve stages:
1. Sensing, whi h refers to the measurement or observation of the obje t to be
lassi�ed.
2. Pre-pro essing, whi h is the pro ess of �ltering the raw data for noise supres-
sion and other operations to improve the quality of the data.
3. Feature extra tion, whi h refers to the pro ess of extra ting relevant data
for the lassi� ation task. The result of the feature extra tion stage is alled a
feature ve tor.
4. Classi� ation, in whi h the feature ve tor extra ted from the obje t is lassi-
�ed into the most appropriate lass.
5. Post-pro essing. As di�erent a tions might also have di�erent osts asso iated
with them, the �nal task of a pattern re ognition system is to de ide upon an
a tion based on the lassi� ation results.
Figure 2.1 illustrates the di�erent stages for an opti al hara ter re ogniser.
While the sensing, pre-pro essing and feature extra tion tasks are very spe i� to
the parti ular problem, the lassi� ation task an be somehow generalised for typi al
pattern re ognition systems. Formally, a lassi�er an be de�ned as a fun tion:
c(x) = arg max
c∈Cgc(x) (2.1)
where, for ea h lass c, a dis riminant fun tion gc is used to measure the degree
to whi h the obje t x belongs to lass c. Obviously, x is lassi�ed into a lass to
whi h it belongs with maximum degree. When random variables are used for x and
4 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 5 � #11
✐ ✐
2.1. Ma hine Learning and Language Pro essing
Figure 2.1: Opti al hara ter re ogniser for digits 6 and 9, based on the
average gray levels of the upper and the bottom half.
c, the optimal lassi� ation te hnique is the so- alled Bayes lassi�er or de ision rule
(named after Thomas Bayes):
c∗(x) = arg max
c∈Cp(c|x) (2.2)
If the posterior p(c|x) probability is modeled dire tly, the lassi�er is said to follow
a dis riminative approa h. However, the lassi al approa h to pattern re ognition is
to write the Bayes lassi�er in terms of lass priors p(c) and lass- onditional densities
p(x|c) (generative approa h). By applying the Bayes' rule:
c∗(x) = arg max
c∈Cp(c|x) = arg max
c∈Cp(c)p(x|c) (2.3)
Labelled samples (x1, c1), ..., (xN , cN ) randomly drawn from p(x, c) are used to
estimate p(c) and p(x|c). Usually, for ea h lass c, its prior is estimated as:
p(c) ≈Nc
N[Nc =
∑
n:cn=c
1] (2.4)
and its onditional density p(x|c) is estimated from samples labelled with c.
2.1.2 Automati spee h re ognition
Automati spee h re ognition (ASR) an be de�ned as the independent, omputer-
driven trans ription of spoken language into readable text. In a nutshell, ASR is
te hnology that allows a omputer to identify the words that a person speaks into a
mi rophone and onvert it to written text. Having a ma hine to understand �uently
spoken spee h has driven spee h resear h for more than 50 years. Although ASR
te hnology is not yet at the point where ma hines understand all spee h, in any
a ousti environment, or by any person, it is used on a daily basis in a number of
appli ations and servi es.
Formally, the spee h re ognition problem an be des ribed as a fun tion that
de�nes a mapping from the a ousti eviden e to a single or a sequen e of words. Let
x = (x1, x2, ..., xt) represent the a ousti eviden e that is generated in time from a
given spee h signal. Let w = (w1, w2, ..., wn) denote a sequen e of n words, ea h
belonging to a �xed set of possible words, W . Let p(w|x) denote the probability
MLLP-DSIC-UPV 5
�memoria� � 2014/9/11 � 12:16 � page 6 � #12
✐ ✐
Chapter 2. Preliminaries
that the words w were spoken given that the a ousti eviden e x was observed. The
spee h re ogniser should sele t the sequen e of words w satisfying:
w = arg max
w∈Wp(w|x) (2.5)
However, sin e it is di� ult to dire tly model p(w|x), we an apply the Bayes' rule
as in Equation 2.3:
w = arg max
w∈Wp(w|x) = arg max
w∈Wp(w)p(x|w) (2.6)
where p(w) is known as language model and p(x|w) as a ousti model. Typi ally, n-gram models are used to estimate p(w) [21℄; while p(x|w) is estimated using Hidden
Markov Models (HMMs) and Gaussian Mixture Models (GMMs) [25℄. Figure 2.2
shows a general overview of an automati spee h re ognition system.
Figure 2.2: General overview of an automati spee h re ognition system.
2.1.3 Statisti al ma hine translation
Ma hine translation (MT) is the use of omputers to automate the translation of texts
or utteran es from one language into another, while the underlying meaning remains
the same. The history of MT goes ba k to the late 40s with the famous publi ation
of Weaver [49℄, widely re ognised as one of the pioneers of ma hine translation. The
70s and 80s saw the proliferation of rule-based ma hine translation systems su h
as Meteo [42℄, Systran [8℄ and METAL [6℄. The ontributions in statisti al ma hine
translation were minor until the early 90s, when the IBM group presented the Candide
system [7℄. Sin e then, the development of statisti al MT has experien ed a major
boost on a ount of the many resear h groups emerged in this area.
The goal of MT is the automati translation of a sour e senten e s into a target
senten e t, being
6 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 7 � #13
✐ ✐
2.2. Video le ture repositories
s = s1, ..., sj , ..., s|s| sj ∈ St = t1, ..., ti, ..., t|t| ti ∈ T
where sj and ti denote sour e and target words, and S and T , the sour e and target
vo abularies, respe tively.
In statisti al MT, this translation pro ess is usually presented as a de ision pro ess,
where given a sour e senten e s, a target senten e t is sele ted a ording to
t = arg max
t
p(t|s) (2.7)
where p(t|s) is the probability for t to be the a tual translation of s. Applying the
Bayes' rule we an reformulate Equation 2.7 as
t = arg max
t
p(t)p(s|t) (2.8)
where p(t) and p(s|t) orrespond to the language and translation models, respe tively.
Intuitively, the language model represents the well-formedness of the andidate trans-
lation t; while the translation model an be understood as a mapping fun tion from
target to sour e words.
Most of the state-of-the-art statisti al MT systems are based on bilingual pharses [9,
23℄. Another approa h whi h has be ome popular in re ent years is grounded on the
integration of synta ti knowledge into statisti al MT systems [51, 52, 15℄. The third
main approa h is the modelling of the translation pro ess as a �nite-state trans-
du er [12, 4℄.
2.2 Video le ture repositories
In this se tion we present the poliMedia and VideoLe tures.NET video le ture reposi-
tories. Both of them have been trans ribed and translated as part of the transLe tures
proje t. transLe tures' ASR and MT systems have been spe ially developed to ex-
ploit the parti ular hara teristi s that large video le ture repositories usually present.
More spe i� ally, the a ousti , language and translation models are adapted to the
parti ular speaker and topi of the le tures, thereby improving the trans riptions and
translations a ura y [27, 2, 3℄. This adaptation pro ess is referred to as massive
adaptation.
At this point, it is worth stressing that the integration of transLe tures' te hnolo-
gies into these repositories is one of the major ontributions of this thesis and will be
properly addressed in Chapter 3.
2.2.1 poliMedia
The poliMedia platform is a re ent, innovative servi e for the reation and distribution
of edu ational multimedia ontent at the UPV [33℄. It was initially designed to allow
UPV professors to generate and publish edu ational video le tures. It urrently serves
to more than 30 000 students and 2800 professors. The poliMedia platform has also
MLLP-DSIC-UPV 7
�memoria� � 2014/9/11 � 12:16 � page 8 � #14
✐ ✐
Chapter 2. Preliminaries
Tables 2.1: Basi statisti s of the UPV's poliMedia repository (May 2014)
Number of le tures 11 662
Total duration (in hours) 2422
Avg. le ture length (in minutes) 12.5
Total number of speakers 1443
Avg. number of le tures re orded per speaker 7
been exported to several both national and international universities. Table 2.1 shows
basi statisti s of the urrent UPV's poliMedia repository.
The produ tion pro ess of the poliMedia platform has been arefully designed to
generate high quality re ordings with a high produ tion rate. The poliMedia studio is
a 4 meter room with a white ba kground in whi h all the ne essary equipment for the
re ording is available to professors. Figure 2.3 shows the studio during a re ording
session.
Figure 2.3: The poliMedia studio during a re ording session.
Re ordings on poliMedia follow a parti ular standard format. As it an be seen
in Figure 2.4, the speaker appears on the bottom right orner of the s reen while
the omputer s reen (usually showing the time-aligned presentation slides) is shown
entered as the main element of the re ording. The videos are stored in AVC/H.264
format with di�erent settings. Video size is 1280x720 pixels and average bit rates
usually os illate between 500 and 900 kbps. Audio format is AAC/LC stereo (85%)
and mono (15%) with sampling frequen ies of 44 100 and 48 000 Hz. A ording to the
Nyquist�Shannon sampling theorem, this is more than su� ient to a urately over
the human spee h frequen y range (∼200-3500 Hz).
8 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 9 � #15
✐ ✐
2.2. Video le ture repositories
Figure 2.4: A poliMedia re ording example.
In order to automati ally trans ribe the poliMedia Spanish repository, automati
spee h re ognition systems need relevant sample data. To this end, 114 hours of
poliMedia Spanish video le tures were manually trans ribed. This data was prop-
erly partitioned into training, development and test sets in order to train, tune and
evaluate the ASR systems. Details regarding ea h set are shown in Table 2.2.
Tables 2.2: Statisti s of the Spanish poliMedia training, development and
test partitions
Training Development Test
Videos 655 26 23
Speakers 73 5 5
Hours 107h 3.8h 3.4h
Senten es 39.2K 1.3K 1.1K
Words 936K 35K 31K
Vo abulary 26.9K 4.7K 4.3K
2.2.2 VideoLe tures.NET
VideoLe tures.NET is a free and open a ess repository of video le tures mostly �lmed
by people from the Joºef Stefan Institute (IJS) at major onferen es, summer s hools,
workshops and s ien e promotional events from many �elds of S ien e. VideoLe -
tures.NET is being used as an edu ational platform for several EU funded resear h
proje ts, di�erent open edu ational resour es organizations su h as the OpenCourse-
Ware Consortium, MIT OpenCourseWare and Open Yale Courses as well as other
MLLP-DSIC-UPV 9
�memoria� � 2014/9/11 � 12:16 � page 10 � #16
✐ ✐
Chapter 2. Preliminaries
s ienti� institutions like CERN. In this way, VideoLe tures.NET olle ts high qual-
ity edu ational ontent whi h is re orded with high-quality, homogeneous standards.
All le tures, a ompanying do uments, information and links are systemati ally
sele ted and lassi�ed through the editorial pro ess taking into a ount the author's
omments. The video editing is done in-house and is never ensored, that is, le -
tures are never edited in a way whi h would allow ontent or viewer manipulation.
Most le tures are a ompanied with time-aligned presentation slides (as shown in
Figure 2.5).
Tables 2.3: Basi statisti s of the IJS' VideoLe tures.NET repository (May
2014)
Number of le tures 20 358
Total number of authors 12 167
Total duration (in hours) 12 681
Average le ture duration (in minutes) 37
Figure 2.5: The VideoLe tures.NET web player.
In order to train proper ASR systems to automati ally trans ribe the repository,
high-quality trans riptions were manually generated for an English subset of Vide-
oLe tures.NET. The resulting training, development and test sets are summarised in
Table 2.4.
10 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 11 � #17
✐ ✐
2.3. The Open ast Matterhorn platform
Tables 2.4: Statisti s of the English VideoLe tures.NET training, develop-
ment and test partitions
Training Development Test
Videos 20 4 4
Speakers 68 11 25
Hours 20h 3.2h 3.4h
Senten es 5K 1K 1.3K
Words 130K 28K 34K
Vo abulary 7K 3K 3K
2.3 The Open ast Matterhorn platform
Matterhorn is a free, open-sour e software to support the management of edu ational
audio and video ontent. Higher edu ation institutions use Matterhorn to produ e
le ture re ordings, manage existing video, serve designated distribution hannels, and
provide user interfa es to engage students with edu ational video. The proje t om-
bines individual solutions and fo uses the e�orts and experien e of di�erent univer-
sities in one shared open produ t. The reation of a uni�ed system with an open
development pro ess is proje ted to foster the ex hange of edu ational ontent also.
To this end, it in ludes the following features:
• Administrative tools for s heduling automated re ordings, manually uploading
�les, and managing videos, metadata, work�ows and pro essing fun tions.
• Integration with re ording devi es in the lassroom for managing automated
apture of audio, VGA, and multiple video sour es.
• Pro essing and en oding servi es that prepare and pa kage the media �les a -
ording to on�gurable spe i� ations, in luding ri h media features (slide seg-
mentation/indexation) for in-video sear h.
• Distribution to lo al streaming and download servers and on�guration apa-
bility for distribution to hannels su h as YouTube, iTunes or a ampus ourse
or ontent management system
• Ri h media user interfa e for learners to engage with ontent, in luding slide
preview, ontent-based sear h, heatmaps and additional features.
Multiple edu ational institutions have adopted Open ast Matterhorn as their
video Content Management System (CMS). The University of California Berkeley,
the University of Vigo or the University of Osnabrue k are some examples urrently
involved into the Open ast Community. The international su ess of Open ast Mat-
terhorn makes the platform a perfe t s enario for the integration of state-of-the-art
automati spee h re ognition and ma hine translation te hnologies. By integrating
transLe tures tools into the Matterhorn platform, edu ational institutions will be
MLLP-DSIC-UPV 11
�memoria� � 2014/9/11 � 12:16 � page 12 � #18
✐ ✐
Chapter 2. Preliminaries
able to break the language barrier and support people with hearing disabilities by
subtitling their videos in multiple languages.
12 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 13 � #19
✐ ✐
Chapter 3
The transLe tures Platform
3.1 Introdu tion
The transLe tures Platform [41℄ omprises a set of tools for integrating automati
trans ription and translation te hnologies into large edu ational video repositories.
Its main omponents are the transLe tures Database, Web Servi e, Ingest Servi e and
Player. These tools have been used to integrate transLe tures' ASR and MT systems
into the previously des ribed poliMedia and VideoLe tures.NET repositories [39℄.
Figure 3.1 gives a general overview of the transLe tures Platform integration into
poliMedia, VideoLe tures.NET or any other video le ture repository. In this �gure,
the prin ipal onne tions between the di�erent tools in a lient repository are illus-
trated. The transLe tures Database, Web Servi e and Ingest Servi e will be des ribed
in detail in Se tions 3.2, 3.3 and 3.4. The transLe tures Player, one of the author's
main ontributions, is given spe ial attention in Se tion 3.5.
3.2 transLe tures Database
The transLe tures Database is a PostgreSQL relational database whi h stores all the
data required for the Web Servi e and the Ingest Servi e. The main ategories of
data stored in the Database are as follows:
• Video le tures: All the information related to a spe i� le ture is stored in
the database, in luding language, duration, title, keywords and ategory. In
addition, an external ID, provided by the lient repository, is stored and used
for le ture identi� ation purposes in all transa tions performed between the
lient and the Player and Web Servi e.
• Speakers: Information about the le ture speaker an be used by the ASR
systems to adapt the underlying models to the unique hara teristi s of a given
speaker and, therefore, improve the quality of the resulting subtitles.
13
�memoria� � 2014/9/11 � 12:16 � page 14 � #20
✐ ✐
Chapter 3. The transLe tures Platform
Figure 3.1: General overview of the transLe tures Platform integration into
a lient repository.
• Subtitles: All subtitles generated via the Ingest Servi e are stored in the
database and retrieved by the Web Servi e.
• Uploads: Every time an /ingest operation is performed, a new upload entry is
stored in the database.
Media and subtitle �les are stored on the hard drive separately from the relational
database. We an distinguish between three di�erent kind of �les:
• Media: Video/audio �les of the le tures that already exist in the database,
plus related �les su h as slides, external do uments and thumbnails.
14 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 15 � #21
✐ ✐
3.3. transLe tures Web Servi e
• Trans riptions: Subtitle �les in DFXP format.
• Uploads: Uploaded Media Pa kage Files (MPF) and the �les ontained within
them.
3.3 transLe tures Web Servi e
The transLe tures Web Servi e is the interfa e for transfering information and data
between the poliMedia, VideoLe tures.NET or any other repository and the transLe -
tures Platform. It also enables the subtitle visualisation and editing apabilities of
the transLe tures Player. This web servi e is implemented as a Python Web Server
Gateway Interfa e (WSGI), and de�nes a set of HTTP interfa es related to subtitle
delivery and media upload:
• /ingest: POST request whi h allows the lient to upload audio/video �les and
other related material, su h as slides and other text resour es that an be used
to adapt the ASR system, together with other metadata in a Media Pa kage
File (MPF). This MPF will be later pro essed by the Ingest Servi e in order to
generate automati trans riptions and translations for the uploaded media.
• /status: GET request to he k the status of a video le ture uploaded via the
/ingest interfa e.
• /le turedata: GET request that returns basi metadata and �le lo ations for a
given video le ture.
• /langs: GET request that provides the lient with a list of subtitles and lan-
guages available for a given video le ture.
• /dfxp: GET request that returns the subtitles in DFXP format for a given
le ture and language.
• /mod: POST request that sends and ommits hanges made by a user when
editing a trans ription or translation. User orre tions are later used by ASR
and SMT systems in order to improve the underlying models.
3.4 transLe tures Ingest Servi e
The Ingest Servi e is the tool devoted to handling and properly pro essing the Media
Pa kage Files uploaded via the /ingest interfa e of the Web Servi e. It is implemented
as a Python module that should be exe uted periodi ally (typi ally every minute) to
he k for and pro ess new le ture uploads, and also to assess whether existing uploads
are being pro essed orre tly. The uploads table of the Database is used to keep the
status of every upload up-to-date. This information is also a essed by the Web
Servi e's /status interfa e.
MLLP-DSIC-UPV 15
�memoria� � 2014/9/11 � 12:16 � page 16 � #22
✐ ✐
Chapter 3. The transLe tures Platform
As it was previously mentioned, MPFs are uploaded to the transLe tures Platform
via the Web Servi e's /ingest interfa e and stored in the Database. Then the Ingest
Servi e reads the uploads table of the Database and starts pro essing ea h MPF. An
upload will typi ally follow the following sequential steps (see Figure 3.2):
Figure 3.2: Overview of the transLe tures Ingest Servi e work�ow.
1. Media Pa kage Pro essing: In this step the MPF is unzipped and a series of
se urity, data integrity and data format he ks are performed on the unpa ked
data. If all he ks ome up lean, then the MPF data is redire ted to the next
step, whi h might be the 2nd, 3rd, 4th or 5th step listed here, depending on the
input data.
2. Trans ription Generation: In this step a trans ription �le in DFXP format is
generated from the main media �le (video, audio) using a suitable ASR Module.
3. Translation(s) Generation: In this step one or more translation �les in
DFXP format will be generated from the trans ription �le (whether automati-
ally generated in the previous step or provided in the MPF) using suitable MT
Modules.
16 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 17 � #23
✐ ✐
3.5. transLe tures Player
4. Media Conversion: In this step the main media �le is onverted into the
video formats required by the transLe tures Player in order to maximise browser
ompatibility.
5. Store Data: In this last step, the data ontained in the MPF and the data
automati ally generated by the Ingest Servi e are stored in the Database.
3.5 transLe tures Player
Massive adaptation an deliver substantial ontributions to the improvement of tran-
s riptions overall quality, but it is our belief that su� iently a urate results are
unlikely to be obtained through fully-automated approa hes alone. Instead, in or-
der to rea h the desired levels of a ura y, we must onsider user intera tion. For
that purpose, an HTML5 post-editing appli ation has been arefully designed to ex-
pedite the error supervision task [46℄, and thereby obtain subtitles of an a eptable
quality in ex hange for a minimum amount of user e�ort. Figure 3.3 illustrates the
ommuni ation between the Player and the other tools in the platform.
Figure 3.3: transLe tures Player ommuni ations diagram.
In this se tion we des ribe the three di�erent versions of the tool that were devel-
oped and evaluated by users of both poliMedia and VideoLe tures.NET platforms.
3.5.1 Standard post-editing
In the standard post-editing approa h, users are shown automati trans riptions and
translations while viewing the le ture. The standard layout is illustrated in Figure 3.4.
MLLP-DSIC-UPV 17
�memoria� � 2014/9/11 � 12:16 � page 18 � #24
✐ ✐
Chapter 3. The transLe tures Platform
However, three alternative editing layouts are available for users to hoose from a -
ording to their personal preferen es. In any of them, users an freely supervise any
trans ription and translation segment. Additionally, a omplete set of key short uts
has been implemented to enhan e expert user apabilities.
Figure 3.4: transLe tures Player standard post-editing mode (default side-
by-side layout).
The user intera tion an be summarised as follows: the video le ture and the
orresponding trans ription or translation are played in syn hrony, allowing users to
read the trans ription while wat hing the video. When a user spots an error, they
an li k on the parti ular segment to pause the video and enter their hanges.
3.5.2 Intelligent intera tion
Standard user models for the trans ription of audiovisual obje ts, like the one pre-
sented above, are bat h-oriented. These models yield satisfying results when highly
ollaborative users are working on near-perfe t system output. To �nd out whether
we ould further improve supervision times, an alternative intera tion strategy, based
on on�den e measures [35℄, was introdu ed. These on�den e measures provide an
indi ator as to the probable orre tness of ea h word appearing in the automati
trans ription. Words with low on�den e values are likely to have been in orre tly
re ognised at the point of ASR. The idea is that by fo using supervision a tions only
on in orre tly-trans ribed words, we an optimise user intera tion to get the best
possible trans ription in ex hange for the least amount of e�ort [36, 40℄.
Figure 3.5 shows the intelligent intera tion interfa e. Here, users are asked to
supervise a subset of words presele ted by the CAT ( omputer-assisted translation)
system as low on�den e. This subset typi ally onstitutes between 10-20% of all
words trans ribed using the ASR system, though users are able to modify this range
at will to as low as 5% and as high as 40%, depending on the per eived a ura y of
the trans ription. Ea h word was played in the ontext of one word before and one
word after, in order to fa ilitate its omprehension and resulting orre tion. In the
�gure, low- on�den e words are shown in red and orre ted low- on�den e words in
18 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 19 � #25
✐ ✐
3.5. transLe tures Player
Figure 3.5: transLe tures Player intelligent intera tion interfa e (only editing
box is shown).
green. The text box that opens for ea h low- on�den e word an be expanded in
either dire tion in order to modify the surrounding text as required.
3.5.3 Two-step supervision
Intelligent intera tion an qui kly improve the trans ription a ura y with limited
user e�ort. Nevertheless, as the CAT system will unlikely �nd all possible trans rip-
tion errors, a small amount of in orre t words will remain. The system ould use
the intelligent intera tion inputs as onstraints in the ASR sear h spa e in order to
�nd the best trans ription hypothesis. Figure 3.6 gives an overview of the two-step
supervision strategy.
Basi ally, in the �rst step, low- on�den e words are presented to the user for
supervision in in reasing order of on�den e. These words keep being presented until
one of the three following onditions are met:
• The total supervision time rea hes double the duration of the video itself.
• No orre tions are entered for �ve onse utive segments.
• 20% of all words are supervised.
Then, supervision a tions are fed into the ASR system to generate a new and
MLLP-DSIC-UPV 19
�memoria� � 2014/9/11 � 12:16 � page 20 � #26
✐ ✐
Chapter 3. The transLe tures Platform
Figure 3.6: transLe tures Player two-step supervision strategy overview.
improved trans ription. In the se ond step, users supervise the improved trans ription
from start to �nish, following the standard post-editing method.
3.5.4 User evaluations
User evaluations were arried out within the transLe tures proje t in order to evaluate
the models and tools in a real-life setting. UPV le turers, having already �lmed
material for poliMedia, were invited under 2012-2013 and 2013-2014 Do èn ia en
Xarxa alls to evaluate the omputer-assisted trans ription tools being developed in
transLe tures. Le turers signing up for this programme ommitted to supervising the
automati trans riptions of �ve of their poliMedia videos using the tools des ribed in
this hapter.
In order to evaluate the di�erent intera tion strategies, UPV le turers were asked
to rate various aspe ts on a Likert s ale from 1-10 (see Table 3.2). In addition, usage
statisti s were olle ted to obje tively evaluate the real user e�ort (see Figure 3.7).
The metri used to that end was the Real Time Fa tor (RTF), whi h is the ratio
(R) between the time spent by the user ellaborating the trans ription (P ) and the
duration of the video (T ).
R =P
T(3.1)
Figure 3.7 shows the RTF as fun tion of WER
1
for the di�erent intera tion strate-
gies. Table 3.1 shows detailed information regarding this �gure, whi h is dis ussed
below.
1
WER stands for word error rate, and it is based on the Levenshtein distan e string metri .
Informally, the WER between two senten es is the minimum number of word edits (i.e. insertions,
deletions or substitutions) required to hange one senten e into the other, divided by the original
senten e length.
20 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 21 � #27
✐ ✐
3.5. transLe tures Player
1
3
5
10
5 10 20 30 40
RTF
WERPhase-11
3
5
10
5 10 20 30 40
RTF
WERPhase-21
3
5
10
5 10 20 30 40
RTF
WERPhase-3
Figure 3.7: RTF as a fun tion of WER for the di�erent intera tion strategies.
Phases 1, 2 and 3 orrespond to standard post-editing, intelligent intera tion
and two-step supervision, respe tively.
1. Standard post-editing : The average WER of the initial automati trans riptions
was 16.9, and the average RTF for the post-editing pro ess was 5.4. When om-
pared to that re orded for trans ribing from s rat h (∼10 RTF for non-expert
users [29℄), we got a signi� ant de rease of about 50%. This way, le turers'
performan e be ame omparable to that of professional trans riptionists [17℄,
rather than that expe ted from non-expert trans riptionists [28℄.
2. Intelligent intera tion: The mean RTF was lowered to 2.2, but it must be noted
that the resulting trans riptions were not error free. The WER was redu ed
(in average) from 19.5 to 8.0 after the intelligent intera tion supervision. These
results underline the e�e tiveness of on�den e measures.
3. Two-step supervision: The mean RTF for the �rst step was as low as 1.4.
Although it only redu ed the average WER from 28.4 to 25.0, the ASR massive
adaptation and onstrained sear h post-pro ess was able to lower the WER of
the trans riptions to 18.7. Then, the improved trans riptions were supervised
following the standard post-editing strategy (RTF 3.9). The umulative RTF
was 5.3. Although this RTF is omparable to that obtained for the standard
post-editing strategy, it should be stressed that the initial WER was 28.4 (w.r.t.
16.9) in this ase.
3.5.5 Con lusions
A ording to Table 3.2, we an see that le turer preferen es do not rely on pure ef-
� ien y metri s. Although standard post-editing WER redu tion per RTF unit was
the lowest, UPV le turers rated it as the best hoi e. When they were questioned
MLLP-DSIC-UPV 21
�memoria� � 2014/9/11 � 12:16 � page 22 � #28
✐ ✐
Chapter 3. The transLe tures Platform
Tables 3.1: Summary of results obtained for ea h intera tion strategy.
SP-E (1) II (2) T-SS (3)
Initial WER 16.9 19.5 28.4
Final WER 0.0 8.0 0.0
RTF 5.4 2.2 5.3
∇WER/RTF 3.1 5.2 5.4
Tables 3.2: Satisfa tion survey results for ea h intera tion strategy.
Question (summarised) Mean (SP-E) Mean (II) Mean (TS-S)
Intuitiveness
1- Easy to use. 9.4 7.8 7.5
2- Easy to learn. 9.4 8.1 8.6
3- Help information lear. 9.2 8.1 8.5
4- Organisation on s reen lear. 9.0 8.4 8.7
Grand Mean 9.3 8.1 8.3
Likeability
5- Comfortable. 8.7 6.5 7.3
6- Like the interfa e. 8.7 6.9 7.4
7- I am satis�ed. 9.0 6.9 7.4
Grand Mean 8.8 6.8 7.4
Usability
8- E�e tively omplete work. 9.0 6.7 7.7
9- Qui ker than from s rat h. 8.6 6.6 7.4
10- Has everything I expe t. 9.0 5.6 7.1
Grand Mean 8.9 6.3 7.4
Overall Mean 9.1 7.2 7.8
about the intelligent intera tion and the two-step supervision strategies, they an-
swered they did not want to leave any error in the trans riptions, and they did want
to avoid supervising the same le ture twi e.
However, their ratings might be in�uen ed by many fa tors. For instan e, it is
understandable that users rate better a tool for editing an automati trans ription
when the initial WER is 16.9 than when it is 28.4. Another fa tor that might stress
their obsession for leaving no errors in the trans riptions is the fa t that they were
editing their own le ture trans riptions.
What we an on lude is that the standard post-editing intera tion is more in-
tuitive and user-friendly than the intelligent intera tion and two-step supervision
alternatives, whi h might require of some expertise. Nevertheless, if a relatively small
22 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 23 � #29
✐ ✐
3.5. transLe tures Player
amount of time is going to be spent on supervising a trans ription, and the a ura y
stri tness is not tight, intelligent intera tion and two-step supervision have proven to
be signi� antly more e� ient hoi es.
MLLP-DSIC-UPV 23
�memoria� � 2014/9/11 � 12:16 � page 24 � #30
✐ ✐
Chapter 3. The transLe tures Platform
24 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 25 � #31
✐ ✐
Chapter 4
Integration into Open ast
Matterhorn
4.1 Introdu tion
In this hapter we address the integration of transLe tures' ASR and MT solutions
into the Open ast Matterhorn platform, whi h was introdu ed in Se tion 2.3. The
goal is to integrate the tools des ribed in Chapter 3 into the di�erent Matterhorn
work�ow phases. This is dis ussed below in three se tions: Se tion 4.2 des ribes the
system ar hite ture of the Matterhorn platform. Next, a more detailed analysis of
the platform, fo using on relevant details for integration purposes, is presented in
Se tion 4.3. Finally, the strategies followed for the su essfull integration of the tools
are des ribed in Se tion 4.4.
4.2 System ar hite ture
The Matterhorn platform omprises four modules: le ture apture and administra-
tion, ingest and pro essing, distribution and engage tools. Figure 4.1 shows a diagram
of the Matterhorn ar hite ture whi h in ludes its main omponents and dependen ies
among them.
The members of the Open ast Community have sele ted Java as programming lan-
guage to reate the ne essary appli ations and a Servi e-Oriented Ar hite ture (SOA)
infrastru ture. The overall appli ation design is highly modularised and relies on the
OSGi (dynami module system for Java) te hnology. The OSGi servi e platform
provides a standardised, omponent-oriented omputing environment for ooperating
network servi es.
The di�erent phases of the Matterhorn work�ow are applied as follows:
1. S hedule/prepare and apture: The re ording pro ess begins with deter-
mining what is to be re orded, where and in what form. Campus data will be
25
�memoria� � 2014/9/11 � 12:16 � page 26 � #32
✐ ✐
Chapter 4. Integration into Open ast Matterhorn
Figure 4.1: Matterhorn ar hite tural draft.
integrated by the universities` IT departments. For this purpose, Matterhorn is
open to both the learning management systems and administrative data bases.
Syllabi, le ture and room timetables do not only provide the basi information
to answer the question raised above, but in an ideal ase, most of the meta-
data related to the re ording (le turer, title, summary, language, et .) as well.
Re ording devi es are then s heduled to automati ally re ord, e.g. in le ture
hall 0S03, every Tuesday from 10:00 to 12:00, the le ture on �XYZ� by Prof.
ABC.
2. Ingest and pro essing: At the end of the re ording the tra ks are sent to
an �inbox� to be pro essed. The inbox also serves as �ingest� for other video
obje ts to be integrated in the subsequent work�ows of Matterhorn. The dif-
ferent re ording tra ks (audio, ontent, video) are bundled to a media pa kage,
ontent-indexed (at �rst through opti al hara ter re ognition of the slide, later
ertainly through audio re ognition also) and if ne essary ar hived in the most
native formats. They are en oded a ording to the spe i�ed distribution pa-
rameters.
3. Distribution: The distribution demands of the universities are extremely het-
erogeneous: they go from simple integration of the videos in lo al WCMS or
blogs, to posting in password-prote ted LMS, to distribution via iTunes U or
26 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 27 � #33
✐ ✐
4.3. Platform analysis
YouTube. The distribution module uploads ingested ontent into di�erent dis-
tribution hannels a ording to parti ular institutions needs.
4. Engage tools: This module is losely linked to the distribute module sin e it
must also manage presentation and use of the obje ts. However, appli ations in
the Engage module make it possible to use omprehensive information (meta-
data, video and audio analysis, annotations, and use analysis) for intelligent
user interfa es. Likewise, support of learning management systems (LMS) or
virtual learning environments (VLE) is an important issue. To make sure that
the produ ed material will be used, Matterhorn video and audio player ompo-
nents are easily integrated in existing ourse websites, wikis, and blog systems.
So ial annotations, whi h an be used to improve sear h or navigation and feed-
ba k possibilities are �own ba k to the system like the user statisti s already
mentioned.
4.3 Platform analysis
In order to su essfully integrate the transLe tures' tools into the Open ast Matter-
horn platform, further analysis of the platform must be done. In this se tion, we
des ribe in detail di�erent aspe ts of the Matterhorn platform whi h are relevant for
this purpose.
4.3.1 Media pa kage
After audio/video material has been sent to the inbox, the media is bundled into
a media pa kage. A media pa kage is onsidered the business do ument within the
Matterhorn system. Besides the media obje ts, it in ludes further information from
media analysis as well as metadata. Every media pa kage therefore onsists of a
manifest and a list of pa kage elements that are referred to in the manifest. Pa kage
elements are:
• Media tra ks (audiovisual material)
• Metadata atalogs
• Atta hments (slides, pdf, text, annotations, et .)
The media pa kage generation pro ess is arried out automati ally when new
material is uploaded to the system (see Figure 4.2).
4.3.2 Work�ow
On e a media pa kage has been su essfully generated, it goes through a set of opera-
tions, like the en oding of the media �les in di�erent formats or the publi ation of the
media in distribution hannels. These operations are de�ned in a work�ow. A Mat-
terhorn work�ow is an ordered list of operations de�ned in a XML �le. There is no
MLLP-DSIC-UPV 27
�memoria� � 2014/9/11 � 12:16 � page 28 � #34
✐ ✐
Chapter 4. Integration into Open ast Matterhorn
Figure 4.2: Media pa kage ingest pro ess.
limit to the number of operations or their repetition in a given work�ow. A work�ow
operation an run autonomously or pause itself to allow for external, usually user,
intera tion. Although there are some prede�ned work�ows in the default Open ast
Matterhorn installation, ustom work�ows an be de�ned in order to perform the
desired operations to all media pa kages being ingested into the system.
Work�ow operations are usually alls to di�erent platform servi es. These opera-
tions are de�ned in the Matterhorn's Condu tor Servi e and are known as work�ow
operation handlers. Some basi work�ow operations in luded in Matterhorn are, for
instan e, the inspe t operation whi h extra ts te hni al information about the media
�les in luded in the media pa kage, the ompose operation for en oding video and au-
dio �les in di�erent formats, or the ar hive operation to store the �les in the system.
Custom work�ow operation handlers an be implemented in order to extend work�ow
fun tionalities.
28 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 29 � #35
✐ ✐
4.4. Integration pro ess
4.4 Integration pro ess
The integration of transLe tures' tools into the Matterhorn platform an be divided
in two parts. The �rst part (Se tion 4.4.1) will over the generation of automati
trans riptions and translations for the ingested media. The se ond part (Se tion 4.4.2)
will involve the visualisation and supervision of the automati ally generated aptions.
The overall integration pro ess is illustrated in Figure 4.3.
Figure 4.3: Matterhorn integration overview.
4.4.1 Generation of automati trans riptions and translations
Figure 4.2 showed how ingested media is sent to the Work�ow Servi e in order to be
pro essed by a parti ular work�ow. As mentioned in Se tion 4.3.2, ustom work�ows
an be de�ned to perform spe i� operations during the ingest pro ess. In order
to send the media �les to the transLe tures Platform, a ustom work�ow in luding a
ustom work�ow operation handler will be de�ned. In this way, audio tra ks extra ted
from the ingested media �les will be wrapped up in a transLe tures media pa kage
and sent to the transLe tures Web Servi e ingest interfa e. To summarise:
1. Media tra ks are ingested into the Matterhorn platform.
2. The transLe tures work�ow pro esses the media tra ks:
MLLP-DSIC-UPV 29
�memoria� � 2014/9/11 � 12:16 � page 30 � #36
✐ ✐
Chapter 4. Integration into Open ast Matterhorn
(a) Initial standard Matterhorn operations are performed (inspe t, prepare-av,
ompose, trim, et .)
(b) Audio tra ks are extra ted from media �les (FLAC format)
( ) A transLe tures media pa kage is generated, in luding:
• Extra ted audio �les
• Media language
• Matterhorn mediaPa kage ID
• Title
• Author
• Duration
• Atta hments (slides, do uments, et .)
(d) The media pa kage is sent to the transLe tures Web Servi e /ingest inter-
fa e.
(e) Final standard Matterhorn operations are performed (segmentpreviews,
publish, et .)
The des ribed pro ess is illustrated in Figure 4.4.
Figure 4.4: transLe tures ustom work�ow overview.
4.4.2 Subtitle visualisation and supervision
The generated subtitles must be made available to the users of the platform when
a essing the media. Furthermore, as it was meant in transLe tures, users should be
30 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 31 � #37
✐ ✐
4.4. Integration pro ess
able to supervise the automati trans riptions and translations in order to improve
its a ura y. To this purpose, an alternative to the Matterhorn Engage Player is
presented.
The Paella Player
1
, developed by the Área de Sistemas de Informa ión y Comu-
ni a iones (ASIC) at the UPV, is an HTML5 multistream video player apable of
playing multiple audio and video streams syn hronously and supporting a number of
user plugins. It is spe ially designed for le ture re ordings and fully ompatible with
the Matterhorn platform.
Paella Player also provides support for the transLe tures Platform tools. It is
apable of displaying transLe tures subtitles by dire tly a essing the transLe tures
Web Servi e. In addition, it also provides a link to supervise the trans riptions and
translations by using the transLe tures Player. Figure 4.5 illustrates the subtitle
visualisation support of Paella Player.
Figure 4.5: Paella Player showing available subtitles for a Matterhorn le ture.
1
Visit http://paellaplayer.upv.es for more information.
MLLP-DSIC-UPV 31
�memoria� � 2014/9/11 � 12:16 � page 32 � #38
✐ ✐
Chapter 4. Integration into Open ast Matterhorn
32 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 33 � #39
✐ ✐
Chapter 5
Using spee h trans riptions
in le ture re ommender
systems
5.1 Introdu tion
One problem reated by the su ess of video le ture repositories is the di� ulty fa ed
by individual users when hoosing the most suitable video for their learning needs
from among the vast numbers available on a given site. Users are often overwhelmed
by the amount of le tures available and may not have the time or knowledge to �nd the
most suitable videos for their learning requirements. Up until re ently, re ommender
systems have mainly been applied in areas su h as musi [24, 30℄, movies [11, 50℄,
books [31℄ and e- ommer e [13℄, leaving video le tures largely to one side. Only a
few ontributions to this parti ular area an be found in the literature, most of them
fo used on VideoLe tures.NET [5℄. However, none of them has explored the possibility
of using le ture trans riptions to better represent le ture ontents at a semanti level.
In this hapter we des ribe a ontent-based le ture re ommender system that uses
automati spee h trans riptions, alongside le ture slides and other relevant external
do uments, to generate semanti le ture and user models. In Se tion 5.2 we give
an overview of this system, fo using on the text extra tion and information retrieval
pro ess, topi and user modeling and the re ommendation pro ess. In Se tion 5.3 we
address the dynami update of the re ommender system and the required optimisa-
tions needed to maximise the s alability of the system. The integration of the system
presented in Se tions 5.2 and 5.3 into VideoLe tures.NET, arried out as part of the
PASCAL Harvest Proje t La Vie, is des ribed in detail in Se tion 5.4. Finally, we
lose with some on luding remarks, in Se tion 5.5.
33
�memoria� � 2014/9/11 � 12:16 � page 34 � #40
✐ ✐
Chapter 5. Using spee h trans riptions in le ture re ommender systems
Figure 5.1: System overview.
5.2 System overview
Fig. 5.1 gives an overview of the re ommender system. The left-hand side of the
�gure show the topi and user modeling pro edure, whi h an be seen as the training
pro ess of the re ommender system. To the right we see the re ommendation pro ess.
The aim of topi and user modeling is to obtain a simpli�ed representation of ea h
video le ture and user. The resulting representations are stored in a re ommender
database. This database will be exploited later in the re ommendation pro ess in
order to re ommend le tures to users.
As shown in Fig. 5.1, every le ture in the repository goes through the topi and
user modeling pro ess, whi h involves three steps. The �rst step is arried out by the
text extra tion module. This module omprises three submodules: ASR (Automati
Spee h Re ognition), WS (Web Sear h) and OCR (Opti al Chara ter Re ognition).
34 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 35 � #41
✐ ✐
5.2. System overview
As its name suggests, the ASR submodule generates an automati spee h trans rip-
tion of the video le ture. The WS submodule uses the le ture title to sear h for
related do uments and publi ations on the web. The OCR submodule extra ts text
from the le ture slides, where available. The se ond step takes the text retrieved by
the text extra tion module and omputes a bag-of-words representation. This bag-of-
words representation onsists of a simpli�ed text des ription ommonly used in nat-
ural language pro essing and information retrieval. More pre isely, the bag-of-words
representation of a given text is its ve tor of word ounts over a �xed vo abulary.
Finally, in the third step, le ture bags-of-words are used to represent the users of the
system. That is, ea h user is represented as the bag-of-words omputed over all the
le tures the user has ever seen.
When the topi and user modeling pro ess ends, the re ommender database is
ready for exploitation by the re ommender engine (see the right-hand side of Fig. 5.1).
This engine uses re ommendation features to al ulate a measure s of the suitability
of the re ommendation for every (u, v, r) triplet, where u refers to a parti ular user, v
is the le ture they are urrently viewing and r is a hypotheti al le ture re ommenda-
tion. In re ommender systems, this is usually referred to as the utility fun tion [34℄.
Spe i� ally, it indi ates how likely it is that a user u would want to wat h le ture r
after viewing le ture v. For instan e, this utility fun tion an be omputed as a linear
ombination of re ommendation features:
s(u, v, r) = w · x =
N∑
n=1
wn · xn (5.1)
where x is a feature ve tor omputed for the triplet (u, v, r), w is a feature weight
ve tor and N is the number of re ommendation features. In this work, the following
re ommendation features were onsidered:
1. Le ture popularity: number of visits to le ture r.
2. Content similarity: weighted dot produ t between the le ture bags-of-words v
and r [19℄.
3. Category similarity: number of ategories (from a prede�ned set) that v and r
have in ommon.
4. User ontent similarity: weighted dot produ t between the bags-of-words u and
r.
5. User ategory similarity: number of ategories in ommon between le ture r
and all the ategories of le tures the user u has wat hed in the past.
6. Co-visits: number of times le tures v and r have been seen in the same browsing
session.
7. User similarity: number of di�erent users that have seen both v and r.
MLLP-DSIC-UPV 35
�memoria� � 2014/9/11 � 12:16 � page 36 � #42
✐ ✐
Chapter 5. Using spee h trans riptions in le ture re ommender systems
Feature weights w an be learned by training di�erent statisti al lassi� ation
models, su h as support ve tor ma hines (SVMs), using positive and negative (u, v,
r) re ommendation samples.
The most suitable re ommendation r for a given u and v is omputed as follows:
r = arg max
r
s(u, v, r) (5.2)
However, in re ommender systems the most ommon pra ti e is to provide the
user the M re ommendations r that a hieve the highest utility values s, for instan e,
the �rst 10 le tures.
5.3 System updates and optimisation
Le ture repositories are rarely stati . They may grow to in lude new le tures, or
have outdated videos removed. Also, users' learning progress or intera tions with the
repository in�uen e the user models. The re ommender database must therefore be
onstantly updated in order to in lude the new le tures added to the repository and
update the user models. Furthermore, the addition of new le tures to the system
might lead to hanges to the bag-of-words (�xed) vo abulary. Any variation to this
vo abulary involves a omplete regeneration of the re ommender database. That said,
hanges to the vo abulary may not be signi� ant until a substantial per entage of new
le tures has been added to the repository.
Two di�erent update s enarios an be de�ned: the in orporation of new le tures
and updating the user models, on the one hand, and the rede�niton of the bag-of-words
vo abulary, in luding the regeneration of both the le ture and user bags-of-words, on
the other. We will refer to these s enarios as regular update and o asional update,
respe tively, after the di�erent periodi ities with whi h they are meant to be run.
• Regular update: The regular update is responsible for in luding the new le tures
added to the repository and updating the user models with the last user a tivity,
both in the re ommender database. As its name suggests, this pro ess is meant
to be run on a daily basis, depending on the frequen y with whi h new le tures
are added to the repository, sin e new le tures annot be re ommended until
they have been pro essed and in luded in the re ommender database.
• O asional update: As mentioned in Se tion 5.2, le ture bags-of-words are al-
ulated under a �xed vo abulary. Sin e there is no vo abulary restri tion on the
text extra tion pro ess, we need to modify the bag-of-words vo abulary as new
le tures are added to the system. The o asional update arries out the pro ess
of updating this vo abulary, whi h involves re al ulating both the le ture and
user bags-of-words.
In order to maximise the s alability of the system, while also redu ing the response
time of the re ommender, the features Content similarity, Category similarity, Co-
visits and User similarity des ribed in Se tion 5.2 are pre omputed for every possible
36 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 37 � #43
✐ ✐
5.4. Integration into VideoLe tures.NET
le ture pair and stored in the re ommender database. Then, during the re ommen-
dation pro ess, the re ommender engine loads the values of these features, leaving
the omputation of features User ontent similarity and User ategory similarity un-
til runtime. The de ision to al ulate the features User ontent similarity and User
ategory similarity at runtime was driven by the highly dynami nature of the user
models, in ontrast to the le ture models, whi h remain onstant until the bag-of-
words vo abulary is hanged.
5.4 Integration into VideoLe tures.NET
The proposed re ommendation system was implemented and integrated into the Vide-
oLe tures.NET repository during the PASCAL2 Harvest Proje t La Vie (Learning
Adapted Video Information Enhan er) [32℄. Said integration is dis ussed here a ross
three subse tions. First, in Se tion 5.4.1 we address the generation of le ture and
user models from video le ture trans riptions and other text resour es. Next, in
Se tion 5.4.2 we des ribe how re ommender feature weights were learned from data
olle ted from the existing VideoLe tures.NET re ommender system. Finally, we
present our evaluation of the system in Se tion 5.4.3.
5.4.1 Topi and user modeling
The �rst step in generating le ture and user models involved olle ting textual in-
formation from di�erent sour es. In parti ular, for VideoLe tures.NET, the text
extra tion module gathered textual information from the following sour es:
• transLe tures spee h trans riptions.
• Web sear h-based textual information from Wikipedia, DBLP and Google (ab-
stra ts and/or arti les).
• Text extra ted from le ture presentation slides (PPT, PDF or PNG using Op-
ti al Chara ter Re ognition (OCR)).
• VideoLe tures.NET internal database metadata.
Next, the text extra tion module output was used to generate le ture bags-of-
words for every le ture in the repository. These bags-of-words, as mentioned in Se -
tion 5.2, were al ulated under a �xed vo abulary that was obtained by applying a
threshold to the number of di�erent le tures in whi h a word must appear in order to
be in luded. By means of this threshold, vo abulary size is signi� antly redu ed, sin e
un ommon and/or very spe i� words are disregarded. On e de�ned, term weights
were al ulated using term frequen y-inverse do ument frequen y (td-idf), a statisti-
al weighting s heme ommonly used in information retrieval and text mining [26℄.
Spe i� ally, tf-idf weights are used to al ulate the features Content similarity and
User ontent similarity. Finally, the VideoLe tures.NET user a tivity log was parsed
in order to obtain values for the feature Co-visits for all possible le ture pairs, as
MLLP-DSIC-UPV 37
�memoria� � 2014/9/11 � 12:16 � page 38 � #44
✐ ✐
Chapter 5. Using spee h trans riptions in le ture re ommender systems
well as a list of le tures viewed per user. This list was used together with the le tures
bags-of-words to generate the users bags-of-words and ategories. These, in turn, were
used to al ulate User ontent similarity and User ategory similarity, respe tively,
as well as User similarity for all possible le ture pairs. In a �nal step, all this data
was stored in the re ommender database in order to be exploited by the re ommender
engine in the re ommendation pro ess.
5.4.2 Learning re ommendation feature weights
On e the data needed to ompute re ommendation feature values for every possible
(u, v, r) triplet in the repository was made available, the next step was to learn
the optimum feature weights w for the al ulation of the utility fun tion shown in
Equation 5.1. To this end, an SVM lassi�er was trained using data olle ted from
the existing VideoLe tures.NET naïve re ommender system (based only on keywords
extra ted from the le ture titles). Spe i� ally, every time a user li ked on any of
the 10 re ommendation links provided by this re ommender system, 1 positive and 9
negative samples were registered. SVM training was performed using the SVM
light
open-sour e software [20℄. The optimum feature weights were those that obtained the
minimum lassi� ation error over the re ommendation data.
5.4.3 Evaluation
Although there are many di�erent approa hes to the evaluation of re ommender sys-
tems [37, 18℄, it is di� ult to state any �rm on lusions regarding the quality of the
re ommendations made until they are deployed in a real-life setting. The La Vie
proje t therefore provided an ideal evaluation framework, being deployed a ross the
o� ial VideoLe tures.NET site. The strategy followed for the obje tive evaluation of
the La Vie re ommender was to ompare it against the existing VideoLe tures.NET
re ommender by means of a oin-�ipping approa h. Spe i� ally, this approa h on-
sisted of logging user li ks on re ommendation links provided by both systems on a
50/50 basis and omparing the total number of li ks re orded for ea h system.
The results did not show any signi� ant di�eren es between the two re ommenders
in terms of user behaviour. This an be explained by the fa t that user- li k ount
alone is not a legitimate point of omparison for re ommendation quality. For in-
stan e, random variables not taken into a ount might in�uen e how users respond
to the re ommendation links provided. As an alternative, we an ompare the rank
of the re ommendations li ked by users within ea h system. Spe i� ally, for ea h
re ommendation li ked by a user in either system, we an ompare how the same
re ommendation ranked in the other system. This might be a more appropriate mea-
sure for omparing the re ommendations in terms of suitability. However, additional
data need to be olle ted in order to arry out this alternative evaluation. This data
is urrently being olle ted and future evaluation results will be obtained following
this rank omparison approa h.
Despite the la k of obje tive eviden e for assessing the omparative performan e
of the La Vie system, subje tive evaluations indi ate that the proposed re ommender
38 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 39 � #45
✐ ✐
5.5. Con lusions and future work
system provides better re ommendations than the existing VideoLe tures.NET re -
ommender. Fig. 5.2 shows re ommendation examples from both systems for a new
user viewing a random VideoLe tures.NET le ture. Although re ommendation suit-
ability is a subje tive measure, La Vie re ommendations seem to be more appropriate
in terms of ontent similarity.
Figure 5.2: On the left, La Vie system re ommendations for the �Basi s
of probability and statisti s� VideoLe tures.NET le ture. On the right, re -
ommendations made by VideoLe tures.NET's existing system for the same
le ture.
5.5 Con lusions and future work
In this hapter we have shown how automati spee h trans riptions of video le -
tures an be exploited to develop a le ture re ommender system that an zoom in on
user interests at a semanti level. In addition, we have des ribed how the proposed
re ommender system has been parti ularly implemented for the VideoLe tures.NET
repository. This implementation was later deployed in the o� ial VideoLe tures.NET
site.
The proposed system ould also be extended for deployment a ross more gen-
eral video repositories, provided that video ontents are well represented in the data
obtained by the text extra tion module.
By way of future work we intend to evaluate the re ommender system using other
evaluation approa hes that measure the suitability of the re ommendations more a -
urately, su h as the aforementioned re ommendation rank omparison.
MLLP-DSIC-UPV 39
�memoria� � 2014/9/11 � 12:16 � page 40 � #46
✐ ✐
Chapter 5. Using spee h trans riptions in le ture re ommender systems
40 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 41 � #47
✐ ✐
Chapter 6
Con lusions and Future
Work
6.1 General Con lusions
In this thesis we have presented the integration of state-of-the-art ASR and MT sys-
tems, developed within the transLe tures proje t, into di�erent video le ture repos-
itories. In parti ular, we have shown how transLe tures' solutions to produ e ost-
e�e tive a urate trans riptions and translations have been integrated in poliMedia
and VideoLe tures.NET (Chapter 3), and also into the Open ast Matterhorn open-
sour e platform (Chapter 4). In addition, in Chapter 5 we have shown how the pro-
du ed automati trans riptions an be used to develop a ontent-based video le ture
re ommender system.
6.2 Contributions
The s ienti� publi ations related to this work are listed below.
• transLe tures. Silvestre-Cerdà, Joan Albert; Del Agua, Miguel; Gar és, Gonçal;
Gas ó, Guillem; Giménez-Pastor, Adrià; Martínez, Adrià; Pérez González de
Martos, Alejandro; Sán hez, Isaías; Martínez-Santos, Ni olás Serrano; Spen er,
Ra hel; Valor Miró, Juan Daniel; Andrés-Ferrer, Jesús; Civera, Jorge; San hís,
Alberto; Juan, Alfons. Pro eedings of IberSPEECH 2012, pp. 345-351, 2012.
• Integrating a state-of-the-art ASR system into the Open ast Matter-
horn platform. Juan Daniel Valor Miró, Alejandro Pérez González de Martos,
Jorge Civera and Alfons Juan. IberSPEECH 2012, vol. CCIS 328, Springer, p.
237�246, November 2012, Madrid (Spain).
• A System Ar hite ture to Support Cost-E�e tive Trans ription and
Translation of Large Video Le ture Repositories. Silvestre-Cerdà, Joan
41
�memoria� � 2014/9/11 � 12:16 � page 42 � #48
✐ ✐
Chapter 6. Con lusions and Future Work
Albert; Pérez, Alejandro; Jiménez, Manuel; Turró, Carlos; Juan, Alfons; Civera,
Jorge. IEEE International Conferen e on Systems, Man, and Cyberneti s (SMC)
2013 , pp. 3994-3999, 2013.
• Evaluating intelligent interfa es for post-editing automati trans rip-
tions of online video le tures. J.D. Valor Miró, R.N. Spen er, A. Pérez
González de Martos, G. Gar és Díaz-Munío, C. Turró, J. Civera and A. Juan.
Open Learning: The Journal of Open, Distan e and e-Learning. Vol. 29, Iss.
1, 2014.
• Evalua ión del pro eso de revisión de trans rip iones automáti as
para vídeos poliMedia. Valor Miró, Juan Daniel; Spen er, R N; de Martos,
Pérez González A; Díaz-Munío, Gar és G; Turró, C; Civera, J; Juan, A. Pro . of
I Jornadas de Innova ión Edu ativa y Do en ia en Red (IN-RED 2014), Valen ia
(Spain), 2014.
• Using Automati Spee h Trans riptions in Le ture Re ommendation
Systems. Pérez-González-de-Martos, Alejandro; Silvestre-Cerdà, Joan Albert;
Rihtar, Matjaº; Juan, Alfons; Civera, Jorge. IberSPEECH 2014. Submitted.
6.3 Future work
It is our intention to keep improving the di�erent CAT intera tion strategies presented
in Se tion 3.5 in order to redu e the required user e�ort up to a minimum. This might
involve to arry out additional user evaluations for gathering more user feedba k.
In addition, we will keep updating the transLe tures Platform tools presented in
Chapter 3 to meet the future requirements of poliMedia, VideoLe tures.NET and
other video le ture repositories.
42 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 43 � #49
✐ ✐
Bibliography
[1℄ Coursera. https://www. oursera.org.
[2℄ transLe tures: First report on massive adaptation. https://www.
transle tures.eu/wp- ontent/uploads/2013/05/transLe tures-D3.1.
1-18Nov2012.pdf.
[3℄ transLe tures: Se ond report on massive adaptation. http://www.
transle tures.eu/wp- ontent/uploads/2014/01/transLe tures-D3.1.
2-15Nov2013.pdf.
[4℄ Hiyan Alshawi, Shona Douglas, and Srinivas Bangalore. Learning dependen y
translation models as olle tions of �nite-state head transdu ers. Computational
Linguisti s, 26(1):45�60, 2000.
[5℄ Nino Antulov-Fantulin, Matko Bo²njak, Martin Znidar²i , and Miha et al. Gr ar.
E ml-pkdd 2011 dis overy hallenge overview. Dis overy Challenge, 2011.
[6℄ Win�eld S. Bennett and Jonathan Slo um. The lr ma hine translation system.
Comput. Linguist., 11(2-3):111�121, 1985.
[7℄ Adam L Berger, Peter F Brown, Stephen A Della Pietra, Vin ent J Della Pietra,
John R Gillett, John D La�erty, Robert L Mer er, Harry Printz, and Lubo² Ure².
The andide system for ma hine translation. In Pro eedings of the workshop
on Human Language Te hnology, pages 157�162. Asso iation for Computational
Linguisti s, 1994.
[8℄ Reinhard Billmeier. Zu den linguistis hen grundlagen von systran. Multilingua-
Journal of Cross-Cultural and Interlanguage Communi ation, 1(2):83�96, 1982.
[9℄ Chris Callison-Bur h, Cameron Fordy e, Philipp Koehn, Christof Monz, and Josh
S hroeder. Further meta-evaluation of ma hine translation. In Pro eedings of the
Third Workshop on Statisti al Ma hine Translation, pages 70�106. Asso iation
for Computational Linguisti s, 2008.
[10℄ Jaime G Carbonell, Ryszard S Mi halski, and Tom M Mit hell. An overview of
ma hine learning. In Ma hine learning, pages 3�23. Springer, 1983.
[11℄ Walter Carrer-Neto, María Luisa Hernández-Al araz, Rafael Valen ia-Gar ía,
and Fran is o Gar ía-Sán hez. So ial knowledge-based re ommender system. ap-
pli ation to the movies domain. Expert Systems with Appli ations, 39(12):10990�
11000, 2012.
43
�memoria� � 2014/9/11 � 12:16 � page 44 � #50
✐ ✐
Bibliography
[12℄ Fran is o Casa uberta and Enrique Vidal. Ma hine translation with inferred
sto hasti �nite-state transdu ers. Computational Linguisti s, 30(2):205�225,
2004.
[13℄ Jose Jesus Castro-S hez, Raul Miguel, David Vallejo, and Lorenzo Manuel López-
López. A highly adaptive re ommender system based on fuzzy logi for b2
e- ommer e portals. Expert Systems with Appli ations, 38(3):2441�2454, 2011.
[14℄ Atsushi Fujii, Katunobu Itou, and Tetsuya Ishikawa. Lodem: A system for on-
demand video le tures. Spee h Communi ation, 48(5):516 � 531, 2006.
[15℄ Jonathan Graehl and Kevin Knight. Training tree transdu ers. Te hni al report,
DTIC Do ument, 2004.
[16℄ Charles Miller Grinstead and James Laurie Snell. Introdu tion to probability.
Ameri an Mathemati al So ., 1998.
[17℄ Timothy J. Hazen. Automati alignment and error orre tion of human generated
trans ripts for long spee h re ordings. In Pro eedings of the Ninth International
Conferen e on Spoken Language Pro essing (Interspee h 2006 � ICSLP), 2006.
[18℄ Jonathan L Herlo ker, Joseph A Konstan, Loren G Terveen, and John T Riedl.
Evaluating ollaborative �ltering re ommender systems. ACM Transa tions on
Information Systems (TOIS), 22(1):5�53, 2004.
[19℄ Thorsten Joa hims. A probabilisti analysis of the ro hio algorithm with t�df
for text ategorization. Te hni al report, DTIC Do ument, 1996.
[20℄ Thorsten Joa hims. Svmlight: Support ve tor ma hine. SVM-Light Support
Ve tor Ma hine http://svmlight. joa hims. org/, University of Dortmund, 19(4),
1999.
[21℄ Slava Katz. Estimation of probabilities from sparse data for the language model
omponent of a spee h re ognizer. A ousti s, Spee h and Signal Pro essing, IEEE
Transa tions on, 35(3):400�401, 1987.
[22℄ Markus Ketterl, Olaf A S hulte, and Adam Ho hman. Open ast matterhorn:
A ommunity-driven open sour e software proje t for produ ing, managing,
and distributing a ademi video. Intera tive Te hnology and Smart Edu ation,
7(3):168�180, 2010.
[23℄ Philipp Koehn, Franz Josef O h, and Daniel Mar u. Statisti al phrase-based
translation. In Pro eedings of the 2003 Conferen e of the North Ameri an
Chapter of the Asso iation for Computational Linguisti s on Human Language
Te hnology-Volume 1, pages 48�54. Asso iation for Computational Linguisti s,
2003.
[24℄ Seok Kee Lee, Yoon Ho Cho, and Soung Hie Kim. Collaborative �ltering with
ordinal s ale-based impli it ratings for mobile musi re ommendations. Informa-
tion S ien es, 180(11):2142�2155, 2010.
44 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 45 � #51
✐ ✐
Bibliography
[25℄ Stephen E Levinson, Lawren e R Rabiner, and Man Mohan Sondhi. An intro-
du tion to the appli ation of the theory of probabilisti fun tions of a markov
pro ess to automati spee h re ognition. Bell System Te hni al Journal, The,
62(4):1035�1074, 1983.
[26℄ Christopher D Manning, Prabhakar Raghavan, and Hinri h S hütze. Introdu tion
to information retrieval, volume 1. Cambridge university press Cambridge, 2008.
[27℄ Adrià Martínez-Villaronga, Miguel A. del Agua, Jesús Andrés-Ferrer, and Alfons
Juan. Language model adaptation for video le tures trans ription. In Interna-
tional Conferen e on A ousti s, Spee h and Signal Pro essing (ICASSP), pages
8450�8454. IEEE, 2013.
[28℄ Cosmin Munteanu, Ron Bae ker, and Gerald Penn. Collaborative editing for im-
proved usefulness and usability of trans ript-enhan ed web asts. In Pro eedings
of the SIGCHI Conferen e on Human Fa tors in Computing Systems, CHI '08,
pages 373�382, New York, NY, USA, 2008. ACM.
[29℄ Cosmin Munteanu, Gerald Penn, and Xiaodan Zhu. Improving automati spee h
re ognition for le tures through transformation-based rules learned from mini-
mal data. In Pro eedings of the Joint Conferen e of the 47th Annual Meeting
of the ACL and the 4th International Joint Conferen e on Natural Language
Pro essing of the AFNLP: Volume 2-Volume 2, pages 764�772. Asso iation for
Computational Linguisti s, 2009.
[30℄ Alexandros Nanopoulos, Dimitrios Rafailidis, Panagiotis Symeonidis, and Yannis
Manolopoulos. Musi box: Personalized musi re ommendation based on ubi
analysis of so ial tags. Audio, Spee h, and Language Pro essing, IEEE Transa -
tions on, 18(2):407�412, 2010.
[31℄ Edward Rolando Núñez-Valdéz, Juan Manuel Cueva Lovelle, Os ar San-
juán Martínez, Vi ente Gar ía-Díaz, Patri ia Ordoñez de Pablos, and Carlos En-
rique Montenegro Marín. Impli it feedba k te hniques on re ommender systems
applied to ele troni books. Computers in Human Behavior, 28(4):1186�1193,
2012.
[32℄ PASCAL Harvest Programme. http://www.pas al-network.org/?q=node/19.
[33℄ poliMedia. The polimedia repository, 2007.
[34℄ Paul Resni k and Hal R Varian. Re ommender systems. Communi ations of the
ACM, 40(3):56�58, 1997.
[35℄ Alberto San his, Alfons Juan, and Enrique Vidal. A word-based naïve bayes
lassi�er for on�den e estimation in spee h re ognition. IEEE Transa tions on
Audio, Spee h, and Language Pro essing, 20(2):565�574, 2012.
[36℄ Ni olás Serrano, Adrià Giménez, Jorge Civera, Alberto San his, and Alfons Juan.
Intera tive handwriting re ognition with limited user e�ort. International Jour-
nal on Do ument Analysis and Re ognition, pages 1�13, 2013.
MLLP-DSIC-UPV 45
�memoria� � 2014/9/11 � 12:16 � page 46 � #52
✐ ✐
Bibliography
[37℄ Guy Shani and Asela Gunawardana. Evaluating re ommendation systems. In
Re ommender systems handbook, pages 257�297. Springer, 2011.
[38℄ Joan Albert Silvestre, Miguel del Agua, Gonçal Gar és, Guillem Gas ó, Adrià
Giménez-Pastor, Adrià Martínez, Alejandro Pérez González de Martos, Isaías
Sán hez, Ni olás Serrano Martínez-Santos, Ra hel Spen er, Juan Daniel Valor
Miró, Jesús Andrés-Ferrer, Jorge Civera, Alberto San hís, and Alfons Juan.
transle tures. In Pro eedings of IberSPEECH 2012, 2012.
[39℄ Joan Albert Silvestre-Cerdà, Alejandro Pérez, Manuel Jiménez, Carlos Turró,
Alfons Juan, and Jorge Civera. A system ar hite ture to support ost-e�e tive
trans ription and translation of large video le ture repositories. In IEEE In-
ternational Conferen e on Systems, Man, and Cyberneti s (SMC) 2013, pages
3994�3999, 2013.
[40℄ Isaías Sán hez-Cortina, Ni olás Serrano, Alberto San his, and Alfons Juan. A
prototype for intera tive spee h trans ription balan ing error and supervision
e�ort. In Pro . of the 2012 ACM Intl. Conf. on Intelligent User Interfa es,
pages 325�326, 2012.
[41℄ The transLe tures-UPV Team. The transLe tures Platform (TLP). http://
transle tures.eu/tlp.
[42℄ Benoît Thouin. The meteo system. Pra ti al experien e of ma hine translation,
39:44, 1982.
[43℄ Jussi Tohka. Sgn-2506: Introdu tion to pattern re ognition. 2006.
[44℄ The transLe tures UPV Team. The transle tures-upv toolkit (tlk), 2013.
[45℄ UPVLC and XEROX and JSI-K4A and RWTH and EML and DDS. transle -
tures. https://transle tures.eu/, 2012.
[46℄ Juan Daniel Valor Miró, Alejandro Pérez González de Martos, Jorge Civera,
and Alfons Juan. Integrating a state-of-the-art asr system into the open ast
matterhorn platform. In Advan es in Spee h and Language Te hnologies for
Iberian Languages, pages 237�246. Springer, 2012.
[47℄ Videole tures.NET: Ex hange ideas and share knowledge. http://www.
videole tures.net/.
[48℄ Mike Wald. Creating a essible edu ational multimedia through editing auto-
mati spee h re ognition aptioning in real time. Intera tive Te hnology and
Smart Edu ation, 3(2):131�141, 2006.
[49℄ Warren Weaver. Translation. Ma hine translation of languages, 14:15�23, 1955.
[50℄ Pinata Winoto and Ti�any Y Tang. The role of user mood in movie re ommen-
dations. Expert Systems with Appli ations, 37(8):6086�6092, 2010.
46 MLLP-DSIC-UPV
�memoria� � 2014/9/11 � 12:16 � page 47 � #53
✐ ✐
Bibliography
[51℄ Dekai Wu. A polynomial-time algorithm for statisti al ma hine translation. In
Pro eedings of the 34th annual meeting on Asso iation for Computational Lin-
guisti s, pages 152�158. Asso iation for Computational Linguisti s, 1996.
[52℄ Kenji Yamada and Kevin Knight. A syntax-based statisti al translation model.
In Pro eedings of the 39th Annual Meeting on Asso iation for Computational
Linguisti s, pages 523�530. Asso iation for Computational Linguisti s, 2001.
MLLP-DSIC-UPV 47
�memoria� � 2014/9/11 � 12:16 � page 48 � #54
✐ ✐
�memoria� � 2014/9/11 � 12:16 � page 49 � #55
✐ ✐
List of Figures
2.1 Opti al hara ter re ogniser for digits 6 and 9, based on the average
gray levels of the upper and the bottom half. . . . . . . . . . . . . . . 5
2.2 General overview of an automati spee h re ognition system. . . . . . 6
2.3 The poliMedia studio during a re ording session. . . . . . . . . . . . . 8
2.4 A poliMedia re ording example. . . . . . . . . . . . . . . . . . . . . . . 9
2.5 The VideoLe tures.NET web player. . . . . . . . . . . . . . . . . . . . 10
3.1 General overview of the transLe tures Platform integration into a lient
repository. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Overview of the transLe tures Ingest Servi e work�ow. . . . . . . . . . 16
3.3 transLe tures Player ommuni ations diagram. . . . . . . . . . . . . . 17
3.4 transLe tures Player standard post-editing mode (default side-by-side
layout). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5 transLe tures Player intelligent intera tion interfa e (only editing box
is shown). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6 transLe tures Player two-step supervision strategy overview. . . . . . . 20
3.7 RTF as a fun tion of WER for the di�erent intera tion strategies.
Phases 1, 2 and 3 orrespond to standard post-editing, intelligent in-
tera tion and two-step supervision, respe tively. . . . . . . . . . . . . . 21
4.1 Matterhorn ar hite tural draft. . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Media pa kage ingest pro ess. . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Matterhorn integration overview. . . . . . . . . . . . . . . . . . . . . . 29
4.4 transLe tures ustom work�ow overview. . . . . . . . . . . . . . . . . . 30
4.5 Paella Player showing available subtitles for a Matterhorn le ture. . . 31
5.1 System overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 On the left, La Vie system re ommendations for the �Basi s of prob-
ability and statisti s� VideoLe tures.NET le ture. On the right, re -
ommendations made by VideoLe tures.NET's existing system for the
same le ture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
49
�memoria� � 2014/9/11 � 12:16 � page 50 � #56
✐ ✐
�memoria� � 2014/9/11 � 12:16 � page 51 � #57
✐ ✐
Index of Tables
2.1 Basi statisti s of the UPV's poliMedia repository (May 2014) . . . . 8
2.2 Statisti s of the Spanish poliMedia training, development and test par-
titions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Basi statisti s of the IJS' VideoLe tures.NET repository (May 2014) 10
2.4 Statisti s of the English VideoLe tures.NET training, development and
test partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 Summary of results obtained for ea h intera tion strategy. . . . . . . 22
3.2 Satisfa tion survey results for ea h intera tion strategy. . . . . . . . . 22
51
�memoria� � 2014/9/11 � 12:16 � page 52 � #58
✐ ✐