page i T A UNIVERSIT POLITÈCNICA DE ALÈNCIA V MASTER'S

�memoria� � 2014/9/11 � 12:16 � page i � #1

✐ ✐

UNIVERSITAT POLITÈCNICA DE VALÈNCIA

DEPARTMENT OF COMPUTER SYSTEMS AND COMPUTATION

MASTER'S THESIS

Integration of Ma hine Learning and Language Pro essing

Te hnologies into Video Le ture Platforms.

Master in Arti� ial Intelligen e, Pattern Re ognition and Digital Imaging.

Alejandro Pérez González de Martos

Dire ted by:

Dr. Alfons Juan Cis ar

Dr. Jorge Civera Saiz

September 11, 2014

�memoria� � 2014/9/11 � 12:16 � page ii � #2

✐ ✐

�memoria� � 2014/9/11 � 12:16 � page iii � #3

✐ ✐

I would like to thank the UPV's Ma hine Learning and Language Pro essing

group and the Joºef Stefan Institute for making this thesis possible.

�memoria� � 2014/9/11 � 12:16 � page iv � #4

✐ ✐

�memoria� � 2014/9/11 � 12:16 � page v � #5

✐ ✐

Contents

1 Introdu tion 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 S ienti� and te hni al goals . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Do ument stru ture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Preliminaries 3

2.1 Ma hine Learning and Language Pro essing . . . . . . . . . . . . . . . 3

2.1.1 Pattern re ognition . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.2 Automati spee h re ognition . . . . . . . . . . . . . . . . . . . 5

2.1.3 Statisti al ma hine translation . . . . . . . . . . . . . . . . . . 6

2.2 Video le ture repositories . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 poliMedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.2 VideoLe tures.NET . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 The Open ast Matterhorn platform . . . . . . . . . . . . . . . . . . . . 11

3 The transLe tures Platform 13

3.1 Introdu tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 transLe tures Database . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 transLe tures Web Servi e . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4 transLe tures Ingest Servi e . . . . . . . . . . . . . . . . . . . . . . . . 15

3.5 transLe tures Player . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.5.1 Standard post-editing . . . . . . . . . . . . . . . . . . . . . . . 17

3.5.2 Intelligent intera tion . . . . . . . . . . . . . . . . . . . . . . . 18

3.5.3 Two-step supervision . . . . . . . . . . . . . . . . . . . . . . . . 19

3.5.4 User evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.5.5 Con lusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Integration into Open ast Matterhorn 25

4.1 Introdu tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2 System ar hite ture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 Platform analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3.1 Media pa kage . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3.2 Work�ow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.4 Integration pro ess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4.1 Generation of automati trans riptions and translations . . . . 29

4.4.2 Subtitle visualisation and supervision . . . . . . . . . . . . . . 30

v

�memoria� � 2014/9/11 � 12:16 � page vi � #6

✐ ✐

Contents

5 Using spee h trans riptions in le ture re ommender systems 33

5.1 Introdu tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.3 System updates and optimisation . . . . . . . . . . . . . . . . . . . . . 36

5.4 Integration into VideoLe tures.NET . . . . . . . . . . . . . . . . . . . 37

5.4.1 Topi and user modeling . . . . . . . . . . . . . . . . . . . . . . 37

5.4.2 Learning re ommendation feature weights . . . . . . . . . . . . 38

5.4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.5 Con lusions and future work . . . . . . . . . . . . . . . . . . . . . . . 39

6 Con lusions and Future Work 41

6.1 General Con lusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

vi MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 1 � #7

✐ ✐

Chapter 1

Introdu tion

1.1 Motivation

Online multimedia repositories are rapidly growing and be oming evermore onsoli-

dated as key knowledge assets. This is parti ularly true in the area of edu ation, where

large repositories of video le tures are being established on the ba k of in reasingly

available and standardised infrastru tures. Well-known examples of this in lude mas-

sive open online ourses (MOOCs) aggregators su h as Coursera [1℄, the Universitat

Politèni a de Valèn ia (UPV) poliMedia platform system for the ost-e�e tive ellab-

oration and publi ation of quality edu ational videos [33℄, and VideoLe tures.NET,

an award-winning free and open a ess edu ational video le tures repository [47℄.

As in other repositories, trans ription and translation of video le tures in platforms

su h as poliMedia and VideoLe tures.NET is needed to make them a essible to

speakers of di�erent languages and to people with disabilities [14, 48℄. Moreover,

trans riptions and translations of video le tures an also be helpful in the development

of multiple digital ontent management appli ations su h as le ture ategorisation,

summarisation, re ommendation, automated topi �nding or plagiarism dete tion.

However, most le tures in these platforms are neither trans ribed nor translated

be ause of the la k of e� ient solutions to obtain them at a reasonable level of a -

ura y. This was the motivation behind the transLe tures European proje t [38, 45℄,

whi h aimed at developing innovative, ost-e�e tive solutions for produ ing a urate

trans riptions and translations for large video repositories.

1.2 S ienti� and te hni al goals

This work aims to e�e tively integrate the tools developed in transLe tures [44, 41℄

into di�erent video le ture platforms. In parti ular, these tools will be integrated into

the aforementioned poliMedia and VideoLe tures.NET repositories. In the ase of

the latter, video le ture trans riptions will be used to develop a le ture re ommender

appli ation as part of the PASCAL Harvest Proje t La Vie [32℄. In addition, transLe -

1

�memoria� � 2014/9/11 � 12:16 � page 2 � #8

✐ ✐

Chapter 1. Introdu tion

tures solutions will also be integrated into Open ast Matterhorn [22℄, an open-sour e

platform to support the management of edu ational multimedia ontent adopted by

more than 40 di�erent organisations all around the world.

It should be noted that this thesis is framed within the resear h proje t transLe -

tures, and it is therefore part of a ollaborative work. That said, spe ial attention

will be paid to the author's individual ontributions.

1.3 Do ument stru ture

This thesis is organised as follows: Chapter 2 introdu es the ma hine learning and lan-

guage pro essing omputer s ien e �elds, fo using on the automati spee h re ognition

and statisti al ma hine translation tasks. In this hapter, the poliMedia and Vide-

oLe tures.NET video le ture repositories are also des ribed, as well as the Open ast

Matterhorn platform. Next, the set of tools for the integration of transLe tures te h-

nologies into video le ture repositories is presented in Chapter 3. Chapter 4 addresses

the integration of these tools into the Open ast Matterhorn platform. Chapter 5 de-

s ribes a le ture re ommender system that uses automati spee h trans riptions to

better represent le ture ontents at a semanti level. In parti ular, this system was

implemented for the VideoLe tures.NET repository. Finally, on luding remarks are

given in Chapter 6.

2 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 3 � #9

✐ ✐

Chapter 2

Preliminaries

2.1 Ma hine Learning and Language Pro essing

The Ma hine Learning �eld, evolved from the broad �eld of Arti� ial Intelligen e,

deals with building omputer systems that optimise performan e riteria using pre-

vious data or experien e. It involves the development of mathemati al algorithms

that dis over knowledge from spe i� data sets, and then �learn� from the data in

an iterative fashion that allows predi tions to be made. Today, Ma hine Learning

in ludes a variety of appli ations su h as natural language pro essing, sear h engine

fun tion, medi al diagnosis, omputer vision, and sto k market analysis.

As the reader might guess from the aforementioned appli ations, the range of

learning problems is learly large. To avoid reinventing the wheel for every new

Ma hine Learning appli ation, the resear h ommunity has tended to formalise the

problems in a set of fairly narrow prototypes. Ma hine Learning systems an be

lassi�ed along three parti ularly meaningful dimensions [10℄:

• Classi� ation on the basis of the underlying learning strategies used.

• Classi� ation on the basis of the knowledge representation.

• Classi� ation in terms of the appli ation domain of the performan e system for

whi h knowledge is a quired.

Ea h point in the spa e de�ned by the above dimensions orresponds to a parti -

ular learning strategy. In this thesis, we on entrate on the lassi� ation task, also

referred to as pattern re ognition, where one attempts to build algorithms apable of

automati ally onstru ting methods for distinguishing between di�erent exemplars,

based on their di�erentiating patterns. More spe i� ally, we will fo us on the par-

ti ular tasks of spee h re ognition and ma hine translation, whi h an be in luded

in the Natural Language Pro essing (NLP) �eld. The NLP handles the resear h for

e� ient methods to enhan e human-ma hine (and human-human) ommuni ation.

3

�memoria� � 2014/9/11 � 12:16 � page 4 � #10

✐ ✐

Chapter 2. Preliminaries

In the following se tions we give a short overview of the statisti al pattern re ogni-

tion approa h, and brie�y introdu e the automati spee h re ognition and ma hine

translation problems.

2.1.1 Pattern re ognition

The term pattern re ognition refers to the task of pla ing some obje t to a orre t lass

based on the measurements about the obje t [43℄. Con retely, pattern re ognition sys-

tems aim to re ognise their environment from data a quired by appropriate sensors

(opti al, a ousti , thermal, hemi al, et .). Some of the spe i� pattern re ognition

obje tives in lude spee h and handwriting re ognition, ma hine translation, �nger-

print/fa ial verti� ation, and disease identi� ation. In order to on isely des ribe the

di�erent pattern re ognition problems, probability theory has proven to be the most

adequate language. We will assume the reader has some prior knowledge on probabil-

ity theory, and therefore some basi de�nitions will be skipped. For more details and

a very gentle and detailed dis ussion, see the book of Introdu tion to probability [16℄.

A ording to the lassi� ation paradigm, to re ognise means to lassify into one

out of C given lasses or ategories with minimum probability of error. Many pattern

re ognition systems an be thought to onsist of �ve stages:

1. Sensing, whi h refers to the measurement or observation of the obje t to be

lassi�ed.

2. Pre-pro essing, whi h is the pro ess of �ltering the raw data for noise supres-

sion and other operations to improve the quality of the data.

3. Feature extra tion, whi h refers to the pro ess of extra ting relevant data

for the lassi� ation task. The result of the feature extra tion stage is alled a

feature ve tor.

4. Classi� ation, in whi h the feature ve tor extra ted from the obje t is lassi-

�ed into the most appropriate lass.

5. Post-pro essing. As di�erent a tions might also have di�erent osts asso iated

with them, the �nal task of a pattern re ognition system is to de ide upon an

a tion based on the lassi� ation results.

Figure 2.1 illustrates the di�erent stages for an opti al hara ter re ogniser.

While the sensing, pre-pro essing and feature extra tion tasks are very spe i� to

the parti ular problem, the lassi� ation task an be somehow generalised for typi al

pattern re ognition systems. Formally, a lassi�er an be de�ned as a fun tion:

c(x) = arg max

c∈Cgc(x) (2.1)

where, for ea h lass c, a dis riminant fun tion gc is used to measure the degree

to whi h the obje t x belongs to lass c. Obviously, x is lassi�ed into a lass to

whi h it belongs with maximum degree. When random variables are used for x and

4 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 5 � #11

✐ ✐

2.1. Ma hine Learning and Language Pro essing

Figure 2.1: Opti al hara ter re ogniser for digits 6 and 9, based on the

average gray levels of the upper and the bottom half.

c, the optimal lassi� ation te hnique is the so- alled Bayes lassi�er or de ision rule

(named after Thomas Bayes):

c∗(x) = arg max

c∈Cp(c|x) (2.2)

If the posterior p(c|x) probability is modeled dire tly, the lassi�er is said to follow

a dis riminative approa h. However, the lassi al approa h to pattern re ognition is

to write the Bayes lassi�er in terms of lass priors p(c) and lass- onditional densities

p(x|c) (generative approa h). By applying the Bayes' rule:

c∗(x) = arg max

c∈Cp(c|x) = arg max

c∈Cp(c)p(x|c) (2.3)

Labelled samples (x1, c1), ..., (xN , cN ) randomly drawn from p(x, c) are used to

estimate p(c) and p(x|c). Usually, for ea h lass c, its prior is estimated as:

p(c) ≈Nc

N[Nc =

∑

n:cn=c

1] (2.4)

and its onditional density p(x|c) is estimated from samples labelled with c.

2.1.2 Automati spee h re ognition

Automati spee h re ognition (ASR) an be de�ned as the independent, omputer-

driven trans ription of spoken language into readable text. In a nutshell, ASR is

te hnology that allows a omputer to identify the words that a person speaks into a

mi rophone and onvert it to written text. Having a ma hine to understand �uently

spoken spee h has driven spee h resear h for more than 50 years. Although ASR

te hnology is not yet at the point where ma hines understand all spee h, in any

a ousti environment, or by any person, it is used on a daily basis in a number of

appli ations and servi es.

Formally, the spee h re ognition problem an be des ribed as a fun tion that

de�nes a mapping from the a ousti eviden e to a single or a sequen e of words. Let

x = (x1, x2, ..., xt) represent the a ousti eviden e that is generated in time from a

given spee h signal. Let w = (w1, w2, ..., wn) denote a sequen e of n words, ea h

belonging to a �xed set of possible words, W . Let p(w|x) denote the probability

MLLP-DSIC-UPV 5

�memoria� � 2014/9/11 � 12:16 � page 6 � #12

✐ ✐


that the words w were spoken given that the a ousti eviden e x was observed. The

spee h re ogniser should sele t the sequen e of words w satisfying:

w = arg max

w∈Wp(w|x) (2.5)

However, sin e it is di� ult to dire tly model p(w|x), we an apply the Bayes' rule

as in Equation 2.3:

w = arg max

w∈Wp(w|x) = arg max

w∈Wp(w)p(x|w) (2.6)

where p(w) is known as language model and p(x|w) as a ousti model. Typi ally, n-gram models are used to estimate p(w) [21℄; while p(x|w) is estimated using Hidden

Markov Models (HMMs) and Gaussian Mixture Models (GMMs) [25℄. Figure 2.2

shows a general overview of an automati spee h re ognition system.

Figure 2.2: General overview of an automati spee h re ognition system.

2.1.3 Statisti al ma hine translation

Ma hine translation (MT) is the use of omputers to automate the translation of texts

or utteran es from one language into another, while the underlying meaning remains

the same. The history of MT goes ba k to the late 40s with the famous publi ation

of Weaver [49℄, widely re ognised as one of the pioneers of ma hine translation. The

70s and 80s saw the proliferation of rule-based ma hine translation systems su h

as Meteo [42℄, Systran [8℄ and METAL [6℄. The ontributions in statisti al ma hine

translation were minor until the early 90s, when the IBM group presented the Candide

system [7℄. Sin e then, the development of statisti al MT has experien ed a major

boost on a ount of the many resear h groups emerged in this area.

The goal of MT is the automati translation of a sour e senten e s into a target

senten e t, being

6 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 7 � #13

✐ ✐

2.2. Video le ture repositories

s = s1, ..., sj , ..., s|s| sj ∈ St = t1, ..., ti, ..., t|t| ti ∈ T

where sj and ti denote sour e and target words, and S and T , the sour e and target

vo abularies, respe tively.

In statisti al MT, this translation pro ess is usually presented as a de ision pro ess,

where given a sour e senten e s, a target senten e t is sele ted a ording to

t = arg max

t

p(t|s) (2.7)

where p(t|s) is the probability for t to be the a tual translation of s. Applying the

Bayes' rule we an reformulate Equation 2.7 as

t = arg max

t

p(t)p(s|t) (2.8)

where p(t) and p(s|t) orrespond to the language and translation models, respe tively.

Intuitively, the language model represents the well-formedness of the andidate trans-

lation t; while the translation model an be understood as a mapping fun tion from

target to sour e words.

Most of the state-of-the-art statisti al MT systems are based on bilingual pharses [9,

23℄. Another approa h whi h has be ome popular in re ent years is grounded on the

integration of synta ti knowledge into statisti al MT systems [51, 52, 15℄. The third

main approa h is the modelling of the translation pro ess as a �nite-state trans-

du er [12, 4℄.

2.2 Video le ture repositories

In this se tion we present the poliMedia and VideoLe tures.NET video le ture reposi-

tories. Both of them have been trans ribed and translated as part of the transLe tures

proje t. transLe tures' ASR and MT systems have been spe ially developed to ex-

ploit the parti ular hara teristi s that large video le ture repositories usually present.

More spe i� ally, the a ousti , language and translation models are adapted to the

parti ular speaker and topi of the le tures, thereby improving the trans riptions and

translations a ura y [27, 2, 3℄. This adaptation pro ess is referred to as massive

adaptation.

At this point, it is worth stressing that the integration of transLe tures' te hnolo-

gies into these repositories is one of the major ontributions of this thesis and will be

properly addressed in Chapter 3.

2.2.1 poliMedia

The poliMedia platform is a re ent, innovative servi e for the reation and distribution

of edu ational multimedia ontent at the UPV [33℄. It was initially designed to allow

UPV professors to generate and publish edu ational video le tures. It urrently serves

to more than 30 000 students and 2800 professors. The poliMedia platform has also

MLLP-DSIC-UPV 7

�memoria� � 2014/9/11 � 12:16 � page 8 � #14

✐ ✐


Tables 2.1: Basi statisti s of the UPV's poliMedia repository (May 2014)

Number of le tures 11 662

Total duration (in hours) 2422

Avg. le ture length (in minutes) 12.5

Total number of speakers 1443

Avg. number of le tures re orded per speaker 7

been exported to several both national and international universities. Table 2.1 shows

basi statisti s of the urrent UPV's poliMedia repository.

The produ tion pro ess of the poliMedia platform has been arefully designed to

generate high quality re ordings with a high produ tion rate. The poliMedia studio is

a 4 meter room with a white ba kground in whi h all the ne essary equipment for the

re ording is available to professors. Figure 2.3 shows the studio during a re ording

session.

Figure 2.3: The poliMedia studio during a re ording session.

Re ordings on poliMedia follow a parti ular standard format. As it an be seen

in Figure 2.4, the speaker appears on the bottom right orner of the s reen while

the omputer s reen (usually showing the time-aligned presentation slides) is shown

entered as the main element of the re ording. The videos are stored in AVC/H.264

format with di�erent settings. Video size is 1280x720 pixels and average bit rates

usually os illate between 500 and 900 kbps. Audio format is AAC/LC stereo (85%)

and mono (15%) with sampling frequen ies of 44 100 and 48 000 Hz. A ording to the

Nyquist�Shannon sampling theorem, this is more than su� ient to a urately over

the human spee h frequen y range (∼200-3500 Hz).

8 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 9 � #15

✐ ✐

2.2. Video le ture repositories

Figure 2.4: A poliMedia re ording example.

In order to automati ally trans ribe the poliMedia Spanish repository, automati

spee h re ognition systems need relevant sample data. To this end, 114 hours of

poliMedia Spanish video le tures were manually trans ribed. This data was prop-

erly partitioned into training, development and test sets in order to train, tune and

evaluate the ASR systems. Details regarding ea h set are shown in Table 2.2.

Tables 2.2: Statisti s of the Spanish poliMedia training, development and

test partitions

Training Development Test

Videos 655 26 23

Speakers 73 5 5

Hours 107h 3.8h 3.4h

Senten es 39.2K 1.3K 1.1K

Words 936K 35K 31K

Vo abulary 26.9K 4.7K 4.3K

2.2.2 VideoLe tures.NET

VideoLe tures.NET is a free and open a ess repository of video le tures mostly �lmed

by people from the Joºef Stefan Institute (IJS) at major onferen es, summer s hools,

workshops and s ien e promotional events from many �elds of S ien e. VideoLe -

tures.NET is being used as an edu ational platform for several EU funded resear h

proje ts, di�erent open edu ational resour es organizations su h as the OpenCourse-

Ware Consortium, MIT OpenCourseWare and Open Yale Courses as well as other

MLLP-DSIC-UPV 9

�memoria� � 2014/9/11 � 12:16 � page 10 � #16

✐ ✐


s ienti� institutions like CERN. In this way, VideoLe tures.NET olle ts high qual-

ity edu ational ontent whi h is re orded with high-quality, homogeneous standards.

All le tures, a ompanying do uments, information and links are systemati ally

sele ted and lassi�ed through the editorial pro ess taking into a ount the author's

omments. The video editing is done in-house and is never ensored, that is, le -

tures are never edited in a way whi h would allow ontent or viewer manipulation.

Most le tures are a ompanied with time-aligned presentation slides (as shown in

Figure 2.5).

Tables 2.3: Basi statisti s of the IJS' VideoLe tures.NET repository (May

2014)

Number of le tures 20 358

Total number of authors 12 167

Total duration (in hours) 12 681

Average le ture duration (in minutes) 37

Figure 2.5: The VideoLe tures.NET web player.

In order to train proper ASR systems to automati ally trans ribe the repository,

high-quality trans riptions were manually generated for an English subset of Vide-

oLe tures.NET. The resulting training, development and test sets are summarised in

Table 2.4.

10 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 11 � #17

✐ ✐

2.3. The Open ast Matterhorn platform

Tables 2.4: Statisti s of the English VideoLe tures.NET training, develop-

ment and test partitions

Training Development Test

Videos 20 4 4

Speakers 68 11 25

Hours 20h 3.2h 3.4h

Senten es 5K 1K 1.3K

Words 130K 28K 34K

Vo abulary 7K 3K 3K

2.3 The Open ast Matterhorn platform

Matterhorn is a free, open-sour e software to support the management of edu ational

audio and video ontent. Higher edu ation institutions use Matterhorn to produ e

le ture re ordings, manage existing video, serve designated distribution hannels, and

provide user interfa es to engage students with edu ational video. The proje t om-

bines individual solutions and fo uses the e�orts and experien e of di�erent univer-

sities in one shared open produ t. The reation of a uni�ed system with an open

development pro ess is proje ted to foster the ex hange of edu ational ontent also.

To this end, it in ludes the following features:

• Administrative tools for s heduling automated re ordings, manually uploading

�les, and managing videos, metadata, work�ows and pro essing fun tions.

• Integration with re ording devi es in the lassroom for managing automated

apture of audio, VGA, and multiple video sour es.

• Pro essing and en oding servi es that prepare and pa kage the media �les a -

ording to on�gurable spe i� ations, in luding ri h media features (slide seg-

mentation/indexation) for in-video sear h.

• Distribution to lo al streaming and download servers and on�guration apa-

bility for distribution to hannels su h as YouTube, iTunes or a ampus ourse

or ontent management system

• Ri h media user interfa e for learners to engage with ontent, in luding slide

preview, ontent-based sear h, heatmaps and additional features.

Multiple edu ational institutions have adopted Open ast Matterhorn as their

video Content Management System (CMS). The University of California Berkeley,

the University of Vigo or the University of Osnabrue k are some examples urrently

involved into the Open ast Community. The international su ess of Open ast Mat-

terhorn makes the platform a perfe t s enario for the integration of state-of-the-art

automati spee h re ognition and ma hine translation te hnologies. By integrating

transLe tures tools into the Matterhorn platform, edu ational institutions will be

MLLP-DSIC-UPV 11

�memoria� � 2014/9/11 � 12:16 � page 12 � #18

✐ ✐


able to break the language barrier and support people with hearing disabilities by

subtitling their videos in multiple languages.

12 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 13 � #19

✐ ✐

Chapter 3

The transLe tures Platform

3.1 Introdu tion

The transLe tures Platform [41℄ omprises a set of tools for integrating automati

trans ription and translation te hnologies into large edu ational video repositories.

Its main omponents are the transLe tures Database, Web Servi e, Ingest Servi e and

Player. These tools have been used to integrate transLe tures' ASR and MT systems

into the previously des ribed poliMedia and VideoLe tures.NET repositories [39℄.

Figure 3.1 gives a general overview of the transLe tures Platform integration into

poliMedia, VideoLe tures.NET or any other video le ture repository. In this �gure,

the prin ipal onne tions between the di�erent tools in a lient repository are illus-

trated. The transLe tures Database, Web Servi e and Ingest Servi e will be des ribed

in detail in Se tions 3.2, 3.3 and 3.4. The transLe tures Player, one of the author's

main ontributions, is given spe ial attention in Se tion 3.5.

3.2 transLe tures Database

The transLe tures Database is a PostgreSQL relational database whi h stores all the

data required for the Web Servi e and the Ingest Servi e. The main ategories of

data stored in the Database are as follows:

• Video le tures: All the information related to a spe i� le ture is stored in

the database, in luding language, duration, title, keywords and ategory. In

addition, an external ID, provided by the lient repository, is stored and used

for le ture identi� ation purposes in all transa tions performed between the

lient and the Player and Web Servi e.

• Speakers: Information about the le ture speaker an be used by the ASR

systems to adapt the underlying models to the unique hara teristi s of a given

speaker and, therefore, improve the quality of the resulting subtitles.

13

�memoria� � 2014/9/11 � 12:16 � page 14 � #20

✐ ✐

Chapter 3. The transLe tures Platform

Figure 3.1: General overview of the transLe tures Platform integration into

a lient repository.

• Subtitles: All subtitles generated via the Ingest Servi e are stored in the

database and retrieved by the Web Servi e.

• Uploads: Every time an /ingest operation is performed, a new upload entry is

stored in the database.

Media and subtitle �les are stored on the hard drive separately from the relational

database. We an distinguish between three di�erent kind of �les:

• Media: Video/audio �les of the le tures that already exist in the database,

plus related �les su h as slides, external do uments and thumbnails.

14 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 15 � #21

✐ ✐

3.3. transLe tures Web Servi e

• Trans riptions: Subtitle �les in DFXP format.

• Uploads: Uploaded Media Pa kage Files (MPF) and the �les ontained within

them.

3.3 transLe tures Web Servi e

The transLe tures Web Servi e is the interfa e for transfering information and data

between the poliMedia, VideoLe tures.NET or any other repository and the transLe -

tures Platform. It also enables the subtitle visualisation and editing apabilities of

the transLe tures Player. This web servi e is implemented as a Python Web Server

Gateway Interfa e (WSGI), and de�nes a set of HTTP interfa es related to subtitle

delivery and media upload:

• /ingest: POST request whi h allows the lient to upload audio/video �les and

other related material, su h as slides and other text resour es that an be used

to adapt the ASR system, together with other metadata in a Media Pa kage

File (MPF). This MPF will be later pro essed by the Ingest Servi e in order to

generate automati trans riptions and translations for the uploaded media.

• /status: GET request to he k the status of a video le ture uploaded via the

/ingest interfa e.

• /le turedata: GET request that returns basi metadata and �le lo ations for a

given video le ture.

• /langs: GET request that provides the lient with a list of subtitles and lan-

guages available for a given video le ture.

• /dfxp: GET request that returns the subtitles in DFXP format for a given

le ture and language.

• /mod: POST request that sends and ommits hanges made by a user when

editing a trans ription or translation. User orre tions are later used by ASR

and SMT systems in order to improve the underlying models.

3.4 transLe tures Ingest Servi e

The Ingest Servi e is the tool devoted to handling and properly pro essing the Media

Pa kage Files uploaded via the /ingest interfa e of the Web Servi e. It is implemented

as a Python module that should be exe uted periodi ally (typi ally every minute) to

he k for and pro ess new le ture uploads, and also to assess whether existing uploads

are being pro essed orre tly. The uploads table of the Database is used to keep the

status of every upload up-to-date. This information is also a essed by the Web

Servi e's /status interfa e.

MLLP-DSIC-UPV 15

�memoria� � 2014/9/11 � 12:16 � page 16 � #22

✐ ✐


As it was previously mentioned, MPFs are uploaded to the transLe tures Platform

via the Web Servi e's /ingest interfa e and stored in the Database. Then the Ingest

Servi e reads the uploads table of the Database and starts pro essing ea h MPF. An

upload will typi ally follow the following sequential steps (see Figure 3.2):

Figure 3.2: Overview of the transLe tures Ingest Servi e work�ow.

1. Media Pa kage Pro essing: In this step the MPF is unzipped and a series of

se urity, data integrity and data format he ks are performed on the unpa ked

data. If all he ks ome up lean, then the MPF data is redire ted to the next

step, whi h might be the 2nd, 3rd, 4th or 5th step listed here, depending on the

input data.

2. Trans ription Generation: In this step a trans ription �le in DFXP format is

generated from the main media �le (video, audio) using a suitable ASR Module.

3. Translation(s) Generation: In this step one or more translation �les in

DFXP format will be generated from the trans ription �le (whether automati-

ally generated in the previous step or provided in the MPF) using suitable MT

Modules.

16 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 17 � #23

✐ ✐

3.5. transLe tures Player

4. Media Conversion: In this step the main media �le is onverted into the

video formats required by the transLe tures Player in order to maximise browser

ompatibility.

5. Store Data: In this last step, the data ontained in the MPF and the data

automati ally generated by the Ingest Servi e are stored in the Database.

3.5 transLe tures Player

Massive adaptation an deliver substantial ontributions to the improvement of tran-

s riptions overall quality, but it is our belief that su� iently a urate results are

unlikely to be obtained through fully-automated approa hes alone. Instead, in or-

der to rea h the desired levels of a ura y, we must onsider user intera tion. For

that purpose, an HTML5 post-editing appli ation has been arefully designed to ex-

pedite the error supervision task [46℄, and thereby obtain subtitles of an a eptable

quality in ex hange for a minimum amount of user e�ort. Figure 3.3 illustrates the

ommuni ation between the Player and the other tools in the platform.

Figure 3.3: transLe tures Player ommuni ations diagram.

In this se tion we des ribe the three di�erent versions of the tool that were devel-

oped and evaluated by users of both poliMedia and VideoLe tures.NET platforms.

3.5.1 Standard post-editing

In the standard post-editing approa h, users are shown automati trans riptions and

translations while viewing the le ture. The standard layout is illustrated in Figure 3.4.

MLLP-DSIC-UPV 17

�memoria� � 2014/9/11 � 12:16 � page 18 � #24

✐ ✐


However, three alternative editing layouts are available for users to hoose from a -

ording to their personal preferen es. In any of them, users an freely supervise any

trans ription and translation segment. Additionally, a omplete set of key short uts

has been implemented to enhan e expert user apabilities.

Figure 3.4: transLe tures Player standard post-editing mode (default side-

by-side layout).

The user intera tion an be summarised as follows: the video le ture and the

orresponding trans ription or translation are played in syn hrony, allowing users to

read the trans ription while wat hing the video. When a user spots an error, they

an li k on the parti ular segment to pause the video and enter their hanges.

3.5.2 Intelligent intera tion

Standard user models for the trans ription of audiovisual obje ts, like the one pre-

sented above, are bat h-oriented. These models yield satisfying results when highly

ollaborative users are working on near-perfe t system output. To �nd out whether

we ould further improve supervision times, an alternative intera tion strategy, based

on on�den e measures [35℄, was introdu ed. These on�den e measures provide an

indi ator as to the probable orre tness of ea h word appearing in the automati

trans ription. Words with low on�den e values are likely to have been in orre tly

re ognised at the point of ASR. The idea is that by fo using supervision a tions only

on in orre tly-trans ribed words, we an optimise user intera tion to get the best

possible trans ription in ex hange for the least amount of e�ort [36, 40℄.

Figure 3.5 shows the intelligent intera tion interfa e. Here, users are asked to

supervise a subset of words presele ted by the CAT ( omputer-assisted translation)

system as low on�den e. This subset typi ally onstitutes between 10-20% of all

words trans ribed using the ASR system, though users are able to modify this range

at will to as low as 5% and as high as 40%, depending on the per eived a ura y of

the trans ription. Ea h word was played in the ontext of one word before and one

word after, in order to fa ilitate its omprehension and resulting orre tion. In the

�gure, low- on�den e words are shown in red and orre ted low- on�den e words in

18 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 19 � #25

✐ ✐


Figure 3.5: transLe tures Player intelligent intera tion interfa e (only editing

box is shown).

green. The text box that opens for ea h low- on�den e word an be expanded in

either dire tion in order to modify the surrounding text as required.

3.5.3 Two-step supervision

Intelligent intera tion an qui kly improve the trans ription a ura y with limited

user e�ort. Nevertheless, as the CAT system will unlikely �nd all possible trans rip-

tion errors, a small amount of in orre t words will remain. The system ould use

the intelligent intera tion inputs as onstraints in the ASR sear h spa e in order to

�nd the best trans ription hypothesis. Figure 3.6 gives an overview of the two-step

supervision strategy.

Basi ally, in the �rst step, low- on�den e words are presented to the user for

supervision in in reasing order of on�den e. These words keep being presented until

one of the three following onditions are met:

• The total supervision time rea hes double the duration of the video itself.

• No orre tions are entered for �ve onse utive segments.

• 20% of all words are supervised.

Then, supervision a tions are fed into the ASR system to generate a new and

MLLP-DSIC-UPV 19

�memoria� � 2014/9/11 � 12:16 � page 20 � #26

✐ ✐


Figure 3.6: transLe tures Player two-step supervision strategy overview.

improved trans ription. In the se ond step, users supervise the improved trans ription

from start to �nish, following the standard post-editing method.

3.5.4 User evaluations

User evaluations were arried out within the transLe tures proje t in order to evaluate

the models and tools in a real-life setting. UPV le turers, having already �lmed

material for poliMedia, were invited under 2012-2013 and 2013-2014 Do èn ia en

Xarxa alls to evaluate the omputer-assisted trans ription tools being developed in

transLe tures. Le turers signing up for this programme ommitted to supervising the

automati trans riptions of �ve of their poliMedia videos using the tools des ribed in

this hapter.

In order to evaluate the di�erent intera tion strategies, UPV le turers were asked

to rate various aspe ts on a Likert s ale from 1-10 (see Table 3.2). In addition, usage

statisti s were olle ted to obje tively evaluate the real user e�ort (see Figure 3.7).

The metri used to that end was the Real Time Fa tor (RTF), whi h is the ratio

(R) between the time spent by the user ellaborating the trans ription (P ) and the

duration of the video (T ).

R =P

T(3.1)

Figure 3.7 shows the RTF as fun tion of WER

1

for the di�erent intera tion strate-

gies. Table 3.1 shows detailed information regarding this �gure, whi h is dis ussed

below.

1

WER stands for word error rate, and it is based on the Levenshtein distan e string metri .

Informally, the WER between two senten es is the minimum number of word edits (i.e. insertions,

deletions or substitutions) required to hange one senten e into the other, divided by the original

senten e length.

20 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 21 � #27

✐ ✐


1

3

5

10

5 10 20 30 40

RTF

WERPhase-11

3

5

10

5 10 20 30 40

RTF

WERPhase-21

3

5

10

5 10 20 30 40

RTF

WERPhase-3

Figure 3.7: RTF as a fun tion of WER for the di�erent intera tion strategies.

Phases 1, 2 and 3 orrespond to standard post-editing, intelligent intera tion

and two-step supervision, respe tively.

1. Standard post-editing : The average WER of the initial automati trans riptions

was 16.9, and the average RTF for the post-editing pro ess was 5.4. When om-

pared to that re orded for trans ribing from s rat h (∼10 RTF for non-expert

users [29℄), we got a signi� ant de rease of about 50%. This way, le turers'

performan e be ame omparable to that of professional trans riptionists [17℄,

rather than that expe ted from non-expert trans riptionists [28℄.

2. Intelligent intera tion: The mean RTF was lowered to 2.2, but it must be noted

that the resulting trans riptions were not error free. The WER was redu ed

(in average) from 19.5 to 8.0 after the intelligent intera tion supervision. These

results underline the e�e tiveness of on�den e measures.

3. Two-step supervision: The mean RTF for the �rst step was as low as 1.4.

Although it only redu ed the average WER from 28.4 to 25.0, the ASR massive

adaptation and onstrained sear h post-pro ess was able to lower the WER of

the trans riptions to 18.7. Then, the improved trans riptions were supervised

following the standard post-editing strategy (RTF 3.9). The umulative RTF

was 5.3. Although this RTF is omparable to that obtained for the standard

post-editing strategy, it should be stressed that the initial WER was 28.4 (w.r.t.

16.9) in this ase.

3.5.5 Con lusions

A ording to Table 3.2, we an see that le turer preferen es do not rely on pure ef-

� ien y metri s. Although standard post-editing WER redu tion per RTF unit was

the lowest, UPV le turers rated it as the best hoi e. When they were questioned

MLLP-DSIC-UPV 21

�memoria� � 2014/9/11 � 12:16 � page 22 � #28

✐ ✐


Tables 3.1: Summary of results obtained for ea h intera tion strategy.

SP-E (1) II (2) T-SS (3)

Initial WER 16.9 19.5 28.4

Final WER 0.0 8.0 0.0

RTF 5.4 2.2 5.3

∇WER/RTF 3.1 5.2 5.4

Tables 3.2: Satisfa tion survey results for ea h intera tion strategy.

Question (summarised) Mean (SP-E) Mean (II) Mean (TS-S)

Intuitiveness

1- Easy to use. 9.4 7.8 7.5

2- Easy to learn. 9.4 8.1 8.6

3- Help information lear. 9.2 8.1 8.5

4- Organisation on s reen lear. 9.0 8.4 8.7

Grand Mean 9.3 8.1 8.3

Likeability

5- Comfortable. 8.7 6.5 7.3

6- Like the interfa e. 8.7 6.9 7.4

7- I am satis�ed. 9.0 6.9 7.4


Usability

8- E�e tively omplete work. 9.0 6.7 7.7

9- Qui ker than from s rat h. 8.6 6.6 7.4

10- Has everything I expe t. 9.0 5.6 7.1


Overall Mean 9.1 7.2 7.8

about the intelligent intera tion and the two-step supervision strategies, they an-

swered they did not want to leave any error in the trans riptions, and they did want

to avoid supervising the same le ture twi e.

However, their ratings might be in�uen ed by many fa tors. For instan e, it is

understandable that users rate better a tool for editing an automati trans ription

when the initial WER is 16.9 than when it is 28.4. Another fa tor that might stress

their obsession for leaving no errors in the trans riptions is the fa t that they were

editing their own le ture trans riptions.

What we an on lude is that the standard post-editing intera tion is more in-

tuitive and user-friendly than the intelligent intera tion and two-step supervision

alternatives, whi h might require of some expertise. Nevertheless, if a relatively small

22 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 23 � #29

✐ ✐


amount of time is going to be spent on supervising a trans ription, and the a ura y

stri tness is not tight, intelligent intera tion and two-step supervision have proven to

be signi� antly more e� ient hoi es.

MLLP-DSIC-UPV 23

�memoria� � 2014/9/11 � 12:16 � page 24 � #30

✐ ✐


24 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 25 � #31

✐ ✐

Chapter 4

Integration into Open ast

Matterhorn

4.1 Introdu tion

In this hapter we address the integration of transLe tures' ASR and MT solutions

into the Open ast Matterhorn platform, whi h was introdu ed in Se tion 2.3. The

goal is to integrate the tools des ribed in Chapter 3 into the di�erent Matterhorn

work�ow phases. This is dis ussed below in three se tions: Se tion 4.2 des ribes the

system ar hite ture of the Matterhorn platform. Next, a more detailed analysis of

the platform, fo using on relevant details for integration purposes, is presented in

Se tion 4.3. Finally, the strategies followed for the su essfull integration of the tools

are des ribed in Se tion 4.4.

4.2 System ar hite ture

The Matterhorn platform omprises four modules: le ture apture and administra-

tion, ingest and pro essing, distribution and engage tools. Figure 4.1 shows a diagram

of the Matterhorn ar hite ture whi h in ludes its main omponents and dependen ies

among them.

The members of the Open ast Community have sele ted Java as programming lan-

guage to reate the ne essary appli ations and a Servi e-Oriented Ar hite ture (SOA)

infrastru ture. The overall appli ation design is highly modularised and relies on the

OSGi (dynami module system for Java) te hnology. The OSGi servi e platform

provides a standardised, omponent-oriented omputing environment for ooperating

network servi es.

The di�erent phases of the Matterhorn work�ow are applied as follows:

1. S hedule/prepare and apture: The re ording pro ess begins with deter-

mining what is to be re orded, where and in what form. Campus data will be

25

�memoria� � 2014/9/11 � 12:16 � page 26 � #32

✐ ✐

Chapter 4. Integration into Open ast Matterhorn

Figure 4.1: Matterhorn ar hite tural draft.

integrated by the universities` IT departments. For this purpose, Matterhorn is

open to both the learning management systems and administrative data bases.

Syllabi, le ture and room timetables do not only provide the basi information

to answer the question raised above, but in an ideal ase, most of the meta-

data related to the re ording (le turer, title, summary, language, et .) as well.

Re ording devi es are then s heduled to automati ally re ord, e.g. in le ture

hall 0S03, every Tuesday from 10:00 to 12:00, the le ture on �XYZ� by Prof.

ABC.

2. Ingest and pro essing: At the end of the re ording the tra ks are sent to

an �inbox� to be pro essed. The inbox also serves as �ingest� for other video

obje ts to be integrated in the subsequent work�ows of Matterhorn. The dif-

ferent re ording tra ks (audio, ontent, video) are bundled to a media pa kage,

ontent-indexed (at �rst through opti al hara ter re ognition of the slide, later

ertainly through audio re ognition also) and if ne essary ar hived in the most

native formats. They are en oded a ording to the spe i�ed distribution pa-

rameters.

3. Distribution: The distribution demands of the universities are extremely het-

erogeneous: they go from simple integration of the videos in lo al WCMS or

blogs, to posting in password-prote ted LMS, to distribution via iTunes U or

26 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 27 � #33

✐ ✐

4.3. Platform analysis

YouTube. The distribution module uploads ingested ontent into di�erent dis-

tribution hannels a ording to parti ular institutions needs.

4. Engage tools: This module is losely linked to the distribute module sin e it

must also manage presentation and use of the obje ts. However, appli ations in

the Engage module make it possible to use omprehensive information (meta-

data, video and audio analysis, annotations, and use analysis) for intelligent

user interfa es. Likewise, support of learning management systems (LMS) or

virtual learning environments (VLE) is an important issue. To make sure that

the produ ed material will be used, Matterhorn video and audio player ompo-

nents are easily integrated in existing ourse websites, wikis, and blog systems.

So ial annotations, whi h an be used to improve sear h or navigation and feed-

ba k possibilities are �own ba k to the system like the user statisti s already

mentioned.

4.3 Platform analysis

In order to su essfully integrate the transLe tures' tools into the Open ast Matter-

horn platform, further analysis of the platform must be done. In this se tion, we

des ribe in detail di�erent aspe ts of the Matterhorn platform whi h are relevant for

this purpose.

4.3.1 Media pa kage

After audio/video material has been sent to the inbox, the media is bundled into

a media pa kage. A media pa kage is onsidered the business do ument within the

Matterhorn system. Besides the media obje ts, it in ludes further information from

media analysis as well as metadata. Every media pa kage therefore onsists of a

manifest and a list of pa kage elements that are referred to in the manifest. Pa kage

elements are:

• Media tra ks (audiovisual material)

• Metadata atalogs

• Atta hments (slides, pdf, text, annotations, et .)

The media pa kage generation pro ess is arried out automati ally when new

material is uploaded to the system (see Figure 4.2).

4.3.2 Work�ow

On e a media pa kage has been su essfully generated, it goes through a set of opera-

tions, like the en oding of the media �les in di�erent formats or the publi ation of the

media in distribution hannels. These operations are de�ned in a work�ow. A Mat-

terhorn work�ow is an ordered list of operations de�ned in a XML �le. There is no

MLLP-DSIC-UPV 27

�memoria� � 2014/9/11 � 12:16 � page 28 � #34

✐ ✐


Figure 4.2: Media pa kage ingest pro ess.

limit to the number of operations or their repetition in a given work�ow. A work�ow

operation an run autonomously or pause itself to allow for external, usually user,

intera tion. Although there are some prede�ned work�ows in the default Open ast

Matterhorn installation, ustom work�ows an be de�ned in order to perform the

desired operations to all media pa kages being ingested into the system.

Work�ow operations are usually alls to di�erent platform servi es. These opera-

tions are de�ned in the Matterhorn's Condu tor Servi e and are known as work�ow

operation handlers. Some basi work�ow operations in luded in Matterhorn are, for

instan e, the inspe t operation whi h extra ts te hni al information about the media

�les in luded in the media pa kage, the ompose operation for en oding video and au-

dio �les in di�erent formats, or the ar hive operation to store the �les in the system.

Custom work�ow operation handlers an be implemented in order to extend work�ow

fun tionalities.

28 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 29 � #35

✐ ✐

4.4. Integration pro ess

4.4 Integration pro ess

The integration of transLe tures' tools into the Matterhorn platform an be divided

in two parts. The �rst part (Se tion 4.4.1) will over the generation of automati

trans riptions and translations for the ingested media. The se ond part (Se tion 4.4.2)

will involve the visualisation and supervision of the automati ally generated aptions.

The overall integration pro ess is illustrated in Figure 4.3.

Figure 4.3: Matterhorn integration overview.

4.4.1 Generation of automati trans riptions and translations

Figure 4.2 showed how ingested media is sent to the Work�ow Servi e in order to be

pro essed by a parti ular work�ow. As mentioned in Se tion 4.3.2, ustom work�ows

an be de�ned to perform spe i� operations during the ingest pro ess. In order

to send the media �les to the transLe tures Platform, a ustom work�ow in luding a

ustom work�ow operation handler will be de�ned. In this way, audio tra ks extra ted

from the ingested media �les will be wrapped up in a transLe tures media pa kage

and sent to the transLe tures Web Servi e ingest interfa e. To summarise:

1. Media tra ks are ingested into the Matterhorn platform.

2. The transLe tures work�ow pro esses the media tra ks:

MLLP-DSIC-UPV 29

�memoria� � 2014/9/11 � 12:16 � page 30 � #36

✐ ✐


(a) Initial standard Matterhorn operations are performed (inspe t, prepare-av,

ompose, trim, et .)

(b) Audio tra ks are extra ted from media �les (FLAC format)

( ) A transLe tures media pa kage is generated, in luding:

• Extra ted audio �les

• Media language

• Matterhorn mediaPa kage ID

• Title

• Author

• Duration

• Atta hments (slides, do uments, et .)

(d) The media pa kage is sent to the transLe tures Web Servi e /ingest inter-

fa e.

(e) Final standard Matterhorn operations are performed (segmentpreviews,

publish, et .)

The des ribed pro ess is illustrated in Figure 4.4.

Figure 4.4: transLe tures ustom work�ow overview.

4.4.2 Subtitle visualisation and supervision

The generated subtitles must be made available to the users of the platform when

a essing the media. Furthermore, as it was meant in transLe tures, users should be

30 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 31 � #37

✐ ✐

4.4. Integration pro ess

able to supervise the automati trans riptions and translations in order to improve

its a ura y. To this purpose, an alternative to the Matterhorn Engage Player is

presented.

The Paella Player

1

, developed by the Área de Sistemas de Informa ión y Comu-

ni a iones (ASIC) at the UPV, is an HTML5 multistream video player apable of

playing multiple audio and video streams syn hronously and supporting a number of

user plugins. It is spe ially designed for le ture re ordings and fully ompatible with

the Matterhorn platform.

Paella Player also provides support for the transLe tures Platform tools. It is

apable of displaying transLe tures subtitles by dire tly a essing the transLe tures

Web Servi e. In addition, it also provides a link to supervise the trans riptions and

translations by using the transLe tures Player. Figure 4.5 illustrates the subtitle

visualisation support of Paella Player.

Figure 4.5: Paella Player showing available subtitles for a Matterhorn le ture.

1

Visit http://paellaplayer.upv.es for more information.

MLLP-DSIC-UPV 31

�memoria� � 2014/9/11 � 12:16 � page 32 � #38

✐ ✐


32 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 33 � #39

✐ ✐

Chapter 5

Using spee h trans riptions

in le ture re ommender

systems

5.1 Introdu tion

One problem reated by the su ess of video le ture repositories is the di� ulty fa ed

by individual users when hoosing the most suitable video for their learning needs

from among the vast numbers available on a given site. Users are often overwhelmed

by the amount of le tures available and may not have the time or knowledge to �nd the

most suitable videos for their learning requirements. Up until re ently, re ommender

systems have mainly been applied in areas su h as musi [24, 30℄, movies [11, 50℄,

books [31℄ and e- ommer e [13℄, leaving video le tures largely to one side. Only a

few ontributions to this parti ular area an be found in the literature, most of them

fo used on VideoLe tures.NET [5℄. However, none of them has explored the possibility

of using le ture trans riptions to better represent le ture ontents at a semanti level.

In this hapter we des ribe a ontent-based le ture re ommender system that uses

automati spee h trans riptions, alongside le ture slides and other relevant external

do uments, to generate semanti le ture and user models. In Se tion 5.2 we give

an overview of this system, fo using on the text extra tion and information retrieval

pro ess, topi and user modeling and the re ommendation pro ess. In Se tion 5.3 we

address the dynami update of the re ommender system and the required optimisa-

tions needed to maximise the s alability of the system. The integration of the system

presented in Se tions 5.2 and 5.3 into VideoLe tures.NET, arried out as part of the

PASCAL Harvest Proje t La Vie, is des ribed in detail in Se tion 5.4. Finally, we

lose with some on luding remarks, in Se tion 5.5.

33

�memoria� � 2014/9/11 � 12:16 � page 34 � #40

✐ ✐

Chapter 5. Using spee h trans riptions in le ture re ommender systems

Figure 5.1: System overview.

5.2 System overview

Fig. 5.1 gives an overview of the re ommender system. The left-hand side of the

�gure show the topi and user modeling pro edure, whi h an be seen as the training

pro ess of the re ommender system. To the right we see the re ommendation pro ess.

The aim of topi and user modeling is to obtain a simpli�ed representation of ea h

video le ture and user. The resulting representations are stored in a re ommender

database. This database will be exploited later in the re ommendation pro ess in

order to re ommend le tures to users.

As shown in Fig. 5.1, every le ture in the repository goes through the topi and

user modeling pro ess, whi h involves three steps. The �rst step is arried out by the

text extra tion module. This module omprises three submodules: ASR (Automati

Spee h Re ognition), WS (Web Sear h) and OCR (Opti al Chara ter Re ognition).

34 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 35 � #41

✐ ✐

5.2. System overview

As its name suggests, the ASR submodule generates an automati spee h trans rip-

tion of the video le ture. The WS submodule uses the le ture title to sear h for

related do uments and publi ations on the web. The OCR submodule extra ts text

from the le ture slides, where available. The se ond step takes the text retrieved by

the text extra tion module and omputes a bag-of-words representation. This bag-of-

words representation onsists of a simpli�ed text des ription ommonly used in nat-

ural language pro essing and information retrieval. More pre isely, the bag-of-words

representation of a given text is its ve tor of word ounts over a �xed vo abulary.

Finally, in the third step, le ture bags-of-words are used to represent the users of the

system. That is, ea h user is represented as the bag-of-words omputed over all the

le tures the user has ever seen.

When the topi and user modeling pro ess ends, the re ommender database is

ready for exploitation by the re ommender engine (see the right-hand side of Fig. 5.1).

This engine uses re ommendation features to al ulate a measure s of the suitability

of the re ommendation for every (u, v, r) triplet, where u refers to a parti ular user, v

is the le ture they are urrently viewing and r is a hypotheti al le ture re ommenda-

tion. In re ommender systems, this is usually referred to as the utility fun tion [34℄.

Spe i� ally, it indi ates how likely it is that a user u would want to wat h le ture r

after viewing le ture v. For instan e, this utility fun tion an be omputed as a linear

ombination of re ommendation features:

s(u, v, r) = w · x =

N∑

n=1

wn · xn (5.1)

where x is a feature ve tor omputed for the triplet (u, v, r), w is a feature weight

ve tor and N is the number of re ommendation features. In this work, the following

re ommendation features were onsidered:

1. Le ture popularity: number of visits to le ture r.

2. Content similarity: weighted dot produ t between the le ture bags-of-words v

and r [19℄.

3. Category similarity: number of ategories (from a prede�ned set) that v and r

have in ommon.

4. User ontent similarity: weighted dot produ t between the bags-of-words u and

r.

5. User ategory similarity: number of ategories in ommon between le ture r

and all the ategories of le tures the user u has wat hed in the past.

6. Co-visits: number of times le tures v and r have been seen in the same browsing

session.

7. User similarity: number of di�erent users that have seen both v and r.

MLLP-DSIC-UPV 35

�memoria� � 2014/9/11 � 12:16 � page 36 � #42

✐ ✐


Feature weights w an be learned by training di�erent statisti al lassi� ation

models, su h as support ve tor ma hines (SVMs), using positive and negative (u, v,

r) re ommendation samples.

The most suitable re ommendation r for a given u and v is omputed as follows:

r = arg max

r

s(u, v, r) (5.2)

However, in re ommender systems the most ommon pra ti e is to provide the

user the M re ommendations r that a hieve the highest utility values s, for instan e,

the �rst 10 le tures.

5.3 System updates and optimisation

Le ture repositories are rarely stati . They may grow to in lude new le tures, or

have outdated videos removed. Also, users' learning progress or intera tions with the

repository in�uen e the user models. The re ommender database must therefore be

onstantly updated in order to in lude the new le tures added to the repository and

update the user models. Furthermore, the addition of new le tures to the system

might lead to hanges to the bag-of-words (�xed) vo abulary. Any variation to this

vo abulary involves a omplete regeneration of the re ommender database. That said,

hanges to the vo abulary may not be signi� ant until a substantial per entage of new

le tures has been added to the repository.

Two di�erent update s enarios an be de�ned: the in orporation of new le tures

and updating the user models, on the one hand, and the rede�niton of the bag-of-words

vo abulary, in luding the regeneration of both the le ture and user bags-of-words, on

the other. We will refer to these s enarios as regular update and o asional update,

respe tively, after the di�erent periodi ities with whi h they are meant to be run.

• Regular update: The regular update is responsible for in luding the new le tures

added to the repository and updating the user models with the last user a tivity,

both in the re ommender database. As its name suggests, this pro ess is meant

to be run on a daily basis, depending on the frequen y with whi h new le tures

are added to the repository, sin e new le tures annot be re ommended until

they have been pro essed and in luded in the re ommender database.

• O asional update: As mentioned in Se tion 5.2, le ture bags-of-words are al-

ulated under a �xed vo abulary. Sin e there is no vo abulary restri tion on the

text extra tion pro ess, we need to modify the bag-of-words vo abulary as new

le tures are added to the system. The o asional update arries out the pro ess

of updating this vo abulary, whi h involves re al ulating both the le ture and

user bags-of-words.

In order to maximise the s alability of the system, while also redu ing the response

time of the re ommender, the features Content similarity, Category similarity, Co-

visits and User similarity des ribed in Se tion 5.2 are pre omputed for every possible

36 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 37 � #43

✐ ✐

5.4. Integration into VideoLe tures.NET

le ture pair and stored in the re ommender database. Then, during the re ommen-

dation pro ess, the re ommender engine loads the values of these features, leaving

the omputation of features User ontent similarity and User ategory similarity un-

til runtime. The de ision to al ulate the features User ontent similarity and User

ategory similarity at runtime was driven by the highly dynami nature of the user

models, in ontrast to the le ture models, whi h remain onstant until the bag-of-

words vo abulary is hanged.

5.4 Integration into VideoLe tures.NET

The proposed re ommendation system was implemented and integrated into the Vide-

oLe tures.NET repository during the PASCAL2 Harvest Proje t La Vie (Learning

Adapted Video Information Enhan er) [32℄. Said integration is dis ussed here a ross

three subse tions. First, in Se tion 5.4.1 we address the generation of le ture and

user models from video le ture trans riptions and other text resour es. Next, in

Se tion 5.4.2 we des ribe how re ommender feature weights were learned from data

olle ted from the existing VideoLe tures.NET re ommender system. Finally, we

present our evaluation of the system in Se tion 5.4.3.

5.4.1 Topi and user modeling

The �rst step in generating le ture and user models involved olle ting textual in-

formation from di�erent sour es. In parti ular, for VideoLe tures.NET, the text

extra tion module gathered textual information from the following sour es:

• transLe tures spee h trans riptions.

• Web sear h-based textual information from Wikipedia, DBLP and Google (ab-

stra ts and/or arti les).

• Text extra ted from le ture presentation slides (PPT, PDF or PNG using Op-

ti al Chara ter Re ognition (OCR)).

• VideoLe tures.NET internal database metadata.

Next, the text extra tion module output was used to generate le ture bags-of-

words for every le ture in the repository. These bags-of-words, as mentioned in Se -

tion 5.2, were al ulated under a �xed vo abulary that was obtained by applying a

threshold to the number of di�erent le tures in whi h a word must appear in order to

be in luded. By means of this threshold, vo abulary size is signi� antly redu ed, sin e

un ommon and/or very spe i� words are disregarded. On e de�ned, term weights

were al ulated using term frequen y-inverse do ument frequen y (td-idf), a statisti-

al weighting s heme ommonly used in information retrieval and text mining [26℄.

Spe i� ally, tf-idf weights are used to al ulate the features Content similarity and

User ontent similarity. Finally, the VideoLe tures.NET user a tivity log was parsed

in order to obtain values for the feature Co-visits for all possible le ture pairs, as

MLLP-DSIC-UPV 37

�memoria� � 2014/9/11 � 12:16 � page 38 � #44

✐ ✐


well as a list of le tures viewed per user. This list was used together with the le tures

bags-of-words to generate the users bags-of-words and ategories. These, in turn, were

used to al ulate User ontent similarity and User ategory similarity, respe tively,

as well as User similarity for all possible le ture pairs. In a �nal step, all this data

was stored in the re ommender database in order to be exploited by the re ommender

engine in the re ommendation pro ess.

5.4.2 Learning re ommendation feature weights

On e the data needed to ompute re ommendation feature values for every possible

(u, v, r) triplet in the repository was made available, the next step was to learn

the optimum feature weights w for the al ulation of the utility fun tion shown in

Equation 5.1. To this end, an SVM lassi�er was trained using data olle ted from

the existing VideoLe tures.NET naïve re ommender system (based only on keywords

extra ted from the le ture titles). Spe i� ally, every time a user li ked on any of

the 10 re ommendation links provided by this re ommender system, 1 positive and 9

negative samples were registered. SVM training was performed using the SVM

light

open-sour e software [20℄. The optimum feature weights were those that obtained the

minimum lassi� ation error over the re ommendation data.

5.4.3 Evaluation

Although there are many di�erent approa hes to the evaluation of re ommender sys-

tems [37, 18℄, it is di� ult to state any �rm on lusions regarding the quality of the

re ommendations made until they are deployed in a real-life setting. The La Vie

proje t therefore provided an ideal evaluation framework, being deployed a ross the

o� ial VideoLe tures.NET site. The strategy followed for the obje tive evaluation of

the La Vie re ommender was to ompare it against the existing VideoLe tures.NET

re ommender by means of a oin-�ipping approa h. Spe i� ally, this approa h on-

sisted of logging user li ks on re ommendation links provided by both systems on a

50/50 basis and omparing the total number of li ks re orded for ea h system.

The results did not show any signi� ant di�eren es between the two re ommenders

in terms of user behaviour. This an be explained by the fa t that user- li k ount

alone is not a legitimate point of omparison for re ommendation quality. For in-

stan e, random variables not taken into a ount might in�uen e how users respond

to the re ommendation links provided. As an alternative, we an ompare the rank

of the re ommendations li ked by users within ea h system. Spe i� ally, for ea h

re ommendation li ked by a user in either system, we an ompare how the same

re ommendation ranked in the other system. This might be a more appropriate mea-

sure for omparing the re ommendations in terms of suitability. However, additional

data need to be olle ted in order to arry out this alternative evaluation. This data

is urrently being olle ted and future evaluation results will be obtained following

this rank omparison approa h.

Despite the la k of obje tive eviden e for assessing the omparative performan e

of the La Vie system, subje tive evaluations indi ate that the proposed re ommender

38 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 39 � #45

✐ ✐

5.5. Con lusions and future work

system provides better re ommendations than the existing VideoLe tures.NET re -

ommender. Fig. 5.2 shows re ommendation examples from both systems for a new

user viewing a random VideoLe tures.NET le ture. Although re ommendation suit-

ability is a subje tive measure, La Vie re ommendations seem to be more appropriate

in terms of ontent similarity.

Figure 5.2: On the left, La Vie system re ommendations for the �Basi s

of probability and statisti s� VideoLe tures.NET le ture. On the right, re -

ommendations made by VideoLe tures.NET's existing system for the same

le ture.

5.5 Con lusions and future work

In this hapter we have shown how automati spee h trans riptions of video le -

tures an be exploited to develop a le ture re ommender system that an zoom in on

user interests at a semanti level. In addition, we have des ribed how the proposed

re ommender system has been parti ularly implemented for the VideoLe tures.NET

repository. This implementation was later deployed in the o� ial VideoLe tures.NET

site.

The proposed system ould also be extended for deployment a ross more gen-

eral video repositories, provided that video ontents are well represented in the data

obtained by the text extra tion module.

By way of future work we intend to evaluate the re ommender system using other

evaluation approa hes that measure the suitability of the re ommendations more a -

urately, su h as the aforementioned re ommendation rank omparison.

MLLP-DSIC-UPV 39

�memoria� � 2014/9/11 � 12:16 � page 40 � #46

✐ ✐


40 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 41 � #47

✐ ✐

Chapter 6

Con lusions and Future

Work

6.1 General Con lusions

In this thesis we have presented the integration of state-of-the-art ASR and MT sys-

tems, developed within the transLe tures proje t, into di�erent video le ture repos-

itories. In parti ular, we have shown how transLe tures' solutions to produ e ost-

e�e tive a urate trans riptions and translations have been integrated in poliMedia

and VideoLe tures.NET (Chapter 3), and also into the Open ast Matterhorn open-

sour e platform (Chapter 4). In addition, in Chapter 5 we have shown how the pro-

du ed automati trans riptions an be used to develop a ontent-based video le ture

re ommender system.

6.2 Contributions

The s ienti� publi ations related to this work are listed below.

• transLe tures. Silvestre-Cerdà, Joan Albert; Del Agua, Miguel; Gar és, Gonçal;

Gas ó, Guillem; Giménez-Pastor, Adrià; Martínez, Adrià; Pérez González de

Martos, Alejandro; Sán hez, Isaías; Martínez-Santos, Ni olás Serrano; Spen er,

Ra hel; Valor Miró, Juan Daniel; Andrés-Ferrer, Jesús; Civera, Jorge; San hís,

Alberto; Juan, Alfons. Pro eedings of IberSPEECH 2012, pp. 345-351, 2012.

• Integrating a state-of-the-art ASR system into the Open ast Matter-

horn platform. Juan Daniel Valor Miró, Alejandro Pérez González de Martos,

Jorge Civera and Alfons Juan. IberSPEECH 2012, vol. CCIS 328, Springer, p.

237�246, November 2012, Madrid (Spain).

• A System Ar hite ture to Support Cost-E�e tive Trans ription and

Translation of Large Video Le ture Repositories. Silvestre-Cerdà, Joan

41

�memoria� � 2014/9/11 � 12:16 � page 42 � #48

✐ ✐

Chapter 6. Con lusions and Future Work

Albert; Pérez, Alejandro; Jiménez, Manuel; Turró, Carlos; Juan, Alfons; Civera,

Jorge. IEEE International Conferen e on Systems, Man, and Cyberneti s (SMC)

2013 , pp. 3994-3999, 2013.

• Evaluating intelligent interfa es for post-editing automati trans rip-

tions of online video le tures. J.D. Valor Miró, R.N. Spen er, A. Pérez

González de Martos, G. Gar és Díaz-Munío, C. Turró, J. Civera and A. Juan.

Open Learning: The Journal of Open, Distan e and e-Learning. Vol. 29, Iss.

1, 2014.

• Evalua ión del pro eso de revisión de trans rip iones automáti as

para vídeos poliMedia. Valor Miró, Juan Daniel; Spen er, R N; de Martos,

Pérez González A; Díaz-Munío, Gar és G; Turró, C; Civera, J; Juan, A. Pro . of

I Jornadas de Innova ión Edu ativa y Do en ia en Red (IN-RED 2014), Valen ia

(Spain), 2014.

• Using Automati Spee h Trans riptions in Le ture Re ommendation

Systems. Pérez-González-de-Martos, Alejandro; Silvestre-Cerdà, Joan Albert;

Rihtar, Matjaº; Juan, Alfons; Civera, Jorge. IberSPEECH 2014. Submitted.

6.3 Future work

It is our intention to keep improving the di�erent CAT intera tion strategies presented

in Se tion 3.5 in order to redu e the required user e�ort up to a minimum. This might

involve to arry out additional user evaluations for gathering more user feedba k.

In addition, we will keep updating the transLe tures Platform tools presented in

Chapter 3 to meet the future requirements of poliMedia, VideoLe tures.NET and

other video le ture repositories.

42 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 43 � #49

✐ ✐

Bibliography

[1℄ Coursera. https://www. oursera.org.

[2℄ transLe tures: First report on massive adaptation. https://www.

transle tures.eu/wp- ontent/uploads/2013/05/transLe tures-D3.1.

1-18Nov2012.pdf.

[3℄ transLe tures: Se ond report on massive adaptation. http://www.

transle tures.eu/wp- ontent/uploads/2014/01/transLe tures-D3.1.

2-15Nov2013.pdf.

[4℄ Hiyan Alshawi, Shona Douglas, and Srinivas Bangalore. Learning dependen y

translation models as olle tions of �nite-state head transdu ers. Computational

Linguisti s, 26(1):45�60, 2000.

[5℄ Nino Antulov-Fantulin, Matko Bo²njak, Martin Znidar²i , and Miha et al. Gr ar.

E ml-pkdd 2011 dis overy hallenge overview. Dis overy Challenge, 2011.

[6℄ Win�eld S. Bennett and Jonathan Slo um. The lr ma hine translation system.

Comput. Linguist., 11(2-3):111�121, 1985.

[7℄ Adam L Berger, Peter F Brown, Stephen A Della Pietra, Vin ent J Della Pietra,

John R Gillett, John D La�erty, Robert L Mer er, Harry Printz, and Lubo² Ure².

The andide system for ma hine translation. In Pro eedings of the workshop

on Human Language Te hnology, pages 157�162. Asso iation for Computational

Linguisti s, 1994.

[8℄ Reinhard Billmeier. Zu den linguistis hen grundlagen von systran. Multilingua-

Journal of Cross-Cultural and Interlanguage Communi ation, 1(2):83�96, 1982.

[9℄ Chris Callison-Bur h, Cameron Fordy e, Philipp Koehn, Christof Monz, and Josh

S hroeder. Further meta-evaluation of ma hine translation. In Pro eedings of the

Third Workshop on Statisti al Ma hine Translation, pages 70�106. Asso iation

for Computational Linguisti s, 2008.

[10℄ Jaime G Carbonell, Ryszard S Mi halski, and Tom M Mit hell. An overview of

ma hine learning. In Ma hine learning, pages 3�23. Springer, 1983.

[11℄ Walter Carrer-Neto, María Luisa Hernández-Al araz, Rafael Valen ia-Gar ía,

and Fran is o Gar ía-Sán hez. So ial knowledge-based re ommender system. ap-

pli ation to the movies domain. Expert Systems with Appli ations, 39(12):10990�

11000, 2012.

43

�memoria� � 2014/9/11 � 12:16 � page 44 � #50

✐ ✐

Bibliography

[12℄ Fran is o Casa uberta and Enrique Vidal. Ma hine translation with inferred

sto hasti �nite-state transdu ers. Computational Linguisti s, 30(2):205�225,

2004.

[13℄ Jose Jesus Castro-S hez, Raul Miguel, David Vallejo, and Lorenzo Manuel López-

López. A highly adaptive re ommender system based on fuzzy logi for b2

e- ommer e portals. Expert Systems with Appli ations, 38(3):2441�2454, 2011.

[14℄ Atsushi Fujii, Katunobu Itou, and Tetsuya Ishikawa. Lodem: A system for on-

demand video le tures. Spee h Communi ation, 48(5):516 � 531, 2006.

[15℄ Jonathan Graehl and Kevin Knight. Training tree transdu ers. Te hni al report,

DTIC Do ument, 2004.

[16℄ Charles Miller Grinstead and James Laurie Snell. Introdu tion to probability.

Ameri an Mathemati al So ., 1998.

[17℄ Timothy J. Hazen. Automati alignment and error orre tion of human generated

trans ripts for long spee h re ordings. In Pro eedings of the Ninth International

Conferen e on Spoken Language Pro essing (Interspee h 2006 � ICSLP), 2006.

[18℄ Jonathan L Herlo ker, Joseph A Konstan, Loren G Terveen, and John T Riedl.

Evaluating ollaborative �ltering re ommender systems. ACM Transa tions on

Information Systems (TOIS), 22(1):5�53, 2004.

[19℄ Thorsten Joa hims. A probabilisti analysis of the ro hio algorithm with t�df

for text ategorization. Te hni al report, DTIC Do ument, 1996.

[20℄ Thorsten Joa hims. Svmlight: Support ve tor ma hine. SVM-Light Support

Ve tor Ma hine http://svmlight. joa hims. org/, University of Dortmund, 19(4),

1999.

[21℄ Slava Katz. Estimation of probabilities from sparse data for the language model

omponent of a spee h re ognizer. A ousti s, Spee h and Signal Pro essing, IEEE

Transa tions on, 35(3):400�401, 1987.

[22℄ Markus Ketterl, Olaf A S hulte, and Adam Ho hman. Open ast matterhorn:

A ommunity-driven open sour e software proje t for produ ing, managing,

and distributing a ademi video. Intera tive Te hnology and Smart Edu ation,

7(3):168�180, 2010.

[23℄ Philipp Koehn, Franz Josef O h, and Daniel Mar u. Statisti al phrase-based

translation. In Pro eedings of the 2003 Conferen e of the North Ameri an

Chapter of the Asso iation for Computational Linguisti s on Human Language

Te hnology-Volume 1, pages 48�54. Asso iation for Computational Linguisti s,

2003.

[24℄ Seok Kee Lee, Yoon Ho Cho, and Soung Hie Kim. Collaborative �ltering with

ordinal s ale-based impli it ratings for mobile musi re ommendations. Informa-

tion S ien es, 180(11):2142�2155, 2010.

44 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 45 � #51

✐ ✐

Bibliography

[25℄ Stephen E Levinson, Lawren e R Rabiner, and Man Mohan Sondhi. An intro-

du tion to the appli ation of the theory of probabilisti fun tions of a markov

pro ess to automati spee h re ognition. Bell System Te hni al Journal, The,

62(4):1035�1074, 1983.

[26℄ Christopher D Manning, Prabhakar Raghavan, and Hinri h S hütze. Introdu tion

to information retrieval, volume 1. Cambridge university press Cambridge, 2008.

[27℄ Adrià Martínez-Villaronga, Miguel A. del Agua, Jesús Andrés-Ferrer, and Alfons

Juan. Language model adaptation for video le tures trans ription. In Interna-

tional Conferen e on A ousti s, Spee h and Signal Pro essing (ICASSP), pages

8450�8454. IEEE, 2013.

[28℄ Cosmin Munteanu, Ron Bae ker, and Gerald Penn. Collaborative editing for im-

proved usefulness and usability of trans ript-enhan ed web asts. In Pro eedings

of the SIGCHI Conferen e on Human Fa tors in Computing Systems, CHI '08,

pages 373�382, New York, NY, USA, 2008. ACM.

[29℄ Cosmin Munteanu, Gerald Penn, and Xiaodan Zhu. Improving automati spee h

re ognition for le tures through transformation-based rules learned from mini-

mal data. In Pro eedings of the Joint Conferen e of the 47th Annual Meeting

of the ACL and the 4th International Joint Conferen e on Natural Language

Pro essing of the AFNLP: Volume 2-Volume 2, pages 764�772. Asso iation for

Computational Linguisti s, 2009.

[30℄ Alexandros Nanopoulos, Dimitrios Rafailidis, Panagiotis Symeonidis, and Yannis

Manolopoulos. Musi box: Personalized musi re ommendation based on ubi

analysis of so ial tags. Audio, Spee h, and Language Pro essing, IEEE Transa -

tions on, 18(2):407�412, 2010.

[31℄ Edward Rolando Núñez-Valdéz, Juan Manuel Cueva Lovelle, Os ar San-

juán Martínez, Vi ente Gar ía-Díaz, Patri ia Ordoñez de Pablos, and Carlos En-

rique Montenegro Marín. Impli it feedba k te hniques on re ommender systems

applied to ele troni books. Computers in Human Behavior, 28(4):1186�1193,

2012.

[32℄ PASCAL Harvest Programme. http://www.pas al-network.org/?q=node/19.

[33℄ poliMedia. The polimedia repository, 2007.

[34℄ Paul Resni k and Hal R Varian. Re ommender systems. Communi ations of the

ACM, 40(3):56�58, 1997.

[35℄ Alberto San his, Alfons Juan, and Enrique Vidal. A word-based naïve bayes

lassi�er for on�den e estimation in spee h re ognition. IEEE Transa tions on

Audio, Spee h, and Language Pro essing, 20(2):565�574, 2012.

[36℄ Ni olás Serrano, Adrià Giménez, Jorge Civera, Alberto San his, and Alfons Juan.

Intera tive handwriting re ognition with limited user e�ort. International Jour-

nal on Do ument Analysis and Re ognition, pages 1�13, 2013.

MLLP-DSIC-UPV 45

�memoria� � 2014/9/11 � 12:16 � page 46 � #52

✐ ✐

Bibliography

[37℄ Guy Shani and Asela Gunawardana. Evaluating re ommendation systems. In

Re ommender systems handbook, pages 257�297. Springer, 2011.

[38℄ Joan Albert Silvestre, Miguel del Agua, Gonçal Gar és, Guillem Gas ó, Adrià

Giménez-Pastor, Adrià Martínez, Alejandro Pérez González de Martos, Isaías

Sán hez, Ni olás Serrano Martínez-Santos, Ra hel Spen er, Juan Daniel Valor

Miró, Jesús Andrés-Ferrer, Jorge Civera, Alberto San hís, and Alfons Juan.

transle tures. In Pro eedings of IberSPEECH 2012, 2012.

[39℄ Joan Albert Silvestre-Cerdà, Alejandro Pérez, Manuel Jiménez, Carlos Turró,

Alfons Juan, and Jorge Civera. A system ar hite ture to support ost-e�e tive

trans ription and translation of large video le ture repositories. In IEEE In-

ternational Conferen e on Systems, Man, and Cyberneti s (SMC) 2013, pages

3994�3999, 2013.

[40℄ Isaías Sán hez-Cortina, Ni olás Serrano, Alberto San his, and Alfons Juan. A

prototype for intera tive spee h trans ription balan ing error and supervision

e�ort. In Pro . of the 2012 ACM Intl. Conf. on Intelligent User Interfa es,

pages 325�326, 2012.

[41℄ The transLe tures-UPV Team. The transLe tures Platform (TLP). http://

transle tures.eu/tlp.

[42℄ Benoît Thouin. The meteo system. Pra ti al experien e of ma hine translation,

39:44, 1982.

[43℄ Jussi Tohka. Sgn-2506: Introdu tion to pattern re ognition. 2006.

[44℄ The transLe tures UPV Team. The transle tures-upv toolkit (tlk), 2013.

[45℄ UPVLC and XEROX and JSI-K4A and RWTH and EML and DDS. transle -

tures. https://transle tures.eu/, 2012.

[46℄ Juan Daniel Valor Miró, Alejandro Pérez González de Martos, Jorge Civera,

and Alfons Juan. Integrating a state-of-the-art asr system into the open ast

matterhorn platform. In Advan es in Spee h and Language Te hnologies for

Iberian Languages, pages 237�246. Springer, 2012.

[47℄ Videole tures.NET: Ex hange ideas and share knowledge. http://www.

videole tures.net/.

[48℄ Mike Wald. Creating a essible edu ational multimedia through editing auto-

mati spee h re ognition aptioning in real time. Intera tive Te hnology and

Smart Edu ation, 3(2):131�141, 2006.

[49℄ Warren Weaver. Translation. Ma hine translation of languages, 14:15�23, 1955.

[50℄ Pinata Winoto and Ti�any Y Tang. The role of user mood in movie re ommen-

dations. Expert Systems with Appli ations, 37(8):6086�6092, 2010.

46 MLLP-DSIC-UPV

�memoria� � 2014/9/11 � 12:16 � page 47 � #53

✐ ✐

Bibliography

[51℄ Dekai Wu. A polynomial-time algorithm for statisti al ma hine translation. In

Pro eedings of the 34th annual meeting on Asso iation for Computational Lin-

guisti s, pages 152�158. Asso iation for Computational Linguisti s, 1996.

[52℄ Kenji Yamada and Kevin Knight. A syntax-based statisti al translation model.

In Pro eedings of the 39th Annual Meeting on Asso iation for Computational

Linguisti s, pages 523�530. Asso iation for Computational Linguisti s, 2001.

MLLP-DSIC-UPV 47

�memoria� � 2014/9/11 � 12:16 � page 48 � #54

✐ ✐

�memoria� � 2014/9/11 � 12:16 � page 49 � #55

✐ ✐

List of Figures

2.1 Opti al hara ter re ogniser for digits 6 and 9, based on the average

gray levels of the upper and the bottom half. . . . . . . . . . . . . . . 5

2.2 General overview of an automati spee h re ognition system. . . . . . 6

2.3 The poliMedia studio during a re ording session. . . . . . . . . . . . . 8

2.4 A poliMedia re ording example. . . . . . . . . . . . . . . . . . . . . . . 9

2.5 The VideoLe tures.NET web player. . . . . . . . . . . . . . . . . . . . 10

3.1 General overview of the transLe tures Platform integration into a lient

repository. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Overview of the transLe tures Ingest Servi e work�ow. . . . . . . . . . 16

3.3 transLe tures Player ommuni ations diagram. . . . . . . . . . . . . . 17

3.4 transLe tures Player standard post-editing mode (default side-by-side

layout). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.5 transLe tures Player intelligent intera tion interfa e (only editing box

is shown). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.6 transLe tures Player two-step supervision strategy overview. . . . . . . 20

3.7 RTF as a fun tion of WER for the di�erent intera tion strategies.

Phases 1, 2 and 3 orrespond to standard post-editing, intelligent in-

tera tion and two-step supervision, respe tively. . . . . . . . . . . . . . 21

4.1 Matterhorn ar hite tural draft. . . . . . . . . . . . . . . . . . . . . . . 26

4.2 Media pa kage ingest pro ess. . . . . . . . . . . . . . . . . . . . . . . . 28

4.3 Matterhorn integration overview. . . . . . . . . . . . . . . . . . . . . . 29

4.4 transLe tures ustom work�ow overview. . . . . . . . . . . . . . . . . . 30

4.5 Paella Player showing available subtitles for a Matterhorn le ture. . . 31

5.1 System overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.2 On the left, La Vie system re ommendations for the �Basi s of prob-

ability and statisti s� VideoLe tures.NET le ture. On the right, re -

ommendations made by VideoLe tures.NET's existing system for the

same le ture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

49

�memoria� � 2014/9/11 � 12:16 � page 50 � #56

✐ ✐

�memoria� � 2014/9/11 � 12:16 � page 51 � #57

✐ ✐

Index of Tables

2.1 Basi statisti s of the UPV's poliMedia repository (May 2014) . . . . 8

2.2 Statisti s of the Spanish poliMedia training, development and test par-

titions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Basi statisti s of the IJS' VideoLe tures.NET repository (May 2014) 10

2.4 Statisti s of the English VideoLe tures.NET training, development and

test partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Summary of results obtained for ea h intera tion strategy. . . . . . . 22

3.2 Satisfa tion survey results for ea h intera tion strategy. . . . . . . . . 22

51

�memoria� � 2014/9/11 � 12:16 � page 52 � #58

✐ ✐

page i T A UNIVERSIT POLITÈCNICA DE ALÈNCIA V MASTER'S

Documents

Transcript of page i T A UNIVERSIT POLITÈCNICA DE ALÈNCIA V MASTER'S