Associating video frames with texthdw/ACM-SIGIR-2003.pdftext and video frames, a query based on text...
Transcript of Associating video frames with texthdw/ACM-SIGIR-2003.pdftext and video frames, a query based on text...
-
Associating video frames with text
Pınar Duygulu and Howard D. WactlarInformedia Project
School of Computer ScienceCarnegie Mellon University
Pittsburgh, PA, USA
[email protected], [email protected]
ABSTRACT
In this study, integration of visual and textual data is pro-posed to solve the correspondence problem between videoframes and associated text in order to annotate video frameswith more reliable labels and descriptions. Visual featuresextracted from video frames are linked to text that is ob-tained from the audio transcripts using joint statistics. Theresults show that using this approach it is possible to havebetter annotations for video frames that can be later usedto improve the performance of text based queries. The pro-posed approach will be integrated into the Informedia Digi-tal Video Library Project at Carnegie Mellon University.
1. INTRODUCTION
Video is a rich source of data where visual and textual in-formation occur together. A system that combines thesetwo types of data is more powerful than a system address-ing only one of them, since text and visual features maybe ambiguous if treated in isolation. In order to make bet-ter use of the available data, visual features (such as color,texture, shape, and motion features) and semantics derivedfrom text can be integrated. The text provides high-level se-mantics for a video sequence that cannot be obtained usingcurrent computer vision techniques, and visual propertiessupply rich information that can be difficult to express in atextual format.
When text and visual properties are integrated, several ap-plications become possible including better search and brows-ing. There are many collections where images/videos areannotated with some descriptive text (e.g., Corel data set,some museum collections, news photographs on the web withcaptions, etc.). Manual annotation of these collections issubjective and requires a huge amount of effort [14]. Someapproaches are proposed to make this process easier andautomatic [2, 4, 5, 12, 13].
Figure 1: An example query result from the Infor-media system [1], where the query word is president.A story is segmented into shots where each shot isrepresented with a keyframe. Transcripts that areextracted from the audio narrative are aligned withthe shots. The arrows represent the shots where thequery word occurs in the transcript aligned withthe shot. There are correspondence problems be-tween the frames and text. The word president ismentioned when the anchor and/or reporter weretalking; but the president appears when his name isnot mentioned. If a single frame is required as theresult of the query, the system will produce the an-chorperson/reporter frames which are not desirablein many cases. It is better to produce a frame wherethe president appears.
In these manually annotated image collections, although itis known that the annotation words are associated with theimage, the correspondence between the words and the imageregions are unknown. Some methods are proposed to findthe correspondences by modeling the joint statistics of wordsand image regions [2, 10, 11, 15, 17].
Similar correspondence problems occur in video data. Thereare sets of video frames and transcripts extracted from theaudio speech narrative, but the semantic correspondencesbetween them are not fixed because they may not be co-occurring in time. If there is no direct association between
caeText Box26th Annual International ACM SIGIR Conference, Toronto, Canada, July 28-August 1, 2003.
-
text and video frames, a query based on text may produceincorrect visual results.
For example, in most news videos (see Figure 1) the an-chorperson talks about an event, place or person, but theimages relating to the event, place, or person appear laterin the video. Therefore, a query based only on text relatedto a person, place, or event, and showing the frames at thematching narrative, will yield incorrect frames of the an-chorperson as the result.
The goal of this study is to determine the correspondencesbetween the video frames and associated text in order toannotate the video frames with more reliable labels and de-scriptions. This enables a textual query to return more ac-curate semantically corresponding images, and enables animage-based query or response to provide more meaningfuldescriptors.
In Section 2, our approach to solve the correspondence prob-lem between image regions and text will be explained. Thestrategy to apply a similar approach on video data will bedescribed in Section 3. In Sections 4 and 5, the procedureand the experimental results will be presented for two differ-ent data sets : TREC 2001 and Chinese cultural data setsrespectively. Section 6 will discuss the results and presentfuture directions.
2. MULTIMEDIA TRANSLATION
The problem of finding the correspondences can be consid-ered as the translation of visual features to words, similar tothe translation of text from one language to another. In thatsense, there is an analogy between learning a lexicon for ma-chine translation and learning a correspondence model forassociating words with image regions.
Learning a lexicon from data is a standard problem in ma-chine translation literature [7, 16]. Typically, lexicons arelearned from a type of data set known as an aligned corpora.Assuming an unknown one-to-one correspondence betweenwords, coming up with a joint probability distribution link-ing words in two languages is a missing data problem [7]and can be dealt by application of the EM (ExpectationMaximization) algorithm [9].
Data sets consisting of annotated images are similar to alignedcorpora. There is a set of images, each consisting of a num-ber of regions and a set of associated words. Each imageis segmented into regions and from each region a set of fea-tures, including color, texture, shape, position and size, areextracted. In order to exploit the analogy with machinetranslation, we vector-quantize the set of features represent-ing an image region using k-means. Each region then gets asingle label (blob token).
The problem is then to construct a probability table thatlinks the blob tokens with word tokens. The probability ta-ble is initialized to the co-occurrences of blobs and words.The final translation probability table is constructed usingthe EM algorithm which iterates between the two steps:
(i) use an estimate of the probability table to predict corre-spondences;(ii) then use the correspondences to refine the estimate ofthe probability table.
Once learned, the probability table is used to predict wordscorresponding to particular image regions (region naming),or words associated with whole images (auto-annotation)(see [10, 11] for the details).
3. CORRESPONDENCES ON VIDEO
The Informedia Digital Video Library Project at CarnegieMellon University [1] combines speech, image and naturallanguage understanding to automatically transcribe, seg-ment and index video for intelligent search and image re-trieval. The current library consists of terabytes of data(broadcast news captured over the last years, documen-taries produced for public television and government agen-cies, classroom lectures, and other video genres) with auto-matically extracted metadata and indices.
Broadcast news is very challenging data, but due to its na-ture it is mostly based on people and requires a focus onperson detection/recognition. In this study, we use subsetsof Chinese culture and TREC 2001 data sets which are rel-atively simpler.
The data consists of video frames and the associated tran-script extracted from the audio. The frames and transcriptsare associated on the shot-basis. Each shot is representedwith a single keyframe. Transcripts are aligned with theshots by determining when each word was spoken throughan automatic alignment process using Sphinx-III speech rec-ognizer.
Each keyframe is segmented into regions, and features areextracted from each region. In this study, fixed sized gridsare used as the regions because of their simplicity to applyon a large volume of data. A feature vector of size 46 isformed to represent each region. Position is represented bythe coordinates (x, y) of the region’s center of gravity inrelation to the image dimensions. Color is redundantly rep-resented using the mean and variance of the HSV and RGBcolor spaces. Texture is represented by using the mean andvariance of 16 filter responses. We used four difference ofGaussian filters with different sigmas and twelve orientedfilters, aligned in 30 degree increments.
The vocabulary consists of only nouns which are extractedby applying Brill’s tagger [6] to the transcript. Due to theerrorful transcripts obtained from the speech recognition,many noisy words remain in the vocabulary.
4. TREC 2001 DATA
The TREC 2001 data is chosen, because it is a truthed andextensively analyzed data set. 2232 images are used fromthis data set. All the nouns (1938 words) are taken as thevocabulary words.
-
plane(2) space(5),telescope(10) space(1),world(6)
space(6),astronaut(7) plane(2) robot(5)
Figure 2: Example annotation results for TREC2001 data. The images are annotated with the top10 words with the highest prediction probabilitygiven the regions of the image. For each image someof the annotation words and their ranks are shown.
Different from a still image with some annotated keywords,in video, text is not usually associated with a single frame.The words for the surrounding frames can also be associatedwith the current frame. For the experiments on TREC data,the text for the surrounding frames is also considered by set-ting the window size arbitrarily to five (i.e., for each frame,text associated with the five preceding and five followingframes are taken together with the text for that frame).
Each image is divided into 7 x 7 blocks, resulting in 49 re-gions for each image. Then, features are extracted from eachof these blocks and feature space is vector quantized usingk-means where k is set to 500. An initial translation prob-ability table is constructed by taking the co-occurrences of500 blob tokens and 1938 word tokens. The final translationprobability table is obtained by applying EM. In order to an-notate the images, the word posterior probabilities for theimage blobs are summed into a single word posterior, andthe highest probability words are taken as the annotationwords.
In Figure 2, some of the annotation words predicted in thetop ten words with the highest probability for an image areshown. As can be seen, the predicted words are the cor-rect words that describe what is in the image. Figure 3shows the query results for Statue of Liberty using thecurrent Informedia system. In some cases, neither statuenor liberty matches with the frames where the Statue ofLiberty appears, and in some of the frames where the Statuedoesnt appear, the text mentions the name. With the pro-posed approach, all the frames with the Statue of Libertyare annotated with statue and liberty in the top threewords as shown in Figure 4. Also, for the frames that donot include the statue, neither statue nor liberty is pre-dicted. Therefore, when a query is performed on statue ofliberty, with the proposed approach the results will giveonly the correct matches where the Statue appears.
Figure 3: Query results for statue of liberty usingthe current Informedia system. Each shot is repre-sented with a single keyframe and the transcript ex-tracted from the audio is aligned with the shots. Thearrows indicate the shots where the words statue orliberty occur in the corresponding audio/transcript.For example, the first and the second purple arrowsindicate the word statue for the third and fourthimages on the first row respectively, the first red ar-row indicates the word liberty for the first imageon the second row, and so on. Although, the Statueof Liberty appears in the second and fourth imageson the second row, none of the words are matched.However, the third image on the first row, and thefirst image on the second row is matched, althoughthey are not the images of Statue of Liberty. Withthe proposed approach, the results are corrected bypredicting both statue and liberty in the top threewords for all the images where the Statue appears,and by not predicting any of the words for the otherswhere the statue doesnt appear.
statue(1) liberty(2) statue(1) liberty(3)
statue(1) liberty(2) statue(1) liberty(3)
Figure 4: For the shots that Statue of Liberty ap-pears, rank of the statue and liberty words in thepredictions.
-
5. CHINESE CULTURE DATA
The other data set used for the experiments consists of Chi-nese cultural documentaries [8, 18]. Figure 5 and Figure 6show some examples from this data set. Similar correspon-dence problems occur in this data set. For example, in thestory about the Great Wall, some frames show the views ofthe Wall, while the others show the reporter or the otherrelated people/objects; therefore the word wall may notmatch with the view of the Wall.
For the experiments, the movies that includes interviewswith people are removed due to their dependence on face de-tection/recognition, and the remaining 3785 shots are used.Each shot is represented with a single keyframe which is di-vided into 5 x 5 blocks. Then, features are extracted fromeach of these blocks and feature space is vector quantizedusing k-means where k is set to 1000. Therefore there are1000 blob tokens.
In order to construct the vocabulary, nouns extracted fromBrill’s tagger are used. Figure 7 shows that the frequency ofthe words lies in a large range. The number of words in thevocabulary is 2597, but only about 20% of the words are ina meaningful range. We eliminate the words that occur lessthan 5 times, or more than 250 times. This pruning processreduces the vocabulary to 626 words. Shots that are notassociated with any word after this pruning process are alsoeliminated, resulting in 2785 remaining shots.
Figure 5: Results of a query for great wall on Chi-nese culture data set using the current Informediasystem. The arrows indicate the shots where thequery words occur in the corresponding audio.
For the experiments on the TREC data, the text for thesurrounding shots is also taken by setting a fixed windowsize. In the Chinese data set, the window size is varied.First, we give the results using only the text aligned with asingle shot, then we compare the results for different windowsizes.
Figure 8 shows some frames that predict (a)panda, (b)wall,(c)emperor in the first three words with the highest proba-bility. As can be seen, the right words are predicted for theframes that are related with the word. The word panda is
Figure 6: Results of a query for panda on Chineseculture data set using the current Informedia sys-tem. The arrows indicate the shots where the queryword occurs in the corresponding audio.
0 500 1000 1500 2000 2500 30000
200
400
600
800
1000
1200
1400
words
wor
d fr
eque
ncie
s
Figure 7: Word frequencies for the Chinese culturedata set. Originally there are 2597 words in thevocabulary. However, almost 80% of the vocabularyoccurs less then five times. We eliminate the wordsthat occur less than 5 times or more than 250 timesto have a new vocabulary. The number of wordsremained in the pruned vocabulary is 626.
correctly predicted for different views of panda. Similarly,the word wall is correctly predicted when the Great Wallappears in the image, although the images are not very sim-ilar. The images for emperor are in a larger range, since it ishard to associate this word with a specific object or scene.
In order to evaluate the results on a larger scale 189 imagesare visually investigated for the word panda. The results areshown in Table 1. 127 images are associated with the wordpanda (using the aligned transcript), but only in 42 of theseimages a panda appears. The system predicts panda as thefirst word with the highest probability for 121 images, andin 51 of them a panda appears in the image. In 25 of theseimages the word panda is predicted for the images where apanda appears, even though the word was not associatedwith the frame originally. The results show that we missed11 images where the panda word was associated and a pandaappears in the image, but cannot be predicted as the firstword. For the images where the system cannot predict pandaas the first word, it was usually the second or the third word.
-
(a)
(b)
(c)
Figure 8: Sample images that predict (a)panda,(b)wall, and (c)emperor in the first three words withthe highest probability.
The predicted words can be used for two different purposes:to have better annotations for the frames on the training setand to annotate the frames on the test set that do not haveassociated text. As explained before, we eliminated 1000images since the words associated with those images werethe less frequent words. These images are used as the testset. For panda, we randomly choose two sets of frames wherethe story is about a panda. Among these, the frames thatare in the test set are analyzed. We look at the images wherepanda is predicted as the first word, and check whether thepanda appears in the image. In the first set 11 images over25 images, and in the second set 2 images over 25 images,predict panda as the first word when the panda appears inthe image.
The scene shown in Figure 6 is further used to analyze theresults for predicting the word panda both for the trainingand test sets. Figure 9 shows the rank of the word pandaas the predicted word for the corresponding frames. Red(dark gray) numbers correspond to the images that are inthe test set (the frames that are initially eliminated), andgreen (light gray) numbers correspond to the images thatare in the training set. The ranking in the training set isalso important, since most of the correct associations aremissing and our goal is finding more reliable annotations forthe frames. As the figure shows, on most of the trainingimages if a panda appears in the image, we predict the word
Table 1: Correspondence results for panda. 189frames are inspected. Associated means that theword is in the transcript that is aligned with theframe. Predicted means that the system predictpanda for the frame as the first word. Correctmeans a panda appears, and incorrect means a pandadoesn’t appear in the frame.
number of associations 127number of correct associations 42number of incorrect associations 85number of predictions 121number of correct predictions 51number of incorrect predictions 70associated but not predicted 64association is correct but not predicted 11association is incorrect not predicted 53not associated but predicted 61not associated but correctly predicted 25not associated and incorrectly predicted 36associated and predicted and correct 26associated and predicted but incorrect 31
panda in the first four words with the highest probability.The predictions for the test images are also satisfactory:some of the images where a panda appears are annotatedwith the word panda, and for the images that don’t showany panda the annotation rank is very low. However, bothfor the training and test images, we see a problem when thewoman appears: since the woman frames highly co-occurwith the word panda, the system learns that the womanframes are also associated with panda and we incorrectlypredict panda with a high score.
Figure 9: The predictions for the panda scene. Thenumbers show the rank of the word panda amongthe predicted words. Red (dark gray) numbers cor-respond to the images that are not in the trainingset, green (light gray) numbers correspond to theimages in the training set.
We investigate the effect of window size, by choosing eitheronly the words from a single shot, or also from the surround-ing shots (window size is set to 1,2 or 3). The average num-ber of words associated with a single shot is 3.7275 whenonly the words that are associated with a single shot areused; 10.1670 when window size is set to 1; 16.4470 whenwindow size is 2; and 22.6725 when window size is 3. Themaximum number of words are 25, 47, 63 and 70 respec-tively.
-
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
wallgreatsochinese
pandaoneemperor
ha
china
flag
hsi
cranenot
there
lan
merchant
mercury
round
square
first
otherwho
very
deer
chcourtyard
year
can
people
sea.
stretche
glorynative
chamber
detail
eternal
letter
bronze
she
yangtze
captive
state
what
qin
tomb
saying
tzu
blue
ding
birthcity
month
manyking
forbiddendoidea
lot
time
wartrade
feet
new
dynastyprogram
travel
elixir
harmony
jade
middle
just
now
army
boundary
evil
thing
imperial
upbamboo
today
northern
stone
shi
power
meter
arrow
roof
roman
watchtower
ancestor
hair
throne
woman
system
keeper
range
dream
animal
area
heremay
writing
why
tangland
mountain
kind
symbol
region
fire
word
individual
giant
ming
eastern
20th
river
forestgenghi
magicbarrierfaith
fighting
text
line
archeologistsuccessorresource
act
burning
carriage
weapon
command
prince
handwalk
century
respect
hawk
pitideal
fitwolong
generation
white
waytreasure
cotta
yellow
piecepalace
workbuild
form
north
life
book
masgod
fragmentinvasion
servant
shape
qing
lengthdecade
feeling
organizationtrip
world
li
dragonspirit
babyearthmilitary
mongol
mind
conflict
monument
tourist
pu
open
beginning
chariot
sword
crown
objectlymodel
legend
ground
mean
gold
shownsleepanswer
number
li−li
member
home
houralmostsoldierface
found
little
part
silk
sense
femaleperiod
zhaotian
nomadimmortality
something
run
villager
culture
tower
thought
hill
lo
court
lady
figure
longschaller
researcher
central
thousandplacesonright
national
far
remain
nightofficer
burial
incrediblesectionheaven
ruler
wonder
size
radio
let
dr.
story
attempt
empire
warrior
history
natural
statue
shang
reign
eunuch
empreswater
standard
excavationkept
official
sort
effort
man
barbarian
control
pagoda
b.c.
pasexampleten
warring
research
wildlife
hundred
million
name
body
country
age
ritual
top
construction
beliefmother
look
traditional
eyelivingdiscovery
secret
rule
question
death
sea
huang
gate
center
probablyunited
site
day
showinside
force
wild
concubinered
humanhabitatoutside
westwesternhan
specybuildingancient
recall
prec
isio
n
Figure 10: Recall vs precision graph : single frame.Recall is defined as the number of correct predictionsover the number of times that the word occurs inthe data, and precision is defined as the number ofcorrect predictions over all predictions.
Figures 10, 11, 12 and 13 show the recall and precision val-ues as a function of words when the associated words arechosen from a single frame, or from a window size 1, 2, and3 respectively. Recall is defined as the number of correctpredictions over the number of times that the word occursin the data, and precision is defined as the number of cor-rect predictions over all predictions. Correct predictions arefound by comparing the first n words that are predicted withthe highest probability (where n is the number of actualwords associated with the frame), with the actual ones. Itis hard to understand the figures clearly, but it is easy to seethat the system predicts more words as the window size in-creases. Also, the figures indicate that for most of the wordsthe recall and/or precision values increase as the number ofwords associated with a frame increases. Table 2 shows therecall and precision values for some selected words.
Table 2: Comparison of recall and precision valuesfor some selected words as a function of window size.
panda wall emperorrec - prec rec - prec rec - prec
single frame 0.718 - 0.204 0.849 - 0.228 0.689 - 0.224wsize = 1 0.818 - 0.243 0.918 - 0.292 0.874 - 0.286wsize = 2 0.886 - 0.256 0.952 - 0.315 0.945 - 0.326wsize = 3 0.915 - 0.260 0.972 - 0.319 0.969 - 0.363
Table 3: Prediction measures when different windowsizes are used to obtain the words associating witha shot.
prediction measuresingle frame 0.1851wsize = 1 0.2469wsize = 2 0.2783wsize = 3 0.2975
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
wallgreatchineseoneemperor
panda
sochinahanottherefirstwhoyear
peoplewhat
verych
zhao
tombcandeerotherscript
round
many
mongol
cranejade
sheanimal
dynastyqin
king
justding
forbiddenkind
feetming
hsi
chamber
up
ideamerchanthair
lady
city
northern
here
statearmynow
rule
imperial
stretche
porcelain
ground
huo
system
bamboo
tang
powercentury
tzu
ideal
giant
remain
waysort
silk
eastern
pig
cotta
dragon
wild
stone
do
alligator
sea.
pagoda
keeperlandlot
bronze
empire
symbolli
place
buildinghandtime
flag
communist
crownmountain
burial
heaven
material
evil
thousandtoday
faith
empres
courtyard
concubine
shi
captive
program
elixir
spirit
square
almost
officer
baby
roman
war
qing
period
marqui
world
habitat
new
edge
native
gravebirth
number
may
ancestor
treasure
individual
man
book
harmonygeneral
partspecyriver
sealook
yellow
han
shang
wildlife
title
object
long
figure
life
mind
lored
country
next
barbarianthrone
build
warrior
mercury
somethingeveryone
dr.
western
earth
sense
site
saying
yungkey
sword
yisleep
spot
schaller
wolong
historyscholargold
genghimagic
tower
tradition
weapon
military
a.d.
model
woman
shapedestruction
death
policy
man.
natural
philosophy
section
west
conflict
palace
b.c.
mile
li−li
writingtrade
archeological
official
nomad
travel
rebel
corner
hu
areahorse
george
workconstructiongate
little
lan
line
lieburning
eternal
heightmother
generation
iron
eye
gift
road
hundred
eunuch
fall
art
north
radio
princemeng
control
law
ten
running
piece
trip
expedition
forest
leave
fenghill
form
human
center
white
righteffort
afterlife
pay
father
heart
god
curse
chi
everything
servant
journey
leopard
range
ancientfound
belief
setwonder
national
archeologist
discovery
huangtouristmanchu
female
waterbody
brick
invasion
relic
practice
heavenly
row
pas
young
position
problem
reason
attempt
immortality
thought
yang
tour
kingdomday
route
tian
beginning
existence
frontlocation
million
ruler
breeding
name
ritual
enemy
monument
sacred
buddha
thing
yangtze
group
secret
far
myth
culture
pride
fire
mystery
carriage
cloud
development
warring
united
huangdinightcourt
age
attack
yin
agricultural
supreme
command
soldier
capital
purpose
master
inside
museum
facetraditional
looking
living
men
glimpsefighting
taoist
zhou
soil
successor
blackhistorical
surrounding
legend
item
letter
characterpit
standkleiman
restoration
reality
barrier
text
territory
glory
step
civil
chime
blade
collar
pound
laborer
space
probably
statue
heavy
favorite
pottery
walk
middle
mouth
decade
fit
feeling
population
bear
force
family
ratherleftsaw
desert
artisan
blue
example
visit
dead
hopedoing
organizationmao
peasant
moderncommon
change
story
amount
accountpu
top
energy
local
understand
sovereign
flow
open
play
meetwine
silver
stop
zoo20th
mean
male
member
hour
future
reserve
scaleruntakingshown
leadercivilization
word
matter
shareanswer
food
alive
big
search
incredible
region
sizeletscientist
troop
picture
standard
culturalrecord
reign
kept
excavation
boundaryresearchfield
head
project
high
why
question
government
researchershow
son
outside
recall
prec
isio
n
Figure 11: Recall vs precision : window size = 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
onewallchineseemperorgreatsochina
panda
hanot
therefirstwhotombpeopleveryyearcanwhat
deer
othermanychzhaoshe
mongol
justupjadedynasty
chamber
hsiround
qin
ming
state
tzu
powerarmy
animal
porcelain
cityimperialnow
here
king
ideacentury
forbiddensymbol
pagoda
kindway
system
feet
native
rulealligatorwarrior
ding
do
dragon
cotta
time
spirit
scriptnorthern
empres
han
ideal
silk
treasure
bamboo
flag
lady
heaven
western
bronze
almost
mountainland
thousand
lettercountry
new
crane
pig
tang
merchant
shi
giantconcubine
wildlifeyang
empiremayphilosophy
elixir
stretche
manstone
war
building
barbarianroman
writingtoday
evilschaller
tower
crownancient
lot
history
remainplace
captive
areaworldwork
ancestor
partkeepertrade
look
longsite
li
yin
sort
officer
yellow
b.c.
wild
military
specy
qing
square
build
death
burial
birthbabynext
huo
period
supreme
eastern
conflict
thing
yung
mouth
soldierlifeprince
tenwolong
construction
george
night
red
mind
gold
wonder
ground
lan
sun
hair
line
sea.
book
harmony
habitat
throne
official
name
natural
something
communist
thought
carriage
individual
li−lifound
figure
story
human
enemyscholar
confucian
lo
edge
north
fire
breeding
palace
program
beliefhope
piece
dr.west
men
brick
a.d.
shape
yangtze
object
cloud
model
sea
20th
face
dayright
shang
house
mystery
ritual
stand
kingdomriversense
material
picture
buddhahu
little
rulerafterlife
manchu
group
hundred
set
faith
forcetravelmonumentsealrather
route
problem
confucianism
god
relic
journey
record
mile
leader
culture
bodywarring
archeological
womanorganizationrebel
eunuch
corner
answer
soninside
earthgeneral
huangdi
mercury
silver
generation
road
number
hand
marqui
destruction
archeologist
tian
probably
territorydetail
leave
art
gatereign
heavenly
row
techniquesection
burning
highimmortality
field
living
civil
grave
walk
waterfutureattempt
family
feng
meng
everything
sleep
agricultural
nature
sacred
chairman
far
legend
excavationnothing
servant
mao
center
white
person
word
huang
weapon
zhou
flow
command
wine
manmade
stop
radio
secret
understandcontrol
million
female
sign
courtyard
glimpse
demon
taoistmuseum
question
discovery
man.
someone
chime
pit
looking
researcher
young
taking
food
lie
historical
form
sharestandard
effort
moon
surrounding
item
rich
leopard
conservationnational
structure
reason
height
reality
heart
fighting
curse
chi
glory
respectcharacter
mothereat
expedition
gift
nomad
statueeverybody
pavilion
mean
middle
month
father
artisan
hill
troop
law
saying
destroybeginningplay
sword
meet
maletrip
forest
common
project
di
age
feeling
invasion
why
women
yi
tryold
tourist
tradition
eye
rainforest
marco
beijing
title
traditional
government
boundary
kept
villager
pere
zootrack
bigpopulation
roof
hawkhead
development
location
showfallexample
outside
genghi
civilization
search
size
collar
horse
capital
arrow
black
light
range
incredible
region
soil
becoming
culturalbarrierkhansuccessor
pas
purpose
ruling
favorite
fit
director
ironrestoration
boy
basic
mount
fightlaborer
research
wealthtemplecycleeternaldozen
united
republic
decadethreat
spot
modernleft
key
running
position
peasant
westerner
pay
sovereign
blue
top
desert
visit
doing
amount
account
practice
else
weight
bone
society
political
home
reserveopenexistence
shown
scale
main
farmer
lif
front
tour
matterstanding
nation
myth
runscientist
guard
second
central
alive
court
let
colormaster
foot
reach
spacepottery
saw
change
recall
prec
isio
n
Figure 12: Recall vs precision : window size = 2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
one
wall
chineseemperorsogreat
chinanotha
panda
therefirstyearwho
peopleverytombwhatother
canmany
just
upding
forbidden
charmy
zhaoshe
hsi
deer
dynastyjadeanimalqin
ming
citynow
mongolround
imperial
porcelain
power
century
state
cottasystemhere
tzubamboo
kind
time
king
ideawarrior
lidoway
pagoda
symbol
feet
building
chamber
rulemayhan
war
northern
alligator
letter
empressilkscripttreasurebronzethousand
ancestor
today
new
carriage
specy
heaven
worldshi
princemountainempiremilitaryshangdragon
barbarian
tang
lo
western
landgoldphilosophy
history
remain
look
stonewriting
crane
almost
ancient
giantcommunist
huo
concubinefire
weapon
row
man
soldierlifeworkpartfigure
build
yung
lady
areaschaller
b.c.
longlittle
countrythronenextperiod
merchant
yangyellowevil
crown
roman
spirit
face
place
elixir
construction
flag
ideal
tower
keeper
wild
ground
birtheastern
wildlife
body
house
burial
lot
baby
stretche
sea
site
roofsort
piece
confucian
red
mind
trade
archeologistlan
hope
found
linewonder
palace
squarenatural
day
qing
officer
pig
west
tian
gatethingnumber
confucianism
relic
conflict
nomad
name
supreme
problemnight
habitat20th
death
shape
belief
techniqueobject
thoughtearthzhou
captive
officialnative
hundred
storycourtyard
marquiwarring
rightafterlife
westerner
leaderdetail
bird
sense
effortscholar
triproute
sunriver
group
hu
pas
picturewolong
destruction
hair
woman
something
discovery
stopfaithartisan
territoryrulerprobably
why
a.d.
journey
force
troop
attack
boundary
fall
edge
model
form
tourist
high
ritual
reasongeneration
individual
leave
seal
inside
stand
north
book
statue
taoist
fengmen
leopard
li−li
general
hand
human
manchu
sword
standard
female
breeding
mercury
cloud
glimpsehuang
program
huangdi
sacred
mile
everythingmarcoyin
immortality
curse
living
road
george
fieldmotherculturelegend
genghi
di
mysteryarcheologicalset
control
chiten
mouth
man.
government
modern
far
favorite
nothingenemy
buddha
rebel
water
outside
kingdom
meng
agricultural
temple
chimeradio
white
heavenly
food
forest
civil
walk
million
materialyangtzecenter
god
harmony
pavilion
grave
section
insemination
eye
future
mean
mao
family
head
silver
organizationunderstand
secret
reignnature
eunuchtitle
running
moon
lie
attempt
son
person
top
nationalhistorical
population
women
researcherword
beginning
brick
buddhist
try
rather
commonsawmanmade
someone
lengthsea.
surrounding
item
respect
eat
iron
master
record
travel
cultural
fighting
burning
charactersleep
opentaking
tour
myth
main
question
father
corner
dr.
barrier
wineexample
peasant
demon
key
tradition
project
space
horse
rangehilldesert
light
trap
feeling
showzoo
elsegreenkhansuccessor
stepincredible
shown
doing
unitedexcavation
age
gift
front
nation
minister
realityheart
monument
share
big
answer
mount
rich
ling−ling
size
restoration
spotsearch
energy
expedition
beijing
run
civilizationrainforest
alive
resource
everybody
meet
eternal
saying
account
servant
threatthirdlaw
variety
destroy
color
capitallooking
claim
play
leftyoung
old
pay
sign
invasion
flow
pere
better
reach
home
weight
political
structure
kept
meter
existence
hour
sturgeon
hawkchairman
view
watchtowerheight
development
start
magic
everyone
begun
let
revolution
artguard
blade
month
location
matingarrowpridecommand
court
pound
becoming
second
museum
region
soil
fight
policyworker
purpose
pit
attention
fate
boy
heavy
foot
plan
coming
holding
bear
child
ruling
fitglory
basic
bloodpottery
path
chariot
hall
fear
wealth
decademiddledozenposition
sovereign
pu
lydead
changeyi
male
amountblue
visit
peakreserve
battle
villager
one.
standing
traditionalfarmer
lifscale
matter
unique
hard
design
achievement
research
recall
prec
isio
n
Figure 13: Recall vs precision : window size = 3
-
The results are also compared using the prediction mea-sure, which is calculated by averaging the number of correctpredictions over the number of words in an image for all theimages. As Table 3 shows, prediction measure is better asthe window size increases.
6. DISCUSSION AND FUTURE WORK
In this study, integration of visual and textual data is pro-posed to solve the correspondence problem between videoframes and associated text. The preliminary results showthat using this approach it is possible to have better anno-tations for the video frames that can be used to improve theperformance of the text based queries.
In the current system relatively simpler and smaller datasets are used. Our goal is to apply a similar approach tobroadcast news which is a harder dataset since there areterabytes of video and it requires focusing on people.
Currently the system uses simple features extracted from thestill images. A better set of features that also include somedetectors (e.g., face detector) and some motion informationis likely to improve the performance. In the experiments,fixed size grids are used. Using a segmenter, better resultscan be obtained [3]. Also, in video the objects are mostly theones that moves. Therefore, it is better to use the temporalinformation to segment the moving objects.
Using a large number of data may allow us to perform sta-tistical analysis. It is an interesting problem to come with adistance metric for choosing the text that best describes theframe. In this study, individual words are taken as the vo-cabulary words. Taking the noun phrases or the compoundwords (e.g. “statue of liberty”) may improve the perfor-mance. Also, some lexical analysis needs to be done to un-derstand which type of words more precisely correspond tovisual features and should be taken as the vocabulary words.
7. REFERENCES[1] Informedia Project. http://informedia.cs.cmu.edu.
[2] K. Barnard, P. Duygulu, N. de Freitas, D. A. Forsyth,D. Blei, and M. Jordan. Matching words and pictures.Journal of Machine Learning Research, 3:1107–1135,2003.
[3] K. Barnard, P. Duygulu, R. Guru, P. Gabbur, andD. A. Forsyth. The effects of segmentation and featurechoice in a translation model of object recognition. InIEEE Conf. on Computer Vision and PatternRecognition, 2003.
[4] K. Barnard and D. A. Forsyth. Exploiting imagesemantics for picture libraries. In The FirstACM/IEEE-CS Joint Conference on Digital Libraries,page 469, 2001.
[5] K. Barnard and D. A. Forsyth. Learning the semanticsof words and pictures. In Int. Conf. on ComputerVision, pages 408–15, 2001.
[6] E. Brill. A simple rule-based part of speech tagger. InIn Proceedings of the Third Conference on AppliedNatural Language Processing, 1992.
[7] P. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. L.Mercer. The mathematics of statistical machinetranslation: Parameter estimation. ComputationalLinguistics, 19(2):263–311, 1993.
[8] C. Chen. First Emperor of China, Voyager CD-ROM.http://www.voyagerco.com/cdrom/, 1994.
[9] A. P. Dempster, N. M. Laird, and D. Rubin.Maximum likelihood from incomplete data via the emalgorithm. Journal of the Royal Statistical Society,Series B, 1(39):1–38, 1977.
[10] P. Duygulu, K. Barnard, N. Freitas, and D. A.Forsyth. Object recognition as machine translation:learning a lexicon for a fixed image vocabulary. InSeventh European Conference on Computer Vision(ECCV), volume 4, pages 97–112, 2002.
[11] P. Duygulu-Sahin. Translating Images to words: Anovel approach for object recognition. PhD thesis,Middle East Technical University, Turkey, 2003.
[12] C. Jelmini and S. Marchand-Maillet. Deva: anextensible ontology-based annotation model for visualdocument collections. In Proceedings of SPIEPhotonics West, Electronic Imaging 2002, InternetImaging IV, Santa Clara, CA, USA, 2003.
[13] J. Li and J. Z. Wang. Automatic linguistic indexing ofpictures by a statistical modeling approach. IEEETrans. on Pattern Analysis and Machine Intelligence,25(10):14, 2003.
[14] M. Markkula and E. Sormunen. End-user searchingchallenges indexing practices in the digital newspaperphoto archive. Information retrieval, 1:259–285, 2000.
[15] O. Maron. Learning from Ambiguity. PhD thesis,MIT, 1998.
[16] I. D. Melamed. Empirical Methods for ExploitingParallel Texts. MIT Press, 2001.
[17] Y. Mori, H. Takahashi, and R. Oka. Image-to-wordtransformation based on dividing and vectorquantizing images with words. In First InternationalWorkshop on Multimedia Intelligent Storage andRetrieval Management, 1999.
[18] H. Wactlar and C. Chen. Enhanced perspectives forhistorical and cultural documentaries using informediatechnologies. Joint Conference on Digital Libraries(JCDL’02), Portland, Oregon, July 13-17, 1994.