High Level Computer Vision Topic Models in Vision€¦ · High Level Computer Vision - July 15,...
Transcript of High Level Computer Vision Topic Models in Vision€¦ · High Level Computer Vision - July 15,...
High Level Computer Vision !Topic Models in Vision
Bernt Schiele - [email protected] Mario Fritz - [email protected] !http://www.d2.mpi-inf.mpg.de/cv
High Level Computer Vision - July 15, 2o14
Topic Models for Object Recognition
slide credit Fei-Fei et. al.
High Level Computer Vision - July 15, 2o14
Today
!
• Unsupervised learning ‣ topic models
‣ object class discovery
‣ extension to spatial information
3
High Level Computer Vision - July 15, 2o14
How many object categories are there?
~10,000 to 30,000
Biederman 1987slide credit Fei-Fei et. al.
High Level Computer Vision - July 15, 2o14
Datasets and Their Issues In Object Recognition
Caltech101/256[Fei-‐Fei et al, 2004][Griffin et al, 2007]
PASCAL
[Everingham et al, 2009]
MSRC[Sho<on et al. 2006]
slide credit Fei-Fei et. al.
High Level Computer Vision - July 15, 2o14
ESP[Ahn et al, 2006]
LabelMe[ Russell et al,
2005]
TinyImageTorralba et al. 2007
Lotus Hill[ Yao et al, 2007]
Datasets and Their Issues In Object Recognition
slide credit Fei-Fei et. al.
High Level Computer Vision - July 15, 2o14
Size of existing datasets
Datasets # of categories # of images per category
# of total images
Collected by
Caltech101 101 ~100 ~10K Human
Lotus Hill ~300 ~ 500 ~150K Human
LabelMe 183 ~200 ~30K Human
Ideal ~30K >>10^2 A LOT Machine
slide credit Fei-Fei et. al.
High Level Computer Vision - July 15, 2o14
• An ontology of images based on WordNet • Collected using Amazon Mechanical Turk
~105+ nodes ~108+ images
shepherd dog, sheep dog
German shepherdcollieanimal
Deng, Dong, Socher, Li & Fei-‐Fei, CVPR 2009slide credit Fei-Fei et. al.
High Level Computer Vision - July 15, 2o14
image-‐net.org
21,841 categories, 14,197,122 images• Animals ‣ Fish
‣ Bird
‣ Mammal ‣ Invertebrate
• Scenes ‣ Indoors
‣ Geological formations
• Sport AcSviSes
• Fabric Materials
• InstrumentaSon
– Tool – Appliances
– …
• Plants – …
slide credit Fei-Fei et. al.
High Level Computer Vision - July 15, 2o14Deng, Wei, Socher, Li, Li, Fei-‐Fei, CVPR 2009
“Cycling”
slide credit Fei-Fei et. al.
High Level Computer Vision - July 15, 2o14Deng, Wei, Socher, Li, Li, Fei-‐Fei, CVPR 2009
“Drawing room, withdrawing room”
slide credit Fei-Fei et. al.
High Level Computer Vision - July 15, 2o14
How much supervision do you need to learn models of objects?
High Level Computer Vision - July 15, 2o14
Object label + segmentation
Agarwal & Roth ’02, Leibe & Schiele ’03, Torralba et al. ’05
LabelMe, PASCAL, TU Darmstadt, MIT scenes and objects
MIT+CMU frontal faces
Viola & Jones ’01 Rowley et al. ’98
•
High Level Computer Vision - July 15, 2o14
Weakly Supervised: Object appears somewhere in the image
Fergus et al. ’03, Csurka et al. ’04, Dorko & Schmid ’05
Caltech 101, PASCAL, MSRC
airplane
motorbike
face
car
High Level Computer Vision - July 15, 2o14
Image + text caption
Corel, Flickr, Names+faces, ESP game
Barnard et al. ’03, Berg et al. ’04
High Level Computer Vision - July 15, 2o14
Images onlyGiven a collection of unlabeled images, discover visual
object categories and their segmentation
• Which images contain the same object(s) ? • Where is the object in the image?
High Level Computer Vision - July 15, 2o14
Training Conditions
• Object Labels + Segmentation !
• Object Label + Bounding Box !
• Object Label !
• Image and text !!
• Images only ‣ example technique: LDA
17
bridge, sky, water
High Level Computer Vision - July 15, 2o14
Unsupervised Learning
• Application: ‣ refining web search (outlier removal)
‣ discovering object classes
!
• No labels -> generative models • Modeling p(x) • What can the data distribution tell us about the problem? • Simple examples: ‣ clustering (e.g. k-means)
‣ gaussian mixture models
• Today: ‣ “multinomial mixture model” with a twist
18
High Level Computer Vision - July 15, 2o14
Gaussian Mixture
• Only red data points are observed • We can recover cluster/class structure
by making distribution assumption • But how to do this with visual words?
19
p(x) =X
i
p(x|ci)p(ci)
p(x|c1)
p(x|c2)
p(c2)p(c1)
✓
z
x
Nc
C
µ
�
High Level Computer Vision - July 15, 2o14
How to read graphical models
• Encoded independence assumption (all the arrows you don’t see) !!!
• Plate notation !!!!!
• Observed vs. Unobserved
20
Learning as Inference
�
v1 v2 v3 · · · vN
(a)
�
vn
N
(b)
Figure 9.1: (a): Belief network for cointossing model. (b): Plate notation equiv-alent of (a). A plate replicates the quanti-ties inside the plate a number of times asspecified in the plate.
In the above�N
n=1 I [vn = 1] is the number of occurrences of heads, which we more conveniently denote as
NH . Likewise,�N
n=1 I [vn = 0] is the number of tails, NT . Hence
p(�|v1, . . . , vN ) ⇧ p(�)�NH (1� �)NT (9.1.6)
For an experiment with NH = 2, NT = 8, the posterior distribution is
p(� = 0.1|V) = k ⇤ 0.15⇤ 0.12 ⇤ 0.98 = k ⇤ 6.46⇤ 10�4 (9.1.7)
p(� = 0.5|V) = k ⇤ 0.8⇤ 0.52 ⇤ 0.58 = k ⇤ 7.81⇤ 10�4 (9.1.8)
p(� = 0.8|V) = k ⇤ 0.05⇤ 0.82 ⇤ 0.28 = k ⇤ 8.19⇤ 10�8 (9.1.9)
where V is shorthand for v1, . . . , vN . From the normalisation requirement we have 1/k = 6.46 ⇤ 10�4 +7.81⇤ 10�4 + 8.19⇤ 10�8 = 0.0014, so that
p(� = 0.1|V) = 0.4525, p(� = 0.5|V) = 0.5475, p(� = 0.8|V) = 0.0001 (9.1.10)
as shown in fig(9.2b). These are the ‘posterior’ parameter beliefs. In this case, if we were asked to choose asingle a posteriori most likely value for �, it would be � = 0.5, although our confidence in this is low sincethe posterior belief that � = 0.1 is also appreciable. This result is intuitive since, even though we observedmore Tails than Heads, our prior belief was that it was more likely the coin is fair.
Repeating the above with NH = 20, NT = 80, the posterior changes to
p(� = 0.1|V) = 1�1.93⇤10�6, p(� = 0.5|V) = 1.93⇤10�6, p(� = 0.8|V) = 2.13⇤10�35 (9.1.11)
fig(9.1c), so that the posterior belief in � = 0.1 dominates. This is reasonable since in this situation, thereare so many more tails than heads that this is unlikely to occur from a fair coin. Even though we a priorithought that the coin was fair, a posteriori we have enough evidence to change our minds.
9.1.2 Making decisions
In itself, the Bayesian posterior merely represents our beliefs and says nothing about how best to summarisethese beliefs. In situations in which decisions need to be taken under uncertainty we need to additionallyspecify what the utility of any decision is, as in chapter(7).
�0.1 0.5 0.8
(a)
�0.1 0.5 0.8
(b)
�0.1 0.5 0.8
(c)
Figure 9.2: (a): Prior encoding our be-liefs about the amount the coin is biased toheads. (b): Posterior having seen NH = 2heads and NT = 8 tails. (c): Posteriorhaving seen NH = 20 heads and NH = 80tails. Assuming any value of 0 ⌅ � ⌅ 1 ispossible the ML setting is � = 0.2 in bothNH = 2, NT = 8 and NH = 20, NT = 80.
182 DRAFT June 18, 2011
x x
x
yz p(x, y, z) = p(z|x)p(y|x)p(x)
Object
[Barber,Bishop]
Position Orientation
example:
High Level Computer Vision - July 15, 2o14
Topic Models
• Originally developed for text and document analysis
• Unsupervised learning without any labels
• Given ‣ documents in bag of words
representation (histogram of word counts)
• Assumption ‣ each documents is about a subset of
possible topics
‣ each topic has characteristic words
• Goal ‣ infer characteristic word distributions
for each topic
‣ infer which topics a document is about
21
Figure from [Blei]
High Level Computer Vision - July 15, 2o14
Latent Topic Models (pLSA, LDA, ...)
22
answering two kinds of similarities: assessing the similarity between two documents, and assessing the associative similarity between two words. We close by considering how generative models have the potential to provide further insight into human cognition.
2. Generative Models
A generative model for documents is based on simple probabilistic sampling rules that describe how words in documents might be generated on the basis of latent (random) variables. When fitting a generative model, the goal is to find the best set of latent variables that can explain the observed data (i.e., observed words in documents), assuming that the model actually generated the data. Figure 2 illustrates the topic modeling approach in two distinct ways: as a generative model and as a problem of statistical inference. On the left, the generative process is illustrated with two topics. Topics 1 and 2 are thematically related to money and rivers and are illustrated as bags containing different distributions over words. Different documents can be produced by picking words from a topic depending on the weight given to the topic. For example, documents 1 and 3 were generated by sampling only from topic 1 and 2 respectively while document 2 was generated by an equal mixture of the two topics. Note that the superscript numbers associated with the words in documents indicate which topic was used to sample the word. The way that the model is defined, there is no notion of mutual exclusivity that restricts words to be part of one topic only. This allows topic models to capture polysemy, where the same word has multiple meanings. For example, both the money and river topic can give high probability to the word BANK, which is sensible given the polysemous nature of the word.
loan
PROBABILISTIC GENERATIVE PROCESS
TOPIC 1
money
loan
bankmoney
bank
river
TOPIC 2
river
riverstream
bank
bank
stream
bank
loan
DOC1: money1 bank1 loan1
bank1 money1 money1
bank1 loan1
DOC2: money1 bank1
bank2 river2 loan1 stream2
bank1 money1
DOC3: river2 bank2
stream2 bank2 river2 river2
stream2 bank2
1.0
.5
.5
1.0
TOPIC 1
TOPIC 2
DOC1: money? bank? loan?
bank? money? money?
bank? loan?
DOC2: money? bank?
bank? river? loan? stream?
bank? money?
DOC3: river? bank?
stream? bank? river? river?
stream? bank?
STATISTICAL INFERENCE
?
?
?
Figure 2. Illustration of the generative process and the problem of statistical inference underlying topic models
The generative process described here does not make any assumptions about the order of words as they appear in documents. The only information relevant to the model is the number of times words are produced. This is known as the bag-of-words assumption, and is common to many statistical models of language including LSA. Of course, word-order information might contain important cues to the content of a document and this information is not utilized by the model. Griffiths, Steyvers, Blei, and Tenenbaum (2005) present an extension of the topic model that is sensitive to word-order and automatically learns the syntactic as well as semantic factors that guide word choice (see also Dennis, this book for a different approach to this problem).
The right panel of Figure 2 illustrates the problem of statistical inference. Given the observed words in a set of documents, we would like to know what topic model is most likely to have generated the data. This involves inferring the probability distribution over words associated with each topic, the distribution over topics for each document, and, often, the topic responsible for generating each word.
3
wor
ds C = � �
documents
wor
ds
topics
topi
cs
documentsTOPICMODEL
wor
ds
documents
C = U D VT
wor
ds
dimsdims
dim
s
dim
s
documentsLSA
normalizedco-occurrence matrix
mixtureweights
mixturecomponents
Figure 6. The matrix factorization of the LSA model compared to the matrix factorization of the topic model
Other applications. The statistical model underlying the topic modeling approach has been extended to include other sources of information about documents. For example, Cohn and Hofmann (2001) extended the pLSI model by integrating content and link information. In their model, the topics are associated not only with a probability distribution over terms, but also over hyperlinks or citations between documents. Recently, Steyvers, Smyth, Rosen-Zvi, and Griffiths (2004) and Rosen-Zvi, Griffiths, Steyvers, and Smyth (2004) proposed the author-topic model, an extension of the LDA model that integrates authorship information with content. Instead of associating each document with a distribution over topics, the author-topic model associates each author with a distribution over topics and assumes each multi-authored document expresses a mixture of the authorsV topic mixtures. The statistical model underlying the topic model has also been applied to data other than text. The grade-of-membership (GoM) models developed by statisticians in the 1970s are of a similar form (Manton, Woodbury, & Tolley, 1994), and Erosheva (2002) considers a GoM model equivalent to a topic model. The same model has been used for data analysis in genetics (Pritchard, Stephens, & Donnelly, 2000).
4. Algorithm for Extracting Topics
The main variables of interest in the model are the topic-word distributions � and the topic distributions � for each document. Hofmann (1999) used the expectation-maximization (EM) algorithm to obtain direct estimates of � and �. This approach suffers from problems involving local maxima of the likelihood function, which has motivated a search for better estimation algorithms (Blei et al., 2003; Buntine, 2002; Minka, 2002). Instead of directly estimating the topic-word distributions � and the topic distributions � for each document, another approach is to directly estimate the posterior distribution over z (the assignment of word tokens to topics), given the observed words w, while marginalizing out � and �. Each zi gives an integer value [ 1..T ] for the topic that word token i is assigned to. Because many text collections contain millions of word token, the estimation of the posterior over z requires efficient estimation procedures. We will describe an algorithm that uses Gibbs sampling, a form of Markov chain Monte Carlo, which is easy to implement and provides a relatively efficient method of extracting a set of topics from a large corpus (Griffiths & Steyvers, 2004; see also Buntine, 2004, Erosheva 2002 and Pritchard et al., 2000). More information about other algorithms for extracting topics from a corpus can be obtained in the references given above.
Markov chain Monte Carlo (MCMC) refers to a set of approximate iterative techniques designed to sample values from complex (often high-dimensional) distributions (Gilks, Richardson, & Spiegelhalter, 1996). Gibbs sampling (also known as alternating conditional sampling), a specific form of MCMC, simulates a high-dimensional distribution by sampling on lower-dimensional subsets of variables where each subset is conditioned on the value of all others. The sampling is done sequentially and proceeds until the sampled values approximate the target distribution. While the Gibbs procedure we will describe does not provide direct estimates of � and �, we will show how � and � can be approximated using posterior estimates of z.
7
• Documents are occurrence histograms over the vocabulary
• Topics are distributions over vocabulary
• Unsupervised learning of mixture model
Figure from [Griffiths]
High Level Computer Vision - July 15, 2o14
Latent Dirichlet Allocation (LDA)
Blei, et al. 2003wij - words
zij - topic assignments
- topic mixing weightsθi
φk - word mixing weights
High Level Computer Vision - July 15, 2o14
Latent Dirichlet Allocation (LDA)Blei, et al. 2003
wij - words
zij - topic assignments
- topic mixing weights - (multinomials)
θi
φk- word mixing weights (multinomials)
High Level Computer Vision - July 15, 2o14
Dirichlet Distribution
• Dirichlet distribution is conjugate distribution to the multinomial distribution ‣ Dirichlet prior can be integrated out analytically and yields again a Dirichlet
distribution !!
‣ Example: K = 3 - Dirchlet distribution over the three variables
is confined to a simplex
Dir(µ|�) =�(�0)
�(�1) . . .�(�K)
K�
k=1
µ�k�1k
µ1
µ2
µ3
{�k} = 1{�k} = 0.1 {�k} = 10
µ1, µ2, µ3
High Level Computer Vision - July 15, 2o14
Inferencewij - words
zij - topic assignments
- topic mixing weightsθi
φk - word mixing weights
Use Gibbs sampler to sample topic assignments
•Only need to maintain counts of topic assignments •Sampler typically converges in less than 50 iterations •Run time is less than an hour
[Griffiths & Steyvers 2004]
High Level Computer Vision - July 15, 2o14
Gibbs Sampling
• Random initialization of z variables
• Iteratively sample from each zi given all the other z
• after burn in phase -> samples from the true distribution !
• True Bayesian would keep distribution over z
• In practice: often single sample is used
27
generated by arbitrarily mixing the two topics. Each circle corresponds to a single word token and each row to a document (for example, document 1 contains 4 times the word BANK). In Figure 7, the color of the circles indicate the topic assignments (black = topic 1; white = topic 2). At the start of sampling (top panel), the assignments show no structure yet; these just reflect the random assignments to topics. The lower panel shows the state of the Gibbs sampler after 64 iterations. Based on these assignments, Equation 4 gives the following estimates for the distributions over words for topic 1 and 2: and
. Given the size of the dataset, these estimates are reasonable
reconstructions of the parameters used to generate the data.
39.' ,29.' ,32.' )1()1()1( BANKLOANMONEY III
35.' ,4.' ,25.' )2()2()2( BANKSTREAMRIVER III
River Stream Bank Money Loan12345678910111213141516
River Stream Bank Money Loan12345678910111213141516
Figure 7. An example of the Gibbs sampling procedure.
Exchangeability of topics. There is no a priori ordering on the topics that will make the topics identifiable between or even within runs of the algorithm. Topic j in one Gibbs sample is theoretically not constrained to be similar to topic j in another sample regardless of whether the samples come from the same or different Markov chains (i.e., samples spaced apart that started with the same random assignment or samples from different random assignments). Therefore, the different samples cannot be averaged at the level of topics. However, when topics are used to calculate a statistic which is invariant to the ordering of the topics, it becomes possible and even important to average over different Gibbs samples (see Griffiths and Steyvers, 2004). Model averaging is likely to improve results because it allows sampling from multiple local modes of the posterior.
Stability of Topics. In some applications, it is desirable to focus on a single topic solution in order to interpret each individual topic. In that situation, it is important to know which topics are stable and will reappear across samples and which topics are idiosyncratic for a particular solution. In Figure 8, an analysis is shown of the degree to which two topic solutions can be aligned between samples from different Markov chains. The TASA corpus was taken as
9
Figure from [Griffiths]
High Level Computer Vision - July 15, 2o14
Gibbs Sampling
28
z_i
z_j
random initialization
initialize iterate:
sample from p(z_i|z_j) sample from p(z_j|z_i)
after burn in period (red) you start drawing samples from
the joint distribution
High Level Computer Vision - July 15, 2o14
Examples from Text
29
showed four other topics from this solution). In each of these topics, the word PLAY is given relatively high probability related to the different senses of the word (playing music, theater play, playing games).
word prob. word prob. word prob.MUSIC .090 LITERATURE .031 PLAY .136
DANCE .034 POEM .028 BALL .129SONG .033 POETRY .027 GAME .065PLAY .030 POET .020 PLAYING .042SING .026 PLAYS .019 HIT .032
SINGING .026 POEMS .019 PLAYED .031BAND .026 PLAY .015 BASEBALL .027
PLAYED .023 LITERARY .013 GAMES .025SANG .022 WRITERS .013 BAT .019
SONGS .021 DRAMA .012 RUN .019DANCING .020 WROTE .012 THROW .016
PIANO .017 POETS .011 BALLS .015PLAYING .016 WRITER .011 TENNIS .011RHYTHM .015 SHAKESPEARE .010 HOME .010ALBERT .013 WRITTEN .009 CATCH .010
MUSICAL .013 STAGE .009 FIELD .010
Topic 166Topic 82Topic 77
Figure 9. Three topics related to the word PLAY.
Document #29795
Bix beiderbecke, at age060 fifteen207, sat174 on the slope071 of a bluff055 overlooking027 the mississippi137 river137. He was listening077 to music077 coming009 from a passing043 riverboat. The music077 had already captured006 his heart157 as well as his ear119. It was jazz077. Bix beiderbecke had already had music077 lessons077. He showed002 promise134 on the piano077, and his parents035 hoped268 he might consider118 becoming a concert077 pianist077. But bix was interested268 in another kind050 of music077. He wanted268 to play077 the cornet. And he wanted268 to play077 jazz077...
Document #1883
There is a simple050 reason106 why there are so few periods078 of really great theater082 in our whole western046 world. Too many things300 have to come right at the very same time. The dramatists must have the right actors082, the actors082 must have the right playhouses, the playhouses must have the right audiences082. We must remember288 that plays082 exist143 to be performed077, not merely050 to be read254. ( even when you read254 a play082 to yourself, try288 to perform062 it, to put174 it on a stage078, as you go along.) as soon028 as a play082 has to be performed082, then some kind126 of theatrical082...
Document #21359
Jim296 has a game166 book254. Jim296 reads254 the book254. Jim296 sees081 a game166 for one. Jim296 plays166 the game166. Jim296 likes081 the game166 for one. The game166 book254 helps081 jim296. Don180 comes040 into the house038. Don180 and jim296 read254 the game166 book254. The boys020 see a game166 for two. The two boys020 play166 the game166. The boys020 play166 the game166 for two. The boys020 like the game166. Meg282 comes040 into the house282. Meg282 and don180 and jim296 read254 the book254. They see a game166 for three. Meg282 and don180 and jim296 play166 the game166. They play166...
Figure 10. Three TASA documents with the word play.
In a new context, having only observed a single word PLAY, there would be uncertainty over which of these topics could have generated this word. This uncertainty can be reduced by observing other less ambiguous words in context. The disambiguation process can be described by the process of iterative sampling as described in the previous section (Equation 4), where the assignment of each word token to a topic depends on the assignments of the other words in the context. In Figure 10, fragments of three documents are shown from TASA that use PLAY in
11
showed four other topics from this solution). In each of these topics, the word PLAY is given relatively high probability related to the different senses of the word (playing music, theater play, playing games).
word prob. word prob. word prob.MUSIC .090 LITERATURE .031 PLAY .136
DANCE .034 POEM .028 BALL .129SONG .033 POETRY .027 GAME .065PLAY .030 POET .020 PLAYING .042SING .026 PLAYS .019 HIT .032
SINGING .026 POEMS .019 PLAYED .031BAND .026 PLAY .015 BASEBALL .027
PLAYED .023 LITERARY .013 GAMES .025SANG .022 WRITERS .013 BAT .019
SONGS .021 DRAMA .012 RUN .019DANCING .020 WROTE .012 THROW .016
PIANO .017 POETS .011 BALLS .015PLAYING .016 WRITER .011 TENNIS .011RHYTHM .015 SHAKESPEARE .010 HOME .010ALBERT .013 WRITTEN .009 CATCH .010
MUSICAL .013 STAGE .009 FIELD .010
Topic 166Topic 82Topic 77
Figure 9. Three topics related to the word PLAY.
Document #29795
Bix beiderbecke, at age060 fifteen207, sat174 on the slope071 of a bluff055 overlooking027 the mississippi137 river137. He was listening077 to music077 coming009 from a passing043 riverboat. The music077 had already captured006 his heart157 as well as his ear119. It was jazz077. Bix beiderbecke had already had music077 lessons077. He showed002 promise134 on the piano077, and his parents035 hoped268 he might consider118 becoming a concert077 pianist077. But bix was interested268 in another kind050 of music077. He wanted268 to play077 the cornet. And he wanted268 to play077 jazz077...
Document #1883
There is a simple050 reason106 why there are so few periods078 of really great theater082 in our whole western046 world. Too many things300 have to come right at the very same time. The dramatists must have the right actors082, the actors082 must have the right playhouses, the playhouses must have the right audiences082. We must remember288 that plays082 exist143 to be performed077, not merely050 to be read254. ( even when you read254 a play082 to yourself, try288 to perform062 it, to put174 it on a stage078, as you go along.) as soon028 as a play082 has to be performed082, then some kind126 of theatrical082...
Document #21359
Jim296 has a game166 book254. Jim296 reads254 the book254. Jim296 sees081 a game166 for one. Jim296 plays166 the game166. Jim296 likes081 the game166 for one. The game166 book254 helps081 jim296. Don180 comes040 into the house038. Don180 and jim296 read254 the game166 book254. The boys020 see a game166 for two. The two boys020 play166 the game166. The boys020 play166 the game166 for two. The boys020 like the game166. Meg282 comes040 into the house282. Meg282 and don180 and jim296 read254 the book254. They see a game166 for three. Meg282 and don180 and jim296 play166 the game166. They play166...
Figure 10. Three TASA documents with the word play.
In a new context, having only observed a single word PLAY, there would be uncertainty over which of these topics could have generated this word. This uncertainty can be reduced by observing other less ambiguous words in context. The disambiguation process can be described by the process of iterative sampling as described in the previous section (Equation 4), where the assignment of each word token to a topic depends on the assignments of the other words in the context. In Figure 10, fragments of three documents are shown from TASA that use PLAY in
11
showed four other topics from this solution). In each of these topics, the word PLAY is given relatively high probability related to the different senses of the word (playing music, theater play, playing games).
word prob. word prob. word prob.MUSIC .090 LITERATURE .031 PLAY .136
DANCE .034 POEM .028 BALL .129SONG .033 POETRY .027 GAME .065PLAY .030 POET .020 PLAYING .042SING .026 PLAYS .019 HIT .032
SINGING .026 POEMS .019 PLAYED .031BAND .026 PLAY .015 BASEBALL .027
PLAYED .023 LITERARY .013 GAMES .025SANG .022 WRITERS .013 BAT .019
SONGS .021 DRAMA .012 RUN .019DANCING .020 WROTE .012 THROW .016
PIANO .017 POETS .011 BALLS .015PLAYING .016 WRITER .011 TENNIS .011RHYTHM .015 SHAKESPEARE .010 HOME .010ALBERT .013 WRITTEN .009 CATCH .010
MUSICAL .013 STAGE .009 FIELD .010
Topic 166Topic 82Topic 77
Figure 9. Three topics related to the word PLAY.
Document #29795
Bix beiderbecke, at age060 fifteen207, sat174 on the slope071 of a bluff055 overlooking027 the mississippi137 river137. He was listening077 to music077 coming009 from a passing043 riverboat. The music077 had already captured006 his heart157 as well as his ear119. It was jazz077. Bix beiderbecke had already had music077 lessons077. He showed002 promise134 on the piano077, and his parents035 hoped268 he might consider118 becoming a concert077 pianist077. But bix was interested268 in another kind050 of music077. He wanted268 to play077 the cornet. And he wanted268 to play077 jazz077...
Document #1883
There is a simple050 reason106 why there are so few periods078 of really great theater082 in our whole western046 world. Too many things300 have to come right at the very same time. The dramatists must have the right actors082, the actors082 must have the right playhouses, the playhouses must have the right audiences082. We must remember288 that plays082 exist143 to be performed077, not merely050 to be read254. ( even when you read254 a play082 to yourself, try288 to perform062 it, to put174 it on a stage078, as you go along.) as soon028 as a play082 has to be performed082, then some kind126 of theatrical082...
Document #21359
Jim296 has a game166 book254. Jim296 reads254 the book254. Jim296 sees081 a game166 for one. Jim296 plays166 the game166. Jim296 likes081 the game166 for one. The game166 book254 helps081 jim296. Don180 comes040 into the house038. Don180 and jim296 read254 the game166 book254. The boys020 see a game166 for two. The two boys020 play166 the game166. The boys020 play166 the game166 for two. The boys020 like the game166. Meg282 comes040 into the house282. Meg282 and don180 and jim296 read254 the book254. They see a game166 for three. Meg282 and don180 and jim296 play166 the game166. They play166...
Figure 10. Three TASA documents with the word play.
In a new context, having only observed a single word PLAY, there would be uncertainty over which of these topics could have generated this word. This uncertainty can be reduced by observing other less ambiguous words in context. The disambiguation process can be described by the process of iterative sampling as described in the previous section (Equation 4), where the assignment of each word token to a topic depends on the assignments of the other words in the context. In Figure 10, fragments of three documents are shown from TASA that use PLAY in
11
showed four other topics from this solution). In each of these topics, the word PLAY is given relatively high probability related to the different senses of the word (playing music, theater play, playing games).
word prob. word prob. word prob.MUSIC .090 LITERATURE .031 PLAY .136
DANCE .034 POEM .028 BALL .129SONG .033 POETRY .027 GAME .065PLAY .030 POET .020 PLAYING .042SING .026 PLAYS .019 HIT .032
SINGING .026 POEMS .019 PLAYED .031BAND .026 PLAY .015 BASEBALL .027
PLAYED .023 LITERARY .013 GAMES .025SANG .022 WRITERS .013 BAT .019
SONGS .021 DRAMA .012 RUN .019DANCING .020 WROTE .012 THROW .016
PIANO .017 POETS .011 BALLS .015PLAYING .016 WRITER .011 TENNIS .011RHYTHM .015 SHAKESPEARE .010 HOME .010ALBERT .013 WRITTEN .009 CATCH .010
MUSICAL .013 STAGE .009 FIELD .010
Topic 166Topic 82Topic 77
Figure 9. Three topics related to the word PLAY.
Document #29795
Bix beiderbecke, at age060 fifteen207, sat174 on the slope071 of a bluff055 overlooking027 the mississippi137 river137. He was listening077 to music077 coming009 from a passing043 riverboat. The music077 had already captured006 his heart157 as well as his ear119. It was jazz077. Bix beiderbecke had already had music077 lessons077. He showed002 promise134 on the piano077, and his parents035 hoped268 he might consider118 becoming a concert077 pianist077. But bix was interested268 in another kind050 of music077. He wanted268 to play077 the cornet. And he wanted268 to play077 jazz077...
Document #1883
There is a simple050 reason106 why there are so few periods078 of really great theater082 in our whole western046 world. Too many things300 have to come right at the very same time. The dramatists must have the right actors082, the actors082 must have the right playhouses, the playhouses must have the right audiences082. We must remember288 that plays082 exist143 to be performed077, not merely050 to be read254. ( even when you read254 a play082 to yourself, try288 to perform062 it, to put174 it on a stage078, as you go along.) as soon028 as a play082 has to be performed082, then some kind126 of theatrical082...
Document #21359
Jim296 has a game166 book254. Jim296 reads254 the book254. Jim296 sees081 a game166 for one. Jim296 plays166 the game166. Jim296 likes081 the game166 for one. The game166 book254 helps081 jim296. Don180 comes040 into the house038. Don180 and jim296 read254 the game166 book254. The boys020 see a game166 for two. The two boys020 play166 the game166. The boys020 play166 the game166 for two. The boys020 like the game166. Meg282 comes040 into the house282. Meg282 and don180 and jim296 read254 the book254. They see a game166 for three. Meg282 and don180 and jim296 play166 the game166. They play166...
Figure 10. Three TASA documents with the word play.
In a new context, having only observed a single word PLAY, there would be uncertainty over which of these topics could have generated this word. This uncertainty can be reduced by observing other less ambiguous words in context. The disambiguation process can be described by the process of iterative sampling as described in the previous section (Equation 4), where the assignment of each word token to a topic depends on the assignments of the other words in the context. In Figure 10, fragments of three documents are shown from TASA that use PLAY in
11
Figure from [Griffiths]
High Level Computer Vision - July 15, 2o14
Discovering Objects and Their Location in Images
J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, W. T. Freeman.
High Level Computer Vision - July 15, 2o14
Visual analogy
document !
word !
topics
- image !- visual word !- objects
High Level Computer Vision - July 15, 2o14
System overview
Input image Compute visual words Discover visual topics
High Level Computer Vision - July 15, 2o14
Bag of words• LDA model assumes exchangeability • Order of words does not matter
Stack visual word histograms as columns in matrix !Throw away spatial information!
DictionaryHistogramVisual wordsInterest regions
2
20
4
415
35 18
3921
1061
2
3
4
1
High Level Computer Vision - July 15, 2o14
Low-rank matrix factorization
• Latent Semantic Analysis (Deerwester, et al. 1990) • Probabilistic Latent Semantic Analysis (Hofmann 2001)
High Level Computer Vision - July 15, 2o14
Apply to Caltech 4 + background images
Faces 435 Motorbikes 800 Airplanes 800 Cars (rear) 1155 Background 900 Total: 4090
High Level Computer Vision - July 15, 2o14
High Level Computer Vision - July 15, 2o14
High Level Computer Vision - July 15, 2o14
Most likely words given topic
Topic 1
Topic 2
Word 1
Word 2
Word 1
Word 2
High Level Computer Vision - July 15, 2o14
Most likely words given topic
Topic 3
Topic 4
Word 1
Word 2
Word 1
Word 2
High Level Computer Vision - July 15, 2o14
Polysemy
Regions that map to the same visual word:
In English, “bank” refers to: 1. an institution that handles money 2. the side of a river
High Level Computer Vision - July 15, 2o14
High Level Computer Vision - July 15, 2o14
Image clustering
Average confusion:
faces
motorbikes
cars-
rear
airplanes
background
faces
motorbikes
cars-
rear
airplanes
background
faces
motorbikes
cars-
rear
airplanes
background
faces
motorbikes
cars-
rear
airplanes
background
Confusion matrices:
High Level Computer Vision - July 15, 2o14
Image as a mixture of topics (objects)
High Level Computer Vision - July 15, 2o14
High Level Computer Vision - July 15, 2o14
High Level Computer Vision - July 15, 2o14
Discussion
• Discovered visual topics corresponding to object categories from a corpus of unlabeled images
• Used visual words representation and topic discovery models from the text understanding community
• Classification on unseen images is comparable to supervised methods on Caltech 5 dataset
• The discovered categories can be localized within an image
High Level Computer Vision - July 15, 2o14
Learning Object Categories from contaminated data
Rob Fergus
Li Fei-Fei Pietro Perona
Andrew Zisserman
High Level Computer Vision - July 15, 2o14
Google’s variable search performance
High Level Computer Vision - July 15, 2o14
Web Image Query• Use Google’s automatic translation tool to translate keyword • Use: German, French, Italian, Spanish, Chinese, English • Take first 5 images returned using translated keywords to give validation set
High Level Computer Vision - July 15, 2o14
Overall learning scheme
Collect images from Google
Keyword
Find regions
VQ regions
Learn 8 topic pLSA model
Propose bounding boxes
Learn 8 topic TSI-pLSA model
Translate keyword
Collect validation set
Pick best topic
Classifier
High Level Computer Vision - July 15, 2o14
Motorbike – pLSA
High Level Computer Vision - July 15, 2o14
Motorbike – TSI-pLSA
High Level Computer Vision - July 15, 2o14
Car Rear – TSI-pLSA
High Level Computer Vision - July 15, 2o14
Decomposition, Discovery, and Detection of Visual Categories Using Topic Models
54
Mario Fritz Bernt Schiele
High Level Computer Vision - July 15, 2o14
• Learning of generative decompositions: ‣ learning intermediate representation instead of
object discovery • Learnt components range from local to global
!• Unsupervised google re-ranking task
!• Supervised detection by combining generative
decompositions with discriminative classifier !
• Quantitative Evaluation ‣ faster learning curve
‣ comparison to state-of-the-art on detection
‣ comparison to hard coded special purpose features
Generative Decomposition of Visual Categories for Unsupervised Learning and Supervised Detection
55
High Level Computer Vision - July 15, 2o14
Object Representation
• Dense Object Representation: ‣ oriented gradient histogram representation ([Dalal&Triggs’05], [Bissacco06] )
‣ spatial grid of local histograms
‣ 9 possible edge orientation per local histogram !!!!!!
‣ e.g. 16 x 10 grid x 9 orientations
• Goal: ‣ Learn for each object a decomposition into sub-structures
‣ Learn for all objects a set of shared sub-structures
56
High Level Computer Vision - July 15, 2o14
Object Representation
57
�1�=T�
j=1
+�2� +�3� +�5� + . . .
sparsity prior Dirichlet(�)
sparsity prior Dirichlet(�)
High Level Computer Vision - July 15, 2o14
• Probabilistic Topic Model ‣ Document is composed (contains) a mixture of topics
‣ Each topics generates a mixture of words
‣ (Latent-Dirichlet-Allocation Model [Blei’03] estimated by Gibbs Sampling [Griffiths’04]) !
!!!
!!
• Dense Object Representation: ‣ Document = Image of an Object
‣ Word = localized edge orientation
Learning of Generative Decompositions
58
High Level Computer Vision - July 15, 2o14 59
Unsupervised Learning of Topics/Feature Representation (Pascal’06 dataset with 10 classes)
High Level Computer Vision - July 15, 2o14
Visualizing Neighborhood in Latent (Topic) Space by k-means clustering
60
High Level Computer Vision - July 15, 2o14
Unsupervised Learning Task: Results on Google-Reranking Task
61
• Task: Improve precision first 3 pages from google image search !
• Topics trained per Class ‣ cluster images in
latent topic space
‣ re-ranking using largest topic-clusters
!
• Quantitive results ‣ precision at 15% recall
‣ better than local features approach [Fergus05]
High Level Computer Vision - July 15, 2o14
Extension to Supervised Detection
• Generative/Discriminative Training ‣ generative model:
- topic model decomposes objects into a mixture of topics
- objects represented via topic distribution vector
‣ discriminative training - SVM-classifier (RBF-kernel/Chi-Square kernel) using topic distribution vector
!!
62
High Level Computer Vision - July 15, 2o14
Improved Learning Curve Experiment on UIUC cars
• SVM-Training using intermediate topic-representation: ‣ maximal performance reached
with 150 training samples
63
• SVM trained directly on oriented gradient histograms
‣ maximal performance reached with 250 training samples
‣ performance overall lower
High Level Computer Vision - July 15, 2o14
Pascal 2006 Visual Object Class Challenge (10 classes)
64
High Level Computer Vision - July 15, 2o14
Pascal 2006 Visual Object Class Challenge (10 classes)
• Outperforms best approaches on ‣ cars (+ 5.75% AP)
‣ bus (+ 9.14% AP)
‣ bicycle (+ 5.67% AP)
• Among the best ‣ motorbikes
• About average for articulated objects ‣ cat
‣ cow
‣ dog
‣ horse
‣ sheep
‣ person
65
High Level Computer Vision - July 15, 2o14
slide credits
• Josef Sivic • Trevor Darrell • Rob Fergus • Fei-Fei Li
66