Indexing and Searching Timed Media: the Role of Mutual Information Models Tony Davis...
-
Upload
marian-lilian-parsons -
Category
Documents
-
view
215 -
download
0
Transcript of Indexing and Searching Timed Media: the Role of Mutual Information Models Tony Davis...
Indexing and Searching Timed Media: the Role of Mutual Information Models
Tony Davis (StreamSage/Comcast)
IPAM, UCLA
1 Oct. 2007
A bit about what we do… StreamSage (now a part of Comcast) focuses on indexing and
retrieval of “timed” media (video and audio, aka “multimedia” or “broadcast media”)
A variety of applications, but now centered on cable TV
This is joint work with many members of the research team:Shachi Dave Abby ElbowDavid Houghton I-Ju LaiHemali Majithia Phil RennertKevin Reschke Robert RubinoffPinaki Sinha Goldee Udani
Overview Theme: use of term association models to address challenges of
timed media
Problems addressed: Retrieving all and only the relevant portions of timed media for a
query Lexical semantics (WSD, term expansion, compositionality of
multi-word units) Ontology enhancement Topic detection, clustering, and similarity among documents
Automatic indexing and retrieval of streaming media
Streaming media presents particular difficulties Its timed nature makes navigation cumbersome, so the system
must extract relevant intervals of documents rather than present a list of documents to the user
Speech recognition is error-prone, so the system must compensate for noisy input
We use a mix of statistical and symbolic NLP Various modules factor into calculating relevant intervals for
each term:1. Word sense disambiguation 2. Query expansion 3. Anaphor resolution 4. Name recognition 5. Topic segmentation and identification
Statistical and NLP Foundations: the COW and YAKS Models
Two types of models
COW (Co-Occurring Words) Based on proximity of terms, using a sliding window
YAKS (Yet Another Knowledge Source) Based on grammatical relationships between terms
The COW Model
Large mutual information model of word co-occurrence
MI(X,Y) = P(X,Y)/P(X)P(Y)
Thus, COW values greater than 1 indicate correlation (tendency to co-occur); less than 1, anticorrelation
Values are adjusted for centrality (salience)
Two main COW models: New York Times, based on 325 million words (about 6 years)
of text Wikipedia, more recent, roughly the same amount of text also have specialized COW models (for medicine, business,
and others), as well as models for other languages
The COW (Co-Occurring Words) Model
The COW model is at the core of what we do
Relevance interval construction
Document segmentation and topic identification
Word sense disambiguation (large-scale and unsupervised, based on clustering COWs of an ambiguous word)
Ontology construction Determining semantic relatedness of terms Determining specificity of a term
The COW (Co-Occurring Words) Model An example: top 10 COWs of railroad//N
post//J 165.554
shipper//N 123.568
freight//N 121.375
locomotive//N 119.602
rail//N 73.7922
railway//N 64.6594
commuter//N 63.4978
pickups//N 48.3637
train//N 44.863
burlington//N 41.4952
Multi-Word Units (MWUs) Why do we care about MWUs? Because they act like single words
in many cases, but also:
MWUs are often powerful disambiguators of words within them (see, e.g., Yarowsky (1995), Pederson (2002) for wsd methods that exploit this):
‘fuel tank’, ‘fish tank’, ‘tank tread’ ‘indoor pool’, ‘labor pool’, ‘pool table’
Useful in query expansion ‘Dept. of Agriculture’ ‘Agriculture Dept.’ ‘hookworm in dogs’ ‘canine hookworm’
Provide many terms that can be added to ontologies ‘commuter railroad’, ‘peanut butter’
Multi-Word Units (MWUs) in our models
MWUs in our system
We extract nominal MWUs, using a simple procedure based on POS-tagging:
Example:
1. ({N,J}) ({N,J}) N
2. N Prep (‘the’) ({N,J}) N where Prep is ‘in’, ‘on’, ‘of’, ‘to’, ‘by’, ‘with’, ‘without’, ‘for’, or
‘against’
For the most common 100,000 or so MWUs in our corpus, we calculate COW values, as we do for words
COWs of MWUs An example: top ten COWs of ‘commuter railroad’
post//J 1234.47
pickups//N 315.005
rail//N 200.839
sanitation//N 186.99
weekday//N 135.609
transit//N 134.329
commuter//N 119.435
subway//N 86.6837
transportation//N 86.487
railway//N 86.2851
COWs of MWUs Another example: top ten COWs of ‘underground railroad’
abolitionist//N 845.075
slave//N 401.732
gourd//N 266.538
runaway//J 226.163
douglass//N 170.459
slavery//N 157.654
harriet//N 131.803
quilt//N 109.241
quaker//N 94.6592
historic//N 86.0395
The YAKS model Motivations
COW values reflect simple co-occurrence or association, but no particular relationship beyond that
For some purposes, it’s useful to measure the association between two terms in a particular syntactic relationship
Construction Parse a lot of text (the same 325 million words of New York Times
used to build our NYT COW model); however, long sentences (>25 words) were discarded, as parsing them was slow and error-prone
The parser’s output provides information about grammatical relations between words in a clause; to measure the association of a verb (say ‘drink’) and a noun as its object (say ‘beer’), we consider the set of all verb-object pairs, and calculate mutual information over that set
Calculate MI for broader semantic classes of terms, e.g.: food, substance. Semantic classes were taken from Cambridge International Dictionary of English (CIDE); there are about 2000 of them, arranged in a shallow hierarchy
YAKS Examples Some objects of ‘eat’
OBJ head=eat
arg1=hamburger 139.249
arg1=pretzel 90.359
arg1=:Food 18.156
arg1=:Substance 7.89
arg1=:Sound 0.324
arg1=:Place 0.448
Relevance Intervals
Relevance Intervals (RIs) Each RI is a contiguous segment of audio/video deemed relevant to a term
1. RIs are calculated for all content words (after lemmatization) and multi-word expressions
2. RI basis: sentence containing the term
3. Each RI is expanded forward and backward to capture relevant material, using the techniques including: Topic boundary detection by changes in COW values across sentences Topic boundary detection via discourse markers Synonym-based query expansion Anaphor resolution
4. Nearby RIs for the same term are merged
5. Each RI is assigned a magnitude, reflecting its likely importance to a user searching on that term, based on the number of occurrences of the term in the RI, and the COW values of other words in the RI with the term
Relevance Intervals: an Example Index term: squatter
Among the sentences containing this term are these two, near each other:Paul Bew is professor of Irish politics at Queens University in Belfast.In South Africa the government is struggling to contain a growing demand for land from its black citizens.Authorities have vowed to crack down and arrest squatters illegally occupying land near Johannesburg.In a most serious incident today more than 10,000 black South Africans have seized government and privately-owned property.Hundreds were arrested earlier this week and the government hopes to move the rest out in the next two days.NPR’s Kenneth Walker has a report.Thousands of squatters in a suburb outside Johannesburg cheer loudly as their leaders deliver angry speeches against whites and landlessness in South Africa.“Must give us a place…”
We build an RI for squatter around each of these sentences…
Relevance Intervals: an Example Index term: squatter
Among the sentences containing this term are these two, near each other:Paul Bew is professor of Irish politics at Queens University in Belfast.In South Africa the government is struggling to contain a growing demand for land from its black citizens. [cow-expand]Authorities have vowed to crack down and arrest squatters illegally occupying land near Johannesburg.In a most serious incident today more than 10,000 black South Africans have seized government and privately-owned property. [cow-expand]Hundreds were arrested earlier this week and the government hopes to move the rest out in the next two days.NPR’s Kenneth Walker has a report.Thousands of squatters in a suburb outside Johannesburg cheer loudly as their leaders deliver angry speeches against whites and landlessness in South Africa.“Must give us a place…” [topic segment boundary]
We build an RI for squatter around each of these sentences…
Relevance Intervals: an Example Index term: squatter
Among the sentences containing this term are these two, near each other:Paul Bew is professor of Irish politics at Queens University in Belfast. [topic segment boundary]In South Africa the government is struggling to contain a growing demand for land from its black citizens. [cow-expand]Authorities have vowed to crack down and arrest squatters illegally occupying land near Johannesburg.In a most serious incident today more than 10,000 black South Africans have seized government and privately-owned property. [cow-expand]Hundreds were arrested earlier this week and the government hopes to move the rest out in the next two days. [merge nearby intervals]NPR’s Kenneth Walker has a report. [merge nearby intervals]Thousands of squatters in a suburb outside Johannesburg cheer loudly as their leaders deliver angry speeches against whites and landlessness in South Africa.“Must give us a place…” [topic segment boundary]
Two occurrence of squatter produce a complete merged interval.
Relevance Intervals and Virtual Documents
The set of all the RIs for a term in a document constitutes the virtual document for that term
In effect, the VD for a term is intended to approximate a document that would have been produced had the authors focused solely on that term
A VD is assigned a magnitude equal to the highest magnitude of the RIs in it, with a bonus if more than one RI has a similarly high magnitude
Merging Ris for multiple terms
Iran
Russia
Occurrence of Original Term
Note that this can only be done at query time,So it needs to be fairly quick and simple.
Merging RIs
Iran
Russia
ActivationSpreading
Occurrence of Original Term
Merging RIs
Iran
Russia
Russia and Iran
Occurrence of Original Term
Evaluating RIs and VDs
Evaluation of retrieval effectiveness in timed media raises further issues: Building a gold-standard is painstaking, and potentially more subjective It’s necessary to measure how closely the system’s Ris match the gold
standard’s What’s a reasonable baseline?
We created a gold standard of about 2300 VDs with about 200 queries on about 50 documents (NPR, CNN, ABC, and business webcasts), and rated each RI in a VD on a scale of 1 (highly relevant) to 3 (marginally relevant).
Testing of the system was performed on speech recognizer output
Evaluating RIs and VDs
Measure amounts of extraneous and missed content
ideal RI missed
system RI
extraneous
Evaluating RIs and VDs
Comparison of percentages of median extraneous and missed content over all queries between system using COWs and system using only sentences with query terms present
relevance most relevant relevant most relevant
and relevant
marginally
relevant
systemextra miss extra miss extra miss extra miss
With
COWs 9.3 12.7 39.2 27.5 29.9 21.6 61.7 15.0
Without
COWs 0.0 57.7 0.3 64.7 0.0 63.4 18.8 60.2
MWUs and compositionality
MWUs, idioms, and compositionality
Several partially independent factors are in play here (Calzolari, et al. 2002):
1. reduced syntactic and semantic transparency;
2. reduced or lack of compositionality;
3. more or less frozen or fixed status;
4. possible violation of some otherwise general syntactic patterns or rules;
5. a high degree of lexicalization (depending on pragmatic factors);
6. a high degree of conventionality.
MWUs, idioms, and compositionality
In addition, there are two kinds of “mixed” cases Ambiguous MWUs, with one meaning compositional and the other not:
‘end of the tunnel’, ‘underground railroad’ “Normal” use of some component words, but not others:
‘flea market’ (a kind of market)
‘peanut butter’ (a spread made from peanuts)
Automatic detection of non-compositionality
Previous work Lin (1999): “based on the hypothesis that when a phrase is non-
compositional, its mutual information differs significantly from the mutual informations [sic] of phrases obtained by substituting one of the words in the phrase with a similar word.”
For instance, the distribution of ‘peanut’ and ‘butter’ should differ from that of ‘peanut’ and ‘margarine’
Results are not very good yet, because semantically-related words often have quite different distributions, and many compositional collocations are “institutionalized”, so that substituting words within them will change distributional statistics.
Automatic detection of non-compositionality
Previous work Baldwin, et al. (2002): “use latent semantic analysis to determine the
similarity between a multiword expression and its constituent words”; “higher similarities indicate greater decomposability”
“Our expectation is that for constituent word-MWE pairs with higher LSA similarities, there is a greater likelihood of the MWE being a hyponym of the constituent word.” (for head words of MWEs)
“correlate[s] moderately with WordNet-based hyponymy values.”
Automatic detection of non-compositionality
We use the COW model for a related approach to the problem COWs (and COW values) of an MWU and its component words will be
more alike if the MWU is compositional. We use a measure of occurrences of a component word near an MWU as
another criterion of compositionality The more often words in the MWU appear near it, but not as a part of it, the
more likely it is that the MWU is compositional.
COW pair sum measure
Get the top n COWs of an MWU, and of one of its component words. For each pair of COWs (one from each of these lists), find their COW value.
railroad commuter railroad
post//J post//J
shipper//N pickups//N
freight//N rail//N
COW pair sum measure
Get the top n COWs of an MWU, and of one of its component words. For each pair of COWs (one from each of these lists), find their COW value. Then sum up these values. This provides a measure of how similar the contexts
in which the MWU and its component word appear are.
Feature overlap measure
Get the top n COWs (and values) of an MWU, and of one of its component words.
For each COW with a value greater than some threshold, treat that COW as a feature of the term.
Then compute the overlap coefficient (Jaccard coefficient); for two sets of features A and B:
| A B |
| A B |
Occurrence-based measure
For each occurrence of an MWU, determine if a given component word occurs in a window around that occurrence, but not as part of that MWU.
Calculate the proportion of occurrences for which this the case, compared to all occurrences of the MWU.
Testing the measures
We extracted all MWUs tagged as idiomatic in the Cambridge International Dictionary of English (about 1000 expressions).
There are about 112 of these that conform to our MWU patterns and occur with sufficient frequency in our corpus that we have calculated COWs for them.
fashion victim
flea market
flip side
Testing the measures
We then searched the 100,000 MWUs for which we have COW values, choosing compositional MWUs containing the same terms.
In some cases, this is difficult or impossible, as no appropriate MWUs are present. About 144 MWUs are on the compositional list.
fashion victim fashion designer crime victim
flea market [flea collar] market share
flip side [coin flip] side of the building
Results: basic statistics
The idiomatic and compositional sets are quite different in aggregate, though there is a large variance:
COW pair sum Feature overlap Occurrence measure
Non-idiomatic
mean 575.478
s.d. 861.754
mean 0.297
s.d. 0.256
mean 37.877
s.d. 23.470
Idiomatic
mean -236.92
s.d. 502.436
mean 0.109
s.d. 0.180
mean 16.954
s.d. 16.637
Results: discriminating the two sets
How well does each measure discriminate between idioms and non-idioms?
COW pair sum
negative positive
Non-idiomatic 75 213
Idiomatic 178 46
Results: discriminating the two sets
How well does each measure discriminate between idioms and non-idioms?
feature overlap
< 0.12 >= 0.12
Non-idiomatic 100 188
Idiomatic 175 49
Results: discriminating the two sets
How well does each measure discriminate between idioms and non-idioms?
occurrence-based measure
< 25% >= 25%
Non-idiomatic 94 194
Idiomatic 174 50
Results: discriminating the two sets Can we do better combining the measures? We used the decision-tree software C5.0 to check
Rule: if COW pair sum <= -216.739 or
COW pair sum <= 199.215 and occ. measure < 27.74%
then idiomatic; otherwise non-idiomatic
yes no
Non-idiomatic 50 238
Idiomatic 184 40
Results: discriminating the two sets Some cases are “split”—classified as idiomatic with respect to one component
word but not the other word:
bear hug is idiomatic w.r.t. bear but not hug
flea market is idiomatic w.r.t. flea but not market Other methods to improve performance on this task
MWUs often come in semantic “clusters”:
‘almond tree’, ‘peach tree’, ‘blackberry bush’, ‘pepper plant’, etc. Corresponding components in these MWUs can be localized in a small area
of WordNet (Barrett, Davis, and Dorr (2001)) or UMLS (Rosario, Hearst, and Fillmore (2002)).
“Outliers” that don’t fit the pattern are potentially idiomatic or non-compositional (’plane tree’ but ’rubber tree’, which is compositional).
Clustering and topic detection
Clustering by similarities among segments
Content of a segment is represented by its topically salient terms.
The COW model is used to calculate a similarity measure for each pair of segments.
Clustering on the resulting matrix of similarities (using the CLUTO package) yields topically distinct clusters of results.
Clustering Results
Example: crane
10 segments form 2 well-defined clusters, one relating to birds, the other to machinery (and cleanup of the WTC debris in particular)
Clustering Results Example: crane
Cluster Labeling
Topic labels for clusters improve usability.
Candidate cluster labels can be obtained from: Topic terms of segments in a cluster Multi-word units containing the query term(s) Outside sources (taxonomies, Wikipedia, …)
Topics through latent Dirichlet allocation
LDA (Blei, Ng, and Jordan 2003) models documents as probability distribution over a set of underlying topics, with each topic defined as a probability distribution over terms.
We’ve generated topic models for the SR output of about 20,000 news programs (from 400 to1000 topics).
Nouns, verbs, and MWUs only; all other words discarded Impressionistically, the “best” topics seem to be those in which a
third or more of the probability mass is in 10-50 terms.
Topics through latent Dirichlet allocationAre there better ways to gauge topic coherence (and relatedness)? We’ve tried two
COW sum measure (Wikipedia COW table on 20 most probably words in a topic
Average “distance” between CIDE semantic domain codes for term (also top 20, using the similarity measure of Wu and Palmer 1994)
Excerpt of CIDE hierarchy:43 Building and Civil Engineering
66 Buildings68 Buildings: names and types of
754 Houses and homes755 Public buildings365 Rooms
194 Furniture and Fittings834 Bathroom fixtures and fittings811 Curtains and wallpaper804 Tables805 Chairs and seats
Topics through latent Dirichlet allocation Results are quite different (these are with 1000 topics)
COW sum measure
Rank in WikiCow= 1 Rank in cide= 232return home war talk stress month post-traumatic rack deal think stress$disorder night psychological face understand pt experience combat series mental$health
Rank in WikiCow= 2 Rank in cide= 773research cell embryo stem$cell disease destroy embryonic$stem scientist parkinson researcher federal funding potential cure today human$embryo medical human$life california create
Rank in WikiCow= 3 Rank in cide= 924giuliani new$hampshire run talk gingrich february former$new thompson social mormon massachusetts mayor iowa newt opening york choice rudy$giuliani jim committee
Topics through latent Dirichlet allocation Results are quite different (these are with 1000 topics)
COW sum measure
Rank in WikiCow= 4 Rank in cide= 650
palestinian israeli prime$minister israelis arafat west$bank jerusalem peace settlement gaza$strip government violence israeli$soldier hama leader move force security peace$process downtown
Rank in WikiCow= 5 Rank in cide= 283
japan japanese tokyo connecticut lieberman lucy run lose joe$lieberman support lemont disaster war twenty-first$century anti-war define peterson$cbs senator$joe future figure
Rank in WikiCow= 6 Rank in cide= 710
california schwarzenegger union break office view arnold$schwarzenegger tuesday politician budget rosie poll agenda o'donnell political battle california$governor help rating maria
Topics through latent Dirichlet allocation Results are quite different (these are with 1000 topics)
CIDE distance measure
Rank in cide= 1 Rank in WikiCow= 278play show broadway actor stage star performance theatre perform audience act career tony sing character production young role review performer
Rank in cide= 2 Rank in WikiCow= 476war u.n. refugee united$nation international defend support kill u.s. week flee allow chief board cost remain desperate innocent humanitarian conference
Rank in cide= 3 Rank in WikiCow= 151competition win team skate compete figure stand finish performance gold watch champion hard-on score skater talk meter lose fight gold$medal
Topics through latent Dirichlet allocation Results are quite different (these are with 1000 topics)
CIDE distance measure
Rank in cide= 4 Rank in WikiCow= 858
set move show pass finish mind send-up expect outcome address defence person responsibility wish independent system woman salute term pres
Rank in cide= 5 Rank in WikiCow= 478
character play think kind film actor mean mind scene act guy script role good$morning real read nice wonderful stuff interesting
Rank in cide= 6 Rank in WikiCow= 892
warren purpose rick hunter week drive hundred influence allow poor peace night leader train sign walk training goal team rally
YAKS and Ontology Enhancement
YAKS (Yet another knowledge source) What is YAKS: Like the COW model: measures how much more (or less) frequently
words occur in certain environments than would be expected by chance.
Unlike the COW model: does not measure co-occurrence within a fixed window size. Measures how often words are in one of the syntactic relations tracked by the model.
Example: Subject(eat,dog) would be an entry in the YAKS table recording how many times the verb eat occurs in the corpus with dog as its subject.
Syntax and Semantics: YAKS YAKS infrastructure: Corpus: New York Times corpus (six years), all sentences under 25
words (~60% of sentences, ~50% of words); about 160 million words.
Need good parsing. We used the Xerox Linguistics Environment (XLE) parser from PARC, based on lexical-functional grammar.
Syntax and Semantics: YAKS Sample YAKS relations: (YAKS tracks 13 different relations.)
SUBJ(V, X)
OBJ(V, X)
COMP(V, X)
POBJ(P, X)
CONJ(C, X, Y)
X is subject of verb V.
X is direct object of verb V.
Verb V and complementizer X.
Preposition P and object X.
Coordinating conjunction C, and two conjuncts (either nouns or verbs) X and Y.
“Dan ate soup”: SUBJ(eat, Dan) “Dan ate soup”: OBJ(eat, soup)
“Dan believes that soup is good”: COMP(believe, that)
“Eli sat on the ground”: POBJ(on, the ground)
“Dan ate bread and soup”: CONJ(and, bread, soup)
Meaning Example Relation
For ontology extension, we use only the OBJ relation.
YAKS in ontology enhancement
hat
shirt
dress
seed items
Items from one neighborhoodin the ontology
sock
YAKS in ontology enhancement
wash
wear Which verbs are most associated with these nouns, in a large corpus?
hat
shirt
dresssock
irontake off
YAKS in ontology enhancement
wash
wear
Whaich other nouns are most associated as objects of those verbs?
hat
shirt
dresssock
irontake off
jacket
sweater
blouse
pantsclothes
pajamas
new candidates for that area of the ontology
YAKS in ontology enhancementYAKS technique:The noun-verb-noun cycle reflects the selectional restrictions of verbs that are strongly associated with the seed nouns.This lets us use the information in the statistical properties of the large corpus to find nouns that are semantically close to your seeds in the ontology.This is why we use OBJ (the object relation) for this YAKS technique. (SUBJ found the same information, but with slightly more noise.)
YAKS in ontology enhancementYAKS technique: finds new items for a neighborhood, but where exactly do they go?
belongs in here . . .X
YAKS in ontology enhancement
But where?X
?
?
?
? ? ? ? ?
?
?
YAKS in ontology enhancement Techniques to precisely locate a candidate term in the ontology: Wikipedia check (already described)
(Does the Dog page contain “dog is a mammal”? Does the Mammal page contain “mammal is a dog”?)
Yahoo pattern technique
YAKS in ontology enhancement Yahoo check:
1. Use known pattern information to form a Yahoo search.• If we suspect that a dog is-a mammal, search Yahoo for “dog is
a mammal” and “mammal is a dog”. Also search using other is-a patterns (e.g., from Hearst 1994).
2. Use relative counts (absolute counts hard to interpret).
YAKS in ontology enhancement: evaluation Four levels of is-a: Beef, meat, food, substance
1. Seed pairs were beef-meat, meat-food, and food-substance.
2. YAKS induction and siblings added to produce candidate pool.
3. Candidate list was then pared using a version of the Wikipedia check:
YAKS in ontology enhancement: evaluation Beef, meat, food, substance — Wikipedia check: Where might candidate term T fit, relative to seed term S?
Count mentions of T on the S page, MT, and mentions of S on the T page, MS. Weight doubly mentions in the first paragraph. Subtract: MT - MS .
If T is an S, then MT - MS should be >0.
YAKS in ontology enhancement: evaluation Beef, meat, food, substance — Results: After YAKS induction and expansion via siblings, had >800 candidate
terms to add to the ontology. Wikipedia check pared this to 38 terms, each with its best is-a
relationship. Of these, on hand inspection, all but two were very high quality.
YAKS in ontology enhancement: evaluation YAKS testing: MILO transportation ontology
1. In each trial, seed terms were three items from near the bottom of the ontology, linked by is-a. For instance, sedan, car, and vehicle.
Selecting terms from the bottom of the ontology decreases the lexicalization problems inherent in MILO content.
• Induced new candidate terms using the YAKS technique, adding siblings as well.
YAKS in ontology enhancement: evaluation3. Candidate list was then pared using a Yahoo is-a pattern check:
A. For a candidate term T, compare it to each of the three seeds
– i.e., is it a sedan? Is it a car? Is it a vehicle?
B. Do this via a Yahoo search for Hearst is-a patterns involving T and each of the seeds. This trial used the patterns Y such as X and X and other Y.
– i.e., search for “. . . sedans such as T . . .”, “. . . cars such as T . . .”, “. . . vehicles such as T . . .”, “. . . T and other sedans ...”, and so on.
YAKS in ontology enhancement: evaluation3. Yahoo pattern check, continued:
C. Does the candidate have a significant number of Yahoo hits for one of the seeds with one of the is-a patterns?
A. Is there a significant difference between that highest number of hits and the number of hits with some other seed?
B. That is, is there a clear indication that the candidate is-a for one particular seed?
D. If not, discard this candidate.
E. If so, then this candidate most likely is-a member of the class named by that seed.
YAKS in ontology enhancement: evaluation Results — MILO transportation ontology:
candidate Yahoo hits for is-a: resultsedan car vehicle
truck
van
automobile
minivan
BMW
Jeep
Corolla
convertible
0 532 152000
0 95 5590
0 63 14700
1 76 561
3 360 66
0 3600 849
6 66 3
0 25 76
a truck is a vehicle
a van is a vehicle
an automobile is a vehicle
a minivan is a vehicle
a BMW is a car
a Jeep is a car
a Corolla is a car
a convertible is a vehicle
Conclusions and future work
Conclusions and future work
MI models can help compensate for the noisiness in timed media search and retrieval.
MI models can help in building knowledge sources from large text collections.
We’re still looking for better ways to combine handcrafted semantic resources with statistical ones.