1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt.
CHAPTER 7 CONCEPT BASED EVENT SEARCH AND...
Transcript of CHAPTER 7 CONCEPT BASED EVENT SEARCH AND...
118
CHAPTER 7
CONCEPT BASED EVENT SEARCH AND RANK
The previous chapters discussed the document processing tasks
required by the event based search engine. Those document processing tasks
help to extract event based information and group the specific events in each
cluster. This helps to index domain specific information for news event
search. Since the content on the web is growing rapidly every fraction of a
second, search engines such as Google, Yahoo and MSN have become the
most heavily-used online services, with millions of searches performed every
day. All the above search engines basically use the keyword based search
strategy. Ranking algorithms such as PageRank algorithm (Brin et al 1998)
and HITS Algorithm (Ding et al 2012) score the documents, according to the
incoming and outgoing links of the documents. However, due to the large
number of documents available on the web, the number of results produced
by keyword based search engines is too numerous. The ultimate challenge for
search engines is to provide effective systems that retrieve the most relevant
information from the web that exactly caters to the user information need.
Concept based search attempts to improve search effectiveness by
incorporating conceptual information, that conveys meaning rather than using
the presence or absence of keywords, as the basis for the retrieval process.
A multilevel Concept based Search, and Ranking Algorithm described by
Balaji et al (2012), retrieves and ranks the documents based on the concepts
and relationships between the concepts using Universal Networking Language
119
(UNL) based semantic representation of the documents. The Semantic graph
based Index is used for both query expansion and concept based search. The
algorithm has been evaluated on a corpus of tourism documents, and its
performance is compared with that of keyword based search. The mean
average precision of the concept based search for tourism domain is found to
be 0.75 while the keyword based search has a Mean Average Precision
(MAP) score of 0.45. We have adopted the same approach for event based
search and ranking by modifying the ranking function with the event-based
weight and we found even a low precision. This chapter talks about a concept
based approach for news event search and ranking.
7.1 INTRODUCTION
Combining heterogeneous information from different news articles
requires complex semantic interpretation, which can automatically map
different concepts, and tag the domain specific information. In order to
automatically identify the event specific sentences, and to extract event
specific features from large corpora, we apply machine learning approaches.
We present a new context based event model for News Articles. We need to
represent the event context, using the appropriate semantic models. The
Resource Description Framework (RDF) (Graves & Gutierrez 2006) graph is
generally used to express the relationship between words, in terms of subject-
object-predicate representation, expressed using word level semantics. This
subject-object-predicate representation is suitable for structured languages
like English. Compared to RDF, OWL (Antoniou et al 2004) has a richer set
of semantics, but is however, oriented to representing Ontology and not a text.
For both OWL and RDF, it is difficult to express the richer semantics required
for event based processing of text. Hence, we have chosen UNL as a semantic
120
graph based representation, which helps to represent documents in language
and domain independent ways.
As we discussed in the previous chapters, event based features are
extracted based on machine learning approaches viz, Agglomerative
Clustering, Bootstrapping and Label propagation, for Event clustering, event
pattern identification and temporal expression identification and
normalization respectively. These machine learning approaches help to extract
event based information from the news text and use this information to
construct event specific indexes. In this chapter, we discuss a multi-field news
event search engine, which is designed with a specialized concept based multi
field index that considers event based features. The contributions of this work
include the design of multi-field index structure, and the enhancement of a
concept based search engine (CoRee) (Balaji et al 2012) with event based
features.
When the user enters a query for a news event, he may want to
track information with respect to location, time and person; the single field
indexer requires additional processing to retrieve the documents, which have
all the fields required by the user. Moreover, the semantic link between an
event with the exact time, place and person may not be correct. Hence, we
have pre-processed each document, and extracted the event concepts which
occur with time, place and person based relations for building the multi-field
indexer. This multi-field indexer helps us to get results for the exact link of an
event with respect to time, place and person. The related work is discussed in
the next section.
121
7.2 APPROACHES TO CONCEPTUAL SEARCH
This section discusses the existing concept based searching, based
on the semantic representation, ontology and other semantic based
representation. Concept based search can be classified as those that use a
background knowledge source to provide conceptual information, and those
that use semantically analyzed the components of the document. Concept
based search can also be classified based on how semantics is used to
represent the documents. Documents can be represented by considering
concepts associated with the frequently occurring keywords, or by converting
important components of the document into a semantic structure. In addition,
concept based search can also be classified based on, where the semantics are
introduced in the components of the search engine. Semantics can be
introduced in query expansion, building the index, searching and also in
ranking the search results. The related work of concept based search has been
discussed with the above aspects.
7.2.1 Semantic based Search
While some meaning based search engines use sentence level
semantics, others use ontology as the background knowledge source for
providing semantics. Hakia (Sudeepthi et al 2012) is a semantic search engine
that uses knowledge of Ontology and Fuzzy logic for semantic ranking. In
order to retrieve conceptual results, it uses Query Detection and Extraction
(QDEX) Indexing Architecture, (Loutas et al 2012) which enables the
semantic analysis of web pages and provides meaning based search results. In
Hakia besides the keywords, phrases are used for meaning based searches.
The limitation of Hakia is that, it accepts queries as questions in a specific
format. Also, the QDEX algorithm extracts all possible queries that can be
122
asked on the content of web pages of various lengths and forms. This is an
offline process before any user query is entered.
The major difficulty in the QDEX system is the reduction of the
huge number of generated query sequences into a few dozens, which make
sense. Hakia allows only these predefined query sequences generated from
the content, to be used as queries.
On the other hand, SenseBot (Rana & Singh 2013) is a semantic
search engine that runs over search engines like Google and Yahoo, to
generate a multi document summary based on text mining and limited
semantics. Though all the above search engines provide meaning based
results, some search engines require sophisticated query analysis techniques
to provide meaningful search results. Other search engines consider concepts
rather than relations between concepts as the basis of the match. However, in
the search engine described in this paper, the context of the query was
retrieved by traversing the already created UNL based indexer. The frequently
occurring UNL relations obtained from the UNL index, in effect provide
information about the possible connections between concepts in the specific
domain under consideration. These connections provide the context of the
query concept, and the query expansion based on this context yields
meaningful search results.
7.2.2 Ontology based Search
Concept based search can also be based on the use of knowledge
structures. One such search engine is Engineering or Environmental
Knowledge Ontology-based Semantic Search (EKOSS) (Kraines et al 2006).
It is an ontology based semantic search engine which uses a fully functional
ontology for representing the knowledge base. A collaborative knowledge
123
sharing environment is provided which helps knowledge experts to share their
knowledge such as research papers, database, computer simulated model, and
even curriculum vitae. The EKOSS system is used to construct computer-
interpretable semantically rich statements of the knowledge resource. When a
user request is posted, this system converts the user request into a computer
readable knowledge description based on description logic and associated
rules. Ontology-based information retrieval (Gao et al 2005) intended for e-
Government has been developed for securing the legal documents of the
government. The disadvantage of using ontology based search engines is that
they are susceptible to changes in the information resources. This will affect
the conceptualization of the domain representation. Moreover, the effort
required to build ontology is huge. Ontology based news event search
depends on the domain, and the use of a common vocabulary ontology for
different domains remains a challenging task.
7.2.3 UNL based Search
A meaning based multilingual search engine that uses UNL
(Universal Networking Language) is AgroExplorer (Surve et al 2004). This
search engine is similar to the search engines described in this work, since
AgroExplorer also uses Universal Networking Language (UNL) expressions
for representing sentences as graphs, that capture the meaning of the
sentences. AgroExplorer has been developed for the agriculture domain and
also provides multilingual features. A simple search and rank process based
on the degree of match of the query UNL, and the frequency of occurrence of
the Concepts with other concepts in the UNL expression is used. The
algorithm for searching and ranking described in this paper, is a part of UNL
search system that differs from the existing AgroExplorer (Surve et al 2004)
in that, our approach incorporates semantics in every component of the search
124
engine. Previous approaches consider the conceptual relationship only at the
word level and they have considered only document based properties such as
position, term/concept level statistics for ranking. Instead, in addition to the
document based properties, we have additionally used a sophisticated three
level conceptual search and rank process. We have also attempted index based
query expansion, which considers the conceptual link of query concepts with
other concepts, in order to obtain context based results. This concept based
search has been enhanced with event based text processing, and we have
adopted the same ranking approach with event based weight. The existing
work in event based search is explained in the next section.
7.2.4 Event based Search
STORIES in time: is a graph-based interface for news tracking and
discovery (Berendt & Subasic 2009). This work provides a graph based
representation of the news story, in which the user can search the specific
node of the news topic to get specific information about an event or a topic.
This method extracts facts from the document along with the time. For each
time period a set of graphs represents the news event. The term level
similarity between two temporal nodes is measured based on its term
frequency and its co-occurrence frequency. Moreover, the existing event
tracking gave importance to the temporal aspect rather than the person and
place. There is only limited work which considers all the entities for event
tracking. Sayyadi et al (2009) introduced a community detection method to
form a network of keywords pertaining to an event. Hence, for each event
there will be a set of frequently co-occurring terms. In this approach, the term
that appears in one event may appear in another event, and this redundancy
may lead to wrong interpretation. Lam et al (2001) had introduced a
contextual analysis of news events, instead of term based matching between
125
events; their method identifies a concept based similarity between events,
based on the statistical context identified between the sentences of two events.
NewsX (Wunderwald et al 2011) is an event extraction tool developed to
answer WH questions for news events. It uses a rule based approach to find
answers to the WH questions. Cybulska & Vossen (2011) proposed a similar
event extraction method for historical events. They used the WordNet and
ontological classes for the semantic tagging of the given corpus. Manual tagging
is required to indicate the historical data in terms of place, person and action.
However, the automatic extraction of historical information and learning the
relationship with other historical events, are still challenging issues.
Cai et al (2013) has proposed an event relationship analysis based
on temporal facts. He focused mainly on event evaluation based on time
varying features. Similarly, Jin et al (2008) proposed a time based web search
engine, which focused only on the temporal aspect of the search query. Allan
et al (2001) have also proposed temporal based news summarization, based on
the probability of the occurrence of a word on the given topic. Feng et al
(2007) proposed an approach for finding links between events. They used a
clustering approach to link events and their relationships. Kapp et al (2013)
described a person based event exploration. They computed the similarity
between events based on the participating entities. Sarma et al (2011) consider
time constraint-based relationship between event entities. They constructed a
global temporal cluster, and a local temporal cluster to identify the dynamic
relationship between events. All the above said approaches consider only the
co-occurrence context, while we have developed an event model that
considers not only the temporal facts, but also the relationship of an event to
the person and place where the event context may be at the word level,
33sentence level, or document level.
126
There is a prototype called NewsSync (Vydiswaran et al 2011),
which explores news stories based on user preferences. Though, the idea
behind this work is similar to the work described in this chapter, it works for
structured languages like English. Moreover, it does not consider any
semantic representation to provide meaningful results to the user. In our
approach, we have utilized the existing rule based UNL enconversion
(Balaji et al 2011) to represent a document using semantic graphs. From this
semantic graph, we have identified only event based semantic subgraphs,
which are required for event specific clustering to cluster events in terms of
time, place and person. Adopting machine learning approaches in information
retrieval can help to reduce the time to process the information. Therefore,
compared to previous approaches, our method works well to provide
meaningful results to the user within a reasonable time limit. Moreover, this
work can be adapted easily to other domains and languages, without
modifying the methodology. The machine learning approaches adopted for
our Tamil news search engine and its system architecture are discussed in the
next section.
7.3 COREE – A CONCEPT BASED SEARCH ENGINE
In the UNL based search system discussed here, UNL graphs that
represent fragments of sentences in a document are used to build the
conceptual index. The UNL enconverter of the system uses a rule based
approach to convert the sentence constituents to UNL graphs where concepts
are represented as nodes and relations as edges. The use of this approach
allows terms to be represented as concepts, extracts a standard set of
semantic relations between concepts in a sentence, and at the same time,
associates a hierarchy of the concepts linked through the UNL semantic
relations. This essentially means that semantically analyzed information
127
from the sentences of the documents is used for building the index. In
addition, the constraints associated with the concepts available in the UNL
Knowledge Base (KB), also incorporate information from a background
knowledge resource into the index structure. For example, the UW word
Chennai of the Tamil sentence will be translated into Chennai(icl > place).
Here, Chennai denotes the head word and the icl > place denotes the
contraints associated with the concept. Figure 5 shows the UNL enconversion
process.
Figure 7.1 Semantic representation of a sentence
In Figure 7.1 the concept build(icl > action) is connected to Rajara
jachozhan(icl>person) and also with Thanjai Temple(iof > t em ple) using
agt and obj respectively.
The set of UNL graphs obtained from the enconversion component
of the search system is represented as a multi-list structure, which is
discussed in chapter 4. This multilist structure contains three separate
indices, such as CRC (Concept-Relation-Concept), CR (Concept-Relation)
and C (Concept) indices, in order to aid searching and ranking. In addition to
building the UNL graph represented as a multilist structure, the UNL
enconverter provides additional information to aid the retrieval process.
128
The CRC Indices for the UNL Tamil sentence are build (icl>act
ion)- agt – Rajarajachozhan (icl>person), build (icl>act ion) - obj - Thanjai
Temple (iof> t emple);The CR Indices for the UNL tamil sentence are build
(icl > action)-agt, build (icl > action)-obj;and the C Indices for the UNL
tamil sentence are build (icl > action), Thanjai Temple (iof>temple),
Rajarajachozhan (icl>person).
Sentence based information includes the sentence identifier, Part Of
Speech tags, Entity tags, Multiword tags, the actual terms or words associated
with the UNL concepts, and a bit pattern vector, that indicates sentence-wise
position of the concepts in the document. Document based information
includes a document identifier, term frequency, concept frequency, and the
position of the concepts in the document. These features are used in weight
determination during the searching and ranking of documents. Features such
as the frequency of the concepts present in the document in addition to term
frequency, allow ranking to be both term and concept based which becomes
important when the term frequency is not significant. The bit pattern vector
indicating the distance between concepts helps to identify relations that are
not necessarily proximity dependent. The UNL index with all the above
sentence level and document level information is initially stored in the Binary
Search Tree (BST). In the Binary Search Tree (BST), we are able to index up
to 33,000 documents. When we increase the number of documents, we are
unable to insert concepts in the BST and have to choose a different data
structure. We have attempted the SQL database and Lucene index, and tested
1 lakh documents. We are able to index the concepts, but the time required to
search in the SQL database is more, when compared to the Lucene index.
Hence, we have chosen Lucene for indexing concepts, and the relations
between concepts from the documents, represented in the UNL.
129
7.3.1 Concept based Query Expansion
In this work, the context of a query concept is defined as the
association of this concept with other concepts in a CRC relation, across
documents in the domain of interest. By analyzing the index, the concept
associated with a query is matched with the CRCs of the index, and the most
common CRCs associated with the query concept are extracted. The
expanded concepts obtained, are ranked based on the frequency of the CRC
and on its being an entity. Query expansion is an on-line activity and the
index analysis results in efficient query expansion. The most frequently
occurring CRC in the index indicates the frequent association of concepts in
the domain across documents, and hence, gives the domain context of the
query concept. This expansion of the query concepts to CRC, allows context
dictated query sub graphs to be constructed for the query. The expanded query
graph is now associated with the actual query terms, query concepts and
expanded concepts associated with the context of the query concept. This, in
turn, means that the difference between these is required during both
searching and ranking.
The index based query expansion influences the searching and
ranking of documents in many ways. It helps to build CRC query graphs that
can be matched with the Concept based index. Without this expansion, single
word queries would have resulted in an isolated concept, which may not give
semantic based results to the user. As already explained, the association of
expanded concepts allows domain oriented, corpus based context of the query
word to play a role in semantic matching, and in addition, helps to bring in
documents, which have concepts in the context of the query, which would
have been missed by other search mechanisms.
130
7.3.2 Concept based Search
The basic searching procedure is based on complete CRC Match
or partial CR or C matches between the query sub graphs and the
corresponding index, as in AgroExplorer (Surve et al 2004). However, the
design of the ranking procedure depends on whether the match of the index
is with the actual query terms, actual query concepts or expanded concepts.
In addition, all the sentence and document based features associated with the
conceptual indices also affect the ranking procedure.
The overall algorithm for searching and ranking actually performs a
three level ranking. The first level ranking is obtained based on whether there
is complete match (CRC match), partial match of Concept Relation (CR) or
match of only concepts (C Only). This level of ranking is provided by the
Degree of Match Categorization tag Ta. The set of documents obtained at the
Level 1 category is further prioritized, using Concept Association
Categorization Tag Tb. Concept Association categorization depends on
whether the index match is between the query terms, query concepts or
expanded concepts. Once the documents have been ranked by Ta and T b,
the documents at the same Ta.Tb level are ranked, based on weights
calculated based on the index based features associated with the concept.
A Tag represented as Ta.Tb helps in determining the two level lists
of prioritized documents. Tag Ta computed in Level 1 indicates the degree of
match while Tb computed at Level 2 indicates the type of concept association.
For determining the tags the following terminology is defined.A given query
with n terms may be represented as a set Q. Let Q ={q1, ...., qn } ,where qi
represents a query term. Each element i of the power-set of Q is expanded
and enconverted to a set E Qi of UNL graphs gim, where m represents the
131
expanded concepts from the UNL index and m > 0. Here, the power set of
Q represents that each query term is associated with not only a single
expanded terms and it’s concepts, it also represents more than one expaned
terms and concepts. That is,E Qi = {gi1, gi2, ....gin },where each gij is a tuple of
{Cxjj, Rjj ,Cyjj} representing a relation Rjj between the two associated
concepts Xjj and Yjj. The presence of all three elements of the tuple
corresponds to a CRC graph, the presence of a C and R corresponds to a CR
graph, and the presence of a C alone indicates a C graph. Now each gi j is
matched with the CRC, CR and C indices represented in the index graphs in
the indices ICRC , ICR and IC to obtain a set of documents Dij.CRC, Dij.CR ,
and Dij.C . The matching set of documents Djj for the expanded query graph
gi j is the union of these three sets, i.e., Now by using these sets, the degree of
match is determined by the tag Ta.
=
This algorithm aims at improving the search and ranking, by
performing matching at three levels, namely,
1. Partial or Complete match between the index and expanded
query;
2. A Concept Association level which distinguishes between the
actual query terms, query concepts and expanded concept.
3. Document based features, such as frequency of occurrence
and position of terms and concepts in the document.
132
7.3.2.1 Tag determination for degree of match
The tag determination for the degree of match depends on the
extent of match between the CRC representing the query sub graph and the
conceptual index. It essentially differentiates between CRC, CR and C
matches. Ta helps in differentiating between the different degrees of match.
The UNL sub graph is a directional graph, and hence, partial match also
considers whether the concept in CR (Concept Relation), matches with the
source concept, Cxi , or the destination concept, Cyi , of the UNL sub graph.
=
1 , ,2 { , } ,3 { , }4 { , } { }5 { } { }6 { } { }
(7.1)
7.3.2.2 Tag determination for concept association
The next level of tag determination is based on whether the Ci
value in the CRC,CR and C matches corresponds to the actual query term ,the
concept of the query term or the concept obtained after query expansion.
Accordingly, the concept association is said to be of three types.
1. Query Term TWi association - This means that the concept Ci is
the query term itself.
2. Concept Word CWi association - This means that the concept
Ci matches the corresponding concept of the query, but the
actual query term is different.
133
3. Expanded Word EWi association - This means that the
concept Ci is associated with a concept that is not actually in
the query, but has been obtained as a result of query
expansion.
Based on the above 3 values, eight different tags are obtained as
given below.
1234567
xi yi i
xi i yi i
xi i yi i
xi yi ib
xi i yi i
xi i yi i
xi i yi i
if C C TWif C TW and C CWif C CW and C TWif C C CWTif C TW and C EWif C EW and C TWif C EW and C CW
(7.2)
It can be seen that the Tag Tb, differentiating between the three
types explained above, also differentiates between whether the concept is the
source node Cxi or destination node Cyi of the directed UNL sub- graph. The
eight values of Tb bring out these differences. Within each DTa the documents
are ordered as per Tb.
Each of the set Di j .Ta documents are now tagged with the Tb tag.
In other words, the searched documents are prioritized and ranked according
to TaTb value. Let Di j . TaTb represent the set of documents with a tag the
Ta. Tb corresponding to the enconverted query graph gi j . The next section
describes how index based features are used to further rank each set of
Di j .TaTb documents.
134
7.3.2.3 Use of index based features
Index based features are used to calculate a weight factor to
prioritize the documents within each set Di j .TaTb. The features used are
position, frequency count, Named Entity (NE) tag and Multi-word (MW) tag
of the term/concept. The feature weight is calculated as follows.
The Index based feature Weight is denoted as WI.
=,..
(7.3)
Here, i represents the single document weight, and j represents the
weight across the documents. PiWeight represents the position weight of the
concept. Position weight is computed based on, where in the document the
concept or term occurs. FiCount represents the frequency of occurrence of
concepts in the document. NEiW eight represents the Named Entity weight
associated with the concepts in Q. MWiWeight represents the Multi Word weight
associated with the concepts in Q.
7.3.2.4 Computation of Overall Concept based Ranking
The first step in computing the overall ranking (Oc) is to merge all
the Di j .TaTb documents corresponding to each gij of the Query Q. Those
documents which occur in the maximum number of sets are ranked higher.
The merged sets of documents are then ranked based on the TaTb value, and
each set DTa.Tb is, in turn, ranked using the normalized index based weight
factor. d is the normalized weight factor to differentiate between the complete
CRC match, partial CR, or C match.
135
=0.5 = 1
0.3 = 2,3,40.2 = 5,6,7
(7.4)
Thus, in this algorithm, the first level of ranking is obtained by Ta;in turn, the set of documents corresponding to each Ta at the second level, are ranked according to Tb, and then within the set, at level three these documents are ranked according to a normalized index based weight d×WQTa.Tb.
Thus, the conceptual searching and ranking algorithm considers the degree of match, context of query, concept association and index based term, concept and position factors corresponding to sentences as well as documents for effective ranking. When we attempt an event-based search facility in the
concept based search engine, we found low precision for some queries which require event based results. Therefore, we found that event based concepts and relations in semantic graphs need to be given higher weight when compared to the normal concepts. Sophisticated document processing tasks
were required to learn the event based concepts and relations, to give importance to the document, and also, we should design a suitable event based index in which the user can search the event with any event based properties. The design of an Event based index is explained in the next section.
7.4 MULTI FIELD INDEX FOR EVENT SEARCH
This index structure helps to search an event with a different time, person and place, different event with the same time, place and person, the same person/place/time of different events. This concept based index
(Subalalitha et al 2011), is an inverted index which maps from concepts to the documents. Usually, the inverted index contains terms and documents. This concept based index contains two separate indices, such as Concept-Relation-Concept indices (CRC Indices) and Concept indices(C Indices). CRC indices
have the entire relation between the concepts. C indices represent only the concepts. The work described in this thesis, additionally added Event, Time,
136
Place and Person as the new field in the CRC index, in which the concepts which are connected with event based relations and constraints such as tim (represents the time of an event ), agt (represents person of an event) , plc (represents place of an event) , plf (represents place from), plt (represents
place to), have been stored in the semantic link based fields such as <Time>, <Place>, <Person>. We have separated our indexer in to <Person>, <Place>, <Time> indexes based on the event cluster results.
Figure 7.2 Term based index Figure 7.3 Concept based index
Figure 7.4 Concept based multi field event indexer
137
Figure 7.2 depicts a term based indexer, in which the terms are
considered as keys for this indexer. We can search only with a single field.
Similarly, the concept based indexer which is shown in Figure 7.3, considers
only the connected concept (To concept) as a keys to retrieve results from the
indexer. In the case of the Concept based Multifield indexer which is shown
in Figure 7.4, it takes the event based fields <Event>, <Time>, <Place>,
<Person> as a key, in which the user can retrieve information with multiple
fields. Table 7.1 shows the list of fields used in the concept based event
indexer.
A Common Event Hash (CEH) key is required to store the Time,
Person, Place based event clusters. In the existing UW dictionary (Balaji et al
2011), we are maintaining the <Term ID>, <Concept ID> for each terms that
appears in the UW dictionary, to know the unique terms and concepts in the
document. We have split the Lucene index, based on the concept ID.
<Concept ID > is used to represent the conceptual similarity between the
terms. For example, In Tamil, for the terms “ (Water falls) –Aruvi”,
“ (Water falls)- N rv cci” both the concepts should be mapped
with “Water falls (icl>natural phenomena)”. The <ConceptID> has also been
used as a Common event Hash key to split the cluster results based on
<Person>, <Place>. It ranges from 1-10,000. The concepts (either
Person/Place) which have the concept ID ranging between 1 and 10000,
appear in a single cluster and the concepts which appear from 10,000-20,000
will appear in the next, and it continues till it covers all the concepts in the
UW Dictionary. The number of split is based on the number of concepts
retrieved from the clusters, and their concept ID value. Here <Time> based
clusters are indexed with the Normalized temporal expression in the particular
range. The Date field contains the difference between the Document Creation
Time and the actual time of an event, which is represented in the standard
time format.
138
Table 7.1 Fields used in the concept based event index
Fields Description
Concept-relation-concept
The relationship between the Concept(C1)-Relation(R)-Concept(C2) belongs to the document set D.
Concept The list of concepts(Cs) occurred in the document set D.
Event Event based connected concept
Time Time based connected concept
Place Place based connected concept
Person Person based connected concept
Document Identifier d1,d2,d3,........dn corresponds to the list of documents which contains both CRC and C indices
Sentence Identifier S1, S2, S3 ...Sn corresponds to the list of sentence identifiers which contains both CRC and C indices.
This is used to know the position of the term/concept in the document.
Event Weight Number of event based concepts in the document
Position weight Concept Position weight
Frequency count Frequency of occurrence of the concepts in the document
Part of Speech Tagging
It is used to know the importance of the concepts with respect to the domain of interest. For example, for the tourism domain the “Named Entities” are more important than other nouns.
Term words It represents the term words present in the document
Date Document creation Time – Actual Time
Concept-relation-concept
The relationship between the Concept(C1)-Relation(R)-Concept(C2) belongs to the document set D.
139
For example, if we give user query as “
- Cacci ulaka k ppai kirikke c ta ai (Sachin Tendulkar's
World Cup record)”, the semantic graph will be constructed for the above
example, and the list of concepts retrieved from this event query are
Sachin(icl>person), World cup(icl>event), cricket (icl>sport), record(obj>thing).
The concept-relation-toconcept (CRCs) are Sachin(icl>person)-pos-World
cup(icl>event); World cup(icl>event)-mod-cricket(icl>sport). First, the query
graph is analyzed for <Person>, <Place>, <Time> based concepts. From the
semantic constraint “icl>person”, “icl>event”, we can identify whether to search
in the Time based cluster, Person based cluster, or event based cluster. Each
query concept is associated with the concept id. With the help of the concept id,
the person based cluster is retrieved.
The list of concepts used to query the multifield Lucene indexer is
given below.
Concepts Sachin(icl>person), World cup(icl>event), cricket (icl>sport), record(obj>thing)
CRCs Sachin(icl>person)-pos-World cup(icl>event); World cup(icl>event)-mod-cricket(icl>sport).
After identifying the location of the person based concepts, it finds
the matching between the other concepts associated with the query. It
compares them with the additional event based fields such as <Event> field,
<Time> and <place> fields. If it finds matches for all/only few, it retrieves the
documents. If none of the query concepts matches with the other event based
fields, it will fetch the information by using only the single field.
With the help of the bootstrapping approach, the number of event
patterns is identified from the document. It assigns event weights based on the
following factors.
140
1. The total number of event specific subgraphs (TCRCe) which is
obtained with the help of the event context score described in
Chapter 1.
2. The number of event specific templates described in the
document (Wte) which was learned with the help of the
bootstrapping process, as discussed in Chapter 5.
3. The number of event specific concepts in the document (Wce),
which is obtained with the help of the event context score
described in Chapter 1.)
4. The Time of an event with respect to Document Creation
Time(TDCGe), that was obtained from the semi supervised label
propagation, which is described in Chapter 6.
Here TCRCe are identified by analysing the number of CRCs with
event based concepts and relations. Wte is determined by checking the
possible event templates filled by the document. Some documents can
describe all the event based fields and some document may miss a few fields.
By analyzing the number of event templates extracted from the document, this
weight is assigned. Wce analyses only the number of unique event based
concepts that appear in the document. TDCTe assigns values by finding the
difference between the actual time of an event and the DCT. It helps to find
whether the event described in the document is a recent event or a past one.
From the above sets of analysis, the event weight (eW) is assigned.
= log( (7.5)
=
(7.6)
141
Here, TCRCD represents the number of CRCs in the document. TNe
represents the total number of events/event tags in the document (For each
event tag, the number of event templates are assigned and summed together).
Tc denotes the total number of concepts in the document.
7.5 EVENT BASED SEARCH AND RANK
The existing concept based UNL search - CoRee (Balaji et al 2012)
has been modified by the event based feature weights to get event specific
results to the users. The features considered for ranking are given below .
Main Event – Sub Event : This is identified by analysing the Concept-
relation-concept (CRC) indices of the document which is connected with the
more number of event based properties is given first preference compared to
the other documents.
Recent – Past: This is identified with the help of the Time and Date field
from the CRC index/ C index. If the user specifies the time, the documents
which contain the time required by the user is given first preference and the
rest of the documents are ordered from the recent to the past.
Number of Event Specific Information in a Document: This is considered
by using the event weight and frequency count specified by the CRC index/ C
Index. Initially, the documents are ordered based on the frequency of
occurrence of the event query concept; inside that level, it is sorted again with
the number of event based concepts in the document. The documents with
more number of event based concepts have been given a higher ranking than
the other.
The existing ranking parameter has been modified as follows. The
basic search procedure is based on the CRC or partial CR or C matches. This
ranking algorithm follows the existing Tag determination algorithm, which
142
consider the degree of match based (Level 1), concept association (Level 2)
and use of index based features (Level 3). We have modified the index based
feature weight into the event based feature weight, as described in the
previous section. The previous approach uses 3 level ranking. This approach
modified the 3rd level, and additionally considered one more level, in which it
ranks the documents with Time based weight. In the third level, instead of
considering only the position weight, we additionally consider the event-weight.
Therefore, the modified ranking function is shown in equation 3. The index
based feature weight (WI) has been computed based on the previous approach,
based on the document specific properties as shown in Equation (7.3).
3 = (7.7)
4 = 1 < >T < >
(7.8)
In Level 4, if the user query UserQ contains temporal expressions
(<TIME>), then the weight is set to 1 and the documents which have user
given temporal expressions will be given higher weight; otherwise, it sorts the
documents based on the TDCGe value in the ascending order.
When the user enters an event query, it gets enconverted (Balaji
et al 2011, Elenchezhian et al 2011) into UNL concepts and relations, as
described in section 3.1, and its expanded concepts are retrieved from the
event indexer, by analyzing the most frequent connected concept with event
based relations and concepts. After retrieving the concepts, the expanded
terms are ranked, based on the event based conceptual connection between the
query and the document. The query concept, which is connected with more
number of event specific concepts/relations/semantic constraints is given
higher weight, than the other concepts. Among the expanded concepts, the top
10 expansions are taken as search concepts for the event search engine. The
143
existing template based summary (Sublalitha et al 2011) has been used with
event based features and event specific sentences. We have evaluated this event
search engine for the FIRE corpus consisting of 1,94,000 documents with
20 queries; we found that the basic term based event search gives a Precision of
P@5 = 0.12, concept based search(Without Event based properties) gives a
Precision of P@5=0.23 and concept based event specific search gives
Precision P@5=0.48. We found that our concept based event specific search
performs well, when compared to term based and concept based searches.
7.5.1 Implementation of Event Search and Rank
The Event Search Interface is based on queries, retrieved
documents, indexes, and ranked results. Each document contains a
<Document Id> and list of fields required for indexing. The important event
based concepts and their parameters were indexed with the help of Lucene
Indexer. The software module such as offline and online were implemented in
JAVA and deployed as a separate plugin in an open source tool called Nutch
0.9.
To search the Event index, the query should be converted into
semantic graph and expanded with the help of Event Indexer. The query
graph, which has the actual query string, expanded concepts and their
additional parameters such as POS tags of the query concepts, Query tags
which helps to differentiate between the term, concept based and multiword
based results. Since this search interface has more than one fields as input,
each query given in each field gets enconverted into UNL based semantic
graph and multiple graphs are constructed.
Therefore, the search method can search for multiple query
concepts and rank the documents based on the event based feature weight.
The number of parameters required for the event search is given below.
144
Search_Result = eventIndex. EventSearchUNL.search(queryString,
queryMulti-List)
Here, queryString = “Eventquery”+”Time”+”Person”+”Place”| any one the
combinations;
The queryString can contain any number of fields based on the
need of the user. The user may want to search for documents with event Time
as fields that contain the year “2009 to 2010” and also with the person name
“Tendulkar”. queryMulti-List contains the graph based properties such as Cs
and CRCs, and additional contextual information such as the other connected
concepts, POS tags, MW (Multiword) tags, which are retrieved from the event
indexer. The resulted documents and its associated summary in both English
and Tamil were given as output to the user. The user interface design of the
Event search engine is given in Figure 7.5.
Figure 7.5 Concept based Event search Interface
145
7.6 RESULTS AND EVALUATION
We have used the Forum for Information Retrieval Evaluation
(FIRE) Tamil corpus which consists of 1, 92,200 documents. Among those
documents, our concept based event specific indexer, indexed 1,72,000
documents, and identified 25,000 concepts and 30,000 concept-relation-
concept indices. The concept coverage is less when compared to the existing
concept based search for tourism search (Balaji et al 2011). We have used the
existing UW dictionary to enconvert the documents. Since the existing
dictionary was developed for the tourism domain, it can cover most of the
places, however coverage of events and person based concepts are less.
However, we have tested 50 Tamil event based news queries of different
event types. Before computing the Discount cumulative Gain (DCG), the
human relevance rating (5 scale rating), needs to be assigned, to check the
accuracy of the search and rank. Since this method supports concept based
event search with different event fields, the evaluation specification contains
different weighing factors. The evaluation specification for each method is
defined in each case.
Case 1: For a given query Q, the user may want to search for the same event,
which occurs at different times, places and involves different persons. In that
case, the following rating scheme is followed to assign the user rating. This is
shown in Table 7.2.
146
Table 7.2 Five scale rating for event search for case 1
Rating Justification
5 Whole document Talks about the main event (with all the event based properties) or Title itself contains the given query
4 Main event with either Time/Persons/Place
3 Sub events of the main event/ Topically related events
2 Only a few lines talk about the query/links talks about related topic of an event
1 Single occurrence of a term of an event / related event
Case 2: If the query Q is a combination of Time and Place and Person, the
search engine should give results for different events, which occurs in same
Time and Place and Person. In order to weigh that factor, the following rating
has been assigned, which is shown in Table 7.3.
Case 3: If the given query Q is time/person/Place, the news search engine
should give different events involved with the given event properties. From
the user’s perspective, a document is most relevant, when it results with event
documents with user’s specific event tracking. If they track events with
persons, the resulting documents, should give the most relevant events related
to the persons/time/place. If a user wants to track about a sports person, the
search engine should give documents, which talk about an event in which the
sports person involved. This is shown in Table 4.
147
Table 7.3 Five scale rating for event search for case 2
Rating Justification
5 The document contains more number of events that occur in the same place, to the same person and in the same place
4 More number of Events with only any two parameters is the same for the events, either two places+Person is the same or Person+Time is the same.
3 More number of events with only any one of the parameters (either Time/Place/Person)
2 Only few lines/links talk about Event with any one of the parameters Time/Person/Place
1 Single occurrence of the term of an event / related event
Table 7.4 Five scale rating for event search for case 3.
Rating Justification
5 The document contains the most relevant event for a given Time/Person/Place
4 Related relevant event for a given Time/Person/Place.
3 Talks about any events(no need to be closely related events specific to a particular place/person/time) for a given Time/Person/Place
2 Only a few lines/links talks about Event with any one of the parameters Time/Person/Place
1 Single occurrence of the term of an event / related event
The Concept-Based ranking function has been changed by
introducing additional event weight and the method considers a number of
148
conceptual links for a given event query, and gives importance to the cause
and effect relation, which yields the main and subevent of the given query.
The temporal factors are important and the user usually tends to search recent
events, rather than the past events. Hence, we rank the documents from the
recent to the past. The document specific properties, such as the position of
the concept in the document and it’s frequency of occurrence, has been
considered in the final stage. The Normalized Discounted cumulative gain
(nDCG) is calculated for different cases, depending on the query. It is
compared for term based search, concept based search (Without event based
process), and concept based event search. This is shown in Table 7.5.
Table 7.5 Comparison of nDCG
Approaches nDCG
Term based Search and Rank 0.43
Concept based Search and Rank (Without event based process)
0.61
Concept based event search and Rank 0.86
Precision is calculated as follows after assigning the relevance tag
to the retrieved document. If the content of the resulting document is closely
related with the given event query Q, then the relevance tag is assigned as
‘more relevant’ and its parameter is ‘Y’ and its value is 1. Similarly, if the
content of the resulting document is partially related with the given event
query Q, then the relevance tag is assigned as ‘partially relevant’ and its
parameter is ‘P’ and its value is 0.5. We may also retrieve the relevant result
in the link of the web page, then it is considered as ‘less relevant’ and its
relevance parameter is ‘L’ and its value is 0.5. Some time we may retrieve
results with no link, in that case it will be considered as ‘not relevant’ and it is
assigned as a parameter ‘N’ value is 0.If the resulted page can’t be accessed
149
due to the following problems: page cannot be found, page under
construction, and some other technical faults, then it was categorized as ‘site
cannot be accessed’ and it is assigned a parameter ‘X’ and its value is 0. In
the 5 scale rating scheme discussed in the previous table, the rating 5 – 4 is
assigned with ‘Y” as a relevance rate. Similarly the rating 3 is considered as
“P” as a relevance tag and the rating 2 is considered as “L” and the rating 1 is
assigned “N”/”X”.
Table 7.6 Comparison of Precision (P@5 and P@10)
Approaches P@5 P@10
Term based Search and Rank 0.13 0.17
Concept based Search and Rank (Without event based process)
0.23 0.22
Concept based event search and Rank 0.48 0.43
From the Tables 7.5 and 7.6, the term based event search and rank
gives low precision and the concept based search results are better than term
based search results. However, compared to the concept based event search
method, the precision is low. Hence, compared to the existing concept based
search and rank method, this work shows 20% improvement and the
discounted cumulative gain is also high for the concept based event search
and rank. The ranking algorithm can be further improved by considering
event concept based page rank score for the retrieved documents. We have
also planned to extend our previous work (Umamaheswari et al 2013) to
automatically learn the ranking parameters for the event search.
150
7.7 CONCLUSION
In order to facilitate news readers to get coherent information about
an event and to facilitate event search from event based perspectives, such as
Time, Place and Person, this work describes an approach that tracks events
conceptually, with respect to these event specific properties, and ranks the
documents using event specific features and relations. The news event search
has been developed for Tamil news documents, and we plan to extend our
work to other languages like English, Malayalam, Telugu, etc. The language
independent semantic representation helps us in adoption to different
languages, irrespective of the nature of the language. The next chapter
discusses the extension of this event search and rank, by considering the event
based conceptual link between pages; hence, the page which has either a
physical link/ conceptual link will be given higher weight. We have also
incorporated user rating, by learning the conceptual features of the documents
that have a higher user rate.