CHAPTER 7 CONCEPT BASED EVENT SEARCH AND...

118

CHAPTER 7

CONCEPT BASED EVENT SEARCH AND RANK

The previous chapters discussed the document processing tasks

required by the event based search engine. Those document processing tasks

help to extract event based information and group the specific events in each

cluster. This helps to index domain specific information for news event

search. Since the content on the web is growing rapidly every fraction of a

second, search engines such as Google, Yahoo and MSN have become the

most heavily-used online services, with millions of searches performed every

day. All the above search engines basically use the keyword based search

strategy. Ranking algorithms such as PageRank algorithm (Brin et al 1998)

and HITS Algorithm (Ding et al 2012) score the documents, according to the

incoming and outgoing links of the documents. However, due to the large

number of documents available on the web, the number of results produced

by keyword based search engines is too numerous. The ultimate challenge for

search engines is to provide effective systems that retrieve the most relevant

information from the web that exactly caters to the user information need.

Concept based search attempts to improve search effectiveness by

incorporating conceptual information, that conveys meaning rather than using

the presence or absence of keywords, as the basis for the retrieval process.

A multilevel Concept based Search, and Ranking Algorithm described by

Balaji et al (2012), retrieves and ranks the documents based on the concepts

and relationships between the concepts using Universal Networking Language

119

(UNL) based semantic representation of the documents. The Semantic graph

based Index is used for both query expansion and concept based search. The

algorithm has been evaluated on a corpus of tourism documents, and its

performance is compared with that of keyword based search. The mean

average precision of the concept based search for tourism domain is found to

be 0.75 while the keyword based search has a Mean Average Precision

(MAP) score of 0.45. We have adopted the same approach for event based

search and ranking by modifying the ranking function with the event-based

weight and we found even a low precision. This chapter talks about a concept

based approach for news event search and ranking.

7.1 INTRODUCTION

Combining heterogeneous information from different news articles

requires complex semantic interpretation, which can automatically map

different concepts, and tag the domain specific information. In order to

automatically identify the event specific sentences, and to extract event

specific features from large corpora, we apply machine learning approaches.

We present a new context based event model for News Articles. We need to

represent the event context, using the appropriate semantic models. The

Resource Description Framework (RDF) (Graves & Gutierrez 2006) graph is

generally used to express the relationship between words, in terms of subject-

object-predicate representation, expressed using word level semantics. This

subject-object-predicate representation is suitable for structured languages

like English. Compared to RDF, OWL (Antoniou et al 2004) has a richer set

of semantics, but is however, oriented to representing Ontology and not a text.

For both OWL and RDF, it is difficult to express the richer semantics required

for event based processing of text. Hence, we have chosen UNL as a semantic

120

graph based representation, which helps to represent documents in language

and domain independent ways.

As we discussed in the previous chapters, event based features are

extracted based on machine learning approaches viz, Agglomerative

Clustering, Bootstrapping and Label propagation, for Event clustering, event

pattern identification and temporal expression identification and

normalization respectively. These machine learning approaches help to extract

event based information from the news text and use this information to

construct event specific indexes. In this chapter, we discuss a multi-field news

event search engine, which is designed with a specialized concept based multi

field index that considers event based features. The contributions of this work

include the design of multi-field index structure, and the enhancement of a

concept based search engine (CoRee) (Balaji et al 2012) with event based

features.

When the user enters a query for a news event, he may want to

track information with respect to location, time and person; the single field

indexer requires additional processing to retrieve the documents, which have

all the fields required by the user. Moreover, the semantic link between an

event with the exact time, place and person may not be correct. Hence, we

have pre-processed each document, and extracted the event concepts which

occur with time, place and person based relations for building the multi-field

indexer. This multi-field indexer helps us to get results for the exact link of an

event with respect to time, place and person. The related work is discussed in

the next section.

121

7.2 APPROACHES TO CONCEPTUAL SEARCH

This section discusses the existing concept based searching, based

on the semantic representation, ontology and other semantic based

representation. Concept based search can be classified as those that use a

background knowledge source to provide conceptual information, and those

that use semantically analyzed the components of the document. Concept

based search can also be classified based on how semantics is used to

represent the documents. Documents can be represented by considering

concepts associated with the frequently occurring keywords, or by converting

important components of the document into a semantic structure. In addition,

concept based search can also be classified based on, where the semantics are

introduced in the components of the search engine. Semantics can be

introduced in query expansion, building the index, searching and also in

ranking the search results. The related work of concept based search has been

discussed with the above aspects.

7.2.1 Semantic based Search

While some meaning based search engines use sentence level

semantics, others use ontology as the background knowledge source for

providing semantics. Hakia (Sudeepthi et al 2012) is a semantic search engine

that uses knowledge of Ontology and Fuzzy logic for semantic ranking. In

order to retrieve conceptual results, it uses Query Detection and Extraction

(QDEX) Indexing Architecture, (Loutas et al 2012) which enables the

semantic analysis of web pages and provides meaning based search results. In

Hakia besides the keywords, phrases are used for meaning based searches.

The limitation of Hakia is that, it accepts queries as questions in a specific

format. Also, the QDEX algorithm extracts all possible queries that can be

122

asked on the content of web pages of various lengths and forms. This is an

offline process before any user query is entered.

The major difficulty in the QDEX system is the reduction of the

huge number of generated query sequences into a few dozens, which make

sense. Hakia allows only these predefined query sequences generated from

the content, to be used as queries.

On the other hand, SenseBot (Rana & Singh 2013) is a semantic

search engine that runs over search engines like Google and Yahoo, to

generate a multi document summary based on text mining and limited

semantics. Though all the above search engines provide meaning based

results, some search engines require sophisticated query analysis techniques

to provide meaningful search results. Other search engines consider concepts

rather than relations between concepts as the basis of the match. However, in

the search engine described in this paper, the context of the query was

retrieved by traversing the already created UNL based indexer. The frequently

occurring UNL relations obtained from the UNL index, in effect provide

information about the possible connections between concepts in the specific

domain under consideration. These connections provide the context of the

query concept, and the query expansion based on this context yields

meaningful search results.

7.2.2 Ontology based Search

Concept based search can also be based on the use of knowledge

structures. One such search engine is Engineering or Environmental

Knowledge Ontology-based Semantic Search (EKOSS) (Kraines et al 2006).

It is an ontology based semantic search engine which uses a fully functional

ontology for representing the knowledge base. A collaborative knowledge

123

sharing environment is provided which helps knowledge experts to share their

knowledge such as research papers, database, computer simulated model, and

even curriculum vitae. The EKOSS system is used to construct computer-

interpretable semantically rich statements of the knowledge resource. When a

user request is posted, this system converts the user request into a computer

readable knowledge description based on description logic and associated

rules. Ontology-based information retrieval (Gao et al 2005) intended for e-

Government has been developed for securing the legal documents of the

government. The disadvantage of using ontology based search engines is that

they are susceptible to changes in the information resources. This will affect

the conceptualization of the domain representation. Moreover, the effort

required to build ontology is huge. Ontology based news event search

depends on the domain, and the use of a common vocabulary ontology for

different domains remains a challenging task.

7.2.3 UNL based Search

A meaning based multilingual search engine that uses UNL

(Universal Networking Language) is AgroExplorer (Surve et al 2004). This

search engine is similar to the search engines described in this work, since

AgroExplorer also uses Universal Networking Language (UNL) expressions

for representing sentences as graphs, that capture the meaning of the

sentences. AgroExplorer has been developed for the agriculture domain and

also provides multilingual features. A simple search and rank process based

on the degree of match of the query UNL, and the frequency of occurrence of

the Concepts with other concepts in the UNL expression is used. The

algorithm for searching and ranking described in this paper, is a part of UNL

search system that differs from the existing AgroExplorer (Surve et al 2004)

in that, our approach incorporates semantics in every component of the search

124

engine. Previous approaches consider the conceptual relationship only at the

word level and they have considered only document based properties such as

position, term/concept level statistics for ranking. Instead, in addition to the

document based properties, we have additionally used a sophisticated three

level conceptual search and rank process. We have also attempted index based

query expansion, which considers the conceptual link of query concepts with

other concepts, in order to obtain context based results. This concept based

search has been enhanced with event based text processing, and we have

adopted the same ranking approach with event based weight. The existing

work in event based search is explained in the next section.

7.2.4 Event based Search

STORIES in time: is a graph-based interface for news tracking and

discovery (Berendt & Subasic 2009). This work provides a graph based

representation of the news story, in which the user can search the specific

node of the news topic to get specific information about an event or a topic.

This method extracts facts from the document along with the time. For each

time period a set of graphs represents the news event. The term level

similarity between two temporal nodes is measured based on its term

frequency and its co-occurrence frequency. Moreover, the existing event

tracking gave importance to the temporal aspect rather than the person and

place. There is only limited work which considers all the entities for event

tracking. Sayyadi et al (2009) introduced a community detection method to

form a network of keywords pertaining to an event. Hence, for each event

there will be a set of frequently co-occurring terms. In this approach, the term

that appears in one event may appear in another event, and this redundancy

may lead to wrong interpretation. Lam et al (2001) had introduced a

contextual analysis of news events, instead of term based matching between

125

events; their method identifies a concept based similarity between events,

based on the statistical context identified between the sentences of two events.

NewsX (Wunderwald et al 2011) is an event extraction tool developed to

answer WH questions for news events. It uses a rule based approach to find

answers to the WH questions. Cybulska & Vossen (2011) proposed a similar

event extraction method for historical events. They used the WordNet and

ontological classes for the semantic tagging of the given corpus. Manual tagging

is required to indicate the historical data in terms of place, person and action.

However, the automatic extraction of historical information and learning the

relationship with other historical events, are still challenging issues.

Cai et al (2013) has proposed an event relationship analysis based

on temporal facts. He focused mainly on event evaluation based on time

varying features. Similarly, Jin et al (2008) proposed a time based web search

engine, which focused only on the temporal aspect of the search query. Allan

et al (2001) have also proposed temporal based news summarization, based on

the probability of the occurrence of a word on the given topic. Feng et al

(2007) proposed an approach for finding links between events. They used a

clustering approach to link events and their relationships. Kapp et al (2013)

described a person based event exploration. They computed the similarity

between events based on the participating entities. Sarma et al (2011) consider

time constraint-based relationship between event entities. They constructed a

global temporal cluster, and a local temporal cluster to identify the dynamic

relationship between events. All the above said approaches consider only the

co-occurrence context, while we have developed an event model that

considers not only the temporal facts, but also the relationship of an event to

the person and place where the event context may be at the word level,

33sentence level, or document level.

126

There is a prototype called NewsSync (Vydiswaran et al 2011),

which explores news stories based on user preferences. Though, the idea

behind this work is similar to the work described in this chapter, it works for

structured languages like English. Moreover, it does not consider any

semantic representation to provide meaningful results to the user. In our

approach, we have utilized the existing rule based UNL enconversion

(Balaji et al 2011) to represent a document using semantic graphs. From this

semantic graph, we have identified only event based semantic subgraphs,

which are required for event specific clustering to cluster events in terms of

time, place and person. Adopting machine learning approaches in information

retrieval can help to reduce the time to process the information. Therefore,

compared to previous approaches, our method works well to provide

meaningful results to the user within a reasonable time limit. Moreover, this

work can be adapted easily to other domains and languages, without

modifying the methodology. The machine learning approaches adopted for

our Tamil news search engine and its system architecture are discussed in the

next section.

7.3 COREE – A CONCEPT BASED SEARCH ENGINE

In the UNL based search system discussed here, UNL graphs that

represent fragments of sentences in a document are used to build the

conceptual index. The UNL enconverter of the system uses a rule based

approach to convert the sentence constituents to UNL graphs where concepts

are represented as nodes and relations as edges. The use of this approach

allows terms to be represented as concepts, extracts a standard set of

semantic relations between concepts in a sentence, and at the same time,

associates a hierarchy of the concepts linked through the UNL semantic

relations. This essentially means that semantically analyzed information

127

from the sentences of the documents is used for building the index. In

addition, the constraints associated with the concepts available in the UNL

Knowledge Base (KB), also incorporate information from a background

knowledge resource into the index structure. For example, the UW word

Chennai of the Tamil sentence will be translated into Chennai(icl > place).

Here, Chennai denotes the head word and the icl > place denotes the

contraints associated with the concept. Figure 5 shows the UNL enconversion

process.

Figure 7.1 Semantic representation of a sentence

In Figure 7.1 the concept build(icl > action) is connected to Rajara

jachozhan(icl>person) and also with Thanjai Temple(iof > t em ple) using

agt and obj respectively.

The set of UNL graphs obtained from the enconversion component

of the search system is represented as a multi-list structure, which is

discussed in chapter 4. This multilist structure contains three separate

indices, such as CRC (Concept-Relation-Concept), CR (Concept-Relation)

and C (Concept) indices, in order to aid searching and ranking. In addition to

building the UNL graph represented as a multilist structure, the UNL

enconverter provides additional information to aid the retrieval process.

128

The CRC Indices for the UNL Tamil sentence are build (icl>act

ion)- agt – Rajarajachozhan (icl>person), build (icl>act ion) - obj - Thanjai

Temple (iof> t emple);The CR Indices for the UNL tamil sentence are build

(icl > action)-agt, build (icl > action)-obj;and the C Indices for the UNL

tamil sentence are build (icl > action), Thanjai Temple (iof>temple),

Rajarajachozhan (icl>person).

Sentence based information includes the sentence identifier, Part Of

Speech tags, Entity tags, Multiword tags, the actual terms or words associated

with the UNL concepts, and a bit pattern vector, that indicates sentence-wise

position of the concepts in the document. Document based information

includes a document identifier, term frequency, concept frequency, and the

position of the concepts in the document. These features are used in weight

determination during the searching and ranking of documents. Features such

as the frequency of the concepts present in the document in addition to term

frequency, allow ranking to be both term and concept based which becomes

important when the term frequency is not significant. The bit pattern vector

indicating the distance between concepts helps to identify relations that are

not necessarily proximity dependent. The UNL index with all the above

sentence level and document level information is initially stored in the Binary

Search Tree (BST). In the Binary Search Tree (BST), we are able to index up

to 33,000 documents. When we increase the number of documents, we are

unable to insert concepts in the BST and have to choose a different data

structure. We have attempted the SQL database and Lucene index, and tested

1 lakh documents. We are able to index the concepts, but the time required to

search in the SQL database is more, when compared to the Lucene index.

Hence, we have chosen Lucene for indexing concepts, and the relations

between concepts from the documents, represented in the UNL.

129

7.3.1 Concept based Query Expansion

In this work, the context of a query concept is defined as the

association of this concept with other concepts in a CRC relation, across

documents in the domain of interest. By analyzing the index, the concept

associated with a query is matched with the CRCs of the index, and the most

common CRCs associated with the query concept are extracted. The

expanded concepts obtained, are ranked based on the frequency of the CRC

and on its being an entity. Query expansion is an on-line activity and the

index analysis results in efficient query expansion. The most frequently

occurring CRC in the index indicates the frequent association of concepts in

the domain across documents, and hence, gives the domain context of the

query concept. This expansion of the query concepts to CRC, allows context

dictated query sub graphs to be constructed for the query. The expanded query

graph is now associated with the actual query terms, query concepts and

expanded concepts associated with the context of the query concept. This, in

turn, means that the difference between these is required during both

searching and ranking.

The index based query expansion influences the searching and

ranking of documents in many ways. It helps to build CRC query graphs that

can be matched with the Concept based index. Without this expansion, single

word queries would have resulted in an isolated concept, which may not give

semantic based results to the user. As already explained, the association of

expanded concepts allows domain oriented, corpus based context of the query

word to play a role in semantic matching, and in addition, helps to bring in

documents, which have concepts in the context of the query, which would

have been missed by other search mechanisms.

130

7.3.2 Concept based Search

The basic searching procedure is based on complete CRC Match

or partial CR or C matches between the query sub graphs and the

corresponding index, as in AgroExplorer (Surve et al 2004). However, the

design of the ranking procedure depends on whether the match of the index

is with the actual query terms, actual query concepts or expanded concepts.

In addition, all the sentence and document based features associated with the

conceptual indices also affect the ranking procedure.

The overall algorithm for searching and ranking actually performs a

three level ranking. The first level ranking is obtained based on whether there

is complete match (CRC match), partial match of Concept Relation (CR) or

match of only concepts (C Only). This level of ranking is provided by the

Degree of Match Categorization tag Ta. The set of documents obtained at the

Level 1 category is further prioritized, using Concept Association

Categorization Tag Tb. Concept Association categorization depends on

whether the index match is between the query terms, query concepts or

expanded concepts. Once the documents have been ranked by Ta and T b,

the documents at the same Ta.Tb level are ranked, based on weights

calculated based on the index based features associated with the concept.

A Tag represented as Ta.Tb helps in determining the two level lists

of prioritized documents. Tag Ta computed in Level 1 indicates the degree of

match while Tb computed at Level 2 indicates the type of concept association.

For determining the tags the following terminology is defined.A given query

with n terms may be represented as a set Q. Let Q ={q1, ...., qn } ,where qi

represents a query term. Each element i of the power-set of Q is expanded

and enconverted to a set E Qi of UNL graphs gim, where m represents the

131

expanded concepts from the UNL index and m > 0. Here, the power set of

Q represents that each query term is associated with not only a single

expanded terms and it’s concepts, it also represents more than one expaned

terms and concepts. That is,E Qi = {gi1, gi2, ....gin },where each gij is a tuple of

{Cxjj, Rjj ,Cyjj} representing a relation Rjj between the two associated

concepts Xjj and Yjj. The presence of all three elements of the tuple

corresponds to a CRC graph, the presence of a C and R corresponds to a CR

graph, and the presence of a C alone indicates a C graph. Now each gi j is

matched with the CRC, CR and C indices represented in the index graphs in

the indices ICRC , ICR and IC to obtain a set of documents Dij.CRC, Dij.CR ,

and Dij.C . The matching set of documents Djj for the expanded query graph

gi j is the union of these three sets, i.e., Now by using these sets, the degree of

match is determined by the tag Ta.

=

This algorithm aims at improving the search and ranking, by

performing matching at three levels, namely,

1. Partial or Complete match between the index and expanded

query;

2. A Concept Association level which distinguishes between the

actual query terms, query concepts and expanded concept.

3. Document based features, such as frequency of occurrence

and position of terms and concepts in the document.

132

7.3.2.1 Tag determination for degree of match

The tag determination for the degree of match depends on the

extent of match between the CRC representing the query sub graph and the

conceptual index. It essentially differentiates between CRC, CR and C

matches. Ta helps in differentiating between the different degrees of match.

The UNL sub graph is a directional graph, and hence, partial match also

considers whether the concept in CR (Concept Relation), matches with the

source concept, Cxi , or the destination concept, Cyi , of the UNL sub graph.

=

1 , ,2 { , } ,3 { , }4 { , } { }5 { } { }6 { } { }

(7.1)

7.3.2.2 Tag determination for concept association

The next level of tag determination is based on whether the Ci

value in the CRC,CR and C matches corresponds to the actual query term ,the

concept of the query term or the concept obtained after query expansion.

Accordingly, the concept association is said to be of three types.

1. Query Term TWi association - This means that the concept Ci is

the query term itself.

2. Concept Word CWi association - This means that the concept

Ci matches the corresponding concept of the query, but the

actual query term is different.

133

3. Expanded Word EWi association - This means that the

concept Ci is associated with a concept that is not actually in

the query, but has been obtained as a result of query

expansion.

Based on the above 3 values, eight different tags are obtained as

given below.

1234567

xi yi i

xi i yi i

xi i yi i

xi yi ib

xi i yi i

xi i yi i

xi i yi i

if C C TWif C TW and C CWif C CW and C TWif C C CWTif C TW and C EWif C EW and C TWif C EW and C CW

(7.2)

It can be seen that the Tag Tb, differentiating between the three

types explained above, also differentiates between whether the concept is the

source node Cxi or destination node Cyi of the directed UNL sub- graph. The

eight values of Tb bring out these differences. Within each DTa the documents

are ordered as per Tb.

Each of the set Di j .Ta documents are now tagged with the Tb tag.

In other words, the searched documents are prioritized and ranked according

to TaTb value. Let Di j . TaTb represent the set of documents with a tag the

Ta. Tb corresponding to the enconverted query graph gi j . The next section

describes how index based features are used to further rank each set of

Di j .TaTb documents.

134

7.3.2.3 Use of index based features

Index based features are used to calculate a weight factor to

prioritize the documents within each set Di j .TaTb. The features used are

position, frequency count, Named Entity (NE) tag and Multi-word (MW) tag

of the term/concept. The feature weight is calculated as follows.

The Index based feature Weight is denoted as WI.

=,..

(7.3)

Here, i represents the single document weight, and j represents the

weight across the documents. PiWeight represents the position weight of the

concept. Position weight is computed based on, where in the document the

concept or term occurs. FiCount represents the frequency of occurrence of

concepts in the document. NEiW eight represents the Named Entity weight

associated with the concepts in Q. MWiWeight represents the Multi Word weight

associated with the concepts in Q.

7.3.2.4 Computation of Overall Concept based Ranking

The first step in computing the overall ranking (Oc) is to merge all

the Di j .TaTb documents corresponding to each gij of the Query Q. Those

documents which occur in the maximum number of sets are ranked higher.

The merged sets of documents are then ranked based on the TaTb value, and

each set DTa.Tb is, in turn, ranked using the normalized index based weight

factor. d is the normalized weight factor to differentiate between the complete

CRC match, partial CR, or C match.

135

=0.5 = 1

0.3 = 2,3,40.2 = 5,6,7

(7.4)

Thus, in this algorithm, the first level of ranking is obtained by Ta;in turn, the set of documents corresponding to each Ta at the second level, are ranked according to Tb, and then within the set, at level three these documents are ranked according to a normalized index based weight d×WQTa.Tb.

Thus, the conceptual searching and ranking algorithm considers the degree of match, context of query, concept association and index based term, concept and position factors corresponding to sentences as well as documents for effective ranking. When we attempt an event-based search facility in the

concept based search engine, we found low precision for some queries which require event based results. Therefore, we found that event based concepts and relations in semantic graphs need to be given higher weight when compared to the normal concepts. Sophisticated document processing tasks

were required to learn the event based concepts and relations, to give importance to the document, and also, we should design a suitable event based index in which the user can search the event with any event based properties. The design of an Event based index is explained in the next section.

7.4 MULTI FIELD INDEX FOR EVENT SEARCH

This index structure helps to search an event with a different time, person and place, different event with the same time, place and person, the same person/place/time of different events. This concept based index

(Subalalitha et al 2011), is an inverted index which maps from concepts to the documents. Usually, the inverted index contains terms and documents. This concept based index contains two separate indices, such as Concept-Relation-Concept indices (CRC Indices) and Concept indices(C Indices). CRC indices

have the entire relation between the concepts. C indices represent only the concepts. The work described in this thesis, additionally added Event, Time,

136

Place and Person as the new field in the CRC index, in which the concepts which are connected with event based relations and constraints such as tim (represents the time of an event ), agt (represents person of an event) , plc (represents place of an event) , plf (represents place from), plt (represents

place to), have been stored in the semantic link based fields such as <Time>, <Place>, <Person>. We have separated our indexer in to <Person>, <Place>, <Time> indexes based on the event cluster results.

Figure 7.2 Term based index Figure 7.3 Concept based index

Figure 7.4 Concept based multi field event indexer

137

Figure 7.2 depicts a term based indexer, in which the terms are

considered as keys for this indexer. We can search only with a single field.

Similarly, the concept based indexer which is shown in Figure 7.3, considers

only the connected concept (To concept) as a keys to retrieve results from the

indexer. In the case of the Concept based Multifield indexer which is shown

in Figure 7.4, it takes the event based fields <Event>, <Time>, <Place>,

<Person> as a key, in which the user can retrieve information with multiple

fields. Table 7.1 shows the list of fields used in the concept based event

indexer.

A Common Event Hash (CEH) key is required to store the Time,

Person, Place based event clusters. In the existing UW dictionary (Balaji et al

2011), we are maintaining the <Term ID>, <Concept ID> for each terms that

appears in the UW dictionary, to know the unique terms and concepts in the

document. We have split the Lucene index, based on the concept ID.

<Concept ID > is used to represent the conceptual similarity between the

terms. For example, In Tamil, for the terms “ (Water falls) –Aruvi”,

“ (Water falls)- N rv cci” both the concepts should be mapped

with “Water falls (icl>natural phenomena)”. The <ConceptID> has also been

used as a Common event Hash key to split the cluster results based on

<Person>, <Place>. It ranges from 1-10,000. The concepts (either

Person/Place) which have the concept ID ranging between 1 and 10000,

appear in a single cluster and the concepts which appear from 10,000-20,000

will appear in the next, and it continues till it covers all the concepts in the

UW Dictionary. The number of split is based on the number of concepts

retrieved from the clusters, and their concept ID value. Here <Time> based

clusters are indexed with the Normalized temporal expression in the particular

range. The Date field contains the difference between the Document Creation

Time and the actual time of an event, which is represented in the standard

time format.

138

Table 7.1 Fields used in the concept based event index

Fields Description

Concept-relation-concept

The relationship between the Concept(C1)-Relation(R)-Concept(C2) belongs to the document set D.

Concept The list of concepts(Cs) occurred in the document set D.

Event Event based connected concept

Time Time based connected concept

Place Place based connected concept

Person Person based connected concept

Document Identifier d1,d2,d3,........dn corresponds to the list of documents which contains both CRC and C indices

Sentence Identifier S1, S2, S3 ...Sn corresponds to the list of sentence identifiers which contains both CRC and C indices.

This is used to know the position of the term/concept in the document.

Event Weight Number of event based concepts in the document

Position weight Concept Position weight

Frequency count Frequency of occurrence of the concepts in the document

Part of Speech Tagging

It is used to know the importance of the concepts with respect to the domain of interest. For example, for the tourism domain the “Named Entities” are more important than other nouns.

Term words It represents the term words present in the document

Date Document creation Time – Actual Time

Concept-relation-concept

The relationship between the Concept(C1)-Relation(R)-Concept(C2) belongs to the document set D.

139

For example, if we give user query as “

- Cacci ulaka k ppai kirikke c ta ai (Sachin Tendulkar's

World Cup record)”, the semantic graph will be constructed for the above

example, and the list of concepts retrieved from this event query are

Sachin(icl>person), World cup(icl>event), cricket (icl>sport), record(obj>thing).

The concept-relation-toconcept (CRCs) are Sachin(icl>person)-pos-World

cup(icl>event); World cup(icl>event)-mod-cricket(icl>sport). First, the query

graph is analyzed for <Person>, <Place>, <Time> based concepts. From the

semantic constraint “icl>person”, “icl>event”, we can identify whether to search

in the Time based cluster, Person based cluster, or event based cluster. Each

query concept is associated with the concept id. With the help of the concept id,

the person based cluster is retrieved.

The list of concepts used to query the multifield Lucene indexer is

given below.

Concepts Sachin(icl>person), World cup(icl>event), cricket (icl>sport), record(obj>thing)

CRCs Sachin(icl>person)-pos-World cup(icl>event); World cup(icl>event)-mod-cricket(icl>sport).

After identifying the location of the person based concepts, it finds

the matching between the other concepts associated with the query. It

compares them with the additional event based fields such as <Event> field,

<Time> and <place> fields. If it finds matches for all/only few, it retrieves the

documents. If none of the query concepts matches with the other event based

fields, it will fetch the information by using only the single field.

With the help of the bootstrapping approach, the number of event

patterns is identified from the document. It assigns event weights based on the

following factors.

140

1. The total number of event specific subgraphs (TCRCe) which is

obtained with the help of the event context score described in

Chapter 1.

2. The number of event specific templates described in the

document (Wte) which was learned with the help of the

bootstrapping process, as discussed in Chapter 5.

3. The number of event specific concepts in the document (Wce),

which is obtained with the help of the event context score

described in Chapter 1.)

4. The Time of an event with respect to Document Creation

Time(TDCGe), that was obtained from the semi supervised label

propagation, which is described in Chapter 6.

Here TCRCe are identified by analysing the number of CRCs with

event based concepts and relations. Wte is determined by checking the

possible event templates filled by the document. Some documents can

describe all the event based fields and some document may miss a few fields.

By analyzing the number of event templates extracted from the document, this

weight is assigned. Wce analyses only the number of unique event based

concepts that appear in the document. TDCTe assigns values by finding the

difference between the actual time of an event and the DCT. It helps to find

whether the event described in the document is a recent event or a past one.

From the above sets of analysis, the event weight (eW) is assigned.

= log( (7.5)

=

(7.6)

141

Here, TCRCD represents the number of CRCs in the document. TNe

represents the total number of events/event tags in the document (For each

event tag, the number of event templates are assigned and summed together).

Tc denotes the total number of concepts in the document.

7.5 EVENT BASED SEARCH AND RANK

The existing concept based UNL search - CoRee (Balaji et al 2012)

has been modified by the event based feature weights to get event specific

results to the users. The features considered for ranking are given below .

Main Event – Sub Event : This is identified by analysing the Concept-

relation-concept (CRC) indices of the document which is connected with the

more number of event based properties is given first preference compared to

the other documents.

Recent – Past: This is identified with the help of the Time and Date field

from the CRC index/ C index. If the user specifies the time, the documents

which contain the time required by the user is given first preference and the

rest of the documents are ordered from the recent to the past.

Number of Event Specific Information in a Document: This is considered

by using the event weight and frequency count specified by the CRC index/ C

Index. Initially, the documents are ordered based on the frequency of

occurrence of the event query concept; inside that level, it is sorted again with

the number of event based concepts in the document. The documents with

more number of event based concepts have been given a higher ranking than

the other.

The existing ranking parameter has been modified as follows. The

basic search procedure is based on the CRC or partial CR or C matches. This

ranking algorithm follows the existing Tag determination algorithm, which

142

consider the degree of match based (Level 1), concept association (Level 2)

and use of index based features (Level 3). We have modified the index based

feature weight into the event based feature weight, as described in the

previous section. The previous approach uses 3 level ranking. This approach

modified the 3rd level, and additionally considered one more level, in which it

ranks the documents with Time based weight. In the third level, instead of

considering only the position weight, we additionally consider the event-weight.

Therefore, the modified ranking function is shown in equation 3. The index

based feature weight (WI) has been computed based on the previous approach,

based on the document specific properties as shown in Equation (7.3).

3 = (7.7)

4 = 1 < >T < >

(7.8)

In Level 4, if the user query UserQ contains temporal expressions

(<TIME>), then the weight is set to 1 and the documents which have user

given temporal expressions will be given higher weight; otherwise, it sorts the

documents based on the TDCGe value in the ascending order.

When the user enters an event query, it gets enconverted (Balaji

et al 2011, Elenchezhian et al 2011) into UNL concepts and relations, as

described in section 3.1, and its expanded concepts are retrieved from the

event indexer, by analyzing the most frequent connected concept with event

based relations and concepts. After retrieving the concepts, the expanded

terms are ranked, based on the event based conceptual connection between the

query and the document. The query concept, which is connected with more

number of event specific concepts/relations/semantic constraints is given

higher weight, than the other concepts. Among the expanded concepts, the top

10 expansions are taken as search concepts for the event search engine. The

143

existing template based summary (Sublalitha et al 2011) has been used with

event based features and event specific sentences. We have evaluated this event

search engine for the FIRE corpus consisting of 1,94,000 documents with

20 queries; we found that the basic term based event search gives a Precision of

P@5 = 0.12, concept based search(Without Event based properties) gives a

Precision of P@5=0.23 and concept based event specific search gives

Precision P@5=0.48. We found that our concept based event specific search

performs well, when compared to term based and concept based searches.

7.5.1 Implementation of Event Search and Rank

The Event Search Interface is based on queries, retrieved

documents, indexes, and ranked results. Each document contains a

<Document Id> and list of fields required for indexing. The important event

based concepts and their parameters were indexed with the help of Lucene

Indexer. The software module such as offline and online were implemented in

JAVA and deployed as a separate plugin in an open source tool called Nutch

0.9.

To search the Event index, the query should be converted into

semantic graph and expanded with the help of Event Indexer. The query

graph, which has the actual query string, expanded concepts and their

additional parameters such as POS tags of the query concepts, Query tags

which helps to differentiate between the term, concept based and multiword

based results. Since this search interface has more than one fields as input,

each query given in each field gets enconverted into UNL based semantic

graph and multiple graphs are constructed.

Therefore, the search method can search for multiple query

concepts and rank the documents based on the event based feature weight.

The number of parameters required for the event search is given below.

144

Search_Result = eventIndex. EventSearchUNL.search(queryString,

queryMulti-List)

Here, queryString = “Eventquery”+”Time”+”Person”+”Place”| any one the

combinations;

The queryString can contain any number of fields based on the

need of the user. The user may want to search for documents with event Time

as fields that contain the year “2009 to 2010” and also with the person name

“Tendulkar”. queryMulti-List contains the graph based properties such as Cs

and CRCs, and additional contextual information such as the other connected

concepts, POS tags, MW (Multiword) tags, which are retrieved from the event

indexer. The resulted documents and its associated summary in both English

and Tamil were given as output to the user. The user interface design of the

Event search engine is given in Figure 7.5.

Figure 7.5 Concept based Event search Interface

145

7.6 RESULTS AND EVALUATION

We have used the Forum for Information Retrieval Evaluation

(FIRE) Tamil corpus which consists of 1, 92,200 documents. Among those

documents, our concept based event specific indexer, indexed 1,72,000

documents, and identified 25,000 concepts and 30,000 concept-relation-

concept indices. The concept coverage is less when compared to the existing

concept based search for tourism search (Balaji et al 2011). We have used the

existing UW dictionary to enconvert the documents. Since the existing

dictionary was developed for the tourism domain, it can cover most of the

places, however coverage of events and person based concepts are less.

However, we have tested 50 Tamil event based news queries of different

event types. Before computing the Discount cumulative Gain (DCG), the

human relevance rating (5 scale rating), needs to be assigned, to check the

accuracy of the search and rank. Since this method supports concept based

event search with different event fields, the evaluation specification contains

different weighing factors. The evaluation specification for each method is

defined in each case.

Case 1: For a given query Q, the user may want to search for the same event,

which occurs at different times, places and involves different persons. In that

case, the following rating scheme is followed to assign the user rating. This is

shown in Table 7.2.

146

Table 7.2 Five scale rating for event search for case 1

Rating Justification

5 Whole document Talks about the main event (with all the event based properties) or Title itself contains the given query

4 Main event with either Time/Persons/Place

3 Sub events of the main event/ Topically related events

2 Only a few lines talk about the query/links talks about related topic of an event

1 Single occurrence of a term of an event / related event

Case 2: If the query Q is a combination of Time and Place and Person, the

search engine should give results for different events, which occurs in same

Time and Place and Person. In order to weigh that factor, the following rating

has been assigned, which is shown in Table 7.3.

Case 3: If the given query Q is time/person/Place, the news search engine

should give different events involved with the given event properties. From

the user’s perspective, a document is most relevant, when it results with event

documents with user’s specific event tracking. If they track events with

persons, the resulting documents, should give the most relevant events related

to the persons/time/place. If a user wants to track about a sports person, the

search engine should give documents, which talk about an event in which the

sports person involved. This is shown in Table 4.

147

Table 7.3 Five scale rating for event search for case 2


5 The document contains more number of events that occur in the same place, to the same person and in the same place

4 More number of Events with only any two parameters is the same for the events, either two places+Person is the same or Person+Time is the same.

3 More number of events with only any one of the parameters (either Time/Place/Person)

2 Only few lines/links talk about Event with any one of the parameters Time/Person/Place

1 Single occurrence of the term of an event / related event

Table 7.4 Five scale rating for event search for case 3.


5 The document contains the most relevant event for a given Time/Person/Place

4 Related relevant event for a given Time/Person/Place.

3 Talks about any events(no need to be closely related events specific to a particular place/person/time) for a given Time/Person/Place

2 Only a few lines/links talks about Event with any one of the parameters Time/Person/Place

1 Single occurrence of the term of an event / related event

The Concept-Based ranking function has been changed by

introducing additional event weight and the method considers a number of

148

conceptual links for a given event query, and gives importance to the cause

and effect relation, which yields the main and subevent of the given query.

The temporal factors are important and the user usually tends to search recent

events, rather than the past events. Hence, we rank the documents from the

recent to the past. The document specific properties, such as the position of

the concept in the document and it’s frequency of occurrence, has been

considered in the final stage. The Normalized Discounted cumulative gain

(nDCG) is calculated for different cases, depending on the query. It is

compared for term based search, concept based search (Without event based

process), and concept based event search. This is shown in Table 7.5.

Table 7.5 Comparison of nDCG

Approaches nDCG

Term based Search and Rank 0.43

Concept based Search and Rank (Without event based process)

0.61

Concept based event search and Rank 0.86

Precision is calculated as follows after assigning the relevance tag

to the retrieved document. If the content of the resulting document is closely

related with the given event query Q, then the relevance tag is assigned as

‘more relevant’ and its parameter is ‘Y’ and its value is 1. Similarly, if the

content of the resulting document is partially related with the given event

query Q, then the relevance tag is assigned as ‘partially relevant’ and its

parameter is ‘P’ and its value is 0.5. We may also retrieve the relevant result

in the link of the web page, then it is considered as ‘less relevant’ and its

relevance parameter is ‘L’ and its value is 0.5. Some time we may retrieve

results with no link, in that case it will be considered as ‘not relevant’ and it is

assigned as a parameter ‘N’ value is 0.If the resulted page can’t be accessed

149

due to the following problems: page cannot be found, page under

construction, and some other technical faults, then it was categorized as ‘site

cannot be accessed’ and it is assigned a parameter ‘X’ and its value is 0. In

the 5 scale rating scheme discussed in the previous table, the rating 5 – 4 is

assigned with ‘Y” as a relevance rate. Similarly the rating 3 is considered as

“P” as a relevance tag and the rating 2 is considered as “L” and the rating 1 is

assigned “N”/”X”.

Table 7.6 Comparison of Precision (P@5 and P@10)

Approaches P@5 P@10

Term based Search and Rank 0.13 0.17

Concept based Search and Rank (Without event based process)

0.23 0.22

Concept based event search and Rank 0.48 0.43

From the Tables 7.5 and 7.6, the term based event search and rank

gives low precision and the concept based search results are better than term

based search results. However, compared to the concept based event search

method, the precision is low. Hence, compared to the existing concept based

search and rank method, this work shows 20% improvement and the

discounted cumulative gain is also high for the concept based event search

and rank. The ranking algorithm can be further improved by considering

event concept based page rank score for the retrieved documents. We have

also planned to extend our previous work (Umamaheswari et al 2013) to

automatically learn the ranking parameters for the event search.

150

7.7 CONCLUSION

In order to facilitate news readers to get coherent information about

an event and to facilitate event search from event based perspectives, such as

Time, Place and Person, this work describes an approach that tracks events

conceptually, with respect to these event specific properties, and ranks the

documents using event specific features and relations. The news event search

has been developed for Tamil news documents, and we plan to extend our

work to other languages like English, Malayalam, Telugu, etc. The language

independent semantic representation helps us in adoption to different

languages, irrespective of the nature of the language. The next chapter

discusses the extension of this event search and rank, by considering the event

based conceptual link between pages; hence, the page which has either a

physical link/ conceptual link will be given higher weight. We have also

incorporated user rating, by learning the conceptual features of the documents

that have a higher user rate.

CHAPTER 7 CONCEPT BASED EVENT SEARCH AND...

Documents

Transcript of CHAPTER 7 CONCEPT BASED EVENT SEARCH AND...