Download - INEX: Evaluating content-oriented XML retrieval

INEX: Evaluating INEX: Evaluating content-oriented XML content-oriented XML

retrieval retrieval Mounia LalmasMounia Lalmas

Queen Mary University of LondonQueen Mary University of Londonhttp://qmir.dcs.qmul.ac.ukhttp://qmir.dcs.qmul.ac.uk

OutlineOutline

Content-oriented XML Content-oriented XML retrievalretrieval

Evaluating XML retrieval: Evaluating XML retrieval: INEXINEX

XML RetrievalXML Retrieval Traditional IR is about finding relevant documents to a Traditional IR is about finding relevant documents to a

user’s information need, e.g. entire book.user’s information need, e.g. entire book.

XML retrieval allows users to retrieve document XML retrieval allows users to retrieve document components that are more focussed to their information components that are more focussed to their information needs, e.g a chapter of a book instead of an entire book.needs, e.g a chapter of a book instead of an entire book.

The structure of documents is exploited to identify which The structure of documents is exploited to identify which document components to retrieve.document components to retrieve.

Structured DocumentsStructured Documents

Linear order of words, sentences, paragraphs …

Hierarchy or logical structure of a book’s chapters, sections …

Links (hyperlink), cross-references, citations …

Temporal and spatial relationships in multimedia documents

Book

Chapters

Sections

Paragraphs

World Wide Web

This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of today’s research it issues to make se last sentence..

Structured DocumentsStructured Documents ExplicitExplicit structure structure formalised formalised

through document representation through document representation standards (standards (mark-up languagesmark-up languages))

LayoutLayoutLaTeX (publishing), HTML (Web LaTeX (publishing), HTML (Web publishing)publishing)

StructureStructureSGML, SGML, XMLXML (Web publishing, (Web publishing, engineering), MPEG-7 (broadcasting)engineering), MPEG-7 (broadcasting)

Content/Content/SemanticSemanticRDF, DAML + OIL, OWL (semantic RDF, DAML + OIL, OWL (semantic web)web)

World Wide Web

This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of today’s research it issues to make se last sentence..

<b><font size=+2>SDR</font></b><img src="qmir.jpg" border=0>

<section> <subsection> <paragraph>… </paragraph> <paragraph>… </paragraph> </subsection></section>

<Book rdf:about=“book”> <rdf:author=“..”/> <rdf:title=“…”/></Book>

XML: eXML: eXXtensible tensible Mark-upMark-up LLanguageanguage

Meta-language (user-defined tags) currently Meta-language (user-defined tags) currently being adopted as the document format being adopted as the document format language by W3Clanguage by W3C

Used to describe content and structure (and Used to describe content and structure (and not layout)not layout)

Grammar described in DTD (Grammar described in DTD ( used for used for validation)validation)<lecture> <title> Structured Document Retrieval </title> <author> <fnm> Smith </fnm> <snm> John </snm> </author> <chapter> <title> Introduction into XML retrieval </title> <paragraph> …. </paragraph> … </chapter> …</lecture>

<!ELEMENT lecture (title, author+,chapter+)><!ELEMENT author (fnm*,snm)><!ELEMENT fnm #PCDATA>…

XML: eXML: eXXtensible tensible Mark-upMark-up LLanguageanguage

Use of XPath notation to refer to the Use of XPath notation to refer to the XML structureXML structure

chapter/title: title is a direct sub-component of chapter//title: any titlechapter//title: title is a direct or indirect sub-component of chapterchapter/paragraph[2]: any direct second paragraph of any chapterchapter/*: all direct sub-components of a chapter

<lecture> <title> Structured Document Retrieval </title> <author> <fnm> Smith </fnm> <snm> John </snm> </author> <chapter> <title> Introduction into SDR </title> <paragraph> …. </paragraph> … </chapter> …</lecture>

Querying XML documentsQuerying XML documents Content-only (CO) queriesContent-only (CO) queries

''open standards for digital video in distance learningopen standards for digital video in distance learning''

Content-and-structure (CAS) queriesContent-and-structure (CAS) queries

//article [about(., 'formal methods verify correctness aviation //article [about(., 'formal methods verify correctness aviation systems')]systems')]

/body//section/body//section [about(.,'case study application model checking theorem [about(.,'case study application model checking theorem

proving')]proving')]

Structure-only (SA) queriesStructure-only (SA) queries

/article//*section/paragraph[2]/article//*section/paragraph[2]


Return document components of Return document components of varying granularityvarying granularity (e.g. a book, (e.g. a book,

a chapter, a section, a paragraph, a a chapter, a section, a paragraph, a table, a figure, etc), relevant to the table, a figure, etc), relevant to the user’s information need both with user’s information need both with

regards to regards to contentcontent and and structurestructure..


Retrieve theRetrieve the bestbest components components according to content and structure according to content and structure criteria:criteria: INEX:INEX: most specific component that satisfies the query, most specific component that satisfies the query,

while being exhaustive to the querywhile being exhaustive to the query

Shakespeare study:Shakespeare study: best entry points, which are best entry points, which are components from which many relevant components can be components from which many relevant components can be reached through browsingreached through browsing

??????

ArticleArticle ?XML,??XML,?retrievalretrieval

??authoringauthoring

0.9 XML 0.5 XML 0.2 XML0.9 XML 0.5 XML 0.2 XML 0.4 retrieval 0.7 0.4 retrieval 0.7

authoringauthoring

ChallengesChallenges

Title Section 1 Section 2

no fixed retrieval unit + nested elements + element types how to obtain document and collection statistics? which component is a good retrieval unit? which components contribute best to content of Article? how to estimate? how to aggregate?

0.4 0.50.2

0.6 0.40.4

0.2

Approaches …Approaches …vector space model

probabilistic model

bayesian network

language model

extending DB model

boolean model

natural language processing

cognitive model

ontology

parameter estimation

tuning

smoothing

fusion

phrase

term statistics

collection statistics

component statistics

proximity search

logistic regression

belief modelrelevance feedback

Vector space modelVector space model

article index

abstract index

section index

sub-section index

paragraph index

RSV normalised RSV

RSV normalised RSV

RSV normalised RSV

RSV normalised RSV

RSV normalised RSV

merge

tf and idf as for fixed and non-nested retrieval units

(IBM Haifa, INEX 2003)

Language modelLanguage modelelement language modelcollection language modelsmoothing parameter

element score

element sizeelement scorearticle score

query expansion with blind feedbackignore elements with 20 terms

high value of leads to increase in size of retrieved elements

results with = 0.9, 0.5 and 0.2 similar

rank element

(University of Amsterdam, INEX 2003)

Evaluation of XML retrieval: Evaluation of XML retrieval: INEXINEX

Evaluating the effectiveness of content-oriented XML Evaluating the effectiveness of content-oriented XML retrieval approachesretrieval approaches

Collaborative effort Collaborative effort participants contribute to the participants contribute to the development of the collectiondevelopment of the collection

queriesqueriesrelevance assessmentsrelevance assessments

Similar methodology as for TREC, but adapted to XML Similar methodology as for TREC, but adapted to XML retrievalretrieval

40+ participants worldwide40+ participants worldwide

Workshop in Schloss Dagstuhl in December (20+ institutions)Workshop in Schloss Dagstuhl in December (20+ institutions)

INEX Test CollectionINEX Test Collection Documents (~500MB), which consist of 12,107 articles in Documents (~500MB), which consist of 12,107 articles in

XML format from the IEEE Computer Society; 8 millions XML format from the IEEE Computer Society; 8 millions elementselements

INEX 2002INEX 200230 CO and 30 CAS queries30 CO and 30 CAS queriesinex2002 metricinex2002 metric

INEX 2003INEX 200336 CO and 30 CAS queries36 CO and 30 CAS queriesCAS queries are defined according to enhanced subset of XPathCAS queries are defined according to enhanced subset of XPathinex2002 and inex2003 metricsinex2002 and inex2003 metrics

INEX 2004 is just startingINEX 2004 is just starting

TasksTasks COCO: aim is to decrease user effort by : aim is to decrease user effort by

pointing the user to the most specific pointing the user to the most specific relevant portions of documents. relevant portions of documents.

SCASSCAS: retrieve relevant nodes that match : retrieve relevant nodes that match the structure specified in the query. the structure specified in the query.

VCASVCAS: retrieve relevant nodes that may : retrieve relevant nodes that may not be the same as the target elements, not be the same as the target elements, but are structurally similar.but are structurally similar.

Relevance in XMLRelevance in XML

A element is relevant if it “has significant A element is relevant if it “has significant and demonstrable bearing on the matter at and demonstrable bearing on the matter at hand”hand”

Common assumptions in IRCommon assumptions in IR ObjectivityObjectivity TopicalityTopicality Binary natureBinary nature IndependenceIndependence

section

paragraph

article

1 2

1 2 3

Relevance in INEXRelevance in INEX

ExhaustivityExhaustivityhow exhaustively a document component discusses the how exhaustively a document component discusses the query: 0, 1, 2, 3query: 0, 1, 2, 3

SpecificitySpecificityhow focused the component is on the query: 0, 1, 2, 3how focused the component is on the query: 0, 1, 2, 3

RelevanceRelevance (3,3), (2,3), (1,1), (0,0), …(3,3), (2,3), (1,1), (0,0), …

section

article all sections relevant article very relevantall sections relevant article better than sectionsone section relevant article less relevantone section relevant section better than article…

Relevance assessment Relevance assessment tasktask

CompletenessCompleteness Element Element parent element, children element parent element, children element

ConsistencyConsistency Parent of a relevant element must also be relevant, Parent of a relevant element must also be relevant,

although to a different extentalthough to a different extent Exhaustivity increase going Exhaustivity increase going Specificity decrease going Specificity decrease going

Use of an online interfaceUse of an online interface Assessing a query takes a week!Assessing a query takes a week! Average 2 topics per participantsAverage 2 topics per participants

section

paragraph

article

1 2

1 2 3

InterfaceInterface

Currentassessments

Navigation

Groups

AssessmentsAssessments With respect to the elemens to assessWith respect to the elemens to assess

26 % assessments on elements in the pool (66 26 % assessments on elements in the pool (66 % in INEX 2002).% in INEX 2002).68 % highly specific elements not in the pool68 % highly specific elements not in the pool

7 % elements automatically assessed7 % elements automatically assessed

INEX 2002INEX 200223 inconsistent assessments per query for one 23 inconsistent assessments per query for one rulerule

MetricsMetricsNeed to consider:Need to consider:

Two dimensions of relevanceTwo dimensions of relevance Independency assumption does not holdIndependency assumption does not hold No predefined retrieval unitNo predefined retrieval unit OverlapOverlap Linear vs. clustered rankingLinear vs. clustered ranking

section

article

INEX 2002 metricINEX 2002 metricQuantization: Quantization:

strict strict

generalizedgeneralized

€

fstrict(exh,spec) =1 if exh = 3 and spec = 30 otherwise ⎧ ⎨ ⎩

€

fgen(exh,spec) =

1.00 if (exh,spec) = 330.75 if (exh,spec)∈ {23,32,31}0.50 if (exh,spec)∈ {13,22,21}0.25 if (exh,spec)∈ {11,12}0.00 if (exh,spec) = 00

⎧

⎨

⎪ ⎪ ⎪

⎩

⎪ ⎪ ⎪

INEX 2002 metricINEX 2002 metricPrecision as defined by Raghavan’89 Precision as defined by Raghavan’89 (based on ESL)(based on ESL)

where n is estimatedwhere n is estimated1

))(|(

+⋅

++⋅

⋅=

risjnx

nxxretrreP

€

n = f (assess(c))c∈C∑

Overlap problemOverlap problem

INEX 2003 metricINEX 2003 metricIdeal concept space (Wong & Yao ‘95)Ideal concept space (Wong & Yao ‘95)

cct

spec∩

=

tct

exh∩

= c

t

INEX 2003 metricINEX 2003 metricQuantization:Quantization:

strictstrict

generalisedgeneralised€

exhstrict(exh) =1 if exh = 30 otherwise ⎧ ⎨ ⎩

€

specstrict(spec) =1 if spec = 30 otherwise ⎧ ⎨ ⎩

€

exhgen(exh) = exh / 3

€

specgen(spec) = spec / 3

INEX 2003 metricINEX 2003 metricIgnoring overlap:Ignoring overlap:

€

recalls =t∩ c i

U

i=1

k

∑

t∩ c iU

i=1

N

∑=

exh(c iU )

i=1

k

∑

exh(c iU )

i=1

N

∑

€

precisions =

t∩ c iU

c iU ⋅ c iT

i=1

k

∑

c iT

i=1

k

∑=

spec(c iU ) ⋅ c iT

i=1

k

∑

c iT

i=1

k

∑

INEX 2003 metricINEX 2003 metricConsidering overlap:Considering overlap:

€

recallo =

exh(c iU ) ⋅

c iT − U j=1

i−1 c jT

c iT

i=1

k

∑

exh(c iU )

i=1

N

∑

€

precisiono =spec(c i

U ) ⋅ c iT − c jT

j=1

i−1Ui=1

k

∑

c iT − c j

Tj=1

i−1Ui=1

k

∑

INEX 2003 metricINEX 2003 metric Penalises overlap by only scoring novel Penalises overlap by only scoring novel

information in overlapping resultsinformation in overlapping results Assume uniform distribution of relevant Assume uniform distribution of relevant

informationinformation Issue of stabilityIssue of stability Size considered directly in precision (is it Size considered directly in precision (is it

intuitive that large is good or not?)intuitive that large is good or not?) Recall defined using exh onlyRecall defined using exh only Precision defined using spec onlyPrecision defined using spec only

Alternative metricsAlternative metrics User-effort oriented measuresUser-effort oriented measures

Expected Relevant RatioExpected Relevant RatioTolerance to IrrelevanceTolerance to Irrelevance

Discounted Cumulated GainDiscounted Cumulated Gain

Lessons learntLessons learnt Good definition of relevanceGood definition of relevance

Expressing CAS queries was not easyExpressing CAS queries was not easy

Relevance assessment process must be Relevance assessment process must be “improved”“improved”

Further development on metrics neededFurther development on metrics needed

User studies requiredUser studies required

ConclusionConclusion XML retrieval is not just about the effective XML retrieval is not just about the effective

retrieval of XML documents, but also about how retrieval of XML documents, but also about how to evaluate effectivenessto evaluate effectiveness

INEX 2004 tracksINEX 2004 tracks Relevance feedbackRelevance feedback InteractiveInteractive Heterogeneous collectionHeterogeneous collection Natural language queryNatural language query

http://inex.is.informatik.uni-duisburg.de:2004/










INEX: Evaluating INEX: Evaluating content-oriented XML content-oriented XML

retrievalretrievalMounia LalmasMounia Lalmas

Queen Mary University of LondonQueen Mary University of Londonhttp://qmir.dcs.qmul.ac.ukhttp://qmir.dcs.qmul.ac.uk