DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT
description
Transcript of DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED
ENVIRONMENT
MAYURI UMRANIKAR
CONTENTS
Introduction
Retrieval Environment
- The Vector Space Model
- INEX Environment
- Flexible Retrieval System
Method Used for Retrieval
- Document Tree – Construction
- Ranking of Elements
- Output
Experiments
Conclusions
INTRODUCTION
Extensible Markup Language (XML) preferred for representing documents and due to increase of documents, issue of element retrieval arises
Focus on retrieval of relevant elements rather than entire document INEX – INitiative for Evaluation of XML Retrieval Flexible Mechanisms Different Approaches Term Weighting
RETRIEVAL ENVIRONMENT
2 Factors – Issues when focus moves from documents to components and Salton’s Vector Space Model
Vector Space Model – Weight number of times a term occurs in the document
Fox’s Extended Vector Space Model – Incorporation of objective identifiers
Document vector consists of subvectors Contain text independently indexed, weighted, searched and
retrieved Term Weighting – weighting within subjective vectors Smart Experimental Retrieval System
INEX ENVIRONMENT
Content Only (CO) –ignore document structure, like typical queries, specify only content of search
Content and Structure (CAS) – explicitly refer to structure, exhaustive and specific
CO query directly to user, CAS additional filtering and search of body portion
CAS returns rank ordered list of elements INEX-EVAL – uses measures of recall and precision
( fig, exhaustivity, specificity mapped to a single relevance)
results are ranked
FLEXIBLE RETRIEVAL SYSTEM
Smart Format – documents and topics translated, indexed as extended vectors
Subjective vectors – contain content bearing terms Objective vectors – serve as filters on result returned by CAS
queries Extended vector – subjective vector, terms having a paragraph in
body subvector Lnu-ltu weighting Dynamic flexible retrieval- tree representation, rank ordered list by
lnu weights
METHOD FOR FLEXIBLE RETRIEVAL Input – Query Q given and paragraph, retrieve rank ordered list,
terminal modes N top ranked paragraphs as input selected Set of paragraphs used to identify documents – elements generated
and returned as output Document Tree – Needs information of structure
Terminal nodes
Pre-order traversal
Terminal nodes found in paragraph index
SIMPLE XML DOCUMENT AND ITS SCHEMA
CONSTRUCTION OF DOCUMENT TREE For query Q, n top ranked paras used to build trees Leaf elements or terminal nodes - paragraph nodes Each leaf represented by term-freq weighted frequency vector 1st – gather all leaf nodes, terminal nodes done 2nd – merge children vectors for parents Document schema determine merging Parent – unique terms of children, term –freq weighted parent
vector( has content of children) Process in recursive manner done
RANKING OF ELEMENTS
Set of elements of document tree generated Problem- structured retrieval; rank ordered list of elements Method used – All-element index( separate representation for each
element of each document and weighting information) Lnu weights - elements variable length, do not require global
frequency Normalization and length – failing results in biased values Pivot – document length probability of relevance= probability of
retrieval Slope- amount of tilting Pivoted Normalization – reduces difference Lnu term weights:
((1+log(term_freq))/ (1+log(avg_term_freq)))/((1-slope)+slope*((no_unique_terms)/pivot)
Ltu weighting – N collection size, nk no of elements
((1+log(term_freq))/log(N/nk))/
((1-slope)+slope*(no_unique_terms)/pivot)) N,nk element dependent, should be known through indexing We move up; N – count elements of each type Nk – inverted file entry in paragraph index, mapping identifiers and
xpaths (given)
OUTPUT OF FLEXIBLE RETRIEVAL Select another leaf node, gather siblings, construct document tree,
calculate Lnu term weights, Ltu weighted query; produce another rank ordered list
After n top ranked exhausted, last list produced, merge lists Single set of elements rank ordered – correlation Q Comparison – flexible retrieval & all-element index
identical – set of n paragraphs i/p to flexible retrieval have all paragraphs same values used for Lnu-ltu
ALGORITHM
EXPERIMENTS
Paragraph – result; set of extended vectors representing paragraph CO – subvector represents subjective portion, body subvector
important (content of element and not type) contained in body Tree Representation
FACTORS OF INTEREST
Slope, pivot for Lnu-ltu Effective structure retrieval Can be determined – empirically, applied from one collection to
other; Generic N- no of paragraphs input, sets upper bound on number per query Actual trees depend on number of paragraphs having same group
or same document
EXPERIMENTS DONE
All-element and dynamic/flexible retrieval experiments and results
- body-only retrieval Correlation between element and query vector produced –
correlation of body elements only
Table 1
RESULTS Tables
Result equivalent Flexible more efficient – file space
Time required for indexing is half Dynamic- Per query basis cost more – n; total trees not exact
required specified Another factor – value of nk
DISCUSSIONS AND CONCLUSIONS Flexible retrieval dynamically, rank ordered list of elements, single
indexing at level - basic indexing node (paragraph) Basic functions- SMART; extended vector model Results – flexible capabilities Attempt to incorporate other subvectors, internal node, weight INEX – exhaustivity and specificity; results exhaustive; specificity
research going on; results are reflection It is the better way of retrieval than all-indexing
THANK YOU!!!