Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan...

23
An information retrieval system for parliamentary documents Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso E. Romero Chapter: 12 Presented by Quratulain CSE 655 Probabilistic Reasoning Faculty of Computer Science, Institute of Business Administration

Transcript of Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan...

Page 1: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

An information retrieval system for parliamentary documentsBook: Bayesian Networks : A practical guide to

applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso E. Romero Chapter: 12

Presented byQuratulain

CSE 655 Probabilistic ReasoningFaculty of Computer Science,

Institute of Business Administration

Page 2: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

Quratulain 2

OutlineIntroduction

Overview of information retrieval systems

Bayesian network and information retrieval

Theoretical foundations

Building the information retrieval system

Conclusion

10 oct, 2009

Page 3: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

Quratulain 3

Introduction/MotivationTo fulfil the objective of democracy, need to make

public all activities of parliament.Previously, information was sent in a printed form

to all official organization and libraries.Currently, electronic document published on the

web, which is fast, cheaper and an easier way.The official bulletin, transcripts of all speeches in

different session, after editing published on website in PDF.

The documents are accessible using database-like queries.

10 oct, 2009

Page 4: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

Quratulain 4

ProblemsTo access information user must know about:

Session number

Date of legislature

Difficult to access information

10 oct, 2009

Page 5: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

Quratulain 5

GoalA website with real search engine based on

content.The natural language query is applied to

access the information.The obtained the relevant document through

system.The output will be a set of document

components of varying granularity (from complete document to single paragraph, also sorted depending on degree of relevance).

** This will avoid manual search **10 oct, 2009

Page 6: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

Quratulain 6

OutlineIntroduction

Overview of information retrieval

systems

Bayesian network and information retrieval

Theoretical foundations

Building the information retrieval system

Conclusion10 oct, 2009

Page 7: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

7

Overview of information retrievalInformation retrieval is concerned with representation,

storage, organization, and accessing of information items.

Information retrieval systems work as:Given a set of documentsPre-processing

remove words not useful in search(stopwords) Convert word to its stem word(reduce vocabulary) Each word is associated with weights expressing their

importance (in document or collection of documents)NLP query indexed to match query representation with

the stored document using any IR model.Finally, a set of document identifiers is presented to the

user sorted according to their relevance degree.10 oct, 2009 Quratulain

Page 8: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

Quratulain 8

Overview of information retrievalStandard IR treat document as atomic entities.

XML allows structured documents with

semantics.Structured IR views documents as aggregates

interrelated structural elements by indexing.Structured IR models exploit the content and

the structure of documents to estimate the relevance of document components to query.

10 oct, 2009

Page 9: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

Quratulain 9

OutlineIntroduction

Overview of information retrieval systems

Bayesian network and information

retrieval

Theoretical foundations

Building the information retrieval system

Conclusion10 oct, 2009

Page 10: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

Quratulain 10

Bayesian Networks and information retrievalBayesian networks were first applied to IR at the

beginning of 1990 by croft and turtle.Bayesian network in IR models compute the

probability of relevance given a document and a query.Two important model of BNs within IR:

Belief network modelBayesian network retrieval model.Common feature are:

Each index term and document represented as nodes in network.

Links connecting each document node with all the term nodes.Model differ in:

The direction of arc. Additional arc (relationship b/w documents and terms.)

10 oct, 2009

Page 11: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

Quratulain 11

BN-based retrieval model

10 oct, 2009

D2

T1

D1

T7

T6T5

T4T3

T2

D3

Terms

Documents

Page 12: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

Quratulain 12

Drawback of Bayesian network1. Time and space require to assess the

distributions and store them(conditional probability per node is exponential with the parent

nodes)2. The efficiency of carrying out inference,

because general inference in BNs is NP-hard problem

ThereforeThe direct approach where we propagate the evidence contained in a query through the whole network is unfeasible .

10 oct, 2009

Page 13: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

Quratulain 13

OutlineIntroduction

Overview of information retrieval systems

Bayesian network and information retrieval

Theoretical foundations

Building the information retrieval system

Conclusion

10 oct, 2009

Page 14: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

14

Theoretical foundationsSet of documents D={D1 ,D2 , ..., DM}

Set of terms used to index these documentsEach document Di is organized hierarchically,

representing structural associations of elements in Di called structural unit.

These association to a document form a tree. For example scientific article.

10 oct, 2009 Quratulain

Page 15: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

Quratulain 15

The structure of scientific article

10 oct, 2009

Index Terms

TitleParag

1Parag

2Title

Parag 1

TitleParag

1

Ref 1

Ref 2

Subsec 1

Subsec 2

Section 1

Section 2

BibligraphyTitle

Author

Abstract

Document 1

Page 16: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

Quratulain 16

BN model for documentBN modeling of document contain 3-kind of

nodesTerms set , T={T1, T2, ..., Tl}Basic structural unit, Ub ={B1, B2, ..., Bm}Complex structural unit, Uc ={S1, S2, ..., Sm}

Set of all structural unit U= Ub Uc

To each node T, B, S is associated a binary random variables as {t- , t+}, {b- , b+} or {s- , s+} respectively. (-) not relevant , (+) relevant.

10 oct, 2009

Page 17: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

Quratulain 17

BN model for document

10 oct, 2009

Ub

T1 T6T11

T10

T9T8T2 T3 T4 T5 T7T16

T15

T14

T13

T12

B1 B6B2 B3 B4 B5 B7

S1 S2 S3

S4

Uc Uc Us , with Pa(S1) Pa(S2) = , S1 S2 Uc

Page 18: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

Quratulain 18

BN for documentConditional Probability

P(t+)P(b+|pa(B))P(s+|pa(S))

Due to greater number of parent, efficient inference procedure is needed.

10 oct, 2009

Page 19: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

Quratulain 19

Influence Diagram ModelOnce the BN has been constructed transform

it into influence diagram by including decision and utility nodes.Chance node : previous BNDecision node : Utility node :

10 oct, 2009

Page 20: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

Quratulain 20

OutlineIntroduction

Overview of information retrieval systems

Bayesian network and information retrieval

Theoretical foundations

Building the information retrieval

system

Conclusion10 oct, 2009

Page 21: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

Quratulain 21

Building the information retrieval system(PAIRS)PAIRS is a software package (store document in

relational database)Written in C++Specifically developed to store and retrieve

documents generated by the parliament of AndalusiaBased on probabilistic model.

10 oct, 2009

PDF documen

t collectio

n

XML documen

t collectio

n

Indexing System

Query

Indexed Query

Search Engine

Indexed Document Collection

Retrieved Document

ComponentsGen

era

l sc

hem

e o

f PA

IRS

Page 22: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

Quratulain 22

OutlineIntroduction

Overview of information retrieval systems

Bayesian network and information retrieval

Theoretical foundations

Building the information retrieval system

Conclusion

10 oct, 2009

Page 23: Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.

Quratulain 23

ConclusionThis paper present a retrieval system based on

probabilistic model belong to parliament information.

The system has been proven efficient in term of

indexing and retrieval time.

Bayesian network technologies can be employed in

problem domains whose dimensionality would earlier

avoid its use.

The system is not a finished product, still several

possible improvement are required.10 oct, 2009