Igor Mokriš, Lenka Skovajsová Institute of Informatics, SAS Bratislava, Slovakia

Information Retrieval by means of Vector Space Model of Document Representation and Cascade Neural Networks

Igor Mokriš, Lenka Skovajsová

Institute of Informatics, SAS Bratislava, Slovakia

[email protected], [email protected]

Summary

Development of the neural network model for information retrieval from text documents in Slovak language by vector space model of document representation

Key words: Information Retrieval, Queries, Keywords, Text Documents, Neural Networks, Slovak Language

Text Document Analysis

The most common approaches : Statistical – analyses words in text

documents comparing them with keywords Linguistic – extracts linguistic units from text

– phoneme, morpheme, lexeme, ... Knowledge – based – uses domain models of

documents descripted by ontology Porter algorithm for English

Slovak language is more complicated

Inflection of Slovak language – grammatical forms – nouns, adjectives, pronouns, ...

Complicated word – timing and declension, prefixes and suffixes, ...

Synonyms and homonyms Phrases containing more than one word, And so on

System for Information Retrieval in STDFurdík, K.: Inf. Retrieval in Nat. Language by Hypertext Structure, 2003. User Indexation Document Administrator

document enter query

Proposed IR System

User answer:

document list

Administrator

Document base - collection

feedback

manage

External document space

(WWW, HDD,...)

Document base manager

Ontology – knowledge domain model

Conceptualization tool

Ontology editor

(Semi)- automatic conceptualization objects

Format tools Coding and formats transition

Agents For document retrieval from web

Attribute document evaluation

Fulltext indexing

Documents

Language analysis Processing

Morphological

Lexical

Semantic

Syntactical

User interface

Retrieval

Query processing

Arrangement and result list

creation

Feedback solution

Indexation

Index

Attribute index

Fulltext index

Conceptual index

Legend:

Data flow Managing flow Data from user

Document enter Semi-automatic conceptualization and domain model change on the base of language analysis solution

How continue

Utilization of Neural Networks

Well trained NN is able: to simplify the Slovak text analysis, is invariance from point of Slovak words infection, perform faster linguistic analysis

Disadvantage: problems with learning and static structure of NN

System for Information RetrievalCan be simplified

Query

Text Operations

Document Retrieval

Document Indexation

Document Base

Manager

Docu – ment Index Document

Return

User Query Administrator

Indexation Subsystem

Administrator Subsystem

Documents

Answer – Retrieved

Documents

User Query Subsystem

It means – 3 Layer Information Retrieval

System Most simplified structure of system

Keywords DocumentsQueries

Next solution - Representation the query, keywords and document layer by neural networks

1. neural network

2. neural network

Query Keywords

Documents

Development of 1st NN for Keyword Determination1st NN – Feed-Forward NN Back-Prop Type

Query Keywords

x1

x2

xN

y1

yM

k1

k2

kL

w11 w11

wMN wLM

nety netk

Θy1

ΘyM

Θk1

ΘkL

Development of 2nd NN for Document Determination – Vector Space Model

nmnn

m

m

kkk

kkk

kkk

nmK

...

............

...

...

)(

21

22221

11211

K(m x n) – Vector Space Matrix

kkd – frequency of keywords in documents

k – number of keywords

d – number of documents

NN with Spreading Activation FunctionDetermination of Documents

Keywords Documents

k1

k2

kL

d1

d2

d P

w LP

w11 netdj

NN with Spreading Activation Function SAF NN is not learning Weights are setting by equation

W = K

Experiments Model of cascade NN in Matlab Query layer - 12 characters Keyword layer - 20 keywords Document layer - 90 documents Each document - app. 50 words QTrS – 164 queries of training set KwTrS – 20 keywords of training set 2nd NN is not trained

Experiments 1st experiment QTsS1 – 185 queries, questions from QTsS1

belonging keywords from KwTrS Precision 0,996 2nd experiment QTsS1 – 100 queries, questions from QTsS1

belonging no keywords from KwTrS Precision 0,97

Disadvantage of VS Model Approach Great dimension of VS matrix Next approach – Dimension reduction of VS

matrix – Latent Semantic Model

Latent Semantic ModelSingular Value Decomposition of Vector Space Matrix K

K = U S VT

U – row – oriented eigen vectors of K.KT

V – column – oriented eigen vectors of K.KT

S – diagonal matrix of singular values of K.KT

dim (S) < dim (K)

VS Matrix Dimension Reduction – Truncated SVD Sr < S

n

m

k

m

k

k

n

k =

Kr (m x n) U(m x r) Sr (r x r) V(r x n)

r

k – number of singular values si r < k

r – number of si after dimension reduction

Number of elements of reduced matrices is lower then number of elements in the matrix K

Solution of Dimension VSM ReductionDocument relevance D is defined by:

D = Q x K,

Q – set of queries

K – VS matrix

Reduced document relevance Dr is defined by:

Dr = Q x Kr,

Kr = U.Sr.VT – reduced VS matrix

Experiments

Collection of 90 documents with 20 keywords – vector space matrix

Dimension reduction by truncated singular value decomposition

For chosen number of singular values computation the precision, recall, absolute and relative number of element kil

Evaluation of Experiments – precision, recall, number of elements VSredR - recall R = nretrel / nrel

nretrel – number of retrieved relevant documents

nrel – number of relevant documents

P – precision P = nretrel / nret ret – number of retrieved documents

Resultssi Precision Recall Absolute Relative

1 0,7942 0,24 110 0,632

2 0,95 0,314 121 0,695

3 0,95 0,405 137 0,787

5 0,975 0,512 148 0,850

7 0,977 0,634 161 0,925

10 1,0 0,754 165 0,948

15 1,0 0,95 173 0,994

20 1,0 1,0 174 1,0

Conclusion

follows from table

Igor Mokriš, Lenka Skovajsová Institute of Informatics, SAS Bratislava, Slovakia

Documents

Transcript of Igor Mokriš, Lenka Skovajsová Institute of Informatics, SAS Bratislava, Slovakia