Igor Mokriš, Lenka Skovajsová Institute of Informatics, SAS Bratislava, Slovakia
description
Transcript of Igor Mokriš, Lenka Skovajsová Institute of Informatics, SAS Bratislava, Slovakia
Information Retrieval by means of Vector Space Model of Document Representation and Cascade Neural Networks
Igor Mokriš, Lenka Skovajsová
Institute of Informatics, SAS Bratislava, Slovakia
Summary
Development of the neural network model for information retrieval from text documents in Slovak language by vector space model of document representation
Key words: Information Retrieval, Queries, Keywords, Text Documents, Neural Networks, Slovak Language
Text Document Analysis
The most common approaches : Statistical – analyses words in text
documents comparing them with keywords Linguistic – extracts linguistic units from text
– phoneme, morpheme, lexeme, ... Knowledge – based – uses domain models of
documents descripted by ontology Porter algorithm for English
Slovak language is more complicated
Inflection of Slovak language – grammatical forms – nouns, adjectives, pronouns, ...
Complicated word – timing and declension, prefixes and suffixes, ...
Synonyms and homonyms Phrases containing more than one word, And so on
System for Information Retrieval in STDFurdík, K.: Inf. Retrieval in Nat. Language by Hypertext Structure, 2003. User Indexation Document Administrator
document enter query
Proposed IR System
User answer:
document list
Administrator
Document base - collection
feedback
manage
External document space
(WWW, HDD,...)
Document base manager
Ontology – knowledge domain model
Conceptualization tool
Ontology editor
(Semi)- automatic conceptualization objects
Format tools Coding and formats transition
Agents For document retrieval from web
Attribute document evaluation
Fulltext indexing
Documents
Language analysis Processing
Morphological
Lexical
Semantic
Syntactical
User interface
Retrieval
Query processing
Arrangement and result list
creation
Feedback solution
Indexation
Index
Attribute index
Fulltext index
Conceptual index
Legend:
Data flow Managing flow Data from user
Document enter Semi-automatic conceptualization and domain model change on the base of language analysis solution
How continue
Utilization of Neural Networks
Well trained NN is able: to simplify the Slovak text analysis, is invariance from point of Slovak words infection, perform faster linguistic analysis
Disadvantage: problems with learning and static structure of NN
System for Information RetrievalCan be simplified
Query
Text Operations
Document Retrieval
Document Indexation
Document Base
Manager
Docu – ment Index Document
Return
User Query Administrator
Indexation Subsystem
Administrator Subsystem
Documents
Answer – Retrieved
Documents
User Query Subsystem
It means – 3 Layer Information Retrieval
System Most simplified structure of system
Keywords DocumentsQueries
Next solution - Representation the query, keywords and document layer by neural networks
1. neural network
2. neural network
Query Keywords
Documents
Development of 1st NN for Keyword Determination1st NN – Feed-Forward NN Back-Prop Type
Query Keywords
x1
x2
xN
y1
yM
k1
k2
kL
w11 w11
wMN wLM
nety netk
Θy1
ΘyM
Θk1
ΘkL
Development of 2nd NN for Document Determination – Vector Space Model
nmnn
m
m
kkk
kkk
kkk
nmK
...
............
...
...
)(
21
22221
11211
K(m x n) – Vector Space Matrix
kkd – frequency of keywords in documents
k – number of keywords
d – number of documents
NN with Spreading Activation FunctionDetermination of Documents
Keywords Documents
k1
k2
kL
d1
d2
d P
w LP
w11 netdj
NN with Spreading Activation Function SAF NN is not learning Weights are setting by equation
W = K
Experiments Model of cascade NN in Matlab Query layer - 12 characters Keyword layer - 20 keywords Document layer - 90 documents Each document - app. 50 words QTrS – 164 queries of training set KwTrS – 20 keywords of training set 2nd NN is not trained
Experiments 1st experiment QTsS1 – 185 queries, questions from QTsS1
belonging keywords from KwTrS Precision 0,996 2nd experiment QTsS1 – 100 queries, questions from QTsS1
belonging no keywords from KwTrS Precision 0,97
Disadvantage of VS Model Approach Great dimension of VS matrix Next approach – Dimension reduction of VS
matrix – Latent Semantic Model
Latent Semantic ModelSingular Value Decomposition of Vector Space Matrix K
K = U S VT
U – row – oriented eigen vectors of K.KT
V – column – oriented eigen vectors of K.KT
S – diagonal matrix of singular values of K.KT
dim (S) < dim (K)
VS Matrix Dimension Reduction – Truncated SVD Sr < S
n
m
k
m
k
k
n
k =
Kr (m x n) U(m x r) Sr (r x r) V(r x n)
r
k – number of singular values si r < k
r – number of si after dimension reduction
Number of elements of reduced matrices is lower then number of elements in the matrix K
Solution of Dimension VSM ReductionDocument relevance D is defined by:
D = Q x K,
Q – set of queries
K – VS matrix
Reduced document relevance Dr is defined by:
Dr = Q x Kr,
Kr = U.Sr.VT – reduced VS matrix
Experiments
Collection of 90 documents with 20 keywords – vector space matrix
Dimension reduction by truncated singular value decomposition
For chosen number of singular values computation the precision, recall, absolute and relative number of element kil
Evaluation of Experiments – precision, recall, number of elements VSredR - recall R = nretrel / nrel
nretrel – number of retrieved relevant documents
nrel – number of relevant documents
P – precision P = nretrel / nret ret – number of retrieved documents
Resultssi Precision Recall Absolute Relative
1 0,7942 0,24 110 0,632
2 0,95 0,314 121 0,695
3 0,95 0,405 137 0,787
5 0,975 0,512 148 0,850
7 0,977 0,634 161 0,925
10 1,0 0,754 165 0,948
15 1,0 0,95 173 0,994
20 1,0 1,0 174 1,0
Conclusion
follows from table