A Chinese Information RetrievalSystem Using SDD
Introduction SDD algorithmSDD algorithm Compute SDD Term Weighting Computing the SimilarityComputing the Similarity Term Extracting System Implementation Experiments and Results Web applications Conclusion
Introduction
People need to find information People need to find information quickly and accuratelyquickly and accurately today today
Search engines help people to find information they need from the hugSearch engines help people to find information they need from the huge data collection.e data collection.
Search engines are based on information retrieval models.Search engines are based on information retrieval models.
Traditional search engines can not get the good precision and the ratio Traditional search engines can not get the good precision and the ratio of recall in the search.of recall in the search.
Introduction
Vector Space ModelVector Space Model(VSM)(VSM) was advocated to improve the precision was advocated to improve the precision and the ratio of recall in searching. and the ratio of recall in searching.
VSM is to represent individual documents and queries in a collectiVSM is to represent individual documents and queries in a collection as a vector in on as a vector in a multi dimensionala multi dimensional space. space.
Latent Semantic IndexingLatent Semantic Indexing (LSI) (LSI) is is an improvement modelan improvement model of VSM. of VSM.
Singular Value DecompositionSingular Value Decomposition ( (SVD)SVD) is widely used in LSI is widely used in LSI.
Introduction
SVD has been used quite effectively for information SVD has been used quite effectively for information retrieval. retrieval.
SVD is much more expansive to compute for a large SVD is much more expansive to compute for a large database collection.database collection.
We adopt a different matrix approximation called We adopt a different matrix approximation called Semi Semi Discrete Decomposition (SDD).Discrete Decomposition (SDD).
SDD algorithm
A matrix is showing as following. be the number of term, and be tA matrix is showing as following. be the number of term, and be the number of document. The number of rows is greater than or equal to its nuhe number of document. The number of rows is greater than or equal to its number of columns , mber of columns ,
So the SDD of matrix of dimension k is:So the SDD of matrix of dimension k is:
nmmm
n
n
nm
aaa
aaa
aaa
A
21
22212
12111
NM M
NM
N
Tkkkk YDXA
SDD algorithm
We can also extend the equation as followingWe can also extend the equation as following:
where is an m-vector, is an n-vector. The entries of and are where is an m-vector, is an n-vector. The entries of and are from the set of . And is a diagonal matrix. This equation is from the set of . And is a diagonal matrix. This equation is called a k-term SDD. called a k-term SDD.
Since a k-term SDD needs only k floating point numbers plus k(m+n) Since a k-term SDD needs only k floating point numbers plus k(m+n) entries from S for storage. It is inexpensive to compute quite a large entries from S for storage. It is inexpensive to compute quite a large number of terms.number of terms.
Tii
k
1ii
Tk
2
1
k
2
1
yxd
y
y
y
d00
0d0
00d
kk xxxA 21
ix iy ix iy 1,0,1S id
Compute SDD
There are three steps for computing an SDD approximation:There are three steps for computing an SDD approximation:
1. Let be the k-term approximation, be the residual at the th step.1. Let be the k-term approximation, be the residual at the th step.
2. As the sub problem, solve the triplet 2. As the sub problem, solve the triplet solution with minimizes.solution with minimizes.
This is a mixed integer programming problem, it can be solved as below:This is a mixed integer programming problem, it can be solved as below:
(a) Fixed y.(a) Fixed y.(b) Solve the equation above for x and d using this y.(b) Solve the equation above for x and d using this y.(c) Solve the equation above for y and d using the x from step (b).(c) Solve the equation above for y and d using the x from step (b).(d) Repeat until convergence criterion is satisfied.(d) Repeat until convergence criterion is satisfied.
3. Repeat the step 2 until 3. Repeat the step 2 until
2
F
Tkk dxyRy,x,dFmin 0,, dSySx mm
1 kk AARkA kR k
ki
kkk yxd ,,
Term Weighting
In vector space model, term weighting is a very important and has great In vector space model, term weighting is a very important and has great influence on a success of the retrieval system. influence on a success of the retrieval system.
A matrix , . We define that is the term weighting of A matrix , . We define that is the term weighting of term term
in document as following:in document as following:
It consists of three components, is a global weight of term , is the It consists of three components, is a global weight of term , is the local weight of the term in the document , and is a normalization local weight of the term in the document , and is a normalization factor for the document .factor for the document .
jijiij dtga
ig ijti
jjdj
ANM ijaA ija ij
i
Term Weighting
The weighting scheme is usually specified by a six-letter combination that indicates The weighting scheme is usually specified by a six-letter combination that indicates local, global, and normalization components for the term document matrix. local, global, and normalization components for the term document matrix.
we specify the weighting scheme as lxn.afx, and the weighting formulas can be calwe specify the weighting scheme as lxn.afx, and the weighting formulas can be calculated as following: culated as following:
otherwise
otherwise
0
))1(log()(1(log(1
2
12m
k kjijijffa
0
))(
log()1log(1
n
k ik
i
i f
nf
q
0ijf
0ijf
Computing the Similarity
The similarity between the document and query vector is calculated by tThe similarity between the document and query vector is calculated by the cosine coefficient. Bellow is the he cosine coefficient. Bellow is the formulaformula using to compute the similar using to compute the similarity:ity:
the document can be arranged in descending order of similarity and the the document can be arranged in descending order of similarity and the number of documents retrieved can be limited.number of documents retrieved can be limited.
n
ii i
n
ii i
n
i ii
dq
dqdq
22
1),cos(
Term Extracting
LSI model LSI model is easy to use corpus in different languages to accomplisis easy to use corpus in different languages to accomplish the cross language retrievalh the cross language retrieval. .
We choose a morphological analysis system called ChaSen(We choose a morphological analysis system called ChaSen( 茶筅) t) to extract the Chinese word from document o extract the Chinese word from document in our system.in our system.
We need good Dictionary to separate the Chinese word correctly. We need good Dictionary to separate the Chinese word correctly.
Term Extracting
Figure 1: Chinese Morphological analyzerFigure 1: Chinese Morphological analyzer
System Implementation Below is an illustration showing the working mechanism of our Below is an illustration showing the working mechanism of our SDD information retrieval systemSDD information retrieval system:
Documents Documents CollectionCollection
Document Document VectorsVectors
Dictionary Dictionary VectorsVectors
Query stringQuery string
Query VectorQuery Vector
Rank relevant document Rank relevant document in descending order of in descending order of similaritysimilarity
SDD SDD ComputationComputation
System Implementation
Implement the SDD information retrieval system as following:Implement the SDD information retrieval system as following:
1. Segment the terms from document collection.Segment the terms from document collection.
2. Create the term-document matrix in MatrixMarket Create the term-document matrix in MatrixMarket Coordinate Format.Coordinate Format.
3. Using SDDPACK to compute the term-document matrix SDDPACK to compute the term-document matrix decompositiondecomposition. The command is as below:The command is as below:
$decomp –k 200 –y -b 4 term-doc.mtx term-doc.sdd$decomp –k 200 –y -b 4 term-doc.mtx term-doc.sdd
4. Ranking the relevant document.Ranking the relevant document.
System Implementation
%%MatrixMarket matrix coordinate real general %%MatrixMarket matrix coordinate real general 8 6 208 6 201 1 4.110885e-011 1 4.110885e-012 1 3.692579e-012 1 3.692579e-013 1 4.464557e-013 1 4.464557e-014 1 3.692579e-014 1 3.692579e-015 1 4.110885e-015 1 4.110885e-016 1 1.590307e-016 1 1.590307e-017 1 3.180615e-017 1 3.180615e-018 1 2.520578e-018 1 2.520578e-013 2 1.000000e+003 2 1.000000e+003 3 7.263057e-013 3 7.263057e-015 3 6.873719e-015 3 6.873719e-013 4 3.162278e-013 4 3.162278e-015 4 9.486833e-015 4 9.486833e-013 5 6.666667e-013 5 6.666667e-015 5 6.666667e-015 5 6.666667e-018 5 3.333333e-018 5 3.333333e-011 6 5.000000e-011 6 5.000000e-013 6 5.000000e-013 6 5.000000e-01
%% Semidiscrete Decomposition (SDD)%% Semidiscrete Decomposition (SDD)%% Matrix: sdddata/matrix Terms: 5 Accr: 0.%% Matrix: sdddata/matrix Terms: 5 Accr: 0.00e+00 Tol: 1.00e-02 InnIts: 100 Init: 100e+00 Tol: 1.00e-02 InnIts: 100 Init: 15 8 65 8 65.7245558500289916992187500e-015.7245558500289916992187500e-012.7197235822677612304687500e-012.7197235822677612304687500e-014.0811389684677124023437500e-014.0811389684677124023437500e-012.5439447164535522460937500e-012.5439447164535522460937500e-011.3001415133476257324218750e-011.3001415133476257324218750e-01 0 0 1 0 1 0 0 00 0 1 0 1 0 0 0 1 1 0 1 0 0 1 11 1 0 1 0 0 1 1 0 0 1 0 -1 0 0 00 0 1 0 -1 0 0 0 1 -1 0 -1 0 0 -1 11 -1 0 -1 0 0 -1 1 1 1 -1 1 -1 1 0 01 1 -1 1 -1 1 0 0 1 1 1 1 1 11 1 1 1 1 1 1 0 0 0 0 11 0 0 0 0 1 0 1 0 -1 0 00 1 0 -1 0 0 0 0 0 0 0 10 0 0 0 0 1 1 0 0 0 0 01 0 0 0 0 0
Figure 2. 8 x 6 matrix outputFigure 2. 8 x 6 matrix output Figure 3. SDD outputFigure 3. SDD output
Experiments and Results
We selected a small data set in which have only 100 We selected a small data set in which have only 100 documents to do a test. The data is Chinese text-base documents to do a test. The data is Chinese text-base documents coming from the web page of Chinese documents coming from the web page of Chinese Agricultural University.Agricultural University.
For comparing the performance of SDD and SVD, we For comparing the performance of SDD and SVD, we compute the matrix decomposition using both SDD and compute the matrix decomposition using both SDD and SVD. SVD.
Create query vector, compute the Similarity and rank the Create query vector, compute the Similarity and rank the document in descending order of similarity.document in descending order of similarity.
Experiments and Results
1 22 0.8451102 3 0.1861893 49 0.1644034 4 58 58 0.1574440.1574445 56 0.1488116 1 0.1398917 7 9 9 0.1050010.1050018 69 0.0678638 69 0.0678639 9 31 31 0.0577630.05776310 23 0.056919
1 22 0.7419982 1 0.4015683 49 0.3990594 3 0.3981775 58 0.3975716 23 0.3961997 9 0.3940328 12 0.3910859 14 0.38959010 31 0.380483
Figure 4. Top ten entries in SVDFigure 4. Top ten entries in SVD Figure 5. Top ten entries Figure 5. Top ten entries
in SDDin SDD
Web Applications
We developed a web-based application for the presentation of this ChinWe developed a web-based application for the presentation of this Chinese information retrieval systems.ese information retrieval systems.
Visiting this side by using the address of Visiting this side by using the address of http://pc110.narc.affrc.go.jp/Chinese/..
We also developed a Japanese system using SDD-base VSM. The web iWe also developed a Japanese system using SDD-base VSM. The web interface shows at the address of nterface shows at the address of http://pc110.narc.affrc.go.jp/AgrInfo/..
Web Applications
Web Applications
Web Applications
Web Applications
Web Applications
Conclusion
We presented a Chinese Information retrieval system by using We presented a Chinese Information retrieval system by using SDD. SDD.
SDD has good advantage in saving storage of computer resSDD has good advantage in saving storage of computer resourcesources..
SDD will be easy to implement for a big data collection.SDD will be easy to implement for a big data collection. SDD will be easy to accomplish the cross language retrievaSDD will be easy to accomplish the cross language retrieva
l. l. SDD has almost the same retrieval performance compared SDD has almost the same retrieval performance compared
with SVD. with SVD.
Thank you!
Top Related