Calculating Statistical Similarity between Sentences · sentences, based on symbolic...

13
Calculating Statistical Similarity between Sentences Junsheng Zhang, Yunchuan Sun, Huilin Wang, Yanqing He Journal of Convergence Information Technology, Volume 6, Number 2. February 2011 Calculating Statistical Similarity between Sentences 1 Junsheng Zhang, 2 Yunchuan Sun, 3 Huilin Wang, 4 Yanqing He 1, First Author IT Support Center, Institute of Scientific and Technical Information of China, Beijing, China, 100038, [email protected] *2,Corresponding Author School of Economics and Business Administration, Beijing Normal University, Beijing, China, 100875, [email protected] 3,4 IT Support Center, Institute of Scientific and Technical Information of China, Beijing, China, 100038, [email protected], [email protected] doi:10.4156/jcit.vol6. issue2.3 Abstract Sentence similarity plays an important role in text-related research and applications. It is closely related to word similarity and document similarity. The statistical similarity measures between sentences, based on symbolic characteristics and structural information, could measure the similarity between sentences without any prior knowledge but only on the statistical information of sentences. This paper presents several approaches to calculating statistical similarity between sentences on a test corpus of 40 sentences. These measures can be used in short text related applications such as corpus construction and title/abstract based document recommendation. The evaluation results show the differences of these measures. Keywords: Sentence Similarity, Statistic, Calculating 1. Introduction Sentence similarity plays an important role in many text-related research and applications such as information retrieval, information recommendation, natural language processing, machine translation and translation memory, and etc. Calculating similarity between sentences is the basis of measuring the similarity between texts which is the key of document classification and clustering. Sentence similarity is one of the key issues for sentence alignment, sentence clustering, question answering, and etc. Natural languages have different meaning granularities such as word, phrase, sentence and document. Word can be regarded as the minimum meaning unit in natural language, while sentence can be regarded as the minimum unit to communicate some complete meaning in natural language. Separators like period and comma are symbols to split word sequences into meaningful chunks. A sequence of words and separators formulate a sentence. Accordingly, there are several levels of similarities in natural languages. Words can be classified into synonyms and antonyms based on the similarity between words and phrases. Similarities between documents are the basis of text classification and clustering. Sentence similarity is between the word similarity and document similarity. Similarities calculating methods vary in different levels. Word similarity. Similarity between words can be calculated from the spelling of words or the meaning of words. Edit distance can be used to measure similarity between words from the spelling of words. If two words are similar in spelling, they are possible similar in meaning in our intuition. While lexicon dictionary such as WordNet can be used to calculate the meaning similarity between words just like WordNet::similarity does [2]. Sentence similarity. The similarities between words in different sentences have great influence on the similarity between two sentences. Words and their orders in the sentences are two important factors to calculate sentence similarity. Document similarity. The similarity between sentences has great influence on the similarity between documents. Commonly used approaches are often based on similarity between the keyword sets (e.g., Dice similarity) or similarity between the vectors of keywords (e.g., cosine similarity). These methods seldom consider the semantic meaning of words. Sentences are the mediate level of meaning units between words and documents. Sentences connect words and documents in the meaning space of natural language. Most existing approaches ignore - 22 -

Transcript of Calculating Statistical Similarity between Sentences · sentences, based on symbolic...

Calculating Statistical Similarity between Sentences Junsheng Zhang, Yunchuan Sun, Huilin Wang, Yanqing He

Journal of Convergence Information Technology, Volume 6, Number 2. February 2011

Calculating Statistical Similarity between Sentences

1Junsheng Zhang, 2Yunchuan Sun, 3Huilin Wang, 4Yanqing He 1, First AuthorIT Support Center, Institute of Scientific and Technical Information of China,

Beijing, China, 100038, [email protected] *2,Corresponding AuthorSchool of Economics and Business Administration,

Beijing Normal University, Beijing, China, 100875, [email protected] 3,4IT Support Center, Institute of Scientific and Technical Information of China,

Beijing, China, 100038, [email protected], [email protected] doi:10.4156/jcit.vol6. issue2.3

Abstract

Sentence similarity plays an important role in text-related research and applications. It is closely related to word similarity and document similarity. The statistical similarity measures between sentences, based on symbolic characteristics and structural information, could measure the similarity between sentences without any prior knowledge but only on the statistical information of sentences. This paper presents several approaches to calculating statistical similarity between sentences on a test corpus of 40 sentences. These measures can be used in short text related applications such as corpus construction and title/abstract based document recommendation. The evaluation results show the differences of these measures.

Keywords: Sentence Similarity, Statistic, Calculating

1. Introduction

Sentence similarity plays an important role in many text-related research and applications such as

information retrieval, information recommendation, natural language processing, machine translation and translation memory, and etc. Calculating similarity between sentences is the basis of measuring the similarity between texts which is the key of document classification and clustering. Sentence similarity is one of the key issues for sentence alignment, sentence clustering, question answering, and etc.

Natural languages have different meaning granularities such as word, phrase, sentence and document. Word can be regarded as the minimum meaning unit in natural language, while sentence can be regarded as the minimum unit to communicate some complete meaning in natural language. Separators like period and comma are symbols to split word sequences into meaningful chunks. A sequence of words and separators formulate a sentence.

Accordingly, there are several levels of similarities in natural languages. Words can be classified into synonyms and antonyms based on the similarity between words and phrases. Similarities between documents are the basis of text classification and clustering. Sentence similarity is between the word similarity and document similarity. Similarities calculating methods vary in different levels.

Word similarity. Similarity between words can be calculated from the spelling of words or the meaning of words. Edit distance can be used to measure similarity between words from the spelling of words. If two words are similar in spelling, they are possible similar in meaning in our intuition. While lexicon dictionary such as WordNet can be used to calculate the meaning similarity between words just like WordNet::similarity does [2].

Sentence similarity. The similarities between words in different sentences have great influence on the similarity between two sentences. Words and their orders in the sentences are two important factors to calculate sentence similarity.

Document similarity. The similarity between sentences has great influence on the similarity between documents. Commonly used approaches are often based on similarity between the keyword sets (e.g., Dice similarity) or similarity between the vectors of keywords (e.g., cosine similarity). These methods seldom consider the semantic meaning of words.

Sentences are the mediate level of meaning units between words and documents. Sentences connect words and documents in the meaning space of natural language. Most existing approaches ignore

- 22 -

Calculating Statistical Similarity between Sentences Junsheng Zhang, Yunchuan Sun, Huilin Wang, Yanqing He

Journal of Convergence Information Technology, Volume 6, Number 2. February 2011

sentence similarity for the reason of the time cost. However, the similarity of sentence level is interesting and important. Besides, the study on sentence similarity can enhance the word similarity research, especially on the statistical approaches based on large corpus.

In this paper, we focus on the measures of the sentence similarity from symbols. We propose two measures of sentence similarity based on the words orders and the words distances. Considering sentence similarities based on word sets, word vectors, and edit distances, we use six approaches to calculating statistical similarity between sentences. Then, we evaluate and compare the six approaches on a corpus with 40 sentences selected from NIST05 BLEU corpus. Evaluation results show that different measures of sentence similarity could indicate different statistical characteristics of sentences. The calculating approaches for sentence similarity based on statistical characteristics are helpful for the applications of natural language processing.

2. Related Work

Similarity analysis between long texts has been widely used in documents retrieval. It considers

mainly on the statistical information of keywords in long texts. Keywords are generally selected using weight assigning schemes like TF-IDF [12].

Sentence similarity has been used in the fields of text-related knowledge representation and discovery, such as machine translation [13], translation memory, text summarization [4], text categorization, question answering and even image search on the Web [3]. Similarity analysis between sentences is different from that between long texts in statistical information. The weight assigning approach such as TF-IDF is not suitable for the similarity analysis between sentences.

A corpus is needed in calculating sentence similarity statistically. A method is presented for measuring the semantic similarity of texts using a corpus-based measure of semantic word similarity and a normalized and modified version of the Longest Common Subsequence (LCS) string matching algorithm in [6,7].

Sentence similarity partially depends on the word similarity. Ref. [11] investigates several methods for computing the words similarity in two aspects: the dictionary-based methods using WordNet, Roget’s thesaurus, or other resources; and the corpus-based methods using frequencies of co-occurrence in corpora (cosine method, latent semantic indexing, mutual information, and etc). The methods can be used in several fields, such as solving TOEFL-style synonym questions, detecting words that do not fit into their context in order to detect speech recognition errors, and synonym choice in context for writing aid tools. Ref. [14] presents an approach for measuring the semantic relatedness between words based on the implicit semantic links. The authors introduced Omiotis, a measure of semantic relatedness between texts, which capitalizes on the word-to-word semantic relatedness measure and extends it to measure the relatedness between texts.

There are three categories of similarity calculation between texts: word co-occurrence methods, corpus-based methods, and methods based on descriptive features [10]. Word co-occurrence method calculates similarity with vectors. Corpus-based text similarity can find the semantic associations between words. However, the high dimension and sparse sentence vectors leads to the bad performance of similarity calculation. A method is discussed for measuring text semantic similarity based on corpus and knowledge [11], and the experiments show that semantic similarity outperforms methods on simple lexical matching. Word to word similarity based on prior knowledge was discussed in [1, 9]. Pointwise mutual information [15] and latent semantic analysis [8] are used to measure the word to word similarity. Ref. [18] presents vector based models for semantic composition. A Chinese similarity computing system model was proposed for Chinese sentence similarity computation [19]. A modified method was proposed for concepts similarity calculation [20]. Ref. [21] designed dissimilarity measures to increase their utility for instance-based learning methods.

In this paper, sentence similarity analysis depends on the word similarity and structural information of sentences. Here, word similarity only considers the spellings and ignores the semantic meanings of words. Structure information considers the orders of words and the distances between words, and it ignores the syntactic information of sentences.

- 23 -

Calculating Statistical Similarity between Sentences Junsheng Zhang, Yunchuan Sun, Huilin Wang, Yanqing He

Journal of Convergence Information Technology, Volume 6, Number 2. February 2011

3. Sentence Similarity Measures

In this paper, a sentence is defined as a sequence of words and separators, denoted by s. In

preprocessing phase, a sentence should be segmented into words and phrases, especially for the Chinese sentences that have no blanks between two neighbored words. Separators are also important parts to represent the semantics of a sentence. The length of sentence s, denoted by |s|, is defined as the number of words and separators in sentence s. There are three types of the sentence similarity measures:

Symbolic similarity. If two sentences are more similar, then words in the two sentences are more similar, and vice versa. Here, words in two sentences are similar means that the words are similar in symbolic or in semantics.

Semantic similarity. Two sentences with different symbolic and structure information could convey the same or similar meaning. Semantic similarity of sentences is based on the meanings of the words and the syntactic of sentence.

Structure similarity. If two sentences are similar, structural relations between words are similar, and vice versa. Structural relations include relations between words and the distances between words. If the structures of two sentences are similar, they are more possible to convey similar meanings.

Words formulate sentences, while sentences formulate documents. There are many researches focusing on word similarity and document similarity.

Word similarity has two types: symbolic similarity and semantic similarity. The symbolic similarity of words can be calculated by edit distance, and the semantic similarity of words can be calculated by WordNet::similarity.

Document similarity is often calculated by the Vector Space Model (VSM). Documents are represented by the bag of words, and the meanings of documents are presented by vectors. The document similarity can be calculated by using cosine of the vectors. If the weights of words are ignored, the document similarity can be calculated based on the sets of keywords by using Dice similarity or Jaccard Coefficient similarity.

Sentence similarity is close to word similarity and document similarity. Words are the components of sentences, while sentences are the components of documents. If words in two sentences are similar, the two sentences are more possibly similar. If sentences in two documents are similar, the two documents are more possibly similar.

The sentence similarity is partially based on the word similarity, and the relations between words should be also considered. However, word similarity cannot replace sentence similarity. Because the word similarity reflects the closeness of two discrete words or concepts, while the sentence similarity reflects the closeness of two sequences of words and separators, and the meaning of sentences are closely related to the orders of words.

The similarity of two documents depends on the similarity of their sentences. Document similarity is calculated by vectors of documents, and the VSM model is efficient in calculating document similarities in large scaled document sets. To deeply understand the meanings of documents, it is necessary to concentrate on the similarities of sentences.

Semantic sentence similarity is difficult to be calculated on large scale corpus, because taxonomy is hard to cover all domains. Statistical similarity between sentences considers only words in the two sentences without any prior knowledge such as lexicon dictionary or syntactical parsing. The cost of calculating statistic similarity is lower than that of calculating semantic similarity. Statistical similarity combines the symbolic characteristics (such as word sets and vectors) and structural characteristics (such as the orders and distances).

4. Calculating Statistic Similarity between Sentences

In this section, we present six measures of statistical similarity between sentences. Sentence similarity based on word set is calculated with the sets of words in two sentences

respectively.

- 24 -

Calculating Statistical Similarity between Sentences Junsheng Zhang, Yunchuan Sun, Huilin Wang, Yanqing He

Journal of Convergence Information Technology, Volume 6, Number 2. February 2011

Sentence similarity based on vector is calculated with vectors representing two sentences

respectively. The weights of words are assigned in two ways: one way is to assign the weights of words averagely; the other is to assign the weights of words by TF-IDF approach.

Sentence similarity based on edit distance is calculated based on the edit distances between two sentences.

Sentence similarity based on word order is calculated with the orders between word pairs in the sentences.

Sentence similarity based on word distance is calculated based on the distances between word pairs in the same sentences.

The former four sentence similarity measures are symbolic similarity, while the latter two sentence similarity measures are structural similarity. Symbolic similarity between sentences only considers the spelling of words ignoring the meanings of words. The symbolic similarity of words has two types: the set of words and the bag of words. The set of words representing the meaning of a sentence with a word set, while the bag of words can be transformed into a vector to represent the meaning of the sentence. The structure information of a sentence includes word orders, word distances and syntactic structure of the sentence. Here, we only consider word orders and word distances.

Before calculating statistical similarity between sentences, suppose sa is a sentence with length m (m ≥2), and sb is a sentence with length n (n ≥2).

sa = wa1wa2wa3 … wam (m≥ 2)

sb = wb1wb2wb3 … wbn (n≥2)

where wai(i∈[1, m]) and wbj( j∈[1, n]) are the words or separators in sa and sb. Let w(sa) be the set of words containing all the words wai(i∈[1, m]), and w(sb) be the set of words containing all the words wbj(j∈[1, n]).

To distinguish the different sentence similarities, we use two simple exemplary sentences as follows.

sa = That is the old file .

sb = This is the new file .

4.1. Sentence Similarity based on Word Set To calculate sentence similarity based on word set, the word sets of sentences should be

constructed first. Since the sentences may have different tenses and voices, there are two ways to calculate word based sentence similarity. One is to calculate sentence similarity with the words in sentences; the other is to calculate sentence similarity with stemmed words in sentences. Here, we choose the original words in the sentences, because the stemmed words would miss the tense and voice information of the sentence.

After the word sets of two sentences are formulated, Jaccard similarity between sentences can be calculated by

, )=| ∩ |

| ∪ |.

Another similarity based on the word set can be calculated by Dice similarity.

,2| ∩ || | | |

.

The word sets of two example sentences are

- 25 -

Calculating Statistical Similarity between Sentences Junsheng Zhang, Yunchuan Sun, Huilin Wang, Yanqing He

Journal of Convergence Information Technology, Volume 6, Number 2. February 2011

"That","is","the","old","file",".""This","is","the","new","file","."

The word based sentence similarities are

, 5/7, 5/6

4.2. Sentence Similarity based on Word Vector

Words in a sentence have different semantic roles, and the weights of words should be different.

The weights of words in calculating document similarity can be calculated by TF-IDF [12]. However, it is difficult to use TF-IDF to represent the weights of words in a sentence because the frequency is 1for almost all words. It is necessary to find a new approach to assigning different weights to different words. The semantic roles of words can be analyzed by NLP techniques, and different roles of words have different weights [5]. We adopt two strategies for vector based sentence similarity: one is to assign weights of words averagely based on the numbers of all the words and separators, the other is to assign weights of words by TF-IDF based on a corpus.

To calculate sentence similarity based on word vectors, the word vectors of sentences should be constructed first. If the words in w(sa) and w(sb) are assigned with weights, sa and sb can be represented by the bags of words:

, , , , … , ,

, , , , … , ,

Then cosine similarity between sentences can be calculated by

,∑

∑ ∑

Coming to the two exemplary sentences, the vectors of the two sentences are as follows:

, 1 , , 1 , , 1 , , 1 , , 1 , . ,1, 1 , , 1 , , 1 , , 1 , , 1 , . ,1

We choose all the weights of words in the sentences 1. If a word occurs two or more times in one

sentence, the weight of the word is accumulated. The sentence similarity based on the vectors is , 2/3.

Sentence similarity based on word vector considers words and their weights in two sentences. Vector based similarity is popular in information retrieval. However, it does not consider the orders and distances between words. It is also symbolic similarity, and different word weights distinguish the importance of words.

Stop Words are words which do not contain important significance in the sentence. Usually these words are filtered out because they bring vast amount of unnecessary information for calculating sentence similarity of long texts. Some words occur in different sentences, called stop words, which are less important to distinguish the meaning of sentences. In similarity measures for long texts based on word sets and vectors should filter the stop words to formulate the word sets and vectors, because too many stop words will become the noise for calculating sentence similarity. However, the stop words in sentences are important, for they imply the structure information of sentences. If all the stop words are eliminated from sentence, the structure information might partially lost in the calculating of the sentence similarity.

- 26 -

4.3.

A

sentdistFiscdist

Dcostthe

CThe

H

TE

simi

4.4. A

befo

T

sequlangWhibetw

Whebe c

C

sent

J

. Sentence Si

According to tences can bance, Levenscher edit disance between

Definition 4.1t of a sequenother.

Coming to there are 2 subsHere, we calc

Then the sentEdit distance ilarity of sequ

. Word Orde

According to ore and after

The sequentiuential netwoguage model ile the sequeween words v

ere (wx, wy)∈calculated bas

Coming to thtences are list

CJuns

Journal of Conver

imilarity ba

the spellingbe calculatedshtein distanctance, Ukkonn sentences. 1 (Levenshtece of symbol

he two exemptitutions from

culate the sent

ence similaribased simil

uences such a

er based Sen

the positioncould be esta

Fig

al relations ork could beuses the adj

ential networvary from 1 to

∈L(sa) ∪L(sb

sed on the or

he two exemted in Table

Calculating Statissheng Zhang, Yunrgence Informatio

sed on Edit

gs of words d. There are ce, Damerau-nen and Hir

ein Edit Dist insertions, d

plary sentencm “That” to “tence similar

ty values arearity is alsoas strings, lan

ntence Simil

ns of words iablished. Figu

gure 1. The s

between woe used to diacent n word

rk shows theo |s|-1.

,

,

b) means thatders of words

Set

mplary senten1.

stical Similarity bnchuan Sun, Huilion Technology, V

Distance

in two senteseveral diff

-Levenshtein shberg. We

tance). The edeletions, or

es sa and sb, “This”, and 3 rity based on

11 _

mapped ontosymbolic sim

nguages and b

larity

in a sentenceure 1 shows a

equential net

ords formulascover frequds to describe sequential r

, ,

, , ,

t wx is befores by

,||

nces, sequent

etween Sentencesin Wang, Yanqing

Volume 6, Number

ences, the edferrent kinds

distance, Jarchoose Leve

edit-distance substitutions

the edit distsubstitutionsthe edit dista

o the intervalmilarity, andbiological se

e, the orders a sequential n

twork of a sen

te a sequentuent patterns.e the sequenrelations mo

, … , ,

, … , ,

e wy. The sim

∩ |∪ |

tial relations

s g He r 2. February 201

dit distance bof edit dist

ro-Winkler dienshtein dista

of two string transforming

tance betwees from “old” tance by

l (0, 1]. d it is populaquences.

between wonetwork of a

ntence

tial network . The comm

ntial charactere than n, an

milarity betwe

s between w

1

between the tance: Hammistance, Wagnance as the

gs is the minig one string

en sa and sb ito “new”.

ar in calcula

rd pairs suchsentence.

of words. mon used n-g

ristics of wond the distan

een sa and sb

ords in the

two ming ner-edit

imal into

is 5.

ting

h as

The gram ords. nces

can

two

- 27 -

L(s

L(s

AS

sequ

4.5. T

pair whebetw

T

L

sb ca

whe

Fwor

Twor

5. E

5.1.

D

diffeWe

WThe

J

Tasa) (That,is),

old), (thesb) (This,is),

new), (th

And the similarSequential reuential relatio

. Word Dista

To consider thr of words in ere w1 and w2 ween words in

The list of dist

Let W(a, b) = an be calculate

ere wai = wbi anFor the two exrds is _The meaning ords. The distan

Evaluation

. Evaluation

During the tranerent sentenceuse different s

We choose 40 40 sentences

CJuns

Journal of Conver

able 1. Seque, (That, the), (e, file), (the, .), (This, the), (T

he, file), (the, .

rity between sa

elations betwons are also c

ance based S

he distance betthe same senare two words

n sentence sa.

F

tances between

,

,

{(wax, way) |1 ≤ed as follows:

nd waj = wbj , axemplary sent

, 40/of a sentence inces between w

and Discus

n

nslation betwees in the targetsentence similsentences as belongs 10 gr

Calculating Statissheng Zhang, Yunrgence Informatio

ential relationThat, old), (Th, (old, file), (oThis, new), (T), (new, file), (

a and sb based ween words connection ch

Sentence Sim

tween word pantence. The dis, and d is the

Figure 2. The

n word pairs in

, , ,

, , ,

≤ x ≤ y ≤ i} ∩

,

and 1 ≤ I, j ≤ |Wtences, the sim141. is not only relewords are also

ssion

een different nt language. Thlarity methodsa testing corproups, and eac

stical Similarity bnchuan Sun, Huilion Technology, V

ns between what, file), (Tha

old, .), (file, .)This, file), (Thi

(new, .), (file,

on word orderare instance

haracteristics

milarity

airs, it is neceistance betwee

distance betw

distance netw

n the same sen

, , ,

, , ,

∩{(wbp, wbq) | 1

∑ , ∈ ,

,

W(a, b)|. milarity betwe

evant to the oro structural cha

natural languaghese target sens to find the topus to comparech group conta

etween Sentencesin Wang, Yanqing

Volume 6, Number

words in two eat, .), (is, the),

is, .), (is, the), .)

rs is _es of conneof words.

ssary to calcuen two words

ween w1 and w

work of a sente

ntence is as fo

, … ,

, … ,

1 ≤ p ≤ q ≤ j}

, ,

∑ ,

een sa and sb b

rders of wordsaracteristic of

ges, a source ntences are semop-k statisticalle the differentains 4 sentenc

s g He r 2. February 201

example sent(is, old), (is,

(is, new), (is,

, 1/4.ctions betwe

ulate the distan takes the for

w2. Figure 2 sh

ence

ollows.

, ,

, ,

, then similari

based on the d

s but also the dsentences.

sentence may mantically simly similar sentt statistical sim

ces with simila

1

tences file), (is, .), (t

file),(is, .), (t

een words.

nces between erm of (w1, w2

hows the distan

,

,

ity between sa

distances betw

distances betw

be translated milar in meanitences. milarity measuar meaning. Th

he,

he,

The

each , d), nces

and

ween

ween

into ings.

ures. hese

- 28 -

Calculating Statistical Similarity between Sentences Junsheng Zhang, Yunchuan Sun, Huilin Wang, Yanqing He

Journal of Convergence Information Technology, Volume 6, Number 2. February 2011

sentences are selected from NIST05 corpus for BLEU test in machine translation. The sentence groups are listed in the Appendix.

Based on the sentence similarity measures, we can find the precision rate and recall rate. F-measure is used to evaluate the sentence similarity by considering both the precision rate and recall rate. Sentence similarity is used to find the top-4 similar sentences of a sentence from other 39 sentences. The fetched sentences are compared with the sentences in the same group, and then the precision rate, recall rate and F-measure could be calculated.

The formulas to calculate precision rate p, recall rate r and F-measure F are as follows.

| | ∩ | || |

| | ∩ | || |

2

We choose the top-4 most similar sentences for each sentence according to the different sentence

similarity measures. The top-4 sentences contain the sentence itself. We compare the fetched 4 sentence set with the original 4 sentence set, and the precision rate, recall rate and F-measure are calculated in each sentence group, and the number of similar sentence groups is 40. Then we calculate the average precision rate, recall rate and F-measure as the precision rate, recall rate and F-measure of the sentence similarity on the test corpus.

Table 2 is the experiment results of different measures of sentence similarity. Figure 3 shows the precision rate, recall rate and F-measure of different measures of sentence similarity. The evaluation results show that sentence similarity based on word set and sentence similarity based on word order have better performance than other sentence similarity. Sentence similarity based on edit distance has the lowest precision rate, recall rate and F-measure. Sentence similarity based on TF-IDF has lower precision rate, recall rate and F-measure. Although the evaluation result is closely relevant to the test corpus, it also shows the differences between different sentence similarity measures. The evaluation result could be explained as follows:

Sentence similarity based on word set and sentence similarity based on word order capture more local information of sentence pairs. The statistical similarity between two sentences depends more heavily on local information of the two sentences than statistical information acquired from the whole corpus.

Sentence similarity based on Edit distance only considers the insertion, deletion and substitutions of characters and separators. It is hard to capture the meaning of words, so it has the worst performance in the evaluation.

Word orders between word pairs are important than the distances between word pairs in calculating the sentence similarity.

Sentence similarity based on TF-IDF weighted vector performs worse than the average sentence similarity based on weighted vectors. It means that the local information of two sentences is more important than the global information in the corpus.

Table 2. Precision rate, recall rate and F-measure of six kinds of statistical similarity measures between

sentences on the test corpus Similarity based on precision rate recall rate F-measure word set 1 1 1 average weighted vector 0.9797 0.9797 0.9797 edit distance 0.8786 0.8786 0.8786 word order 0.9932 0.9932 0.9932 word distance 0.9730 0.9730 0.9730 TF-IDF weighted vector 0.8952 0.8952 0.9851

- 29 -

Fig

T

colocalcsub-gridsentworsimiaver

5.2.

S

as sithe cpersorga

J

gure 3. Precisi

The pairwise sor map of paculating metho-figures which

ds representingtences which ard set based seilarity and Trage weighted

. Discussion

Similarity is a ize, color and cognition of hsons. Similarianization, retri

CJuns

Journal of Conver

on rate, recall

sentence similairwise similaods. Each subh show the infg the similar are in the sam

entence similarTF-IDF weightd vector based

cognitive contexture) or se

human being. ities in naturaieval and reco

Calculating Statissheng Zhang, Yunrgence Informatio

rate and F-me

arities for the arity between

b-figure is symformation of sipairs of 40 se

me sentence grrity, word ordted vector bassimilarity and

ncept represenemantics in theThe similarityal languages mmendation.

stical Similarity bnchuan Sun, Huilion Technology, V

easure of diffeCorpus

40 sentences n sentences cmmetric, and timilar sentencentences themroups. Accord

der based sentesed similarity hd edit distance

nting the simile concept spacy degree betw

play an imp

etween Sentencesin Wang, Yanqing

Volume 6, Number

erent sentence

are also calcucalculated bythere are sevece groups. Themselves. Otherding to the graence similarityhave clear bribased sentenc

larity betweence. Similarity een two objec

portant role in

s g He r 2. February 201

similarity me

ulated. Figure y different seen parallel brie longest brighr bright lines ay color mapsy, word distanight lines, howce similarity a

n two objects ibetween two

cts may be difn information

1

easures on the

4 shows the gentence similaight lines in thht line contain

show the sims, it is shown

nce based sentewever, bright lare not clear.

in attributes (sobjects is clos

fferent to diffen access such

test

gray arity hese

ns 40 milar

that ence lines

such se to erent h as

- 30 -

FiguIn tanon

edit

A

simi

J

(a) b

(c) ba

(e) basure 4. Gray cothe color graphd y-axis. The word set (b) sdistance (d) s

Approaches toilarity, and the Similarity be

However, inin the two seof words areword set conare very diffsimilarity ca

During the ignored as thdescribing scombine theweights with

CJuns

Journal of Conver

based on wor

ased on edit d

sed on word dolor map for ph, the color ofbrighter is thesentence similsentence simila

(f) sente

o calculating e reasons are aetween long te

n two sentenceentence occurse hard to be asntaining all thefferent, the twoalculation resusimilarity calhe stop words

syntactic infore function woh TF-IDF.

Calculating Statissheng Zhang, Yunrgence Informatio

rd set

istance

distance pairwise similaf the small gride color the morarity based onarity based onence similarity

similarity beas follows: exts is calculaes, the keywors with few timssigned. Worde words in theo vectors repr

ults to be very lculation, the s. However, thrmation of senords. This is

stical Similarity bnchuan Sun, Huilion Technology, V

arity between d shows the sire similar two

n average weign word order (ey based on TF-

etween long t

ated based on tds are hard to

mes, and even ds in the vectore two sentenceresenting the tsmall. function wor

hese function ntences. So thdifferent from

etween Sentencesin Wang, Yanqing

Volume 6, Number

(b) based on

(d) bas

(f) based on Tsentences calcmilarity betwe sentences are

ghted vector (ce) sentence sim-IDF weighted

texts cannot b

the vectors, wbe chosen by some words ors representinges respectivelytwo sentences

rds such as “twords play im

he syntactic inm the vector

s g He r 2. February 201

average weig

sed on word o

TF-IDF weigculated by diffeen the two see. (a) sentence c) sentence simmilarity basedd vector.

be used to c

which comes frusing TF-IDF

only occur oncg the two senty. If words in s are sparse, w

the”, “a”, “anmportant rolenformation of

based simila

1

ghted vector

order

hted vector ferent similaritentences of x-a

similarity basmilarity based d on word dista

calculate sente

rom bag of woF, and most wce. So the weitences formulathe two senten

which leads to

nd”, and “of”s in sentences

f sentences shoarity by assign

ties. axis sed

d on ance

ence

ords. ords ights ate a nces

o the

” are s for ould ning

- 31 -

Calculating Statistical Similarity between Sentences Junsheng Zhang, Yunchuan Sun, Huilin Wang, Yanqing He

Journal of Convergence Information Technology, Volume 6, Number 2. February 2011

Sentence semantic similarity can be calculated by considering the semantics of words. Sentences

with similar meaning may not share common words, and it is difficult to detect the semantic similarity between sentences just using lexical information and statistic information of words.

Sentence similarity can be used to cluster sentences for syntactic patterns [16] and analyze the patterns in the languages [17].

6. Conclusion

Sentence similarity is important during information organization and retrieval. It is closely related to

both word similarity and document similarity. According to the symbolic characteristics and structural information of sentences, the statistical similarity between sentences could be calculated. The statistical similarity measures between sentences could measure the similarity between sentences without any prior knowledge but only on the statistical information of sentences. This paper presents six approaches to calculating statistical similarity between sentences. The different statistical similarity measures are evaluated on a test corpus of 40 sentences. The evaluation results show that differences between these sentence similarity measures. The proposed similarity measures can be used in short text related applications such as corpus construction and title/abstract based document recommendation.

7. Acknowledgment

This research has been partially supported by ISTIC research foundation projects YY-2010018,

ZD2010-3-3, XK2010-6 and Beijing Excellent Talent Fostering Foundation 2010D009012000001.

8. References

[1] Wu, Z. and Palmer, M. “Verbs semantics and lexical selection”, Association for Computational Linguistics, 1994.

[2] Pedersen, T. and Patwardhan, S. and Michelizzi, J. “WordNet::Similarity: measuring the relatedness of concepts”, Association for Computational Linguistics, 2004.

[3] T. Coelho, P. Calado, L. Souza, B. Riberio-Neto”, and R. Muntz. Image retrieval using multiple evidence ranking. IEEE Transactions on Knowledge and Data Engineering, 16(4):408–417, 2004.

[4] G. Erkan and D. Radev. “LexRank: Graph-based lexical centrality as salience in text summarization”. Journal of Artificial Intelligence Research, 22(1):457–479, 2004.

[5] D. Gildea and D. Jurafsky. “Automatic labeling of semantic roles. Computational Linguistics”, 28(3):245–288, 2002.

[6] A. Islam and D. Inkpen. “Semantic text similarity using corpus-based word similarity and string similarity”. ACM Transactions on Knowledge Discovery from Data (TKDD), 2(2):1–25, 2008.

[7] A. Islam and D. Inkpen. “Semantic similarity of short texts”. Recent Advances in Natural Language Processing V: Selected Papers from Ranlp 2007, page 227, 2009.

[8] T. Landauer, P. Foltz, and D. Laham. “An introduction to latent semantic analysis”. Discourse processes, 25(2):259–284, 1998.

[9] C. Leacock and M. Chodorow. “Combining local context and WordNet similarity for word sense identification”. WordNet: An electronic lexical database, 49(2):265–283, 1998.

[10] Y. Li, D. McLean, Z. Bandar, J. O’Shea, and K. Crockett. “Sentence similarity based on semantic nets and corpus statistics”. IEEE Transactions on Knowledge and Data Engineering, pages 1138–1150, 2006.

[11] Mihalcea, Rada and Corley, Courtney and Strapparava, Carlo, “Corpus-based and knowledge-based measures of text semantic similarity”, in Proceedings of the 21st national conference on Artificial intelligence - Volume 1, pages 775—780, AAAI Press, 2006.

[12] G. Salton and M. McGill. Introduction to modern information retrieval. McGraw-Hill New York, 1983.

[13] H. Somers. “Review article: Example-based machine translation”. Machine Translation, 14(2):113–157, 1999.

- 32 -

Calculating Statistical Similarity between Sentences Junsheng Zhang, Yunchuan Sun, Huilin Wang, Yanqing He

Journal of Convergence Information Technology, Volume 6, Number 2. February 2011

[14] G. Tsatsaronis, I. Varlamis, and M. Vazirgiannis. “Text relatedness based on a word thesaurus”.

Journal of Artificial Intelligence Research, 37:1–39, 2010. [15] P. Turney. “Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL”. Machine Learning:

ECML 2001, pages 491–502, 2001. [16] Fu, K. & Lu, S. “A clustering procedure for syntactic patterns”, IEEE Transactions on Systems,

Man and Cybernetics, IEEE, 7:734-742, 2007. [17] Lu, S. and Fu, K. “A sentence-to-sentence clustering procedure for pattern analysis”, IEEE

Transactions on Systems, Man and Cybernetics, IEEE, 8: 381-389, 2007. [18] Mitchell, J. and Lapata, M. “Vector-based models of semantic composition”, in Proceedings of

ACL-08: HLT, pages 236-244, 2008. [19] Cheng Xian-yi, Sun Ping, Zhu Qian, Cai Yue-hong, "The Research of Chinese Semantic

Similarity Calculation Introduced Punctuations", JCIT, Vol. 5, No. 7, pp. 17-23, 2010. [20] Zhang Ye, Cai GuoQiang, Jia DongYue, "A Modified Method for Concepts Similarity

Calculation", JCIT, Vol. 6, No. 1, pp. 34-40, 2011. [21] Lluis A. Belanche, Jorge Orozco, "On the design of metric relations", JCIT, Vol. 3, No. 3, pp. 70-

81, 2008.

Appendix: Ten groups of sentences

Ten groups of sentences selected from NIST05 BLEU corpus are as follows. 1. (a) meriam abandoned her husband a year ago .

(b) meriam abandoned her husband a year ago . (c) meriam abandoned her husband a year ago . (d) meriam abandoned her husband a year ago .

2. (a) johnson says his work was to park the cars of hotel and casino guests while meriam was increasingly tempted by the nightlife of the gambling city , frequenting clubs late at night and ignoring her husband .

(b) johnson says his work was to park the cars of hotel and casino guests while meriam was increasingly tempted by the nightlife of the gambling city , frequenting clubs late at night and ignoring her husband .

(c) johnson says his work was to park the cars of hotel and casino guests while meriam was increasingly tempted by the nightlife of the gambling city , frequenting clubs late at night and ignoring her husband .

(d) johnson says his work was to park the cars of hotel and casino guests while meriam was increasingly tempted by the nightlife of the gambling city , frequenting clubs late at night and ignoring her husband . 3. (a) this week ’s eurozone economic indicators to show continuous economic weakening in europe

(b) this week ’s eurozone economic indicators to show continuous economic weakening in europe (c) this week ’s eurozone economic indicators to show continuous economic weakening in europe (d) this week ’s eurozone economic indicators to show continuous economic weakening in europe

4. (a) “ she ’s gone off the deep end , “ johnson says . (b) “ she ’s gone off the deep end , “ johnson says . (c) “ she ’s gone off the deep end , “ johnson says . (d) “ she ’s gone off the deep end , “ johnson says .

5. (a) economists said that eurozone economic indicators to be released this week will provide further evidence of weak economic confidence and slowing economic output , while british data is expected to show a pick - up in industrial activities .

(b) economists said that eurozone economic indicators to be released this week will provide further evidence of weak economic confidence and slowing economic output , while british data is expected to show a pick - up in industrial activities .

(c) economists said that eurozone economic indicators to be released this week will provide further evidence of weak economic confidence and slowing economic output , while british data is expected to show a pick - up in industrial activities .

- 33 -

Calculating Statistical Similarity between Sentences Junsheng Zhang, Yunchuan Sun, Huilin Wang, Yanqing He

Journal of Convergence Information Technology, Volume 6, Number 2. February 2011

(d) economists said that eurozone economic indicators to be released this week will provide further

evidence of weak economic confidence and slowing economic output , while british data is expected to show a pick - up in industrial activities . 6. (a) the pair met in 1999 when career military man johnson was stationed in bahrain .

(b) the pair met in 1999 when career military man johnson was stationed in bahrain . (c) the pair met in 1999 when career military man johnson was stationed in bahrain . (d) the pair met in 1999 when career military man johnson was stationed in bahrain .

7. (a) their union was forbidden by the royal family , so johnson disguised meriam in a flannel shirt and a baseball cap , forged her military identification papers and brought her out of her home country to america .

(b) their union was forbidden by the royal family , so johnson disguised meriam in a flannel shirt and a baseball cap , forged her military identification papers and brought her out of her home country to america .

(c) their union was forbidden by the royal family , so johnson disguised meriam in a flannel shirt and a baseball cap , forged her military identification papers and brought her out of her home country to america .

(d) their union was forbidden by the royal family , so johnson disguised meriam in a flannel shirt and a baseball cap , forged her military identification papers and brought her out of her home country to america . 8. (a) but his life changed dramatically when he met the beautiful teenage princess and the pair fell in love .

(b) but his life changed dramatically when he met the beautiful teenage princess and the pair fell in love .

(c) but his life changed dramatically when he met the beautiful teenage princess and the pair fell in love .

(d) but his life changed dramatically when he met the beautiful teenage princess and the pair fell in love . 9. (a) however , johnson says that their star - crossed union comes to an end because of the temptations of the “ sin city “ of las vegas , the constant tensions with meriam ’s rich and powerful family and rumors of an assassination plot against him .

(b) however , johnson says that their star - crossed union comes to an end because of the temptations of the “ sin city “ of las vegas , the constant tensions with meriam ’s rich and powerful family and rumors of an assassination plot against him .

(c) however , johnson says that their star - crossed union comes to an end because of the temptations of the “ sin city “ of las vegas , the constant tensions with meriam ’s rich and powerful family and rumors of an assassination plot against him .

(d) however , johnson says that their star - crossed union comes to an end because of the temptations of the “ sin city “ of las vegas , the constant tensions with meriam ’s rich and powerful family and rumors of an assassination plot against him . 10. (a) after a bitter immigration battle with us authorities , the couple finally married at the candlelight wedding chapel on the famed and glitzy las vegas strip when johnson was 23 and the bride only 19 .

(b) after a bitter immigration battle with us authorities , the couple finally married at the candlelight wedding chapel on the famed and glitzy las vegas strip when johnson was 23 and the bride only 19 .

(c) after a bitter immigration battle with us authorities , the couple finally married at the candlelight wedding chapel on the famed and glitzy las vegas strip when johnson was 23 and the bride only 19 .

(d) after a bitter immigration battle with us authorities , the couple finally married at the candlelight wedding chapel on the famed and glitzy las vegas strip when johnson was 23 and the bride only 19 .

- 34 -