An Event Detection Algorithm Based on Improved STC

5
A n Event Detection Algorithm Based on Improved S T C Li-Qing Q iu Bin-Pang Li-Ping Zhao State K ey Lab. o f Software Development Environment, Beihang University, 100083 {qiuliqing, pangbin, zhaolp } , nlsde. buaa. edu. c n Abstract I n order t o overcome some shortcomings o f traditional algorithms i n event detection, w e propose a n event detection algorithm based on improved ST C (suffix tree clustering), which detects significant events from large volumes o f news a n d presents t he main content o f t h e events to t h e useras summaries. Th e experimental results o n T D T indicate that t he n e w algorithm is a n effective document clustering algorithm. 1 . Introduction A ne w event i s defined as a specific thing happens a t a specific time a n d place [ 1], which m a y consecutively reported b y many news articles in a period. Automated detection o f ne w events from W e b documents i s a n open challenge i n text mining. Event detection is a n unsupervised learning task. Document clustering algorithms attempt t o group documents together based o n their similarities; th e documents that a r e relevant t o a certain topic will hopefully be allocated i n a single cluster [2].More specially, t h e cluster algorithm used i n event detection is incremental. Current event detection systems are mostly based o n comparing a ne w document t o th e clusters o f documents i n th e past, a n d thresholding o n t h e similarity scores-if al l th e similarity scores a r e below a threshold, the ne w document i s predicted a s t h e first story o f a novel event, otherwise, i t belongs t o t h e most similar clusters. I n this paper, w e focus o n h o w t o u s e improved ST C t o detect ne w events, which i s a n incremental, 0 ( n ) time algorithm that produces coherent clusters. O u r main work includes: (1)Improving th e algorithm o f ST C (2)Proposing a ne w algorithm based on improved ST C f o r event detection. T h e remainder o f this paper i s organized a s follows. W e review t h e previous research work i n section 2.1n section 3 w e summarize ST C a n d describe the problem o f STC. T h e proposed algorithm i s presented i n section 4.Then w e report o n experimental methodologies a n d results i n section 5 . A t last, w e conclude o u r paper a nd discuss t h e future plans i n section 6 . 2 . Related work O ne o f t h e hotspots of event detection i s clustering algorithm. I n addition, some researchers proposed some methods t o improve the performance such a s imposing a time window [ 2] a n d re-weighting named entities [3]. Yang e t a l. [ 2 ] proposed a clustering algorithm, G A C (Group Average Clustering), a divided-and- conquer version o f a group-average clustering algorithm. G A C performs A H C (Agglomerative Hierarchical Clustering), producing hierarchically organized document clustering. AH C does n o t t r y t o find "best" clusters, b u t keeps merging t h e closest pair o f objects t o form clusters. With a reasonable distance measurement, the best time complexity of a practical AH C algorithm is O(N2) . S o A H C is typically slow when applied t o large W e b documents. Yang e t a l . [ 2 ] a n d Allan [4 ] proposed Single-pass algorithm, which w as straightforward. The algorithm sequentially processes the input documents, once a t a time a n d grows clusters incrementally. A ne w document is absorbed by the most similar cluster i n th e past i f t h e similarity between th e document a n d th e cluster i s above a preselected clustering threshold. Single-pass method is very easy, as well a s th e most popular clustering algorithm i n event 5 2 8 Authorized licensed use limited to: SATHYABAMA UNIVERSITY. Downloade d on November 5, 2008 at 03:37 from IEEE Xplore. Restrictions apply.

Transcript of An Event Detection Algorithm Based on Improved STC

8/8/2019 An Event Detection Algorithm Based on Improved STC

http://slidepdf.com/reader/full/an-event-detection-algorithm-based-on-improved-stc 1/5

An E v e n t D e t e c t i o n A l g o r i t h m B a s e d on I m p r o v e d STC

L i - Q i n g Q i u B i n - P a n g L i - P i n g Z h a oS t a t e K e y L a b . o f S o f t w a r e D e v e l o p m e n t E n v i r o n m e n t , B e i h a n g U n i v e r s i t y , 1 0 0 0 8 3

{ q i u l i q i n g , p a n g b i n , z h a o l p }, n l s d e . b u a a . e d u . c n

A b s t r a c t

I n o r d e r t o o v e r c o m e s o m e s h o r t c o m i n g s o ft r a d i t i o n a l a l g o r i t h m s i n e v e n t d e t e c t i o n , w e p r o p o s e

a n e v e n t d e t e c t i o n a l g o r i t h m b a s e d o n i m pr o ve d STC( s u f f i x t r e e c l u s t e r i n g ) , w h i c h d e t e c t s s i g n i f i c a n te v e n t s f r o m l a r g e v o l u m e s o f n e w s a n d p r e s e n t s t h e

m a i n c o n t e n t o f t h e e v e n t s t o t h e u s e r a s s u m m a r i e s .T h e e x p e r i m e n t a l r e s u l t s o n TDT i n d i c a t e t h a t t h en e w a l g o r i t h m i s a n e f f e c t i v e d o c u m e n t c l u s t e r i n ga l g o r i t h m .

1 . I n t r o d u c t i o n

A ne w e v e n t i s d e f i n e d a s a s p e c i f i c t h i n g h a p p e n s

a t a s p e c i f i c t i m e a n d p l a c e [ 1 ] , w h i c h mayc o n s e c u t i v e l y r e p o r t e d b y many n e w s a r t i c l e s i n a

p e r i o d . A u t o m a t e d d e t e c t i o n o f ne w e v e n t s f r o m

Web d o c u m e n t s i s a n o p e n c h a l l e n g e i n t e x t m i n i n g .E v e n t d e t e c t i o n i s a n u n s u p e r v i s e d l e a r n i n g t a s k .

D o c u m e n t c l u s t e r i n g a l g o r i t h m s a t t e m p t t o g r o u p

d o c u m e n t s t o g e t h e r b a s e d o n t h e i r s i m i l a r i t i e s ; t h ed o c u m e n t s t h a t a r e r e l e v a n t t o a c e r t a i n t o p i c w i l lh o p e f u l l y b e a l l o c a t e d i n a s i n g l e c l u s t e r [ 2 ] . M o r e

s p e c i a l l y , t h e c l u s t e r a l g o r i t h m u s e d i n e v e n t

d e t e c t i o n i s i n c r e m e n t a l . C u r r e n t e v e n t d e t e c t i o ns y s t e m s a r e m o s t l y b a s e d o n c o m p a r i n g a ne wd o c u m e n t t o t h e c l u s t e r s o f d o c u m e n t s i n t h e p a s t ,a n d t h r e s h o l d i n g o n t h e s i m i l a r i t y s c o r e s - i f a l l t h es i m i l a r i t y s c o r e s a r e b e l o w a t h r e s h o l d , t h e ne wd o c u m e n t i s p r e d i c t e d a s t h e f i r s t s t o r y o f a n o v e l

e v e n t , o t h e r w i s e , i t b e l o n g s t o t h e m o s t s i m i l a rc l u s t e r s .

I n t h i s p a p e r , we f o c u s o n how t o u s e i m p r o v e dSTC t o d e t e c t ne w e v e n t s , w h i c h i s a n i n c r e m e n t a l , 0( n ) t i m e a l g o r i t h m t h a t p r o d u c e s c o h e r e n t c l u s t e r s .

O u r m a i n w o r k i n c l u d e s :( 1 ) I m p r o v i n g t h e a l g o r i t h m o f STC

( 2 ) P r o p o s i n g a ne w a l g o r i t h m b as e d o n i m p r o v e dSTC f o r e v e n t d e t e c t i o n .

T h e r e m a i n d e r o f t h i s p a p e r i s o r g a n i z e d a sf o l l o w s . W e r e v i e w t h e p r e v i o u s r e s e a r c h w o r k i ns e c t i o n 2 . 1 n s e c t i o n 3 we s u m m a r i z e STC a n dd e s c r i b e t h e p r o b l e m o f S T C . T h e p r o p o s e da l g o r i t h m i s p r e s e n t e d i n s e c t i o n 4 . T h e n we r e p o r t o n

e x p e r i m e n t a l m e t h o d o l o g i e s a n d r e s u l t s i n s e c t i o n 5 .A t l a s t , we c o n c l u d e o u r p a p e r a n d d i s c u s s t h e f u t u r ep l a n s i n s e c t i o n 6 .

2 . R e l a t e d work

One o f t h e h o t s p o t s o f e v e n t d e t e c t i o n i s c l u s t e r i n ga l g o r i t h m . I n a d d i t i o n , s o m e r e s e a r c h e r s p r o p o s e ds o m e m et ho d s t o i m p r o v e t h e p e r f o r m a n c e s u c h a si m p o s i n g a t i m e w i n d o w [ 2 ] a n d r e - w e i g h t i n g n a m e de n t i t i e s [ 3 ] .

Y a n g e t a l . [ 2 ] p r o p o s e d a c l u s t e r i n g a l g o r i t h m ,

GAC ( G r o u p A v e r a ge C l u s t e r i n g ) , a d i v i d e d - a n d -c o n q u e r v e r s i o n o f a g r o u p - a v e r a g e c l u s t e r i n ga l g o r i t h m . GAC p e r f o r m s AHC ( A g g l o m e r a t i v eH i e r a r c h i c a l C l u s t e r i n g ) , p r o d u c i n g h i e r a r c h i c a l l yo r g a n i z e d d o c u m e n t c l u s t e r i n g . AHC d o e s n o t t r y t of i n d " b e s t " c l u s t e r s , b u t k e e p s m e r g i n g t h e c l o s e s tp a i r o f o b j e c t s t o f o r m c l u s t e r s . W i t h a r e a s o n a b l ed i s t a n c e m e a s u r e m e n t , t h e b e s t t i m e c o m p l e x i t y o f a

p r a c t i c a l AHC a l g o r i t h m i s O ( N 2 ) . S o AHC i s

t y p i c a l l y s l o w w h e n a p p l i e d t o l a r g e Web d o c u m e n t s .

Y a n g e t a l . [ 2 ] a n d A l l a n [ 4 ] p r o p o s e d S i n g l e - p a s sa l g o r i t h m , w h i c h w a s s t r a i g h t f o r w a r d . T h e a l g o r i t h ms e q u e n t i a l l y p r o c e s s e s t h e i n p u t d o c u m e n t s , o n c e a t a

t i m e a n d g r o w s c l u s t e r s i n c r e m e n t a l l y . A ne wd o c u m e n t i s a b s o r b e d b y t h e m o s t s i m i l a r c l u s t e r i nt h e p a s t i f t h e s i m i l a r i t y b e t w e e n t h e d o c u m e n t a n dt h e c l u s t e r i s a b o v e a p r e s e l e c t e d c l u s t e r i n gt h r e s h o l d . S i n g l e - p a s s m e t h o d i s v e r y e a s y , a s w e l l a st h e m o s t p o p u l a r c l u s t e r i n g a l g o r i t h m i n e v e n t

5 2 8

Authorized licensed use limited to: SATHYABAMA UNIVERSITY. Downloaded on November 5, 2008 at 03:37 from IEEE Xplore. Restrictions apply.

8/8/2019 An Event Detection Algorithm Based on Improved STC

http://slidepdf.com/reader/full/an-event-detection-algorithm-based-on-improved-stc 2/5

d e t e c t i o n , b u t i t s u f f e r s f r o m b e i n g o r d e r d e p e n d a n t

a n d f r o m h av i ng a t e n d e n c y t o p r o d u c e l a r g e c l u s t e r s[ 5 ] . M o r e o v e r , i t o n l y p r o c e s s e s t h e i n p u t d o c u m e n t s

o n c e a t a t i m e , w h i c h i s l i m i t e d i n l a r g e Webd o c u m e n t s .

L e i e t a l . [ 6 ] p r o p o s e d a n i m p r o v e d i n c r e m e n t a l K -

m e a n s f o r d e t e c t i n g e v e n t s . I n o r d e r t o s e l e c t i n i t i a lc l u s t e r c e n t e r s o b j e c t i v e l y , t h e a l g o r i t h m u t i l i z e sd e n s i t y f u n c t i o n t o i n i t i a l i z e c l u s t e r c e n t e r s . T h eq u a n t i t y o f c l u s t e r s i s a f f e c t e d l i t t l e b y t h e o r d e r i nw h i c h t h e n e w s s t o r i e s a r e p r o c e s s e d . B u t t h ei n i t i a l i z a t i o n o f c l u s t e r c e n t e r s a d d s t i m e a n d s p a c e -c o n s u m i n g , s o i t i s u n r e a s o n a b l e u s e d i n o n - l i n ed e t e c t i o n .

F r o m t h e a s p e c t o f u t i l i z i n g t h e c o n t e n t s o f

d o c u m e n t s , T F - I D F i s s t i l l t h e d o m i n a n t t e c h n i q u ef o r d o c u m e n t r e p r e s e n t a t i o n . I n t h i s m e t h o d , e a c h

d o c u m e n t i s r e p r e s e n t e d b y a v e c t o r o f w e i g h t e d

t e r m s t h a t c a n b e e i t h e r w o r d s o r p h r a s e s a n d i g n o r et h e s e q u e n c e o r d e r o f t h e w o r d s o r p h r a s e s [ 5 ] , t h u sl o s i n g v a l u a b l e i n f o r m a t i o n .

3 . S u m m a r i z a t i o n o f STC

STC i s p r o p o s e d b y Z am i r a n d E t z i o m i [ 5 ] f o rc l u s t e r i n g i n t h e i r m e t a - s e a r c h e n g i n e , w h i c h f i r s t

i d e n t i f i e s s e t s o f d o c u m e n t s t h a t s h a r e commonp h r a s e s b y c o n s t r u c t i o n s u f f i x t r e e , a n d t h e n c r e a t e sc l u s t e r s a c c o r d i n g t o t h e s e p h r a s e s . STC d o e s n o t

t r e a t a d o c u m e n t a s a s e t o f w o r d s b u t r a t h e r a s a

s t r i n g , m a k i n g u s e o f p r o x i m i t y i n f o r m a t i o n b e t w e e nw o r d s . STC r e l i e s o n a s u f f i x t r e e t o e f f i c i e n t l yi d e n t i f y s e t s o f d o c u m e n t s t h a t s h a r e common p h a s e sa n d u s e d t h i s i n f o r m a t i o n t o c r e a t e c l u s t e r s a n d t os u c c i n c t l y s u m m a r i z e t h e i r c o n t e n t s f o r u s e r s .

STC h a s t h r e e l o g i c a l s t e p s a s f o l l o w i n g s :( 1 ) D o c u m e n t " c l e a n i n g "S e n t e n c e b o u n d a r i e s a r e m a r k e d a n d n o n - w o r d

t o k e n ( s u c h a s HTML t a g s ) a r e s t r i p p e d . T h e s t r i n g so f e a c h d o c u m e n t a r e t r a n s f o r m e d u s i n g a s t e m m i n ga l g o r i t h m .

( 2 ) I d e n t i f y i n g b a s e c l u s t e r s u s i n g a s u f f i x t r e e

S u f f i x t r e e d o c u m e n t m o d e l c o n s i d e r s a d o c u m e n tt o b e a s e t o f s u f f i x s u b s t r i n g s , t h e common p r e f i x e so f t h e s u f f i x s u b s t r i n g s a r e s e l e c t e d a s p h r a s e s t ol a b e l t h e e d g e s o f a s u f f i x t r e e . F i g u r e 1 i s a n

e x a m p l e o f t h e s u f f i x t r e e o f a s e t o f d o c u m e n t s a sf o l l o w i n g :

D o c u m e n t l : C a t a t e c h e e s e .D o c u m e n t 2 : M o u s e a t e c h e e s e t o o .

D o c u m e n t 3 : C a t a t e m o u s e t o o .

F i g u r e l . An i n st an c e o f s u f f i x t r e e

A t t h e s a m e t i m e , s u f f i x t r e e i s t r y i n g t o k e e p t h es e q u e n t i a l o r d e r o f e a c h w o r d i n t h e o r i g i n a ld o c u m e n t s i n o r d e r t o d i s p l a y t h e s u m m a r y i n s t e p( 3 ) . T h e s t r u c t u r e c a n b e c o n s t r u c t e d i n t i m e l i n e a r( l i n e a r i n t h e s i z e o f t h e d o c u m e n t s e t ) , a n d c a n b e

c o n s t r u c t e d i n c r e m e n t a l l y a s t h e d o c u m e n t s a r e b e i n gr e a d [ 5 ] .

E a c h n o d e o f t h e s u f f i x t r e e r e p r e s e n t s a g r o u p o f

d o c u me nt s a nd a p h a s e t h a t i s common t o a l l o f

t h e m . T h e l a b e l o f t h e n o d e r e p r e s e n t s t h e c ommonp h r a s e ; t h e s e t o f d o c u m e n t s t a g g i n g t h e s u f f i x - n o d e st h a t a r e d e s c e n d a n t s o f t h e n o d e r e p r e s e n t s a b a s e

c l u s t e r .( 3 ) M e r g i n g t h e s e b a s e c l u s t e r s i n t o c l u s t e r sT h e f i n a l s t e p i s m e r g i n g b a s e c l u s t e r s w i t h a h i g h

o v e r l a p i n t h e i r d o c u m e n t s e t s , w h i c h a l l o w s a

d o c u m e n t t o a p p e a r i n m o r e t h a n o n e c l u s t e r .C l u s t e r s a r e s c o r e d a n d a l a b e l i s g e n e r a t e d f o r e a c h

c l u s t e r s . F i g u r e 2 i s a n e x a m p l e o f b a s e c l u s t e r g r a p h .

Phrase: ca t a t

Dcument slt )1 3

1 ) - l i r a S e iD ) o c u m e

1nous

e.ocmnt'.P h a e t o o }

Ie 3

P h rase: chet

P . i r a s e m at e 4

I D o c u l e r i t s .

F i g u r e 2 . T h e e x a m p l e o f b a s e c l u s t e r

B a s e d o n a b o v e m o d e l , we h a v e i d e n t i f i e d s e v e r a lk e y f e a t u r e s o f S T C , w h i c h i s e x c e s s i v e l y s u i t a b l e f o re v e n t d e t e c t i o n :

( 1 ) I n c r e m e n t a l i t y : T h e s t r i n g s a s s o c i a t e d w i t h

e a c h d o c u m e n t c a n b e e a s i l y i n s e r t e d o n t o t h e s u f f i xt r e e a s s o o n a s t h e d o c u m e n t i s r e c e i v e d . In c o n t r a s t ,

5 2 9

t c

c w c 5 ct'nolis

t t x )

kiNu COU-.

I- 1

L l i( I I , i . 1 1 )

im

0I O U - N echr.

\ R . W\ , -- l

u

Authorized licensed use limited to: SATHYABAMA UNIVERSITY. Downloaded on November 5, 2008 at 03:37 from IEEE Xplore. Restrictions apply.

8/8/2019 An Event Detection Algorithm Based on Improved STC

http://slidepdf.com/reader/full/an-event-detection-algorithm-based-on-improved-stc 3/5

m o s t c l u s t e r a l g o r i t h m s c a n n o t p r o c e s s d o c u m e n t

s e t s i n c r e m e n t a l l y i n c l u d i n g AHC.( 2 ) L i n e a r t i m e : U n l i k e AHC, STC i s a l i n e a r t i m e

c l u s t e r i n g a l g o r i t h m ( l i n e a r i n t h e s i z e o f t h ec l u s t e r i n g s e t ) w h i c h i s b as e d o n i d e n t i f y i n g p h r a s e st h a t a r e common t o g r o u p s o f d o c u m e nt s .

( 3 ) O v e r l a p p i n g c l u s t e r s : STC c r e a t e s o v e r l a p p i n gc l u s t e r s , n a m e l y , i t a l l o w s o n e d o c u m e n t b e l o n g t om o r e t h a n o n e c l u s t e r , w h i c h i s m o r e r e a s o n a b l e

b e c a us e o n e d o c u m e n t may h a v e m u l t i p l e t o p i c s .( 4 ) N o n - e x h a u s t i v e : U n l i k e S i n g l e - p a s s a l g o r i t h m

w h i c h p r o c e s s e s t h e i n p u t d o c u m e n t s o n c e a t i m e ,STC c a n p r o c e s s a b a t c h o f d o c u m e n t s e a s i l y b yi n s e r t i n g t h e s t r i n g s o f e ac h d o c u m e n t o n t o t h e s u f f i xt r e e .

( 5 ) B r o w s a b l e s u m m a r i e s : O n e o r m o r e s e v e r a lp h r a s e s a r e n a t u r a l l y s e l e c t e d t o g e n e r a t e a t o p i cs u m m a r y t o l a b e l t h e c o r r e s p o n d i n g c l u s t e r c l u s t e r s

d u r i n g b u i l d i n g t h e c l u s t e r s i n S T C . H o w e v e r , i t c a nn o t p r o v i d e c o n c i s e a n d a c c u r a t e d e s c r i p t i o n s o f t h ec l u s t e r s i n m o s t c l u s t e r i n g a l g o r i t h m .

H o w e v e r , STC h a s s o m e s h o r t c o m i n g s . C h i m e t a l .

[ 7 ] p r o p o s e d t h a t t h e r e w a s n o e f f i c i e n t m e a s u r e t oe v a l u a t e t h e q u a l i t y o f c l u s t e r s i n S T C . B r a n s o n e t a l .

[ 8 ] p r o p o s e d t h a t i t w a s u n n e c e s s a r y p r e s e n t i n g a l ll a b e l s t o t h e u s e r s , b e c a u s e i t w a s common t h a t l a b e l st h a t w e r e s u b s e t s o f o n e a n o t h e r t o b e m e r g e d i n t ot h e s a m e c l u s t e r .

I n t h e p a p e r , t h e m e a s u r e t o e v a l u a t e t h e q u a l i t y o f

c l u s t e r s i n STC w i l l b e i m p r o v e d . F u r t h e r m o r e ,

l a b e l s t h a t w i l l p r e s e n t t o t h e u s e r s w i l l b e i m p r o v e d .

4 . I m p r o v e m e n t o f STC

W e i m p l e m e n t a m o d i f i e d v e r s i o n o f S T C . T h ei n p u t t o o u r a l g o r i t h m i s a c o l l e c t i o n o f d o c u m e n t sa n d a s e t o f u s e r - s p e c i f i e d p a r a m e t e r s . T h e o u t p u t i s

a f o r e s t o f t r e e s o f c l u s t e r s . T h e m a i n s t e p s a r ed e s c r i b e d i n s e c t i o n 3 . W e m a k e f o l l o w i n g c h a n g e st o t h e a l g o r i t h m i n s e c t i o n 3 .

( I ) F o r m e r g i n g b a s e c l u s t e r s , STC d e f i n e s a b i n a r ys i m i l a r i t y m e a s u r e b e t w e e n b a s e c l u s t e r s b as e d o n

t h e o v e r l a p o f t h e i r d o c u m e n t s e t s . G i v e n t w o b a s ec l u s t e r s S m a n d B n , w i t h s i z e s I B ma n d I B n r e s p e c t i v e l y , a n d r e p r e s e n t i n g t h e n u m b e r

o f d o c u m e n t s common t o b o t h b as e c l u s t e r s . T h e

s i m i l a r i t y o f B m a n d B n t o b e 1 i f :

B n B n | / I B > 0 . 5 a n d

|Bm0B | / | B n l >0.5

O t h e r w i s e , t h e i r s i m i l a r i t y i s d e f i n e d t o b e 0 .W e f o u n d t h a t t h e " a n d " B o o l e a n o p e r a t o r i s n o t

s u i t a b l e i n t h e f o l l o w i n g c o n d i t i o n i f o ne b a s e c l u s t e ri s t h e s u b s e t o f t h e o t h e r :

BacBn a n d B < 0 . 5 o r

B n a C Bm a n d B| < 0 5 Bm

S o we c h a n g e t h e " a n d " o p e r a t o r t o " o r " o p e r a t o r :

|BmnB n | / g B m | >0.5 or

|Bm0B /Bn >0.5

T h i s i s e s s e n t i a l l y i d e n t i c a l t o Y a n g ' s [ 1 0 ]i m p r o v e m e n t , e x c e p t t h a t we t a k e p l a c e t h e t h r e s h o l das 0 . 5 r a t h e r t h a n a n u n d e f i n e d p a r a m e t e r .

( 2 ) C l u s t e r score i s ve ry i m p o r t a n t , h o w e v e r , Z a m i r

e t a l . [ 5 ] d o e s n o t d e s c r i b e how t o score t h e c l u s t e r .

W e score t h e c l u s t e r s u s i n g f o l l o w i n g f u n c t i o n :S c o r e ( B , ) = B , * / m i n ( L e n g t h ( L a b e l , ) ) *W e i g h t ( d o c )

Where I B X i s n u m b e r o f d o c u m e n t s i n c l u s t e r

B x , m i n ( L e n g t h ( L a b e l 1 ) ) i s t h e minimum l e n g t h o f

l a b e l s o f c l u s t e r B q , W e i g h t ( d o c ) i s t h e w e i g h t o f t h e

d o c u m e n t s w h i c h B q b e l o n g t o .

( 3 ) C l u s t e r s r e m a i n a t l e a s t h a l f o f t h e n o d e s

w h i c h w i l l c o n t a i n t h e m a i n i d e a o f t h e d o c u m e n t s .

T h e r e f o r e a ny l a b e l or c o m b i n a t i o n o f l a b e l s i n t h em e r g e d c l u s t e r s h o u l d b e a g o o d g e n e r a l d e s c r i p t i o no f t h e d o c u m e n t s i n t h e c l u s t e r . H o w e v e r , i t i s n o t e d

t h a t some l a b e l s ar e t h e s u b s e t s o f o ne a n o t h e r , w h i c hmay b e m e r g e d i n t o t h e same c l u s t e r s . W e d o n o t

want t o d o u b l e c o u n t t h e m . More s p e c i a l l y , we

c h o o s e t h e l o n g e s t l a b e l f i r s t , a n d t h e n c h o o s e t h el a b e l s w h i c h are n o t t h e s u b s e t s o f any s e l e c t e dl a b e l s .

( 4 ) Chim e t a l . [ 7 ] p r o p o s e d a n o v e l m e t h o d w h i c h maps

a l l n o d e s n o f t h e common s u f f i x t r e e t o a

Md i m e n s i o n a l spac e o f VSD ( V e c t o r S p a c e M o d e l )

( n = 1 , 2 , ,M) , e a c h d o c u m e n t d c a n b e r e p r e s e n t e d

as a f e a t u r e v e c t o r o f t h e w e i g h t s o f Mn o d e s :

d=

{ w ( l , d ) , w ( 2 , d ) ,.,

w ( M , d ) }T h e d o c u m e n t f r e q u e n c y o f e a c h n o d e d f ( n ) i s

d e f i n e d as t h e n u m b e r o f t h e d i f f e r e n t d o c u m e n t s t h a th a v e t r a v e r s e d n o d e n . F o r e x a m p l e i n F i g u r e 2 ,d f ( a ) = 2 .

C h i m e t a l . [ 7 ] a l s o p r o p o s e d t h a t s i m p l y i g n o r i n g

t h e s t o p w o r d s b e c o m e s i m p r a c t i c a l i n S T C , b e c a u s e

5 3 0

Authorized licensed use limited to: SATHYABAMA UNIVERSITY. Downloaded on November 5, 2008 at 03:37 from IEEE Xplore. Restrictions apply.

8/8/2019 An Event Detection Algorithm Based on Improved STC

http://slidepdf.com/reader/full/an-event-detection-algorithm-based-on-improved-stc 4/5

s u f f i x t r e e m o d e l i s t r y i n g t o k e e p t h e s e q u e n t i a lo r d e r o f e a c h w o r d s i n a d o c u m e n t , a n d t h e s a m ep h r a s e o r w o r d s m i g h t o c c u r i n d i f f e r e n t n o d e s o f t h es u f f i x t r e e . B a s e d o n t h e i d e a , t h e y p r o p o s e d t h ed e f i n i t i o n o f " s t o p n o d e " , w h i c h a p p l i e s t h e s a m e i d e ao f s t o p w o r d s i n t h e s u f f i x t r e e s i m i l a r i t y m e a s u r e

c o m p u t a t i o n . A t h r e s h o l d i d f t h d o f i n v e r s e d o c u m e n t

f r eq u en c y ( i d f ) i s g i v e n t o i d e n t i f y w h e t h e r a n o d e

i s a s t o p n o d e . T h e e x p e r i m e n t s h o w e d t h a t t h e d e s i g ni s v e r y e f f i c i e n t .

I n o u r p a p e r , we a d o p t t h e s a m e i d e a o f

" s t o p n o d e " i n s t e a d o f s t o p w o r d s , w h i c h a l s o p r o v e t ob e e f f i c i e n t .

5 . E x p e r i m e n t a l r e s u l t s

5 . 1 . D a t a p r e p a r a t i o n

W e p r e p a r e TDT [ 1 ] c o r p u s f o r e x p e r i m e n t s ,w h i c h a r e b e n c h m a r k s f o r e v e n t d e t e c t i o n . W ec h o o s e TDT2 E n g l i s h c o r p u s t o r u n e x p e r i m e n t s . T h eTDT2 E n g l i s h c o r p u s c o n t a i n s n e w s d a t a c o l l e c t e dd a i l y f r o m 6 n e w s s o u r c e s , o v e r a p e r i o d o f s i xm o n t h s . D e t e c t i o n i s a n u n s u p e r v i s e d c l a s s i f i c a t i o nt a s k t h a t d o e s n o t i n v o l v e t r a i n i n g d a t a , s o a l l t h eE n g l i s h c o r p u s i s u s e d a s e v a l u a t i o n d a t a . M o r ed e t a i l s o f o u r d a t a s e t a r e l i s t e d i n T a b l e t .

T a b l e 1 . D e t a i l s o f d a t a s e t

T i m e J a n u a r y - J u n e , 1 9 9 8N u m b e r o f A r t i c l e s 1 9 3 0

N u m b e r o f

T o p i c s ( E v e n t s ) 7 0

A v e r a g e a r t i c l e spe r e v e n t 2 7

5 . 2 . E v a l u a t i o n m e a s u r e s

TDT p r o j e c t h a s i t s own e v a l u a t i o n p l a n , i . e . ,d e t e c t i o n p e r f o r m a n c e i s c h a r a c t e r i z e d i n t e r m s o f t h ep r o b a b i l i t y o f m i s s a n d f a l s e a l a r m e r r o r s , a n d t h e s ee r r o r p r o b a b i l i t i e s a r e t h e n c o m b i n e d i n t o a s i n g l ed e t e c t i o n c o s t b y a s s i g n i n g c o s t s t o m i s s a n d f a l s ea l a r m e r r o r s . H o w e v e r , t h e i r t a s k s a r e n o t c o n s i s t e n tw i t h o u r s . T h u s , we c h o o s e t h e s a m e e v a l u a t i o nm e t r i c s a s t h a t i n Y a n g e t a l . [ 2 ] . T a b l e 2 i l l u s t r a t e s t h ec o n t i n g e n c y t a b l e f o r a c l u s t e r - e v e n t p a i r , w h e r e a ,b , c , d a r e d o c u m e n t c o u n t s i n t h e c o r r e s p o n d i n g

c e l l s . F i v e e v o l u t i o n a l m e a s u r e s a r e d e f i n e di n c l u d i n g M i s s , F a l s e a l a r m , R e c a l l , P r e c i s i o n , F -

m e a s u r e . T o m e a s u r e g l o b a l p e r f o r m a n c e , t w o

a v e r a g i n g m e t h o d s a r e u s e d : m i c r o - a v e r a g e b ys u m m i n g t h e c o r r e s p o n d i n g c e l l s a n d t h e n c o m p u t et h e f i v e m e a s u r e s , m a c r o - a v e r a g e b y a v e r a g i n g t h e

f i v e m e a s u r e s o f a l l e v e n t s .

T a b l e 2 . C o n t i n g e n c y t a b l e

i n c l u s t e rn o t i n c l u s t e r

i n e v e n t

ac

n o t i n e v e n t

bd

I n o u r p a p e r , w e u s e m ic r o- a ve ra ge o f F - m e a s ur e

a n d m a c r o - a v e r a g e o f F - m e a s u r e a s o u r e v a l u a t i o nm e a s u r e s . W e c o m p u t e m i c r o - a v e r a g e a n d m a c r o -

a v e r a g e o f F - m e a s u r e s c o r e f o r e a c h c l u s t e r i n g r e s u l tr e s p e c t i v e l y . F - m e a s u r e i s o r i g i n a l l y d e f i n e d b y C . J .

R i j s b e r g e n [ 9 ] , w h i c h i s t h e h a r m o n i c m e a n o f r e c a l la n d p r e c i s i o n . T h e m e a s u r e i s d e f i n e d a s f o l l o w i n gi f ( a + b + c ) > 0 , o t h e r w i s e u n d e f i n e d :

F 2 * P r e c i s i o n * R e c a l l 2 a

P r e c i s i o n + R e c a l l 2 a + h + c

W h e r e P r e c i s i o n = a l ( a + b ) i f a + b > O ,

o t h e r w i s e u n d e f i n e d ; R e c a l l = a l ( a + c ) i f a + c > 0 ,

o t h e r w i s e u n d e f i n e d .

5 . 3 . P e r f o r m a n c e o n t h e d a t a s e t

T o c o m p a r e o u r a p p r o a c h w i t h o t h e r a l g o r i t h m s ,Y a n g e t a l . ' s a u g m e n t e d GAC a n d t h e g e n e r a l l y u s e dKNN a l g o r i t h m a r e c h o s e n a s b a s e l i n e s . S i n c e GACi s a h i e r a r c h i c a l c l u s t e r i n g m e t h o d , we s t o p a f t e rt h e r e a r e k c l u s t e r s l e f t , a n d r u n r e - c l u s t e r i n g 5 t i m e sa s t h e r e c o m m e n d e d s e t t i n g s i n [ 2 ] . F o r KNN wer e p o r t t h e r e s u l t s u n d e r t h e b e s t t h r e s h o l d .

T h e o r i g i n a l STC a l g o r i t h m s e l e c t s t h e 5 0 0 h i g h e s ts c o r i n g b a s e c l u s t e r s f o r f u r t h e r c l u s t e r m e r g i n g , b u to n l y t h e t o p 1 0 c l u s t e r s a r e s e l e c t e d f r o m t h e m e r g e dc l u s t e r s a s t h e f i n a l c l u s t e r i n g r e s u l t . T h u s we a l s oa l l o w e d GAC a n d KNN t o g e n e r a t e 1 0 c l u s t e r s i n o u r

e x p e r i m e n t s t o c o n d u c t a s f a i r a s p o s s i b l ec o m p a r i s o n s .

W e u s e J a v a t o i m p l e m e n t a l l t h e r e a l g o r i t h m s , a n du s e E x c e l t o p l o t a l l f i g u r e s . F i g u r e l i l l u s t r a t e s t h er e s u l t s o f t h e t h r e e a p p r o a c h e s . T h e c o m p a r i s o n o f

t h e t h r e e m e t h o d s i s s h o w n a s F i g u r e 1 w h i c h s h o w s

5 3 1

Authorized licensed use limited to: SATHYABAMA UNIVERSITY. Downloaded on November 5, 2008 at 03:37 from IEEE Xplore. Restrictions apply.

8/8/2019 An Event Detection Algorithm Based on Improved STC

http://slidepdf.com/reader/full/an-event-detection-algorithm-based-on-improved-stc 5/5

t h e p e r f o r m a n c e o f t h e t h r e e a n d F i g u r e 2 w h i c h

s h o w s t h e e x e c u t i o n t i m e o f t h e t h r e e .

0 . . .

)|!i,                                                                                                                                                                 l

acavF

F i g u r e 3 . Th e p e r f o r m a n c e o f t he t hr ee m e t h o d s

4

| G A C £ T C

F i g u r e 4 . Th e e x e c u t i o n t i m e o f t h e t h r e e m e t h o d s

( T i m e / s )

T h e b e s t r e s u l t s o b t a i n e d b y STC a r e o b v i o u s l y .W e b e l i e v e t h e m a i n r e a s o n f o r S T C ' s b e s tp e r f o r m a n c e i s t h e STC m o d e l , w h i c h i s m o s t

s u i t a b l e f o r e v e n t d e t e c t i o n . F u r t h e r m o r e , STC i s t h ef a s t e s t a l g o r i t h m a m o n g t h e t h r e e , w h i c h own t o t h ed i r e c t - i n s e r t i n g p o l i c y o f S T C .

6 . C o n c l u s i o n s a n d f u t u r e w o rk

O u r w o r k p r e s e n t e d i n t h i s p a p e r i s m a i n l y f o c u s e d

o n i m p r o v i n g t h e e f f e c t i v e n e s s o f d o c u m e n tc l u s t e r i n g a l g o r i t h m s w h i c h i s t h e h o t s p o t o f E v e n t

d e t e c t i o n . E f f i c i e n c y o p t i m i z a t i o n o f t h e a l g o r i t h mh a s b e e n a t a r g e t o f o u r c u r r e n t w o r k . STC i s a

s u i t a b l e a l g o r i t h mf o r

c l u s t e r i n gi n e v e n t

d e t e c t i o n ,w h i c h h a s e x c e l l e n t f e a t u r e s s u c h a s l i n e a r t i m e

c o m p l e x i t y a n d i n c r e m e n t a l i t y 1 . I n t h i s w o r k weh a v e s h o w n a ne w e v e n t d e t e c t i o n a l g o r i t h m b a s e d

o n i m p r o v e d STC a l g o r i t h m . W e i m p r o v e d STC i nt h e f o l l o w i n g a s p e c t s : i m p r o v i n g m e t h o d o f m e r g i n gb a s e c l u s t e r s , ne w d e f i n i t i o n o f c l u s t e r s c o r e , ne wc l u s t e r l a b e l s a n d i m p l e m e n t a t i o n o f " s t o p n o d e " . W e

a l s o d e m o n s t r a t e a s u b s t a n t i a l r e s u l t s u s i n g i m p r o v e dSTC a l g o r i t h m , w h i c h i n d i c a t e b e s t p e r f o r m a n c e a n ds h o r t e s t e x e c u t i o n t i m e c o m p a r e d t o KNN a n d GAC.

I n t h e c o u r s e o f t h i s w o r k , we e n c o u n t e r e d an u m b e r o f i n t e r e s t i n g q u e s t i o n s a n d h o p e t o a n s w e r

t h e m i n o u r f u t u r e r e s e a r c h . F o r o n e , we a r e n o t

s a t i s f i e d w i t h t h e t h r e s h o l d , a l t h o u g h STC i s n o t v e r ys e n s i t i v e t o t h e t h r e s h o l d . S e c o n d , we w i l l c o n s i d e ri m p r o v i n g STC a l g o r i t h m b y c o m b i n a t i o n w i t h o t h e ra l g o r i t h m s , w h i c h may o b t a i n b e t t e r r e s u l t s .

T h e p r e s e n t w o r k c a n b e e x t e n d e d i n a n u m b e r o f

i m p o r t a n t d i r e c t i o n s . O n e i s d i c t a t e d b y t h e m u l t i -l i n g u a l n a t u r e o f T D T : STC a l g o r i t h m s h o u l d b e

c a p a b l e o f d e a l i n g w i t h d o c u m e n t s i n m u l t i p l el a n g u a g e s . W e a r e v e r y i n t e r e s t e d i n i m p l e m e n t i n g

t h e i m p r o v e d STC a l g o r i t h m i n t o C h i n e s e .

R e f e r e n c e

[ l ] T o p i c d e t e c t i o n a n d t r a c k i n g ( t d t ) p r o j e c t . h o m e p a g e :h t t p : / / w w w . n i s t . g o v / s p e e c h / t e s t s / t d t / .[ 2 ] Y . Y a n g a n d J . G . C a r b o n e l l e t a l . L e a r n i n g A p p r o a c h e sf o r D e t e c t i n g a n d T r a c k i n g News E v e n t s [ J ] . I E E EI n t e l l i g e n t S y s t e m s : S p e c i a l I s s u e o n A p p l i c a t i o n s o f

I n t e l l i g e n t I n f o r m a t i o n R e t r i e v a l , 1 9 9 9 , 1 4 ( 4 ) : 3 2 - 4 3 .[ 3 ] Y . Y a n g a n d J . Z . e t a l . T o p i c - c o n d i t i o n e d n o v e l t yd e t e c t i o n I n P r o c . o f SIGKDD i n t e r n a t i o n a l c o n f e r e n c e o n

k n o w l e d g e d i s c o v e r y a n d d a t a m i n i n g , 2 0 0 2 .[ 4 ] J . A l l a n . T o p i c d e t e c t i o n a n d t r a c k i n g : e v e n t - b a s e d

i n f o r m a t i o n o r g a n i z a t i o n [ M ] . D o r d r e c h t : K l u w e r A c a de m i cP u b l i s h e r s , 2 0 0 2 .

[ 5 ] 0 . Z a m i r a n d 0 . E t z i o n i . Web D o c u m e n t C l u s t e r i n g : AF e a s i b i l i t y D e m o n s t r a t e . I n P r o c . o f S I G I R ' 9 8 , U n i v e r s i t yo f W a s h i n g t o n , S e t t l e , U S A , 1 9 9 8 .[ 6 ] Z . L e i a n d L . Wu e t a l . I n c r e m e n t a l K - m e a n s M e t h o dB a s e d o n I n i t i a l i s a t i o n o f C l u s t e r C e n t e r s a n d I t s

A p p l i c a t i o n i n News E v e n t D e t e c t i o n . J o u r n a l o f t h e C h i n a

S o c i e t y f o r S c i e n t i f i c . 2 5 ( 3 ) : 2 8 9 - 2 9 5 , 2 0 0 6 .

[ 7 ] H . C h i m a n d X . D e n g . A N e w S u f f i x T r e e S i m i l a r i t yM e a s u r e f o r D o c u m e n t C l u s t e r i n g . I n P r o c . o f WWW'2007,B a n f f , A l b e r t a , C a n a d a , 2 0 0 7 .

[ 8 ] S . B r a n s o n a n d A . G r e e n b e r g . C l u s t e r i n g Web S e a r c h

R e s u l t s U s i n g S u f f i x T r e e M e t h o d s . h o m e p a g e :

h t t p : / / s t a n f o r d . e d u / l a s s / a r c h i v e / c s / c s 2 7 6 a / c s 2 7 6 a[ 9 ] C . J . v a n R i j s b e r g e n , I n f o r m a t i o n R e t r i e v a l [ M ] . L o n d o n :

B u t t e r w o r t h s , 1 9 7 9 .[ 1 0 ] J . W . Y a n g . A C h i n e s e Web P a g e C l u s t e r i n g A l g o r i t h mB a s e d o n t h e S u f f i x T r e e . Wuhan U n i v e r s i t y J o u r n a l o f

N a t i o n a l S c i e n c e s [ M ] . 9 ( 5 ) : 8 1 7 - 8 2 2 , 2 0 0 4 .

5 3 2