Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

18
Speaker: Ping-Tsun Chang Text Mining Text Mining Workshop of ACM SIGKDD Workshop of ACM SIGKDD

description

Mining of Concurrent Text and Time Series Ænalyst Predicting trends in stock prices based on the content of news stories that precede the trends Two types of data –Financial time series –Time-stamped news stories How to connect? –Learn a language model for every trend type

Transcript of Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

Page 1: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

Speaker: Ping-Tsun Chang

Text MiningText MiningWorkshop of ACM SIGKDDWorkshop of ACM SIGKDD

Page 2: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

OutlineOutline

• SIGKDD: Text Mining Workshop:• Session: Mining Time-Tagged Text

– Mining of Concurrent Text and Time Series– TimeMines: Constructing Timelines with Statis

tical Models of Word Usage• Session: Text Mining Applications:

– Mining E-mail Authorship

Page 3: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

Mining of Concurrent Text and Time SeriesMining of Concurrent Text and Time Series

ÆÆnalystnalyst• Predicting trends in stock prices based on th

e content of news stories that precede the trends

• Two types of data– Financial time series– Time-stamped news stories

• How to connect?– Learn a language model for every trend type

Page 4: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

Mining of Concurrent Text and Time SeriesMining of Concurrent Text and Time Series Sy System Designstem Design

Time-SeriesData (Stock Price) Trends

Texual Data(News Articles) ReleventDocuments Align TrendsWith Documents

LanguageModelForTrend-Type

New Document

LikelihoodThat theDocumentIs fromEach Model

Page 5: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

Mining of Concurrent Text and TimeMining of Concurrent Text and Time

Redescribe Time SeriesRedescribe Time Series• Identifying Trends

• Discretizing Trends– This step in a subjective one in which we assign labels t

o segments based on their characteristics• Length• Slope• Intercept• r2

21 bb

jj afterbeforet

Page 6: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

Mining of Concurrent Text and Time Mining of Concurrent Text and Time ClusterClusteringing

• Agglomerative clustering

|,|),(

,

ji

CC CC yx

ji CC

DCCGAD ix jy

Page 7: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

Mining of Concurrent Text and Time Mining of Concurrent Text and Time LanguLanguage Models (I)age Models (I)

• A Language Model represents a discrete distribution over the words in the vecabulary

})...{|(maxarg 1 mttrendstbest DDMPM

})...({)()|}...({maxarg

1

1

m

ttmtrendstbest DDP

MPMDDPM

m

i Dw i

tim

i i

ti

m

ttm

iGEwPMP

GEDPMP

DDPMPMDDP

111

1

)|()|w(

)|()|D(

})...({)()|}...({

Page 8: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

Mining of Concurrent Text and Time Mining of Concurrent Text and Time LanguLanguage Models (II)age Models (II)

• Language Model can separate stories that are followed by a surge that from stories that are not

Page 9: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

Mining of Concurrent Text and Time Mining of Concurrent Text and Time CurrenCurrent Alignmentt Alignment

• A document would be associated with more than one trend

• It is possible for d2 to influence both trends t

1 and t2.

Page 10: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

TimeMines: Constructing Timelines wTimeMines: Constructing Timelines with Statistical Models of Word Usageith Statistical Models of Word Usage

• Automatically generates timelines from data-tagged free text corpora

• Construct overviews of text corpora suitable for browsing using timelines

• Identify time-dependent features that identify important topics in text documents

Page 11: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

TimeMinesTimeMines

Systems OverviewSystems Overview• Process steps to discover features in text

Page 12: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

TimeMinesTimeMinesThe Model for Extracting FeaturesThe Model for Extracting Features

• Stationary random model– The occurrence of a feature depends only on its base ra

te, and dose NOT vary with time.– The arrival of features is a random process with an Un

known binomial distribution• Extracting Features

– Noun phrases and name entities– Label as noun phrases any grouops of words of length l

ess than 6 which matched the regular expression (NOUN| ADJECTIVE)*NOUN

Page 13: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

TimeMinesTimeMinesFinding Significant FeaturesFinding Significant Features

• Many statistics can be used to characterize a 2x2 Contigency Table– EMIM: Expected Mutual Information Measure– KL: Kullback-Leibler divergence– x2: Chi-Square

f0 ~f0

t t0 a b

t t0 c d

Page 14: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

TimeMinesTimeMinesGrouping Significant FeaturesGrouping Significant Features

• The assumption that two features fj and fk have independent distributions implies that P( fk ) = P( fk | fj )

fj ~fj

fk a b

~fk c d

Page 15: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

TimeMinesTimeMines

Systems ImageSystems Image• The pop-up window shows significant named entit

ies of Oklahoma, FBI, Justice Department, etc.

Page 16: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

Mining E-mail AuthorshipMining E-mail Authorship

• Authorship identification or categorisation by E-mail documents

• E-mail document features– Structural characteristics– Linguistic evidenece

• Support Vector Machine

Page 17: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

Mining E-mail AuthorshipMining E-mail AuthorshipE-mail document body attributesE-mail document body attributes

• Structural features• pattern of vocabulary usage• Stylistic• Sub-stylistic features

Page 18: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.

Mining E-mail AuthorshipMining E-mail AuthorshipExperienmantal ResultsExperienmantal Results

• SVMlight• F-measure with β=1.0