Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.
-
Upload
clarence-chase -
Category
Documents
-
view
215 -
download
0
description
Transcript of Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.
![Page 1: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.](https://reader036.fdocuments.us/reader036/viewer/2022082620/5a4d1aeb7f8b9ab05997ab3a/html5/thumbnails/1.jpg)
Speaker: Ping-Tsun Chang
Text MiningText MiningWorkshop of ACM SIGKDDWorkshop of ACM SIGKDD
![Page 2: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.](https://reader036.fdocuments.us/reader036/viewer/2022082620/5a4d1aeb7f8b9ab05997ab3a/html5/thumbnails/2.jpg)
OutlineOutline
• SIGKDD: Text Mining Workshop:• Session: Mining Time-Tagged Text
– Mining of Concurrent Text and Time Series– TimeMines: Constructing Timelines with Statis
tical Models of Word Usage• Session: Text Mining Applications:
– Mining E-mail Authorship
![Page 3: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.](https://reader036.fdocuments.us/reader036/viewer/2022082620/5a4d1aeb7f8b9ab05997ab3a/html5/thumbnails/3.jpg)
Mining of Concurrent Text and Time SeriesMining of Concurrent Text and Time Series
ÆÆnalystnalyst• Predicting trends in stock prices based on th
e content of news stories that precede the trends
• Two types of data– Financial time series– Time-stamped news stories
• How to connect?– Learn a language model for every trend type
![Page 4: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.](https://reader036.fdocuments.us/reader036/viewer/2022082620/5a4d1aeb7f8b9ab05997ab3a/html5/thumbnails/4.jpg)
Mining of Concurrent Text and Time SeriesMining of Concurrent Text and Time Series Sy System Designstem Design
Time-SeriesData (Stock Price) Trends
Texual Data(News Articles) ReleventDocuments Align TrendsWith Documents
LanguageModelForTrend-Type
New Document
LikelihoodThat theDocumentIs fromEach Model
![Page 5: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.](https://reader036.fdocuments.us/reader036/viewer/2022082620/5a4d1aeb7f8b9ab05997ab3a/html5/thumbnails/5.jpg)
Mining of Concurrent Text and TimeMining of Concurrent Text and Time
Redescribe Time SeriesRedescribe Time Series• Identifying Trends
• Discretizing Trends– This step in a subjective one in which we assign labels t
o segments based on their characteristics• Length• Slope• Intercept• r2
21 bb
jj afterbeforet
![Page 6: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.](https://reader036.fdocuments.us/reader036/viewer/2022082620/5a4d1aeb7f8b9ab05997ab3a/html5/thumbnails/6.jpg)
Mining of Concurrent Text and Time Mining of Concurrent Text and Time ClusterClusteringing
• Agglomerative clustering
|,|),(
,
ji
CC CC yx
ji CC
DCCGAD ix jy
![Page 7: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.](https://reader036.fdocuments.us/reader036/viewer/2022082620/5a4d1aeb7f8b9ab05997ab3a/html5/thumbnails/7.jpg)
Mining of Concurrent Text and Time Mining of Concurrent Text and Time LanguLanguage Models (I)age Models (I)
• A Language Model represents a discrete distribution over the words in the vecabulary
})...{|(maxarg 1 mttrendstbest DDMPM
})...({)()|}...({maxarg
1
1
m
ttmtrendstbest DDP
MPMDDPM
m
i Dw i
tim
i i
ti
m
ttm
iGEwPMP
GEDPMP
DDPMPMDDP
111
1
)|()|w(
)|()|D(
})...({)()|}...({
![Page 8: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.](https://reader036.fdocuments.us/reader036/viewer/2022082620/5a4d1aeb7f8b9ab05997ab3a/html5/thumbnails/8.jpg)
Mining of Concurrent Text and Time Mining of Concurrent Text and Time LanguLanguage Models (II)age Models (II)
• Language Model can separate stories that are followed by a surge that from stories that are not
![Page 9: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.](https://reader036.fdocuments.us/reader036/viewer/2022082620/5a4d1aeb7f8b9ab05997ab3a/html5/thumbnails/9.jpg)
Mining of Concurrent Text and Time Mining of Concurrent Text and Time CurrenCurrent Alignmentt Alignment
• A document would be associated with more than one trend
• It is possible for d2 to influence both trends t
1 and t2.
![Page 10: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.](https://reader036.fdocuments.us/reader036/viewer/2022082620/5a4d1aeb7f8b9ab05997ab3a/html5/thumbnails/10.jpg)
TimeMines: Constructing Timelines wTimeMines: Constructing Timelines with Statistical Models of Word Usageith Statistical Models of Word Usage
• Automatically generates timelines from data-tagged free text corpora
• Construct overviews of text corpora suitable for browsing using timelines
• Identify time-dependent features that identify important topics in text documents
![Page 11: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.](https://reader036.fdocuments.us/reader036/viewer/2022082620/5a4d1aeb7f8b9ab05997ab3a/html5/thumbnails/11.jpg)
TimeMinesTimeMines
Systems OverviewSystems Overview• Process steps to discover features in text
![Page 12: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.](https://reader036.fdocuments.us/reader036/viewer/2022082620/5a4d1aeb7f8b9ab05997ab3a/html5/thumbnails/12.jpg)
TimeMinesTimeMinesThe Model for Extracting FeaturesThe Model for Extracting Features
• Stationary random model– The occurrence of a feature depends only on its base ra
te, and dose NOT vary with time.– The arrival of features is a random process with an Un
known binomial distribution• Extracting Features
– Noun phrases and name entities– Label as noun phrases any grouops of words of length l
ess than 6 which matched the regular expression (NOUN| ADJECTIVE)*NOUN
![Page 13: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.](https://reader036.fdocuments.us/reader036/viewer/2022082620/5a4d1aeb7f8b9ab05997ab3a/html5/thumbnails/13.jpg)
TimeMinesTimeMinesFinding Significant FeaturesFinding Significant Features
• Many statistics can be used to characterize a 2x2 Contigency Table– EMIM: Expected Mutual Information Measure– KL: Kullback-Leibler divergence– x2: Chi-Square
f0 ~f0
t t0 a b
t t0 c d
![Page 14: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.](https://reader036.fdocuments.us/reader036/viewer/2022082620/5a4d1aeb7f8b9ab05997ab3a/html5/thumbnails/14.jpg)
TimeMinesTimeMinesGrouping Significant FeaturesGrouping Significant Features
• The assumption that two features fj and fk have independent distributions implies that P( fk ) = P( fk | fj )
fj ~fj
fk a b
~fk c d
![Page 15: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.](https://reader036.fdocuments.us/reader036/viewer/2022082620/5a4d1aeb7f8b9ab05997ab3a/html5/thumbnails/15.jpg)
TimeMinesTimeMines
Systems ImageSystems Image• The pop-up window shows significant named entit
ies of Oklahoma, FBI, Justice Department, etc.
![Page 16: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.](https://reader036.fdocuments.us/reader036/viewer/2022082620/5a4d1aeb7f8b9ab05997ab3a/html5/thumbnails/16.jpg)
Mining E-mail AuthorshipMining E-mail Authorship
• Authorship identification or categorisation by E-mail documents
• E-mail document features– Structural characteristics– Linguistic evidenece
• Support Vector Machine
![Page 17: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.](https://reader036.fdocuments.us/reader036/viewer/2022082620/5a4d1aeb7f8b9ab05997ab3a/html5/thumbnails/17.jpg)
Mining E-mail AuthorshipMining E-mail AuthorshipE-mail document body attributesE-mail document body attributes
• Structural features• pattern of vocabulary usage• Stylistic• Sub-stylistic features
![Page 18: Speaker: Ping-Tsun Chang Text Mining Workshop of ACM SIGKDD.](https://reader036.fdocuments.us/reader036/viewer/2022082620/5a4d1aeb7f8b9ab05997ab3a/html5/thumbnails/18.jpg)
Mining E-mail AuthorshipMining E-mail AuthorshipExperienmantal ResultsExperienmantal Results
• SVMlight• F-measure with β=1.0