Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II)

75
03/15/22 1 Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj Acknowledgements: Slides by students at CS512 (Spring 2009)

description

Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II). Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj Acknowledgements: Slides by students at CS512 (Spring 2009). 9/10/2014. - PowerPoint PPT Presentation

Transcript of Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II)

04/24/23 1

Data Mining: Concepts and Techniques

— Chapter 10 —10.3.2 Mining Text and Web Data (II)

Jiawei Han and Micheline Kamber

Department of Computer Science

University of Illinois at Urbana-Champaign

www.cs.uiuc.edu/~hanjAcknowledgements: Slides by students at CS512 (Spring 2009)

Outline

• Probabilistic Topic Models (Yue Lu)

• Opinion Mining (Hyun Duk Kim)

• Mining Query Logs for Personalized Search (Yuanhua

Lv)

• Online Analytical Processing on Multidimensional Text

Database (Duo Zhang)

3

Probabilistic Topic Models

Yue LU Department of Computer Science

University of Illinois, Urbana-Champaign

Many slides are adapted/taken from different sources, including presentations by ChengXiang Zhai, Qiaozhu Mei and Tom Griffiths

Intuition

4

• Documents exhibit multiple topics.

topic: Social network website

topic: education

topic: criticism

What is a Topic?

5

Topic: A broad concept/theme, semantically coherent, which is hidden in documents

Representation: a multinomial distribution of words, i.e., a unigram language model

retrieval 0.2information 0.15model 0.08query 0.07language 0.06feedback 0.03……

6

Organize Information with Topics

Words

Entities

Phrases

Topics

Categories

How many in a document?

Resolution

1

several

Many ...

Patterns

thousands

new orleans, put together, ..

oil, new, put, …orleans, is, …

new orleans, president bush..

Natural hazards

hundreds

50~100

oil price,

price 0.0772oil 0.0643gas 0.0454 increase 0.0210product 0.0203fuel 0.0188company 0.0182…

government response

loss statistics, …

7

The Usage of Topic Models

• Usage of a topic model:– Summarize themes/aspects– Navigate documents– Retrieve documents– Segment documents– Document classification– Document clustering

Topic 1

Topic k

Topic 2

Background B

government 0.3 response 0.2

...

donate 0.1relief 0.05help 0.02

...

city 0.2new 0.1

orleans 0.05 ...

is 0.05the 0.04a 0.03

...

[ Criticism of government response to the hurricane primarily consisted of criticism of its response to the

approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. …

80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries

pledged monetary donations or other assistance]. …

8

General Idea of Probabilistic Topic Models

• Cast intuition into a generative probabilistic process (Generative Process)– Each document is a mixture of corpus-wide topics

(multinomial distribution/unigram LM)– Each word is drawn from one of those topics

• Since we only observe the documents, need to figure out (Estimation/Inference)– What are the topics?– How are the documents divided according to those

topics?

• Two basic models: PLSA and LDA

Probabilistic Latent Semantic Analysis/Indexing [Hofmann 99]

PLSA: Generation Process

w

Topics

Collection background

B

B

Document

Is 0.05the 0.04a 0.03 ..

1

2

k

d1

d2

dk

battery 0.3 life 0.2..

design 0.1screen 0.05

price 0.2purchase 0.15

Generate a word in a document

[Hofmann 99], [Zhai et al. 04]

Parameters: B=noise-level (manually set)’s and ’s need to be estimated

PLSA: Estimation

w

Topics

Collection background

B

B

Document

Is ?the ?a ?

1

2

k

d1

d2

dk

battery ? life ?

design ?screen ?

price ?purchase ?

Generate a word in a document

[Hofmann 99], [Zhai et al. 04]

?

?

? Log-likelihood of

the collection

Estimated with Maximum Likelihood Estimator (MLE) through an EM algorithm

Problems with PLSA

– “Documents have no generative probabilistic semantics”

•i.e., document is just a symbol

– Model has many parameters•linear in number of documents•need heuristic methods to prevent overfitting

– Cannot generalize to new documents

Latent Dirichlet Allocation [Blei et al. 03]

Basic Idea of LDA

• Adding a Dirichlet Prior α on topic distribution in documents

• Adding a Dirichlet Prior β on word distribution in topics

• α, β can be vectors, but for convenience, α = α1= α2=…; β = β1 = β2=… (Smoothed LDA)

w

Topics

1

2

k

d1

d2

dk

Document

[Blei et al. 03], [Griffiths&Steyvers 02, 03, 04]

β

β

Dirichlet Hyperparameters α, β

• Generally have a smoothing effect on multinomial parameters

• Large α, β : more smoothed topic/word distribution

• Small α, β: more skewed topic/word distribution (e.g. bias towards a few words for each topic)

• Common settings: α=50/K, β=0.01

• PLSA is maximum a posteriori estimated LDA when using uniform prior: α=1, β=1

Inference

• Exact inference is intractable

• Approximation techniques:– Mean field variational methods (Blei et al., 2001, 2003)

– Expectation propagation (Minka and Lafferty, 2002)

– Collapsed Gibbs sampling (Griffiths and Steyvers, 2002)

– Collapsed variational inference (Teh et al., 2006)

Would like to know more?

• “Parameter estimation for text analysis” by Gregor Heinrich

• “Probabilistic topic models” by Mark Steyvers

Opinion Mining

Hyun Duk Kim

04/24/23Data Mining: Principles and

Algorithms 18

Agenda Overview Opinion finding & sentiment classification Opinion Summarization Other works Discussion & Conclusion

04/24/23Data Mining: Principles and

Algorithms 19

Web 2.0 “ Web 2.0 is the business revolution in the

computer industry caused by the move to the Internet as a platform, and an attempt to understand the rules for success on that new platform.” [Wikipedia]

Users participate in content creation ex. Blog, review, Q&A forum

04/24/23Data Mining: Principles and

Algorithms 20

Opinion Mining Huge volume of

opinions on the Web Ex. Product

reviews, Blog posts about politic issues

Need a good technique to summarize them

Example of commercial system (MS live search)

04/24/23Data Mining: Principles and

Algorithms 21

Usefulness of opinion mining Individuals

Purchasing a product/ service Tracking political topics Other decision making tasks

Businesses and organizations product and service benchmarking survey on a topic

Ads placements Place an ad when one praises an product Place an ad from a competitor if one criticizes a

product[Kavita Ganesan & Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008]

04/24/23Data Mining: Principles and

Algorithms 22

Subtasks Opinion finding & sentiment classification

Opinion finding If the target text is opinion or fact

Sentiment classification If the opinion is positive or negative In detail, ‘positive/negative/mixed’

Methods Lexicon based method Machine learning

Opinion Summarization How to show opinion finding/classification results

effectively Methods

Basic statistics showing Feature level summary [Hu & Liu, KDD'04/ Hu & Liu, AAAI'04] Summary paragraph generation [Kim et al, TAC'08] Probabilistic analysis [Mei et al, WWW'07]

Other works04/24/23

Data Mining: Principles and Algorithms 23

Opinion Finding Lexicon-based method

Prepare opinion word list Ex. Word: ‘good’, ‘bad’ / Phrase: ‘I think’, ‘In my opinion’

Check special part of speech expressing opinions Ex. Adjective: ‘excellent’, ‘horrible’ / Verb: ‘like’, ‘hate’

Decision based on the those words occurrences Lexicon sources

Manually classified word lists WordNet External sources: Wikipedia (objective), review data (subjective)

Machine learning Train with tagged examples Main features

Opinion lexicons Part-of-speech tag, Punctuation (ex. ! ), Modifiers (ex. not, very)Word tokens, Dependency

04/24/23Data Mining: Principles and

Algorithms 24

Opinion Sentiment Classification

Method Similar to opinion finding

Lexicon based method Machine learning

Instead of using ‘opinionated word/examples’, use ‘positive and negative’ word/examples

If positive/negative dominant -> positive or negative

Both positive and negative dominantly exist -> mixed

04/24/23Data Mining: Principles and

Algorithms 25

Opinion Sentiment Classification

Query dependent sentiment classification [Lee et al, TREC '08/ Jia et al, TREC '08]

Motivation: Sentiments are expressed differently in different queries Ex. Small can be good for ipod size, but can be bad for

LCD monitor size Use external web sources to obtain positive and

negative opinionated lexicons Key Ideas

Objective words: Wikipedia, product specification part of Amazon.com

Subjective words: Reviews from Amazon.com, Rateitall.com and Epinions.com

Reviews rated 4 or 5 out of 5: positive words Reviews rated 1 or 2 out of 5: negative words

Top ranked in Text Retrieval Conference[Kavita Ganesan & Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008]

04/24/23Data Mining: Principles and

Algorithms 26

Agenda Overview Opinion finding & sentiment classification Opinion Summarization Other works Discussion & Conclusion

04/24/23Data Mining: Principles and

Algorithms 27

Opinion Summarization Basic statistics

Show how many numbers of opinions Ex. Opinions about ipod

04/24/23Data Mining: Principles and

Algorithms 28

Positive Negative80% 20%

Opinion Summarization (cont.)

Feature-based summary [Hu & Liu, KDD '04/ Hu & Liu, AAAI '04]

Find lower level of features and analyze. Ex. Opinions about ipod

Feature extraction Usually nouns / noun phrases Frequent feature identification

Association mining Feature pruning and infrequent feature identification

based on heuristic rules Sentiment summary for each features

04/24/23Data Mining: Principles and

Algorithms 29

Battery life Design PricePos Neg Pos Neg Pos Neg50% 50% 95% 5% 30% 70%

Opinion Summarization (cont.)

Summary paragraph generation [Kim et al, TAC '08] General NLP summarization techniques

Sentence extraction based summary Opinion filtering

Show sentences opinionated. Show sentences having the same polarity to the

goal of the summary Opinion ordering

Paragraph division by opinion polarity [Paragraph1] … Following are positive opinions…

Following are negative opinions… [Paragraph2] …

Following are mixed opinions… …

04/24/23Data Mining: Principles and

Algorithms 30

Opinion Summarization (cont.)

Probabilistic analysis Topic sentiment mixture model [Mei et al, WWW '07]

Topic modeling with opinion priors

04/24/23Data Mining: Principles and

Algorithms 31

Figure. The generation process of the topic-sentiment mixture model

Agenda Overview Opinion finding & sentiment classification Opinion Summarization Other works Discussion & Conclusion

04/24/23Data Mining: Principles and

Algorithms 32

Other works Comparative analysis Focus on texts having

contradiction or comparison. Finding comparative sentences [Jindal & Liu, SIGIR

'06] Comparison indicator such as ‘than’ or ‘as well as’. Ex. ‘Ipod’ is better than ‘Zune’. Sequential patterns showing comparative

sentences ex. {NN}{VBZ}{RB}{moreJJR}{NN}{IN}{NN} ⟨ ⟩

comparative Finding preferred entity [Murthy & Liu, COLING '08]

Rule based approach Context-dependent orientation finding using Pros

and Cons reviews.

04/24/23Data Mining: Principles and

Algorithms 33

Other works Opinion Integration [Lu & Zhai, WWW '08]

Integrate expert reviews with arbitrary text collection

Expert reviews: well structured, easy to find features, not often updated

Arbitrary: not structured, various & updated data

Semi-supervised topic model Extract structure aspects (features) data from the

expert review to cluster general documents Add supplementary opinions from general

documents04/24/23

Data Mining: Principles and Algorithms 34

Agenda Overview Opinion finding & sentiment classification Opinion Summarization Other works Discussion & Conclusion

04/24/23Data Mining: Principles and

Algorithms 35

Challenges in opinion mining Polarity terms are context sensitive

Ex. Small can be good for ipod size, but can be bad for LCD monitor size

Even in the same domain, use different words depending on target feature

Ex. Long ‘ipod’ battery life vs. long ‘ipod’ loading time Partially solved (query dependent sentiment classification)

Implicit and complex opinion expressions Rhetoric expression, metaphor, double negation Ex. The food was like a stone Need both good IR and NLP techniques for opinion mining.

Cannot divide into pos/neg clearly Not all opinions can be classified into two categories Interpretation can be changed based on conditions Ex. 1) The battery life is ‘long’ if you do not use LCD a lot (pos)

2) The battery life is ‘short’ if you use LCD a lot (neg)Current system classify the first one as positive and second one as negative. However, actually both are saying the same fact.

[Kavita Ganesan & Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008]

04/24/23Data Mining: Principles and

Algorithms 36

Discussion A difficult task Essential for many blog or review mining

techniques Current stage of opinion finding

Good performance in sentence level, specific domain, sub-problem.

Still low accuracy in general case MAP score of TREC ‘08 top performed system

Opinion finding: 0.4569 Polarity finding: 0.2297~0.2723

A lot of margin to improve !04/24/23

Data Mining: Principles and Algorithms 37

References I. Ounis, C. Macdonald and I. Soboroff, Overview of the TREC 2008 Blog Track , TREC, 2008. Opinion Mining and Summarization: Sentiment Analysis. Tutorial given at WWW-2008, April

21, 2008 in Beijing, China. Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, ChengXiang Zhai. Topic Sentiment

Mixture: Modeling Facets and Opinions in Weblogs, Proceedings of the 16th International World Wide Web Conference (WWW' 07), pages 171-180, 2007.

Minqing Hu and Bing Liu. "Mining and summarizing customer reviews". To appear in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004, full paper), Seattle, Washington, USA, Aug 22-25, 2004.

Minqing Hu and Bing Liu. "Mining Opinion Features in Customer Reviews." To appear in Proceedings of Nineteeth National Conference on Artificial Intellgience (AAAI-2004), San Jose, USA, July 2004.

Yue Lu and ChengXiang Zhai. "Opinion Integration Through Semisupervised Topic Modeling", In Proceedings of the 17th International World Wide Web Conference (WWW'08)

Kavita Ganesan, Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008 Hyun Duk Kim, Dae Hoon Park, V.G.Vinod Vydiswaran, and ChengXiang Zhai,Opinion

Summarization Using Entity Features and Probabilistic Sentence Coherence Optimization: UIUC at TAC 2008 Opinion Summarization Pilot, Text Analysis Conference (TAC), Maryland, USA.

04/24/23Data Mining: Principles and

Algorithms 38

References Y. Lee, S.-H. Na, J. Kim, S.-H. Nam, H.-Y. Jung and J.-H. Lee , KLE at TREC 2008

Blog Track: Blog Post and Feed Retrieval , TREC, 2008. L. Jia, C. Yu and W. Zhang, UIC at TREC 208 Blog Track, TREC, 2008. Nitin Jindal and Bing Liu. "Identifying Comparative Sentences in Text

Documents" To appear in Proceedings of the 29th Annual International ACM SIGIR Conference on Research & Development on Information Retrieval (SIGIR-06), Seattle 2006.

Opinion Mining and Summarization (including review spam detection), tutorial given at WWW-2008, April 21, 2008 in Beijing, China.

Murthy Ganapathibhotla and Bing Liu, Mining opinions in comparative sentences, Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 241–248, Manchester, August 2008

04/24/23Data Mining: Principles and

Algorithms 39

Thank you

04/24/23Data Mining: Principles and

Algorithms 40

Mining User Query Logs for Personalized Search

Yuanhua Lv

(Some slides are taken from Xuehua Shen, Bin Tan, and ChengXiang Zhai’s presentation)

42

Problem of Current Search EnginesJaguar

CarApple Software

Animal

Chemistry Software

Suppose we know:

1. Short-term query logs: previous query = “racing cars”. [Shen et al. 05]

2. Long-term query logs: “car” occurs far more frequently than “Apple” in the user’s query logs of the recent 2 months. [Tan et al. 06]

43

Problem Definition

Q2

{C2,1 , C2,2 ,C2,3 ,… } C2

Q1 User Query{C1,1 , C1,2 ,C1,3 ,…} C1 User Clickthrough

? User Information Need

How to model and mine user query logs?Qk

e.g., Apple software

e.g., Apple - Mac OS X The Apple Mac OS X product page. Describes features in the current version of Mac OS X, a screenshot gallery, latest software downloads, and a directory of ...

44

Retrieval Model

Qk

D

θQk

θD

Similarity Measure

Results( || )kQ DD

Basis: Unigram language model + KL divergence

( | ) ( | )k kp w p w Q 1 1 1 1,..., , ,...( | ) ,( | , )k kk kQ Qp Cw p w CQ

U

Mining query logs to update query model

'kQ

'( || )kQ DD

Query Logs

45

Mining Short-term User Query Logs [Shen et al. 05]

Qk

Q1

Qk-1

C1

Ck-1

Average user’s previous clickthrough

CH

QH

111

1

( | ) ( | )i k

Q iki

p w H p w Q

111

1

( | ) ( | )i k

C iki

p w H p w C

Average user’s previous queries

1 H

Combine previous clickthrough and previous queries

( | ) ( | ) (1 ) ( | )C Qp w H p w H p w H

k

1

Linearly interpolate current queryand history model

( | ) ( | ) (1 ) ( | )k kp w p w Q p w H

Four Heuristic Variants

• FixInt: fixed coefficient interpolation( | ) ( | ) (1 ) ( | )k kp w p w Q p w H

47

Mining Short-term User Query Logs [Shen et al. 05]

Qk

Q1

Qk-1

C1

Ck-1

Average user’s previous clickthrough

CH

QH

111

1

( | ) ( | )i k

Q iki

p w H p w Q

111

1

( | ) ( | )i k

C iki

p w H p w C

Average user’s previous queries

1 H

Combine previous clickthrough and previous queries

( | ) ( | ) (1 ) ( | )C Qp w H p w H p w H

k

1

Linearly interpolate current queryand history model

( | ) ( | ) (1 ) ( | )k kp w p w Q p w H

Fixed α?

Four Heuristic Variants

• FixInt: fixed coefficient interpolation• BayesInt: adapt the interpolation coefficient to

different query length – Intuition: if the current query Qk is longer, we

should trust Qk more

49

Mining Short-term User Query Logs [Shen et al. 05]

Qk

Q1

Qk-1

C1

Ck-1

Average user’s previous clickthrough

CH

QH

111

1

( | ) ( | )i k

Q iki

p w H p w Q

111

1

( | ) ( | )i k

C iki

p w H p w C

Average user’s previous queries

1 H

Combine previous clickthrough and previous queries

( | ) ( | ) (1 ) ( | )C Qp w H p w H p w H

k

1

Linearly interpolate current queryand history model

( | ) ( | ) (1 ) ( | )k kp w p w Q p w H

Fixed α?

Average?

Four Heuristic Variants• FixInt: fixed coefficient interpolation• BayesInt: adapt the interpolation coefficient to

different query length – Intuition: if the current query Qk is longer, we

should trust Qk more• OnlineUp: assign more weight to more recent

records.• BatchUp: the user becomes better and better at

query formulation as time goes on, but we do not need to “decay” the clickthrough.

51

Data Set of Evaluation

• Data collection: TREC AP88-90• Topics: 30 hard topics of TREC topics 1-150• System: search engine + RDBMS• Context: Query and clickthrough history of 3

participants.

52

Overall Effect of Search Context

Query

FixInt

(=0.1,=1.0)

BayesInt

(=0.2,=5.0)

OnlineUp

(=5.0,=15.0)

BatchUp

(=2.0,=15.0)

MAP pr@20 MAP pr@20 MAP pr@20 MAP pr@20

Q3 0.0421 0.1483 0.0421 0.1483 0.0421 0.1483 0.0421 0.1483Q3+HQ+HC 0.0726 0.1967 0.0816 0.2067 0.0706 0.1783 0.0810 0.2067Improve 72.4% 32.6% 93.8% 39.4% 67.7% 20.2% 92.4% 39.4%Q4 0.0536 0.1933 0.0536 0.1933 0.0536 0.1933 0.0536 0.1933Q4+HQ+HC 0.0891 0.2233 0.0955 0.2317 0.0792 0.2067 0.0950 0.2250Improve 66.2% 15.5% 78.2% 19.9% 47.8% 6.9% 77.2% 16.4%

• Short-term query log helps system improve retrieval accuracy

• BayesInt better than FixInt; BatchUp better than OnlineUp

Mining Long-term User Query Log [Tan et al. 05]

• Can we mine long-term user query log similarly?

• Challenge: long-term query log is noisy– How do we handle the noise?– Can we still improve performance?

• Solution: – Assign weights to the query log data (EM

algorithm)

Hierarchical History Models

q1D1C1

S1

θS1

q2D2C2

S2

θS2

...... qt-1Dt-1Ct-1

St-1

θSt-1

qtDt

St

......

θH θq

θq,H {θd}

D(θq,H||θd)

unit history modelθSk ← qkDkCk

overall history modelθH = Σwk θSk

original query modelθq

contextual query modelθq,H

document modelθd

Weights for query log units

Discriminative Weighting with Mixture Model

q1D1C1

S1

θS1

q2D2C2

S2

θS2

...... qt-1Dt-1Ct-1

St-1

θSt-1

qtDt

St

......

θH θqθB

θMix

Backgroundmodel

λ1?λ2?

λt-1?

λB?

Select {λ} to fit the data: maximize p(Dt|θMix)

λq?

<d1>jaguar car perfect for racing<d2>jaguar is a big cat...<d3>locate jaguar dealerin champaign...

EM algorithm

Experimental Results

two query types

recurring fresh≫combination ≈ clickthrough > docs > query, contextless

Summary

• Mining user query logs can personalize search results and improve retrieval performance– Four different models to exploit short-term query

logs [Shen et al. 05].– Assign weights to the long-term query logs to

reduce the effect of noise [Tan et al. 06].

Reference

• Xuehua Shen, Bin Tan, ChengXiang Zhai: Context-sensitive information retrieval using implicit feedback. SIGIR 2005: 43-50

• Bin Tan, Xuehua Shen, ChengXiang Zhai: Mining long-term search history to improve search accuracy. KDD 2006: 718-723

59

Thank you !

The End

04/24/23 60

Data Mining: Concepts and Techniques — Chapter 11 —

11.8. Online Analytical Processing onMultidimensional Text Database

Duo ZhangDepartment of Computer Science

University of Illinois at Urbana-Champaignhttp://sifaka.cs.uiuc.edu/~dzhang22/

04/24/23 61

Online Analytical Processing onMultidimensional Text Database

Motivation

Text Cube: Computing IR Measures for Multidimensional Text Database Analysis

Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases

Motivation• Industry and commercial applications often

collect huge amount of data containing both structured data records and unstructured text data in a multidimensional text database

• Incident reports• Job descriptions• Product reviews• Service feedback

• It is highly desirable and strategically important to support high-performance search and mining over such databases

04/24/23 62

Examples Aviation Safety Reporting System

How to organize the data to help experts efficiently explore and digest text information?

e.g. compare the reports in 1998 and reports in 1999? How to help experts analyze a specific type of anomaly

in different contexts? e.g. what did pilots say about anomaly “landing without

clearance” during daylight v.s. night?

Time Location Environment … Narrative

199801 TX Daylight … …… I TOLD HIM I WAS AT 2000 FT AND HE SAID OK……

199801 LA Daylight … ……WE STOPPED THE DSCNT AT CIRCLING MINIMUMS……

199801 LA Night … ……THE TAXI/LNDG LIGHTS VERY DIM. NO OTHER VISIBLE TFC IN SIGHT……

199902 FL Night … ……I FEEL WE SHOULD ALL EDUCATE OURSELVES ON CHKLISTS……

04/24/23 64

Online Analytical Processing onMultidimensional Text Database

Motivation

Text Cube: Computing IR Measures for Multidimensional Text Database AnalysisC. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao (ICDE’08)

Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases

Text Cube Text Cube

A novel data cube model integrating the power of traditional data cube and IR techniques for effective text mining

Computing IR measures for multidimensional text database analysis

Heterogeneous records to be examined Structured categorical attributes Unstructured free text

IR statistics are evaluated TF-IDF Inverted Index

04/24/23 65

Text Cube - Implementation Preprocessing

stemming, stop words elimination, TF-IDF weighting Concept hierarchy construction

A dimension hierarchy takes the form of a tree or a DAG. An attribute at a lower level reveals more details

Four operations are supported: roll-up, drill-down, slice and dice

Term hierarchy construction A term hierarchy represents semantic levels of

terms in the text and their correlations Infusion with expert knowledge Two novel operations: Pull-up & Push-down

04/24/23 66

Text Cube - Implementation Partial materialization: if a non-materialized cell is

retrieved, we compute it on-the-fly based on the partially materialized cuboids

A balance between time and space: given a time threshold δ, we minimize storage size within the query time bound δ for retrieving all cells to be interested in

04/24/23 67

Experiment – Efficiency and Effectiveness

68

Compare avgTF under different“Environment: Weather Elements”

Compare avgTF under different“Supplementary: Problem Areas”

04/24/23 69

Online Analytical Processing onMultidimensional Text Database

Motivation

Text Cube: Computing IR Measures for Multidimensional Text Database Analysis

Topic Cube: Topic Modeling for OLAP on Multidimensional Text DatabasesD. Zhang, C. Zhai, and J. Han (SDM’09)

Motivation Aviation Safety Reporting System

How to organize the data to help experts efficiently explore and digest text information?

e.g. compare the reports in 1998 and reports in 1999? How to help experts analyze a specific type of anomaly

in different contexts? e.g. what did pilots say about anomaly “landing without

clearance” during daylight v.s. night?

Time Location Environment … Narrative

199801 TX Daylight … …… I TOLD HIM I WAS AT 2000 FT AND HE SAID OK……

199801 LA Daylight … ……WE STOPPED THE DSCNT AT CIRCLING MINIMUMS……

199801 LA Night … ……THE TAXI/LNDG LIGHTS VERY DIM. NO OTHER VISIBLE TFC IN SIGHT……

199902 FL Night … ……I FEEL WE SHOULD ALL EDUCATE OURSELVES ON CHKLISTS……

Solution: Topic Cube

Challenges: How to support operations along the topic dimension? How to quickly extract semantic topics?

98.0199.0299.01

98.02

LAX SJC MIA AUS

overshootundershootbirds

turbulence

Time

Location

Topic

CA FL TX

Location

19981999

Time

Deviation

Encounter

Topic

drill-down

roll-up

Constructing Topic Cube

Time Loc Env … Narrative

98.01 TX Daylight …

98.01 LA Daylight …

98.01 LA Night …

99.02 FL Night …

ALL

Anomaly Altitude Deviation

…… Anomaly Maintenance Problem

…… Anomaly Inflight Encounter

Undershoot

…… Overshoot

Improper Documentation

Improper Maintenance

Birds Turbulence

…… ……

Descent 0.06Cloud 0.03Ft 0.01… ….

Descent 0.05System 0.02View 0.01… ….

Altitude 0.03Ft 0.02Climb 0.01… ….

Altitude 0.04Ft 0.03Instruct 0.01… ….

drill-down

roll-up

Materialization

StandardDimension(Location)

Topic Dimension (Anomaly Event)

CLAX-overshoot CLAX-altitude CLAX-

all

CCA-overshoot CCA-altitude CCA-all

CUS-overshoot CUS-altitude CUS-all

Mtopic-agg

Mtopic-agg Mtopic-agg

Mtopic-

agg

Mtopic-

agg

Mstd-agg Mstd-agg Mstd-agg

Mstd-agg Mstd-agg Mstd-agg

Mtopic-agg

( 1) ( 1)

( 1) ( 1)

,' { , , }(0) ( )

, '' ' { , , }

( , ) ( ')

( | )( ', ) ( ')

L Ls ei i

L Ls ei i

d wdjL

c id w

w dj

c w d p z j

p wc w d p z j

,( )(0)

, ''

( , ) ( )( | )

( ', ) ( )i cin

a

i ci

d wc d DL

c jd w

w c d D

c w d p z jp w

c w d p z j

Mtopic-

agg:Mstd-

agg:

Experimental ResultsContex

t Word p(w|θ)

daylight

Tower 0.075Pattern 0.061Final 0.060

Runway 0.053Land 0.052

Downwind 0.039

night

Tower 0.035Runway 0.029

Light 0.027Instrument Landing System 0.015

Beacon 0.014

landing without clearance

ObjectiveFunction

Iterations

Time (sec.)

Closeness to the optimum point

…WINDS ALOFT AT PATTERN ALT OF 1000 FT MSL, WERE MUCH STRONGER AND A DIRECT XWIND. NEEDLESS TO SAY, THE PATTERNS AND LNDGS WERE DIFFICULT FOR MY STUDENT AND THERE WAS LIGHT TURB ON THE DOWNWIND…

…I LISTENED TO HWD ATIS AND FOUND THE TWR CLOSED AND AN ANNOUNCEMENT THAT THE HIGH INTENSITY LIGHTS FOR RWY 28L WERE INOP. BROADCASTING IN THE BLIND AND LOOKING FOR THE TWR BEACON AND LOW INTENSITY LIGHTS AGAINST A VERY BRIGHT BACKGROUND CLUTTER OF STREET LIGHTS, ETC…

04/24/23 75