Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II)
description
Transcript of Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II)
04/24/23 1
Data Mining: Concepts and Techniques
— Chapter 10 —10.3.2 Mining Text and Web Data (II)
Jiawei Han and Micheline Kamber
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanjAcknowledgements: Slides by students at CS512 (Spring 2009)
Outline
• Probabilistic Topic Models (Yue Lu)
• Opinion Mining (Hyun Duk Kim)
• Mining Query Logs for Personalized Search (Yuanhua
Lv)
• Online Analytical Processing on Multidimensional Text
Database (Duo Zhang)
3
Probabilistic Topic Models
Yue LU Department of Computer Science
University of Illinois, Urbana-Champaign
Many slides are adapted/taken from different sources, including presentations by ChengXiang Zhai, Qiaozhu Mei and Tom Griffiths
Intuition
4
• Documents exhibit multiple topics.
topic: Social network website
topic: education
topic: criticism
What is a Topic?
5
Topic: A broad concept/theme, semantically coherent, which is hidden in documents
Representation: a multinomial distribution of words, i.e., a unigram language model
retrieval 0.2information 0.15model 0.08query 0.07language 0.06feedback 0.03……
6
Organize Information with Topics
Words
Entities
Phrases
Topics
Categories
How many in a document?
Resolution
1
several
Many ...
…
Patterns
thousands
new orleans, put together, ..
oil, new, put, …orleans, is, …
new orleans, president bush..
Natural hazards
hundreds
50~100
oil price,
price 0.0772oil 0.0643gas 0.0454 increase 0.0210product 0.0203fuel 0.0188company 0.0182…
government response
loss statistics, …
7
The Usage of Topic Models
• Usage of a topic model:– Summarize themes/aspects– Navigate documents– Retrieve documents– Segment documents– Document classification– Document clustering
Topic 1
Topic k
Topic 2
…
Background B
government 0.3 response 0.2
...
donate 0.1relief 0.05help 0.02
...
city 0.2new 0.1
orleans 0.05 ...
is 0.05the 0.04a 0.03
...
[ Criticism of government response to the hurricane primarily consisted of criticism of its response to the
approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. …
80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries
pledged monetary donations or other assistance]. …
8
General Idea of Probabilistic Topic Models
• Cast intuition into a generative probabilistic process (Generative Process)– Each document is a mixture of corpus-wide topics
(multinomial distribution/unigram LM)– Each word is drawn from one of those topics
• Since we only observe the documents, need to figure out (Estimation/Inference)– What are the topics?– How are the documents divided according to those
topics?
• Two basic models: PLSA and LDA
PLSA: Generation Process
w
Topics
Collection background
B
B
Document
Is 0.05the 0.04a 0.03 ..
…
1
2
k
d1
d2
dk
battery 0.3 life 0.2..
design 0.1screen 0.05
price 0.2purchase 0.15
Generate a word in a document
[Hofmann 99], [Zhai et al. 04]
Parameters: B=noise-level (manually set)’s and ’s need to be estimated
PLSA: Estimation
w
Topics
Collection background
B
B
Document
Is ?the ?a ?
…
1
2
k
d1
d2
dk
battery ? life ?
design ?screen ?
price ?purchase ?
Generate a word in a document
[Hofmann 99], [Zhai et al. 04]
?
?
? Log-likelihood of
the collection
Estimated with Maximum Likelihood Estimator (MLE) through an EM algorithm
Problems with PLSA
– “Documents have no generative probabilistic semantics”
•i.e., document is just a symbol
– Model has many parameters•linear in number of documents•need heuristic methods to prevent overfitting
– Cannot generalize to new documents
Basic Idea of LDA
• Adding a Dirichlet Prior α on topic distribution in documents
• Adding a Dirichlet Prior β on word distribution in topics
• α, β can be vectors, but for convenience, α = α1= α2=…; β = β1 = β2=… (Smoothed LDA)
w
Topics
…
1
2
k
d1
d2
dk
Document
[Blei et al. 03], [Griffiths&Steyvers 02, 03, 04]
β
β
Dirichlet Hyperparameters α, β
• Generally have a smoothing effect on multinomial parameters
• Large α, β : more smoothed topic/word distribution
• Small α, β: more skewed topic/word distribution (e.g. bias towards a few words for each topic)
• Common settings: α=50/K, β=0.01
• PLSA is maximum a posteriori estimated LDA when using uniform prior: α=1, β=1
Inference
• Exact inference is intractable
• Approximation techniques:– Mean field variational methods (Blei et al., 2001, 2003)
– Expectation propagation (Minka and Lafferty, 2002)
– Collapsed Gibbs sampling (Griffiths and Steyvers, 2002)
– Collapsed variational inference (Teh et al., 2006)
Would like to know more?
• “Parameter estimation for text analysis” by Gregor Heinrich
• “Probabilistic topic models” by Mark Steyvers
Agenda Overview Opinion finding & sentiment classification Opinion Summarization Other works Discussion & Conclusion
04/24/23Data Mining: Principles and
Algorithms 19
Web 2.0 “ Web 2.0 is the business revolution in the
computer industry caused by the move to the Internet as a platform, and an attempt to understand the rules for success on that new platform.” [Wikipedia]
Users participate in content creation ex. Blog, review, Q&A forum
04/24/23Data Mining: Principles and
Algorithms 20
Opinion Mining Huge volume of
opinions on the Web Ex. Product
reviews, Blog posts about politic issues
Need a good technique to summarize them
Example of commercial system (MS live search)
04/24/23Data Mining: Principles and
Algorithms 21
Usefulness of opinion mining Individuals
Purchasing a product/ service Tracking political topics Other decision making tasks
Businesses and organizations product and service benchmarking survey on a topic
Ads placements Place an ad when one praises an product Place an ad from a competitor if one criticizes a
product[Kavita Ganesan & Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008]
04/24/23Data Mining: Principles and
Algorithms 22
Subtasks Opinion finding & sentiment classification
Opinion finding If the target text is opinion or fact
Sentiment classification If the opinion is positive or negative In detail, ‘positive/negative/mixed’
Methods Lexicon based method Machine learning
Opinion Summarization How to show opinion finding/classification results
effectively Methods
Basic statistics showing Feature level summary [Hu & Liu, KDD'04/ Hu & Liu, AAAI'04] Summary paragraph generation [Kim et al, TAC'08] Probabilistic analysis [Mei et al, WWW'07]
Other works04/24/23
Data Mining: Principles and Algorithms 23
Opinion Finding Lexicon-based method
Prepare opinion word list Ex. Word: ‘good’, ‘bad’ / Phrase: ‘I think’, ‘In my opinion’
Check special part of speech expressing opinions Ex. Adjective: ‘excellent’, ‘horrible’ / Verb: ‘like’, ‘hate’
Decision based on the those words occurrences Lexicon sources
Manually classified word lists WordNet External sources: Wikipedia (objective), review data (subjective)
Machine learning Train with tagged examples Main features
Opinion lexicons Part-of-speech tag, Punctuation (ex. ! ), Modifiers (ex. not, very)Word tokens, Dependency
04/24/23Data Mining: Principles and
Algorithms 24
Opinion Sentiment Classification
Method Similar to opinion finding
Lexicon based method Machine learning
Instead of using ‘opinionated word/examples’, use ‘positive and negative’ word/examples
If positive/negative dominant -> positive or negative
Both positive and negative dominantly exist -> mixed
04/24/23Data Mining: Principles and
Algorithms 25
Opinion Sentiment Classification
Query dependent sentiment classification [Lee et al, TREC '08/ Jia et al, TREC '08]
Motivation: Sentiments are expressed differently in different queries Ex. Small can be good for ipod size, but can be bad for
LCD monitor size Use external web sources to obtain positive and
negative opinionated lexicons Key Ideas
Objective words: Wikipedia, product specification part of Amazon.com
Subjective words: Reviews from Amazon.com, Rateitall.com and Epinions.com
Reviews rated 4 or 5 out of 5: positive words Reviews rated 1 or 2 out of 5: negative words
Top ranked in Text Retrieval Conference[Kavita Ganesan & Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008]
04/24/23Data Mining: Principles and
Algorithms 26
Agenda Overview Opinion finding & sentiment classification Opinion Summarization Other works Discussion & Conclusion
04/24/23Data Mining: Principles and
Algorithms 27
Opinion Summarization Basic statistics
Show how many numbers of opinions Ex. Opinions about ipod
04/24/23Data Mining: Principles and
Algorithms 28
Positive Negative80% 20%
Opinion Summarization (cont.)
Feature-based summary [Hu & Liu, KDD '04/ Hu & Liu, AAAI '04]
Find lower level of features and analyze. Ex. Opinions about ipod
Feature extraction Usually nouns / noun phrases Frequent feature identification
Association mining Feature pruning and infrequent feature identification
based on heuristic rules Sentiment summary for each features
04/24/23Data Mining: Principles and
Algorithms 29
Battery life Design PricePos Neg Pos Neg Pos Neg50% 50% 95% 5% 30% 70%
Opinion Summarization (cont.)
Summary paragraph generation [Kim et al, TAC '08] General NLP summarization techniques
Sentence extraction based summary Opinion filtering
Show sentences opinionated. Show sentences having the same polarity to the
goal of the summary Opinion ordering
Paragraph division by opinion polarity [Paragraph1] … Following are positive opinions…
Following are negative opinions… [Paragraph2] …
Following are mixed opinions… …
04/24/23Data Mining: Principles and
Algorithms 30
Opinion Summarization (cont.)
Probabilistic analysis Topic sentiment mixture model [Mei et al, WWW '07]
Topic modeling with opinion priors
04/24/23Data Mining: Principles and
Algorithms 31
Figure. The generation process of the topic-sentiment mixture model
Agenda Overview Opinion finding & sentiment classification Opinion Summarization Other works Discussion & Conclusion
04/24/23Data Mining: Principles and
Algorithms 32
Other works Comparative analysis Focus on texts having
contradiction or comparison. Finding comparative sentences [Jindal & Liu, SIGIR
'06] Comparison indicator such as ‘than’ or ‘as well as’. Ex. ‘Ipod’ is better than ‘Zune’. Sequential patterns showing comparative
sentences ex. {NN}{VBZ}{RB}{moreJJR}{NN}{IN}{NN} ⟨ ⟩
comparative Finding preferred entity [Murthy & Liu, COLING '08]
Rule based approach Context-dependent orientation finding using Pros
and Cons reviews.
04/24/23Data Mining: Principles and
Algorithms 33
Other works Opinion Integration [Lu & Zhai, WWW '08]
Integrate expert reviews with arbitrary text collection
Expert reviews: well structured, easy to find features, not often updated
Arbitrary: not structured, various & updated data
Semi-supervised topic model Extract structure aspects (features) data from the
expert review to cluster general documents Add supplementary opinions from general
documents04/24/23
Data Mining: Principles and Algorithms 34
Agenda Overview Opinion finding & sentiment classification Opinion Summarization Other works Discussion & Conclusion
04/24/23Data Mining: Principles and
Algorithms 35
Challenges in opinion mining Polarity terms are context sensitive
Ex. Small can be good for ipod size, but can be bad for LCD monitor size
Even in the same domain, use different words depending on target feature
Ex. Long ‘ipod’ battery life vs. long ‘ipod’ loading time Partially solved (query dependent sentiment classification)
Implicit and complex opinion expressions Rhetoric expression, metaphor, double negation Ex. The food was like a stone Need both good IR and NLP techniques for opinion mining.
Cannot divide into pos/neg clearly Not all opinions can be classified into two categories Interpretation can be changed based on conditions Ex. 1) The battery life is ‘long’ if you do not use LCD a lot (pos)
2) The battery life is ‘short’ if you use LCD a lot (neg)Current system classify the first one as positive and second one as negative. However, actually both are saying the same fact.
[Kavita Ganesan & Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008]
04/24/23Data Mining: Principles and
Algorithms 36
Discussion A difficult task Essential for many blog or review mining
techniques Current stage of opinion finding
Good performance in sentence level, specific domain, sub-problem.
Still low accuracy in general case MAP score of TREC ‘08 top performed system
Opinion finding: 0.4569 Polarity finding: 0.2297~0.2723
A lot of margin to improve !04/24/23
Data Mining: Principles and Algorithms 37
References I. Ounis, C. Macdonald and I. Soboroff, Overview of the TREC 2008 Blog Track , TREC, 2008. Opinion Mining and Summarization: Sentiment Analysis. Tutorial given at WWW-2008, April
21, 2008 in Beijing, China. Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, ChengXiang Zhai. Topic Sentiment
Mixture: Modeling Facets and Opinions in Weblogs, Proceedings of the 16th International World Wide Web Conference (WWW' 07), pages 171-180, 2007.
Minqing Hu and Bing Liu. "Mining and summarizing customer reviews". To appear in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004, full paper), Seattle, Washington, USA, Aug 22-25, 2004.
Minqing Hu and Bing Liu. "Mining Opinion Features in Customer Reviews." To appear in Proceedings of Nineteeth National Conference on Artificial Intellgience (AAAI-2004), San Jose, USA, July 2004.
Yue Lu and ChengXiang Zhai. "Opinion Integration Through Semisupervised Topic Modeling", In Proceedings of the 17th International World Wide Web Conference (WWW'08)
Kavita Ganesan, Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008 Hyun Duk Kim, Dae Hoon Park, V.G.Vinod Vydiswaran, and ChengXiang Zhai,Opinion
Summarization Using Entity Features and Probabilistic Sentence Coherence Optimization: UIUC at TAC 2008 Opinion Summarization Pilot, Text Analysis Conference (TAC), Maryland, USA.
04/24/23Data Mining: Principles and
Algorithms 38
References Y. Lee, S.-H. Na, J. Kim, S.-H. Nam, H.-Y. Jung and J.-H. Lee , KLE at TREC 2008
Blog Track: Blog Post and Feed Retrieval , TREC, 2008. L. Jia, C. Yu and W. Zhang, UIC at TREC 208 Blog Track, TREC, 2008. Nitin Jindal and Bing Liu. "Identifying Comparative Sentences in Text
Documents" To appear in Proceedings of the 29th Annual International ACM SIGIR Conference on Research & Development on Information Retrieval (SIGIR-06), Seattle 2006.
Opinion Mining and Summarization (including review spam detection), tutorial given at WWW-2008, April 21, 2008 in Beijing, China.
Murthy Ganapathibhotla and Bing Liu, Mining opinions in comparative sentences, Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 241–248, Manchester, August 2008
04/24/23Data Mining: Principles and
Algorithms 39
Mining User Query Logs for Personalized Search
Yuanhua Lv
(Some slides are taken from Xuehua Shen, Bin Tan, and ChengXiang Zhai’s presentation)
42
Problem of Current Search EnginesJaguar
CarApple Software
Animal
Chemistry Software
Suppose we know:
1. Short-term query logs: previous query = “racing cars”. [Shen et al. 05]
2. Long-term query logs: “car” occurs far more frequently than “Apple” in the user’s query logs of the recent 2 months. [Tan et al. 06]
43
Problem Definition
Q2
{C2,1 , C2,2 ,C2,3 ,… } C2
…
Q1 User Query{C1,1 , C1,2 ,C1,3 ,…} C1 User Clickthrough
? User Information Need
How to model and mine user query logs?Qk
e.g., Apple software
e.g., Apple - Mac OS X The Apple Mac OS X product page. Describes features in the current version of Mac OS X, a screenshot gallery, latest software downloads, and a directory of ...
44
Retrieval Model
Qk
D
θQk
θD
Similarity Measure
Results( || )kQ DD
Basis: Unigram language model + KL divergence
( | ) ( | )k kp w p w Q 1 1 1 1,..., , ,...( | ) ,( | , )k kk kQ Qp Cw p w CQ
U
Mining query logs to update query model
'kQ
'( || )kQ DD
Query Logs
45
Mining Short-term User Query Logs [Shen et al. 05]
Qk
Q1
Qk-1
…
C1
Ck-1
…
Average user’s previous clickthrough
CH
QH
111
1
( | ) ( | )i k
Q iki
p w H p w Q
111
1
( | ) ( | )i k
C iki
p w H p w C
Average user’s previous queries
1 H
Combine previous clickthrough and previous queries
( | ) ( | ) (1 ) ( | )C Qp w H p w H p w H
k
1
Linearly interpolate current queryand history model
( | ) ( | ) (1 ) ( | )k kp w p w Q p w H
Four Heuristic Variants
• FixInt: fixed coefficient interpolation( | ) ( | ) (1 ) ( | )k kp w p w Q p w H
47
Mining Short-term User Query Logs [Shen et al. 05]
Qk
Q1
Qk-1
…
C1
Ck-1
…
Average user’s previous clickthrough
CH
QH
111
1
( | ) ( | )i k
Q iki
p w H p w Q
111
1
( | ) ( | )i k
C iki
p w H p w C
Average user’s previous queries
1 H
Combine previous clickthrough and previous queries
( | ) ( | ) (1 ) ( | )C Qp w H p w H p w H
k
1
Linearly interpolate current queryand history model
( | ) ( | ) (1 ) ( | )k kp w p w Q p w H
Fixed α?
Four Heuristic Variants
• FixInt: fixed coefficient interpolation• BayesInt: adapt the interpolation coefficient to
different query length – Intuition: if the current query Qk is longer, we
should trust Qk more
49
Mining Short-term User Query Logs [Shen et al. 05]
Qk
Q1
Qk-1
…
C1
Ck-1
…
Average user’s previous clickthrough
CH
QH
111
1
( | ) ( | )i k
Q iki
p w H p w Q
111
1
( | ) ( | )i k
C iki
p w H p w C
Average user’s previous queries
1 H
Combine previous clickthrough and previous queries
( | ) ( | ) (1 ) ( | )C Qp w H p w H p w H
k
1
Linearly interpolate current queryand history model
( | ) ( | ) (1 ) ( | )k kp w p w Q p w H
Fixed α?
Average?
Four Heuristic Variants• FixInt: fixed coefficient interpolation• BayesInt: adapt the interpolation coefficient to
different query length – Intuition: if the current query Qk is longer, we
should trust Qk more• OnlineUp: assign more weight to more recent
records.• BatchUp: the user becomes better and better at
query formulation as time goes on, but we do not need to “decay” the clickthrough.
51
Data Set of Evaluation
• Data collection: TREC AP88-90• Topics: 30 hard topics of TREC topics 1-150• System: search engine + RDBMS• Context: Query and clickthrough history of 3
participants.
52
Overall Effect of Search Context
Query
FixInt
(=0.1,=1.0)
BayesInt
(=0.2,=5.0)
OnlineUp
(=5.0,=15.0)
BatchUp
(=2.0,=15.0)
MAP pr@20 MAP pr@20 MAP pr@20 MAP pr@20
Q3 0.0421 0.1483 0.0421 0.1483 0.0421 0.1483 0.0421 0.1483Q3+HQ+HC 0.0726 0.1967 0.0816 0.2067 0.0706 0.1783 0.0810 0.2067Improve 72.4% 32.6% 93.8% 39.4% 67.7% 20.2% 92.4% 39.4%Q4 0.0536 0.1933 0.0536 0.1933 0.0536 0.1933 0.0536 0.1933Q4+HQ+HC 0.0891 0.2233 0.0955 0.2317 0.0792 0.2067 0.0950 0.2250Improve 66.2% 15.5% 78.2% 19.9% 47.8% 6.9% 77.2% 16.4%
• Short-term query log helps system improve retrieval accuracy
• BayesInt better than FixInt; BatchUp better than OnlineUp
Mining Long-term User Query Log [Tan et al. 05]
• Can we mine long-term user query log similarly?
• Challenge: long-term query log is noisy– How do we handle the noise?– Can we still improve performance?
• Solution: – Assign weights to the query log data (EM
algorithm)
Hierarchical History Models
q1D1C1
S1
θS1
q2D2C2
S2
θS2
...... qt-1Dt-1Ct-1
St-1
θSt-1
qtDt
St
......
θH θq
θq,H {θd}
D(θq,H||θd)
unit history modelθSk ← qkDkCk
overall history modelθH = Σwk θSk
original query modelθq
contextual query modelθq,H
document modelθd
Weights for query log units
Discriminative Weighting with Mixture Model
q1D1C1
S1
θS1
q2D2C2
S2
θS2
...... qt-1Dt-1Ct-1
St-1
θSt-1
qtDt
St
......
θH θqθB
θMix
Backgroundmodel
λ1?λ2?
λt-1?
λB?
Select {λ} to fit the data: maximize p(Dt|θMix)
λq?
<d1>jaguar car perfect for racing<d2>jaguar is a big cat...<d3>locate jaguar dealerin champaign...
EM algorithm
Experimental Results
two query types
recurring fresh≫combination ≈ clickthrough > docs > query, contextless
Summary
• Mining user query logs can personalize search results and improve retrieval performance– Four different models to exploit short-term query
logs [Shen et al. 05].– Assign weights to the long-term query logs to
reduce the effect of noise [Tan et al. 06].
Reference
• Xuehua Shen, Bin Tan, ChengXiang Zhai: Context-sensitive information retrieval using implicit feedback. SIGIR 2005: 43-50
• Bin Tan, Xuehua Shen, ChengXiang Zhai: Mining long-term search history to improve search accuracy. KDD 2006: 718-723
04/24/23 60
Data Mining: Concepts and Techniques — Chapter 11 —
11.8. Online Analytical Processing onMultidimensional Text Database
Duo ZhangDepartment of Computer Science
University of Illinois at Urbana-Champaignhttp://sifaka.cs.uiuc.edu/~dzhang22/
04/24/23 61
Online Analytical Processing onMultidimensional Text Database
Motivation
Text Cube: Computing IR Measures for Multidimensional Text Database Analysis
Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases
Motivation• Industry and commercial applications often
collect huge amount of data containing both structured data records and unstructured text data in a multidimensional text database
• Incident reports• Job descriptions• Product reviews• Service feedback
• It is highly desirable and strategically important to support high-performance search and mining over such databases
04/24/23 62
Examples Aviation Safety Reporting System
How to organize the data to help experts efficiently explore and digest text information?
e.g. compare the reports in 1998 and reports in 1999? How to help experts analyze a specific type of anomaly
in different contexts? e.g. what did pilots say about anomaly “landing without
clearance” during daylight v.s. night?
Time Location Environment … Narrative
199801 TX Daylight … …… I TOLD HIM I WAS AT 2000 FT AND HE SAID OK……
199801 LA Daylight … ……WE STOPPED THE DSCNT AT CIRCLING MINIMUMS……
199801 LA Night … ……THE TAXI/LNDG LIGHTS VERY DIM. NO OTHER VISIBLE TFC IN SIGHT……
199902 FL Night … ……I FEEL WE SHOULD ALL EDUCATE OURSELVES ON CHKLISTS……
04/24/23 64
Online Analytical Processing onMultidimensional Text Database
Motivation
Text Cube: Computing IR Measures for Multidimensional Text Database AnalysisC. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao (ICDE’08)
Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases
Text Cube Text Cube
A novel data cube model integrating the power of traditional data cube and IR techniques for effective text mining
Computing IR measures for multidimensional text database analysis
Heterogeneous records to be examined Structured categorical attributes Unstructured free text
IR statistics are evaluated TF-IDF Inverted Index
04/24/23 65
Text Cube - Implementation Preprocessing
stemming, stop words elimination, TF-IDF weighting Concept hierarchy construction
A dimension hierarchy takes the form of a tree or a DAG. An attribute at a lower level reveals more details
Four operations are supported: roll-up, drill-down, slice and dice
Term hierarchy construction A term hierarchy represents semantic levels of
terms in the text and their correlations Infusion with expert knowledge Two novel operations: Pull-up & Push-down
04/24/23 66
Text Cube - Implementation Partial materialization: if a non-materialized cell is
retrieved, we compute it on-the-fly based on the partially materialized cuboids
A balance between time and space: given a time threshold δ, we minimize storage size within the query time bound δ for retrieving all cells to be interested in
04/24/23 67
Experiment – Efficiency and Effectiveness
68
Compare avgTF under different“Environment: Weather Elements”
Compare avgTF under different“Supplementary: Problem Areas”
04/24/23 69
Online Analytical Processing onMultidimensional Text Database
Motivation
Text Cube: Computing IR Measures for Multidimensional Text Database Analysis
Topic Cube: Topic Modeling for OLAP on Multidimensional Text DatabasesD. Zhang, C. Zhai, and J. Han (SDM’09)
Motivation Aviation Safety Reporting System
How to organize the data to help experts efficiently explore and digest text information?
e.g. compare the reports in 1998 and reports in 1999? How to help experts analyze a specific type of anomaly
in different contexts? e.g. what did pilots say about anomaly “landing without
clearance” during daylight v.s. night?
Time Location Environment … Narrative
199801 TX Daylight … …… I TOLD HIM I WAS AT 2000 FT AND HE SAID OK……
199801 LA Daylight … ……WE STOPPED THE DSCNT AT CIRCLING MINIMUMS……
199801 LA Night … ……THE TAXI/LNDG LIGHTS VERY DIM. NO OTHER VISIBLE TFC IN SIGHT……
199902 FL Night … ……I FEEL WE SHOULD ALL EDUCATE OURSELVES ON CHKLISTS……
Solution: Topic Cube
Challenges: How to support operations along the topic dimension? How to quickly extract semantic topics?
98.0199.0299.01
98.02
LAX SJC MIA AUS
overshootundershootbirds
turbulence
Time
Location
Topic
CA FL TX
Location
19981999
Time
Deviation
Encounter
Topic
drill-down
roll-up
Constructing Topic Cube
Time Loc Env … Narrative
98.01 TX Daylight …
98.01 LA Daylight …
98.01 LA Night …
99.02 FL Night …
ALL
Anomaly Altitude Deviation
…… Anomaly Maintenance Problem
…… Anomaly Inflight Encounter
Undershoot
…… Overshoot
Improper Documentation
Improper Maintenance
Birds Turbulence
…… ……
Descent 0.06Cloud 0.03Ft 0.01… ….
Descent 0.05System 0.02View 0.01… ….
Altitude 0.03Ft 0.02Climb 0.01… ….
Altitude 0.04Ft 0.03Instruct 0.01… ….
drill-down
roll-up
Materialization
StandardDimension(Location)
Topic Dimension (Anomaly Event)
CLAX-overshoot CLAX-altitude CLAX-
all
CCA-overshoot CCA-altitude CCA-all
CUS-overshoot CUS-altitude CUS-all
Mtopic-agg
Mtopic-agg Mtopic-agg
Mtopic-
agg
Mtopic-
agg
Mstd-agg Mstd-agg Mstd-agg
Mstd-agg Mstd-agg Mstd-agg
Mtopic-agg
( 1) ( 1)
( 1) ( 1)
,' { , , }(0) ( )
, '' ' { , , }
( , ) ( ')
( | )( ', ) ( ')
L Ls ei i
L Ls ei i
d wdjL
c id w
w dj
c w d p z j
p wc w d p z j
,( )(0)
, ''
( , ) ( )( | )
( ', ) ( )i cin
a
i ci
d wc d DL
c jd w
w c d D
c w d p z jp w
c w d p z j
Mtopic-
agg:Mstd-
agg:
Experimental ResultsContex
t Word p(w|θ)
daylight
Tower 0.075Pattern 0.061Final 0.060
Runway 0.053Land 0.052
Downwind 0.039
night
Tower 0.035Runway 0.029
Light 0.027Instrument Landing System 0.015
Beacon 0.014
landing without clearance
ObjectiveFunction
Iterations
Time (sec.)
Closeness to the optimum point
…WINDS ALOFT AT PATTERN ALT OF 1000 FT MSL, WERE MUCH STRONGER AND A DIRECT XWIND. NEEDLESS TO SAY, THE PATTERNS AND LNDGS WERE DIFFICULT FOR MY STUDENT AND THERE WAS LIGHT TURB ON THE DOWNWIND…
…I LISTENED TO HWD ATIS AND FOUND THE TWR CLOSED AND AN ANNOUNCEMENT THAT THE HIGH INTENSITY LIGHTS FOR RWY 28L WERE INOP. BROADCASTING IN THE BLIND AND LOOKING FOR THE TWR BEACON AND LOW INTENSITY LIGHTS AGAINST A VERY BRIGHT BACKGROUND CLUTTER OF STREET LIGHTS, ETC…