Introduction to text mining and insights on bridging structured and unstructured data
-
Upload
sayaliskulkarni -
Category
Technology
-
view
126 -
download
1
description
Transcript of Introduction to text mining and insights on bridging structured and unstructured data
![Page 1: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/1.jpg)
Introduction to text miningWith a dive into structured and unstructured data
Sayali Kulkarni
October 23, 2010
![Page 2: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/2.jpg)
Outline for today
I Quick refresher on data mining
I What is so special about text?
I Introduction to CSAW
I Annotation system
I Distributed indexing and retrieval system
I Future work
![Page 3: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/3.jpg)
Data Mining I
I Data is useless if it does not make sense!
I Analyzing the data from different angles
I Important to know:I Data: What we getI Information: What we can useI Knowledge: How we use it
I Classes, clusters, association rules, patterns, sequences ...
![Page 4: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/4.jpg)
Data Mining II
I Different kinds of dataI Protein sequencesI Genetic dataI Network monitoringI Text dataI Images/sound – multimedia data
I Different challenges in each case
I Scaling, noise, generalization, overfitting, incorporatingdomain knowledge
![Page 5: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/5.jpg)
Text Mining I
I SourcesI Textual data from the webI Data collected within the organizationsI Survey data and feedback
I RepresentationI Using words as featuresI Data cleaning is a big task: spelling corrections, stop word
handling, stemmingI Weight of words depends on : importance of words in the
document and overall uniqueness of the word
I Mining TasksI SummarizationI Document clusteringI Document labellingI SearchI ...
![Page 6: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/6.jpg)
Text Mining II
I Structure of the data is important
I Web data is diverse in natureI Completely unstructured data like news, blogs, mails forumsI Parly structured data like Wikipedia, PubMed, other domain
specific enclyclopedias and dictionariesI Data contained in the text in form of lists and tables is much
more structured
I Adding semantics to such data
I Linking the structured and unstructured data
I One of the major applications of this is sematic search
![Page 7: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/7.jpg)
Search today: Impedance mismatch
Search Engine
�� ��������� ��� ���������� ��������� ��� �����
![Page 8: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/8.jpg)
Our vision of next-gen search
Search Engine
������ ������ �������������� ������� ��������� ������ ����� ����������� ����� ��������� ������ ��
Curating and Searching the Annotated Web
![Page 9: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/9.jpg)
Our vision of next-gen search
Search Engine
������ ������ �������������� ������� ��������� ������ ����� ����������� ����� ��������� ������ ��
Curating and Searching the Annotated Web
![Page 10: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/10.jpg)
CSAW search paradigm IData Model
I IR indexes - limited expressiveness
I Relational databases - intricate schema knowledge
I CSAW : IR index (unstructured) + annotation and catalogindex(structured)
![Page 11: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/11.jpg)
CSAW search paradigm II
Query CapabilitiesQuerying text with type annotations
�������������������
���� ������������ ������������
������������ ����������
���������������������� ������
����
ResponseTables of entities, quantities (special type of entities) and textfields
![Page 12: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/12.jpg)
High level block diagram
Figure: CSAW - high level block diagram
![Page 13: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/13.jpg)
Annotation System
Figure: Annotation Engine in CSAW
![Page 14: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/14.jpg)
Terminologies I
Figure: A plain page from unstructured data source
![Page 15: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/15.jpg)
Terminologies II
Spots
Figure: A spot on a page
Spot is an occurrence of text on a page that can be possibly linkedto a Wikipedia articleRelated notations:S0 All candidate spots in a Web pageS ⊆ S0 Arbitrary set of spotss ∈ S One spot, including surrounding context
![Page 16: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/16.jpg)
Terminologies III
Possible attachments
Figure: Possible attachments for a spot
Attachments are Wikipedia entities that can be possibly linked to aspotRelated notations:Γs Candidate entity labels for spot sΓ0
Ss∈S0
Γs , all candidate labels for page
Γ ⊆ Γ0 An arbitrary set of entity labelsγ ∈ Γ An entity label value, here, a Wikipedia entity
![Page 17: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/17.jpg)
Entity Disambiguation
Document
Spots
s
s’
Spo
t-to
-labe
l co
mpa
tibili
ty
Candidate labels
γγγγ
ΓΓΓΓs
γγγγ’
ΓΓΓΓs’
Figure: Disambiguation based on compatibility between spot and label
SemTag and Seeker[D+03] exploited this for entity disambiguation.It is the first Web-scale entity disambiguation system
![Page 18: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/18.jpg)
Entity Disambiguation
Document
Spots
s
s’
Spo
t-to
-labe
l co
mpa
tibili
ty
Candidate labels
γγγγ
ΓΓΓΓs
γγγγ’
ΓΓΓΓs’
Figure: Disambiguation based on compatibility between spot and label
SemTag and Seeker[D+03] exploited this for entity disambiguation.It is the first Web-scale entity disambiguation system
![Page 19: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/19.jpg)
Collective Entity Disambiguation
Document
Spots
s
s’
Spo
t-to
-labe
l co
mpa
tibili
ty
Inter-label topical coherence
Candidate labels
γγγγ
ΓΓΓΓs
γγγγ’
ΓΓΓΓs’
g(γ(γ(γ(γ))))
g(γ(γ(γ(γ’ ))))
Figure: Disambiguation based on local compatibility and topicalcoherence of spots
Example: Page with spots for Air Jordan, Michael Jordan, ChicagoBulls
I Cucerzan[Cuc07] was the first to recognize generalinterdependence between entity labels
I Work by Milne et al.[MW08] includes limited form ofcollective disambiguation
![Page 20: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/20.jpg)
Collective Entity Disambiguation
Document
Spots
s
s’
Spo
t-to
-labe
l co
mpa
tibili
ty
Inter-label topical coherence
Candidate labels
γγγγ
ΓΓΓΓs
γγγγ’
ΓΓΓΓs’
g(γ(γ(γ(γ))))
g(γ(γ(γ(γ’ ))))
Figure: Disambiguation based on local compatibility and topicalcoherence of spots
Example: Page with spots for Air Jordan, Michael Jordan, ChicagoBulls
I Cucerzan[Cuc07] was the first to recognize generalinterdependence between entity labels
I Work by Milne et al.[MW08] includes limited form ofcollective disambiguation
![Page 21: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/21.jpg)
Topical coherence based on entity catalog
![Page 22: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/22.jpg)
Relatedness information from entity catalog
I How related are two entities γ, γ′ in Wikipedia?
I Embed γ in some space using g : Γ → Rc
I Define relatedness r(γ, γ′) = g(γ) · g(γ′) or related
I Cucerzan’s proposal: c = number of categories; g(γ)[τ ] = 1 ifγ belongs to category τ , 0 otherwise, length of g(γ) is c .
r(γ, γ′) =g(γ)>g(γ′)√
g(γ)>g(γ)√
g(γ′)>g(γ′)
I Milne and Witten’s proposal: c = number of Wikipediapages; g(γ)[p] = 1 if page p links to page γ, 0 otherwise
r(γ, γ′) =log |g(γ) ∩ g(γ′)| − log max{|g(γ)|, |g(γ′)|}
log c − log min{|g(γ)|, |g(γ′)|}
![Page 23: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/23.jpg)
Relatedness information from entity catalog
I How related are two entities γ, γ′ in Wikipedia?
I Embed γ in some space using g : Γ → Rc
I Define relatedness r(γ, γ′) = g(γ) · g(γ′) or related
I Cucerzan’s proposal: c = number of categories; g(γ)[τ ] = 1 ifγ belongs to category τ , 0 otherwise, length of g(γ) is c .
r(γ, γ′) =g(γ)>g(γ′)√
g(γ)>g(γ)√
g(γ′)>g(γ′)
I Milne and Witten’s proposal: c = number of Wikipediapages; g(γ)[p] = 1 if page p links to page γ, 0 otherwise
r(γ, γ′) =log |g(γ) ∩ g(γ′)| − log max{|g(γ)|, |g(γ′)|}
log c − log min{|g(γ)|, |g(γ′)|}
![Page 24: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/24.jpg)
Relatedness information from entity catalog
I How related are two entities γ, γ′ in Wikipedia?
I Embed γ in some space using g : Γ → Rc
I Define relatedness r(γ, γ′) = g(γ) · g(γ′) or related
I Cucerzan’s proposal: c = number of categories; g(γ)[τ ] = 1 ifγ belongs to category τ , 0 otherwise, length of g(γ) is c .
r(γ, γ′) =g(γ)>g(γ′)√
g(γ)>g(γ)√
g(γ′)>g(γ′)
I Milne and Witten’s proposal: c = number of Wikipediapages; g(γ)[p] = 1 if page p links to page γ, 0 otherwise
r(γ, γ′) =log |g(γ) ∩ g(γ′)| − log max{|g(γ)|, |g(γ′)|}
log c − log min{|g(γ)|, |g(γ′)|}
![Page 25: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/25.jpg)
Dataset for evaluation I
I Documents(IITB) crawled from popular sites
I Publicly available data from Cucerzan’s experiments (CZ)
IITB CZ
Number of documents 107 19
Total number of spots 17,200 288
Spot per 100 tokens 30 4.48
Average ambiguity per Spot 5.3 18
Figure: Corpus statistics.
![Page 26: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/26.jpg)
Dataset for evaluation II
More on IITB dataset
I Collected a total of about 19,000 annotations
I Done by by 6 volunteers
I About 50 man-hours spent in collecting the annotations
I Exhaustive tagging by volunteers
I Spots labeled as NAwas about 40%
#Spots tagged by more than one person 1390
#NAamong these spots 524
#Spots with disagreement 278
#Spots with disagreement involving NA 218
Figure: Inter-annotator agreement.
![Page 27: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/27.jpg)
Human Supervision
I System identifies spots and mentions
I Shows pull-down list of (subset of) Γs for each s
I User selects γ∗ ∈ Γs ∪ NA
![Page 28: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/28.jpg)
Our Approach
I Main contributions:I Refined node features (feature design)I Using inlink based features for defining coherence score
(feature design)I Modified approach for collective inference (algorithm design)
![Page 29: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/29.jpg)
Modeling local compatibility
I Feature vector fs(γ) ∈ Rd expresses local textual compatibilitybetween (context of) spot s and candidate label γ
I Components of fs(γ) based on Wikipedia TFIDF vectors of:
1. Snippet2. Full text3. Anchor text4. Anchor text with some tokens around it
and using similarity measures:
1. Dot-product2. Cosine similarity3. Jaccard similarity
![Page 30: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/30.jpg)
Sense probability prior
I What entity does “Intel” refer to?I Chip design and manufacturing companyI Fictional cartel in a 1961 BBC TV serial
I Pr0(γ|s) is very high for chip maker, low for cartel
I Append element log Pr0(γ|s) to fs(γ)
![Page 31: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/31.jpg)
Components of the objective
Node score
I Node scoring model w ∈ Rd
I Node score defined as w>fs(γ)
I w is trained to give suitable weights to different compatibilitymeasures
I During test time, greedy choice local to s would bearg maxγ∈Γs w>fs(γ)
Clique Score
I Use Milne’s relatedness formulation
![Page 32: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/32.jpg)
Two-part objective to maximize
Node potential:
NP(y) =∏s
NPs(ys) =∏s
exp(w>fs(ys)
)
Clique potential:
CP(y) =∏
s 6=s′exp (r(ys , ys′)) = exp
∑
s 6=s′r(ys , ys′)
After taking logs and rescaling terms
1
|S0|∑
s
w>fs(ys) +1(|S0|2
)∑
s 6=s′r(ys , ys′)
![Page 33: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/33.jpg)
Two-part objective to maximize
Node potential:
NP(y) =∏s
NPs(ys) =∏s
exp(w>fs(ys)
)
Clique potential:
CP(y) =∏
s 6=s′exp (r(ys , ys′)) = exp
∑
s 6=s′r(ys , ys′)
After taking logs and rescaling terms
1
|S0|∑
s
w>fs(ys) +1(|S0|2
)∑
s 6=s′r(ys , ys′)
![Page 34: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/34.jpg)
ILP formulation
I Casting as 0/1 integer linear program
I Relaxing it to an LP
I Using up to |Γ0|+ |Γ0|2 variables
Variables:
zsγ = spotsisassignedlabelγ ∈ Γs ]
uγγ′ = [both γ, γ′ assigned to spots]{{
![Page 35: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/35.jpg)
ILP formulationObjective:
max{zsγ ,uγγ′} (NP′) + (CP1′)
Node potential:
1
|S0|∑
s∈S0
∑
γ∈Γs
zsγw>fs(γ) (NP′)
Clique potential:
1(|S0|2
)∑
s 6=s′∈S0
∑
γ∈Γs ,γ′∈Γs′
uγγ′r(γ, γ′) (CP1′)
Subject to constraints:
∀s, γ : zsγ ∈ {0, 1}, ∀γ, γ′ : uγγ′ ∈ {0, 1} (1)
∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′ (2)
∀s :∑
γ zsγ = 1. (3)
![Page 36: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/36.jpg)
ILP formulationObjective:
max{zsγ ,uγγ′} (NP′) + (CP1′)
Node potential:
1
|S0|∑
s∈S0
∑
γ∈Γs
zsγw>fs(γ) (NP′)
Clique potential:
1(|S0|2
)∑
s 6=s′∈S0
∑
γ∈Γs ,γ′∈Γs′
uγγ′r(γ, γ′) (CP1′)
Subject to constraints:
∀s, γ : zsγ ∈ {0, 1}, ∀γ, γ′ : uγγ′ ∈ {0, 1} (1)
∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′ (2)
∀s :∑
γ zsγ = 1. (3)
![Page 37: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/37.jpg)
ILP formulationObjective:
max{zsγ ,uγγ′} (NP′) + (CP1′)
Node potential:
1
|S0|∑
s∈S0
∑
γ∈Γs
zsγw>fs(γ) (NP′)
Clique potential:
1(|S0|2
)∑
s 6=s′∈S0
∑
γ∈Γs ,γ′∈Γs′
uγγ′r(γ, γ′) (CP1′)
Subject to constraints:
∀s, γ : zsγ ∈ {0, 1}, ∀γ, γ′ : uγγ′ ∈ {0, 1} (1)
∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′ (2)
∀s :∑
γ zsγ = 1. (3)
![Page 38: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/38.jpg)
ILP formulationObjective:
max{zsγ ,uγγ′} (NP′) + (CP1′)
Node potential:
1
|S0|∑
s∈S0
∑
γ∈Γs
zsγw>fs(γ) (NP′)
Clique potential:
1(|S0|2
)∑
s 6=s′∈S0
∑
γ∈Γs ,γ′∈Γs′
uγγ′r(γ, γ′) (CP1′)
Subject to constraints:
∀s, γ : zsγ ∈ {0, 1}, ∀γ, γ′ : uγγ′ ∈ {0, 1} (1)
∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′ (2)
∀s :∑
γ zsγ = 1. (3)
![Page 39: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/39.jpg)
LP relaxation for the ILP formulation
I Relax the constraints in the formulation as :
∀s, γ : 0 ≤ zsγ ≤ 1, ∀γ, γ′ : 0 ≤ uγγ′ ≤ 1
∀s, γ, γ′ : uγγ′ ≤ zsγ and uγγ′ ≤ zsγ′
∀s :∑
γ zsγ = 1.
I Margin between objective of relaxed LP and the rounded LP isquite thin
700
800
900
1000
1 2 3 4 5 6 7 8Tuning parameter
Tot
al O
bjec
tive
LP1-rounded
LP1-relaxed
![Page 40: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/40.jpg)
Hill climbing algorithm
I Initialization mechanismsI Label updates
![Page 41: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/41.jpg)
Backoff strategy I
I Allow backoff from tagging some spots
I Assign a special label “NA” to mark a “no attachment”
I Reward a spot for attaching to NA– RNA
I Spots marked NAdo not contribute to clique potential
I Smaller the value of RNA, more aggresive is the tagging
How this affects our objectiveN0 ⊆ S0 : spots assigned NAA0 = S0 \ N0 : remaining spotsFinal objective:
maxy
1
|S0|
∑
s∈N0
RNA +∑
s∈A0
w>fs(ys)
(NP)
+1(|A0|2
)∑
s 6=s′∈A0
r(ys , ys′) (CP1)
![Page 42: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/42.jpg)
Backoff strategy I
I Allow backoff from tagging some spots
I Assign a special label “NA” to mark a “no attachment”
I Reward a spot for attaching to NA– RNA
I Spots marked NAdo not contribute to clique potential
I Smaller the value of RNA, more aggresive is the tagging
How this affects our objectiveN0 ⊆ S0 : spots assigned NAA0 = S0 \ N0 : remaining spotsFinal objective:
maxy
1
|S0|
∑
s∈N0
RNA +∑
s∈A0
w>fs(ys)
(NP)
+1(|A0|2
)∑
s 6=s′∈A0
r(ys , ys′) (CP1)
![Page 43: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/43.jpg)
Backoff strategy II
IssuesA0 depends on y and hence the resulting optimization can nolonger be written as an ILPWay around
I Treat NAas a zero topical coherence label
r(NA, ·) = r(·,NA) = r(NA,NA) = 0;
I Contribution to NPis still equal to RNA
Modified Objective
maxy
1
|S0|(∑
s∈N0
RNA +∑
s∈A0
w>fs(ys)) (NP)
+1(|S0|2
)∑
s 6=s′∈A0
r(ys , ys′) (CP1)
![Page 44: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/44.jpg)
Multi-topic model
I Current clique potentials encourages a single cluster model
I The single cluster hypothesis is not always true
I Partition the set of possible attachments as C = Γ1, . . . , ΓK
I Refined clique potential for supporting multitopic model
1
|C |∑
Γk∈C
1(Γk
2
)∑
s,s′: ys ,ys′∈Γk
r(ys , ys′). (CPK)
I Using(Γk
2
)instead of
(S02
)to reward smaller coherent clusters
I Node score is not disturbed
![Page 45: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/45.jpg)
System design of the annotation system
![Page 46: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/46.jpg)
Evaluation of the annotation system
Evaluation measures:
PrecisionNumber of spots tagged correctly out oftotal number of spots tagged
RecallNumber of spots tagged correctly out oftotal number of spots in ground truth
F12×Recall×Precision(Recall+Precision)
![Page 47: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/47.jpg)
Results summary
I Selection of NPfeatures is important
I Collective inference adds value
Evaluation:
Our system CZ Milne
Recall 70.7% 31.43% 66.1%
Precision 68.7% 53.41% 19.35%
F1 69.69% 39.57% 29.94%
![Page 48: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/48.jpg)
Results summary I
Figure: Annotated page related to cricket
![Page 49: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/49.jpg)
Results summary II
Figure: Annotated page related to finance
![Page 50: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/50.jpg)
Query building blocks
Matcher: Word or phrase or mention of specific entity incatalog or quantity
Target: Placeholder that the engine must instantiate, e.g.,entity of given type, quantity with given unit (butpossibly just an uninterpreted token sequence)
Context: Any token segment that contains specified matchersand instantiations of targets
Predicates: Constraints over targets and context, e.g., textproximity, membership of entity in category,containment of quantity in range, . . .
Aggregators: Collects evidence in favor of candidate targetinstantiations from multiple contexts
![Page 51: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/51.jpg)
Query example: Category targets
Tabulate French films with the number of Academy Awards thateach won
I ?f ∈+ Category:French Films,
I ?a ∈ Qtype:Number,
I InContext(?c; ?f, ?a, "academy awards", won)
I Evidence aggregator Consensus(?c)
I Resulting in an output table with two columns 〈?f, ?a〉Tabulate physicists and musical instruments they played
I ?p ∈+ Category:Physicist
I ?m ∈+ Category:Musical Instrument
I InContext(?c; ?p, ?m, played)
I Evidence aggregator Consensus(?c)
![Page 52: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/52.jpg)
Subqueries and joins
I ?f ∈+ Category:French Film,
I ?a ∈ Qtype:Number, ?p ∈ Qtype:MoneyAmount,
I InContext(?c1; ?f, ?a, "academy awards", won),
I InContext(?c2; ?f, ?p, production cost, budget),
I Consensus(?c1, ?c2)
I Output 〈?f, ?a, ?p〉I Note that the number of academy awards and the production
cost may come from different Web pages
![Page 53: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/53.jpg)
Distributed indexing and storage
�������������� �� ���� ���� ��
��� ����������������������������� �������������� ���������� ���� ������� � ��� ! �������
I Hadoop used for distributed storage and processing
I Distributed Index stored as Lucene posting lists
I Lucene payload carries additional data like annotationconfidence, quantities
I Adapted Katta for distributed index retrieval
![Page 54: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/54.jpg)
Distributed search
������������������
� �� ����� ���� ��������� ����� ������ �� ! ����� ��"�� ��
I Local Ranking Engine(LRE) scores and ranks a documentwith respect to a user query
I Query Consensus Engine(QCE) aggregates evidences fromdifferent pages
![Page 55: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/55.jpg)
Distributed query processing
���� �������� ����� ����������� ���� ��� �������� ��������� ������������������ ��������������� �����
�� �� ����������������������� !�������"�
#�����$� �� ���%���� ���% ������ � ����� ����� #����� ������
&�� �� ���� �������'�������� �$� ���� ���% %���� ���� �� ����� ����� (���������������(��
![Page 56: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/56.jpg)
... hence completing the big picture of CSAW
�������������� �� ���� ���� ��
��� ����������������������������� �������������� ���������� ���� ������� � ��� ! ������� "��#���$������ � %&�' %���� &��(��� ������
Figure: Detailed CSAW system
![Page 57: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/57.jpg)
Road ahead
Annotation system
I Extending collective inferencing beyond page-level boundaries
I Extending inference algorithms to multitopic models
I Associating confidence with annotations
Query system
I Enhancing the data model and query language
I Entity concensus algorithms
Others
I Ranking of entities in dropdown on the Annotation UI
I Alternative methods for storing annotations suitable forperforming interesting mining tasks
![Page 58: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/58.jpg)
Thank you all for your interest in this topic
![Page 59: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/59.jpg)
References I
Somnath Banerjee, Soumen Chakrabarti, and Ganesh Ramakrishnan,Learning to rank for quantity consensus queries, SIGIR Conference, 2009.
S. Cucerzan, Large-scale named entity disambiguation based on Wikipediadata, EMNLP Conference, 2007, pp. 708–716.
S Dill et al., SemTag and Seeker: Bootstrapping the semantic Web viaautomated semantic annotation, WWW Conference, 2003.
Michael I. Jordan (ed.), Learning in graphical models, MIT Press, 1999.
Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, and SoumenChakrabarti, Collective annotation of Wikipedia entities in Web text,SIGKDD Conference, 2009.
R Mihalcea and A Csomai, Wikify!: linking documents to encyclopedicknowledge, CIKM, 2007, pp. 233–242.
![Page 60: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/60.jpg)
References II
Rada Mihalcea, Paul Tarau, and Elizabeth Figa, Pagerank on semanticnetworks, with application to word sense disambiguation, COLING ’04:Proceedings of the 20th international conference on ComputationalLinguistics (Morristown, NJ, USA), Association for ComputationalLinguistics, 2004, p. 1126.
David Milne and Ian H Witten, Learning to link with Wikipedia, CIKM,2008.
![Page 61: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/61.jpg)
Support slides
I Evaluation of our system in more details
I CSAW search paradigm description
I Objective value comparison
I Scaling and performance measurement
I About Katta
I Comparison of Local, hill climbing, LP - training RNA
I Sample malformed dendrograms in category space
I Multi-topical model and dendrogram for the same
I Dendrorgam based algorithm
![Page 62: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/62.jpg)
Effect of NP learning
50
55
60
65
70
75
Wiki fullpage,cosine
Anchortext,
cosine
Anchortext
context,cosine
Wiki fullpage,
Jaccard
Allfeatures(learn w)
F1,
%
20
40
60
80
100
0 20 40 60 80Recall, %
Pre
cisi
on, %
LocalLocal+PriorM&WCucerzan
I Learning w isbetter thancommonly-used singlefeatures
I Enough tobeatleave-one-outandanchor-basedapproaches
![Page 63: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/63.jpg)
Effect of NP learning
50
55
60
65
70
75
Wiki fullpage,cosine
Anchortext,
cosine
Anchortext
context,cosine
Wiki fullpage,
Jaccard
Allfeatures(learn w)
F1,
%
20
40
60
80
100
0 20 40 60 80Recall, %
Pre
cisi
on, %
LocalLocal+PriorM&WCucerzan
I Learning w isbetter thancommonly-used singlefeatures
I Enough tobeatleave-one-outandanchor-basedapproaches
![Page 64: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/64.jpg)
Benefits of collective annotation
40
50
60
70
80
90
40 50 60 70 80Recall, %
Pre
cisi
on, %
Local Local+priorHill1 Hill1+priorLP1 LP1+prior
Recall/Precision on IITB dataset
0
20
40
60
80
100
20 30 40 50 60 70Recall, %
Pre
cisi
on, %
Local
Hill1
LP1
Milne
Cucerzan
F1=63%
F1=69%
Recall/Precision on CZ dataset
I Evaluated on twodifferent data sets
I Can significantly pushrecall while preservingprecision
![Page 65: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/65.jpg)
Is our belief about the objective correct?
20
30
40
50
0.75 0.8 0.85 0.9 0.95 1Objective (normalized)
F1,
%
doc1 doc2doc3 doc4doc5 doc6
Figure: F1 versus Objective
I As theobjective valueincreases, theF1 increases
I Validates ourbelief aboutthe objective
![Page 66: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/66.jpg)
Effect of tuning RNA I
15
25
35
45
55
65
1 2 3 4 5 6 7 8RNA
F1,
%
Local Hill1LP1
Figure: F1 for Local, Hill and LP for differentRNAvalues
I Best RNAforLocal islesser than thebest RNAforHill1 andLP1
![Page 67: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/67.jpg)
Effect of tuning RNA II
40
50
60
70
80
90
1 2 3 4 5 6 7 8RNA
Pre
cisi
on, %
Local Hill1LP1
Figure: Precision for different RNAvalues
102030405060708090
1 2 3 4 5 6 7 8RNA
Rec
all,
%
Local Hill1LP1
Figure: Recall for different RNAvalues
I Smaller the value ofRNA, more aggresiveis the tagging
I Precision increaseswith increase inRNAvalue
I Recall decreases withincrease in RNAvalue
![Page 68: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/68.jpg)
CSAW search paradigm description
I Data modelI Two extremes in current available systems - IR systems and
relationalI Our goal: bridging the gap between the two
I Query capabilitiesI Two extremes in current systems - keyword queries and structured
SQL-like queriesI Our goal: allow composite representation and combine textual
proximity with structured data (from some catalog)
I ResponseI Current search systems returns URLs or highly structured data (as
in SQL)I Our goal: return list of entities, quantities (special type of entities),
or tables of entities, quantities
![Page 69: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/69.jpg)
CSAW search paradigm description
I Data modelI Two extremes in current available systems - IR systems and
relationalI Our goal: bridging the gap between the two
I Query capabilitiesI Two extremes in current systems - keyword queries and structured
SQL-like queriesI Our goal: allow composite representation and combine textual
proximity with structured data (from some catalog)
I ResponseI Current search systems returns URLs or highly structured data (as
in SQL)I Our goal: return list of entities, quantities (special type of entities),
or tables of entities, quantities
![Page 70: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/70.jpg)
CSAW search paradigm description
I Data modelI Two extremes in current available systems - IR systems and
relationalI Our goal: bridging the gap between the two
I Query capabilitiesI Two extremes in current systems - keyword queries and structured
SQL-like queriesI Our goal: allow composite representation and combine textual
proximity with structured data (from some catalog)
I ResponseI Current search systems returns URLs or highly structured data (as
in SQL)I Our goal: return list of entities, quantities (special type of entities),
or tables of entities, quantities
![Page 71: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/71.jpg)
Objective value comparison for Local, hill climbing, LP
700
800
900
1000
1 2 3 4 5 6 7 8rhoNA
Tot
al O
bjec
tive
Hill1LP1-roundedLP1-relaxed
![Page 72: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/72.jpg)
Scaling and performance measurement
0
2
4
6
8
10
0 50 100 150 200 250# of spots
Tim
e, s
Hill1
LP1
Figure: Scaling the annotation process withnumber of spots being annotated
I Scaling ismildlyquadraticallywrt |S0|
I Hill climbingtakes about2–3 seconds
I LP takesaround 4–6seconds
![Page 73: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/73.jpg)
About Katta
I Salient Features:I ScalableI Failure tolerantI DistributedI IndexedI Data storage
I Serves very large Lucene indexes as index shards on manyservers
I Replicates shards on different servers for performance andfault-tolerance
I Supports pluggable network topologies
I Master fail-over
I Plays well with Hadoop clusters
![Page 74: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/74.jpg)
Comparison of Local, hill climbing, LP - training RNA
Local Hill1 LP1
no Prior 63.45% 64.87% 67.02%
+Prior 68.75% 67.46% 69.69%
![Page 75: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/75.jpg)
Sample dendrograms I
![Page 76: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/76.jpg)
Sample dendrograms II
![Page 77: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/77.jpg)
Dendrogram with multitopic model
![Page 78: Introduction to text mining and insights on bridging structured and unstructured data](https://reader033.fdocuments.us/reader033/viewer/2022051610/54927e93ac7959042e8b4626/html5/thumbnails/78.jpg)
Multi-topical model
I Current clique potentials encourages a single cluster model
I The single cluster hypothesis is not always true
I Refined clique potential for supporting multitopic model
1
|C |∑
Γk∈C
1(Γk
2
)∑
s,s′: ys ,ys′∈Γk
r(ys , ys′). (CPK)
I Using(Γk
2
)instead of
(S02
), to reward smaller coherent clusters
as desired