Automatic and unsupervised topic discovery in social networks

35
Automatic and unsupervised topic discovery in social networks Antonio Moreno [Carlos Vicient] Seminar at Poznan University of Technology, June 2014

description

Research seminar given at the Poznan University of Technology, Poland, June 2014. The topic was the automatic and unsupervised discovery of topics in social networks.

Transcript of Automatic and unsupervised topic discovery in social networks

Page 1: Automatic and unsupervised topic discovery in social networks

Automatic and unsupervised topic

discovery in social networksAntonio Moreno[Carlos Vicient]

Seminar at Poznan University of Technology, June 2014

Page 2: Automatic and unsupervised topic discovery in social networks

Introduction Methodology of analysis Case study Conclusions and future work

Table of contents

Page 3: Automatic and unsupervised topic discovery in social networks

Introduction

Page 4: Automatic and unsupervised topic discovery in social networks

Web 2.0 (Social Web)◦ Huge amount of highly heterogeneous and

unstructured user-generated data in the Web (e.g. Wikipedia, blogs) and in social networks (e.g. Facebook, Twitter)

Global aim of our work◦ Develop tools based on Artificial Intelligence

techniques that may analyze all this information in an automatic and unsupervised way and build knowledge structures Some previous works

Ontology-Based Information Extraction Ontology Learning from the Web

Introduction

Page 5: Automatic and unsupervised topic discovery in social networks

Ontology Learning from Web pages

PhD thesis-D.Sánchez (2007)

Page 6: Automatic and unsupervised topic discovery in social networks

Focus on Social Networks – Twitter 500 million short messages (tweets) per day

Current Work

Hashtags

Page 7: Automatic and unsupervised topic discovery in social networks

Hashtags can be taken as indicators of the topic of a tweet

Given a large number of tweets, most approaches to automatic topic detection try to cluster tweets (or cluster hashtags) in some way

Most usual solution: cluster hashtags considering their syntactic co-occurrence

Topic detection in Twitter

Page 8: Automatic and unsupervised topic discovery in social networks

Synonymy: #illness, #disease Polysemy: #operation Lexical similarity: #pharmaceutical,

#pharmaceuticals, #pharma, #pharmacy, #pharmacology

Acronyms: #AIDS, #HIV Named entities:

#MayoClinic,#AustinCancerCentre Concatenation: #HighBloodPressure, #lungcancer Feelings: #CancerSucks Invented words, nonsense

Hashtag heterogeneity

Page 9: Automatic and unsupervised topic discovery in social networks

A semantic management of hashtags will provide a more coherent classification than the usual ones based on syntactic co-occurrence.

Reminder of the talk:◦ Unsupervised semantic clustering of hashtags◦ Case study – Medical tweets

Work hypothesis

Page 10: Automatic and unsupervised topic discovery in social networks

Methodology of analysis

Page 11: Automatic and unsupervised topic discovery in social networks

After obtaining the hashtags from a given corpus of tweets, a three-step analytic process is applied:◦ Semantic annotation of hashtags◦ Hashtag clustering◦ Selection of relevant clusters

Analysis of a set of hashtags

Page 12: Automatic and unsupervised topic discovery in social networks

Idea: give meaning to each hashtag, by linking it to a WordNet concept◦ #SagradaFamilia => Church◦ #LFC => Football Club

Rationale: if we are able to associate each hashtag to a concept in an ontology, we will be able to apply ontology-based semantic similarity measures to know the degree of relationship between pairs of hashtags

1-Semantic annotation

Page 13: Automatic and unsupervised topic discovery in social networks

Step 1: The hashtag matches directly with a WordNet concept◦ Word-breaking techniques and iterative

prefix/suffix analysis are applied◦ #Cathedral, #GothicCathedral match with the

“Cathedral” concept Easy, but most hashtags do not appear

directly in WordNet

Semantic annotation process (I)

Page 14: Automatic and unsupervised topic discovery in social networks

Semantic annotation process (II)

WordNet

Page 15: Automatic and unsupervised topic discovery in social networks

Semantic annotation process (II)

WordNet

Page 16: Automatic and unsupervised topic discovery in social networks

Semantic annotation process (II)

WordNet

?

Page 17: Automatic and unsupervised topic discovery in social networks

Semantic annotation process (II)

WordNet

#SagradaFamilia => {building, church, basilica}

?

Page 18: Automatic and unsupervised topic discovery in social networks

At this point each hashtag h is associated to one (or several) WordNet concepts Lh ◦ The hashtags that have not been annotated in the

previous step are dismissed In order to apply a clustering process it is

necessary to define a measure of semantic similarity between pairs of hashtags (i.e. between pairs of lists of WordNet concepts)

2-Hashtag clustering

Page 19: Automatic and unsupervised topic discovery in social networks

We have considered that the similarity between two hashtags h1 and h2 is the maximum similarity between a concept in Lh1 and a concept in Lh2

◦ Any ontology-based semantic similarity measure between concepts could be applied

Comparing two tags

h1: C1 C2

h2: C3 C4 C5

0.2 0.1

0.50.60.3

0.1

Page 20: Automatic and unsupervised topic discovery in social networks

We have considered that the similarity between two hashtags h1 and h2 is the maximum similarity between a concept in Lh1 and a concept in Lh2

◦ Any ontology-based semantic similarity measure between concepts could be applied

Comparing two tags

h1: C1 C2

h2: C3 C4 C5

0.2 0.1

0.50.60.3

0.1

Using these similarity between hashtags we perform a hierarchical clustering of the set of hashtags

Page 21: Automatic and unsupervised topic discovery in social networks

Due to the nature of social tags, traditional clustering methods provide solutions with a large number of irrelevant classes

It is important to analyse the clustering tree and determine which classes of hashtags are good enough to be shown to the user

3-Selection of relevant clusters

Page 22: Automatic and unsupervised topic discovery in social networks

filtering (HC, minK, maxK, t1, t2) finalClusts := Ø forall k in maxK .. minK

forall c in 1 .. k b := inter-cluster-homogeneity(HCkc)

if ((b >= t1) && (|HCkc| >= t2)

&& (∄ e  in finalClusts | e ⊆ HCkc))

Add HCkc to finalClusts

return finalClusts

Filtering algorithm Cut the tree and obtain k classes

Page 23: Automatic and unsupervised topic discovery in social networks

filtering (HC, minK, maxK, t1, t2) finalClusts := Ø forall k in maxK .. minK

forall c in 1 .. k b := inter-cluster-homogeneity(HCkc)

if ((b >= t1) && (|HCkc| >= t2)

&& (∄ e  in finalClusts | e ⊆ HCkc))

Add HCkc to finalClusts

return finalClusts

Filtering algorithm

Compute the homogeneity of each class

Page 24: Automatic and unsupervised topic discovery in social networks

filtering (HC, minK, maxK, t1, t2) finalClusts := Ø forall k in maxK .. minK

forall c in 1 .. k b := inter-cluster-homogeneity(HCkc)

if ((b >= t1) && (|HCkc| >= t2)

&& (∄ e  in finalClusts | e ⊆ HCkc))

Add HCkc to finalClusts

return finalClusts

Filtering algorithm

A class is selected if it is big enough, it is homogeneous enough, and it is not a superset of any of the previously selected classesA semantic centroid of each selected class is calculated

Page 25: Automatic and unsupervised topic discovery in social networks

Case study

Page 26: Automatic and unsupervised topic discovery in social networks

5000 medical tweets related to Oncology, extracted from Symplur (www.symplur.com)

From October 31st 2012 to January 11th 2013

The set contains1086 different hashtags

Using the WordNet + Wikipedia semantic annotation process, 930 hashtags (85.6%) were annotated◦ Half of the annotations are made in the

first step (WordNet) and the other half in the second step (Wikipedia)

◦ 156 hashtags (14.4%) were removed

Dataset

1 2 3 4 5 6 7 8 9 10 11 12

2530

793 769

371293

12952 30 2 3 24 1

hashtags/tweet

#hashtags

#tw

eets

Page 27: Automatic and unsupervised topic discovery in social networks

The remaining 930 hashtags were manually examined. ◦ 536 (57.6%) were relevant medical hashtags, and

they were classified in 16 manually labelled categories Organs, professions, medical tests, etc.

◦ 394 (42.4%) were considered noisy or unrelated to Medicine

Manual analysis

Page 28: Automatic and unsupervised topic discovery in social networks

Wu-Palmer semantic similarity measure

Hashtag hierarchical clustering

Page 29: Automatic and unsupervised topic discovery in social networks

maxK=200, minK=5◦ The algorithm proceeds from the cut that divides the set in

200 classes up to the cut that divides the set in 5 classes; thus, it moves from more particular classes to more general classes

t1: minimum inter-class-homogeneity◦ All the values between 0 and 1 (in 0.1 steps) were tested.◦ In this talk I will consider the value 0.70.

t2: minimum number of elements◦ All the even values between 2 and 20 were tested.◦ In this talk I will consider the value 10.

With these parameters, 31 classes were obtained

Selection of relevant classes

Page 30: Automatic and unsupervised topic discovery in social networks

A: Manual set of 16 correct classes (536 HTs) + a noisy 17th class (394 HTs)

B: Set of 31 classes (930 HTs) obtained by the system

We calculate, for each class Bi in B◦ Its semantic centroid◦ Which is the class Aj in A with which it shares more elements

Precision: How many items of Bi belong to Aj

Recall: How many items of Aj appear in Bi

Evaluation of the results

Page 31: Automatic and unsupervised topic discovery in social networks

Classes in BSemantic centroid, Size

Best matching classes in APrecision, Recall, Manual label

Page 32: Automatic and unsupervised topic discovery in social networks

Conclusions and Future Work

Page 33: Automatic and unsupervised topic discovery in social networks

The unsupervised analysis of the set of HTs contained in a corpus of tweets is very hard, because half of them may be noisy or unrelated to the domain, and they have a very heterogeneous nature

Our hypothesis is that semantic measures of similarity between HTs will lead to better classifications that standard co-occurrence techniques

In a test on 5000 medical tweets, 13 of the 16 manually labelled classes are found, with different degrees of precision and recall

Summary

Page 34: Automatic and unsupervised topic discovery in social networks

Evaluate the quality of the semantic annotation step

Test different ontology-based semantic similarity measures in the clustering step

Explore deeply the influence of the thresholds on the selection step

Obtain as result a hierarchy of classes at different levels of abstraction, rather than a partition

Test the system on different sets of tweets◦ Size: from thousands to millions of tweets◦ Domain: uni-domain or general corpus

Future work

Page 35: Automatic and unsupervised topic discovery in social networks

Automatic and unsupervised topic discovery in social networks

Antonio Moreno, [Carlos Vicient]