Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian...

34
Topic cluster of Streaming Tweets based on GPU- Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

Transcript of Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian...

Page 1: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map

Group 15Chen ZhutianHuang Hengguang

Page 2: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

Outline

Background

Pipeline and Technique

Conclusion

Page 3: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

Background

Page 4: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

What happen in the tweets stream?

• Unsupervised, Clustering algorithm.

• Organize large document collections according to textual similarities.

• Create visible result for searching and exploring large document collections.

Page 5: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

WEBSOM system

• Based on Self Organizing Map.• Generate topic map for

documents.• Explore large documents just

like explore Google map.

Page 6: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

What WEBSOM looks like?

Page 7: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

Gap

• WEBSOM – Long document, static, long training time.

• Twitter – Short text, dynamic, streaming data

• How to adapt SOM to streaming Twitter data?

Page 8: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

What our system looks like

Page 9: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

Outline

Background

Pipeline and Technique

Conclusion

Page 10: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

Pipeline and Technique

Page 11: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

Pipeline

Detect Event

Build Dictionary

Vectorize Tweets

Reduce Dimension

SOM Cluster

Show the SOM map

Detect Event

Page 12: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

Detect Event

• Only focus on unusual events.• How to identify abnormal events on

Twitter?

Tweets Stream

Events

Events

Page 13: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

1. Similar to TCP’s congestion control mechanism.

2. Count the number of tweets in a moving window.

3. Weighted moving average and variance.

4. Threshold to determine whether it’s an event.

Detect Event

Page 14: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

FIFA2014, Brazil 1:7 Germany

• Track 823 keywords.

such as “FIFA”, ”Ger”, ”Brazil”,

“#WordCup”…

• In 110 minutes.

• 100 million tweets.

• Sample 1%

Test Data

Page 15: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

Goal!

Goal! X 3

Goal!

Time of Peak What’s happen?

4:11 First Goal!

4:25 Goal! X 3 in 3 minute

4:30 Goal!

5:07 Second Half Begin

5:25 Goal!

5:35 Goal!

5:46 Goal!

5:50 End!

Detect Event

Page 16: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

Build Dictionary

Vectorize Tweets

Reduce Dimension

SOM Cluster

Show the SOM map

Detect Event

Detect Event

Build Dictionary

Page 17: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

1. Remove stop words2. Stemming – Snow Balls3. Remove words whose occurrence less that

10%4. Remove words whose occurrence greater

that 50%

Build Dictionary

Page 18: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

1. Vector Space model2. TF-IDF3. Normalization

Vectorize Tweets

𝑉 𝑖= (0.4123 ,0.12312 ,0.344 ,… )

Page 19: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

10,000 tweets x 10,000 dimension

1+ hour for convergence

Page 20: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

Reduce Dimension

Show the SOM map

SOM Cluster

Reduce Dimension

Vectorize Tweets

Build Dictionary

Detect Event

Page 22: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

Show the SOM map

SOM Cluster

Reduce Dimension

Vectorize Tweets

Build Dictionary

Detect Event

SOM Cluster

Page 23: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

What is SOM? Self-organization Map.

• Artificial Neural Network

• Unsupervised Learning• Iteration Based• Visible Result

SOM Cluster

Page 24: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

• Sequential SOM

• Batch Type SOM Faster, Effective

SOM Cluster

Page 25: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

Random Projection+ Batch SOM +

1 SecondHour

SOM Cluster

CUDA

Page 26: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

20 newsgroups

• 20,000 documents.

• 20 different newsgroups.

• only in 1 group.

Test Data

http://web.ist.utl.pt/acardoso/datasets/.

60% vs

40%Train

Test

Page 27: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

Method Random Projection

Macro Accuracy(

%)

Micro Accuracy(

%)

Renato’s SOM NO 68 67

Our Method YES 60 61

Conclusion: Random projection will result in losing precision. Hence the performance will decrease after dimension reduction.

20 Newsgroup Test

Page 28: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

Method Random Projection

Macro Accuracy(%)

Micro Accuracy(%)

Renato’s SOM NO 68 67

Our Method YES 60 61

Matlab repeat Renato’s SOM

NO 63 62

Matlab repeat Renato’s SOM

YES 61 60

We use SOM tool box to repeat Renato’s experiment totally.

20 Newsgroup Test

Page 29: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

FIFA Data

Page 30: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

FIFA Data

Page 31: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

FIFA Data

Page 32: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

Conclusion

Page 33: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

• 2 algorithms• 3 sets of

experiment• 1 prototype

system• 1 case study

Conclusion

Page 34: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

Thanks for Watching

Q & A