Building Topic/Trend Detection System based on Slow Intelligence

Post on 04-Jan-2016

42 views 1 download

description

Building Topic/Trend Detection System based on Slow Intelligence. Chia-Chun Shih & Ting-Chun Peng Institute for Information Industry Taipei, Taiwan. Presented at DMS’10 special session on Slow Intelligence Systems. Agenda. Introduction Topic/Trend Detection System - PowerPoint PPT Presentation

Transcript of Building Topic/Trend Detection System based on Slow Intelligence

Building Topic/Trend Detection System based on Slow Intelligence

Chia-Chun Shih & Ting-Chun Peng

Institute for Information Industry

Taipei, Taiwan

Presented at DMS’10 special session on Slow Intelligence Systems

2

Agenda

• Introduction• Topic/Trend Detection System• Topic/Trend Detection System with Slow Intelligence• Conclusion

Introduction

4

Introduction

• Social media is prevailing• Social media is a reflection of real-world

– An experiment from HP Social Computing Lab shows:• Twitter-rate time series can accurately predict box-office movie sales with

Adjusted R2 = 0.973 (amazing!!)

• The emerging market for Social Media Monitoring Service– E.g., Nielsen Buzzmetrics, Radian6

Twitter PostsTwitter PostsBlog PostsBlog Posts Facebook UsersFacebook Users

5

Introduction

• Topic Detection and Tracking (TDT)– Initiated by DARPA at 1996– discover the topical structure in unsegmented streams of

news reporting as it appears across multiple media– Tasks:

• Topic Detection• Topic Tracking• First Story Detection• Story Segmentation• Link Detection

(cont’d)

6

Introduction

• Slow Intelligence provides a software development framework for systems with insufficient computing resources to gradually adapt to environments to handle complexities

(cont’d)

EnumeratorEnumerator AdaptorAdaptor EliminatorEliminator ConcentratorConcentratorProblemProblem Solution Solution

Knowledge-based ControllerKnowledge-based Controller

EnvironmentEnvironment

Slow Intelligence System

1 2 3 4

7

Introduction • In this paper, we propose a design of online topic/trend detection system

for Social Media with the advantages of Slow Intelligence.• Four complexities of designing online topic/trend detection systems are

identified, along with corresponding Slow Intelligence solutions.

(cont’d)

Enumerator Adaptor Eliminator Concentrator

Slow Intelligence System Building Blocks

Crawler & Extractor Topic Extractror Trend Detector

Topic/Trend Detection System

SIS system for scheduling Crawlers

SIS system for Selecting Trend Estimation MethodSIS System for

Focused Crawling

SIS system for adapting extractors

Enumerator Adaptor Eliminator Concentrator

Slow Intelligence System Building Blocks

Crawler & Extractor Topic Extractror Trend Detector

Topic/Trend Detection System

SIS system for scheduling Crawlers

SIS system for Selecting Trend Estimation MethodSIS System for

Focused Crawling

SIS system for adapting extractors

Topic/Trend Detection System

9

Topic/Trend Detection System

• Objective– Detect current hot topics and to predict future hot topics based on data

collected from Social Media

• Three components– Crawler & Extractor: Collect data and extract information from Social

Media– Topic Extractor: Detect hot topics from a set of text documents– Trend Detector: Detect trends (future hot topics) based on currently

available data

Crawler &

Extractor

Topic Extractor

Trend Detector

SocialMedia

Current Hot topics

Future Hot topics

10

Topic/Trend Detection System

• Crawler & Extractor

(cont’d)

Web dataDB

WebCrawler

HTMLdocuments

InformationExtractor

* Extract articles and metadata (title, author, content, etc) from semi-structured web content

User’sKeywords of

Interests

Topic Extractor

Social Media

Textdocuments

Crawler & Extractor

11

Topic/Trend Detection System

• Topic Extractor

(cont’d)

Web dataDB

Topic WordExtraction

Topic WordClustering

Hot topicextraction

Currenttopics

CurrentHot topics

Topic Extractor

• Apply TF-IDF scheme to generate Top-N topic words for each document

• Apply clustering algorithm to cluster topic words into topic groups. The topic groups are treated as “topics”

• Apply aging theory to find hot topics

12

Topic/Trend Detection System

• Trend Detector

(cont’d)

Trend Detector

Currenttopics

Trend EstimationAlgorithms

Topic Trend(Future Hot Topics)

• The Trend Estimation Algorithm is a black box now, however, it will “find its way” when Slow Intelligence is involved in the system

Topic/Trend Detection Systemwith Slow Intelligence

14

T/TD System with Slow Intelligence

• Four complexities of designing online topic/trend detection systems

• 1. It is unlikely to collect all web data based on limited amount of computing resources. The system needs to develop data collection strategies which can concentrate limited resources on collecting important web data.

Crawler &

Extractor

15

T/TD System with Slow Intelligence

• 2. Many computation methods are available for estimating trends. If parameter settings are also taken into account, there are too many combinations to choose. Furthermore, Internet is a changing environment, which means current best solution may not perform well in the future. The system needs to automatically (or at least quasi-automatically) find best solution from many alternatives in a changing environment.

(cont’d)

Trend Detector

16

T/TD System with Slow Intelligence

• 3. The crawler needs to revisit websites to collect up-to-date data in hourly or daily intervals. Each site has different amount of to-be-update data and different policy to restrict frequent access, which are unknown beforehand. The system needs to find feasible data collection schedule based on past experience.

(cont’d)

Crawler &

Extractor

17

T/TD System with Slow Intelligence

• 4. Any changes in web pages may disrupt Extractors. It needs automatic repair mechanism for Extractors if many websites are being monitored. The repair mechanism needs to detect errors of Extractors, find alternatives, and choose the best solution from alternatives to fix the disrupted Extractors.

(cont’d)

Crawler &

Extractor

18

T/TD System with Slow Intelligence

1. SIS to help restrict the range of data collection

(cont’d)

Knowledge of data

Knowledge of algorithm

19

T/TD System with Slow Intelligence

2. SIS to help select and adapt trend detection algorithms

(cont’d)

20

T/TD System with Slow Intelligence

3. SIS to help scheduling Crawler

(cont’d)

21

T/TD System with Slow Intelligence

4. SIS to help adapt Extractors

(cont’d)

Conclusion

23

Conclusion

• An online trend detection system requires careful resource allocation and automatic algorithm adaptation to process huge size of heterogeneous data.

• This research adopts Slow Intelligence, which provides a framework for systems with insufficient computing resources to gradually adapt to environments, to response the challenges.

• Four Slow Intelligence subsystems are proposed, and each subsystem targets a challenge in designing online topic/trend detection systems.

If you have any questions, please e-mail us

chiachun@iii.org.tw (Chia-Chun Shih)

markpeng@iii.org.tw (Ting-Chun Peng)