Spatio-temporal analysis of opinion in social media...

1
Spatio-temporal analysis of opinion in social media: outlier detection for the business intelligence area Miguel Fernandes Ca´ ı˜ na, PhD Student Supervised by: Dr. Rebeca P. D´ ıaz Redondo and Dr. Ana Fern´ andez Vilas Department of Telematics Engineering, University of Vigo Motivation I Internet has become a communication and expression platform, rather than just a static information source. Mailing lists, forums and chats have been part of it since the beginning, but over the last years, social networks have become the primary platform of communication for the majority of its users. I The continuous flow of public information from forums and social networks makes possible to extract any kind of sentiment expressed about a product, service or brand. The aggregation of this data, including the impact of time and location, could be crucial in the success of a business decision. I Natural Language Processing (NLP), defined as the ability of a system to process human language [1], is an artificial intelligence component that can be used to mine opinion and sentiment from social networks, and classify each post as being positive, negative or neutral towards a specific subject. I This flow of opinions could be exploited by a Company, in order to verify the im- pact of a business decision or an external situation on the public’s perception over its products, services or brands. Simply ignoring it could be harmful for the company’s success. Objectives 1. The main objective of this PhD is to propose a general model applying different techniques (opinion mining, relevant topic identification, data and company con- nections, space-time scopes and real time analysis) that could be transformed into a solution that provides a high level view about products and services of in- terest, enabling decision making ”as soon as possible”, as well as post-mortem analysis of relevant events. 2. A platform will be developed following the proposed model, and it should be able to provide insight about the impact of events on the public’s perception over related brands, services or products. Research Plan I A comprehensive review of the relevant literature in the fields of opinion and sen- timent mining, topic disambiguation, space-time scopes and real time analysis, outlining the state-of-the-art techniques is being performed continually in order to gain critical insights. I The development is following an iterative and incremental approach, and is di- vided in several phases: Phase 1 (Completed) - Conceptual Test A basic approach using simple scripts written in Perl, to extract user’s tweets, preprocess them and mine their opinions using a basic algorithm. Phase 2 (Completed) - Platform Definition Design a Java enterprise application model, using MongoDB as a data warehouse, and an extensible architecture to be able to accommodate the four main modules, and their subcomponents. Phase 3 (Completed) - Basic Modules Development of basic algorithms for each components, following the same principles as in Phase 1. Phase 4 - Advanced Modules Integration of several state-of-the-art techniques for each area, specially in preprocessing and sentiment analysis. Automatic anomalies detection, external API support and live extraction will also be part of this phase. I Validation and assessment of the results will be based on a statistical approach, and the project success will be evaluated using this criteria. I The modularity of the platform is essential to provide an extensible framework for other projects on this area. Results Figure 1: Marble Architecture A platform, named Marble has been developed, and integrates all the main compo- nents that could provide the achievement of our objectives. This model is flexible enough to allow an implementation based on its guidelines. The main components are: Extractor An main extraction module capable of extracting information from Twitter related to a subject. If needed, the module will be expanded to cover other social networks. Preprocessors Modules incorporating NLP processing techniques, stemming and lemmatization capabilities, synonyms recognition and disambiguation practices, that will be in charge of converting the raw data into information for the opinion mining module. This module would be configurable, in order to use different processing techniques, depending on the nature of the data. Sentiment Analysers Modules in charge of extracting the opinion expressed in each message and define a polarity, using a combination of sentiment analysis techniques and heuristics, which will allow to identify specific characteristics of opinion on each user interaction. Presenters A presentation module, responsible for extracting relevant information from the mined opinion and correlating it to manually identified events. The module will also be able to detect ”special situations” not related to any of the known events, in order to discover unidentified incidents Advances in 2015-2016: I Marble is currently being used as a research media under the project ”INRISCO: ANALISIS DE COMUNIDADES BASADO EN MINERIA SOCIAL (2015-2017). Ministerio de Ciencia e Innovaci´ on. Proyectos de I+D+I del programa estatal de investigacion, desarrollo e innovacion orientada a los retos de la sociedad (TEC2014-54335-C4-3-R).”, specifically to extract useful data in collaboration with UPC and UC3M. I The core of the platform was redesigned in order to separate the backend and the frontend. The backend is in charge of all the processing work, and exposes a REST API that is being used by the official frontend but is available to any third party application. I The platform was adapted to use a plug-in architecture, in order to allow new processing algorithms to be integrated for both the preprocessing and sentiment extraction stages. I Two processing algorithms are available: a basic one based on SenticNet word polarity, and another one based on [3], using Stanford NLP tools for parsing and topic selection, and SentiWordNet polarity ratings for establishing the sentiment of tweets. I More techniques are being evaluated to be added in the next iterations, including Naive Bayes, Part of Speech tagging and Support Vector Machines. Next Year Planning Continue the development of new processing modules (Phase 4) including self-validation capabilities using different tagged datasets, in order to provide a fast method of evaluation in different fields of use. This modules will included supervised and unsupervised algorithms. Evaluation and Analysis of the Processing Module Each processing module will be evaluated and validated, in terms of its recall, precision and effectiveness. The results of the evaluation will be included in a paper that will be submitted to a journal (JCR indexed). References [1] Preeti and Brahmaleen Kaur Sidhu. “NATURAL LANGUAGE PROCESSING”. In: International Journal of Computer Technology & Applications 4.5 (2013), pp. 751–758. [2] M. Fernandes Ca´ ı˜ na, R. D´ ıaz Redondo, and A. Fern´ andez Vilas. “Marble Initiative - Monitoring the Impact of Events on Customers Opinion”. In: Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (IC3K 2014). 2014, pp. 403–410. ISBN : 978-989-758-048-2. DOI : 10 . 5220 / 0005151504030410. [3] Syed Akib Anwar Hridoy, M. Tahmid Ekram, Mohammad Samiul Islam, et al. “Localized twitter opinion mining using sentiment analysis”. In: Decision Analytics 2.1 (2015), p. 8. ISSN: 2193-8636. DOI : 10.1186/s40165-015-0016-4. URL : http://www.decisionanalyticsjournal.com/content/2/1/8.

Transcript of Spatio-temporal analysis of opinion in social media...

Page 1: Spatio-temporal analysis of opinion in social media ...doc_tic.uvigo.es/sites/default/files/jornada2016/poster/Fernandes Cai… · Research Plan I A comprehensive review of the relevant

Spatio-temporal analysis of opinion in social media:outlier detection for the business intelligence area

Miguel Fernandes Caına, PhD StudentSupervised by: Dr. Rebeca P. Dıaz Redondo and Dr. Ana Fernandez Vilas

Department of Telematics Engineering, University of Vigo

Motivation

I Internet has become a communication and expression platform, rather than just astatic information source. Mailing lists, forums and chats have been part of it sincethe beginning, but over the last years, social networks have become the primaryplatform of communication for the majority of its users.

I The continuous flow of public information from forums and social networks makespossible to extract any kind of sentiment expressed about a product, service orbrand. The aggregation of this data, including the impact of time and location,could be crucial in the success of a business decision.

I Natural Language Processing (NLP), defined as the ability of a system to processhuman language [1], is an artificial intelligence component that can be used tomine opinion and sentiment from social networks, and classify each post as beingpositive, negative or neutral towards a specific subject.

I This flow of opinions could be exploited by a Company, in order to verify the im-pact of a business decision or an external situation on the public’s perceptionover its products, services or brands. Simply ignoring it could be harmful for thecompany’s success.

Objectives

1. The main objective of this PhD is to propose a general model applying differenttechniques (opinion mining, relevant topic identification, data and company con-nections, space-time scopes and real time analysis) that could be transformedinto a solution that provides a high level view about products and services of in-terest, enabling decision making ”as soon as possible”, as well as post-mortemanalysis of relevant events.

2. A platform will be developed following the proposed model, and it should be able toprovide insight about the impact of events on the public’s perception over relatedbrands, services or products.

Research Plan

I A comprehensive review of the relevant literature in the fields of opinion and sen-timent mining, topic disambiguation, space-time scopes and real time analysis,outlining the state-of-the-art techniques is being performed continually in order togain critical insights.

I The development is following an iterative and incremental approach, and is di-vided in several phases:

Phase 1 (Completed) - Conceptual Test A basic approach using simplescripts written in Perl, to extract user’s tweets, preprocess them andmine their opinions using a basic algorithm.

Phase 2 (Completed) - Platform Definition Design a Java enterpriseapplication model, using MongoDB as a data warehouse, and anextensible architecture to be able to accommodate the four mainmodules, and their subcomponents.

Phase 3 (Completed) - Basic Modules Development of basic algorithmsfor each components, following the same principles as in Phase 1.

Phase 4 - Advanced Modules Integration of several state-of-the-arttechniques for each area, specially in preprocessing and sentimentanalysis. Automatic anomalies detection, external API support andlive extraction will also be part of this phase.

I Validation and assessment of the results will be based on a statistical approach,and the project success will be evaluated using this criteria.

I The modularity of the platform is essential to provide an extensible framework forother projects on this area.

Results

Figure 1: Marble Architecture

A platform, named Marble has been developed, and integrates all the main compo-nents that could provide the achievement of our objectives. This model is flexibleenough to allow an implementation based on its guidelines. The main componentsare:

Extractor An main extraction module capable of extractinginformation from Twitter related to a subject. If needed, themodule will be expanded to cover other social networks.

Preprocessors Modules incorporating NLP processing techniques,stemming and lemmatization capabilities, synonyms recognitionand disambiguation practices, that will be in charge ofconverting the raw data into information for the opinion miningmodule. This module would be configurable, in order to usedifferent processing techniques, depending on the nature of thedata.

Sentiment Analysers Modules in charge of extracting the opinionexpressed in each message and define a polarity, using acombination of sentiment analysis techniques and heuristics,which will allow to identify specific characteristics of opinion oneach user interaction.

Presenters A presentation module, responsible for extractingrelevant information from the mined opinion and correlating it tomanually identified events. The module will also be able todetect ”special situations” not related to any of the knownevents, in order to discover unidentified incidents

Advances in 2015-2016:I Marble is currently being used as a research media under the project ”INRISCO:

ANALISIS DE COMUNIDADES BASADO EN MINERIA SOCIAL (2015-2017).Ministerio de Ciencia e Innovacion. Proyectos de I+D+I del programa estatalde investigacion, desarrollo e innovacion orientada a los retos de la sociedad(TEC2014-54335-C4-3-R).”, specifically to extract useful data in collaboration withUPC and UC3M.

I The core of the platform was redesigned in order to separate the backend andthe frontend. The backend is in charge of all the processing work, and exposesa REST API that is being used by the official frontend but is available to any thirdparty application.

I The platform was adapted to use a plug-in architecture, in order to allow newprocessing algorithms to be integrated for both the preprocessing and sentimentextraction stages.

I Two processing algorithms are available: a basic one based on SenticNet wordpolarity, and another one based on [3], using Stanford NLP tools for parsing andtopic selection, and SentiWordNet polarity ratings for establishing the sentimentof tweets.

I More techniques are being evaluated to be added in the next iterations, includingNaive Bayes, Part of Speech tagging and Support Vector Machines.

Next Year Planning

Continue the development of new processing modules (Phase 4) including self-validation capabilities using different tagged datasets, in order to provide a fast method ofevaluation in different fields of use. This modules will included supervised and unsupervised algorithms.

Evaluation and Analysis of the Processing Module Each processing module will be evaluated and validated, in terms of its recall, precision and effectiveness. The results ofthe evaluation will be included in a paper that will be submitted to a journal (JCR indexed).

References

[1] Preeti and Brahmaleen Kaur Sidhu. “NATURAL LANGUAGE PROCESSING”. In: International Journal of Computer Technology & Applications 4.5 (2013), pp. 751–758.[2] M. Fernandes Caına, R. Dıaz Redondo, and A. Fernandez Vilas. “Marble Initiative - Monitoring the Impact of Events on Customers Opinion”. In: Proceedings

of the International Conference on Knowledge Discovery and Information Retrieval (IC3K 2014). 2014, pp. 403–410. ISBN: 978-989-758-048-2. DOI: 10 . 5220 /0005151504030410.

[3] Syed Akib Anwar Hridoy, M. Tahmid Ekram, Mohammad Samiul Islam, et al. “Localized twitter opinion mining using sentiment analysis”. In: Decision Analytics 2.1(2015), p. 8. ISSN: 2193-8636. DOI: 10.1186/s40165-015-0016-4. URL: http://www.decisionanalyticsjournal.com/content/2/1/8.