Early Detection of Controversial Urdu Speeches from Social ... · Early Detection of Controversial...

17
Data Science and Pattern Recognition c 2017 ISSN 2520-4165 Ubiquitous International Volume 1, Number 2, December 2017 Early Detection of Controversial Urdu Speeches from Social Media Raza Ul Mustafa Department of Computer Science COMSTAS Institute of Information Technology, Sahiwal, Pakistan [email protected] M. Saqib Nawaz Department of Informatics, School of Mathematical Sciences Peking University, Beijing, China [email protected] Javed Ferzund Department of Computer Science COMSTAS Institute of Information Technology, Sahiwal, Pakistan [email protected] M. Ikram Ullah Lali Department of Software Engineering University of Gujrat, Pakistan [email protected] Basit Shahzad Faculty of Engineering and Computer Science National University of Modern Languages, Islamabad. Pakistan [email protected] Philippe Fournier-Viger School of Natural Sciences and Humanities, Harbin Institute of Technology (Shenzhen), Shenzhen, China [email protected] Abstract. Controversial speech detection on social media is critical for various applica- tions such as content recommendation and controversial event extraction. In this article, a data driven approach is applied for the collection of Urdu tweets and their classifica- tion as controversial or noncontroversial tweets. In particular, Support Vector Machine (SVM), Naive Bayes (NB) and Logistic Regression (LR) models have been trained for this task. Furthermore, controversial Urdu tweets have been analyzed using sequential pattern mining and sequential rule mining to find the most frequent words and patterns, and relationships between words in patterns. Results show that NB outperforms SVM and LR for this classification task. Besides, some interesting patterns were discovered, which can be used to automatically identify tweets containing controversial speeches or messages. By applying sequential rule mining, interesting relationships between words in patterns were also found. Keywords: Twitter, Urdu, Controversial Speeches, Machine Learning, Sequential Pat- tern Mining. 26

Transcript of Early Detection of Controversial Urdu Speeches from Social ... · Early Detection of Controversial...

Page 1: Early Detection of Controversial Urdu Speeches from Social ... · Early Detection of Controversial Urdu Speeches from Social Media 29 revision count, number of unique editors, number

Data Science and Pattern Recognition c©2017 ISSN 2520-4165

Ubiquitous International Volume 1, Number 2, December 2017

Early Detection of Controversial Urdu Speeches fromSocial Media

Raza Ul Mustafa

Department of Computer ScienceCOMSTAS Institute of Information Technology, Sahiwal, Pakistan

[email protected]

M. Saqib Nawaz

Department of Informatics, School of Mathematical SciencesPeking University, Beijing, China

[email protected]

Javed Ferzund

Department of Computer ScienceCOMSTAS Institute of Information Technology, Sahiwal, Pakistan

[email protected]

M. Ikram Ullah Lali

Department of Software EngineeringUniversity of Gujrat, Pakistan

[email protected]

Basit Shahzad

Faculty of Engineering and Computer ScienceNational University of Modern Languages, Islamabad. Pakistan

[email protected]

Philippe Fournier-Viger

School of Natural Sciences and Humanities,Harbin Institute of Technology (Shenzhen), Shenzhen, China

[email protected]

Abstract. Controversial speech detection on social media is critical for various applica-tions such as content recommendation and controversial event extraction. In this article,a data driven approach is applied for the collection of Urdu tweets and their classifica-tion as controversial or noncontroversial tweets. In particular, Support Vector Machine(SVM), Naive Bayes (NB) and Logistic Regression (LR) models have been trained forthis task. Furthermore, controversial Urdu tweets have been analyzed using sequentialpattern mining and sequential rule mining to find the most frequent words and patterns,and relationships between words in patterns. Results show that NB outperforms SVMand LR for this classification task. Besides, some interesting patterns were discovered,which can be used to automatically identify tweets containing controversial speeches ormessages. By applying sequential rule mining, interesting relationships between words inpatterns were also found.Keywords: Twitter, Urdu, Controversial Speeches, Machine Learning, Sequential Pat-tern Mining.

26

Page 2: Early Detection of Controversial Urdu Speeches from Social ... · Early Detection of Controversial Urdu Speeches from Social Media 29 revision count, number of unique editors, number

Early Detection of Controversial Urdu Speeches from Social Media 27

1. Introduction. With the increasing popularity of the Internet, and information andcommunication technologies, micro-blogging and social networking websites have becomeessential for world-wide interpersonal communication. They are used to continuouslyshare information, connect with other people, and express opinions and thoughts, whichare then made available to the world via different mediums. In recent years, social mediahave received a lot of attention from Internet users, groups, organizations and researchersfrom different disciplines. Moreover, micro-blogging websites are influencing and facili-tating many aspects of modern life from education and technology to business and gov-ernment affairs [1]. Twitter is a social networking platform that let users write messages(tweets) of up to 140 characters. It plays a major role as a source of information duringmajor events all around the world, and can be used to follow breaking news. On Twitter,users can write or share tweets that they are interested in. Furthermore, many Twitterusers post messages about various topics such as relationships, life problems, and opinions.

Although micro-blogging websites are a popular mean of sharing information, they areoften misused by hate groups to promote and spread radicalism online. Controversy gen-erally refers to discussions in which members of the audience express opposite opinionsto discuss some issues [2]. On social media, people and groups engage in discussions witheach other related to different matters and this interaction is not always for social good.Sometimes this interaction leads to controversies. To disturb society, some extremistsand groups attempt to generate controversy among people. In this context, it is impor-tant for national security agencies, governments, marketing groups, and companies todetect these controversies and address them in an appropriate way in a timely manner.Studying controversial topics, debates and speeches is thus an important research area.It aims at identifying and classifying the level of endorsement of members of a conversa-tion to identify the nature of the topic, and in particular whether it is a controversial ornoncontroversial discussion.

On Twitter, hateful or controversial tweets are those that contain abusive content thattarget individuals (e.g. a politician, a celebrity or a product) or some groups (e.g. acountry, a religion, a gender, or an organization) [3]. Detecting hateful speech in a hugeamount of online data is desirable to understand opinions of people or groups toward otherpersons and groups. Moreover, detecting hateful speech can be used to discourage wrong-ful activities, and support many other applications such as controversial event detection,information filtering, content recommendation and developing prediction systems.

Manually checking data on micro-blogging sites to identify controversial speeches isdifficult and time consuming. Furthermore, filtering hateful and controversial online databy hand is not scalable. Hence, there is a need for developing automatic detection meth-ods. To address this issue, this paper proposes an approach that uses machine learningand data mining techniques to detect controversial speeches (tweets) written in Urdu lan-guage. Urdu is the national language of Pakistan and is spoken by more than 100 millionpersons in South Asia. It is an Indo-Aryan language that is written using Arabic scriptfrom right to left, using the Nastalique writing style [4].

The proposed approach is applied to the Twitter platform. Twitter offers APIs (Appli-cation Programming Interfaces) for developers and researchers that can be used to easilycollect data, to then be analyzed. These APIs can be combined to build complex appli-cations. Among various social platforms, Twitter was chosen for this study due to thefollowing reasons [5, 6, 7]:

• Twitter is popular, and thus research on Twitter can have a high social impact;• Twitter provides tools to efficiently search for conversations;• Tweets are indexed by major search engines such as Google;

Page 3: Early Detection of Controversial Urdu Speeches from Social ... · Early Detection of Controversial Urdu Speeches from Social Media 29 revision count, number of unique editors, number

28 Raza Ul Mustafa et al.

• Tweets related to a given topic are easy to retrieve since each event or news storytend to be used with some specific hashtags;

• The Twitter API for collecting data is easier to use than those of other social mediasuch as Facebook, LinkedIn, and Tumbler.

• Twitter messages often contain information about the time zone of the user;• Twitter keeps information about retweets, tags embedded inside tweets and who

follows whom, which offers insightful information for discourse analysis;• The frequency of tweet posting can be analyzed to perform minute-by-minute anal-

ysis of an event.

A method for detecting opinions and potential controversy is sentiment analysis [8, 9].It is used to extract subjective information from tweets and other type of online data(e.g. Web pages, reviews). In data mining, pattern mining is used to find interesting,useful, and unpredictable patterns in datasets. Pattern mining techniques can find hiddenpatterns that are useful to understand data and support decision making. A specifictype of pattern mining techniques called sequential pattern mining algorithms are usedto discover subsequences in a set of sequences. The importance of a subsequence canbe measured using various criteria such as the subsequence frequency, its length, andprofit. Sequential pattern mining is used in many real-life applications that range frombioinformatics, text and market based analysis, recommendation systems, e-learning, andwebpage click-stream analysis. More detail on sequential pattern mining can be found in[10].

The main focus of this work is to detect controversial speeches and patterns in tweetswritten in Urdu. Data mining techniques can be used to detect and extract subjectiveinformation from tweets. We believe that with proper data modeling and opinion miningtechniques, controversial patterns extracted from tweets can be used for the detectionof controversial speeches. Such information can be used by organizations, governmentand security officials to take appropriate actions. In the proposed approach, the firststep is data collection. For this purpose, tweets were collected in Pakistan for a periodof 6 months. These tweets are then preprocessed to obtain a format that is suitablefor applying data mining techniques. Tweets in the collected dataset have then beenclassified into positive and negative tweets. Positive tweets are controversial speeches ormessages, whereas negative tweets do not contain controversial messages. For trainingand evaluation purposes, three classifiers were used: Support Vector Machine (SVM),Naive Bayes (NB) and Logistic Regression (LR). Moreover, the positive tweet dataset wasfurther analyzed using sequential pattern mining algorithms to extract common words,patterns and relationships between words in positive tweets. We believe that extractedpatterns can be used for the automatic identification of tweets containing controversialspeeches.

The rest of the article is organized as follows: In Section 2, related work on investigatingcontroversial online data is discussed. The approach that we used to discover controversialUrdu speeches on Twitter is elaborated in Section 3. Evaluation of the proposed approachand results are discussed in Section 4. Finally, a conclusion is drawn in Section 5.

2. Related Work. Some studies have investigated the presence of controversy in onlinedata on websites such as Wikipedia, Twitter and news portals. Some early work done byKittur et al. [11] has focused on detecting controversy in Wikipedia. They analyzed revi-sion history and structured data to characterize conflicting behavior at the global, article,and user level. However, manually tagging Wikipedia pages as controversial may result intagging inconsistencies and sparseness. To estimate the level of controversy of Wikipediaarticles, Yasseri et al. [12] considered information extracted from the edit-history such as

Page 4: Early Detection of Controversial Urdu Speeches from Social ... · Early Detection of Controversial Urdu Speeches from Social Media 29 revision count, number of unique editors, number

Early Detection of Controversial Urdu Speeches from Social Media 29

revision count, number of unique editors, number of reverts, and number of editors par-ticipating in an edit war and their reputations. Similarly, Rad and Barbosa [13] surveyedfive algorithms for controversy detection on Wikipedia and compared their performance.Das et al. [14] used controversy detection to study manipulation by Wikipedia adminis-trators. The authors described a methodology to compute a score on the basis of a usersediting history. The computed score measures how focused a user is on a controversialtheme.

For news websites, Choi et al. [8] described a method for automatically detectingcontroversial issues and their subtopics in news articles. Their approach consists of lookingat frequent words appearing in two contexts containing positive and/or negative sentimentwords. Chimmalgi [15] focused on the early identification of topics that are likely togenerate a significant controversy. They developed an algorithm for controversy trendprediction in user comments of news articles using lexicons. Taking a more data-drivenapproach, Awadallah et al. [16] proposed a data driven approach to identify the opinionsof persons in news articles. Their method is iterative, and based on a seed set of patternsthat describe expressions of support or opposition to an idea. Mejova et al. [17] also useda data-driven approach to examine how controversy interplays with emotional expressionand biased language in the news. They introduced a new dataset of controversial andnoncontroversial terms that was collected using crowdsourcing. Millions of articles werecompared to evaluate to what extent an issue is controversial by comparing how it isportrayed across different media compared to other issues.

Guerra at al. [18] used a systematic comparison that is based on a graph structureto study controversy on social media. In [19], controversy is studied in social mediafrom a network structure perspective and content related to topics of various domains.Gupta et al. [20] studied the problem of evaluating the credibility of events on Twit-ter. They proposed an approach enhanced with event graph-based optimization to assessevent credibility. Pennacchiotti and Popescu [9] used a content driven approach to detectcontroversies around celebrities on Twitter. For detection, various features are used thatinclude swear words, the occurrence of sentiment bearing words, and words from a list ofcontroversial topics derived from Wikipedia. Similarly, in [2], Popescu and Pennacchiottiproposed three methods to address the problem of controversial event identification onTwitter. An automatic system is developed in [21] to investigate the credibility of Arabicnews content on Twitter. Two approaches were used that assign credibility levels to eachtweet. The first approach calculates the similarity between authentic news sources andtweets, whereas the second approach calculates the similarity in authentic news sourcesalong with proposed features. Metaxas and Mustafaraj [22] have designed a system toevaluate trails of trustworthiness that are propagated through real time channels. Thesystem is able to assess the credibility, provenance and independence of multiple informa-tion sources presented to users. Ratkiewicz et al. [23] proposed a Web service to trackpolitical memes on Twitter. The Web service helped in detecting astroturfing, smear cam-paigns, and other types of misinformation in the context of U.S. political elections. Zia etal. [24] used three supervised learning algorithms (SVM, NB and kNN) to identify hatespeech on Twitter, whereas Badjatiya et al. [3] used deep neural network architecturesfor hate speech detection on Twitter.

Previous studies focused on detecting controversy in specific events or trends on Twitter.This work differs from previous work in that the goal is to detect controversial messageswritten using the Urdu language. For that, we trained and evaluated three classifiers:SVM, NB and LR. Besides, Section 4 presents a comparative analysis of classifiers whichcan be handy in the future to select an appropriate classifier for this task. Moreover,we used various sequential pattern mining algorithms to find common frequent words,

Page 5: Early Detection of Controversial Urdu Speeches from Social ... · Early Detection of Controversial Urdu Speeches from Social Media 29 revision count, number of unique editors, number

30 Raza Ul Mustafa et al.

patterns and relationships between words in Urdu tweets. Such results can be used for thedevelopment of automatic controversial event detection and information filtering systems.

3. Methodology. The proposed methodology, illustrated in Fig. 1, is composed of threemain tasks:

• Data Collection from Twitter;• Data Modeling, which consists of two steps: Pre-processing and Tokenization/Feature

Weighting;• Machine Learning, which is done in two steps: training a classifier and performing

controversial Urdu pattern detection using the classifier.

To validate the proposed approach, three classifiers (SVM, NB and LR) were used,evaluated using 10-fold cross validation. More detail on the SVM, NB and LR classifierscan be found in [1, 25]. Sequential pattern mining was also applied to locate patternsin the dataset which tend to appear in controversial messages. Each task is explained inmore details, next.

Figure 1. Flow chart of the proposed methodology

3.1. Data Collection. Urdu tweets from official popular news pages and political pageshave been collected from December 2016 to June 2017 using the Twitter search andstream API [26]. Some pages from where tweets have been collected are listed in Table1. The Twitter API can be used to collect relevant tweets that match a specific query.It collects the latest tweets by crawling Twitter. An SQL database was used to store thetweets on a data collection server. Some tweet attributes have been used to sort tweetsin the database and some attributes have been used for tracking geo-locations such astime zones. In total, approximately 8,000 tweets were collected. As the emphasis is onUrdu tweets, the dataset only contains tweets written in Urdu language. Tweets writtenin other languages were discarded.

3.2. Pre-processing. The main objective of pre-processing is to prepare data so thatmachine learning algorithms can be applied [27]. This involves mainly tokenization, fea-ture weighting and data cleaning (removal of irrelevant features). Once the data hadbeen collected, URLs from tweets and replies were removed. Data containing only imagesor a link without text were also removed. Stop words were removed using a stop-wordlist as they do not provide any information about a topic, and can thus be consideredas noise. Pre-processing is of key importance for the classification of textual data sinceproper pre-processing not only improves the effectiveness of a classifier but can also re-duce its training time. Collected tweets have further been pre-processed by performingthe following steps.

Tokenization: Tokenization consists of breaking long text strings into substrings whichmay include phrases and words collectively known as tokens. There are two main waysof performing tokenization (phrase and word tokenization). In this paper, word-leveltokenization is used as it is considered more useful in terms of statistical significance thanthe former [25]. Using this process, a sentence such as “ ” is broken into

Page 6: Early Detection of Controversial Urdu Speeches from Social ... · Early Detection of Controversial Urdu Speeches from Social Media 29 revision count, number of unique editors, number

Early Detection of Controversial Urdu Speeches from Social Media 31

Table 1. A sample of Twitter pages used to collect tweets

ExpressNewsPK geonews_urdu

DunyaNews Waqtnewstv

ARYNEWSOFFICIAL BOLNETWORK

TazaTareen SAMAATV

dawn_com etribune

roznamadunya PTVNewsOfficial

PTIofficial 92newschannel

24NewsHD CapitalTV_News

JaagAlerts jangnews

QTV_TV aajtvonline

Dawn_News Roznama_Express

tokens “ ”, “ ”, “ ”. The algorithm used for tokenization split a sentence intotokens based on whitespaces and using a built-in dictionary.

Feature Weighting: After tokenization, weights are assigned to tokens (features)based on their relative effectiveness. This process is known as feature weighting. Astandard function to compute weights is TF-IDF [5]. The TF-IDF scheme consists of twoparts: TF and IDF. TF, which stands for term frequency, is the number of times that aterm/token appears in a document. IDF is the inverse document frequency of a term ina collection of documents.

3.3. Sentiment Classification. After applying pre-processing, the data is in a suitableformat for applying a classification algorithm. Each tweet was then manually labelledas one of two categories: controversial with a true label and noncontroversial with afalse label. A sample of controversial and noncontroversial tweets is provided in Table2. Different algorithms are available that can be used to train a classifier. Differentexperimental studies have been carried out to analyze the performance of classifiers fortext categorization. Based on the result of these experiments, SVM, LR, and NB areobserved to be very effective algorithms [25].

3.4. Evaluation Parameters. Standard 10-fold cross validation was employed to evalu-ate the performance of SVM, NB and LR for this classification task. Cross validation wasused for model validation and to evaluate the generalization in terms of statistical resultsthat are provided by a model. Performing 10-fold cross validation consists of randomlypartitioning a dataset into 10 sub-datasets. Out of these 10 sub-datasets, one is selectedas validation set for model testing and the remaining 9 sub-datasets are used for modeltraining. This process is repeated 10 times where each sub-dataset is used exactly onceas validation set. The average of the 10 results is then considered [7].

Three evaluation measures (precision, recall and f-measure) are used to evaluate theperformance of the three classifiers. The reason to choose these three measures is that theyare widely used to evaluate the performance of classifiers. The mathematical definitionof these measures with respect to a positive class is defined as follows:

Page 7: Early Detection of Controversial Urdu Speeches from Social ... · Early Detection of Controversial Urdu Speeches from Social Media 29 revision count, number of unique editors, number

32 Raza Ul Mustafa et al.

Table 2. Sample of controversial and noncontroversial tweets

Recall(R) =no of CPP

no of PE(1)

Precision(P) =no of CPP

no of PP(2)

F-measure =2 × P ×R

P + R(3)

In equations 1 and 2, CPP, PE and PP stand for correct positive predictions, positiveexamples and positive predictions respectively.

4. Results and Discussion. We used the SVM, LR and NB classifier to evaluate theappropriateness of our data representation and to train models. Experimentation withthe three classifiers were performed using the WEKA [28] software program. We choseto use WEKA for classifier training and evaluation because it offers state-of art featureselection, classification and evaluation methods along with text processing utilities such astokenization, stop word removal and feature weighting (such as TF-IDF). The performanceof the three classifiers is compared in Table 3.

High precision is achieved for SVM and LR (91.9% and 92.7% respectively). On theother hand, high recall and f-measure was achieved for NB. F-measure results indicatethat NB performed better than SVM and LR. High recall for NB means that NB cor-rectly classified most of the controversial tweets in the dataset. On the other hand, alow precision value indicates that NB classified some noisy (irrelevant and neutral) tweetsinto the controversial class. To compare the accuracy of the three classifiers, paired t-test(corrected) is performed in WEKA. Statistical paired t-test compares two datasets inwhich observations in one dataset can be paired with another datasets observations. The

Page 8: Early Detection of Controversial Urdu Speeches from Social ... · Early Detection of Controversial Urdu Speeches from Social Media 29 revision count, number of unique editors, number

Early Detection of Controversial Urdu Speeches from Social Media 33

Table 3. Performance comparison of the three classifiers

Classifier Precision Recall F-measure

SVM 0.919 0.837 0.876

NB 0.857 0.927 0.890

LR 0.927 0.788 0.852

main objective of this test is to investigate the statistical evidence that the mean differ-ence from paired observations from two datasets on a particular outcome is significantlydifferent from zero. More detail on paired t-test can be found in [29]. Obtained resultsindicate that some differences in accuracy of SVM, LR and NB exist, as shown in Table4. However, differences in accuracies of classifiers are not statistically significant.

Table 4. Accuracy comparison of the three classifiers

SVM NB LR

0.840 0.845 0.820

We have also used the SPMF data mining library [10] to analyze Urdu tweets for dis-covering common words, patterns and complex hidden relationships between such wordsand patterns. SPMF, developed in JAVA, is an open-source data mining library thatis specialized in pattern mining. SPMF offers implementations of more than 133 datamining algorithms. We used the TKS, CM-SPAM and the ERMiner algorithms. TKS(Top-K Sequential pattern mining) discovers the top-k sequential patterns in a dataset,where k is set by the user. The CM-SPAM algorithm adopts a strategy called CMAP(Co-occurrence MAP) to prune candidate patterns using a vertical database representa-tion, and efficiently discover sequential patterns in a dataset. ERMiner (Equivalence classbased sequential Rule Miner) is designed for mining sequential rules in a sequence dataset.More detail on TKS, CM-SPAM and ERMiner can be found in [30, 31, 32], respectively.

In our dataset of positive tweets, each tweet is considered as a sentence. Each line endswith a point. Furthermore, Urdu text is encoded using the UTF-8 encoding. We usedthe TKS algorithm in SPMF to find hidden patterns in our dataset. Obtained patternswith length 1 and 2 are shown in Table 5.

The column “#SUP#” of Table 5 indicates the occurrence count of each pattern inthe dataset. The most frequent pattern of length 1 in positive tweets is “ ”, which

is repeated 101 times. Words “ ” and “ ” appear 42 and 40 times, respectively.

For patterns of length 2, pattern “ ” appears 14 times and pattern “ ” was

found 10 times. The patterns “ ” and “ ” were also found 10 times each. Itcan be observed that patterns in Table 5 do not provide much information for controversyor hate speech detection. We also ran the TKS algorithm to extract patterns of length 3,4 and 5. Results are shown in Table 6 and 7.

We found some interesting results for long patterns. For patterns of length 3, pat-

tern “ ” (Nawab Brahamdagh Bugti in English) occurred 7 times in thedatabase. Nawab Brahamdagh Bugti is currently the founder and leader of the BalochRepublican Party. The tweet that contain his name either shows a controversial tweet orpublic hatred towards him as he is accused by the Government of Pakistan for leading the

Page 9: Early Detection of Controversial Urdu Speeches from Social ... · Early Detection of Controversial Urdu Speeches from Social Media 29 revision count, number of unique editors, number

34 Raza Ul Mustafa et al.

Table 5. Extracted patterns of length 1 and 2

Page 10: Early Detection of Controversial Urdu Speeches from Social ... · Early Detection of Controversial Urdu Speeches from Social Media 29 revision count, number of unique editors, number

Early Detection of Controversial Urdu Speeches from Social Media 35

Table 6. Extracted patterns of length 3 and 4

Page 11: Early Detection of Controversial Urdu Speeches from Social ... · Early Detection of Controversial Urdu Speeches from Social Media 29 revision count, number of unique editors, number

36 Raza Ul Mustafa et al.

Table 7. Extracted patterns of length 5

Baloch Republican Army, which is an extremist group in Baluchistan area of Pakistan.Similarly, the pattern “ ” that appears 3 times shows the hatred towards

Afghan refugees living in Pakistan. For 4 words patterns, “

”, “ ”, “ ” and “

” exist in controversial patterns. A tweet is considered controversial if Bara-hamdagh and Bugti is praised and hatred towards Pakistan and Pakistan army is shown

in that tweet. Similarly for patterns of length 5, “ ”, “

Page 12: Early Detection of Controversial Urdu Speeches from Social ... · Early Detection of Controversial Urdu Speeches from Social Media 29 revision count, number of unique editors, number

Early Detection of Controversial Urdu Speeches from Social Media 37

”, “ ”, “ ”, “

” and “ ” can be considered as con-troversial patterns. We have also noticed that most of the patterns are related to FATA,Baluchistan, Terrorism, Shooting and Killing.

Table 8. Frequently occurring patterns of length 1

Table 8 shows some of the most frequent words in the dataset which are extractedfrom positive tweets using the CM-SPAM algorithm. All the frequent words appearingin patterns of length 1 that appear in at least 5% of the sentences in the dataset arepresented. The results are almost similar to the results obtained using the TKS algo-rithm for patterns of length 1. Most commonly used words are related to Baluchistan,Pakistan, Pakistan Army, terrorism and killing. We also extracted relationships betweenpatterns using sequential rules. By using the ERMiner algorithm, we found sequentialrules between words in the dataset. A sequential rule of the form X -> Y indicates asequential relationship between two unordered sets of words (X and Y) that appear in asame sentence. To apply ERMiner, the minimum frequency (minsup) threshold was setto 1% of the tweets. Moreover, sequential rules having a confidence of at least 70% whereextracted, which means that in an unordered set of tweets the unordered set of words Xis followed by the unordered set of words Y at least 70% of the times when X appears ina tweet. Obtained results are listed in Table 9.

The “#CONF#” column in Table 9 indicates that the confidence of each rule X ==> Y,calculated as the percentage of sequences containing X where X is followed by Y. The first

Page 13: Early Detection of Controversial Urdu Speeches from Social ... · Early Detection of Controversial Urdu Speeches from Social Media 29 revision count, number of unique editors, number

38 Raza Ul Mustafa et al.

Table 9. Sequential rules between words

column of Table 9 indicates that 80% of the time, when “ ” appears in a tweet, it is fol-

lowed by “ ”.

Page 14: Early Detection of Controversial Urdu Speeches from Social ... · Early Detection of Controversial Urdu Speeches from Social Media 29 revision count, number of unique editors, number

Early Detection of Controversial Urdu Speeches from Social Media 39

The pattern “ ” appeared four times in the dataset. Similarly, pattern “ ”

is followed by “ ” all the time. The main ruling tribe in Baluchistan is “ ”. So,tweets that are controversial and related to Baluchistan contain the rules:

%ہم"! بلاير هن(%) مچستا+ ==>بل( ,

بگٹ! (ہم&% بلاير هن,(+ مچستا. ==>بل, , ,

بگٹ! (ہم&% بلاير مچستا+ ==>بل0 ,

بگٹ! هن'&% مچستا( ==>بل' ,

Some of the controversial or hate speeches towards Afghanistan contain the rules:

“ <== ” and “ <== ”. The last two sequential rules indi-cate that hatred is shown towards Afghan refugees all the time when they are mentionedin controversial tweets.

5. Conclusion. Micro-blogging and social network websites are generally used to expressfeelings, thoughts and opinions. Individuals of various religions, ethnicities, and nation-alities express their right to speak freely about problems and controversial topics on suchplatforms. However, many users and extremist groups exploit this platform for maliciouspurposes. In this work, a data driven approach was proposed to find controversial Urduspeeches on Twitter. We have first used supervised learning algorithms SVM, LR andNB to evaluate the effectiveness of our approach, where NB showed a slightly better per-formance than SVM and LR. Furthermore, controversial Urdu tweets have been analyzedusing sequential pattern mining algorithms (TKS, CM-SPAM and ERMiner) to find themost frequent words and patterns in the dataset. Moreover, relations between words intweets have been found using sequential rule mining.

One of the main difficulties in this research project was to convert Urdu text in properformat so that it can be analyzed with WEKA and SPMF. The approach presented in thispaper was applied to the Urdu language. However, we believe that this approach could beapplied to English as well. The proposed method could also be used to find controversialUrdu speeches on other social network platforms such as Facebook and YouTube.

REFERENCES

[1] R. U. Mustafa, M. S, Nawaz, M I. Lali, T. Zia and W. Mehmood, “Predicting the cricket match out-come using crowd opinions on social networks: A Comparative study of machine learning methods,”Malaysian Journal of Computer Science, vol. 30, no. 1, pp. 63-76, 2017.

[2] A. M. Popescu, and M. Pennacchiotti, “Detecting controversial events from Twitter,” In Proceedingsof the 19th ACM International conference on Information and knowledge Management, pp. 1873-1876, 2010.

[3] P. Badjatiya, S. Gupta, M. Gupta, and V. Varma, “Deep learning for hate speech detection intweets,” In Proceedings of the 26th International Conference on World Wide Web Companion, pp.759-760, 2017.

[4] S. Hussain, “Letters to sound rules for Urdu text to speech system,” In Proceedings of the Workshopon Computational Approaches to Arabic Script based Languages, pp. 1-6, 2004.

[5] M. I. Lali, R. U. Mustafa, K. Saleem, M. S. Nawaz, T. Zia, and B. Shehzad, “Finding healthcareissues with search engine and social network data,” International Journal on Sematic Web andInformation Systems, vol. 13, no. 1, pp. 48-62, 2017.

[6] B. Shahzad, M. I. Lali, M. S. Nawaz, W. Aslam, R. U. Mustafa, and A. Mashkoor. “Discovery andclassification of user interests on social media,” Information Discovery and Delivery, vol. 45, no. 3,pp. 130-138, 2017.

[7] M. S. Nawaz, R. U. Mustafa, and M. I. Lali, “Role of online data from search engine and social mediain healthcare informatics,” Applying Big Data Analytics in Bioinformatics and Medicine, Chapter11, pp. 272-293, IGI Global, 2017.

Page 15: Early Detection of Controversial Urdu Speeches from Social ... · Early Detection of Controversial Urdu Speeches from Social Media 29 revision count, number of unique editors, number

40 Raza Ul Mustafa et al.

[8] Y. Choi, Y. Jung, and S.-H. Myaeng, “Identifying controversial issues and their sub-topics in newsarticles,” In Proceedings of Pacific Asia Workshop of Intelligence and Security Informatics, pp.140-153, 2010.

[9] M. Pennacchiotti, and A. M. Popescu, “Detecting controversies in Twitter: A first study,” In Pro-ceedings of the NAACL HLT Workshop on Computational Linguistics in a World of Social Media,pp. 31-32, 2010.

[10] P. Fournier-Viger, J. C. W. Lin, R. U. Kiran, Y. S. Koh and R. Thomas, “A survey of sequentialpattern mining,” Data Science and Pattern Recognition, vol. 1, no. 1, pp. 54-77, 2017.

[11] A. Kittur, B. Suh, B. A. Pendleton and E. H. Chi, “He says, she says: Conflict and coordination inWikipedia,” In Proceedings of CHI, pp. 453-462, 2007.

[12] T. Yasseri, R. Sumi, A. Rung, A. Kornai and J. Kertesz, “Dynamics of conflicts in Wikipedia,” PLoSONE, vol. 7, no. 6, p. e38869, 2012.

[13] H. S. Rad and D. Barbosa, “Identifying controversial articles in Wikipedia: A comparative study,”In Proceedings of WikiSym, pp. 1-7, 2012.

[14] S. Das, A. Lavoie and M. Magdon-Ismail, “Manipulation among the arbiters of collective intelligence:How Wikipedia administrators mold public opinion,” ACM Transactions on the Web, vol. 10, no. 4,p. 24, 2016.

[15] R. V. Chimmalgi, “Controversy trend detection in social media,” Masters Thesis, Louisiana StateUniversity, 2010.

[16] R. Awadallah, M. Ramanath, and G. Weikum, “Harmony and dissonance: Organizing the peoplesvoices on political controversies,” In Proceedings of the Fifth ACM International Conference on WebSearch and Data Mining, pp. 523-532, 2012.

[17] Y. Mejova, A. X. Zhang, N. Diakopoulos and C. Castillo, “Controversy and sentiment in onlinenews,” arXiv preprint arXiv:1409.8152, 2014.

[18] P. H. C. Guerra, W. Meira, J, C. Cardie, and R. Kleinberg, “A measure of polarization on socialmedia networks based on community boundaries,” In Proceedings of the 7th International Conferenceon Weblogs and Social Media, 2013.

[19] K. Garimella, G. D. F. Morales, A. Gionis and M. Mathioudakis, “Quantifying controversy in socialmedia,” arXiv preprint arXiv:1507.05224, 2015.

[20] M. Gupta, P. Zhao and J. Han, “Evaluating event credibility on Twitter,” In Proceedings of theSIAM International Conference on Data Mining, pp. 153-164, 2012.

[21] R. Al Khalifa and M. Al Eidan, “An experimental system for measuring the credibility of newscontent in Twitter,” International Journal of Web Information Systems, vol. 7, no. 2, pp. 130-151,2011.

[22] P. Metaxas and E. Mustafaraj. “Trails of trustworthiness in real-time streams,” In Proceedings ofDesign, Influence and Social Technologies Workshop of CSCW, 2012.

[23] J. Ratkiewicz, M. Conover, M. Meiss, B. Goncalves, S. Patil, A. Flammini and F. Menczer, “Truthy:mapping the spread of astroturf in microblog streams,” In Proceedings of the 19th ACM InternationalConference on Information and Knowledge Management, pp. 249-252, 2011.

[24] T. Zia, M. S. Akram, M. S. Nawaz, B. Shehzad, R. U. Mustafa, A. M. Abdullatif and M. I. Lali,“Identification of hatred speeches on Twitter,” In Proceedings of 52nd The IRES InternationalConference, pp. 27-32, 2016.

[25] F. Sebastiani, “Machine learning in automated text categorization,” ACM Computing Surveys, vol.34, pp.1-47, 2002.

[26] Twitter API. Available at: https://dev.Twitter.com/rest/public/search. Accessed on May 25, 2017.[27] M. S. Nawaz, M. Bilal, M. I. Lali, R. U. Mustafa, W. Aslam and S. Jajja, “Effectiveness of social

media data in healthcare communication,” Journal of Medical Imaging and Health Informatics, vol.7, no. 7, pp. 1365-1371, 2017.

[28] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I. H. Witten, “The WEKA datamining software: An update,” ACM SIGKDD Explorations Newsletter, vol. 11, no. 1, pp. 10-18,2011.

[29] H. Hsu and P. A. Lachenbruch, “Paired t Test,” Wiley Encyclopedia of Clinical Trials, 2008.[30] P. Fournier-Viger, A. Gomariz, T. Gueniche, E. Mwamikazi, E and R. Thomas, “TKS: Efficient

mining of top-k sequential patterns,” In Proceedings of International Conference on Advanced DataMining and Applications, pp. 109-120, 2013.

[31] P. Fournier-Viger, A. Gomariz, M. Campos and R. Thomas, “Fast vertical mining of sequentialpatterns using co-occurrence information,” In Proceedings of Pacific-Asia Conference on KnowledgeDiscovery and Data Mining, pp. 40-52, 2014.

Page 16: Early Detection of Controversial Urdu Speeches from Social ... · Early Detection of Controversial Urdu Speeches from Social Media 29 revision count, number of unique editors, number

Early Detection of Controversial Urdu Speeches from Social Media 41

[32] P. Fournier-Viger, T. Gueniche, S. Zida and V. S. Tseng, “ERMiner: Sequential rule mining usingequivalence classes,” In Proceedings of International Symposium on Intelligent Data Analysis, pp.108-119, 2014.

Raza Ul Mustafacompleted his Bachelor in Computer Science in2012. Then he received his MS degree in Computer Science in 2014.Currently he is working as a Lecturer in Department of Computer Sci-ence, COMSATS Institute of Information Technology, Sahiwal, Pak-istan. His research area includes Machine Learning, Web Mining, SEOand Software Defined Networking.

M. Saqib Nawazis currently pursuing his PhD in Information Sciencefrom Peking University, China. He did his MS (Computer Science)from University of Sargodha, Pakistan in 2014 and BS (Computer En-gineering) from Engineering University, Peshawar, Pakistan in 2011.His areas of interest include Formal Methods, Social Network Analyt-ics and Parallel Computing.

Javed Ferzundis an associate professor at Department of ComputerScience, COMSATS Institute of Information Technology, Sahiwal,where he served as Head of Department from 2013-2015. He receivedPhD degree from Graz University of Technology, Austria in 2009. Hismain research interests include Big Data Analytics, Internet of Thingsand Machine Learning. Particularly, he is interested in applications ofIoT and Big Data in the Agro-Informatics and Bioinformatics fields.Currently, he is leading the Big Data Analytics Research Group atCOMSATS Institute Sahiwal.

M. Ikram Ullah Laliis currently working as an Assosiate Professorat the Department of Software Engineering, University of Gujrat, Pak-istan. He obtained his PhD degree in Computer Science from COM-SATS Institute of Information Technology, Islamabad, Pakistan. Hisareas of interests include Machine Learning, Formal Methods, SoftwareTesting and Distributed Computing. Several articles of his have beenpublished in conferences and reputed journals.

Page 17: Early Detection of Controversial Urdu Speeches from Social ... · Early Detection of Controversial Urdu Speeches from Social Media 29 revision count, number of unique editors, number

42 Raza Ul Mustafa et al.

Basit Shahzadearned his PhD from University Technology Petronas,Malaysia. Dr. Shahzad is currently at Faculty of Engineering andComputer Science, NUML, Islamabad as Assistant Professor. Beforethis, he has served King Saud University, Riyadh and has worked asAssistant Professor at COMSATS Institute of Information Technology,Islamabad. Dr Shahzad has numerous publications in journals and con-ferences of international repute and has a very active research profile.He has editorial role in several conferences and journals of high reputeand has edited numbers of special issues in significant journals in theareas of software engineering, social networks, and mobile healthcare.His research interests include Information Systems (Enterprise Archi-tecture, Software cost and risk modeling, Mitigating technological risks

in modern banking) and Advancements in Research Methodologies (Quantitative, Qualitative,Mixed method), mobile healthcare, and Social Networks.

Philippe Fournier-Vigeris a full professor and Youth 1000 scholar atthe Harbin Institute of Technology Shenzhen Graduate School, China.He received a Ph.D. in Computer Science at the University of Que-bec in Montreal (2010). He has published more than 130 researchpapers in refereed international conferences and journals, which havereceived more than 1,100 citations. His research interests include datamining, pattern mining, sequence analysis and prediction, text mining,e-learning, and social network mining. He is the founder of the popularSPMF open-source data mining library, which has been cited in morethan 400 research papers since 2010. He is also editor-in-chief of theData Mining and Pattern Recognition journal.