[IEEE 2013 ACS International Conference on Computer Systems and Applications (AICCSA) - Ifrane,...

4
Big Data for Geo-political Analysis: application on Barack Obama’s remarks and speeches Anass Bensrhir Big-data and Business Intelligence Consultant Casablanca - Morocco [email protected] AbstractThis document gives and insight about Big-data and its application in geo-political analysis. I examined nearly 1 gigabyte of Barack Obama speeches and remarks, and analysed every word and every sentence using a combination of artificial intelligence and Natural language processing algorithms. My main aim has been to showcase the extraordinary potential Big-data application on political subjects has to improve our understanding of geo-political situations, by allowing us to test and refine existing assumptions (such as president focus) and make new discoveries, such as the clear impact of an election debate over the country’s situation. Further research and work in this area will likely contribute to making Big-data driven approaches to political science. KeywordsBig-Data, NoSQL, NLP , Hadoop ,Python, nltk I. INTRODUCTION The term "big data" refers to large, complex datasets that require an enormous amount of computing power to work with. Because over the years, the cost of computing has fallen exponentially, and the advent of cloud computing essentially gives anyone access to a supercomputer at their fingertips, the big data revolution has become a reality. Big data has a great potential to change and revolutionize not just research, but also the way governments, organizations, and big corporations conduct business and find hidden patterns, this will more likely change the way companies make decisions. II. ARCHITECTURE The data for this study was obtained from the website of the ‘Whitehouse’, this website consists of a Huge source of data from briefing room transcripts to detailed data of Government programs and nation issues, through a process known as ‘scraping’. As part of this project, the Scrapper I wrote collected transcripts of Obama speeches and Remarks, for the period 2009-2013. These transcripts, which are written in html format, amount to approximately 1 gigabyte of data including images. The dataset for this study was created using several further Analysis and interpretation programs I wrote for this purpose, which analysed this mass of html and measured its various relevant features and hidden patterns. Figure 1 presents the architecture I chose to handle the whole process which is powered by several technologies. Fig. 1 The Big-data processing unit’s Architecture used in this research III. METHODOLOGY I will now move on to look at the different stages and procedures that helped me pulling out insightful results , from gathering relevant data to storing and analysing it . A. Scraping As discussed earlier, the format of the transcripts published by the Whitehouse are encoded in html, data Scraping is a technique in which a computer program extracts data from human-readable output coming from websites. For this purpose I built a scraper to gather every transcript in the Whitehouse’s website, this scrapper looks for this URL pattern (http://whitehouse.gov/the-press office/year/day/month/title) and stores its html content. Over 2600 documents were scrapped and stored accordantly B. Data preparation Almost every scraped data needs a kind of cleaning and preparation; either to remove HTML formatted data or data that is riddled with inconsistencies. Without clean data, every Big Data initiative will take longer, cost more, and deliver fewer benefits. 978-1-4799-0792-2/13/$31.00 ©2013 IEEE

Transcript of [IEEE 2013 ACS International Conference on Computer Systems and Applications (AICCSA) - Ifrane,...

Page 1: [IEEE 2013 ACS International Conference on Computer Systems and Applications (AICCSA) - Ifrane, Morocco (2013.05.27-2013.05.30)] 2013 ACS International Conference on Computer Systems

Big Data for Geo-political Analysis: application on Barack Obama’s remarks and speeches

Anass Bensrhir Big-data and Business Intelligence Consultant

Casablanca - Morocco [email protected]

Abstract— This document gives and insight about Big-data and its application in geo-political analysis. I examined nearly 1 gigabyte of Barack Obama speeches and remarks, and analysed every word and every sentence using a combination of artificial intelligence and Natural language processing algorithms. My main aim has been to showcase the extraordinary potential Big-data application on political subjects has to improve our understanding of geo-political situations, by allowing us to test and refine existing assumptions (such as president focus) and make new discoveries, such as the clear impact of an election debate over the country’s situation. Further research and work in this area will likely contribute to making Big-data driven approaches to political science. Keywords— Big-Data, NoSQL, NLP , Hadoop ,Python, nltk

I. INTRODUCTION The term "big data" refers to large, complex datasets that

require an enormous amount of computing power to work with. Because over the years, the cost of computing has fallen exponentially, and the advent of cloud computing essentially gives anyone access to a supercomputer at their fingertips, the big data revolution has become a reality.

Big data has a great potential to change and revolutionize

not just research, but also the way governments, organizations, and big corporations conduct business and find hidden patterns, this will more likely change the way companies make decisions.

II. ARCHITECTURE The data for this study was obtained from the website of the

‘Whitehouse’, this website consists of a Huge source of data from briefing room transcripts to detailed data of Government programs and nation issues, through a process known as ‘scraping’. As part of this project, the Scrapper I wrote collected transcripts of Obama speeches and Remarks, for the period 2009-2013. These transcripts, which are written in html format, amount to approximately 1 gigabyte of data including images. The dataset for this study was created using several further Analysis and interpretation programs I wrote for this purpose, which analysed this mass of html and measured its various relevant features and hidden patterns.

Figure 1 presents the architecture I chose to handle the

whole process which is powered by several technologies.

Fig. 1 The Big-data processing unit’s Architecture used in this research

III. METHODOLOGY I will now move on to look at the different stages and

procedures that helped me pulling out insightful results , from gathering relevant data to storing and analysing it .

A. Scraping As discussed earlier, the format of the transcripts published

by the Whitehouse are encoded in html, data Scraping is a technique in which a computer program extracts data from human-readable output coming from websites. For this purpose I built a scraper to gather every transcript in the Whitehouse’s website, this scrapper looks for this URL pattern (http://whitehouse.gov/the-press office/year/day/month/title) and stores its html content. Over 2600 documents were scrapped and stored accordantly

B. Data preparation Almost every scraped data needs a kind of cleaning and

preparation; either to remove HTML formatted data or data that is riddled with inconsistencies. Without clean data, every Big Data initiative will take longer, cost more, and deliver fewer benefits.

978-1-4799-0792-2/13/$31.00 ©2013 IEEE

Page 2: [IEEE 2013 ACS International Conference on Computer Systems and Applications (AICCSA) - Ifrane, Morocco (2013.05.27-2013.05.30)] 2013 ACS International Conference on Computer Systems

In order to detect which kind of errors and inconsistencies are to be removed, a detailed data analysis is required. In addition to a manual inspection of the data.

1) Extracting the important data: Since the data scrapped is formatted in HTML and

published within the WhiteHouse CMS, a data analysis helped me to find the HTML div tag responsible for showing the important data. In order to extract the important data

Fig. 2 HTML source code of the transcripts in the Whitehouse’s website

As the figure shows <div id=”content”> is the html tag that embeds all the transcripts, getting every data inside that div was done by BeautifulSoup by setting //*[@id="content"] as the XPATH, every partially cleaned transcript was stored in a separate MongoDB document.These transcripts which were extracted and stored in MongoDB database needed to be converted to RAW text to be analysed afterwards.Getting text out of HTML is a sufficiently common task that NLTK provides a helper method nltk.clean_html.

2) Scoring for Relevance: As discussed before data cleaning can skip unwanted data,

for instance some speeches were given by Michelle Obama (first lady) but they are published in the same category of the Whitehouse’s website, but they can be identified by the title , official Obama transcripts are marked with this title “Remarks of President Barack Obama”.

Bayesian classification can solve this kind of problem, The algorithm aims to predict the class membership probabilities, such as the probability that a given speech belongs to President Obama class, or to predict the most probable class that the pattern belongs to. The Bayesian algorithm was trained using a dataset of Titles which identifies the speech Author and found that 83% of the transcripts belong to Barak Obama.

IV. DATA ANALYSIS AND RESULTS The dataset for this study was computed using several further computer programs written in Python, which analysed this mass of transcripts and measured various relevant features.

Fig. 3 The data mining process used in this study

Hadoop and MapReduce were used to perform parallel computing between the 2 data processing servers, parallel processing is very important nowadays because several analysis and artificial intelligence algorithms need hundreds of Gigaflops per second to measure different parameters.9

The automated data processing period lasted 10 hours. My

aim of this research was to find a way to answer these questions:

� What were the most spoken or referred words used by

Obama (overall, each year)? � What were the most referenced countries, personalities

and in which context? � What was Obama’s presidential focus, Internal or

foreign affairs or is there a balance? � How the levels of speech’s polarity and subjectivity are

associated with the presidential period? � How the level of modality is associated with the

presidential period? � Was there any big change in speech patterns within

presidential debate period (post October 2012)? To answer those questions, I needed to find relevant

parameters to measure:

A) Words Frequency: This parameter is very important to measure Obama’s

emphasis or political focus on common problems or situations. This data might provide different perspectives depending on the time zone selected.8

B) Sentiment Analysis: Sentiment analysis concerns the use of the use of automatic

methods for predicting the orientation of subjective content on text documents aiming to determine the attitude or polarity of a text dataset with respect to a topic or the overall contextual polarity of a document.

C) Mood Analysis: Mood parameter tries to identify a collection of

sentences as indicative, or conditional: � Indicative � Conditional

D) Modality Analysis: Modality is a grammatical parameter expressing the degree

of certainty, in other words, modality is what allows speakers to attach expressions of belief, attitude and obligation to statements. Example: "I wish our troops in Iraq would return safe to our land" has a lower modality because it is tied to a probability that might or may not occur. Whereas “We will surely win this war” has a positive modality because it might be considered as factual.7

I will now move on to look at the different results that can be pulled from the analysis.

Page 3: [IEEE 2013 ACS International Conference on Computer Systems and Applications (AICCSA) - Ifrane, Morocco (2013.05.27-2013.05.30)] 2013 ACS International Conference on Computer Systems

1) Presidential focus:

The variety of different types of speaking opportunity (from state of union speeches, remarks by the president to presidential debates) mean that there is space for quite a wide range of issues to be discussed in any given year. Nevertheless, we might also expect that the president focuses on specific issues of importance at specific times, to the detriment of others, both because of wider changes in geopolitical situations and because of the agenda of the Bureau of president, which is able to control to an extent the topics exposed.

For this part of the project, the data in the dataset was

analysed to determine the frequency with which 30 different words or topics were discussed.

Figure 5 tracks changes in the extent to which different topics are discussed by Obama (from 2009 to 2013). A number of conclusions can be drawn from this graphics. Some topics, such as economy, jobs and taxes have remained relatively stable over time, year on year fluctuations taken into account. Others have shown notable trends such as Romney and security and military.

Fig. 4 Map of frequency of which countries were referenced in speeches

Figure 4 tracks frequency of which the president speaks or make a reference to different countries, we can notice 4 levels in this graph:

- The first level is a light grey which covers countries like Morocco, South Africa and Norway

- The second level is a light blue that covers mainly European and some middle-east, Arab countries

- The third level is sky blue that covers some powerful countries like (Russia , Israel , China) and controversial countries like (Iran , North Korea…)

- The Forth level is dark blue, which is represented by United States, which means President Obama gives major important to United States in his speeches.

Fig. 5 From left to right, Most spoken and emphasized words by Obama from

2009 to 2013

Three broad conclusions can be drawn from this analysis of

the topics discussed by Barak Obama. Firstly, in the space of a year, the scope for discussion is large. Only the environment can genuinely claim to have received little attention, the president focuses on his speech mainly on internal affairs like recession and economy. Secondly, however, attention does change over time, and in particular does seem to be clearly linked to external events like presidential elections and crisis. Finally, even if attention does change over time, President Obama gives more importance to his own country in his speeches, which is different from Former president Bush who was known to give more attention to war against terrorism and external affairs.

2) Polarity regarding different events : In this second section, I want to move on to look in more

detail at the President’s sentiment regarding some major events, analysing the extent to which his polarity and the subjectivity of his speeches changes according to the environment , The results of these classifications are summarised in Figure 5.

As has already been outlined previously, the parameters were computed by artificial intelligence algorithms which means that it should be treated as an estimated figure. However, there is no reason to believe that these results affect the reality in the long run; actually a sample of the results was treated manually to confirm the parameters and has shown a high level of accuracy, hence a sentiment level below 0.10 can be considered as negative.

In October, 4 2010, President Obama met with the Economic Recovery Advisory Board to discuss economic difficulties and recovery plans, president Obama gave multiple speeches in the same month after that date to present the situation and to give some plans for recovery, we can clearly notice in Figure 5, that president’s sentiment was negative but with a slightly negative subjectivity in the speech, which shows that he was less optimistic about the situation.

Page 4: [IEEE 2013 ACS International Conference on Computer Systems and Applications (AICCSA) - Ifrane, Morocco (2013.05.27-2013.05.30)] 2013 ACS International Conference on Computer Systems

Fig. 5 Variation of sentiment and subjectivity computed from Obama

Remarks and speeches from 2009-2013

In May 1 2011, Obama announces that Osama bin Laden has been killed in Pakistan, as the figure shows, Obama’s sentiment increased significantly and gave a speech with higher subjectivity, the president mentioned that his war against terrorism has won a big challenge by the death of the leader of Al-Qaeda.

Starting from March 2012, the president’s sentiment was decreasing significantly due to the upcoming elections and the first 2 presidential debates between him and Romney, this behavior was described by many political analysts as a decrease in Obama’s confidence in winning the second term. We may also notice that this behavior changed significantly after the Third debate and over November 6, when he was re-elected.

3) Modality regarding different events: In this third section, I want to move on to look in more

detail at the President’s modality regarding some major events over the period of 2010 to 2013, analysing the grammatical structure of every sentence in the speech. For instance, the presidential speeches have a constant amount of modality and inductive form around 80%, which means that every speech has the same degree of certainty, in other words, expressions of belief, attitude and obligation remains balanced even with major events and crisis. But conditional sentences’ frequencies changes over major events. We can notice a repetitive pattern over each year as Obama uses more conditional sentences at the end of each year,

As in August, 2011, President Obama signed into law the Budget Control Act of 2011 to raise the federal debt ceiling , in that period he used more conditional statements to wish or suggest resolutions depending upon other conditions.

We can also notice that he was using more and more conditional statements months before the elections to present his program to American citizens with a high optimistic way.

Fig. 6 Variation of modality inductive and conditional sentences computed

from Obama Remarks and speeches from 2009-2013

V. CONCLUSION This article has attempted to tackle two major tasks. Firstly,

it has highlighted benefits of Big-data to Analyse and measure political and geo-political events, Minor events can generate a chain of change that plots the whole identity of a country.

Secondly, and more importantly, it has highlighted both the

ability of polarity and mood of a speech to demonstrate the political view of the president and show periods of power and weaknesses during a timeline . Overall modality of Obama remained stable over the time which can show the importance his bureau gives to each speech to stay in a balance of Truth, Hope and perspectives. The types of subjects being referred have slightly changed: the environment has become much more prominent, whilst topics such as security and war on terrorism have faded with time.

Overall, the main aim has been to showcase the

extraordinary potential Big-data application on political subjects has to improve our understanding of geo-political situations, by allowing us to test and refine existing assumptions (such as president focus) and make new discoveries, such as the clear impact of an election debate over the country’s situation .Further research and work in this area will likely contribute to making Big-data driven approaches to political science.

REFERENCES

[1] Apache Hadoop website http://hadoop.apache.org [2] Ubuntu Linux http://www.ubuntu.com [3] MongoDB tutorial [Online]. Available:

http://docs.mongodb.org/manual/tutorial/getting-started [4] MapReduce tutorial [Online]. Available:

http://hadoop.apache.org/docs/r1.1.1/mapred_tutorial.html [5] Opinion mining and sentiment analysis - Cornell University . [Online].

Available: www.cs.cornell.edu/home/llee/omsa/omsa.pdf [6] NLTK tutorial. [Online]. Available:

http://www.cogsci.ucsd.edu/~rik/courses/readings/bird05-NLTK-intro.pdf

[7] Paul Portner, “Modality” Oxford linguistics [8] Harald Baayen, “Word Frequency Distributions (Text, Speech and

Language Technology)” [9] Michael G. Noll, Writing An Hadoop MapReduce Program In Python.

[Online]. Available: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python.