QUT Stage2 Document Avijit Paul
-
Upload
avijit-paul -
Category
Documents
-
view
98 -
download
5
Transcript of QUT Stage2 Document Avijit Paul
C r e a t i v e I n d u s t r i e s F a c u l t y -‐ Q u e e n s l a n d U n i v e r s i t y o f T e c h n o l o g y
Extracting meaningful information from Social Network streams for crisis mapping Avijit Paul (n8459941) Stage 2 Proposal, Doctor of Philosophy May 2012
2012
08 Fall
“Extracting meaningful information from Social Network streams for Crisis Mapping” Avijit Paul – n8459941 – PhD -‐ Stage 2 Proposal -‐ [email protected]
2
Table of Contents
1. The Proposed Title ..................................................................................................................... 3
2. The Proposed Supervisors and their Credentials ......................................................................... 3 Principal Supervisor: Associate Professor Dr. Axel Bruns ............................................................................. 3 Associate Supervisor: Associate Professor Dr. Dian Tjondronegoro ............................................................. 3 Associate Supervisor: Dr. Oksana Zelenko .................................................................................................... 3
3. Background and Literature Review ............................................................................................. 4 Keywords .................................................................................................................................................. 5 Research Domain ...................................................................................................................................... 5
3.1 Introductory Statement ............................................................................................................ 6
3.2 Literature Review ..................................................................................................................... 8 New Media & Communication Studies ...................................................................................................... 8
Crisis Communication and Social Media ....................................................................................................... 8 Twitter Analytics ....................................................................................................................................... 9
Contextual Analysis ....................................................................................................................................... 9 Computational Linguistic ............................................................................................................................ 10
Information Design .................................................................................................................................. 10 Visual Analytics ........................................................................................................................................... 11
Early Detection ........................................................................................................................................ 11
3.3 Research Problem .................................................................................................................. 11 Central Research Problem: How to extract and present useful information from Social Media stream during crisis time? ................................................................................................................................... 11 Sub Problem 1: How to identify what is useful information? ................................................................... 12 Sub Problem 2: How to capture selected data from Social Media Stream? .............................................. 12 Sub Problem 3: How to extract and analyse captured data in real time to find useful information .......... 12 Sub Problem 4: How to present the information to stakeholders ............................................................. 13
4. Program And Design Of The Research Investigation .................................................................. 13
4.1 Objectives, Methodology and Research Plan .......................................................................... 14
4.2 Resources and Funding Required ............................................................................................ 15 Books and journals required .................................................................................................................... 16
4.3 Individual Contribution to the Research Team ........................................................................ 16
4.4 Timeline of Completion of the Program .................................................................................. 16
5. Reference List ........................................................................................................................... 18
6. Appendix .................................................................................................................................. 21
6.1 Coursework ............................................................................................................................ 21
“Extracting meaningful information from Social Network streams for Crisis Mapping” Avijit Paul – n8459941 – PhD -‐ Stage 2 Proposal -‐ [email protected]
3
1. The Proposed Title
Extracting meaningful information from Social Network streams for Crisis Mapping
2. The Proposed Supervisors and their Credentials
Principal Supervisor: Associate Professor Dr. Axel Bruns
Dr. Axel Bruns is an Associate Professor in the Creative Industries Faculty at Queensland
University of Technology (QUT) in Brisbane, Australia, and a Chief Investigator in the ARC Centre of
Excellence for Creative Industries and Innovation (cci.edu.au). He is the author of Blogs, Wikipedia,
Second Life and Beyond: From Production to Produsage (2008) and Gatewatching: Collaborative
Online News Production (2005), and the editor of Uses of Blogs with Joanne Jacobs (2006; all
released by Peter Lang, New York). On top of developing metrics to analyse and map twitter data, in
recent years he has published a vast array of research in the area of Social Network and Crisis
Communication that includes topics such as “Twitter and Crises”, “Twitter and Disaster Resilience”.
Associate Supervisor: Associate Professor Dr. Dian Tjondronegoro
Dr. Dian Tjondronegoro is an Associate Professor at QUT, research and teaching in the area of
“Mobile and Multimedia Technologies”. Dr. Tjondronegoro leads the “Mobile Multimedia Research
Group” and teaches in the area of “Mobile Devices and Mobile Application Development”. Of specific
significance to this project is his expertise in extracting semantic contents from video using
audiovisual features. Prior to this experience, Dr. Tjondronegoro has examined cross-‐media content
tagging and clustering of text, image, and video to support extraction of semantically related web
content.
Associate Supervisor: Dr. Oksana Zelenko
Dr. Oksana Zelenko is a researcher at Creative Industries Faculty at QUT. Her research area
focuses on the role of visual and interaction design in the field of mental health promotion for
children and young people. Previously her design work included researching and developing online
visual counseling tools that are currently in use by one of Australia's largest youth counseling
organisations. On top of that, Dr. Zelenko has also demonstrated expertise in the area of information
design for community resilience and organisational communication.
“Extracting meaningful information from Social Network streams for Crisis Mapping” Avijit Paul – n8459941 – PhD -‐ Stage 2 Proposal -‐ [email protected]
4
3. Background and Literature Review
During recent natural disasters (e.g., Queensland Flood in 2010-‐2011 and Earthquake, Tsunami
and Nuclear Crisis in Japan 2011) millions of status updates appeared on various social networks,
indicating that people’s reliance on social media at the time of disaster has increased tremendously
in recent years. The greatest concern, however, when it comes to harvesting information from users
of Social Networks to emergency service is the uncertain credibility of received data content. At
present it is highly problematic to differentiate between information that has high degree of crisis-‐
relevance and that information which has a very low degree of crisis-‐relevance. Prior research by
Bruns (2011), Potts et al., (2011) shows that using certain methods, such as following keywords and
hashtags from publicly available data in twitter make it possible to identify information related to a
specific crisis in progress and extract meaningful information from these status updates or tweets.
However, as tweets are produced and disseminated extremely quickly, there exists the very
practical consideration of filtering highly useful information stream from non-‐relevant tweets
(Boulos et al., 2011). This is not simply an inconvenience, it poses a significant challenge that if
resolved can mean the different between life-‐saving decisions and life-‐wasting decisions.
This concern is compounded by managing the complex task of appropriately disseminating the
crisis-‐relevant information that is harvested by filtering social media stream, to the multiple
government disaster relief agencies (DCS, 2011) and Non-‐Government Organisations (NGO’s) whose
relief capacities, resources and decision would be highly valued by such information. Additionally, as
the state and the value of the information during crisis change constantly, information
representation techniques need improvement in order to present temporal data in actionable
manner. The literature demonstrates a gap in current approaches in presenting such information to
these stakeholders.
Therefore this project will address some of the issues that surround the management and the
dynamic state of unfolding disaster by extracting high-‐value, context-‐specific and chronologically
framed disaster-‐based information. Through a process of digital harvesting and categorising social
media conversation streams, this project also seeks to deliver both a framework and a system that
will facilitate key decision making processes during times of natural disaster.
“Extracting meaningful information from Social Network streams for Crisis Mapping” Avijit Paul – n8459941 – PhD -‐ Stage 2 Proposal -‐ [email protected]
5
Keywords
Natural Disaster, Flood, Earthquake, Social Network Analysis, Twitter Analytics, Big data,
Visualisation, Information Retrieval, Text Mining, Machine Learning, Natural Language Processing.
Research Domain
This research utilizes an interdisciplinary approach that combines elements from media and
communication studies, crisis communication, communication design, twitter analytics, sentiment
analysis and computational linguistic.
Image 1: Domain areas of this research
“Extracting meaningful information from Social Network streams for Crisis Mapping” Avijit Paul – n8459941 – PhD -‐ Stage 2 Proposal -‐ [email protected]
6
3.1 Introductory Statement
The first 24 hours are often the most critical time during any natural disaster and is also the
period when most community harms occurs (DCS, 2011). Casualty increases due to slow response
time from relief organisations as they lack verifiable information (Meier, 2012). The Department of
Community Safety (DCS) of Queensland Government, for example, in its 2011 report entitled “All
Hazards’ Information Management Program” have identified reducing response time during disaster
event a priority in order to reduce community harm (DCS, 2011) (Image 2 below).
Image 2: Enhancing disaster response system from current to future (DCS, 2011).
Prior research suggests that by using crowd-‐sourced information from various sources including
social networks, it is potentially possible to shorten the time it takes to find information that allows
faster response time (Platt, Hood, & Citrin, 2011). In recent disasters people from all over the world
used social network sites to update their situation and seek help. This made Social Media streams an
extremely powerful information source during crisis events (Muralidharan, Rasmussen, Patterson, &
Shin, 2011). Two social networking sites, Facebook and Twitter, were most popular among the social
network sites during these acute events. However, prior research suggests that due to their “walled-‐
garden” approach, Facebook is less accessible than twitter for public communication (Bruns, 2012).
As Twitter updates are visible even to a non-‐registered user and Twitter allows a user to follow
another user without the need to know the person, a person can follow a crisis authority quickly
during disaster time to receive real time updates. This enables Twitter to draw on and also become
information source at the same time. For this reason Twitter is the social network of choice for this
research.
“Extracting meaningful information from Social Network streams for Crisis Mapping” Avijit Paul – n8459941 – PhD -‐ Stage 2 Proposal -‐ [email protected]
7
However, as updates in Twitter happen extremely quickly, keeping track of all the updates to
extract useful information is a daunting task. Additionally, during a crisis different authorities require
different information to act on. Selecting relevant information set for related authority is a challenge
faced while harnessing power of social media (DCS, 2011). According to CCI Floods report by Bruns,
Burgess, Crawford & Shaw (2012) tweets during crisis can be categorised in five major categories;
information, Media Sharing, Help and Fundraising, Direct Experience and Discussion and Reaction.
Extracting and presenting in such groups can provide authorities with actionable information.
However, not all tweets can be grouped distinctively and therefore challenge remains in identifying
tweets in real-‐time that do not clearly fall into a certain group.
Additionally, a large body of present Twitter research uses certain methods such as hashtags to
identify messages related to a specific natural disaster and find meaningful information out of that
(Bruns, Burgess, Crawford, & Shaw, 2012). However, this method of tracking via pre-‐defined
keywords has its limitations. As most natural disasters are unpredictable events, it is difficult to guess
which keywords will become popular and noteworthy in order to be selected for tracking.
Additionally, when a crisis happens, users may introduce new keywords or hashtags, which may take
time to become noteworthy or may be abandoned again as other, similar keywords gain importance
(Bruns & Liang, 2012).
On top of that, there are plenty of rumours and false information in twitter (Gayo-‐Avello, 2012)
that makes information credibility one of the biggest issue of twitter (Castillo, Mendoza, & Poblete,
2011), (Gupta, Zhao, & Han, 2012). Not all messages that appear in tweet stream are authentic in
nature. As a result, rumour and fake information during disasters often creates unnecessary
situations (Mendoza, Poblete, & Castillo, 2010) and contributes significantly in the irrelevant
information or noise, which needs to be eliminated in order to find information that is useful.
Therefore finding information from their early ripples and grouping them together before they
become prominent is one of the key areas of this research.
Furthermore, as crisis continues, status and condition of a crisis situation gets updated and may
make the information irrelevant. At present most of the crowdsourced crisis information
visualisation uses some form of maps to display information (Elwood, 2011). However map data
often do not portray this temporal aspect of data visualisation. As presenting chronological
information of disaster is crucial for informed decision making at times of disaster, this is another key
area of this research.
“Extracting meaningful information from Social Network streams for Crisis Mapping” Avijit Paul – n8459941 – PhD -‐ Stage 2 Proposal -‐ [email protected]
8
Therefore the primary aim of this research is to formulate new research perspectives and
methods to extract and present relevant information from on going social media updates during
natural disasters. By building a theoretical framework and an online system, this project will harvest
social media conversation streams to help make life saving decisions.
3.2 Literature Review
New Media & Communication Studies
In recent years, new media such as Social Media has been heavily influencing the way we
communicate socially and interpersonally (Baym, Zhang, & Lin, 2004). While it has given power to
ordinary citizens to broadcast their message to potentially an unlimited number of people, inability
to identify who actually reads the message makes it very limited at the same time. Therefore quite
often when someone tweets they only have an imagined audience in mind and they hope someone
will read it (Boyd & Marwick, 2011). According to prior research, this imagined audience affects how
people tweet and how they balance their authenticity and reputation in the tweetverse (Boyd,
2011). As this research focuses on communication via social media, theories of new media and
communication studies will be extensively reviewed. Additionally, to gain better understanding of
Social Media usage in Crisis, literature on crisis communication will be thoroughly reviewed.
Crisis Communication and Social Media
Prior research suggests that people have been using Twitter for spontaneous volunteerism in
recent crisis situations (Starbird & Palen, 2011). When a crisis looms, ordinary citizens who were not
affected takes up more active role from a passive ‘everyday user’ role (Bruns, 2011) to reach out and
help people by using social media. Concepts such as ‘Voluntweeters’, a self organising online
microblogging volunteer community has emerged in recent natural disasters without any directive or
influence from governments or authorities (Starbird & Palen, 2011). And in case people are unable to
directly contribute information from the ground, they tend to ‘retweet’ very quickly in an effort to
spread the news as fast as possible (Starbird, 2012).
Apart from collective behaviour phenomena, Twitter has also been used for intensified
information search, social convergence in physical space, and information contagion (Starbird, Palen,
Hughes, & Vieweg, 2010). As Twitter has repeatedly been proven to maintain connectivity (Bruns,
2011), finding ways to show empathy for the people involved (Sarcevic et al., 2012), streamline
“Extracting meaningful information from Social Network streams for Crisis Mapping” Avijit Paul – n8459941 – PhD -‐ Stage 2 Proposal -‐ [email protected]
9
multi-‐channel communication processes and options to be readily accessible to the news media
during crisis situations (Large, 2012), a thorough understanding and testing of Crisis Communication
theories can help to create necessary framework that can be used to analyse social network data
sets in real time.
Twitter Analytics
These two way communications multiplied by thousands of people creates a firehose of
information (Wu, Hofman, Mason, & Watts, 2011). The Twitter firehose consists of the entire tweet
stream at any given time (Dong et al., 2010). Since the number of updates can be extremely quick
and massive (more than 5,000 tweets per second in twitter alone during Japan Tsunami (Empson,
2012)) microsyntex format such as usage hashtags are particularly useful to bring a particular topic in
the forefront of an ongoing conversation (Stamberger, 2010). However, the contributing factors that
establishes a keyword as hashtag is still not well researched (Cullum, 2010). In fact, there is limited
research on extracting useful information from the firehose. Furthermore, identifying keywords or
hashtags alone is not enough as various other metrics such as widely shared links, influential users,
retweets can have significant importance and are important items to extract and analyse (Boyd,
Golder, & Lotan, 2010) .
At present the most common twitter analytics is done via tracking keywords and hashtags (Bruns
& Liang, 2012). Other analytics involve locating and profiling user id (twitter handles) (Yugami, Igata,
Anai, & Inakoshi, 2012), geo tagging (Lee, Wakamiya, & Sumiya, 2011), URL and linkage data
(Aggarwal, 2011) etc. Twitter analytics has been used to track academic citation prediction
(Eysenbach, 2011), temporal patterns of happiness (Dodds, Harris, Kloumann, Bliss, & Danforth,
2011) and finding meaningful expression of engagement (Huston, Weiss, & Benyoucef, 2011). Most,
if not all Twitter analytics however are post-‐hoc and the data is archived first and analysed later. In
the early stage of this research I will use the most appropriate method among the methods available
to simulate and test my hypothesis and will develop a new method for real time testing in the last
phase of the research. This presents the first research gaps on extracting meaningful and useful
information from an on-‐going social media updates during crisis.
Contextual Analysis
Even though real time data processing can be used to extract data (Vlachos, 2011), it does not
have the ability to identify meaning out of a given context. In order to understand meaning, it needs
to learn the rules and patterns (Valero, Gómez, & Pineda, 2009). Different methods such as
“Extracting meaningful information from Social Network streams for Crisis Mapping” Avijit Paul – n8459941 – PhD -‐ Stage 2 Proposal -‐ [email protected]
10
dictionary-‐based, rule based, hybrid have been proposed for such pattern or named entity
recognition activity (Song, Tjondronegoro, & Docherty, 2012), (Döhling & Leser, 2011). However,
limited research has been conducted in conjunction with disaster response, contextual and
sentiment analysis and named entity recognition (Park, Cha, Kim, & Jeong, 2012), (De Fortuny, De
Smedt, Martens, & Daelemans, 2012). Thus, in order to be usable in picking early disaster signals,
contextual analysis can be used to find the meaning of a word in context (Maxwell, Raue, Azzopardi,
Johnson, & Oates, 2012). Therefore, by mining subjective expression or opinion, it will be able to
differentiate between similar words used in different context avoid creating false alarm while
grouping extracted data from a social media stream (Liu, 2010).
Computational Linguistic
In recent years there has been a growing interest in using Computational Linguistics with Twitter
during a crisis (Corvey, Vieweg, Rood, & Palmer, 2010) mostly to identify trending keywords (Sakaki,
Toriumi, & Matsuo, 2011). It has also been used to problems with products and service with Twitter
data (N. K. Gupta, 2011). As this research requires extensive analysis of text data in order to
understand uses of words in context, methods of computational linguistic in emergency will be
studied in order to isolate noise data from useful data.
Information Design
Traditionally maps have been used to represent crisis related data in order to identify priority
areas (Tufte, 2001). However, as the information changes rapidly in social networks, presenting crisis
information gathered from Social Networks via map may not be the best way. Furthermore, most of
the available crisis presentation system requires extensive manual entry and monitoring into a
system that projects the data in a crisis map (Meier, 2012). Although this has proven useful, it is
often time and resource consuming. Since every minute is important when saving lives after a
natural disaster, alternative information design and presentation techniques such as fractal maps,
heat maps or other non-‐map based visualisation techniques will be explored. As there has been
limited research done on presenting data generated from such massive datasets during disaster, one
of the major challenge for this research is to present real time information extracted from social
network stream in a meaningful manner.
“Extracting meaningful information from Social Network streams for Crisis Mapping” Avijit Paul – n8459941 – PhD -‐ Stage 2 Proposal -‐ [email protected]
11
Visual Analytics
As the amount of data driven documents and services increases rapidly, visual analytics is gaining
more and more momentum in recent years (Bostock, Ogievetsky, & Heer, 2011). Collaborating and
Social visualisation techniques have also gained popularity to visualise crowd-‐sourced data (Heer &
Agrawala, 2008), (Keim et al., 2008). These visual analytic methods and processed will be studied to
find how it can be used to best present the data in order to present it quickly and effectively in a
crisis situation.
Early Detection
Prior research suggested use of social media to predict health disasters such as H1N1 using
traditional and social media (Liu & Kim, 2011). It has also been used to suggest low-‐level prediction
of natural disasters (Li, Wang, & Liu, 2011). However, once data is gathered, due to vast differences
in the information generated, it remains quite difficult to analyse them in real time. On top of that,
there is no established methodology to identify the time taken before a certain term becomes a
trending topic: there is a methodological gap when it comes to identifying weak signals surfacing
through social media streams before they become widely visible, in order to understand which
keywords are likely to be important. Limited research has been conducted to identify links between
social media updates and natural disaster prediction. Therefore the third area of interest is to
probabilistically identify relationship between social media updates and potential natural disaster.
3.3 Research Problem
Based on the prior literature review, a central research problem and four sub-‐problems are
identified. They are;
Central Research Problem: How to extract and present useful information from Social Media stream during crisis time?
As updates happen extremely quickly in social networks, especially during crisis time, one of the
most important parts is to extract information that is useful. Even though it is possible to read
through real time social network data, the problem remains trying to extract information that is
useful and usable in close to real-‐time. Additionally, quality of information degrades over time and
current presentation techniques pose certain limitations in getting up to date information quickly.
“Extracting meaningful information from Social Network streams for Crisis Mapping” Avijit Paul – n8459941 – PhD -‐ Stage 2 Proposal -‐ [email protected]
12
Thus, the central challenge of this thesis is to extract useful information from Social Media and
present it with as little delay as possible.
Sub Problem 1: How to identify what is useful information?
As useful is a relative term, the first problem to address is -‐ what is useful during crisis situation?
As prior research shows that in a twitter conversation there are various patterns and metrics of
communication, the first challenge is to find which metrics; patterns and frameworks can identify a
conversation as useful. For example, finding out who are the most active users during disaster time,
who posts original messages that get retweeted most may have a significant impact to find useful
conversation and therefore will be identified as a variable. Once the variables are determined, the
task is to develop and test the hypothesis on archival data before testing it in a live environment at a
later stage.
Sub Problem 2: How to capture selected data from Social Media Stream?
The second problem is to capture data from the social media stream during a crisis. At present
there are various methods available and deployed such as twapperkeeper. However, most of the
available capture methods looks for a pre determined keyword or Hashtag or pre-‐identified user. As
this research is looking for information from a full firehose tweet stream, new methods such as
Hadoop, Twitter stom and so on will be used to capture the Social Media Stream. Since there are
various methods available with their own strength and weakness, finding the right way to capture
will be the second issue to solve.
Sub Problem 3: How to extract and analyse captured data in real time to find useful information
Once the method for capturing information is identified, the next challenge is to analyse it and
segregate noise from the information. The hypothesis developed at Sub-‐problem 1 will be applied to
data collected at Sub-‐problem 2 at this stage. The challenge will be to identify how to separate filter
information from the data source by applying twitter analytics, sentiment analysis, computational
linguistic or any other methods necessary in real time to a live twitter data stream.
“Extracting meaningful information from Social Network streams for Crisis Mapping” Avijit Paul – n8459941 – PhD -‐ Stage 2 Proposal -‐ [email protected]
13
Sub Problem 4: How to present the information to stakeholders
Once usable information is extracted, the next challenge is to present it in a way that is relevant
to the stakeholders, authorities, communities, media to act on. As different stakeholder require
different types of information and a one size filter do not fit all the information, the next challenge is
to identify how to present to them in a flexible way so that they can act on it. Various visualisation
techniques that were identified within the literature review will be tested at this stage to find out
which technique represents temporal data in a chronological manner most effectively.
4. Program And Design Of The Research Investigation
This research will be divided in a four iterative phases that will allow me to go constantly develop
and evaluate the whole research project (Image 3). The key phases are-‐
Phase 0: Initial Literature Review (First 3 months)
Phase 1: Building hypotheses and theoretical algorithm from literature (2nd 3 months)
Phase 2: Capturing real time Twitter data using capturing technologies like Hadoop, Strom (last 6
months of first year)
Phase 3: Real time analytics, hypothesis testing and sending for evaluation (2nd year)
Phase 4: Information design and creating Crisis Visualization. (Initial months in 3rd Year)
Image 3: Key phases of the research design
The phases are broken down in actionable tasks below that allow going back and forth between
the tasks as deems necessary.
“Extracting meaningful information from Social Network streams for Crisis Mapping” Avijit Paul – n8459941 – PhD -‐ Stage 2 Proposal -‐ [email protected]
14
4.1 Objectives, Methodology and Research Plan
The objective of this research is to address the research problems identified in earlier sections.
And to do that, mixed methods consisting of various qualitative and quantitative research methods
will be used in this research. As there are various methodologies currently available for data analysis
and communication during disaster, some of the methods will use quantitative data and others will
use qualitative data. Below are some of the broad methods that will be studied during this project.
First objective is to identify what is useful and it will be developed by reviewing literature in this
area. This review will analyse reports, media and academic writing on recent research in the area of
social network, natural disaster, media & communication studies, crisis communication, twitter
analytics, sentiment analysis and computational linguistics. Based on the studies, variables will be
identified to find what is useful in the context of social media conversation during disaster.
This will be followed closely by development and testing of the hypotheses on how Twitter users
communicate during a crisis. In order to do this, I will first slice disaster related (QLD flood, Japan
Tsunami, New Zealand earthquake) twitter data gathered at CCI from twapperkeeper using awk
scripts (a data extraction and reporting tool) developed by Axel Bruns. By mapping relationship
between twitter datasets both in the area of disaster and social media communication I will be able
to test the developed hypotheses on communication during crisis. I will then formulate approaches
to extract relevant information from a large dataset archived at QUT. Using visualisation tools such
as Gephi I will also explore possibilities of presenting information differently. This task will be done
after submission of stage 2.
However, as natural disasters are happening around the world, research articles in this area are
appearing rapidly. To keep abreast of these developments, the literature review will be on going
throughout the Phd in case new variables are identified.
Second objective is to capture live twitter streams so that it can be stored for future analysis. In
order to do this, I will setup a NoSQL database (Mongo or CouchBase) with one Hadoop and one
STORM cluster to store incoming twitter streams. Although the target at this stage is to use twitter
firehose as the input stream, as this access needs to be purchased, if I am unable to gain access for
that I will use keyword specific input streams. The database and the server will initially be hosted via
two cloud instances from NeCTAR, an Australian Government project conducted as part of the Super
Science initiative and financed by the Education Investment Fund. The use of database and cluster
“Extracting meaningful information from Social Network streams for Crisis Mapping” Avijit Paul – n8459941 – PhD -‐ Stage 2 Proposal -‐ [email protected]
15
file system may vary if a new and improved version is released. This task will be done in between
stage 2 submission and confirmation seminar. In the end this will result in a system that can capture
twitter data from the twitter firehose in real-‐time and will provide the basis for real-‐time analysis on
the captured data stream.
Third objective is to extract the useful information from this live twitter stream. This will be
done using suitable twitter analytics methods available at that point of time. Additionally, to
understand the meaning of the words used based on their context, in order to identify weak signals I
will apply contextual analysis and other computational linguistic methods at this stage. At this stage
the whole system will go through an iterative process of testing, evaluation and improvement to
make it more effective. This step will use the hypothesis developed from the first phase (first
objective) and data collected from the second phase (second objective) to initially test on archival
data. Based on the result, the system will be sent for evaluation to the Queensland Government’s
Department of Community Safety (DCS) for assessment. Improvements will be carried out based on
the feedback gathered. This whole process will be done during 2nd year of candidature.
Fourth objective is to present the information in a way that is useful for the stakeholders.
Various presentation techniques will be used to test the extracted information in order to see which
presents the most benefit. Since using maps such as Google Map or other maps are the most
traditional way of presenting the information, the data will first be placed using that mapping
technique. However as maps have their own limitations in dealing with temporal data in
chronological order, other techniques for information design will be tested at this stage based on the
extracted information. This whole process will be an iterative process with seeking feedback from
DCS as there are number of ways the data can be presented and sampled.
4.2 Resources and Funding Required
In the first stage I will use my own personal computer and QUT computers in the lab in order to
slice data with awk scripts and Gephi to visualize. After that, in order to do real time data extraction,
I will first use free Australian Research Cloud network (NeCTAR) instances that is already available for
QUT students. At the same time I will also submit NeCTAR RFP stage 2 in order to secure a longer run
at using their cloud instances. If I need access to even larger cloud instances I will use AWS (Amazon
Web Service) and will apply further research funding such as the auDA grant (.au domain
administration Ltd) to support usage and storage at AWS or other appropriate cloud.
“Extracting meaningful information from Social Network streams for Crisis Mapping” Avijit Paul – n8459941 – PhD -‐ Stage 2 Proposal -‐ [email protected]
16
Books and journals required
As this research taps into various emerging fields, some of the books and journals available are
still in their early access edition and therefore not available through QUT library. If they are not
available, I will request the library to purchase them.
4.3 Individual Contribution to the Research Team
Although this is an individual project, it is linked to the ARC Linkage-‐funded project “Social
Media in Times of Crisis: Learning from Recent Natural Disasters to Improve Future Strategies” with
collaboration from Queensland Department for Community Safety and the Eidos Institute. This
project combines large-‐scale quantitative and close qualitative analysis to investigate the public use
of social media during disasters, working with key emergency management organisations to improve
their communication strategies. My contribution will be building theory and framework on what to
extract as well as developing improved extraction and presentation methods for social media data
stream.
4.4 Timeline of Completion of the Program
Please refer to the attached timeline.
PHD TIMELINE - AVIJIT PAULTime Elapsed (in months for 3 yr study) 3 6 9 12 15 18 21 24 27 30 33 36 Key Dates Resource Implications ConstraintsPhD MilestonesStage 2 5th June 2012Confirmation 5th March 2013Annual Progress 30th Sept 2013Final Seminar 4th Dec 2014Lodgement 4th Jan 2015Generic Capabilities
Advanced theoretical knowledge and analytical skills, as well as methodological, research design and problem-solving skills in a particular research area; Develop method
ATN More Critical and Creative Thinking
Advanced information processing skills and knowledge of advanced information technologies and other research technologies; AIRSIndependence in research planning and execution, consistent with the level of the research degree
Apply for research grant
Apply for research grant
Apply for research grant
Competence in the execution of protocols for research health and safety, ethical conduct and intellectual property ;
Confirm IP Arrangements
Submit Ethics Application
Complete H&S training
Skills in project management, teamwork, academic writing and oral communication;
ATN Leap Communication and Leadership
ATN Leap Project Mangement
Grad Cert in Research Commercialisation
Awareness of the mechanisms for research results transfer to end-users, scholarly dissemination through publications and presentations, research policy, and research career planning.
ATN More Critical Writing Journal Conference
Publication Workshop
Presentation Workshop Conference Journal
Commercialization exploration
CourseworkAdvanced Information Retrieval Skills (IFN001 Mandatory for PhD candidates) 15th June 2012Enquiry to Creative Industries (KKP 6601) 15th June 2012Thesis WritingTitle & AbstractIntroductionLiterature ReviewMethodologyData Analysis - Archival DataData Analysis - Live DataData Analysis - Visual AnalyticsDiscussionConclusionResearch Process (methodology in sections)Accessing LiteratureConsider MethodologiesHypothesis developmentReal Time CaptureImplementation of Real time Analytics
Live testing with Twitter Stream Funding for large scale access to twitter data
If unable to gain access will work with keywords
information designGather ResultsApprovals/Agreements/ApplicationsIntellectual PropertyEthicsIndustry Health & safety ScholarshipsGrants in AidWrite Up ScholarshipOutputsConference PapersJournalsSystem Commercialization
Meeting Final Seminar timeline
Confirmation Seminar
Develop tools
Develop skills in statistics, use or key software e.g. endnote, SPSS, AWK, STORM, Python Data analysis
“Extracting meaningful information from Social Network streams for Crisis Mapping” Avijit Paul – n8459941 – PhD -‐ Stage 2 Proposal -‐ [email protected]
18
5. Reference List
Aggarwal, C. C. (2011). An Introduction To Social Network Data Analytics.
Axel Bruns, J. B. (2011). New methodologies for researching news discussion on Twitter. Paper presented at the
The Future of Journalism, Cardiff, UK.
Baym, N. K., Zhang, Y. B., & Lin, M. C. (2004). Social interactions across media. New Media & Society, 6(3), 299.
Bostock, M., Ogievetsky, V., & Heer, J. (2011). D3: Data-‐Driven Documents. Visualization and Computer
Graphics, IEEE Transactions on, 17(12), 2301-‐2309.
Boulos, M. N. K., Resch, B., Crowley, D. N., Breslin, J. G., Sohn, G., Burtner, R., Pike, W., Jezierski, E., Chuang, K.-‐
Y. S. (2011). Crowdsourcing, citizen sensing and sensor web technologies for public and environmental
health surveillance and crisis management: trends, OGC standards and application examples.
International Journal of Health Geographics, 10.
Boyd, D. (2011). Research on Social Network Sites.
Boyd, D., Golder, S., & Lotan, G. (2010). Tweet, tweet, retweet: Conversational aspects of retweeting on
twitter. 1-‐10.
Boyd, D., & Marwick, A. E. (2011). I tweet honestly, I tweet passionately: Twitter users, context collapse, and
the imagined audience. New Media & Society, 13(1), 114.
Bruns, A. (2011). Towards Distributed Citizen Participation: Lessons from WikiLeaks and the Queensland Floods.
Paper presented at the Conference for E-‐Democracy and Open Government, Krems, Austria
Bruns, A. (2012). Ad Hoc Innovation by Users of Social Networks: The Case of Twitter ZSI Discussion Paper
Bruns, A., & Liang, Y. E. (2012). Tools and methods for capturing Twitter data during natural disasters. First
Monday, 17(4-‐2).
Bruns., A., Burgess, J., Crawford, K., & Shaw, F. (2012). CCI Floodsreport: Media Ecologies Project, ARC Centre
of Excellence for Creative Industries & Innovation.
Castillo, C., Mendoza, M., & Poblete, B. (2011). Information credibility on twitter.
Corvey, W. J., Vieweg, S., Rood, T., & Palmer, M. (2010). Twitter in mass emergency: what NLP techniques can
contribute. Paper presented at the Proceedings of the NAACL HLT 2010 Workshop on Computational
Linguistics in a World of Social Media, Los Angeles, California.
Cullum, B. (2010). What makes a hashtag successful. Retrieved April 8th, 2012, from
http://www.movements.org/blog/entry/what-‐makes-‐a-‐twitter-‐hashtag-‐successful/
DCS, Q. G. (2011). ‘All Hazards’ Information Management Program
http://www.btrc.qld.gov.au/c/document_library/get_file?uuid=a4491bd2-‐cfe5-‐466b-‐a003-‐
45f86878bc85&groupId=12276. Brisbane: QLD Government.
De Fortuny, E. J., De Smedt, T., Martens, D., & Daelemans, W. (2012). Media coverage in times of political crisis:
a text mining approach: University of Antwerp, Faculty of Applied Economics.
Dodds, P. S., Harris, K. D., Kloumann, I. M., Bliss, C. A., & Danforth, C. M. (2011). Temporal patterns of
happiness and information in a global social network: hedonometrics and Twitter. [; Research
Support, U.S. Gov't, Non-‐P.H.S.]. PloS one, 6(12), e26752.
“Extracting meaningful information from Social Network streams for Crisis Mapping” Avijit Paul – n8459941 – PhD -‐ Stage 2 Proposal -‐ [email protected]
19
Döhling, L., & Leser, U. (2011). EquatorNLP: Pattern-‐based Information Extraction for Disaster Response.
Dong, A., Zhang, R., Kolari, P., Bai, J., Diaz, F., Chang, Y., Zhaohui, Z. (2010). Time is of the essence: improving
recency ranking using twitter data.
Elwood, S. (2011). Geographic Information Science: Visualization, visual methods, and the geoweb. Progress in
Human Geography, 35(3), 401-‐408.
Empson, R. (2012, February 5). Twitter: In The Final 3 Minutes Of The Super Bowl, There Were 10,000 Tweets
Per Second. Retrieved April 9th, 2012, from http://techcrunch.com/2012/02/05/twitter-‐in-‐the-‐final-‐3-‐
minutes-‐of-‐the-‐super-‐bowl-‐there-‐were-‐10000-‐tweets-‐per-‐second/
Eysenbach, G. (2011). Can Tweets Predict Citations? Metrics of Social Impact Based on Twitter and Correlation
with Traditional Metrics of Scientific Impact. Journal of Medical Internet Research, 13(4). doi: e123
10.2196/jmir.2012
Gayo-‐Avello, D. (2012). "I Wanted to Predict Elections with Twitter and all I got was this Lousy Paper" : A
Balanced Survey on Election Prediction using Twitter Data. Arxiv preprint arXiv:1204.6441.
Gupta, M., Zhao, P., & Han, J. (2012). Evaluating Event Credibility on Twitter.
Gupta, N. K. (2011). Extracting descriptions of problems with product and services from twitter data.
Heer, J., & Agrawala, M. (2008). Design considerations for collaborative visual analytics. Information
Visualization, 7(1), 49-‐62.
Huston, C., Weiss, M., & Benyoucef, M. (2011). Following the Conversation: A More Meaningful Expression of
Engagement. In G. Babin, K. StanoevskaSlabeva & P. Kropf (Eds.), E-‐Technologies: Transformation in a
Connected World (Vol. 78, pp. 199-‐210). Berlin: Springer-‐Verlag Berlin.
Keim, D., Andrienko, G., Fekete, J. D., Görg, C., Kohlhammer, J., & Melançon, G. (2008). Visual analytics:
Definition, process, and challenges. Information Visualization, 154-‐175.
Large, T. (2012). TechnoTalk -‐ Will Twitter put the U.N. out of the disaster business? Retrieved 28 March, 2012,
from http://www.trust.org/alertnet/blogs/technotalk/will-‐twitter-‐put-‐the-‐un-‐out-‐of-‐the-‐disaster-‐
business/#.T3Gkd2LX3Yk.twitter
Lee, R., Wakamiya, S., & Sumiya, K. (2011). Discovery of unusual regional social activities using geo-‐tagged
microblogs. World Wide Web-‐Internet and Web Information Systems, 14(4), 321-‐349.
Li, C., Wang, Y., & Liu, X. (2011). Research on natural disaster forecasting data processing and visualization
technology.
Liu, B. (2010). Sentiment analysis and subjectivity. Handbook of Natural Language Processing, 627-‐666.
Liu, B. F., & Kim, S. (2011). How organizations framed the 2009 H1N1 pandemic via social and traditional
media: Implications for US health communicators. [Article]. Public Relations Review, 37(3), 233-‐244.
doi: 10.1016/j.pubrev.2011.03.005
Maxwell, D., Raue, S., Azzopardi, L., Johnson, C., & Oates, S. (2012). Crisees: Real-‐Time Monitoring of Social
Media Streams to Support Crisis Management. Advances in Information Retrieval, 573-‐575.
Meier, P. (Producer). (2012, April 4th). Collaborative Mapping Platforms: Crowdsourced Crisis Response.
[Keynote] Retrieved from http://www.trendhunter.com/keynote/patrick-‐meier
Mendoza, M., Poblete, B., & Castillo, C. (2010). Twitter Under Crisis: Can we trust what we RT?
“Extracting meaningful information from Social Network streams for Crisis Mapping” Avijit Paul – n8459941 – PhD -‐ Stage 2 Proposal -‐ [email protected]
20
Muralidharan, S., Rasmussen, L., Patterson, D., & Shin, J. H. (2011). Hope for Haiti: An analysis of Facebook and
Twitter usage during the earthquake relief efforts. [Article]. Public Relations Review, 37(2), 175-‐177.
doi: 10.1016/j.pubrev.2011.01.010
Park, J., Cha, M., Kim, H., & Jeong, J. (2012). Managing Bad News in Social Media: A Case Study on Domino’s
Pizza Crisis.
Platt., A., Hood., C., & Citrin., L. (2011). Organization of Social Network Messages to Improve Understanding of
an Evolving Crisis Paper presented at the Intelligence and Security Informatics (ISI), 2011 IEEE
International Conference, Beijing.
Potts, L., Seitzinger, J., Jones, D., & Harrison, A. (2011). Tweeting disaster: hashtag constructions and collisions.
Sakaki, T., Toriumi, F., & Matsuo, Y. (2011). Tweet trend analysis in an emergency situation.
Sarcevic, A., Palen, L., White, J., Starbird, K., Bagdouri, M., & Anderson, K. (2012). Beacons of hope in
decentralized coordination: learning from on-‐the-‐ground medical twitterers during the 2010 Haiti
earthquake.
Song, W., Tjondronegoro, D. W., & Docherty, M. (2012). Understanding user experience of mobile video:
framework, measurement, and optimization. Mobile Multimedia: User and Technology Perspectives,
3-‐30.
Stamberger, K. S. a. J. (2010). Tweak the Tweet: Leveraging microblogging proliferation with a prescriptive
syntax to support citizen reporting. Paper presented at the Information Systems for Crisis Response
and Management (ISCRAM), Seatle, USA.
Starbird, K. (2012). Digital Volunteerism: Examining Connected Crowd Work During Mass Disruption Events.
Starbird, K., & Palen, L. (2011). "Voluntweeters": self-‐organizing by digital volunteers in times of crisis. Paper
presented at the Proceedings of the 2011 annual conference on Human factors in computing systems,
Vancouver, BC, Canada.
Starbird, K., Palen, L., Hughes, A. L., & Vieweg, S. (2010). Chatter on the red: what hazards threat reveals about
the social life of microblogged information. Paper presented at the Proceedings of the 2010 ACM
conference on Computer supported cooperative work, Savannah, Georgia, USA.
Tufte, E. R. (2001). The visual display of quantitative information: Graphics Press.
Valero, A. T. l., Gómez, M. M. y., & Pineda, L. V. o. (2009). Using Machine Learning for Extracting Information
from Natural Disaster News Reports. Computación y Sistemas (Computers and Systems), 13(1), 33-‐44.
Vlachos, A. (2011). Evaluating unsupervised learning for natural language processing tasks. Paper presented at
the Proceedings of the First Workshop on Unsupervised Learning in NLP, Edinburgh, Scotland.
Wu, S., Hofman, J. M., Mason, W. A., & Watts, D. J. (2011). Who says what to whom on twitter. Paper
presented at the Proceedings of the 20th international conference on World Wide Web, Hyderabad,
India.
Yugami, N., Igata, N., Anai, H., & Inakoshi, H. (2012). Advanced Analytics for Intelligent Society. Fujitsu Scientific
& Technical Journal, 48(2), 110-‐116.
“Extracting meaningful information from Social Network streams for Crisis Mapping” Avijit Paul – n8459941 – PhD -‐ Stage 2 Proposal -‐ [email protected]
21
6. Appendix
6.1 Coursework
AIRS Unit – IFN 001
I have taken the course Advanced Information Retrieval Skills (IFN001) and submitted assignment
and waiting for result.
Approaches to Enquiry In the Creative Industries -‐ KKP601
I have taken this course, Approaches to Enquiry In the Creative Industries, completed the
presentation and have submitted the final assignment and waiting for result.