Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

25
Overview of the 2014 ALTA Shared Task Identifying Expressions of Locations in Tweets Diego Moll´ a Sarvnaz Karimi Macquarie University CSIRO ALTA 2014, Melbourne, Australia

Transcript of Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

Page 1: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

Overview of the 2014 ALTA Shared TaskIdentifying Expressions of Locations in Tweets

Diego Molla Sarvnaz Karimi

Macquarie University CSIRO

ALTA 2014, Melbourne, Australia

Page 2: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Contents

The 2014 ALTA Shared Task

The Tweet Data

Kaggle in Class

Evaluation Results

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 2/21

Page 3: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Contents

The 2014 ALTA Shared Task

The Tweet Data

Kaggle in Class

Evaluation Results

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 3/21

Page 4: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

The 2014 Shared Task

Task: Identify Expressions of Locations in Tweets

Categories: student, open

Prize: $500 (IBM Research Shared Task Student Prize)

Framework: Kaggle in Class

Student Category

I All members areuniversity students.

I No members are full-timeemployed.

I No members have a PhD.

Open Category

I Any other teams.

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 4/21

Page 5: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Identify Expressions of Locations in Tweets

Tweet LocationFrance and Germany join the US and UKin advising their nationals in Libya to leaveimmediately http://bbc.in/1rVmrDJ

France, Ger-many, US, UK,Libya

Dutch investigators not going to MH17crash site in eastern Ukraine due to securityconcerns, OSCE monitors say

MH17 crash site,eastern Ukraine

Seeing early signs of potential flashflooding with stationary storms near St.Marys, Tavistock, Cambridge #onstormpic.twitter.com/BtogIxgQ5G

St. Marys,Tavistock,Cambridge

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 5/21

Page 6: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Motivation

1. When people discuss events, often they mention the location.

2. In the case of emergencies, such locations are very useful.

3. Recommender systems can use location information toimprove their recommendations.

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 6/21

Page 7: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Motivation

1. When people discuss events, often they mention the location.

2. In the case of emergencies, such locations are very useful.

3. Recommender systems can use location information toimprove their recommendations.

http://rt.com/usa/new-jersey-flooded-sandy-575/

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 6/21

Page 8: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Motivation

1. When people discuss events, often they mention the location.

2. In the case of emergencies, such locations are very useful.

3. Recommender systems can use location information toimprove their recommendations.

http://static.echonest.com/DukeListens/event_mapping_at_last_fm.html

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 6/21

Page 9: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Location Expressions in Tweets

What is a location?

Any specific mention of a country, city, suburb, or POI.

I Macquarie Centre.

I Ryde Hospital.

Where can we find location mentions?

I In the text.

I In hashtags: #Australia.

I In URLs: http://abc.net.au/melbourne/.

I In mentions: @Australia.

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 7/21

Page 10: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Location Expressions in Tweets

What is a location?

Any specific mention of a country, city, suburb, or POI.

I Macquarie Centre.

I Ryde Hospital.

Where can we find location mentions?

I In the text.

I In hashtags: #Australia.

I In URLs: http://abc.net.au/melbourne/.

I In mentions: @Australia.

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 7/21

Page 11: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Related Work

Named entity recognition in Twitter

I LabelledLDA for NER and PoS on tweets (Ritter et al. 2011).

I TwiNER: Unsupervised, using external sources (e.g.Wikipedia) for NER on tweets (Li et al. 2012).

Location extraction

I Twitcident: Using NER to identify location information ontweets (Abel et al. 2012).

I Ensemble classifiers to predict home locations of tweets(Mahmud et al. 2012).

I NER tools, used out of the box vs. re-trained on tweets(Lingad et al. 2013).

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 8/21

Page 12: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Related Work

Named entity recognition in Twitter

I LabelledLDA for NER and PoS on tweets (Ritter et al. 2011).

I TwiNER: Unsupervised, using external sources (e.g.Wikipedia) for NER on tweets (Li et al. 2012).

Location extraction

I Twitcident: Using NER to identify location information ontweets (Abel et al. 2012).

I Ensemble classifiers to predict home locations of tweets(Mahmud et al. 2012).

I NER tools, used out of the box vs. re-trained on tweets(Lingad et al. 2013).

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 8/21

Page 13: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Contents

The 2014 ALTA Shared Task

The Tweet Data

Kaggle in Class

Evaluation Results

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 9/21

Page 14: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Tweet Collection

Source

I From Lingad et al. (2013).

I Tweets from late 2010 to late 2012.

I Augmented with additional tweets.

I Several annotations, only location mentions were used for theALTA shared task.

Size

I Originally, 3,220 tweets.

I Available for the ALTA shared task: 3,047.

I After removing duplicates: 3,003.

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 10/21

Page 15: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Data Contents

Data for training and development

I Tweet IDs.

I Location mentions.

I Tweet download script.

Copyright restrictions

I Twitter does not allow the distribution of tweets.

I The shared task participants were asked to download thetweets themselves.

I Depending on the network status and changes by Twitter andTwitter users, specific tweets might not be available fordownload.

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 11/21

Page 16: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Data Format

Format of location mentions

I All multi-word terms split into their single words.

I Word duplicates are numbered.

I All punctuation marks are removed, including #.

I Words are lowercased.

I Data in a CSV file.

Examples

I Tweet ID1, france germany us uk libya

I Tweet ID2, australia australia2 australia3

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 12/21

Page 17: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Contents

The 2014 ALTA Shared Task

The Tweet Data

Kaggle in Class

Evaluation Results

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 13/21

Page 18: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Kaggle in Class

Kaggle

I Kaggle offers a Web-based framework for data-drivencompetitions.

I A large base of potential participants.

I Potentially large prizes for the participants.

I Fee-based for the organisers; free for the participants.

Kaggle in Class

I Free for organisers and participants.

I Limited user support by Kaggle.

I Used by course-based competitions.

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 14/21

Page 19: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Alta Shared Task in Kaggle in Class

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 15/21

Page 20: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Alta Shared Task in Kaggle in Class

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 16/21

Page 21: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Features of Kaggle in Class

I Public leaderboard: all participants can submit and comparewith other participants.

I Automated evaluation: organisers can choose among severalevaluation metrics.

I Public and private partitions: A private partition of the testdata is held private for the final ranking

I Public: 501 tweets.I Private: 502 tweets.

I Discussion forum: for communication among participants.

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 17/21

Page 22: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Contents

The 2014 ALTA Shared Task

The Tweet Data

Kaggle in Class

Evaluation Results

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 18/21

Page 23: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Evaluation Metric

Mean F1-Score

I Compute recall and precision of each individual word.

I This allows evaluation of partially correct location mentions.

F1 = 2pr

p + r

Example

I Target: senegal senegal2

I System output: senegal christchurch brighton

I p = 1/3

I r = 1/2

I F1 = 0.42014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 19/21

Page 24: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Conclusions

Conclusions

I Kaggle in class, a useful means to run the shared task.I Few participants, but very active.

I 168 runs in the combined 4 teams.

I Participants (read the Proceedings!) used a combination of:

1. sequence labellers,2. feature engineering, and3. combined classifiers.

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 20/21

Page 25: Overview of the 2014 ALTA Shared Task: Identifying Expressions of Locations in Tweets

The 2013 Task The Tweet Data Kaggle in Class Evaluation Results

Results

Team Category Public PrivateMQ Student 0.781 0.792AUT NLP Open 0.748 0.747Yarra Student 0.768 0.732JK Rowling Open 0.751 0.726

2014 ALTA Shared Task Diego Molla, Sarvnaz Karimi 21/21