Phd Colloquium Spatial Analysis

18
Data Mining to Understand International Dimensions to Online Identity - a classification of 2+ billion names and their linkage to virtual identities and social network traffic. Alistair Leak UCL SECReT [email protected]

description

Presentation given as part of a PHD Colloquium on Spatial Analysis delivered on Wed 11th January 2013

Transcript of Phd Colloquium Spatial Analysis

Page 1: Phd Colloquium Spatial Analysis

Data Mining to Understand International Dimensions to Online Identity - a classification of 2+ billion names and their linkage to virtual identities and social network traffic.

• Alistair Leak• UCL SECReT• [email protected]

Page 2: Phd Colloquium Spatial Analysis

Who am I?

Education:Kingston University (BSc) - GIS

UCL (M.Res) - Advanced Spatial Analysis and Visualisation

UCL 3+1 - PhD Security and Crime Science

Supervisors:1st Supervisor: Professor Paul Longley

2nd Supervisor: Dr James Cheshire

Page 3: Phd Colloquium Spatial Analysis

Definitions:• Netnography

– “A qualitative, interpretive research methodology that uses internet-optimized ethnographic research techniques to study the social context in online communities” (Kozinets,2009)

• Cybergeodemographics– “The analysis of people by where they live and by whom they

interact with, in real and virtual space” (Longley, 2012)

Page 4: Phd Colloquium Spatial Analysis

Uncertainty of Identity: Work Package 4: Cybergeodemographics

• Use of primary and secondary data to relate virtual Internet traffic to the probable physical locations from which it emanated; and the development of typologies of social networks that are robust, generalized and related to physical locations.

Data Collection Tools (WP1)

Text Analytics(WP2)

Cybergeodemographics (WP4)

Secondary Data

Page 5: Phd Colloquium Spatial Analysis

Working Title:

• “Data Mining to Understand International Dimensions to Online Identity - a classification of 2+ billion names and their linkage to virtual identities and social network traffic”

Objectives:

• Develop spatial context of name network classification• Develop typologies of social networks• Measure how representative social media is of the

underlying population.

Page 6: Phd Colloquium Spatial Analysis

Work Plan• M.Res (Present – 2013)

– Foundation work• Assess representative capability of tweet data

– Skills Development• Spatio-Temporal Data Mining• Database Management

• Ph.D (2013 – 2016)

– Objectives• Develop spatial component of names networks• Develop typologies of social networks• Develop a measure of uncertainty

– Completion in August 2016

Page 7: Phd Colloquium Spatial Analysis

Data Sources:

*Sina Weibo

Page 8: Phd Colloquium Spatial Analysis

Case Study: Tweets in London

• 1.4 Million Tweets over 3 months Sep - Dec 2012

Page 9: Phd Colloquium Spatial Analysis

What’s in a Tweet?

First Name

SurnameUnique ID

Popularity

Interactions

# Themes

Possibilities:•Political Affiliation•Gender•Age•Location

Time/Date

Location

Page 10: Phd Colloquium Spatial Analysis

• Gender– Database of 62000 names + genders– Determined by Forename

• Demographic– OAC – Output area classifier

• ONOMAP– Ethnicity, Religion, Geographical Origin.– Determined by Forename Surname combination

Data Classification

Page 11: Phd Colloquium Spatial Analysis

Data Classification

Page 12: Phd Colloquium Spatial Analysis

Tw

eets

by

ON

OM

AP

Rel

igio

n

Page 13: Phd Colloquium Spatial Analysis

Tw

eets

by

ON

OM

AP

Rel

igio

n

Page 14: Phd Colloquium Spatial Analysis

Tw

eets

by

ON

OM

AP

Gro

up

Page 15: Phd Colloquium Spatial Analysis

Challenges of Study

• Signal from Noise– Tweets are not all sent from individuals homes

• Day and night demographics

– Not all location tweets are real people

• Data Quality/Sample Size– Twitter users are self selecting

• Only a small proportion have enabled location services• Dataset currently has 92,000 unique users

Page 16: Phd Colloquium Spatial Analysis

Target Areas of Study

• Spatio-temporal differentiation of tweets– Night– Day– Travel

• Expansion of the Methodology for World Names– Initially into Europe.

• Application of new name datasets.

Page 17: Phd Colloquium Spatial Analysis

References:• Dale, M. R. T., and M-J. Fortin. "From graphs to spatial graphs." Annual Review of Ecology,

Evolution, and Systematics 41.1 (2010): 21.• Fischer, E. (July, 2011). World Map of Flikr and Twitter Locations. In See Something or Say

Something. Available at http://www.flickr.com/photos/walkingsf/5912169471/in/set-72157627140310742

• http://urbantick.blogspot.co.uk/2010/12/ncl-social-networks.html

• Kozinets, Robert V. Netnography: Doing ethnographic research online. Sage Publications Limited, 2009.

• R Core Team (2012). R: A language and environment for statistical computing. R Foundation for

• Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/.

• Rao, D., Yarowsky, D., Shreevats, A., & Gupta, M. (2010, October). Classifying latent user attributes in twitter. In Proceedings of the 2nd international workshop on Search and mining user-generated contents (pp. 37-44). ACM.

Page 18: Phd Colloquium Spatial Analysis

Thank-you

X Factor GraphProduced with R and Gephi