Searching peoples' digital footprints A new avenue in sociology and what are the problems with it...

30
Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics

Transcript of Searching peoples' digital footprints A new avenue in sociology and what are the problems with it...

Page 1: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

Searching peoples' digital footprints

A new avenue in sociology and what are the problems with it

János KertészBudapest University of Technology and Economics

Page 2: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

Outline

• Problems with the classical primary data collection in soc. – an example

• Abundance of data: Digital footprints – new era in social sciences?

• Examples • Data availability• Ethical issues • Summary

Page 3: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

Primary Methods of Data Collection

Interviewing PeopleDesigning a questionnaireObserving peopleContent analysis Designing an experiment to carry outCase studyFocus group

Page 4: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

Primary Methods of Data Collection

Interviewing PeopleDesigning a questionnaireThis method is best for discovering factual information about people … Observing peopleContent analysis Designing an experiment to carry outCase studyFocus group

Statistics about primary data collection: Papers over 10 years in American Sociological Review:Interpretative: 17%Survey: 80%Experiment: 3%

Page 5: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

An example: The Add Health database

„The (US) National Longitudinal Study of Ado-lescent Health (Add Health) is a nationally repre-sentative study that explores the causes of health-related behaviors of adolescents in grades 7 through 12 and their outcomes in young adulthood. Add Health seeks to examine how social contexts (families, friends, peers, schools, neigh-borhoods, and communities) influence adolescents' health and risk behaviors.” Designed by J. R. Udry, P. S. Bearman, and K. M. Harris, started 1994, still going on.; funded by National Institute of Child Health and Human Development (P01-HD31921)Contact: http://www.cpc.unc.edu/addhealth

Page 6: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

DATA (cont.)

Data based on questionnaires and medical tests~ 1700 publications (inc. dissertations)We used the data from Wave I (1994-95):75871 students were asked in 84 high schools68 questions, including 10 friendship related ones:>> Name 5 best male and 5 best female friends.>> For each friend select from the list those, which apply. During the last 7 days you 1. visited each other2. met after school3. spent time together during last weekend4. talked with him/her about a problem5. talked with him/her on the phone

Page 7: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

Threshold analysis

Strength of ties characterized by discrete weights

Links are a priori directed, corresponding to the nominations

Strong asymmetry may occur: A B but B A5 1

G/N : order parameter of percolations2ns : „susceptibility”Black line: w=(w + w )/2 mutuality requiredRed line: no mutuality required, missing nomination is taken as 0

Gonzales, et al 2007

Page 8: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

Other ways of finding data for scientific research:Huge datasets due to ITOfficial data collections (open or can be made available)Statistical Institutes (e.g. P. Hedeström’s Stockholm data)Fiscal data (income distributions etc.) Medical Data (e.g., Finnish diabetes data, mortality data)…Work related:Commercial data (e.g. point collections, trading data of

companies) secret, property of companiesFinancial data (e.g., stock and other markets, banks) partly open (free or for purchase)…Science related (open):Human Genome ProjectChemical Data BanksArchivesBibliographies…

These data are produced either for analysis or we assume that they would be used for that purpose

Page 9: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

Data generated in our everyday lives

A new avenue for social sciences:Digital footprints

Page 10: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

This collection of data raises • Legal• Ethical

issues (see later)

At the same time it provides a gold mine for research!

Page 11: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.
Page 12: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

Until now, social science has struggled to obtain tools that do more than scratch the surface of some of its questions. These range from identifying the driving forces behindviolence, to the factors influencing how ideas, attitudes and prejudices spread through human populations. The available tools have largely remained in a time warp, consisting of analyses of national censuses, small-scale surveys, or lone researchers with a notebook observing interactions within small groups.

Being able to automatically and remotely obtain massive amounts of continuous data opens up unprecedented opportunities for social scientists to study organizations and entire communities or populations.NATURE|Vol 449|11 October 2007

Page 13: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

Communications leave detailed information about who with whom, when and where…• phone (mobile and fixed line) • sms, mms• MSN • email

In a broader sense all kinds of activities can be used, which leave electronic records, including • commercial activities (ebay, point collecting

cards, credit cards, etc)• open collaborative environments (Wikipedia, gnu, etc)• E-communities (Facebook, MySpace, etc)• E-games (Roleplaying, Where is George, etc)

Page 14: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

Enron Email Dataset (free: www.cs.cmu.edi/~enron/)150 users, (Enron management) 0.5M messagesmade public (including content!)by Fed. Energy Regulatory Commission The presently available corpus does not include attachmentsand some messages have been deleted (due to requests of affected employees)

Triggered much interesting work, e.g.: Berkeley Enron Email Analysis (testing methods) J. Shetty and J. Adibi: The Enron Email Dataset:

Database Schema and Brief Statistical Report Z. Eisler, I Bartos and J.K. : Fluctuation scaling

Huberman et al: HP data (publicly not available)

Related: Microsoft report MSR-TR-2006-186 (2007): on 30X109 MSN messages

Page 15: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

J. S

hett

y a

nd J.

Ad

ibi

Page 16: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

Fluctuation scaling: ~ <f>Eis

ler

et

al. 2

00

8

Page 17: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

• Over 7 million Over 7 million private mobile phoneprivate mobile phone subscriptions subscriptions• Focus: voice calls within the home operator Focus: voice calls within the home operator

• Data aggregated from a period of 18 weeksData aggregated from a period of 18 weeks• Require reciprocity (Require reciprocity (XXY AND YY AND YXX) for a link) for a link

• Customers are anonymous (hash codes)Customers are anonymous (hash codes)• Data from Data from anan European mobile operator European mobile operator

Constructing Constructing social nsocial networetwork from k from mobilemobile

phone data phone data J.-P. Onnela, et al. PNAS 104, 7332-7336 (2007) J.-P. Onnela, et al. New J. Phys. 9, 179 (2007)

Y

X 15 min

5 min

20 minX

Y

Page 18: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

Huge network: proxy for network at societal level

Largest connected component dominates

3.9M / 4.6M nodes

6.5M / 7.0M links

Page 19: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

Possible to ask unprecedented questions and even find the answers to them

Study revealed the structure of the network, the interplay btw weigths and communities, the relations btw local, mesoscopic and global structure(See JP Onnela’s talk)

Page 20: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

New data (continuously supplied): • records of each call, sms, mms• information about subscribers age, gender, ZIP code

New studies started on data from Belgium (+information about location of the call)France, Hungary (fixed lines)India

With some efforts individuals could be identified!

No data sharing possible: Confidentiality agreement with the provider. Contracts regulate publication rights like in an industrial R & D project

Page 21: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

I. Yang, E. Oh, B. Kahng: Phy. Rev. E 74, 016121 2006

A: collectibles, B: clothing, sport, office C: home decoration, electronics, D: art, hobbyE: books, toys, F: valuables (jewelry, stamps, …)

1)

2)

eBay data

Traditional classification scheme (2) can be improved byhierarchical agglomeration algorithm (1)

Page 22: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

Where is George?

Zip code

Page 23: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

(„Where is George”)The scaling laws of human travelD. Brockmann, L. Hufnagel and T. GeiselNature 439, 462-465 (26 January 2006) doi:10.1038/nature04292

Page 24: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

Diapers and beer

Standard story in data mining courses:

An investigation of 1.2M baskets of consumers of Osco Drug showed that between 5 and 7 pm significantly many bought diapers and beer together (suggesting that bored young fathers were sent to the shop)

(It is an urban legend that as a consequence the management let put diapers and beer closer to each other. But they could have…)

One should not have illusions about (mis)use of point collector cards, great winning actions etc…

Page 25: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

LAMENTS OF A SORROWFUL MAN

They've entered me in books of every kind,I'm registered and checked in every way.I'm kept in musty, ink-stained offices, in folders that are growing grizzly-grey.Oh, gnashing of teeth, oh, humiliation,that I am captive till my dying day,that they dispose of me from top to toe, that I am just a record, filed away.I'd much prefer to live in the Saharaor rot beneath a mound of heavy clay,for I am kept in books of every kind,and registered and checked in every way.

D. Kosztolányi, 1924

Page 26: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

Google has all tools to be Big Brother. It has control over your• Clicks (interest, taste, purchases, pictures…)• Mail• Travel plansetc.These data would be of much interest for research but they contain too much information. Google definitely uses them, e.g. for targeted advertising.

Ethical issues

„When web provider AOL’s research division published an analysis of search behaviour on the Internet last year, it had what it thought was a bright idea: it would reach out to academics by making an anonymized version of the data freely available for download from its website. But within hours, it had to pull the site, after bloggers managed to infer many identities from the data and view the associatedsearch histories.”

NATURE|Vol 449|11 October 2007

Page 27: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

Two problems related to „computational social science”:i) Privacy issues

Data are not produced for scientific evaluation, in contrast to questionnaires, where the target person can decide about delivering data or cases where data handling is expected. Moreover, in the latter case the utilization of the data is strongly regulated by law and by organizations (e.g. Consortium for Political and Social Research). ii) Controllability and reproducability of research

Since data are not public (sometimes even the actual source must not be named) the general criterion of controllability of scientific research is violated. As seen on the AOL example, this is related to i), or to commercial interests. A good counterexample is Enron Email Database, which can serve as a benchmark for related studies.

Page 28: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

Measures?So far no real scandal… caused by scientific use of data.Institutional framework needed?

Putting People on the Map: Protecting Confidentiality withLinked Social-Spatial Data www.nap.edu/catalog/11865.html (Natl Acad. Sci., Washington DC, 2007) concluded:

“Institutional solutions involve establishing tiers of risk and access, and developing data-sharing protocols that match the level of access to the risks and benefits of the planned research.”

However, “Businesses seem more prone to misuse private data than scientists of any stripe.” (Marshall Van Alstyne, BU)But „trust is of crucial importance to the contract between scientific expertise and the broader society that supports it” NATURE editorial, 2007 October

Page 29: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

If we are careless…

Page 30: Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

Summary:

Fantastic new possibilities for computational social science

Multidisciplinary efforts neededMore open, shared data needed. Benchmarking.Experiments???Artificial data?Ethical and legal issues: Privacy, commercial

interest and scientific reproducibilityInstitutionialization?Surveys cannot be substituted!