'Drinking from the fire hose? The pitfalls and potential of Big Data'.

Drinking from the fire hose? The pitfalls & potential of Big Data

Josh Cowls, Oxford Internet Institutewith contributions from Eric Meyer, Ralph Schroeder and

Linnet Taylor

t2i Lab, Chalmers, 27th March 2014

Overview

• Background• Definitions• Innovations and implications• Learning to drink from the fire hose

The Oxford Internet Institute

• Department of University of Oxford

• MO: ‘Understanding life online’

• Multi-disciplinary mix (social sciences plus physical and medical sciences,

and humanities)

• 45 researchers (and growing)

• 50 students (MSc Social Science of Internet; PhD programme)

• Generating big data on social, political and economic behaviour from social

media

www.oii.ox.ac.uk

• Funded by the Alfred P. Sloan Foundation• 2012 – 2014 • Data sources:

• 120 interviews, mainly with social scientists but some interviewees from business, government• Reports, workshops, publications• No representative sample, but some patterns of

disciplinary and skills background and career trajectory

NB where unattributed, quotes used in this presentation are excerpted from interviews conducted as part of this project.

Accessing and Using Big Data to Advance Social Science Knowledge

Big Data: our definition

Big data are data that are unprecedented in scale and scope in relation to a given phenomenon.

They are often streams of data (rather than fixed datasets), accumulating large volumes, often at high velocity.

Big Data: other definitions

• ‘Transactional’ (Margetts et al)• ‘Things that one can do at a large scale that

cannot be done at a smaller one’ (Mayer-Shonberger and Cukier)

• The ‘3 Vs’: volume, velocity, variety – but also veracity, visualisability, viscosity? (Gartner)

... what Big Data isn’t

• A generalisable, quantifiable ‘amount’ of data• A race to the top (Mutually Assured Distraction)• The same for every discipline, field or sector

A ‘working’ definition

• The Big Data phenomenon might be less about what the dataset is and more about how we work with it

• (Even if this is indistinguishable in practice)

Shifts in mindset

From Mayer-Shonberger and Cukier:• “The ability to analyse vast amounts of data

about a topic rather than be forced to settle for smaller sets”

• “A willingness to embrace data’s real-world messiness rather than privilege exactitude”

• “A growing respect for correlations rather than a continuing quest for elusive causality”

Implications for research

Whither the sample?

“the sample survey[‘s] glory years ... are in the past”

Savage and Burrows, 2007


Whither the sample?

“sampling is like an analog photographic print. It looks good from a distance, but as you stare closer, zooming in on a particular detail, it gets blurry ... Often, the really interesting things in life are found in places that samples fail to fully catch”

Mayer-Shonberger and Cukier 2012


More or mess?

“social media is really, really fascinating, and the reason is because it ... falls into this category of there’s something there but we don’t know what it is. So you can measure public opinion on Twitter and clearly that’s indicative of something, but we don’t quite know what it’s representative of”

Brandon Stewart, Harvard University Department of Government


More or mess?

“the problem with the hashtag stuff [is that] we have wonderful case studies but we don’t know what they sit in essentially, what the framework is, if that’s 1% or 10% or 100% of the current conversation in Australia or whatever”

Axel Bruns, Queensland University of Technology


More or mess?

“the big problem that we haven’t cracked is that if someone tweets a sentiment it’s not necessarily what they’re feeling, it can be for a variety of reasons, so it doesn’t really reflect what they feel necessarily”

Mike Thelwall, University of Wolverhampton


Do we care about causes?

“Big Data is all about correlation; it’s not about causation, which means that you don’t need to have a theory beforehand. You just start looking for correlation … so you don’t have any idea about the structure of the data, you just find a funny correlation.”

Sara Esposti, Open University Business School



“a central concern of social science is, we don’t just want to find statistical associations, we actually want to uncover the underlying causal processes by which social systems work ... The data themselves don’t tell you about cause and effect, there’s actually a very complex often, complex inferential process you have to go through in order to extract from the data the things that you really care about

David Jensen, University of Massachusetts



“I’ve been talking to some computer scientists who are rising stars, they’re really doing well, and they acknowledge that the way in which the field works, novelty is the key issue. And so there’s always an incentive or a pressure to keep on doing new stuff with new data, even though they might have wanted to go into more depth into something.

Sandra Gonzalez-Bailon, Annenberg School of Communication, University of Pennsylvania

The challenge

How can we extract meaning from Big Data – learn to drink from the fire hose?

Drinking from the fire hose

• Understanding the data• Collaborating• Mixing methods

Drinking from the fire hose: understanding the data

The rise of the information society has given us myriad new forms of data and accompanying ways of analysing it.

The challenging part is abstracting meaning about society in general from data created and harvested online.


Example: it’s hard to predict elections using Twitter

“[Of] 14 different attempts to predict elections based on Twitter data ... Only half of them were successful ... All of this looks close to mere chance”

Gayo-Avello 2012


Example: Facebook isn’t going anywhere, and neither is Princeton

Canarella and Spechler 2014 Develin 2014


But it’s much simpler, conceptually speaking, to analyse online phenomena on their own terms

Yasseri, Hale & Margetts 2013


But it’s much simpler, conceptually speaking, to analyse online phenomena on their own terms

Hale, Yasseri, Cowls, Meyer, Schroeder & Margetts (submitted)


Of course, online data can still provide insights into offline life, but these must be well-grounded.

e.g. Seth Stephens-Davidowitz, ‘The Cost of Racial Animus on a Black Candidate: Evidence Using Google Data’• Google accounts for >50% of search engine market (less

concern over representativeness)• Google searches are private and anonymous (less

concern over reliability)• This method uncovers a social phenomenon, racism,

which would be harder to detect in pre-Internet approaches e.g. interviews or surveys


Beware false prophets

XKCD


Beware false prophets: analyses using thousands of variables can generate millions or billions of possible relationships – not all (or most) will be valid or meaningful


Beware false prophets

“if you look at the data long enough you’ll find predictive signals that are in fact completely spurious...for about, I think a 20 or 25 year period, the US stock market was perfectly correlated with the level of butter production in Bangladesh … if you look at hundreds and hundreds of these indicators, whether it’s the level of Bangladesh butter production or the number of cars in New York City or whatever it is, eventually you'll find something that just by pure chance matches what you're looking for. ”

Mike Cafarella, University of Michigan

Drinking from the fire hose: collaborating

Big data research often necessitates a wide variety of skills and perspectives. The growth of teams in academic research has been increasing for decades:


This trend is likely to persist as big data research becomes more common

“the best research will often merge in collaboration between computer scientists who will have access to the tools and the background to further develop and apply those, and with social scientists who will have, sort of, good pressing social questions that we can get insight into with the data that is now available. ”

Scott Hale, Oxford Internet Institute


This trend is likely to persist as big data research becomes more common

“I can find someone to optimise an algorithm, I can pay someone to build a website but what I want is someone that is going to be thinking the human side through every step of the way, and when you build an algorithm and when you write a line of code you ask, does this make sense in terms of the phenomena that I am trying to model or trying to interpret.”

Josh Introne, Michigan State University

Drinking from the fire hose: mixing methods

While Big Data is necessarily quantitative, it can be used in conjunction with other methods.

“For me, I think if I only look at the numbers I don’t get the whole picture … if we look at, for example, Twitter data, you can see some tendencies, but if you want to answer the right question then I think it’s necessary to do more qualitative studies … So I’m doing interviews with political parties, I’m also doing interviews with journalists, in order to talk about how they are using social media as journalistic tools. ”

Bente Kalsnes, University of Oslo

Drinking from the fire hose: mixing methods

This means correlations can point the way for deeper causal explanatory research.

“So you start off with the patterns and then what you should be doing is saying ‘Well, here’s some possible reasons’, and then when you’ve found some relationships which really deserve more study then you would go off and do a more detailed qualitative assessment as to whether this was true or not. . ”

Richard Webber, King’s College London

Conclusion: learning to drink from the fire hose

The major question around Big Data is not what the data looks like and more about what we do with it.

The Big Data approach seems to challenge basic tenets of academic research, undermining precision, validity and explanatory power

However, with a greater understanding of the nature of data, a collaborative approach and a willingness to employ multiple methods, we’ll be better equipped to drink from the Big Data fire hose.

'Drinking from the fire hose? The pitfalls and potential of Big Data'.

Technology

Transcript of 'Drinking from the fire hose? The pitfalls and potential of Big Data'.