'Drinking from the fire hose? The pitfalls and potential of Big Data'.
-
Upload
josh-cowls -
Category
Technology
-
view
232 -
download
1
description
Transcript of 'Drinking from the fire hose? The pitfalls and potential of Big Data'.
Drinking from the fire hose? The pitfalls & potential of Big Data
Josh Cowls, Oxford Internet Institutewith contributions from Eric Meyer, Ralph Schroeder and
Linnet Taylor
t2i Lab, Chalmers, 27th March 2014
Overview
• Background• Definitions• Innovations and implications• Learning to drink from the fire hose
The Oxford Internet Institute
• Department of University of Oxford
• MO: ‘Understanding life online’
• Multi-disciplinary mix (social sciences plus physical and medical sciences,
and humanities)
• 45 researchers (and growing)
• 50 students (MSc Social Science of Internet; PhD programme)
• Generating big data on social, political and economic behaviour from social
media
www.oii.ox.ac.uk
• Funded by the Alfred P. Sloan Foundation• 2012 – 2014 • Data sources:
• 120 interviews, mainly with social scientists but some interviewees from business, government• Reports, workshops, publications• No representative sample, but some patterns of
disciplinary and skills background and career trajectory
NB where unattributed, quotes used in this presentation are excerpted from interviews conducted as part of this project.
Accessing and Using Big Data to Advance Social Science Knowledge
Big Data: our definition
Big data are data that are unprecedented in scale and scope in relation to a given phenomenon.
They are often streams of data (rather than fixed datasets), accumulating large volumes, often at high velocity.
Big Data: other definitions
• ‘Transactional’ (Margetts et al)• ‘Things that one can do at a large scale that
cannot be done at a smaller one’ (Mayer-Shonberger and Cukier)
• The ‘3 Vs’: volume, velocity, variety – but also veracity, visualisability, viscosity? (Gartner)
... what Big Data isn’t
• A generalisable, quantifiable ‘amount’ of data• A race to the top (Mutually Assured Distraction)• The same for every discipline, field or sector
A ‘working’ definition
• The Big Data phenomenon might be less about what the dataset is and more about how we work with it
• (Even if this is indistinguishable in practice)
Shifts in mindset
From Mayer-Shonberger and Cukier:• “The ability to analyse vast amounts of data
about a topic rather than be forced to settle for smaller sets”
• “A willingness to embrace data’s real-world messiness rather than privilege exactitude”
• “A growing respect for correlations rather than a continuing quest for elusive causality”
Implications for research
Whither the sample?
“the sample survey[‘s] glory years ... are in the past”
Savage and Burrows, 2007
Implications for research
Whither the sample?
“sampling is like an analog photographic print. It looks good from a distance, but as you stare closer, zooming in on a particular detail, it gets blurry ... Often, the really interesting things in life are found in places that samples fail to fully catch”
Mayer-Shonberger and Cukier 2012
Implications for research
More or mess?
“social media is really, really fascinating, and the reason is because it ... falls into this category of there’s something there but we don’t know what it is. So you can measure public opinion on Twitter and clearly that’s indicative of something, but we don’t quite know what it’s representative of”
Brandon Stewart, Harvard University Department of Government
Implications for research
More or mess?
“the problem with the hashtag stuff [is that] we have wonderful case studies but we don’t know what they sit in essentially, what the framework is, if that’s 1% or 10% or 100% of the current conversation in Australia or whatever”
Axel Bruns, Queensland University of Technology
Implications for research
More or mess?
“the big problem that we haven’t cracked is that if someone tweets a sentiment it’s not necessarily what they’re feeling, it can be for a variety of reasons, so it doesn’t really reflect what they feel necessarily”
Mike Thelwall, University of Wolverhampton
Implications for research
Do we care about causes?
“Big Data is all about correlation; it’s not about causation, which means that you don’t need to have a theory beforehand. You just start looking for correlation … so you don’t have any idea about the structure of the data, you just find a funny correlation.”
Sara Esposti, Open University Business School
Implications for research
Do we care about causes?
“a central concern of social science is, we don’t just want to find statistical associations, we actually want to uncover the underlying causal processes by which social systems work ... The data themselves don’t tell you about cause and effect, there’s actually a very complex often, complex inferential process you have to go through in order to extract from the data the things that you really care about
David Jensen, University of Massachusetts
Implications for research
Do we care about causes?
“I’ve been talking to some computer scientists who are rising stars, they’re really doing well, and they acknowledge that the way in which the field works, novelty is the key issue. And so there’s always an incentive or a pressure to keep on doing new stuff with new data, even though they might have wanted to go into more depth into something.
Sandra Gonzalez-Bailon, Annenberg School of Communication, University of Pennsylvania
The challenge
How can we extract meaning from Big Data – learn to drink from the fire hose?
Drinking from the fire hose
• Understanding the data• Collaborating• Mixing methods
Drinking from the fire hose: understanding the data
The rise of the information society has given us myriad new forms of data and accompanying ways of analysing it.
The challenging part is abstracting meaning about society in general from data created and harvested online.
Drinking from the fire hose: understanding the data
Example: it’s hard to predict elections using Twitter
“[Of] 14 different attempts to predict elections based on Twitter data ... Only half of them were successful ... All of this looks close to mere chance”
Gayo-Avello 2012
Drinking from the fire hose: understanding the data
Example: Facebook isn’t going anywhere, and neither is Princeton
Canarella and Spechler 2014 Develin 2014
Drinking from the fire hose: understanding the data
But it’s much simpler, conceptually speaking, to analyse online phenomena on their own terms
Yasseri, Hale & Margetts 2013
Drinking from the fire hose: understanding the data
But it’s much simpler, conceptually speaking, to analyse online phenomena on their own terms
Hale, Yasseri, Cowls, Meyer, Schroeder & Margetts (submitted)
Drinking from the fire hose: understanding the data
Of course, online data can still provide insights into offline life, but these must be well-grounded.
e.g. Seth Stephens-Davidowitz, ‘The Cost of Racial Animus on a Black Candidate: Evidence Using Google Data’• Google accounts for >50% of search engine market (less
concern over representativeness)• Google searches are private and anonymous (less
concern over reliability)• This method uncovers a social phenomenon, racism,
which would be harder to detect in pre-Internet approaches e.g. interviews or surveys
Drinking from the fire hose: understanding the data
Beware false prophets
XKCD
Drinking from the fire hose: understanding the data
Beware false prophets: analyses using thousands of variables can generate millions or billions of possible relationships – not all (or most) will be valid or meaningful
Drinking from the fire hose: understanding the data
Beware false prophets
“if you look at the data long enough you’ll find predictive signals that are in fact completely spurious...for about, I think a 20 or 25 year period, the US stock market was perfectly correlated with the level of butter production in Bangladesh … if you look at hundreds and hundreds of these indicators, whether it’s the level of Bangladesh butter production or the number of cars in New York City or whatever it is, eventually you'll find something that just by pure chance matches what you're looking for. ”
Mike Cafarella, University of Michigan
Drinking from the fire hose: collaborating
Big data research often necessitates a wide variety of skills and perspectives. The growth of teams in academic research has been increasing for decades:
Drinking from the fire hose: collaborating
This trend is likely to persist as big data research becomes more common
“the best research will often merge in collaboration between computer scientists who will have access to the tools and the background to further develop and apply those, and with social scientists who will have, sort of, good pressing social questions that we can get insight into with the data that is now available. ”
Scott Hale, Oxford Internet Institute
Drinking from the fire hose: collaborating
This trend is likely to persist as big data research becomes more common
“I can find someone to optimise an algorithm, I can pay someone to build a website but what I want is someone that is going to be thinking the human side through every step of the way, and when you build an algorithm and when you write a line of code you ask, does this make sense in terms of the phenomena that I am trying to model or trying to interpret.”
Josh Introne, Michigan State University
Drinking from the fire hose: mixing methods
While Big Data is necessarily quantitative, it can be used in conjunction with other methods.
“For me, I think if I only look at the numbers I don’t get the whole picture … if we look at, for example, Twitter data, you can see some tendencies, but if you want to answer the right question then I think it’s necessary to do more qualitative studies … So I’m doing interviews with political parties, I’m also doing interviews with journalists, in order to talk about how they are using social media as journalistic tools. ”
Bente Kalsnes, University of Oslo
Drinking from the fire hose: mixing methods
This means correlations can point the way for deeper causal explanatory research.
“So you start off with the patterns and then what you should be doing is saying ‘Well, here’s some possible reasons’, and then when you’ve found some relationships which really deserve more study then you would go off and do a more detailed qualitative assessment as to whether this was true or not. . ”
Richard Webber, King’s College London
Conclusion: learning to drink from the fire hose
The major question around Big Data is not what the data looks like and more about what we do with it.
The Big Data approach seems to challenge basic tenets of academic research, undermining precision, validity and explanatory power
However, with a greater understanding of the nature of data, a collaborative approach and a willingness to employ multiple methods, we’ll be better equipped to drink from the Big Data fire hose.