IS6600-10 Big Data, Intelligence & Surveillance 1.

Post on 23-Dec-2015

217 views 0 download

Transcript of IS6600-10 Big Data, Intelligence & Surveillance 1.

IS6600-10

Big Data, Intelligence & Surveillance

1

2

Hype, Reality or …?

3

Purpose

• The purpose of this class is to introduce the concept of Big Data, examine its potential and value for organisations and governments, as well as the downside effects on privacy

• I also hope to stimulate your own thinking about Big Data – and how it affects you

4

Basics

• Big Data refers to the vast quantities of data that businesses and governments gather

• This data is believed to contain useful, actionable intelligence that could lead to – Process efficiencies– Lower costs, – Higher profits, – Identification of terrorism threats/plans

• What is needed is the will and expertise to perform the relevant analysis.

5

How Big is Big?

• It depends on how quickly you can access and process data (with normal database management tools)

• For a small company, hundreds of gigabytes could be big. For a larger company, hundreds of terabytes– 1 terabyte = 1000 gigabytes– 1 petabyte = 1000 terabytes– 1 exabyte = 1000 petabytes

• Zettabyte, Yottabyte

6

Size Contexts

• Some areas of science generate huge amounts of data:– Meteorology (weather forecasting) & Remote Sensing– Genomics (genome sequencing)– Physics, e.g. CERN

• 150 million sensors each deliver data 40 million times per second• Working with only 0.001% of the data collected, still 25 petabytes

a year is collected• If all data was used, it would be 500 exabytes a day – 200 times

more than all other global data sources combined– Social data, RFID data,– Surveillance – NSA & GCHQ

7

The History

• Big Data is not a new topic– Data has been getting bigger continually ever since the first

byte was created– It is related to storage capacity and processing power –

which also keep growing continually

• Over the last 25 years, many governments have attempted to consolidate data holdings into single databases controlled by single parties– National ID Schemes– National Health Records Management

8

Corporate Examples

• Amazon handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers.

• Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data

• Facebook handles 50 billion photos.• TaoBao & Alibaba – again, billions of transactions• Consumer profile databases, Loyalty Cards, Octopus

Ford• http://www.datanami.com/datanami/2013-03-16/how_ford_is_putt

ing_hadoop_pedal_to_the_metal.html

• Ford’s modern hybrid Fusion model generates up to 25 gigabytes of data per hour– Data that is a potential goldmine for Ford, as long as it

can find the right analytical tools for the job. • The data can be used to

– understand driving behaviors and reduce accidents, – understand wear and tear – identify issues that lower maintenance costs, – avoid collisions

• But who should own the data? Ford? The car owner?9

10

Needles & Haystacks

• The volume of data is huge, beyond imagination, and the consultants and software firms want us to believe that somewhere, if you can find them, there may be some needles – pieces of actionable intelligence

11

Who is Pushing Big Data?

• IBM!– Because they want to sell you their software that

(they claim) will help you to analyse the data and find the needles

• Consultants stand to make millions, by panicking their clients into spending on software solutions

• Globally, this is a US$100 billion industry, growing 10% a year

12

Is Everyone Happy?

• The consultants suggest not. Accenture:– 22% of companies are very satisfied– 35% are quite satisfied– 34% are dissatisfied– 39% say that they have data that is relevant to

their business strategy• Big data can be useful – if you know what to

look for and how to get that ‘intelligence’ to the people who can use it

13

Consultant Perspectives

• Companies have lots of data, but “most organisations measure too many things that don’t matter and don’t put sufficient focus onto the things that do” (Accenture).

• “Companies are buried in information” and are struggling to use it (McKinsey)

• The more data they have, the less they seem to know!

14

Then What Should the Companies Do?

• Spend more money (say the consultants)– “a large investment in new data capabilities”

• McKinsey– “embed analytics into business processes”

• Accenture

• Alternatively– Go and ask people what they think is happening!– Ask your lost customers why they got lost!

• A survey or big data analytics won’t tell you why.

15

Gartner’s Hype Cycle

16

Big Data and Intelligence

• One of the highest impact news stories since June 2013 has concerned the secret surveillance activities of the NSA and GCHQ agencies – as revealed by Edward Snowden

• These surveillance activities are fundamentally about big data and analytics, just as they are also about privacy and security, espionage and politics

17

Key Terms

• NSA – National Security Agency (US)• GCHQ – General Communications Headquarters (UK)• Prism, Tempora, Xkeyscore, Bullrun,

– Systems that store, retrieve and analyze the data

• The Guardian – UK newspaper that first published the stories

• Patriot Act – US Act for Homeland Security post 11-9-11

http://en.wikipedia.org/wiki/Patriot_Act

18

Government’s Perspective

• Looking for needles in the metadata– Phone numbers, call duration & frequency– Global patterns that may involve terrorism– If a bombing in India can be matched to a sudden

increase of calls in another country, that might be of interest

– To be effective, they need as much data as possible – in short, everything.

19

The Surveillance Picture

• Edward Snowden has leaked a LOT of information• The stories are still coming. We have learned a LOT

about what governments do – with their own citizens’ data, and with data from other countries

• You may recall stories about data being captured in Hong Kong and China from the Chinese University and Tsinghua University Internet hubs– http://www.reuters.com/article/2013/06/24/us-usa-securi

ty-tsinghua-idUSBRE95N0M220130624• This is a series of events of global proportion• We should not be surprised at anything any more

– If they want to collect it, anything, then they can and will.

20

Selected Events• Publication of a top-secret court order against

Verizon mandating it to hand over the call records of all its customers

• http://www.theguardian.com/world/2013/jul/19/nsa-extended-verizon-trawl-through-court-order

• Orders for all other telecoms firms also existed• Large-scale collection of data without individual

warrants– Prism

• http://en.wikipedia.org/wiki/PRISM_(surveillance_program)

21

Prism

• A system that gives the NSA access to the personal information of non-US people from US Internet companies– Apple, Facebook, Google, Microsoft, Skype, Yahoo,…

• These companies always claimed that they protected individual privacy, but … it seems that this was not the case

• However, they were legally required to say nothing – the court orders prohibited them saying anything about their data sharing with the NSA

• Data obtained by cable tapping– Metadata & content from 4 US telecoms providers’ cables

22

Facebook

• During Jan-June 2013, governments requested info on 38,000 Facebook users– 11,000 + from the US (79% compliance)– 4000+ from India (50% compliance)– 170 from Turkey (47% compliance)– 11 from Egypt (0% compliance)– http://www.theguardian.com/technology/2013/a

ug/27/facebook-government-user-requests

23

XKeyscore

• This is the data retrieval system used to collect, process and search the data

• http://en.wikipedia.org/wiki/XKeyscore

• It allows an NSA analyst to query “nearly everything a typical user does on the Internet” in near-real time, including:– Email content– Websites visited and searches– Metadata

• In theory these systems were designed to analyse data about foreigners, but many Americans were also included in the databases

24

GCHQ• This is the UK’s government department that

deals with Telecommunications Signals & Intelligence

• http://www.gchq.gov.uk • http://en.wikipedia.org/wiki/Government_Communicat

ions_Headquarters

• Access to Prism since 2010 • Operates Tempora, similar to Prism, for

collecting data from the Internet and Telecomms.

25

GCHQ

• In 2009, GCHQ spied on foreign politicians visiting the UK for a G20 summit– Eavesdropping phonecalls, emails – Monitoring computers– Installing keyloggers and then tracking activities

post-summit– Turkish Finance Minister (Simsek)– Russian leader (Medvedev)

• Purpose – Economic/Political Intelligence

26

Tempora

• Much of the data is harvested from Internet cables that enter the UK (GBs-TBs per second)– 300 GCHQ and 250 NSA analysts are involved

• Telephone calls, Email messages, Facebook entries, Personal Internet history, IM chats, pwds,

– Cooperation with private telecoms companies– Data held for 3 days, metadata for 30

• http://en.wikipedia.org/wiki/Tempora• http://www.theguardian.com/uk/2013/jun/21/gchq-ca

bles-secret-world-communications-nsa

Bullrun

• NSA and GCHQ spend millions developing programmes that can break Internet security (cryptography) protocols like https, ssl, etc.

• They also work directly with the telecom providers to ensure that they have backdoors that help them to access data that clients think is private/secret

• There are no Secrets! – http://www.theguardian.com/world/2013/sep/05/nsa-gchq-encryption-codes-security

27

28

Collusion or Legal Obligation?

• One defence offered by the private companies that hold the data (whether in databases or as ISPs) is that they are required to obey the law of the countries in which they operate– They have no choice – they must hand over the

data, or cooperate with the security agencies– Also, they cannot reveal that they are cooperating

– they are gagged from revealing the existence of the Prism/Tempora/Bullrun systems

29

Payouts

• GCHQ and NSA are working with each other, sharing each other’s data

• NSA subsidizes GCHQ’s costs @ GBP millions annually

• http://www.theguardian.com/uk-news/2013/aug/01/nsa-paid-gchq-spying-edward-snowden

• NSA benefits by GCHQ operating under less strict operating & oversight rules

• NSA expects returns… reports, intelligence.

30

Problems

• Big data is HUGE – there is simply too much data to collect and analyse– GCHQ may collect up to 20% of the actual data

flow• Big data is getting bigger

– Cables that carry hundreds of GB/second make that task harder still

• As always, 99.999% of the data is not useful.– Can you find the 0.001% that might be?

31

Reactions

• There have been attempts to stop media organizations from reporting on the surveillance programmes

• Computers owned by the Guardian newspaper were physically destroyed in an attempt to remove the data & prevent further publication– Additional copies are held in Brazil and the US– http://www.wired.com/threatlevel/2013/08/guar

dian-snowden-files-destroyed/

32

Implications for Individuals

• Is your data being harvested?– It seems likely.

• Are your private communications, including online purchases, secure?– Not very.

• Are you protected by data privacy laws?– Not against governments.– Perhaps against private companies.

• http://www.pcpd.org.hk/

33

Questions• What kind of data is being collected?

– Where, By Who, For What Purposes???– Can we see/find (some of) the data anywhere?– Are you personally at risk?

• That depends on who you are, what you do, who you talk to and what about.

– Should we be concerned?• Is there anything we can do as individuals, as decision

makers, as companies? – http://www.theguardian.com/world/2013/sep/05/nsa-how-to-remain-secure-surveillance

• Or is it more sensible just to get on with our lives?

• Do some Internet research now and try to answer some of these questions.