Introduction to big data

69
BIG DATA 8/2014 [email protected]

description

A lesson about big data in an engineering school

Transcript of Introduction to big data

Page 1: Introduction to big data

BIG DATA

8/2014

[email protected]

Page 2: Introduction to big data

Table of contents

① Definition : what is big data?

② Dimensionning uses

③ Why should I be interested? • Market study

④ How can it benefit • Companies

• Consumers/citizen

• Society

⑤ Infrastructure • Gathering

• Storage

• Networks

• Processing

⑥ Tools

⑦ Data models and predictive analytics

⑧ From big data to smart data

Page 3: Introduction to big data

DEFINITIONS

Page 4: Introduction to big data

Co

st

Time

Use

rs

Time-keeper genesis

Page 5: Introduction to big data

Hype curve 2014 (copyright Gartner)

Page 6: Introduction to big data

Is big fata the last buzzword bingo?

① http://www.bullshitbingo.net/cards/buzzwor

d/

Page 7: Introduction to big data

Why data science?

Page 8: Introduction to big data

http://www.personalizemedia.com/gary

s-social-media-count/

Page 9: Introduction to big data

3V by SAP

CRM* data

GP

S

Demand

Sp

ee

d

Velocity

Transactions

Op

po

rtu

nitie

s

Se

rvic

e c

alls

Customer

Sales orders

Inventory

E-m

ails

Twe

ets

Planning

Things

Mobile

Insta

nt m

essa

ge

s

Velocity

Volume Variety

Page 10: Introduction to big data

What is big data? (original idea Gartner)

① Volume

② Variety • Structured/unstructured

• Public/Private

• Text/image/sound…

③ Velocity • Generated

• Captured

• Shared

④ Veracity

Page 11: Introduction to big data

Four « V » by IBM

Page 12: Introduction to big data

Volume

① Since first wave 10.000 years ago until 1950,

mankind created only 5 exabytes

② 1 EB = 1 000 000 000 000 000 000B =

③ 1 000 000 000 gigabytes =

④ 1 000 000 terabytes =

⑤ 1 000 petabytes...

⑥ Nowaday we produce 5 exabytes every 2

days!

Page 13: Introduction to big data

Volume (cf. Wikipedia)

① According to an IDC study sponsorised by

EMC Gartner, digital data created in the

world were

② 1,2 zettabytes/year in 2010 to

③ 1,8 zettabytes in 2011, and

④ 2,8 zettabytes in 2012 and up to

⑤ 40 zettabytes in 2020.

Page 14: Introduction to big data

Volume : storage

Page 15: Introduction to big data

Variety : data classifications

① By structure

• Structured (SQL like data bases) # 20%

• Unstructured # 80%

② By source

• Human originated

• Non-human originated

• In-house

• From outside

③ By movement

• Data in motion

• Data at rest

Page 16: Introduction to big data

Velocity : example of cellular data rates

① 2G • GPRS : 140,6 Kbps

• Edge : 473,6 Kbps

② 3G • UMTS : 384 Kbps

• HSPA : 14,4 Mbps

③ 4G • HSPA+ : 42,2 Mbps

• LTE : 173 Mbps

④What’s next?

Page 17: Introduction to big data

Big data cycle management

Capture

Organize

Integrate Analyze

Act

Page 18: Introduction to big data

Intelligence cycle (source C.I.A.) : similarity?

① UKUSA agreement shares

• facilities,

• tasks and

• product

② between participating governments.

③ What about analysis?

2014 Hear & Know 2014 18

Page 19: Introduction to big data

Intelligence cycle applied to Sigint

Interception of messages and

communications data (meta data)

Processing

• Traffic analysis of communications data (who is communicating with who)

• Cryptanalysis

• Analysis of the content of messages

Analysis with the use of other sources, for

example Open Source Intelligence

(OSINT)

Dissemination

Planning & direction

2014 Hear & Know 2014 19

Page 20: Introduction to big data

Big data « food chain »

Personal data

Contacts/Calendar Audio/Voice/Music Mails/Notes Photos/Videos Identifiers/Metadata Positions Navigation history Biometric data

(fingerprints, voice…) Games data

Storage

Terminal Servers Data centers Cloud

Accessibility

Operators radio (short or long range) API OS Development kit

Applications

Millions of applications in Apple store and Google Play

Page 21: Introduction to big data

Structured data: human generated

① Input

② Click stream

③Gaming related (moves)

④Quantified self

Page 22: Introduction to big data

Structured data : machine/computer generated

① Sensor

② Smart meters

③Weblog

④ Point of sale

⑤ Financial

Page 23: Introduction to big data

WHY BIG DATA?

Page 24: Introduction to big data

Unsolved problems with « classical » means

① Search engines with RBDMS -> Google own

solutions

Page 25: Introduction to big data

Moore’s law miniaturisation and its limits

① « number of integrated transistors on a

silicon is multiplied by 4 every 3 years »

Moore 1965

② Roch’s law : chip manufacturing costs

double every 4 years

③ Below 20 nm : quantum effects

Page 26: Introduction to big data

After the end of Moore’s law

① #2020 limits of classical « engraving » physics = > necessary evolutions • Pessimistic scenario

o Innovation applicative/architecture o Cost/price erosion

• Substitution technologies o Biology/électronique moléculaire organique o ADN o Neuronal/analog o supraconductors o Optics o Components with one or few electrons

o Quantum computers o Nano-technologies o …

Page 27: Introduction to big data

DIMENSIONNING EXAMPLES

Page 28: Introduction to big data

The « historical » big data crunchers

① Simulation (nuclear…)

②Meteo

③ Sigint

④ The National Security Agency is building the

biggest building on earth : the Utah Data

Center. Scheduled for yottabytes of

internet collected data.

⑤ Cryptoanalysis

Page 29: Introduction to big data

Amazon

① From on-line library to global IT provider

② Big data :

• User initially then

• Provider

Page 30: Introduction to big data

Cellular : potential analysis

① Cell activity for urbanism and network

planification

② Policy makers

③ Urban planners

④ Traffic engineers

⑤Weather forecast

⑥…

Page 31: Introduction to big data

Quantified self/Lifelogging

① Position

② Sleeping hours

③ Tension/ cardio frequency

④ Podometer

⑤ Accelerometer/speedometer/distance

⑥ Food/beverage

⑦ Temperature

⑧ Weigh

⑨ Size

⑩ Photo

11 Voice recording

Page 32: Introduction to big data

Trading

① High frequency/Speed trading (non

distributed)

Page 33: Introduction to big data

EDF and metering

① Previous situation

• 35 M houses in France

• 2 « relevés » / year

② Remote metering

• Every 30 mn

③ Result

• Expected spare : xxx MW

Page 34: Introduction to big data

Cybersecurity Hypervision

① Digitalattackmaps

② norse

Page 35: Introduction to big data

Big data applications

① Social media analytics : impressive

example : Linked in « people you may

know »

② Voice analytics : call centers, mobile

phones (SIRI)

③ Text analytics

④ Video analytics

⑤ Telecom : customer churn

⑥ Behavioural analytics

Page 36: Introduction to big data

Marketing

① Knowledge

• Brand

• Competitors

• Customers

• Anticipate the market

• A/B testing

Page 37: Introduction to big data

Marketing

Page 38: Introduction to big data

Geolocation

① Skyhook (Google, Apple…) data base

may be used to observe people

movement

Page 39: Introduction to big data

Intelligence family

Int

Hum

OS

Im

Sig

2014 Hear & Know 2014 39

Page 40: Introduction to big data

Sigint and family

Techint

Sigint

Comint Elint Masint

Imint …

2014 Hear & Know 2014 40

Page 41: Introduction to big data

Politics

①Obama re-election

② In 2013, Big Data is one of the « 7 ambitions

stratégiques de la France » according to

the Commission innovation 2030

Page 42: Introduction to big data

Science

① “Square Kilometre Array” radiotelescope

will deliver 50 terabytes analyzed

data/day, with 7 000 raw data terabytes/s

② Large Hadron Collider has around 150

millions sensors producing data 40

millions/s.

③ # 600 millions collisions/s , after filters, 100

interesting collisions remains /s. There are

25 Pbytes to store/year

Page 43: Introduction to big data

IOT/M2M/IOO

① According to Yole, Internet of Things will

represent 15% of processed data in 2024

② Electronics components will jump from 9,5

G$ in 2014 to 46 G$ in 2024

Page 44: Introduction to big data
Page 45: Introduction to big data

IoO

« In 2020, there will be 80 billions, according to Samuel Ropert from Idate.

IoO alone will count for , 85% of IoT,

11% for terminals and 4% for M2M.

Expected annual growth between 2010 and 2020

IoO 41%,

terminals 22%

M2M 16%.

Page 46: Introduction to big data

Roadmap of the Internet of things

Page 47: Introduction to big data

MARKET STUDY

Page 48: Introduction to big data
Page 49: Introduction to big data

Big data growth

① Annual growth for Big Data for 2011-2016 is

expected 31.7%.

②Market should reach 23,8 G$ in 2016

(source : IDC march 2013).

③ Big Data should be 8% of european GNP in

2020 (AFDEL february 2013).

Page 50: Introduction to big data

Risks and opportunities

① How to create value with this data

flooding?

② If you don’t do it yourself on your market :

advantage to the first mover.

③ Democracy risks : end of privacy ?

Dictature based on data ?

Page 51: Introduction to big data

INFRASTRUCTURE

Page 52: Introduction to big data

Software approach

Traditional

Monolithic

Centralised storage

RDBMS

Data frame/format

preliminary

Proprietary

Big data

Distributed

Storage and execution at

node level

Brute data processing

opensource

Page 53: Introduction to big data

Hardware approach

Traditional hardware

Specific hardware

Big central server

NAS

Raid

Expensive

Uneasy evolution

Big data

Basic hardware

Pizza boxes

Ethernet

JBOD

Unexpensive

Easy evolution

Page 54: Introduction to big data

Big data stack (copyright big data for

dummies)

Page 55: Introduction to big data

Infrastructure criteria

① Performance

② Availability

③ Scalability

④ Flexibility

⑤ Cost

⑥ Redundancy + resiliance

Page 56: Introduction to big data

Storage caracteristics

Caracteristics RDBMS Big Data

Data size Giga bytes Peta bytes

Access Interactive Near real time or

batch

Scheme/structure Static Dynamic

Language SQL UQL/Procedural (Java,

C++…)

Job scheduling Hard Simple

Integrity High High

Scaling Non linear linear

Page 57: Introduction to big data

TOOLS

Page 58: Introduction to big data

Tools

① Hadoop

• MapReduce

② PostgreSQL (www.postgresql.org)

③ R

④Matlab

⑤ Analyst Notebook

⑥Watson

Page 59: Introduction to big data

Hadoop

①Opensource

② Fast (parallel processing)

③Main components

• Distributed file system

• MapReduce engine

Page 60: Introduction to big data

Mapreduce (made in Google)

①Map

② Reduce

Page 61: Introduction to big data

Why R?

①Open

② www.r-project.org

③ www.rstudio.com

Page 62: Introduction to big data

Big data, and after?

①Open data

② Smart data

③ Linked data

Page 63: Introduction to big data

BACK-UP

Page 64: Introduction to big data

The cloud

① Shared resources

② Applications

③ Computing

④ Storage

⑤ Networking

⑥ Development and deployment platforms

Page 65: Introduction to big data

Cloud vocabulary for delivery models

① IaaS : infrastructure as a service

② PaaS : platform as a service

③ SaaS : software as a service

④ And specially useful for big data

⑤ DaaS : data as a service

Page 66: Introduction to big data

Cloud players

① Worldwide

• Google

• Apple

• Microsoft

• Amazon

• Openstack

• Dropbox…

② France

• Cloudwatt

• Numergy

Page 67: Introduction to big data

Improvement

① Data modelling

② Data management

Page 68: Introduction to big data

At stake

31/08/2014 68

Page 69: Introduction to big data

① To further the cause of promoting

awareness to the future impact of IoT, let’s

answer these three key questions:

• What kind of data are these devices collecting?

• What are the different types of “Things” or

categories that are getting connected?

• What are the different use cases that are driving

the revenue predictions?