Rob anderson

32
© Copyright 2010 EMC Corporation. All rights reserved. 1 BIG DATA IS CHANGING THE WORLD

description

 

Transcript of Rob anderson

Page 1: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 1

BIG DATA

IS CHANGING THE WORLD

Page 2: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 2

IN THIS DECADE THE DIGITAL UNIVERSE

WILL GROW 44XFROM 0.9 ZETTABYTES TO 35.2 ZETTABYTES

Source : 2010 IDC Digital Universe Study

Page 3: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 3

90% OF THEDIGITAL UNIVERSE IS

UNSTRUCTURED

Source: 2011 IDC Digital Universe Study

Page 4: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 4

GeophysicalExploration

Big Data Has Arrived

Medical Imaging

VideoSurveillanceMobile Sensors

Video Rendering

Gene Sequencing

Smart Grids

Social Media

ElectronicPayments

Page 5: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 5

Billion Dollar Specialty Care Service Provider

Deliver Better Healthcare With Big DataQ

ua

lity

Of

Pa

tie

nt

Ca

re

Legacy System &

Traditional Data

New System &

Big Data

Treatment

Pathways On

Summary Data

Treatment

Pathways On

All The Data

Social &

Economic

Factors

International

Results

Individual

Patient History

Page 6: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 6

Retail Banking Firm Aligns Offers To Customers

Increase Profit Margins With Big DataC

ust

om

er

Pro

fit

Legacy System &

Traditional Data

New System &

Big Data

Agent

“Best Guess”

Profit-Based

Recommendations

User Based

Recommendations

Identify

“At-Risk”

Customers

Page 7: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 7

Classifying and segmenting Big Data

• Rich content stores—original intellectual property or value-added

– Media, VOD, content creation, special effects, satellite imagery, GIS data

• Generated from workflow—must be managed/processed quickly & cheaply

– Manufacturing, simulation, electronic design

• Develop new intellectual property based on big data

– Pharmaceutical companies doing customised drug development

• Companies, public sector, utilities mining data for business advantage

• Some mine consumer data—higher-volume and potentially higher-value

Page 8: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 8

0

10

20

30

40

50

60

70

80

90

2009 2010 2011 2012 2013 2014

EX

AB

YT

ES

Big Data is File & Unstructured Data

By 2012, 80% of all storage capacity sold will be for file-based data

Source: IDC

File Based: 60.7% CAGR Block Based: 21.8% CAGR

Page 9: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 9

Why is Big Data appearing now?

Source: IDC

Page 10: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 10

Gartner’s 3 V’s of Big Data

Page 11: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 11

“The Internet of Things”

• Massive explosion of smart devices, all sending, receiving, storing data

– handhelds, tablets, cameras

– Human-oriented devices

• Non-human-oriented devices

– sensors, embedded CPUs

• Social networking messages & data grow exponentially

– Twitter feeds, Facebook updates, LinkedIn messages

• Increasingly, business is conducted digitally – or digitized

• Big Data is global – any source to any target

Page 12: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 12

Source:

GoGlobe

Page 13: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 13

Companies want to store big data—Why?

• Google – Originally thought of as “search engine”

– Now: Storing the Internet, storing every search query

• Facebook, Twitter – Just social media?

– Storing every message you send, monitoring every

market trend

• Amazon – your every purchase, forever

• Carriers – Storing location-based data on everyone

Page 14: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 14

Social Networking AnalysisCourtesy of NSF Workshop on Social Modeling

Page 15: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 15

The race is on• Big Data leads to the Optimised OrganisationBig Data leads to the Optimised OrganisationBig Data leads to the Optimised OrganisationBig Data leads to the Optimised Organisation

• Takes a long time to build a functioning data

warehouse, analytics tools, connect to business

• Many companies have a head start

• Every CIO needs to consider Big Data in their

strategy to stay ahead

– How to manage, how to leverage

Page 16: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 16

A little retailer I once knew• Why can Amazon beat everyone on price?

• Purchase information used to adjust supply chain

• Shipping and logistics adjusted according to conditions on

the ground and supply chain

• Other customers’ information used to provide

recommendations, improve experience

• Not just Amazon: Tesco, Carrefour, Metro, etc all taking

advantage

Page 17: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 17

How do we make decisions?• Good data is hard to get—so often on no data at all

• Often on information from peers, colleagues,

reports, or because it’s always been done that way

• Many companies fail because they fail to detect shifts in consumer demand

• Internet has made customers more segmented, and

causes customer choice to change faster

Page 18: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 18

Moving to a Data-Driven Model• Managing with the facts

• Making a science out of data!

• Experimental model—different

than BI

• Moving from “gut feel” to

rational, scentific decisions

Page 19: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 19

Big-Data-based Decisions• Unlock value by making information transparent

and useable at higher frequency

• More accurate information (e.g. inventories, trends)

• Tailor products more precisely

• Sophisticated analytics makes for better decisions

• Better products (via web feedback, sensors, etc)Source: McKinsey

Page 20: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 20

What holds back big data?• Not ICT—compute & storage getting

bigger, cheaper, easier

• Not the quantity of data (see slide 1)

• Not the value—large-scale Big Data

projects generally have great ROI

• Real problems are organisational change and talent acquisition

Page 21: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 21

Page 22: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 22

How are people doing it?

• Enterprises ingesting > 1PB data per day within 5 yrs

• Big data is often largely unstructured

• Hadoop is an application written to analyze big data

– open source, Java-based

• Big data can mean billions to trillions of files

– Each file can be gigabytes to terabytes in size

• Directed graph analysis, Collaborative Filtering, A/B testing, Associative Rule Learning, Classification, Natural Language

processing, Data Mining, Pattern Matching, Sentiment Analysis, Comparative Effectiveness, Clinical Decision Support are

examples of big data techniques

• This means petabytes to exabytes of data

Page 23: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 23

How do you manage and design for Big Data?

• Scale and parallelism are the keys

– Big data is far too big to process sequentially

– Too much coming in too quickly

– Example: Banks seeking to process market data

more quickly, reducing decision making time from

days to minutes

• Answer: Scale-out storage and scale-out processing

Page 24: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 24

Cramming big data onto traditional models

Scalability

Performance

Management

Availability

Cost

Sto

rag

eN

etw

ork

Serv

er

Page 25: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 25

A different idea – scale-out

Scalability

Performance

Management

Availability

Cost

Sto

rag

eN

etw

ork

Serv

er

Page 26: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 26

Enterprise Hadoop: Greenplum & Isilon

• Easier and more reliable

– Packaged Hadoop distribution with Isilon storage

• Purpose-built Hadoop infrastructure

– Faster, less risk

• Sharing expertise to address the talent gap

– Architecture, data science, and roadmap services

• Proven at scale with worldwide support

– 24x7 one call Hadoop support from EMC

– Key component of Greenplum UAP

– Unstructured data processing

Page 27: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 27

Increasing Demand for Advanced Analytics• Complex

– Deep, rich analysis of big data sets

– Ad hoc, interactive analysis, not structured reports

• Timely

– On-going, frequent analysis (e.g. daily, weekly)

– Insights delivered in minutes/seconds

• Actionable

– Forward looking, predictive insight

– Create new business value

Page 28: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 28

• EMC Greenplum is a shared nothing, massively parallel

processing (MPP) data warehouse system

• Core principle of data computing is to move the processing

dramatically closer to the data andandandand to the people

EMC Greenplum: Purpose-built for Big Data

Fast DataFast DataFast DataFast DataLoadingLoadingLoadingLoading

Extreme PerformanceExtreme PerformanceExtreme PerformanceExtreme Performance

& Elastic Scalability& Elastic Scalability& Elastic Scalability& Elastic ScalabilityUnified Unified Unified Unified

Data AccessData AccessData AccessData Access

Page 29: Rob anderson

29© Copyright 2011 EMC Corporation. All rights reserved. EMC Confidential – NDA Required

� Greenplum’s Massively Parallel Processing (MPP) Database has extreme scalability on general purpose systems

� Automatic parallelization

– Load and query like any database

� Scan and process in parallel

– Extremely scalable and I/O optimized

� Linear scalability by adding nodes

– Each adds storage, query performance and loading performance

MPP Shared-Nothing Architecture

...

NetworkInterconnect

...

......Master

Servers

Query planningand dispatch

SegmentServers

Storage andquery

processing

MapReduce

ExternalSources

MPP loading, streaming, etc.

... ... ... ... ...... ... ... ... ...

Page 30: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 30

EMC Hadoop.

Open Source.

Fully Supported By

EMC.

Page 31: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 31

ActActActActDocumentum xCP

The EMC Big Data “Stack”

AnalyzeAnalyzeAnalyzeAnalyzeGreenplum, Hadoop

?

StoreStoreStoreStoreIsilon and AtmosIsilon and AtmosIsilon and AtmosIsilon and Atmos

Petabyte

Scale11

Structured &

Unstructured22

Real Time33

Collaborative44

Page 32: Rob anderson

© Copyright 2010 EMC Corporation. All rights reserved. 32

THANK YOU

HAVE A GREAT CONFERENCE!