AnIntroductiontoBigData.pdf

72
8/11/2019 AnIntroductiontoBigData.pdf http://slidepdf.com/reader/full/anintroductiontobigdatapdf 1/72 © 2012 IBM Corporation May 10, 2012 The Big  Deal About Big Data Paul Zikopoulos, BA, MBA Director, IBM Information Management Technical Professionals, WW Competitive Database, and Big Data IBM Certified Advanced Technical Expert (Clusters and DRDA) IBM Certified Customer Solutions Expert (DBA and BI) [email protected] @BigData_paulz

Transcript of AnIntroductiontoBigData.pdf

Page 1: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 1/72

© 2012 IBM CorporationMay 10, 2012

The Big DealAbout

Big Data

Paul Zikopoulos, BA, MBADirector, IBM Information Management Technical Professionals,

WW Competitive Database, and Big Data

IBM Certified Advanced Technical Expert (Clusters and DRDA)

IBM Certified Customer Solutions Expert (DBA and BI)

[email protected]

@BigData_paulz

Page 2: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 2/72

© 2012 IBM Corporation2

Paul C. Zikopoulos, B.A., M.B.A., is theDirector of Technical Professionals for IBM

Software Group’s Information Management

division and additionally leads the World

Wide Database Competitive and Big Data

Technical Sales Acceleration teams. Paul is anaward-winning writer and speaker with more than 18

years of experience in Information Management. Paul has written

more than 350 magazine articles and 15 books including

Understanding Big Data: Analytics for Enterprise Class Hadoop and

Streaming Data, DB2 pureScale: Risk Free Agile Scaling, Break 

Free with DB2 9.7: A Tour of Cost Saving Features; Information onDemand: Introduction to DB2 9.5 New Features; DB2

Fundamentals Certification for Dummies; DB2 for Dummies; and

more. Paul is a DB2 Certified Advanced Technical Expert (DRDA

and Clusters) and a DB2 Certified Solutions Expert (BI and DBA).In his spare time, he enjoys all sorts of sporting activities, including

running with his dog Chachi, avoiding punches in his MMA training,

and trying to figure out the world according to Chloë—his daughter.

You can reach him at: [email protected].

@BigData_paulz

Page 3: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 3/72

© 2012 IBM Corporation3

 !   b  y   t   h  e  e  n

   d  o   f   2   0   1   1 ,

   t   h   i  s  w  a  s  a   b

  o  u   t

   3   0   b   i   l   l   i  o  n  a

  n   d  g  r  o  w   i  n  g

  e  v  e  n   f  a  s   t  e  r

   I  n

   2   0   0   5   t   h  e  r  e  w  e  r  e   1 .   3   b

   i   l   l   i  o  n   R   F   I   D

    t

  a  g  s   i  n  c   i  r  c  u   l  a   t   i  o  n ! 

Page 4: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 4/72

© 2012 IBM Corporation4   I  n

   2   0   0   5   t   h  e  r  e  w  e  r  e   1 .   3   b

   i   l   l   i  o  n   R   F   I   D

    t

  a  g  s   i  n  c   i  r  c  u   l  a   t   i  o  n ! 

Page 5: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 5/72

© 2012 IBM Corporation5

An increasingly sensor-enabled and instrumentedbusiness environment generates HUGE volumes of

data with MACHINE SPEED characteristics!

1 BILLION lines of codeEACH engine generating 10 TB every 30 minutes!

L

H

R

 

J

F

K

:

 

6

4

 

T

B

s

 

Page 6: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 6/72

© 2012 IBM Corporation6

Automatic Temporal and Spatially Enriched Data

"  Star of “Myth Busters”

(Adam Savage) takes aphoto of his car at hishouse using hissmartphone and poststo his Twitter account

"  Over 650,000 people nowknows where he lives!.

Page 7: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 7/72© 2012 IBM Corporation7

Automatic Temporal and Spatially Enriched Data

"

 

In London, whiletravelling from work tohome, a person’s videois typically captures atleast 150 times

This data is used foranalyzing crime andhomeland security(and what else?)

Page 8: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 8/72© 2012 IBM Corporation8

Automatic Temporal and Spatially Enriched Data

Page 9: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 9/72© 2012 IBM Corporation9

The Social Layer in an Instrumented Interconnected World

12+ TBs of tweet data

every day

25+ TBs oflog data

every day

   ?   T   B  s  o   f

   d

  a   t  a  e  v  e  r  y   d  a  y

2+billion 

peopleon the

Web byend 2011

30 billion RFID

tags today(1.3B in 2005) 

4.6

billion cameraphones

worldwide 

100s ofmillions

of GPSenabled  

devicessold

annually 

76 million smartmeters in 2009! 

200M by 2014

Page 10: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 10/72© 2012 IBM Corporation10

Can a Social Media Persona be Monetized?

Page 11: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 11/72© 2012 IBM Corporation11

Can a Social Media Persona be Monetized?

Page 12: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 12/72© 2012 IBM Corporation12

Extract Intent, Life Events, Micro Segmentation Attributes

Monetizable IntentRelocation

Location Wishful Thinking

SPAMbots

Chloe ZHappy b-day Pauline! I can’t believe you and you’reson were born on the same day. April 1st was surely no fool’s day

in ‘75 or for little Sebastian in 2011! Luuuuv you guys…

Tony Pelaski I think @justinbieber deserves his 2 AMAZINGsongs in the top ten! He capital R O C K S …

Paul’s Wife Sooo bored … and angry

Leon K Tired of the other guys, I’m all Android now, but thebattery life. Man, I could fry an egg on this thing; had a 1st degree burn on my ear! Talk about ‘ears burning’

Name, Birthday, Family

Not Relevant - Noise

Not Relevant - NoiseMonetizable Intent

Page 13: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 13/72© 2012 IBM Corporation13

35 ZB(1 ZB = 1B TB)

4Trillion

8GBiPods

1.8 ZB

1 ZB1 ZB=1T GB

Page 14: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 14/72© 2012 IBM Corporation14

Extracting insight from an immense volume, variety and velocity of data,

in context, beyond what was previously possible.

The Big Data Opportunity

Page 15: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 15/72© 2012 IBM Corporation15

Data In MotionAnalyzes and correlates 5M+market messages/sec to execute

algorithmic option trades withaverage latency of 30 micro-secs.

500K/sec, 6B+ IPDRs analyzedper day on more than 4 PBs/yr.

sustaining 1GBps.

Consider: Data that is never stored, never has to besubjected to retention policies: COST SAVINGS

Page 16: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 16/72

Page 17: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 17/72© 2012 IBM Corporation17

Applications for Big Data AnalyticsSmarter Healthcare

Homeland Security Traffic Control Telecom

Multi-channelsales

Trading AnalyticsManufacturing

Finance

Fraud and Risk

Log Analysis

Search Quality

Retail: Churn, NBO

Page 18: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 18/72© 2012 IBM Corporation18

Imagine If You Could Generate Reports Like This! 

0%

10%

20%

30%

40%

50%

60%

70%

80%

J F M A M J J A S O N D

% Carts Complete or Abandonedby Month 2011

% Carts Complete % Carts Abandoned

60% of carts are regularly abandoned

Page 19: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 19/72© 2012 IBM Corporation19

Click Stream Analysis Use Case: Sessionization

Make IPaddress thekey and

timestampand URL the

value.

Map Step

K1 – Ignored, V1 – 1 line of log

201.67.34.67

K3

10:24:48,www.ff.com 

V3Ensuresomething isin cart and

determine lastpage visited.

Reduce Step

K2 V2 

K4

V4

201.67.34.67 | 10:24:48 | www.ff.com | 200

123.99.54.76 | 11:15:21 | www.ff.com/search? | 200

231.67.66.84 | 02:40:33 | www.ff.com/addItem.jsp | 200

101.23.78.92 | 07:09:56 | www.ff.com/shipping.jsp | 200

114.43.89.75 | 01:32:43 | www.ff.com | 200

109.55.55.81 | 09:45:31 | www.ff.com/confirmation.jsp | 200

1 line 

Web log input

Framework

does this

automatically

Shuffle/Sort/Group

201.67.34.67 10:24:48,www.ff.com 

123.99.54.76 11:15:21,www.ff.com/search?

231.67.66.84 02:40:33,www.ff.com/addItem.jsp

101.23.78.92 07:09:56,www.ff.com/shipping.jsp

114.43.89.75 01:32:43,www.ff.com 

109.55.55.81 09:45:31,www.ff.com/confirmation.jsp

10:25:15,www.ff.com/search?

10:26:23,www.ff.com/shipping

201.67.34.67

www.ff.com/shipping

Goal: Determine how many abandoned shopping carts there are and wherethey were abandoned or dropped take rates: Where is “THE LAST MILE?”

Input: Web log data (IP address, timestamp, URL, HTTP return codes)

"  Output: List of IP addresses and last page visited by user –  Sessionized, and attributed with pathing information

Load Log Files Identify Sessions

Identify

Abandoned

Carts

(1) Add

Context

X

(2) Aggregate Report

Page 20: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 20/72© 2012 IBM Corporation20

Click Stream Analysis Use Case: Sessionization

Make IPaddress thekey and

timestampand URL the

value.

Map Step

K1 – Ignored, V1 – 1 line of log

201.67.34.67

K3

10:24:48,www.ff.com 

V3Ensuresomething isin cart and

determine lastpage visited.

Reduce Step

K2 V2 

K4

V4

201.67.34.67 | 10:24:48 | www.ff.com | 200

123.99.54.76 | 11:15:21 | www.ff.com/search? | 200

231.67.66.84 | 02:40:33 | www.ff.com/addItem.jsp | 200

101.23.78.92 | 07:09:56 | www.ff.com/shipping.jsp | 200

114.43.89.75 | 01:32:43 | www.ff.com | 200

109.55.55.81 | 09:45:31 | www.ff.com/confirmation.jsp | 200

1 line 

Web log input

Framework

does this

automatically

Shuffle/Sort/Group

201.67.34.67 10:24:48,www.ff.com 

123.99.54.76 11:15:21,www.ff.com/search?

231.67.66.84 02:40:33,www.ff.com/addItem.jsp

101.23.78.92 07:09:56,www.ff.com/shipping.jsp

114.43.89.75 01:32:43,www.ff.com 

109.55.55.81 09:45:31,www.ff.com/confirmation.jsp

10:25:15,www.ff.com/search?

10:26:23,www.ff.com/shipping

201.67.34.67

www.ff.com/shipping

Goal: Determine how many abandoned shopping carts there are and wherethey were abandoned or dropped take rates: Where is “THE LAST MILE?”

Input: Web log data (IP address, timestamp, URL, HTTP return codes)

"  Output: List of IP addresses and last page visited by user –  Sessionized, and attributed with pathing information

Load Log Files Identify Sessions

Identify

Abandoned

Carts

(1) Add

Context

X

(2) Aggregate Report

Page 21: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 21/72© 2012 IBM Corporation21

Click Stream Analysis Use Case: Sessionization

Make IPaddress thekey and

timestampand URL the

value.

Map Step

K1 – Ignored, V1 – 1 line of log

201.67.34.67

K3

10:24:48,www.ff.com 

V3Ensuresomething isin cart and

determine lastpage visited.

Reduce Step

K2 V2 

K4

V4

201.67.34.67 | 10:24:48 | www.ff.com | 200

123.99.54.76 | 11:15:21 | www.ff.com/search? | 200

231.67.66.84 | 02:40:33 | www.ff.com/addItem.jsp | 200

101.23.78.92 | 07:09:56 | www.ff.com/shipping.jsp | 200

114.43.89.75 | 01:32:43 | www.ff.com | 200

109.55.55.81 | 09:45:31 | www.ff.com/confirmation.jsp | 200

1 line 

Web log input

Framework

does this

automatically

Shuffle/Sort/Group

201.67.34.67 10:24:48,www.ff.com 

123.99.54.76 11:15:21,www.ff.com/search?

231.67.66.84 02:40:33,www.ff.com/addItem.jsp

101.23.78.92 07:09:56,www.ff.com/shipping.jsp

114.43.89.75 01:32:43,www.ff.com 

109.55.55.81 09:45:31,www.ff.com/confirmation.jsp

10:25:15,www.ff.com/search?

10:26:23,www.ff.com/shipping

201.67.34.67

www.ff.com/shipping

Goal: Determine how many abandoned shopping carts there are and wherethey were abandoned or dropped take rates: Where is “THE LAST MILE?”

Input: Web log data (IP address, timestamp, URL, HTTP return codes)

"  Output: List of IP addresses and last page visited by user –  Sessionized, and attributed with pathing information

Page 22: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 22/72

© 2012 IBM Corporation22

Jun 29 04:03:37 192.168.190.24 nagios: GLOBAL SERVICE EVENT HANDLER: ast-oral- 

vip; remedy_oracle_db_latch-contention; (null); (null); (null); handle-all-

critical-event 

Jun 29 04:03:57 192.168.190.24 nagios: SERVICE ALERT: ast-oral- 

vip;remedy_oracle_db_soft-prase-ratio;CRITICAL;HARD;3;CRITICAL – Soft parse

ratio 

86.63% 

Jun 29 07:50:57 192.168.190.24 nagios: SERVICE ALERT: ast-oral 

vip;remedy_oracle_db_latch-waiting; OK;SOFT;2;OK – SGA latch xssinfo freelist

(#350) 

sleeping 0.000000% of the time 

Jun 29 07:50:57 192.168.190.24 nagios: GLOBAL SERVICE EVENT HANDLER: ast-oral- 

vip;remedy_oracle_db_latch-waiting; (null); (null); (null); handle-all-

critical-event 

A Simple Log FileDATE TIME IP ADDRESS MONITOR

TYPE

COMPONENT

Page 23: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 23/72

© 2012 IBM Corporation23

Page 24: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 24/72

© 2012 IBM Corporation24

Analytic Approaches

BusinessUsers

DetermineQuestions

Traditional AnalyticsStructured & Repeatable

IT TeamBuilds System

To AnswerKnown Questions

IT TeamDelivers DataOn Flexible

Platform

Big Data AnalyticsIterative & Exploratory

BusinessUsers

Explore andAsk Any Question

Page 25: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 25/72

© 2012 IBM Corporation25

Why Didn’t We Use All of the Big Data Before?

“ IBM  gave us an opportunity to turn our plans into something that wasvery tangible right from the beginning . IBM had experts within data mining ,Big Data, and Apache Hadoop and it was clear to use from the beginning we

wanted to improve our business, not only today, but also prepare for thechallenges we will face in three to five years, we had to go with IBM .”

 – Lars Christian Christensen VP Plant Siting & Forecasting

Page 26: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 26/72

© 2012 IBM Corporation26Watson’s advanced analytic capabilities can sort through the equivalent of200 MILLION pages of data to uncover an answer in 3 SECONDS.

Page 27: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 27/72

© 2012 IBM Corporation27

IBM Big Data Strategy: Move the Analytics Closer to the Data

New analytic applications drivethe requirements for a big dataplatform –

 

Integrate and manage the fullvariety, velocity and volume of data

 –  Apply advanced analytics to

information in its native form –  Visualize all available data for ad-

hoc analysis –

 

Development environment for

building new analytic applications –  Workload optimization and

scheduling

 – 

Security and Governance

BI /Reportin

BI /Reporting

Exploration /Visualization

FunctionalApp

IndustryApp

PredictiveAnalytics

ContentAnalytics

Analytic Applications

IBM Big Data Platform

Systems

Management

Application

Development

Visualization

& Discovery

Accelerators

Information Integration & Governance

HadoopSystem

StreamComputing

DataWarehouse

HadoopSystem

StreamComputing

Application

Development

Visualization

& Discovery

Page 28: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 28/72

© 2012 IBM Corporation28

IBM’s Big Data Platform Rounds Out Open Source

BI /Repor ting

 

BI / Reporting Exploration /Visualization

FunctionalApp

IndustryApp

PredictiveAnalytics

ContentAnalytics

Analytic Applications

IBM Big Data PlatformSystems

ManagementApplication

DevelopmentVisualization& Discovery

Accelerators

Information Integration & Governance

StreamComputing

DataWarehouse

HadoopSystem

Component BigInsights

1.3 

ClouderaCDH3 

MapReduceand HDFS

0.20.2+(923 patches)

HBase 0.90.4+

Zookeeper 3.3.4+

Oozie 2.3.1

Lucene 3.3.0

Hive 0.8.0

Pig 0.8.1

InfoSphere BigInsights

Administration & Security

Workload Optimization

Connectors

IBMcertified

Apache Hadoopor

   O  p  e  n  s  o  u  r  c  e

  -   b  a  s  e   d

  c  o  m  p  o  n  e

  n   t  s

   E  n   t  e  r

  p  r   i  s  e   C  a  p  a   b   i   l   i   t   i  e  s Advanced Engines

Indexing

or! 

Page 29: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 29/72

© 2012 IBM Corporation29

IBM Big Data Platform

InfoSphere BigInsightsHadoop-based low latency

analytics for variety and volume

Data-At-Rest

Netezza AppliancesHigh capacity for Queryable Archive

BI+Ad Hoc Analytics

Smart Analytics SystemOperational Analytics on

Structured Data

InfoSphere WarehouseLarge volume structured

data analytics

InfoSphere StreamsLow Latency Analytics for

streaming data

Velocity, Variety & Volume

Data-In-MotionMPP Data Warehouse

MPP StreamComputing

MPPInformationIntegration

MPP Hadoop

InfoSphere InformationServer

High volume data integrationand transformation

Page 30: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 30/72

© 2012 IBM Corporation30

Complementary Analytics

Traditional ApproachStructured, analytical, logical

New ApproachCreative, holistic thought, intuition

StructuredRepeatable

Linear

Monthly sales reports

Profitability analysis

Customer surveys

Internal App Data

Data

Warehouse

Traditional

Sources

Structured

Repeatable

Linear

Transaction Data

ERP data

Mainframe Data

OLTP System Data

UnstructuredExploratoryIterative

Brand sentiment

Product strategy

Maximum asset utilization

Hadoop

Streams

New

Sources

Unstructured

Exploratory

Iterative

Web Logs

Social Data

Text Data: emails

Sensor data: images

RFID

EnterpriseIntegration

Page 31: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 31/72

© 2012 IBM Corporation31

Data Governance

Separation of Duties

 – 

Users can’t delete their audit logs

"  Separation of Duties

 – 

No authorized access

Privilege Users  –  Monitoring authorized and privilege

user activities

Privilege Users  –  Monitoring authorized and privilege

user activities

Sensitive Data (in the file system) –  Protecting and blocking unauthorized

access to the system and the data

Sensitive Data (in tables) –  Protecting and blocking unauthorized

access to the system and the data

"  Workflow for audit and compliancecontrols to validate proper security

procedures (satisfies regulations!)

"  Workflow for audit and compliancecontrols to validate proper security

procedures (satisfies regulations!)Internal Use Only

 S o m e t i m e s

  t e r m e d 

 “ S c h e m a  F i

 r s t ” 

S o m e t i m e s  t e r m e d  “ S c he m a  La t e r ”  

Page 32: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 32/72

© 2012 IBM Corporation32

Data Security Design Goals

Minimal impact to the Database 

server resources"  No Database configuration changes" 

Separation of Duties: Preventingprivilege users from doing maliciousactivities

Audit trail with very granular details

of Database activity"  Real-time alerting and blocking

Minimal impact to the network

100% Database activity visibility"  Heterogeneous support 

Minimal impact to the Big Data

server resources"  No Big Data configuration changes 

Separation of Duties: Preventingprivilege users/jobs from doingmalicious activities

Audit trail with very granular details

of Big Data activity"  Real-time alerting and blocking

Minimal impact to the network

100% Big Data activity visibility"  Heterogeneous support 

Internal Use Only

 S o m e t i m e s

  t e r m e d 

 “ S c h e m a  F i

 r s t ” 

S o m e t i m e s  t e r m e d  “ S c he m a  La t e r ”  

Page 33: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 33/72

© 2012 IBM Corporation33

IBM Flattens the Time to Value Big Data Curve

+   + + +   +

HadoopHarden

File SystemETL Dev. Tooling

Text/MLAnalytics

Visualization

+

Velocity

Hadoop

Harden

File System ETL Dev. ToolingText/MLAnalytics Visualization

+   +

 OR + +

Velocity

+

100+Big DataBusiness PartnersSigned

Page 34: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 34/72

© 2012 IBM Corporation34

First Ever Forrester Wave on Big Data

“   IBM has the deepest

Hadoop platform and

application portfolio. IBM,

an established EDW vendor,has its own Hadoop

distribution; an extensive

 professional services force

working on Hadoop projects;

extensive R&D programsdeveloping Hadoop

technologies; connections to

Hadoop from its EDW.”     –The Forrester Wave™: Enterprise

Hadoop Solutions, 1Q12

Page 35: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 35/72

© 2012 IBM CorporationMay 10, 2012

Open SourceTechnology Behind

Big Data

Page 36: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 36/72

© 2012 IBM Corporation36

A Difference of Processing Models

SETI@home is a Computational Processing Model – 

Service for Extraterrestrial Intelligence (SETI) uses

unused desktop CPU processing power to perform

wide-spread analysis of radio telescope data

 – 

Pushes data to the program for processing

 – 

Data to function"  MapReduce is a Data Processing Model

 – 

Data processing primitives of Mappers and Reducers

 – 

Can be complex to write MapReduce programs, but a

very simple configuration change makes it scale to

1000s of nodes

 – Function to data

Page 37: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 37/72

© 2012 IBM Corporation37

Building Blocks of Hadoop

Page 38: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 38/72

© 2012 IBM Corporation38

Hadoop Framework

"  Hadoop Common –

 

 A utility layer that provides access to the HDFS and projects

"  HDFS –

 

Data storage platform for Hadoop framework

 – 

Can scale to massive size when distributed over

multiple computers

"  MapReduce –

 

Framework to process data across a node clusters

 – MAP process splits work by first mapping the input

across the control nodes of the cluster, then splittingthe workload into even smaller data sets and distributing

it further throughout the computing cluster•  Allows MPP – think DB2 DPF ‘like’

 – REDUCE collects and combines the nodes’ answers to deliver a result

Page 39: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 39/72

© 2012 IBM Corporation39

Workload Optimization

Optimized performance for big data analytic workloads 

Task  Map

(break task into small

parts)

Adaptive Map

(optimization —

order small units of work)

Reduce

(many results to a

single result set)

Adaptive MapReduce

"  Algorithm to optimize execution time ofmultiple small jobs

"  Performance gains of 30% reduceoverhead of task startup

Hadoop System Scheduler

"  Identifies small and large jobs fromprior experience

"  Sequences work to reduce overhead

Page 40: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 40/72

© 2012 IBM Corporation40

Growing the Hadoop Environment

"  Avro: Data serialization system that convertsdata into a fast, compact binary data format –

 

 Avro data can be stored in a file with versioned schemas

"  Chukwa: Large-scale monitoring system that providesinsights into the Hadoop distributed file system

and MapReduce

HBase is a scalable, column-oriented distributeddatabase modeled after Google's BigTable distributedstorage system

Hive is a data warehouse infrastructure that providesad hoc query and data summarization for HadoopSupported data –

 

Hive utilizes SQL-like query language call HiveQL –  HiveQL also used by programmers to run custom MapReduce jobs. 

Page 41: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 41/72

© 2012 IBM Corporation41

Growing the Hadoop Environment

Mahout is a data mining library designed to work on

the Hadoop framework –  Mahout delivers a core set of algorithms designed for clustering,

classification and batch-based filtering.

Pig is a high-level programming language and executionframework for parallel computation –

 

Pig works within the Hadoop and MapReduce frameworks

ZooKeeper provides coordination, configuration andgroup services for distributed applications working overthe Hadoop stack

Page 42: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 42/72

© 2012 IBM Corporation42

Committed to Open Source

"  Decade of lineage and contributions

to the open source community – 

 Apache Hadoop and Jaql, Apache Derby, Apache Geronimo, Apache Jakarta, +++

 –  Eclipse: founded by IBM –  Significant Lucene contributions via IBM

Lucene Extension Library (ILEL) –  DRDA, XQuery, SQL, XML4J, XERCES,

HTTP, Java, Linux, +++

IBM products built on open source –

 

WebSphere: Apache –

 

Rational: Eclipse and Apache –

 

InfoSphere: Eclipse and Apache, +++

IBM’s BigInsights (Hadoop) is 100%open source compatible withno forks

Page 43: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 43/72

© 2012 IBM CorporationMay 10, 2012

Why IBMfor Big DataThe Solution Side

Paul Zikopoulos, BA, MBADirector, IBM Information Management Technical Professionals,

WW Competitive Database, and Big Data

IBM Certified Advanced Technical Expert (Clusters and DRDA)

IBM Certified Customer Solutions Expert (DBA and BI)

[email protected]

Page 44: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 44/72

© 2012 IBM Corporation44

10100101010011100101001111001000100100010010001000100101

Living Model Management and Frontier Anlaytics

010110011000111010010010010011100010010100100101100100101001100100101001001010100010010

01100100101001001010100010010

1100010010100100101100100101001100100101001001010100010010

01100100101001001010100010010

   O  p  p  o  r   t  u  n   i   t  y   C  o  s

   t   S   t  a  r   t  s   H  e  r  e

01100100101001001010100010010

01100100101001001010100010010

11000100101001001011001001010

01100100101001001010100010010

01100100101001001010100010010

01100100101001001010100010010

01100100101001001010100010010

011001001010010010101000100101100010010100100101100100101001100100101001001010100010010

01100100101001001010100010010

01100100101001001010100010010

11000100101001001011001001010

Adaptive

AnalyticsModel

   F  o  r  e  c  a  s   t

   N

  o  w  c  a  s   t

   F  o  r  e  c  a  s   t

Bootstrap 

Enrich

Data Ingest

Page 45: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 45/72

© 2012 IBM Corporation45

Data In Motion

"  Hear what’s going on miles away to optimize

perimeter displacements

Perspective: Try to find the word “Zero” in a1000 MP3 song library in a fraction of a second –  Figure out the difference between the sound of a

human whisper and the wind

Cisco turns to IBM big

Page 46: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 46/72

© 2012 IBM Corporation4646

gdata for intelligent

infrastructuremanagement

Optimize building energyconsumption with centralizedmonitoring and control ofbuilding monitoring system

Automates preventive andcorrective maintenance ofbuilding corrective systems

Uses Streams, InfoSphere

BigInsights and Cognos-  Log Analytics

-  Energy Bill Forecasting

-  Energy consumption optimization

-  Detection of anomalous usage

-  Presence-aware energy mgt.

-  Policy enforcement

Page 47: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 47/72

© 2012 IBM Corporation47

BigInsights and Cisco: Energy Usage Dashboard

Page 48: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 48/72

© 2012 IBM Corporation48

End-user Visualization Development Environment

Gather

Crawl – gather statistically Adapter–gather dynamically

Extract

Document-level infoCleanse, normalize

I   t   er  a t   e

 Familiar coding

and tooling

environment,testing, and

optimization

Explore

 Analyze, annotate, filterVisualize results

Page 49: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 49/72

© 2012 IBM Corporation49

!"#$%&"'()%&* !,- .)/) )0)1234% /##1 5#" (6%,0&%% 6%&"% 

!,- .)/) 78)11&0-&%9 

"  Business users need a noprogramming approach foranalyzing Big Data

Extremely difficult to findactionable business insights indata from multiple sources withdifferent formats

"  Translating untapped data intoactionable business insights isa common requirement thatrequires visualization

What is BigSheets?

:#$ 4)0 !,-;8&&/% 8&1<= 

"  Spreadsheet-like discovery interface letsbusiness users easily analyze Big Data

with ZERO PROGRAMMING

BUILT-IN “readers” can work

with data in several common formats –  JSON arrays, CSV, TSV, Web, Hive, Line

Reader, Nutch Web Crawler Output, +++

"  Users can VISUALLY combine and

explore various types of data to identifyhidden

 

insights

Page 50: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 50/72

© 2012 IBM Corporation50

Page 51: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 51/72

© 2012 IBM Corporation51

Big Data Made Easy for the Little Guy

"  USC’s Film Forecaster correctly

predicted a clamor for "Hangover 2”that resulted in $100 million openingover Memorial Day weekend –  Looked at 250K-500K Tweets and broke

down positive and negative messages using

a lexicon of 1700 words

The Film Forecaster sounds like a bigundertaking for USC, but it really came

down to one communications masters

student who learned Big Sheets ina day , then pulled in the tweets and

analyzed them - Ryan Kim

Even though relatively still immature

in terms of product release cycles,

BigSheets has an impressiverange of functionality , which is

only set to expand as it is further

developed and integrated withBigInsights – Krishna Roy

Page 52: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 52/72

© 2012 IBM Corporation52

Running Applications from the Web Console

Page 53: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 53/72

© 2012 IBM Corporation53

A Rich Management Big Data Tool

Page 54: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 54/72

© 2012 IBM Corporation54

Rich Big Data Development Environment

"  Eclipse base Development tools

 – 

For JQAL, Hive, PIG, JavaMapReduce and Text Analytics

Page 55: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 55/72

© 2012 IBM Corporation55

Without a Big Data Platform You Code! 

IBM Big Data Platform

Streams provides development, deployment,runtime, and infrastructure services

 

“TerraEchos developers can deliver

applications 45% faster due to the agility

of Streams Processing Language!” – Alex Philip, CEO and President

Multithreading

Custom SQLand

Scripts

Performance

Optimization

Debug 

ApplicationManagement

EventHandling

Connectors

CheckPointing

Security

HA  Acceleratorsand

Toolkits

Over 100 sample applications and

toolkits with industry focusedtoolkits with 300+ functions and

operators! 

Page 56: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 56/72

© 2012 IBM Corporation56

Streams Runtime Illustrated

X86 Host X86 Host

Meters

CompanyFilter

UsageModel

X86 Host

Usage

Contract

X86 Host X86 Host

TextExtract

Temp Action

Season Adjust

TextExtract

DegreeHistory

CompareHistory Store

History

Meters

Daily Adjust

Page 57: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 57/72

© 2012 IBM Corporation57

How Text Analytics Works

Football World Cup 2010, one team distinguished

themselves well, losing to the eventual champions1-0 in the Final. Early in the second half,Netherlands’ striker,  Arjen Robben, had a breakaway,but the keeper for  Spain, Iker Casilas made the save. Winger  Andres Iniesta scored for Spain for the win.

Netherlands Striker  Arjen Robben 

Keeper  Spain Iker Casilas  Winger  Andres Iniesta  Spain 

World Cup 2010 Highlights

Page 58: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 58/72

© 2012 IBM Corporation58

Text Analytic Toolkit Example

Semantically search Enron’s emails to support queries such as

“Find Tom’s phone” in order to find his actual phone number –  As opposed to emails containing words Tom  and Phone 

Need to be able to accurately extract email entities of type Person and Phone and the relationship between them

Page 59: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 59/72

© 2012 IBM Corporation59

The Enron Email Example

Start with a naïve set of rules to identify a PersonPhone relationship 

built using some sort of editor –  Build out an extractor that know how to find a person’s name: Lorraine Smith  –

 

Build out an extractor that knows how to find a phone number: 607-205-4493

607.205.4493   Euro spacing 

xt. ext. …  North American spacing 

+0119054132112   Globalstar GSP-1600 

911, 411 

+++ 

Page 60: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 60/72

© 2012 IBM Corporation60

The Tricky Thing About Sentiment! 

Page 61: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 61/72

© 2012 IBM Corporation61

Accelerating Analytics – Explainability" 

Every annotation’s provenance can be visualized" 

Every annotation’s provenance can be visualized

Provenance of Morgan Stanley Fax: 205-4493 shows theFullName rule is responsible for generating incorrect annotation  –  Solutions?

• 

Remove Morgan Stanley from FullName dictionary?• 

Create new dictionary for CompanyName?

Text Analytics Toolkit

Page 62: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 62/72

© 2012 IBM Corporation62

Text Analytics Toolkit Provenance Viewer"  Major challenge for analysts is determining the lineage of changes

that have been applied to text

 – 

REALLY difficult to diskern which extractors need to be adjusted to tweak theresulting annotations

"  Provenance Viewer for interactive visualizations to display outputannotations for regression, debugging, version enhancements, +++ –  Reduce development of extractors by days to weeks

Page 63: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 63/72

© 2012 IBM Corporation63

Accelerating Analytics – discovery and Pattern Matching"  Contextual Clues discoverer cluster the context surrounding annotation in

order to detect frequently occurring patterns

"  Illustrate clustering results between incorrect PersonPhone pairs –  Visualize frequent occurrences (bubble size) of clues: i.e. FAX and TELEFAX

 –  Developer improves precision of annotator by adding a rule that filters out PersonPhone 

pars if the Phone is preceded by a FAX clue

 –  Developer finds call at text as key clue for rule extraction for PersonPhone

Text Analytics Toolkit

Page 64: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 64/72

© 2012 IBM Corporation64

Accelerating Analytics – Assisted Development

"  IDE that exposes sophisticated techniques to assist rule developers

throughout all states of the development cycle –  Promotes agile development because it unifies the business analyst and the

developer with a common toolset which fosters understanding

•  Business Analyst helps mature and validates extractor results•

 

Developer understands visually that AQL enhancements needed to refine rules

Text Analytics Toolkit

Page 65: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 65/72

© 2012 IBM Corporation65

Export Reports to Business Users

Page 66: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 66/72

© 2012 IBM Corporation66

BigInsights Text Analytics Development

Hyperlink Navigation

ProceduralHelp

Outline View

Content AssistDifference

Viewer

Design Time Validation

Context Sensitive Help

Regular Expression Builder

Page 67: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 67/72

© 2012 IBM Corporation67

Text Analytics Toolkit – Development Accelerator" 

Eclipse plug-ins enhance analyst productivity –  When writing AQL code, the editor features syntax highlighting, and automatic

detection of syntax error at design time not runtime

 –  Pre-built extractors and sample test tools, +++

Reduce coding time and debugging by 30-50%+!

Page 68: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 68/72

© 2012 IBM Corporation68

Did anyone ever ask if the answer

was

Page 69: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 69/72

© 2012 IBM Corporation69

Business Task: Find the Pictures that Have Cats

Page 70: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 70/72

© 2012 IBM Corporation70

BUT! Your Application Returns the Following! 

PRECISION(a measure of exactness)

2/4 

RECALL 

(a measure of coverage)

2/5 

Page 71: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 71/72

© 2012 IBM Corporation71

Learn Hadoop: At Your Place, At Your PaceMaking Learning Hadoop Easy and Fun

Flexible on-line delivery allowslearning @your place and@your pace

"

 

Free courses, free study materials

Cloud-based sandbox forexercises – zero setup

10000+ registered students

@BigData paulz

Page 72: AnIntroductiontoBigData.pdf

8/11/2019 AnIntroductiontoBigData.pdf

http://slidepdf.com/reader/full/anintroductiontobigdatapdf 72/72

THINK

PDF http://tinyurl.com/88auawg

Hardcopy http://tinyurl.com/7ef9ubs

@ g _p