AnIntroductiontoBigData.pdf
-
Upload
cicciciccio -
Category
Documents
-
view
214 -
download
0
Transcript of AnIntroductiontoBigData.pdf
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 1/72
© 2012 IBM CorporationMay 10, 2012
The Big DealAbout
Big Data
Paul Zikopoulos, BA, MBADirector, IBM Information Management Technical Professionals,
WW Competitive Database, and Big Data
IBM Certified Advanced Technical Expert (Clusters and DRDA)
IBM Certified Customer Solutions Expert (DBA and BI)
@BigData_paulz
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 2/72
© 2012 IBM Corporation2
Paul C. Zikopoulos, B.A., M.B.A., is theDirector of Technical Professionals for IBM
Software Group’s Information Management
division and additionally leads the World
Wide Database Competitive and Big Data
Technical Sales Acceleration teams. Paul is anaward-winning writer and speaker with more than 18
years of experience in Information Management. Paul has written
more than 350 magazine articles and 15 books including
Understanding Big Data: Analytics for Enterprise Class Hadoop and
Streaming Data, DB2 pureScale: Risk Free Agile Scaling, Break
Free with DB2 9.7: A Tour of Cost Saving Features; Information onDemand: Introduction to DB2 9.5 New Features; DB2
Fundamentals Certification for Dummies; DB2 for Dummies; and
more. Paul is a DB2 Certified Advanced Technical Expert (DRDA
and Clusters) and a DB2 Certified Solutions Expert (BI and DBA).In his spare time, he enjoys all sorts of sporting activities, including
running with his dog Chachi, avoiding punches in his MMA training,
and trying to figure out the world according to Chloë—his daughter.
You can reach him at: [email protected].
@BigData_paulz
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 3/72
© 2012 IBM Corporation3
! b y t h e e n
d o f 2 0 1 1 ,
t h i s w a s a b
o u t
3 0 b i l l i o n a
n d g r o w i n g
e v e n f a s t e r
I n
2 0 0 5 t h e r e w e r e 1 . 3 b
i l l i o n R F I D
t
a g s i n c i r c u l a t i o n !
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 4/72
© 2012 IBM Corporation4 I n
2 0 0 5 t h e r e w e r e 1 . 3 b
i l l i o n R F I D
t
a g s i n c i r c u l a t i o n !
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 5/72
© 2012 IBM Corporation5
An increasingly sensor-enabled and instrumentedbusiness environment generates HUGE volumes of
data with MACHINE SPEED characteristics!
1 BILLION lines of codeEACH engine generating 10 TB every 30 minutes!
L
H
R
J
F
K
:
6
4
T
B
s
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 6/72
© 2012 IBM Corporation6
Automatic Temporal and Spatially Enriched Data
" Star of “Myth Busters”
(Adam Savage) takes aphoto of his car at hishouse using hissmartphone and poststo his Twitter account
" Over 650,000 people nowknows where he lives!.
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 7/72© 2012 IBM Corporation7
Automatic Temporal and Spatially Enriched Data
"
In London, whiletravelling from work tohome, a person’s videois typically captures atleast 150 times
"
This data is used foranalyzing crime andhomeland security(and what else?)
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 8/72© 2012 IBM Corporation8
Automatic Temporal and Spatially Enriched Data
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 9/72© 2012 IBM Corporation9
The Social Layer in an Instrumented Interconnected World
12+ TBs of tweet data
every day
25+ TBs oflog data
every day
? T B s o f
d
a t a e v e r y d a y
2+billion
peopleon the
Web byend 2011
30 billion RFID
tags today(1.3B in 2005)
4.6
billion cameraphones
worldwide
100s ofmillions
of GPSenabled
devicessold
annually
76 million smartmeters in 2009!
200M by 2014
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 10/72© 2012 IBM Corporation10
Can a Social Media Persona be Monetized?
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 11/72© 2012 IBM Corporation11
Can a Social Media Persona be Monetized?
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 12/72© 2012 IBM Corporation12
Extract Intent, Life Events, Micro Segmentation Attributes
Monetizable IntentRelocation
Location Wishful Thinking
SPAMbots
Chloe ZHappy b-day Pauline! I can’t believe you and you’reson were born on the same day. April 1st was surely no fool’s day
in ‘75 or for little Sebastian in 2011! Luuuuv you guys…
Tony Pelaski I think @justinbieber deserves his 2 AMAZINGsongs in the top ten! He capital R O C K S …
Paul’s Wife Sooo bored … and angry
Leon K Tired of the other guys, I’m all Android now, but thebattery life. Man, I could fry an egg on this thing; had a 1st degree burn on my ear! Talk about ‘ears burning’
Name, Birthday, Family
Not Relevant - Noise
Not Relevant - NoiseMonetizable Intent
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 13/72© 2012 IBM Corporation13
35 ZB(1 ZB = 1B TB)
4Trillion
8GBiPods
1.8 ZB
1 ZB1 ZB=1T GB
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 14/72© 2012 IBM Corporation14
Extracting insight from an immense volume, variety and velocity of data,
in context, beyond what was previously possible.
The Big Data Opportunity
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 15/72© 2012 IBM Corporation15
Data In MotionAnalyzes and correlates 5M+market messages/sec to execute
algorithmic option trades withaverage latency of 30 micro-secs.
500K/sec, 6B+ IPDRs analyzedper day on more than 4 PBs/yr.
sustaining 1GBps.
Consider: Data that is never stored, never has to besubjected to retention policies: COST SAVINGS
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 16/72
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 17/72© 2012 IBM Corporation17
Applications for Big Data AnalyticsSmarter Healthcare
Homeland Security Traffic Control Telecom
Multi-channelsales
Trading AnalyticsManufacturing
Finance
Fraud and Risk
Log Analysis
Search Quality
Retail: Churn, NBO
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 18/72© 2012 IBM Corporation18
Imagine If You Could Generate Reports Like This!
0%
10%
20%
30%
40%
50%
60%
70%
80%
J F M A M J J A S O N D
% Carts Complete or Abandonedby Month 2011
% Carts Complete % Carts Abandoned
60% of carts are regularly abandoned
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 19/72© 2012 IBM Corporation19
Click Stream Analysis Use Case: Sessionization
Make IPaddress thekey and
timestampand URL the
value.
Map Step
K1 – Ignored, V1 – 1 line of log
201.67.34.67
K3
10:24:48,www.ff.com
V3Ensuresomething isin cart and
determine lastpage visited.
Reduce Step
K2 V2
K4
V4
201.67.34.67 | 10:24:48 | www.ff.com | 200
123.99.54.76 | 11:15:21 | www.ff.com/search? | 200
231.67.66.84 | 02:40:33 | www.ff.com/addItem.jsp | 200
101.23.78.92 | 07:09:56 | www.ff.com/shipping.jsp | 200
114.43.89.75 | 01:32:43 | www.ff.com | 200
109.55.55.81 | 09:45:31 | www.ff.com/confirmation.jsp | 200
1 line
Web log input
Framework
does this
automatically
Shuffle/Sort/Group
201.67.34.67 10:24:48,www.ff.com
123.99.54.76 11:15:21,www.ff.com/search?
231.67.66.84 02:40:33,www.ff.com/addItem.jsp
101.23.78.92 07:09:56,www.ff.com/shipping.jsp
114.43.89.75 01:32:43,www.ff.com
109.55.55.81 09:45:31,www.ff.com/confirmation.jsp
10:25:15,www.ff.com/search?
10:26:23,www.ff.com/shipping
201.67.34.67
www.ff.com/shipping
"
Goal: Determine how many abandoned shopping carts there are and wherethey were abandoned or dropped take rates: Where is “THE LAST MILE?”
"
Input: Web log data (IP address, timestamp, URL, HTTP return codes)
" Output: List of IP addresses and last page visited by user – Sessionized, and attributed with pathing information
Load Log Files Identify Sessions
Identify
Abandoned
Carts
(1) Add
Context
X
(2) Aggregate Report
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 20/72© 2012 IBM Corporation20
Click Stream Analysis Use Case: Sessionization
Make IPaddress thekey and
timestampand URL the
value.
Map Step
K1 – Ignored, V1 – 1 line of log
201.67.34.67
K3
10:24:48,www.ff.com
V3Ensuresomething isin cart and
determine lastpage visited.
Reduce Step
K2 V2
K4
V4
201.67.34.67 | 10:24:48 | www.ff.com | 200
123.99.54.76 | 11:15:21 | www.ff.com/search? | 200
231.67.66.84 | 02:40:33 | www.ff.com/addItem.jsp | 200
101.23.78.92 | 07:09:56 | www.ff.com/shipping.jsp | 200
114.43.89.75 | 01:32:43 | www.ff.com | 200
109.55.55.81 | 09:45:31 | www.ff.com/confirmation.jsp | 200
1 line
Web log input
Framework
does this
automatically
Shuffle/Sort/Group
201.67.34.67 10:24:48,www.ff.com
123.99.54.76 11:15:21,www.ff.com/search?
231.67.66.84 02:40:33,www.ff.com/addItem.jsp
101.23.78.92 07:09:56,www.ff.com/shipping.jsp
114.43.89.75 01:32:43,www.ff.com
109.55.55.81 09:45:31,www.ff.com/confirmation.jsp
10:25:15,www.ff.com/search?
10:26:23,www.ff.com/shipping
201.67.34.67
www.ff.com/shipping
"
Goal: Determine how many abandoned shopping carts there are and wherethey were abandoned or dropped take rates: Where is “THE LAST MILE?”
"
Input: Web log data (IP address, timestamp, URL, HTTP return codes)
" Output: List of IP addresses and last page visited by user – Sessionized, and attributed with pathing information
Load Log Files Identify Sessions
Identify
Abandoned
Carts
(1) Add
Context
X
(2) Aggregate Report
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 21/72© 2012 IBM Corporation21
Click Stream Analysis Use Case: Sessionization
Make IPaddress thekey and
timestampand URL the
value.
Map Step
K1 – Ignored, V1 – 1 line of log
201.67.34.67
K3
10:24:48,www.ff.com
V3Ensuresomething isin cart and
determine lastpage visited.
Reduce Step
K2 V2
K4
V4
201.67.34.67 | 10:24:48 | www.ff.com | 200
123.99.54.76 | 11:15:21 | www.ff.com/search? | 200
231.67.66.84 | 02:40:33 | www.ff.com/addItem.jsp | 200
101.23.78.92 | 07:09:56 | www.ff.com/shipping.jsp | 200
114.43.89.75 | 01:32:43 | www.ff.com | 200
109.55.55.81 | 09:45:31 | www.ff.com/confirmation.jsp | 200
1 line
Web log input
Framework
does this
automatically
Shuffle/Sort/Group
201.67.34.67 10:24:48,www.ff.com
123.99.54.76 11:15:21,www.ff.com/search?
231.67.66.84 02:40:33,www.ff.com/addItem.jsp
101.23.78.92 07:09:56,www.ff.com/shipping.jsp
114.43.89.75 01:32:43,www.ff.com
109.55.55.81 09:45:31,www.ff.com/confirmation.jsp
10:25:15,www.ff.com/search?
10:26:23,www.ff.com/shipping
201.67.34.67
www.ff.com/shipping
"
Goal: Determine how many abandoned shopping carts there are and wherethey were abandoned or dropped take rates: Where is “THE LAST MILE?”
"
Input: Web log data (IP address, timestamp, URL, HTTP return codes)
" Output: List of IP addresses and last page visited by user – Sessionized, and attributed with pathing information
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 22/72
© 2012 IBM Corporation22
Jun 29 04:03:37 192.168.190.24 nagios: GLOBAL SERVICE EVENT HANDLER: ast-oral-
vip; remedy_oracle_db_latch-contention; (null); (null); (null); handle-all-
critical-event
Jun 29 04:03:57 192.168.190.24 nagios: SERVICE ALERT: ast-oral-
vip;remedy_oracle_db_soft-prase-ratio;CRITICAL;HARD;3;CRITICAL – Soft parse
ratio
86.63%
Jun 29 07:50:57 192.168.190.24 nagios: SERVICE ALERT: ast-oral
vip;remedy_oracle_db_latch-waiting; OK;SOFT;2;OK – SGA latch xssinfo freelist
(#350)
sleeping 0.000000% of the time
Jun 29 07:50:57 192.168.190.24 nagios: GLOBAL SERVICE EVENT HANDLER: ast-oral-
vip;remedy_oracle_db_latch-waiting; (null); (null); (null); handle-all-
critical-event
A Simple Log FileDATE TIME IP ADDRESS MONITOR
TYPE
COMPONENT
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 23/72
© 2012 IBM Corporation23
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 24/72
© 2012 IBM Corporation24
Analytic Approaches
BusinessUsers
DetermineQuestions
Traditional AnalyticsStructured & Repeatable
IT TeamBuilds System
To AnswerKnown Questions
IT TeamDelivers DataOn Flexible
Platform
Big Data AnalyticsIterative & Exploratory
BusinessUsers
Explore andAsk Any Question
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 25/72
© 2012 IBM Corporation25
Why Didn’t We Use All of the Big Data Before?
“ IBM gave us an opportunity to turn our plans into something that wasvery tangible right from the beginning . IBM had experts within data mining ,Big Data, and Apache Hadoop and it was clear to use from the beginning we
wanted to improve our business, not only today, but also prepare for thechallenges we will face in three to five years, we had to go with IBM .”
– Lars Christian Christensen VP Plant Siting & Forecasting
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 26/72
© 2012 IBM Corporation26Watson’s advanced analytic capabilities can sort through the equivalent of200 MILLION pages of data to uncover an answer in 3 SECONDS.
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 27/72
© 2012 IBM Corporation27
IBM Big Data Strategy: Move the Analytics Closer to the Data
"
New analytic applications drivethe requirements for a big dataplatform –
Integrate and manage the fullvariety, velocity and volume of data
– Apply advanced analytics to
information in its native form – Visualize all available data for ad-
hoc analysis –
Development environment for
building new analytic applications – Workload optimization and
scheduling
–
Security and Governance
BI /Reportin
g
BI /Reporting
Exploration /Visualization
FunctionalApp
IndustryApp
PredictiveAnalytics
ContentAnalytics
Analytic Applications
IBM Big Data Platform
Systems
Management
Application
Development
Visualization
& Discovery
Accelerators
Information Integration & Governance
HadoopSystem
StreamComputing
DataWarehouse
HadoopSystem
StreamComputing
Application
Development
Visualization
& Discovery
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 28/72
© 2012 IBM Corporation28
IBM’s Big Data Platform Rounds Out Open Source
BI /Repor ting
BI / Reporting Exploration /Visualization
FunctionalApp
IndustryApp
PredictiveAnalytics
ContentAnalytics
Analytic Applications
IBM Big Data PlatformSystems
ManagementApplication
DevelopmentVisualization& Discovery
Accelerators
Information Integration & Governance
StreamComputing
DataWarehouse
HadoopSystem
Component BigInsights
1.3
ClouderaCDH3
MapReduceand HDFS
0.20.2+(923 patches)
HBase 0.90.4+
Zookeeper 3.3.4+
Oozie 2.3.1
Lucene 3.3.0
Hive 0.8.0
Pig 0.8.1
InfoSphere BigInsights
Administration & Security
Workload Optimization
Connectors
IBMcertified
Apache Hadoopor
O p e n s o u r c e
- b a s e d
c o m p o n e
n t s
E n t e r
p r i s e C a p a b i l i t i e s Advanced Engines
Indexing
or!
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 29/72
© 2012 IBM Corporation29
IBM Big Data Platform
InfoSphere BigInsightsHadoop-based low latency
analytics for variety and volume
Data-At-Rest
Netezza AppliancesHigh capacity for Queryable Archive
BI+Ad Hoc Analytics
Smart Analytics SystemOperational Analytics on
Structured Data
InfoSphere WarehouseLarge volume structured
data analytics
InfoSphere StreamsLow Latency Analytics for
streaming data
Velocity, Variety & Volume
Data-In-MotionMPP Data Warehouse
MPP StreamComputing
MPPInformationIntegration
MPP Hadoop
InfoSphere InformationServer
High volume data integrationand transformation
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 30/72
© 2012 IBM Corporation30
Complementary Analytics
Traditional ApproachStructured, analytical, logical
New ApproachCreative, holistic thought, intuition
StructuredRepeatable
Linear
Monthly sales reports
Profitability analysis
Customer surveys
Internal App Data
Data
Warehouse
Traditional
Sources
Structured
Repeatable
Linear
Transaction Data
ERP data
Mainframe Data
OLTP System Data
UnstructuredExploratoryIterative
Brand sentiment
Product strategy
Maximum asset utilization
Hadoop
Streams
New
Sources
Unstructured
Exploratory
Iterative
Web Logs
Social Data
Text Data: emails
Sensor data: images
RFID
EnterpriseIntegration
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 31/72
© 2012 IBM Corporation31
Data Governance
"
Separation of Duties
–
Users can’t delete their audit logs
" Separation of Duties
–
No authorized access
"
Privilege Users – Monitoring authorized and privilege
user activities
"
Privilege Users – Monitoring authorized and privilege
user activities
"
Sensitive Data (in the file system) – Protecting and blocking unauthorized
access to the system and the data
"
Sensitive Data (in tables) – Protecting and blocking unauthorized
access to the system and the data
" Workflow for audit and compliancecontrols to validate proper security
procedures (satisfies regulations!)
" Workflow for audit and compliancecontrols to validate proper security
procedures (satisfies regulations!)Internal Use Only
S o m e t i m e s
t e r m e d
“ S c h e m a F i
r s t ”
S o m e t i m e s t e r m e d “ S c he m a La t e r ”
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 32/72
© 2012 IBM Corporation32
Data Security Design Goals
"
Minimal impact to the Database
server resources" No Database configuration changes"
Separation of Duties: Preventingprivilege users from doing maliciousactivities
"
Audit trail with very granular details
of Database activity" Real-time alerting and blocking
"
Minimal impact to the network
"
100% Database activity visibility" Heterogeneous support
"
Minimal impact to the Big Data
server resources" No Big Data configuration changes
"
Separation of Duties: Preventingprivilege users/jobs from doingmalicious activities
"
Audit trail with very granular details
of Big Data activity" Real-time alerting and blocking
"
Minimal impact to the network
"
100% Big Data activity visibility" Heterogeneous support
Internal Use Only
S o m e t i m e s
t e r m e d
“ S c h e m a F i
r s t ”
S o m e t i m e s t e r m e d “ S c he m a La t e r ”
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 33/72
© 2012 IBM Corporation33
IBM Flattens the Time to Value Big Data Curve
+ + + + +
HadoopHarden
File SystemETL Dev. Tooling
Text/MLAnalytics
Visualization
+
Velocity
Hadoop
Harden
File System ETL Dev. ToolingText/MLAnalytics Visualization
+ +
OR + +
Velocity
+
100+Big DataBusiness PartnersSigned
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 34/72
© 2012 IBM Corporation34
First Ever Forrester Wave on Big Data
“ IBM has the deepest
Hadoop platform and
application portfolio. IBM,
an established EDW vendor,has its own Hadoop
distribution; an extensive
professional services force
working on Hadoop projects;
extensive R&D programsdeveloping Hadoop
technologies; connections to
Hadoop from its EDW.” –The Forrester Wave™: Enterprise
Hadoop Solutions, 1Q12
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 35/72
© 2012 IBM CorporationMay 10, 2012
Open SourceTechnology Behind
Big Data
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 36/72
© 2012 IBM Corporation36
A Difference of Processing Models
"
SETI@home is a Computational Processing Model –
Service for Extraterrestrial Intelligence (SETI) uses
unused desktop CPU processing power to perform
wide-spread analysis of radio telescope data
–
Pushes data to the program for processing
–
Data to function" MapReduce is a Data Processing Model
–
Data processing primitives of Mappers and Reducers
–
Can be complex to write MapReduce programs, but a
very simple configuration change makes it scale to
1000s of nodes
– Function to data
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 37/72
© 2012 IBM Corporation37
Building Blocks of Hadoop
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 38/72
© 2012 IBM Corporation38
Hadoop Framework
" Hadoop Common –
A utility layer that provides access to the HDFS and projects
" HDFS –
Data storage platform for Hadoop framework
–
Can scale to massive size when distributed over
multiple computers
" MapReduce –
Framework to process data across a node clusters
– MAP process splits work by first mapping the input
across the control nodes of the cluster, then splittingthe workload into even smaller data sets and distributing
it further throughout the computing cluster• Allows MPP – think DB2 DPF ‘like’
– REDUCE collects and combines the nodes’ answers to deliver a result
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 39/72
© 2012 IBM Corporation39
Workload Optimization
Optimized performance for big data analytic workloads
Task Map
(break task into small
parts)
Adaptive Map
(optimization —
order small units of work)
Reduce
(many results to a
single result set)
Adaptive MapReduce
" Algorithm to optimize execution time ofmultiple small jobs
" Performance gains of 30% reduceoverhead of task startup
Hadoop System Scheduler
" Identifies small and large jobs fromprior experience
" Sequences work to reduce overhead
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 40/72
© 2012 IBM Corporation40
Growing the Hadoop Environment
" Avro: Data serialization system that convertsdata into a fast, compact binary data format –
Avro data can be stored in a file with versioned schemas
" Chukwa: Large-scale monitoring system that providesinsights into the Hadoop distributed file system
and MapReduce
"
HBase is a scalable, column-oriented distributeddatabase modeled after Google's BigTable distributedstorage system
"
Hive is a data warehouse infrastructure that providesad hoc query and data summarization for HadoopSupported data –
Hive utilizes SQL-like query language call HiveQL – HiveQL also used by programmers to run custom MapReduce jobs.
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 41/72
© 2012 IBM Corporation41
Growing the Hadoop Environment
"
Mahout is a data mining library designed to work on
the Hadoop framework – Mahout delivers a core set of algorithms designed for clustering,
classification and batch-based filtering.
"
Pig is a high-level programming language and executionframework for parallel computation –
Pig works within the Hadoop and MapReduce frameworks
"
ZooKeeper provides coordination, configuration andgroup services for distributed applications working overthe Hadoop stack
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 42/72
© 2012 IBM Corporation42
Committed to Open Source
" Decade of lineage and contributions
to the open source community –
Apache Hadoop and Jaql, Apache Derby, Apache Geronimo, Apache Jakarta, +++
– Eclipse: founded by IBM – Significant Lucene contributions via IBM
Lucene Extension Library (ILEL) – DRDA, XQuery, SQL, XML4J, XERCES,
HTTP, Java, Linux, +++
"
IBM products built on open source –
WebSphere: Apache –
Rational: Eclipse and Apache –
InfoSphere: Eclipse and Apache, +++
"
IBM’s BigInsights (Hadoop) is 100%open source compatible withno forks
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 43/72
© 2012 IBM CorporationMay 10, 2012
Why IBMfor Big DataThe Solution Side
Paul Zikopoulos, BA, MBADirector, IBM Information Management Technical Professionals,
WW Competitive Database, and Big Data
IBM Certified Advanced Technical Expert (Clusters and DRDA)
IBM Certified Customer Solutions Expert (DBA and BI)
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 44/72
© 2012 IBM Corporation44
10100101010011100101001111001000100100010010001000100101
Living Model Management and Frontier Anlaytics
010110011000111010010010010011100010010100100101100100101001100100101001001010100010010
01100100101001001010100010010
1100010010100100101100100101001100100101001001010100010010
01100100101001001010100010010
O p p o r t u n i t y C o s
t S t a r t s H e r e
01100100101001001010100010010
01100100101001001010100010010
11000100101001001011001001010
01100100101001001010100010010
01100100101001001010100010010
01100100101001001010100010010
01100100101001001010100010010
011001001010010010101000100101100010010100100101100100101001100100101001001010100010010
01100100101001001010100010010
01100100101001001010100010010
11000100101001001011001001010
Adaptive
AnalyticsModel
F o r e c a s t
N
o w c a s t
F o r e c a s t
Bootstrap
Enrich
Data Ingest
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 45/72
© 2012 IBM Corporation45
Data In Motion
" Hear what’s going on miles away to optimize
perimeter displacements
"
Perspective: Try to find the word “Zero” in a1000 MP3 song library in a fraction of a second – Figure out the difference between the sound of a
human whisper and the wind
Cisco turns to IBM big
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 46/72
© 2012 IBM Corporation4646
gdata for intelligent
infrastructuremanagement
"
Optimize building energyconsumption with centralizedmonitoring and control ofbuilding monitoring system
"
Automates preventive andcorrective maintenance ofbuilding corrective systems
"
Uses Streams, InfoSphere
BigInsights and Cognos- Log Analytics
- Energy Bill Forecasting
- Energy consumption optimization
- Detection of anomalous usage
- Presence-aware energy mgt.
- Policy enforcement
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 47/72
© 2012 IBM Corporation47
BigInsights and Cisco: Energy Usage Dashboard
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 48/72
© 2012 IBM Corporation48
End-user Visualization Development Environment
Gather
Crawl – gather statistically Adapter–gather dynamically
Extract
Document-level infoCleanse, normalize
I t er a t e
Familiar coding
and tooling
environment,testing, and
optimization
Explore
Analyze, annotate, filterVisualize results
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 49/72
© 2012 IBM Corporation49
!"#$%&"'()%&* !,- .)/) )0)1234% /##1 5#" (6%,0&%% 6%&"%
!,- .)/) 78)11&0-&%9
" Business users need a noprogramming approach foranalyzing Big Data
"
Extremely difficult to findactionable business insights indata from multiple sources withdifferent formats
" Translating untapped data intoactionable business insights isa common requirement thatrequires visualization
What is BigSheets?
:#$ 4)0 !,-;8&&/% 8&1<=
" Spreadsheet-like discovery interface letsbusiness users easily analyze Big Data
with ZERO PROGRAMMING
"
BUILT-IN “readers” can work
with data in several common formats – JSON arrays, CSV, TSV, Web, Hive, Line
Reader, Nutch Web Crawler Output, +++
" Users can VISUALLY combine and
explore various types of data to identifyhidden
insights
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 50/72
© 2012 IBM Corporation50
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 51/72
© 2012 IBM Corporation51
Big Data Made Easy for the Little Guy
" USC’s Film Forecaster correctly
predicted a clamor for "Hangover 2”that resulted in $100 million openingover Memorial Day weekend – Looked at 250K-500K Tweets and broke
down positive and negative messages using
a lexicon of 1700 words
The Film Forecaster sounds like a bigundertaking for USC, but it really came
down to one communications masters
student who learned Big Sheets ina day , then pulled in the tweets and
analyzed them - Ryan Kim
Even though relatively still immature
in terms of product release cycles,
BigSheets has an impressiverange of functionality , which is
only set to expand as it is further
developed and integrated withBigInsights – Krishna Roy
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 52/72
© 2012 IBM Corporation52
Running Applications from the Web Console
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 53/72
© 2012 IBM Corporation53
A Rich Management Big Data Tool
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 54/72
© 2012 IBM Corporation54
Rich Big Data Development Environment
" Eclipse base Development tools
–
For JQAL, Hive, PIG, JavaMapReduce and Text Analytics
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 55/72
© 2012 IBM Corporation55
Without a Big Data Platform You Code!
IBM Big Data Platform
Streams provides development, deployment,runtime, and infrastructure services
“TerraEchos developers can deliver
applications 45% faster due to the agility
of Streams Processing Language!” – Alex Philip, CEO and President
Multithreading
Custom SQLand
Scripts
Performance
Optimization
Debug
ApplicationManagement
EventHandling
Connectors
CheckPointing
Security
HA Acceleratorsand
Toolkits
Over 100 sample applications and
toolkits with industry focusedtoolkits with 300+ functions and
operators!
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 56/72
© 2012 IBM Corporation56
Streams Runtime Illustrated
X86 Host X86 Host
Meters
CompanyFilter
UsageModel
X86 Host
Usage
Contract
X86 Host X86 Host
TextExtract
Temp Action
Season Adjust
TextExtract
DegreeHistory
CompareHistory Store
History
Meters
Daily Adjust
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 57/72
© 2012 IBM Corporation57
How Text Analytics Works
Football World Cup 2010, one team distinguished
themselves well, losing to the eventual champions1-0 in the Final. Early in the second half,Netherlands’ striker, Arjen Robben, had a breakaway,but the keeper for Spain, Iker Casilas made the save. Winger Andres Iniesta scored for Spain for the win.
Netherlands Striker Arjen Robben
Keeper Spain Iker Casilas Winger Andres Iniesta Spain
World Cup 2010 Highlights
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 58/72
© 2012 IBM Corporation58
Text Analytic Toolkit Example
"
Semantically search Enron’s emails to support queries such as
“Find Tom’s phone” in order to find his actual phone number – As opposed to emails containing words Tom and Phone
"
Need to be able to accurately extract email entities of type Person and Phone and the relationship between them
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 59/72
© 2012 IBM Corporation59
The Enron Email Example
"
Start with a naïve set of rules to identify a PersonPhone relationship
built using some sort of editor – Build out an extractor that know how to find a person’s name: Lorraine Smith –
Build out an extractor that knows how to find a phone number: 607-205-4493
607.205.4493 Euro spacing
xt. ext. … North American spacing
+0119054132112 Globalstar GSP-1600
911, 411
+++
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 60/72
© 2012 IBM Corporation60
The Tricky Thing About Sentiment!
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 61/72
© 2012 IBM Corporation61
Accelerating Analytics – Explainability"
Every annotation’s provenance can be visualized"
Every annotation’s provenance can be visualized
"
Provenance of Morgan Stanley Fax: 205-4493 shows theFullName rule is responsible for generating incorrect annotation – Solutions?
•
Remove Morgan Stanley from FullName dictionary?•
Create new dictionary for CompanyName?
Text Analytics Toolkit
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 62/72
© 2012 IBM Corporation62
Text Analytics Toolkit Provenance Viewer" Major challenge for analysts is determining the lineage of changes
that have been applied to text
–
REALLY difficult to diskern which extractors need to be adjusted to tweak theresulting annotations
" Provenance Viewer for interactive visualizations to display outputannotations for regression, debugging, version enhancements, +++ – Reduce development of extractors by days to weeks
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 63/72
© 2012 IBM Corporation63
Accelerating Analytics – discovery and Pattern Matching" Contextual Clues discoverer cluster the context surrounding annotation in
order to detect frequently occurring patterns
" Illustrate clustering results between incorrect PersonPhone pairs – Visualize frequent occurrences (bubble size) of clues: i.e. FAX and TELEFAX
– Developer improves precision of annotator by adding a rule that filters out PersonPhone
pars if the Phone is preceded by a FAX clue
– Developer finds call at text as key clue for rule extraction for PersonPhone
Text Analytics Toolkit
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 64/72
© 2012 IBM Corporation64
Accelerating Analytics – Assisted Development
" IDE that exposes sophisticated techniques to assist rule developers
throughout all states of the development cycle – Promotes agile development because it unifies the business analyst and the
developer with a common toolset which fosters understanding
• Business Analyst helps mature and validates extractor results•
Developer understands visually that AQL enhancements needed to refine rules
Text Analytics Toolkit
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 65/72
© 2012 IBM Corporation65
Export Reports to Business Users
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 66/72
© 2012 IBM Corporation66
BigInsights Text Analytics Development
Hyperlink Navigation
ProceduralHelp
Outline View
Content AssistDifference
Viewer
Design Time Validation
Context Sensitive Help
Regular Expression Builder
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 67/72
© 2012 IBM Corporation67
Text Analytics Toolkit – Development Accelerator"
Eclipse plug-ins enhance analyst productivity – When writing AQL code, the editor features syntax highlighting, and automatic
detection of syntax error at design time not runtime
– Pre-built extractors and sample test tools, +++
"
Reduce coding time and debugging by 30-50%+!
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 68/72
© 2012 IBM Corporation68
Did anyone ever ask if the answer
was
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 69/72
© 2012 IBM Corporation69
Business Task: Find the Pictures that Have Cats
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 70/72
© 2012 IBM Corporation70
BUT! Your Application Returns the Following!
PRECISION(a measure of exactness)
2/4
RECALL
(a measure of coverage)
2/5
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 71/72
© 2012 IBM Corporation71
Learn Hadoop: At Your Place, At Your PaceMaking Learning Hadoop Easy and Fun
"
Flexible on-line delivery allowslearning @your place and@your pace
"
Free courses, free study materials
"
Cloud-based sandbox forexercises – zero setup
"
10000+ registered students
@BigData paulz
8/11/2019 AnIntroductiontoBigData.pdf
http://slidepdf.com/reader/full/anintroductiontobigdatapdf 72/72
THINK
PDF http://tinyurl.com/88auawg
Hardcopy http://tinyurl.com/7ef9ubs
@ g _p