Vladimir_Suvorov_Big_data
-
Upload
irina-krylova -
Category
Education
-
view
335 -
download
1
description
Transcript of Vladimir_Suvorov_Big_data
1 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Big Data Concepts & Practice
Vladimir Suvorov [email protected]
EMC &
DataScienceSquad.com
2 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
About myself
© 2012 IBM Corporation February 16, 2013
Why Big Data
How We Got Here
4 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 4
…by t
he e
nd o
f 2011,
this
was a
bout
30
billion a
nd g
row
ing e
ven f
aste
r
In 2
005 t
here
were
1.3
billion R
FID
t
ags in c
ircula
tion…
5 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
An increasingly sensor-enabled and instrumented business environment generates HUGE volumes of
data with MACHINE SPEED characteristics…
1 BILLION lines of code
EACH engine generating 10 TB every 30 minutes!
6 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
350B Transactions/Year
Meter Reads every 15 min.
3.65B – meter reads/day 120M – meter reads/month
7 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
In August of 2010, Adam Savage, of “Myth Busters,” took a photo of his vehicle using his smartphone. He then posted the photo to his Twitter account including the phrase “Off to work.”
Since the photo was taken by
his smartphone, the image contained metadata revealing the exact geographical location the photo was taken
By simply taking and posting a
photo, Savage revealed the exact location of his home, the vehicle he drives, and the time he leaves for work
8 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
The Social Layer in an Instrumented Interconnected World
2+ billion
people on the
Web by end
2011
30 billion
RFID tags today (1.3B in 2005)
4.6 billion
camera phones
world wide
100s of millions of GPS
enabled
devices sold
annually
76 million smart
meters in 2009… 200M by 2014
12+ TBs
of tweet data every day
25+ TBs of log data
every day
? T
Bs o
f data
every
day
9 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Twitter Tweets per Second Record Breakers of 2011
10 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Extract Intent, Life Events, Micro Segmentation
Attributes
Jo Jobs
Tina Mu
Tom Sit
Pauline
Name, Birthday, Family
Not Relevant - Noise
Not Relevant - Noise
Monetizable Intent
Monetizable Intent Relocation
Location Wishful Thinking
SPAMbots
11 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Extracting insight from an immense volume, variety and velocity of data, in context, beyond what was previously possible
Big Data Includes Any of the following Characteristics
Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text Streaming data and large volume data movement Scale from Terabytes to Petabytes (1K TBs) to Zetabytes (1B TBs)
Variety: Velocity: Volume:
12 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
• Retailers collect click-stream data from Web site interactions and loyalty card data
– This traditional POS information is used by retailer for shopping basket analysis, inventory replenishment, +++
– But data is being provided to suppliers for customer buying analysis
• Healthcare has traditionally been dominated by paper-based systems, but this information is getting digitized
• Science is increasingly dominated by big science initiatives
– Large-scale experiments generate over 15 PB of data a year and can’t be stored within the data center; sent to laboratories
• Financial services are seeing large and large volumes through smaller trading sizes, increased market volatility, and technological improvements in automated and algorithmic trading
• Improved instrument and sensory technology
– Large Synoptic Survey Telescope’s GPixel camera generates 6PB+ of image data per year or consider Oil and Gas industry
Bigger and Bigger Volumes of Data
13 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Data AVAILABLE to
an organization
Data an organization
can PROCESS
The Big Data Conundrum
• The percentage of available data an enterprise can analyze is decreasing
proportionately to the available to it
Quite simply, this means as enterprises, we are getting
“more naive” about our business over time
We don’t know what we could already know….
14 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Why Not All of Big Data Before: Didn’t have the Tools?
15 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Applications for Big Data Analytics
Homeland Security
Finance Smarter Healthcare Multi-channel sales
Telecom
Manufacturing
Traffic Control
Trading Analytics
Fraud and Risk
Log Analysis
Search Quality
Retail: Churn, NBO
16 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
16
Most Requested Uses of Big Data • Log Analytics & Storage
• Smart Grid / Smarter Utilities
• RFID Tracking & Analytics
• Fraud / Risk Management & Modeling
• 360° View of the Customer
• Warehouse Extension
• Email / Call Center Transcript Analysis
• Call Detail Record Analysis
• +++
17 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
What companies & analytics think of Big Data
18 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Gartner & McKinsley
19 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Hype Cycle of Big Data
20 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Priority matrix
21 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Key vision • Predictive modeling is gaining momentum with property
and casualty (P&C) companies who are using them to
support claims analysis, CRM, risk management, pricing
and actuarial workflows, quoting, and underwriting.
• Social content is the fastest growing category of new
content in the enterprise and will eventually attain 20%
market penetration.
• Gartner reports that 45% as sales management teams
identify sales analytics as a priority to help them
understand sales performance, market conditions and
opportunities.
• Over 80% of Web Analytics solutions are delivered via
Software-as-a-Service (SaaS).
22 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Big Data deliverables by McKinsley
23 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
24 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Intel
25 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Intel Big Data Cluster Example Application Big Data Algorithms Compute
Style
Scientific study (e.g. earthquake study)
Ground model Earthquake simulation, thermal conduction, …
HPC
Internet library search
Historic web snapshots
Data mining MapReduce
Virtual world analysis
Virtual world database
Data mining TBD
Language translation
Text corpuses, audio archives,…
Speech recognition, machine translation, text-to-speech, …
MapReduce & HPC
Video search Video data Object/gesture identification, face recognition, …
MapReduce
25
There has been more video uploaded to YouTube in the last 2 months than if ABC, NBC, and CBS had been airing content 24/7/365 continuously since 1948. - Gartner
26 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 26
Example Motivating Application:
Online Processing of Archival Video • Research project: Develop a context recognition system that is 90% accurate over
90% of your day
• Leverage a combination of low- and high-rate sensing for perception
• Federate many sensors for improved perception
• Big Data: Terabytes of archived video from many egocentric cameras
• Example query 1: “Where did I leave my briefcase?”
• Sequential search through all video streams [Parallel Camera]
• Example query 2: “Now that I’ve found my briefcase, track it”
• Cross-cutting search among related video streams [Parallel Time]
26
Big Data Cluster
27 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Oracle
28 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Big Data Use Cases
Today’s Challenge New Data What’s Possible
Healthcare
Expensive office visits
Remote patient
monitoring
Preventive care,
reduced hospitalization
Manufacturing
In-person support Product sensors
Automated diagnosis,
support
Location-Based
Services
Based on home zip
code
Real time location data Geo-advertising, traffic,
local search
Public Sector
Standardized services Citizen surveys
Tailored services,
cost reductions
Retail
One size fits all
marketing
Social media Sentiment analysis
segmentation
29 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
•Operational efficiency and productivity
•Fraud detection and prevention
•Close tax gaps
•Value for money for citizens
•Prevent crime waves
•Customize actions based on population
segments
•Public utilities to reduce consumption
•Produce safety from farm to fork
What’s in Big Data for Public Sector
30 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Microsoft
31 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Increases ad revenue by processing 3.5 billion events per day
Massive Volumes Processes 464 billion rows per quarter, with average query time under 10 secs.
Measures and ranks online user influence by processing 3 billion signals per day
Cloud Connectivity Connects across 15 social networks via the cloud for data and API access
Improving investigation time by analyzing large volume & variety of data
Real-Time Insight Cut investigation time from 2 years to 15 days
New opportunities
32 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Microsoft’s Approach to Big Data
33 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
A Holistic Big Data Solution from Microsoft
34 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Data Scientist Job
35 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Sexy Job of Data Scientist
Tom Davenport, who is teaching an executive
program in Big Data and analytics at Harvard
University, said some data scientists are
earning annual salaries as high as $300,000,
which is “pretty good for somebody that
doesn't have anyone else working for them.”
Davenport also said such workers are
motivated by the problems and opportunities
data provides.
36 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
What EMC Think of Data Scientists
37 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Job evolution
38 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
What Forbes think of Data Scientists
39 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Data Science Courses
40 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Course Modules and Navigation Icons
40 Introduction and Course Agenda
Data Science and Big Data Analytics
1. Introduction to Big Data Analytics
2. Data Analytics Lifecycle + Lab
3. Review of Basic Data Analytics Methods Using R +
Labs
4. Advanced Analytics - Theory & Methods + Labs
5. Advanced Analytics - Technology & Tools + Labs
6. The Endgame, or Putting it All Together + Final Lab
41 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Topics : Data Science and Big Data Analytics
Course
41
Introduction to Big Data Analytics + Data Analytics Lifecycle
Review of Basic Data Analytic Methods Using R
Advanced Analytics – Theory and Methods
Advanced Analytics - Technology and Tools
The Endgame, or Putting it All Together + Final Lab on Big Data Analytics
Big Data Overview State of the Practice in Analytics The Data Scientist Big Data Analytics in Industry Verticals Data Analytics Lifecycle
Using R to Look at Data - Introduction to R Analyzing and Exploring the Data Statistics for Model Building and Evaluation
K-means Clustering Association Rules Linear Regression Logistic Regression Naive Bayesian Classifier Decision Trees Time Series Analysis Text Analysis
Analytics for Unstructured Data (MapReduce and Hadoop) The Hadoop Ecosystem In-database Analytics – SQL Essentials Advanced SQL and MADlib for In-database Analytics
Operationalizing an Analytics Project Creating the Final Deliverables Data Visualization Techniques + Final Lab – Application of the Data Analytics Lifecycle to a Big Data Analytics Challenge
42 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Hadoop
43 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Top companies need Hadoop
44 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
What is Hadoop and Where did it start?
• Created by Doug Cutting, formerly of Yahoo! Now Cloudera
– HDFS (storage) & MapReduce (compute)
– Inspired by Google’s MapReduce and Google File System (GFS) papers
• Much of the initial work on Hadoop was done by Yahoo
• It is now a top-level Apache project backed by large open source development community
45 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
What is Hadoop?
• Storage & Compute in 1 Framework • Open Source Project of the Apache Software Foundation • Written in Java
HDFS MapReduce
Two Core Components
Storage in the Hadoop Distributed File System
Compute via the MapReduce distributed processing platform
46 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Hadoop cluster architecture
47 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
MapReduce example
48 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Hadoop versions
49 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Hadoop Wave Report
“EMC Greenplum is the first mover in Hadoop appliances. EMC Greenplum the first EDW vendor to provide a full-featured enterprise-grade Hadoop appliance and roll out an appliance family that integrates its Hadoop, EDW, and data integration in a single rack. It provides its own open source Hadoop distribution software, integrates EMC’s strong storage product portfolio in its appliances, and has an extensive professional services force of EMC technical consultants and data scientists with Hadoop expertise.”
50 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Hadoop Players Today
51 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Get Started With Hadoop Today
Hadoop Architecture Services – POC planning and deployment
– Installation and best practices
– Educate the team
Greenplum Analytics Labs – Leverage the expertise of Greenplum’s
Data Scientists
– Packaged solutions that produce business value and actionable results
– Accelerate Hadoop capabilities on your data with your analysts
Establish a strategic vision – Roadmap for Hadoop and unified analytics
Data Scientists & Hadoop Architecture teams deliver customer success
52 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
The Greenplum Unified Analytics Platform
53 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
NoSQL
54 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Definition from nosql-databases.org • Next Generation Databases mostly addressing
some of the points: being non-relational, distributed, open-source and horizontal scalable. The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly. Often more characteristics apply as: schema-free, easy replication support, simple API, eventually consistent /BASE (not ACID), a huge data amount, and more. So the misleading term "nosql" (the community now translates it mostly with "not only sql") should be seen as an alias to something like the definition above.
55 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
NoSQL
http://nosql-database.org/
• Non relational
• Scalability – Vertically
• Add more data
– Horizontally • Add more storage
• Collection of structures – Hashtables, maps, dictionaries
• No pre-defined schema
• No join operations
• CAP not ACID – Consistency, Availability and Partitioning (but not all three at
once!)
– Atomicity, Consistency, Isolation and Durability
56 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Advantages of NoSQL
• Cheap, easy to implement
• Data are replicated and can be partitioned
• Easy to distribute
• Don't require a schema
• Can scale up and down
• Quickly process large amounts of data
• Relax the data consistency requirement (CAP)
• Can handle web-scale data, whereas Relational DBs cannot
57 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Disadvantages of NoSQL
• New and sometimes buggy
• Data is generally duplicated, potential for inconsistency
• No standardized schema
• No standard format for queries
• No standard language
• Difficult to impose complicated structures
• Depend on the application layer to enforce data integrity
• No guarantee of support
• Too many options, which one, or ones to pick
58 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
NoSQL Options
Key-Value Stores
• This technology you know and love and use all the
time
– Hashmap for example
• Put(key,value)
• value = Get(key)
• Examples
– Redis (my favorite!!) – in memory store
– Memcached
– and 100s more
59 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Column Stores
• Not to be confused with the relational-db version
of this
– Sybase-IQ etc.
• Multi-dimensional map
• Not all entries are relevant each time
– Column families
• Examples
– Cassandra
– Hbase
– Amazon SimpleDB
60 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Document Stores
• Key-document stores
– However the document can be seen as a value so
you can consider this is a super-set of key-value
• Big difference is that in document stores one can
query also on the document, i.e. the document
portion is structured (not just a blob of data)
• Examples
– MongoDB
– CouchDB
61 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc
Graph Stores
• Use a graph structure
– Labeled, directed, attributed multi-graph
• Label for each edge
• Directed edges
• Multiple attributes per node
• Multiple edges between nodes
– Relational DBs can model graphs, but an edge
requires a join which is expensive
• Example Neo4j
– http://www.infoq.com/articles/graph-nosql-neo4j
62 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc