Bringing Big Data Modelling into the Hands of Domain...
Transcript of Bringing Big Data Modelling into the Hands of Domain...
1© 2015 The MathWorks, Inc.
Bringing Big Data Modelling into the
Hands of Domain Experts
David Willingham
Senior Application Engineer
MathWorks
2
“Data is the sword of the 21st century, those who
wield it the samurai.”
Big data — how to create
it, manipulate it, and put it
to good use.
“If you want to work at
Google, make sure you
can use MATLAB.”
Google’s Former SVP - Jonathan Rosenberg
3
Who is looking to do Big Data Analytics now?
Software Developers
– Programmers using languages such as Java / .NET
– They may not be domain experts
End Users (non technical)
– Non programmers
– They may or may not be domain experts
Domain Users
– Scientists, Engineers, Analysts, etc..
– Have some programming skills
– Might use MATLAB to protype ideas, algorithms models
– They want their ideas to scale with the size of data easily
4
Key Industries
Aerospace and defense
Automotive
Biotech and pharmaceutical
Communications
Education
Electronics and semiconductors
Energy production
Financial services
Industrial automation and
machinery
Medical devices
5
Tesco uses supply chain analytics to
save £100m a year.
4 years of sales data held in
a Teradata data warehouse.
MATLAB is used to forecast
the optimum stock levels.
“We can run 100 stores for 100 days in about half an hour.
We can figure out quickly whether what we are doing is
right and we can optimise that.”
6
Preventing lethal ship collisions with whales
Cornell University collected
terabytes of ocean acoustic
data.
Crowdsourced algorithms
to detect and classify
animal signals in the
presence of noise.
“A data set that would have taken months to process can
now be processed multiple times in just a few days using
different detection algorithms.”
7
Earth’s topography
on a Miller cylindrical
projection, created
with MATLAB and
Mapping Toolbox.
MathWorks Today
More than 1 million users worldwide.
3500 universities
21 AU universities have MATLAB campus-wide license.
8
Challenges of Big Data
“Any collection of data sets so large and complex that it becomes
difficult to process using … traditional data processing applications.”(Wikipedia)
Getting started
Rapid data exploration
Development of scalable algorithms
Ease of deployment
9
Big Data Capabilities in MATLAB
Memory and Data Access
64-bit processors
Memory Mapped Variables
Disk Variables
Databases
Datastores
Platforms
Desktop (Multicore, GPU)
Clusters
Cloud Computing (MDCS on EC2)
Hadoop
Programming Constructs
Streaming
Block Processing
Parallel-for loops
GPU Arrays
SPMD and Distributed Arrays
MapReduce
10
MATLAB and MemoryBest Practices for Memory Usage
Use the appropriate data storage
– Use only the precision your need
– Sparse Matrices
– Categorical Arrays
– Be aware of overhead of cells and structures
Minimize Data Copies
– In place operations, if possible
– Use nested functions
– If using objects, consider
handle classes
11
Considerations for Choosing an Approach
Data characteristics
– Size, type and location of your data
Compute platform
– Single desktop machine or cluster
Analysis Characteristics
– Embarrassingly Parallel
– Analyze sub-segments of data and aggregate results
– Operate on entire dataset
12
Techniques for Big Data in MATLAB
Complexity
Embarrassingly
Parallel
Non-
Partitionable
MapReduce
Distributed Memory
SPMD and distributed arrays
Load,
Analyze,
Discard
parfor, datastore,
out-of-memory
in-memory
13
Techniques for Big Data in MATLAB
Embarrassingly
Parallel
Non-
Partitionable
MapReduce
Distributed Memory
SPMD and distributed arrays
Load,
Analyze,
Discard
parfor, datastore,
out-of-memory
in-memory
Complexity
14
Parallel Computing with MATLAB
MATLAB
Desktop (Client)
Worker
Worker
Worker
Worker
Worker
Worker
15
Demo: Determining Land UseUsing Parallel for-loops (parfor)
Data
– Arial images of agriculture land
– 24 TIF files
Analysis
– Find and measure
irrigation fields
– Determine which irrigation
circles are in use (by color)
– Calculate area under irrigation
16
When to Use parfor
Data Characteristics
– Can be of any format (i.e. text, images)
as long as it can be broken into pieces
– The data for each iteration must
fit in memory
Compute Platform
– Desktop (Parallel Computing Toolbox)
– Cluster (MATLAB Distributed Computing Server)
Analysis Characteristics
– Each iteration of your loop must be independent
17
Techniques for Big Data in MATLAB
Embarrassingly
Parallel
Non-
Partitionable
MapReduce
Distributed Memory
SPMD and distributed arrays
Load,
Analyze,
Discard
parfor, datastore,
out-of-memory
in-memory
Complexity
18
Access Big Datadatastore
Easily specify data set
– Single text file (or collection of text files)
Preview data structure and format
Select data to import
using column names
Incrementally read
subsets of the dataairdata = datastore('*.csv');
airdata.SelectedVariables = {'Distance', 'ArrDelay‘};
data = read(airdata);
19
Demo: Vehicle Registry AnalysisUsing a DataStore
Data
– Massachusetts Vehicle Registration
Data from 2008-2011
– 16M records, 45 fields
Analysis
– Examine hybrid adoptions
– Calculate % of hybrids registered
by quarter
– Fit growth to predict further
adoption
20
When to Use datastore
Data Characteristics
– Text data in files, databases or stored in the
Hadoop Distributed File System (HDFS)
Compute Platform
– Desktop
Analysis Characteristics
– Supports Load, Analyze, Discard workflows
– Incrementally read chunks of data,
process within a while loop
21
Big Data Capabilities in MATLAB
Memory and Data Access
64-bit processors
Memory Mapped Variables
Disk Variables
Databases
Datastores
Platforms
Desktop (Multicore, GPU)
Clusters
Cloud Computing (MDCS on EC2)
Hadoop
Programming Constructs
Streaming
Block Processing
Parallel-for loops
GPU Arrays
SPMD and Distributed Arrays
MapReduce
22
Reading in Part of a Dataset from Files
Text file, ASCII file
– datastore
MAT file
– Load and save part of a variable using the matfile
Binary file
– Read and write directly to/from file using memmapfile
– Maps address space to file
Databases
– ODBC and JDBC-compliant (e.g. Oracle, MySQL, Microsoft, SQL Server)
23
Techniques for Big Data in MATLAB
ComplexityEmbarrassingly
Parallel
Non-
Partitionable
MapReduce
Distributed Memory
SPMD and distributed arrays
Load,
Analyze,
Discard
parfor, datastore,
out-of-memory
in-memory
24
Analyze Big Datamapreduce
Use the powerful MapReduce programming
technique to analyze big data
– mapreduce uses a datastore to process data
in small chunks that individually fit into memory
– Useful for processing multiple keys, or when
Intermediate results do not fit in memory
mapreduce on the desktop
– Increase compute capacity (Parallel Computing Toolbox)
– Access data on HDFS to develop algorithms for use on Hadoop
mapreduce with Hadoop
– Run on Hadoop using MATLAB Distributed Computing Server
– Deploy applications and libraries for Hadoop using MATLAB Compiler
********************************
* MAPREDUCE PROGRESS *
********************************
Map 0% Reduce 0%
Map 20% Reduce 0%
Map 40% Reduce 0%
Map 60% Reduce 0%
Map 80% Reduce 0%
Map 100% Reduce 25%
Map 100% Reduce 50%
Map 100% Reduce 75%
Map 100% Reduce 100%
25
mapreduce
Data Store Map Reduce
Veh_typ Q3_08 Q4_08 Q1_09 Hybrid
Car 1 1 1 0
SUV 0 1 1 1
Car 1 1 1 1
Car 0 0 1 1
Car 0 1 1 1
Car 1 1 1 1
Car 0 0 1 1
SUV 0 1 1 0
Car 1 1 1 0
SUV 1 1 1 1
Car 0 1 1 1
Car 1 0 0 0
Key: Q3_080
1
1
0
1
1
1
Key: Q4_08
0
1
1
1
1
Key: Q1_09
Hybrid
0
0
0
1
0
1
Key % Hybrid (Value)
Q3_08 0.4
Q4_08 0.67
Q1_09 0.75
Key: Q3_08
Key: Q4_08
Key: Q1_09
Key: Q3_080
1
1
0
1
1
1
Key: Q4_08
0
1
1
1
1
Key: Q1_09
Hybrid
0
0
0
1
0
1
Shuffle
and Sort
26
Demo: Vehicle Registry AnalysisUsing MapReduce
Data
– Massachusetts Vehicle Registration
Data from 2008-2011
– 16M records, 45 fields
Analysis
– Examine hybrid adoptions
– Calculate % of hybrids registered
By Quarter
By Regional Area
– Create map of results
27
Datastore
HDFS
The Big Data Platform
Reduce
Node
Node
Node Data
Data
Data
Map
ReduceMap
ReduceMap
Map Reduce
Map
Map
Reduce
Reduce
28
Datastore
Explore and Analyze Data on Hadoop
MATLAB
MapReduce
Code
HDFS
Node Data
MATLAB
Distributed
Computing
Server
Node Data
Node Data
Map Reduce
Map Reduce
Map Reduce
29
Deployed Applications with Hadoop
MATLAB
MapReduce
Code
Datastore
HDFS
Node Data
Node Data
Node Data
Map Reduce
Map Reduce
Map Reduce
MATLAB
runtime
30
When to Use mapreduce
Data Characteristics
– Text data in files, databases or stored in the
Hadoop Distributed File System (HDFS)
– Dataset will not fit into memory
Compute Platform
– Desktop
– Scales to run within Hadoop MapReduce
on data in HDFS
Analysis Characteristics
– Must be able to be Partitioned into two phases
1. Map: filter or process sub-segments of data
2. Reduce: aggregate interim results and calculate final answer
31
Techniques for Big Data in MATLAB
ComplexityEmbarrassingly
Parallel
Non-
Partitionable
MapReduce
Distributed Memory
SPMD and distributed arrays
Load,
Analyze,
Discard
parfor, datastore,
out-of-memory
in-memory
32
Parallel Computing – Distributed Memory
Core 1
Core 3 Core 4
Core 2
RAM
Using More Computers (RAM)
Core 1
Core 3 Core 4
Core 2
RAM
…
33
TOOLBOXES
BLOCKSETS
Distributed Array
Lives on the Cluster
Remotely Manipulate Array
from Desktop
11 26 41
12 27 42
13 28 43
14 29 44
15 30 45
16 31 46
17 32 47
17 33 48
19 34 49
20 35 50
21 36 51
22 37 52
Distributed ArraysAvailable from Parallel Computing Toolbox
MATLAB Distributed Computing
Server
34
spmd blocks
spmd
% single program across workers
end
Mix parallel and serial code in the same function
Run on a pool of MATLAB resources
Single Program runs simultaneously across
workers
Multiple Data spread across multiple workers
35
Example: Airline Delay Analysis
Data
– BTS/RITA Airline
On-Time Statistics
– 123.5M records, 29 fields
Analysis
– Calculate delay patterns
– Visualize summaries
– Estimate & evaluate
predictive models
36
When to Use Distributed Memory
Data Characteristics
– Data must be fit in collective
memory across machines
Compute Platform
– Prototype (subset of data) on desktop
– Run on a cluster or cloud
Analysis Characteristics
– Consists of:
Parts that can be run on data in memory (spmd)
Supported functions for distributed arrays
37
Big Data on the Desktop
Expand workspace 64 bit processor support – increased in-memory data set handling
Access portions of data too big to fit into memory Memory mapped variables – huge binary file
Datastore – huge text file or collections of text files
Database – query portion of a big database table
Variety of programming constructs System Objects – analyze streaming data
MapReduce – process text files that won’t fit into memory
Increase analysis speed Parallel for-loops with multicore/multi-process machines
GPU Arrays
38
Further Scaling Big Data Capacity
MATLAB supports a number of programming
constructs for use with clusters
General compute clusters Parallel for-loops – embarrassingly parallel algorithms
SPMD and distributed arrays – distributed memory
Hadoop clusters MapReduce – analyze data stored in the Hadoop Distributed File System
Use these constructs
on the desktop to
develop your algorithms
Migrate to a
cluster without
algorithm changes
39
Learn More
MATLAB Documentation
– Strategies for Efficient Use of Memory
– Resolving "Out of Memory" Errors
Big Data with MATLAB– www.mathworks.com/discovery/big-data-matlab.html
MATLAB MapReduce and Hadoop– www.mathworks.com/discovery/matlab-mapreduce-hadoop.html
40
EXTRA SLIDES
41
Running into Big Data Issues?
Performance
– Takes too long to process all
of your data
“Out of memory”
– Running out of address space
Slow processing (swapping)
– Data too large to be efficiently
managed between RAM and
virtual memory
42
MATLAB and MemoryMemory and Your Operating System
32-bit operating systems
– 4GB of addressable memory per process
– Part of it is reserved by the OS,
leaving the application < 4GB
64-bit operating systems
– 8TB of addressable memory
per process
– Limited by the amount of memory
available on the computer
Use 64-bit OS, if possible
Memory per Process
Available for Process
Reserved by
Operating System
43
MATLAB and MemoryBest Practices for Memory Usage
Use the appropriate data storage
– Use only the precision your need
– Sparse Matrices
– Categorical Arrays
– Be aware of overhead of cells and structures
Minimize Data Copies
– In place operations, if possible
– Use nested functions
– If using objects, consider
handle classes
44
No dependencies or communications between tasks
Ideal problem for parallel computing (parfor)
Independent Tasks or Iterations
TimeTime
45
Block Processing Images
blockproc automatically divides an
image into blocks for processing
Reduces memory usage
– Read and write block
directly from image file
Processes arbitrarily large images
Available from Image Processing Toolbox
46
System Objects
A class of MATLAB objects that
support streaming workflows
Simplifies data access for streaming applications
– Manages flow of data from files or network
– Handles data indexing and buffering
Contain algorithms to work with streaming data
– Manages algorithm state
– Available for Signal Processing, Communications,
Video Processing, and Phased Array Applications
Available from DSP System Toolbox
Communications System Toolbox
Computer Vision System Toolbox
Phased Array System Toolbox
Image Acquisition Toolbox
47
Platform Desktop Only Desktop + Cluster Desktop + Hadoop
Data Size 100’s MB -10’s GB 100’s MB -100’s GB 100’s GB – PBs
Techniques • parfor
• datastore
• mapreduce
• parfor
• distributed data• spmd
• mapreduce
Summary: Options for Handling Large Data
MATLAB
Desktop (Client)
Hadoop Cluster
Hadoop Scheduler
…
… … …
..…
..…
..…
MATLAB
Desktop (Client)
Cluster
Scheduler
…
… … …..…
..…
..…
MATLAB
Desktop (Client)
48
Performance Gain with More Hardware
Using GPUs
Device Memory
GPU cores
Device Memory
Core 1
Core 3 Core 4
Core 2
Cache
Using More Cores (CPUs)
49
Scaling Up Your ComputationsMATLAB Distributed Computing Server
Transition from desktop to cluster with
minimal code changes
Supports both interactive and
batch workflows
Integrates with commonly
used schedulers
Cluster
Computer Cluster
Scheduler
50
MapReduce Programming Model
mapreducer
datastore
mapreduce
MAP REDUCE
Execution
Environment
Output FilesInput Files
51
mapreducer
mapreducer: Control the Execution Environment
of the MapReduce operation
MAP REDUCE
Output FilesInput Files
Execution
Environment
52
mapreducer
Serial Hadoop ClusterLocal Pool
Multi-core CPU
53
datastore
datastore: Stores information regarding
input files and output files
MAP REDUCE
Execution
Environment
Output FilesInput Files
54
mapreduce
MAP REDUCE
Output FilesInput Files
Execution
Environment
mapreduce: Executes the MapReduce algorithm
55
mapreduce
mapreduce: - Input datastore object
- Map function
- Reduce function
MAP REDUCE
Output FilesInput Files
Execution
Environment