Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the...
Transcript of Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the...
![Page 1: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/1.jpg)
1© 2014 The MathWorks, Inc.
Data Analytics with MATLAB
Tackling the Challenges of Big Data
Adrienne James, PhD
MathWorks
7th October 2014
![Page 2: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/2.jpg)
2
Big Data in Industry
ENERGYAsset Optimization
FINANCEMarket Risk, Regulatory
AUTOFleet Data Analysis
AEROMaintenance, reliability
Medical DevicesPatient Outcomes
![Page 3: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/3.jpg)
3
PROCESSING OPTIONS
• MATLAB RESTful interface to Cluster
• MATLAB Hadoop Streaming
• NoSQL connector (e.g. mongo)
• MATLAB / Java App accessing Cluster
• MATLAB Map-Reduce Components
![Page 4: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/4.jpg)
4
Key takeaways
New functions for analysing data that does not fit in memory on your
desktop
– datastore
– mapreduce
& that can scale for use with Hadoop
Additional techniques for predictive modelling with large data
– Work with large data in memory on a cluster (spmd)
Deploy predictive models
– Bring MATLAB analytics to the Web
– Share analytics with a wider community of users
![Page 5: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/5.jpg)
5
How big is big? What characterises “big” data?
Wikipedia
“Any collection of data sets so large and complex that it becomes difficult to
process using … traditional data processing applications.”
Volume : amount of data
Velocity : speed at which data is generated or needs to be analysed
Variety : range of data types/data sources
![Page 6: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/6.jpg)
6
Considerations: Large Data AnalyticsData Characteristics
1. Size & type of data?
2. Where is your data?
3. What hardware do you have access to?
4. Analysis Characteristics?
![Page 7: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/7.jpg)
7
Example: Airline Delay Analysis
Data
– BTS/RITA Airline On-Time Statistics
– 123.5M records, 29 fields
Analysis Tasks
– Calculate delay patterns
– Visualize summaries
– Estimate & evaluate predictive models
![Page 8: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/8.jpg)
8
Considerations: Large Data AnalyticsAirline Data Characteristics
1. Size & type of data?
CSV Data
22 files
12GB
![Page 9: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/9.jpg)
9
Considerations: Large Data AnalyticsData Characteristics
1. Size & type of data?
2. Where is my data?• Small subset available locally
• Entire data set stored elsewhere
![Page 10: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/10.jpg)
10
Big Data Analysis with MATLAB – start on the desktop
Explore
Prototype
Scale
Access Share/Deploy
Work on your desktop
Start “simple”
Basic statistics
Explore data
![Page 11: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/11.jpg)
11
Demo: Exploring departure delays using datastore
Explore approaches pre- & post-
Start with a small subset …
What happens as the data size grows?
…. until eventually it does not fit in memory on your desktop machine
datastore
![Page 12: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/12.jpg)
12
Access & explore bigger data on the desktop more easily
Easily specify data set
– Single text file (or collection of text files)
– Database (using Database Toolbox)
Preview data structure and format
Customise data to import
using column names
Incrementally read
subsets of the data
airdata = datastore('*.csv');
airdata.SelectedVariables = {'Distance', 'ArrDelay‘};
data = read(airdata);
datastore
![Page 13: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/13.jpg)
13
load
datastore extends Data Access Landscape
SMALL Increasing Data Size
memmapfile
matfile
API
databasedatabase.
ODBCConnection
Text files
Databases
.MAT files
Binary files
Images
textscan,
readtable
+programming
ImageAdapterimread, …
fread, …
SystemObjectsstreaming data
post-
readtable
Import
Tool
datastoretextscan
…
pre-
![Page 14: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/14.jpg)
14
Considerations: Large Data AnalyticsData Characteristics
1. Size & type of data?
2. Where is your data?
3. What hardware do you have access to?
4. Analysis Characteristics Initially, simple statistics & data exploration
• Small subset available locally
• Entire data set stored elsewhere
![Page 15: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/15.jpg)
15
Big Data Analysis with MATLAB
Explore
Prototype
Scale
Access Share/Deploy
Scale to a cluster
Start locally and then …..
![Page 16: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/16.jpg)
16
Datastore
HDFS
Reduce
Node
Node
Node Data
Data
Data
Map
ReduceMap
ReduceMap
Map Reduce
Map
Map
Reduce
Reduce
What is ?
A Big Data Platform
![Page 17: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/17.jpg)
17
A bit of audience participation – mapreduce ….
![Page 18: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/18.jpg)
18
Introducing the mapreduce programming framework
Input filesIntermediate files
(local disk)Output files
Newspaper
pages
For each page how many
times do “Steve”, “Emily” and
“David” get mentioned?
Total
mentions
Steve 11%
Emily 58%
David 31%
Example:
National
popularity contest
![Page 19: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/19.jpg)
19
mapreduce concept – group counts
Map Reduce
Input filesIntermediate files
(local disk)Output files
![Page 20: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/20.jpg)
20
Demo: Exploring mapreduce
![Page 21: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/21.jpg)
21
Datastore
Explore and Analyze Data on Hadoop
MATLAB
MapReduce
Code
HDFS
Node Data
MATLAB
Distributed
Computing
Server
Node Data
Node Data
Map Reduce
Map Reduce
Map Reduce
Hadoop
ds = datastore('hdfs://myserver:7867/data/file1.txt');
![Page 22: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/22.jpg)
22
Considerations: Large Data AnalyticsData Characteristics
1. Size & type of data?
2. Where is your data?
3. What hardware do you have access to?
4. Analysis Characteristics Explore predictive modelling
Cluster
![Page 23: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/23.jpg)
23
Big Data Analysis with MATLAB
Explore
Prototype
Scale
Access Share/Deploy
Scale to a cluster
Options for more involved
algorithms ….
• may require all data in memory
• multiple iterations …
![Page 24: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/24.jpg)
24
Data Analytics Landscape
easily
partitioned;
independent
tasks
iterative
all data needed in
memory at once
SMALL Increasing Data Size
SIMPLE
COMPLEX
Algorithm
complexity
More programming
effort required
Built-in
numerical & statistical
algorithms
spmddistributed
arrays
gpuarray
parfor
vectorisationmapreduce
![Page 25: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/25.jpg)
25
Working with more “complex” algorithms with data in memory
on a cluster
MDCS
1987 1988 1989 1990 1991 1992
Instr
uctions
Reduced D
ata
Client
![Page 26: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/26.jpg)
26
Demo: Predictive Modelling
Logistic Regression & Neural Networks
10 busiest airport origins & 7 largest airline carriers
Explore & compare prediction quality of two models to predict flights delayed for more than
20 minutes
– Randomly partition data into test and training sets (cvpartition)
– Model #1: Logistic Regression
– Model #2: Neural Network
Predictor Variables: DayOfWeek,Origin,Airline,DepTime,Distance
![Page 27: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/27.jpg)
27
Single Program, Multiple Data
Lab 1
>> mycode
Lab 2
>> mycode
Lab 3
>> mycode
Lab 4
>> mycode
![Page 28: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/28.jpg)
28
Single Program, Multiple Data
Parallel Pool
Lab 1
Lab 2
Lab 3
Lab 4
Client
spmd
a = rand;
end
a = rand;
a = rand;
a = rand;
a = rand;
Cluster
![Page 29: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/29.jpg)
29
Explore Big Data
Explore
Prototype
Access Share/Deploy
Subset data by filtering or variable selection
and gain insight with visualization
Scale
Explore
Prototype
Scale
Access Share/Deploy
![Page 30: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/30.jpg)
30
Highlights: Airline Delay Analysis
Start small
Scale up
Quick prototyping on large data
Interactive exploration
Interspersed visualizations
Predictive modelling with large data
![Page 31: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/31.jpg)
31
Deploy
Explore
Prototype
Scale
Access Share/Deploy
Hadoop
Enterprise
WebDesktop
![Page 32: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/32.jpg)
32
Web Analytics: Analysis of traffic around Paris
http://rumeur.bruitparif.fr/
![Page 33: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/33.jpg)
33
Predictive Data Analytics – Load Demand Forecasting
![Page 34: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/34.jpg)
34
Demo
Station:
![Page 35: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/35.jpg)
35
MATLAB on Hadoop
Two modes of operation
Execute mapreduce on Hadoop from your MATLAB desktop using
MATLAB Distributed Computing Server
– Extends your desktop environment for use with Hadoop
– Execute algorithms within Hadoop MapReduce on data stored in HDFS
Create standalone applications or libraries for deploying to production
instances of Hadoop
– Locked down package for use in production environments
– Integration of MATLAB analytics with operational systems
![Page 36: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/36.jpg)
36
Key takeaways
New functions for analysing data that does not fit in memory on your
desktop
– datastore
– mapreduce
& that can scale for use with Hadoop
Additional techniques for predictive modelling with large data
– Work with large data in memory on a cluster (spmd)
Deploy predictive models
– Bring MATLAB analytics to the Web
– Share analytics with a wider community of users
![Page 37: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/37.jpg)
37
New Big Data Capabilities in MATLAB
Memory and Data Access
64-bit processors
Memory Mapped Variables
Disk Variables
Databases
Datastores
Platforms
Desktop (Multicore, GPU)
Clusters
Cloud Computing (MDCS on EC2)
Hadoop
Programming Constructs
Streaming
Block Processing
Parallel-for loops
GPU Arrays
SPMD and Distributed Arrays
MapReduce
![Page 38: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/38.jpg)
38
Additional Resources
MathWorks Web Site
Big Data With MATAB: http://www.mathworks.com/discovery/big-data-matlab.html
MapReduce & Hadoop: http://www.mathworks.com/discovery/matlab-mapreduce-hadoop.html
Machine Learning with MATLAB: http://www.mathworks.com/machine-learning/index.html
A selection of user stories
LiquidNet: Lean Data Analysis: The Awesome Data Dexterity of MATLAB Desktop
Ruuki Metals: Steel Manufacturing Process Analytics
CEESAR: Data Processing Framework Supporting Large Scale Driving Data Analysis
Daimler AG: Analyzing Test Data from a Worldwide Fleet of Fuel Cell Vehicles
![Page 39: Data Analytics with MATLAB - es.mathworks.com€¦ · Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7th October 2014. 2 Big Data in](https://reader034.fdocuments.us/reader034/viewer/2022052014/602c116e6d15a429ad649ee4/html5/thumbnails/39.jpg)
39
Thank You