Using Apache Spark with IBM SPSS Modeler
-
Upload
global-knowledge-training -
Category
Technology
-
view
332 -
download
7
Transcript of Using Apache Spark with IBM SPSS Modeler
![Page 1: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/1.jpg)
Using Apache Spark with IBM SPSS Modeler
Dr. Steve R. Poulin
![Page 2: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/2.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 2
Dr. Steve PoulinPrincipal Data Scientist & Manager of Predictive Analytics
Over 20 years experience as SPSS trainer and consultant
Holds a Ph.D. in Social Policy, Planning, and Policy Analysis from Columbia University
IBM Master Instructor with Global Knowledge
Worked with over 250 organizations that have used SPSS
Currently more heavily involved in consulting
![Page 3: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/3.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 3
Agenda
Intro Concepts
Enabling Apache Spark Applications
Gradient Boosted Trees with Mllib
K-Means with Mllib
Multinomial Naive Bayes with Mllib
Q&A
Follow-Ons & Additional References
![Page 4: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/4.jpg)
Intro Concepts
![Page 5: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/5.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 5
What is Apache Spark?
Apache Spark1 is an open-source cluster computing framework with in-memory processing to speed analytic applications up to 100 times faster compared to technologies on the market today.
Apache Spark works within Hadoop and is an alternative to MapReduce.
![Page 6: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/6.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 6
Hadoop
Hadoop is a collection of open-source modules that are part of the Apache Project.
o The Apache Project is managed by the volunteer-run Apache Software Foundation.
One of the major components of Hadoop is the Hadoop Distributed File System (HDFS™), which is a distributed file system providing high-throughput access to application data.
![Page 7: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/7.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 7
MapReduce
MapReduce2 is the processing engine for Apache Hadoop:
o A parallel processing system that is composed of a map procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a reduce procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies.)
It is designed for the analysis of large datasets.
![Page 8: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/8.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 8
MapReduce and Apache Spark
Apache Spark performs in-memory processing, whereas MapReduce moves data in and out of a disk.3
As a result, Apache Spark can run programs up to 100x faster than MapReducein memory or 10x faster on disk.
![Page 9: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/9.jpg)
Enabling Apache Spark
Applications
![Page 10: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/10.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 10
IBM SPSS Modeler
Apache Spark is well-suited for running complex machine learning techniques using machine learning libraries (MLlib) with large datasets.
Although Apache Spark applications will run with any data source, they will only achieve these efficiencies when connected to the Analytic Server node, which enables IBM SPSS Modeler to use data from a Hadoop environment.
The following applications that can be accessed from with IBM SPSS Modeler will be demonstrated during this seminar:
o Gradient Boosted Trees with MLlib
o K-Means with MLlib
o Multinomial Naive Bayes with MLlib
![Page 11: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/11.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 11
IBM SPSS Analytic Server
IBM SPSS Analytic Server enables the IBM SPSS Modeler to use data from Hadoop distributions
This feature is found as a node in the Sources palette:
Although Apache Spark applications will run with data accessed from many data sources (e.g. SQL databases and text files), they will not achieve their full potential efficiency unless they are connected to a Hadoop data environment through IBM SPSS Analytic Server.4
![Page 12: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/12.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 12
Enabling IBM SPSS Modeler to Run Apache Spark
Applications
Install a copy of Python 2.7 that includes NumPy, a Python component for scientific computing. o Anaconda is a free package manager that includes Python with the NumPy
component.
o The Python 2.7 Anaconda package can be downloaded from Continuum Analytics©
at: www.continuum.io/downloads
The following line of text must be added to your options.cfg file:o eas_pyspark_python_path, “[location of python.exe file in the Python
program with NumPy]”
o For example: eas_pyspark_python_path, “C:/Program Files/Anaconda2/python.exe”
The options.cfg file is located in the config folder of your IBM SPSS Modeler Program Files.o For example: C:\Program Files\IBM\SPSS\Modeler\18.0\config
![Page 13: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/13.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 13
Adding Spark Applications through IBM SPSS
Modeler Extension Hub
The Extension Hub automatically connects to the IBM SPSS Predictive Analytics Gallery
http://ibmpredictiveanalytics.github.io and presents the extensions in a dialog box.
![Page 14: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/14.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 14
IBM SPSS Modeler Extension Hub Dialog Box
Demos on extensions can be obtained at: https://github.com/IBMPredictiveAnalytic
![Page 15: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/15.jpg)
Gradient Boosted Trees
with MLlib
![Page 16: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/16.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 16
Introduction
Like the Random Trees procedure, this procedure generates ensembles of decision trees but also iteratively trains decision trees in order to minimize a “loss function,” (a penalty for mispredictions.)5
The algorithm uses the current ensemble to predict the label of each training instance and then compares the prediction with the true label.
The dataset is re-labeled to put more emphasis on training instances with poor predictions.
Thus, in the next iteration, the decision tree will help correct for previous mistakes.
![Page 17: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/17.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 17
Loss Functions
Loss Task Description
Log Loss Classification Twice binomial negative log likelihood
Squared Error Regression Also called L2 loss. Default loss for regression tasks
Absolute Error Regression Also called L1 loss. Can be more robust to outliers than Squared Error
![Page 18: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/18.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 18
Gradient Boosted Trees with MLlib Dialog Boxes
![Page 19: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/19.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 19
Gradient Boosted Trees with MLlib Dialog Boxes
One of the three Loss Functions is selected here
![Page 20: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/20.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 20
Gradient Boosted Trees with MLlib Output
Confidence scores
![Page 21: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/21.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 21
Gradient Boosted Trees with MLlib Stream:
LIVE DEMO
![Page 22: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/22.jpg)
K-Means with MLlib
![Page 23: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/23.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 23
Introduction
The K-Means clustering technique has long been part of IBM SPSS Modeler and IBM SPSS Statistics.
The user specifies the number of clusters (the “K” value) to test.
In the traditional method, K individual records are selected based on their distinctive profiles although there is some randomness in which records are selected.
The remaining records are assigned to the K clusters based on which of the initial records they are most similar to as determined by the Squared Euclidian distance measure.
Records can be re-assigned to make the clusters more distinctive.
![Page 24: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/24.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 24
K-Means with MLlib
The K-Means with MLlib procedure uses a machine-learning process to build the clusters.6
The distance measure used to determine which cluster each record is assigned to is labeled Epsilon.
Although the user still provides the K value, the final result may be less than K clusters.
![Page 25: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/25.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 25
K-Means with MLlib Dialog Boxes
![Page 26: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/26.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 26
K-Means with MLlib Dialog Boxes
When creating the
clusters does not
improve the Epsilon
less than this value,
the cluster building
process stops.
Lowering this value
will increase
processing time.
![Page 27: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/27.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 27
K-Means with MLlib Dialog Boxes
This only needs to be
increased if there is an
indication that the
convergence threshold
was not met.
![Page 28: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/28.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 28
K-Means with MLlib Dialog Boxes
This does not to be
changed for more recent
versions of Spark.
![Page 29: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/29.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 29
K-Means with MLlib Dialog Boxes
The Initialization Mode determines how
individual records are selected for the
training process.
The Random option randomly selects
these records.
Without the use of a Random Seed,
varying distributions of random numbers
will be generated that result in the
selection of different records each time
the procedure is run.
If this box is checked, the Random Seed
value will ensure that the same initial
records are selected.
![Page 30: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/30.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 30
K-Means with MLlib Dialog Boxes
The K-Means [] option (also
known as K-Means ++) in the
Initialization Mode section of the
dialog box provides an alternative
way to select the first records for
the cluster-building process.
This option builds clusters more
quickly than the use of randomly
selected records but may not
scale up well for large datasets.
The Initialization Steps only
applies to this option.
![Page 31: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/31.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 31
K-Means with MLlib Output
Cluster membership
values
![Page 32: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/32.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 32
K-Means with MLlib Stream:
LIVE DEMO
![Page 33: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/33.jpg)
Multinomial Naive Bayes
with MLlib
![Page 34: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/34.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 34
Multinomial Naive Bayes with MLlib
Naive Bayes is a classification algorithm with the assumption of independence (hence the term “naïve”) between every pair of predictors (called “features” in this procedure).7
As is the case for all classification procedures, it requires one target field and any number of predictors.
Within a single pass to the training data, it computes the conditional probability distribution of each categorical field value, and then it applies Bayes’ theorem (the probability of an event based on prior knowledge of conditions that might be related to the event) to compute the conditional probability distribution of predictor values given an observation and use it for prediction.
![Page 35: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/35.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 35
Multinomial Naive Bayes (in contrast to other forms of Bayesian methods) uses fields representing the number of times items, such as words, have been found in a document
This procedure is often used for document classification
Multinomial Naive Bayes with MLlib
![Page 36: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/36.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 36
The Smoothing
parameter addresses
conditions have a
conditional probability
of zero and should
probably be left at its
default value of 1.
Multinomial Naive Bayes with MLlib Dialog Box
![Page 37: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/37.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 37
Predicted outcomes
Multinomial Naive Bayes with MLlib Output
![Page 38: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/38.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 38
Multinomial Naive Bayes with MLlib Stream:
LIVE DEMO
![Page 39: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/39.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 39
Questions?
Steve Poulin
Still have questions? [email protected]
![Page 40: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/40.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 40
References: Further Reading
1. www.spark.apache.org
2. https://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/
3. https://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce
4. http://www-03.ibm.com/software/products/en/spss-analytic-server
5. http://spark.apache.org/docs/latest/mllib-ensembles.html#gradient-boosted-trees-gbts
6. http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
7. http://spark.apache.org/docs/1.5.2/mllib-naive-bayes.html
![Page 41: Using Apache Spark with IBM SPSS Modeler](https://reader030.fdocuments.us/reader030/viewer/2022021418/5a6d8b677f8b9ac7418b65bd/html5/thumbnails/41.jpg)
© Global Knowledge Training LLC. All rights reserved. Page 41
Next Steps
For a deeper dive into the concepts and tactics presented here, take a look at our available training:
Introduction to IBM SPSS Modeler and Data Mining (v18)
Predictive Modeling for Categorical Targets Using IBM SPSS Modeler (v18)
Advanced Predictive Modeling Using IBM SPSS Modeler (v18)