2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

58
Learn more about Advanced Analytics at http://www.alpinenow.com Large-Scale Machine Learning with DB Tsai Machine Learning Engineering Lead @ AlpineDataLabs Internet of Things Conference @ Moscone Center, SF http://www.iotaconf.com/ October 20, 2014

description

Apache Spark is a new cluster computing engine offering a number of advantages over its predecessor MapReduce. In-memory cache is utilized in Apache Spark to scale and parallelize iterative algorithms which makes it ideal for large-scale machine learning. It is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. In this talk, DB will introduce Spark and show how to use Spark’s high-level API in Java, Scala or Python. Then, he will show how to use MLlib, a library of machine learning algorithms for big data included in Spark to do classification, regression, clustering, and recommendation in large scale.

Transcript of 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Page 1: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Large-Scale Machine Learning with

DB Tsai

Machine Learning Engineering Lead @ AlpineDataLabs Internet of Things Conference @ Moscone Center, SF http://www.iotaconf.com/ October 20, 2014

Page 2: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

TRADITIONAL DESKTOP

IN-DATABASE METHODS

WEB-BASED AND COLLABORATIVE

SIMPLIFIED CODE-FREE HADOOP & MPP DATABASE

ONGOING INNOVATION

The Path to Innovation

Page 3: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

The Path to Innovation

Iterative algorithms scan through the data each time

With Spark, data is cached in memory after first iteration

Quasi-Newton methods enhance in-memory benefits

921s 150m

m rows

97s

Page 4: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Machine Learning in the Big Data Era •  Hadoop Map Reduce solutions

•  MapReduce scales well for batch processing •  Lots of machine learning algorithms are iterative by nature •  There are lots of tricks people do, like training with sub-samples of

data, and then average the models. Why have big data if you’re only approximating.

+ =

Page 5: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Lightning-fast cluster computing

•  Empower users to iterate through the data by utilizing the in-memory cache.

•  Logistic regression runs up to 100x faster than Hadoop M/R in memory.

•  We’re able to train exact models without doing any approximation.

Page 6: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Why MLlib?

•  MLlib is a Spark subproject providing Machine Learning primitives

•  It’s built on Apache Spark, a fast and general engine for large-scale data processing

•  Shipped with Apache Spark since version 0.8 •  High quality engineering design and effort •  More than 50 contributors since July 2014

Page 7: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Algorithms supported in MLlib •  Classification: SVMs, logistic regression, decision trees,

naïve Bayes, and random forests •  Regression: linear regression, and random forests •  Collaborative filtering: alternating least squares (ALS) •  Clustering: k-means •  Dimensionality reduction: singular value decomposition

(SVD), and principal component analysis (PCA) •  Basic statistics: summary statistics, correlations, stratified

sampling, hypothesis testing, and random data generation •  Feature extraction and transformation: TF-IDF, Word2Vec,

StandardScaler, and Normalizer

Page 8: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

MapReduce Review

•  MapReduce – Simplified Data Processing on Large Clusters, 2004.

•  Scales Linearly •  Data Locality •  Fault Tolerance in Data Storage and Computation

Page 9: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Hadoop MapReduce Review •  Mapper: Loads the data and emits a set of key-value pair •  Reducer: Collects the key-value pairs with the same key to process,

and output the result. •  Combiner: Can reduce shuffle traffic by combining key-value pairs

locally before going to reducer. •  In-Mapper Combiner: Aggregating the result in the mapper side,

and using the LRU cache to prevent out of heap space. http://alpinenow.com/blog/in-mapper-combiner/

•  Good: Built in fault tolerance, scalable, and production proven in industry.

•  Bad: Optimized for disk IO without leveraging memory well; iterative algorithms go through disk IO again and again; primitive API is not easy and clean to develop.

Page 10: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Spark MapReduce •  Spark also uses MapReduce as a programming model but

with much richer APIs in Scala, Java, and Python. •  With Scala expressive APIs, 5-10x less code. •  Not just a distributed computation framework, Spark provides

several pre-built components helping users to implement application faster and easier. - Spark Streaming - Spark SQL - MLlib (Machine Learning) - GraphX (Graph Processing)

Page 11: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Resilient Distributed Datasets (RDDs)

•  RDD is a fault-tolerant collection of elements that can be operated on in parallel.

•  RDDs can be created by parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, HIVE, or any data source offering a Hadoop InputFormat.

•  RDDs can be cached in memory or on disk

Page 12: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Hadoop M/R vs Spark M/R

•  Hadoop

•  Spark

Page 13: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

RDD Operations - two types of operations

•  Transformations: Creates a new dataset from an existing one. They are lazy, in that they do not compute their results right away.

•  Actions: Returns a value to the driver program after running a computation on the dataset.

Page 14: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Transformations •  map(func) - Return a new distributed dataset formed by passing each

element of the source through a function func. •  filter(func) - Return a new dataset formed by selecting those elements of the

source on which func returns true. •  flatMap(func) - Similar to map, but each input item can be mapped to 0 or

more output items (so func should return a Seq rather than a single item). •  mapPartitions(func) - Similar to map, but runs separately on each partition

(block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T.

•  groupByKey([numTasks]) - When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.

•  reduceByKey(func, [numTasks]) – When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. http://spark.apache.org/docs/latest/programming-guide.html#transformations

Page 15: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Actions •  reduce(func) - Aggregate the elements of the dataset

using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.

•  collect() - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

•  count(), first(), take(n), saveAsTextFile(path), etc. http://spark.apache.org/docs/latest/programming-guide.html#actions

Page 16: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

RDD Persistence/Cache •  RDD can be persisted using the persist() or cache()

methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.

•  Persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space).

Page 17: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

RDD Storage Level •  MEMORY_ONLY - Store RDD as deserialized Java objects in the JVM.

If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.

•  MEMORY_AND_DISK - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.

•  MEMORY_ONLY_SER - Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.

•  MEMORY_AND_DISK_SER - Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.

http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

Page 18: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Word Count Example in Scala

Page 19: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Page 20: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Page 21: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Page 22: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

API’s design philosophy in MLlib •  Works seamlessly with Spark Core, and Spark SQL; users can use

core API’s or Spark SQL for data pre-processing, and then pipe into training step.

•  Algorithms are implemented in Scala. Public interfaces don’t use advanced Scala features to ensure Java compatibility.

•  Many of MLlib API’s have python bindings. •  MLlib is under active development. The APIs marked Experimental/

DeveloperApi may change in future releases, and will provide migration guide if they are changed.

•  API’s are well documented, and designed to be expressive. •  Code is well-tested, comprehensive unittest coverage. There are lots

of comments in the code, and it’s a enjoyable experience to read the code.

Page 23: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Data Types •  MLlib local vectors and local matrices are currently

wrapping Breeze implementation; as a result, the underlying linear algebra operations are provided by Breeze and jblas. https://github.com/scalanlp/breeze

•  However, the methods converting MLlib to Breeze vectors/matrices or the other way around are private to org.apache.spark.mllib scope. This restriction can be workaround by having your custom code in org.apache.spark.mllib.something package.

•  A training sample used in supervised learning is stored in LabeledPoint which contains a label/response and a feature vector in dense or sparse.

•  Distributed RowMatrix – basically, it’s RDD[Vector] which doesn’t have meaningful row indices.

•  Distributed IndexedRowMatrix – it’s similar to RowMatrix, but each row is represented by its index and a local vector.

Page 24: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Local vector

The base class of local vectors is Vector, and we provide two implementations: DenseVector and SparseVector.

Page 25: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Some useful tips related to local vector •  If you want to use native Breeze functionality, you can

have your code in org.apache.spark.mllib package.

Page 26: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Real code in MLlib in MultivariateOnlineSummarizer

Page 27: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

LabeledPoint •  Double is used for storing the label, so we can use the labeled points

in both regression and classification. For binary classification, a label should be either 0.0 or 1.0. For N-class classification, labels should be class indices starting from zero: 0.0, 1.0, 2.0, …, N - 1

Page 28: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Supervised Learning •  Binary Classification: linear SVMs (SGD), logistic regression (L-

BFGS and SGD), decision trees, random forests (Spark 1.2), and naïve Bayes.

•  Multiclass Classification: Decision trees, naïve Bayes (coming soon - multinomial logistic regression in GLMNET)

•  Regression: linear least squares (SGD), Lasso (SGD + soft-threshold), ridge regression (SGD), decision trees, and random forests (Spark 1.2)

•  Currently, the regularization in linear model will penalize all the weights including the intercept which is not desired in some use-cases. Alpine has GLMNET implementation using OWLQN which can exactly reproduce R’s GLMNET package result with scalability. We’re in the process of merging it into MLlib community.

Page 29: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

LinearRegressionWithSGD

Page 30: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

SVMWithSGD

Page 31: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

SPARK-2934: LogisticRegressionWithLBFGS

•  Merged in Spark 1.1 •  Contributed by Alpine Data Labs •  Using L-BFGS to train Logistic Regression instead of

default Gradient Descent. •  Users don't have to construct their objective function for

Logistic Regression, and don't have to implement the whole details.

•  Together with SPARK-2979 to minimize the condition number, the convergence rate is further improved.

Page 32: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

SPARK-2979: Improve the convergence rate by standardizing the training features

l  Merged in Spark 1.1 l  Contributed by Alpine Data Labs l  Due to the invariance property of MLEs, the scale of your inputs are

irrelevant. l  However, the optimizer will not be happy with poor condition numbers

which can often be improved by scaling. l  The model is trained in the scaled space, but the coefficients are

converted to original space; as a result, it's transparent to users. l  Without this, some training datasets mixing the columns with different

scales may not be able to converge. l  Scikit and glmnet package also standardize the features before training to

improve the convergence. l  Only enable in Logistic Regression for now.

Page 33: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Page 34: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

a9a Dataset Benchmark

-1 1 3 5 7 9 11 13 150.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

Logistic Regression with a9a Dataset (11M rows, 123 features, 11% non-zero elements)16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster

L-BFGSGD

Iterations

Log-

Likeli

hood

/ Num

ber o

f Sam

ples

Page 35: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

0 5 10 15 20 25 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Logistic Regression with rcv1 Dataset (6.8M rows, 677,399 features, 0.15% non-zero elements)16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster

LBFGS Sparse VectorGD Sparse Vector

Second

Log-

Likeli

hood

/ Num

ber o

f Sam

ples

rcv1 Dataset Benchmark

Page 36: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

news20 Dataset Benchmark

0 10 20 30 40 50 60 70 800

0.2

0.4

0.6

0.8

1

1.2

Logistic Regression with news20 Dataset (0.14M rows, 1,355,191 features, 0.034% non-zero elements)16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster

LBFGS Sparse VectorGD Sparse Vector

Second

Log-

Likeli

hood

/ Num

ber o

f Sam

ples

Page 37: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

K-Means

Page 38: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

PCA + K-Means

Page 39: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Collaborative Filtering

Page 40: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Spark-1157: L-BFGS Optimizer

•  No, its not a blender!

Page 41: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

What is Spark-1157: L-BFGS Optimizer •  Merged in Spark 1.0 •  Contributed by Alpine Data Labs •  Popular algorithms for parameter estimation in Machine Learning. •  It’s a quasi-Newton Method. •  Hessian matrix of second derivatives doesn't need to be evaluated

directly. •  Hessian matrix is approximated using gradient evaluations. •  It converges a way faster than the default optimizer in Spark,

Gradient Decent. •  We are contributing OWLQN which is an variant of LBFGS to deal

with L1 problem to Spark. It’s a building block of GLMNET.

Page 42: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Page 43: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

SPARK-2505: Weighted Regularization ongoing work

l  Each components of weights can be penalized differently. l  We can exclude intercept from regularization in this framework. l  Decoupling regularization from the raw gradient update which is

not used in other optimization schemes. l  Allow various update/learning rate schemes (adagrad,

normalized adaptive gradient, etc) to be applied independent of the regularization

l  Smooth and L1 regularization will be handled differently in optimizer.

Page 44: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

SPARK-2309: Multinomial Logistic Regression ongoing work

l  For K classes multinomial problem, we can generalize it via K -1 linear models with logist link functions.

l  As a result, the weights will have dimension of (K-1)(N + 1) where N is number of features.

l  MLlib interface is designed for one set of paramerters per model, so it requires some interface design changes.

l  Expected to be merged in next release of MLlib, Spark 1.2

Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297

Page 45: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

SPARK-2272: Transformer

A spark, the soul of a transformer

Page 46: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

SPARK-2272: Transformer l  Merged in Spark 1.1 l  Contributed by Alpine Data Labs l  MLlib data preprocessing pipeline. l  StandardScaler

-  Standardize features by removing the mean and scaling to unit variance. -  RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear

models typically works better with zero mean and unit variance.

l  Normalizer -  Normalizes samples individually to unit L^n norm. -  Common operation for text classification or clustering for instance. -  For example, the dot product of two l2-normalized TF-IDF vectors is the

cosine similarity of the vectors.

Page 47: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

StandardScaler

Page 48: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Normalizer

Page 49: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

l  Merged in Spark 1.1 l  Contributed by Alpine Data Labs l  Online algorithms for computing the mean, variance, min, and max in a streaming

fashion.

l  Two online summerier can be merged, so we can use one summerier for one block of data in map phase, and merge all of them in reduce phase to obtain the global summarizer.

l  A numerically stable one-pass algorithm is implemented to avoid catastrophic cancellation in naive implementation. Ref: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

l  Optimized for sparse vector, and the time complexity is O(non-zeors) instead of O(numCols) for each sample.

SPARK-1969: Online summarizer

Two-pass algorithm

Naive algorithm

Page 50: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Page 51: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Page 52: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Spark SQL •  Spark SQL allows relational queries expressed in SQL, HiveQL, or

Scala to be executed using Spark. At the core of this component is a new type of RDD, SchemaRDD.

•  SchemaRDDs are composed of Row objects, along with a schema that describes the data types of each column in the row. A SchemaRDD is similar to a table in a traditional relational database.

•  A SchemaRDD can be created from an existing RDD, a Parquet file, a JSON dataset, or by running HiveQL against data stored in Apache Hive. http://spark.apache.org/docs/latest/sql-programming-guide.html

Page 53: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

l  With SparkSQL, users can easily load the parquet/avro datasets into Spark, and perform the data pre-processing before the training steps.

l  MLlib considers to use schemaRDD as a native typed data format, like R’s data-frame. This allows us to create output model with types and column names, and also be easier to create PMML model.

Spark SQL + MLlib

Page 54: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

l  With SparkSQL, users can easily load the parquet/avro datasets into Spark, and perform the data pre-processing before the training steps.

l  MLlib considers to use schemaRDD as a native typed data format, like R’s data-frame. This allows us to create output model with types and column names, and also be easier to create PMML model.

Spark SQL + MLlib

Page 55: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Example: Prepare training data using Spark SQL

Page 56: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

Example: Prepare training data using Spark SQL

Page 57: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

l  MLlib official guide - https://spark.apache.org/docs/latest/mllib-guide.html

l  Github – https://github.com/apache/spark

l  Mailing lists - [email protected] or [email protected]

Interested in MLlib?

Page 58: 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

Learn more about Advanced Analytics at http://www.alpinenow.com

For more information, contact us

1550 Bryant Street Suite 1000 San Francisco, CA 94103 USA +1 (877) 542-0062 www.alpinenow.com

Get Started Today!

http://start.alpinenow.com