Multinomial Logistic Regression with Apache Spark

Click here to load reader

  • date post

  • Category


  • view

  • download


Embed Size (px)


Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer. Bio: DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.

Transcript of Multinomial Logistic Regression with Apache Spark

  • Multinomial Logistic Regression with Apache Spark DB Tsai Machine Learning Engineer May 1st , 2014
  • Machine Learning with Big Data Hadoop MapReduce solutions MapReduce is scaling well for batch processing However, lots of machine learning algorithms are iterative by nature. There are lots of tricks people do, like training with subsamples of data, and then average the models. Why big data with approximation? + =
  • Lightning-fast cluster computing Empower users to iterate through the data by utilizing the in-memory cache. Logistic regression runs up to 100x faster than Hadoop M/R in memory. We're able to train exact model without doing any approximation.
  • Binary Logistic Regression d = ax1+bx2+cx0 a2 +b2 where x0=1 P( y=1x , w)= exp(d ) 1+exp(d ) = exp(x w) 1+exp(x w) P ( y=0x , w)= 1 1+exp(x w) log P ( y=1x , w) P( y=0x , w) =x w w0= c a2 +b2 w1= a a2 +b2 w2= b a2 +b2 wherew0 iscalled as intercept
  • Training Binary Logistic Regression Maximum Likelihood estimation From a training data and labels We want to find that maximizes the likelihood of data defined by We can take log of the equation, and minimize it instead. The Log-Likelihood becomes the loss function. X =( x1 , x2 , x3 ,...) Y =( y1, y2, y3, ...) w L( w , x1, ... , xN )=P ( y1x1 , w)P ( y2x2 , w)... P( yN xN , w) l( w ,x)=log P(y1x1 , w)+log P( y2x2 , w)...+log P( yN xN , w)
  • Optimization First Order Minimizer Require loss, gradient of loss function Gradient Decent is step size Limited-memory BFGS (L-BFGS) Orthant-Wise Limited-memory Quasi-Newton (OWLQN) Coordinate Descent (CD) Trust Region Newton Method (TRON) Second Order Minimizer Require loss, gradient and hessian of loss function Newton-Raphson, quadratic convergence. Fast! Ref: Journal of Machine Learning Research 11 (2010) 3183- 3234, Chih-Jen Lin et al. wn+1=wn G , wn+1=wnH1 G
  • Problem of Second Order Minimizer Scale horizontally (the numbers of training data) by leveraging Spark to parallelize this iterative optimization process. Don't scale vertically (the numbers of training features). Dimension of Hessian is Recent applications from document classification and computational linguistics are of this type. dim(H )=[(k1)(n+1)] 2 where k is numof class ,nis num of features
  • L-BFGS It's a quasi-Newton method. Hessian matrix of second derivatives doesn't need to be evaluated directly. Hessian matrix is approximated using gradient evaluations. It converges a way faster than the default optimizer in Spark, Gradient Decent. We love open source! Alpine Data Labs contributed our L-BFGS to Spark, and it's already merged in Spark-1157.
  • Training Binary Logistic Regression l( w ,x) = k=1 N log P( ykxk , w) = k=1 N yk log P ( yk=1xk , w)+(1yk )log P ( yk =0xk , w) = k=1 N yk log exp( xk w) 1+exp( xk w) +(1yk )log 1 1+exp( xk w) = k=1 N yk xk wlog(1+exp( xk w)) Gradient : Gi (w ,x)= l( w ,x) wi =k=1 N yk xki exp( xk w) 1+exp( xk w) xki Hessian: H ij (w ,x)= l( w ,x) wi w j =k=1 N exp( xk w) (1+exp( xk w))2 xki xkj
  • Overfitting P ( y=1x , w)= exp(zd ) 1+exp(zd ) = exp(x w) 1+exp(x w)
  • Regularization The loss function becomes The loss function of regularizer doesn't depend on data. Common regularizers are L2 Regularization: L1 Regularization: L1 norm is not differentiable at zero! ltotal (w ,x)=lmodel (w ,x)+lreg( w) lreg (w)=i=1 N wi 2 lreg (w)=i=1 N wi G (w ,x)total=G(w ,x)model+G( w)reg H ( w ,x)total= H (w ,x)model+ H (w)reg
  • Mini School of Spark APIs map(func) : Return a new distributed dataset formed by passing each element of the source through a function func. reduce(func) : Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. No key thing here compared with Hadoop M/R.
  • Example No More Word Count! Let's compute the mean of numbers. 3.0 2.0 1.0 5.0 Executor 1 Executor 2 map map (1.0, 1) (5.0, 1) (3.0, 1) (2.0, 1) reduce (5.0, 2) reduce (11.0, 4) Shuffle
  • The previous example using map(func) will create new tuple object, which may cause Garbage Collection issue. Like the combiner in Hadoop Map Reduce, for each tuple emitted from map(func), there is no guarantee that they will be combined locally in the same executor by reduce(func). It may increase the traffic in shuffle phase. In Hadoop, we address this by in-mapper combiner or aggregating the result in global variables which have scope to entire partition. The same approach can be used in Spark.
  • Mini School of Spark APIs mapPartitions(func) : Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator[T] => Iterator[U] when running on an RDD of type T. This API allows us to have global variables on entire partition to aggregate the result locally and efficiently.
  • Better Mean of Numbers Implementation 3.0 2.0 Executor 1 reduce (11.0, 4) Shuffle var counts = 0 var sum = 0.0 loop var counts = 2 var sum = 5.0 3.0 2.0 (5.0, 2) 1.0 5.0 Executor 2 var counts = 0 var sum = 0.0 loop var counts = 2 var sum = 6.0 1.0 5.0 (6.0, 2) mapPartitions
  • More Idiomatic Scala Implementation aggregate(zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U zeroValue is a neutral "zero value" with type U for initialization. This is analogous to in previous example. var counts = 0 var sum = 0.0 seqOp is function taking (U, T) => U, where U is aggregator initialized as zeroValue. T is each line of values in RDD. This is analogous to mapPartition in previous example. combOp is function taking (U, U) => U. This is essentially combing the results between different executors. The functionality is the same as reduce(func) in previous example.
  • Approach of Parallelization in Spark Spark Driver JVM Time Spark Executor 1 JVM Spark Executor 2 JVM 1) Find available resources in cluster, and launch executor JVMs. Also initialize the weights. 2) Ask executors to load the data into executors' JVMs 2) Trying to load the data into memory. If the data is bigger than memory, it will be partial cached. (The data locality from source will be taken care by Spark) 3) Ask executors to compute loss, and gradient of each training sample (each row) given the current weights. Get the aggregated results after the Reduce Phase in executors. 5) If the regularization is enabled, compute the loss and gradient of regularizer in driver since it doesn't depend on training data but only depends on weights. Add them into the results from executors. 3) Map Phase: Compute the loss and gradient of each row of training data locally given the weights obtained from the driver. Can either emit each result or sum them up in local aggregators. 4) Reduce Phase: Sum up the losses and gradients emitted from the Map Phase Taking a rest! 6) Plug the loss and gradient from model and regularizer into optimizer to get the new weights. If the differences of weights and losses are larger than criteria, GO BACK TO 3) Taking a rest! 7) Finish the model training! Taking a rest!
  • Step 3) and 4) This is the implementation of step 3) and 4) in MLlib before Spark 1.0 gradient can have implementations of Logistic Regression, Linear Regression, SVM, or any customized cost function. Each training data will create new grad object after gradient.compute.
  • Step 3) and 4) with mapPartitions
  • Step 3) and 4) with aggregate This is the implementation of step 3) and 4) in MLlib in coming Spark 1.0 No unnecessary object creation! It's helpful when we're dealing with large features training data. GC will not kick in the executor JVMs.
  • Extension to Multinomial Logistic Regression In Binary Logistic Regression For K classes multinomial problem where labels ranged from [0, K-1], we can generalize it via The model, weights becomes (K-1)(N+1) matrix, where N is number of features. log P( y=1x , w) P ( y=0x , w) =x w1 log P ( y=2x , w) P ( y=0x , w) =x w2 ... log P ( y=K 1x , w) P ( y=0x , w) =x wK 1 log P( y=1x , w) P ( y=0x , w) =x w w=(w1, w2, ... , wK 1)T P( y=0x , w)= 1 1+ i=1 K 1 exp(x wi ) P( y=1x , w)= exp(x w2) 1+ i=1 K 1 exp(x wi) ... P( y=K 1x , w)= exp(x wK 1) 1+i=1 K 1 exp(x wi )
  • Training Multinomial Logistic Regression l( w ,x) = k=1 N log P( ykxk , w) = k=1 N ( yk)log P( y=0xk , w)+(1( yk ))log P ( ykxk , w) = k=1 N ( yk)log 1 1+ i=1 K 1 exp(x wi) +(1( yk ))log exp(x w yk ) 1+ i=1 K1 exp(x wi) = k=1 N (1( yk ))x wyk log(1+ i=1 K1 exp(x wi)) Gradient : Gij ( w ,x)= l( w ,x) w