Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017
-
Upload
mlconf -
Category
Technology
-
view
755 -
download
0
Transcript of Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017
![Page 1: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/1.jpg)
Apache Mahout Distributed Matrix Math for Machine Learning
![Page 2: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/2.jpg)
About Me
• Senior Director of Data Science at Lucidworks (Apache Solr/Lucene, Fusion search tools)
• Formerly Chief Data Scientist, Technical Lead of Data Science Practice at Accenture
• Committer and PMC Member, Apache Mahout
• On Twitter @akm
• Email at [email protected], [email protected]
• Adversarial Learning podcast with @joelgrus at http://adversariallearning.com
![Page 3: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/3.jpg)
Apache Mahout Recent Trends in 0.12/0.13
• Simplify and improve performance of distributed matrix-math programming
• Provide flexible computation options for software and hardware
• Enable easier and quicker new algorithm development
• Allow polyglot programming and plotting in notebooks via Apache Zeppelin
![Page 4: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/4.jpg)
Introduction to Apache Mahout
Apache Mahout is an environment for creating scalable, performant, machine-learning applications
Apache Mahout provides:
• Mathematically expressive Scala DSL
• A collection of pre-canned math and statistics algorithms
• Interchangeable distributed engines
• Interchangeable native solvers (JVM, CPU, GPU, CUDA, or custom)
![Page 5: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/5.jpg)
Feature Highlights in Recent Releases
• v 0.13.1, Soon — CUDA Solvers, Apache Spark 2.1/Scala 2.11 support
• New web site platform, May 2017 — Moved from ASF CMS system to Markdown and Jekyll; allows documentation pull requests to be merged in and published automatically
• v 0.13.0, Apr 2017 — GPU/CPU Solvers, algorithm framework
• v 0.12.2, Nov 2016 — Apache Zeppelin integration for notebooks and visualization
• v 0.12.0, Apr 2016 — Apache Flink backend support
• New Mahout book, Feb 2016 — ‘Apache Mahout: Beyond MapReduce’ by Dmitriy Lyubimov and Andrew Palumbo
• v 0.10.0 - Apr 2015 - Mahout-Samsara vector-math DSL, MapReduce jobs soft-deprecated, Spark backend support
![Page 6: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/6.jpg)
Topic Overview
• Mahout-Samsara: Declarative, R-like, domain-specific language (DSL) for matrix math
• Backend-agnostic programming
• Apache Zeppelin notebooks
• Algorithm development framework (modeled after sk-learn)
• Solve on available CPU cores, single or multiple GPUs, or in the JVM
• Next steps, and how to get involved
![Page 7: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/7.jpg)
Mahout-Samsara
![Page 8: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/8.jpg)
Mahout-Samsara
MapReduce is dead; long live the little clip-art blue man!
![Page 9: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/9.jpg)
Mahout-Samsara
• Mahout-Samsara is an easy-to-use domain-specific language (DSL) for large-scale machine learning on distributed systems like Apache Spark and Flink
• Uses Scala as programming/scripting environment
• Algebraic expression optimizer for distributed linear algebra
• Provides a translation layer to distributed engines
• Support for Spark and Flink DataSets, RDDs
• System-agnostic, R-like DSL; actual formula from (d)spca:
val G = B %*% B.t - C - C.t + (ksi dot ksi) * (s_q cross s_q)
![Page 10: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/10.jpg)
Mahout-Samsara
• Mahout-Samsara computes C = A’A via row-outer-product formulation:
• Executes in a single pass over row-partitioned A
Example of an algebraic optimization
• Logical optimization
• Optimizer rewrites plan to use logical operator for Transpose-Times-Self matrix multiplication
• Single pass: multiply partitioned rows by themselves as transposed columns
• Computation of A’A:
val C = A.t %*% A
• Naïve execution
• 1st pass: transpose A (requires repartitioning of A)
• 2nd pass: multiply result with A (expensive, potentially requires repartitioning again)
![Page 11: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/11.jpg)
Mahout-Samsara
• Mahout-Samsara computes C = A’A via row-outer-product formulation:
• Executes in a single pass over row-partitioned A
Example of an algebraic optimization
![Page 12: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/12.jpg)
Mahout-Samsara
• Mahout-Samsara computes C = A’A via row-outer-product formulation:
• Executes in a single pass over row-partitioned A
Example of an algebraic optimization
![Page 13: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/13.jpg)
Mahout-Samsara
• Mahout-Samsara computes C = A’A via row-outer-product formulation:
• Executes in a single pass over row-partitioned A
Example of an algebraic optimization
![Page 14: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/14.jpg)
Mahout-Samsara
• Mahout-Samsara computes C = A’A via row-outer-product formulation:
• Executes in a single pass over row-partitioned A
Example of an algebraic optimization
![Page 15: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/15.jpg)
Backend-Agnostic Programming
![Page 16: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/16.jpg)
Backend-Agnostic Programming
![Page 17: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/17.jpg)
Apache Zeppelin Notebooks
![Page 18: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/18.jpg)
Apache Zeppelin Notebooks
• Notebooks for polyglot programming with all types of data
• Plotting with R and Python off of computed data from other tools in the same notebook
• Share variables between interpreters
• For more: https://zeppelin.apache.org
• Mahout interpreter for Zeppelin released June 2016
• Post by Trevor Grant on how to use it at https://rawkintrevo.org/2016/05/19/visualizing-apache-mahout-in-r-via-apache-zeppelin-incubating
• https://mahout.apache.org/docs/0.13.1-SNAPSHOT/tutorials/misc/mahout-in-zeppelin/
![Page 19: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/19.jpg)
Apache Zeppelin Notebooks
Add the Mahout Interpreter
![Page 20: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/20.jpg)
Apache Zeppelin Notebooks
Add the Mahout Interpreter, click “Create”
![Page 21: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/21.jpg)
Apache Zeppelin Notebooks
Example usage
![Page 22: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/22.jpg)
Apache Zeppelin Notebooks
Example usage
![Page 23: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/23.jpg)
Apache Zeppelin Notebooks
Hand results to R for plotting
![Page 24: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/24.jpg)
Algorithm Development Framework
![Page 25: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/25.jpg)
Algorithm Development Framework
• Patterned after R and Python (sk-learn) APIs
• Fitter populates a Model, which contains the parameter estimates, fit statistics, a summary, and has a predict() method
• https://rawkintrevo.org/2017/05/02/introducing-pre-canned-algorithms-apache-mahout
• https://mahout.apache.org/docs/0.13.1-SNAPSHOT/tutorials/misc/contributing-algos
![Page 26: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/26.jpg)
Solve on CPU, GPU, or JVM
![Page 27: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/27.jpg)
Solve on CPU, GPU, or JVM
Current architecture with native CPU and GPU support and unreleased jCUDA bindings
![Page 28: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/28.jpg)
Solve on CPU, GPU, or JVM
Initial benchmarking on latest release
![Page 29: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/29.jpg)
Solve on CPU, GPU, or JVM
Initial benchmarking on latest release
• Sparse MMul at geometry of 1000 x 1000 %*% 1000 x 1000 density = 0.2, with 5 runs Mahout JVM Sparse multiplication time: 1501 msMahout jCUDA Sparse multiplication time: 49 ms
30x speedup
• Sparse MMul at geometry of 1000 x 1000 %*% 1000 x 1000 density = .02, with 5 runs Mahout JVM Sparse multiplication time: 34 ms Mahout jCUDA Sparse multiplication time: 4 ms
8.5x speedup
• Sparse MMul at geometry of 1000 x 1000 %*% 1000 x 1000 density = .002, with 5 runs Mahout JVM Sparse multiplication time: 1 ms Mahout jCUDA Sparse multiplication time: 1 ms
0x speedup
![Page 30: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/30.jpg)
Solve on CPU, GPU, or JVM
• jCUDA work is still in a branch, will be in master in the next couple months
• Currently the modes of compute are JVM, CPU (using all available cores), and single GPU
• Multi-GPU is next priority
• Currently multiplication takes place in different solvers based on matrix shape (banding, triangularity, etc.)
• Directing location for data and compute based on shape and density is another priority
• Watch this space for other speedups
Next steps
![Page 31: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/31.jpg)
How to Use Mahout and Get Involved
![Page 32: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/32.jpg)
How to Use Mahout and Get Involved
Web: https://mahout.apache.org
Source code, PRs welcome: https://github.com/apache/mahout
Mailing lists: https://mahout.apache.org/community/mailing-lists.html
Download, install, embed: https://mahout.apache.org/downloads.html
![Page 33: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017](https://reader031.fdocuments.us/reader031/viewer/2022020314/5aacef127f8b9a003b8b4585/html5/thumbnails/33.jpg)
Thank YouQ&A
h2ps://mahout.apache.org h2ps://github.com/apache/mahout