1 ©MapR Technologies 2013- Confidential
Apache Mahout
How it's good, how it's awesome, and where it falls short
2 ©MapR Technologies 2013- Confidential
What is Mahout?
“Scalable machine learning” – not just Hadoop-oriented machine learning
– not entirely, that is. Just mostly.
Components – math library
– clustering
– classification
– decompositions
– recommendations
3 ©MapR Technologies 2013- Confidential
What is Right and Wrong with Mahout?
Components – recommendations
– math library
– clustering
– classification
– decompositions
– other stuff
4 ©MapR Technologies 2013- Confidential
What is Right and Wrong with Mahout?
Components – recommendations
– math library
– clustering
– classification
– decompositions
– other stuff
5 ©MapR Technologies 2013- Confidential
What is Right and Wrong with Mahout?
Components – recommendations
– math library
– clustering
– classification
– decompositions
– other stuff
All the stuff that isn’t there
All the stuff that isn’t there
6 ©MapR Technologies 2013- Confidential
Mahout Math
7 ©MapR Technologies 2013- Confidential
Mahout Math
Goals are – basic linear algebra,
– and statistical sampling,
– and good clustering,
– decent speed,
– extensibility,
– especially for sparse data
But not – totally badass speed
– comprehensive set of algorithms
– optimization, root finders, quadrature
8 ©MapR Technologies 2013- Confidential
Matrices and Vectors
At the core: – DenseVector, RandomAccessSparseVector
– DenseMatrix, SparseRowMatrix
Highly composable API
Important ideas: – view*, assign and aggregate
– iteration
m.viewDiagonal().assign(v) m.viewDiagonal().assign(v)
9 ©MapR Technologies 2013- Confidential
Assign
Matrices
Vectors
Matrix assign(double value);
Matrix assign(double[][] values);
Matrix assign(Matrix other);
Matrix assign(DoubleFunction f);
Matrix assign(Matrix other, DoubleDoubleFunction f);
Matrix assign(double value);
Matrix assign(double[][] values);
Matrix assign(Matrix other);
Matrix assign(DoubleFunction f);
Matrix assign(Matrix other, DoubleDoubleFunction f);
Vector assign(double value);
Vector assign(double[] values);
Vector assign(Vector other);
Vector assign(DoubleFunction f);
Vector assign(Vector other, DoubleDoubleFunction f);
Vector assign(DoubleDoubleFunction f, double y);
Vector assign(double value);
Vector assign(double[] values);
Vector assign(Vector other);
Vector assign(DoubleFunction f);
Vector assign(Vector other, DoubleDoubleFunction f);
Vector assign(DoubleDoubleFunction f, double y);
10 ©MapR Technologies 2013- Confidential
Views
Matrices
Vectors
Matrix viewPart(int[] offset, int[] size);
Matrix viewPart(int row, int rlen, int col, int clen);
Vector viewRow(int row);
Vector viewColumn(int column);
Vector viewDiagonal();
Matrix viewPart(int[] offset, int[] size);
Matrix viewPart(int row, int rlen, int col, int clen);
Vector viewRow(int row);
Vector viewColumn(int column);
Vector viewDiagonal();
Vector viewPart(int offset, int length); Vector viewPart(int offset, int length);
11 ©MapR Technologies 2013- Confidential
Examples
The trace of a matrix
Random projection
Low rank random matrix
12 ©MapR Technologies 2013- Confidential
Examples
The trace of a matrix
Random projection
Low rank random matrix
m.viewDiagonal().zSum() m.viewDiagonal().zSum()
13 ©MapR Technologies 2013- Confidential
Examples
The trace of a matrix
Random projection
Low rank random matrix
m.viewDiagonal().zSum() m.viewDiagonal().zSum()
m.times(new DenseMatrix(1000, 3).assign(new Normal())) m.times(new DenseMatrix(1000, 3).assign(new Normal()))
14 ©MapR Technologies 2013- Confidential
Recommenders
15 ©MapR Technologies 2013- Confidential
Examples of Recommendations
Customers buying books (Linden et al)
Web visitors rating music (Shardanand and Maes) or movies (Riedl, et al), (Netflix)
Internet radio listeners not skipping songs (Musicmatch)
Internet video watchers watching >30 s (Veoh)
Visibility in a map UI (new Google maps)
16 ©MapR Technologies 2013- Confidential
Recommendation Basics
History:
User Thing
1 3
2 4
3 4
2 3
3 2
1 1
2 1
17 ©MapR Technologies 2013- Confidential
Recommendation Basics
History as matrix:
(t1, t3) cooccur 2 times,
(t1, t4) once,
(t2, t4) once,
(t3, t4) once
t1 t2 t3 t4
u1 1 0 1 0
u2 1 0 1 1
u3 0 1 0 1
18 ©MapR Technologies 2013- Confidential
A Quick Simplification
Users who do h
Also do r
Ah
ATAh( )
ATA( )h
User-centric recommendations
Item-centric recommendations
19 ©MapR Technologies 2013- Confidential
Clustering
20 ©MapR Technologies 2013- Confidential
An Example
21 ©MapR Technologies 2013- Confidential
An Example
22 ©MapR Technologies 2013- Confidential
Diagonalized Cluster Proximity
23 ©MapR Technologies 2013- Confidential
Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Tim
e p
er
po
int
(μs) 2
3
4
56
8
10
12
14
16
Threaded version
Non- threaded
Perfect Scaling
✓
24 ©MapR Technologies 2013- Confidential
Lots of Clusters Are Fine
25 ©MapR Technologies 2013- Confidential
Decompositions
26 ©MapR Technologies 2013- Confidential
Low Rank Matrix
Or should we see it differently?
Are these scaled up versions of all the same column?
1 2 5
2 4 10
10 20 50
20 40 100
27 ©MapR Technologies 2013- Confidential
Low Rank Matrix
Matrix multiplication is designed to make this easy
We can see weighted column patterns, or weighted row patterns
All the same mathematically
1
2
10
20
1 2 5 x
Column pattern (or weights)
Column pattern (or weights)
Weights (or row pattern)
Weights (or row pattern)
28 ©MapR Technologies 2013- Confidential
Low Rank Matrix
What about here?
This is like before, but there is one exceptional value
1 2 5
2 4 10
10 100 50
20 40 100
29 ©MapR Technologies 2013- Confidential
Low Rank Matrix
OK … add in a simple fixer upper
1
2
10
20
1 2 5 x
0
0
10
0
0 8 0 x
Which row Which row
Exception pattern
Exception pattern
+ [
[
]
]
30 ©MapR Technologies 2013- Confidential
Random Projection
31 ©MapR Technologies 2013- Confidential
SVD Projection
32 ©MapR Technologies 2013- Confidential
Classifiers
33 ©MapR Technologies 2013- Confidential
Mahout Classifiers
Naïve Bayes – high quality implementation
– uses idiosyncratic input format
– … but it is naïve
SGD – sequential, not parallel
– auto-tuning has foibles
– learning rate annealing has issues
– definitely not state of the art compared to Vowpal Wabbit
Random forest – scaling limits due to decomposition strategy
– yet another input format
– no deployment strategy
34 ©MapR Technologies 2013- Confidential
The stuff that isn’t there
35 ©MapR Technologies 2013- Confidential
What Mahout Isn’t
Mahout isn’t R, isn’t SAS
It doesn’t aim to do everything
It aims to scale some few problems of practical interest
The stuff that isn’t there is a feature, not a defect
36 ©MapR Technologies 2013- Confidential
Contact: – [email protected]
– @ted_dunning
– @apachemahout
Slides and such http://www.slideshare.net/tdunning
Hash tags: #mapr #apachemahout
Top Related