Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for...
-
Upload
yahoo-developer-network -
Category
Technology
-
view
1.740 -
download
2
description
Transcript of Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for...
Framework for a suite of Co-clustering
algorithms for predictive modeling on Hadoop
Vaijanath N. Rao
Rohini Uppuluri
Presentation for[CLIENT]
Agenda
• Introduction
• Background
• Some Approaches
• Co-Clustering
• Introduction
• Related Work
• Why Hadoop?
• Goal
• Our Framework
• Conclusions and Future Work
Presentation for[CLIENT]
Background
Modeling for Prediction
• Will user A like this movie?
• Will a user B like this camera
• Customer purchase decisions in an e-commerce setting
And tons of other things…
Presentation for[CLIENT]
Some Approaches
• Collaborative filtering
• User Based, Item Based, Model Based, Content Based, Hybrid (See [1],
[2] ) etc
• Latent Models
• Probabilistic Latent Semantic Indexing [3,6]
• Matrix Factorization [4,7,8],
• Probabilistic Discrete Latent Factor[5]
• Co-clustering
• Clustering along multiple axes: [9,10] etc; survey in [16]
Presentation for[CLIENT]
Co-clustering
0?00?
00?11
0?1?0
11101
0?00?
00?11
0?1?0
11101
0?01?
00?11
0?1?1
11100
0?01?
00?11
0?1?1
11100
Users
Products
Clu
ste
red
Users
Clustered Products
. . .
. . .Reducing E
rror
Product
Attributes
User
Attributes
Row Cluster Updation
Column Cluster Updation
Global Model Updation
Row Cluster Updation
Column Cluster Updation
Global Model Updation
Row Cluster Updation
Column Cluster Updation
Global Model Updation
Presentation for[CLIENT]
Some Approaches
• Bregman co-clustering - Framework [11]
• Information theoretic co-clustering [12]
• Min sum squared co-clustering [13]
• Scalable Framework based on Bregman
framework[14]
• DisCo [15]
Presentation for[CLIENT]
Why Hadoop
• Real world data – Huge
• Large matrix to operate on(millions and
millions of rows, millions of columns!)
• Lot of computations
Presentation for[CLIENT]
Goal
• Number of approaches, need for a common
framework
• To build a framework to fit in the multiple algorithms
on hadoop
• Easy framework for users to choose and use
Presentation for[CLIENT]
Overview
Row Cluster
Updator Job
Column Cluster
Updator Job
Global Model
Updator Job
Global Model
Input
Row Clusters
Column Clusters
Presentation for[CLIENT]
Overview : Core Interfaces
• Input vector (type, id, datavec, attributevec, cost, assignment)
• Cluster ( vector, len)
• Row Cluster
• Column Cluster
• Distance/Error Function (vector1, vector2)
• Model (matrix)
• Row Model
• Column Model
• Group Model
• Objective Function (Model1, Model2)
Presentation for[CLIENT]
Currently we have
• Graph Based Bi-clustering
• Disco
Presentation for[CLIENT]
Disco Algorithm
1. Initialization
1.1 row and column clusters
1.2 Compute global model
2. While objective function is met
2.1 For each row in the data, pick the row group
which minimizes error
2.2 Update row clusters
2.3 Update global model
2.4 For each column in the data, pick the column
group which minimizes error
2.5 Update column clusters
2.6 Update global model
3. Return row and column clusters
Presentation for[CLIENT]
Pick the Best Row Group/Cluster
Presentation for[CLIENT]
Example
Presentation for[CLIENT]
RowCluster Updator Job
Presentation for[CLIENT]
Example
Presentation for[CLIENT]
BiClustering
Presentation for[CLIENT]
Pick the Best Row Group/Cluster
Presentation for[CLIENT]
Example
Presentation for[CLIENT]
Row Updator Job
Pick the best row group
cluster which minimizes cost
or error
rowId clickVector attributeVector bestRowClusterId cost
Best Row
Cluster IdclickVector
lineId
0
rowId
clickVector
attributeVector
curRowClusterId
curRowClusterError
KeyType
DATA
Key type
ROWCLUSTER
keyvalue
key value
RowCluster Mapper
keyType:
DATA:
Just Emit
ROW CLUSTER
Aggregate Row Cluster
RowCluster Reducer
rowId clickVector attributeVector bestRowClusterId cost
Updated Row Clusters
Also write
Presentation for[CLIENT]
Example
Presentation for[CLIENT]
Conclusions and Future Work
• Implementing more algorithms
• Easy to use examples and more documentation
Presentation for[CLIENT]
References[1] J. Herlocker, J. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for
performing collaborative filtering. In SIGIR, pages 230–237, 1999
[2] J. Basilico and T. Hofmann. Unifying collaborative and content-based filtering. In ICML ’04, pages 65–72, 2004.
[3] T. Hoffman and J. Puzicha. Latent class models for collaborative filtering. In Proc. IJCAI ’99, 1999.
[4] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS ’07, 2007
[5] D. Agarwal and S. Merugu. Predictive discrete latent factor models for large scale dyadic data. In Proc. KDD ’07, pages 26–35, 2007
[6] A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW ’07, pages 271–280, 2007
[7] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In KDD ’08, pages 426–434, 2008
[8] H. Ma, H. Yang, M. Lyu, and I. King. Sorec: social recommendation using probabilistic matrix factorization. In CIKM ’08, pages 931–940, 2008
[9] Y. Cheng and G. M. Church. Biclustering of expression data. In Proc ICMB ’00, pages 93–103, 2000
[10] T. George and S. Merugu. A scalable collaborative filtering framework based on co-clustering. In ICDM, pages 625 – 628, 2005
Presentation for[CLIENT]
References (contd..)[11] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. Modha. A generalized maximum
entropy approach to bregman co-clustering and matrix approximation. JMLR, 1919--1986, 2007.
[12] I. Dhillon, S. Mallela, and D. Modha. Information-theoretic co-clustering. In Proc. KDD ’03, pages 89–98, 2003
[13] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra. Minimum sum squared residue co-clustering of gene expression data. In Proc. SDM ’04, 2004
[14] M. Deodhar, G. Gupta, J. Ghosh, H. Cho, and I. Dhillon. A scalable framework for discovering coherent co-clusters in noisy data. In ICML ’08, 2008
[15] S. Papadimitriou and J. Sun. Disco: Distributed co-clustering with mapreduce: A case study towards petabyte-scale end-to-end mining. In ICDM ’08, pages 512–521, 2008
[16] S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: A survey. IEEE Trans. on Computational Biology and Bioinformatics, 1(1):24–45, 2004
Presentation for[CLIENT]
Thank you
Presentation for[CLIENT]
Row Cluster Updator Job
Pick the best row group
cluster which minimizes cost
or error
rowId clickVector attributeVector bestRowClusterId cost
rowId Updated rowCluster
rowId Updated Partial GlobalModel
rowId
clickVector
attributeVector
curRowClusterId
curRowClusterError
Value type
DATA
Value type
ROW GLOB MODEL
Value type
ROWCLUSTER
key value
key value
RowCluster Mapper
ValueType:
DATA:
Just Emit
ROW CLUSTER
Aggregate Row Cluster
ROW GLOB CLUSTER
Aggregate Partial Global Model
for given row cluster
RowCluster Reducer
rowId clickVector attributeVector bestRowClusterId cost
Updated Row Clusters
Updated Partial Global
Model
Also write
Presentation for[CLIENT]
Column Cluster Updator Job
Pick the best col group
cluster which minimizes cost
or error
colId clickVector attributeVector bestColClusterId cost
colId Updated colCluster
colId Updated Partial GlobalModel
colId
clickVector
attributeVector
curColClusterId
curColClusterError
Value type
DATA
Value type
COL GLOB MODEL
Value type
COLCLUSTER
key value
key value
ColCluster Mapper
ValueType:
DATA:
Just Emit
COL CLUSTER
Aggregate Col Cluster
COL GLOB CLUSTER
Aggregate Partial Global Model
for given col cluster
ColCluster Reducer
Updated Col Clusters
Updated Partial Global
Model
Also write
colId clickVector attributeVector bestColClusterId cost