Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for...

27
Framework for a suite of Co-clustering algorithms for predictive modeling on Hadoop Vaijanath N. Rao ([email protected]) Rohini Uppuluri ([email protected])

description

 

Transcript of Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for...

Page 1: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Framework for a suite of Co-clustering

algorithms for predictive modeling on Hadoop

Vaijanath N. Rao

([email protected])

Rohini Uppuluri

([email protected])

Page 2: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Agenda

• Introduction

• Background

• Some Approaches

• Co-Clustering

• Introduction

• Related Work

• Why Hadoop?

• Goal

• Our Framework

• Conclusions and Future Work

Page 3: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Background

Modeling for Prediction

• Will user A like this movie?

• Will a user B like this camera

• Customer purchase decisions in an e-commerce setting

And tons of other things…

Page 4: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Some Approaches

• Collaborative filtering

• User Based, Item Based, Model Based, Content Based, Hybrid (See [1],

[2] ) etc

• Latent Models

• Probabilistic Latent Semantic Indexing [3,6]

• Matrix Factorization [4,7,8],

• Probabilistic Discrete Latent Factor[5]

• Co-clustering

• Clustering along multiple axes: [9,10] etc; survey in [16]

Page 5: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Co-clustering

0?00?

00?11

0?1?0

11101

0?00?

00?11

0?1?0

11101

0?01?

00?11

0?1?1

11100

0?01?

00?11

0?1?1

11100

Users

Products

Clu

ste

red

Users

Clustered Products

. . .

. . .Reducing E

rror

Product

Attributes

User

Attributes

Row Cluster Updation

Column Cluster Updation

Global Model Updation

Row Cluster Updation

Column Cluster Updation

Global Model Updation

Row Cluster Updation

Column Cluster Updation

Global Model Updation

Page 6: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Some Approaches

• Bregman co-clustering - Framework [11]

• Information theoretic co-clustering [12]

• Min sum squared co-clustering [13]

• Scalable Framework based on Bregman

framework[14]

• DisCo [15]

Page 7: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Why Hadoop

• Real world data – Huge

• Large matrix to operate on(millions and

millions of rows, millions of columns!)

• Lot of computations

Page 8: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Goal

• Number of approaches, need for a common

framework

• To build a framework to fit in the multiple algorithms

on hadoop

• Easy framework for users to choose and use

Page 9: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Overview

Row Cluster

Updator Job

Column Cluster

Updator Job

Global Model

Updator Job

Global Model

Input

Row Clusters

Column Clusters

Page 10: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Overview : Core Interfaces

• Input vector (type, id, datavec, attributevec, cost, assignment)

• Cluster ( vector, len)

• Row Cluster

• Column Cluster

• Distance/Error Function (vector1, vector2)

• Model (matrix)

• Row Model

• Column Model

• Group Model

• Objective Function (Model1, Model2)

Page 11: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Currently we have

• Graph Based Bi-clustering

• Disco

Page 12: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Disco Algorithm

1. Initialization

1.1 row and column clusters

1.2 Compute global model

2. While objective function is met

2.1 For each row in the data, pick the row group

which minimizes error

2.2 Update row clusters

2.3 Update global model

2.4 For each column in the data, pick the column

group which minimizes error

2.5 Update column clusters

2.6 Update global model

3. Return row and column clusters

Page 13: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Pick the Best Row Group/Cluster

Page 14: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Example

Page 15: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

RowCluster Updator Job

Page 16: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Example

Page 17: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

BiClustering

Page 18: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Pick the Best Row Group/Cluster

Page 19: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Example

Page 20: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Row Updator Job

Pick the best row group

cluster which minimizes cost

or error

rowId clickVector attributeVector bestRowClusterId cost

Best Row

Cluster IdclickVector

lineId

0

rowId

clickVector

attributeVector

curRowClusterId

curRowClusterError

KeyType

DATA

Key type

ROWCLUSTER

keyvalue

key value

RowCluster Mapper

keyType:

DATA:

Just Emit

ROW CLUSTER

Aggregate Row Cluster

RowCluster Reducer

rowId clickVector attributeVector bestRowClusterId cost

Updated Row Clusters

Also write

Page 21: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Example

Page 22: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Conclusions and Future Work

• Implementing more algorithms

• Easy to use examples and more documentation

Page 23: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

References[1] J. Herlocker, J. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for

performing collaborative filtering. In SIGIR, pages 230–237, 1999

[2] J. Basilico and T. Hofmann. Unifying collaborative and content-based filtering. In ICML ’04, pages 65–72, 2004.

[3] T. Hoffman and J. Puzicha. Latent class models for collaborative filtering. In Proc. IJCAI ’99, 1999.

[4] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS ’07, 2007

[5] D. Agarwal and S. Merugu. Predictive discrete latent factor models for large scale dyadic data. In Proc. KDD ’07, pages 26–35, 2007

[6] A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW ’07, pages 271–280, 2007

[7] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In KDD ’08, pages 426–434, 2008

[8] H. Ma, H. Yang, M. Lyu, and I. King. Sorec: social recommendation using probabilistic matrix factorization. In CIKM ’08, pages 931–940, 2008

[9] Y. Cheng and G. M. Church. Biclustering of expression data. In Proc ICMB ’00, pages 93–103, 2000

[10] T. George and S. Merugu. A scalable collaborative filtering framework based on co-clustering. In ICDM, pages 625 – 628, 2005

Page 24: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

References (contd..)[11] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. Modha. A generalized maximum

entropy approach to bregman co-clustering and matrix approximation. JMLR, 1919--1986, 2007.

[12] I. Dhillon, S. Mallela, and D. Modha. Information-theoretic co-clustering. In Proc. KDD ’03, pages 89–98, 2003

[13] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra. Minimum sum squared residue co-clustering of gene expression data. In Proc. SDM ’04, 2004

[14] M. Deodhar, G. Gupta, J. Ghosh, H. Cho, and I. Dhillon. A scalable framework for discovering coherent co-clusters in noisy data. In ICML ’08, 2008

[15] S. Papadimitriou and J. Sun. Disco: Distributed co-clustering with mapreduce: A case study towards petabyte-scale end-to-end mining. In ICDM ’08, pages 512–521, 2008

[16] S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: A survey. IEEE Trans. on Computational Biology and Bioinformatics, 1(1):24–45, 2004

Page 25: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Thank you

Page 26: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Row Cluster Updator Job

Pick the best row group

cluster which minimizes cost

or error

rowId clickVector attributeVector bestRowClusterId cost

rowId Updated rowCluster

rowId Updated Partial GlobalModel

rowId

clickVector

attributeVector

curRowClusterId

curRowClusterError

Value type

DATA

Value type

ROW GLOB MODEL

Value type

ROWCLUSTER

key value

key value

RowCluster Mapper

ValueType:

DATA:

Just Emit

ROW CLUSTER

Aggregate Row Cluster

ROW GLOB CLUSTER

Aggregate Partial Global Model

for given row cluster

RowCluster Reducer

rowId clickVector attributeVector bestRowClusterId cost

Updated Row Clusters

Updated Partial Global

Model

Also write

Page 27: Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Presentation for[CLIENT]

Column Cluster Updator Job

Pick the best col group

cluster which minimizes cost

or error

colId clickVector attributeVector bestColClusterId cost

colId Updated colCluster

colId Updated Partial GlobalModel

colId

clickVector

attributeVector

curColClusterId

curColClusterError

Value type

DATA

Value type

COL GLOB MODEL

Value type

COLCLUSTER

key value

key value

ColCluster Mapper

ValueType:

DATA:

Just Emit

COL CLUSTER

Aggregate Col Cluster

COL GLOB CLUSTER

Aggregate Partial Global Model

for given col cluster

ColCluster Reducer

Updated Col Clusters

Updated Partial Global

Model

Also write

colId clickVector attributeVector bestColClusterId cost