Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for...

Framework for a suite of Co-clustering

algorithms for predictive modeling on Hadoop

Vaijanath N. Rao

([email protected])

Rohini Uppuluri

([email protected])

Presentation for[CLIENT]

Agenda

• Introduction

• Background

• Some Approaches

• Co-Clustering

• Introduction

• Related Work

• Why Hadoop?

• Goal

• Our Framework

• Conclusions and Future Work


Background

Modeling for Prediction

• Will user A like this movie?

• Will a user B like this camera

• Customer purchase decisions in an e-commerce setting

And tons of other things…


Some Approaches

• Collaborative filtering

• User Based, Item Based, Model Based, Content Based, Hybrid (See [1],

[2] ) etc

• Latent Models

• Probabilistic Latent Semantic Indexing [3,6]

• Matrix Factorization [4,7,8],

• Probabilistic Discrete Latent Factor[5]

• Co-clustering

• Clustering along multiple axes: [9,10] etc; survey in [16]


Co-clustering

0?00?

00?11

0?1?0

11101

0?00?

00?11

0?1?0

11101

0?01?

00?11

0?1?1

11100

0?01?

00?11

0?1?1

11100

Users

Products

Clu

ste

red

Users

Clustered Products

. . .

. . .Reducing E

rror

Product

Attributes

User

Attributes

Row Cluster Updation

Column Cluster Updation

Global Model Updation








Some Approaches

• Bregman co-clustering - Framework [11]

• Information theoretic co-clustering [12]

• Min sum squared co-clustering [13]

• Scalable Framework based on Bregman

framework[14]

• DisCo [15]


Why Hadoop

• Real world data – Huge

• Large matrix to operate on(millions and

millions of rows, millions of columns!)

• Lot of computations


Goal

• Number of approaches, need for a common

framework

• To build a framework to fit in the multiple algorithms

on hadoop

• Easy framework for users to choose and use


Overview

Row Cluster

Updator Job

Column Cluster

Updator Job

Global Model

Updator Job

Global Model

Input

Row Clusters

Column Clusters


Overview : Core Interfaces

• Input vector (type, id, datavec, attributevec, cost, assignment)

• Cluster ( vector, len)

• Row Cluster

• Column Cluster

• Distance/Error Function (vector1, vector2)

• Model (matrix)

• Row Model

• Column Model

• Group Model

• Objective Function (Model1, Model2)


Currently we have

• Graph Based Bi-clustering

• Disco


Disco Algorithm

1. Initialization

1.1 row and column clusters

1.2 Compute global model

2. While objective function is met

2.1 For each row in the data, pick the row group

which minimizes error

2.2 Update row clusters

2.3 Update global model

2.4 For each column in the data, pick the column

group which minimizes error

2.5 Update column clusters

2.6 Update global model

3. Return row and column clusters


Pick the Best Row Group/Cluster


Example


RowCluster Updator Job


Example


BiClustering


Pick the Best Row Group/Cluster


Example


Row Updator Job

Pick the best row group

cluster which minimizes cost

or error

rowId clickVector attributeVector bestRowClusterId cost

Best Row

Cluster IdclickVector

lineId

0

rowId

clickVector

attributeVector

curRowClusterId

curRowClusterError

KeyType

DATA

Key type

ROWCLUSTER

keyvalue

key value

RowCluster Mapper

keyType:

DATA:

Just Emit

ROW CLUSTER

Aggregate Row Cluster

RowCluster Reducer


Updated Row Clusters

Also write


Example


Conclusions and Future Work

• Implementing more algorithms

• Easy to use examples and more documentation


References[1] J. Herlocker, J. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for

performing collaborative filtering. In SIGIR, pages 230–237, 1999

[2] J. Basilico and T. Hofmann. Unifying collaborative and content-based filtering. In ICML ’04, pages 65–72, 2004.

[3] T. Hoffman and J. Puzicha. Latent class models for collaborative filtering. In Proc. IJCAI ’99, 1999.

[4] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS ’07, 2007

[5] D. Agarwal and S. Merugu. Predictive discrete latent factor models for large scale dyadic data. In Proc. KDD ’07, pages 26–35, 2007

[6] A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW ’07, pages 271–280, 2007

[7] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In KDD ’08, pages 426–434, 2008

[8] H. Ma, H. Yang, M. Lyu, and I. King. Sorec: social recommendation using probabilistic matrix factorization. In CIKM ’08, pages 931–940, 2008

[9] Y. Cheng and G. M. Church. Biclustering of expression data. In Proc ICMB ’00, pages 93–103, 2000

[10] T. George and S. Merugu. A scalable collaborative filtering framework based on co-clustering. In ICDM, pages 625 – 628, 2005


References (contd..)[11] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. Modha. A generalized maximum

entropy approach to bregman co-clustering and matrix approximation. JMLR, 1919--1986, 2007.

[12] I. Dhillon, S. Mallela, and D. Modha. Information-theoretic co-clustering. In Proc. KDD ’03, pages 89–98, 2003

[13] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra. Minimum sum squared residue co-clustering of gene expression data. In Proc. SDM ’04, 2004

[14] M. Deodhar, G. Gupta, J. Ghosh, H. Cho, and I. Dhillon. A scalable framework for discovering coherent co-clusters in noisy data. In ICML ’08, 2008

[15] S. Papadimitriou and J. Sun. Disco: Distributed co-clustering with mapreduce: A case study towards petabyte-scale end-to-end mining. In ICDM ’08, pages 512–521, 2008

[16] S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: A survey. IEEE Trans. on Computational Biology and Bioinformatics, 1(1):24–45, 2004


Thank you


Row Cluster Updator Job

Pick the best row group


or error


rowId Updated rowCluster

rowId Updated Partial GlobalModel

rowId

clickVector

attributeVector

curRowClusterId

curRowClusterError

Value type

DATA

Value type

ROW GLOB MODEL

Value type

ROWCLUSTER

key value

key value

RowCluster Mapper

ValueType:

DATA:

Just Emit

ROW CLUSTER

Aggregate Row Cluster

ROW GLOB CLUSTER

Aggregate Partial Global Model

for given row cluster

RowCluster Reducer


Updated Row Clusters

Updated Partial Global

Model

Also write


Column Cluster Updator Job

Pick the best col group


or error

colId clickVector attributeVector bestColClusterId cost

colId Updated colCluster

colId Updated Partial GlobalModel

colId

clickVector

attributeVector

curColClusterId

curColClusterError

Value type

DATA

Value type

COL GLOB MODEL

Value type

COLCLUSTER

key value

key value

ColCluster Mapper

ValueType:

DATA:

Just Emit

COL CLUSTER

Aggregate Col Cluster

COL GLOB CLUSTER

Aggregate Partial Global Model

for given col cluster

ColCluster Reducer

Updated Col Clusters

Updated Partial Global

Model

Also write

colId clickVector attributeVector bestColClusterId cost

Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for...

Technology

Transcript of Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for...