CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms...

22
CMU SCS U Kang (CMU) 1 KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer Science Carnegie Mellon University Evangelos Papalexakis Abhay Harpale

Transcript of CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms...

Page 1: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 1KDD 2012

GigaTensor: Scaling Tensor Analysis Up By 100 Times –

Algorithms and Discoveries

U Kang

ChristosFaloutsos

School of Computer ScienceCarnegie Mellon University

EvangelosPapalexakis

AbhayHarpale

Page 2: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 2KDD 2012

Outline

Problem Definition

Algorithm

Discoveries

Conclusions

Page 3: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 3KDD 2012

Background: Tensor

Tensors (=multi-dimensional arrays) are every-where Hyperlinks and anchor texts in Web graphs

URL 1

URL 2

Anchor Text

Java

C++

C#

11

1

1

1

11

Page 4: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 4KDD 2012

Background: Tensor

Tensors (=multi-dimensional arrays) are every-where Sensor stream (time, location, type) Predicates (subject, verb, object) in knowledge base

“Barrack Obama is the president of U.S.”

“Eric Clapton playsguitar”

(26M)

(26M)

(48M) NELL (Never Ending

Language Learner) dataNonzeros =144M

Page 5: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 5KDD 2012

Problem Definition

Q1: How to decompose a billion-scale tensor? Corresponds to SVD in 2D case

Page 6: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 6KDD 2012

Problem Definition

Q2: What are the important concepts and syn-onyms in a KB tensor? Q2.1: What are the dominant concepts in the

knowledge base tensor? Q2.2: What are the synonyms to a given noun

phrase?

(26M)

(26M)

(48M) NELL (Never Ending

Language Learner) dataNonzeros =144M

Page 7: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 7KDD 2012

Outline

Problem Definition

Algorithm

Discoveries

Conclusions

Page 8: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 8KDD 2012

Algorithm: Problem Definition

Q1: How to decompose a billion-scale tensor? Corresponds to SVD in 2D case

Page 9: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 9KDD 2012

Challenge

Alternating Least Square (ALS) Algorithm

• •

: pseudo-inverse

How to design fast MapReduce algorithm for the ALS?

: Hadamard: Khatri-Rao

(J=26M)

(I=26M)

(K=48M)

Details

Page 10: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 10KDD 2012

Main Idea

1. Ordering of Computation Our choice

FLOPS (NELL data)𝟖 ⋅𝟏𝟎𝟗

FLOPS (NELL data)𝟐 .𝟓⋅𝟏𝟎𝟏𝟕

Details

Page 11: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 11KDD 2012

Main Idea

2. Avoiding Intermediate Data Explosion

Size of Intermediate Data (NELL) - Naïve: 100 PB

(J=26M)

(I=26M)

(K=48M)

Details

Page 12: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 12KDD 2012

Main Idea

2. Avoiding Intermediate Data Explosion

Size of Intermediate Data (NELL)- Proposed: 1.5 GB

Details

Size of Intermediate Data (NELL) - Naïve: 100 PB

(Before) (After)

Page 13: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 13KDD 2012

Experiments

GigaTensor solves 100x larger problem

Number of nonzero= I / 50

(J)

(I)

(K)

GigaTensor

Tensor

Toolbox Out ofMemory

100x

Page 14: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 14KDD 2012

Outline

Problem Definition

Algorithm

Discoveries

Conclusions

Page 15: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 15KDD 2012

Discoveries: Problem Definition

Q2: What are the important concepts and syn-onyms in a KB tensor? Q2.1: What are the dominant concepts in the

knowledge base tensor? Q2.2: What are the synonyms to a given noun

phrase?

(26M)

(26M)

(48M) NELL (Never Ending

Language Learner) dataNonzeros =144M

Page 16: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 16KDD 2012

A2.1: Concept Discovery

Concept Discovery in Knowledge Base

Page 17: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 17KDD 2012

A2.1: Concept Discovery

Page 18: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 18KDD 2012

A2.2: Synonym Discovery

Synonym Discovery in Knowledge Base

a1 a2 aR…

(Given) noun phrase

(Discovered) synonym 1

(Discovered) synonym 2

Page 19: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 19KDD 2012

A2.2: Synonym Discovery

Page 20: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 20KDD 2012

Outline

Problem Definition

Algorithm

Discoveries

Conclusions

Page 21: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 21KDD 2012

Conclusion

GigaTensor: scalable tensor decomposition al-gorithm for billion-length modes tensors Algorithm: avoid intermediate data explosion Discoveries: concept discovery and contextual syn-

onym detection on KB tensor

Page 22: CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Christos Faloutsos School of Computer.

CMU SCS

U Kang (CMU) 22KDD 2012

Thank you !www.cs.cmu.edu/~pegasuswww.cs.cmu.edu/~ukang