CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms...
-
Upload
tiffany-todd -
Category
Documents
-
view
227 -
download
0
Transcript of CMU SCS U Kang (CMU) 1KDD 2012 GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms...
CMU SCS
U Kang (CMU) 1KDD 2012
GigaTensor: Scaling Tensor Analysis Up By 100 Times –
Algorithms and Discoveries
U Kang
ChristosFaloutsos
School of Computer ScienceCarnegie Mellon University
EvangelosPapalexakis
AbhayHarpale
CMU SCS
U Kang (CMU) 2KDD 2012
Outline
Problem Definition
Algorithm
Discoveries
Conclusions
CMU SCS
U Kang (CMU) 3KDD 2012
Background: Tensor
Tensors (=multi-dimensional arrays) are every-where Hyperlinks and anchor texts in Web graphs
URL 1
URL 2
Anchor Text
Java
C++
C#
11
1
1
1
11
CMU SCS
U Kang (CMU) 4KDD 2012
Background: Tensor
Tensors (=multi-dimensional arrays) are every-where Sensor stream (time, location, type) Predicates (subject, verb, object) in knowledge base
“Barrack Obama is the president of U.S.”
“Eric Clapton playsguitar”
(26M)
(26M)
(48M) NELL (Never Ending
Language Learner) dataNonzeros =144M
CMU SCS
U Kang (CMU) 5KDD 2012
Problem Definition
Q1: How to decompose a billion-scale tensor? Corresponds to SVD in 2D case
CMU SCS
U Kang (CMU) 6KDD 2012
Problem Definition
Q2: What are the important concepts and syn-onyms in a KB tensor? Q2.1: What are the dominant concepts in the
knowledge base tensor? Q2.2: What are the synonyms to a given noun
phrase?
(26M)
(26M)
(48M) NELL (Never Ending
Language Learner) dataNonzeros =144M
CMU SCS
U Kang (CMU) 7KDD 2012
Outline
Problem Definition
Algorithm
Discoveries
Conclusions
CMU SCS
U Kang (CMU) 8KDD 2012
Algorithm: Problem Definition
Q1: How to decompose a billion-scale tensor? Corresponds to SVD in 2D case
CMU SCS
U Kang (CMU) 9KDD 2012
Challenge
Alternating Least Square (ALS) Algorithm
• •
: pseudo-inverse
How to design fast MapReduce algorithm for the ALS?
: Hadamard: Khatri-Rao
(J=26M)
(I=26M)
(K=48M)
Details
CMU SCS
U Kang (CMU) 10KDD 2012
Main Idea
1. Ordering of Computation Our choice
FLOPS (NELL data)𝟖 ⋅𝟏𝟎𝟗
FLOPS (NELL data)𝟐 .𝟓⋅𝟏𝟎𝟏𝟕
Details
CMU SCS
U Kang (CMU) 11KDD 2012
Main Idea
2. Avoiding Intermediate Data Explosion
Size of Intermediate Data (NELL) - Naïve: 100 PB
(J=26M)
(I=26M)
(K=48M)
Details
CMU SCS
U Kang (CMU) 12KDD 2012
Main Idea
2. Avoiding Intermediate Data Explosion
Size of Intermediate Data (NELL)- Proposed: 1.5 GB
Details
Size of Intermediate Data (NELL) - Naïve: 100 PB
(Before) (After)
CMU SCS
U Kang (CMU) 13KDD 2012
Experiments
GigaTensor solves 100x larger problem
Number of nonzero= I / 50
(J)
(I)
(K)
GigaTensor
Tensor
Toolbox Out ofMemory
100x
CMU SCS
U Kang (CMU) 14KDD 2012
Outline
Problem Definition
Algorithm
Discoveries
Conclusions
CMU SCS
U Kang (CMU) 15KDD 2012
Discoveries: Problem Definition
Q2: What are the important concepts and syn-onyms in a KB tensor? Q2.1: What are the dominant concepts in the
knowledge base tensor? Q2.2: What are the synonyms to a given noun
phrase?
(26M)
(26M)
(48M) NELL (Never Ending
Language Learner) dataNonzeros =144M
CMU SCS
U Kang (CMU) 16KDD 2012
A2.1: Concept Discovery
Concept Discovery in Knowledge Base
CMU SCS
U Kang (CMU) 17KDD 2012
A2.1: Concept Discovery
CMU SCS
U Kang (CMU) 18KDD 2012
A2.2: Synonym Discovery
Synonym Discovery in Knowledge Base
a1 a2 aR…
(Given) noun phrase
(Discovered) synonym 1
(Discovered) synonym 2
CMU SCS
U Kang (CMU) 19KDD 2012
A2.2: Synonym Discovery
CMU SCS
U Kang (CMU) 20KDD 2012
Outline
Problem Definition
Algorithm
Discoveries
Conclusions
CMU SCS
U Kang (CMU) 21KDD 2012
Conclusion
GigaTensor: scalable tensor decomposition al-gorithm for billion-length modes tensors Algorithm: avoid intermediate data explosion Discoveries: concept discovery and contextual syn-
onym detection on KB tensor
CMU SCS
U Kang (CMU) 22KDD 2012
Thank you !www.cs.cmu.edu/~pegasuswww.cs.cmu.edu/~ukang