Download - © 2007 Jimeng Sun Less is More: Compact Matrix Decomposition for Large Sparse Graphs Jimeng Sun, Yinglian Xie, Hui Zhang, Christos Faloutsos Speaker: Jimeng.

© 2007 Jimeng Sun

Less is More: Compact Matrix Decomposition for Large Sparse

Graphs

Jimeng Sun, Yinglian Xie, Hui Zhang, Christos Faloutsos

Speaker: Jimeng Sun

© 2007 Jimeng Sun2

Motivation

• Sparse matrices are everywhere

Network Forensics

Social network analysis

Web graph analysis

Text mining

# of nonzeros in Amxn= O(m+n)

© 2007 Jimeng Sun3

Motivation

• Sparse matrices are everywhere

Network Forensics

Social network analysis

Web graph analysis

Text mining

How to summarize sparse matrices in a concise and intuitive manner?

Compression, Anomaly detection

© 2007 Jimeng Sun4

Problem: Network forensics

• Input: Network flows <src, dst, # of packets> over time.

<128.2.175.2, 128.2.175.184, 128>

<128.2.1.2, 128.2.175.184, 128>

<128.2.17.43, 128.2.12.1, 128>

…

• Output: Useful patterns

Summarize the traffic flows

Identify abnormal traffic patterns

time

© 2007 Jimeng Sun5

Challenges• High volume

A large ISP with 100 POPs, each POP 10Gbps link capacity

[Hotnets2004] has 450 GB/hour with compression

• Sparsity Distribution is skewed

dest

inati

on

source

dest

inati

on

source

© 2007 Jimeng Sun6

Outline• Motivation

• Problem definition

• Proposed mining framework Sparsification

Matrix decomposition

Error Measure

• Experiments

• Related work

• Conclusion

© 2007 Jimeng Sun7

• Network forensics Sparsification load shedding

Matrix decomposition summarization

Error Measure anomaly detection

© 2007 Jimeng Sun8

i-th hour

Sparsification

Sparsification

Random sampling w/ prob p

i+1-th hour

Rescale each entry by 1/p

src

dst

© 2007 Jimeng Sun10

• Network forensics Sparisfication load shedding





• Goal: Summarize traffic matrices

• Why? Anomaly detection

• How? Singular Value Decomposition (SVD) - existing

CUR Decomposition - existing

Compact Matrix Decomposition (CMD) - new


Background: Singular Value Decomposition (SVD)

X = UVT

u1 u2 ukx(1) x(2) x(M) = .

v1

v2

vk

.

1

2

k

X U

VT

right singular vectors

input data left singular vectors

singular values


Background: SVD applications

• Low-rank approximation

• Pseudo-inverse: M+= V-1UT

• Principle component analysis

• Latent semantic indexing

• Webpage ranking: Kleinberg’s HITS score


Pros and cons of SVD

+ Optimal low-rank approximation

• in L2 and Frobenius norm

- Interpretability problem:

A singular vector specifies a linear combination of all input columns or rows.

- Lack of Sparsity

Singular vectors are usually dense

1st left singular vector

=U

VT





• How?

× Singular Value Decomposition (SVD) - existing

CUR Decomposition - existing



Background: CUR decomposition

Goal: make ||A-CUR|| small.

Drineas et al., Fast Monte Carlo Algorithms for Matrices III: Computing a Compressed Approximate Matrix Decomposition, SIAM Journal on Computing, 2006.


Background: CUR decomposition

Drineas et al., Fast Monte Carlo Algorithms for Matrices III: Computing a Compressed Approximate Matrix Decomposition, SIAM Journal on Computing, 2006.

Goal: make ||A-CUR|| small.

Pseudo-inverse of the intersection of C and R


CUR: provably good approximation to SVD

• Assume Ak is the “best” rank k approximation to A (through SVD).

Thm [Drineas et al.] CUR in O(mn) time achieves

||A-CUR|| <= ||A-Ak||+ ||A||

with probability at least 1-, by picking O( k log(1/) / 2 ) columns, and

O( k2 log3(1/) / 6 ) rowsDrineas et al., Fast Monte Carlo Algorithms for Matrices III: Computing a Compressed Approximate Matrix Decomposition, SIAM Journal on Computing, 2006.


Background: CUR applications

• DNA SNP Data analysis

• Recommendation system

• Fast kernel approximation

1. Intra- and interpopulation genotype reconstruction from tagging SNPs, P. Paschou, M. W. Mahoney, A. Javed, J. R. Kidd, A. J. Pakstis, S. Gu, K. K. Kidd, and P. Drineas, Genome Research, 17(1), 96-107 (2007)

2. Tensor-CUR Decompositions For Tensor-Based Data, M. W. Mahoney, M. Maggioni, and P. Drineas, Proc. 12-th Annual SIGKDD, 327-336 (2006)


Pros and cons of CUR

+ Easy interpretation• Since the basis vectors are actual

columns and rows

+ Sparse basis• Since the basis vectors are actual

columns and rows

- Duplicate columns and rows• Columns of large norms will be

sampled many times

Singular vector

Actual column





• How?

× Singular Value Decomposition (SVD) – existing

× CUR Decomposition - existing



Compact Matrix Decomposition (CMD)

• Given a matrix A, find three matrices C, U, R such that ||A-CUR|| is small

No duplicates in C and R

ACd

Rd

Cs

Rs

U = X+

X

¼Finding U is more involved!

=

CUR CMD


Column sampling: subspace construction

• Sample c columns with replacement Biased toward the columns of large norm,

the probably pi =||A(i)||2/j ||A(j)||2

Rescale by

A Cd

c=6


Column sampling: duplicate column removal

• Remove duplicate columns

• Scale the columns by the square root of the number of duplicates

p3

CsCd


Column sampling:correctness proof

Thm: Matrix Cs and Cd have the same singular values and left singular vectors See our paper for the proof

Implication: Column duplicate removal preserves the sample top-k subspace


• Low rank approximation

CMD construction

~A = U cU Tc A

= CV C § ¡ 1C (CV C § ¡ 1

C )T A

= CV C § ¡ 2C V T

C C T A

= CUA

~A = U cU Tc A

= CV C § ¡ 1C (CV C § ¡ 1

C )T A

= CV C § ¡ 2C V T

C C T A

= CUA

~A = CUR

Project to top-c column subspace

C+

c £ mbig, dense entire

matrix

C

m £ csparse

¼

details


Row sampling

• Approximate matrix multiplication Sample and rescale the columns and rows

• Remove duplicate rows and scale the rows by the number of duplicates C C+ A ¼ C U R

C+c£m

An£m

Uc£r

Rr£m

details


CMD summary• Given a matrix A, find three matrices C, U, R, such

that ||A-CUR|| is small

• Biased sampling with replacement of columns/rows to construct Cd and Rd

• Remove duplicates with proper scaling

• Construct a small U

ACd

Rd

Cs

Rs

Construct a small U


• Network forensics Sparsification load shedding




Error Measure

• True error

• Approximated error

for some sample elements in a set S






Error Measure

• Experiments

• Related work

• Conclusion


Experiment datasets • Network flow data

22k x 22k matrices

Every matrix corresponds to 1 hour of data

Elements are the log(packet count +1)

1200 hours, 500 GB raw trace

• DBLP bibliographic data Author-conference graphs from 1980 to 2004

428K authors, 3659 conferences

Elements are the numbers of papers published by the authors


Experiment design

1. CMD vs. SVD, CUR w.r.t. Space

CPU time

Accuracy = 1 – relative sum square error

2. Evaluation of other modules Sparsification, Error measure

3. Case-study on network anomaly detection


1.a Space efficiency

• CMD uses up to 100x less space to achieve the same accuracy

• CUR limitation: duplicate columns and rows

• SVD limitation: singular vectors are dense

Network DBLP


1.b Computational efficiency

• CMD is fastest among all three

• CMD and CUR requires SVD on only the sampled columns

• CUR is much worse than CMD due to duplicate columns

• SVD is slowest since it performs on the entire data

Network DBLP


2.a Robustness of Sparsification

• Small accuracy penalty for all algorithms

Difference is small


2.b Accuracy Estimation

• Matrix approximation for network flow data (22k-by-22k)

• Vary the number of sampled cols and rows from 200 to 2000


3. Case study: network anomaly detection

• Identify the onset of worm-like hierarchical scanning activities

• The tradition method based on volume monitoring cannot detect that






Error Measure

• Experiments

• Related work

• Conclusion


CUR decompositionsStewart, Berry, Pulatova

(Num. Math.’99, TOMS’05 )

C: variant of the QR algorithm,

U: minimizes ||A-CUR||F

R: variant of the QR algorithm,

No a priori bounds

Solid experimental performance

Goreinov, Tyrtyshnikov, & Zamarashkin

(LAA ’97, Cont. Math. ’01)

C: columns that span max volume

U: W+

R: rows that span max volume

Existential result

Error bounds depend on ||W+||2

Spectral norm bounds!

Williams & Seeger

(NIPS ’00)

C: uniformly at random

U: W+

R: uniformly at random

Experimental evaluation

A is assumed PSD

Connections to Nystrom method

Drineas, Kannan, & Mahoney (SODA ’03, ’04)

C: w.r.t. column lengths

U: in linear/constant time

R: w.r.t. row lengths

Randomized algorithm

Provable, a priori, bounds

Explicit dependency on A –Ak

Drineas, Mahoney, & Muthukrishnan

(’05, ’06)

C: depends on singular vectors of A.

U: (almost) W+

R: depends on singular vectors of C

(1+) approximation to A –Ak

Computable in SVDk(A) time.

Acknowledge to Petros Drineas for this slide

Monte-Carlo Sampling approach

Deterministic approach

CMD can help here!


Other related work

• Low-rank approximation Frieze, Kannan, Vempala (1998)

Achlioptas and McSherry (2001)

Sarlós (2006)

Zhang, Zha, Simon (2002)

• Other sparse approximations Sebro, Jaakkola (2004): max-margin matrix

factorization

Nonnegative matrix factorization

L1 regularization


ConclusionHow to summarize sparse matrices

in a concise and intuitive manner?

1.Provable accuracy guarantee2.10x to 100x improvement3.Interpretability4.Applied to 500 Gb network

forensics data

Proposed method - CMD


Thank you

• Contact: Jimeng Sun

[email protected]

• Acknowledgement to Petros Drineas and Michael Mahoney for the insightful discussion/help on CUR decomposition


The sparsity property

SVD: A = U VT

Big but sparse Big and dense

CMD: A = C U R

Big but sparse Big but sparse

dense but small

sparse and small


Column sampling: subspace construction

• Biased sampling with replacement of the “large” columns


Column sampling: duplicate column removal

• Remove duplicate columns and scale the column by the square root of the number of duplicates


Summary on CMD• CMD: A C U R

C/R: sampled and scaled columns and rows without duplicates (sparse)

U: a small matrix (dense)

• Properties Interpretability: interpret matrix by sampled rows

and columns

Efficiency: in computation and space

• Application Network forensics: Anomaly detection


ConclusionHow to summarize sparse matrices

in a concise and intuitive manner?

Network Forensics1. Sparsification through

sampling 2. Low-rank approximation3. Error measure

Application

CMD: low rank approximation 1. sampled and scaled columns

and rows without duplicates (sparse)

2. a small matrix (dense)

Theory

1. Provable accuracy guarantee

2. 10x to 100x improvement3. Interpretability4. Applied to 500 Gb network

forensics data

CMD: