Decomposition of Big Tensors With Low Multilinear …1 Decomposition of Big Tensors With Low...

12
1 Decomposition of Big Tensors With Low Multilinear Rank Guoxu Zhou, Andrzej Cichocki Fellow, IEEE, and Shengli Xie Senior Member, IEEE Abstract—Tensor decompositions are promising tools for big data analytics as they bring multiple modes and aspects of data to a unified framework, which allows us to discover complex internal structures and correlations of data. Unfortunately most existing approaches are not designed to meet the major challenges posed by big data analytics. This paper attempts to improve the scal- ability of tensor decompositions and provides two contributions: A flexible and fast algorithm for the CP decomposition (FFCP) of tensors based on their Tucker compression; A distributed randomized Tucker decomposition approach for arbitrarily big tensors but with relatively low multilinear rank. These two algorithms can deal with huge tensors, even if they are dense. Extensive simulations provide empirical evidence of the validity and efficiency of the proposed algorithms. Index Terms—Tucker decompositions, canonical polyadic de- composition (CPD), nonnegative tensor factorization (NTF), non- negative Tucker decomposition (NTD), dimensionality reduction, nonnegative alternating least squares. I. I NTRODUCTION Tensor decompositions are emerging tools for analyzing multi-dimensional data that arise in many scientific and en- gineering fields such as signal processing, machine learning, chemometrics [1]–[4]. Some data are naturally presented as tensors, such as color images, video clips, multichannel multi- trial EEG signals. In the era of big data, even more data with multiple aspects and/or modalities are collected and many of them are naturally correlated and can be organized as tensors for improved data analysis. The methodologies that matricize the data and then apply matrix factorization approaches give flatten view of data and often cause loss of internal structure information. Compared with matrix factorization techniques, tensor decomposition methodologies are versatile and provide multiple perspective stereoscopic view of data rather than a flatten one. Tensor decompositions are particularly promising for big data analysis as multiple data can be analyzed in a unified framework, which allows us to explore their complex connections across multiple factor matrices simultaneously. By tensor decompositions, we achieve data compression, low-rank approximation, visualization, and feature extraction of high- dimensional multi-way data. Guoxu Zhou is with the Laboratory for Advanced Brain Signal Processing, RIKEN, Brain Science Institute, Wako-shi, Saitama 3510198, Japan. E-mail: [email protected]. Andrzej Cichocki is with the RIKEN BSI, Japan and Systems Research Institute, Warsaw Poland. E-mail: [email protected]. Shengli Xie is with the Faculty of Automation, Guangdong University of Technology, Guangzhou 510006, China. E-mail: [email protected]. TABLE I: Notations and Definitions A, ar , a ir Matrix, the rth-column, and the (i, r)th-entry of matrix A, respectively. N, R Set of integers, real numbers. N + and R + are their positive counterparts. N N The set of positive integers no larger than N in ascend- ing order, i.e., N N = {1, 2,...,N }. Y, Y (n) A tensor, the mode-n matricization of tensor Y. ~, Element-wise (Hadamard) product, division of matrices or tensors. Moreover, we define A Y . = A Y. , Kronecker product and Khatri-Rao product (column- wise Kronecker product) of matrices. Two of the most popular tensor decomposition models are the canonical polyadic (CP) model, also known as APRAllel FACtor (PARAFAC) analysis, and the Tucker model, respec- tively [5]. The CP model, which is quite similar to the singular value decomposition (SVD) of matrices, decomposes the target tensor into the sum of rank-one tensors. While CP decompo- sitions (CPD) are able to give the most compact, essentially unique (under mild conditions [6]) representation of multiway data, they are unfortunately not always well-defined and may achieve a poor fit to the original data in practical applications. In contrast, Tucker decompositions estimate the subspace of each mode and the resulting representation of data are not unique, however, they often achieve more satisfactory fit. Hence these two models often serve different purposes. Note that a priori information on factor matrices could be incorpo- rated in order to extracted physically meaningful components. For example, in [7] statistical independence of components is used to improve the performance of CPD. Nonnegativity CPD, also called nonnegative tensor factorization (NTF) historically, is another important model that has attracted extensive study. In fact, nonnegative factorizations prove to be able to give useful parts-based sparse representation of objects and have been successfully applied in image interpretation and repre- sentation, classification, document clustering, etc [1]. What is more, unlike unconstrained CPD, nonnegative CPD is always well defined and is free of ill-posedness of CPD [8]. While many new algorithms for tensor decompositions have been developed in the last decade, only very recently has their scalability received special attention to meet the challenge posed by big tensor data analysis. For CP decompositions, the ParCube approach [9] and the partition-and-merge approach in [10] were proposed, both of which depend on reliable CP de- compositions of a set of small sub-tensors. However, existing arXiv:1412.1885v2 [cs.NA] 29 Dec 2014

Transcript of Decomposition of Big Tensors With Low Multilinear …1 Decomposition of Big Tensors With Low...

1

Decomposition of Big Tensors With LowMultilinear Rank

Guoxu Zhou Andrzej Cichocki Fellow IEEE and Shengli Xie Senior Member IEEE

AbstractmdashTensor decompositions are promising tools for bigdata analytics as they bring multiple modes and aspects of data toa unified framework which allows us to discover complex internalstructures and correlations of data Unfortunately most existingapproaches are not designed to meet the major challenges posedby big data analytics This paper attempts to improve the scal-ability of tensor decompositions and provides two contributionsA flexible and fast algorithm for the CP decomposition (FFCP)of tensors based on their Tucker compression A distributedrandomized Tucker decomposition approach for arbitrarily bigtensors but with relatively low multilinear rank These twoalgorithms can deal with huge tensors even if they are denseExtensive simulations provide empirical evidence of the validityand efficiency of the proposed algorithms

Index TermsmdashTucker decompositions canonical polyadic de-composition (CPD) nonnegative tensor factorization (NTF) non-negative Tucker decomposition (NTD) dimensionality reductionnonnegative alternating least squares

I INTRODUCTION

Tensor decompositions are emerging tools for analyzingmulti-dimensional data that arise in many scientific and en-gineering fields such as signal processing machine learningchemometrics [1]ndash[4] Some data are naturally presented astensors such as color images video clips multichannel multi-trial EEG signals In the era of big data even more data withmultiple aspects andor modalities are collected and many ofthem are naturally correlated and can be organized as tensorsfor improved data analysis The methodologies that matricizethe data and then apply matrix factorization approaches giveflatten view of data and often cause loss of internal structureinformation Compared with matrix factorization techniquestensor decomposition methodologies are versatile and providemultiple perspective stereoscopic view of data rather than aflatten one Tensor decompositions are particularly promisingfor big data analysis as multiple data can be analyzed in aunified framework which allows us to explore their complexconnections across multiple factor matrices simultaneously Bytensor decompositions we achieve data compression low-rankapproximation visualization and feature extraction of high-dimensional multi-way data

Guoxu Zhou is with the Laboratory for Advanced Brain Signal ProcessingRIKEN Brain Science Institute Wako-shi Saitama 3510198 Japan E-mailzhouguoxuieeeorg

Andrzej Cichocki is with the RIKEN BSI Japan and Systems ResearchInstitute Warsaw Poland E-mail ciabrainrikenjp

Shengli Xie is with the Faculty of Automation Guangdong University ofTechnology Guangzhou 510006 China E-mail eeoshlxiescuteducn

TABLE I Notations and Definitions

A ar air Matrix the rth-column and the (i r)th-entry of matrixA respectively

N R Set of integers real numbers N+ and R+ are theirpositive counterparts

NN The set of positive integers no larger than N in ascend-ing order ie NN = 1 2 N

Y Y(n) A tensor the mode-n matricization of tensor Y~ Element-wise (Hadamard) product division of matrices

or tensors Moreover we define AY

= A Y

otimes Kronecker product and Khatri-Rao product (column-wise Kronecker product) of matrices

Two of the most popular tensor decomposition models arethe canonical polyadic (CP) model also known as APRAllelFACtor (PARAFAC) analysis and the Tucker model respec-tively [5] The CP model which is quite similar to the singularvalue decomposition (SVD) of matrices decomposes the targettensor into the sum of rank-one tensors While CP decompo-sitions (CPD) are able to give the most compact essentiallyunique (under mild conditions [6]) representation of multiwaydata they are unfortunately not always well-defined and mayachieve a poor fit to the original data in practical applicationsIn contrast Tucker decompositions estimate the subspace ofeach mode and the resulting representation of data are notunique however they often achieve more satisfactory fitHence these two models often serve different purposes Notethat a priori information on factor matrices could be incorpo-rated in order to extracted physically meaningful componentsFor example in [7] statistical independence of components isused to improve the performance of CPD Nonnegativity CPDalso called nonnegative tensor factorization (NTF) historicallyis another important model that has attracted extensive studyIn fact nonnegative factorizations prove to be able to giveuseful parts-based sparse representation of objects and havebeen successfully applied in image interpretation and repre-sentation classification document clustering etc [1] What ismore unlike unconstrained CPD nonnegative CPD is alwayswell defined and is free of ill-posedness of CPD [8]

While many new algorithms for tensor decompositions havebeen developed in the last decade only very recently has theirscalability received special attention to meet the challengeposed by big tensor data analysis For CP decompositions theParCube approach [9] and the partition-and-merge approach in[10] were proposed both of which depend on reliable CP de-compositions of a set of small sub-tensors However existing

arX

iv1

412

1885

v2 [

csN

A]

29

Dec

201

4

2

CPD

Nonnegave13 CPD

Reduce noiseCompress data

Tucker13 Representaon

Addi$onal13 constraints13 Sparsity13 sta$s$cal13 independency13

smoothness13

Nonnegave13 Tucker13 Decomposion

Fig 1 Decompositions of a big tensor based on its compactTucker representation

CPD algorithms often suffer from the issue of ill-posednessin the sense that the optimal low-rank approximation does notexist at all Hence in this type of methods any ill-conditionedsub-tensors formed by the samplingpartition operation willunavoidably deteriorate the final results Another scalableapproach GigaTensor [11] requires the target tensors are verysparse For Tucker decompositions the MACH approach [12]was also proposed to obtain a sparse tensor based on samplingthe entries of a target tensor Some other approaches that sam-ple the columns (fibers) of target tensors can be found in [13][14] However these methods are often not suitable for verynoisy data Our purpose in this paper is to develop a generalframework for big tensor decompositions The basic idea issimple Assuming the given big tensor has low multilinearrank we use the Tucker model to compress the original bigtensor first Using this Tucker representation of data bothunconstrainedconstrained CP and Tucker decompositions canbe efficiently performed as illustrated in Fig1 We developedfast scalable algorithms for CPD based on a compact Tuckerrepresentation of data and a new randomized algorithm for theTucker decomposition of a big tensor

The rest of the paper is organized as follows In Section IIthe CP and Tucker decomposition models and algorithms arebriefly overviewed In Section III a family of new algorithmsare proposed for (nonnegative) CP decompositions based onthe Tucker representation of tensors In Section IV two newdistributed randomized Tucker decomposition algorithms areproposed that can be used to decompose big tensors with rel-atively low multilinear rank Simulations results on syntheticand real-world data are presented in Section V and conclusionsare made in Section VI

In TABLE I the frequently used notations are listed andreaders are referred to [1] [5] for more details on tensorconventions

II OVERVIEW OF TUCKER AND POLYADICDECOMPOSITIONS

By using the Tucker model a given high-order tensor Y isinRI1timesI2middotmiddotmiddottimesIN is approximated as

Y asymp Y = Gtimes1 U(1) times2 U(2) middot middot middot timesN U(N)

= JGU(1)U(2) middot middot middot U(N)K

(1)

where U(n) =[u(n)1 u

(n)2 middot middot middot u

(n)Rn

]isin RIntimesRn is the

mode-n component (factor) matrix consisting of latent com-ponents u

(n)rn as its columns G isin RR1timesR2middotmiddotmiddottimesRN is the core

tensor reflecting the connections between the components and(R1 R2 RN ) is called the multilinear rank of Y withRn = rank(Y(n))

By applying SVD to Y(n) n = 1 2 N we obtainthe Tucker decomposition of a given tensor which is referredto as the high-order SVD (HOSVD) and has been widelyapplied in multidimensional data analysis [15]ndash[17] While theTucker decomposition can be easily implemented it is alsocriticized for the lack of uniqueness and suffering from curseof dimensionality As another important tensor decompositionmodel the CP model decomposes the target tensor into a setof rank-one terms that is

Y asymp Y =sum

r

λra(1)r a(2)

r middot middot middot a(N)r

= JA(1)A(2) middot middot middot A(N)K

(2)

where denotes the outer product of vectors and for sim-plicity λr is absorbed into a

(N)r r = 1 2 R CPD is

essentially unique under mild conditions [6] which has beenfound many important applications such as blind identificationfeature extraction However the CPD of a given tensor isnot always well-defined in the sense that the optimal CPdecomposition for a given rank may not exist at all [18]

III EFFICIENT CP DECOMPOSITIONS OF TENSORS BASEDON THEIR TUCKER APPROXIMATIONS

A Overview of CPD Algorithms

Alternating least-squares (ALS) has been a standard ap-proach to perform CPD which minimizes a total of N LSproblems

J(A(n)) =1

2

∥∥∥Y(n) minusA(n)B(n)T∥∥∥2

F n isin NN (3)

in an alternating fashion in each iteration possibly withadditional constraints eg nonnegativity posed on the factormatrices A(n) and

B(n) =⊙

p 6=nA(p) isin R

prodp 6=n IptimesR (4)

Based on the gradient of J(A(n))

partJ

partA(n)= A(n)B(n)TB(n) minusY(n)B

(n) (5)

3

+ +hellip+

v(1)1

v(2)1

v(3)1 v

(3)2

v(1)2

v(2)2

v(1)R

v(2)R

v(3)R

(a) Tucker+CP

+ +hellip+a

(3)R

a(2)R

a(1)Ra

(1)2

a(2)2

a(3)2a

(3)1

a(2)1

a(1)1

(b) FFCP

Fig 2 Two different ways to perform CPD based on a Tucker approximation In the Tucker+CP routine the CPD is performedon a small full tensor (ie the core tensor G) while in the FFCP method it is performed on a tensor in its approximate Tuckerrepresentation JGU(1)U(2) middot middot middot U(N)K

we can obtain the standard ALS update rule for the CPD

A(n) larr Y(n)B(n)(B(n)TB(n))

minus1 n = 1 2 N (6)

by letting partJpartA(n) = 0 or a family of first-order nonnegative

CPD algorithms [19] for example the one using the multi-plicative update (MU) rule

A(n) larr A(n) ~[(Y(n)B

(n)) (A(n)B(n)TB(n))] (7)

and the efficient hierarchical ALS (HALS) approach thatupdates only one column each time

a(n)r larr a(n)

r +1

t(n)rr

P+

(q(n)r minusA(n)t(n)r

) (8)

where P+(x) = max(x 0) is element-wisely applied to a ma-trixvector [19] [20] t(n)rr and t

(n)r are the (r r) entry and the

rth column of the matrix T(n) = B(n)TB(n) respectively andq(n)r is the rth column of Q(n)

= Y(n)B(n) Efficient com-

putation of (5) is the foundation of a large class of first-ordermethods for CPD and their variants [19] In [21] the authorsargued that first-order (gradient) methods generally provideacceptable convergence speed and are easy to parallelize andhence more suitable than the second-order counterparts forlarge scale optimization problems This motivates us to restrictour discussion to the first-order methods in this paper

In the next we discuss how to compute the gradient termsin (5) efficiently While the term

B(n)TB(n) =~p 6=n(A(p)TA(p)) isin RRtimesR (9)

in (5) (and in (6) (7) and (8) as well) can be efficiently com-puted however the term Y(n)B

(n) is quite computationallydemanding especially when Y is huge and N is large Indeedit is likely that in many cases neither Y nor B(n) can beloaded into memory To overcome this problem we often hopeto 1) avoid accessing the huge tensor Y frequently (eg inevery iteration) 2) avoid explicitly constructing the big matrixB(n) isin R

prodk 6=n IktimesR To this end two properties of Y can be

employed to improve scalability of CPD algorithmsbull Sparsity of Y In this case the GigaTensor method [11]

provides an elegant solution only the nonzero entries

of Y will be accessed and the columns of Y(n)B(n)

are computed in a parallel manner but without explicitconstruction of B(n)

bull Low rank of Y (the tensor can be dense) This type ofmethods use the low-rank approximation instead of theoriginal huge dense tensor Y A standard method is toperform Tucker compression first typically by using theHOSVD and then CPD is applied to the core tensor [22]which is referred to as the Tucker+CP method in this pa-per The approach in [10] splits the original big tensor intoa set of small sub-tensors and then CPDs are applied toeach sub-tensor After that Y(n)B

(n) can be computed byusing these sub-tensor in their CP approximation insteadof Y Arguing that some sub-tensors could be very ill-posed which unavoidably causes poor performance thePARACOMP method [23] uses a set of random matricesthat share a pre-specified number of common anchor rowsto compress the tensor Y using the Tucker model andthen a set of much smaller tensors are obtained Then theCPD is performed on these small tensors simultaneouslyand original true components are recovered by solving alinear system incorporating the information provided bythe anchor rows This method avoids the SVD of hugematrices required by the HOSVD however it involves atotal of P Tucker compression procedures and CPD ofP small tensors where P ge InR

Both the method in [10] and the PARACOMP method in[23] rely on reliable CPD of all sub (or intermediate) tensorsHowever it is widely known that existing CPD approachesoften converge to local solutions especially when bottlenecksexist in one or more factor matrices [24] or even worsethe CPD is simply not unique For example in [23] theauthors argued that the method in [10] assumed each subtensor had unique CPD which however cannot be guaranteedfrom the global uniqueness conditions alone [23] In thePARACOMP the uniqueness of CPD of each small tensorcan be guaranteed from the global uniqueness conditionshowever it did not solve the local convergence problem andis sensitive to noise In other words to achieve satisfactory

4

results the PARACOMP actually requires that all the CPDsin the intermediate step converge globally which is oftenunrealistic Hence we believe it is important to develop morerobust and scalable algorithms for the CPD especially whenY is dense In the meanwhile we should avoid using CPDsas intermediate steps due to the local convergence issue

Let us go into more detail of the Tucker+CP approachIn this approach the big tensor Y is approximated by itsHOSVD of Y asymp Y = JGU(1)U(2) middot middot middot U(N)K whereU(n) isin RIntimesRn is computed from the truncated SVD of Y(n)

with U(n)TU(n) = I then the CPD is applied to the muchsmaller tensor G isin RR1timesR2timesmiddotmiddotmiddottimesRN such that

G asymp JV(1)V(2) middot middot middot V(N)K (10)

thereby leading to the CPD of Y asymp JA(1)A(2) middot middot middot A(N)Kwith A(n) = U(n)V(n) For large-scale problems very oftenIn Rn and the CPD of (10) can be efficiently solved evenusing a standard desktop computer The Tucker+CP routinewhich is originally proposed first in [22] is very efficientfor the big tensors with low multilinear ranks Howeverthis framework is not directly applicable if it is necessaryto impose additional constraints such as nonnegativity andsparsity on factor matrices Very recently Cohen Farias andComon proposed a novel efficient method to deal with thenonnegativity of factor matrices [25] Our objective in thispaper is to develop a more general and flexible scheme whichcan not only achieve similar efficiency and scalability but alsoallow to impose any additional or optional constraints on factormatrices very easily

B CPD of Big Tensors by Based on Their Tucker Compression

The Tucker decomposition is suitable for pre-processingand compressing data because it is significantly faster thanthe CPD and it has upper error bounds guarantee [26] Inthis subsection we consider a slightly modified version ofthe Tucker+CP approach to overcome its limitations In thefirst step the Tucker compression is performed such thatY asymp Y = JGU(1)U(2) middot middot middot U(N)K as in the Tucker+CPapproach In the second step however we perform the CPD onthe whole tensor Y instead of the core tensor G by minimizingthe following cost function

J(A(n)) =1

2

∥∥∥Yminus JA(1)A(2) middot middot middot A(N)K∥∥∥2

F

st A(n) isin Cn(11)

where Cn denotes some constraints imposed on A(n)With (11) the term Y(n)B

(n) in the gradient partJpartA(n) can be

efficiently computed from

Y(n)B(n) asymp U(n)G(n)

[⊙p 6=n

V(p)] (12)

whereV(n) = U(n)TA(n) isin RRntimesR (13)

Algorithm 1 The Flexible Fast Nonnegative CP (FFCP)Decomposition Algorithm

Require Y isin RI1timesI2middotmiddotmiddottimesIN the output rank R1 Perform Tucker decomposition to obtain

Y asymp JGU(1)U(2) middot middot middot U(N)K Initialize A(n) andV(n) = U(n)TA(n)

2 repeat3 for n isin NN do4 Efficiently compute H(n) = G(n)

[⊙p 6=n V(p)

]

5 Compute Y(n)B(n) asymp U(n)H(n) and B(n)TB(n) =

~p 6=n(A(p)TA(p))

6 Update A(n) by using (6) or (7)(8) to imposenonnegativity constraints andor (17) to improve thesparsity of components

7 V(n) larr U(n)TA(n)8 end for9 until a stopping criterion is satisfied

10 return GlarrY A(n) n isin N

Note that V(n) is quite small as Rn le R and it can be com-puted by using distributed systems or the matrix partitioningmethod when In is really huge As such the ALS update rule(6) can be replaced by

A(n) larr U(n)G(n)

p 6=nV(p)

(~p 6=n

A(p)TA(p)

)minus1

(14)Because the approximation Y may contain negative entries(7) can be replaced by

A(n) larr A(n) ~P+

(U(n)G(n)

(⊙p 6=n V(p)

))

A(n)(~p 6=n A(p)TA(p)

) (15)

As a result the nonnegativity of factor matrices can be easilyincorporated in the proposed scheme If we hope to extractsparse latent component we consider the l1-norm regularizedCPD proposed in [27] and minimize J(A(n)) + c

∥∥A(n)∥∥1

thereby leading to

A(n) larr Sc(A(n)) (16)

where the sparsity operator Sc(middot) is element-wisely defined as

Sc(x) =

xminus c if x gt c

0 if minus c le x le cx+ c if x lt minusc

In summary most first-order based constrainedunconstrainedCPD algorithms can benefit from the proposed methodApproximately the computational complexity is reducedfrom O(Rn

prodpisinNN

Ip) in (5) to only about O(InR2n +

RnprodpisinNN

Rp) due to using a compact representation of Y

5

minus20 minus15 minus10 minus5 0 5 10 15 20

60

65

70

75

80

85

90

95

Noise level (dB)

Fit

Tucker+CP

FFCPDirect CP

Fig 3 The Fit values averaged over 50 Monte-Carlo runsobtained by 3 different ways (ie the Direct CP that appliesCPD on Y directly the FFCP on the Tucker format of Y asympJGU(1)U(2) middot middot middot U(N)K and the Tucker+CP on the coretensor G first and then construct the CP approximation) toperform CPD on a synthetic tensor Y isin R50times50times50times50 withR = 10 The data tensor Y was generated by using (2) anddifferent levels of Gaussian noise (SNR in dB) were added

It is worth noticing that if Y = JGU(1)U(2) middot middot middot U(N)Kwith U(n)U(n)T = I minimizing (11) is equivalent tosolving

minV(n)

J(V(n)) =1

2

∥∥∥Gminus JV(1)V(2) middot middot middot V(N)K∥∥∥F

st A(n) = U(n)V(n) isin Cn(17)

where V(n) is defined in (13) and Cn denotes some con-straints on A(n) Indeed (17) has been considered in [25]specially for NTF Although this method essentially has thesame computational complexity as the FFCP method the mainadvantage of (11) is that any off-the-shelf (constrained) CPDmethods can be directly applied with efficient computation ofthe gradient terms (9) and (12) without the need of designinga new algorithm to solve (17)

In summary not only does (11) benefit from the sameadvantage of low-computational complexity as the Tucker+CPmethod it is also quite flexible because additional constraintson A(n) can be easily incorporated such as nonnegativityandor sparsity For this reason the new method is calledthe flexible fast CP (FFCP) decomposition algorithm and thecorresponding pseudo code is briefly listed in Algorithm 11In Fig3 we compared the Tucker+CP the FFCP (solving(11)) and the standard CPD (solving (3)) algorithms Thedata tensor Y isin R50times50times50times50 was generated by using (2)

1 Although in Algorithm 1 only the LS update and multiplicative updateare listed any update rules based on the gradient (eg HALS see [19] forexample) could be incorporated into the scheme of FFCP straightforwardly

with R = 10 and contaminated by different levels of noise (inSNR) ranging from -20dB to 20dB In the figure we showedtheir Fit values averaged over 50 Monte-Carlo runs whereFit = 1 minus

∥∥∥Yminus Y∥∥∥FYF It turns out that the Tucker

compression generally will not degenerate the accuracy Onthe contrary it may improve the stableness of CPD algorithms

Although the matrices V(n) in (12) are generallysmall construction of the tall (

prodk 6=nRk)-by-Rn matrix

ie⊙

k 6=n V(p) and direct computation of H(n) =

G(n)

[⊙p 6=n V(p)

]can be avoided by applying either one of

the following methodsbull Notice that the rth column of H(n) can be computed

using h(n)r = G timesp 6=n a

(p)r

T which allows H(n) to becomputed in parallel by using the shared data tensor G

but without constructing⊙

k 6=n V(p) explicitly

bull The term G(n)

[⊙p 6=n V(p)

]can be computed in parallel

by using the technique adopted in the GigaTensor [11]Here G(n) is generally not sparse but relatively small

Moreover H(n) = G(n)

[⊙p 6=n V(p)

]can be viewed as

the gradient term when we perform the CPD of G asympJV(1)V(2) middot middot middot V(N)K (see (5) and (17)) For this reasonmost techniques which have been proposed to speed upstandard CP decompositions can be applied to compute H(n)

straightforwardlyFor very large-scale data where In are extremely huge the

products of A(n)TA(n) and U(n)TA(n) can be performed ondistributed systems see eg [28] or simply computed byusing the matrix partitioning technique on a standard computer

In summary even with very limited computational resourcethe FFCP algorithm can perform the (nonnegative) CPD ofvery large-scale tensors provided that their low-rank Tuckerrepresentation can be obtained Notice that the Tucker ap-proximation of a given tensor plays a fundamental role in theproposed FFCP method as well as the Tucker+CP the methodproposed in [25] and the PARACOMP [35] and the HOSVDoften serves as workhorse to perform Tucker compressionHowever the HOSVD involves the eigenvalue decompositionof huge matrices Y(n) which makes it unsuitable for large-scale problems [35] In the next section we focus on how toovercome this limitation

IV DISTRIBUTED RANDOMIZED TUCKERDECOMPOSITIONS (RANDTUCKER)

A Randomized Tucker Decomposition

In the Tucker decomposition of (1) the factor matrices U(n)

are estimated by minimizing the following LS problems

J(U(n)) =∥∥∥Y(n) minusU(n)B(n)T

∥∥∥F (18)

where B(n) =[otimes

k 6=n U(k)]

GT(n) In the HOSVD [26] the

factor matrix U(n) is estimated as the leading singular vectors

6

of Y(n) However performing SVD is prohibitive for verylarge-scale problems Recently a set of probabilistic algorithmshave been developed to discover the low-rank structure of hugematrices [29] Here we consider the following randomizedrange finder (RRF) algorithm (Algorithm 41 of [29]) toestimate the basis of U(n)

1) Drawn a random Gaussian matrix Ω isin Rprod

p6=n IptimesRn 2) Obtain the In times Rn matrix Z(n) = Y(n)Ω

(n)3) Compute U(n) isin RIntimesRn as an orthonormal basis of

Z(n) isin RIntimesRn by using eg QR decompositionsIn the above Rn = Rn + p and p is referred to as theoversampling parameter and p = 5 or p = 10 was suggestedin [29] Based on this procedure we propose a randomizedTucker decomposition (RandTucker) algorithm to performTucker decompositions see Algorithm 2 Compared with theHOSVD algorithm the proposed RandTucker algorithm ishighly parallalizable and suitable for distributed computation

Several randomized algorithms have already been proposedfor Tucker decompositions such as the method using fibersampling [14] [30] Compared with them the proposedRandTucker method has an important advantage it is ableto obtain nearly optimal approximation even for very noisydata by using only one additional ALS-like iteration whichis referred to as RandTucker2i see Algorithm 3 and Fig4for the performance comparison In this figure a 200-by-200-by-200 tensor Y was generated in each run by using the (1)with Rn = 10 then different levels of Gaussian noise wasadded to Y to generate the observation tensor We plottedthe Fit values (and the standard variations) obtained by theRandTucker (with p = 10) the RandTucker2i (p = 10) andthe HOSVD averaged over 50 Monte-Carlo runs From thefigure the RandTucker achieved slightly worse Fit valuesespecially when noise was heavy while the RandTucker2iachieved almost the same Fit values as the HOSVD Asthe RandTucker and the RandTucker2i do not involve thesingular value decomposition of big matrices and they arehighly parallelizable they are very promising in the Tuckerdecompositions of big tensors

B Distributed Randomized Tucker DecompositionIn the above RandTucker and RandTucker2i algorithms we

have assumed that all the data can be processed in physicalmemory directly For big data this could be unrealistic Acommon scenario is that only partial data can be loaded inmemory In this subsection we discuss how to solve thisproblem by using a computer cluster (in this paper we used theMATLAB Distributed Computing Server2) Here we considera cluster which is composed ofbull a number of workers with identity labels each of which is

a computational engine that executes computational taskssimultaneously In the meanwhile they may communicatewith each other and the client as well

2See wwwmathworkscomhelppdf docmdcemdcepdf

Algorithm 2 The Randomized Tucker Decomposition (Rand-Tucker) Algorithm

Require Y isin RI1timesI2middotmiddotmiddottimesIN and the multilinear rank(R1 R2 RN )

1 for n = 1 2 N do2 Compute Z(n) = Y(n)Ω

(n) where Ω(n) is an(prodk 6=n Ik)-by-Rn random Gaussian matrix

3 Compute U(n) as an orthonormal basis of Z(n) byusing eg the QR decomposition

4 Ylarr Ytimesn U(n)T In larr Rn5 end for6 GlarrY7 return Y asymp JGU(1)U(2) middot middot middot U(N)K

Algorithm 3 The RandTucker Algorithm with 2 Iterationsonly (RandTucker2i)

Require Y isin RI1timesI2middotmiddotmiddottimesIN and the multilinear rank(R1 R2 RN )

1 Initialize U(n) isin RIntimesRn as random Gaussian matrices2 Repeat lines (3)-(7) twice3 for n = 1 2 N do4 X = Ytimesp 6=n U(p)5 Compute Z(n) = X(n)Ω

(n) where Ω(n) is an(prodk 6=n Rk)-by-Rn random Gaussian matrix

6 Compute U(n) as an orthonormal basis of Z(n) byusing eg the QR decomposition

7 end for8 Glarr Ytimes1 U(1) times2 U(2) middot middot middot timesN U(N)9 return Y asymp JGU(1)U(2) middot middot middot U(N)K

0 5 10 15 20 25 3020

30

40

50

60

70

80

90

100

Noise level (SNR in dB)

Fit (

)

RandTucker

RandTucker2i

HOSVD

Fig 4 Performance comparison between the RandTucker theRandTucker2i and the HOSVD algorithm in terms of meanvalue of Fit and the standard derivation over 50 Monte-Carloruns

bull a client which is used to schedule tasks to workers andgather results from workers and return final results

7

Workeramp17amp Workeramp19amp Workeramp21amp Workeramp23amp

Workeramp9amp Workeramp11amp Workeramp15amp

Workeramp1amp Workeramp3amp Workeramp5amp

Workeramp13amp

Workeramp7amp

Y(111)

Y(211)

Y(121)

Y(221) Y(231)

Y(131) Y(141)

Y(241)

(a)

P 2 NS1S2middotmiddotmiddotSN+

= snk = ind2subp(n)(sn k)

s = ind2sub(p(s1 s2 sN ))

(Using P(n))

(Using P)

Worker p(s)

p(s)Y(s)

(b)

Fig 5 Demonstration of small data tensors and the virtual computing resource tensor P in a cluster (a) An example Y isinR2Itimes4Jtimes3K is partitioned into smaller blocks Y(s) with the same size I times J timesK thereby leading to the 2times 4times 3 computingresource tensor P identifying a total of 24 workers (b) An entry p(s) isin 1 2 prodn Sn of the computing resource tensorP indicates that Worker p(s) processes the sub-tensor Y(s) where s = [s1 s2 sN ]T = ind2sub(p(s)) Hence in (a)p(1 3 2) = 13 means Worker 13 processes the sub-tensor Y(132) as [1 3 2] = ind2sub(13) = p(2)(3 3)

Given a big N th-order tensor Y isin RI1timesI2timesmiddotmiddotmiddottimesIN we parti-tion each dimension In into a total of Sn disjoint subsets eachof which consists of LSn indices with

sumSn

sn=1 Lsn = In Assuch the original big tensor is partitioned into a total of W =S1S2 middot middot middotSN sub-tensors and each sub-tensor can be denotedby Y(s1s2sN ) with the size of Ls1 times Ls2 times middot middot middot times LsN where sn isin NSn n = 1 2 N At the same time supposewe have a computer cluster with W workers (computationalengines)3 We imagine that the W workers are virtuallyspatially permuted to form a S1 times S2 times middot middot middot times SN tensor suchthat each worker is just associated with one sub-tensor seeFig5(a) In other words we will have two types of tensorsbull The small data tensors obtained by partitioning Y

Y(s) = Y(s1s2sN ) isin RLs1timesLs2timesmiddotmiddotmiddottimesLsN (19)

where s =[s1 s2 middot middot middot sN

]Tis used to specify

the (s1 s2 sN )th-block of Y sn isin NSn n =

1 2 N see Fig5(a) for an illustrationbull The computing resource tensor P isin NS1timesS2timesmiddotmiddotmiddottimesSN

+

formed by reshaping [1 2 W ] that identifies theW = S1S2 middot middot middotSN workers and the associated data ten-sors The entries of P have a one-to-one correspondencewith the sub-tensors Y(s) meaning that Worker p(s)process the sub-tensor Y(s) see Fig5(b)

It can be seen the RandTucker and the RandTucker2i majorlyuse four basic operations 1) Mode-n unfolding of a tensor2) Matrix-matrix product to obtain Y(n)Ω

(n) 3) Generating aGaussian random matrix Ω 4) Estimating an orthogonal basisof the range of Y(n)Ω

(n) Below we discuss how to performthese four operations in a distributed computing system

Mode-n unfolding of tensor Y We unfold the computingresource tensor P isin NS1timesS2timesmiddotmiddotmiddottimesSN

+ along mode-n to obtain

3We have assumed that each sub-tensor can be efficiently handled in by theassociated worker otherwise we increase the number of workers and decreasethe size of each sub-tensor accordingly

P(n) isin NSntimesK+ K =

prodj 6=n Sj Let p(n)(sn k) be the

(sn k)th entry of P(n) and define

s = πsnk = ind2sub(p(n)(sn k))

sn = 1 2 Sn k = 1 2 prod

j 6=nSj

(20)

where the function ind2sub(middot) converts an entry of P(n) iep(n)(sn k) a worker identifier into an N -by-1 vector ofsubscript values determining the position of p(n)(sn k) in thetensor P which allows us to locate the sub-tensor associatedwith this worker In other words the above (20) means thatthe tensor Yπsnk is processed by Worker p(n)(sn k) Then itcan be seen that

Y(n) =

Yπ11

(n) Yπ12

(n) Yπ1K

(n)

Yπ21

(n) Yπ22

(n) Yπ2K

(n)

Y

πSn1

(n) YπSn2

(n) YπSnK

(n)

isin RIntimes

prodj 6=n Ij

(21)where Y

πsnk

(n) is the mode-n martricization of Yπsnk pro-cessed by Worker p(n)(sn k) sn isin NSn

and k isin NK In summary to obtain Y(n) we perform mode-n unfolding

of Y(s)(n) simultaneously Then the (sn k)th sub-matrix of Y(n)

will be given by Worker p(n)(sn k) as shown in (21)Matrix-matrix product Large scale matrix-matrix product

can be performed by using any existing technique As here themode-n unfolding matrix Y(n) has already been partitionedand distributed we apply the matrix partitioning techniquewe compute the product Y(n)Ω

(n) by using

Z(n)1

Z(n)2

middot middot middotZ

(n)Sn

= Y(n)Ω

(n) = Y(n)

Ω(n)1

Ω(n)2

Ω(n)K

isin RIntimesRn (22)

where Y(n) is given by (21) Ω(n)k isin R

prodj 6=n LSj

timesRn are thesub-matrices by partitioning the random Gaussian matrix Ω(n)

8

50

100

150

200

250

300

350

400

1x108 2x108 3x108 5x108 7x108 1x109 1x109

Scale of the input data ( I N)

Tim

e (s)

FFCPCPminusALSPARACOMP

(a)

2

4

6

8

10

12

14

16

18

1x109 1x1012 8x1012 3x1013 6x1013 1x1014

Scale of the input data (I N)

Tim

e (s

)

(b)

Fig 6 (a) Time consumption comparison of the FFCP the CP-ALS and the PARACOMP on condition that the data tensorof I times I times I can be loaded into memory For the FFCP the RandTucker2i was used to compress the data with p = 15 (b)Scalability of the FFCP algorithm provided that the given big tensor was in its Tucker format The consumed time by theFFCP increases nearly linearly with respect to I

into K parts From (22) we have Z(n)sn =

sumKk=1 Y

πsnk

(n) Ω(n)k

and all the products Yπsnk

(n) Ω(n)k are computed on Workers

p(πsnk) simultaneously and then are broadcasted to Workerp(n)(sn k0) typically k0 = 1 Finally for each sn isin NSn wesum Y

πsnk

(n) Ω(n)k up over k on Worker p(n)(sn k0) to yield

Z(n)sn Generating the Gaussian random matrix Ω(n) Note that

the workers identified by the kth column of P(n) ie theworkers p(n)(sn k) sn = 1 2 Sn share a same randomGaussian matrix Ω

(n)k (corresponding to the kth column of

blocks of Y(n) in (21)) Hence Ω(n)k can be generated on

one worker say p(n)(1 k) and then is broadcasted to all theworkers p(n)(sn k) with sn 6= 1 However this way suffersfrom heavy communication overhead A more economicalmethod is to broadcast the parameters (or the random numbergenerators) that control the random number generation andthen the random matrices Ω

(n)k are generated on all workers

by using the received random number generators as suchfor each k the workers p(n)(sn k) sn isin NSn

will generateidentical random Gaussian matrices Ω

(n)k simultaneously

Estimating an orthonormal basis of the columns ofY(n)Ω

(n) To this end we perform eigenvalue decompositionofsumSn

sn=1 Z(n)sn

TZ(n)sn = UΣUT Then the mode-n factor

matrix U(n) with orthonormal columns can be computed from

U(n)sn = Z(n)

sn UΣminus12 (23)

in parallel on the workers p(n)(sn 1) where U(n)sn is the snth

sub-matrix of U(n) sn isin NSn After U(n) has been estimated Y will be updated as Ylarr

Y timesn U(n)T which can be realized using the above mode-n

tensor unfolding and matrix-matrix product After update thenth dimension of Y(s) will be reduced to be Rn from LSn

We repeat the above procedure for n = 1 2 N As the

last step the core tensor is computed as G =sumWw=1 Y

(w) iethe sum of sub-tensors distributed on all workers

C Nonnegative Tucker Decomposition

In our recent papers [19] [31] we also showed how toefficiently perform nonnegative Tucker decomposition basedon a Tucker approximation The idea of that method is similarto the FFCP method after the original high-dimensional rawdata has been replaced by its Tucker approximation with lowmultilinear rank the computational complexity of gradientswith respect to the nonnegative core tensor and factor matri-ces can be significantly reduced in terms of both time andspace Readers are referred to [19] [31] for more detailsIn summary Tucker compression can play a fundamentalrole to achieve highly efficient tensor decompositions asillustrated in Fig1 Based on the proposed randomized Tuckerdecomposition methods we can easily perform various typesof constrainedunconstrained tensor decompositions for verylarge-scale tensor data

V SIMULATIONS

A Synthetic data

In this subsection we investigate the performance of theproposed FFCP algorithm by using synthetic data All the ex-periments were performed in 64-bit MATLAB on a computerwith Intel Core i7 CPU (333GHz) and 24GB memory running64-bit Windows 7

9

TABLE II Comparison between the algorithms when they were applied to perform CPD of synthetic data The noises wereGaussian with SNR=10dB in all the cases

Algorithm N = 3 I = 200 R = 10 N = 3 I = 500 R = 20 N = 6 I = 20 R = 5Fit () Time (s) Fit () Time (s) Fit () Time (s)

CP-ALS 821plusmn 102 395 805plusmn 68 640 896plusmn 112 188FastCPALS 798plusmn 90 104 829plusmn 56 211 924plusmn 109 66PARAFAC 825plusmn 79 782 805plusmn 61 1583 876plusmn 114 643FFCP 819plusmn 80 35 828plusmn 56 40 934plusmn 86 50

TABLE III Comparison between the algorithms when they were applied to perform nonnegative CPD of synthetic data Thenoises were Gaussian with SNR=10dB in all the cases

Algorithm N = 3 I = 200 R = 10 N = 3 I = 500 R = 20 N = 6 I = 20 R = 5Fit () Time (s) Fit () Time (s) Fit () Time (s)

CP-NMU 973plusmn 12 664 976plusmn 08 16952 917plusmn 56 4265CP-HALS 978plusmn 10 121 977plusmn 07 2281 951plusmn 41 1014FFCP-MU 990plusmn 05 23 993plusmn 02 121 934plusmn 41 70FFCP-HALS 991plusmn 01 61 995plusmn 01 299 943plusmn 35 118

Simulations on CPD In each run a tensor Ylowast was generatedby using the model (2) where the entries of the latent factormatrices A(n) isin RItimesR (ie the ground truth) were drawn fromiid standard normal distributions n = 1 2 N Thenindependent Gaussian noise (with SNR=10dB) was addedto Ylowast to yield noisy observation tensor Y We tested threedifferent combinations of N I and R as shown in TABLE IIThe algorithms CP-ALS PARAFAC which are the standardalgorithms included in the Tensor Toolbox 25 [32] and the N -way toolbox [33] respectively were compared as the baselineThe FastCPALS algorithm [34] was also compared For all thealgorithm the maximum iteration number was set as 1000and a stopping criterion |Fit(Y Yt) minus Fit(Y Yt+1)| lt 10minus6

was applied where Yt is an estimation of Ylowast obtained afterthe tth iteration The mean values and standard variationsof Fit(Ylowast Y) over 20 Monte-Carlo runs are detailed in TA-BLE II along with the averaged time consumed by eachalgorithm From the table we see that the FFCP algorithmis significantly faster than the other algorithms but withoutsignificant deterioration of performance although it workson a compressed tensor instead of the original tensor Wealso tested the PARACOMP [35] method which was speciallyproposed for big tensor decompositions It turned out that thePARACOMP was sensitive to noise in noise-free cases itoften achieved perfect Fit of 1 otherwise when the noise was10dB it generally obtained the Fit lower than 03

In Fig6 we demonstrated the scalability of the proposedFFCP method where the RandTucker2i was used to compressthe data with p = 15 From Fig6(a) we can see that theFFCP shows very promising scalability its time consumptionincreases almost linearly with respect to the scale of the inputdata on condition that the rank is fixed on the contrastthe time consumed by the other algorithms increased nearlyexponentially Note that both the FFCP and the PARACOMPmethod [35] were designed mainly for big data analyticsand this comparison did not precisely reflect their actual

time consumption difference in a distributed computationalenvironment However it did show that the proposed methodwas more efficient than the PARACOMP which can also be in-ferred from the obvious facts 1) the PARACOMP compressesthe observation big tensor at least P times (P ge IR) whereasin our RandTucker2i based FFCP we only need to do it twice(or even once when RandTucker is applied) 2) although bothof them need to perform CPD on small compressed tensorshowever the PARACOMP method needs to do this P timeswhereas the FFCP needs to do it only once as the final stepParticularly in the PARACOMP any failure of the CPDs ofP sub-tensors will degenerate the final results Fig6(b) showsthe scalability of the FFCP provided that the given big tensorwas in its Tucker format To such scale of tensor if we usethe original full tensor directly all the algorithms ran out ofmemory But if we use its Tucker compression the FFCPcan perform constrainedunconstrained CPD efficiently evenon our desktop computer

Simulations on Nonnegative CPD In this experiments thedata were generated as the above except that the latent factormatrices were generated from iid exponential distributionswith the mean parameter micro = 10 Then we randomly selected10 entries of each factor to be zeros to generate sparsefactor matrices As the PARAFAC algorithm was very slowwe only compared the CP-NMU algorithm which is a standardnonnegative CPD algorithm based on the multiplicative updaterule included in the Tensor Toolbox [32] and the CP-HALSalgorithm [36] that is based on the Hierarchical ALS (HALS)method For fair comparison their corresponding implemen-tations of FFCP were compared The results averaged over20 Monte-Carlo runs were detailed in TABLE III From theTABLE FFCP implementations were not only significantlyfaster than their competitors but also often achieved higherFit values To the best of our knowledge the FFCP is one ofthe most efficient algorithms for nonnegative CPD

To demonstrate the high efficiency of the FFCP algorithm

10

in this simulation the entries of A(n) isin R10000times10 n =1 2 3 4 were drawn from iid exponential distributions withthe mean parameter micro = 10 Again we randomly selected 20entries of A(n) to be zeros to obtain sparse factor matricesBy using (2) and A(n) generated above a big 4th-order tensorY isin R10000times10000times10000times10000 with R = 10 was obtainedwhich contains a total of 1016 entries in its full format Itis certainly impossible to decompose such a huge tensor in astandard desktop computer Suppose we have already obtainedits Tucker approximation YT = JGA(1)A(2)A(3)A(4)K asympY Due to the rotation ambiguity of Tucker decompositionswe know theoretically A(n) = A(n)Q(n) where Q(n) wasnot necessarily to be nonnegative Generally applying CPD onthe core tensor G directly cannot guarantee the nonnegativityof factor matrices Instead we applied the FFCP algorithm onthe Tucker compressed tensor YT to extract nonnegative factormatrices In a typical run the FFCP-HALS consumed only 774seconds and the recovered components were with the signal-to-interference ratio4 (SIR) higher than 200dB which almostperfectly match the true components Although this exampleonly shows an ideal noise-free case it illustrates how the FFCPalgorithm can be applied to perform nonnegative CPD of hugetensors as long as their good low-rank approximations can beobtained

B The COIL-100 Images Clustering Analysis

In this experiment we applied the proposed NTD algorithmsto the clustering analysis of the objects selected from theColumbia Object Image Library (COIL-100) The COIL-100database consists of 7200 images of 100 objects each ofwhich has 72 images taken from different poses hence theseimages naturally form a 128times 128times 3times 7200 tensor Y Thedata tensor was decomposed by using tensor decompositionalgorithms and then used the two t-Distributed StochasticNeighbor Embedding (t-SNE) components of A(4) as featuresto cluster and visualize the results The K-means algorithm wasused to cluster the objects and the accuracy of clustering wasdefined in [37]To justify the proposed distributed RandTuckeralgorithm we used the MATLAB parallel computing toolboxwhich allows us to use the full processing power of multicoredesktops by executing applications on workers (MATLABcomputational engines) that run locally The tensor Y waspartitioned into 4 sub-tensors with the size of 128times128times3times1800 and was distributed on 4 workers As the baseline theTucker-ALS algorithm included in [32] was compared and themultilinear rank was empirically set as R = (20 20 3 20)For the RandTucker method R = (30 30 3 50) was setBoth of these two algorithms used 2 iterations Their perfor-mances were detailed in TABLE IV It can be seen that nosignificant deterioration was observed between the randomized

4The performance index SIR is defined as SIR(a(n)r a

(n)r ) =

20 log∥∥∥a(n)

r minus a(n)r

∥∥∥2∥∥∥a(n)

r

∥∥∥2

where a(n)r is an estimate of a

(n)r both

a(n)r and a

(n)r are normalized to be zero-mean and unit-variance

TABLE IV Comparison between the algorithms when theywere applied to clustering analysis of COIL-100 objects

Algorithm Fit () Accuracy ()Tucker-ALS 703 692

RandTucker2i 691 684RandTucker 686 698

FFCP(Tucker-ALS) 671 713FFCP (RandTucker2i) 647 680

12

3

4

5

6 7

8

9

10

11

12

13

1415

16

17

18

19

20

21

2223

24

25

2627

28

29

30

31

32

33

34

35

36

37

38 3940

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

6465

6667

68

69

70

71

72

73

74

75

76

77

78

7980

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

Fig 7 Visualization of the clustering results obtained byapplying distributed RandTucker2i on the COIL100 imagedatabase The two tSNE components of A(4) were used forvisualization

methods and the deterministic methods which suggested thatthe distributed RandTucker algorithm can be applied to reallylarge scale problems The clustering results obtained by theRandTucker2i Figare visualized in Fig7 by using the twotSNE components of A(4)

C UCF Sports Action Videos Analysis

This data set consists of a set of videos collected fromvarious sports which are typically featured on broadcast televi-sion channels [38] (see httpcrcvucfedudataUCF SportsActionphp) To avoid rescaling the videos and keep all thevideos in the same specification we selected 5 actions ofDiving Riding-Horse Run Swing and Walk-Front each ofwhich contains 10 video clips with the size of 404times720times3times35(35 is the number of frames) We concatenated these videosto form a 5th-order tensor This tensor has about 16 times 109

entries and more than 99 of them are nonzero When it issaved in double format it consumes about 117 GB spaceWe used tensor decomposition methods to decompose this

11

Fig 8 The UCF Sports Action Videos The first row shows asnapshot of each action and the second row shows 5 snapshotsfrom a video of diving

tensor to extract low-dimensional features for further cluster-ing analysis When we applied the CP-ALS method includedin the Tensor Toolbox to decompose 30 videos it consumedabout 4 hours for only one iteration when 40 videos wereconsidered the CP-ALS ran out of memory We configureda small cluster with 3 computers (a desktop computer plustwo laptop computers (each of which has a i7 Intel dualcore CPU 8 GB physical memory running Windows 7 andMATLAB 2012b)) In the first stage the RandTucker2i wasapplied with R = [35 35 3 10 20] It consumed about 104seconds to obtain the Tucker approximation of the originaltensor with the Fit of 723 to the observation data Then theFFCP was applied using a desktop computer with R = 10 Inthis case only 75 seconds were required without nonnegativityconstraints and 7 seconds were consumed when nonnegativityconstraints were imposed Finally we used A(5) as featuresto cluster the videos and achieved the accuracy of 58 ina typical run This experiments showed that the proposedalgorithms can be applied to decompose very large scaletensors by using a distributed computing cluster

VI CONCLUSION

Tensor decompositions are particularly promising for bigmultiway data analytics as it is able to discover more complexinternal structures and correlations of data gathered frommultiple aspects devices trials and subjects etc While mostexisting tensor decomposition methods are not designed tomeet the major challenges posed by big data in this paperwe proposed a unified framework for the decomposition ofbig tensors with relatively low multilinear rank The key ideais that based on its Tucker approximation of a big tensor alarge set of unconstrainedconstrained tensor decompositionscan be efficiently performed We developed a flexible fastalgorithm for the CP decomposition of tensors using theirTucker representation and a distributed randomized Tuckerdecomposition approach which is particularly suitable for bigtensor analysis Simulation results confirmed that the proposedalgorithms are highly efficient and scalable

While the randomized Tucker decomposition approachshowed its close-to-optimal performance the correspondingerror bounds need to be investigated theoretically in the futureWe also believe many randomized (or related) approachesproposed for low-rank approximation of matrices [29] [39]

could be applied for Tucker decompositions however a com-parative study in the tensor decomposition scenarios is desiredParticularly if we have efficient and robust unconstrained CPdecomposition algorithms we can use the CP model to com-press the raw data instead of the Tucker model which suffersfrom the curse of dimensionality for very high-order tensorsHence developing more reliable and robust CP decompositionapproaches is another important task All these open issuesdeserve further study in the near future

REFERENCES

[1] A Cichocki R Zdunek A H Phan and S-I Amari Nonnegative Ma-trix and Tensor Factorizations Applications to Exploratory Multi-wayData Analysis and Blind Source Separationpplications to ExploratoryMulti-Way Data Analysis and Blind Source Separation ChichesterUK John Wiley amp Sons Ltd 2009

[2] N Sidiropoulos R Bro and G Giannakis ldquoParallel factor analysisin sensor array processingrdquo IEEE Transactions on Signal Processingvol 48 no 8 pp 2377ndash2388 Aug 2000

[3] L De Lathauwer J Castaing and J Cardoso ldquoFourth-order cumulant-based blind identification of underdetermined mixturesrdquo IEEE Transac-tions on Signal Processing vol 55 no 6 pp 2965ndash2973 2007

[4] P Comon ldquoTensors A brief introductionrdquo IEEE Signal ProcessingMagazine vol 31 no 3 pp 44ndash53 May 2014

[5] T G Kolda and B W Bader ldquoTensor decompositions and applicationsrdquoSIAM Review vol 51 no 3 pp 455ndash500 2009

[6] N D Sidiropoulos and R Bro ldquoOn the uniqueness of multilineardecomposition of N -way arraysrdquo Journal of Chemometrics vol 14no 3 pp 229 ndash 239 2000

[7] G Zhou and A Cichocki ldquoCanonical polyadic decomposition based ona single mode blind source separationrdquo IEEE Signal Processing Lettersvol 19 no 8 pp 523 ndash526 Aug 2012

[8] L-H Lim and P Comon ldquoNonnegative approximations of nonnegativetensorsrdquo Journal of Chemometrics vol 23 no 7-8 pp 432ndash441 2009

[9] E E Papalexakis C Faloutsos and N D Sidiropoulos ldquoParCubeSparse parallelizable tensor decompositionsrdquo in Proceedings of the 2012European Conference on Machine Learning and Knowledge Discoveryin Databases - Volume Part I ser ECML PKDDrsquo12 Berlin HeidelbergSpringer-Verlag 2012 pp 521ndash536

[10] A Phan and A Cichocki ldquoAdvances in PARAFAC using parallel blockdecompositionrdquo in Neural Information Processing ser Lecture Notesin Computer Science C Leung M Lee and J Chan Eds vol 5863Springer Berlin Heidelberg 2009 pp 323ndash330

[11] U Kang E Papalexakis A Harpale and C Faloutsos ldquoGigaTensorScaling tensor analysis up by 100 times - algorithms and discoveriesrdquoin Proceedings of the 18th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining ser KDD rsquo12 New York NYUSA ACM 2012 pp 316ndash324

[12] C Tsourakakis ldquoMACH Fast randomized tensor decompositionsrdquoin Proceedings of the 2010 SIAM International Conference on DataMining 2010 pp 689ndash700

[13] P Drineas and M W Mahoney ldquoA randomized algorithm for a tensor-based generalization of the singular value decompositionrdquo Linear Alge-bra and its Applications vol 420 no 2ndash3 pp 553 ndash 571 2007

[14] C F Caiafa and A Cichocki ldquoGeneralizing the column-row matrix de-composition to multi-way arraysrdquo Linear Algebra and its Applicationsvol 433 no 3 pp 557 ndash 573 2010

[15] M Vasilescu and D Terzopoulos ldquoMultilinear analysis of image en-sembles Tensorfacesrdquo in Computer Vision mdash ECCV 2002 ser LectureNotes in Computer Science A Heyden G Sparr M Nielsen andP Johansen Eds Springer Berlin Heidelberg 2002 vol 2350 pp447ndash460

[16] O Duchenne F Bach I-S Kweon and J Ponce ldquoA tensor-basedalgorithm for high-order graph matchingrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 12 pp 2383ndash2395 Dec2011

12

[17] A Rajwade A Rangarajan and A Banerjee ldquoImage denoising usingthe higher order singular value decompositionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 35 no 4 pp 849ndash8622013

[18] V De Silva and L-H Lim ldquoTensor rank and the ill-posedness of the bestlow-rank approximation problemrdquo SIAM Journal on Matrix Analysis andApplications vol 30 no 3 pp 1084ndash1127 September 2008

[19] G Zhou A Cichocki Q Zhao and S Xie ldquoNonnegative matrixand tensor factorizations An algorithmic perspectiverdquo IEEE SignalProcessing Magazine vol 31 no 3 pp 54ndash65 May 2014

[20] G Zhou A Cichocki and S Xie ldquoFast nonnegative matrixtensorfactorization based on low-rank approximationrdquo IEEE Transactions onSignal Processing vol 60 no 6 pp 2928 ndash2940 June 2012

[21] V Cevher S Becker and M Schmidt ldquoConvex optimization forbig data Scalable randomized and parallel algorithms for big dataanalyticsrdquo IEEE Signal Processing Magazine vol 31 no 5 pp 32ndash43Sept 2014

[22] R Bro and C A Andersson ldquoImproving the speed of multiway algo-rithms Part ii Compressionrdquo Chemometrics and Intelligent LaboratorySystems vol 42 no 1ndash2 pp 105 ndash 113 1998

[23] N Sidiropoulos E Papalexakis and C Faloutsos ldquoParallel randomlycompressed cubes A scalable distributed architecture for big tensordecompositionrdquo IEEE Signal Processing Magazine vol 31 no 5 pp57ndash70 Sept 2014

[24] P Comon X Luciani and A L F de Almeida ldquoTensor decompositionsalternating least squares and other talesrdquo Journal of Chemometricsvol 23 no 7-8 pp 393ndash405 2009

[25] J Cohen R Farias and P Comon ldquoFast decomposition of largenonnegative tensorsrdquo IEEE Signal Processing Letters vol 22 no 7pp 862ndash866 July 2015

[26] L De Lathauwer B De Moor and J Vandewalle ldquoA multilinearsingular value decompositionrdquo SIAM Journal on Matrix Analysis andApplications vol 21 pp 1253ndash1278 2000

[27] E Papalexakis N Sidiropoulos and R Bro ldquoFrom k-means to higher-way co-clustering Multilinear decomposition with sparse latent factorsrdquoIEEE Transactions on Signal Processing vol 61 no 2 pp 493ndash506Jan 2013

[28] Z Qian X Chen N Kang M Chen Y Yu T Moscibroda andZ Zhang ldquoMadLINQ Large-scale distributed matrix computation forthe cloudrdquo in Proceedings of the 7th ACM European Conference onComputer Systems ser EuroSys rsquo12 New York NY USA ACM2012 pp 197ndash210

[29] N Halko P Martinsson and J Tropp ldquoFinding structure with ran-domness Probabilistic algorithms for constructing approximate matrixdecompositionsrdquo SIAM Review vol 53 no 2 pp 217ndash288 2011

[30] N Vervliet O Debals L Sorber and L De Lathauwer ldquoBreaking thecurse of dimensionality using decompositions of incomplete tensorsTensor-based scientific computing in big data analysisrdquo IEEE SignalProcessing Magazine vol 31 no 5 pp 71ndash79 Sept 2014

[31] G Zhou A Cichocki Q Zhao and S Xie ldquoEfficient nonnegativeTucker decompositions Algorithms and uniquenessrdquo IEEE Transactionson Image Processing (Submitted) p 13 2014 [Online] Available[ArXive-prints]httparxivorgabs14044412

[32] B W Bader and T G Kolda ldquoMATLAB tensor toolbox version25rdquo Feb 2012 [Online] Available httpcsmrcasandiagovsimtgkoldaTensorToolbox

[33] C A Andersson and R Bro ldquoThe N -way toolbox for MATLABrdquo 2000[Online] Available httpwwwmodelslifekudksourcenwaytoolbox

[34] H Phan P Tichavsky and A Cichocki ldquoFast alternating LS algorithmsfor high order CANDECOMPPARAFAC tensor factorizationsrdquo IEEETransactions on Signal Processing vol 61 no 19 pp 4834ndash4846 2013

[35] N Sidiropoulos E Papalexakis and C Faloutsos ldquoA parallel algorithmfor big tensor decomposition using randomly compressed cubes (PARA-COMP)rdquo in 2014 IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP) May 2014 pp 1ndash5

[36] A Cichocki and A-H Phan ldquoFast local algorithms for large scalenonnegative matrix and tensor factorizationsrdquo IEICE Transactions onFundamentals of Electronics Communications and Computer Sciences(Invited paper) vol E92-A no 3 pp 708ndash721 2009

[37] D Cai X He J Han and T Huang ldquoGraph regularized nonnegativematrix factorization for data representationrdquo IEEE Transactions on

Pattern Analysis and Machine Intelligence vol 33 no 8 pp 1548ndash1560 aug 2011

[38] M Rodriguez J Ahmed and M Shah ldquoAction MACH a spatio-temporal maximum average correlation height filter for action recogni-tionrdquo in IEEE Conference on Computer Vision and Pattern Recognition2008 CVPR 2008 June 2008 pp 1ndash8

[39] R Baraniuk V Cevher and M Wakin ldquoLow-dimensional models fordimensionality reduction and signal recovery A geometric perspectiverdquoProceedings of the IEEE vol 98 no 6 pp 959ndash971 June 2010

  • I Introduction
  • II Overview of Tucker and Polyadic Decompositions
  • III Efficient CP Decompositions of Tensors Based on Their Tucker Approximations
    • III-A Overview of CPD Algorithms
    • III-B CPD of Big Tensors by Based on Their Tucker Compression
      • IV Distributed Randomized Tucker Decompositions (RandTucker)
        • IV-A Randomized Tucker Decomposition
        • IV-B Distributed Randomized Tucker Decomposition
        • IV-C Nonnegative Tucker Decomposition
          • V Simulations
            • V-A Synthetic data
            • V-B The COIL-100 Images Clustering Analysis
            • V-C UCF Sports Action Videos Analysis
              • VI Conclusion
              • References

2

CPD

Nonnegave13 CPD

Reduce noiseCompress data

Tucker13 Representaon

Addi$onal13 constraints13 Sparsity13 sta$s$cal13 independency13

smoothness13

Nonnegave13 Tucker13 Decomposion

Fig 1 Decompositions of a big tensor based on its compactTucker representation

CPD algorithms often suffer from the issue of ill-posednessin the sense that the optimal low-rank approximation does notexist at all Hence in this type of methods any ill-conditionedsub-tensors formed by the samplingpartition operation willunavoidably deteriorate the final results Another scalableapproach GigaTensor [11] requires the target tensors are verysparse For Tucker decompositions the MACH approach [12]was also proposed to obtain a sparse tensor based on samplingthe entries of a target tensor Some other approaches that sam-ple the columns (fibers) of target tensors can be found in [13][14] However these methods are often not suitable for verynoisy data Our purpose in this paper is to develop a generalframework for big tensor decompositions The basic idea issimple Assuming the given big tensor has low multilinearrank we use the Tucker model to compress the original bigtensor first Using this Tucker representation of data bothunconstrainedconstrained CP and Tucker decompositions canbe efficiently performed as illustrated in Fig1 We developedfast scalable algorithms for CPD based on a compact Tuckerrepresentation of data and a new randomized algorithm for theTucker decomposition of a big tensor

The rest of the paper is organized as follows In Section IIthe CP and Tucker decomposition models and algorithms arebriefly overviewed In Section III a family of new algorithmsare proposed for (nonnegative) CP decompositions based onthe Tucker representation of tensors In Section IV two newdistributed randomized Tucker decomposition algorithms areproposed that can be used to decompose big tensors with rel-atively low multilinear rank Simulations results on syntheticand real-world data are presented in Section V and conclusionsare made in Section VI

In TABLE I the frequently used notations are listed andreaders are referred to [1] [5] for more details on tensorconventions

II OVERVIEW OF TUCKER AND POLYADICDECOMPOSITIONS

By using the Tucker model a given high-order tensor Y isinRI1timesI2middotmiddotmiddottimesIN is approximated as

Y asymp Y = Gtimes1 U(1) times2 U(2) middot middot middot timesN U(N)

= JGU(1)U(2) middot middot middot U(N)K

(1)

where U(n) =[u(n)1 u

(n)2 middot middot middot u

(n)Rn

]isin RIntimesRn is the

mode-n component (factor) matrix consisting of latent com-ponents u

(n)rn as its columns G isin RR1timesR2middotmiddotmiddottimesRN is the core

tensor reflecting the connections between the components and(R1 R2 RN ) is called the multilinear rank of Y withRn = rank(Y(n))

By applying SVD to Y(n) n = 1 2 N we obtainthe Tucker decomposition of a given tensor which is referredto as the high-order SVD (HOSVD) and has been widelyapplied in multidimensional data analysis [15]ndash[17] While theTucker decomposition can be easily implemented it is alsocriticized for the lack of uniqueness and suffering from curseof dimensionality As another important tensor decompositionmodel the CP model decomposes the target tensor into a setof rank-one terms that is

Y asymp Y =sum

r

λra(1)r a(2)

r middot middot middot a(N)r

= JA(1)A(2) middot middot middot A(N)K

(2)

where denotes the outer product of vectors and for sim-plicity λr is absorbed into a

(N)r r = 1 2 R CPD is

essentially unique under mild conditions [6] which has beenfound many important applications such as blind identificationfeature extraction However the CPD of a given tensor isnot always well-defined in the sense that the optimal CPdecomposition for a given rank may not exist at all [18]

III EFFICIENT CP DECOMPOSITIONS OF TENSORS BASEDON THEIR TUCKER APPROXIMATIONS

A Overview of CPD Algorithms

Alternating least-squares (ALS) has been a standard ap-proach to perform CPD which minimizes a total of N LSproblems

J(A(n)) =1

2

∥∥∥Y(n) minusA(n)B(n)T∥∥∥2

F n isin NN (3)

in an alternating fashion in each iteration possibly withadditional constraints eg nonnegativity posed on the factormatrices A(n) and

B(n) =⊙

p 6=nA(p) isin R

prodp 6=n IptimesR (4)

Based on the gradient of J(A(n))

partJ

partA(n)= A(n)B(n)TB(n) minusY(n)B

(n) (5)

3

+ +hellip+

v(1)1

v(2)1

v(3)1 v

(3)2

v(1)2

v(2)2

v(1)R

v(2)R

v(3)R

(a) Tucker+CP

+ +hellip+a

(3)R

a(2)R

a(1)Ra

(1)2

a(2)2

a(3)2a

(3)1

a(2)1

a(1)1

(b) FFCP

Fig 2 Two different ways to perform CPD based on a Tucker approximation In the Tucker+CP routine the CPD is performedon a small full tensor (ie the core tensor G) while in the FFCP method it is performed on a tensor in its approximate Tuckerrepresentation JGU(1)U(2) middot middot middot U(N)K

we can obtain the standard ALS update rule for the CPD

A(n) larr Y(n)B(n)(B(n)TB(n))

minus1 n = 1 2 N (6)

by letting partJpartA(n) = 0 or a family of first-order nonnegative

CPD algorithms [19] for example the one using the multi-plicative update (MU) rule

A(n) larr A(n) ~[(Y(n)B

(n)) (A(n)B(n)TB(n))] (7)

and the efficient hierarchical ALS (HALS) approach thatupdates only one column each time

a(n)r larr a(n)

r +1

t(n)rr

P+

(q(n)r minusA(n)t(n)r

) (8)

where P+(x) = max(x 0) is element-wisely applied to a ma-trixvector [19] [20] t(n)rr and t

(n)r are the (r r) entry and the

rth column of the matrix T(n) = B(n)TB(n) respectively andq(n)r is the rth column of Q(n)

= Y(n)B(n) Efficient com-

putation of (5) is the foundation of a large class of first-ordermethods for CPD and their variants [19] In [21] the authorsargued that first-order (gradient) methods generally provideacceptable convergence speed and are easy to parallelize andhence more suitable than the second-order counterparts forlarge scale optimization problems This motivates us to restrictour discussion to the first-order methods in this paper

In the next we discuss how to compute the gradient termsin (5) efficiently While the term

B(n)TB(n) =~p 6=n(A(p)TA(p)) isin RRtimesR (9)

in (5) (and in (6) (7) and (8) as well) can be efficiently com-puted however the term Y(n)B

(n) is quite computationallydemanding especially when Y is huge and N is large Indeedit is likely that in many cases neither Y nor B(n) can beloaded into memory To overcome this problem we often hopeto 1) avoid accessing the huge tensor Y frequently (eg inevery iteration) 2) avoid explicitly constructing the big matrixB(n) isin R

prodk 6=n IktimesR To this end two properties of Y can be

employed to improve scalability of CPD algorithmsbull Sparsity of Y In this case the GigaTensor method [11]

provides an elegant solution only the nonzero entries

of Y will be accessed and the columns of Y(n)B(n)

are computed in a parallel manner but without explicitconstruction of B(n)

bull Low rank of Y (the tensor can be dense) This type ofmethods use the low-rank approximation instead of theoriginal huge dense tensor Y A standard method is toperform Tucker compression first typically by using theHOSVD and then CPD is applied to the core tensor [22]which is referred to as the Tucker+CP method in this pa-per The approach in [10] splits the original big tensor intoa set of small sub-tensors and then CPDs are applied toeach sub-tensor After that Y(n)B

(n) can be computed byusing these sub-tensor in their CP approximation insteadof Y Arguing that some sub-tensors could be very ill-posed which unavoidably causes poor performance thePARACOMP method [23] uses a set of random matricesthat share a pre-specified number of common anchor rowsto compress the tensor Y using the Tucker model andthen a set of much smaller tensors are obtained Then theCPD is performed on these small tensors simultaneouslyand original true components are recovered by solving alinear system incorporating the information provided bythe anchor rows This method avoids the SVD of hugematrices required by the HOSVD however it involves atotal of P Tucker compression procedures and CPD ofP small tensors where P ge InR

Both the method in [10] and the PARACOMP method in[23] rely on reliable CPD of all sub (or intermediate) tensorsHowever it is widely known that existing CPD approachesoften converge to local solutions especially when bottlenecksexist in one or more factor matrices [24] or even worsethe CPD is simply not unique For example in [23] theauthors argued that the method in [10] assumed each subtensor had unique CPD which however cannot be guaranteedfrom the global uniqueness conditions alone [23] In thePARACOMP the uniqueness of CPD of each small tensorcan be guaranteed from the global uniqueness conditionshowever it did not solve the local convergence problem andis sensitive to noise In other words to achieve satisfactory

4

results the PARACOMP actually requires that all the CPDsin the intermediate step converge globally which is oftenunrealistic Hence we believe it is important to develop morerobust and scalable algorithms for the CPD especially whenY is dense In the meanwhile we should avoid using CPDsas intermediate steps due to the local convergence issue

Let us go into more detail of the Tucker+CP approachIn this approach the big tensor Y is approximated by itsHOSVD of Y asymp Y = JGU(1)U(2) middot middot middot U(N)K whereU(n) isin RIntimesRn is computed from the truncated SVD of Y(n)

with U(n)TU(n) = I then the CPD is applied to the muchsmaller tensor G isin RR1timesR2timesmiddotmiddotmiddottimesRN such that

G asymp JV(1)V(2) middot middot middot V(N)K (10)

thereby leading to the CPD of Y asymp JA(1)A(2) middot middot middot A(N)Kwith A(n) = U(n)V(n) For large-scale problems very oftenIn Rn and the CPD of (10) can be efficiently solved evenusing a standard desktop computer The Tucker+CP routinewhich is originally proposed first in [22] is very efficientfor the big tensors with low multilinear ranks Howeverthis framework is not directly applicable if it is necessaryto impose additional constraints such as nonnegativity andsparsity on factor matrices Very recently Cohen Farias andComon proposed a novel efficient method to deal with thenonnegativity of factor matrices [25] Our objective in thispaper is to develop a more general and flexible scheme whichcan not only achieve similar efficiency and scalability but alsoallow to impose any additional or optional constraints on factormatrices very easily

B CPD of Big Tensors by Based on Their Tucker Compression

The Tucker decomposition is suitable for pre-processingand compressing data because it is significantly faster thanthe CPD and it has upper error bounds guarantee [26] Inthis subsection we consider a slightly modified version ofthe Tucker+CP approach to overcome its limitations In thefirst step the Tucker compression is performed such thatY asymp Y = JGU(1)U(2) middot middot middot U(N)K as in the Tucker+CPapproach In the second step however we perform the CPD onthe whole tensor Y instead of the core tensor G by minimizingthe following cost function

J(A(n)) =1

2

∥∥∥Yminus JA(1)A(2) middot middot middot A(N)K∥∥∥2

F

st A(n) isin Cn(11)

where Cn denotes some constraints imposed on A(n)With (11) the term Y(n)B

(n) in the gradient partJpartA(n) can be

efficiently computed from

Y(n)B(n) asymp U(n)G(n)

[⊙p 6=n

V(p)] (12)

whereV(n) = U(n)TA(n) isin RRntimesR (13)

Algorithm 1 The Flexible Fast Nonnegative CP (FFCP)Decomposition Algorithm

Require Y isin RI1timesI2middotmiddotmiddottimesIN the output rank R1 Perform Tucker decomposition to obtain

Y asymp JGU(1)U(2) middot middot middot U(N)K Initialize A(n) andV(n) = U(n)TA(n)

2 repeat3 for n isin NN do4 Efficiently compute H(n) = G(n)

[⊙p 6=n V(p)

]

5 Compute Y(n)B(n) asymp U(n)H(n) and B(n)TB(n) =

~p 6=n(A(p)TA(p))

6 Update A(n) by using (6) or (7)(8) to imposenonnegativity constraints andor (17) to improve thesparsity of components

7 V(n) larr U(n)TA(n)8 end for9 until a stopping criterion is satisfied

10 return GlarrY A(n) n isin N

Note that V(n) is quite small as Rn le R and it can be com-puted by using distributed systems or the matrix partitioningmethod when In is really huge As such the ALS update rule(6) can be replaced by

A(n) larr U(n)G(n)

p 6=nV(p)

(~p 6=n

A(p)TA(p)

)minus1

(14)Because the approximation Y may contain negative entries(7) can be replaced by

A(n) larr A(n) ~P+

(U(n)G(n)

(⊙p 6=n V(p)

))

A(n)(~p 6=n A(p)TA(p)

) (15)

As a result the nonnegativity of factor matrices can be easilyincorporated in the proposed scheme If we hope to extractsparse latent component we consider the l1-norm regularizedCPD proposed in [27] and minimize J(A(n)) + c

∥∥A(n)∥∥1

thereby leading to

A(n) larr Sc(A(n)) (16)

where the sparsity operator Sc(middot) is element-wisely defined as

Sc(x) =

xminus c if x gt c

0 if minus c le x le cx+ c if x lt minusc

In summary most first-order based constrainedunconstrainedCPD algorithms can benefit from the proposed methodApproximately the computational complexity is reducedfrom O(Rn

prodpisinNN

Ip) in (5) to only about O(InR2n +

RnprodpisinNN

Rp) due to using a compact representation of Y

5

minus20 minus15 minus10 minus5 0 5 10 15 20

60

65

70

75

80

85

90

95

Noise level (dB)

Fit

Tucker+CP

FFCPDirect CP

Fig 3 The Fit values averaged over 50 Monte-Carlo runsobtained by 3 different ways (ie the Direct CP that appliesCPD on Y directly the FFCP on the Tucker format of Y asympJGU(1)U(2) middot middot middot U(N)K and the Tucker+CP on the coretensor G first and then construct the CP approximation) toperform CPD on a synthetic tensor Y isin R50times50times50times50 withR = 10 The data tensor Y was generated by using (2) anddifferent levels of Gaussian noise (SNR in dB) were added

It is worth noticing that if Y = JGU(1)U(2) middot middot middot U(N)Kwith U(n)U(n)T = I minimizing (11) is equivalent tosolving

minV(n)

J(V(n)) =1

2

∥∥∥Gminus JV(1)V(2) middot middot middot V(N)K∥∥∥F

st A(n) = U(n)V(n) isin Cn(17)

where V(n) is defined in (13) and Cn denotes some con-straints on A(n) Indeed (17) has been considered in [25]specially for NTF Although this method essentially has thesame computational complexity as the FFCP method the mainadvantage of (11) is that any off-the-shelf (constrained) CPDmethods can be directly applied with efficient computation ofthe gradient terms (9) and (12) without the need of designinga new algorithm to solve (17)

In summary not only does (11) benefit from the sameadvantage of low-computational complexity as the Tucker+CPmethod it is also quite flexible because additional constraintson A(n) can be easily incorporated such as nonnegativityandor sparsity For this reason the new method is calledthe flexible fast CP (FFCP) decomposition algorithm and thecorresponding pseudo code is briefly listed in Algorithm 11In Fig3 we compared the Tucker+CP the FFCP (solving(11)) and the standard CPD (solving (3)) algorithms Thedata tensor Y isin R50times50times50times50 was generated by using (2)

1 Although in Algorithm 1 only the LS update and multiplicative updateare listed any update rules based on the gradient (eg HALS see [19] forexample) could be incorporated into the scheme of FFCP straightforwardly

with R = 10 and contaminated by different levels of noise (inSNR) ranging from -20dB to 20dB In the figure we showedtheir Fit values averaged over 50 Monte-Carlo runs whereFit = 1 minus

∥∥∥Yminus Y∥∥∥FYF It turns out that the Tucker

compression generally will not degenerate the accuracy Onthe contrary it may improve the stableness of CPD algorithms

Although the matrices V(n) in (12) are generallysmall construction of the tall (

prodk 6=nRk)-by-Rn matrix

ie⊙

k 6=n V(p) and direct computation of H(n) =

G(n)

[⊙p 6=n V(p)

]can be avoided by applying either one of

the following methodsbull Notice that the rth column of H(n) can be computed

using h(n)r = G timesp 6=n a

(p)r

T which allows H(n) to becomputed in parallel by using the shared data tensor G

but without constructing⊙

k 6=n V(p) explicitly

bull The term G(n)

[⊙p 6=n V(p)

]can be computed in parallel

by using the technique adopted in the GigaTensor [11]Here G(n) is generally not sparse but relatively small

Moreover H(n) = G(n)

[⊙p 6=n V(p)

]can be viewed as

the gradient term when we perform the CPD of G asympJV(1)V(2) middot middot middot V(N)K (see (5) and (17)) For this reasonmost techniques which have been proposed to speed upstandard CP decompositions can be applied to compute H(n)

straightforwardlyFor very large-scale data where In are extremely huge the

products of A(n)TA(n) and U(n)TA(n) can be performed ondistributed systems see eg [28] or simply computed byusing the matrix partitioning technique on a standard computer

In summary even with very limited computational resourcethe FFCP algorithm can perform the (nonnegative) CPD ofvery large-scale tensors provided that their low-rank Tuckerrepresentation can be obtained Notice that the Tucker ap-proximation of a given tensor plays a fundamental role in theproposed FFCP method as well as the Tucker+CP the methodproposed in [25] and the PARACOMP [35] and the HOSVDoften serves as workhorse to perform Tucker compressionHowever the HOSVD involves the eigenvalue decompositionof huge matrices Y(n) which makes it unsuitable for large-scale problems [35] In the next section we focus on how toovercome this limitation

IV DISTRIBUTED RANDOMIZED TUCKERDECOMPOSITIONS (RANDTUCKER)

A Randomized Tucker Decomposition

In the Tucker decomposition of (1) the factor matrices U(n)

are estimated by minimizing the following LS problems

J(U(n)) =∥∥∥Y(n) minusU(n)B(n)T

∥∥∥F (18)

where B(n) =[otimes

k 6=n U(k)]

GT(n) In the HOSVD [26] the

factor matrix U(n) is estimated as the leading singular vectors

6

of Y(n) However performing SVD is prohibitive for verylarge-scale problems Recently a set of probabilistic algorithmshave been developed to discover the low-rank structure of hugematrices [29] Here we consider the following randomizedrange finder (RRF) algorithm (Algorithm 41 of [29]) toestimate the basis of U(n)

1) Drawn a random Gaussian matrix Ω isin Rprod

p6=n IptimesRn 2) Obtain the In times Rn matrix Z(n) = Y(n)Ω

(n)3) Compute U(n) isin RIntimesRn as an orthonormal basis of

Z(n) isin RIntimesRn by using eg QR decompositionsIn the above Rn = Rn + p and p is referred to as theoversampling parameter and p = 5 or p = 10 was suggestedin [29] Based on this procedure we propose a randomizedTucker decomposition (RandTucker) algorithm to performTucker decompositions see Algorithm 2 Compared with theHOSVD algorithm the proposed RandTucker algorithm ishighly parallalizable and suitable for distributed computation

Several randomized algorithms have already been proposedfor Tucker decompositions such as the method using fibersampling [14] [30] Compared with them the proposedRandTucker method has an important advantage it is ableto obtain nearly optimal approximation even for very noisydata by using only one additional ALS-like iteration whichis referred to as RandTucker2i see Algorithm 3 and Fig4for the performance comparison In this figure a 200-by-200-by-200 tensor Y was generated in each run by using the (1)with Rn = 10 then different levels of Gaussian noise wasadded to Y to generate the observation tensor We plottedthe Fit values (and the standard variations) obtained by theRandTucker (with p = 10) the RandTucker2i (p = 10) andthe HOSVD averaged over 50 Monte-Carlo runs From thefigure the RandTucker achieved slightly worse Fit valuesespecially when noise was heavy while the RandTucker2iachieved almost the same Fit values as the HOSVD Asthe RandTucker and the RandTucker2i do not involve thesingular value decomposition of big matrices and they arehighly parallelizable they are very promising in the Tuckerdecompositions of big tensors

B Distributed Randomized Tucker DecompositionIn the above RandTucker and RandTucker2i algorithms we

have assumed that all the data can be processed in physicalmemory directly For big data this could be unrealistic Acommon scenario is that only partial data can be loaded inmemory In this subsection we discuss how to solve thisproblem by using a computer cluster (in this paper we used theMATLAB Distributed Computing Server2) Here we considera cluster which is composed ofbull a number of workers with identity labels each of which is

a computational engine that executes computational taskssimultaneously In the meanwhile they may communicatewith each other and the client as well

2See wwwmathworkscomhelppdf docmdcemdcepdf

Algorithm 2 The Randomized Tucker Decomposition (Rand-Tucker) Algorithm

Require Y isin RI1timesI2middotmiddotmiddottimesIN and the multilinear rank(R1 R2 RN )

1 for n = 1 2 N do2 Compute Z(n) = Y(n)Ω

(n) where Ω(n) is an(prodk 6=n Ik)-by-Rn random Gaussian matrix

3 Compute U(n) as an orthonormal basis of Z(n) byusing eg the QR decomposition

4 Ylarr Ytimesn U(n)T In larr Rn5 end for6 GlarrY7 return Y asymp JGU(1)U(2) middot middot middot U(N)K

Algorithm 3 The RandTucker Algorithm with 2 Iterationsonly (RandTucker2i)

Require Y isin RI1timesI2middotmiddotmiddottimesIN and the multilinear rank(R1 R2 RN )

1 Initialize U(n) isin RIntimesRn as random Gaussian matrices2 Repeat lines (3)-(7) twice3 for n = 1 2 N do4 X = Ytimesp 6=n U(p)5 Compute Z(n) = X(n)Ω

(n) where Ω(n) is an(prodk 6=n Rk)-by-Rn random Gaussian matrix

6 Compute U(n) as an orthonormal basis of Z(n) byusing eg the QR decomposition

7 end for8 Glarr Ytimes1 U(1) times2 U(2) middot middot middot timesN U(N)9 return Y asymp JGU(1)U(2) middot middot middot U(N)K

0 5 10 15 20 25 3020

30

40

50

60

70

80

90

100

Noise level (SNR in dB)

Fit (

)

RandTucker

RandTucker2i

HOSVD

Fig 4 Performance comparison between the RandTucker theRandTucker2i and the HOSVD algorithm in terms of meanvalue of Fit and the standard derivation over 50 Monte-Carloruns

bull a client which is used to schedule tasks to workers andgather results from workers and return final results

7

Workeramp17amp Workeramp19amp Workeramp21amp Workeramp23amp

Workeramp9amp Workeramp11amp Workeramp15amp

Workeramp1amp Workeramp3amp Workeramp5amp

Workeramp13amp

Workeramp7amp

Y(111)

Y(211)

Y(121)

Y(221) Y(231)

Y(131) Y(141)

Y(241)

(a)

P 2 NS1S2middotmiddotmiddotSN+

= snk = ind2subp(n)(sn k)

s = ind2sub(p(s1 s2 sN ))

(Using P(n))

(Using P)

Worker p(s)

p(s)Y(s)

(b)

Fig 5 Demonstration of small data tensors and the virtual computing resource tensor P in a cluster (a) An example Y isinR2Itimes4Jtimes3K is partitioned into smaller blocks Y(s) with the same size I times J timesK thereby leading to the 2times 4times 3 computingresource tensor P identifying a total of 24 workers (b) An entry p(s) isin 1 2 prodn Sn of the computing resource tensorP indicates that Worker p(s) processes the sub-tensor Y(s) where s = [s1 s2 sN ]T = ind2sub(p(s)) Hence in (a)p(1 3 2) = 13 means Worker 13 processes the sub-tensor Y(132) as [1 3 2] = ind2sub(13) = p(2)(3 3)

Given a big N th-order tensor Y isin RI1timesI2timesmiddotmiddotmiddottimesIN we parti-tion each dimension In into a total of Sn disjoint subsets eachof which consists of LSn indices with

sumSn

sn=1 Lsn = In Assuch the original big tensor is partitioned into a total of W =S1S2 middot middot middotSN sub-tensors and each sub-tensor can be denotedby Y(s1s2sN ) with the size of Ls1 times Ls2 times middot middot middot times LsN where sn isin NSn n = 1 2 N At the same time supposewe have a computer cluster with W workers (computationalengines)3 We imagine that the W workers are virtuallyspatially permuted to form a S1 times S2 times middot middot middot times SN tensor suchthat each worker is just associated with one sub-tensor seeFig5(a) In other words we will have two types of tensorsbull The small data tensors obtained by partitioning Y

Y(s) = Y(s1s2sN ) isin RLs1timesLs2timesmiddotmiddotmiddottimesLsN (19)

where s =[s1 s2 middot middot middot sN

]Tis used to specify

the (s1 s2 sN )th-block of Y sn isin NSn n =

1 2 N see Fig5(a) for an illustrationbull The computing resource tensor P isin NS1timesS2timesmiddotmiddotmiddottimesSN

+

formed by reshaping [1 2 W ] that identifies theW = S1S2 middot middot middotSN workers and the associated data ten-sors The entries of P have a one-to-one correspondencewith the sub-tensors Y(s) meaning that Worker p(s)process the sub-tensor Y(s) see Fig5(b)

It can be seen the RandTucker and the RandTucker2i majorlyuse four basic operations 1) Mode-n unfolding of a tensor2) Matrix-matrix product to obtain Y(n)Ω

(n) 3) Generating aGaussian random matrix Ω 4) Estimating an orthogonal basisof the range of Y(n)Ω

(n) Below we discuss how to performthese four operations in a distributed computing system

Mode-n unfolding of tensor Y We unfold the computingresource tensor P isin NS1timesS2timesmiddotmiddotmiddottimesSN

+ along mode-n to obtain

3We have assumed that each sub-tensor can be efficiently handled in by theassociated worker otherwise we increase the number of workers and decreasethe size of each sub-tensor accordingly

P(n) isin NSntimesK+ K =

prodj 6=n Sj Let p(n)(sn k) be the

(sn k)th entry of P(n) and define

s = πsnk = ind2sub(p(n)(sn k))

sn = 1 2 Sn k = 1 2 prod

j 6=nSj

(20)

where the function ind2sub(middot) converts an entry of P(n) iep(n)(sn k) a worker identifier into an N -by-1 vector ofsubscript values determining the position of p(n)(sn k) in thetensor P which allows us to locate the sub-tensor associatedwith this worker In other words the above (20) means thatthe tensor Yπsnk is processed by Worker p(n)(sn k) Then itcan be seen that

Y(n) =

Yπ11

(n) Yπ12

(n) Yπ1K

(n)

Yπ21

(n) Yπ22

(n) Yπ2K

(n)

Y

πSn1

(n) YπSn2

(n) YπSnK

(n)

isin RIntimes

prodj 6=n Ij

(21)where Y

πsnk

(n) is the mode-n martricization of Yπsnk pro-cessed by Worker p(n)(sn k) sn isin NSn

and k isin NK In summary to obtain Y(n) we perform mode-n unfolding

of Y(s)(n) simultaneously Then the (sn k)th sub-matrix of Y(n)

will be given by Worker p(n)(sn k) as shown in (21)Matrix-matrix product Large scale matrix-matrix product

can be performed by using any existing technique As here themode-n unfolding matrix Y(n) has already been partitionedand distributed we apply the matrix partitioning techniquewe compute the product Y(n)Ω

(n) by using

Z(n)1

Z(n)2

middot middot middotZ

(n)Sn

= Y(n)Ω

(n) = Y(n)

Ω(n)1

Ω(n)2

Ω(n)K

isin RIntimesRn (22)

where Y(n) is given by (21) Ω(n)k isin R

prodj 6=n LSj

timesRn are thesub-matrices by partitioning the random Gaussian matrix Ω(n)

8

50

100

150

200

250

300

350

400

1x108 2x108 3x108 5x108 7x108 1x109 1x109

Scale of the input data ( I N)

Tim

e (s)

FFCPCPminusALSPARACOMP

(a)

2

4

6

8

10

12

14

16

18

1x109 1x1012 8x1012 3x1013 6x1013 1x1014

Scale of the input data (I N)

Tim

e (s

)

(b)

Fig 6 (a) Time consumption comparison of the FFCP the CP-ALS and the PARACOMP on condition that the data tensorof I times I times I can be loaded into memory For the FFCP the RandTucker2i was used to compress the data with p = 15 (b)Scalability of the FFCP algorithm provided that the given big tensor was in its Tucker format The consumed time by theFFCP increases nearly linearly with respect to I

into K parts From (22) we have Z(n)sn =

sumKk=1 Y

πsnk

(n) Ω(n)k

and all the products Yπsnk

(n) Ω(n)k are computed on Workers

p(πsnk) simultaneously and then are broadcasted to Workerp(n)(sn k0) typically k0 = 1 Finally for each sn isin NSn wesum Y

πsnk

(n) Ω(n)k up over k on Worker p(n)(sn k0) to yield

Z(n)sn Generating the Gaussian random matrix Ω(n) Note that

the workers identified by the kth column of P(n) ie theworkers p(n)(sn k) sn = 1 2 Sn share a same randomGaussian matrix Ω

(n)k (corresponding to the kth column of

blocks of Y(n) in (21)) Hence Ω(n)k can be generated on

one worker say p(n)(1 k) and then is broadcasted to all theworkers p(n)(sn k) with sn 6= 1 However this way suffersfrom heavy communication overhead A more economicalmethod is to broadcast the parameters (or the random numbergenerators) that control the random number generation andthen the random matrices Ω

(n)k are generated on all workers

by using the received random number generators as suchfor each k the workers p(n)(sn k) sn isin NSn

will generateidentical random Gaussian matrices Ω

(n)k simultaneously

Estimating an orthonormal basis of the columns ofY(n)Ω

(n) To this end we perform eigenvalue decompositionofsumSn

sn=1 Z(n)sn

TZ(n)sn = UΣUT Then the mode-n factor

matrix U(n) with orthonormal columns can be computed from

U(n)sn = Z(n)

sn UΣminus12 (23)

in parallel on the workers p(n)(sn 1) where U(n)sn is the snth

sub-matrix of U(n) sn isin NSn After U(n) has been estimated Y will be updated as Ylarr

Y timesn U(n)T which can be realized using the above mode-n

tensor unfolding and matrix-matrix product After update thenth dimension of Y(s) will be reduced to be Rn from LSn

We repeat the above procedure for n = 1 2 N As the

last step the core tensor is computed as G =sumWw=1 Y

(w) iethe sum of sub-tensors distributed on all workers

C Nonnegative Tucker Decomposition

In our recent papers [19] [31] we also showed how toefficiently perform nonnegative Tucker decomposition basedon a Tucker approximation The idea of that method is similarto the FFCP method after the original high-dimensional rawdata has been replaced by its Tucker approximation with lowmultilinear rank the computational complexity of gradientswith respect to the nonnegative core tensor and factor matri-ces can be significantly reduced in terms of both time andspace Readers are referred to [19] [31] for more detailsIn summary Tucker compression can play a fundamentalrole to achieve highly efficient tensor decompositions asillustrated in Fig1 Based on the proposed randomized Tuckerdecomposition methods we can easily perform various typesof constrainedunconstrained tensor decompositions for verylarge-scale tensor data

V SIMULATIONS

A Synthetic data

In this subsection we investigate the performance of theproposed FFCP algorithm by using synthetic data All the ex-periments were performed in 64-bit MATLAB on a computerwith Intel Core i7 CPU (333GHz) and 24GB memory running64-bit Windows 7

9

TABLE II Comparison between the algorithms when they were applied to perform CPD of synthetic data The noises wereGaussian with SNR=10dB in all the cases

Algorithm N = 3 I = 200 R = 10 N = 3 I = 500 R = 20 N = 6 I = 20 R = 5Fit () Time (s) Fit () Time (s) Fit () Time (s)

CP-ALS 821plusmn 102 395 805plusmn 68 640 896plusmn 112 188FastCPALS 798plusmn 90 104 829plusmn 56 211 924plusmn 109 66PARAFAC 825plusmn 79 782 805plusmn 61 1583 876plusmn 114 643FFCP 819plusmn 80 35 828plusmn 56 40 934plusmn 86 50

TABLE III Comparison between the algorithms when they were applied to perform nonnegative CPD of synthetic data Thenoises were Gaussian with SNR=10dB in all the cases

Algorithm N = 3 I = 200 R = 10 N = 3 I = 500 R = 20 N = 6 I = 20 R = 5Fit () Time (s) Fit () Time (s) Fit () Time (s)

CP-NMU 973plusmn 12 664 976plusmn 08 16952 917plusmn 56 4265CP-HALS 978plusmn 10 121 977plusmn 07 2281 951plusmn 41 1014FFCP-MU 990plusmn 05 23 993plusmn 02 121 934plusmn 41 70FFCP-HALS 991plusmn 01 61 995plusmn 01 299 943plusmn 35 118

Simulations on CPD In each run a tensor Ylowast was generatedby using the model (2) where the entries of the latent factormatrices A(n) isin RItimesR (ie the ground truth) were drawn fromiid standard normal distributions n = 1 2 N Thenindependent Gaussian noise (with SNR=10dB) was addedto Ylowast to yield noisy observation tensor Y We tested threedifferent combinations of N I and R as shown in TABLE IIThe algorithms CP-ALS PARAFAC which are the standardalgorithms included in the Tensor Toolbox 25 [32] and the N -way toolbox [33] respectively were compared as the baselineThe FastCPALS algorithm [34] was also compared For all thealgorithm the maximum iteration number was set as 1000and a stopping criterion |Fit(Y Yt) minus Fit(Y Yt+1)| lt 10minus6

was applied where Yt is an estimation of Ylowast obtained afterthe tth iteration The mean values and standard variationsof Fit(Ylowast Y) over 20 Monte-Carlo runs are detailed in TA-BLE II along with the averaged time consumed by eachalgorithm From the table we see that the FFCP algorithmis significantly faster than the other algorithms but withoutsignificant deterioration of performance although it workson a compressed tensor instead of the original tensor Wealso tested the PARACOMP [35] method which was speciallyproposed for big tensor decompositions It turned out that thePARACOMP was sensitive to noise in noise-free cases itoften achieved perfect Fit of 1 otherwise when the noise was10dB it generally obtained the Fit lower than 03

In Fig6 we demonstrated the scalability of the proposedFFCP method where the RandTucker2i was used to compressthe data with p = 15 From Fig6(a) we can see that theFFCP shows very promising scalability its time consumptionincreases almost linearly with respect to the scale of the inputdata on condition that the rank is fixed on the contrastthe time consumed by the other algorithms increased nearlyexponentially Note that both the FFCP and the PARACOMPmethod [35] were designed mainly for big data analyticsand this comparison did not precisely reflect their actual

time consumption difference in a distributed computationalenvironment However it did show that the proposed methodwas more efficient than the PARACOMP which can also be in-ferred from the obvious facts 1) the PARACOMP compressesthe observation big tensor at least P times (P ge IR) whereasin our RandTucker2i based FFCP we only need to do it twice(or even once when RandTucker is applied) 2) although bothof them need to perform CPD on small compressed tensorshowever the PARACOMP method needs to do this P timeswhereas the FFCP needs to do it only once as the final stepParticularly in the PARACOMP any failure of the CPDs ofP sub-tensors will degenerate the final results Fig6(b) showsthe scalability of the FFCP provided that the given big tensorwas in its Tucker format To such scale of tensor if we usethe original full tensor directly all the algorithms ran out ofmemory But if we use its Tucker compression the FFCPcan perform constrainedunconstrained CPD efficiently evenon our desktop computer

Simulations on Nonnegative CPD In this experiments thedata were generated as the above except that the latent factormatrices were generated from iid exponential distributionswith the mean parameter micro = 10 Then we randomly selected10 entries of each factor to be zeros to generate sparsefactor matrices As the PARAFAC algorithm was very slowwe only compared the CP-NMU algorithm which is a standardnonnegative CPD algorithm based on the multiplicative updaterule included in the Tensor Toolbox [32] and the CP-HALSalgorithm [36] that is based on the Hierarchical ALS (HALS)method For fair comparison their corresponding implemen-tations of FFCP were compared The results averaged over20 Monte-Carlo runs were detailed in TABLE III From theTABLE FFCP implementations were not only significantlyfaster than their competitors but also often achieved higherFit values To the best of our knowledge the FFCP is one ofthe most efficient algorithms for nonnegative CPD

To demonstrate the high efficiency of the FFCP algorithm

10

in this simulation the entries of A(n) isin R10000times10 n =1 2 3 4 were drawn from iid exponential distributions withthe mean parameter micro = 10 Again we randomly selected 20entries of A(n) to be zeros to obtain sparse factor matricesBy using (2) and A(n) generated above a big 4th-order tensorY isin R10000times10000times10000times10000 with R = 10 was obtainedwhich contains a total of 1016 entries in its full format Itis certainly impossible to decompose such a huge tensor in astandard desktop computer Suppose we have already obtainedits Tucker approximation YT = JGA(1)A(2)A(3)A(4)K asympY Due to the rotation ambiguity of Tucker decompositionswe know theoretically A(n) = A(n)Q(n) where Q(n) wasnot necessarily to be nonnegative Generally applying CPD onthe core tensor G directly cannot guarantee the nonnegativityof factor matrices Instead we applied the FFCP algorithm onthe Tucker compressed tensor YT to extract nonnegative factormatrices In a typical run the FFCP-HALS consumed only 774seconds and the recovered components were with the signal-to-interference ratio4 (SIR) higher than 200dB which almostperfectly match the true components Although this exampleonly shows an ideal noise-free case it illustrates how the FFCPalgorithm can be applied to perform nonnegative CPD of hugetensors as long as their good low-rank approximations can beobtained

B The COIL-100 Images Clustering Analysis

In this experiment we applied the proposed NTD algorithmsto the clustering analysis of the objects selected from theColumbia Object Image Library (COIL-100) The COIL-100database consists of 7200 images of 100 objects each ofwhich has 72 images taken from different poses hence theseimages naturally form a 128times 128times 3times 7200 tensor Y Thedata tensor was decomposed by using tensor decompositionalgorithms and then used the two t-Distributed StochasticNeighbor Embedding (t-SNE) components of A(4) as featuresto cluster and visualize the results The K-means algorithm wasused to cluster the objects and the accuracy of clustering wasdefined in [37]To justify the proposed distributed RandTuckeralgorithm we used the MATLAB parallel computing toolboxwhich allows us to use the full processing power of multicoredesktops by executing applications on workers (MATLABcomputational engines) that run locally The tensor Y waspartitioned into 4 sub-tensors with the size of 128times128times3times1800 and was distributed on 4 workers As the baseline theTucker-ALS algorithm included in [32] was compared and themultilinear rank was empirically set as R = (20 20 3 20)For the RandTucker method R = (30 30 3 50) was setBoth of these two algorithms used 2 iterations Their perfor-mances were detailed in TABLE IV It can be seen that nosignificant deterioration was observed between the randomized

4The performance index SIR is defined as SIR(a(n)r a

(n)r ) =

20 log∥∥∥a(n)

r minus a(n)r

∥∥∥2∥∥∥a(n)

r

∥∥∥2

where a(n)r is an estimate of a

(n)r both

a(n)r and a

(n)r are normalized to be zero-mean and unit-variance

TABLE IV Comparison between the algorithms when theywere applied to clustering analysis of COIL-100 objects

Algorithm Fit () Accuracy ()Tucker-ALS 703 692

RandTucker2i 691 684RandTucker 686 698

FFCP(Tucker-ALS) 671 713FFCP (RandTucker2i) 647 680

12

3

4

5

6 7

8

9

10

11

12

13

1415

16

17

18

19

20

21

2223

24

25

2627

28

29

30

31

32

33

34

35

36

37

38 3940

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

6465

6667

68

69

70

71

72

73

74

75

76

77

78

7980

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

Fig 7 Visualization of the clustering results obtained byapplying distributed RandTucker2i on the COIL100 imagedatabase The two tSNE components of A(4) were used forvisualization

methods and the deterministic methods which suggested thatthe distributed RandTucker algorithm can be applied to reallylarge scale problems The clustering results obtained by theRandTucker2i Figare visualized in Fig7 by using the twotSNE components of A(4)

C UCF Sports Action Videos Analysis

This data set consists of a set of videos collected fromvarious sports which are typically featured on broadcast televi-sion channels [38] (see httpcrcvucfedudataUCF SportsActionphp) To avoid rescaling the videos and keep all thevideos in the same specification we selected 5 actions ofDiving Riding-Horse Run Swing and Walk-Front each ofwhich contains 10 video clips with the size of 404times720times3times35(35 is the number of frames) We concatenated these videosto form a 5th-order tensor This tensor has about 16 times 109

entries and more than 99 of them are nonzero When it issaved in double format it consumes about 117 GB spaceWe used tensor decomposition methods to decompose this

11

Fig 8 The UCF Sports Action Videos The first row shows asnapshot of each action and the second row shows 5 snapshotsfrom a video of diving

tensor to extract low-dimensional features for further cluster-ing analysis When we applied the CP-ALS method includedin the Tensor Toolbox to decompose 30 videos it consumedabout 4 hours for only one iteration when 40 videos wereconsidered the CP-ALS ran out of memory We configureda small cluster with 3 computers (a desktop computer plustwo laptop computers (each of which has a i7 Intel dualcore CPU 8 GB physical memory running Windows 7 andMATLAB 2012b)) In the first stage the RandTucker2i wasapplied with R = [35 35 3 10 20] It consumed about 104seconds to obtain the Tucker approximation of the originaltensor with the Fit of 723 to the observation data Then theFFCP was applied using a desktop computer with R = 10 Inthis case only 75 seconds were required without nonnegativityconstraints and 7 seconds were consumed when nonnegativityconstraints were imposed Finally we used A(5) as featuresto cluster the videos and achieved the accuracy of 58 ina typical run This experiments showed that the proposedalgorithms can be applied to decompose very large scaletensors by using a distributed computing cluster

VI CONCLUSION

Tensor decompositions are particularly promising for bigmultiway data analytics as it is able to discover more complexinternal structures and correlations of data gathered frommultiple aspects devices trials and subjects etc While mostexisting tensor decomposition methods are not designed tomeet the major challenges posed by big data in this paperwe proposed a unified framework for the decomposition ofbig tensors with relatively low multilinear rank The key ideais that based on its Tucker approximation of a big tensor alarge set of unconstrainedconstrained tensor decompositionscan be efficiently performed We developed a flexible fastalgorithm for the CP decomposition of tensors using theirTucker representation and a distributed randomized Tuckerdecomposition approach which is particularly suitable for bigtensor analysis Simulation results confirmed that the proposedalgorithms are highly efficient and scalable

While the randomized Tucker decomposition approachshowed its close-to-optimal performance the correspondingerror bounds need to be investigated theoretically in the futureWe also believe many randomized (or related) approachesproposed for low-rank approximation of matrices [29] [39]

could be applied for Tucker decompositions however a com-parative study in the tensor decomposition scenarios is desiredParticularly if we have efficient and robust unconstrained CPdecomposition algorithms we can use the CP model to com-press the raw data instead of the Tucker model which suffersfrom the curse of dimensionality for very high-order tensorsHence developing more reliable and robust CP decompositionapproaches is another important task All these open issuesdeserve further study in the near future

REFERENCES

[1] A Cichocki R Zdunek A H Phan and S-I Amari Nonnegative Ma-trix and Tensor Factorizations Applications to Exploratory Multi-wayData Analysis and Blind Source Separationpplications to ExploratoryMulti-Way Data Analysis and Blind Source Separation ChichesterUK John Wiley amp Sons Ltd 2009

[2] N Sidiropoulos R Bro and G Giannakis ldquoParallel factor analysisin sensor array processingrdquo IEEE Transactions on Signal Processingvol 48 no 8 pp 2377ndash2388 Aug 2000

[3] L De Lathauwer J Castaing and J Cardoso ldquoFourth-order cumulant-based blind identification of underdetermined mixturesrdquo IEEE Transac-tions on Signal Processing vol 55 no 6 pp 2965ndash2973 2007

[4] P Comon ldquoTensors A brief introductionrdquo IEEE Signal ProcessingMagazine vol 31 no 3 pp 44ndash53 May 2014

[5] T G Kolda and B W Bader ldquoTensor decompositions and applicationsrdquoSIAM Review vol 51 no 3 pp 455ndash500 2009

[6] N D Sidiropoulos and R Bro ldquoOn the uniqueness of multilineardecomposition of N -way arraysrdquo Journal of Chemometrics vol 14no 3 pp 229 ndash 239 2000

[7] G Zhou and A Cichocki ldquoCanonical polyadic decomposition based ona single mode blind source separationrdquo IEEE Signal Processing Lettersvol 19 no 8 pp 523 ndash526 Aug 2012

[8] L-H Lim and P Comon ldquoNonnegative approximations of nonnegativetensorsrdquo Journal of Chemometrics vol 23 no 7-8 pp 432ndash441 2009

[9] E E Papalexakis C Faloutsos and N D Sidiropoulos ldquoParCubeSparse parallelizable tensor decompositionsrdquo in Proceedings of the 2012European Conference on Machine Learning and Knowledge Discoveryin Databases - Volume Part I ser ECML PKDDrsquo12 Berlin HeidelbergSpringer-Verlag 2012 pp 521ndash536

[10] A Phan and A Cichocki ldquoAdvances in PARAFAC using parallel blockdecompositionrdquo in Neural Information Processing ser Lecture Notesin Computer Science C Leung M Lee and J Chan Eds vol 5863Springer Berlin Heidelberg 2009 pp 323ndash330

[11] U Kang E Papalexakis A Harpale and C Faloutsos ldquoGigaTensorScaling tensor analysis up by 100 times - algorithms and discoveriesrdquoin Proceedings of the 18th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining ser KDD rsquo12 New York NYUSA ACM 2012 pp 316ndash324

[12] C Tsourakakis ldquoMACH Fast randomized tensor decompositionsrdquoin Proceedings of the 2010 SIAM International Conference on DataMining 2010 pp 689ndash700

[13] P Drineas and M W Mahoney ldquoA randomized algorithm for a tensor-based generalization of the singular value decompositionrdquo Linear Alge-bra and its Applications vol 420 no 2ndash3 pp 553 ndash 571 2007

[14] C F Caiafa and A Cichocki ldquoGeneralizing the column-row matrix de-composition to multi-way arraysrdquo Linear Algebra and its Applicationsvol 433 no 3 pp 557 ndash 573 2010

[15] M Vasilescu and D Terzopoulos ldquoMultilinear analysis of image en-sembles Tensorfacesrdquo in Computer Vision mdash ECCV 2002 ser LectureNotes in Computer Science A Heyden G Sparr M Nielsen andP Johansen Eds Springer Berlin Heidelberg 2002 vol 2350 pp447ndash460

[16] O Duchenne F Bach I-S Kweon and J Ponce ldquoA tensor-basedalgorithm for high-order graph matchingrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 12 pp 2383ndash2395 Dec2011

12

[17] A Rajwade A Rangarajan and A Banerjee ldquoImage denoising usingthe higher order singular value decompositionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 35 no 4 pp 849ndash8622013

[18] V De Silva and L-H Lim ldquoTensor rank and the ill-posedness of the bestlow-rank approximation problemrdquo SIAM Journal on Matrix Analysis andApplications vol 30 no 3 pp 1084ndash1127 September 2008

[19] G Zhou A Cichocki Q Zhao and S Xie ldquoNonnegative matrixand tensor factorizations An algorithmic perspectiverdquo IEEE SignalProcessing Magazine vol 31 no 3 pp 54ndash65 May 2014

[20] G Zhou A Cichocki and S Xie ldquoFast nonnegative matrixtensorfactorization based on low-rank approximationrdquo IEEE Transactions onSignal Processing vol 60 no 6 pp 2928 ndash2940 June 2012

[21] V Cevher S Becker and M Schmidt ldquoConvex optimization forbig data Scalable randomized and parallel algorithms for big dataanalyticsrdquo IEEE Signal Processing Magazine vol 31 no 5 pp 32ndash43Sept 2014

[22] R Bro and C A Andersson ldquoImproving the speed of multiway algo-rithms Part ii Compressionrdquo Chemometrics and Intelligent LaboratorySystems vol 42 no 1ndash2 pp 105 ndash 113 1998

[23] N Sidiropoulos E Papalexakis and C Faloutsos ldquoParallel randomlycompressed cubes A scalable distributed architecture for big tensordecompositionrdquo IEEE Signal Processing Magazine vol 31 no 5 pp57ndash70 Sept 2014

[24] P Comon X Luciani and A L F de Almeida ldquoTensor decompositionsalternating least squares and other talesrdquo Journal of Chemometricsvol 23 no 7-8 pp 393ndash405 2009

[25] J Cohen R Farias and P Comon ldquoFast decomposition of largenonnegative tensorsrdquo IEEE Signal Processing Letters vol 22 no 7pp 862ndash866 July 2015

[26] L De Lathauwer B De Moor and J Vandewalle ldquoA multilinearsingular value decompositionrdquo SIAM Journal on Matrix Analysis andApplications vol 21 pp 1253ndash1278 2000

[27] E Papalexakis N Sidiropoulos and R Bro ldquoFrom k-means to higher-way co-clustering Multilinear decomposition with sparse latent factorsrdquoIEEE Transactions on Signal Processing vol 61 no 2 pp 493ndash506Jan 2013

[28] Z Qian X Chen N Kang M Chen Y Yu T Moscibroda andZ Zhang ldquoMadLINQ Large-scale distributed matrix computation forthe cloudrdquo in Proceedings of the 7th ACM European Conference onComputer Systems ser EuroSys rsquo12 New York NY USA ACM2012 pp 197ndash210

[29] N Halko P Martinsson and J Tropp ldquoFinding structure with ran-domness Probabilistic algorithms for constructing approximate matrixdecompositionsrdquo SIAM Review vol 53 no 2 pp 217ndash288 2011

[30] N Vervliet O Debals L Sorber and L De Lathauwer ldquoBreaking thecurse of dimensionality using decompositions of incomplete tensorsTensor-based scientific computing in big data analysisrdquo IEEE SignalProcessing Magazine vol 31 no 5 pp 71ndash79 Sept 2014

[31] G Zhou A Cichocki Q Zhao and S Xie ldquoEfficient nonnegativeTucker decompositions Algorithms and uniquenessrdquo IEEE Transactionson Image Processing (Submitted) p 13 2014 [Online] Available[ArXive-prints]httparxivorgabs14044412

[32] B W Bader and T G Kolda ldquoMATLAB tensor toolbox version25rdquo Feb 2012 [Online] Available httpcsmrcasandiagovsimtgkoldaTensorToolbox

[33] C A Andersson and R Bro ldquoThe N -way toolbox for MATLABrdquo 2000[Online] Available httpwwwmodelslifekudksourcenwaytoolbox

[34] H Phan P Tichavsky and A Cichocki ldquoFast alternating LS algorithmsfor high order CANDECOMPPARAFAC tensor factorizationsrdquo IEEETransactions on Signal Processing vol 61 no 19 pp 4834ndash4846 2013

[35] N Sidiropoulos E Papalexakis and C Faloutsos ldquoA parallel algorithmfor big tensor decomposition using randomly compressed cubes (PARA-COMP)rdquo in 2014 IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP) May 2014 pp 1ndash5

[36] A Cichocki and A-H Phan ldquoFast local algorithms for large scalenonnegative matrix and tensor factorizationsrdquo IEICE Transactions onFundamentals of Electronics Communications and Computer Sciences(Invited paper) vol E92-A no 3 pp 708ndash721 2009

[37] D Cai X He J Han and T Huang ldquoGraph regularized nonnegativematrix factorization for data representationrdquo IEEE Transactions on

Pattern Analysis and Machine Intelligence vol 33 no 8 pp 1548ndash1560 aug 2011

[38] M Rodriguez J Ahmed and M Shah ldquoAction MACH a spatio-temporal maximum average correlation height filter for action recogni-tionrdquo in IEEE Conference on Computer Vision and Pattern Recognition2008 CVPR 2008 June 2008 pp 1ndash8

[39] R Baraniuk V Cevher and M Wakin ldquoLow-dimensional models fordimensionality reduction and signal recovery A geometric perspectiverdquoProceedings of the IEEE vol 98 no 6 pp 959ndash971 June 2010

  • I Introduction
  • II Overview of Tucker and Polyadic Decompositions
  • III Efficient CP Decompositions of Tensors Based on Their Tucker Approximations
    • III-A Overview of CPD Algorithms
    • III-B CPD of Big Tensors by Based on Their Tucker Compression
      • IV Distributed Randomized Tucker Decompositions (RandTucker)
        • IV-A Randomized Tucker Decomposition
        • IV-B Distributed Randomized Tucker Decomposition
        • IV-C Nonnegative Tucker Decomposition
          • V Simulations
            • V-A Synthetic data
            • V-B The COIL-100 Images Clustering Analysis
            • V-C UCF Sports Action Videos Analysis
              • VI Conclusion
              • References

3

+ +hellip+

v(1)1

v(2)1

v(3)1 v

(3)2

v(1)2

v(2)2

v(1)R

v(2)R

v(3)R

(a) Tucker+CP

+ +hellip+a

(3)R

a(2)R

a(1)Ra

(1)2

a(2)2

a(3)2a

(3)1

a(2)1

a(1)1

(b) FFCP

Fig 2 Two different ways to perform CPD based on a Tucker approximation In the Tucker+CP routine the CPD is performedon a small full tensor (ie the core tensor G) while in the FFCP method it is performed on a tensor in its approximate Tuckerrepresentation JGU(1)U(2) middot middot middot U(N)K

we can obtain the standard ALS update rule for the CPD

A(n) larr Y(n)B(n)(B(n)TB(n))

minus1 n = 1 2 N (6)

by letting partJpartA(n) = 0 or a family of first-order nonnegative

CPD algorithms [19] for example the one using the multi-plicative update (MU) rule

A(n) larr A(n) ~[(Y(n)B

(n)) (A(n)B(n)TB(n))] (7)

and the efficient hierarchical ALS (HALS) approach thatupdates only one column each time

a(n)r larr a(n)

r +1

t(n)rr

P+

(q(n)r minusA(n)t(n)r

) (8)

where P+(x) = max(x 0) is element-wisely applied to a ma-trixvector [19] [20] t(n)rr and t

(n)r are the (r r) entry and the

rth column of the matrix T(n) = B(n)TB(n) respectively andq(n)r is the rth column of Q(n)

= Y(n)B(n) Efficient com-

putation of (5) is the foundation of a large class of first-ordermethods for CPD and their variants [19] In [21] the authorsargued that first-order (gradient) methods generally provideacceptable convergence speed and are easy to parallelize andhence more suitable than the second-order counterparts forlarge scale optimization problems This motivates us to restrictour discussion to the first-order methods in this paper

In the next we discuss how to compute the gradient termsin (5) efficiently While the term

B(n)TB(n) =~p 6=n(A(p)TA(p)) isin RRtimesR (9)

in (5) (and in (6) (7) and (8) as well) can be efficiently com-puted however the term Y(n)B

(n) is quite computationallydemanding especially when Y is huge and N is large Indeedit is likely that in many cases neither Y nor B(n) can beloaded into memory To overcome this problem we often hopeto 1) avoid accessing the huge tensor Y frequently (eg inevery iteration) 2) avoid explicitly constructing the big matrixB(n) isin R

prodk 6=n IktimesR To this end two properties of Y can be

employed to improve scalability of CPD algorithmsbull Sparsity of Y In this case the GigaTensor method [11]

provides an elegant solution only the nonzero entries

of Y will be accessed and the columns of Y(n)B(n)

are computed in a parallel manner but without explicitconstruction of B(n)

bull Low rank of Y (the tensor can be dense) This type ofmethods use the low-rank approximation instead of theoriginal huge dense tensor Y A standard method is toperform Tucker compression first typically by using theHOSVD and then CPD is applied to the core tensor [22]which is referred to as the Tucker+CP method in this pa-per The approach in [10] splits the original big tensor intoa set of small sub-tensors and then CPDs are applied toeach sub-tensor After that Y(n)B

(n) can be computed byusing these sub-tensor in their CP approximation insteadof Y Arguing that some sub-tensors could be very ill-posed which unavoidably causes poor performance thePARACOMP method [23] uses a set of random matricesthat share a pre-specified number of common anchor rowsto compress the tensor Y using the Tucker model andthen a set of much smaller tensors are obtained Then theCPD is performed on these small tensors simultaneouslyand original true components are recovered by solving alinear system incorporating the information provided bythe anchor rows This method avoids the SVD of hugematrices required by the HOSVD however it involves atotal of P Tucker compression procedures and CPD ofP small tensors where P ge InR

Both the method in [10] and the PARACOMP method in[23] rely on reliable CPD of all sub (or intermediate) tensorsHowever it is widely known that existing CPD approachesoften converge to local solutions especially when bottlenecksexist in one or more factor matrices [24] or even worsethe CPD is simply not unique For example in [23] theauthors argued that the method in [10] assumed each subtensor had unique CPD which however cannot be guaranteedfrom the global uniqueness conditions alone [23] In thePARACOMP the uniqueness of CPD of each small tensorcan be guaranteed from the global uniqueness conditionshowever it did not solve the local convergence problem andis sensitive to noise In other words to achieve satisfactory

4

results the PARACOMP actually requires that all the CPDsin the intermediate step converge globally which is oftenunrealistic Hence we believe it is important to develop morerobust and scalable algorithms for the CPD especially whenY is dense In the meanwhile we should avoid using CPDsas intermediate steps due to the local convergence issue

Let us go into more detail of the Tucker+CP approachIn this approach the big tensor Y is approximated by itsHOSVD of Y asymp Y = JGU(1)U(2) middot middot middot U(N)K whereU(n) isin RIntimesRn is computed from the truncated SVD of Y(n)

with U(n)TU(n) = I then the CPD is applied to the muchsmaller tensor G isin RR1timesR2timesmiddotmiddotmiddottimesRN such that

G asymp JV(1)V(2) middot middot middot V(N)K (10)

thereby leading to the CPD of Y asymp JA(1)A(2) middot middot middot A(N)Kwith A(n) = U(n)V(n) For large-scale problems very oftenIn Rn and the CPD of (10) can be efficiently solved evenusing a standard desktop computer The Tucker+CP routinewhich is originally proposed first in [22] is very efficientfor the big tensors with low multilinear ranks Howeverthis framework is not directly applicable if it is necessaryto impose additional constraints such as nonnegativity andsparsity on factor matrices Very recently Cohen Farias andComon proposed a novel efficient method to deal with thenonnegativity of factor matrices [25] Our objective in thispaper is to develop a more general and flexible scheme whichcan not only achieve similar efficiency and scalability but alsoallow to impose any additional or optional constraints on factormatrices very easily

B CPD of Big Tensors by Based on Their Tucker Compression

The Tucker decomposition is suitable for pre-processingand compressing data because it is significantly faster thanthe CPD and it has upper error bounds guarantee [26] Inthis subsection we consider a slightly modified version ofthe Tucker+CP approach to overcome its limitations In thefirst step the Tucker compression is performed such thatY asymp Y = JGU(1)U(2) middot middot middot U(N)K as in the Tucker+CPapproach In the second step however we perform the CPD onthe whole tensor Y instead of the core tensor G by minimizingthe following cost function

J(A(n)) =1

2

∥∥∥Yminus JA(1)A(2) middot middot middot A(N)K∥∥∥2

F

st A(n) isin Cn(11)

where Cn denotes some constraints imposed on A(n)With (11) the term Y(n)B

(n) in the gradient partJpartA(n) can be

efficiently computed from

Y(n)B(n) asymp U(n)G(n)

[⊙p 6=n

V(p)] (12)

whereV(n) = U(n)TA(n) isin RRntimesR (13)

Algorithm 1 The Flexible Fast Nonnegative CP (FFCP)Decomposition Algorithm

Require Y isin RI1timesI2middotmiddotmiddottimesIN the output rank R1 Perform Tucker decomposition to obtain

Y asymp JGU(1)U(2) middot middot middot U(N)K Initialize A(n) andV(n) = U(n)TA(n)

2 repeat3 for n isin NN do4 Efficiently compute H(n) = G(n)

[⊙p 6=n V(p)

]

5 Compute Y(n)B(n) asymp U(n)H(n) and B(n)TB(n) =

~p 6=n(A(p)TA(p))

6 Update A(n) by using (6) or (7)(8) to imposenonnegativity constraints andor (17) to improve thesparsity of components

7 V(n) larr U(n)TA(n)8 end for9 until a stopping criterion is satisfied

10 return GlarrY A(n) n isin N

Note that V(n) is quite small as Rn le R and it can be com-puted by using distributed systems or the matrix partitioningmethod when In is really huge As such the ALS update rule(6) can be replaced by

A(n) larr U(n)G(n)

p 6=nV(p)

(~p 6=n

A(p)TA(p)

)minus1

(14)Because the approximation Y may contain negative entries(7) can be replaced by

A(n) larr A(n) ~P+

(U(n)G(n)

(⊙p 6=n V(p)

))

A(n)(~p 6=n A(p)TA(p)

) (15)

As a result the nonnegativity of factor matrices can be easilyincorporated in the proposed scheme If we hope to extractsparse latent component we consider the l1-norm regularizedCPD proposed in [27] and minimize J(A(n)) + c

∥∥A(n)∥∥1

thereby leading to

A(n) larr Sc(A(n)) (16)

where the sparsity operator Sc(middot) is element-wisely defined as

Sc(x) =

xminus c if x gt c

0 if minus c le x le cx+ c if x lt minusc

In summary most first-order based constrainedunconstrainedCPD algorithms can benefit from the proposed methodApproximately the computational complexity is reducedfrom O(Rn

prodpisinNN

Ip) in (5) to only about O(InR2n +

RnprodpisinNN

Rp) due to using a compact representation of Y

5

minus20 minus15 minus10 minus5 0 5 10 15 20

60

65

70

75

80

85

90

95

Noise level (dB)

Fit

Tucker+CP

FFCPDirect CP

Fig 3 The Fit values averaged over 50 Monte-Carlo runsobtained by 3 different ways (ie the Direct CP that appliesCPD on Y directly the FFCP on the Tucker format of Y asympJGU(1)U(2) middot middot middot U(N)K and the Tucker+CP on the coretensor G first and then construct the CP approximation) toperform CPD on a synthetic tensor Y isin R50times50times50times50 withR = 10 The data tensor Y was generated by using (2) anddifferent levels of Gaussian noise (SNR in dB) were added

It is worth noticing that if Y = JGU(1)U(2) middot middot middot U(N)Kwith U(n)U(n)T = I minimizing (11) is equivalent tosolving

minV(n)

J(V(n)) =1

2

∥∥∥Gminus JV(1)V(2) middot middot middot V(N)K∥∥∥F

st A(n) = U(n)V(n) isin Cn(17)

where V(n) is defined in (13) and Cn denotes some con-straints on A(n) Indeed (17) has been considered in [25]specially for NTF Although this method essentially has thesame computational complexity as the FFCP method the mainadvantage of (11) is that any off-the-shelf (constrained) CPDmethods can be directly applied with efficient computation ofthe gradient terms (9) and (12) without the need of designinga new algorithm to solve (17)

In summary not only does (11) benefit from the sameadvantage of low-computational complexity as the Tucker+CPmethod it is also quite flexible because additional constraintson A(n) can be easily incorporated such as nonnegativityandor sparsity For this reason the new method is calledthe flexible fast CP (FFCP) decomposition algorithm and thecorresponding pseudo code is briefly listed in Algorithm 11In Fig3 we compared the Tucker+CP the FFCP (solving(11)) and the standard CPD (solving (3)) algorithms Thedata tensor Y isin R50times50times50times50 was generated by using (2)

1 Although in Algorithm 1 only the LS update and multiplicative updateare listed any update rules based on the gradient (eg HALS see [19] forexample) could be incorporated into the scheme of FFCP straightforwardly

with R = 10 and contaminated by different levels of noise (inSNR) ranging from -20dB to 20dB In the figure we showedtheir Fit values averaged over 50 Monte-Carlo runs whereFit = 1 minus

∥∥∥Yminus Y∥∥∥FYF It turns out that the Tucker

compression generally will not degenerate the accuracy Onthe contrary it may improve the stableness of CPD algorithms

Although the matrices V(n) in (12) are generallysmall construction of the tall (

prodk 6=nRk)-by-Rn matrix

ie⊙

k 6=n V(p) and direct computation of H(n) =

G(n)

[⊙p 6=n V(p)

]can be avoided by applying either one of

the following methodsbull Notice that the rth column of H(n) can be computed

using h(n)r = G timesp 6=n a

(p)r

T which allows H(n) to becomputed in parallel by using the shared data tensor G

but without constructing⊙

k 6=n V(p) explicitly

bull The term G(n)

[⊙p 6=n V(p)

]can be computed in parallel

by using the technique adopted in the GigaTensor [11]Here G(n) is generally not sparse but relatively small

Moreover H(n) = G(n)

[⊙p 6=n V(p)

]can be viewed as

the gradient term when we perform the CPD of G asympJV(1)V(2) middot middot middot V(N)K (see (5) and (17)) For this reasonmost techniques which have been proposed to speed upstandard CP decompositions can be applied to compute H(n)

straightforwardlyFor very large-scale data where In are extremely huge the

products of A(n)TA(n) and U(n)TA(n) can be performed ondistributed systems see eg [28] or simply computed byusing the matrix partitioning technique on a standard computer

In summary even with very limited computational resourcethe FFCP algorithm can perform the (nonnegative) CPD ofvery large-scale tensors provided that their low-rank Tuckerrepresentation can be obtained Notice that the Tucker ap-proximation of a given tensor plays a fundamental role in theproposed FFCP method as well as the Tucker+CP the methodproposed in [25] and the PARACOMP [35] and the HOSVDoften serves as workhorse to perform Tucker compressionHowever the HOSVD involves the eigenvalue decompositionof huge matrices Y(n) which makes it unsuitable for large-scale problems [35] In the next section we focus on how toovercome this limitation

IV DISTRIBUTED RANDOMIZED TUCKERDECOMPOSITIONS (RANDTUCKER)

A Randomized Tucker Decomposition

In the Tucker decomposition of (1) the factor matrices U(n)

are estimated by minimizing the following LS problems

J(U(n)) =∥∥∥Y(n) minusU(n)B(n)T

∥∥∥F (18)

where B(n) =[otimes

k 6=n U(k)]

GT(n) In the HOSVD [26] the

factor matrix U(n) is estimated as the leading singular vectors

6

of Y(n) However performing SVD is prohibitive for verylarge-scale problems Recently a set of probabilistic algorithmshave been developed to discover the low-rank structure of hugematrices [29] Here we consider the following randomizedrange finder (RRF) algorithm (Algorithm 41 of [29]) toestimate the basis of U(n)

1) Drawn a random Gaussian matrix Ω isin Rprod

p6=n IptimesRn 2) Obtain the In times Rn matrix Z(n) = Y(n)Ω

(n)3) Compute U(n) isin RIntimesRn as an orthonormal basis of

Z(n) isin RIntimesRn by using eg QR decompositionsIn the above Rn = Rn + p and p is referred to as theoversampling parameter and p = 5 or p = 10 was suggestedin [29] Based on this procedure we propose a randomizedTucker decomposition (RandTucker) algorithm to performTucker decompositions see Algorithm 2 Compared with theHOSVD algorithm the proposed RandTucker algorithm ishighly parallalizable and suitable for distributed computation

Several randomized algorithms have already been proposedfor Tucker decompositions such as the method using fibersampling [14] [30] Compared with them the proposedRandTucker method has an important advantage it is ableto obtain nearly optimal approximation even for very noisydata by using only one additional ALS-like iteration whichis referred to as RandTucker2i see Algorithm 3 and Fig4for the performance comparison In this figure a 200-by-200-by-200 tensor Y was generated in each run by using the (1)with Rn = 10 then different levels of Gaussian noise wasadded to Y to generate the observation tensor We plottedthe Fit values (and the standard variations) obtained by theRandTucker (with p = 10) the RandTucker2i (p = 10) andthe HOSVD averaged over 50 Monte-Carlo runs From thefigure the RandTucker achieved slightly worse Fit valuesespecially when noise was heavy while the RandTucker2iachieved almost the same Fit values as the HOSVD Asthe RandTucker and the RandTucker2i do not involve thesingular value decomposition of big matrices and they arehighly parallelizable they are very promising in the Tuckerdecompositions of big tensors

B Distributed Randomized Tucker DecompositionIn the above RandTucker and RandTucker2i algorithms we

have assumed that all the data can be processed in physicalmemory directly For big data this could be unrealistic Acommon scenario is that only partial data can be loaded inmemory In this subsection we discuss how to solve thisproblem by using a computer cluster (in this paper we used theMATLAB Distributed Computing Server2) Here we considera cluster which is composed ofbull a number of workers with identity labels each of which is

a computational engine that executes computational taskssimultaneously In the meanwhile they may communicatewith each other and the client as well

2See wwwmathworkscomhelppdf docmdcemdcepdf

Algorithm 2 The Randomized Tucker Decomposition (Rand-Tucker) Algorithm

Require Y isin RI1timesI2middotmiddotmiddottimesIN and the multilinear rank(R1 R2 RN )

1 for n = 1 2 N do2 Compute Z(n) = Y(n)Ω

(n) where Ω(n) is an(prodk 6=n Ik)-by-Rn random Gaussian matrix

3 Compute U(n) as an orthonormal basis of Z(n) byusing eg the QR decomposition

4 Ylarr Ytimesn U(n)T In larr Rn5 end for6 GlarrY7 return Y asymp JGU(1)U(2) middot middot middot U(N)K

Algorithm 3 The RandTucker Algorithm with 2 Iterationsonly (RandTucker2i)

Require Y isin RI1timesI2middotmiddotmiddottimesIN and the multilinear rank(R1 R2 RN )

1 Initialize U(n) isin RIntimesRn as random Gaussian matrices2 Repeat lines (3)-(7) twice3 for n = 1 2 N do4 X = Ytimesp 6=n U(p)5 Compute Z(n) = X(n)Ω

(n) where Ω(n) is an(prodk 6=n Rk)-by-Rn random Gaussian matrix

6 Compute U(n) as an orthonormal basis of Z(n) byusing eg the QR decomposition

7 end for8 Glarr Ytimes1 U(1) times2 U(2) middot middot middot timesN U(N)9 return Y asymp JGU(1)U(2) middot middot middot U(N)K

0 5 10 15 20 25 3020

30

40

50

60

70

80

90

100

Noise level (SNR in dB)

Fit (

)

RandTucker

RandTucker2i

HOSVD

Fig 4 Performance comparison between the RandTucker theRandTucker2i and the HOSVD algorithm in terms of meanvalue of Fit and the standard derivation over 50 Monte-Carloruns

bull a client which is used to schedule tasks to workers andgather results from workers and return final results

7

Workeramp17amp Workeramp19amp Workeramp21amp Workeramp23amp

Workeramp9amp Workeramp11amp Workeramp15amp

Workeramp1amp Workeramp3amp Workeramp5amp

Workeramp13amp

Workeramp7amp

Y(111)

Y(211)

Y(121)

Y(221) Y(231)

Y(131) Y(141)

Y(241)

(a)

P 2 NS1S2middotmiddotmiddotSN+

= snk = ind2subp(n)(sn k)

s = ind2sub(p(s1 s2 sN ))

(Using P(n))

(Using P)

Worker p(s)

p(s)Y(s)

(b)

Fig 5 Demonstration of small data tensors and the virtual computing resource tensor P in a cluster (a) An example Y isinR2Itimes4Jtimes3K is partitioned into smaller blocks Y(s) with the same size I times J timesK thereby leading to the 2times 4times 3 computingresource tensor P identifying a total of 24 workers (b) An entry p(s) isin 1 2 prodn Sn of the computing resource tensorP indicates that Worker p(s) processes the sub-tensor Y(s) where s = [s1 s2 sN ]T = ind2sub(p(s)) Hence in (a)p(1 3 2) = 13 means Worker 13 processes the sub-tensor Y(132) as [1 3 2] = ind2sub(13) = p(2)(3 3)

Given a big N th-order tensor Y isin RI1timesI2timesmiddotmiddotmiddottimesIN we parti-tion each dimension In into a total of Sn disjoint subsets eachof which consists of LSn indices with

sumSn

sn=1 Lsn = In Assuch the original big tensor is partitioned into a total of W =S1S2 middot middot middotSN sub-tensors and each sub-tensor can be denotedby Y(s1s2sN ) with the size of Ls1 times Ls2 times middot middot middot times LsN where sn isin NSn n = 1 2 N At the same time supposewe have a computer cluster with W workers (computationalengines)3 We imagine that the W workers are virtuallyspatially permuted to form a S1 times S2 times middot middot middot times SN tensor suchthat each worker is just associated with one sub-tensor seeFig5(a) In other words we will have two types of tensorsbull The small data tensors obtained by partitioning Y

Y(s) = Y(s1s2sN ) isin RLs1timesLs2timesmiddotmiddotmiddottimesLsN (19)

where s =[s1 s2 middot middot middot sN

]Tis used to specify

the (s1 s2 sN )th-block of Y sn isin NSn n =

1 2 N see Fig5(a) for an illustrationbull The computing resource tensor P isin NS1timesS2timesmiddotmiddotmiddottimesSN

+

formed by reshaping [1 2 W ] that identifies theW = S1S2 middot middot middotSN workers and the associated data ten-sors The entries of P have a one-to-one correspondencewith the sub-tensors Y(s) meaning that Worker p(s)process the sub-tensor Y(s) see Fig5(b)

It can be seen the RandTucker and the RandTucker2i majorlyuse four basic operations 1) Mode-n unfolding of a tensor2) Matrix-matrix product to obtain Y(n)Ω

(n) 3) Generating aGaussian random matrix Ω 4) Estimating an orthogonal basisof the range of Y(n)Ω

(n) Below we discuss how to performthese four operations in a distributed computing system

Mode-n unfolding of tensor Y We unfold the computingresource tensor P isin NS1timesS2timesmiddotmiddotmiddottimesSN

+ along mode-n to obtain

3We have assumed that each sub-tensor can be efficiently handled in by theassociated worker otherwise we increase the number of workers and decreasethe size of each sub-tensor accordingly

P(n) isin NSntimesK+ K =

prodj 6=n Sj Let p(n)(sn k) be the

(sn k)th entry of P(n) and define

s = πsnk = ind2sub(p(n)(sn k))

sn = 1 2 Sn k = 1 2 prod

j 6=nSj

(20)

where the function ind2sub(middot) converts an entry of P(n) iep(n)(sn k) a worker identifier into an N -by-1 vector ofsubscript values determining the position of p(n)(sn k) in thetensor P which allows us to locate the sub-tensor associatedwith this worker In other words the above (20) means thatthe tensor Yπsnk is processed by Worker p(n)(sn k) Then itcan be seen that

Y(n) =

Yπ11

(n) Yπ12

(n) Yπ1K

(n)

Yπ21

(n) Yπ22

(n) Yπ2K

(n)

Y

πSn1

(n) YπSn2

(n) YπSnK

(n)

isin RIntimes

prodj 6=n Ij

(21)where Y

πsnk

(n) is the mode-n martricization of Yπsnk pro-cessed by Worker p(n)(sn k) sn isin NSn

and k isin NK In summary to obtain Y(n) we perform mode-n unfolding

of Y(s)(n) simultaneously Then the (sn k)th sub-matrix of Y(n)

will be given by Worker p(n)(sn k) as shown in (21)Matrix-matrix product Large scale matrix-matrix product

can be performed by using any existing technique As here themode-n unfolding matrix Y(n) has already been partitionedand distributed we apply the matrix partitioning techniquewe compute the product Y(n)Ω

(n) by using

Z(n)1

Z(n)2

middot middot middotZ

(n)Sn

= Y(n)Ω

(n) = Y(n)

Ω(n)1

Ω(n)2

Ω(n)K

isin RIntimesRn (22)

where Y(n) is given by (21) Ω(n)k isin R

prodj 6=n LSj

timesRn are thesub-matrices by partitioning the random Gaussian matrix Ω(n)

8

50

100

150

200

250

300

350

400

1x108 2x108 3x108 5x108 7x108 1x109 1x109

Scale of the input data ( I N)

Tim

e (s)

FFCPCPminusALSPARACOMP

(a)

2

4

6

8

10

12

14

16

18

1x109 1x1012 8x1012 3x1013 6x1013 1x1014

Scale of the input data (I N)

Tim

e (s

)

(b)

Fig 6 (a) Time consumption comparison of the FFCP the CP-ALS and the PARACOMP on condition that the data tensorof I times I times I can be loaded into memory For the FFCP the RandTucker2i was used to compress the data with p = 15 (b)Scalability of the FFCP algorithm provided that the given big tensor was in its Tucker format The consumed time by theFFCP increases nearly linearly with respect to I

into K parts From (22) we have Z(n)sn =

sumKk=1 Y

πsnk

(n) Ω(n)k

and all the products Yπsnk

(n) Ω(n)k are computed on Workers

p(πsnk) simultaneously and then are broadcasted to Workerp(n)(sn k0) typically k0 = 1 Finally for each sn isin NSn wesum Y

πsnk

(n) Ω(n)k up over k on Worker p(n)(sn k0) to yield

Z(n)sn Generating the Gaussian random matrix Ω(n) Note that

the workers identified by the kth column of P(n) ie theworkers p(n)(sn k) sn = 1 2 Sn share a same randomGaussian matrix Ω

(n)k (corresponding to the kth column of

blocks of Y(n) in (21)) Hence Ω(n)k can be generated on

one worker say p(n)(1 k) and then is broadcasted to all theworkers p(n)(sn k) with sn 6= 1 However this way suffersfrom heavy communication overhead A more economicalmethod is to broadcast the parameters (or the random numbergenerators) that control the random number generation andthen the random matrices Ω

(n)k are generated on all workers

by using the received random number generators as suchfor each k the workers p(n)(sn k) sn isin NSn

will generateidentical random Gaussian matrices Ω

(n)k simultaneously

Estimating an orthonormal basis of the columns ofY(n)Ω

(n) To this end we perform eigenvalue decompositionofsumSn

sn=1 Z(n)sn

TZ(n)sn = UΣUT Then the mode-n factor

matrix U(n) with orthonormal columns can be computed from

U(n)sn = Z(n)

sn UΣminus12 (23)

in parallel on the workers p(n)(sn 1) where U(n)sn is the snth

sub-matrix of U(n) sn isin NSn After U(n) has been estimated Y will be updated as Ylarr

Y timesn U(n)T which can be realized using the above mode-n

tensor unfolding and matrix-matrix product After update thenth dimension of Y(s) will be reduced to be Rn from LSn

We repeat the above procedure for n = 1 2 N As the

last step the core tensor is computed as G =sumWw=1 Y

(w) iethe sum of sub-tensors distributed on all workers

C Nonnegative Tucker Decomposition

In our recent papers [19] [31] we also showed how toefficiently perform nonnegative Tucker decomposition basedon a Tucker approximation The idea of that method is similarto the FFCP method after the original high-dimensional rawdata has been replaced by its Tucker approximation with lowmultilinear rank the computational complexity of gradientswith respect to the nonnegative core tensor and factor matri-ces can be significantly reduced in terms of both time andspace Readers are referred to [19] [31] for more detailsIn summary Tucker compression can play a fundamentalrole to achieve highly efficient tensor decompositions asillustrated in Fig1 Based on the proposed randomized Tuckerdecomposition methods we can easily perform various typesof constrainedunconstrained tensor decompositions for verylarge-scale tensor data

V SIMULATIONS

A Synthetic data

In this subsection we investigate the performance of theproposed FFCP algorithm by using synthetic data All the ex-periments were performed in 64-bit MATLAB on a computerwith Intel Core i7 CPU (333GHz) and 24GB memory running64-bit Windows 7

9

TABLE II Comparison between the algorithms when they were applied to perform CPD of synthetic data The noises wereGaussian with SNR=10dB in all the cases

Algorithm N = 3 I = 200 R = 10 N = 3 I = 500 R = 20 N = 6 I = 20 R = 5Fit () Time (s) Fit () Time (s) Fit () Time (s)

CP-ALS 821plusmn 102 395 805plusmn 68 640 896plusmn 112 188FastCPALS 798plusmn 90 104 829plusmn 56 211 924plusmn 109 66PARAFAC 825plusmn 79 782 805plusmn 61 1583 876plusmn 114 643FFCP 819plusmn 80 35 828plusmn 56 40 934plusmn 86 50

TABLE III Comparison between the algorithms when they were applied to perform nonnegative CPD of synthetic data Thenoises were Gaussian with SNR=10dB in all the cases

Algorithm N = 3 I = 200 R = 10 N = 3 I = 500 R = 20 N = 6 I = 20 R = 5Fit () Time (s) Fit () Time (s) Fit () Time (s)

CP-NMU 973plusmn 12 664 976plusmn 08 16952 917plusmn 56 4265CP-HALS 978plusmn 10 121 977plusmn 07 2281 951plusmn 41 1014FFCP-MU 990plusmn 05 23 993plusmn 02 121 934plusmn 41 70FFCP-HALS 991plusmn 01 61 995plusmn 01 299 943plusmn 35 118

Simulations on CPD In each run a tensor Ylowast was generatedby using the model (2) where the entries of the latent factormatrices A(n) isin RItimesR (ie the ground truth) were drawn fromiid standard normal distributions n = 1 2 N Thenindependent Gaussian noise (with SNR=10dB) was addedto Ylowast to yield noisy observation tensor Y We tested threedifferent combinations of N I and R as shown in TABLE IIThe algorithms CP-ALS PARAFAC which are the standardalgorithms included in the Tensor Toolbox 25 [32] and the N -way toolbox [33] respectively were compared as the baselineThe FastCPALS algorithm [34] was also compared For all thealgorithm the maximum iteration number was set as 1000and a stopping criterion |Fit(Y Yt) minus Fit(Y Yt+1)| lt 10minus6

was applied where Yt is an estimation of Ylowast obtained afterthe tth iteration The mean values and standard variationsof Fit(Ylowast Y) over 20 Monte-Carlo runs are detailed in TA-BLE II along with the averaged time consumed by eachalgorithm From the table we see that the FFCP algorithmis significantly faster than the other algorithms but withoutsignificant deterioration of performance although it workson a compressed tensor instead of the original tensor Wealso tested the PARACOMP [35] method which was speciallyproposed for big tensor decompositions It turned out that thePARACOMP was sensitive to noise in noise-free cases itoften achieved perfect Fit of 1 otherwise when the noise was10dB it generally obtained the Fit lower than 03

In Fig6 we demonstrated the scalability of the proposedFFCP method where the RandTucker2i was used to compressthe data with p = 15 From Fig6(a) we can see that theFFCP shows very promising scalability its time consumptionincreases almost linearly with respect to the scale of the inputdata on condition that the rank is fixed on the contrastthe time consumed by the other algorithms increased nearlyexponentially Note that both the FFCP and the PARACOMPmethod [35] were designed mainly for big data analyticsand this comparison did not precisely reflect their actual

time consumption difference in a distributed computationalenvironment However it did show that the proposed methodwas more efficient than the PARACOMP which can also be in-ferred from the obvious facts 1) the PARACOMP compressesthe observation big tensor at least P times (P ge IR) whereasin our RandTucker2i based FFCP we only need to do it twice(or even once when RandTucker is applied) 2) although bothof them need to perform CPD on small compressed tensorshowever the PARACOMP method needs to do this P timeswhereas the FFCP needs to do it only once as the final stepParticularly in the PARACOMP any failure of the CPDs ofP sub-tensors will degenerate the final results Fig6(b) showsthe scalability of the FFCP provided that the given big tensorwas in its Tucker format To such scale of tensor if we usethe original full tensor directly all the algorithms ran out ofmemory But if we use its Tucker compression the FFCPcan perform constrainedunconstrained CPD efficiently evenon our desktop computer

Simulations on Nonnegative CPD In this experiments thedata were generated as the above except that the latent factormatrices were generated from iid exponential distributionswith the mean parameter micro = 10 Then we randomly selected10 entries of each factor to be zeros to generate sparsefactor matrices As the PARAFAC algorithm was very slowwe only compared the CP-NMU algorithm which is a standardnonnegative CPD algorithm based on the multiplicative updaterule included in the Tensor Toolbox [32] and the CP-HALSalgorithm [36] that is based on the Hierarchical ALS (HALS)method For fair comparison their corresponding implemen-tations of FFCP were compared The results averaged over20 Monte-Carlo runs were detailed in TABLE III From theTABLE FFCP implementations were not only significantlyfaster than their competitors but also often achieved higherFit values To the best of our knowledge the FFCP is one ofthe most efficient algorithms for nonnegative CPD

To demonstrate the high efficiency of the FFCP algorithm

10

in this simulation the entries of A(n) isin R10000times10 n =1 2 3 4 were drawn from iid exponential distributions withthe mean parameter micro = 10 Again we randomly selected 20entries of A(n) to be zeros to obtain sparse factor matricesBy using (2) and A(n) generated above a big 4th-order tensorY isin R10000times10000times10000times10000 with R = 10 was obtainedwhich contains a total of 1016 entries in its full format Itis certainly impossible to decompose such a huge tensor in astandard desktop computer Suppose we have already obtainedits Tucker approximation YT = JGA(1)A(2)A(3)A(4)K asympY Due to the rotation ambiguity of Tucker decompositionswe know theoretically A(n) = A(n)Q(n) where Q(n) wasnot necessarily to be nonnegative Generally applying CPD onthe core tensor G directly cannot guarantee the nonnegativityof factor matrices Instead we applied the FFCP algorithm onthe Tucker compressed tensor YT to extract nonnegative factormatrices In a typical run the FFCP-HALS consumed only 774seconds and the recovered components were with the signal-to-interference ratio4 (SIR) higher than 200dB which almostperfectly match the true components Although this exampleonly shows an ideal noise-free case it illustrates how the FFCPalgorithm can be applied to perform nonnegative CPD of hugetensors as long as their good low-rank approximations can beobtained

B The COIL-100 Images Clustering Analysis

In this experiment we applied the proposed NTD algorithmsto the clustering analysis of the objects selected from theColumbia Object Image Library (COIL-100) The COIL-100database consists of 7200 images of 100 objects each ofwhich has 72 images taken from different poses hence theseimages naturally form a 128times 128times 3times 7200 tensor Y Thedata tensor was decomposed by using tensor decompositionalgorithms and then used the two t-Distributed StochasticNeighbor Embedding (t-SNE) components of A(4) as featuresto cluster and visualize the results The K-means algorithm wasused to cluster the objects and the accuracy of clustering wasdefined in [37]To justify the proposed distributed RandTuckeralgorithm we used the MATLAB parallel computing toolboxwhich allows us to use the full processing power of multicoredesktops by executing applications on workers (MATLABcomputational engines) that run locally The tensor Y waspartitioned into 4 sub-tensors with the size of 128times128times3times1800 and was distributed on 4 workers As the baseline theTucker-ALS algorithm included in [32] was compared and themultilinear rank was empirically set as R = (20 20 3 20)For the RandTucker method R = (30 30 3 50) was setBoth of these two algorithms used 2 iterations Their perfor-mances were detailed in TABLE IV It can be seen that nosignificant deterioration was observed between the randomized

4The performance index SIR is defined as SIR(a(n)r a

(n)r ) =

20 log∥∥∥a(n)

r minus a(n)r

∥∥∥2∥∥∥a(n)

r

∥∥∥2

where a(n)r is an estimate of a

(n)r both

a(n)r and a

(n)r are normalized to be zero-mean and unit-variance

TABLE IV Comparison between the algorithms when theywere applied to clustering analysis of COIL-100 objects

Algorithm Fit () Accuracy ()Tucker-ALS 703 692

RandTucker2i 691 684RandTucker 686 698

FFCP(Tucker-ALS) 671 713FFCP (RandTucker2i) 647 680

12

3

4

5

6 7

8

9

10

11

12

13

1415

16

17

18

19

20

21

2223

24

25

2627

28

29

30

31

32

33

34

35

36

37

38 3940

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

6465

6667

68

69

70

71

72

73

74

75

76

77

78

7980

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

Fig 7 Visualization of the clustering results obtained byapplying distributed RandTucker2i on the COIL100 imagedatabase The two tSNE components of A(4) were used forvisualization

methods and the deterministic methods which suggested thatthe distributed RandTucker algorithm can be applied to reallylarge scale problems The clustering results obtained by theRandTucker2i Figare visualized in Fig7 by using the twotSNE components of A(4)

C UCF Sports Action Videos Analysis

This data set consists of a set of videos collected fromvarious sports which are typically featured on broadcast televi-sion channels [38] (see httpcrcvucfedudataUCF SportsActionphp) To avoid rescaling the videos and keep all thevideos in the same specification we selected 5 actions ofDiving Riding-Horse Run Swing and Walk-Front each ofwhich contains 10 video clips with the size of 404times720times3times35(35 is the number of frames) We concatenated these videosto form a 5th-order tensor This tensor has about 16 times 109

entries and more than 99 of them are nonzero When it issaved in double format it consumes about 117 GB spaceWe used tensor decomposition methods to decompose this

11

Fig 8 The UCF Sports Action Videos The first row shows asnapshot of each action and the second row shows 5 snapshotsfrom a video of diving

tensor to extract low-dimensional features for further cluster-ing analysis When we applied the CP-ALS method includedin the Tensor Toolbox to decompose 30 videos it consumedabout 4 hours for only one iteration when 40 videos wereconsidered the CP-ALS ran out of memory We configureda small cluster with 3 computers (a desktop computer plustwo laptop computers (each of which has a i7 Intel dualcore CPU 8 GB physical memory running Windows 7 andMATLAB 2012b)) In the first stage the RandTucker2i wasapplied with R = [35 35 3 10 20] It consumed about 104seconds to obtain the Tucker approximation of the originaltensor with the Fit of 723 to the observation data Then theFFCP was applied using a desktop computer with R = 10 Inthis case only 75 seconds were required without nonnegativityconstraints and 7 seconds were consumed when nonnegativityconstraints were imposed Finally we used A(5) as featuresto cluster the videos and achieved the accuracy of 58 ina typical run This experiments showed that the proposedalgorithms can be applied to decompose very large scaletensors by using a distributed computing cluster

VI CONCLUSION

Tensor decompositions are particularly promising for bigmultiway data analytics as it is able to discover more complexinternal structures and correlations of data gathered frommultiple aspects devices trials and subjects etc While mostexisting tensor decomposition methods are not designed tomeet the major challenges posed by big data in this paperwe proposed a unified framework for the decomposition ofbig tensors with relatively low multilinear rank The key ideais that based on its Tucker approximation of a big tensor alarge set of unconstrainedconstrained tensor decompositionscan be efficiently performed We developed a flexible fastalgorithm for the CP decomposition of tensors using theirTucker representation and a distributed randomized Tuckerdecomposition approach which is particularly suitable for bigtensor analysis Simulation results confirmed that the proposedalgorithms are highly efficient and scalable

While the randomized Tucker decomposition approachshowed its close-to-optimal performance the correspondingerror bounds need to be investigated theoretically in the futureWe also believe many randomized (or related) approachesproposed for low-rank approximation of matrices [29] [39]

could be applied for Tucker decompositions however a com-parative study in the tensor decomposition scenarios is desiredParticularly if we have efficient and robust unconstrained CPdecomposition algorithms we can use the CP model to com-press the raw data instead of the Tucker model which suffersfrom the curse of dimensionality for very high-order tensorsHence developing more reliable and robust CP decompositionapproaches is another important task All these open issuesdeserve further study in the near future

REFERENCES

[1] A Cichocki R Zdunek A H Phan and S-I Amari Nonnegative Ma-trix and Tensor Factorizations Applications to Exploratory Multi-wayData Analysis and Blind Source Separationpplications to ExploratoryMulti-Way Data Analysis and Blind Source Separation ChichesterUK John Wiley amp Sons Ltd 2009

[2] N Sidiropoulos R Bro and G Giannakis ldquoParallel factor analysisin sensor array processingrdquo IEEE Transactions on Signal Processingvol 48 no 8 pp 2377ndash2388 Aug 2000

[3] L De Lathauwer J Castaing and J Cardoso ldquoFourth-order cumulant-based blind identification of underdetermined mixturesrdquo IEEE Transac-tions on Signal Processing vol 55 no 6 pp 2965ndash2973 2007

[4] P Comon ldquoTensors A brief introductionrdquo IEEE Signal ProcessingMagazine vol 31 no 3 pp 44ndash53 May 2014

[5] T G Kolda and B W Bader ldquoTensor decompositions and applicationsrdquoSIAM Review vol 51 no 3 pp 455ndash500 2009

[6] N D Sidiropoulos and R Bro ldquoOn the uniqueness of multilineardecomposition of N -way arraysrdquo Journal of Chemometrics vol 14no 3 pp 229 ndash 239 2000

[7] G Zhou and A Cichocki ldquoCanonical polyadic decomposition based ona single mode blind source separationrdquo IEEE Signal Processing Lettersvol 19 no 8 pp 523 ndash526 Aug 2012

[8] L-H Lim and P Comon ldquoNonnegative approximations of nonnegativetensorsrdquo Journal of Chemometrics vol 23 no 7-8 pp 432ndash441 2009

[9] E E Papalexakis C Faloutsos and N D Sidiropoulos ldquoParCubeSparse parallelizable tensor decompositionsrdquo in Proceedings of the 2012European Conference on Machine Learning and Knowledge Discoveryin Databases - Volume Part I ser ECML PKDDrsquo12 Berlin HeidelbergSpringer-Verlag 2012 pp 521ndash536

[10] A Phan and A Cichocki ldquoAdvances in PARAFAC using parallel blockdecompositionrdquo in Neural Information Processing ser Lecture Notesin Computer Science C Leung M Lee and J Chan Eds vol 5863Springer Berlin Heidelberg 2009 pp 323ndash330

[11] U Kang E Papalexakis A Harpale and C Faloutsos ldquoGigaTensorScaling tensor analysis up by 100 times - algorithms and discoveriesrdquoin Proceedings of the 18th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining ser KDD rsquo12 New York NYUSA ACM 2012 pp 316ndash324

[12] C Tsourakakis ldquoMACH Fast randomized tensor decompositionsrdquoin Proceedings of the 2010 SIAM International Conference on DataMining 2010 pp 689ndash700

[13] P Drineas and M W Mahoney ldquoA randomized algorithm for a tensor-based generalization of the singular value decompositionrdquo Linear Alge-bra and its Applications vol 420 no 2ndash3 pp 553 ndash 571 2007

[14] C F Caiafa and A Cichocki ldquoGeneralizing the column-row matrix de-composition to multi-way arraysrdquo Linear Algebra and its Applicationsvol 433 no 3 pp 557 ndash 573 2010

[15] M Vasilescu and D Terzopoulos ldquoMultilinear analysis of image en-sembles Tensorfacesrdquo in Computer Vision mdash ECCV 2002 ser LectureNotes in Computer Science A Heyden G Sparr M Nielsen andP Johansen Eds Springer Berlin Heidelberg 2002 vol 2350 pp447ndash460

[16] O Duchenne F Bach I-S Kweon and J Ponce ldquoA tensor-basedalgorithm for high-order graph matchingrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 12 pp 2383ndash2395 Dec2011

12

[17] A Rajwade A Rangarajan and A Banerjee ldquoImage denoising usingthe higher order singular value decompositionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 35 no 4 pp 849ndash8622013

[18] V De Silva and L-H Lim ldquoTensor rank and the ill-posedness of the bestlow-rank approximation problemrdquo SIAM Journal on Matrix Analysis andApplications vol 30 no 3 pp 1084ndash1127 September 2008

[19] G Zhou A Cichocki Q Zhao and S Xie ldquoNonnegative matrixand tensor factorizations An algorithmic perspectiverdquo IEEE SignalProcessing Magazine vol 31 no 3 pp 54ndash65 May 2014

[20] G Zhou A Cichocki and S Xie ldquoFast nonnegative matrixtensorfactorization based on low-rank approximationrdquo IEEE Transactions onSignal Processing vol 60 no 6 pp 2928 ndash2940 June 2012

[21] V Cevher S Becker and M Schmidt ldquoConvex optimization forbig data Scalable randomized and parallel algorithms for big dataanalyticsrdquo IEEE Signal Processing Magazine vol 31 no 5 pp 32ndash43Sept 2014

[22] R Bro and C A Andersson ldquoImproving the speed of multiway algo-rithms Part ii Compressionrdquo Chemometrics and Intelligent LaboratorySystems vol 42 no 1ndash2 pp 105 ndash 113 1998

[23] N Sidiropoulos E Papalexakis and C Faloutsos ldquoParallel randomlycompressed cubes A scalable distributed architecture for big tensordecompositionrdquo IEEE Signal Processing Magazine vol 31 no 5 pp57ndash70 Sept 2014

[24] P Comon X Luciani and A L F de Almeida ldquoTensor decompositionsalternating least squares and other talesrdquo Journal of Chemometricsvol 23 no 7-8 pp 393ndash405 2009

[25] J Cohen R Farias and P Comon ldquoFast decomposition of largenonnegative tensorsrdquo IEEE Signal Processing Letters vol 22 no 7pp 862ndash866 July 2015

[26] L De Lathauwer B De Moor and J Vandewalle ldquoA multilinearsingular value decompositionrdquo SIAM Journal on Matrix Analysis andApplications vol 21 pp 1253ndash1278 2000

[27] E Papalexakis N Sidiropoulos and R Bro ldquoFrom k-means to higher-way co-clustering Multilinear decomposition with sparse latent factorsrdquoIEEE Transactions on Signal Processing vol 61 no 2 pp 493ndash506Jan 2013

[28] Z Qian X Chen N Kang M Chen Y Yu T Moscibroda andZ Zhang ldquoMadLINQ Large-scale distributed matrix computation forthe cloudrdquo in Proceedings of the 7th ACM European Conference onComputer Systems ser EuroSys rsquo12 New York NY USA ACM2012 pp 197ndash210

[29] N Halko P Martinsson and J Tropp ldquoFinding structure with ran-domness Probabilistic algorithms for constructing approximate matrixdecompositionsrdquo SIAM Review vol 53 no 2 pp 217ndash288 2011

[30] N Vervliet O Debals L Sorber and L De Lathauwer ldquoBreaking thecurse of dimensionality using decompositions of incomplete tensorsTensor-based scientific computing in big data analysisrdquo IEEE SignalProcessing Magazine vol 31 no 5 pp 71ndash79 Sept 2014

[31] G Zhou A Cichocki Q Zhao and S Xie ldquoEfficient nonnegativeTucker decompositions Algorithms and uniquenessrdquo IEEE Transactionson Image Processing (Submitted) p 13 2014 [Online] Available[ArXive-prints]httparxivorgabs14044412

[32] B W Bader and T G Kolda ldquoMATLAB tensor toolbox version25rdquo Feb 2012 [Online] Available httpcsmrcasandiagovsimtgkoldaTensorToolbox

[33] C A Andersson and R Bro ldquoThe N -way toolbox for MATLABrdquo 2000[Online] Available httpwwwmodelslifekudksourcenwaytoolbox

[34] H Phan P Tichavsky and A Cichocki ldquoFast alternating LS algorithmsfor high order CANDECOMPPARAFAC tensor factorizationsrdquo IEEETransactions on Signal Processing vol 61 no 19 pp 4834ndash4846 2013

[35] N Sidiropoulos E Papalexakis and C Faloutsos ldquoA parallel algorithmfor big tensor decomposition using randomly compressed cubes (PARA-COMP)rdquo in 2014 IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP) May 2014 pp 1ndash5

[36] A Cichocki and A-H Phan ldquoFast local algorithms for large scalenonnegative matrix and tensor factorizationsrdquo IEICE Transactions onFundamentals of Electronics Communications and Computer Sciences(Invited paper) vol E92-A no 3 pp 708ndash721 2009

[37] D Cai X He J Han and T Huang ldquoGraph regularized nonnegativematrix factorization for data representationrdquo IEEE Transactions on

Pattern Analysis and Machine Intelligence vol 33 no 8 pp 1548ndash1560 aug 2011

[38] M Rodriguez J Ahmed and M Shah ldquoAction MACH a spatio-temporal maximum average correlation height filter for action recogni-tionrdquo in IEEE Conference on Computer Vision and Pattern Recognition2008 CVPR 2008 June 2008 pp 1ndash8

[39] R Baraniuk V Cevher and M Wakin ldquoLow-dimensional models fordimensionality reduction and signal recovery A geometric perspectiverdquoProceedings of the IEEE vol 98 no 6 pp 959ndash971 June 2010

  • I Introduction
  • II Overview of Tucker and Polyadic Decompositions
  • III Efficient CP Decompositions of Tensors Based on Their Tucker Approximations
    • III-A Overview of CPD Algorithms
    • III-B CPD of Big Tensors by Based on Their Tucker Compression
      • IV Distributed Randomized Tucker Decompositions (RandTucker)
        • IV-A Randomized Tucker Decomposition
        • IV-B Distributed Randomized Tucker Decomposition
        • IV-C Nonnegative Tucker Decomposition
          • V Simulations
            • V-A Synthetic data
            • V-B The COIL-100 Images Clustering Analysis
            • V-C UCF Sports Action Videos Analysis
              • VI Conclusion
              • References

4

results the PARACOMP actually requires that all the CPDsin the intermediate step converge globally which is oftenunrealistic Hence we believe it is important to develop morerobust and scalable algorithms for the CPD especially whenY is dense In the meanwhile we should avoid using CPDsas intermediate steps due to the local convergence issue

Let us go into more detail of the Tucker+CP approachIn this approach the big tensor Y is approximated by itsHOSVD of Y asymp Y = JGU(1)U(2) middot middot middot U(N)K whereU(n) isin RIntimesRn is computed from the truncated SVD of Y(n)

with U(n)TU(n) = I then the CPD is applied to the muchsmaller tensor G isin RR1timesR2timesmiddotmiddotmiddottimesRN such that

G asymp JV(1)V(2) middot middot middot V(N)K (10)

thereby leading to the CPD of Y asymp JA(1)A(2) middot middot middot A(N)Kwith A(n) = U(n)V(n) For large-scale problems very oftenIn Rn and the CPD of (10) can be efficiently solved evenusing a standard desktop computer The Tucker+CP routinewhich is originally proposed first in [22] is very efficientfor the big tensors with low multilinear ranks Howeverthis framework is not directly applicable if it is necessaryto impose additional constraints such as nonnegativity andsparsity on factor matrices Very recently Cohen Farias andComon proposed a novel efficient method to deal with thenonnegativity of factor matrices [25] Our objective in thispaper is to develop a more general and flexible scheme whichcan not only achieve similar efficiency and scalability but alsoallow to impose any additional or optional constraints on factormatrices very easily

B CPD of Big Tensors by Based on Their Tucker Compression

The Tucker decomposition is suitable for pre-processingand compressing data because it is significantly faster thanthe CPD and it has upper error bounds guarantee [26] Inthis subsection we consider a slightly modified version ofthe Tucker+CP approach to overcome its limitations In thefirst step the Tucker compression is performed such thatY asymp Y = JGU(1)U(2) middot middot middot U(N)K as in the Tucker+CPapproach In the second step however we perform the CPD onthe whole tensor Y instead of the core tensor G by minimizingthe following cost function

J(A(n)) =1

2

∥∥∥Yminus JA(1)A(2) middot middot middot A(N)K∥∥∥2

F

st A(n) isin Cn(11)

where Cn denotes some constraints imposed on A(n)With (11) the term Y(n)B

(n) in the gradient partJpartA(n) can be

efficiently computed from

Y(n)B(n) asymp U(n)G(n)

[⊙p 6=n

V(p)] (12)

whereV(n) = U(n)TA(n) isin RRntimesR (13)

Algorithm 1 The Flexible Fast Nonnegative CP (FFCP)Decomposition Algorithm

Require Y isin RI1timesI2middotmiddotmiddottimesIN the output rank R1 Perform Tucker decomposition to obtain

Y asymp JGU(1)U(2) middot middot middot U(N)K Initialize A(n) andV(n) = U(n)TA(n)

2 repeat3 for n isin NN do4 Efficiently compute H(n) = G(n)

[⊙p 6=n V(p)

]

5 Compute Y(n)B(n) asymp U(n)H(n) and B(n)TB(n) =

~p 6=n(A(p)TA(p))

6 Update A(n) by using (6) or (7)(8) to imposenonnegativity constraints andor (17) to improve thesparsity of components

7 V(n) larr U(n)TA(n)8 end for9 until a stopping criterion is satisfied

10 return GlarrY A(n) n isin N

Note that V(n) is quite small as Rn le R and it can be com-puted by using distributed systems or the matrix partitioningmethod when In is really huge As such the ALS update rule(6) can be replaced by

A(n) larr U(n)G(n)

p 6=nV(p)

(~p 6=n

A(p)TA(p)

)minus1

(14)Because the approximation Y may contain negative entries(7) can be replaced by

A(n) larr A(n) ~P+

(U(n)G(n)

(⊙p 6=n V(p)

))

A(n)(~p 6=n A(p)TA(p)

) (15)

As a result the nonnegativity of factor matrices can be easilyincorporated in the proposed scheme If we hope to extractsparse latent component we consider the l1-norm regularizedCPD proposed in [27] and minimize J(A(n)) + c

∥∥A(n)∥∥1

thereby leading to

A(n) larr Sc(A(n)) (16)

where the sparsity operator Sc(middot) is element-wisely defined as

Sc(x) =

xminus c if x gt c

0 if minus c le x le cx+ c if x lt minusc

In summary most first-order based constrainedunconstrainedCPD algorithms can benefit from the proposed methodApproximately the computational complexity is reducedfrom O(Rn

prodpisinNN

Ip) in (5) to only about O(InR2n +

RnprodpisinNN

Rp) due to using a compact representation of Y

5

minus20 minus15 minus10 minus5 0 5 10 15 20

60

65

70

75

80

85

90

95

Noise level (dB)

Fit

Tucker+CP

FFCPDirect CP

Fig 3 The Fit values averaged over 50 Monte-Carlo runsobtained by 3 different ways (ie the Direct CP that appliesCPD on Y directly the FFCP on the Tucker format of Y asympJGU(1)U(2) middot middot middot U(N)K and the Tucker+CP on the coretensor G first and then construct the CP approximation) toperform CPD on a synthetic tensor Y isin R50times50times50times50 withR = 10 The data tensor Y was generated by using (2) anddifferent levels of Gaussian noise (SNR in dB) were added

It is worth noticing that if Y = JGU(1)U(2) middot middot middot U(N)Kwith U(n)U(n)T = I minimizing (11) is equivalent tosolving

minV(n)

J(V(n)) =1

2

∥∥∥Gminus JV(1)V(2) middot middot middot V(N)K∥∥∥F

st A(n) = U(n)V(n) isin Cn(17)

where V(n) is defined in (13) and Cn denotes some con-straints on A(n) Indeed (17) has been considered in [25]specially for NTF Although this method essentially has thesame computational complexity as the FFCP method the mainadvantage of (11) is that any off-the-shelf (constrained) CPDmethods can be directly applied with efficient computation ofthe gradient terms (9) and (12) without the need of designinga new algorithm to solve (17)

In summary not only does (11) benefit from the sameadvantage of low-computational complexity as the Tucker+CPmethod it is also quite flexible because additional constraintson A(n) can be easily incorporated such as nonnegativityandor sparsity For this reason the new method is calledthe flexible fast CP (FFCP) decomposition algorithm and thecorresponding pseudo code is briefly listed in Algorithm 11In Fig3 we compared the Tucker+CP the FFCP (solving(11)) and the standard CPD (solving (3)) algorithms Thedata tensor Y isin R50times50times50times50 was generated by using (2)

1 Although in Algorithm 1 only the LS update and multiplicative updateare listed any update rules based on the gradient (eg HALS see [19] forexample) could be incorporated into the scheme of FFCP straightforwardly

with R = 10 and contaminated by different levels of noise (inSNR) ranging from -20dB to 20dB In the figure we showedtheir Fit values averaged over 50 Monte-Carlo runs whereFit = 1 minus

∥∥∥Yminus Y∥∥∥FYF It turns out that the Tucker

compression generally will not degenerate the accuracy Onthe contrary it may improve the stableness of CPD algorithms

Although the matrices V(n) in (12) are generallysmall construction of the tall (

prodk 6=nRk)-by-Rn matrix

ie⊙

k 6=n V(p) and direct computation of H(n) =

G(n)

[⊙p 6=n V(p)

]can be avoided by applying either one of

the following methodsbull Notice that the rth column of H(n) can be computed

using h(n)r = G timesp 6=n a

(p)r

T which allows H(n) to becomputed in parallel by using the shared data tensor G

but without constructing⊙

k 6=n V(p) explicitly

bull The term G(n)

[⊙p 6=n V(p)

]can be computed in parallel

by using the technique adopted in the GigaTensor [11]Here G(n) is generally not sparse but relatively small

Moreover H(n) = G(n)

[⊙p 6=n V(p)

]can be viewed as

the gradient term when we perform the CPD of G asympJV(1)V(2) middot middot middot V(N)K (see (5) and (17)) For this reasonmost techniques which have been proposed to speed upstandard CP decompositions can be applied to compute H(n)

straightforwardlyFor very large-scale data where In are extremely huge the

products of A(n)TA(n) and U(n)TA(n) can be performed ondistributed systems see eg [28] or simply computed byusing the matrix partitioning technique on a standard computer

In summary even with very limited computational resourcethe FFCP algorithm can perform the (nonnegative) CPD ofvery large-scale tensors provided that their low-rank Tuckerrepresentation can be obtained Notice that the Tucker ap-proximation of a given tensor plays a fundamental role in theproposed FFCP method as well as the Tucker+CP the methodproposed in [25] and the PARACOMP [35] and the HOSVDoften serves as workhorse to perform Tucker compressionHowever the HOSVD involves the eigenvalue decompositionof huge matrices Y(n) which makes it unsuitable for large-scale problems [35] In the next section we focus on how toovercome this limitation

IV DISTRIBUTED RANDOMIZED TUCKERDECOMPOSITIONS (RANDTUCKER)

A Randomized Tucker Decomposition

In the Tucker decomposition of (1) the factor matrices U(n)

are estimated by minimizing the following LS problems

J(U(n)) =∥∥∥Y(n) minusU(n)B(n)T

∥∥∥F (18)

where B(n) =[otimes

k 6=n U(k)]

GT(n) In the HOSVD [26] the

factor matrix U(n) is estimated as the leading singular vectors

6

of Y(n) However performing SVD is prohibitive for verylarge-scale problems Recently a set of probabilistic algorithmshave been developed to discover the low-rank structure of hugematrices [29] Here we consider the following randomizedrange finder (RRF) algorithm (Algorithm 41 of [29]) toestimate the basis of U(n)

1) Drawn a random Gaussian matrix Ω isin Rprod

p6=n IptimesRn 2) Obtain the In times Rn matrix Z(n) = Y(n)Ω

(n)3) Compute U(n) isin RIntimesRn as an orthonormal basis of

Z(n) isin RIntimesRn by using eg QR decompositionsIn the above Rn = Rn + p and p is referred to as theoversampling parameter and p = 5 or p = 10 was suggestedin [29] Based on this procedure we propose a randomizedTucker decomposition (RandTucker) algorithm to performTucker decompositions see Algorithm 2 Compared with theHOSVD algorithm the proposed RandTucker algorithm ishighly parallalizable and suitable for distributed computation

Several randomized algorithms have already been proposedfor Tucker decompositions such as the method using fibersampling [14] [30] Compared with them the proposedRandTucker method has an important advantage it is ableto obtain nearly optimal approximation even for very noisydata by using only one additional ALS-like iteration whichis referred to as RandTucker2i see Algorithm 3 and Fig4for the performance comparison In this figure a 200-by-200-by-200 tensor Y was generated in each run by using the (1)with Rn = 10 then different levels of Gaussian noise wasadded to Y to generate the observation tensor We plottedthe Fit values (and the standard variations) obtained by theRandTucker (with p = 10) the RandTucker2i (p = 10) andthe HOSVD averaged over 50 Monte-Carlo runs From thefigure the RandTucker achieved slightly worse Fit valuesespecially when noise was heavy while the RandTucker2iachieved almost the same Fit values as the HOSVD Asthe RandTucker and the RandTucker2i do not involve thesingular value decomposition of big matrices and they arehighly parallelizable they are very promising in the Tuckerdecompositions of big tensors

B Distributed Randomized Tucker DecompositionIn the above RandTucker and RandTucker2i algorithms we

have assumed that all the data can be processed in physicalmemory directly For big data this could be unrealistic Acommon scenario is that only partial data can be loaded inmemory In this subsection we discuss how to solve thisproblem by using a computer cluster (in this paper we used theMATLAB Distributed Computing Server2) Here we considera cluster which is composed ofbull a number of workers with identity labels each of which is

a computational engine that executes computational taskssimultaneously In the meanwhile they may communicatewith each other and the client as well

2See wwwmathworkscomhelppdf docmdcemdcepdf

Algorithm 2 The Randomized Tucker Decomposition (Rand-Tucker) Algorithm

Require Y isin RI1timesI2middotmiddotmiddottimesIN and the multilinear rank(R1 R2 RN )

1 for n = 1 2 N do2 Compute Z(n) = Y(n)Ω

(n) where Ω(n) is an(prodk 6=n Ik)-by-Rn random Gaussian matrix

3 Compute U(n) as an orthonormal basis of Z(n) byusing eg the QR decomposition

4 Ylarr Ytimesn U(n)T In larr Rn5 end for6 GlarrY7 return Y asymp JGU(1)U(2) middot middot middot U(N)K

Algorithm 3 The RandTucker Algorithm with 2 Iterationsonly (RandTucker2i)

Require Y isin RI1timesI2middotmiddotmiddottimesIN and the multilinear rank(R1 R2 RN )

1 Initialize U(n) isin RIntimesRn as random Gaussian matrices2 Repeat lines (3)-(7) twice3 for n = 1 2 N do4 X = Ytimesp 6=n U(p)5 Compute Z(n) = X(n)Ω

(n) where Ω(n) is an(prodk 6=n Rk)-by-Rn random Gaussian matrix

6 Compute U(n) as an orthonormal basis of Z(n) byusing eg the QR decomposition

7 end for8 Glarr Ytimes1 U(1) times2 U(2) middot middot middot timesN U(N)9 return Y asymp JGU(1)U(2) middot middot middot U(N)K

0 5 10 15 20 25 3020

30

40

50

60

70

80

90

100

Noise level (SNR in dB)

Fit (

)

RandTucker

RandTucker2i

HOSVD

Fig 4 Performance comparison between the RandTucker theRandTucker2i and the HOSVD algorithm in terms of meanvalue of Fit and the standard derivation over 50 Monte-Carloruns

bull a client which is used to schedule tasks to workers andgather results from workers and return final results

7

Workeramp17amp Workeramp19amp Workeramp21amp Workeramp23amp

Workeramp9amp Workeramp11amp Workeramp15amp

Workeramp1amp Workeramp3amp Workeramp5amp

Workeramp13amp

Workeramp7amp

Y(111)

Y(211)

Y(121)

Y(221) Y(231)

Y(131) Y(141)

Y(241)

(a)

P 2 NS1S2middotmiddotmiddotSN+

= snk = ind2subp(n)(sn k)

s = ind2sub(p(s1 s2 sN ))

(Using P(n))

(Using P)

Worker p(s)

p(s)Y(s)

(b)

Fig 5 Demonstration of small data tensors and the virtual computing resource tensor P in a cluster (a) An example Y isinR2Itimes4Jtimes3K is partitioned into smaller blocks Y(s) with the same size I times J timesK thereby leading to the 2times 4times 3 computingresource tensor P identifying a total of 24 workers (b) An entry p(s) isin 1 2 prodn Sn of the computing resource tensorP indicates that Worker p(s) processes the sub-tensor Y(s) where s = [s1 s2 sN ]T = ind2sub(p(s)) Hence in (a)p(1 3 2) = 13 means Worker 13 processes the sub-tensor Y(132) as [1 3 2] = ind2sub(13) = p(2)(3 3)

Given a big N th-order tensor Y isin RI1timesI2timesmiddotmiddotmiddottimesIN we parti-tion each dimension In into a total of Sn disjoint subsets eachof which consists of LSn indices with

sumSn

sn=1 Lsn = In Assuch the original big tensor is partitioned into a total of W =S1S2 middot middot middotSN sub-tensors and each sub-tensor can be denotedby Y(s1s2sN ) with the size of Ls1 times Ls2 times middot middot middot times LsN where sn isin NSn n = 1 2 N At the same time supposewe have a computer cluster with W workers (computationalengines)3 We imagine that the W workers are virtuallyspatially permuted to form a S1 times S2 times middot middot middot times SN tensor suchthat each worker is just associated with one sub-tensor seeFig5(a) In other words we will have two types of tensorsbull The small data tensors obtained by partitioning Y

Y(s) = Y(s1s2sN ) isin RLs1timesLs2timesmiddotmiddotmiddottimesLsN (19)

where s =[s1 s2 middot middot middot sN

]Tis used to specify

the (s1 s2 sN )th-block of Y sn isin NSn n =

1 2 N see Fig5(a) for an illustrationbull The computing resource tensor P isin NS1timesS2timesmiddotmiddotmiddottimesSN

+

formed by reshaping [1 2 W ] that identifies theW = S1S2 middot middot middotSN workers and the associated data ten-sors The entries of P have a one-to-one correspondencewith the sub-tensors Y(s) meaning that Worker p(s)process the sub-tensor Y(s) see Fig5(b)

It can be seen the RandTucker and the RandTucker2i majorlyuse four basic operations 1) Mode-n unfolding of a tensor2) Matrix-matrix product to obtain Y(n)Ω

(n) 3) Generating aGaussian random matrix Ω 4) Estimating an orthogonal basisof the range of Y(n)Ω

(n) Below we discuss how to performthese four operations in a distributed computing system

Mode-n unfolding of tensor Y We unfold the computingresource tensor P isin NS1timesS2timesmiddotmiddotmiddottimesSN

+ along mode-n to obtain

3We have assumed that each sub-tensor can be efficiently handled in by theassociated worker otherwise we increase the number of workers and decreasethe size of each sub-tensor accordingly

P(n) isin NSntimesK+ K =

prodj 6=n Sj Let p(n)(sn k) be the

(sn k)th entry of P(n) and define

s = πsnk = ind2sub(p(n)(sn k))

sn = 1 2 Sn k = 1 2 prod

j 6=nSj

(20)

where the function ind2sub(middot) converts an entry of P(n) iep(n)(sn k) a worker identifier into an N -by-1 vector ofsubscript values determining the position of p(n)(sn k) in thetensor P which allows us to locate the sub-tensor associatedwith this worker In other words the above (20) means thatthe tensor Yπsnk is processed by Worker p(n)(sn k) Then itcan be seen that

Y(n) =

Yπ11

(n) Yπ12

(n) Yπ1K

(n)

Yπ21

(n) Yπ22

(n) Yπ2K

(n)

Y

πSn1

(n) YπSn2

(n) YπSnK

(n)

isin RIntimes

prodj 6=n Ij

(21)where Y

πsnk

(n) is the mode-n martricization of Yπsnk pro-cessed by Worker p(n)(sn k) sn isin NSn

and k isin NK In summary to obtain Y(n) we perform mode-n unfolding

of Y(s)(n) simultaneously Then the (sn k)th sub-matrix of Y(n)

will be given by Worker p(n)(sn k) as shown in (21)Matrix-matrix product Large scale matrix-matrix product

can be performed by using any existing technique As here themode-n unfolding matrix Y(n) has already been partitionedand distributed we apply the matrix partitioning techniquewe compute the product Y(n)Ω

(n) by using

Z(n)1

Z(n)2

middot middot middotZ

(n)Sn

= Y(n)Ω

(n) = Y(n)

Ω(n)1

Ω(n)2

Ω(n)K

isin RIntimesRn (22)

where Y(n) is given by (21) Ω(n)k isin R

prodj 6=n LSj

timesRn are thesub-matrices by partitioning the random Gaussian matrix Ω(n)

8

50

100

150

200

250

300

350

400

1x108 2x108 3x108 5x108 7x108 1x109 1x109

Scale of the input data ( I N)

Tim

e (s)

FFCPCPminusALSPARACOMP

(a)

2

4

6

8

10

12

14

16

18

1x109 1x1012 8x1012 3x1013 6x1013 1x1014

Scale of the input data (I N)

Tim

e (s

)

(b)

Fig 6 (a) Time consumption comparison of the FFCP the CP-ALS and the PARACOMP on condition that the data tensorof I times I times I can be loaded into memory For the FFCP the RandTucker2i was used to compress the data with p = 15 (b)Scalability of the FFCP algorithm provided that the given big tensor was in its Tucker format The consumed time by theFFCP increases nearly linearly with respect to I

into K parts From (22) we have Z(n)sn =

sumKk=1 Y

πsnk

(n) Ω(n)k

and all the products Yπsnk

(n) Ω(n)k are computed on Workers

p(πsnk) simultaneously and then are broadcasted to Workerp(n)(sn k0) typically k0 = 1 Finally for each sn isin NSn wesum Y

πsnk

(n) Ω(n)k up over k on Worker p(n)(sn k0) to yield

Z(n)sn Generating the Gaussian random matrix Ω(n) Note that

the workers identified by the kth column of P(n) ie theworkers p(n)(sn k) sn = 1 2 Sn share a same randomGaussian matrix Ω

(n)k (corresponding to the kth column of

blocks of Y(n) in (21)) Hence Ω(n)k can be generated on

one worker say p(n)(1 k) and then is broadcasted to all theworkers p(n)(sn k) with sn 6= 1 However this way suffersfrom heavy communication overhead A more economicalmethod is to broadcast the parameters (or the random numbergenerators) that control the random number generation andthen the random matrices Ω

(n)k are generated on all workers

by using the received random number generators as suchfor each k the workers p(n)(sn k) sn isin NSn

will generateidentical random Gaussian matrices Ω

(n)k simultaneously

Estimating an orthonormal basis of the columns ofY(n)Ω

(n) To this end we perform eigenvalue decompositionofsumSn

sn=1 Z(n)sn

TZ(n)sn = UΣUT Then the mode-n factor

matrix U(n) with orthonormal columns can be computed from

U(n)sn = Z(n)

sn UΣminus12 (23)

in parallel on the workers p(n)(sn 1) where U(n)sn is the snth

sub-matrix of U(n) sn isin NSn After U(n) has been estimated Y will be updated as Ylarr

Y timesn U(n)T which can be realized using the above mode-n

tensor unfolding and matrix-matrix product After update thenth dimension of Y(s) will be reduced to be Rn from LSn

We repeat the above procedure for n = 1 2 N As the

last step the core tensor is computed as G =sumWw=1 Y

(w) iethe sum of sub-tensors distributed on all workers

C Nonnegative Tucker Decomposition

In our recent papers [19] [31] we also showed how toefficiently perform nonnegative Tucker decomposition basedon a Tucker approximation The idea of that method is similarto the FFCP method after the original high-dimensional rawdata has been replaced by its Tucker approximation with lowmultilinear rank the computational complexity of gradientswith respect to the nonnegative core tensor and factor matri-ces can be significantly reduced in terms of both time andspace Readers are referred to [19] [31] for more detailsIn summary Tucker compression can play a fundamentalrole to achieve highly efficient tensor decompositions asillustrated in Fig1 Based on the proposed randomized Tuckerdecomposition methods we can easily perform various typesof constrainedunconstrained tensor decompositions for verylarge-scale tensor data

V SIMULATIONS

A Synthetic data

In this subsection we investigate the performance of theproposed FFCP algorithm by using synthetic data All the ex-periments were performed in 64-bit MATLAB on a computerwith Intel Core i7 CPU (333GHz) and 24GB memory running64-bit Windows 7

9

TABLE II Comparison between the algorithms when they were applied to perform CPD of synthetic data The noises wereGaussian with SNR=10dB in all the cases

Algorithm N = 3 I = 200 R = 10 N = 3 I = 500 R = 20 N = 6 I = 20 R = 5Fit () Time (s) Fit () Time (s) Fit () Time (s)

CP-ALS 821plusmn 102 395 805plusmn 68 640 896plusmn 112 188FastCPALS 798plusmn 90 104 829plusmn 56 211 924plusmn 109 66PARAFAC 825plusmn 79 782 805plusmn 61 1583 876plusmn 114 643FFCP 819plusmn 80 35 828plusmn 56 40 934plusmn 86 50

TABLE III Comparison between the algorithms when they were applied to perform nonnegative CPD of synthetic data Thenoises were Gaussian with SNR=10dB in all the cases

Algorithm N = 3 I = 200 R = 10 N = 3 I = 500 R = 20 N = 6 I = 20 R = 5Fit () Time (s) Fit () Time (s) Fit () Time (s)

CP-NMU 973plusmn 12 664 976plusmn 08 16952 917plusmn 56 4265CP-HALS 978plusmn 10 121 977plusmn 07 2281 951plusmn 41 1014FFCP-MU 990plusmn 05 23 993plusmn 02 121 934plusmn 41 70FFCP-HALS 991plusmn 01 61 995plusmn 01 299 943plusmn 35 118

Simulations on CPD In each run a tensor Ylowast was generatedby using the model (2) where the entries of the latent factormatrices A(n) isin RItimesR (ie the ground truth) were drawn fromiid standard normal distributions n = 1 2 N Thenindependent Gaussian noise (with SNR=10dB) was addedto Ylowast to yield noisy observation tensor Y We tested threedifferent combinations of N I and R as shown in TABLE IIThe algorithms CP-ALS PARAFAC which are the standardalgorithms included in the Tensor Toolbox 25 [32] and the N -way toolbox [33] respectively were compared as the baselineThe FastCPALS algorithm [34] was also compared For all thealgorithm the maximum iteration number was set as 1000and a stopping criterion |Fit(Y Yt) minus Fit(Y Yt+1)| lt 10minus6

was applied where Yt is an estimation of Ylowast obtained afterthe tth iteration The mean values and standard variationsof Fit(Ylowast Y) over 20 Monte-Carlo runs are detailed in TA-BLE II along with the averaged time consumed by eachalgorithm From the table we see that the FFCP algorithmis significantly faster than the other algorithms but withoutsignificant deterioration of performance although it workson a compressed tensor instead of the original tensor Wealso tested the PARACOMP [35] method which was speciallyproposed for big tensor decompositions It turned out that thePARACOMP was sensitive to noise in noise-free cases itoften achieved perfect Fit of 1 otherwise when the noise was10dB it generally obtained the Fit lower than 03

In Fig6 we demonstrated the scalability of the proposedFFCP method where the RandTucker2i was used to compressthe data with p = 15 From Fig6(a) we can see that theFFCP shows very promising scalability its time consumptionincreases almost linearly with respect to the scale of the inputdata on condition that the rank is fixed on the contrastthe time consumed by the other algorithms increased nearlyexponentially Note that both the FFCP and the PARACOMPmethod [35] were designed mainly for big data analyticsand this comparison did not precisely reflect their actual

time consumption difference in a distributed computationalenvironment However it did show that the proposed methodwas more efficient than the PARACOMP which can also be in-ferred from the obvious facts 1) the PARACOMP compressesthe observation big tensor at least P times (P ge IR) whereasin our RandTucker2i based FFCP we only need to do it twice(or even once when RandTucker is applied) 2) although bothof them need to perform CPD on small compressed tensorshowever the PARACOMP method needs to do this P timeswhereas the FFCP needs to do it only once as the final stepParticularly in the PARACOMP any failure of the CPDs ofP sub-tensors will degenerate the final results Fig6(b) showsthe scalability of the FFCP provided that the given big tensorwas in its Tucker format To such scale of tensor if we usethe original full tensor directly all the algorithms ran out ofmemory But if we use its Tucker compression the FFCPcan perform constrainedunconstrained CPD efficiently evenon our desktop computer

Simulations on Nonnegative CPD In this experiments thedata were generated as the above except that the latent factormatrices were generated from iid exponential distributionswith the mean parameter micro = 10 Then we randomly selected10 entries of each factor to be zeros to generate sparsefactor matrices As the PARAFAC algorithm was very slowwe only compared the CP-NMU algorithm which is a standardnonnegative CPD algorithm based on the multiplicative updaterule included in the Tensor Toolbox [32] and the CP-HALSalgorithm [36] that is based on the Hierarchical ALS (HALS)method For fair comparison their corresponding implemen-tations of FFCP were compared The results averaged over20 Monte-Carlo runs were detailed in TABLE III From theTABLE FFCP implementations were not only significantlyfaster than their competitors but also often achieved higherFit values To the best of our knowledge the FFCP is one ofthe most efficient algorithms for nonnegative CPD

To demonstrate the high efficiency of the FFCP algorithm

10

in this simulation the entries of A(n) isin R10000times10 n =1 2 3 4 were drawn from iid exponential distributions withthe mean parameter micro = 10 Again we randomly selected 20entries of A(n) to be zeros to obtain sparse factor matricesBy using (2) and A(n) generated above a big 4th-order tensorY isin R10000times10000times10000times10000 with R = 10 was obtainedwhich contains a total of 1016 entries in its full format Itis certainly impossible to decompose such a huge tensor in astandard desktop computer Suppose we have already obtainedits Tucker approximation YT = JGA(1)A(2)A(3)A(4)K asympY Due to the rotation ambiguity of Tucker decompositionswe know theoretically A(n) = A(n)Q(n) where Q(n) wasnot necessarily to be nonnegative Generally applying CPD onthe core tensor G directly cannot guarantee the nonnegativityof factor matrices Instead we applied the FFCP algorithm onthe Tucker compressed tensor YT to extract nonnegative factormatrices In a typical run the FFCP-HALS consumed only 774seconds and the recovered components were with the signal-to-interference ratio4 (SIR) higher than 200dB which almostperfectly match the true components Although this exampleonly shows an ideal noise-free case it illustrates how the FFCPalgorithm can be applied to perform nonnegative CPD of hugetensors as long as their good low-rank approximations can beobtained

B The COIL-100 Images Clustering Analysis

In this experiment we applied the proposed NTD algorithmsto the clustering analysis of the objects selected from theColumbia Object Image Library (COIL-100) The COIL-100database consists of 7200 images of 100 objects each ofwhich has 72 images taken from different poses hence theseimages naturally form a 128times 128times 3times 7200 tensor Y Thedata tensor was decomposed by using tensor decompositionalgorithms and then used the two t-Distributed StochasticNeighbor Embedding (t-SNE) components of A(4) as featuresto cluster and visualize the results The K-means algorithm wasused to cluster the objects and the accuracy of clustering wasdefined in [37]To justify the proposed distributed RandTuckeralgorithm we used the MATLAB parallel computing toolboxwhich allows us to use the full processing power of multicoredesktops by executing applications on workers (MATLABcomputational engines) that run locally The tensor Y waspartitioned into 4 sub-tensors with the size of 128times128times3times1800 and was distributed on 4 workers As the baseline theTucker-ALS algorithm included in [32] was compared and themultilinear rank was empirically set as R = (20 20 3 20)For the RandTucker method R = (30 30 3 50) was setBoth of these two algorithms used 2 iterations Their perfor-mances were detailed in TABLE IV It can be seen that nosignificant deterioration was observed between the randomized

4The performance index SIR is defined as SIR(a(n)r a

(n)r ) =

20 log∥∥∥a(n)

r minus a(n)r

∥∥∥2∥∥∥a(n)

r

∥∥∥2

where a(n)r is an estimate of a

(n)r both

a(n)r and a

(n)r are normalized to be zero-mean and unit-variance

TABLE IV Comparison between the algorithms when theywere applied to clustering analysis of COIL-100 objects

Algorithm Fit () Accuracy ()Tucker-ALS 703 692

RandTucker2i 691 684RandTucker 686 698

FFCP(Tucker-ALS) 671 713FFCP (RandTucker2i) 647 680

12

3

4

5

6 7

8

9

10

11

12

13

1415

16

17

18

19

20

21

2223

24

25

2627

28

29

30

31

32

33

34

35

36

37

38 3940

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

6465

6667

68

69

70

71

72

73

74

75

76

77

78

7980

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

Fig 7 Visualization of the clustering results obtained byapplying distributed RandTucker2i on the COIL100 imagedatabase The two tSNE components of A(4) were used forvisualization

methods and the deterministic methods which suggested thatthe distributed RandTucker algorithm can be applied to reallylarge scale problems The clustering results obtained by theRandTucker2i Figare visualized in Fig7 by using the twotSNE components of A(4)

C UCF Sports Action Videos Analysis

This data set consists of a set of videos collected fromvarious sports which are typically featured on broadcast televi-sion channels [38] (see httpcrcvucfedudataUCF SportsActionphp) To avoid rescaling the videos and keep all thevideos in the same specification we selected 5 actions ofDiving Riding-Horse Run Swing and Walk-Front each ofwhich contains 10 video clips with the size of 404times720times3times35(35 is the number of frames) We concatenated these videosto form a 5th-order tensor This tensor has about 16 times 109

entries and more than 99 of them are nonzero When it issaved in double format it consumes about 117 GB spaceWe used tensor decomposition methods to decompose this

11

Fig 8 The UCF Sports Action Videos The first row shows asnapshot of each action and the second row shows 5 snapshotsfrom a video of diving

tensor to extract low-dimensional features for further cluster-ing analysis When we applied the CP-ALS method includedin the Tensor Toolbox to decompose 30 videos it consumedabout 4 hours for only one iteration when 40 videos wereconsidered the CP-ALS ran out of memory We configureda small cluster with 3 computers (a desktop computer plustwo laptop computers (each of which has a i7 Intel dualcore CPU 8 GB physical memory running Windows 7 andMATLAB 2012b)) In the first stage the RandTucker2i wasapplied with R = [35 35 3 10 20] It consumed about 104seconds to obtain the Tucker approximation of the originaltensor with the Fit of 723 to the observation data Then theFFCP was applied using a desktop computer with R = 10 Inthis case only 75 seconds were required without nonnegativityconstraints and 7 seconds were consumed when nonnegativityconstraints were imposed Finally we used A(5) as featuresto cluster the videos and achieved the accuracy of 58 ina typical run This experiments showed that the proposedalgorithms can be applied to decompose very large scaletensors by using a distributed computing cluster

VI CONCLUSION

Tensor decompositions are particularly promising for bigmultiway data analytics as it is able to discover more complexinternal structures and correlations of data gathered frommultiple aspects devices trials and subjects etc While mostexisting tensor decomposition methods are not designed tomeet the major challenges posed by big data in this paperwe proposed a unified framework for the decomposition ofbig tensors with relatively low multilinear rank The key ideais that based on its Tucker approximation of a big tensor alarge set of unconstrainedconstrained tensor decompositionscan be efficiently performed We developed a flexible fastalgorithm for the CP decomposition of tensors using theirTucker representation and a distributed randomized Tuckerdecomposition approach which is particularly suitable for bigtensor analysis Simulation results confirmed that the proposedalgorithms are highly efficient and scalable

While the randomized Tucker decomposition approachshowed its close-to-optimal performance the correspondingerror bounds need to be investigated theoretically in the futureWe also believe many randomized (or related) approachesproposed for low-rank approximation of matrices [29] [39]

could be applied for Tucker decompositions however a com-parative study in the tensor decomposition scenarios is desiredParticularly if we have efficient and robust unconstrained CPdecomposition algorithms we can use the CP model to com-press the raw data instead of the Tucker model which suffersfrom the curse of dimensionality for very high-order tensorsHence developing more reliable and robust CP decompositionapproaches is another important task All these open issuesdeserve further study in the near future

REFERENCES

[1] A Cichocki R Zdunek A H Phan and S-I Amari Nonnegative Ma-trix and Tensor Factorizations Applications to Exploratory Multi-wayData Analysis and Blind Source Separationpplications to ExploratoryMulti-Way Data Analysis and Blind Source Separation ChichesterUK John Wiley amp Sons Ltd 2009

[2] N Sidiropoulos R Bro and G Giannakis ldquoParallel factor analysisin sensor array processingrdquo IEEE Transactions on Signal Processingvol 48 no 8 pp 2377ndash2388 Aug 2000

[3] L De Lathauwer J Castaing and J Cardoso ldquoFourth-order cumulant-based blind identification of underdetermined mixturesrdquo IEEE Transac-tions on Signal Processing vol 55 no 6 pp 2965ndash2973 2007

[4] P Comon ldquoTensors A brief introductionrdquo IEEE Signal ProcessingMagazine vol 31 no 3 pp 44ndash53 May 2014

[5] T G Kolda and B W Bader ldquoTensor decompositions and applicationsrdquoSIAM Review vol 51 no 3 pp 455ndash500 2009

[6] N D Sidiropoulos and R Bro ldquoOn the uniqueness of multilineardecomposition of N -way arraysrdquo Journal of Chemometrics vol 14no 3 pp 229 ndash 239 2000

[7] G Zhou and A Cichocki ldquoCanonical polyadic decomposition based ona single mode blind source separationrdquo IEEE Signal Processing Lettersvol 19 no 8 pp 523 ndash526 Aug 2012

[8] L-H Lim and P Comon ldquoNonnegative approximations of nonnegativetensorsrdquo Journal of Chemometrics vol 23 no 7-8 pp 432ndash441 2009

[9] E E Papalexakis C Faloutsos and N D Sidiropoulos ldquoParCubeSparse parallelizable tensor decompositionsrdquo in Proceedings of the 2012European Conference on Machine Learning and Knowledge Discoveryin Databases - Volume Part I ser ECML PKDDrsquo12 Berlin HeidelbergSpringer-Verlag 2012 pp 521ndash536

[10] A Phan and A Cichocki ldquoAdvances in PARAFAC using parallel blockdecompositionrdquo in Neural Information Processing ser Lecture Notesin Computer Science C Leung M Lee and J Chan Eds vol 5863Springer Berlin Heidelberg 2009 pp 323ndash330

[11] U Kang E Papalexakis A Harpale and C Faloutsos ldquoGigaTensorScaling tensor analysis up by 100 times - algorithms and discoveriesrdquoin Proceedings of the 18th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining ser KDD rsquo12 New York NYUSA ACM 2012 pp 316ndash324

[12] C Tsourakakis ldquoMACH Fast randomized tensor decompositionsrdquoin Proceedings of the 2010 SIAM International Conference on DataMining 2010 pp 689ndash700

[13] P Drineas and M W Mahoney ldquoA randomized algorithm for a tensor-based generalization of the singular value decompositionrdquo Linear Alge-bra and its Applications vol 420 no 2ndash3 pp 553 ndash 571 2007

[14] C F Caiafa and A Cichocki ldquoGeneralizing the column-row matrix de-composition to multi-way arraysrdquo Linear Algebra and its Applicationsvol 433 no 3 pp 557 ndash 573 2010

[15] M Vasilescu and D Terzopoulos ldquoMultilinear analysis of image en-sembles Tensorfacesrdquo in Computer Vision mdash ECCV 2002 ser LectureNotes in Computer Science A Heyden G Sparr M Nielsen andP Johansen Eds Springer Berlin Heidelberg 2002 vol 2350 pp447ndash460

[16] O Duchenne F Bach I-S Kweon and J Ponce ldquoA tensor-basedalgorithm for high-order graph matchingrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 12 pp 2383ndash2395 Dec2011

12

[17] A Rajwade A Rangarajan and A Banerjee ldquoImage denoising usingthe higher order singular value decompositionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 35 no 4 pp 849ndash8622013

[18] V De Silva and L-H Lim ldquoTensor rank and the ill-posedness of the bestlow-rank approximation problemrdquo SIAM Journal on Matrix Analysis andApplications vol 30 no 3 pp 1084ndash1127 September 2008

[19] G Zhou A Cichocki Q Zhao and S Xie ldquoNonnegative matrixand tensor factorizations An algorithmic perspectiverdquo IEEE SignalProcessing Magazine vol 31 no 3 pp 54ndash65 May 2014

[20] G Zhou A Cichocki and S Xie ldquoFast nonnegative matrixtensorfactorization based on low-rank approximationrdquo IEEE Transactions onSignal Processing vol 60 no 6 pp 2928 ndash2940 June 2012

[21] V Cevher S Becker and M Schmidt ldquoConvex optimization forbig data Scalable randomized and parallel algorithms for big dataanalyticsrdquo IEEE Signal Processing Magazine vol 31 no 5 pp 32ndash43Sept 2014

[22] R Bro and C A Andersson ldquoImproving the speed of multiway algo-rithms Part ii Compressionrdquo Chemometrics and Intelligent LaboratorySystems vol 42 no 1ndash2 pp 105 ndash 113 1998

[23] N Sidiropoulos E Papalexakis and C Faloutsos ldquoParallel randomlycompressed cubes A scalable distributed architecture for big tensordecompositionrdquo IEEE Signal Processing Magazine vol 31 no 5 pp57ndash70 Sept 2014

[24] P Comon X Luciani and A L F de Almeida ldquoTensor decompositionsalternating least squares and other talesrdquo Journal of Chemometricsvol 23 no 7-8 pp 393ndash405 2009

[25] J Cohen R Farias and P Comon ldquoFast decomposition of largenonnegative tensorsrdquo IEEE Signal Processing Letters vol 22 no 7pp 862ndash866 July 2015

[26] L De Lathauwer B De Moor and J Vandewalle ldquoA multilinearsingular value decompositionrdquo SIAM Journal on Matrix Analysis andApplications vol 21 pp 1253ndash1278 2000

[27] E Papalexakis N Sidiropoulos and R Bro ldquoFrom k-means to higher-way co-clustering Multilinear decomposition with sparse latent factorsrdquoIEEE Transactions on Signal Processing vol 61 no 2 pp 493ndash506Jan 2013

[28] Z Qian X Chen N Kang M Chen Y Yu T Moscibroda andZ Zhang ldquoMadLINQ Large-scale distributed matrix computation forthe cloudrdquo in Proceedings of the 7th ACM European Conference onComputer Systems ser EuroSys rsquo12 New York NY USA ACM2012 pp 197ndash210

[29] N Halko P Martinsson and J Tropp ldquoFinding structure with ran-domness Probabilistic algorithms for constructing approximate matrixdecompositionsrdquo SIAM Review vol 53 no 2 pp 217ndash288 2011

[30] N Vervliet O Debals L Sorber and L De Lathauwer ldquoBreaking thecurse of dimensionality using decompositions of incomplete tensorsTensor-based scientific computing in big data analysisrdquo IEEE SignalProcessing Magazine vol 31 no 5 pp 71ndash79 Sept 2014

[31] G Zhou A Cichocki Q Zhao and S Xie ldquoEfficient nonnegativeTucker decompositions Algorithms and uniquenessrdquo IEEE Transactionson Image Processing (Submitted) p 13 2014 [Online] Available[ArXive-prints]httparxivorgabs14044412

[32] B W Bader and T G Kolda ldquoMATLAB tensor toolbox version25rdquo Feb 2012 [Online] Available httpcsmrcasandiagovsimtgkoldaTensorToolbox

[33] C A Andersson and R Bro ldquoThe N -way toolbox for MATLABrdquo 2000[Online] Available httpwwwmodelslifekudksourcenwaytoolbox

[34] H Phan P Tichavsky and A Cichocki ldquoFast alternating LS algorithmsfor high order CANDECOMPPARAFAC tensor factorizationsrdquo IEEETransactions on Signal Processing vol 61 no 19 pp 4834ndash4846 2013

[35] N Sidiropoulos E Papalexakis and C Faloutsos ldquoA parallel algorithmfor big tensor decomposition using randomly compressed cubes (PARA-COMP)rdquo in 2014 IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP) May 2014 pp 1ndash5

[36] A Cichocki and A-H Phan ldquoFast local algorithms for large scalenonnegative matrix and tensor factorizationsrdquo IEICE Transactions onFundamentals of Electronics Communications and Computer Sciences(Invited paper) vol E92-A no 3 pp 708ndash721 2009

[37] D Cai X He J Han and T Huang ldquoGraph regularized nonnegativematrix factorization for data representationrdquo IEEE Transactions on

Pattern Analysis and Machine Intelligence vol 33 no 8 pp 1548ndash1560 aug 2011

[38] M Rodriguez J Ahmed and M Shah ldquoAction MACH a spatio-temporal maximum average correlation height filter for action recogni-tionrdquo in IEEE Conference on Computer Vision and Pattern Recognition2008 CVPR 2008 June 2008 pp 1ndash8

[39] R Baraniuk V Cevher and M Wakin ldquoLow-dimensional models fordimensionality reduction and signal recovery A geometric perspectiverdquoProceedings of the IEEE vol 98 no 6 pp 959ndash971 June 2010

  • I Introduction
  • II Overview of Tucker and Polyadic Decompositions
  • III Efficient CP Decompositions of Tensors Based on Their Tucker Approximations
    • III-A Overview of CPD Algorithms
    • III-B CPD of Big Tensors by Based on Their Tucker Compression
      • IV Distributed Randomized Tucker Decompositions (RandTucker)
        • IV-A Randomized Tucker Decomposition
        • IV-B Distributed Randomized Tucker Decomposition
        • IV-C Nonnegative Tucker Decomposition
          • V Simulations
            • V-A Synthetic data
            • V-B The COIL-100 Images Clustering Analysis
            • V-C UCF Sports Action Videos Analysis
              • VI Conclusion
              • References

5

minus20 minus15 minus10 minus5 0 5 10 15 20

60

65

70

75

80

85

90

95

Noise level (dB)

Fit

Tucker+CP

FFCPDirect CP

Fig 3 The Fit values averaged over 50 Monte-Carlo runsobtained by 3 different ways (ie the Direct CP that appliesCPD on Y directly the FFCP on the Tucker format of Y asympJGU(1)U(2) middot middot middot U(N)K and the Tucker+CP on the coretensor G first and then construct the CP approximation) toperform CPD on a synthetic tensor Y isin R50times50times50times50 withR = 10 The data tensor Y was generated by using (2) anddifferent levels of Gaussian noise (SNR in dB) were added

It is worth noticing that if Y = JGU(1)U(2) middot middot middot U(N)Kwith U(n)U(n)T = I minimizing (11) is equivalent tosolving

minV(n)

J(V(n)) =1

2

∥∥∥Gminus JV(1)V(2) middot middot middot V(N)K∥∥∥F

st A(n) = U(n)V(n) isin Cn(17)

where V(n) is defined in (13) and Cn denotes some con-straints on A(n) Indeed (17) has been considered in [25]specially for NTF Although this method essentially has thesame computational complexity as the FFCP method the mainadvantage of (11) is that any off-the-shelf (constrained) CPDmethods can be directly applied with efficient computation ofthe gradient terms (9) and (12) without the need of designinga new algorithm to solve (17)

In summary not only does (11) benefit from the sameadvantage of low-computational complexity as the Tucker+CPmethod it is also quite flexible because additional constraintson A(n) can be easily incorporated such as nonnegativityandor sparsity For this reason the new method is calledthe flexible fast CP (FFCP) decomposition algorithm and thecorresponding pseudo code is briefly listed in Algorithm 11In Fig3 we compared the Tucker+CP the FFCP (solving(11)) and the standard CPD (solving (3)) algorithms Thedata tensor Y isin R50times50times50times50 was generated by using (2)

1 Although in Algorithm 1 only the LS update and multiplicative updateare listed any update rules based on the gradient (eg HALS see [19] forexample) could be incorporated into the scheme of FFCP straightforwardly

with R = 10 and contaminated by different levels of noise (inSNR) ranging from -20dB to 20dB In the figure we showedtheir Fit values averaged over 50 Monte-Carlo runs whereFit = 1 minus

∥∥∥Yminus Y∥∥∥FYF It turns out that the Tucker

compression generally will not degenerate the accuracy Onthe contrary it may improve the stableness of CPD algorithms

Although the matrices V(n) in (12) are generallysmall construction of the tall (

prodk 6=nRk)-by-Rn matrix

ie⊙

k 6=n V(p) and direct computation of H(n) =

G(n)

[⊙p 6=n V(p)

]can be avoided by applying either one of

the following methodsbull Notice that the rth column of H(n) can be computed

using h(n)r = G timesp 6=n a

(p)r

T which allows H(n) to becomputed in parallel by using the shared data tensor G

but without constructing⊙

k 6=n V(p) explicitly

bull The term G(n)

[⊙p 6=n V(p)

]can be computed in parallel

by using the technique adopted in the GigaTensor [11]Here G(n) is generally not sparse but relatively small

Moreover H(n) = G(n)

[⊙p 6=n V(p)

]can be viewed as

the gradient term when we perform the CPD of G asympJV(1)V(2) middot middot middot V(N)K (see (5) and (17)) For this reasonmost techniques which have been proposed to speed upstandard CP decompositions can be applied to compute H(n)

straightforwardlyFor very large-scale data where In are extremely huge the

products of A(n)TA(n) and U(n)TA(n) can be performed ondistributed systems see eg [28] or simply computed byusing the matrix partitioning technique on a standard computer

In summary even with very limited computational resourcethe FFCP algorithm can perform the (nonnegative) CPD ofvery large-scale tensors provided that their low-rank Tuckerrepresentation can be obtained Notice that the Tucker ap-proximation of a given tensor plays a fundamental role in theproposed FFCP method as well as the Tucker+CP the methodproposed in [25] and the PARACOMP [35] and the HOSVDoften serves as workhorse to perform Tucker compressionHowever the HOSVD involves the eigenvalue decompositionof huge matrices Y(n) which makes it unsuitable for large-scale problems [35] In the next section we focus on how toovercome this limitation

IV DISTRIBUTED RANDOMIZED TUCKERDECOMPOSITIONS (RANDTUCKER)

A Randomized Tucker Decomposition

In the Tucker decomposition of (1) the factor matrices U(n)

are estimated by minimizing the following LS problems

J(U(n)) =∥∥∥Y(n) minusU(n)B(n)T

∥∥∥F (18)

where B(n) =[otimes

k 6=n U(k)]

GT(n) In the HOSVD [26] the

factor matrix U(n) is estimated as the leading singular vectors

6

of Y(n) However performing SVD is prohibitive for verylarge-scale problems Recently a set of probabilistic algorithmshave been developed to discover the low-rank structure of hugematrices [29] Here we consider the following randomizedrange finder (RRF) algorithm (Algorithm 41 of [29]) toestimate the basis of U(n)

1) Drawn a random Gaussian matrix Ω isin Rprod

p6=n IptimesRn 2) Obtain the In times Rn matrix Z(n) = Y(n)Ω

(n)3) Compute U(n) isin RIntimesRn as an orthonormal basis of

Z(n) isin RIntimesRn by using eg QR decompositionsIn the above Rn = Rn + p and p is referred to as theoversampling parameter and p = 5 or p = 10 was suggestedin [29] Based on this procedure we propose a randomizedTucker decomposition (RandTucker) algorithm to performTucker decompositions see Algorithm 2 Compared with theHOSVD algorithm the proposed RandTucker algorithm ishighly parallalizable and suitable for distributed computation

Several randomized algorithms have already been proposedfor Tucker decompositions such as the method using fibersampling [14] [30] Compared with them the proposedRandTucker method has an important advantage it is ableto obtain nearly optimal approximation even for very noisydata by using only one additional ALS-like iteration whichis referred to as RandTucker2i see Algorithm 3 and Fig4for the performance comparison In this figure a 200-by-200-by-200 tensor Y was generated in each run by using the (1)with Rn = 10 then different levels of Gaussian noise wasadded to Y to generate the observation tensor We plottedthe Fit values (and the standard variations) obtained by theRandTucker (with p = 10) the RandTucker2i (p = 10) andthe HOSVD averaged over 50 Monte-Carlo runs From thefigure the RandTucker achieved slightly worse Fit valuesespecially when noise was heavy while the RandTucker2iachieved almost the same Fit values as the HOSVD Asthe RandTucker and the RandTucker2i do not involve thesingular value decomposition of big matrices and they arehighly parallelizable they are very promising in the Tuckerdecompositions of big tensors

B Distributed Randomized Tucker DecompositionIn the above RandTucker and RandTucker2i algorithms we

have assumed that all the data can be processed in physicalmemory directly For big data this could be unrealistic Acommon scenario is that only partial data can be loaded inmemory In this subsection we discuss how to solve thisproblem by using a computer cluster (in this paper we used theMATLAB Distributed Computing Server2) Here we considera cluster which is composed ofbull a number of workers with identity labels each of which is

a computational engine that executes computational taskssimultaneously In the meanwhile they may communicatewith each other and the client as well

2See wwwmathworkscomhelppdf docmdcemdcepdf

Algorithm 2 The Randomized Tucker Decomposition (Rand-Tucker) Algorithm

Require Y isin RI1timesI2middotmiddotmiddottimesIN and the multilinear rank(R1 R2 RN )

1 for n = 1 2 N do2 Compute Z(n) = Y(n)Ω

(n) where Ω(n) is an(prodk 6=n Ik)-by-Rn random Gaussian matrix

3 Compute U(n) as an orthonormal basis of Z(n) byusing eg the QR decomposition

4 Ylarr Ytimesn U(n)T In larr Rn5 end for6 GlarrY7 return Y asymp JGU(1)U(2) middot middot middot U(N)K

Algorithm 3 The RandTucker Algorithm with 2 Iterationsonly (RandTucker2i)

Require Y isin RI1timesI2middotmiddotmiddottimesIN and the multilinear rank(R1 R2 RN )

1 Initialize U(n) isin RIntimesRn as random Gaussian matrices2 Repeat lines (3)-(7) twice3 for n = 1 2 N do4 X = Ytimesp 6=n U(p)5 Compute Z(n) = X(n)Ω

(n) where Ω(n) is an(prodk 6=n Rk)-by-Rn random Gaussian matrix

6 Compute U(n) as an orthonormal basis of Z(n) byusing eg the QR decomposition

7 end for8 Glarr Ytimes1 U(1) times2 U(2) middot middot middot timesN U(N)9 return Y asymp JGU(1)U(2) middot middot middot U(N)K

0 5 10 15 20 25 3020

30

40

50

60

70

80

90

100

Noise level (SNR in dB)

Fit (

)

RandTucker

RandTucker2i

HOSVD

Fig 4 Performance comparison between the RandTucker theRandTucker2i and the HOSVD algorithm in terms of meanvalue of Fit and the standard derivation over 50 Monte-Carloruns

bull a client which is used to schedule tasks to workers andgather results from workers and return final results

7

Workeramp17amp Workeramp19amp Workeramp21amp Workeramp23amp

Workeramp9amp Workeramp11amp Workeramp15amp

Workeramp1amp Workeramp3amp Workeramp5amp

Workeramp13amp

Workeramp7amp

Y(111)

Y(211)

Y(121)

Y(221) Y(231)

Y(131) Y(141)

Y(241)

(a)

P 2 NS1S2middotmiddotmiddotSN+

= snk = ind2subp(n)(sn k)

s = ind2sub(p(s1 s2 sN ))

(Using P(n))

(Using P)

Worker p(s)

p(s)Y(s)

(b)

Fig 5 Demonstration of small data tensors and the virtual computing resource tensor P in a cluster (a) An example Y isinR2Itimes4Jtimes3K is partitioned into smaller blocks Y(s) with the same size I times J timesK thereby leading to the 2times 4times 3 computingresource tensor P identifying a total of 24 workers (b) An entry p(s) isin 1 2 prodn Sn of the computing resource tensorP indicates that Worker p(s) processes the sub-tensor Y(s) where s = [s1 s2 sN ]T = ind2sub(p(s)) Hence in (a)p(1 3 2) = 13 means Worker 13 processes the sub-tensor Y(132) as [1 3 2] = ind2sub(13) = p(2)(3 3)

Given a big N th-order tensor Y isin RI1timesI2timesmiddotmiddotmiddottimesIN we parti-tion each dimension In into a total of Sn disjoint subsets eachof which consists of LSn indices with

sumSn

sn=1 Lsn = In Assuch the original big tensor is partitioned into a total of W =S1S2 middot middot middotSN sub-tensors and each sub-tensor can be denotedby Y(s1s2sN ) with the size of Ls1 times Ls2 times middot middot middot times LsN where sn isin NSn n = 1 2 N At the same time supposewe have a computer cluster with W workers (computationalengines)3 We imagine that the W workers are virtuallyspatially permuted to form a S1 times S2 times middot middot middot times SN tensor suchthat each worker is just associated with one sub-tensor seeFig5(a) In other words we will have two types of tensorsbull The small data tensors obtained by partitioning Y

Y(s) = Y(s1s2sN ) isin RLs1timesLs2timesmiddotmiddotmiddottimesLsN (19)

where s =[s1 s2 middot middot middot sN

]Tis used to specify

the (s1 s2 sN )th-block of Y sn isin NSn n =

1 2 N see Fig5(a) for an illustrationbull The computing resource tensor P isin NS1timesS2timesmiddotmiddotmiddottimesSN

+

formed by reshaping [1 2 W ] that identifies theW = S1S2 middot middot middotSN workers and the associated data ten-sors The entries of P have a one-to-one correspondencewith the sub-tensors Y(s) meaning that Worker p(s)process the sub-tensor Y(s) see Fig5(b)

It can be seen the RandTucker and the RandTucker2i majorlyuse four basic operations 1) Mode-n unfolding of a tensor2) Matrix-matrix product to obtain Y(n)Ω

(n) 3) Generating aGaussian random matrix Ω 4) Estimating an orthogonal basisof the range of Y(n)Ω

(n) Below we discuss how to performthese four operations in a distributed computing system

Mode-n unfolding of tensor Y We unfold the computingresource tensor P isin NS1timesS2timesmiddotmiddotmiddottimesSN

+ along mode-n to obtain

3We have assumed that each sub-tensor can be efficiently handled in by theassociated worker otherwise we increase the number of workers and decreasethe size of each sub-tensor accordingly

P(n) isin NSntimesK+ K =

prodj 6=n Sj Let p(n)(sn k) be the

(sn k)th entry of P(n) and define

s = πsnk = ind2sub(p(n)(sn k))

sn = 1 2 Sn k = 1 2 prod

j 6=nSj

(20)

where the function ind2sub(middot) converts an entry of P(n) iep(n)(sn k) a worker identifier into an N -by-1 vector ofsubscript values determining the position of p(n)(sn k) in thetensor P which allows us to locate the sub-tensor associatedwith this worker In other words the above (20) means thatthe tensor Yπsnk is processed by Worker p(n)(sn k) Then itcan be seen that

Y(n) =

Yπ11

(n) Yπ12

(n) Yπ1K

(n)

Yπ21

(n) Yπ22

(n) Yπ2K

(n)

Y

πSn1

(n) YπSn2

(n) YπSnK

(n)

isin RIntimes

prodj 6=n Ij

(21)where Y

πsnk

(n) is the mode-n martricization of Yπsnk pro-cessed by Worker p(n)(sn k) sn isin NSn

and k isin NK In summary to obtain Y(n) we perform mode-n unfolding

of Y(s)(n) simultaneously Then the (sn k)th sub-matrix of Y(n)

will be given by Worker p(n)(sn k) as shown in (21)Matrix-matrix product Large scale matrix-matrix product

can be performed by using any existing technique As here themode-n unfolding matrix Y(n) has already been partitionedand distributed we apply the matrix partitioning techniquewe compute the product Y(n)Ω

(n) by using

Z(n)1

Z(n)2

middot middot middotZ

(n)Sn

= Y(n)Ω

(n) = Y(n)

Ω(n)1

Ω(n)2

Ω(n)K

isin RIntimesRn (22)

where Y(n) is given by (21) Ω(n)k isin R

prodj 6=n LSj

timesRn are thesub-matrices by partitioning the random Gaussian matrix Ω(n)

8

50

100

150

200

250

300

350

400

1x108 2x108 3x108 5x108 7x108 1x109 1x109

Scale of the input data ( I N)

Tim

e (s)

FFCPCPminusALSPARACOMP

(a)

2

4

6

8

10

12

14

16

18

1x109 1x1012 8x1012 3x1013 6x1013 1x1014

Scale of the input data (I N)

Tim

e (s

)

(b)

Fig 6 (a) Time consumption comparison of the FFCP the CP-ALS and the PARACOMP on condition that the data tensorof I times I times I can be loaded into memory For the FFCP the RandTucker2i was used to compress the data with p = 15 (b)Scalability of the FFCP algorithm provided that the given big tensor was in its Tucker format The consumed time by theFFCP increases nearly linearly with respect to I

into K parts From (22) we have Z(n)sn =

sumKk=1 Y

πsnk

(n) Ω(n)k

and all the products Yπsnk

(n) Ω(n)k are computed on Workers

p(πsnk) simultaneously and then are broadcasted to Workerp(n)(sn k0) typically k0 = 1 Finally for each sn isin NSn wesum Y

πsnk

(n) Ω(n)k up over k on Worker p(n)(sn k0) to yield

Z(n)sn Generating the Gaussian random matrix Ω(n) Note that

the workers identified by the kth column of P(n) ie theworkers p(n)(sn k) sn = 1 2 Sn share a same randomGaussian matrix Ω

(n)k (corresponding to the kth column of

blocks of Y(n) in (21)) Hence Ω(n)k can be generated on

one worker say p(n)(1 k) and then is broadcasted to all theworkers p(n)(sn k) with sn 6= 1 However this way suffersfrom heavy communication overhead A more economicalmethod is to broadcast the parameters (or the random numbergenerators) that control the random number generation andthen the random matrices Ω

(n)k are generated on all workers

by using the received random number generators as suchfor each k the workers p(n)(sn k) sn isin NSn

will generateidentical random Gaussian matrices Ω

(n)k simultaneously

Estimating an orthonormal basis of the columns ofY(n)Ω

(n) To this end we perform eigenvalue decompositionofsumSn

sn=1 Z(n)sn

TZ(n)sn = UΣUT Then the mode-n factor

matrix U(n) with orthonormal columns can be computed from

U(n)sn = Z(n)

sn UΣminus12 (23)

in parallel on the workers p(n)(sn 1) where U(n)sn is the snth

sub-matrix of U(n) sn isin NSn After U(n) has been estimated Y will be updated as Ylarr

Y timesn U(n)T which can be realized using the above mode-n

tensor unfolding and matrix-matrix product After update thenth dimension of Y(s) will be reduced to be Rn from LSn

We repeat the above procedure for n = 1 2 N As the

last step the core tensor is computed as G =sumWw=1 Y

(w) iethe sum of sub-tensors distributed on all workers

C Nonnegative Tucker Decomposition

In our recent papers [19] [31] we also showed how toefficiently perform nonnegative Tucker decomposition basedon a Tucker approximation The idea of that method is similarto the FFCP method after the original high-dimensional rawdata has been replaced by its Tucker approximation with lowmultilinear rank the computational complexity of gradientswith respect to the nonnegative core tensor and factor matri-ces can be significantly reduced in terms of both time andspace Readers are referred to [19] [31] for more detailsIn summary Tucker compression can play a fundamentalrole to achieve highly efficient tensor decompositions asillustrated in Fig1 Based on the proposed randomized Tuckerdecomposition methods we can easily perform various typesof constrainedunconstrained tensor decompositions for verylarge-scale tensor data

V SIMULATIONS

A Synthetic data

In this subsection we investigate the performance of theproposed FFCP algorithm by using synthetic data All the ex-periments were performed in 64-bit MATLAB on a computerwith Intel Core i7 CPU (333GHz) and 24GB memory running64-bit Windows 7

9

TABLE II Comparison between the algorithms when they were applied to perform CPD of synthetic data The noises wereGaussian with SNR=10dB in all the cases

Algorithm N = 3 I = 200 R = 10 N = 3 I = 500 R = 20 N = 6 I = 20 R = 5Fit () Time (s) Fit () Time (s) Fit () Time (s)

CP-ALS 821plusmn 102 395 805plusmn 68 640 896plusmn 112 188FastCPALS 798plusmn 90 104 829plusmn 56 211 924plusmn 109 66PARAFAC 825plusmn 79 782 805plusmn 61 1583 876plusmn 114 643FFCP 819plusmn 80 35 828plusmn 56 40 934plusmn 86 50

TABLE III Comparison between the algorithms when they were applied to perform nonnegative CPD of synthetic data Thenoises were Gaussian with SNR=10dB in all the cases

Algorithm N = 3 I = 200 R = 10 N = 3 I = 500 R = 20 N = 6 I = 20 R = 5Fit () Time (s) Fit () Time (s) Fit () Time (s)

CP-NMU 973plusmn 12 664 976plusmn 08 16952 917plusmn 56 4265CP-HALS 978plusmn 10 121 977plusmn 07 2281 951plusmn 41 1014FFCP-MU 990plusmn 05 23 993plusmn 02 121 934plusmn 41 70FFCP-HALS 991plusmn 01 61 995plusmn 01 299 943plusmn 35 118

Simulations on CPD In each run a tensor Ylowast was generatedby using the model (2) where the entries of the latent factormatrices A(n) isin RItimesR (ie the ground truth) were drawn fromiid standard normal distributions n = 1 2 N Thenindependent Gaussian noise (with SNR=10dB) was addedto Ylowast to yield noisy observation tensor Y We tested threedifferent combinations of N I and R as shown in TABLE IIThe algorithms CP-ALS PARAFAC which are the standardalgorithms included in the Tensor Toolbox 25 [32] and the N -way toolbox [33] respectively were compared as the baselineThe FastCPALS algorithm [34] was also compared For all thealgorithm the maximum iteration number was set as 1000and a stopping criterion |Fit(Y Yt) minus Fit(Y Yt+1)| lt 10minus6

was applied where Yt is an estimation of Ylowast obtained afterthe tth iteration The mean values and standard variationsof Fit(Ylowast Y) over 20 Monte-Carlo runs are detailed in TA-BLE II along with the averaged time consumed by eachalgorithm From the table we see that the FFCP algorithmis significantly faster than the other algorithms but withoutsignificant deterioration of performance although it workson a compressed tensor instead of the original tensor Wealso tested the PARACOMP [35] method which was speciallyproposed for big tensor decompositions It turned out that thePARACOMP was sensitive to noise in noise-free cases itoften achieved perfect Fit of 1 otherwise when the noise was10dB it generally obtained the Fit lower than 03

In Fig6 we demonstrated the scalability of the proposedFFCP method where the RandTucker2i was used to compressthe data with p = 15 From Fig6(a) we can see that theFFCP shows very promising scalability its time consumptionincreases almost linearly with respect to the scale of the inputdata on condition that the rank is fixed on the contrastthe time consumed by the other algorithms increased nearlyexponentially Note that both the FFCP and the PARACOMPmethod [35] were designed mainly for big data analyticsand this comparison did not precisely reflect their actual

time consumption difference in a distributed computationalenvironment However it did show that the proposed methodwas more efficient than the PARACOMP which can also be in-ferred from the obvious facts 1) the PARACOMP compressesthe observation big tensor at least P times (P ge IR) whereasin our RandTucker2i based FFCP we only need to do it twice(or even once when RandTucker is applied) 2) although bothof them need to perform CPD on small compressed tensorshowever the PARACOMP method needs to do this P timeswhereas the FFCP needs to do it only once as the final stepParticularly in the PARACOMP any failure of the CPDs ofP sub-tensors will degenerate the final results Fig6(b) showsthe scalability of the FFCP provided that the given big tensorwas in its Tucker format To such scale of tensor if we usethe original full tensor directly all the algorithms ran out ofmemory But if we use its Tucker compression the FFCPcan perform constrainedunconstrained CPD efficiently evenon our desktop computer

Simulations on Nonnegative CPD In this experiments thedata were generated as the above except that the latent factormatrices were generated from iid exponential distributionswith the mean parameter micro = 10 Then we randomly selected10 entries of each factor to be zeros to generate sparsefactor matrices As the PARAFAC algorithm was very slowwe only compared the CP-NMU algorithm which is a standardnonnegative CPD algorithm based on the multiplicative updaterule included in the Tensor Toolbox [32] and the CP-HALSalgorithm [36] that is based on the Hierarchical ALS (HALS)method For fair comparison their corresponding implemen-tations of FFCP were compared The results averaged over20 Monte-Carlo runs were detailed in TABLE III From theTABLE FFCP implementations were not only significantlyfaster than their competitors but also often achieved higherFit values To the best of our knowledge the FFCP is one ofthe most efficient algorithms for nonnegative CPD

To demonstrate the high efficiency of the FFCP algorithm

10

in this simulation the entries of A(n) isin R10000times10 n =1 2 3 4 were drawn from iid exponential distributions withthe mean parameter micro = 10 Again we randomly selected 20entries of A(n) to be zeros to obtain sparse factor matricesBy using (2) and A(n) generated above a big 4th-order tensorY isin R10000times10000times10000times10000 with R = 10 was obtainedwhich contains a total of 1016 entries in its full format Itis certainly impossible to decompose such a huge tensor in astandard desktop computer Suppose we have already obtainedits Tucker approximation YT = JGA(1)A(2)A(3)A(4)K asympY Due to the rotation ambiguity of Tucker decompositionswe know theoretically A(n) = A(n)Q(n) where Q(n) wasnot necessarily to be nonnegative Generally applying CPD onthe core tensor G directly cannot guarantee the nonnegativityof factor matrices Instead we applied the FFCP algorithm onthe Tucker compressed tensor YT to extract nonnegative factormatrices In a typical run the FFCP-HALS consumed only 774seconds and the recovered components were with the signal-to-interference ratio4 (SIR) higher than 200dB which almostperfectly match the true components Although this exampleonly shows an ideal noise-free case it illustrates how the FFCPalgorithm can be applied to perform nonnegative CPD of hugetensors as long as their good low-rank approximations can beobtained

B The COIL-100 Images Clustering Analysis

In this experiment we applied the proposed NTD algorithmsto the clustering analysis of the objects selected from theColumbia Object Image Library (COIL-100) The COIL-100database consists of 7200 images of 100 objects each ofwhich has 72 images taken from different poses hence theseimages naturally form a 128times 128times 3times 7200 tensor Y Thedata tensor was decomposed by using tensor decompositionalgorithms and then used the two t-Distributed StochasticNeighbor Embedding (t-SNE) components of A(4) as featuresto cluster and visualize the results The K-means algorithm wasused to cluster the objects and the accuracy of clustering wasdefined in [37]To justify the proposed distributed RandTuckeralgorithm we used the MATLAB parallel computing toolboxwhich allows us to use the full processing power of multicoredesktops by executing applications on workers (MATLABcomputational engines) that run locally The tensor Y waspartitioned into 4 sub-tensors with the size of 128times128times3times1800 and was distributed on 4 workers As the baseline theTucker-ALS algorithm included in [32] was compared and themultilinear rank was empirically set as R = (20 20 3 20)For the RandTucker method R = (30 30 3 50) was setBoth of these two algorithms used 2 iterations Their perfor-mances were detailed in TABLE IV It can be seen that nosignificant deterioration was observed between the randomized

4The performance index SIR is defined as SIR(a(n)r a

(n)r ) =

20 log∥∥∥a(n)

r minus a(n)r

∥∥∥2∥∥∥a(n)

r

∥∥∥2

where a(n)r is an estimate of a

(n)r both

a(n)r and a

(n)r are normalized to be zero-mean and unit-variance

TABLE IV Comparison between the algorithms when theywere applied to clustering analysis of COIL-100 objects

Algorithm Fit () Accuracy ()Tucker-ALS 703 692

RandTucker2i 691 684RandTucker 686 698

FFCP(Tucker-ALS) 671 713FFCP (RandTucker2i) 647 680

12

3

4

5

6 7

8

9

10

11

12

13

1415

16

17

18

19

20

21

2223

24

25

2627

28

29

30

31

32

33

34

35

36

37

38 3940

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

6465

6667

68

69

70

71

72

73

74

75

76

77

78

7980

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

Fig 7 Visualization of the clustering results obtained byapplying distributed RandTucker2i on the COIL100 imagedatabase The two tSNE components of A(4) were used forvisualization

methods and the deterministic methods which suggested thatthe distributed RandTucker algorithm can be applied to reallylarge scale problems The clustering results obtained by theRandTucker2i Figare visualized in Fig7 by using the twotSNE components of A(4)

C UCF Sports Action Videos Analysis

This data set consists of a set of videos collected fromvarious sports which are typically featured on broadcast televi-sion channels [38] (see httpcrcvucfedudataUCF SportsActionphp) To avoid rescaling the videos and keep all thevideos in the same specification we selected 5 actions ofDiving Riding-Horse Run Swing and Walk-Front each ofwhich contains 10 video clips with the size of 404times720times3times35(35 is the number of frames) We concatenated these videosto form a 5th-order tensor This tensor has about 16 times 109

entries and more than 99 of them are nonzero When it issaved in double format it consumes about 117 GB spaceWe used tensor decomposition methods to decompose this

11

Fig 8 The UCF Sports Action Videos The first row shows asnapshot of each action and the second row shows 5 snapshotsfrom a video of diving

tensor to extract low-dimensional features for further cluster-ing analysis When we applied the CP-ALS method includedin the Tensor Toolbox to decompose 30 videos it consumedabout 4 hours for only one iteration when 40 videos wereconsidered the CP-ALS ran out of memory We configureda small cluster with 3 computers (a desktop computer plustwo laptop computers (each of which has a i7 Intel dualcore CPU 8 GB physical memory running Windows 7 andMATLAB 2012b)) In the first stage the RandTucker2i wasapplied with R = [35 35 3 10 20] It consumed about 104seconds to obtain the Tucker approximation of the originaltensor with the Fit of 723 to the observation data Then theFFCP was applied using a desktop computer with R = 10 Inthis case only 75 seconds were required without nonnegativityconstraints and 7 seconds were consumed when nonnegativityconstraints were imposed Finally we used A(5) as featuresto cluster the videos and achieved the accuracy of 58 ina typical run This experiments showed that the proposedalgorithms can be applied to decompose very large scaletensors by using a distributed computing cluster

VI CONCLUSION

Tensor decompositions are particularly promising for bigmultiway data analytics as it is able to discover more complexinternal structures and correlations of data gathered frommultiple aspects devices trials and subjects etc While mostexisting tensor decomposition methods are not designed tomeet the major challenges posed by big data in this paperwe proposed a unified framework for the decomposition ofbig tensors with relatively low multilinear rank The key ideais that based on its Tucker approximation of a big tensor alarge set of unconstrainedconstrained tensor decompositionscan be efficiently performed We developed a flexible fastalgorithm for the CP decomposition of tensors using theirTucker representation and a distributed randomized Tuckerdecomposition approach which is particularly suitable for bigtensor analysis Simulation results confirmed that the proposedalgorithms are highly efficient and scalable

While the randomized Tucker decomposition approachshowed its close-to-optimal performance the correspondingerror bounds need to be investigated theoretically in the futureWe also believe many randomized (or related) approachesproposed for low-rank approximation of matrices [29] [39]

could be applied for Tucker decompositions however a com-parative study in the tensor decomposition scenarios is desiredParticularly if we have efficient and robust unconstrained CPdecomposition algorithms we can use the CP model to com-press the raw data instead of the Tucker model which suffersfrom the curse of dimensionality for very high-order tensorsHence developing more reliable and robust CP decompositionapproaches is another important task All these open issuesdeserve further study in the near future

REFERENCES

[1] A Cichocki R Zdunek A H Phan and S-I Amari Nonnegative Ma-trix and Tensor Factorizations Applications to Exploratory Multi-wayData Analysis and Blind Source Separationpplications to ExploratoryMulti-Way Data Analysis and Blind Source Separation ChichesterUK John Wiley amp Sons Ltd 2009

[2] N Sidiropoulos R Bro and G Giannakis ldquoParallel factor analysisin sensor array processingrdquo IEEE Transactions on Signal Processingvol 48 no 8 pp 2377ndash2388 Aug 2000

[3] L De Lathauwer J Castaing and J Cardoso ldquoFourth-order cumulant-based blind identification of underdetermined mixturesrdquo IEEE Transac-tions on Signal Processing vol 55 no 6 pp 2965ndash2973 2007

[4] P Comon ldquoTensors A brief introductionrdquo IEEE Signal ProcessingMagazine vol 31 no 3 pp 44ndash53 May 2014

[5] T G Kolda and B W Bader ldquoTensor decompositions and applicationsrdquoSIAM Review vol 51 no 3 pp 455ndash500 2009

[6] N D Sidiropoulos and R Bro ldquoOn the uniqueness of multilineardecomposition of N -way arraysrdquo Journal of Chemometrics vol 14no 3 pp 229 ndash 239 2000

[7] G Zhou and A Cichocki ldquoCanonical polyadic decomposition based ona single mode blind source separationrdquo IEEE Signal Processing Lettersvol 19 no 8 pp 523 ndash526 Aug 2012

[8] L-H Lim and P Comon ldquoNonnegative approximations of nonnegativetensorsrdquo Journal of Chemometrics vol 23 no 7-8 pp 432ndash441 2009

[9] E E Papalexakis C Faloutsos and N D Sidiropoulos ldquoParCubeSparse parallelizable tensor decompositionsrdquo in Proceedings of the 2012European Conference on Machine Learning and Knowledge Discoveryin Databases - Volume Part I ser ECML PKDDrsquo12 Berlin HeidelbergSpringer-Verlag 2012 pp 521ndash536

[10] A Phan and A Cichocki ldquoAdvances in PARAFAC using parallel blockdecompositionrdquo in Neural Information Processing ser Lecture Notesin Computer Science C Leung M Lee and J Chan Eds vol 5863Springer Berlin Heidelberg 2009 pp 323ndash330

[11] U Kang E Papalexakis A Harpale and C Faloutsos ldquoGigaTensorScaling tensor analysis up by 100 times - algorithms and discoveriesrdquoin Proceedings of the 18th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining ser KDD rsquo12 New York NYUSA ACM 2012 pp 316ndash324

[12] C Tsourakakis ldquoMACH Fast randomized tensor decompositionsrdquoin Proceedings of the 2010 SIAM International Conference on DataMining 2010 pp 689ndash700

[13] P Drineas and M W Mahoney ldquoA randomized algorithm for a tensor-based generalization of the singular value decompositionrdquo Linear Alge-bra and its Applications vol 420 no 2ndash3 pp 553 ndash 571 2007

[14] C F Caiafa and A Cichocki ldquoGeneralizing the column-row matrix de-composition to multi-way arraysrdquo Linear Algebra and its Applicationsvol 433 no 3 pp 557 ndash 573 2010

[15] M Vasilescu and D Terzopoulos ldquoMultilinear analysis of image en-sembles Tensorfacesrdquo in Computer Vision mdash ECCV 2002 ser LectureNotes in Computer Science A Heyden G Sparr M Nielsen andP Johansen Eds Springer Berlin Heidelberg 2002 vol 2350 pp447ndash460

[16] O Duchenne F Bach I-S Kweon and J Ponce ldquoA tensor-basedalgorithm for high-order graph matchingrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 12 pp 2383ndash2395 Dec2011

12

[17] A Rajwade A Rangarajan and A Banerjee ldquoImage denoising usingthe higher order singular value decompositionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 35 no 4 pp 849ndash8622013

[18] V De Silva and L-H Lim ldquoTensor rank and the ill-posedness of the bestlow-rank approximation problemrdquo SIAM Journal on Matrix Analysis andApplications vol 30 no 3 pp 1084ndash1127 September 2008

[19] G Zhou A Cichocki Q Zhao and S Xie ldquoNonnegative matrixand tensor factorizations An algorithmic perspectiverdquo IEEE SignalProcessing Magazine vol 31 no 3 pp 54ndash65 May 2014

[20] G Zhou A Cichocki and S Xie ldquoFast nonnegative matrixtensorfactorization based on low-rank approximationrdquo IEEE Transactions onSignal Processing vol 60 no 6 pp 2928 ndash2940 June 2012

[21] V Cevher S Becker and M Schmidt ldquoConvex optimization forbig data Scalable randomized and parallel algorithms for big dataanalyticsrdquo IEEE Signal Processing Magazine vol 31 no 5 pp 32ndash43Sept 2014

[22] R Bro and C A Andersson ldquoImproving the speed of multiway algo-rithms Part ii Compressionrdquo Chemometrics and Intelligent LaboratorySystems vol 42 no 1ndash2 pp 105 ndash 113 1998

[23] N Sidiropoulos E Papalexakis and C Faloutsos ldquoParallel randomlycompressed cubes A scalable distributed architecture for big tensordecompositionrdquo IEEE Signal Processing Magazine vol 31 no 5 pp57ndash70 Sept 2014

[24] P Comon X Luciani and A L F de Almeida ldquoTensor decompositionsalternating least squares and other talesrdquo Journal of Chemometricsvol 23 no 7-8 pp 393ndash405 2009

[25] J Cohen R Farias and P Comon ldquoFast decomposition of largenonnegative tensorsrdquo IEEE Signal Processing Letters vol 22 no 7pp 862ndash866 July 2015

[26] L De Lathauwer B De Moor and J Vandewalle ldquoA multilinearsingular value decompositionrdquo SIAM Journal on Matrix Analysis andApplications vol 21 pp 1253ndash1278 2000

[27] E Papalexakis N Sidiropoulos and R Bro ldquoFrom k-means to higher-way co-clustering Multilinear decomposition with sparse latent factorsrdquoIEEE Transactions on Signal Processing vol 61 no 2 pp 493ndash506Jan 2013

[28] Z Qian X Chen N Kang M Chen Y Yu T Moscibroda andZ Zhang ldquoMadLINQ Large-scale distributed matrix computation forthe cloudrdquo in Proceedings of the 7th ACM European Conference onComputer Systems ser EuroSys rsquo12 New York NY USA ACM2012 pp 197ndash210

[29] N Halko P Martinsson and J Tropp ldquoFinding structure with ran-domness Probabilistic algorithms for constructing approximate matrixdecompositionsrdquo SIAM Review vol 53 no 2 pp 217ndash288 2011

[30] N Vervliet O Debals L Sorber and L De Lathauwer ldquoBreaking thecurse of dimensionality using decompositions of incomplete tensorsTensor-based scientific computing in big data analysisrdquo IEEE SignalProcessing Magazine vol 31 no 5 pp 71ndash79 Sept 2014

[31] G Zhou A Cichocki Q Zhao and S Xie ldquoEfficient nonnegativeTucker decompositions Algorithms and uniquenessrdquo IEEE Transactionson Image Processing (Submitted) p 13 2014 [Online] Available[ArXive-prints]httparxivorgabs14044412

[32] B W Bader and T G Kolda ldquoMATLAB tensor toolbox version25rdquo Feb 2012 [Online] Available httpcsmrcasandiagovsimtgkoldaTensorToolbox

[33] C A Andersson and R Bro ldquoThe N -way toolbox for MATLABrdquo 2000[Online] Available httpwwwmodelslifekudksourcenwaytoolbox

[34] H Phan P Tichavsky and A Cichocki ldquoFast alternating LS algorithmsfor high order CANDECOMPPARAFAC tensor factorizationsrdquo IEEETransactions on Signal Processing vol 61 no 19 pp 4834ndash4846 2013

[35] N Sidiropoulos E Papalexakis and C Faloutsos ldquoA parallel algorithmfor big tensor decomposition using randomly compressed cubes (PARA-COMP)rdquo in 2014 IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP) May 2014 pp 1ndash5

[36] A Cichocki and A-H Phan ldquoFast local algorithms for large scalenonnegative matrix and tensor factorizationsrdquo IEICE Transactions onFundamentals of Electronics Communications and Computer Sciences(Invited paper) vol E92-A no 3 pp 708ndash721 2009

[37] D Cai X He J Han and T Huang ldquoGraph regularized nonnegativematrix factorization for data representationrdquo IEEE Transactions on

Pattern Analysis and Machine Intelligence vol 33 no 8 pp 1548ndash1560 aug 2011

[38] M Rodriguez J Ahmed and M Shah ldquoAction MACH a spatio-temporal maximum average correlation height filter for action recogni-tionrdquo in IEEE Conference on Computer Vision and Pattern Recognition2008 CVPR 2008 June 2008 pp 1ndash8

[39] R Baraniuk V Cevher and M Wakin ldquoLow-dimensional models fordimensionality reduction and signal recovery A geometric perspectiverdquoProceedings of the IEEE vol 98 no 6 pp 959ndash971 June 2010

  • I Introduction
  • II Overview of Tucker and Polyadic Decompositions
  • III Efficient CP Decompositions of Tensors Based on Their Tucker Approximations
    • III-A Overview of CPD Algorithms
    • III-B CPD of Big Tensors by Based on Their Tucker Compression
      • IV Distributed Randomized Tucker Decompositions (RandTucker)
        • IV-A Randomized Tucker Decomposition
        • IV-B Distributed Randomized Tucker Decomposition
        • IV-C Nonnegative Tucker Decomposition
          • V Simulations
            • V-A Synthetic data
            • V-B The COIL-100 Images Clustering Analysis
            • V-C UCF Sports Action Videos Analysis
              • VI Conclusion
              • References

6

of Y(n) However performing SVD is prohibitive for verylarge-scale problems Recently a set of probabilistic algorithmshave been developed to discover the low-rank structure of hugematrices [29] Here we consider the following randomizedrange finder (RRF) algorithm (Algorithm 41 of [29]) toestimate the basis of U(n)

1) Drawn a random Gaussian matrix Ω isin Rprod

p6=n IptimesRn 2) Obtain the In times Rn matrix Z(n) = Y(n)Ω

(n)3) Compute U(n) isin RIntimesRn as an orthonormal basis of

Z(n) isin RIntimesRn by using eg QR decompositionsIn the above Rn = Rn + p and p is referred to as theoversampling parameter and p = 5 or p = 10 was suggestedin [29] Based on this procedure we propose a randomizedTucker decomposition (RandTucker) algorithm to performTucker decompositions see Algorithm 2 Compared with theHOSVD algorithm the proposed RandTucker algorithm ishighly parallalizable and suitable for distributed computation

Several randomized algorithms have already been proposedfor Tucker decompositions such as the method using fibersampling [14] [30] Compared with them the proposedRandTucker method has an important advantage it is ableto obtain nearly optimal approximation even for very noisydata by using only one additional ALS-like iteration whichis referred to as RandTucker2i see Algorithm 3 and Fig4for the performance comparison In this figure a 200-by-200-by-200 tensor Y was generated in each run by using the (1)with Rn = 10 then different levels of Gaussian noise wasadded to Y to generate the observation tensor We plottedthe Fit values (and the standard variations) obtained by theRandTucker (with p = 10) the RandTucker2i (p = 10) andthe HOSVD averaged over 50 Monte-Carlo runs From thefigure the RandTucker achieved slightly worse Fit valuesespecially when noise was heavy while the RandTucker2iachieved almost the same Fit values as the HOSVD Asthe RandTucker and the RandTucker2i do not involve thesingular value decomposition of big matrices and they arehighly parallelizable they are very promising in the Tuckerdecompositions of big tensors

B Distributed Randomized Tucker DecompositionIn the above RandTucker and RandTucker2i algorithms we

have assumed that all the data can be processed in physicalmemory directly For big data this could be unrealistic Acommon scenario is that only partial data can be loaded inmemory In this subsection we discuss how to solve thisproblem by using a computer cluster (in this paper we used theMATLAB Distributed Computing Server2) Here we considera cluster which is composed ofbull a number of workers with identity labels each of which is

a computational engine that executes computational taskssimultaneously In the meanwhile they may communicatewith each other and the client as well

2See wwwmathworkscomhelppdf docmdcemdcepdf

Algorithm 2 The Randomized Tucker Decomposition (Rand-Tucker) Algorithm

Require Y isin RI1timesI2middotmiddotmiddottimesIN and the multilinear rank(R1 R2 RN )

1 for n = 1 2 N do2 Compute Z(n) = Y(n)Ω

(n) where Ω(n) is an(prodk 6=n Ik)-by-Rn random Gaussian matrix

3 Compute U(n) as an orthonormal basis of Z(n) byusing eg the QR decomposition

4 Ylarr Ytimesn U(n)T In larr Rn5 end for6 GlarrY7 return Y asymp JGU(1)U(2) middot middot middot U(N)K

Algorithm 3 The RandTucker Algorithm with 2 Iterationsonly (RandTucker2i)

Require Y isin RI1timesI2middotmiddotmiddottimesIN and the multilinear rank(R1 R2 RN )

1 Initialize U(n) isin RIntimesRn as random Gaussian matrices2 Repeat lines (3)-(7) twice3 for n = 1 2 N do4 X = Ytimesp 6=n U(p)5 Compute Z(n) = X(n)Ω

(n) where Ω(n) is an(prodk 6=n Rk)-by-Rn random Gaussian matrix

6 Compute U(n) as an orthonormal basis of Z(n) byusing eg the QR decomposition

7 end for8 Glarr Ytimes1 U(1) times2 U(2) middot middot middot timesN U(N)9 return Y asymp JGU(1)U(2) middot middot middot U(N)K

0 5 10 15 20 25 3020

30

40

50

60

70

80

90

100

Noise level (SNR in dB)

Fit (

)

RandTucker

RandTucker2i

HOSVD

Fig 4 Performance comparison between the RandTucker theRandTucker2i and the HOSVD algorithm in terms of meanvalue of Fit and the standard derivation over 50 Monte-Carloruns

bull a client which is used to schedule tasks to workers andgather results from workers and return final results

7

Workeramp17amp Workeramp19amp Workeramp21amp Workeramp23amp

Workeramp9amp Workeramp11amp Workeramp15amp

Workeramp1amp Workeramp3amp Workeramp5amp

Workeramp13amp

Workeramp7amp

Y(111)

Y(211)

Y(121)

Y(221) Y(231)

Y(131) Y(141)

Y(241)

(a)

P 2 NS1S2middotmiddotmiddotSN+

= snk = ind2subp(n)(sn k)

s = ind2sub(p(s1 s2 sN ))

(Using P(n))

(Using P)

Worker p(s)

p(s)Y(s)

(b)

Fig 5 Demonstration of small data tensors and the virtual computing resource tensor P in a cluster (a) An example Y isinR2Itimes4Jtimes3K is partitioned into smaller blocks Y(s) with the same size I times J timesK thereby leading to the 2times 4times 3 computingresource tensor P identifying a total of 24 workers (b) An entry p(s) isin 1 2 prodn Sn of the computing resource tensorP indicates that Worker p(s) processes the sub-tensor Y(s) where s = [s1 s2 sN ]T = ind2sub(p(s)) Hence in (a)p(1 3 2) = 13 means Worker 13 processes the sub-tensor Y(132) as [1 3 2] = ind2sub(13) = p(2)(3 3)

Given a big N th-order tensor Y isin RI1timesI2timesmiddotmiddotmiddottimesIN we parti-tion each dimension In into a total of Sn disjoint subsets eachof which consists of LSn indices with

sumSn

sn=1 Lsn = In Assuch the original big tensor is partitioned into a total of W =S1S2 middot middot middotSN sub-tensors and each sub-tensor can be denotedby Y(s1s2sN ) with the size of Ls1 times Ls2 times middot middot middot times LsN where sn isin NSn n = 1 2 N At the same time supposewe have a computer cluster with W workers (computationalengines)3 We imagine that the W workers are virtuallyspatially permuted to form a S1 times S2 times middot middot middot times SN tensor suchthat each worker is just associated with one sub-tensor seeFig5(a) In other words we will have two types of tensorsbull The small data tensors obtained by partitioning Y

Y(s) = Y(s1s2sN ) isin RLs1timesLs2timesmiddotmiddotmiddottimesLsN (19)

where s =[s1 s2 middot middot middot sN

]Tis used to specify

the (s1 s2 sN )th-block of Y sn isin NSn n =

1 2 N see Fig5(a) for an illustrationbull The computing resource tensor P isin NS1timesS2timesmiddotmiddotmiddottimesSN

+

formed by reshaping [1 2 W ] that identifies theW = S1S2 middot middot middotSN workers and the associated data ten-sors The entries of P have a one-to-one correspondencewith the sub-tensors Y(s) meaning that Worker p(s)process the sub-tensor Y(s) see Fig5(b)

It can be seen the RandTucker and the RandTucker2i majorlyuse four basic operations 1) Mode-n unfolding of a tensor2) Matrix-matrix product to obtain Y(n)Ω

(n) 3) Generating aGaussian random matrix Ω 4) Estimating an orthogonal basisof the range of Y(n)Ω

(n) Below we discuss how to performthese four operations in a distributed computing system

Mode-n unfolding of tensor Y We unfold the computingresource tensor P isin NS1timesS2timesmiddotmiddotmiddottimesSN

+ along mode-n to obtain

3We have assumed that each sub-tensor can be efficiently handled in by theassociated worker otherwise we increase the number of workers and decreasethe size of each sub-tensor accordingly

P(n) isin NSntimesK+ K =

prodj 6=n Sj Let p(n)(sn k) be the

(sn k)th entry of P(n) and define

s = πsnk = ind2sub(p(n)(sn k))

sn = 1 2 Sn k = 1 2 prod

j 6=nSj

(20)

where the function ind2sub(middot) converts an entry of P(n) iep(n)(sn k) a worker identifier into an N -by-1 vector ofsubscript values determining the position of p(n)(sn k) in thetensor P which allows us to locate the sub-tensor associatedwith this worker In other words the above (20) means thatthe tensor Yπsnk is processed by Worker p(n)(sn k) Then itcan be seen that

Y(n) =

Yπ11

(n) Yπ12

(n) Yπ1K

(n)

Yπ21

(n) Yπ22

(n) Yπ2K

(n)

Y

πSn1

(n) YπSn2

(n) YπSnK

(n)

isin RIntimes

prodj 6=n Ij

(21)where Y

πsnk

(n) is the mode-n martricization of Yπsnk pro-cessed by Worker p(n)(sn k) sn isin NSn

and k isin NK In summary to obtain Y(n) we perform mode-n unfolding

of Y(s)(n) simultaneously Then the (sn k)th sub-matrix of Y(n)

will be given by Worker p(n)(sn k) as shown in (21)Matrix-matrix product Large scale matrix-matrix product

can be performed by using any existing technique As here themode-n unfolding matrix Y(n) has already been partitionedand distributed we apply the matrix partitioning techniquewe compute the product Y(n)Ω

(n) by using

Z(n)1

Z(n)2

middot middot middotZ

(n)Sn

= Y(n)Ω

(n) = Y(n)

Ω(n)1

Ω(n)2

Ω(n)K

isin RIntimesRn (22)

where Y(n) is given by (21) Ω(n)k isin R

prodj 6=n LSj

timesRn are thesub-matrices by partitioning the random Gaussian matrix Ω(n)

8

50

100

150

200

250

300

350

400

1x108 2x108 3x108 5x108 7x108 1x109 1x109

Scale of the input data ( I N)

Tim

e (s)

FFCPCPminusALSPARACOMP

(a)

2

4

6

8

10

12

14

16

18

1x109 1x1012 8x1012 3x1013 6x1013 1x1014

Scale of the input data (I N)

Tim

e (s

)

(b)

Fig 6 (a) Time consumption comparison of the FFCP the CP-ALS and the PARACOMP on condition that the data tensorof I times I times I can be loaded into memory For the FFCP the RandTucker2i was used to compress the data with p = 15 (b)Scalability of the FFCP algorithm provided that the given big tensor was in its Tucker format The consumed time by theFFCP increases nearly linearly with respect to I

into K parts From (22) we have Z(n)sn =

sumKk=1 Y

πsnk

(n) Ω(n)k

and all the products Yπsnk

(n) Ω(n)k are computed on Workers

p(πsnk) simultaneously and then are broadcasted to Workerp(n)(sn k0) typically k0 = 1 Finally for each sn isin NSn wesum Y

πsnk

(n) Ω(n)k up over k on Worker p(n)(sn k0) to yield

Z(n)sn Generating the Gaussian random matrix Ω(n) Note that

the workers identified by the kth column of P(n) ie theworkers p(n)(sn k) sn = 1 2 Sn share a same randomGaussian matrix Ω

(n)k (corresponding to the kth column of

blocks of Y(n) in (21)) Hence Ω(n)k can be generated on

one worker say p(n)(1 k) and then is broadcasted to all theworkers p(n)(sn k) with sn 6= 1 However this way suffersfrom heavy communication overhead A more economicalmethod is to broadcast the parameters (or the random numbergenerators) that control the random number generation andthen the random matrices Ω

(n)k are generated on all workers

by using the received random number generators as suchfor each k the workers p(n)(sn k) sn isin NSn

will generateidentical random Gaussian matrices Ω

(n)k simultaneously

Estimating an orthonormal basis of the columns ofY(n)Ω

(n) To this end we perform eigenvalue decompositionofsumSn

sn=1 Z(n)sn

TZ(n)sn = UΣUT Then the mode-n factor

matrix U(n) with orthonormal columns can be computed from

U(n)sn = Z(n)

sn UΣminus12 (23)

in parallel on the workers p(n)(sn 1) where U(n)sn is the snth

sub-matrix of U(n) sn isin NSn After U(n) has been estimated Y will be updated as Ylarr

Y timesn U(n)T which can be realized using the above mode-n

tensor unfolding and matrix-matrix product After update thenth dimension of Y(s) will be reduced to be Rn from LSn

We repeat the above procedure for n = 1 2 N As the

last step the core tensor is computed as G =sumWw=1 Y

(w) iethe sum of sub-tensors distributed on all workers

C Nonnegative Tucker Decomposition

In our recent papers [19] [31] we also showed how toefficiently perform nonnegative Tucker decomposition basedon a Tucker approximation The idea of that method is similarto the FFCP method after the original high-dimensional rawdata has been replaced by its Tucker approximation with lowmultilinear rank the computational complexity of gradientswith respect to the nonnegative core tensor and factor matri-ces can be significantly reduced in terms of both time andspace Readers are referred to [19] [31] for more detailsIn summary Tucker compression can play a fundamentalrole to achieve highly efficient tensor decompositions asillustrated in Fig1 Based on the proposed randomized Tuckerdecomposition methods we can easily perform various typesof constrainedunconstrained tensor decompositions for verylarge-scale tensor data

V SIMULATIONS

A Synthetic data

In this subsection we investigate the performance of theproposed FFCP algorithm by using synthetic data All the ex-periments were performed in 64-bit MATLAB on a computerwith Intel Core i7 CPU (333GHz) and 24GB memory running64-bit Windows 7

9

TABLE II Comparison between the algorithms when they were applied to perform CPD of synthetic data The noises wereGaussian with SNR=10dB in all the cases

Algorithm N = 3 I = 200 R = 10 N = 3 I = 500 R = 20 N = 6 I = 20 R = 5Fit () Time (s) Fit () Time (s) Fit () Time (s)

CP-ALS 821plusmn 102 395 805plusmn 68 640 896plusmn 112 188FastCPALS 798plusmn 90 104 829plusmn 56 211 924plusmn 109 66PARAFAC 825plusmn 79 782 805plusmn 61 1583 876plusmn 114 643FFCP 819plusmn 80 35 828plusmn 56 40 934plusmn 86 50

TABLE III Comparison between the algorithms when they were applied to perform nonnegative CPD of synthetic data Thenoises were Gaussian with SNR=10dB in all the cases

Algorithm N = 3 I = 200 R = 10 N = 3 I = 500 R = 20 N = 6 I = 20 R = 5Fit () Time (s) Fit () Time (s) Fit () Time (s)

CP-NMU 973plusmn 12 664 976plusmn 08 16952 917plusmn 56 4265CP-HALS 978plusmn 10 121 977plusmn 07 2281 951plusmn 41 1014FFCP-MU 990plusmn 05 23 993plusmn 02 121 934plusmn 41 70FFCP-HALS 991plusmn 01 61 995plusmn 01 299 943plusmn 35 118

Simulations on CPD In each run a tensor Ylowast was generatedby using the model (2) where the entries of the latent factormatrices A(n) isin RItimesR (ie the ground truth) were drawn fromiid standard normal distributions n = 1 2 N Thenindependent Gaussian noise (with SNR=10dB) was addedto Ylowast to yield noisy observation tensor Y We tested threedifferent combinations of N I and R as shown in TABLE IIThe algorithms CP-ALS PARAFAC which are the standardalgorithms included in the Tensor Toolbox 25 [32] and the N -way toolbox [33] respectively were compared as the baselineThe FastCPALS algorithm [34] was also compared For all thealgorithm the maximum iteration number was set as 1000and a stopping criterion |Fit(Y Yt) minus Fit(Y Yt+1)| lt 10minus6

was applied where Yt is an estimation of Ylowast obtained afterthe tth iteration The mean values and standard variationsof Fit(Ylowast Y) over 20 Monte-Carlo runs are detailed in TA-BLE II along with the averaged time consumed by eachalgorithm From the table we see that the FFCP algorithmis significantly faster than the other algorithms but withoutsignificant deterioration of performance although it workson a compressed tensor instead of the original tensor Wealso tested the PARACOMP [35] method which was speciallyproposed for big tensor decompositions It turned out that thePARACOMP was sensitive to noise in noise-free cases itoften achieved perfect Fit of 1 otherwise when the noise was10dB it generally obtained the Fit lower than 03

In Fig6 we demonstrated the scalability of the proposedFFCP method where the RandTucker2i was used to compressthe data with p = 15 From Fig6(a) we can see that theFFCP shows very promising scalability its time consumptionincreases almost linearly with respect to the scale of the inputdata on condition that the rank is fixed on the contrastthe time consumed by the other algorithms increased nearlyexponentially Note that both the FFCP and the PARACOMPmethod [35] were designed mainly for big data analyticsand this comparison did not precisely reflect their actual

time consumption difference in a distributed computationalenvironment However it did show that the proposed methodwas more efficient than the PARACOMP which can also be in-ferred from the obvious facts 1) the PARACOMP compressesthe observation big tensor at least P times (P ge IR) whereasin our RandTucker2i based FFCP we only need to do it twice(or even once when RandTucker is applied) 2) although bothof them need to perform CPD on small compressed tensorshowever the PARACOMP method needs to do this P timeswhereas the FFCP needs to do it only once as the final stepParticularly in the PARACOMP any failure of the CPDs ofP sub-tensors will degenerate the final results Fig6(b) showsthe scalability of the FFCP provided that the given big tensorwas in its Tucker format To such scale of tensor if we usethe original full tensor directly all the algorithms ran out ofmemory But if we use its Tucker compression the FFCPcan perform constrainedunconstrained CPD efficiently evenon our desktop computer

Simulations on Nonnegative CPD In this experiments thedata were generated as the above except that the latent factormatrices were generated from iid exponential distributionswith the mean parameter micro = 10 Then we randomly selected10 entries of each factor to be zeros to generate sparsefactor matrices As the PARAFAC algorithm was very slowwe only compared the CP-NMU algorithm which is a standardnonnegative CPD algorithm based on the multiplicative updaterule included in the Tensor Toolbox [32] and the CP-HALSalgorithm [36] that is based on the Hierarchical ALS (HALS)method For fair comparison their corresponding implemen-tations of FFCP were compared The results averaged over20 Monte-Carlo runs were detailed in TABLE III From theTABLE FFCP implementations were not only significantlyfaster than their competitors but also often achieved higherFit values To the best of our knowledge the FFCP is one ofthe most efficient algorithms for nonnegative CPD

To demonstrate the high efficiency of the FFCP algorithm

10

in this simulation the entries of A(n) isin R10000times10 n =1 2 3 4 were drawn from iid exponential distributions withthe mean parameter micro = 10 Again we randomly selected 20entries of A(n) to be zeros to obtain sparse factor matricesBy using (2) and A(n) generated above a big 4th-order tensorY isin R10000times10000times10000times10000 with R = 10 was obtainedwhich contains a total of 1016 entries in its full format Itis certainly impossible to decompose such a huge tensor in astandard desktop computer Suppose we have already obtainedits Tucker approximation YT = JGA(1)A(2)A(3)A(4)K asympY Due to the rotation ambiguity of Tucker decompositionswe know theoretically A(n) = A(n)Q(n) where Q(n) wasnot necessarily to be nonnegative Generally applying CPD onthe core tensor G directly cannot guarantee the nonnegativityof factor matrices Instead we applied the FFCP algorithm onthe Tucker compressed tensor YT to extract nonnegative factormatrices In a typical run the FFCP-HALS consumed only 774seconds and the recovered components were with the signal-to-interference ratio4 (SIR) higher than 200dB which almostperfectly match the true components Although this exampleonly shows an ideal noise-free case it illustrates how the FFCPalgorithm can be applied to perform nonnegative CPD of hugetensors as long as their good low-rank approximations can beobtained

B The COIL-100 Images Clustering Analysis

In this experiment we applied the proposed NTD algorithmsto the clustering analysis of the objects selected from theColumbia Object Image Library (COIL-100) The COIL-100database consists of 7200 images of 100 objects each ofwhich has 72 images taken from different poses hence theseimages naturally form a 128times 128times 3times 7200 tensor Y Thedata tensor was decomposed by using tensor decompositionalgorithms and then used the two t-Distributed StochasticNeighbor Embedding (t-SNE) components of A(4) as featuresto cluster and visualize the results The K-means algorithm wasused to cluster the objects and the accuracy of clustering wasdefined in [37]To justify the proposed distributed RandTuckeralgorithm we used the MATLAB parallel computing toolboxwhich allows us to use the full processing power of multicoredesktops by executing applications on workers (MATLABcomputational engines) that run locally The tensor Y waspartitioned into 4 sub-tensors with the size of 128times128times3times1800 and was distributed on 4 workers As the baseline theTucker-ALS algorithm included in [32] was compared and themultilinear rank was empirically set as R = (20 20 3 20)For the RandTucker method R = (30 30 3 50) was setBoth of these two algorithms used 2 iterations Their perfor-mances were detailed in TABLE IV It can be seen that nosignificant deterioration was observed between the randomized

4The performance index SIR is defined as SIR(a(n)r a

(n)r ) =

20 log∥∥∥a(n)

r minus a(n)r

∥∥∥2∥∥∥a(n)

r

∥∥∥2

where a(n)r is an estimate of a

(n)r both

a(n)r and a

(n)r are normalized to be zero-mean and unit-variance

TABLE IV Comparison between the algorithms when theywere applied to clustering analysis of COIL-100 objects

Algorithm Fit () Accuracy ()Tucker-ALS 703 692

RandTucker2i 691 684RandTucker 686 698

FFCP(Tucker-ALS) 671 713FFCP (RandTucker2i) 647 680

12

3

4

5

6 7

8

9

10

11

12

13

1415

16

17

18

19

20

21

2223

24

25

2627

28

29

30

31

32

33

34

35

36

37

38 3940

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

6465

6667

68

69

70

71

72

73

74

75

76

77

78

7980

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

Fig 7 Visualization of the clustering results obtained byapplying distributed RandTucker2i on the COIL100 imagedatabase The two tSNE components of A(4) were used forvisualization

methods and the deterministic methods which suggested thatthe distributed RandTucker algorithm can be applied to reallylarge scale problems The clustering results obtained by theRandTucker2i Figare visualized in Fig7 by using the twotSNE components of A(4)

C UCF Sports Action Videos Analysis

This data set consists of a set of videos collected fromvarious sports which are typically featured on broadcast televi-sion channels [38] (see httpcrcvucfedudataUCF SportsActionphp) To avoid rescaling the videos and keep all thevideos in the same specification we selected 5 actions ofDiving Riding-Horse Run Swing and Walk-Front each ofwhich contains 10 video clips with the size of 404times720times3times35(35 is the number of frames) We concatenated these videosto form a 5th-order tensor This tensor has about 16 times 109

entries and more than 99 of them are nonzero When it issaved in double format it consumes about 117 GB spaceWe used tensor decomposition methods to decompose this

11

Fig 8 The UCF Sports Action Videos The first row shows asnapshot of each action and the second row shows 5 snapshotsfrom a video of diving

tensor to extract low-dimensional features for further cluster-ing analysis When we applied the CP-ALS method includedin the Tensor Toolbox to decompose 30 videos it consumedabout 4 hours for only one iteration when 40 videos wereconsidered the CP-ALS ran out of memory We configureda small cluster with 3 computers (a desktop computer plustwo laptop computers (each of which has a i7 Intel dualcore CPU 8 GB physical memory running Windows 7 andMATLAB 2012b)) In the first stage the RandTucker2i wasapplied with R = [35 35 3 10 20] It consumed about 104seconds to obtain the Tucker approximation of the originaltensor with the Fit of 723 to the observation data Then theFFCP was applied using a desktop computer with R = 10 Inthis case only 75 seconds were required without nonnegativityconstraints and 7 seconds were consumed when nonnegativityconstraints were imposed Finally we used A(5) as featuresto cluster the videos and achieved the accuracy of 58 ina typical run This experiments showed that the proposedalgorithms can be applied to decompose very large scaletensors by using a distributed computing cluster

VI CONCLUSION

Tensor decompositions are particularly promising for bigmultiway data analytics as it is able to discover more complexinternal structures and correlations of data gathered frommultiple aspects devices trials and subjects etc While mostexisting tensor decomposition methods are not designed tomeet the major challenges posed by big data in this paperwe proposed a unified framework for the decomposition ofbig tensors with relatively low multilinear rank The key ideais that based on its Tucker approximation of a big tensor alarge set of unconstrainedconstrained tensor decompositionscan be efficiently performed We developed a flexible fastalgorithm for the CP decomposition of tensors using theirTucker representation and a distributed randomized Tuckerdecomposition approach which is particularly suitable for bigtensor analysis Simulation results confirmed that the proposedalgorithms are highly efficient and scalable

While the randomized Tucker decomposition approachshowed its close-to-optimal performance the correspondingerror bounds need to be investigated theoretically in the futureWe also believe many randomized (or related) approachesproposed for low-rank approximation of matrices [29] [39]

could be applied for Tucker decompositions however a com-parative study in the tensor decomposition scenarios is desiredParticularly if we have efficient and robust unconstrained CPdecomposition algorithms we can use the CP model to com-press the raw data instead of the Tucker model which suffersfrom the curse of dimensionality for very high-order tensorsHence developing more reliable and robust CP decompositionapproaches is another important task All these open issuesdeserve further study in the near future

REFERENCES

[1] A Cichocki R Zdunek A H Phan and S-I Amari Nonnegative Ma-trix and Tensor Factorizations Applications to Exploratory Multi-wayData Analysis and Blind Source Separationpplications to ExploratoryMulti-Way Data Analysis and Blind Source Separation ChichesterUK John Wiley amp Sons Ltd 2009

[2] N Sidiropoulos R Bro and G Giannakis ldquoParallel factor analysisin sensor array processingrdquo IEEE Transactions on Signal Processingvol 48 no 8 pp 2377ndash2388 Aug 2000

[3] L De Lathauwer J Castaing and J Cardoso ldquoFourth-order cumulant-based blind identification of underdetermined mixturesrdquo IEEE Transac-tions on Signal Processing vol 55 no 6 pp 2965ndash2973 2007

[4] P Comon ldquoTensors A brief introductionrdquo IEEE Signal ProcessingMagazine vol 31 no 3 pp 44ndash53 May 2014

[5] T G Kolda and B W Bader ldquoTensor decompositions and applicationsrdquoSIAM Review vol 51 no 3 pp 455ndash500 2009

[6] N D Sidiropoulos and R Bro ldquoOn the uniqueness of multilineardecomposition of N -way arraysrdquo Journal of Chemometrics vol 14no 3 pp 229 ndash 239 2000

[7] G Zhou and A Cichocki ldquoCanonical polyadic decomposition based ona single mode blind source separationrdquo IEEE Signal Processing Lettersvol 19 no 8 pp 523 ndash526 Aug 2012

[8] L-H Lim and P Comon ldquoNonnegative approximations of nonnegativetensorsrdquo Journal of Chemometrics vol 23 no 7-8 pp 432ndash441 2009

[9] E E Papalexakis C Faloutsos and N D Sidiropoulos ldquoParCubeSparse parallelizable tensor decompositionsrdquo in Proceedings of the 2012European Conference on Machine Learning and Knowledge Discoveryin Databases - Volume Part I ser ECML PKDDrsquo12 Berlin HeidelbergSpringer-Verlag 2012 pp 521ndash536

[10] A Phan and A Cichocki ldquoAdvances in PARAFAC using parallel blockdecompositionrdquo in Neural Information Processing ser Lecture Notesin Computer Science C Leung M Lee and J Chan Eds vol 5863Springer Berlin Heidelberg 2009 pp 323ndash330

[11] U Kang E Papalexakis A Harpale and C Faloutsos ldquoGigaTensorScaling tensor analysis up by 100 times - algorithms and discoveriesrdquoin Proceedings of the 18th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining ser KDD rsquo12 New York NYUSA ACM 2012 pp 316ndash324

[12] C Tsourakakis ldquoMACH Fast randomized tensor decompositionsrdquoin Proceedings of the 2010 SIAM International Conference on DataMining 2010 pp 689ndash700

[13] P Drineas and M W Mahoney ldquoA randomized algorithm for a tensor-based generalization of the singular value decompositionrdquo Linear Alge-bra and its Applications vol 420 no 2ndash3 pp 553 ndash 571 2007

[14] C F Caiafa and A Cichocki ldquoGeneralizing the column-row matrix de-composition to multi-way arraysrdquo Linear Algebra and its Applicationsvol 433 no 3 pp 557 ndash 573 2010

[15] M Vasilescu and D Terzopoulos ldquoMultilinear analysis of image en-sembles Tensorfacesrdquo in Computer Vision mdash ECCV 2002 ser LectureNotes in Computer Science A Heyden G Sparr M Nielsen andP Johansen Eds Springer Berlin Heidelberg 2002 vol 2350 pp447ndash460

[16] O Duchenne F Bach I-S Kweon and J Ponce ldquoA tensor-basedalgorithm for high-order graph matchingrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 12 pp 2383ndash2395 Dec2011

12

[17] A Rajwade A Rangarajan and A Banerjee ldquoImage denoising usingthe higher order singular value decompositionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 35 no 4 pp 849ndash8622013

[18] V De Silva and L-H Lim ldquoTensor rank and the ill-posedness of the bestlow-rank approximation problemrdquo SIAM Journal on Matrix Analysis andApplications vol 30 no 3 pp 1084ndash1127 September 2008

[19] G Zhou A Cichocki Q Zhao and S Xie ldquoNonnegative matrixand tensor factorizations An algorithmic perspectiverdquo IEEE SignalProcessing Magazine vol 31 no 3 pp 54ndash65 May 2014

[20] G Zhou A Cichocki and S Xie ldquoFast nonnegative matrixtensorfactorization based on low-rank approximationrdquo IEEE Transactions onSignal Processing vol 60 no 6 pp 2928 ndash2940 June 2012

[21] V Cevher S Becker and M Schmidt ldquoConvex optimization forbig data Scalable randomized and parallel algorithms for big dataanalyticsrdquo IEEE Signal Processing Magazine vol 31 no 5 pp 32ndash43Sept 2014

[22] R Bro and C A Andersson ldquoImproving the speed of multiway algo-rithms Part ii Compressionrdquo Chemometrics and Intelligent LaboratorySystems vol 42 no 1ndash2 pp 105 ndash 113 1998

[23] N Sidiropoulos E Papalexakis and C Faloutsos ldquoParallel randomlycompressed cubes A scalable distributed architecture for big tensordecompositionrdquo IEEE Signal Processing Magazine vol 31 no 5 pp57ndash70 Sept 2014

[24] P Comon X Luciani and A L F de Almeida ldquoTensor decompositionsalternating least squares and other talesrdquo Journal of Chemometricsvol 23 no 7-8 pp 393ndash405 2009

[25] J Cohen R Farias and P Comon ldquoFast decomposition of largenonnegative tensorsrdquo IEEE Signal Processing Letters vol 22 no 7pp 862ndash866 July 2015

[26] L De Lathauwer B De Moor and J Vandewalle ldquoA multilinearsingular value decompositionrdquo SIAM Journal on Matrix Analysis andApplications vol 21 pp 1253ndash1278 2000

[27] E Papalexakis N Sidiropoulos and R Bro ldquoFrom k-means to higher-way co-clustering Multilinear decomposition with sparse latent factorsrdquoIEEE Transactions on Signal Processing vol 61 no 2 pp 493ndash506Jan 2013

[28] Z Qian X Chen N Kang M Chen Y Yu T Moscibroda andZ Zhang ldquoMadLINQ Large-scale distributed matrix computation forthe cloudrdquo in Proceedings of the 7th ACM European Conference onComputer Systems ser EuroSys rsquo12 New York NY USA ACM2012 pp 197ndash210

[29] N Halko P Martinsson and J Tropp ldquoFinding structure with ran-domness Probabilistic algorithms for constructing approximate matrixdecompositionsrdquo SIAM Review vol 53 no 2 pp 217ndash288 2011

[30] N Vervliet O Debals L Sorber and L De Lathauwer ldquoBreaking thecurse of dimensionality using decompositions of incomplete tensorsTensor-based scientific computing in big data analysisrdquo IEEE SignalProcessing Magazine vol 31 no 5 pp 71ndash79 Sept 2014

[31] G Zhou A Cichocki Q Zhao and S Xie ldquoEfficient nonnegativeTucker decompositions Algorithms and uniquenessrdquo IEEE Transactionson Image Processing (Submitted) p 13 2014 [Online] Available[ArXive-prints]httparxivorgabs14044412

[32] B W Bader and T G Kolda ldquoMATLAB tensor toolbox version25rdquo Feb 2012 [Online] Available httpcsmrcasandiagovsimtgkoldaTensorToolbox

[33] C A Andersson and R Bro ldquoThe N -way toolbox for MATLABrdquo 2000[Online] Available httpwwwmodelslifekudksourcenwaytoolbox

[34] H Phan P Tichavsky and A Cichocki ldquoFast alternating LS algorithmsfor high order CANDECOMPPARAFAC tensor factorizationsrdquo IEEETransactions on Signal Processing vol 61 no 19 pp 4834ndash4846 2013

[35] N Sidiropoulos E Papalexakis and C Faloutsos ldquoA parallel algorithmfor big tensor decomposition using randomly compressed cubes (PARA-COMP)rdquo in 2014 IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP) May 2014 pp 1ndash5

[36] A Cichocki and A-H Phan ldquoFast local algorithms for large scalenonnegative matrix and tensor factorizationsrdquo IEICE Transactions onFundamentals of Electronics Communications and Computer Sciences(Invited paper) vol E92-A no 3 pp 708ndash721 2009

[37] D Cai X He J Han and T Huang ldquoGraph regularized nonnegativematrix factorization for data representationrdquo IEEE Transactions on

Pattern Analysis and Machine Intelligence vol 33 no 8 pp 1548ndash1560 aug 2011

[38] M Rodriguez J Ahmed and M Shah ldquoAction MACH a spatio-temporal maximum average correlation height filter for action recogni-tionrdquo in IEEE Conference on Computer Vision and Pattern Recognition2008 CVPR 2008 June 2008 pp 1ndash8

[39] R Baraniuk V Cevher and M Wakin ldquoLow-dimensional models fordimensionality reduction and signal recovery A geometric perspectiverdquoProceedings of the IEEE vol 98 no 6 pp 959ndash971 June 2010

  • I Introduction
  • II Overview of Tucker and Polyadic Decompositions
  • III Efficient CP Decompositions of Tensors Based on Their Tucker Approximations
    • III-A Overview of CPD Algorithms
    • III-B CPD of Big Tensors by Based on Their Tucker Compression
      • IV Distributed Randomized Tucker Decompositions (RandTucker)
        • IV-A Randomized Tucker Decomposition
        • IV-B Distributed Randomized Tucker Decomposition
        • IV-C Nonnegative Tucker Decomposition
          • V Simulations
            • V-A Synthetic data
            • V-B The COIL-100 Images Clustering Analysis
            • V-C UCF Sports Action Videos Analysis
              • VI Conclusion
              • References

7

Workeramp17amp Workeramp19amp Workeramp21amp Workeramp23amp

Workeramp9amp Workeramp11amp Workeramp15amp

Workeramp1amp Workeramp3amp Workeramp5amp

Workeramp13amp

Workeramp7amp

Y(111)

Y(211)

Y(121)

Y(221) Y(231)

Y(131) Y(141)

Y(241)

(a)

P 2 NS1S2middotmiddotmiddotSN+

= snk = ind2subp(n)(sn k)

s = ind2sub(p(s1 s2 sN ))

(Using P(n))

(Using P)

Worker p(s)

p(s)Y(s)

(b)

Fig 5 Demonstration of small data tensors and the virtual computing resource tensor P in a cluster (a) An example Y isinR2Itimes4Jtimes3K is partitioned into smaller blocks Y(s) with the same size I times J timesK thereby leading to the 2times 4times 3 computingresource tensor P identifying a total of 24 workers (b) An entry p(s) isin 1 2 prodn Sn of the computing resource tensorP indicates that Worker p(s) processes the sub-tensor Y(s) where s = [s1 s2 sN ]T = ind2sub(p(s)) Hence in (a)p(1 3 2) = 13 means Worker 13 processes the sub-tensor Y(132) as [1 3 2] = ind2sub(13) = p(2)(3 3)

Given a big N th-order tensor Y isin RI1timesI2timesmiddotmiddotmiddottimesIN we parti-tion each dimension In into a total of Sn disjoint subsets eachof which consists of LSn indices with

sumSn

sn=1 Lsn = In Assuch the original big tensor is partitioned into a total of W =S1S2 middot middot middotSN sub-tensors and each sub-tensor can be denotedby Y(s1s2sN ) with the size of Ls1 times Ls2 times middot middot middot times LsN where sn isin NSn n = 1 2 N At the same time supposewe have a computer cluster with W workers (computationalengines)3 We imagine that the W workers are virtuallyspatially permuted to form a S1 times S2 times middot middot middot times SN tensor suchthat each worker is just associated with one sub-tensor seeFig5(a) In other words we will have two types of tensorsbull The small data tensors obtained by partitioning Y

Y(s) = Y(s1s2sN ) isin RLs1timesLs2timesmiddotmiddotmiddottimesLsN (19)

where s =[s1 s2 middot middot middot sN

]Tis used to specify

the (s1 s2 sN )th-block of Y sn isin NSn n =

1 2 N see Fig5(a) for an illustrationbull The computing resource tensor P isin NS1timesS2timesmiddotmiddotmiddottimesSN

+

formed by reshaping [1 2 W ] that identifies theW = S1S2 middot middot middotSN workers and the associated data ten-sors The entries of P have a one-to-one correspondencewith the sub-tensors Y(s) meaning that Worker p(s)process the sub-tensor Y(s) see Fig5(b)

It can be seen the RandTucker and the RandTucker2i majorlyuse four basic operations 1) Mode-n unfolding of a tensor2) Matrix-matrix product to obtain Y(n)Ω

(n) 3) Generating aGaussian random matrix Ω 4) Estimating an orthogonal basisof the range of Y(n)Ω

(n) Below we discuss how to performthese four operations in a distributed computing system

Mode-n unfolding of tensor Y We unfold the computingresource tensor P isin NS1timesS2timesmiddotmiddotmiddottimesSN

+ along mode-n to obtain

3We have assumed that each sub-tensor can be efficiently handled in by theassociated worker otherwise we increase the number of workers and decreasethe size of each sub-tensor accordingly

P(n) isin NSntimesK+ K =

prodj 6=n Sj Let p(n)(sn k) be the

(sn k)th entry of P(n) and define

s = πsnk = ind2sub(p(n)(sn k))

sn = 1 2 Sn k = 1 2 prod

j 6=nSj

(20)

where the function ind2sub(middot) converts an entry of P(n) iep(n)(sn k) a worker identifier into an N -by-1 vector ofsubscript values determining the position of p(n)(sn k) in thetensor P which allows us to locate the sub-tensor associatedwith this worker In other words the above (20) means thatthe tensor Yπsnk is processed by Worker p(n)(sn k) Then itcan be seen that

Y(n) =

Yπ11

(n) Yπ12

(n) Yπ1K

(n)

Yπ21

(n) Yπ22

(n) Yπ2K

(n)

Y

πSn1

(n) YπSn2

(n) YπSnK

(n)

isin RIntimes

prodj 6=n Ij

(21)where Y

πsnk

(n) is the mode-n martricization of Yπsnk pro-cessed by Worker p(n)(sn k) sn isin NSn

and k isin NK In summary to obtain Y(n) we perform mode-n unfolding

of Y(s)(n) simultaneously Then the (sn k)th sub-matrix of Y(n)

will be given by Worker p(n)(sn k) as shown in (21)Matrix-matrix product Large scale matrix-matrix product

can be performed by using any existing technique As here themode-n unfolding matrix Y(n) has already been partitionedand distributed we apply the matrix partitioning techniquewe compute the product Y(n)Ω

(n) by using

Z(n)1

Z(n)2

middot middot middotZ

(n)Sn

= Y(n)Ω

(n) = Y(n)

Ω(n)1

Ω(n)2

Ω(n)K

isin RIntimesRn (22)

where Y(n) is given by (21) Ω(n)k isin R

prodj 6=n LSj

timesRn are thesub-matrices by partitioning the random Gaussian matrix Ω(n)

8

50

100

150

200

250

300

350

400

1x108 2x108 3x108 5x108 7x108 1x109 1x109

Scale of the input data ( I N)

Tim

e (s)

FFCPCPminusALSPARACOMP

(a)

2

4

6

8

10

12

14

16

18

1x109 1x1012 8x1012 3x1013 6x1013 1x1014

Scale of the input data (I N)

Tim

e (s

)

(b)

Fig 6 (a) Time consumption comparison of the FFCP the CP-ALS and the PARACOMP on condition that the data tensorof I times I times I can be loaded into memory For the FFCP the RandTucker2i was used to compress the data with p = 15 (b)Scalability of the FFCP algorithm provided that the given big tensor was in its Tucker format The consumed time by theFFCP increases nearly linearly with respect to I

into K parts From (22) we have Z(n)sn =

sumKk=1 Y

πsnk

(n) Ω(n)k

and all the products Yπsnk

(n) Ω(n)k are computed on Workers

p(πsnk) simultaneously and then are broadcasted to Workerp(n)(sn k0) typically k0 = 1 Finally for each sn isin NSn wesum Y

πsnk

(n) Ω(n)k up over k on Worker p(n)(sn k0) to yield

Z(n)sn Generating the Gaussian random matrix Ω(n) Note that

the workers identified by the kth column of P(n) ie theworkers p(n)(sn k) sn = 1 2 Sn share a same randomGaussian matrix Ω

(n)k (corresponding to the kth column of

blocks of Y(n) in (21)) Hence Ω(n)k can be generated on

one worker say p(n)(1 k) and then is broadcasted to all theworkers p(n)(sn k) with sn 6= 1 However this way suffersfrom heavy communication overhead A more economicalmethod is to broadcast the parameters (or the random numbergenerators) that control the random number generation andthen the random matrices Ω

(n)k are generated on all workers

by using the received random number generators as suchfor each k the workers p(n)(sn k) sn isin NSn

will generateidentical random Gaussian matrices Ω

(n)k simultaneously

Estimating an orthonormal basis of the columns ofY(n)Ω

(n) To this end we perform eigenvalue decompositionofsumSn

sn=1 Z(n)sn

TZ(n)sn = UΣUT Then the mode-n factor

matrix U(n) with orthonormal columns can be computed from

U(n)sn = Z(n)

sn UΣminus12 (23)

in parallel on the workers p(n)(sn 1) where U(n)sn is the snth

sub-matrix of U(n) sn isin NSn After U(n) has been estimated Y will be updated as Ylarr

Y timesn U(n)T which can be realized using the above mode-n

tensor unfolding and matrix-matrix product After update thenth dimension of Y(s) will be reduced to be Rn from LSn

We repeat the above procedure for n = 1 2 N As the

last step the core tensor is computed as G =sumWw=1 Y

(w) iethe sum of sub-tensors distributed on all workers

C Nonnegative Tucker Decomposition

In our recent papers [19] [31] we also showed how toefficiently perform nonnegative Tucker decomposition basedon a Tucker approximation The idea of that method is similarto the FFCP method after the original high-dimensional rawdata has been replaced by its Tucker approximation with lowmultilinear rank the computational complexity of gradientswith respect to the nonnegative core tensor and factor matri-ces can be significantly reduced in terms of both time andspace Readers are referred to [19] [31] for more detailsIn summary Tucker compression can play a fundamentalrole to achieve highly efficient tensor decompositions asillustrated in Fig1 Based on the proposed randomized Tuckerdecomposition methods we can easily perform various typesof constrainedunconstrained tensor decompositions for verylarge-scale tensor data

V SIMULATIONS

A Synthetic data

In this subsection we investigate the performance of theproposed FFCP algorithm by using synthetic data All the ex-periments were performed in 64-bit MATLAB on a computerwith Intel Core i7 CPU (333GHz) and 24GB memory running64-bit Windows 7

9

TABLE II Comparison between the algorithms when they were applied to perform CPD of synthetic data The noises wereGaussian with SNR=10dB in all the cases

Algorithm N = 3 I = 200 R = 10 N = 3 I = 500 R = 20 N = 6 I = 20 R = 5Fit () Time (s) Fit () Time (s) Fit () Time (s)

CP-ALS 821plusmn 102 395 805plusmn 68 640 896plusmn 112 188FastCPALS 798plusmn 90 104 829plusmn 56 211 924plusmn 109 66PARAFAC 825plusmn 79 782 805plusmn 61 1583 876plusmn 114 643FFCP 819plusmn 80 35 828plusmn 56 40 934plusmn 86 50

TABLE III Comparison between the algorithms when they were applied to perform nonnegative CPD of synthetic data Thenoises were Gaussian with SNR=10dB in all the cases

Algorithm N = 3 I = 200 R = 10 N = 3 I = 500 R = 20 N = 6 I = 20 R = 5Fit () Time (s) Fit () Time (s) Fit () Time (s)

CP-NMU 973plusmn 12 664 976plusmn 08 16952 917plusmn 56 4265CP-HALS 978plusmn 10 121 977plusmn 07 2281 951plusmn 41 1014FFCP-MU 990plusmn 05 23 993plusmn 02 121 934plusmn 41 70FFCP-HALS 991plusmn 01 61 995plusmn 01 299 943plusmn 35 118

Simulations on CPD In each run a tensor Ylowast was generatedby using the model (2) where the entries of the latent factormatrices A(n) isin RItimesR (ie the ground truth) were drawn fromiid standard normal distributions n = 1 2 N Thenindependent Gaussian noise (with SNR=10dB) was addedto Ylowast to yield noisy observation tensor Y We tested threedifferent combinations of N I and R as shown in TABLE IIThe algorithms CP-ALS PARAFAC which are the standardalgorithms included in the Tensor Toolbox 25 [32] and the N -way toolbox [33] respectively were compared as the baselineThe FastCPALS algorithm [34] was also compared For all thealgorithm the maximum iteration number was set as 1000and a stopping criterion |Fit(Y Yt) minus Fit(Y Yt+1)| lt 10minus6

was applied where Yt is an estimation of Ylowast obtained afterthe tth iteration The mean values and standard variationsof Fit(Ylowast Y) over 20 Monte-Carlo runs are detailed in TA-BLE II along with the averaged time consumed by eachalgorithm From the table we see that the FFCP algorithmis significantly faster than the other algorithms but withoutsignificant deterioration of performance although it workson a compressed tensor instead of the original tensor Wealso tested the PARACOMP [35] method which was speciallyproposed for big tensor decompositions It turned out that thePARACOMP was sensitive to noise in noise-free cases itoften achieved perfect Fit of 1 otherwise when the noise was10dB it generally obtained the Fit lower than 03

In Fig6 we demonstrated the scalability of the proposedFFCP method where the RandTucker2i was used to compressthe data with p = 15 From Fig6(a) we can see that theFFCP shows very promising scalability its time consumptionincreases almost linearly with respect to the scale of the inputdata on condition that the rank is fixed on the contrastthe time consumed by the other algorithms increased nearlyexponentially Note that both the FFCP and the PARACOMPmethod [35] were designed mainly for big data analyticsand this comparison did not precisely reflect their actual

time consumption difference in a distributed computationalenvironment However it did show that the proposed methodwas more efficient than the PARACOMP which can also be in-ferred from the obvious facts 1) the PARACOMP compressesthe observation big tensor at least P times (P ge IR) whereasin our RandTucker2i based FFCP we only need to do it twice(or even once when RandTucker is applied) 2) although bothof them need to perform CPD on small compressed tensorshowever the PARACOMP method needs to do this P timeswhereas the FFCP needs to do it only once as the final stepParticularly in the PARACOMP any failure of the CPDs ofP sub-tensors will degenerate the final results Fig6(b) showsthe scalability of the FFCP provided that the given big tensorwas in its Tucker format To such scale of tensor if we usethe original full tensor directly all the algorithms ran out ofmemory But if we use its Tucker compression the FFCPcan perform constrainedunconstrained CPD efficiently evenon our desktop computer

Simulations on Nonnegative CPD In this experiments thedata were generated as the above except that the latent factormatrices were generated from iid exponential distributionswith the mean parameter micro = 10 Then we randomly selected10 entries of each factor to be zeros to generate sparsefactor matrices As the PARAFAC algorithm was very slowwe only compared the CP-NMU algorithm which is a standardnonnegative CPD algorithm based on the multiplicative updaterule included in the Tensor Toolbox [32] and the CP-HALSalgorithm [36] that is based on the Hierarchical ALS (HALS)method For fair comparison their corresponding implemen-tations of FFCP were compared The results averaged over20 Monte-Carlo runs were detailed in TABLE III From theTABLE FFCP implementations were not only significantlyfaster than their competitors but also often achieved higherFit values To the best of our knowledge the FFCP is one ofthe most efficient algorithms for nonnegative CPD

To demonstrate the high efficiency of the FFCP algorithm

10

in this simulation the entries of A(n) isin R10000times10 n =1 2 3 4 were drawn from iid exponential distributions withthe mean parameter micro = 10 Again we randomly selected 20entries of A(n) to be zeros to obtain sparse factor matricesBy using (2) and A(n) generated above a big 4th-order tensorY isin R10000times10000times10000times10000 with R = 10 was obtainedwhich contains a total of 1016 entries in its full format Itis certainly impossible to decompose such a huge tensor in astandard desktop computer Suppose we have already obtainedits Tucker approximation YT = JGA(1)A(2)A(3)A(4)K asympY Due to the rotation ambiguity of Tucker decompositionswe know theoretically A(n) = A(n)Q(n) where Q(n) wasnot necessarily to be nonnegative Generally applying CPD onthe core tensor G directly cannot guarantee the nonnegativityof factor matrices Instead we applied the FFCP algorithm onthe Tucker compressed tensor YT to extract nonnegative factormatrices In a typical run the FFCP-HALS consumed only 774seconds and the recovered components were with the signal-to-interference ratio4 (SIR) higher than 200dB which almostperfectly match the true components Although this exampleonly shows an ideal noise-free case it illustrates how the FFCPalgorithm can be applied to perform nonnegative CPD of hugetensors as long as their good low-rank approximations can beobtained

B The COIL-100 Images Clustering Analysis

In this experiment we applied the proposed NTD algorithmsto the clustering analysis of the objects selected from theColumbia Object Image Library (COIL-100) The COIL-100database consists of 7200 images of 100 objects each ofwhich has 72 images taken from different poses hence theseimages naturally form a 128times 128times 3times 7200 tensor Y Thedata tensor was decomposed by using tensor decompositionalgorithms and then used the two t-Distributed StochasticNeighbor Embedding (t-SNE) components of A(4) as featuresto cluster and visualize the results The K-means algorithm wasused to cluster the objects and the accuracy of clustering wasdefined in [37]To justify the proposed distributed RandTuckeralgorithm we used the MATLAB parallel computing toolboxwhich allows us to use the full processing power of multicoredesktops by executing applications on workers (MATLABcomputational engines) that run locally The tensor Y waspartitioned into 4 sub-tensors with the size of 128times128times3times1800 and was distributed on 4 workers As the baseline theTucker-ALS algorithm included in [32] was compared and themultilinear rank was empirically set as R = (20 20 3 20)For the RandTucker method R = (30 30 3 50) was setBoth of these two algorithms used 2 iterations Their perfor-mances were detailed in TABLE IV It can be seen that nosignificant deterioration was observed between the randomized

4The performance index SIR is defined as SIR(a(n)r a

(n)r ) =

20 log∥∥∥a(n)

r minus a(n)r

∥∥∥2∥∥∥a(n)

r

∥∥∥2

where a(n)r is an estimate of a

(n)r both

a(n)r and a

(n)r are normalized to be zero-mean and unit-variance

TABLE IV Comparison between the algorithms when theywere applied to clustering analysis of COIL-100 objects

Algorithm Fit () Accuracy ()Tucker-ALS 703 692

RandTucker2i 691 684RandTucker 686 698

FFCP(Tucker-ALS) 671 713FFCP (RandTucker2i) 647 680

12

3

4

5

6 7

8

9

10

11

12

13

1415

16

17

18

19

20

21

2223

24

25

2627

28

29

30

31

32

33

34

35

36

37

38 3940

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

6465

6667

68

69

70

71

72

73

74

75

76

77

78

7980

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

Fig 7 Visualization of the clustering results obtained byapplying distributed RandTucker2i on the COIL100 imagedatabase The two tSNE components of A(4) were used forvisualization

methods and the deterministic methods which suggested thatthe distributed RandTucker algorithm can be applied to reallylarge scale problems The clustering results obtained by theRandTucker2i Figare visualized in Fig7 by using the twotSNE components of A(4)

C UCF Sports Action Videos Analysis

This data set consists of a set of videos collected fromvarious sports which are typically featured on broadcast televi-sion channels [38] (see httpcrcvucfedudataUCF SportsActionphp) To avoid rescaling the videos and keep all thevideos in the same specification we selected 5 actions ofDiving Riding-Horse Run Swing and Walk-Front each ofwhich contains 10 video clips with the size of 404times720times3times35(35 is the number of frames) We concatenated these videosto form a 5th-order tensor This tensor has about 16 times 109

entries and more than 99 of them are nonzero When it issaved in double format it consumes about 117 GB spaceWe used tensor decomposition methods to decompose this

11

Fig 8 The UCF Sports Action Videos The first row shows asnapshot of each action and the second row shows 5 snapshotsfrom a video of diving

tensor to extract low-dimensional features for further cluster-ing analysis When we applied the CP-ALS method includedin the Tensor Toolbox to decompose 30 videos it consumedabout 4 hours for only one iteration when 40 videos wereconsidered the CP-ALS ran out of memory We configureda small cluster with 3 computers (a desktop computer plustwo laptop computers (each of which has a i7 Intel dualcore CPU 8 GB physical memory running Windows 7 andMATLAB 2012b)) In the first stage the RandTucker2i wasapplied with R = [35 35 3 10 20] It consumed about 104seconds to obtain the Tucker approximation of the originaltensor with the Fit of 723 to the observation data Then theFFCP was applied using a desktop computer with R = 10 Inthis case only 75 seconds were required without nonnegativityconstraints and 7 seconds were consumed when nonnegativityconstraints were imposed Finally we used A(5) as featuresto cluster the videos and achieved the accuracy of 58 ina typical run This experiments showed that the proposedalgorithms can be applied to decompose very large scaletensors by using a distributed computing cluster

VI CONCLUSION

Tensor decompositions are particularly promising for bigmultiway data analytics as it is able to discover more complexinternal structures and correlations of data gathered frommultiple aspects devices trials and subjects etc While mostexisting tensor decomposition methods are not designed tomeet the major challenges posed by big data in this paperwe proposed a unified framework for the decomposition ofbig tensors with relatively low multilinear rank The key ideais that based on its Tucker approximation of a big tensor alarge set of unconstrainedconstrained tensor decompositionscan be efficiently performed We developed a flexible fastalgorithm for the CP decomposition of tensors using theirTucker representation and a distributed randomized Tuckerdecomposition approach which is particularly suitable for bigtensor analysis Simulation results confirmed that the proposedalgorithms are highly efficient and scalable

While the randomized Tucker decomposition approachshowed its close-to-optimal performance the correspondingerror bounds need to be investigated theoretically in the futureWe also believe many randomized (or related) approachesproposed for low-rank approximation of matrices [29] [39]

could be applied for Tucker decompositions however a com-parative study in the tensor decomposition scenarios is desiredParticularly if we have efficient and robust unconstrained CPdecomposition algorithms we can use the CP model to com-press the raw data instead of the Tucker model which suffersfrom the curse of dimensionality for very high-order tensorsHence developing more reliable and robust CP decompositionapproaches is another important task All these open issuesdeserve further study in the near future

REFERENCES

[1] A Cichocki R Zdunek A H Phan and S-I Amari Nonnegative Ma-trix and Tensor Factorizations Applications to Exploratory Multi-wayData Analysis and Blind Source Separationpplications to ExploratoryMulti-Way Data Analysis and Blind Source Separation ChichesterUK John Wiley amp Sons Ltd 2009

[2] N Sidiropoulos R Bro and G Giannakis ldquoParallel factor analysisin sensor array processingrdquo IEEE Transactions on Signal Processingvol 48 no 8 pp 2377ndash2388 Aug 2000

[3] L De Lathauwer J Castaing and J Cardoso ldquoFourth-order cumulant-based blind identification of underdetermined mixturesrdquo IEEE Transac-tions on Signal Processing vol 55 no 6 pp 2965ndash2973 2007

[4] P Comon ldquoTensors A brief introductionrdquo IEEE Signal ProcessingMagazine vol 31 no 3 pp 44ndash53 May 2014

[5] T G Kolda and B W Bader ldquoTensor decompositions and applicationsrdquoSIAM Review vol 51 no 3 pp 455ndash500 2009

[6] N D Sidiropoulos and R Bro ldquoOn the uniqueness of multilineardecomposition of N -way arraysrdquo Journal of Chemometrics vol 14no 3 pp 229 ndash 239 2000

[7] G Zhou and A Cichocki ldquoCanonical polyadic decomposition based ona single mode blind source separationrdquo IEEE Signal Processing Lettersvol 19 no 8 pp 523 ndash526 Aug 2012

[8] L-H Lim and P Comon ldquoNonnegative approximations of nonnegativetensorsrdquo Journal of Chemometrics vol 23 no 7-8 pp 432ndash441 2009

[9] E E Papalexakis C Faloutsos and N D Sidiropoulos ldquoParCubeSparse parallelizable tensor decompositionsrdquo in Proceedings of the 2012European Conference on Machine Learning and Knowledge Discoveryin Databases - Volume Part I ser ECML PKDDrsquo12 Berlin HeidelbergSpringer-Verlag 2012 pp 521ndash536

[10] A Phan and A Cichocki ldquoAdvances in PARAFAC using parallel blockdecompositionrdquo in Neural Information Processing ser Lecture Notesin Computer Science C Leung M Lee and J Chan Eds vol 5863Springer Berlin Heidelberg 2009 pp 323ndash330

[11] U Kang E Papalexakis A Harpale and C Faloutsos ldquoGigaTensorScaling tensor analysis up by 100 times - algorithms and discoveriesrdquoin Proceedings of the 18th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining ser KDD rsquo12 New York NYUSA ACM 2012 pp 316ndash324

[12] C Tsourakakis ldquoMACH Fast randomized tensor decompositionsrdquoin Proceedings of the 2010 SIAM International Conference on DataMining 2010 pp 689ndash700

[13] P Drineas and M W Mahoney ldquoA randomized algorithm for a tensor-based generalization of the singular value decompositionrdquo Linear Alge-bra and its Applications vol 420 no 2ndash3 pp 553 ndash 571 2007

[14] C F Caiafa and A Cichocki ldquoGeneralizing the column-row matrix de-composition to multi-way arraysrdquo Linear Algebra and its Applicationsvol 433 no 3 pp 557 ndash 573 2010

[15] M Vasilescu and D Terzopoulos ldquoMultilinear analysis of image en-sembles Tensorfacesrdquo in Computer Vision mdash ECCV 2002 ser LectureNotes in Computer Science A Heyden G Sparr M Nielsen andP Johansen Eds Springer Berlin Heidelberg 2002 vol 2350 pp447ndash460

[16] O Duchenne F Bach I-S Kweon and J Ponce ldquoA tensor-basedalgorithm for high-order graph matchingrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 12 pp 2383ndash2395 Dec2011

12

[17] A Rajwade A Rangarajan and A Banerjee ldquoImage denoising usingthe higher order singular value decompositionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 35 no 4 pp 849ndash8622013

[18] V De Silva and L-H Lim ldquoTensor rank and the ill-posedness of the bestlow-rank approximation problemrdquo SIAM Journal on Matrix Analysis andApplications vol 30 no 3 pp 1084ndash1127 September 2008

[19] G Zhou A Cichocki Q Zhao and S Xie ldquoNonnegative matrixand tensor factorizations An algorithmic perspectiverdquo IEEE SignalProcessing Magazine vol 31 no 3 pp 54ndash65 May 2014

[20] G Zhou A Cichocki and S Xie ldquoFast nonnegative matrixtensorfactorization based on low-rank approximationrdquo IEEE Transactions onSignal Processing vol 60 no 6 pp 2928 ndash2940 June 2012

[21] V Cevher S Becker and M Schmidt ldquoConvex optimization forbig data Scalable randomized and parallel algorithms for big dataanalyticsrdquo IEEE Signal Processing Magazine vol 31 no 5 pp 32ndash43Sept 2014

[22] R Bro and C A Andersson ldquoImproving the speed of multiway algo-rithms Part ii Compressionrdquo Chemometrics and Intelligent LaboratorySystems vol 42 no 1ndash2 pp 105 ndash 113 1998

[23] N Sidiropoulos E Papalexakis and C Faloutsos ldquoParallel randomlycompressed cubes A scalable distributed architecture for big tensordecompositionrdquo IEEE Signal Processing Magazine vol 31 no 5 pp57ndash70 Sept 2014

[24] P Comon X Luciani and A L F de Almeida ldquoTensor decompositionsalternating least squares and other talesrdquo Journal of Chemometricsvol 23 no 7-8 pp 393ndash405 2009

[25] J Cohen R Farias and P Comon ldquoFast decomposition of largenonnegative tensorsrdquo IEEE Signal Processing Letters vol 22 no 7pp 862ndash866 July 2015

[26] L De Lathauwer B De Moor and J Vandewalle ldquoA multilinearsingular value decompositionrdquo SIAM Journal on Matrix Analysis andApplications vol 21 pp 1253ndash1278 2000

[27] E Papalexakis N Sidiropoulos and R Bro ldquoFrom k-means to higher-way co-clustering Multilinear decomposition with sparse latent factorsrdquoIEEE Transactions on Signal Processing vol 61 no 2 pp 493ndash506Jan 2013

[28] Z Qian X Chen N Kang M Chen Y Yu T Moscibroda andZ Zhang ldquoMadLINQ Large-scale distributed matrix computation forthe cloudrdquo in Proceedings of the 7th ACM European Conference onComputer Systems ser EuroSys rsquo12 New York NY USA ACM2012 pp 197ndash210

[29] N Halko P Martinsson and J Tropp ldquoFinding structure with ran-domness Probabilistic algorithms for constructing approximate matrixdecompositionsrdquo SIAM Review vol 53 no 2 pp 217ndash288 2011

[30] N Vervliet O Debals L Sorber and L De Lathauwer ldquoBreaking thecurse of dimensionality using decompositions of incomplete tensorsTensor-based scientific computing in big data analysisrdquo IEEE SignalProcessing Magazine vol 31 no 5 pp 71ndash79 Sept 2014

[31] G Zhou A Cichocki Q Zhao and S Xie ldquoEfficient nonnegativeTucker decompositions Algorithms and uniquenessrdquo IEEE Transactionson Image Processing (Submitted) p 13 2014 [Online] Available[ArXive-prints]httparxivorgabs14044412

[32] B W Bader and T G Kolda ldquoMATLAB tensor toolbox version25rdquo Feb 2012 [Online] Available httpcsmrcasandiagovsimtgkoldaTensorToolbox

[33] C A Andersson and R Bro ldquoThe N -way toolbox for MATLABrdquo 2000[Online] Available httpwwwmodelslifekudksourcenwaytoolbox

[34] H Phan P Tichavsky and A Cichocki ldquoFast alternating LS algorithmsfor high order CANDECOMPPARAFAC tensor factorizationsrdquo IEEETransactions on Signal Processing vol 61 no 19 pp 4834ndash4846 2013

[35] N Sidiropoulos E Papalexakis and C Faloutsos ldquoA parallel algorithmfor big tensor decomposition using randomly compressed cubes (PARA-COMP)rdquo in 2014 IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP) May 2014 pp 1ndash5

[36] A Cichocki and A-H Phan ldquoFast local algorithms for large scalenonnegative matrix and tensor factorizationsrdquo IEICE Transactions onFundamentals of Electronics Communications and Computer Sciences(Invited paper) vol E92-A no 3 pp 708ndash721 2009

[37] D Cai X He J Han and T Huang ldquoGraph regularized nonnegativematrix factorization for data representationrdquo IEEE Transactions on

Pattern Analysis and Machine Intelligence vol 33 no 8 pp 1548ndash1560 aug 2011

[38] M Rodriguez J Ahmed and M Shah ldquoAction MACH a spatio-temporal maximum average correlation height filter for action recogni-tionrdquo in IEEE Conference on Computer Vision and Pattern Recognition2008 CVPR 2008 June 2008 pp 1ndash8

[39] R Baraniuk V Cevher and M Wakin ldquoLow-dimensional models fordimensionality reduction and signal recovery A geometric perspectiverdquoProceedings of the IEEE vol 98 no 6 pp 959ndash971 June 2010

  • I Introduction
  • II Overview of Tucker and Polyadic Decompositions
  • III Efficient CP Decompositions of Tensors Based on Their Tucker Approximations
    • III-A Overview of CPD Algorithms
    • III-B CPD of Big Tensors by Based on Their Tucker Compression
      • IV Distributed Randomized Tucker Decompositions (RandTucker)
        • IV-A Randomized Tucker Decomposition
        • IV-B Distributed Randomized Tucker Decomposition
        • IV-C Nonnegative Tucker Decomposition
          • V Simulations
            • V-A Synthetic data
            • V-B The COIL-100 Images Clustering Analysis
            • V-C UCF Sports Action Videos Analysis
              • VI Conclusion
              • References

8

50

100

150

200

250

300

350

400

1x108 2x108 3x108 5x108 7x108 1x109 1x109

Scale of the input data ( I N)

Tim

e (s)

FFCPCPminusALSPARACOMP

(a)

2

4

6

8

10

12

14

16

18

1x109 1x1012 8x1012 3x1013 6x1013 1x1014

Scale of the input data (I N)

Tim

e (s

)

(b)

Fig 6 (a) Time consumption comparison of the FFCP the CP-ALS and the PARACOMP on condition that the data tensorof I times I times I can be loaded into memory For the FFCP the RandTucker2i was used to compress the data with p = 15 (b)Scalability of the FFCP algorithm provided that the given big tensor was in its Tucker format The consumed time by theFFCP increases nearly linearly with respect to I

into K parts From (22) we have Z(n)sn =

sumKk=1 Y

πsnk

(n) Ω(n)k

and all the products Yπsnk

(n) Ω(n)k are computed on Workers

p(πsnk) simultaneously and then are broadcasted to Workerp(n)(sn k0) typically k0 = 1 Finally for each sn isin NSn wesum Y

πsnk

(n) Ω(n)k up over k on Worker p(n)(sn k0) to yield

Z(n)sn Generating the Gaussian random matrix Ω(n) Note that

the workers identified by the kth column of P(n) ie theworkers p(n)(sn k) sn = 1 2 Sn share a same randomGaussian matrix Ω

(n)k (corresponding to the kth column of

blocks of Y(n) in (21)) Hence Ω(n)k can be generated on

one worker say p(n)(1 k) and then is broadcasted to all theworkers p(n)(sn k) with sn 6= 1 However this way suffersfrom heavy communication overhead A more economicalmethod is to broadcast the parameters (or the random numbergenerators) that control the random number generation andthen the random matrices Ω

(n)k are generated on all workers

by using the received random number generators as suchfor each k the workers p(n)(sn k) sn isin NSn

will generateidentical random Gaussian matrices Ω

(n)k simultaneously

Estimating an orthonormal basis of the columns ofY(n)Ω

(n) To this end we perform eigenvalue decompositionofsumSn

sn=1 Z(n)sn

TZ(n)sn = UΣUT Then the mode-n factor

matrix U(n) with orthonormal columns can be computed from

U(n)sn = Z(n)

sn UΣminus12 (23)

in parallel on the workers p(n)(sn 1) where U(n)sn is the snth

sub-matrix of U(n) sn isin NSn After U(n) has been estimated Y will be updated as Ylarr

Y timesn U(n)T which can be realized using the above mode-n

tensor unfolding and matrix-matrix product After update thenth dimension of Y(s) will be reduced to be Rn from LSn

We repeat the above procedure for n = 1 2 N As the

last step the core tensor is computed as G =sumWw=1 Y

(w) iethe sum of sub-tensors distributed on all workers

C Nonnegative Tucker Decomposition

In our recent papers [19] [31] we also showed how toefficiently perform nonnegative Tucker decomposition basedon a Tucker approximation The idea of that method is similarto the FFCP method after the original high-dimensional rawdata has been replaced by its Tucker approximation with lowmultilinear rank the computational complexity of gradientswith respect to the nonnegative core tensor and factor matri-ces can be significantly reduced in terms of both time andspace Readers are referred to [19] [31] for more detailsIn summary Tucker compression can play a fundamentalrole to achieve highly efficient tensor decompositions asillustrated in Fig1 Based on the proposed randomized Tuckerdecomposition methods we can easily perform various typesof constrainedunconstrained tensor decompositions for verylarge-scale tensor data

V SIMULATIONS

A Synthetic data

In this subsection we investigate the performance of theproposed FFCP algorithm by using synthetic data All the ex-periments were performed in 64-bit MATLAB on a computerwith Intel Core i7 CPU (333GHz) and 24GB memory running64-bit Windows 7

9

TABLE II Comparison between the algorithms when they were applied to perform CPD of synthetic data The noises wereGaussian with SNR=10dB in all the cases

Algorithm N = 3 I = 200 R = 10 N = 3 I = 500 R = 20 N = 6 I = 20 R = 5Fit () Time (s) Fit () Time (s) Fit () Time (s)

CP-ALS 821plusmn 102 395 805plusmn 68 640 896plusmn 112 188FastCPALS 798plusmn 90 104 829plusmn 56 211 924plusmn 109 66PARAFAC 825plusmn 79 782 805plusmn 61 1583 876plusmn 114 643FFCP 819plusmn 80 35 828plusmn 56 40 934plusmn 86 50

TABLE III Comparison between the algorithms when they were applied to perform nonnegative CPD of synthetic data Thenoises were Gaussian with SNR=10dB in all the cases

Algorithm N = 3 I = 200 R = 10 N = 3 I = 500 R = 20 N = 6 I = 20 R = 5Fit () Time (s) Fit () Time (s) Fit () Time (s)

CP-NMU 973plusmn 12 664 976plusmn 08 16952 917plusmn 56 4265CP-HALS 978plusmn 10 121 977plusmn 07 2281 951plusmn 41 1014FFCP-MU 990plusmn 05 23 993plusmn 02 121 934plusmn 41 70FFCP-HALS 991plusmn 01 61 995plusmn 01 299 943plusmn 35 118

Simulations on CPD In each run a tensor Ylowast was generatedby using the model (2) where the entries of the latent factormatrices A(n) isin RItimesR (ie the ground truth) were drawn fromiid standard normal distributions n = 1 2 N Thenindependent Gaussian noise (with SNR=10dB) was addedto Ylowast to yield noisy observation tensor Y We tested threedifferent combinations of N I and R as shown in TABLE IIThe algorithms CP-ALS PARAFAC which are the standardalgorithms included in the Tensor Toolbox 25 [32] and the N -way toolbox [33] respectively were compared as the baselineThe FastCPALS algorithm [34] was also compared For all thealgorithm the maximum iteration number was set as 1000and a stopping criterion |Fit(Y Yt) minus Fit(Y Yt+1)| lt 10minus6

was applied where Yt is an estimation of Ylowast obtained afterthe tth iteration The mean values and standard variationsof Fit(Ylowast Y) over 20 Monte-Carlo runs are detailed in TA-BLE II along with the averaged time consumed by eachalgorithm From the table we see that the FFCP algorithmis significantly faster than the other algorithms but withoutsignificant deterioration of performance although it workson a compressed tensor instead of the original tensor Wealso tested the PARACOMP [35] method which was speciallyproposed for big tensor decompositions It turned out that thePARACOMP was sensitive to noise in noise-free cases itoften achieved perfect Fit of 1 otherwise when the noise was10dB it generally obtained the Fit lower than 03

In Fig6 we demonstrated the scalability of the proposedFFCP method where the RandTucker2i was used to compressthe data with p = 15 From Fig6(a) we can see that theFFCP shows very promising scalability its time consumptionincreases almost linearly with respect to the scale of the inputdata on condition that the rank is fixed on the contrastthe time consumed by the other algorithms increased nearlyexponentially Note that both the FFCP and the PARACOMPmethod [35] were designed mainly for big data analyticsand this comparison did not precisely reflect their actual

time consumption difference in a distributed computationalenvironment However it did show that the proposed methodwas more efficient than the PARACOMP which can also be in-ferred from the obvious facts 1) the PARACOMP compressesthe observation big tensor at least P times (P ge IR) whereasin our RandTucker2i based FFCP we only need to do it twice(or even once when RandTucker is applied) 2) although bothof them need to perform CPD on small compressed tensorshowever the PARACOMP method needs to do this P timeswhereas the FFCP needs to do it only once as the final stepParticularly in the PARACOMP any failure of the CPDs ofP sub-tensors will degenerate the final results Fig6(b) showsthe scalability of the FFCP provided that the given big tensorwas in its Tucker format To such scale of tensor if we usethe original full tensor directly all the algorithms ran out ofmemory But if we use its Tucker compression the FFCPcan perform constrainedunconstrained CPD efficiently evenon our desktop computer

Simulations on Nonnegative CPD In this experiments thedata were generated as the above except that the latent factormatrices were generated from iid exponential distributionswith the mean parameter micro = 10 Then we randomly selected10 entries of each factor to be zeros to generate sparsefactor matrices As the PARAFAC algorithm was very slowwe only compared the CP-NMU algorithm which is a standardnonnegative CPD algorithm based on the multiplicative updaterule included in the Tensor Toolbox [32] and the CP-HALSalgorithm [36] that is based on the Hierarchical ALS (HALS)method For fair comparison their corresponding implemen-tations of FFCP were compared The results averaged over20 Monte-Carlo runs were detailed in TABLE III From theTABLE FFCP implementations were not only significantlyfaster than their competitors but also often achieved higherFit values To the best of our knowledge the FFCP is one ofthe most efficient algorithms for nonnegative CPD

To demonstrate the high efficiency of the FFCP algorithm

10

in this simulation the entries of A(n) isin R10000times10 n =1 2 3 4 were drawn from iid exponential distributions withthe mean parameter micro = 10 Again we randomly selected 20entries of A(n) to be zeros to obtain sparse factor matricesBy using (2) and A(n) generated above a big 4th-order tensorY isin R10000times10000times10000times10000 with R = 10 was obtainedwhich contains a total of 1016 entries in its full format Itis certainly impossible to decompose such a huge tensor in astandard desktop computer Suppose we have already obtainedits Tucker approximation YT = JGA(1)A(2)A(3)A(4)K asympY Due to the rotation ambiguity of Tucker decompositionswe know theoretically A(n) = A(n)Q(n) where Q(n) wasnot necessarily to be nonnegative Generally applying CPD onthe core tensor G directly cannot guarantee the nonnegativityof factor matrices Instead we applied the FFCP algorithm onthe Tucker compressed tensor YT to extract nonnegative factormatrices In a typical run the FFCP-HALS consumed only 774seconds and the recovered components were with the signal-to-interference ratio4 (SIR) higher than 200dB which almostperfectly match the true components Although this exampleonly shows an ideal noise-free case it illustrates how the FFCPalgorithm can be applied to perform nonnegative CPD of hugetensors as long as their good low-rank approximations can beobtained

B The COIL-100 Images Clustering Analysis

In this experiment we applied the proposed NTD algorithmsto the clustering analysis of the objects selected from theColumbia Object Image Library (COIL-100) The COIL-100database consists of 7200 images of 100 objects each ofwhich has 72 images taken from different poses hence theseimages naturally form a 128times 128times 3times 7200 tensor Y Thedata tensor was decomposed by using tensor decompositionalgorithms and then used the two t-Distributed StochasticNeighbor Embedding (t-SNE) components of A(4) as featuresto cluster and visualize the results The K-means algorithm wasused to cluster the objects and the accuracy of clustering wasdefined in [37]To justify the proposed distributed RandTuckeralgorithm we used the MATLAB parallel computing toolboxwhich allows us to use the full processing power of multicoredesktops by executing applications on workers (MATLABcomputational engines) that run locally The tensor Y waspartitioned into 4 sub-tensors with the size of 128times128times3times1800 and was distributed on 4 workers As the baseline theTucker-ALS algorithm included in [32] was compared and themultilinear rank was empirically set as R = (20 20 3 20)For the RandTucker method R = (30 30 3 50) was setBoth of these two algorithms used 2 iterations Their perfor-mances were detailed in TABLE IV It can be seen that nosignificant deterioration was observed between the randomized

4The performance index SIR is defined as SIR(a(n)r a

(n)r ) =

20 log∥∥∥a(n)

r minus a(n)r

∥∥∥2∥∥∥a(n)

r

∥∥∥2

where a(n)r is an estimate of a

(n)r both

a(n)r and a

(n)r are normalized to be zero-mean and unit-variance

TABLE IV Comparison between the algorithms when theywere applied to clustering analysis of COIL-100 objects

Algorithm Fit () Accuracy ()Tucker-ALS 703 692

RandTucker2i 691 684RandTucker 686 698

FFCP(Tucker-ALS) 671 713FFCP (RandTucker2i) 647 680

12

3

4

5

6 7

8

9

10

11

12

13

1415

16

17

18

19

20

21

2223

24

25

2627

28

29

30

31

32

33

34

35

36

37

38 3940

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

6465

6667

68

69

70

71

72

73

74

75

76

77

78

7980

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

Fig 7 Visualization of the clustering results obtained byapplying distributed RandTucker2i on the COIL100 imagedatabase The two tSNE components of A(4) were used forvisualization

methods and the deterministic methods which suggested thatthe distributed RandTucker algorithm can be applied to reallylarge scale problems The clustering results obtained by theRandTucker2i Figare visualized in Fig7 by using the twotSNE components of A(4)

C UCF Sports Action Videos Analysis

This data set consists of a set of videos collected fromvarious sports which are typically featured on broadcast televi-sion channels [38] (see httpcrcvucfedudataUCF SportsActionphp) To avoid rescaling the videos and keep all thevideos in the same specification we selected 5 actions ofDiving Riding-Horse Run Swing and Walk-Front each ofwhich contains 10 video clips with the size of 404times720times3times35(35 is the number of frames) We concatenated these videosto form a 5th-order tensor This tensor has about 16 times 109

entries and more than 99 of them are nonzero When it issaved in double format it consumes about 117 GB spaceWe used tensor decomposition methods to decompose this

11

Fig 8 The UCF Sports Action Videos The first row shows asnapshot of each action and the second row shows 5 snapshotsfrom a video of diving

tensor to extract low-dimensional features for further cluster-ing analysis When we applied the CP-ALS method includedin the Tensor Toolbox to decompose 30 videos it consumedabout 4 hours for only one iteration when 40 videos wereconsidered the CP-ALS ran out of memory We configureda small cluster with 3 computers (a desktop computer plustwo laptop computers (each of which has a i7 Intel dualcore CPU 8 GB physical memory running Windows 7 andMATLAB 2012b)) In the first stage the RandTucker2i wasapplied with R = [35 35 3 10 20] It consumed about 104seconds to obtain the Tucker approximation of the originaltensor with the Fit of 723 to the observation data Then theFFCP was applied using a desktop computer with R = 10 Inthis case only 75 seconds were required without nonnegativityconstraints and 7 seconds were consumed when nonnegativityconstraints were imposed Finally we used A(5) as featuresto cluster the videos and achieved the accuracy of 58 ina typical run This experiments showed that the proposedalgorithms can be applied to decompose very large scaletensors by using a distributed computing cluster

VI CONCLUSION

Tensor decompositions are particularly promising for bigmultiway data analytics as it is able to discover more complexinternal structures and correlations of data gathered frommultiple aspects devices trials and subjects etc While mostexisting tensor decomposition methods are not designed tomeet the major challenges posed by big data in this paperwe proposed a unified framework for the decomposition ofbig tensors with relatively low multilinear rank The key ideais that based on its Tucker approximation of a big tensor alarge set of unconstrainedconstrained tensor decompositionscan be efficiently performed We developed a flexible fastalgorithm for the CP decomposition of tensors using theirTucker representation and a distributed randomized Tuckerdecomposition approach which is particularly suitable for bigtensor analysis Simulation results confirmed that the proposedalgorithms are highly efficient and scalable

While the randomized Tucker decomposition approachshowed its close-to-optimal performance the correspondingerror bounds need to be investigated theoretically in the futureWe also believe many randomized (or related) approachesproposed for low-rank approximation of matrices [29] [39]

could be applied for Tucker decompositions however a com-parative study in the tensor decomposition scenarios is desiredParticularly if we have efficient and robust unconstrained CPdecomposition algorithms we can use the CP model to com-press the raw data instead of the Tucker model which suffersfrom the curse of dimensionality for very high-order tensorsHence developing more reliable and robust CP decompositionapproaches is another important task All these open issuesdeserve further study in the near future

REFERENCES

[1] A Cichocki R Zdunek A H Phan and S-I Amari Nonnegative Ma-trix and Tensor Factorizations Applications to Exploratory Multi-wayData Analysis and Blind Source Separationpplications to ExploratoryMulti-Way Data Analysis and Blind Source Separation ChichesterUK John Wiley amp Sons Ltd 2009

[2] N Sidiropoulos R Bro and G Giannakis ldquoParallel factor analysisin sensor array processingrdquo IEEE Transactions on Signal Processingvol 48 no 8 pp 2377ndash2388 Aug 2000

[3] L De Lathauwer J Castaing and J Cardoso ldquoFourth-order cumulant-based blind identification of underdetermined mixturesrdquo IEEE Transac-tions on Signal Processing vol 55 no 6 pp 2965ndash2973 2007

[4] P Comon ldquoTensors A brief introductionrdquo IEEE Signal ProcessingMagazine vol 31 no 3 pp 44ndash53 May 2014

[5] T G Kolda and B W Bader ldquoTensor decompositions and applicationsrdquoSIAM Review vol 51 no 3 pp 455ndash500 2009

[6] N D Sidiropoulos and R Bro ldquoOn the uniqueness of multilineardecomposition of N -way arraysrdquo Journal of Chemometrics vol 14no 3 pp 229 ndash 239 2000

[7] G Zhou and A Cichocki ldquoCanonical polyadic decomposition based ona single mode blind source separationrdquo IEEE Signal Processing Lettersvol 19 no 8 pp 523 ndash526 Aug 2012

[8] L-H Lim and P Comon ldquoNonnegative approximations of nonnegativetensorsrdquo Journal of Chemometrics vol 23 no 7-8 pp 432ndash441 2009

[9] E E Papalexakis C Faloutsos and N D Sidiropoulos ldquoParCubeSparse parallelizable tensor decompositionsrdquo in Proceedings of the 2012European Conference on Machine Learning and Knowledge Discoveryin Databases - Volume Part I ser ECML PKDDrsquo12 Berlin HeidelbergSpringer-Verlag 2012 pp 521ndash536

[10] A Phan and A Cichocki ldquoAdvances in PARAFAC using parallel blockdecompositionrdquo in Neural Information Processing ser Lecture Notesin Computer Science C Leung M Lee and J Chan Eds vol 5863Springer Berlin Heidelberg 2009 pp 323ndash330

[11] U Kang E Papalexakis A Harpale and C Faloutsos ldquoGigaTensorScaling tensor analysis up by 100 times - algorithms and discoveriesrdquoin Proceedings of the 18th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining ser KDD rsquo12 New York NYUSA ACM 2012 pp 316ndash324

[12] C Tsourakakis ldquoMACH Fast randomized tensor decompositionsrdquoin Proceedings of the 2010 SIAM International Conference on DataMining 2010 pp 689ndash700

[13] P Drineas and M W Mahoney ldquoA randomized algorithm for a tensor-based generalization of the singular value decompositionrdquo Linear Alge-bra and its Applications vol 420 no 2ndash3 pp 553 ndash 571 2007

[14] C F Caiafa and A Cichocki ldquoGeneralizing the column-row matrix de-composition to multi-way arraysrdquo Linear Algebra and its Applicationsvol 433 no 3 pp 557 ndash 573 2010

[15] M Vasilescu and D Terzopoulos ldquoMultilinear analysis of image en-sembles Tensorfacesrdquo in Computer Vision mdash ECCV 2002 ser LectureNotes in Computer Science A Heyden G Sparr M Nielsen andP Johansen Eds Springer Berlin Heidelberg 2002 vol 2350 pp447ndash460

[16] O Duchenne F Bach I-S Kweon and J Ponce ldquoA tensor-basedalgorithm for high-order graph matchingrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 12 pp 2383ndash2395 Dec2011

12

[17] A Rajwade A Rangarajan and A Banerjee ldquoImage denoising usingthe higher order singular value decompositionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 35 no 4 pp 849ndash8622013

[18] V De Silva and L-H Lim ldquoTensor rank and the ill-posedness of the bestlow-rank approximation problemrdquo SIAM Journal on Matrix Analysis andApplications vol 30 no 3 pp 1084ndash1127 September 2008

[19] G Zhou A Cichocki Q Zhao and S Xie ldquoNonnegative matrixand tensor factorizations An algorithmic perspectiverdquo IEEE SignalProcessing Magazine vol 31 no 3 pp 54ndash65 May 2014

[20] G Zhou A Cichocki and S Xie ldquoFast nonnegative matrixtensorfactorization based on low-rank approximationrdquo IEEE Transactions onSignal Processing vol 60 no 6 pp 2928 ndash2940 June 2012

[21] V Cevher S Becker and M Schmidt ldquoConvex optimization forbig data Scalable randomized and parallel algorithms for big dataanalyticsrdquo IEEE Signal Processing Magazine vol 31 no 5 pp 32ndash43Sept 2014

[22] R Bro and C A Andersson ldquoImproving the speed of multiway algo-rithms Part ii Compressionrdquo Chemometrics and Intelligent LaboratorySystems vol 42 no 1ndash2 pp 105 ndash 113 1998

[23] N Sidiropoulos E Papalexakis and C Faloutsos ldquoParallel randomlycompressed cubes A scalable distributed architecture for big tensordecompositionrdquo IEEE Signal Processing Magazine vol 31 no 5 pp57ndash70 Sept 2014

[24] P Comon X Luciani and A L F de Almeida ldquoTensor decompositionsalternating least squares and other talesrdquo Journal of Chemometricsvol 23 no 7-8 pp 393ndash405 2009

[25] J Cohen R Farias and P Comon ldquoFast decomposition of largenonnegative tensorsrdquo IEEE Signal Processing Letters vol 22 no 7pp 862ndash866 July 2015

[26] L De Lathauwer B De Moor and J Vandewalle ldquoA multilinearsingular value decompositionrdquo SIAM Journal on Matrix Analysis andApplications vol 21 pp 1253ndash1278 2000

[27] E Papalexakis N Sidiropoulos and R Bro ldquoFrom k-means to higher-way co-clustering Multilinear decomposition with sparse latent factorsrdquoIEEE Transactions on Signal Processing vol 61 no 2 pp 493ndash506Jan 2013

[28] Z Qian X Chen N Kang M Chen Y Yu T Moscibroda andZ Zhang ldquoMadLINQ Large-scale distributed matrix computation forthe cloudrdquo in Proceedings of the 7th ACM European Conference onComputer Systems ser EuroSys rsquo12 New York NY USA ACM2012 pp 197ndash210

[29] N Halko P Martinsson and J Tropp ldquoFinding structure with ran-domness Probabilistic algorithms for constructing approximate matrixdecompositionsrdquo SIAM Review vol 53 no 2 pp 217ndash288 2011

[30] N Vervliet O Debals L Sorber and L De Lathauwer ldquoBreaking thecurse of dimensionality using decompositions of incomplete tensorsTensor-based scientific computing in big data analysisrdquo IEEE SignalProcessing Magazine vol 31 no 5 pp 71ndash79 Sept 2014

[31] G Zhou A Cichocki Q Zhao and S Xie ldquoEfficient nonnegativeTucker decompositions Algorithms and uniquenessrdquo IEEE Transactionson Image Processing (Submitted) p 13 2014 [Online] Available[ArXive-prints]httparxivorgabs14044412

[32] B W Bader and T G Kolda ldquoMATLAB tensor toolbox version25rdquo Feb 2012 [Online] Available httpcsmrcasandiagovsimtgkoldaTensorToolbox

[33] C A Andersson and R Bro ldquoThe N -way toolbox for MATLABrdquo 2000[Online] Available httpwwwmodelslifekudksourcenwaytoolbox

[34] H Phan P Tichavsky and A Cichocki ldquoFast alternating LS algorithmsfor high order CANDECOMPPARAFAC tensor factorizationsrdquo IEEETransactions on Signal Processing vol 61 no 19 pp 4834ndash4846 2013

[35] N Sidiropoulos E Papalexakis and C Faloutsos ldquoA parallel algorithmfor big tensor decomposition using randomly compressed cubes (PARA-COMP)rdquo in 2014 IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP) May 2014 pp 1ndash5

[36] A Cichocki and A-H Phan ldquoFast local algorithms for large scalenonnegative matrix and tensor factorizationsrdquo IEICE Transactions onFundamentals of Electronics Communications and Computer Sciences(Invited paper) vol E92-A no 3 pp 708ndash721 2009

[37] D Cai X He J Han and T Huang ldquoGraph regularized nonnegativematrix factorization for data representationrdquo IEEE Transactions on

Pattern Analysis and Machine Intelligence vol 33 no 8 pp 1548ndash1560 aug 2011

[38] M Rodriguez J Ahmed and M Shah ldquoAction MACH a spatio-temporal maximum average correlation height filter for action recogni-tionrdquo in IEEE Conference on Computer Vision and Pattern Recognition2008 CVPR 2008 June 2008 pp 1ndash8

[39] R Baraniuk V Cevher and M Wakin ldquoLow-dimensional models fordimensionality reduction and signal recovery A geometric perspectiverdquoProceedings of the IEEE vol 98 no 6 pp 959ndash971 June 2010

  • I Introduction
  • II Overview of Tucker and Polyadic Decompositions
  • III Efficient CP Decompositions of Tensors Based on Their Tucker Approximations
    • III-A Overview of CPD Algorithms
    • III-B CPD of Big Tensors by Based on Their Tucker Compression
      • IV Distributed Randomized Tucker Decompositions (RandTucker)
        • IV-A Randomized Tucker Decomposition
        • IV-B Distributed Randomized Tucker Decomposition
        • IV-C Nonnegative Tucker Decomposition
          • V Simulations
            • V-A Synthetic data
            • V-B The COIL-100 Images Clustering Analysis
            • V-C UCF Sports Action Videos Analysis
              • VI Conclusion
              • References

9

TABLE II Comparison between the algorithms when they were applied to perform CPD of synthetic data The noises wereGaussian with SNR=10dB in all the cases

Algorithm N = 3 I = 200 R = 10 N = 3 I = 500 R = 20 N = 6 I = 20 R = 5Fit () Time (s) Fit () Time (s) Fit () Time (s)

CP-ALS 821plusmn 102 395 805plusmn 68 640 896plusmn 112 188FastCPALS 798plusmn 90 104 829plusmn 56 211 924plusmn 109 66PARAFAC 825plusmn 79 782 805plusmn 61 1583 876plusmn 114 643FFCP 819plusmn 80 35 828plusmn 56 40 934plusmn 86 50

TABLE III Comparison between the algorithms when they were applied to perform nonnegative CPD of synthetic data Thenoises were Gaussian with SNR=10dB in all the cases

Algorithm N = 3 I = 200 R = 10 N = 3 I = 500 R = 20 N = 6 I = 20 R = 5Fit () Time (s) Fit () Time (s) Fit () Time (s)

CP-NMU 973plusmn 12 664 976plusmn 08 16952 917plusmn 56 4265CP-HALS 978plusmn 10 121 977plusmn 07 2281 951plusmn 41 1014FFCP-MU 990plusmn 05 23 993plusmn 02 121 934plusmn 41 70FFCP-HALS 991plusmn 01 61 995plusmn 01 299 943plusmn 35 118

Simulations on CPD In each run a tensor Ylowast was generatedby using the model (2) where the entries of the latent factormatrices A(n) isin RItimesR (ie the ground truth) were drawn fromiid standard normal distributions n = 1 2 N Thenindependent Gaussian noise (with SNR=10dB) was addedto Ylowast to yield noisy observation tensor Y We tested threedifferent combinations of N I and R as shown in TABLE IIThe algorithms CP-ALS PARAFAC which are the standardalgorithms included in the Tensor Toolbox 25 [32] and the N -way toolbox [33] respectively were compared as the baselineThe FastCPALS algorithm [34] was also compared For all thealgorithm the maximum iteration number was set as 1000and a stopping criterion |Fit(Y Yt) minus Fit(Y Yt+1)| lt 10minus6

was applied where Yt is an estimation of Ylowast obtained afterthe tth iteration The mean values and standard variationsof Fit(Ylowast Y) over 20 Monte-Carlo runs are detailed in TA-BLE II along with the averaged time consumed by eachalgorithm From the table we see that the FFCP algorithmis significantly faster than the other algorithms but withoutsignificant deterioration of performance although it workson a compressed tensor instead of the original tensor Wealso tested the PARACOMP [35] method which was speciallyproposed for big tensor decompositions It turned out that thePARACOMP was sensitive to noise in noise-free cases itoften achieved perfect Fit of 1 otherwise when the noise was10dB it generally obtained the Fit lower than 03

In Fig6 we demonstrated the scalability of the proposedFFCP method where the RandTucker2i was used to compressthe data with p = 15 From Fig6(a) we can see that theFFCP shows very promising scalability its time consumptionincreases almost linearly with respect to the scale of the inputdata on condition that the rank is fixed on the contrastthe time consumed by the other algorithms increased nearlyexponentially Note that both the FFCP and the PARACOMPmethod [35] were designed mainly for big data analyticsand this comparison did not precisely reflect their actual

time consumption difference in a distributed computationalenvironment However it did show that the proposed methodwas more efficient than the PARACOMP which can also be in-ferred from the obvious facts 1) the PARACOMP compressesthe observation big tensor at least P times (P ge IR) whereasin our RandTucker2i based FFCP we only need to do it twice(or even once when RandTucker is applied) 2) although bothof them need to perform CPD on small compressed tensorshowever the PARACOMP method needs to do this P timeswhereas the FFCP needs to do it only once as the final stepParticularly in the PARACOMP any failure of the CPDs ofP sub-tensors will degenerate the final results Fig6(b) showsthe scalability of the FFCP provided that the given big tensorwas in its Tucker format To such scale of tensor if we usethe original full tensor directly all the algorithms ran out ofmemory But if we use its Tucker compression the FFCPcan perform constrainedunconstrained CPD efficiently evenon our desktop computer

Simulations on Nonnegative CPD In this experiments thedata were generated as the above except that the latent factormatrices were generated from iid exponential distributionswith the mean parameter micro = 10 Then we randomly selected10 entries of each factor to be zeros to generate sparsefactor matrices As the PARAFAC algorithm was very slowwe only compared the CP-NMU algorithm which is a standardnonnegative CPD algorithm based on the multiplicative updaterule included in the Tensor Toolbox [32] and the CP-HALSalgorithm [36] that is based on the Hierarchical ALS (HALS)method For fair comparison their corresponding implemen-tations of FFCP were compared The results averaged over20 Monte-Carlo runs were detailed in TABLE III From theTABLE FFCP implementations were not only significantlyfaster than their competitors but also often achieved higherFit values To the best of our knowledge the FFCP is one ofthe most efficient algorithms for nonnegative CPD

To demonstrate the high efficiency of the FFCP algorithm

10

in this simulation the entries of A(n) isin R10000times10 n =1 2 3 4 were drawn from iid exponential distributions withthe mean parameter micro = 10 Again we randomly selected 20entries of A(n) to be zeros to obtain sparse factor matricesBy using (2) and A(n) generated above a big 4th-order tensorY isin R10000times10000times10000times10000 with R = 10 was obtainedwhich contains a total of 1016 entries in its full format Itis certainly impossible to decompose such a huge tensor in astandard desktop computer Suppose we have already obtainedits Tucker approximation YT = JGA(1)A(2)A(3)A(4)K asympY Due to the rotation ambiguity of Tucker decompositionswe know theoretically A(n) = A(n)Q(n) where Q(n) wasnot necessarily to be nonnegative Generally applying CPD onthe core tensor G directly cannot guarantee the nonnegativityof factor matrices Instead we applied the FFCP algorithm onthe Tucker compressed tensor YT to extract nonnegative factormatrices In a typical run the FFCP-HALS consumed only 774seconds and the recovered components were with the signal-to-interference ratio4 (SIR) higher than 200dB which almostperfectly match the true components Although this exampleonly shows an ideal noise-free case it illustrates how the FFCPalgorithm can be applied to perform nonnegative CPD of hugetensors as long as their good low-rank approximations can beobtained

B The COIL-100 Images Clustering Analysis

In this experiment we applied the proposed NTD algorithmsto the clustering analysis of the objects selected from theColumbia Object Image Library (COIL-100) The COIL-100database consists of 7200 images of 100 objects each ofwhich has 72 images taken from different poses hence theseimages naturally form a 128times 128times 3times 7200 tensor Y Thedata tensor was decomposed by using tensor decompositionalgorithms and then used the two t-Distributed StochasticNeighbor Embedding (t-SNE) components of A(4) as featuresto cluster and visualize the results The K-means algorithm wasused to cluster the objects and the accuracy of clustering wasdefined in [37]To justify the proposed distributed RandTuckeralgorithm we used the MATLAB parallel computing toolboxwhich allows us to use the full processing power of multicoredesktops by executing applications on workers (MATLABcomputational engines) that run locally The tensor Y waspartitioned into 4 sub-tensors with the size of 128times128times3times1800 and was distributed on 4 workers As the baseline theTucker-ALS algorithm included in [32] was compared and themultilinear rank was empirically set as R = (20 20 3 20)For the RandTucker method R = (30 30 3 50) was setBoth of these two algorithms used 2 iterations Their perfor-mances were detailed in TABLE IV It can be seen that nosignificant deterioration was observed between the randomized

4The performance index SIR is defined as SIR(a(n)r a

(n)r ) =

20 log∥∥∥a(n)

r minus a(n)r

∥∥∥2∥∥∥a(n)

r

∥∥∥2

where a(n)r is an estimate of a

(n)r both

a(n)r and a

(n)r are normalized to be zero-mean and unit-variance

TABLE IV Comparison between the algorithms when theywere applied to clustering analysis of COIL-100 objects

Algorithm Fit () Accuracy ()Tucker-ALS 703 692

RandTucker2i 691 684RandTucker 686 698

FFCP(Tucker-ALS) 671 713FFCP (RandTucker2i) 647 680

12

3

4

5

6 7

8

9

10

11

12

13

1415

16

17

18

19

20

21

2223

24

25

2627

28

29

30

31

32

33

34

35

36

37

38 3940

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

6465

6667

68

69

70

71

72

73

74

75

76

77

78

7980

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

Fig 7 Visualization of the clustering results obtained byapplying distributed RandTucker2i on the COIL100 imagedatabase The two tSNE components of A(4) were used forvisualization

methods and the deterministic methods which suggested thatthe distributed RandTucker algorithm can be applied to reallylarge scale problems The clustering results obtained by theRandTucker2i Figare visualized in Fig7 by using the twotSNE components of A(4)

C UCF Sports Action Videos Analysis

This data set consists of a set of videos collected fromvarious sports which are typically featured on broadcast televi-sion channels [38] (see httpcrcvucfedudataUCF SportsActionphp) To avoid rescaling the videos and keep all thevideos in the same specification we selected 5 actions ofDiving Riding-Horse Run Swing and Walk-Front each ofwhich contains 10 video clips with the size of 404times720times3times35(35 is the number of frames) We concatenated these videosto form a 5th-order tensor This tensor has about 16 times 109

entries and more than 99 of them are nonzero When it issaved in double format it consumes about 117 GB spaceWe used tensor decomposition methods to decompose this

11

Fig 8 The UCF Sports Action Videos The first row shows asnapshot of each action and the second row shows 5 snapshotsfrom a video of diving

tensor to extract low-dimensional features for further cluster-ing analysis When we applied the CP-ALS method includedin the Tensor Toolbox to decompose 30 videos it consumedabout 4 hours for only one iteration when 40 videos wereconsidered the CP-ALS ran out of memory We configureda small cluster with 3 computers (a desktop computer plustwo laptop computers (each of which has a i7 Intel dualcore CPU 8 GB physical memory running Windows 7 andMATLAB 2012b)) In the first stage the RandTucker2i wasapplied with R = [35 35 3 10 20] It consumed about 104seconds to obtain the Tucker approximation of the originaltensor with the Fit of 723 to the observation data Then theFFCP was applied using a desktop computer with R = 10 Inthis case only 75 seconds were required without nonnegativityconstraints and 7 seconds were consumed when nonnegativityconstraints were imposed Finally we used A(5) as featuresto cluster the videos and achieved the accuracy of 58 ina typical run This experiments showed that the proposedalgorithms can be applied to decompose very large scaletensors by using a distributed computing cluster

VI CONCLUSION

Tensor decompositions are particularly promising for bigmultiway data analytics as it is able to discover more complexinternal structures and correlations of data gathered frommultiple aspects devices trials and subjects etc While mostexisting tensor decomposition methods are not designed tomeet the major challenges posed by big data in this paperwe proposed a unified framework for the decomposition ofbig tensors with relatively low multilinear rank The key ideais that based on its Tucker approximation of a big tensor alarge set of unconstrainedconstrained tensor decompositionscan be efficiently performed We developed a flexible fastalgorithm for the CP decomposition of tensors using theirTucker representation and a distributed randomized Tuckerdecomposition approach which is particularly suitable for bigtensor analysis Simulation results confirmed that the proposedalgorithms are highly efficient and scalable

While the randomized Tucker decomposition approachshowed its close-to-optimal performance the correspondingerror bounds need to be investigated theoretically in the futureWe also believe many randomized (or related) approachesproposed for low-rank approximation of matrices [29] [39]

could be applied for Tucker decompositions however a com-parative study in the tensor decomposition scenarios is desiredParticularly if we have efficient and robust unconstrained CPdecomposition algorithms we can use the CP model to com-press the raw data instead of the Tucker model which suffersfrom the curse of dimensionality for very high-order tensorsHence developing more reliable and robust CP decompositionapproaches is another important task All these open issuesdeserve further study in the near future

REFERENCES

[1] A Cichocki R Zdunek A H Phan and S-I Amari Nonnegative Ma-trix and Tensor Factorizations Applications to Exploratory Multi-wayData Analysis and Blind Source Separationpplications to ExploratoryMulti-Way Data Analysis and Blind Source Separation ChichesterUK John Wiley amp Sons Ltd 2009

[2] N Sidiropoulos R Bro and G Giannakis ldquoParallel factor analysisin sensor array processingrdquo IEEE Transactions on Signal Processingvol 48 no 8 pp 2377ndash2388 Aug 2000

[3] L De Lathauwer J Castaing and J Cardoso ldquoFourth-order cumulant-based blind identification of underdetermined mixturesrdquo IEEE Transac-tions on Signal Processing vol 55 no 6 pp 2965ndash2973 2007

[4] P Comon ldquoTensors A brief introductionrdquo IEEE Signal ProcessingMagazine vol 31 no 3 pp 44ndash53 May 2014

[5] T G Kolda and B W Bader ldquoTensor decompositions and applicationsrdquoSIAM Review vol 51 no 3 pp 455ndash500 2009

[6] N D Sidiropoulos and R Bro ldquoOn the uniqueness of multilineardecomposition of N -way arraysrdquo Journal of Chemometrics vol 14no 3 pp 229 ndash 239 2000

[7] G Zhou and A Cichocki ldquoCanonical polyadic decomposition based ona single mode blind source separationrdquo IEEE Signal Processing Lettersvol 19 no 8 pp 523 ndash526 Aug 2012

[8] L-H Lim and P Comon ldquoNonnegative approximations of nonnegativetensorsrdquo Journal of Chemometrics vol 23 no 7-8 pp 432ndash441 2009

[9] E E Papalexakis C Faloutsos and N D Sidiropoulos ldquoParCubeSparse parallelizable tensor decompositionsrdquo in Proceedings of the 2012European Conference on Machine Learning and Knowledge Discoveryin Databases - Volume Part I ser ECML PKDDrsquo12 Berlin HeidelbergSpringer-Verlag 2012 pp 521ndash536

[10] A Phan and A Cichocki ldquoAdvances in PARAFAC using parallel blockdecompositionrdquo in Neural Information Processing ser Lecture Notesin Computer Science C Leung M Lee and J Chan Eds vol 5863Springer Berlin Heidelberg 2009 pp 323ndash330

[11] U Kang E Papalexakis A Harpale and C Faloutsos ldquoGigaTensorScaling tensor analysis up by 100 times - algorithms and discoveriesrdquoin Proceedings of the 18th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining ser KDD rsquo12 New York NYUSA ACM 2012 pp 316ndash324

[12] C Tsourakakis ldquoMACH Fast randomized tensor decompositionsrdquoin Proceedings of the 2010 SIAM International Conference on DataMining 2010 pp 689ndash700

[13] P Drineas and M W Mahoney ldquoA randomized algorithm for a tensor-based generalization of the singular value decompositionrdquo Linear Alge-bra and its Applications vol 420 no 2ndash3 pp 553 ndash 571 2007

[14] C F Caiafa and A Cichocki ldquoGeneralizing the column-row matrix de-composition to multi-way arraysrdquo Linear Algebra and its Applicationsvol 433 no 3 pp 557 ndash 573 2010

[15] M Vasilescu and D Terzopoulos ldquoMultilinear analysis of image en-sembles Tensorfacesrdquo in Computer Vision mdash ECCV 2002 ser LectureNotes in Computer Science A Heyden G Sparr M Nielsen andP Johansen Eds Springer Berlin Heidelberg 2002 vol 2350 pp447ndash460

[16] O Duchenne F Bach I-S Kweon and J Ponce ldquoA tensor-basedalgorithm for high-order graph matchingrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 12 pp 2383ndash2395 Dec2011

12

[17] A Rajwade A Rangarajan and A Banerjee ldquoImage denoising usingthe higher order singular value decompositionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 35 no 4 pp 849ndash8622013

[18] V De Silva and L-H Lim ldquoTensor rank and the ill-posedness of the bestlow-rank approximation problemrdquo SIAM Journal on Matrix Analysis andApplications vol 30 no 3 pp 1084ndash1127 September 2008

[19] G Zhou A Cichocki Q Zhao and S Xie ldquoNonnegative matrixand tensor factorizations An algorithmic perspectiverdquo IEEE SignalProcessing Magazine vol 31 no 3 pp 54ndash65 May 2014

[20] G Zhou A Cichocki and S Xie ldquoFast nonnegative matrixtensorfactorization based on low-rank approximationrdquo IEEE Transactions onSignal Processing vol 60 no 6 pp 2928 ndash2940 June 2012

[21] V Cevher S Becker and M Schmidt ldquoConvex optimization forbig data Scalable randomized and parallel algorithms for big dataanalyticsrdquo IEEE Signal Processing Magazine vol 31 no 5 pp 32ndash43Sept 2014

[22] R Bro and C A Andersson ldquoImproving the speed of multiway algo-rithms Part ii Compressionrdquo Chemometrics and Intelligent LaboratorySystems vol 42 no 1ndash2 pp 105 ndash 113 1998

[23] N Sidiropoulos E Papalexakis and C Faloutsos ldquoParallel randomlycompressed cubes A scalable distributed architecture for big tensordecompositionrdquo IEEE Signal Processing Magazine vol 31 no 5 pp57ndash70 Sept 2014

[24] P Comon X Luciani and A L F de Almeida ldquoTensor decompositionsalternating least squares and other talesrdquo Journal of Chemometricsvol 23 no 7-8 pp 393ndash405 2009

[25] J Cohen R Farias and P Comon ldquoFast decomposition of largenonnegative tensorsrdquo IEEE Signal Processing Letters vol 22 no 7pp 862ndash866 July 2015

[26] L De Lathauwer B De Moor and J Vandewalle ldquoA multilinearsingular value decompositionrdquo SIAM Journal on Matrix Analysis andApplications vol 21 pp 1253ndash1278 2000

[27] E Papalexakis N Sidiropoulos and R Bro ldquoFrom k-means to higher-way co-clustering Multilinear decomposition with sparse latent factorsrdquoIEEE Transactions on Signal Processing vol 61 no 2 pp 493ndash506Jan 2013

[28] Z Qian X Chen N Kang M Chen Y Yu T Moscibroda andZ Zhang ldquoMadLINQ Large-scale distributed matrix computation forthe cloudrdquo in Proceedings of the 7th ACM European Conference onComputer Systems ser EuroSys rsquo12 New York NY USA ACM2012 pp 197ndash210

[29] N Halko P Martinsson and J Tropp ldquoFinding structure with ran-domness Probabilistic algorithms for constructing approximate matrixdecompositionsrdquo SIAM Review vol 53 no 2 pp 217ndash288 2011

[30] N Vervliet O Debals L Sorber and L De Lathauwer ldquoBreaking thecurse of dimensionality using decompositions of incomplete tensorsTensor-based scientific computing in big data analysisrdquo IEEE SignalProcessing Magazine vol 31 no 5 pp 71ndash79 Sept 2014

[31] G Zhou A Cichocki Q Zhao and S Xie ldquoEfficient nonnegativeTucker decompositions Algorithms and uniquenessrdquo IEEE Transactionson Image Processing (Submitted) p 13 2014 [Online] Available[ArXive-prints]httparxivorgabs14044412

[32] B W Bader and T G Kolda ldquoMATLAB tensor toolbox version25rdquo Feb 2012 [Online] Available httpcsmrcasandiagovsimtgkoldaTensorToolbox

[33] C A Andersson and R Bro ldquoThe N -way toolbox for MATLABrdquo 2000[Online] Available httpwwwmodelslifekudksourcenwaytoolbox

[34] H Phan P Tichavsky and A Cichocki ldquoFast alternating LS algorithmsfor high order CANDECOMPPARAFAC tensor factorizationsrdquo IEEETransactions on Signal Processing vol 61 no 19 pp 4834ndash4846 2013

[35] N Sidiropoulos E Papalexakis and C Faloutsos ldquoA parallel algorithmfor big tensor decomposition using randomly compressed cubes (PARA-COMP)rdquo in 2014 IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP) May 2014 pp 1ndash5

[36] A Cichocki and A-H Phan ldquoFast local algorithms for large scalenonnegative matrix and tensor factorizationsrdquo IEICE Transactions onFundamentals of Electronics Communications and Computer Sciences(Invited paper) vol E92-A no 3 pp 708ndash721 2009

[37] D Cai X He J Han and T Huang ldquoGraph regularized nonnegativematrix factorization for data representationrdquo IEEE Transactions on

Pattern Analysis and Machine Intelligence vol 33 no 8 pp 1548ndash1560 aug 2011

[38] M Rodriguez J Ahmed and M Shah ldquoAction MACH a spatio-temporal maximum average correlation height filter for action recogni-tionrdquo in IEEE Conference on Computer Vision and Pattern Recognition2008 CVPR 2008 June 2008 pp 1ndash8

[39] R Baraniuk V Cevher and M Wakin ldquoLow-dimensional models fordimensionality reduction and signal recovery A geometric perspectiverdquoProceedings of the IEEE vol 98 no 6 pp 959ndash971 June 2010

  • I Introduction
  • II Overview of Tucker and Polyadic Decompositions
  • III Efficient CP Decompositions of Tensors Based on Their Tucker Approximations
    • III-A Overview of CPD Algorithms
    • III-B CPD of Big Tensors by Based on Their Tucker Compression
      • IV Distributed Randomized Tucker Decompositions (RandTucker)
        • IV-A Randomized Tucker Decomposition
        • IV-B Distributed Randomized Tucker Decomposition
        • IV-C Nonnegative Tucker Decomposition
          • V Simulations
            • V-A Synthetic data
            • V-B The COIL-100 Images Clustering Analysis
            • V-C UCF Sports Action Videos Analysis
              • VI Conclusion
              • References

10

in this simulation the entries of A(n) isin R10000times10 n =1 2 3 4 were drawn from iid exponential distributions withthe mean parameter micro = 10 Again we randomly selected 20entries of A(n) to be zeros to obtain sparse factor matricesBy using (2) and A(n) generated above a big 4th-order tensorY isin R10000times10000times10000times10000 with R = 10 was obtainedwhich contains a total of 1016 entries in its full format Itis certainly impossible to decompose such a huge tensor in astandard desktop computer Suppose we have already obtainedits Tucker approximation YT = JGA(1)A(2)A(3)A(4)K asympY Due to the rotation ambiguity of Tucker decompositionswe know theoretically A(n) = A(n)Q(n) where Q(n) wasnot necessarily to be nonnegative Generally applying CPD onthe core tensor G directly cannot guarantee the nonnegativityof factor matrices Instead we applied the FFCP algorithm onthe Tucker compressed tensor YT to extract nonnegative factormatrices In a typical run the FFCP-HALS consumed only 774seconds and the recovered components were with the signal-to-interference ratio4 (SIR) higher than 200dB which almostperfectly match the true components Although this exampleonly shows an ideal noise-free case it illustrates how the FFCPalgorithm can be applied to perform nonnegative CPD of hugetensors as long as their good low-rank approximations can beobtained

B The COIL-100 Images Clustering Analysis

In this experiment we applied the proposed NTD algorithmsto the clustering analysis of the objects selected from theColumbia Object Image Library (COIL-100) The COIL-100database consists of 7200 images of 100 objects each ofwhich has 72 images taken from different poses hence theseimages naturally form a 128times 128times 3times 7200 tensor Y Thedata tensor was decomposed by using tensor decompositionalgorithms and then used the two t-Distributed StochasticNeighbor Embedding (t-SNE) components of A(4) as featuresto cluster and visualize the results The K-means algorithm wasused to cluster the objects and the accuracy of clustering wasdefined in [37]To justify the proposed distributed RandTuckeralgorithm we used the MATLAB parallel computing toolboxwhich allows us to use the full processing power of multicoredesktops by executing applications on workers (MATLABcomputational engines) that run locally The tensor Y waspartitioned into 4 sub-tensors with the size of 128times128times3times1800 and was distributed on 4 workers As the baseline theTucker-ALS algorithm included in [32] was compared and themultilinear rank was empirically set as R = (20 20 3 20)For the RandTucker method R = (30 30 3 50) was setBoth of these two algorithms used 2 iterations Their perfor-mances were detailed in TABLE IV It can be seen that nosignificant deterioration was observed between the randomized

4The performance index SIR is defined as SIR(a(n)r a

(n)r ) =

20 log∥∥∥a(n)

r minus a(n)r

∥∥∥2∥∥∥a(n)

r

∥∥∥2

where a(n)r is an estimate of a

(n)r both

a(n)r and a

(n)r are normalized to be zero-mean and unit-variance

TABLE IV Comparison between the algorithms when theywere applied to clustering analysis of COIL-100 objects

Algorithm Fit () Accuracy ()Tucker-ALS 703 692

RandTucker2i 691 684RandTucker 686 698

FFCP(Tucker-ALS) 671 713FFCP (RandTucker2i) 647 680

12

3

4

5

6 7

8

9

10

11

12

13

1415

16

17

18

19

20

21

2223

24

25

2627

28

29

30

31

32

33

34

35

36

37

38 3940

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

6465

6667

68

69

70

71

72

73

74

75

76

77

78

7980

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

Fig 7 Visualization of the clustering results obtained byapplying distributed RandTucker2i on the COIL100 imagedatabase The two tSNE components of A(4) were used forvisualization

methods and the deterministic methods which suggested thatthe distributed RandTucker algorithm can be applied to reallylarge scale problems The clustering results obtained by theRandTucker2i Figare visualized in Fig7 by using the twotSNE components of A(4)

C UCF Sports Action Videos Analysis

This data set consists of a set of videos collected fromvarious sports which are typically featured on broadcast televi-sion channels [38] (see httpcrcvucfedudataUCF SportsActionphp) To avoid rescaling the videos and keep all thevideos in the same specification we selected 5 actions ofDiving Riding-Horse Run Swing and Walk-Front each ofwhich contains 10 video clips with the size of 404times720times3times35(35 is the number of frames) We concatenated these videosto form a 5th-order tensor This tensor has about 16 times 109

entries and more than 99 of them are nonzero When it issaved in double format it consumes about 117 GB spaceWe used tensor decomposition methods to decompose this

11

Fig 8 The UCF Sports Action Videos The first row shows asnapshot of each action and the second row shows 5 snapshotsfrom a video of diving

tensor to extract low-dimensional features for further cluster-ing analysis When we applied the CP-ALS method includedin the Tensor Toolbox to decompose 30 videos it consumedabout 4 hours for only one iteration when 40 videos wereconsidered the CP-ALS ran out of memory We configureda small cluster with 3 computers (a desktop computer plustwo laptop computers (each of which has a i7 Intel dualcore CPU 8 GB physical memory running Windows 7 andMATLAB 2012b)) In the first stage the RandTucker2i wasapplied with R = [35 35 3 10 20] It consumed about 104seconds to obtain the Tucker approximation of the originaltensor with the Fit of 723 to the observation data Then theFFCP was applied using a desktop computer with R = 10 Inthis case only 75 seconds were required without nonnegativityconstraints and 7 seconds were consumed when nonnegativityconstraints were imposed Finally we used A(5) as featuresto cluster the videos and achieved the accuracy of 58 ina typical run This experiments showed that the proposedalgorithms can be applied to decompose very large scaletensors by using a distributed computing cluster

VI CONCLUSION

Tensor decompositions are particularly promising for bigmultiway data analytics as it is able to discover more complexinternal structures and correlations of data gathered frommultiple aspects devices trials and subjects etc While mostexisting tensor decomposition methods are not designed tomeet the major challenges posed by big data in this paperwe proposed a unified framework for the decomposition ofbig tensors with relatively low multilinear rank The key ideais that based on its Tucker approximation of a big tensor alarge set of unconstrainedconstrained tensor decompositionscan be efficiently performed We developed a flexible fastalgorithm for the CP decomposition of tensors using theirTucker representation and a distributed randomized Tuckerdecomposition approach which is particularly suitable for bigtensor analysis Simulation results confirmed that the proposedalgorithms are highly efficient and scalable

While the randomized Tucker decomposition approachshowed its close-to-optimal performance the correspondingerror bounds need to be investigated theoretically in the futureWe also believe many randomized (or related) approachesproposed for low-rank approximation of matrices [29] [39]

could be applied for Tucker decompositions however a com-parative study in the tensor decomposition scenarios is desiredParticularly if we have efficient and robust unconstrained CPdecomposition algorithms we can use the CP model to com-press the raw data instead of the Tucker model which suffersfrom the curse of dimensionality for very high-order tensorsHence developing more reliable and robust CP decompositionapproaches is another important task All these open issuesdeserve further study in the near future

REFERENCES

[1] A Cichocki R Zdunek A H Phan and S-I Amari Nonnegative Ma-trix and Tensor Factorizations Applications to Exploratory Multi-wayData Analysis and Blind Source Separationpplications to ExploratoryMulti-Way Data Analysis and Blind Source Separation ChichesterUK John Wiley amp Sons Ltd 2009

[2] N Sidiropoulos R Bro and G Giannakis ldquoParallel factor analysisin sensor array processingrdquo IEEE Transactions on Signal Processingvol 48 no 8 pp 2377ndash2388 Aug 2000

[3] L De Lathauwer J Castaing and J Cardoso ldquoFourth-order cumulant-based blind identification of underdetermined mixturesrdquo IEEE Transac-tions on Signal Processing vol 55 no 6 pp 2965ndash2973 2007

[4] P Comon ldquoTensors A brief introductionrdquo IEEE Signal ProcessingMagazine vol 31 no 3 pp 44ndash53 May 2014

[5] T G Kolda and B W Bader ldquoTensor decompositions and applicationsrdquoSIAM Review vol 51 no 3 pp 455ndash500 2009

[6] N D Sidiropoulos and R Bro ldquoOn the uniqueness of multilineardecomposition of N -way arraysrdquo Journal of Chemometrics vol 14no 3 pp 229 ndash 239 2000

[7] G Zhou and A Cichocki ldquoCanonical polyadic decomposition based ona single mode blind source separationrdquo IEEE Signal Processing Lettersvol 19 no 8 pp 523 ndash526 Aug 2012

[8] L-H Lim and P Comon ldquoNonnegative approximations of nonnegativetensorsrdquo Journal of Chemometrics vol 23 no 7-8 pp 432ndash441 2009

[9] E E Papalexakis C Faloutsos and N D Sidiropoulos ldquoParCubeSparse parallelizable tensor decompositionsrdquo in Proceedings of the 2012European Conference on Machine Learning and Knowledge Discoveryin Databases - Volume Part I ser ECML PKDDrsquo12 Berlin HeidelbergSpringer-Verlag 2012 pp 521ndash536

[10] A Phan and A Cichocki ldquoAdvances in PARAFAC using parallel blockdecompositionrdquo in Neural Information Processing ser Lecture Notesin Computer Science C Leung M Lee and J Chan Eds vol 5863Springer Berlin Heidelberg 2009 pp 323ndash330

[11] U Kang E Papalexakis A Harpale and C Faloutsos ldquoGigaTensorScaling tensor analysis up by 100 times - algorithms and discoveriesrdquoin Proceedings of the 18th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining ser KDD rsquo12 New York NYUSA ACM 2012 pp 316ndash324

[12] C Tsourakakis ldquoMACH Fast randomized tensor decompositionsrdquoin Proceedings of the 2010 SIAM International Conference on DataMining 2010 pp 689ndash700

[13] P Drineas and M W Mahoney ldquoA randomized algorithm for a tensor-based generalization of the singular value decompositionrdquo Linear Alge-bra and its Applications vol 420 no 2ndash3 pp 553 ndash 571 2007

[14] C F Caiafa and A Cichocki ldquoGeneralizing the column-row matrix de-composition to multi-way arraysrdquo Linear Algebra and its Applicationsvol 433 no 3 pp 557 ndash 573 2010

[15] M Vasilescu and D Terzopoulos ldquoMultilinear analysis of image en-sembles Tensorfacesrdquo in Computer Vision mdash ECCV 2002 ser LectureNotes in Computer Science A Heyden G Sparr M Nielsen andP Johansen Eds Springer Berlin Heidelberg 2002 vol 2350 pp447ndash460

[16] O Duchenne F Bach I-S Kweon and J Ponce ldquoA tensor-basedalgorithm for high-order graph matchingrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 12 pp 2383ndash2395 Dec2011

12

[17] A Rajwade A Rangarajan and A Banerjee ldquoImage denoising usingthe higher order singular value decompositionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 35 no 4 pp 849ndash8622013

[18] V De Silva and L-H Lim ldquoTensor rank and the ill-posedness of the bestlow-rank approximation problemrdquo SIAM Journal on Matrix Analysis andApplications vol 30 no 3 pp 1084ndash1127 September 2008

[19] G Zhou A Cichocki Q Zhao and S Xie ldquoNonnegative matrixand tensor factorizations An algorithmic perspectiverdquo IEEE SignalProcessing Magazine vol 31 no 3 pp 54ndash65 May 2014

[20] G Zhou A Cichocki and S Xie ldquoFast nonnegative matrixtensorfactorization based on low-rank approximationrdquo IEEE Transactions onSignal Processing vol 60 no 6 pp 2928 ndash2940 June 2012

[21] V Cevher S Becker and M Schmidt ldquoConvex optimization forbig data Scalable randomized and parallel algorithms for big dataanalyticsrdquo IEEE Signal Processing Magazine vol 31 no 5 pp 32ndash43Sept 2014

[22] R Bro and C A Andersson ldquoImproving the speed of multiway algo-rithms Part ii Compressionrdquo Chemometrics and Intelligent LaboratorySystems vol 42 no 1ndash2 pp 105 ndash 113 1998

[23] N Sidiropoulos E Papalexakis and C Faloutsos ldquoParallel randomlycompressed cubes A scalable distributed architecture for big tensordecompositionrdquo IEEE Signal Processing Magazine vol 31 no 5 pp57ndash70 Sept 2014

[24] P Comon X Luciani and A L F de Almeida ldquoTensor decompositionsalternating least squares and other talesrdquo Journal of Chemometricsvol 23 no 7-8 pp 393ndash405 2009

[25] J Cohen R Farias and P Comon ldquoFast decomposition of largenonnegative tensorsrdquo IEEE Signal Processing Letters vol 22 no 7pp 862ndash866 July 2015

[26] L De Lathauwer B De Moor and J Vandewalle ldquoA multilinearsingular value decompositionrdquo SIAM Journal on Matrix Analysis andApplications vol 21 pp 1253ndash1278 2000

[27] E Papalexakis N Sidiropoulos and R Bro ldquoFrom k-means to higher-way co-clustering Multilinear decomposition with sparse latent factorsrdquoIEEE Transactions on Signal Processing vol 61 no 2 pp 493ndash506Jan 2013

[28] Z Qian X Chen N Kang M Chen Y Yu T Moscibroda andZ Zhang ldquoMadLINQ Large-scale distributed matrix computation forthe cloudrdquo in Proceedings of the 7th ACM European Conference onComputer Systems ser EuroSys rsquo12 New York NY USA ACM2012 pp 197ndash210

[29] N Halko P Martinsson and J Tropp ldquoFinding structure with ran-domness Probabilistic algorithms for constructing approximate matrixdecompositionsrdquo SIAM Review vol 53 no 2 pp 217ndash288 2011

[30] N Vervliet O Debals L Sorber and L De Lathauwer ldquoBreaking thecurse of dimensionality using decompositions of incomplete tensorsTensor-based scientific computing in big data analysisrdquo IEEE SignalProcessing Magazine vol 31 no 5 pp 71ndash79 Sept 2014

[31] G Zhou A Cichocki Q Zhao and S Xie ldquoEfficient nonnegativeTucker decompositions Algorithms and uniquenessrdquo IEEE Transactionson Image Processing (Submitted) p 13 2014 [Online] Available[ArXive-prints]httparxivorgabs14044412

[32] B W Bader and T G Kolda ldquoMATLAB tensor toolbox version25rdquo Feb 2012 [Online] Available httpcsmrcasandiagovsimtgkoldaTensorToolbox

[33] C A Andersson and R Bro ldquoThe N -way toolbox for MATLABrdquo 2000[Online] Available httpwwwmodelslifekudksourcenwaytoolbox

[34] H Phan P Tichavsky and A Cichocki ldquoFast alternating LS algorithmsfor high order CANDECOMPPARAFAC tensor factorizationsrdquo IEEETransactions on Signal Processing vol 61 no 19 pp 4834ndash4846 2013

[35] N Sidiropoulos E Papalexakis and C Faloutsos ldquoA parallel algorithmfor big tensor decomposition using randomly compressed cubes (PARA-COMP)rdquo in 2014 IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP) May 2014 pp 1ndash5

[36] A Cichocki and A-H Phan ldquoFast local algorithms for large scalenonnegative matrix and tensor factorizationsrdquo IEICE Transactions onFundamentals of Electronics Communications and Computer Sciences(Invited paper) vol E92-A no 3 pp 708ndash721 2009

[37] D Cai X He J Han and T Huang ldquoGraph regularized nonnegativematrix factorization for data representationrdquo IEEE Transactions on

Pattern Analysis and Machine Intelligence vol 33 no 8 pp 1548ndash1560 aug 2011

[38] M Rodriguez J Ahmed and M Shah ldquoAction MACH a spatio-temporal maximum average correlation height filter for action recogni-tionrdquo in IEEE Conference on Computer Vision and Pattern Recognition2008 CVPR 2008 June 2008 pp 1ndash8

[39] R Baraniuk V Cevher and M Wakin ldquoLow-dimensional models fordimensionality reduction and signal recovery A geometric perspectiverdquoProceedings of the IEEE vol 98 no 6 pp 959ndash971 June 2010

  • I Introduction
  • II Overview of Tucker and Polyadic Decompositions
  • III Efficient CP Decompositions of Tensors Based on Their Tucker Approximations
    • III-A Overview of CPD Algorithms
    • III-B CPD of Big Tensors by Based on Their Tucker Compression
      • IV Distributed Randomized Tucker Decompositions (RandTucker)
        • IV-A Randomized Tucker Decomposition
        • IV-B Distributed Randomized Tucker Decomposition
        • IV-C Nonnegative Tucker Decomposition
          • V Simulations
            • V-A Synthetic data
            • V-B The COIL-100 Images Clustering Analysis
            • V-C UCF Sports Action Videos Analysis
              • VI Conclusion
              • References

11

Fig 8 The UCF Sports Action Videos The first row shows asnapshot of each action and the second row shows 5 snapshotsfrom a video of diving

tensor to extract low-dimensional features for further cluster-ing analysis When we applied the CP-ALS method includedin the Tensor Toolbox to decompose 30 videos it consumedabout 4 hours for only one iteration when 40 videos wereconsidered the CP-ALS ran out of memory We configureda small cluster with 3 computers (a desktop computer plustwo laptop computers (each of which has a i7 Intel dualcore CPU 8 GB physical memory running Windows 7 andMATLAB 2012b)) In the first stage the RandTucker2i wasapplied with R = [35 35 3 10 20] It consumed about 104seconds to obtain the Tucker approximation of the originaltensor with the Fit of 723 to the observation data Then theFFCP was applied using a desktop computer with R = 10 Inthis case only 75 seconds were required without nonnegativityconstraints and 7 seconds were consumed when nonnegativityconstraints were imposed Finally we used A(5) as featuresto cluster the videos and achieved the accuracy of 58 ina typical run This experiments showed that the proposedalgorithms can be applied to decompose very large scaletensors by using a distributed computing cluster

VI CONCLUSION

Tensor decompositions are particularly promising for bigmultiway data analytics as it is able to discover more complexinternal structures and correlations of data gathered frommultiple aspects devices trials and subjects etc While mostexisting tensor decomposition methods are not designed tomeet the major challenges posed by big data in this paperwe proposed a unified framework for the decomposition ofbig tensors with relatively low multilinear rank The key ideais that based on its Tucker approximation of a big tensor alarge set of unconstrainedconstrained tensor decompositionscan be efficiently performed We developed a flexible fastalgorithm for the CP decomposition of tensors using theirTucker representation and a distributed randomized Tuckerdecomposition approach which is particularly suitable for bigtensor analysis Simulation results confirmed that the proposedalgorithms are highly efficient and scalable

While the randomized Tucker decomposition approachshowed its close-to-optimal performance the correspondingerror bounds need to be investigated theoretically in the futureWe also believe many randomized (or related) approachesproposed for low-rank approximation of matrices [29] [39]

could be applied for Tucker decompositions however a com-parative study in the tensor decomposition scenarios is desiredParticularly if we have efficient and robust unconstrained CPdecomposition algorithms we can use the CP model to com-press the raw data instead of the Tucker model which suffersfrom the curse of dimensionality for very high-order tensorsHence developing more reliable and robust CP decompositionapproaches is another important task All these open issuesdeserve further study in the near future

REFERENCES

[1] A Cichocki R Zdunek A H Phan and S-I Amari Nonnegative Ma-trix and Tensor Factorizations Applications to Exploratory Multi-wayData Analysis and Blind Source Separationpplications to ExploratoryMulti-Way Data Analysis and Blind Source Separation ChichesterUK John Wiley amp Sons Ltd 2009

[2] N Sidiropoulos R Bro and G Giannakis ldquoParallel factor analysisin sensor array processingrdquo IEEE Transactions on Signal Processingvol 48 no 8 pp 2377ndash2388 Aug 2000

[3] L De Lathauwer J Castaing and J Cardoso ldquoFourth-order cumulant-based blind identification of underdetermined mixturesrdquo IEEE Transac-tions on Signal Processing vol 55 no 6 pp 2965ndash2973 2007

[4] P Comon ldquoTensors A brief introductionrdquo IEEE Signal ProcessingMagazine vol 31 no 3 pp 44ndash53 May 2014

[5] T G Kolda and B W Bader ldquoTensor decompositions and applicationsrdquoSIAM Review vol 51 no 3 pp 455ndash500 2009

[6] N D Sidiropoulos and R Bro ldquoOn the uniqueness of multilineardecomposition of N -way arraysrdquo Journal of Chemometrics vol 14no 3 pp 229 ndash 239 2000

[7] G Zhou and A Cichocki ldquoCanonical polyadic decomposition based ona single mode blind source separationrdquo IEEE Signal Processing Lettersvol 19 no 8 pp 523 ndash526 Aug 2012

[8] L-H Lim and P Comon ldquoNonnegative approximations of nonnegativetensorsrdquo Journal of Chemometrics vol 23 no 7-8 pp 432ndash441 2009

[9] E E Papalexakis C Faloutsos and N D Sidiropoulos ldquoParCubeSparse parallelizable tensor decompositionsrdquo in Proceedings of the 2012European Conference on Machine Learning and Knowledge Discoveryin Databases - Volume Part I ser ECML PKDDrsquo12 Berlin HeidelbergSpringer-Verlag 2012 pp 521ndash536

[10] A Phan and A Cichocki ldquoAdvances in PARAFAC using parallel blockdecompositionrdquo in Neural Information Processing ser Lecture Notesin Computer Science C Leung M Lee and J Chan Eds vol 5863Springer Berlin Heidelberg 2009 pp 323ndash330

[11] U Kang E Papalexakis A Harpale and C Faloutsos ldquoGigaTensorScaling tensor analysis up by 100 times - algorithms and discoveriesrdquoin Proceedings of the 18th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining ser KDD rsquo12 New York NYUSA ACM 2012 pp 316ndash324

[12] C Tsourakakis ldquoMACH Fast randomized tensor decompositionsrdquoin Proceedings of the 2010 SIAM International Conference on DataMining 2010 pp 689ndash700

[13] P Drineas and M W Mahoney ldquoA randomized algorithm for a tensor-based generalization of the singular value decompositionrdquo Linear Alge-bra and its Applications vol 420 no 2ndash3 pp 553 ndash 571 2007

[14] C F Caiafa and A Cichocki ldquoGeneralizing the column-row matrix de-composition to multi-way arraysrdquo Linear Algebra and its Applicationsvol 433 no 3 pp 557 ndash 573 2010

[15] M Vasilescu and D Terzopoulos ldquoMultilinear analysis of image en-sembles Tensorfacesrdquo in Computer Vision mdash ECCV 2002 ser LectureNotes in Computer Science A Heyden G Sparr M Nielsen andP Johansen Eds Springer Berlin Heidelberg 2002 vol 2350 pp447ndash460

[16] O Duchenne F Bach I-S Kweon and J Ponce ldquoA tensor-basedalgorithm for high-order graph matchingrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 33 no 12 pp 2383ndash2395 Dec2011

12

[17] A Rajwade A Rangarajan and A Banerjee ldquoImage denoising usingthe higher order singular value decompositionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 35 no 4 pp 849ndash8622013

[18] V De Silva and L-H Lim ldquoTensor rank and the ill-posedness of the bestlow-rank approximation problemrdquo SIAM Journal on Matrix Analysis andApplications vol 30 no 3 pp 1084ndash1127 September 2008

[19] G Zhou A Cichocki Q Zhao and S Xie ldquoNonnegative matrixand tensor factorizations An algorithmic perspectiverdquo IEEE SignalProcessing Magazine vol 31 no 3 pp 54ndash65 May 2014

[20] G Zhou A Cichocki and S Xie ldquoFast nonnegative matrixtensorfactorization based on low-rank approximationrdquo IEEE Transactions onSignal Processing vol 60 no 6 pp 2928 ndash2940 June 2012

[21] V Cevher S Becker and M Schmidt ldquoConvex optimization forbig data Scalable randomized and parallel algorithms for big dataanalyticsrdquo IEEE Signal Processing Magazine vol 31 no 5 pp 32ndash43Sept 2014

[22] R Bro and C A Andersson ldquoImproving the speed of multiway algo-rithms Part ii Compressionrdquo Chemometrics and Intelligent LaboratorySystems vol 42 no 1ndash2 pp 105 ndash 113 1998

[23] N Sidiropoulos E Papalexakis and C Faloutsos ldquoParallel randomlycompressed cubes A scalable distributed architecture for big tensordecompositionrdquo IEEE Signal Processing Magazine vol 31 no 5 pp57ndash70 Sept 2014

[24] P Comon X Luciani and A L F de Almeida ldquoTensor decompositionsalternating least squares and other talesrdquo Journal of Chemometricsvol 23 no 7-8 pp 393ndash405 2009

[25] J Cohen R Farias and P Comon ldquoFast decomposition of largenonnegative tensorsrdquo IEEE Signal Processing Letters vol 22 no 7pp 862ndash866 July 2015

[26] L De Lathauwer B De Moor and J Vandewalle ldquoA multilinearsingular value decompositionrdquo SIAM Journal on Matrix Analysis andApplications vol 21 pp 1253ndash1278 2000

[27] E Papalexakis N Sidiropoulos and R Bro ldquoFrom k-means to higher-way co-clustering Multilinear decomposition with sparse latent factorsrdquoIEEE Transactions on Signal Processing vol 61 no 2 pp 493ndash506Jan 2013

[28] Z Qian X Chen N Kang M Chen Y Yu T Moscibroda andZ Zhang ldquoMadLINQ Large-scale distributed matrix computation forthe cloudrdquo in Proceedings of the 7th ACM European Conference onComputer Systems ser EuroSys rsquo12 New York NY USA ACM2012 pp 197ndash210

[29] N Halko P Martinsson and J Tropp ldquoFinding structure with ran-domness Probabilistic algorithms for constructing approximate matrixdecompositionsrdquo SIAM Review vol 53 no 2 pp 217ndash288 2011

[30] N Vervliet O Debals L Sorber and L De Lathauwer ldquoBreaking thecurse of dimensionality using decompositions of incomplete tensorsTensor-based scientific computing in big data analysisrdquo IEEE SignalProcessing Magazine vol 31 no 5 pp 71ndash79 Sept 2014

[31] G Zhou A Cichocki Q Zhao and S Xie ldquoEfficient nonnegativeTucker decompositions Algorithms and uniquenessrdquo IEEE Transactionson Image Processing (Submitted) p 13 2014 [Online] Available[ArXive-prints]httparxivorgabs14044412

[32] B W Bader and T G Kolda ldquoMATLAB tensor toolbox version25rdquo Feb 2012 [Online] Available httpcsmrcasandiagovsimtgkoldaTensorToolbox

[33] C A Andersson and R Bro ldquoThe N -way toolbox for MATLABrdquo 2000[Online] Available httpwwwmodelslifekudksourcenwaytoolbox

[34] H Phan P Tichavsky and A Cichocki ldquoFast alternating LS algorithmsfor high order CANDECOMPPARAFAC tensor factorizationsrdquo IEEETransactions on Signal Processing vol 61 no 19 pp 4834ndash4846 2013

[35] N Sidiropoulos E Papalexakis and C Faloutsos ldquoA parallel algorithmfor big tensor decomposition using randomly compressed cubes (PARA-COMP)rdquo in 2014 IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP) May 2014 pp 1ndash5

[36] A Cichocki and A-H Phan ldquoFast local algorithms for large scalenonnegative matrix and tensor factorizationsrdquo IEICE Transactions onFundamentals of Electronics Communications and Computer Sciences(Invited paper) vol E92-A no 3 pp 708ndash721 2009

[37] D Cai X He J Han and T Huang ldquoGraph regularized nonnegativematrix factorization for data representationrdquo IEEE Transactions on

Pattern Analysis and Machine Intelligence vol 33 no 8 pp 1548ndash1560 aug 2011

[38] M Rodriguez J Ahmed and M Shah ldquoAction MACH a spatio-temporal maximum average correlation height filter for action recogni-tionrdquo in IEEE Conference on Computer Vision and Pattern Recognition2008 CVPR 2008 June 2008 pp 1ndash8

[39] R Baraniuk V Cevher and M Wakin ldquoLow-dimensional models fordimensionality reduction and signal recovery A geometric perspectiverdquoProceedings of the IEEE vol 98 no 6 pp 959ndash971 June 2010

  • I Introduction
  • II Overview of Tucker and Polyadic Decompositions
  • III Efficient CP Decompositions of Tensors Based on Their Tucker Approximations
    • III-A Overview of CPD Algorithms
    • III-B CPD of Big Tensors by Based on Their Tucker Compression
      • IV Distributed Randomized Tucker Decompositions (RandTucker)
        • IV-A Randomized Tucker Decomposition
        • IV-B Distributed Randomized Tucker Decomposition
        • IV-C Nonnegative Tucker Decomposition
          • V Simulations
            • V-A Synthetic data
            • V-B The COIL-100 Images Clustering Analysis
            • V-C UCF Sports Action Videos Analysis
              • VI Conclusion
              • References

12

[17] A Rajwade A Rangarajan and A Banerjee ldquoImage denoising usingthe higher order singular value decompositionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 35 no 4 pp 849ndash8622013

[18] V De Silva and L-H Lim ldquoTensor rank and the ill-posedness of the bestlow-rank approximation problemrdquo SIAM Journal on Matrix Analysis andApplications vol 30 no 3 pp 1084ndash1127 September 2008

[19] G Zhou A Cichocki Q Zhao and S Xie ldquoNonnegative matrixand tensor factorizations An algorithmic perspectiverdquo IEEE SignalProcessing Magazine vol 31 no 3 pp 54ndash65 May 2014

[20] G Zhou A Cichocki and S Xie ldquoFast nonnegative matrixtensorfactorization based on low-rank approximationrdquo IEEE Transactions onSignal Processing vol 60 no 6 pp 2928 ndash2940 June 2012

[21] V Cevher S Becker and M Schmidt ldquoConvex optimization forbig data Scalable randomized and parallel algorithms for big dataanalyticsrdquo IEEE Signal Processing Magazine vol 31 no 5 pp 32ndash43Sept 2014

[22] R Bro and C A Andersson ldquoImproving the speed of multiway algo-rithms Part ii Compressionrdquo Chemometrics and Intelligent LaboratorySystems vol 42 no 1ndash2 pp 105 ndash 113 1998

[23] N Sidiropoulos E Papalexakis and C Faloutsos ldquoParallel randomlycompressed cubes A scalable distributed architecture for big tensordecompositionrdquo IEEE Signal Processing Magazine vol 31 no 5 pp57ndash70 Sept 2014

[24] P Comon X Luciani and A L F de Almeida ldquoTensor decompositionsalternating least squares and other talesrdquo Journal of Chemometricsvol 23 no 7-8 pp 393ndash405 2009

[25] J Cohen R Farias and P Comon ldquoFast decomposition of largenonnegative tensorsrdquo IEEE Signal Processing Letters vol 22 no 7pp 862ndash866 July 2015

[26] L De Lathauwer B De Moor and J Vandewalle ldquoA multilinearsingular value decompositionrdquo SIAM Journal on Matrix Analysis andApplications vol 21 pp 1253ndash1278 2000

[27] E Papalexakis N Sidiropoulos and R Bro ldquoFrom k-means to higher-way co-clustering Multilinear decomposition with sparse latent factorsrdquoIEEE Transactions on Signal Processing vol 61 no 2 pp 493ndash506Jan 2013

[28] Z Qian X Chen N Kang M Chen Y Yu T Moscibroda andZ Zhang ldquoMadLINQ Large-scale distributed matrix computation forthe cloudrdquo in Proceedings of the 7th ACM European Conference onComputer Systems ser EuroSys rsquo12 New York NY USA ACM2012 pp 197ndash210

[29] N Halko P Martinsson and J Tropp ldquoFinding structure with ran-domness Probabilistic algorithms for constructing approximate matrixdecompositionsrdquo SIAM Review vol 53 no 2 pp 217ndash288 2011

[30] N Vervliet O Debals L Sorber and L De Lathauwer ldquoBreaking thecurse of dimensionality using decompositions of incomplete tensorsTensor-based scientific computing in big data analysisrdquo IEEE SignalProcessing Magazine vol 31 no 5 pp 71ndash79 Sept 2014

[31] G Zhou A Cichocki Q Zhao and S Xie ldquoEfficient nonnegativeTucker decompositions Algorithms and uniquenessrdquo IEEE Transactionson Image Processing (Submitted) p 13 2014 [Online] Available[ArXive-prints]httparxivorgabs14044412

[32] B W Bader and T G Kolda ldquoMATLAB tensor toolbox version25rdquo Feb 2012 [Online] Available httpcsmrcasandiagovsimtgkoldaTensorToolbox

[33] C A Andersson and R Bro ldquoThe N -way toolbox for MATLABrdquo 2000[Online] Available httpwwwmodelslifekudksourcenwaytoolbox

[34] H Phan P Tichavsky and A Cichocki ldquoFast alternating LS algorithmsfor high order CANDECOMPPARAFAC tensor factorizationsrdquo IEEETransactions on Signal Processing vol 61 no 19 pp 4834ndash4846 2013

[35] N Sidiropoulos E Papalexakis and C Faloutsos ldquoA parallel algorithmfor big tensor decomposition using randomly compressed cubes (PARA-COMP)rdquo in 2014 IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP) May 2014 pp 1ndash5

[36] A Cichocki and A-H Phan ldquoFast local algorithms for large scalenonnegative matrix and tensor factorizationsrdquo IEICE Transactions onFundamentals of Electronics Communications and Computer Sciences(Invited paper) vol E92-A no 3 pp 708ndash721 2009

[37] D Cai X He J Han and T Huang ldquoGraph regularized nonnegativematrix factorization for data representationrdquo IEEE Transactions on

Pattern Analysis and Machine Intelligence vol 33 no 8 pp 1548ndash1560 aug 2011

[38] M Rodriguez J Ahmed and M Shah ldquoAction MACH a spatio-temporal maximum average correlation height filter for action recogni-tionrdquo in IEEE Conference on Computer Vision and Pattern Recognition2008 CVPR 2008 June 2008 pp 1ndash8

[39] R Baraniuk V Cevher and M Wakin ldquoLow-dimensional models fordimensionality reduction and signal recovery A geometric perspectiverdquoProceedings of the IEEE vol 98 no 6 pp 959ndash971 June 2010

  • I Introduction
  • II Overview of Tucker and Polyadic Decompositions
  • III Efficient CP Decompositions of Tensors Based on Their Tucker Approximations
    • III-A Overview of CPD Algorithms
    • III-B CPD of Big Tensors by Based on Their Tucker Compression
      • IV Distributed Randomized Tucker Decompositions (RandTucker)
        • IV-A Randomized Tucker Decomposition
        • IV-B Distributed Randomized Tucker Decomposition
        • IV-C Nonnegative Tucker Decomposition
          • V Simulations
            • V-A Synthetic data
            • V-B The COIL-100 Images Clustering Analysis
            • V-C UCF Sports Action Videos Analysis
              • VI Conclusion
              • References