gigatensor: scaling tensor analysis up by 100 times ...chandola/teaching/mlseminar... ·...

GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and DiscoveriesU Kang, Evangelos Papalexakis, Abhay Harpale,

Christos FaloutosKDD '12 Proceedings of the 18th ACM SIGKDD international conference on Knowledge

discovery and data mining

A review presented by

Pavan Kumar Behara

Tensor Decomposition• Tensor is an N-dimensional array.

• Most data is stored as tensors.

• Tensors allow representing higher order relationships.

• Multi-way data analysis requires tensor decomposition (TD):

• latent concept discovery

• trend analysis

• clustering

• anomaly detection

† Courtesy: Exploring temporal graph data with Python, Andre Panisson (url)

Spearman’s two-factor theory of intelligence (1927)†

http://www.slideshare.net/panisson/exploring-temporal-graph-data-with-python-a-study-on-tensor-decomposition-of-wearable-sensor-data

Example – Topic Modeling• Goal: To characterize observed data in terms of a much smaller set of unobserved

topics.

• LDA/Hidden Markov/Gaussian mixture models can be used.

• Tensor structure in low-order observable moments (2nd or 3rd typically)†.

• Expectation maximization and Markov chain monte carlo methods fail for largedatasets.

• TD helps in parameter estimation for thesemodels.

Blei, David M. "Probabilistic topic models." Communications of the ACM 55.4 (2012): 77-84.

†A. Anandkumar, et al., Journal of Machine Learning Research 15 (2014).

Tools for Tensor Decomposition

• Tensor Toolbox for Matlab by Kolda, et al.

• N-way toolbox for Matlab by Rasmus bro, et al.

• Scikit-Tensor by Maximilian Nickel, et al.

• BigTensor for Hadoop by U Kang, et al (GigaTensor author).

• FlexiFact for Hadoop by Alex Beutel.

• FTensor, C++ library by W. Landry.

• ITensor, C++ library by Steven R. White, et al.

http://www.sandia.gov/~tgkolda/TensorToolbox/index-2.6.html

http://www.models.life.ku.dk/nwaytoolbox

http://www.mit.edu/~mnick/

http://datalab.snu.ac.kr/bigtensor/

https://github.com/alexbeutel/FlexiFaCT

http://www.wlandry.net/Projects/FTensor

http://itensor.org/

Handling billion-scale tensors• Large scale TDs are limited by memory and compute times.

• GigaTensor (2012) is the first scalable distributed algorithm developed.

• GigaTensor is a distributed implementation of PARAFAC (for parallel factors) decomposition on MapReduce.

• Current version of GigaTensor is further improved and implemented in BIGtensor(2016), a Tensor mining package for Hadoop platform by the same authors.

Some preliminaries• Kronecker Product

• Khatri-Rao product:

• Tensor unfolding/Matricization: reordering an N-way array into a matrix (here only n-mode matricization considered)

• XIxJxK is divided into X(1) (I x JK) , X(2) (J x IK) , X(3) (K x IJ)

Matrix decom. to Tensor decom.• Bilinear decomposition:

• Let X be an I x J matrix with rank R

• Writing X = a1b1T + a2b2

T + … + aRbRT

= ABT, where columns of A and B are ar, br, 1 < r < R

• X = a1b1T + a2b2

T + … + arbrT +…+ aRbR

T , truncating to r ≪ R results in approximating to a low rank matrix (Eckart & Young, 1936).

• PARAFAC is a higher order generalization of the above. It factorizes a tensor into a sum of component rank-one tensors.

• Rank-one tensor: If an Nth order tensor can be expressed as outer product of N vectors. Let YIxJxK = a ◦ b ◦ c (= a(i) b(j) c(k)), Y is a third order rank-one tensor.

• Let X be a three way tensor with dimensions I x J x K, PARAFAC formalism is

X = ∑r λr ar ◦ br ◦ cr r = 1 to R

PARAFAC DecompositionX = ∑r λr ar ◦ br ◦ cr r = 1 to R

• The factor matrices A , B , C , refer to the combination of the vectors from the

rank-one components i.e., A = [a1, a2, …, aR] and so on.

• Using these X can be expressed in n-modes as

• The goal is to compute a PARAFAC decomposition with R components that best approximates X, i.e., to find A, B, C that will minimize objective function

Alternating Least Squares approach• Start with an initial guess for A, B, C. Fix two matrices and solve for the other.

• Having fixed all but one matrix, the problem reduces to a linear least-squares.

• For example, B and C are fixed, the problem reduces to where,

• The optimal solution is then given by

• Khatri-Rao pseudoinverse has a special form which allows to rewrite this as

• So, the same procedure applies for B and C.

Algorithm for PARAFAC

1. Order of Matrix Multiplication

• Three matrix multiplication: Either (PQ)R or P(QR).

• Assuming we already have C ⊙ B, first way requires 2mR+2IR2 whereas second

way requires 2mR+2JKR2 flops (where, m = no. of non-zero elements).

• So, updating factor matrices is done as:

2. Intermediate Data Explosion

• For NELL-1 knowledge base dataset with 26 million noun phrases the intermediate matrix C ⊙ B explodes to 676 trillion rows.

• C is of size KxR, B is JxR and C ⊙ B would be JKxR (HUGE!!!!)

VERY LARGE MATRICES

2. Solution for Data explosion• X(1)(C ⊙ B) can be computed without explicit calculation of (C ⊙ B).

• Decoupling the above as follows results in the largest dense matrix being either B

or C and not (C ⊙ B) as in naïve case:

• Where, 1p represent all-one vector of size p, ‘ * ’ is the element-wise product (Hadamard product).

• Cost and intermediate size for computing X(1)(C ⊙ B) :

≈

2. Algorithm for X(1)(C ⊙ B)

3. MapReduce for matrix operations • MapReduce is a programming model in which users specify a map function that

processes a key/value pair to generate a set of intermediate key/value pairs.

• And, a reduce function that merges all intermediate values associated with the same intermediate key.

• Computing decoupled in a MapReduce way.

• Given: C < j, r, C(j,r) > B < j, r, B(j,r) > X(1) < i, j, X(1)(i, j) >

• To calculate: ,

• Map: X(1) and C on j such that tuples with the same key are shuffled to the same reducer in the

form of < j, (C(j,r), {i, X(1)(i, j) ∀ i ∈ Qj}) > , Qi is the index of non-zero element in X(i, :)

• Reduce: take < j, (C(j,r), {i, X(1)(i, j) ∀ i ∈ Qj}) > and emit < i, j, X(1)(i, j)C(j,r) > for each i ∈ Qj

4. Parallel Outer Products• We have talked about order of computation, and avoiding intermediate data explosion.• Next step is efficient calculation of (CTC*BTB)†.

• CTC = ∑k C(k,:)T ◦ C(k,:) = sum of outer products of rows, and implemented in MapReduce.• Comparing cost with the naïve implementation:

d is the no. of mappers, here they used d=50, R = 10, for NELL-1 dataset example

5. Final step – Distributed matrix multiplication

• First matrix is of I x R , second matrix is very small R x R.

• Broadcast second matrix to all mappers that process the first one to do multiplication in a distributed way.

• In summary, GigaTensor is built upon• Careful choice of order of computations

• Avoiding intermediate data explosion

• Parallel outer products

• Distributed cache multiplication

Scalability• GigaTensor can handle tensor sizes of atleast 109 whereas Tensor Toolbox fails

beyond 107. Also, it can handle dense matrices with billions of non-zeros.

• Running-time scales up linearly with the number of machines.

• Few other distributed algorithms that came after this are Haten2 & SCouT from the same authors (2015, 2016), FlexiFact (A. Beuter, et al, 2013), DFacTo (Choi, et al, 2014).

BigTensor• GigaTensor is further improved and distributed implementations for other tensor

decomposition algorithms like Tucker, etc., are packaged as BigTensor by U Kang’s research group.

• Ease of use: Users do not need to know the map() and reduce() functions.

http://datalab.snu.ac.kr/bigtensor/index.php

References• Kolda, T. G. and B. W. Bader (2009). "Tensor Decompositions and Applications." SIAM Review 51(3): 455-500.

• BIGtensor: Mining Billion-Scale Tensor Made Easy, Namyong Park, Byungsoo Jeon, Jungwoo Lee, U Kang. 25th

ACM International Conference on Information and Knowledge Management (CIKM) 2016, Indianapolis, United

States.

• SCouT: Scalable Coupled Matrix-Tensor Factorization - Algorithms and Discoveries. ByungSoo Jeon, Inah Jeon, Sael

Lee, and U Kang. 32nd IEEE International Conference on Data Engineering (ICDE) 2016, Helsinki, Finland.

• HaTen2: Billion-scale Tensor Decompositions. Inah Jeon, Evangelos E. Papalexakis, U Kang, and Christos Faloutsos.

31st IEEE International Conference on Data Engineering (ICDE) 2015, Seoul, Korea.

• DFacTo: Distributed Factorization of Tensors, Joon Hee Choi, S. V. N. Vishwanathan, arXiv:1406.4519 [stat.ML].

• FlexiFaCT: Scalable Flexible Factorization of Coupled Tensors on Hadoop, Alex Beutel, Partha Pratim Talukdar,

Abhimanu Kumar, Christos Faloutsos, Evangelos E. Papalexakis, and Eric P. Xing, Proceedings of the 2014 SIAM

International Conference on Data Mining. 2014, 109-117.

Thank you

gigatensor: scaling tensor analysis up by 100 times ...chandola/teaching/mlseminar... ·...

Documents