collaborators using nonnegative matrix and tensor

13
Using Nonnegative Matrix and Tensor Factorizations for Email Surveillance Semi-Plenary Session IX, EURO XXII, Prague, CR Michael W. Berry, Murray Browne, and Brett W. Bader University of Tennessee and Sandia National Laboratories July 10, 2007 1 / 51 Collaborators Pau’l Pauca, Bob Plemmons (Wake Forest) Amy Langville (College of Charleston) David Skillicorn, Parambir Keila (Queens U.) Stratis Gallopoulos, Ioannis Antonellis (U. Patras) 2 / 51 Non-Negative Matrix Factorization (NMF) Enron Background Electronic Mail Surveillance Discussion Tracking via PARAFAC/Tensor Factorization Conclusions and References 3 / 51 NMF Origins NMF (Non-negative Matrix Factorization) can be used to approximate high-dimensional data having non-negative components. Lee and Seung (1999) demonstrated its use as a sum-by-parts representation of image data in order to both identify and classify image features. [Xu et al., 2003] demonstrated how NMF-based indexing could outperform SVD-based Latent Semantic Indexing (LSI) for some information retrieval tasks. 4 / 51

Upload: others

Post on 07-Dec-2021

12 views

Category:

Documents


0 download

TRANSCRIPT

Using Nonnegative Matrix and TensorFactorizations for Email Surveillance

Semi-Plenary Session IX, EURO XXII, Prague, CR

Michael W. Berry, Murray Browne, and Brett W. Bader

University of Tennessee and Sandia National Laboratories

July 10, 2007

1 / 51

Collaborators

◮ Pau’l Pauca, Bob Plemmons (Wake Forest)

◮ Amy Langville (College of Charleston)

◮ David Skillicorn, Parambir Keila (Queens U.)

◮ Stratis Gallopoulos, Ioannis Antonellis (U. Patras)

2 / 51

Non-Negative Matrix Factorization (NMF)

Enron Background

Electronic Mail Surveillance

Discussion Tracking via PARAFAC/Tensor Factorization

Conclusions and References

3 / 51

NMF Origins

◮ NMF (Non-negative Matrix Factorization) can be used toapproximate high-dimensional data having non-negativecomponents.

◮ Lee and Seung (1999) demonstrated its use as a sum-by-parts

representation of image data in order to both identify andclassify image features.

◮ [Xu et al., 2003] demonstrated how NMF-based indexingcould outperform SVD-based Latent Semantic Indexing (LSI)for some information retrieval tasks.

4 / 51

NMF for Image Processing

SVD

W H

A i

i

U ViΣ

Sparse NMF verses Dense SVD Bases; Lee and Seung (1999)

5 / 51

NMF for Text Mining (Medlars)

0 1 2 3 4

10

9

8

7

6

5

4

3

2

1 ventricularaorticseptalleftdefectregurgitationventriclevalvecardiacpressure

Highest Weighted Terms in Basis Vector W*1

weight

term

0 0.5 1 1.5 2 2.5

10

9

8

7

6

5

4

3

2

1 oxygenflowpressurebloodcerebralhypothermiafluidvenousarterialperfusion

Highest Weighted Terms in Basis Vector W*2

weight

term

0 1 2 3 4

10

9

8

7

6

5

4

3

2

1 childrenchildautisticspeechgroupearlyvisualanxietyemotionalautism

Highest Weighted Terms in Basis Vector W*5

weight

term

0 0.5 1 1.5 2 2.5

10

9

8

7

6

5

4

3

2

1 kidneymarrowdnacellsnephrectomyunilaterallymphocytesbonethymidinerats

Highest Weighted Terms in Basis Vector W*6

weight

term

Interpretable NMF feature vectors; Langville et al. (2006)

6 / 51

Derivation

◮ Given an m × n term-by-message (sparse) matrix X .

◮ Compute two reduced-dim. matrices W ,H so that X ≃WH;W is m × r and H is r × n, with r ≪ n.

◮ Optimization problem:

minW ,H‖X −WH‖2F ,

subject to Wij ≥ 0 and Hij ≥ 0, ∀i , j .◮ General approach: construct initial estimates for W and H

and then improve them via alternating iterations.

7 / 51

Minimization Challenges

Local Minima

Non-convexity of functional f (W ,H) = 12‖X −WH‖2F in both W

and H

Non-unique Solutions

WDD−1H is non-negative for any non-negative (and invertible) D

8 / 51

Alternative Formulations [Berry et al., 2007]

NMF Formulations

Lee and Seung (2001)Information theoretic formulation based on Kullback-Leiblerdivergence of X from WH

Guillamet, Bressan, and Vitria (2001)Diagonal weight matrix Q used (XQ ≈WHQ) to compensate forfeature redundancy (columns of W )

Wang, Jiar, Hu, and Turk (2004)Constraint-based formulation using Fisher linear discriminantanalysis to improve extraction of spatially localized features

Other Cost Function FormulationsHamza and Brady (2006), Dhillon and Sra (2005), Cichocki,Zdunek, and Amari (2006)

9 / 51

Multiplicative Method (MM)

◮ Multiplicative update rules for W and H (Lee and Seung,1999):

1. Initialize W and H with non-negative values, and scale thecolumns of W to unit norm.

2. Iterate for each c , j , and i until convergence or after kiterations:

2.1 Hcj ← Hcj(W T

X )cj

(W TWH)cj + ǫ

2.2 Wic ←Wic(XH

T )ic

(WHHT )ic + ǫ

2.3 Scale the columns of W to unit norm.

◮ Setting ǫ = 10−9 will suffice [Shahnaz et al., 2006].

10 / 51

Normalization, Complexity, and Convergence

◮ Important to normalize X initially and the basis matrix W ateach iteration.

◮ When optimizing on a unit hypersphere, the column (orfeature) vectors of W , denoted by Wk , are effectively mappedto the surface of the hypersphere by repeated normalization.

◮ MM implementation of NMF requires O(rmn) operations periteration; Lee and Seung (1999) proved that ‖X −WH‖2F ismonotonically non-increasing with MM.

◮ From a nonlinear optimization perspective, MM/NMF can beconsidered a diagonally-scaled gradient descent method.

11 / 51

Lee and Seung MM Convergence

Definitive Convergence?

When the MM algorithm converges to a limit point in the interiorof the feasible region, the point is a stationary point. Thestationary point may or may not be a local minimum. If thelimit point lies on the boundary of the feasible region, one cannotdetermine its stationarity [Berry et al., 2007].

Recent Modifications to Lee and Seung MM

Gonzalez and Zhang (2005) accelerated convergence somewhatbut stationarity issue remains.

Lin (2005) modified the algorithm to guarantee convergence to astationary point.

Dhillon and Sra (2005) derived update rules that incorporateweights for the importance of certain features of the approximation.

12 / 51

Alternating Least Squares Algorithms

Basic ALS Approach

ALS algorithms exploit the convexity of W or H (not both) in theunderlying optimization problem. The basic iteration involves

(LS) Solve for H in W TWH = W TX .(NN) Set negative elements of H to 0.(LS) Solve for W in HHTW T = HXT .(NN) Set negative elements of W to 0.

ALS Recovery and Constraints

Unlike the MM algorithm, an element of W (or H) that becomes 0does not have to remain 0; method can escape/recover from apoor path.

Paatero (1999) and Langville et al.(2006) have improved thecomputational complexity of the ALS approach; sparsity andnon-negativity contraints are enforced.

13 / 51

Alternating Least Squares Algorithms, contd.

ALS Convergence

Polak (1971) showed that every limit point of a sequence ofalternating variable interates is a stationary point.

Lawson and Hanson (1995) produced the Non-Negative LeastSquares (NNLS) that was shown to converge to a local minimum.

The price for convergence of ALS algorithms is the usual high costper iteration – Bro and de Jong (1997).

14 / 51

Hoyer’s Method

◮ From neural network applications, Hoyer (2002) enforcedstatistical sparsity for the weight matrix H in order to enhancethe parts-based data representations in the matrix W .

◮ Mu et al. (2003) suggested a regularization approach toachieve statistical sparsity in the matrix H: point countregularization; penalize the number of nonzeros in H ratherthan

ij Hij .

◮ Goal of increased sparsity – better representation of parts orfeatures spanned by the corpus (X ) [Berry and Browne, 2005].

15 / 51

GD-CLS – Hybrid Approach

◮ First use MM to compute an approximation to W for eachiteration – a gradient descent (GD) optimization step.

◮ Then, compute the weight matrix H using a constrained leastsquares (CLS) model to penalize non-smoothness (i.e.,non-sparsity) in H – common Tikohonov regularizationtechnique used in image processing (Prasad et al., 2003).

◮ Convergence to a non-stationary point evidenced (but noformal proof as of yet).

16 / 51

GD-CLS Algorithm

1. Initialize W and H with non-negative values, and scale thecolumns of W to unit norm.

2. Iterate until convergence or after k iterations:

2.1 Wic ←Wic(XHT )ic

(WHHT )ic + ǫ, for c and i

2.2 Rescale the columns of W to unit norm.2.3 Solve the constrained least squares problem:

minHj

{‖Xj −WHj‖22 + λ‖Hj‖22},

where the subscript j denotes the j th column, for j = 1, . . . ,m.

◮ Any negative values in Hj are set to zero. The parameter λ isa regularization value that is used to balance the reduction ofthe metric ‖Xj −WHj‖22 with enforcement of smoothness andsparsity in H.

17 / 51

Email Collection

◮ By-product of the FERC investigation of Enron (originallycontained 15 million email messages).

◮ This study used the improved corpus known as the EnronEmail set, which was edited by Dr. William Cohen at CMU.

◮ This set had over 500,000 email messages. The majority weresent in the 1999 to 2001 timeframe.

18 / 51

Enron Historical 1999-2001

◮ Ongoing, problematic, development of the Dabhol PowerCompany (DPC) in the Indian state of Maharashtra.

◮ Deregulation of the Calif. energy industry, which led to rollingelectricity blackouts in the summer of 2000 (and subsequentinvestigations).

◮ Revelation of Enron’s deceptive business and accountingpractices that led to an abrupt collapse of the energy colossusin October, 2001; Enron filed for bankruptcy in December,2001.

19 / 51

private Collection

◮ Parsed all mail directories (of all 150 accounts) with theexception of all documents, calendar, contacts,

deleted items, discussion threads, inbox, notes inbox,

sent, sent items, and sent mail; 495-term stoplist usedand extracted terms must appear in more than 1 email andmore than once globally [Berry and Browne, 2005].

◮ Distribution of messages sent in the year 2001:

Month Msgs Terms Month Msgs Terms

Jan 3,621 17,888 Jul 3,077 17,617

Feb 2,804 16,958 Aug 2,828 16,417

Mar 3,525 20,305 Sep 2,330 15,405

Apr 4,273 24,010 Oct 2,821 20,995

May 4,261 24,335 Nov 2,204 18,693

Jun 4,324 18,599 Dec 1,489 8,097

20 / 51

Term Weighting Schemes

Assessment of Term Importance

For m × n term-by-message matrix X = [xij ], define

xij = lijgidj ,

where lij is the local weight for term i occurring in message j , gi isthe global weight for term i in the subcollection, and dj is adocument normalization factor (set dj = 1).

Common Term Weighting Choices

Name Local Globaltxx Term Frequency None

lij = fij gi = 1

lex Logarithmic Entropy (Define: pij = fij/∑

j fij)

lij = log(1 + fij) gi = 1 + (∑

j pij log(pij))/ log n)

21 / 51

Historical Benchmark on private

◮ Rank-50 NMF (X ≃WH) for private collection computedon a 450MHz (Dual) UltraSPARC-II processor using 100iterations:

Mail Dictionary TimeMessages Terms λ sec. hrs.

65, 031 92, 133 0.1 51, 489 14.30.01 51, 393 14.20.001 51, 562 14.3

22 / 51

Visualization of private Collection Term-Msg Matrix

◮ NMF-generated reordering of 92, 133× 65, 031 term-by-message matrix(log-entropy weighting) using vismatrix [Gleich, 2006]; cluster docs inX according to arg max

iHij , then cluster terms according to arg max

jWij .

23 / 51

private with Log-Entropy Weighting

◮ Identify rows of H from X ≃WH or Hk with λ = 0.1; r = 50feature vectors (Wk) generated by GD-CLS:

Feature Cluster Topic DominantIndex (k) Size Description Terms

10 497 California ca, cpuc, gov,socalgas, sempra,org, sce, gmssr,aelaw, ci

23 43 Louise Kitchen evp, fortune,named top britain, woman,woman by ceo, avon, fiorina,Fortune cfo, hewlett, packard

26 231 Fantasy game, wr, qb, play,football rb, season, injury,

updated, fantasy, image

(Cluster size ≡ no. of Hk elements > rowmax/10)

24 / 51

private with Log-Entropy Weighting

◮ Additional topic clusters of significant size:

Feature Cluster Topic DominantIndex (k) Size Description Terms

33 233 Texas UT, orange,longhorn longhorn[s], texas,football true, truorange,newsletter recruiting, oklahoma,

defensive

34 65 Enron partnership[s],collapse fastow, shares,

sec, stock, shareholder,investors, equity, lay

39 235 Emails dabhol, dpc, india,about India mseb, maharashtra,

indian, lenders, delhi,foreign, minister

25 / 51

2001 Topics Tracked by GD-CLS

r = 50 features, lex term weighting, λ = 0.1

(New York Times, May 22, 2005)

26 / 51

Two Penalty Term Formulation

◮ Introduce smoothing on Wk (feature vectors) in addition toHk :

minW ,H{‖X −WH‖2F + α‖W ‖2F + β‖H‖2F},

where ‖ · ‖F is the Frobenius norm.

◮ Constrained NMF (CNMF) iteration:

Hcj ← Hcj(W TX )cj − βHcj

(W TWH)cj + ǫ

Wic ←Wic(XHT )ic − αWic

(WHHT )ic + ǫ

27 / 51

Term Distribution in Feature Vectors

����� �� ��� � �� ��������� ���� ����� ��� ���� �����

�������� �� � � � ���

����� ���� � �������

�� ���� � ��������

��������� ���� � � �

�!��� ���� � �!��

"���# ���� � � � �������

$� ���� � �

�"% ���� � � &������

����������� ���� � ���

����� ��� �

'��������� ���� � � � �������

(����� ���� � �

)��������� ��*� � � �!��

)�+ ��� � �!��

����� ���� �

,-. ���� � � �������

��� ���� � �

''� ���� � � ���

/�+��� ���� �

,-.� �� � � �������

28 / 51

British National Corpus (BNC) – Important NounConservation

◮ In collaboration with P. Keila and D. Skillicorn (Queens Univ.)

◮ 289,695 email subset (all mail folders - not just private)

◮ Smoothing solely applied to NMF W matrix(α = 0.001, 0.01, 0.1, 0.25, 0.50, 0.75, 1.00 with β = 0)

◮ Log-entropy term weighting applied to the term-by-messagematrix X

◮ Monitor top ten nouns for each feature vector (ranked bydescending component values) and extract those appearing intwo or more features; topics assigned manually.

29 / 51

BNC Noun Distribution in Feature VectorsNoun GF Entropy Alpha

0.001 0.01 0.1 0.25 0.50 0.75 1.00 Topic

Waxman 680 0.424 2 2 2 2 2 Downfall

Lieberman 915 0.426 2 2 2 2 2 Downfall

Scandal 679 0.428 2 2 2 Downfall

Nominee(s) 544 0.436 4 3 2 2 2

Barone 470 0.437 2 2 2 2 Downfall

MEADE 456 0.437 2 Downfall

Fichera 558 0.438 2 2 California blackout

Prabhu 824 0.445 2 2 2 2 2 2 India-strong

Tata 778 0.448 2 India-weak

Rupee(s) 323 0.452 3 4 4 4 3 4 2 India-strong

Soybean(s) 499 0.455 2 2 2 2 2 2 2

Rushing 891 0.486 2 2 2 Football - college

Dlrs 596 0.487 2

Janus 580 0.488 2 3 2 3 India-weak

BSES 451 0.498 2 2 2 India-weak

Caracas 698 0.498 2

Escondido 326 0.504 2 2 California/Blackout

Promoters 180 0.509 2 Energy/Scottish

Aramco 188 0.550 2 India-weak

DOORMAN 231 0.598 2 Bawdy/Real Estate

30 / 51

Hoyer Sparsity Constraint

◮ sparseness(x) =√

n−‖x‖1/‖x‖2√n−1

, [Hoyer, 2004]

◮ Imposed as a penalty term of the form

J2(W) = (ω‖vec(W)‖2 − ‖vec(W)‖1)2,

where ω =√

mk − (√

mk − 1)γ and vec(·) transforms amatrix into a vector by column stacking.

◮ Desired sparseness in W is specified by setting γ ∈ [0, 1];sparseness is zero iff all vector components are equal (up tosigns) and is one iff the vector has a single nonzero.

31 / 51

Sample Benchmarks for Smoothing and SparsityConstraints

◮ Elapsed CPU times for CNMF on a 3.2GHz Intel Xeon 3.2GHz(1024KB cache, 4.1GB RAM)

◮ k = 50 feature vectors generated, log-entropy noun-weightingused on 7, 424× 289, 695 noun-by-message matrix, randomW0,H0

W-Constraint Iterations Parameters CPU time

L2 norm 100 α = 0.1, β = 0 19.6mL2 norm 100 α = 0.01, β = 0 20.1mL2 norm 100 α = 0.001, β = 0 19.6mHoyer 30 α = 0.01, β = 0, γ = 0.8 2.8mHoyer 30 α = 0.001, β = 0, γ = 0.8 2.9m

32 / 51

BNC Noun Distribution in Sparsified Feature VectorsNoun GF Entropy Alpha

0.001 0.01 0.01 0.25 0.50 0.75 1.00 Topic

Fleischer 903 0.409 3 Downfall

Coale 836 0.414 2 Downfall

Waxman 680 0.424 2 2 2 2 2 Downfall

Businessweek 485 0.424 2

Lieberman 915 0.426 2 2 2 2 Downfall

Scandal 679 0.428 2 2 2 2 Downfall

Nominee(s) 544 0.436 4 2 2 2 2

Barone 470 0.437 2 2 2 2 Downfall

MEADE 456 0.437 2 Downfall

Fichera 558 0.438 2 2 2 California blackout

Prabhu 824 0.445 2 2 3 2 2 2 India-strong

Tata 778 0.448 2 India-weak

Rupee(s) 323 0.452 3 4 3 4 3 4 2 India-strong

Soybean(s) 499 0.455 2 2 3 2 2 2 2

Rushing 891 0.486 2 2 Football - college

Dlrs 596 0.487 2 2

Janus 580 0.488 2 3 2 2 3 India-weak

BSES 451 0.498 2 2 2 2 India-weak

Caracas 698 0.498 2

Escondido 326 0.504 2 2 California/Blackout

Promoters 180 0.509 2 Energy/Scottish

Aramco 188 0.550 2 India-weak

DOORMAN 231 0.598 2 Bawdy/Real Estate

33 / 51

Annotation Project

◮ Subset of 2001 private collection:

Month Total Classified Usable

Jan,Sep 5591 1100 699Feb 2804 900 460Mar 3525 1200 533Apr 4273 1500 705May 4261 1800 894June 4324 1025 538

Total 24778 7525 3829

◮ Approx. 40 topics identified after NMF initial clustering withk = 50 features.

34 / 51

Annotation Project, contd.

◮ Human classfiers: M. Browne (extensive background readingon Enron collapse) and B. Singer (junior Economics major).

◮ Classify email content versus type (see UC Berkeley EnronEmail Analysis Grouphttp://bailando.sims.berkeley,edu/enron email.html

◮ As of June 2007, distributed by the by U. Penn LDC(Linguistic Data Consortium); see www.ldc.upenn.edu

Citation:

Dr. Michael W. Berry, Murray Browne

and Ben Signer, 2007

2001 Topic Annotated Enron Email Data Set

Linguistic Data Consortium, Philadelphia

35 / 51

Multidimensional Data Analysis via PARAFAC

+ + ...Third dimension offers more explanatory power: uncovers new

latent information and reveals subtle relationships

Build a 3-way array such that there is a term-author matrix for each month.

PARAFAC

Multilinearalgebra

term-authormatrix

term-author-montharray

Email graph

+ + ...

NonnegativePARAFAC

36 / 51

Temporal Assessment via PARAFAC

David

Ellen

Bob

Frank

Alice Carl

IngridHenk

Gary

term-author-montharray

+ + ...=

37 / 51

Mathematical Notation

◮ Kronecker product

A⊗ B =

A11B · · · A1nB...

. . ....

Am1B · · · AmnB

◮ Khatri-Rao product (columnwise Kronecker)

A⊙ B =(

A1 ⊗ B1 · · · An ⊗ Bn

)

◮ Outer product

A1 ◦ B1 =

A11B11 · · · A11Bm1...

. . ....

Am1B11 · · · Am1Bm1

38 / 51

PARAFAC Representations

◮ PARAllel FACtors (Harshman, 1970)

◮ Also known as CANDECOMP (Carroll & Chang, 1970)

◮ Typically solved by Alternating Least Squares (ALS)

Alternative PARAFAC formulations

Xijk ≈r

i=1

AirBjrCkr

X ≈r

i=1

Ai ◦ Bi ◦ Ci , where X is a 3-way array (tensor).

Xk ≈ A diag(Ck:) BT , where Xk is a tensor slice.

X I×JK ≈ A(C ⊙ B)T , where X is matricized.

39 / 51

PARAFAC (Visual) Representations

+ + ...

Scalar form Outer product form

Tensor slice form Matrix form

=

40 / 51

Non-negative PARAFAC Algorithm

◮ Adapted from (Mørup, 2005) and based on NMF by (Lee andSeung, 2001)

||X I×JK − A(C ⊙ B)T ||F = ||X J×IK − B(C ⊙ A)T ||F= ||XK×IJ − C (B ⊙ A)T ||F

◮ Minimize over A, B, C using multiplicative update rule:

Aiρ ← Aiρ(X I×JKZ )iρ

(AZTZ )iρ + ǫ, Z = (C ⊙ B)

Bjρ ← Bjρ(X J×IKZ )jρ

(BZTZ )jρ + ǫ, Z = (C ⊙ A)

Ckρ ← Ckρ(XK×IJZ )kρ

(CZTZ )kρ + ǫ, Z = (B ⊙ A)

41 / 51

Discussion Tracking Using Year 2001 Subset

◮ 197 authors (From:user [email protected]) monitored over 12months;

◮ Parsing 34, 427 email subset with a base dictionary of 121, 393terms (derived from 517, 431 emails) produced 69, 157 uniqueterms; (term-author-month) array X has ∼ 1 million nonzeros.

◮ Term frequency weighting with constraints (global frequency≥ 10 and email frequency ≥ 2); expert-generated stoplist of47, 154 words (M. Browne)

◮ Rank-25 tensor: A (69, 157× 25), B (197× 25), C (12× 25)

= + + ...

term

s

time

authorsMonth Emails Month Emails

Jan 7,050 Jul 2,166Feb 6,387 Aug 2,074Mar 6,871 Sep 2,192Apr 7,382 Oct 5,719May 5,989 Nov 4,011Jun 2,510 Dec 1,382

42 / 51

Tensor-Generated Group Discussions

◮ NNTF Group Discussions in 2001◮ 197 authors; 8 distinguishable discussions

J F M A M J J A S O N D0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Month

Nor

mal

ized

sca

le

Group 3

Group 4

Group 10

Group 12

Group 17

Group 20

Group 21

Group 25

CA/Energy

India

FastowCompanies

Downfall

CollegeFootball

Kaminski/Education

Downfall newsfeeds

43 / 51

Gantt Charts from PARAFAC ModelsNNTF/PARAFAC PARAFAC

Group Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec

Key

Interval

# Bars

[.01,0.1) [0.1, 0.3) [0.3, 0.5) [0.5, 0.7) [0.7,1.0)

(Gray = unclassified topic)

24

25

Downfall Newsfeeds

California Energy

College Football

Fastow Companies

18

15

16

17

23

12

13

14

19

20

21

22

India

Education (Kaminski)

11

5

6

7

8

9

10

Downfall

India

1

2

3

4

Group Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec

College Football

India

College Football

Key

# Bars

[.01,0.1) [0.1, 0.3)Interval [-1.0,.01) [0.3, 0.5) [0.5, 0.7) [0.7,1.0)

(Gray = unclassified topic)

24

25

Downfall

18

15

16

17

23

12

13

22

Education (Kaminski)

11

14

19

20

21

9

10

5

6

7

8

California

1

2

3

4

44 / 51

Day-level Analysis for PARAFAC (Three Groups)

◮ Rank-25 tensor for 357 out of 365 days of 2001:A (69, 157× 25), B (197× 25), C (357× 25)

◮ Groups 3,4,5:

45 / 51

Day-level Analysis for NN-PARAFAC (Three Groups)

◮ Rank-25 tensor (best minimizer) for 357 out of 365 daysof 2001: A (69, 157× 25), B (197× 25), C (357× 25)

◮ Groups 1,7,8:

46 / 51

NNTF Optimal Rank?

◮ No known algorithm for computing the rank of a 3-way array[Kruskal, 1989].

◮ The maximum rank is not a closed set for a given randomtensor.

◮ The maximum rank of a m × n × k tensor is unknown; oneweak inequality is given by

max{m, n, k} ≤ rank ≤ min{m × n, m × k, n × k}

◮ For our rank-25 NNTF, the size of the relative residual normsuggests we are still far from the maximum rank of the 3-wayarray.

47 / 51

Conclusions

◮ GD-CLS/NNMF Algorithm can effectively produce aparts-based approximation X ≃WH of a sparseterm-by-message matrix X .

◮ Smoothing on the features matrix (W ) as opposed to theweight matrix H forces more reuse of higher weighted(log-entropy) terms.

◮ Surveillance systems based on NNMF/NNTF algorithms showpromise for monitoring discussions without the need to isolateor perhaps incriminate individuals.

◮ Potential applications include the monitoring/tracking ofcompany morale, employee feedback, extracurricular activities,and blog discussions.

48 / 51

Future Work

◮ Further work needed in determining effects of alternative termweighting schemes (for X ) and choices of control parameters(e.g., α, β) for CNMF and NNTF/PARAFAC.

◮ How does document (or message) clustering change withdifferent ranks (r) in GD-CLS and NNTF/PARAFAC?

◮ How many dimensions (factors) for NNTF/PARAFAC arereally needed for mining electronic mail and similar corpora?And, at what scale should each dimension be measured(e.g., time)?

◮ Efficient updating procedures are needed for both CNMF andNNTF/PARAFAC models.

49 / 51

For Further Reading

◮ M. Berry, M. Browne, A. Langville, V. Pauca, and R. Plemmons.Alg. and Applic. for Approx. Nonnegative Matrix Factorization.Comput. Stat. & Data Anal., 2007, to appear.

◮ F. Shahnaz, M.W. Berry, V.P. Pauca, and R.J. Plemmons.Document Clustering Using Non-negative Matrix Factorization.Info. Proc. & Management 42(2), 2006, pp. 373-386.

◮ M.W. Berry and M. Browne.Email Surveillance Using Non-negative Matrix Factorization.Comp. & Math. Org. Theory 11, 2005, pp. 249-264.

50 / 51

For Further Reading (contd.)

◮ P. Hoyer.Non-negative Matrix Factorization with Sparseness Constraints.J. Machine Learning Research 5, 2004, pp. 1457-1469.

◮ W. Xu, X. Liu, and Y. Gong.Document-Clustering based on Non-neg. Matrix Factorization.Proceedings of SIGIR’03, Toronto, CA, 2003, pp. 267-273.

◮ J.B. Kruskal.Rank, Decomposition, and Uniqueness for 3-way and n-wayArrays.In Multiway Data Analysis, Eds. R. Coppi and S. Bolaso,Elsevier 1989, pp. 7-18.

51 / 51