sta141c: big data & high performance statistical …chohsieh/teaching/sta141c_spring...then wsvd...

STA141C: Big Data & High Performance StatisticalComputing

Lecture 9: Dimension Reduction/Word2vec

Cho-Jui HsiehUC Davis

May 15, 2018

Principal Component Analysis

Principal Component Analysis (PCA)

Data matrix can be big.Example: bag-of-word modelEach document is represented by a d-dimensional vector x , where xi isnumber of occurrences of word i .

number of features = number of potential words ≈ 10,000

Feature generation for documents

Bag of n-gram features (n = 2):

10,000 words ⇒ 10, 0002 potential features

Data Matrix (document)

Use the bag-of-word matrix or the normalized version (TF-IDF) for adataset (denoted by D):

tfidf(doc,word,D) = tf (doc,word) · idf (word,D)tf(doc, word): term frequency

(word count in the document)/(total number of terms in the document)

idf(word, Dataset): inverse document frequencylog((Number of documents)/(Number of documents with this word))

PCA: Motivation

Data can have huge dimensionality:

Reuters text collection (rcv1): 677,399 documents, 47,236 features(words)Pubmed abstract collection: 8,200,000 documents, 141,043 features(words)

Can we find a low-dimensional representation for each document?

Enable many learning algorithms to run efficientlySometimes achieve better prediction performance (de-noising)Visualize the data

PCA: Motivation

Orthogonal projection of data onto lower-dimensional linear space that:

Maximize variance of projected data(preserve as much information as possible)Minimize reconstruction error

PCA: Formulation

Given the data x1, · · · , xn ∈ Rd , compute the principal vector w by:

w = arg max‖w‖=1

1

n

n∑i=1

(wTxi −wT x̄)2

where x̄ =∑

i xi/n is the mean.

First, shift data so that x̂i = xi − x̄ , so

w = arg max‖w‖=1

1

n

n∑i=1

(wT x̂i )2 = arg max‖w‖=1

1

nwT X̂ X̂Tw

where each column of X̂ is x̂iThe first principal component w is the leading eigenvector of X̂ X̂T

(eigenvector corresponding to the largest eigenvalue)

PCA: Formulation

2nd principal component w2:

Perpendicular to w1

Again, largest varianceEigenvector corresponding to the second eigenvalue

Top k principal components:

w1, · · · ,wk

Top k eigenvectorsThe k-dimensional subspace with largest variance

W = arg maxW∈Rd×k ,WTW=I

{ k∑r=1

1

nwT

r X̂ X̂Twr

}

PCA: illustration

PCA: Computation

PCA: top-k eigenvectors of X̂ X̂T

Assume X̂ = UΣV T , then principal components are Uk (top-k singularvectors of X̂ )

Projection of X̂ to Uk :

UTk X̂ = ΣkV

Tk (k by n matrix)

Each column is the k-dimensional features for a example

PCA can be computed in two ways:

Top-k SVD of X̂Top-k SVD of X̂ X̂T (explicitly form the matrix only when d is small)

Need large scale SVD solver for dense or sparse matrices.

Word2vec: Learning Word Representations

Word2vec: Motivation

Goal: understand the meaning of a word

Given a large text corpus, how to learn low-dimensional features torepresent a word?

Skip-gram model:

For each word wi , define the “contexts” of the word as the wordssurrounding it in an L-sized window:

wi−L−2,wi−L−1,wi−L, · · · ,wi−1︸︷︷︸contexts of wi

,wi ,wi+1, · · · ,wi+L︸︷︷︸contexts of wi

,wi+L+1, · · ·

Get a collection of (word, context) pairs, denoted by D.

Skip-gram model

(Figure from http://mccormickml.com/2016/04/19/

word2vec-tutorial-the-skip-gram-model/)

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

Use bag-of-word model

Idea 1: Use the bag-of-word model to “describe” each word

Assume we have context words c1, · · · , cd in the corpus, compute

#(w , ci ) := number of times the pair (w , ci ) appears in D

For each word w , form a d-dimensional (sparse) vector to describe w

#(w , c1), · · · ,#(w , cd),

PMI/PPMI Representation

Similar to TF-IDF: Need to consider the frequency for each word andeach context

Instead of using co-ocurrent count #(w , c), we can define pointwisemutual information:

PMI(w , c) = log(P̂(w , c)

P̂(w)P̂(c)) = log

#(w , c)|D|#(w)#(c)

,

#(w) =∑

c #(w , c): number of times word w occurred in D#(c) =

∑w #(w , c): number of times context c occurred in D

|D|: number of pairs in D

Positive PMI (PPMI) usually achieves better performance:

PPMI(w , c) = max(PMI(w , c), 0)

MPPMI: a n by d word feature matrix, each row is a word and eachcolumn is a context

PPMI Matrix

Low-dimensional embedding (Word2vec)

Advantages to extracting low-dimensional dense representations:

Improve computational efficiency for end applicationsBetter visualizationBetter performance (?)

Perform PCA/SVD on the sparse feature matrix:

MPPMI ≈ UkΣkVTk

Then W SVD = UkΣk is the context representation of each word(Each row is a k-dimensional feature for a word)

This is one of the word2vec algorithm.

Generalized Low-rank Embedding

SVD basis will minimize

minW ,V‖MPPMI −WV T‖2F

Extensions (Glove, Google W2V, . . . ):

Use different loss function (instead of ‖ · ‖F )Negative sampling (less weights to 0s in MPPMI )Adding bias term:

MPPMI ≈WV T + bweT + ebTc

Details and comparisons:

“Improving Distributional Similarity with Lessons Learned from WordEmbeddings”, Levy et al., ACL 2015.“Glove: Global Vectors for Word Representation”, Pennington et al.,EMNLP 2014.

Results

The low-dimensional embeddings are (often) meaningful:

(Figure from https://www.tensorflow.org/tutorials/word2vec)

https://www.tensorflow.org/tutorials/word2vec

Coming up

Tree-based algorithms

Questions?

sta141c: big data & high performance statistical …chohsieh/teaching/sta141c_spring...then wsvd...

Documents