sta141c: big data & high performance statistical …chohsieh/teaching/sta141c_spring...then wsvd...
TRANSCRIPT
STA141C: Big Data & High Performance StatisticalComputing
Lecture 9: Dimension Reduction/Word2vec
Cho-Jui HsiehUC Davis
May 15, 2018
Principal Component Analysis
Principal Component Analysis (PCA)
Data matrix can be big.Example: bag-of-word modelEach document is represented by a d-dimensional vector x , where xi isnumber of occurrences of word i .
number of features = number of potential words ≈ 10,000
Feature generation for documents
Bag of n-gram features (n = 2):
10,000 words ⇒ 10, 0002 potential features
Data Matrix (document)
Use the bag-of-word matrix or the normalized version (TF-IDF) for adataset (denoted by D):
tfidf(doc,word,D) = tf (doc,word) · idf (word,D)tf(doc, word): term frequency
(word count in the document)/(total number of terms in the document)
idf(word, Dataset): inverse document frequencylog((Number of documents)/(Number of documents with this word))
PCA: Motivation
Data can have huge dimensionality:
Reuters text collection (rcv1): 677,399 documents, 47,236 features(words)Pubmed abstract collection: 8,200,000 documents, 141,043 features(words)
Can we find a low-dimensional representation for each document?
Enable many learning algorithms to run efficientlySometimes achieve better prediction performance (de-noising)Visualize the data
PCA: Motivation
Orthogonal projection of data onto lower-dimensional linear space that:
Maximize variance of projected data(preserve as much information as possible)Minimize reconstruction error
PCA: Formulation
Given the data x1, · · · , xn ∈ Rd , compute the principal vector w by:
w = arg max‖w‖=1
1
n
n∑i=1
(wTxi −wT x̄)2
where x̄ =∑
i xi/n is the mean.
First, shift data so that x̂i = xi − x̄ , so
w = arg max‖w‖=1
1
n
n∑i=1
(wT x̂i )2 = arg max‖w‖=1
1
nwT X̂ X̂Tw
where each column of X̂ is x̂iThe first principal component w is the leading eigenvector of X̂ X̂T
(eigenvector corresponding to the largest eigenvalue)
PCA: Formulation
2nd principal component w2:
Perpendicular to w1
Again, largest varianceEigenvector corresponding to the second eigenvalue
Top k principal components:
w1, · · · ,wk
Top k eigenvectorsThe k-dimensional subspace with largest variance
W = arg maxW∈Rd×k ,WTW=I
{ k∑r=1
1
nwT
r X̂ X̂Twr
}
PCA: illustration
PCA: Computation
PCA: top-k eigenvectors of X̂ X̂T
Assume X̂ = UΣV T , then principal components are Uk (top-k singularvectors of X̂ )
Projection of X̂ to Uk :
UTk X̂ = ΣkV
Tk (k by n matrix)
Each column is the k-dimensional features for a example
PCA can be computed in two ways:
Top-k SVD of X̂Top-k SVD of X̂ X̂T (explicitly form the matrix only when d is small)
Need large scale SVD solver for dense or sparse matrices.
Word2vec: Learning Word Representations
Word2vec: Motivation
Goal: understand the meaning of a word
Given a large text corpus, how to learn low-dimensional features torepresent a word?
Skip-gram model:
For each word wi , define the “contexts” of the word as the wordssurrounding it in an L-sized window:
wi−L−2,wi−L−1,wi−L, · · · ,wi−1︸ ︷︷ ︸contexts of wi
,wi ,wi+1, · · · ,wi+L︸ ︷︷ ︸contexts of wi
,wi+L+1, · · ·
Get a collection of (word, context) pairs, denoted by D.
Skip-gram model
(Figure from http://mccormickml.com/2016/04/19/
word2vec-tutorial-the-skip-gram-model/)
Use bag-of-word model
Idea 1: Use the bag-of-word model to “describe” each word
Assume we have context words c1, · · · , cd in the corpus, compute
#(w , ci ) := number of times the pair (w , ci ) appears in D
For each word w , form a d-dimensional (sparse) vector to describe w
#(w , c1), · · · ,#(w , cd),
PMI/PPMI Representation
Similar to TF-IDF: Need to consider the frequency for each word andeach context
Instead of using co-ocurrent count #(w , c), we can define pointwisemutual information:
PMI(w , c) = log(P̂(w , c)
P̂(w)P̂(c)) = log
#(w , c)|D|#(w)#(c)
,
#(w) =∑
c #(w , c): number of times word w occurred in D#(c) =
∑w #(w , c): number of times context c occurred in D
|D|: number of pairs in D
Positive PMI (PPMI) usually achieves better performance:
PPMI(w , c) = max(PMI(w , c), 0)
MPPMI: a n by d word feature matrix, each row is a word and eachcolumn is a context
PPMI Matrix
Low-dimensional embedding (Word2vec)
Advantages to extracting low-dimensional dense representations:
Improve computational efficiency for end applicationsBetter visualizationBetter performance (?)
Perform PCA/SVD on the sparse feature matrix:
MPPMI ≈ UkΣkVTk
Then W SVD = UkΣk is the context representation of each word(Each row is a k-dimensional feature for a word)
This is one of the word2vec algorithm.
Generalized Low-rank Embedding
SVD basis will minimize
minW ,V‖MPPMI −WV T‖2F
Extensions (Glove, Google W2V, . . . ):
Use different loss function (instead of ‖ · ‖F )Negative sampling (less weights to 0s in MPPMI )Adding bias term:
MPPMI ≈WV T + bweT + ebTc
Details and comparisons:
“Improving Distributional Similarity with Lessons Learned from WordEmbeddings”, Levy et al., ACL 2015.“Glove: Global Vectors for Word Representation”, Pennington et al.,EMNLP 2014.
Results
The low-dimensional embeddings are (often) meaningful:
(Figure from https://www.tensorflow.org/tutorials/word2vec)
Coming up
Tree-based algorithms
Questions?