15-388/688 -practical data science: deep learning15-388/688 -practical data science: deep learning...

15-388/688 - Practical Data Science:Deep learning

J. Zico KolterCarnegie Mellon University

Fall 2019

1

OutlineRecent history in machine learning

Machine learning with neural networks

Training neural networks

Specialized neural network architectures

Deep learning in data science

2






3

AlexNet

“AlexNet” (Krizhevsky et al., 2012), winning entry of ImageNet 2012 competition with a Top-5 error rate of 15.3% (next best system with highly engineered features based got 26.1% error)

4

Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilitiesbetween the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-partsat the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, andthe number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264–4096–4096–1000.

neurons in a kernel map). The second convolutional layer takes as input the (response-normalizedand pooled) output of the first convolutional layer and filters it with 256 kernels of size 5⇥ 5⇥ 48.The third, fourth, and fifth convolutional layers are connected to one another without any interveningpooling or normalization layers. The third convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥256 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourthconvolutional layer has 384 kernels of size 3 ⇥ 3 ⇥ 192 , and the fifth convolutional layer has 256kernels of size 3⇥ 3⇥ 192. The fully-connected layers have 4096 neurons each.

4 Reducing Overfitting

Our neural network architecture has 60 million parameters. Although the 1000 classes of ILSVRCmake each training example impose 10 bits of constraint on the mapping from image to label, thisturns out to be insufficient to learn so many parameters without considerable overfitting. Below, wedescribe the two primary ways in which we combat overfitting.

4.1 Data Augmentation

The easiest and most common method to reduce overfitting on image data is to artificially enlargethe dataset using label-preserving transformations (e.g., [25, 4, 5]). We employ two distinct formsof data augmentation, both of which allow transformed images to be produced from the originalimages with very little computation, so the transformed images do not need to be stored on disk.In our implementation, the transformed images are generated in Python code on the CPU while theGPU is training on the previous batch of images. So these data augmentation schemes are, in effect,computationally free.

The first form of data augmentation consists of generating image translations and horizontal reflec-tions. We do this by extracting random 224⇥ 224 patches (and their horizontal reflections) from the256⇥256 images and training our network on these extracted patches4. This increases the size of ourtraining set by a factor of 2048, though the resulting training examples are, of course, highly inter-dependent. Without this scheme, our network suffers from substantial overfitting, which would haveforced us to use much smaller networks. At test time, the network makes a prediction by extractingfive 224 ⇥ 224 patches (the four corner patches and the center patch) as well as their horizontalreflections (hence ten patches in all), and averaging the predictions made by the network’s softmaxlayer on the ten patches.

The second form of data augmentation consists of altering the intensities of the RGB channels intraining images. Specifically, we perform PCA on the set of RGB pixel values throughout theImageNet training set. To each training image, we add multiples of the found principal components,

4This is the reason why the input images in Figure 2 are 224⇥ 224⇥ 3-dimensional.

5

Figure 4: (Left) Eight ILSVRC-2010 test images and the five labels considered most probable by our model.The correct label is written under each image, and the probability assigned to the correct label is also shownwith a red bar (if it happens to be in the top 5). (Right) Five ILSVRC-2010 test images in the first column. Theremaining columns show the six training images that produce feature vectors in the last hidden layer with thesmallest Euclidean distance from the feature vector for the test image.

In the left panel of Figure 4 we qualitatively assess what the network has learned by computing itstop-5 predictions on eight test images. Notice that even off-center objects, such as the mite in thetop-left, can be recognized by the net. Most of the top-5 labels appear reasonable. For example,only other types of cat are considered plausible labels for the leopard. In some cases (grille, cherry)there is genuine ambiguity about the intended focus of the photograph.

Another way to probe the network’s visual knowledge is to consider the feature activations inducedby an image at the last, 4096-dimensional hidden layer. If two images produce feature activationvectors with a small Euclidean separation, we can say that the higher levels of the neural networkconsider them to be similar. Figure 4 shows five images from the test set and the six images fromthe training set that are most similar to each of them according to this measure. Notice that at thepixel level, the retrieved training images are generally not close in L2 to the query images in the firstcolumn. For example, the retrieved dogs and elephants appear in a variety of poses. We present theresults for many more test images in the supplementary material.

Computing similarity by using Euclidean distance between two 4096-dimensional, real-valued vec-tors is inefficient, but it could be made efficient by training an auto-encoder to compress these vectorsto short binary codes. This should produce a much better image retrieval method than applying auto-encoders to the raw pixels [14], which does not make use of image labels and hence has a tendencyto retrieve images with similar patterns of edges, whether or not they are semantically similar.

7 Discussion

Our results show that a large, deep convolutional neural network is capable of achieving record-breaking results on a highly challenging dataset using purely supervised learning. It is notablethat our network’s performance degrades if a single convolutional layer is removed. For example,removing any of the middle layers results in a loss of about 2% for the top-1 performance of thenetwork. So the depth really is important for achieving our results.

To simplify our experiments, we did not use any unsupervised pre-training even though we expectthat it will help, especially if we obtain enough computational power to significantly increase thesize of the network without obtaining a corresponding increase in the amount of labeled data. Thusfar, our results have improved as we have made our network larger and trained it longer but we stillhave many orders of magnitude to go in order to match the infero-temporal pathway of the humanvisual system. Ultimately we would like to use very large and deep convolutional nets on videosequences where the temporal structure provides very helpful information that is missing or far lessobvious in static images.

8

AlphaGo

5

Google TranslateIn November 2016, Google transitioned it’s translation service to a deep-learning-based system, dramatically improved translation quality in many settings

6

Kilimanjaro is 19,710 feet of the mountain covered with snow, and it is said that the highest mountain in Africa. Top of the west, “Ngaje Ngai” in the Maasai language, has been referred to as the house of God. The top close to the west, there is a dry, frozen carcass of a leopard. Whether the leopard had what the demand at that altitude, there is no that nobody explained.

Kilimanjaro is a mountain of 19,710 feet covered with snow and is said to be the highest mountain in Africa. The summit of the west is called “Ngaje Ngai” in Masai, the house of God. Near the top of the west there is a dry and frozen dead body of leopard. No one has ever explained what leopard wanted at that altitude.

https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html

https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html






7

Neural networks for machine learningThe term “neural network” largely refers to the hypothesis class part of a machine learning algorithm:

1. Hypothesis: non-linear hypothesis function, which involve compositions of multiple linear operators (e.g. matrix multiplications) and elementwise non-linear functions

2. Loss: “Typical” loss functions for classification and regression: logistic, softmax(multiclass logistic), hinge, squared error, absolute error

3. Optimization: Gradient descent, or more specifically, a variant called stochastic gradient descent we will discuss shortly

8

Linear hypotheses and feature learningUntil now, we have (mostly) considered machine learning algorithms that linear hypothesis class

ℎ"𝑥 = 𝜃

&𝜙 𝑥

where 𝜙:ℝ*→ ℝ

, denotes some set of typically non-linear features

Example: polynomials, radial basis functions, custom features like TFIDF (in many domains every 10 years or so there would be new feature types)

The performance of these algorithms depends crucially on coming up with good features

Key question: can we come up with an algorithm that will automatically learn the features themselves?

9

Feature learning, take oneInstead of a simple linear classifier, let’s consider a two-stage hypothesis class where one linear function creates the features and another produces the final hypothesis

ℎ"𝑥 = 𝑊

2𝜙 𝑥 + 𝑏

2= 𝑊

2𝑊

1𝑥+ 𝑏

1+ 𝑏

2,

𝜃 = 𝑊1∈ ℝ

,×*, 𝑏

1∈ ℝ

,,𝑊

2∈ ℝ

1×,, 𝑏

2∈ ℝ

But there is a problem:ℎ"𝑥 = 𝑊

2𝑊

1𝑥+ 𝑏

1+ 𝑏

2= ��𝑥 + 𝑏

i.e., we are still just using a linear classifier (the apparent added complexity is actually not changing the underlying hypothesis function)

10

Neural networksNeural networks are a simple extension of this idea, where we additionally apply a non-linear function after each linear transformation

ℎ"𝑥 = 𝑓

2𝑊

2𝑓1𝑊

1𝑥 + 𝑏

1+ 𝑏

2

where 𝑓1, 𝑓

2:ℝ → ℝ are a non-linear function (applied elementwise)

Common choices of 𝑓8:

11

Hyperbolic tangent: 𝑓 𝑥 = tanh 𝑥 ==2?−1

=2?+1

Sigmoid: 𝑓 𝑥 = 𝜎 𝑥 =1

1+=−?

Rectified linear unit (ReLU): 𝑓 𝑥 = max 𝑥, 0

Illustrating neural networksWe can illustrate the form of neural networks using figures like the following

Middle layer 𝑧 is referred to as the hidden layer or activations

These are the learned features, nothing in the data prescribed what values they should take, left up to algorithm to decide

12

x1

x2

xn

...

z1

z2

zk

...y

W1, b1

W2, b2

Deep learning“Deep learning” refers (almost always) to machine learning using neural network models with multiple hidden layers

Hypothesis function for 𝑘-layer network𝑧8+1

= 𝑓8𝑊

8𝑧8+ 𝑏

8, 𝑧

1= 𝑥, ℎ

"𝑥 = 𝑧

,

(note the 𝑧8

here refers to a vector, not an entry into vector)13

z1 = x

......

W1, b1

z5... ...

z2 z3 z4

W3, b3

W4, b4

= hθ(x)

W2, b2

Properties of neural networksA neural network will a single hidden layers (and enough hidden units) is a universal function approximator, can approximate any function over inputs

In practice, not that relevant (similar to how polynomials can fit any function), and the more important aspect is that they appear to work very well in practice for many domains

The hypothesis ℎ"𝑥 is not a convex function of parameters 𝜃 = {𝑊

8, 𝑏

8}, so we

have possibility of local optima

Architectural choices (how many layers, how they are connected, etc), become important algorithmic design choices (i.e. hyperparameters)

14

Why use deep networksMotivation from circuit theory: many function can be represented more efficiently using deep networks (e.g., parity function requires 𝑂(2𝑛) hidden units with single hidden layer, 𝑂 𝑛 with 𝑂(log 𝑛) layers• But not clear if deep learning really learns these types of network

Motivation from biology: brain appears to use multiple levels of interconnected neurons• But despite the name, the connection between neural networks and biology

is extremely weak

Motivation from practice: works much better for many domains• Hard to argue with results

15

Why now?

16

Better models and algorithms

Lots of dataLots of computing power

Poll: Benefits of deep networksWhat advantages would you expect of applying a deep network to some machine learning problem versus a (pure) linear classifier?

1. Less chance of overfitting data

2. Can capture more complex prediction functions

3. Better test set performance when the number of data points is small

4. Better training set performance when number of data points is small

5. Better test set performance when number of data points in large

17






18

Neural networks for machine learningHypothesis function: neural network

Loss function: “traditional” loss, e.g. logistic loss for binary classification:ℓ ℎ

"𝑥 , 𝑦 = log 1 + exp −𝑦 ⋅ ℎ

"𝑥

Optimization: How do we solve the optimization problem

minimize"

∑

8=1

Z

ℓ ℎ"𝑥8, 𝑦

8

Just use gradient descent as normal (or rather, a version called stochastic gradient descent)

19

Stochastic gradient descentKey challenge for neural networks: often have very large number of samples, computing gradients can be computationally intensive.

Traditional gradient descent computes the gradient with respect to the sum over all examples, then adjusts the parameters in this direction

𝜃 ≔ 𝜃 − 𝛼∑

8=1

Z

𝛻"ℓ(ℎ

"𝑥8, 𝑦

8

Alternative approach, stochastic gradient descent (SGD): adjust parameters based upon just one sample

𝜃 ≔ 𝜃 − 𝛼𝛻"ℓ ℎ

"𝑥8, 𝑦

8

and then repeat these updates for all samples20

Gradient descent vs. SGDGradient descent, repeat:

• For 𝑖 = 1,… ,𝑚:𝑔8← 𝛻

"ℓ ℎ

"𝑥

8, 𝑦

8

• Update parameters:

𝜃 ← 𝜃 − 𝛼∑

8=1

Z

𝑔8

Stochastic gradient descent, repeat:• For 𝑖 = 1,… ,𝑚:

𝜃 ← 𝜃 −𝛻"ℓ ℎ

"𝑥

8, 𝑦

8

In practice, stochastic gradient descent uses a small collection of samples, not just one, called a minibatch

21

Computing gradients: backpropagation

So, how do we compute the gradient 𝛻"ℓ ℎ

"𝑥8, 𝑦

8 ?

Remember 𝜃 here denotes a set of parameters, so we’re really computing gradients with respect to all elements of that set

This is accomplished via the backpropagation algorithm

We won’t cover the algorithm in detail, but backpropagation is just an application of the (multivariate) chain rule from calculus, plus “caching” intermediate terms that, for instance, occur in the gradient of both 𝑊

1and 𝑊

2

22

Training neural networks in practiceThe other good news is also that you will rarely need to implement backpropagation yourself

Many libraries provides methods for you to just specify the neural network “forward” pass, and automatically compute the necessary gradients

Examples: Tensorflow, PyTorch

You’ll use one of these a bit on the homework

23






24

Specialized architecturesVery little of the current wave of enthusiasm for deep learning has actually comefrom the simple “fully connected” neural network model we have seen so far

Instead, most of the excitement has come from two more specialized architectures: convolutional neural networks, and recurrent neural networks

25

The problem with fully-connected networksA 256x256 (RGB) image means ~200,000 dimensional input

Fully connected deep network would require a huge number of parameters, very likely to overfit to data

A generic deep network also doesn’t capture of the the “natural” invariances we expect in images (location, scale)

26

zizi+1

(Wi)1

zizi+1

(Wi)2

Convolutional neural networksConstrain weights: require that activations in following layer be a “local” function of previous layer, and share weights across all locations

Also common to use max-pooling layers that take maximum over region

27

zizi+1

Wi

zizi+1

Wi

zizi+1

max

Convolutional networks in practiceActually common to use “3D” convolutions to combine multiple channels, and use multiple convolutions at each layer to create different features

Convolutions are still linear operations, and we can take gradients using backpropagation in much the same manner

28

zizi+1

(Wi)1

zizi+1

(Wi)2

Predicting sequential dataIn practice, we often want to predict a sequence of outputs given a sequence of inputs

Just predicting each output independently would miss crucial information

Many examples: time series forecasting, sentence labeling, part of speech tagging, etc

29

Recurrent neural networksMaintain state over time, activations are a function of current input and previousactivations

30

z(1)1 z(2)1 z(3)1

z(1)3 z(2)3 z(3)3

z(1)2 z(2)2 z(3)2

· · ·

W1 W1 W1

W2

Wh1 Wh

1 Wh1

W2 W2

𝑧8+1

c= 𝑓

8𝑊

8𝑥c+𝑊

8

ℎ𝑧c−1

+ 𝑏8

ℎ"𝑥c

= 𝑧,

c

Recurrent neural networks in practiceTraditional RNNs have trouble capturing long-term dependencies

More typical to use a more complex hidden unit and activations, called a long short term memory (LSTM) network

31

Evolving Recurrent Neural Network Architectures

Figure 1. The LSTM architecture. The value of the cell is in-creased by it�jt, where � is element-wise product. The LSTM’soutput is typically taken to be ht, and ct is not exposed. The for-get gate ft allows the LSTM to easily reset the value of the cell.

we can close the gap between the LSTM and the better ar-chitectures. Thus, we recommend to increase the bias tothe forget gate before attempting to use more sophisticatedapproaches.

We also performed ablative experiments to measure the im-portance of each of the LSTM’s many components. We dis-covered that the input gate is important, that the output gateis unimportant, and that the forget gate is extremely signif-icant on all problems except language modelling. This isconsistent with Mikolov et al. (2014), who showed that astandard RNN with a hard-coded integrator unit (similar toan LSTM without a forget gate) can match the LSTM onlanguage modelling.

2. Long Short-Term Memory

In this section, we briefly explain why RNNs can be dif-ficult to train and how the LSTM addresses the vanishinggradient problem.

Standard RNNs suffer from both exploding and vanishinggradients (Hochreiter, 1991; Bengio et al., 1994). Bothproblems are caused by the RNN’s iterative nature, whosegradient is essentially equal to the recurrent weight matrixraised to a high power. These iterated matrix powers causethe gradient to grow or to shrink at a rate that is exponentialin the number of timesteps.

The exploding gradients problem is relatively easy tohandle by simply shrinking gradients whose norms ex-ceed a threshold, a technique known as gradient clipping(Mikolov, 2012; Pascanu et al., 2012). While learningwould suffer if the gradient is reduced by a massive fac-

tor too frequently, gradient clipping is extremely effectivewhenever the gradient has a small norm the majority of thetime.

The vanishing gradient is more challenging because it doesnot cause the gradient itself to be small; while the gradi-ent’s component in directions that correspond to long-termdependencies is small, while the gradient’s component indirections that correspond to short-term dependencies islarge. As a result, RNNs can easily learn the short-termbut not the long-term dependencies.

The LSTM addresses the vanishing gradient problem byreparameterizing the RNN. Thus, while the LSTM doesnot have a representational advantage, its gradient cannotvanish. In the discussion that follows, let St denote a hid-den state of an unspecified RNN architecture. The LSTM’smain idea is that, instead of computing St from St�1 di-rectly with a matrix-vector product followed by a nonlin-earity, the LSTM directly computes �St, which is thenadded to St�1 to obtain St. At first glance, this differencemay appear insignificant since we obtain the same St inboth cases. And it is true that computing �St and addingit to St does not result in a more powerful model. How-ever, just like a tanh-based network has better-behaved gra-dients than a sigmoid-based network, the gradients of anRNN that computes �St are nicer as well, since they can-not vanish.

More concretely, suppose that we run our architecture for1000 timesteps to compute S1000, and suppose that we wishto classify the entire sequence into two classes using S1000.Given that S1000 =

P1000t=1 �St, every single �St (in-

cluding �S1) will receive a sizeable contribution from thegradient at timestep 1000. This immediately implies thatthe gradient of the long-term dependencies cannot vanish.It may become “smeared”, but it will never be negligiblysmall.

The full LSTM’s definition includes circuitry for comput-ing �St and circuitry for decoding information from St.Unfortunately, different practitioners use slightly differentLSTM variants. In this work, we use the LSTM architec-ture that is precisely specified below. It is similar to thearchitecture of Graves (2013) but without peep-hole con-nections:

it = tanh(Wxixt +Whiht�1 + bi)

jt = sigm(Wxjxt +Whjht�1 + bj)

ft = sigm(Wxfxt +Whfht�1 + bf)

ot = tanh(Wxoxt +Whoht�1 + bo)

ct = ct�1 � ft + it � jt

ht = tanh(ct)� ot

In these equations, the W⇤ variables are the weight matri-ces and the b⇤ variables are the biases. The operation �

Figure from (Jozefowiczet al., 2015)






32

Deep learning in data scienceWhat role does deep learning have to play in data science?

33

Data problems we would like to solve

Unsolvable problems (50%)Solvable problems (50%)

Problems that need, e.g. deep learning (5%)

Problems that can use “simple” machine learning (45%)

Deep learning in data scienceWhat role does deep learning have to play in data science?

34

Data problems we would like to solve

Unsolvable problems (50%)Solvable problems (50%)

Problems that need, e.g. new deep learning (5%)

Problems that can use “simple” machine learning (45%)

Solving data science problems with deep learningWhen you come up against some machine learning problem with “traditional” features (i.e., human-interpretable characteristics of the data) do not try to solve it by applying deep learning methods first

Use linear regression/classification, linear regression/classification with non-linear features, or gradient boosting methods instead

If these still don’t solve your problem and you can visualize the data in a way that lets you solve it “manually”, or if you really want to squeeze out a 1-2% improvement in performance, then you can apply deep learning

35

The exceptionsHowever, it’s also undeniable that deep learning has made remarkable progress for structured data like images, audio, or text

For these types of data, you can use an already trained network as a feature extractor (i.e., a way of mapping the data to some alternatively, probably lower dimensional representation)

36

Example: Image processing with VGG

VGG network (Simonyan and Zisserman, 2015), trained on ImageNet 1000-way classification of images

Given a new image classification problem, take pre-trained VGG network, take the last layer of weights, and use them as features

Can also “finetune” last few layers of a network to specialize to a new task

37

Published as a conference paper at ICLR 2015

Table 1: ConvNet configurations (shown in columns). The depth of the configurations increasesfrom the left (A) to the right (E), as more layers are added (the added layers are shown in bold). Theconvolutional layer parameters are denoted as “conv⟨receptive field size⟩-⟨number of channels⟩”.The ReLU activation function is not shown for brevity.

ConvNet ConfigurationA A-LRN B C D E

11 weight 11 weight 13 weight 16 weight 16 weight 19 weightlayers layers layers layers layers layers

input (224× 224 RGB image)conv3-64 conv3-64 conv3-64 conv3-64 conv3-64 conv3-64

LRN conv3-64 conv3-64 conv3-64 conv3-64maxpool

conv3-128 conv3-128 conv3-128 conv3-128 conv3-128 conv3-128conv3-128 conv3-128 conv3-128 conv3-128

maxpoolconv3-256 conv3-256 conv3-256 conv3-256 conv3-256 conv3-256conv3-256 conv3-256 conv3-256 conv3-256 conv3-256 conv3-256

conv1-256 conv3-256 conv3-256conv3-256





maxpoolFC-4096FC-4096FC-1000soft-max

Table 2: Number of parameters (in millions).Network A,A-LRN B C D ENumber of parameters 133 133 134 138 144

such layers have a 7 × 7 effective receptive field. So what have we gained by using, for instance, astack of three 3×3 conv. layers instead of a single 7×7 layer? First, we incorporate three non-linearrectification layers instead of a single one, which makes the decision function more discriminative.Second, we decrease the number of parameters: assuming that both the input and the output of athree-layer 3× 3 convolution stack has C channels, the stack is parametrised by 3

!

32C2"

= 27C2

weights; at the same time, a single 7 × 7 conv. layer would require 72C2 = 49C2 parameters, i.e.81% more. This can be seen as imposing a regularisation on the 7× 7 conv. filters, forcing them tohave a decomposition through the 3× 3 filters (with non-linearity injected in between).

The incorporation of 1 × 1 conv. layers (configuration C, Table 1) is a way to increase the non-linearity of the decision function without affecting the receptive fields of the conv. layers. Eventhough in our case the 1× 1 convolution is essentially a linear projection onto the space of the samedimensionality (the number of input and output channels is the same), an additional non-linearity isintroduced by the rectification function. It should be noted that 1×1 conv. layers have recently beenutilised in the “Network in Network” architecture of Lin et al. (2014).

Small-size convolution filters have been previously used by Ciresan et al. (2011), but their netsare significantly less deep than ours, and they did not evaluate on the large-scale ILSVRCdataset. Goodfellow et al. (2014) applied deep ConvNets (11 weight layers) to the task ofstreet number recognition, and showed that the increased depth led to better performance.GoogLeNet (Szegedy et al., 2014), a top-performing entry of the ILSVRC-2014 classification task,was developed independently of our work, but is similar in that it is based on very deep ConvNets

3

Figure from Simonyan and Zisserman, 2015

Example: text processing with word2vec

word2vec (Mikolov, et al., 2013) is a method developed for predicting surrounding words from a given word

To do so, it creates an “embedding” for every word that acts as a good surrogate for the things this word can mean, pre-trained versions available

Bottom line: instead of using bag of words, use word2vec to get a vector representation of each word in a corpus

38

!"#$

%&'(#)))))))))))'*+,-.#/+&))))))+(#'(#

!"#01$

!"#02$

!"#32$

!"#31$

Figure 1: The Skip-gram model architecture. The training objective is to learn word vector representationsthat are good at predicting the nearby words.

In this paper we present several extensions of the original Skip-gram model. We show that sub-sampling of frequent words during training results in a significant speedup (around 2x - 10x), andimproves accuracy of the representations of less frequent words. In addition, we present a simpli-fied variant of Noise Contrastive Estimation (NCE) [4] for training the Skip-grammodel that resultsin faster training and better vector representations for frequent words, compared to more complexhierarchical softmax that was used in the prior work [8].

Word representations are limited by their inability to represent idiomatic phrases that are not com-positions of the individual words. For example, “Boston Globe” is a newspaper, and so it is not anatural combination of the meanings of “Boston” and “Globe”. Therefore, using vectors to repre-sent the whole phrases makes the Skip-gram model considerably more expressive. Other techniquesthat aim to represent meaning of sentences by composing the word vectors, such as the recursiveautoencoders [15], would also benefit from using phrase vectors instead of the word vectors.

The extension from word based to phrase based models is relatively simple. First we identify a largenumber of phrases using a data-driven approach, and then we treat the phrases as individual tokensduring the training. To evaluate the quality of the phrase vectors, we developed a test set of analogi-cal reasoning tasks that contains both words and phrases. A typical analogy pair from our test set is“Montreal”:“Montreal Canadiens”::“Toronto”:“TorontoMaple Leafs”. It is considered to have beenanswered correctly if the nearest representation to vec(“Montreal Canadiens”) - vec(“Montreal”) +vec(“Toronto”) is vec(“Toronto Maple Leafs”).

Finally, we describe another interesting property of the Skip-gram model. We found that simplevector addition can often produce meaningful results. For example, vec(“Russia”) + vec(“river”) isclose to vec(“Volga River”), and vec(“Germany”) + vec(“capital”) is close to vec(“Berlin”). Thiscompositionality suggests that a non-obvious degree of language understanding can be obtained byusing basic mathematical operations on the word vector representations.

2 The Skip-gram Model

The training objective of the Skip-gram model is to find word representations that are useful forpredicting the surrounding words in a sentence or a document. More formally, given a sequence oftraining wordsw1, w2, w3, . . . , wT , the objective of the Skip-grammodel is to maximize the averagelog probability

1

T

T!

t=1

!

−c≤j≤c,j=0

log p(wt+j |wt) (1)

where c is the size of the training context (which can be a function of the center word wt). Largerc results in more training examples and thus can lead to a higher accuracy, at the expense of the

2

Figure from Mikolov, et al., 2013

Example: text processing with BERT

BERT (Bidirectional Encoder Representations from Transformers), (Devlin et al., 2018) trains a language model to predict missing elements of a sentence and predict one sentence from another for two sentence pairs

At application time, can fine-tune this generic model to many other possible tasks such as question answering, sentence classification, etc

39

Figure from Devlin, et al., 2018

15-388/688 -practical data science: deep learning15-388/688 -practical data science: deep learning...

Documents