self-attention for generative modelsself-attention for generative models ashish vaswani and anna...

115
Self-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin, Llion Jones, Justin Gilmer, David Bieber, Jonathan Frankle, Jakob Uszkoreit, and others.

Upload: others

Post on 23-Mar-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Self-Attention For Generative Models

Ashish Vaswani and Anna Huang

Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin, Llion Jones, Justin Gilmer, David Bieber, Jonathan Frankle, Jakob Uszkoreit, and

others.

Page 2: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Learning Representations of Variable Length Data

Basic building block of sequence-to-sequence learning

Neural machine translation, summarization, QA, …

Page 3: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Recurrent Neural Networks

Model of choice for learning variable-length representations.

Natural fit for sentences and sequences of pixels.

LSTMs, GRUs and variants dominate recurrent models.

Page 4: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Recurrent Neural Networks

Page 5: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

But…

Sequential computation inhibits parallelization.

No explicit modeling of long and short range dependencies.

We want to model hierarchy.

RNNs (w/ sequence-aligned states) seem wasteful!

Page 6: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Convolutional Neural Networks?

Page 7: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Convolutional Neural Networks?

Trivial to parallelize (per layer).

Exploits local dependencies

‘Interaction distance’ between positions linear or logarithmic.

Long-distance dependencies require many layers.

Page 8: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Attention

Attention between encoder and decoder is crucial in NMT.

Why not use attention for representations?

Page 9: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Self-Attention

Page 10: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Text generation

Page 11: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Self-Attention

Constant ‘path length’ between any two positions.

Gating/multiplicative interactions.

Trivial to parallelize (per layer).

Can replace sequential computation entirely?

Page 12: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Previous work

Classification & regression with self-attention:

Parikh et al. (2016), Lin et al. (2016)

Self-attention with RNNs:

Long et al. (2016), Shao, Gows et al. (2017)

Recurrent attention:

Sukhbaatar et al. (2015)

Page 13: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

The Transformer

Page 14: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Encoder Self-Attention

Page 15: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Decoder Self-Attention

Page 16: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

FLOPs

Self-Attention O(length2 · dim)

RNN (LSTM) O(length · dim2)

Convolution O(length · dim2 · kernel_width)

Attention is Cheap!

Page 17: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

FLOPs

Self-Attention O(length2 · dim) = 4·109

RNN (LSTM) O(length · dim2) = 16·109

Convolution O(length · dim2 · kernel_width) = 6·109

Attention is Cheap!

length=1000 dim=1000 kernel_width=3

Page 18: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Attention: a weighted average

The cat stuck out its tongue and licked its owner

The cat stuck out its tongue and licked its owner

Page 19: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Convolutions

I kicked the ball

Who Did what? To whom?

Page 20: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Self-Attention

I kicked the ball

Who Did what? To whom?

I kicked the ball

Page 21: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Parallel attention heads

I kicked the ball

WhoDid what?

To whom?

I kicked the ball

Page 22: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Attention head: Who

I kicked the ball

WhoDid what?

To whom?

I kicked the ball

Page 23: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Parallel attention heads

I kicked the ball

WhoDid what?

I kicked the ball

Page 24: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Parallel attention heads

I kicked the ball

WhoDid what?

To whom?

I kicked the ball

Page 25: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Parallel attention heads

I kicked the ball

WhoDid what?

To whom?

I kicked the ball

Page 26: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Self-Attention: Averaging

I kicked the ball

Who Did what? To whom?

kicked

Page 27: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Attention head: Who

I kicked the ball

Who

kicked

Page 28: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Attention head: Did What?

I kicked the ball

WhoDid what?

kicked

Page 29: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Attention head: To Whom?

I kicked the ball

WhoDid what?

To whom?

kicked

Page 30: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Multihead Attention

I kicked the ball

WhoDid what?

To whom?

kicked

Page 31: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Convolution:Different linear transformations by relative position.

The cat stuck out its tongue and licked its owner

The cat stuck out its tongue and licked its owner

Page 32: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Attention: a weighted average

The cat stuck out its tongue and licked its owner

The cat stuck out its tongue and licked its owner

Page 33: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Multi-head AttentionParallel attention layers with different linear transformations on input and output.

The cat stuck out its tongue and licked its owner

The cat stuck out its tongue and licked its owner

Page 34: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Results

Page 35: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Machine Translation: WMT-2014 BLEU

EN-DE EN-FR

GNMT (orig) 24.6 39.9

ConvSeq2Seq 25.2 40.5

Transformer* 28.4 41.8

Attention is All You Need (NeurIPS 2017) Vaswani*, Shazeer*, Parmar*, Uszkoreit*, Jones*, Kaiser*, Gomez*, Polosukhin*

*Transformer models trained >3x faster than the others.

Page 37: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Importance of residuals

Page 38: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Importance of Residuals

Residuals carry positional information to higher layers, among other information.

With residuals Without residuals Without residuals, with timing signals

Page 39: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Training DetailsADAM optimizer with a learning rate warmup (warmup + exponential decay)

Dropout during training at every layer just before adding residual

Layer-norm

Attention dropout (for some experiments)

Checkpoint-averaging

Label smoothing

Auto-regressive decoding with beam search and length biasing…

Page 40: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Results

Page 41: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Generating Wikipedia by Summarizing Long Sequences

ROUGE

seq2seq-attention 12.7

Transformer-ED (L=500) 34.2

Transformer-DMCA (L=11000) 36.2

msaleh@ et al. submission to ICLR’18

Page 42: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Self-Similarity, Image and Music Generation

Page 43: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Self-similarity in images

https://en.wikipedia.org/wiki/Self-similarity

Page 44: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Self-Similarity in Images

Starry Night (Van Gogh, June 1889)

Page 45: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Self-similarity in musicMotifs repeat, immediately and also at a distance

Page 46: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Probabilistic Image Generation

Model the joint distribution of pixels

Turning it into a sequence modeling problem

Assigning probabilities allows measuring generalization

Page 47: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Probabilistic Image Generation

RNNs and CNNs are state-of-the-art (PixelRNN, PixelCNN)

CNNs incorporating gating now match RNNs in quality

CNNs are much faster due to parallelization

A Oord et al. (2016), Salimans et al. (2017), Kalchbrenner et al. (2016)

Page 48: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Probabilistic Image Generation

Long-range dependencies matter for images (e.g. symmetry)

Likely increasingly important with increasing image size

Modeling long-range dependencies with CNNs requires either

Many layers likely making training harder

Large kernels at large parameter/computational cost

Page 49: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Texture Synthesis with Self-Similarity

Texture Synthesis by Non-parametric Sampling (Efros and Leung, 1999)

Page 50: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Non-local Means

BCM 2005

Page 51: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Non-local Means

A Non-local Algorithm for Image Denoising (Buades, Coll, and Morel. CVPR 2005)

Non-local Neural Networks (Wang et al., 2018)

Page 52: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Previous work

Self-attention:

Parikh et al. (2016), Lin et al. (2016), Vaswani et al. (2017)

Autoregressive Image Generation:

A Oord et al. (2016), Salimans et al. (2017)

Page 53: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Self-Attention

Page 54: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

The Image Transformer

Page 55: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Decoder Self-Attention

Page 56: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

FLOPs

Self-Attention O(length2 · dim)

RNN (LSTM) O(length · dim2)

Convolution O(length · dim2 · kernel_width)

Attention is Cheap!

Page 57: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

FLOPs

Self-Attention O(length2 · dim) (length=3072 for images)

RNN (LSTM) O(length · dim2)

Convolution O(length · dim2 · kernel_width)

Attention is Cheap if length << dim!

Page 58: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Combining Locality with Self-Attention

Restrict the attention windows to be local neighborhoods

Good assumption for images because of spatial locality

Page 59: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,
Page 60: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,
Page 61: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Image Transformer Layer

(x, y)(x, y) (x, y)(x, y)

Page 62: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Tasks

Super-resolution

Unconditional and Conditional Image generation

Page 63: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Results

Image TransformerParmar*, Vaswani*, Uszkoreit, Kaiser, Shazeer, Ku, and Tran. ICML 2018

Page 64: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Cifar-10(Test)

Imagenet (Validation)

PixelRNN 3.00 3.86

Gated PixelCNN 3.03 3.83

PixelCNN++ 2.92 (dmol) -

PixelSNAIL 2.85 3.8

Image Transformer, 1D local 2.9 (xent) 3.77

Image Transformer, 1D local 2.9 (dmol) 3.78

Unconditional Image Generation

Cross entropy of various models on CIFAR-10 and Imagenet datasets.

Page 65: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Cifar10 Samples

Page 66: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

CelebA Super ResolutionInput Local 1D Local 2D Truth

Γ=0.8 Γ=0.9 Γ=1.0 Γ=0.8 Γ=0.9 Γ=1.0

Page 67: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

CelebA Super Resolution

% Fooled

Γ = n/a Γ = 1.0 Γ = 0.9 Γ = 0.8

ResNet 4.0 - - -

srez GAN (Garcia, 2016) 8.5 - - -

Pixel Recursive (Dahl et al., 2017)

- 11.0 10.4 10.25

Image Transformer, 1D local 35.94 ± 3.0 33.5 ± 3.5 29.6 ± 4.0

Image Transformer, 2D local 36.11 ±2.5 34 ± 3.5 30.64 ± 4.0

Human Eval performance for the Image Transformer on CelebA. The fraction of humans fooled is significantly better than the previous state of art.

Page 68: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Cifar10 SuperResolution

Page 69: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Conditional Image Completion

Page 70: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Music generation using relative self-attention

Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu and Douglas Eck.

Blog post: https://magenta.tensorflow.org/music-transformer

Page 71: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Raw representations in music and language

(Image from Simon & Oore, 2016)

Language

Music

text speech

Page 72: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

A

ht

Xt

ht

Note on

Note off

Note Velocity

Advance clock

Music Language model:Prior work Performance RNN (Simon & Oore, 2016)

Page 73: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Continuations to given initial motif

RNN-LSTM

Transformer

Music Transformer

Given motif

Page 74: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Continuations to given initial motif

Given motif

Page 75: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Continuations to given initial motif

Given motif

Page 76: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Continuations to given initial motif

RNN-LSTM

Given motif

Page 77: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Continuations to given initial motif

RNN-LSTM

Given motif

Page 78: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Continuations to given initial motif

RNN-LSTM

Transformer

Given motif

Page 79: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Continuations to given initial motif

RNN-LSTM

Transformer

Given motif

Page 80: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Continuations to given initial motif

RNN-LSTM

Transformer

Music Transformer

Given motif

Page 81: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Continuations to given initial motif

RNN-LSTM

Transformer

Music Transformer

Given motif

Page 82: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Self-Similarity in Music

Page 83: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Sample from Music Transformer

Page 85: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Attention: a weighted average

TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90

TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90

Page 86: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Attention: a weighted average

TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90

TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90

Page 87: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Convolution:Different linear transformations by relative position.

TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90

TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90

Page 88: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Relative attention (Shaw et al, 2018)

Multihead attention + convolution?

TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90

TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90

Page 89: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Closer look at attention

QErT

Page 90: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Closer look at relative attention

0,0 0,1 0,2

1,0 1,1 1,2

2,0 2,1 2,2

0 1 2

-1 0 1

-2 -1 0

QErT

Modulated by relative positions

Page 91: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Machine Translation (Shaw et al, 2018)

Model Position Representation

BLEU En-De

BLEUEn-Fr

Transformer Big Absolute 27.9 41.3

Transformer Big Relative 29.2 41.5

Page 92: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Previous work O(L2D): 8.5 GB per layer (Shaw et al, 2018)

Relative embeddings

-2

0

-1 0

0

-1

Multiply by Q

Relative distances

Per layer, L=2048, D=512

Page 93: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Our formulation O(LD): 4.2 MB per layer

Absolute by relative Absolute by absolute

Pad Reshape Sliceiq

Per layer, L=2048, D=512

Skew

Page 94: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Goal of skewing procedure

absolute by relative absolute by absolute

Indexed by

Page 95: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Skewing to reduce relative memoryfrom O(L2D) to O(LD)

Relative embeddings Er

Per layer, L=2048, D=512O(L2D): 8.5 GB O(LD): 4.2 MB (ours)

Skew

Multiply by Q

Directly multiply by Q

Srel

Pad

QET

skew(QET)

QErT

Reshape Slice

iq

iq

Our work

O(LD): 4.2 MB

Previous work

O(L2D): 8.5 GB

Per layer, L=2048, D=512

-2

0

-1 0

0

-1

Page 96: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

A Jazz sample from Music Transformer

Page 97: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

A Jazz sample from Music Transformer

Page 98: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Convolutions and Translational Equivariance

0.5

0 32

32

0 32

32

0.5

Page 99: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Relative positions Translational Equivariance

0.5

0 32

32

0 32

32

0.5

Page 100: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Relative Attention And Graphs

Page 101: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Relative Attention And Graphs

Relational inductive biases, deep learning, and graph networks. (Battaglia et al., 2018)

Self-Attention With Relative Position Representations (Shaw et al., 2018)

Page 102: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Message Passing Neural Networks

h2

h1

h3

Slide credit: Justin Gilmer

Neural Message Passing For Quantum Chemistry. Gilmer et al. ICML 2017

Page 103: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Multiple Towers

Mixing Network

● Run k smaller copies of the MPNN in parallel.

● Mix node states after each message pass.

● Offers a factor of k speedup for the same node dimension d (> 2x speedup when d=200).

● Also helped improve performance when used with matrix multiply message function.

Slide credit: Justin Gilmer

Page 104: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Graph Library

Code

With Justin Gilmer, Jonathan Frankle, and David Bieber

Page 105: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Self-Attention

Constant ‘path length’ between any two positions.

Unbounded memory.

Trivial to parallelize (per layer).

Models Self-Similarity.

Relative attention provides expressive timing, equivariance, and extends naturally to graphs.

Page 106: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Non autoregressive transformer (Gu and Bradbury et al., 2018)

Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement (Lee, Manismov, and Cho, 2018)

Fast Decoding in Sequence Models Using Discrete Latent Variables (ICML 2018)Kaiser, Roy, Vaswani, Pamar, Bengio, Uszkoreit, Shazeer

Towards a Better Understanding of Vector Quantized AutoencodersRoy, Vaswani, Parmar, Neelakantan, 2018

Blockwise Parallel Decoding For Deep Autogressive Models (NeurIPS 2019)Stern, Shazeer, Uszkoreit,

Active Research Area

Page 107: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Transfer learning

Page 108: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Improving Language Understanding by Generative Pre-Training (Radford, Narsimhan, Salimans, and Sutskever)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin, Chang, Lee, and Toutanova)

Page 109: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Optimization and Large Models

Page 110: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost (ICML 2018). Shazeer, Stern.

Memory-Efficient Adaptive Optimization for Large-Scale Learning (2019). Anil, Gupta, Koren, Singer.

Mesh-TensorFlow: Deep Learning for Supercomputers (NeurIPS 2019). Shazeer, Cheng, Parmar, Tran, Vaswani, Koanantakool, Hawkins, Lee, Hong, Young, Sepassi, Hechtman) Code (5 billion parameters)

Page 111: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Self-attention in Other Work.

Page 112: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Generating Wikipedia by Summarizing Long sequences. (ICLR 2018). Liu, Saleh, Pot, Goodrich, Sepassi, Shazeer, Kaiser.

Universal Transformers (ICLR 2019). Deghiani*, Gouws*, Vinyals, Uszkoreit, Kaiser.

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (2019). Dai, Yang, Yang, Carbonell, Le, Salakhutdinov.

A Time-Restricted Self-Attention Layer for ASR (ICASSP 2018). Povey, Hadian, Gharemani, Li, Khudanpur.

Character-Level Language Modeling with Deeper Self-Attention (2018). Roufou*, Choe*, Guo*, Constant*, Jones*

Page 113: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Ongoing and Future Work

Page 114: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Self-supervision and classification for images and video

Understanding Transfer

Ongoing

Page 115: Self-Attention For Generative ModelsSelf-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin,

Multitask learning

Long-range attention

Future