harvard seas new york universitypeople.csail.mit.edu/.../papers/kim_etal_aaai16_slides.pdf · 2015....

90
Character-Aware Neural Language Models Yoon Kim Yacine Jernite David Sontag Alexander Rush Harvard SEAS New York University To appear in AAAI 2016 Code: https://github.com/yoonkim/lstm-char-cnn Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 1 / 79

Upload: others

Post on 23-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Character-Aware Neural Language Models

Yoon Kim Yacine Jernite David Sontag Alexander Rush

Harvard SEAS New York University

To appear in AAAI 2016Code: https://github.com/yoonkim/lstm-char-cnn

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 1 / 79

Page 2: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Language Model

Language Model (LM): probability distribution over a sequence of words.

p(w1, . . . ,wT ) for any sequence of length T from a vocabulary V (withwi ∈ V for all i).

Important for many downstream applications:

machine translation

speech recognition

text generation

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 2 / 79

Page 3: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Count-based Language Models

By the chain rule, any distribution can be factorized as:

p(w1, . . . ,wT ) =T∏t=1

p(wt |w1, . . . ,wt−1)

n-gram language models make a Markov assumption:

p(wt |w1, . . . ,wt−1) ≈ p(wt |wt−n, . . . ,wt−1)

Needs smoothing to deal with rare n-grams.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 3 / 79

Page 4: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Neural Language Models

Neural Language Models (NLM)

Represent words as dense vectors in Rn (word embeddings).

wt ∈ R|V| : One-hot representation of word ∈ V at time t⇒ xt = Xwt : Word embedding (X ∈ Rn×|V|, n < |V|)

Train a neural net that composes history to predict next word.

p(wt = j |w1, . . . ,wt−1) =exp(pj · g(x1, . . . , xt−1) + qj)∑

j ′∈Vexp(pj ′ · g(x1, . . . , xt−1) + qj ′)

= softmax(Pg(x1, . . . , xt−1) + q)

pj ∈ Rm, qj ∈ R : Output word embedding/bias for word j ∈ Vg : Composition function

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 4 / 79

Page 5: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Neural Language Models

Neural Language Models (NLM)

Represent words as dense vectors in Rn (word embeddings).

wt ∈ R|V| : One-hot representation of word ∈ V at time t⇒ xt = Xwt : Word embedding (X ∈ Rn×|V|, n < |V|)

Train a neural net that composes history to predict next word.

p(wt = j |w1, . . . ,wt−1) =exp(pj · g(x1, . . . , xt−1) + qj)∑

j ′∈Vexp(pj ′ · g(x1, . . . , xt−1) + qj ′)

= softmax(Pg(x1, . . . , xt−1) + q)

pj ∈ Rm, qj ∈ R : Output word embedding/bias for word j ∈ Vg : Composition function

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 4 / 79

Page 6: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Feed-forward NLM (Bengio, Ducharme, and Vincent 2003)

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 5 / 79

Page 7: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Feed-forward NLM (Bengio, Ducharme, and Vincent 2003)

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 6 / 79

Page 8: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Feed-forward NLM (Bengio, Ducharme, and Vincent 2003)

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 7 / 79

Page 9: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Feed-forward NLM (Bengio, Ducharme, and Vincent 2003)

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 8 / 79

Page 10: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Recurrent Neural Network LM (Mikolov et al. 2011)

Maintain a hidden state vector ht that is recursively calculated.

ht = f (Wxt + Uht−1 + b)

ht ∈ Rm : Hidden state at time t (summary of history)

W ∈ Rm×n : Input-to-hidden transformation

U ∈ Rm×m : Hidden-to-hidden transformation

f (·) : Non-linearity

Apply softmax to ht .

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 9 / 79

Page 11: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Recurrent Neural Network LM (Mikolov et al. 2011)

Maintain a hidden state vector ht that is recursively calculated.

ht = f (Wxt + Uht−1 + b)

ht ∈ Rm : Hidden state at time t (summary of history)

W ∈ Rm×n : Input-to-hidden transformation

U ∈ Rm×m : Hidden-to-hidden transformation

f (·) : Non-linearity

Apply softmax to ht .

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 9 / 79

Page 12: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Recurrent Neural Network LM (Mikolov et al. 2011)

Maintain a hidden state vector ht that is recursively calculated.

ht = f (Wxt + Uht−1 + b)

ht ∈ Rm : Hidden state at time t (summary of history)

W ∈ Rm×n : Input-to-hidden transformation

U ∈ Rm×m : Hidden-to-hidden transformation

f (·) : Non-linearity

Apply softmax to ht .

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 9 / 79

Page 13: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Recurrent Neural Network LM (Mikolov et al. 2011)

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 10 / 79

Page 14: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Recurrent Neural Network LM (Mikolov et al. 2011)

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 11 / 79

Page 15: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Recurrent Neural Network LM (Mikolov et al. 2011)

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 12 / 79

Page 16: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Recurrent Neural Network LM (Mikolov et al. 2011)

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 13 / 79

Page 17: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Word Embeddings (Collobert et al. 2011; Mikolov et al. 2012)

Key ingredient in Neural Language Models.

After training, similar words are close in the vector space.

(Not unique to NLMs)

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 14 / 79

Page 18: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

NLM Performance (on Penn Treebank)

Difficult/expensive to train, but performs well.

Language Model Perplexity

5-gram count-based (Mikolov and Zweig 2012) 141.2RNN (Mikolov and Zweig 2012) 124.7Deep RNN (Pascanu et al. 2013) 107.5LSTM (Zaremba, Sutskever, and Vinyals 2014) 78.4

Renewed interest in language modeling.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 15 / 79

Page 19: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

NLM Issue

Issue: The fundamental unit of information is still the word

Separate embeddings for “trading”, “leading”, “training”, etc.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 16 / 79

Page 20: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

NLM Issue

Issue: The fundamental unit of information is still the word

Separate embeddings for “trading”, “trade”, “trades”, etc.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 17 / 79

Page 21: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

NLM Issue

No parameter sharing across orthographically similar words (e.g.,spelled similarly).

Orthography contains much semantic/syntactic information.

How can we leverage subword information for language modeling?

Can we exploit this to perform better language modeling with rarewords?

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 18 / 79

Page 22: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Previous (NLM-based) Work

Use morphological segmenter as a preprocessing step

unfortunately ⇒ unPRE · fortunateSTM · lySUF

Luong, Socher, and Manning 2013: Recursive Neural Network overmorpheme embeddings

Botha and Blunsom 2014: Sum over word/morpheme embeddings

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 19 / 79

Page 23: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

This Work

Main Idea: No explicit morphology, use characters directly.

Convolutional Neural Networks (CNN) (LeCun et al. 1989)

Central to deep learning systems in vision.

Shown to be effective for NLP tasks (Collobert et al. 2011).

CNNs in NLP typically involve temporal (rather than spatial)convolutions over words.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 20 / 79

Page 24: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

This Work

Main Idea: No explicit morphology, use characters directly.

Convolutional Neural Networks (CNN) (LeCun et al. 1989)

Central to deep learning systems in vision.

Shown to be effective for NLP tasks (Collobert et al. 2011).

CNNs in NLP typically involve temporal (rather than spatial)convolutions over words.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 20 / 79

Page 25: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Network Architecture: Overview

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 21 / 79

Page 26: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Character-level CNN (CharCNN)

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 22 / 79

Page 27: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Character-level CNN (CharCNN)

C ∈ Rd×l : Matrix representation of word (of length l)

H ∈ Rd×w : Convolutional filter matrix

d : Dimensionality of character embeddings (e.g., 15)

w : Width of convolution filter (e.g., 1,2,3,4,5)

1. Apply a convolution between C and H to obtain a vector f ∈ Rl−w+1

f[i ] = 〈C[∗, i : i + w − 1],H〉

where 〈A,B〉 = Tr(ABT ) is the Frobenius inner product.2. Take the max-over-time (with bias and nonlinearity)

y = tanh(maxi{f[i ]}+ b)

as the feature corresponding to the filter H (for a particular word).

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 23 / 79

Page 28: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Character-level CNN (CharCNN)

C ∈ Rd×l : Matrix representation of word (of length l)

H ∈ Rd×w : Convolutional filter matrix

d : Dimensionality of character embeddings (e.g., 15)

w : Width of convolution filter (e.g., 1,2,3,4,5)

1. Apply a convolution between C and H to obtain a vector f ∈ Rl−w+1

f[i ] = 〈C[∗, i : i + w − 1],H〉

where 〈A,B〉 = Tr(ABT ) is the Frobenius inner product.

2. Take the max-over-time (with bias and nonlinearity)

y = tanh(maxi{f[i ]}+ b)

as the feature corresponding to the filter H (for a particular word).

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 23 / 79

Page 29: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Character-level CNN (CharCNN)

C ∈ Rd×l : Matrix representation of word (of length l)

H ∈ Rd×w : Convolutional filter matrix

d : Dimensionality of character embeddings (e.g., 15)

w : Width of convolution filter (e.g., 1,2,3,4,5)

1. Apply a convolution between C and H to obtain a vector f ∈ Rl−w+1

f[i ] = 〈C[∗, i : i + w − 1],H〉

where 〈A,B〉 = Tr(ABT ) is the Frobenius inner product.2. Take the max-over-time (with bias and nonlinearity)

y = tanh(maxi{f[i ]}+ b)

as the feature corresponding to the filter H (for a particular word).Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 23 / 79

Page 30: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Character-level CNN (CharCNN)

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 24 / 79

Page 31: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Character-level CNN (CharCNN)

C ∈ Rd×l : Representation of absurdity

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 25 / 79

Page 32: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Character-level CNN (CharCNN)

H ∈ Rd×w : Convolutional filter matrix of width w = 3

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 26 / 79

Page 33: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Character-level CNN (CharCNN)

f[1] = 〈C[∗, 1 : 3],H〉

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 27 / 79

Page 34: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Character-level CNN (CharCNN)

f[1] = 〈C[∗, 1 : 3],H〉

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 28 / 79

Page 35: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Character-level CNN (CharCNN)

f[2] = 〈C[∗, 2 : 4],H〉

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 29 / 79

Page 36: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Character-level CNN (CharCNN)

f[T − 2] = 〈C[∗,T − 2 : T ],H〉

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 30 / 79

Page 37: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Character-level CNN (CharCNN)

y [1] = maxi{f[i ]}

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 31 / 79

Page 38: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Character-level CNN (CharCNN)

Each filter picks out a character n-gram

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 32 / 79

Page 39: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Character-level CNN (CharCNN)

f ′[1] = 〈C[∗, 1 : 2],H′〉

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 33 / 79

Page 40: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Character-level CNN (CharCNN)

y [2] = maxi{f ′[i ]}

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 34 / 79

Page 41: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Character-level CNN (CharCNN)

Many filter matrices (25–200) per width (1–7)

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 35 / 79

Page 42: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Character-level CNN (CharCNN)

Add bias, apply nonlinearity

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 36 / 79

0.7$

1.5$

1.1$

0.2$

1.7$

0.8$

1.0$

0.9$

0.5$

1.1$

~b

Page 43: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Character-level CNN (CharCNN)

For roughly the same number of parameters (20 million),

Before

Word embedding

PTB Perplexity: 85.4

Now

Output from CharCNN

PTB Perplexity: 84.6

CharCNN is slower, but convolution operations on GPU have been veryoptimized.

Can we model more complex interactions between character n-gramspicked up by the filters?

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 37 / 79

Page 44: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Character-level CNN (CharCNN)

For roughly the same number of parameters (20 million),

Before

Word embedding

PTB Perplexity: 85.4

Now

Output from CharCNN

PTB Perplexity: 84.6

CharCNN is slower, but convolution operations on GPU have been veryoptimized.

Can we model more complex interactions between character n-gramspicked up by the filters?

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 37 / 79

Page 45: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Highway Network

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 38 / 79

Page 46: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Highway Network

y : output from CharCNN

Multilayer Perceptronz = g(Wy + b)

Highway Network(Srivastava, Greff, and Schmidhuber 2015)

z = t� g(WHy + bH) + (1− t)� y

WH ,bH : Affine transformation

t = σ(WTy + bT ) : transform gate

1− t : carry gate

Hierarchical, adaptive composition of character n-grams.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 39 / 79

Page 47: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Highway Network

y : output from CharCNN

Multilayer Perceptronz = g(Wy + b)

Highway Network(Srivastava, Greff, and Schmidhuber 2015)

z = t� g(WHy + bH) + (1− t)� y

WH ,bH : Affine transformation

t = σ(WTy + bT ) : transform gate

1− t : carry gate

Hierarchical, adaptive composition of character n-grams.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 39 / 79

Page 48: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Highway Network

Input from CharCNN

Input to LSTM

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 40 / 79

Page 49: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Highway Network

Model Perplexity

Word Model 85.4

No Highway Layers 84.6One MLP Layer 92.6One Highway Layer 79.7Two Highway Layers 78.9

No more gains with 3+ layers.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 41 / 79

Page 50: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Results: English Penn Treebank

PPL Size

KN-5 (Mikolov et al. 2012) 141.2 2 mRNN (Mikolov et al. 2012) 124.7 6 mDeep RNN (Pascanu et al. 2013) 107.5 6 mSum-Prod Net (Cheng et al. 2014) 100.0 5 mLSTM-Medium (Zaremba, Sutskever, and Vinyals 2014) 82.7 20 mLSTM-Huge (Zaremba, Sutskever, and Vinyals 2014) 78.4 52 m

LSTM-Word-Small 97.6 5 mLSTM-Char-Small 92.3 5 mLSTM-Word-Large 85.4 20 mLSTM-Char-Large 78.9 19 m

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 42 / 79

Page 51: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

What about morphologically rich languages?

Data-s Data-l|V| |C| T |V| |C| T

English (En) 10 k 51 1 m 60 k 197 20 mCzech (Cs) 46 k 101 1 m 206 k 195 17 mGerman (De) 37 k 74 1 m 339 k 260 51 mSpanish (Es) 27 k 72 1 m 152 k 222 56 mFrench (Fr) 25 k 76 1 m 137 k 225 57 mRussian (Ru) 62 k 62 1 m 497 k 111 25 m

|V| = Word vocab Size|C| = Character vocab sizeT = number of tokens in training set.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 43 / 79

Page 52: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

What about morphologically rich languages?

Data-s Data-l|V| |C| T |V| |C| T

English (En) 10 k 51 1 m 60 k 197 20 mCzech (Cs) 46 k 101 1 m 206 k 195 17 mGerman (De) 37 k 74 1 m 339 k 260 51 mSpanish (Es) 27 k 72 1 m 152 k 222 56 mFrench (Fr) 25 k 76 1 m 137 k 225 57 mRussian (Ru) 62 k 62 1 m 497 k 111 25 m

|V| varies quite a bit by language.

(effectively use the full vocabulary)

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 44 / 79

Page 53: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Baselines

Kneser-Ney LM: Count-based baseline

Word LSTM: Word embeddings as input

Morpheme LBL (Botha and Blunsom 2014)

Input for word k is

xk︸︷︷︸word embedding

+∑j∈Mk

mj

︸ ︷︷ ︸morpheme embeddings

Morpheme LSTM: Same input as above, but with LSTM architecture

Morphemes obtained from running an unsupervised morphological taggerMorfessor Cat-MAP (Creutz and Lagus 2007).

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 45 / 79

Page 54: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Perplexity on Data-S (1 M Tokens)

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 46 / 79

Page 55: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Perplexity on Data-S (1 M Tokens)

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 47 / 79

Page 56: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Perplexity on Data-S (1 M Tokens)

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 48 / 79

Page 57: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Perplexity on Data-S (1 M Tokens)

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 49 / 79

Page 58: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Perplexity on Data-S (1 M Tokens)

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 50 / 79

Page 59: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Perplexity on Data-L (17-57 M Tokens)

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 51 / 79

Page 60: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

How does performance vary with corpus/vocab size?

Experiment on German large dataset:

Take the most frequent K words as the vocabulary and replace restwith <unk>

Compare % perplexity reduction going from word to character LSTM.

Vocabulary Size10 k 25 k 50 k 100 k

1 m 17 16 21 –Training 5 m 8 14 16 21

Size 10 m 9 9 12 1525 m 9 8 9 10

Character model outperforms word model in all scenarios.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 52 / 79

Page 61: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

How does performance vary with corpus/vocab size?

Experiment on German large dataset:

Take the most frequent K words as the vocabulary and replace restwith <unk>

Compare % perplexity reduction going from word to character LSTM.

Vocabulary Size10 k 25 k 50 k 100 k

1 m 17 16 21 –Training 5 m 8 14 16 21

Size 10 m 9 9 12 1525 m 9 8 9 10

Character model outperforms word model in all scenarios.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 52 / 79

Page 62: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Learned Word Representations

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 53 / 79

Page 63: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Learned Word Representations

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 54 / 79

Page 64: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Learned Word Representations

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 55 / 79

Page 65: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Learned Word Representations (In Vocab)

(Based on cosine similarity)

In Vocabularywhile his you richard trading

although your conservatives jonathan advertisedWord letting her we robert advertising

Embedding though my guys neil turnoverminute their i nancy turnover

chile this your hard headingCharacters whole hhs young rich training

(before highway) meanwhile is four richer readingwhite has youth richter leading

meanwhile hhs we eduard tradeCharacters whole this your gerard training

(after highway) though their doug edward tradednevertheless your i carl trader

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 56 / 79

Page 66: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Learned Word Representations (In Vocab)

(Based on cosine similarity)

In Vocabularywhile his you richard trading

although your conservatives jonathan advertisedWord letting her we robert advertising

Embedding though my guys neil turnoverminute their i nancy turnover

chile this your hard headingCharacters whole hhs young rich training

(before highway) meanwhile is four richer readingwhite has youth richter leading

meanwhile hhs we eduard tradeCharacters whole this your gerard training

(after highway) though their doug edward tradednevertheless your i carl trader

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 56 / 79

Page 67: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Learned Word Representations (In Vocab)

(Based on cosine similarity)

In Vocabularywhile his you richard trading

although your conservatives jonathan advertisedWord letting her we robert advertising

Embedding though my guys neil turnoverminute their i nancy turnover

chile this your hard headingCharacters whole hhs young rich training

(before highway) meanwhile is four richer readingwhite has youth richter leading

meanwhile hhs we eduard tradeCharacters whole this your gerard training

(after highway) though their doug edward tradednevertheless your i carl trader

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 56 / 79

Page 68: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Learned Word Representations (In Vocab)

(Based on cosine similarity)

In Vocabularywhile his you richard trading

although your conservatives jonathan advertisedWord letting her we robert advertising

Embedding though my guys neil turnoverminute their i nancy turnover

chile this your hard headingCharacters whole hhs young rich training

(before highway) meanwhile is four richer readingwhite has youth richter leading

meanwhile hhs we eduard tradeCharacters whole this your gerard training

(after highway) though their doug edward tradednevertheless your i carl trader

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 57 / 79

Page 69: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Learned Word Representations (In Vocab)

(Based on cosine similarity)

In Vocabularywhile his you richard trading

although your conservatives jonathan advertisedWord letting her we robert advertising

Embedding though my guys neil turnoverminute their i nancy turnover

chile this your hard headingCharacters whole hhs young rich training

(before highway) meanwhile is four richer readingwhite has youth richter leading

meanwhile hhs we eduard tradeCharacters whole this your gerard training

(after highway) though their doug edward tradednevertheless your i carl trader

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 58 / 79

Page 70: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Learned Word Representations (In Vocab)

(Based on cosine similarity)

In Vocabularywhile his you richard trading

although your conservatives jonathan advertisedWord letting her we robert advertising

Embedding though my guys neil turnoverminute their i nancy turnover

chile this your hard headingCharacters whole hhs young rich training

(before highway) meanwhile is four richer readingwhite has youth richter leading

meanwhile hhs we eduard tradeCharacters whole this your gerard training

(after highway) though their doug edward tradednevertheless your i carl trader

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 59 / 79

Page 71: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Learned Word Representations (In Vocab)

(Based on cosine similarity)

In Vocabularywhile his you richard trading

although your conservatives jonathan advertisedWord letting her we robert advertising

Embedding though my guys neil turnoverminute their i nancy turnover

chile this your hard headingCharacters whole hhs young rich training

(before highway) meanwhile is four richer readingwhite has youth richter leading

meanwhile hhs we eduard tradeCharacters whole this your gerard training

(after highway) though their doug edward tradednevertheless your i carl trader

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 60 / 79

Page 72: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Learned Word Representations (OOV)

Out-of-Vocabularycomputer-aided misinformed looooook

computer-guided informed lookCharacters computerized performed cook

(before highway) disk-drive transformed lookscomputer inform shook

computer-guided informed lookCharacters computer-driven performed looks

(after highway) computerized outperformed lookedcomputer transformed looking

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 61 / 79

Page 73: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Learned Word Representations (OOV)

Out-of-Vocabularycomputer-aided misinformed looooook

computer-guided informed lookCharacters computerized performed cook

(before highway) disk-drive transformed lookscomputer inform shook

computer-guided informed lookCharacters computer-driven performed looks

(after highway) computerized outperformed lookedcomputer transformed looking

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 62 / 79

Page 74: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Learned Word Representations (OOV)

Out-of-Vocabularycomputer-aided misinformed looooook

computer-guided informed lookCharacters computerized performed cook

(before highway) disk-drive transformed lookscomputer inform shook

computer-guided informed lookCharacters computer-driven performed looks

(after highway) computerized outperformed lookedcomputer transformed looking

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 63 / 79

Page 75: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

What is the highway network doing?

Q: Might we simply be learning to carry some dimensions and to combineothers? Is the transform gate truly a function of the input word?

Input from CharCNN

Input to LSTM

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 64 / 79

Page 76: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

What is the highway network doing?

A: No. For all dimensions, on some words σ(·) ≈ 0, and for others ≈ 1.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 65 / 79

Page 77: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

What is the convolutional layer doing?

Q: Does each filter truly pick out a character n-gram?

0.4$ %0.8$ 2.2$ 0.1$ 0.5$ %0.4$ 0.4$ %0.4$ 0.1$

0.1$ 1.2$ 1.5$ %0.8$ %1.5$ 0.2$ 0.1$ 1.2$ 0.7$

0.2$ 0.1$ %1.2$ 0.2$ %0.2$ 0.3$ 0.2$ %1.3$ %0.1$

%0.2$ %0.5$ 0.1$ 0.2$ %0.3$ 0.3$ %0.1$ 1.0$ %0.3$

a b s u r d i t y

0.1$ 0.7$ 0.2$ %0.1$ 0.2$ %0.4$ 0.5$ 0.7$

Concatena3on$of$character$embeddings$

Max%over%3me$pooling$

Single$filter$output$

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 66 / 79

Page 78: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

What is the convolutional layer doing?

For each length-6 filter, the 100 substrings with highest filter response.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 67 / 79

Page 79: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

What is the convolutional layer doing?

For each length-6 filter, the 100 substrings with highest filter response.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 68 / 79

Page 80: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

What is the convolutional layer doing?

For each length-6 filter, the 100 substrings with highest filter response.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 69 / 79

Page 81: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Conclusion

A character-aware language model that relies only on character-levelinputs: CharCNN + Highway network + LSTM.

Outperforms strong word/morpheme LSTM baselines.

Much recent work on character inputs:

Santos and Zadrozny 2014: CNN over characters concatenated withword embeddings into CRF.Zhang and LeCun 2015: Deep CNN over characters for documentclassification.Ballesteros, Dyer, and Smith 2015: LSTM over characters for parsing.Ling et al. 2015: LSTM over characters into another LSTM forlanguage modeling/POS-tagging.

More broadly, suggests new ways to think about representationlearning.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 70 / 79

Page 82: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Appendix: Character N-gram Representations

Prefixes, Suffixes, Hyphenated, Others

Prefixes: character n-grams that start with ‘start-of-word’ character, suchas {un, {mis. Suffixes defined similarly.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 71 / 79

Page 83: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Appendix: Hyperparameters

Small Large

CNN

d 15 15w [1, 2, 3, 4, 5, 6] [1, 2, 3, 4, 5, 6, 7]h [25 · w ] [min{200, 50 · w}]f tanh tanh

HW-Netl 1 2g ReLU ReLU

LSTMl 2 2m 300 650

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 72 / 79

Page 84: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Appendix: Results on Data-S

Cs De Es Fr Ru

B&BKN-4 545 366 241 274 396MLBL 465 296 200 225 304

SmallWord 503 305 212 229 352Morph 414 278 197 216 290Char 401 260 182 189 278

LargeWord 493 286 200 222 357Morph 398 263 177 196 271Char 371 239 165 184 261

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 73 / 79

Page 85: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Appendix: Results on Data-L

Cs De Es Fr Ru En

B&BKN-4 862 463 219 243 390 291MLBL 643 404 203 227 300 273

SmallWord 701 347 186 202 353 236Morph 615 331 189 209 331 233Char 578 305 169 190 313 216

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 74 / 79

Page 86: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

Appendix: Effect of Highway Layers (PTB)

Small Large

No Highway Layers 100.3 84.6One Highway Layer 92.3 79.7Two Highway Layers 90.1 78.9Multilayer Perceptron 111.2 92.6

No more gains with 2+ layers (may be language dependent).

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 75 / 79

Page 87: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

References I

Bengio, Yoshua, Rejean Ducharme, and Pascal Vincent (2003). “A NeuralProbabilistic Language Model”. In: Journal of Machine LearningResearch 3, pp. 1137–1155.

Mikolov, Tomas et al. (2011). “Empirical Evaluation and Combination ofAdvanced Language Modeling Techniques”. In: Proceedings ofINTERSPEECH.

Collobert, Ronan et al. (2011). “Natural Language Processing (almost)from Scratch”. In: Journal of Machine Learning Research 12,pp. 2493–2537.

Mikolov, Tomas et al. (2012). “Subword Language Modeling with NeuralNetworks”. In: preprint: www.fit.vutbr.cz/imikolov/rnnlm/char.pdf.

Mikolov, Tomas and Geoffrey Zweig (2012). “Context DependentRecurrent Neural Network Language Model”. In: Proceedings of SLT.

Pascanu, Razvan et al. (2013). “How to Construct Deep NeuralNetworks”. In: arXiv:1312.6026.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 76 / 79

Page 88: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

References II

Zaremba, Wojciech, Ilya Sutskever, and Oriol Vinyals (2014). “RecurrentNeural Network Regularization”. In: arXiv:1409.2329.

Luong, Minh-Thang, Richard Socher, and Chris Manning (2013). “BetterWord Representations with Recursive Neural Networks for Morphology”.In: Proceedings of CoNLL.

Botha, Jan and Phil Blunsom (2014). “Compositional Morphology forWord Representations and Language Modelling”. In: Proceedings ofICML.

LeCun, Yann et al. (1989). “Handwritten Digit Recognition with aBackpropagation Network”. In: Proceedings of NIPS.

Srivastava, Rupesh Kumar, Klaus Greff, and Jurgen Schmidhuber (2015).“Training Very Deep Networks”. In: arXiv:1507.06228.

Cheng, Wei Chen et al. (2014). “Language Modeling with Sum-ProductNetworks”. In: Proceedings of INTERSPEECH.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 77 / 79

Page 89: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

References III

Creutz, Mathias and Krista Lagus (2007). “Unsupervised Models forMorpheme Segmentation and Morphology Learning”. In: Proceedingsof the ACM Transations on Speech and Language Processing.

Santos, Cicero Nogueira dos and Bianca Zadrozny (2014). “LearningCharacter-level Representations for Part-of-Speech Tagging”. In:Proceedings of ICML.

Zhang, Xiang and Yann LeCun (2015). “Text Understanding FromScratch”. In: arXiv:1502.01710.

Ballesteros, Miguel, Chris Dyer, and Noah A. Smith (2015). “ImprovedTransition-Based Parsing by Modeling Characters instead of Words withLSTMs”. In: Proceedings of EMNLP 2015.

Ling, Wang et al. (2015). “Finding Function in Form: CompositionalCharacter Models for Open Vocabulary Word Representation”. In:Proceedings of EMNLP.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 78 / 79

Page 90: Harvard SEAS New York Universitypeople.csail.mit.edu/.../papers/kim_etal_AAAI16_slides.pdf · 2015. 12. 11. · Harvard SEAS New York University To appear in AAAI 2016 Code: https:

References IV

Hochreiter, Sepp and Jurgen Schmidhuber (1997). “Long Short-TermMemory”. In: Neural Computation 9, pp. 1735–1780.

Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 79 / 79