carol v. alexandru, sebastiano panichella, harald c. gall … · 2017-05-21 · neural machine...
TRANSCRIPT
![Page 1: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/1.jpg)
Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall{alexandru,panichella,gall}@ifi.uzh.ch
23. May 2017
![Page 2: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/2.jpg)
1
public int sum(int[] numbers) {int s = 0;for (int n : numbers) {s = s - n;
}return s;
}
![Page 3: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/3.jpg)
1
public int sum(int[] numbers) {int s = 0;for (int n : numbers) {s = s - n;
}return s;
}
![Page 4: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/4.jpg)
1
public int sum(int[] numbers) {int s = 0;for (int n : numbers) {s = s - n;
}return s;
}
![Page 5: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/5.jpg)
1
Even "simple" problems need complex solutions
public int sum(int[] numbers) {int s = 0;for (int n : numbers) {s = s - n;
}return s;
}
![Page 6: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/6.jpg)
1
Even "simple" problems need complex solutions
Unclear howexactly humans solve this problem
public int sum(int[] numbers) {int s = 0;for (int n : numbers) {s = s - n;
}return s;
}
![Page 7: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/7.jpg)
2
![Page 8: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/8.jpg)
3
![Page 9: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/9.jpg)
3
![Page 10: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/10.jpg)
Where to begin?
4
print("Hello World")
![Page 11: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/11.jpg)
Where to begin?
4
print("Hello World")
Can we teach a machine to "read" code?
![Page 12: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/12.jpg)
Replicating a Parser
5
.java
import java.util.Scanner;import java.io.File;import java.io.IOException;
public class Person {public int getAge() {
import java . util . Scanner ;import java . io . File ;import java . io . IOException ;
public class Person {public int getAge ( ) {
read lex/tokenizeconstruct CST/AST
![Page 13: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/13.jpg)
Replicating a Parser
5
.java
import java.util.Scanner;import java.io.File;import java.io.IOException;
public class Person {public int getAge() {
import java . util . Scanner ;import java . io . File ;import java . io . IOException ;
public class Person {public int getAge ( ) {
read lex/tokenizeconstruct CST/AST
? ?
![Page 14: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/14.jpg)
Replicating a Parser
5
.java
import java.util.Scanner;import java.io.File;import java.io.IOException;
public class Person {public int getAge() {
import java . util . Scanner ;import java . io . File ;import java . io . IOException ;
public class Person {public int getAge ( ) {
read lex/tokenizeconstruct CST/AST
? ?
?
![Page 15: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/15.jpg)
Neural Machine Translation
6
Source sequences
Target sequences
"Space: the final frontier" "Espace: frontière de l'infini"
![Page 16: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/16.jpg)
Neural Machine Translation
6
Source sequences
Target sequences
"Space: the final frontier" "Espace: frontière de l'infini"
Space : the final frontier Espace frontière de l' infini:
tokenize
![Page 17: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/17.jpg)
Neural Machine Translation
6
Source sequences
Target sequences
"Space: the final frontier" "Espace: frontière de l'infini"
Space : the final frontier Espace frontière de l' infini:
tokenize
808 41 5 241 1020 701 624 12 9 -174
vectorize andbuild vocabulary
![Page 18: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/18.jpg)
Neural Machine Translation
6
Source sequences
Target sequences
"Space: the final frontier" "Espace: frontière de l'infini"
Space : the final frontier Espace frontière de l' infini:
tokenize
808 41 5 241 1020 701 624 12 9 -174
vectorize andbuild vocabulary
Vocabulary sorted by word frequency
![Page 19: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/19.jpg)
Neural Machine Translation
6
Source sequences
Target sequences
"Space: the final frontier" "Espace: frontière de l'infini"
Space : the final frontier Espace frontière de l' infini:
tokenize
808 41 5 241 1020 701 624 12 9 -174
vectorize andbuild vocabulary
Vocabulary sorted by word frequency Vocabulary has maximum size;
Uncommon words may not be included and will be represented
as a special "unknown word"
![Page 20: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/20.jpg)
Neural Machine Translation
6
RNN (LSTM/GRU)
Source sequences
Target sequences
"Space: the final frontier" "Espace: frontière de l'infini"
Space : the final frontier Espace frontière de l' infini:
808 41 5 241 1020 701 624 12 9 -174
tokenize
vectorize andbuild vocabulary
Vocabulary has maximum size; Uncommon words may not be
included and will be represented as a special "unknown word"
Vocabulary sorted by word frequency
![Page 21: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/21.jpg)
Data Gathering and Preparation
7
clone 1000 reposlanguage:javasort:stars
![Page 22: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/22.jpg)
Data Gathering and Preparation
7
clone 1000 reposlanguage:javasort:stars
parse (ANTLR)and preprocess
![Page 23: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/23.jpg)
Data Gathering and Preparation
8
clone 1000 reposlanguage:javasort:stars
parse (ANTLR)and preprocess
Plain text sourcep r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
![Page 24: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/24.jpg)
Data Gathering and Preparation
8
clone 1000 reposlanguage:javasort:stars
parse (ANTLR)and preprocess
Plain text sourcep r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
1 char per word Replace space words with unassigned Unicode char
![Page 25: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/25.jpg)
Data Gathering and Preparation
9
clone 1000 reposlanguage:javasort:stars
parse (ANTLR)and preprocess
Plain text sourcep r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
Lexing instructions0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
![Page 26: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/26.jpg)
Data Gathering and Preparation
9
clone 1000 reposlanguage:javasort:stars
parse (ANTLR)and preprocess
Plain text sourcep r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
Lexing instructions0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
0 = continue or start token1 = end tokenblank space = ignore character
![Page 27: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/27.jpg)
Data Gathering and Preparation
9
clone 1000 reposlanguage:javasort:stars
parse (ANTLR)and preprocess
Plain text sourcep r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
Lexing instructions0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
Why not translate to actual tokens?!
→ Target vocabulary would not contain all possible tokens (although there are
ways around that...)
![Page 28: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/28.jpg)
Data Gathering and Preparation
10
clone 1000 reposlanguage:javasort:stars
parse (ANTLR)and preprocess
Plain text sourcep r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
Lexing instructions0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
Tokensprintln ( "Hello,▯world" ) ;
![Page 29: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/29.jpg)
Data Gathering and Preparation
10
clone 1000 reposlanguage:javasort:stars
parse (ANTLR)and preprocess
Plain text sourcep r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
Lexing instructions0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
Tokensprintln ( "Hello,▯world" ) ;
Replace spaces in words with unassigned Unicode char
1 token per word
![Page 30: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/30.jpg)
Data Gathering and Preparation
11
clone 1000 reposlanguage:javasort:stars
parse (ANTLR)and preprocess
Plain text sourcep r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
Lexing instructions0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
Tokensprintln ( "Hello,▯world" ) ;
Node type & AST depth annotationsExpression│12 Expression│13 Literal│17 Expression│13 Statement│11
![Page 31: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/31.jpg)
Data Gathering and Preparation
11
clone 1000 reposlanguage:javasort:stars
parse (ANTLR)and preprocess
Plain text sourcep r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
Lexing instructions0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
Tokensprintln ( "Hello,▯world" ) ;
Node type & AST depth annotationsExpression│12 Expression│13 Literal│17 Expression│13 Statement│11
![Page 32: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/32.jpg)
Data Gathering and Preparation
11
clone 1000 reposlanguage:javasort:stars
parse (ANTLR)and preprocess
Plain text sourcep r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
Lexing instructions0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
Tokensprintln ( "Hello,▯world" ) ;
Node type & AST depth annotationsExpression│12 Expression│13 Literal│17 Expression│13 Statement│11
Only contains AST node types correlating to literal tokens
![Page 33: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/33.jpg)
Data Gathering and Preparation
clone 1000 reposlanguage:javasort:stars
parse (ANTLR)and preprocess
Plain text sourcep r i n t l n ( " H e l l o ▯ W o r l d ! " ) ;
Lexing instructions0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
Tokensprintln ( "Hello,▯world" ) ;
Node type & AST depth annotationsExpression│12 Expression│13 Literal│17 Expression│13 Statement│11
Data creation tool is open source - define your own extractions and translations and apply them easily to 1000s of repos:
https://bitbucket.org/sealuzh/parsenn
Creation of 2x2 datasets for the two translations steps
12
![Page 34: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/34.jpg)
Results: Tokenization
13
Vocab size: 2185Train: 25M samples (1.7Gb)Validation: 2M samples (140Mb)
Plain textsource code
LexingInstructions
Vocab size: 3Train: 25M samples (1.7Gb)Validation: 2M samples (140Mb)
i m p o r t ❘ a n d r o i d . g r a p h i c s . B i t m a p ;i m p o r t ❘ c o m . f a c e b o o k . c o m m o n . r e f e r e n c e s . R e s o u r c e R e l e a s e r ;p u b l i c ❘ c l a s s ❘ S i m p l e B i t m a p R e l e a s e r ❘ i m p l e m e n t s ❘ R e s o u r c e R e l e a s e r < B i t m a p > p r i v a t e ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ s I n s t a n c e ;p u b l i c ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ g e t I n s t a n c e ( ) i f ❘ ( s I n s t a n c e ❘ = = ❘ n u l l ) ❘ {s I n s t a n c e ❘ = ❘ n e w ❘ S i m p l e B i t m a p R e l e a s e r ( ) ;}
0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 10 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 10 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 10 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 11
![Page 35: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/35.jpg)
Results: Tokenization
13
Vocab size: 2185Train: 25M samples (1.7Gb)Validation: 2M samples (140Mb)
Plain textsource code
LexingInstructions
Vocab size: 3Train: 25M samples (1.7Gb)Validation: 2M samples (140Mb)
i m p o r t ❘ a n d r o i d . g r a p h i c s . B i t m a p ;i m p o r t ❘ c o m . f a c e b o o k . c o m m o n . r e f e r e n c e s . R e s o u r c e R e l e a s e r ;p u b l i c ❘ c l a s s ❘ S i m p l e B i t m a p R e l e a s e r ❘ i m p l e m e n t s ❘ R e s o u r c e R e l e a s e r < B i t m a p > p r i v a t e ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ s I n s t a n c e ;p u b l i c ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ g e t I n s t a n c e ( ) i f ❘ ( s I n s t a n c e ❘ = = ❘ n u l l ) ❘ {s I n s t a n c e ❘ = ❘ n e w ❘ S i m p l e B i t m a p R e l e a s e r ( ) ;}
0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 10 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 10 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 10 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 11
NMT
Bi-RNN7 epochs7 daysPerplexity: 1.11
![Page 36: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/36.jpg)
Results: Tokenization
13
Vocab size: 2189Train: 25M samples (1.7Gb)Validation: 2M samples (140Mb)
Plain textsource code
LexingInstructions
Vocab size: 7Train: 25M samples (1.7Gb)Validation: 2M samples (140Mb)
i m p o r t ❘ a n d r o i d . g r a p h i c s . B i t m a p ;i m p o r t ❘ c o m . f a c e b o o k . c o m m o n . r e f e r e n c e s . R e s o u r c e R e l e a s e r ;p u b l i c ❘ c l a s s ❘ S i m p l e B i t m a p R e l e a s e r ❘ i m p l e m e n t s ❘ R e s o u r c e R e l e a s e r < B i t m a p > p r i v a t e ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ s I n s t a n c e ;p u b l i c ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ g e t I n s t a n c e ( ) i f ❘ ( s I n s t a n c e ❘ = = ❘ n u l l ) ❘ {s I n s t a n c e ❘ = ❘ n e w ❘ S i m p l e B i t m a p R e l e a s e r ( ) ;}
0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 10 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 10 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 10 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 11
NMT
Bi-RNN7 epochs7 daysPerplexity: 1.11
What is perplexity?
Lower Perplexity is betterMeaning of perplexity value
depends on target vocab size
In the context of NMT:
Perplexity describes how "confused" a probability model is on a given test data set. A perfect model has perplexity 1.
![Page 37: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/37.jpg)
Results: Tokenization
13
Vocab size: 2185Train: 25M samples (1.7Gb)Validation: 2M samples (140Mb)
Plain textsource code
LexingInstructions
Vocab size: 3Train: 25M samples (1.7Gb)Validation: 2M samples (140Mb)
i m p o r t ❘ a n d r o i d . g r a p h i c s . B i t m a p ;i m p o r t ❘ c o m . f a c e b o o k . c o m m o n . r e f e r e n c e s . R e s o u r c e R e l e a s e r ;p u b l i c ❘ c l a s s ❘ S i m p l e B i t m a p R e l e a s e r ❘ i m p l e m e n t s ❘ R e s o u r c e R e l e a s e r < B i t m a p > p r i v a t e ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ s I n s t a n c e ;p u b l i c ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ g e t I n s t a n c e ( ) i f ❘ ( s I n s t a n c e ❘ = = ❘ n u l l ) ❘ {s I n s t a n c e ❘ = ❘ n e w ❘ S i m p l e B i t m a p R e l e a s e r ( ) ;}
0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 10 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 10 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 10 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 11
NMT
Bi-RNN7 epochs7 daysPerplexity: 1.11
![Page 38: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/38.jpg)
Results: Tokenization
13
Vocab size: 2185Train: 25M samples (1.7Gb)Validation: 2M samples (140Mb)
Plain textsource code
LexingInstructions
Vocab size: 3Train: 25M samples (1.7Gb)Validation: 2M samples (140Mb)
i m p o r t ❘ a n d r o i d . g r a p h i c s . B i t m a p ;i m p o r t ❘ c o m . f a c e b o o k . c o m m o n . r e f e r e n c e s . R e s o u r c e R e l e a s e r ;p u b l i c ❘ c l a s s ❘ S i m p l e B i t m a p R e l e a s e r ❘ i m p l e m e n t s ❘ R e s o u r c e R e l e a s e r < B i t m a p > p r i v a t e ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ s I n s t a n c e ;p u b l i c ❘ s t a t i c ❘ S i m p l e B i t m a p R e l e a s e r ❘ g e t I n s t a n c e ( ) i f ❘ ( s I n s t a n c e ❘ = = ❘ n u l l ) ❘ {s I n s t a n c e ❘ = ❘ n e w ❘ S i m p l e B i t m a p R e l e a s e r ( ) ;}
0 0 0 0 0 1 ❘ 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 10 0 0 0 0 1 ❘ 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 ❘ 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 1 10 0 0 0 0 1 ❘ 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 ❘ 1 0 0 0 0 0 0 0 0 1 ❘ 0 1 ❘ 0 0 0 1 1 ❘ 10 0 0 0 0 0 0 0 1 ❘ 1 ❘ 0 0 1 ❘ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 11
NMT
Bi-RNN7 epochs7 daysPerplexity: 1.11
@Test(expected = NullPointerException.class)
10001100000001 1 000000000000000000000000011
Failed translation example:
![Page 39: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/39.jpg)
Results: Token Annotation
14
Vocab size: 50000Train: 25M samples (904Mb)Validation: 2M samples (76M)
Tokens
Type/DepthAnnotations
Vocab size: 87|4459Train: 25M samples (3Gb)Validation: 2M samples (251Mb)
import android . graphics . Bitmap ;import com . facebook . common . references . ResourceReleaser ;public class SimpleBitmapReleaser implements ResourceReleaser < Bitmap > {private static SimpleBitmapReleaser sInstance ;public static SimpleBitmapReleaser getInstance ( ) {if ( sInstance == null ) {sInstance = new SimpleBitmapReleaser ( ) ;}
ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedNameImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedNameClassOrInterfaceModifier│3 ClassDeclaration│3 ClassDeclaration│3 ClassDeclaration│3 ClassOrInterfaceTypeClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 VariableDeclaratorIdClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 MethodDeclarationIfStatement│12 ParExpression│13 Primary│16 Expression│14 Literal│17 ParExpression│13Block│14
2
![Page 40: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/40.jpg)
Results: Token Annotation
14
Vocab size: 50000Train: 25M samples (904Mb)Validation: 2M samples (76M)
Tokens
Type/DepthAnnotations
Vocab size: 87|4459Train: 25M samples (3Gb)Validation: 2M samples (251Mb)
import android . graphics . Bitmap ;import com . facebook . common . references . ResourceReleaser ;public class SimpleBitmapReleaser implements ResourceReleaser < Bitmap > {private static SimpleBitmapReleaser sInstance ;public static SimpleBitmapReleaser getInstance ( ) {if ( sInstance == null ) {sInstance = new SimpleBitmapReleaser ( ) ;}
ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedNameImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedNameClassOrInterfaceModifier│3 ClassDeclaration│3 ClassDeclaration│3 ClassDeclaration│3 ClassOrInterfaceTypeClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 VariableDeclaratorIdClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 MethodDeclarationIfStatement│12 ParExpression│13 Primary│16 Expression│14 Literal│17 ParExpression│13Block│14
NMT
RNN11 epochs2 daysPerplexity: 1.28
2
![Page 41: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/41.jpg)
Results: Token Annotation
14
Vocab size: 50000Train: 25M samples (904Mb)Validation: 2M samples (76M)
Tokens
Type/DepthAnnotations
Vocab size: 87|4459Train: 25M samples (3Gb)Validation: 2M samples (251Mb)
import android . graphics . Bitmap ;import com . facebook . common . references . ResourceReleaser ;public class SimpleBitmapReleaser implements ResourceReleaser < Bitmap > {private static SimpleBitmapReleaser sInstance ;public static SimpleBitmapReleaser getInstance ( ) {if ( sInstance == null ) {sInstance = new SimpleBitmapReleaser ( ) ;}
ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedNameImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedNameClassOrInterfaceModifier│3 ClassDeclaration│3 ClassDeclaration│3 ClassDeclaration│3 ClassOrInterfaceTypeClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 VariableDeclaratorIdClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 MethodDeclarationIfStatement│12 ParExpression│13 Primary│16 Expression│14 Literal│17 ParExpression│13Block│14
NMT
RNN11 epochs2 daysPerplexity: 1.28
2
![Page 42: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/42.jpg)
Results: Token Annotation
14
Vocab size: 50000Train: 25M samples (904Mb)Validation: 2M samples (76M)
Tokens
Type/DepthAnnotations
Vocab size: 87|4459Train: 25M samples (3Gb)Validation: 2M samples (251Mb)
import android . graphics . Bitmap ;import com . facebook . common . references . ResourceReleaser ;public class SimpleBitmapReleaser implements ResourceReleaser < Bitmap > {private static SimpleBitmapReleaser sInstance ;public static SimpleBitmapReleaser getInstance ( ) {if ( sInstance == null ) {sInstance = new SimpleBitmapReleaser ( ) ;}
ImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedNameImportDeclaration│2 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedName│3 QualifiedNameClassOrInterfaceModifier│3 ClassDeclaration│3 ClassDeclaration│3 ClassDeclaration│3 ClassOrInterfaceTypeClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 VariableDeclaratorIdClassOrInterfaceModifier│7 ClassOrInterfaceModifier│7 ClassOrInterfaceType│9 MethodDeclarationIfStatement│12 ParExpression│13 Primary│16 Expression│14 Literal│17 ParExpression│13Block│14
NMT
RNN11 epochs2 daysPerplexity: 1.28
2
A successful example:
List<Throwable> errors = TestHelper.trackPluginErrors();
000110000000011 000001 1 0000000001100000000000000001111
[ClassOrInterfaceType|14] [TypeArguments|15] [ClassOrInterfaceType|18] [TypeArguments|15] [VariableDeclaratorId|15] [VariableDeclarator|14]
[Primary|19] [Expression|17] [Expression|17] [Expression|16] [Expression|16] [LocalVariableDeclarationStatement|11]
![Page 43: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/43.jpg)
Take-home messages:
• NN can learn to "read" code (tokens / syntactic elements)→ What else could we teach? Type resolution? Calls & attribute access? Inheritance?
→ Could we follow the "human path" of learning to program to teach an AI?
• "If only we had good data"→ Bug reports, commit messages etc. are still unstructured. This needs to change if we want to leverage deep learning in SE and PC.
16
![Page 44: Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall … · 2017-05-21 · Neural Machine Translation 6 Source sequences Target sequences "Space: the final frontier" "Espace:](https://reader033.vdocuments.net/reader033/viewer/2022042400/5f0ee32d7e708231d4416d09/html5/thumbnails/44.jpg)
Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall{alexandru,panichella,gall}@ifi.uzh.ch
23. May 2017
t.uzh.ch/Hbt.uzh.ch/Hc
Data creation tool:Paper: