improving automatic source code summarization via deep

Improving Automatic Source Code Summarization via DeepReinforcement Learning

Yao Wan, Zhou ZhaoCollege of Computer Science andTechnology, Zhejiang University,

Hangzhou, China{wanyao,zhaozhou}@zju.edu.cn

Min YangShenzhen Institutes of AdvancedTechnology, Chinese Academy of

Sciences, [email protected]

Guandong XuAdvanced Analytics Institute,

University of Technology Sydney,Sydney, Australia

[email protected]

Haochao Ying, Jian WuCollege of Computer Science andTechnology, Zhejiang University,

Hangzhou, [email protected]@zju.edu.cn

Philip S. YuUniversity of Illinois at Chicago,

Illinois, USAInstitute for Data Science, Tsinghua

University, Beijing, [email protected]

ABSTRACTCode summarization provides a high level natural language de-scription of the function performed by code, as it can benefit thesoftware maintenance, code categorization and retrieval. To thebest of our knowledge, most state-of-the-art approaches follow anencoder-decoder framework which encodes the code into a hiddenspace and then decode it into natural language space, suffering fromtwo major drawbacks: a) Their encoders only consider the sequen-tial content of code, ignoring the tree structure which is also criticalfor the task of code summarization; b) Their decoders are typicallytrained to predict the next word by maximizing the likelihood ofnext ground-truth word with previous ground-truth word given.However, it is expected to generate the entire sequence from scratchat test time. This discrepancy can cause an exposure bias issue, mak-ing the learnt decoder suboptimal. In this paper, we incorporatean abstract syntax tree structure as well as sequential content ofcode snippets into a deep reinforcement learning framework (i.e.,actor-critic network). The actor network provides the confidenceof predicting the next word according to current state. On the otherhand, the critic network evaluates the reward value of all possibleextensions of the current state and can provide global guidance forexplorations. We employ an advantage reward composed of BLEUmetric to train both networks. Comprehensive experiments on areal-world dataset show the effectiveness of our proposed modelwhen compared with some state-of-the-art methods.

CCS CONCEPTS• Software and its engineering→Documentation; •Comput-ing methodologies → Natural language generation;

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’18, September 3–7, 2018, Montpellier, France© 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-5937-5/18/09. . . $15.00https://doi.org/10.1145/3238147.3238206

KEYWORDSCode summarization, comment generation, deep learning, rein-forcement learning

ACM Reference Format:Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu,and Philip S. Yu. 2018. Improving Automatic Source Code Summarizationvia Deep Reinforcement Learning. In Proceedings of the 2018 33rd ACM/IEEEInternational Conference on Automated Software Engineering (ASE ’18), Sep-tember 3–7, 2018, Montpellier, France. ACM, New York, NY, USA, 11 pages.https://doi.org/10.1145/3238147.3238206

1 INTRODUCTIONIn the life cycle of software development (e.g., implementation, test-ing and maintenance), nearly 90% of effort is used for maintenance,and much of this effort is spent on understanding the maintenancetask and related software source codes [22]. Thus, documentationwhich provides a high level description of the task performed bycode is always a must for software maintenance. Even though vari-ous techniques have been developed to facilitate the programmerduring the implementation and testing of software, documentingcode with comments remains a labour-intensive task, making fewreal-world software projects adequately document the code to re-duce future maintenance costs [9, 16]. It’s nontrivial for a noviceprogrammer to write good comments for source codes. A goodcomment should at least has the following characteristics: a) Cor-rectness. The comments should correctly clarify the intent of code.b) Fluency. The comments should be fluent natural languages thatcan be easily read and understood by maintainers. c) Consistency.The comments should follow a standard style/format for bettercode reading. Code summarization is a task that tries to compre-hend code and automatically generate descriptions directly fromthe source code. The summarization of code can also be viewed asa form of document expansion. Successful code summarization cannot only benefit the maintenance of source codes [15, 30], but alsobe used to improve the performance of code search using naturallanguage queries [32, 51] and code categorization [31].Motivation. Recent research has made inroads towards automaticgeneration of natural language descriptions of software. As far as

https://doi.org/10.1145/3238147.3238206

https://doi.org/10.1145/3238147.3238206

ASE ’18, September 3–7, 2018, Montpellier, France Y. Wan et al.

FunctionDef

Argument Name

BinOp

Name

Name Name

Returnb

a b

a

Codesnippet#addtwonumbersdef add(a,b):return a+b

(a) An example of abstractive syntax tree (AST).

(a)Trainingprocessofmaximumlikelihood-based textgeneration.

Predicted

Groundtruth

Predicted

Missinggroundtruth

(b)Testingprocessofmaximumlikelihood-based textgeneration.

…Erroraccumulate

…

…

(b) The limitation ofmaximum likelihood based text generation.

Figure 1: An illustration of the motivation of our paper. Traditional methods suffer from the following two limitations: a) Onrepresenting the code, the structure information of code is always ignored. b) Traditionalmaximum likelihood basedmethodssuffer from the exposure bias issue.

we know,most of existing code summarizationmethods learn the se-mantic representation of source codes based on statistical languagemodels [30, 33], and then generate comments based on templatesor rules [42]. With the development of deep learning, some neuraltranslation models [2, 13, 15] have also been introduced for codesummarization, which mainly follow an encoder-decoder frame-work. They generally employ recurrent neural networks (RNN e.g.,LSTM [14]) to encode the code snippets and utilize another RNNto decode that hidden state to coherent sentences. These modelsare typically trained to maximize the likelihood of the next wordon the assumption that previous words and ground-truth are given.These models are limited from two aspects: a) The code sequentialand structural information is not fully utilized on feature repre-sentation, which is critical for code understanding. For example,given two simple expressions “f=a+b" and “f=c+d", althoughthey are quite different as two lexical sequences, they share thesame structure (e.g., abstractive syntax tree). b) These models, alsotermed “teacher-forcing", suffer from the exposure bias since intesting time the ground-truth is missing and previously generatedwords from the trained model distribution are used to predict thenext word [37]. Figure 1(b) presents a simple illustration of thediscrepancy among training and testing process in these classicalencoder-decoder models. In the testing phase, this exposure biasmakes error accumulated and makes these models suboptimal, notable to generate those words which are appropriate but with lowprobability to be drawn in the training phase.Contribution. In this paper, we aim to address these two men-tioned issues. To effectively capture the structural (or syntactic)information of code snippets, we employ abstract syntax tree (AST)[7], a data structure widely used in compilers, to represent the struc-ture of program code. Figure 1a shows an example of Python codesnippet and its corresponding AST. The root node is a compositenode of type FunctionDef, while the leaf nodes which are typedas Name are tokens of code snippets. It’s worth mentioning that thetokens from AST parsing may be different from those from wordsegmentation. In our paper, we consider both of them. We parsethe code snippets into ASTs, and then propose an AST-based LSTMmodel [46] to represent the structure of code. We also use another

LSTM model [14] to represent the sequential information of code.Besides, we apply a hybrid attention layer to fuse the structurerepresentation and sequential representation of code on predictingthe word, considering the alignment between predicted word andsource word.

To overcome the exposure bias, we draw on the insights of deepreinforcement learning, which integrates exploration and exploita-tion into a whole framework. Instead of learning a sequential re-current model to greedily look for the next correct word, we utilizean actor network and a critic network to jointly determine the nextbest word at each time step. The actor network, which provides theconfidence of predicting the next word according to current state,serves as a local guidance. The critic network, which evaluates thereward value of all possible extensions of the current state, serves asa global guidance. Our framework is able to include the good wordsthat are with low probability to be drawn by using the actor net-work alone. To learn these two networks more efficiently, we startwith pretraining an actor network using standard supervised learn-ing with cross entropy loss, and pretraining a critic network withmean square loss. Then, we update the actor and critic networksaccording to the advantage reward composed of BLEU metric viapolicy gradient. We summarize our main contributions as follows.

• We propose a more comprehensive representation methodfor source code, with one AST-based LSTM for the structureof source code, and another LSTM for the sequential con-tent of source code. Furthermore, a hybrid attention layer isapplied to fuse these two representations.• To the best of our knowledge, it is the first time that we pro-pose an advanced deep reinforcement learning framework,named actor-critic network, to cope with the exposure biasissue existing in most traditional maximum likelihood-basedcode summarization frameworks.• We validate our proposed model on a real-world dataset of108,726 Python code snippets. Comprehensive experimentsshow the effectiveness of the proposed model when com-pared with some state-of-the-art methods.

Improving Automatic Source Code Summarization via Deep Reinforcement Learning ASE ’18, September 3–7, 2018, Montpellier, France

Organization. The remainder of this paper is organized as follows.We provide some background knowledge on neural language model,RNN encoder-decoder model and reinforcement learning in Section2 for a better understanding of our proposed model. We also for-mally define the problem in Section 2. Section 3 gives an overviewof our proposed framework. Section 4 presents a hybrid embeddingapproach for code representation. Section 5 shows our proposeddeep reinforcement learning framework (i.e., actor-critic network).Section 6 describes the dataset used in our experiment and showsthe experimental results and analysis. Section 7 shows some threatsto validity and limitations existing in our model. Section 8 high-lights some works related to this paper. Finally, we conclude thispaper in Section 9.

2 BACKGROUNDAs we declared before, the code summarization task can be seenas a text generation task given the source code. In this section, wefirst present some background knowledge on text generation whichwill be used in this paper, including language model, attentionalRNN encoder-decoder model and reinforcement learning for betterdecoding. To start with, we introduce some basic notations andterminologies. Let x = (x1,x2, . . . ,x |x | ) denote a sequence of sourcecode snippet, y = (y1,y2, . . . ,y |y | ) denote a sequence of generatedwords, where | · | denotes the length of sequence. Let T denote themaximum step of decoding in the encoder-decoder framework. Wewill often use notation ym ...l to refer to subsequences of the form(ym , . . . ,yl ). D = {(x1, y1), (x2, y2), . . . , (xN , yN )} is the trainingdataset, where N is the size of training set.

2.1 Language ModelLanguage model computes the probability of occurrence of a num-ber of words in a particular sequence. The probability of a sequenceofT words {y1, . . . ,yT } is denoted as p (y1, . . . ,yT ). Since the num-ber of words coming before a word, yi , varies depending on itslocation in the input document,p (y1, . . . ,yT ) is usually conditionedon a window of n previous words rather than all previous words:

p (y1:T ) =i=T∏i=1

p (yi |y1:i−1) ≈i=T∏i=1

p (yi |yi−(n−1):i−1). (1)

This kind of n-grams approach suffers apparent limitations [27,39]. For example, the n-gram model probabilities can not be deriveddirectly from the frequency counts, because models derived thisway have severe problems when confronted with some n-gramsthat have not been explicitly seen before.

The neural language model is a language model based on neuralnetworks. Unlike the n-gram model which predicts a word basedon a fixed number of predecessor words, a neural language modelcan predict a word by predecessor words with longer distances.Figure 2(a) shows the basic structure of a RNN. The neural networkincludes three layers, that is, an input layer which maps each wordto a vector, a recurrent hidden layer which recurrently computesand updates a hidden state after reading each word, and an outputlayer which estimates the probabilities of the following word giventhe current hidden state. The RNN reads the words in the sentenceone by one, and predicts the possible following word at each timestep. At step t , it estimates the probability of the following word

y1

(a)Thestructure ofrecurrentneuralnetwork(RNN)

(b)Thestructureoftree-stuctured RNN(Tree-RNN)

y2 y3 y4 y5

o1 o2 o3 o4 o5

o1 o2

o3o4

o5

y1 y2

y3y4

y5

Input

Hidden

Output

Figure 2: RNN and Tree-RNN (adapted from [46]).

p (yt+1 |y1:t ) by the following steps: First, the current word yt ismapped to a vector by the input layer e . Then, it generates thehidden state ht at time t according to the previous hidden stateht−1 and the current input yt :

ht = f (ht−1, e (yt )). (2)

Here, two common options for f are long short-term memory(LSTM) [14] and the gated recurrent unit (GRU) [21]. Finally, thep (yt+1 |y1:t ) is predicted according to the current hidden state ht :

p (yt+1 |y1:t ) = д(ht ), (3)

where д is a stochastic output layer (typically a softmax for discreteoutputs) that generates output tokens.

2.2 Attentional RNN Encoder-Decoder ModelRNN encoder-decoder has two recurrent neural networks. Theencoder transforms the code snippet x into a sequence of hiddenstates (h1, h2, . . . , h |x | ) with a RNN, while the decoder uses anotherRNN to generate one word yt+1 at a time in the target space.

2.2.1 Encoder. As a RNN, the encoder has a hidden state, whichis a fixed-length vector. At the time step t , the encoder computesthe hidden state ht by:

ht = f (ht−1, ct−1, e (xt ))). (4)

Here, f is the hidden layer which has two main options, i.e.,LSTM and GRU. The last symbol of x should be an end-of-sequence(< eos >) symbol which notifies the encoder to stop and output thefinal hidden state hT , which is used as a vector representation of x.

2.2.2 Decoder. The output of the decoder is the target sequencey = (y1, · · · ,yT ). One input of the decoder is a < start > symboldenoting the beginning of the target sequence. At the time stept , the decoder computes the hidden state ht and the conditionaldistribution of the next symbol yt+1 by:

p (yt+1 |yt ) = д(ht , ct ), (5)


where д is a stochastic output layer and ct is the distinct contextvector for yt , computed by:

ct =|x |∑j=1

αt, jhj , (6)

where αt, j is the attention weight of yt on hj [4].

2.2.3 Training Goal. The encoder and decoder networks arejointly trained to maximize the following objective:

maxθL (θ ) = max

θE

(x,y)∼Dlogp (y|x;θ ), (7)

where θ is the set of the model parameters. We can see that thisclassical encoder-decoder framework targets on maximizing thelikelihood of ground-truth word conditioned on previously gener-ated words. As we have mentioned above, the maximum likelihoodbased encoder-decoder framework suffers the exposure bias issue.Motivated by this, we introduce the reinforcement learning tech-nique for better decoding.

2.3 Reinforcement Learning for BetterDecoding

The reinforcement learning is an approach that interacts with thereal environment and learns the optimal policy from the rewardsignal. It tries to generate text from scratch without ground truth inthe testing phase. Under this approach, the text generation processcan be viewed as a Markov Decision Process (MDP) {S,A, P ,R,γ }.In the MDP setting, state st at time step t consists of the source codesnippets x and the words/actions predicted until t , y0,y1, . . . ,yt .The action space is the dictionaryY that the words are drawn from,i.e., yt ⊂ Y . With the definition of the state, the state transitionfunction P is st+1 = {st ,yt+1}, where the action yt+1 becomes apart of the next state st+1 and the reward rt+1 is received. γ ∈ [0, 1]is the discount factor. The objective of generation process is to finda policy that maximizes the expected reward of generation sentencesampled from the model’s policy:

maxθL (θ ) = max

θE x∼Dy∼Pθ (·|x)

[R (y, x)], (8)

where θ is the parameter of policy needed to be learnt, D is thetraining set, y is the predicted actions/words, and R is the rewardfunction. Our problem can be formulated as follows.Given a code snippet x = (x1,x2, . . . ,x |x | ), our goal is to find apolicy that generates a sequence of words y = (y1,y2, . . . ,y |y | )from dictionaryY with the objective of maximizing the expectedreward.To learn the policy, many approaches have been proposed, which

are mainly categorized into two classes [44]. a) The policy-basedapproaches (e.g., REINFORCE [50]) which optimizes the policydirectly via policy gradient. b) The value-based approaches (e.g.,Q-learning [48]) which learns the Q-function, and in each time theagent selects the action with highest Q-value. It has been verifiedthat the policy-based methods may suffer from a variance issue andthe value-based methods may suffer from a bias issue [17]. Thusin our paper, we adopt the actor-critic learning method which isa more advanced technique that has the advantage of both policy-and value-based methods.

Actor Actor

CodeSnippet

CommentRewardCodecorpus

Codesnippets

Comments

Trainingdata

Deepreinforcementlearning

(a)Offlinetraining (b)Testing

Figure 3: An overall workflow of getting a trained model.

3 OVERVIEW OF PROPOSED FRAMEWORKIn this section, we firstly have a simple overview on the work-flow of how to get a trained model for code summarization. Thenwe present an overview of the network architecture of our pro-posed deep reinforcement learning based model. Figure 3 showsthe overall workflow of how to get a trained model. It includesan offline training stage and an online summarization stage. Inthe training stage, we prepare a large-scale corpus of annotated< code, comment > pairs. The annotated pairs are then fed into ourproposed deep reinforcement learning model for training. Aftertraining, we can get a trained actor network. Then, given a codesnippet, corresponding comment can be generated by the trainedactor network.

Figure 4 is an overview of the network architecture of our pro-posed deep reinforcement learning based model. The architectureof our model follows the actor-critic framework [19], which hasbeen successfully adopted in the decision-making scenarios suchas AlphaGo [41]. We split the framework into four submodules.(a) Hybrid code representation (cf. Sec. 4). This module is used torepresent the source code into a hidden space, which is also calledencoder in the encoder-decoder framework. (b) Hybrid attention(cf. Sec. 5.1.1). On decoding the encoded hidden space into the com-ment space, the attention layer is used to assign different weightsto the code snippet tokens for better generation. (c) Text generation(cf. Sec. 5.1.2). This module is a RNN-based generative network,which is used to generate the next word based on previous gener-ated words. (d) Critic (cf. Sec. 5.2). This module is used to evaluatewhether the generated word is good or not.

Since the generated tokens on (d) can also been seen as actions,we can also called the process (a)-(b)-(c) as actor network. Comparedwith the architecture of traditional encoder-decoder framework, ourproposed model has an additional critic module used to evaluate thevalue of action taken under current state. The process (a)-(b)-(c)-(d)can also be called as critic network. We can see that the actor andcritic networks share the modules (a)-(b)-(c), reducing the numberof learning parameters a lot. We will elaborate each component inthis framework in the following sections.

4 HYBRID REPRESENTATION OF CODEDifferent from previous methods that just utilize sequential tokensto represent code, we also consider the structure information ofsource code. In this section, we present a hybrid embedding ap-proach for code representation. We apply an LSTM to represent


.

.

. ...

Code Snippet

(a) Hybrid code representation

hstr1

hstr2

hstrn

hstrroot

↵stri

AST

dstrj

est

. . .

Hyb

rid . . .

Baseline Reward

(b) Hybrid attention

htxt1 htxt

2 htxtn

↵txti dtxt

jgst+1 fsT

. . .yt yt+1 yT

MLPV ⇡V ⇡

�

(c) Text generation

(d) CriticNotes:Actor network: (a)-(b)-(c)Critic network: (a)-(b)-(c)-(d)

Figure 4: An overview of our proposed deep reinforcement learning framework for code summarization.

the lexical level of code, and an AST-based LSTM to represent thesyntactic level of code.

4.1 Lexical LevelThe key insight into lexical level representation of source code isthat comments are always extracted from the lexical of code, suchas the function name, variable name and so on. It’s apparent that weapply an RNN (e.g., LSTM) to represent the sequential informationof source code. In our paper, the LSTM is adopted.

4.2 Syntactic LevelIn executing a program, a compiler decomposes a program into con-stituents and produces intermediate code according to the syntax ofthe language. AST is one type of intermediate code that representsthe hierarchical syntactic structure of a program [1]. We representthe syntactic level of source code from the aspect of AST embed-ding. Similar to a traditional LSTM unit, we propose AST-basedLSTM where the LSTM unit also contains an input gate, a memorycell and an output gate. Different from a standard LSTM unit whichonly has one forget gate for its previous unit, an AST-based LSTMunit contains multiple forget gates. Given an AST, for any node j,let the hidden state and memory cell of its l-th child be hjl and c jlrespectively. Refer to [46], the hidden state is updated as follows:

ij = σ (W(i )xj +N∑l=1

U(i )l hjl + b

(i ) ),

fjk = σ (W(f )xj +N∑l=1

U(f )kl hjl + b

(f ) ),

oj = σ (W(o)xj +N∑l=1

U(o)l hjl + b

(o) ),

uj = tanh(W(u )xj +N∑l=1

U(u )l hjl + b

(u ) ),

cj = ij ⊙ uj +N∑l=1

fjl ⊙ cjl ,

hj = oj ⊙ tanh(cj ), (9)

where k = 1, 2, · · · ,N . Each of ij , fjk , oj and uj denotes an inputgate, a forget gate, an output gate, and a state for updating thememory cell, respectively.W( ·) and U( ·) are weight matrices, b( ·)is a bias vector, and xj is the word embedding of the j-th node. σ (·)is the logistic function, and the operator ⊙ denotes element-wisemultiplication between vectors. It’s worth mentioning that whenthe tree is simply a chain, namely N = 1, the AST-based LSTM unitreduces to the standard LSTM. Figure 2 shows the structure of RNNand Tree-RNN.

Notice that the number of children N varies for different nodesof different ASTs, which may cause problem in parameter-sharing.For simplification, we transform the generated ASTs to binary treesby the following two steps which have been adopted in [49]: a) Splitnodeswithmore than 2 children, generate a new right child togetherwith the old left child as its children, and then put all children exceptthe leftmost as the children of this new node. Repeat this operationin a top-down way until only nodes with 0, 1, 2 children left; b)Combine nodes with 1 child with its child.

5 DEEP REINFORCEMENT LEARNING FORCODE SUMMARIZATION

In this section, we introduce the advanced deep learning frame-work named actor-critic network, which has been successfully usedin the AlphaGo [41]. We introduce the actor and critic networkrespectively and then present how to train them simultaneously.

5.1 Actor NetworkAfter obtaining the representation of code snippet, we need to de-code it into comment. Here we describe how we generate commentfrom the hidden space with a hybrid attention layer.

5.1.1 Hybrid Attention. Different parts of the code make dif-ferent contributions to the final output of comment. We adoptan attention mechanism [4] which has been successfully used inneural machine translation. In the attention layer, we have twoattention scores, one αstrt (j ) for structural representation and an-other α txtt (j ) for sequential representation of code. At t-th step ofthe decoding process, the attention scores αstrt (j ) and α txtt (j ) arecalculated as follows:


αstrt (j ) =exp(hstrj · st )∑nk=1 exp(h

strk · st )

, α txtt (j ) =exp(htxtj · st )∑nk=1 exp(h

txtk · st )

,

(10)where n is the number of code tokens; h( ·)j · st is the inner project of

h( ·)j and st , which is used to directly calculate the similarity score

between h( ·)j and st . The t-th context vector d( ·)t is calculated as

the summarization vector weighted by α ( ·)t (j ):

dstrt =

n∑t=1

αstrt (j )hstrj , dtxtt =

n∑t=1

α txtt (j )htxtj . (11)

To integrate the structural context vector and the textual vector,we concatenate them firstly and then feed them into an one-layerlinear network:

dt =Wdt [dstrt ; dtxtt ] + bdt ), (12)

where [dstrt ; dtxtt ] is the concatenation of dstrt and dtxtt . The con-text vector is then used for the (t +1)-th word prediction by puttingan additional hidden layer st :

st = tanh(Wc [st ; dt ] + bd ), (13)where [st ; dt ] is the concatenation of st and dt .

5.1.2 Text Generation. The model predicts the t-th word byusing a softmax function. Let pπ denote a policy π determined bythe actor network, pπ (yt |st ) denote the probability distribution ofgenerating t-th word yt , we can get the following equation:

pπ (yt |st ) = so f tmax (Ws st + bs ). (14)

5.2 Critic NetworkUnlike traditional encoder-decoder framework that generates se-quence directly via maximizing likelihood of next word given theground truth word, we directly optimize the evaluationmetrics suchas BLEU [34] for code summarization. We apply a critic network toapproximate the value of generated action at time step t . Differentfrom the actor network, this critic network outputs a single valueinstead of a probability distribution on each decoding step. Beforeintroducing critic network, we introduce the value function.

Given the policy π , sampled actions and reward function, thevalue function V π is defined as the prediction of total reward fromthe state st at step t under policy π , which is formulated as follows:

V π (st ) = Est+1:T ,yt :T

T−t∑l=0

rt+l |yt+1, · · · ,yT , h, (15)

whereT is the max step of decoding; h is the representation of codesnippet. For code summarization, we can only obtain an evaluationscore (BLEU) when the sequence generation process (or episode) isfinished. The episode terminates when step exceeds the max-stepT or generating the end-of-sequence (EOS) token. Therefore, wedefine the reward as follows:

rt =

{0 t < T

BLEU t = T or EOS. (16)

Mathematically, the critic network tries to minimize the follow-ing loss function, where mean square error is used.

L (ϕ) =12 V

π (st ) −V πϕ (st )

2, (17)

where V π (st ) is the target value, V πϕ (st ) is the value predicted by

critic network and ϕ is the parameter of critic network.

5.3 Model TrainingWe use the policy gradient method to optimize policy directly,which is widely used in reinforcement learning. For actor network,the goal of training is to minimize the negative expected reward,which can be defined as L (θ ) = −Ey1, . . .,T ∼π (

∑Tl=t rt ), where θ

is the parameter of actor network. Denote all the parameters asΘ = {θ ,ϕ}, the total loss of our model can be represented asL (Θ) =L (θ ) + L (ϕ).

For policy gradient, it is typically better to train an expressionof the following form according to [40]:

∇θL (Θ) = E[T−1∑t=0

Aπ (st ,yt+1)∇θ logπθ (yt+1 |st )], (18)

where Aπ (st ,yt+1) is advantage function. The reason why wechoose advantage function is that it achieves smaller variance whencompared with some other ones such as TD residual and rewardwith baseline [40].

According to the definition of advantage function, we can for-mulate the advantage function as follows. One can refer to [40] formore details.

Aπ (st ,yt ) = Qπ (st ,yt ) −V π (st ), (19)whereQπ (st ,yt ) is the state-action value function which is definedasQπ (st ,yt ) = Est+1:T ,

yt+1:T

[∑T−tl=0 rt+l

]. From this formulation, we can

find that the advantage functionmeasures whether or not the actionis better or worse than the policy’s default behavior. Therefore, astep in the policy gradient direction can increase the probability ofbetter-than-average actions and decrease the probability of worse-than-average actions.

On the other hand, the gradient of critic network is calculatedas follows:

∇ϕL (Θ) =T−1∑t=0

[V π (st ) −V πϕ (st )]∇ϕV π

ϕ (st ). (20)

We employ stochastic gradient descend with the diagonal variantof AdaGrad [10] to optimize the parameters of our framework.Algorithm 1 summarizes our proposed model described above.

6 EXPERIMENTS AND ANALYSISTo evaluate our proposed approach, in this section, we conductexperiments to answer the following questions:• RQ1.Does our proposed approach improve the performanceof code summarization when compared with some state-of-the-art approaches?• RQ2. What’s the effectiveness of each component for ourproposed model? For example, what about the performance


Algorithm 1 Actor-Critic training for code summarization.

1: Initialize actor πyt+1|st and critic V π (st ) with random weightsθ and ϕ;

2: Pre-train the actor to predict ground truth yt given{y1, · · · ,yt−1} by minimizing Eq. 7;

3: Pre-train the critic to estimate V (st ) with fixed actor;4: for t = 1→ T do5: Receive a random example, and generate sequence of actions

{y1, · · · ,yT } according to current policy πθ ;6: Calculate advantage estimate Aπ according to Eq. 19;7: Update critic weights ϕ using the gradient in Eq. 20;8: Update actor weights θ using the gradient in Eq. 18.

of hybrid code representation and reinforcement learningrespectively?• RQ3.What’s the performance of our proposed model on thedatasets with different code or comment length?

We ask RQ1 to evaluate our deep reinforcement learning-basedmodel compared to some state-of-the-art baselines. We ask RQ2in order to evaluate each component of our model. We ask RQ3 toevaluate our model when varying the length of code or comment.In the following subsections, we first describe the dataset, someevaluation metrics and the training details. Then, we introducesome baselines for RQ1. Finally, we report our results and analysisfor the research questions.

6.1 Dataset PreparationWe evaluate the performance of our proposed method using thedataset in [6], which is obtained from a popular open source projectshosting platform, GitHub1. The dataset contains 108,726 code-comment pairs. The vocabulary size of code and comment is 50,400and 31,350, respectively. For cross-validation, We shuffle the datasetand use the first 60% for training, 20% for validation and the remain-ing for testing. To construct the tree-structure of code, we parsePython code into abstract syntax trees via ast2 lib. To convert codeinto sequential text, we tokenize the code by {. , " ’ : ; ) ( ! (space)},which has been used in [31]. We tokenize the comment by {(space)}.

Figure 5 shows the length distribution of code and comment ontesting data. From Figure 5a, we can find that the lengths of mostcode snippets are located between 20 to 60. This verifies the quotein [26] “Functions should hardly ever be 20 lines long". In Pythonlanguage, the limited length should be shorter. From Figure 5b, wecan notice that the length of nearly all the comments are between 5to 15. This reveals the comment sequence that we need to generatewill not be too long.

6.2 Evaluation MetricsWe evaluate the performance of our proposed model based onfour widely-used evaluation criteria in the area of neural machinetranslation and image captioning, i.e., BLEU [34], METEOR [5],ROUGE-L [23] and CIDER [47]. BLEU measures the average n-gram precision on a set of reference sentences, with a penalty forshort sentences. METEOR is recall-oriented and measures how1https://github.com/2https://docs.python.org/2/library/ast.html

0 20 40 60 80 100

Code length0

250

500

750

1000

1250

1500

1750

2000

Coun

t

(a) Code length distribution.

0 10 20 30 40 50

Code length0

1000

2000

3000

4000

5000

6000

7000

8000

Coun

t

(b) Comment length distribution.

Figure 5: Length distribution of testing data.

0 25 50 75 100 125 150iteration

0

200

400

600

800

perplexity

0 50 100 150 200 250 300 350iteration

0

10

20

30

40

rewa

rd

x50x50

Figure 6: Iteration of training perplexity and reward.

well our model captures content from the references in our output.ROUGE-L takes into account sentence level structure similaritynaturally and identifies longest co-occurring in sequence n-gramsautomatically. CIDER is a consensus based evaluation protocol forimage captioning.

6.3 Training DetailsThe hidden size of the encoder and decoder LSTM networks areboth set to be 512, and the word embedding size is set to be 512.The mini-batch size is set to be 64, while the learning rate is set tobe 0.001. We pretrain both actor network and critic network with10 epochs each, and train the actor-critic network simultaneously10 epoches. We record the perplexity3/reward every 50 iterations.Figure 6 shows the perplexity and reward curves of our method.All the experiments in this paper are implemented with Python 2.7,and run on a computer with an 2.2 GHz Intel Core i7 CPU, 64 GB1600 MHz DDR3 RAM, and a Titan X GPU with 12 GB memory,running Ubuntu 16.04.

6.4 RQ1: Compared to BaselinesWe compare our model with the following baselines:• Seq2Seq [43] is a classical encoder-decoder framework inneural machine translation, which encodes the source sen-tences into a hidden space, and decodes it into target sen-tences. In our comparison, the encoder and decoder are bothbased on LSTM.• Seq2Seq+Attn [4] is a derived version of Seq2Seq model withan attentional layer for word alignment.• Tree2Seq [49] follows the same architecture with Seq2Seqand applies AST-based LSTM as encoder for the task of codeclone detection.

3Perplexity is a function of cross entropy loss, which has beenwidely used in evaluationof many natural language processing tasks.


Table 1: Comparison of the overall performance between our model and previous methods. (Best scores are in boldface.)

BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDERSeq2Seq 0.1660 0.0251 0.0100 0.0056 0.0535 0.2838 0.1262Seq2Seq+Attn 0.1897 0.0419 0.0200 0.0133 0.0649 0.3083 0.2594Tree2Seq 0.1649 0.0236 0.0096 0.0053 0.0501 0.2794 0.1168Tree2Seq+Attn 0.1887 0.0417 0.0197 0.0129 0.0644 0.3068 0.2331Hybrid2Seq+Attn+DRL (Our) 0.2527 0.1033 0.0640 0.0441 0.0929 0.3913 0.7501

Table 2: Effectiveness of each component for our proposed model. (Best scores are in boldface.)

BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDERSeq2Seq+Attn+DRL 0.2421 0.0919 0.0513 0.0325 0.0882 0.3935 0.6390Tree2Seq+Attn+DRL 0.2309 0.0854 0.0499 0.0338 0.0843 0.3767 0.6060Hybrid2Seq 0.1837 0.0379 0.0183 0.0122 0.0604 0.3020 0.2223Hybrid2Seq+Attn 0.1965 0.0516 0.0280 0.0189 0.0693 0.3154 0.3475Hybrid2Seq+Attn+DRL (Our) 0.2527 0.1033 0.0640 0.0441 0.0929 0.3913 0.7501

• Tree2Seq+Attn [11] is a derived version of Tree2Seq modelwith an attentional layer, which has been applied in neuralmachine translation• Hybrid2Seq(+Attn+DRL) represents three versions of ourproposed model with/without Attn/DRL component.

Table 1 shows the experimental results of comparison betweenour proposedmodel and some previous ones. From this table, we canfind that our proposed model outperforms other baselines in almostall of evaluation metrics. When comparing Seq2Seq/Tree2Seq withits correspond attention-based version, we can see that attention isreally effective in aligning the code tokens with comment tokens.We can also find that the performance of simply encoding the treestructure of code is worse than that of simply encoding the code assequence. This can be illustrated by that the words of commentsare always drawn from the tokens of code. Thus, our model whichconsiders both the structure and sequential information of codeachieves the best performance in this comparison.

6.5 RQ2: Component AnalysisTable 2 shows the effectiveness of some main components in ourproposed model. From this table, comparing the results of Seq2Seq+Attn/Tree2Seq+Attn with and without (Table 1) deep reinforcementlearning (DRL), we can see that the proposed DRL component canreally boost the performance of comment generation for sourcecode. We can also find the proposed approach of integrating theLSTM for content and AST-based LSTM for structure is effectiveon representing the code as compared with the correspondingnon-hybrid ones in Table 1. Furthermore, it also verifies that ourproposed hybrid attention mechanism works well in our model.

6.6 RQ3: Parameter AnalysisWe vary the length of code and comment since the code length mayhave an effect on the representation of code and the comment lengthmay have an effect on the performance of text generation. Figure 7and Figure 8 show the performance of our proposed method whencompared with two baselines on datasets of varying code lengthsand comment lengths, respectively.

From Figure 7, we can see that our model performs best whencompared with other baselines on four metrics with respect todifferent code lengths. Additionally, we can see that the our pro-posed model has a stable performance even though the code lengthincreases dramatically. We attribute this effect to the hybrid rep-resentation we adopt in our model. For Figure 8, recall the com-ment length distribution in Figure 5b. Since nearly all the commentlengths of testing data are under 20, we ignore the performanceanalysis over the samples whose comment length are larger than20. From this figure, we can see the performances of our modeland baselines vary dramatically on four metrics with respect todifferent comment lengths.

6.7 Qualitative Analysis and VisualizationWe show two examples in Table 3. It’s clear that the generatedcomments by our model are closest to the ground truth. Althoughthose models without DRL can generate some tokens which arealso in the ground truth, they can’t predict those tokens whichare not frequently appeared in the training data. On the contrary,our deep reinforcement learning based model can generate sometokens which are closer to the ground truth, like git, symbolic.This can be illustrated by the fact that our model has a more com-prehensive exploration on the word space and optimizes the BLEUscore directly.

In Table 3, we also visualize two attentions in our proposedmodelfor the target sentences. For example, for Case 1 with target sen-tence check if git is installed ., we can notice that the str-attn (leftof figure) focuses more on tokens like OSError, False, git,version, which represent the structure of code. On the otherhand, the attention of txt-attn (right of figure) is comparatively dis-persed, and have a focus on some tokens like def, which is of littlesignificance for code summarization. This verifies our assumptionthat LSTM can capture the sequential content of code, and AST-based LSTM can capture the structure information of code. Thus,it’s reasonable to fuse them together for a better representation.


0 10 20 30 40 50 60 70 80 90Code length

0.15

0.20

0.25

0.30

0.35

BLE

U

Hybrid2Seq+Attn+DRL

Tree2Seq+Attn

Seq2Seq+Attn

(a) BLEU

0 10 20 30 40 50 60 70 80 90Code length

0.06

0.08

0.10

0.12

0.14

METEO

R

Hybrid2Seq+Attn+DRL

Tree2Seq+Attn

Seq2Seq+Attn

(b) METEOR

0 10 20 30 40 50 60 70 80 90Code length

0.28

0.30

0.32

0.34

0.36

0.38

0.40

0.42

0.44

0.46

RO

UG

E-L

Hybrid2Seq+Attn+DRL

Tree2Seq+Attn

Seq2Seq+Attn

(c) ROUGE-L

0 10 20 30 40 50 60 70 80 90Code length

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

CID

ER

Hybrid2Seq+Attn+DRL

Tree2Seq+Attn

Seq2Seq+Attn

(d) CIDER

Figure 7: Experimental results of our proposed method and some baselines on different metrics w.r.t. varying code length.

0 5 10 15 20 25 30 35 40Comment length

0.05

0.10

0.15

0.20

0.25

0.30

0.35

BLE

U

Hybrid2Seq+Attn+DRL

Tree2Seq+Attn

Seq2Seq+Attn

(a) BLEU

0 5 10 15 20 25 30 35 40Comment length

0.06

0.07

0.08

0.09

0.10

0.11

0.12

0.13

METEO

R

Hybrid2Seq+Attn+DRL

Tree2Seq+Attn

Seq2Seq+Attn

(b) METEOR

0 5 10 15 20 25 30 35 40Comment length

0.15

0.20

0.25

0.30

0.35

0.40

0.45

RO

UG

E-L

Hybrid2Seq+Attn+DRL

Tree2Seq+Attn

Seq2Seq+Attn

(c) ROUGE-L

0 5 10 15 20 25 30 35 40Comment length

0.0

0.2

0.4

0.6

0.8

1.0

CID

ER

Hybrid2Seq+Attn+DRL

Tree2Seq+Attn

Seq2Seq+Attn

(d) CIDER

Figure 8: Experimental results of our proposedmethod and some baselines on differentmetrics w.r.t. varying comment length.

7 THREATS TO VALIDITY AND LIMITATIONSOne threat to validity is that our approach is experimented onlyon Python code collected from GitHub, so they may not be rep-resentative of all the comments. However, Python is a popularprogramming language used in a large number of projects. In thefuture, we will extend our approach to other programming lan-guages. Another threat to validity is on the metrics we choose forevaluation. It has always been a tough challenge to evaluate the sim-ilarity between two sentences for the tasks such as neural machinetranslation [43], image captioning [18]. In this paper, we only adoptfour popular automatic metrics, it is necessary for us to evaluatethe performance of generated text from more perspectives, such ashuman evaluation. Furthermore, in the deep reinforcement learn-ing perspective, we only set the BLEU score of generated sentenceas the reward. It’s well known that for a reinforcement learningmethod, one of the biggest challenge is how to design a rewardfunction to measure the value of action correctly, and it is stillan open problem. In our future work, we plan to devise a rewardfunction that can reflect the value of each action more correctly.

8 RELATEDWORK8.1 Deep Code RepresentationWith the successful development of deep learning, it has also be-come more and more prevalent for representing source code inthe domain of software engineering research. Gu et al. [12] use asequence-to-sequence deep neural network [43], originally intro-duced for statistical machine translation, to learn intermediate dis-tributed vector representations of natural language queries whichthey use to predict relevant API sequences. Mou et al. [29] learndistributed vector representations using custom convolutional neu-ral networks to represent features of snippets of code, then theyassume that student solutions to various coursework problems

have been intermixed and seek to recover the solution-to-problemmapping via classification. Li et al. [21] learn distributed vector rep-resentations for the nodes of a memory heap and use the learnedrepresentations to synthesize candidate formal specifications forthe code that produces the heap. Piech et al. [36] and Parisotto et al.[35] learn distributed representations of source code input/outputpairs and use them to assess and review student assignments or toguide program synthesis from examples. Neural code-generativemodels of code also use distributed representations to capture con-text, which is a common practice in natural language processing.For example, the work of Maddison and Tarlow [25] and otherneural language models (e.g. LSTMs in Dam et al. [8]) describe con-text distributed representations while sequentially generating code.Ling et al. [24] and Allamanis et al. [3] combine the code-contextdistributed representation with distributed representations of othermodalities (e.g. natural language) to synthesize code.

8.2 Source Code SummarizationCode summarization is a novel task in the area of software engi-neering and has drawn great attention in recent years. The existingworks for code summarization can be mainly categorized as rulebased approaches [42], statistical language model based approaches[30] and deep learning based approaches [2, 13, 15]. Sridhara etal. [42] construct a software word usage model first, and generatecomment according to the tokenized function/variable names viarules. Movshovitz-Attias et al. [30] predict comments from Javasource files using topic models and n-grams. In [2], the authorsintroduce an attentional neural network that employs convolutionon the input tokens to detect local time-invariant and long-rangetopical attention features to summarize source code snippets intoshort, descriptive function name-like summaries. Iyer et al. [15]propose to use LSTM networks with attention to produce sentences


Table 3: Examples of code summarization generated by each model and attention visualization of our model.

Case 1 Case 2

Code snippet

def _has_git():try: subprocess.check_call([git, -version],stdout=subprocess.DEVNULL,stderr=subprocess.DEVNULL)

except(OSError, subprocess.CalledProcessError):return Falseelse: return True

def tensor3(name=None,dtype=None):if (dtype is None):dtype=config.floatXtype=CudaNdarrayType(

dtype=dtype,broadcastable=(False, False, False))

return type(name)

Ground truth check if git is installed . return a symbolic 3-d variable .

Seq2Seq helper function to create a new figuremanager instance . yaml

Seq2Seq+Attn return true if the user has access to thespecified resource .

a decorator that returns a new class that will returna new class name .

Tree2Seq+Attn test that validate_folders throws afoldermissingerror . helper function for #4957 .

Hybrid2Seq+Attn returns the number of git modules that arenot installed . return the path to the currently running server .

Hybrid2Seq+Attn+DRL returns true if git is installed . return a symbolic graph .

Attention visualization

def

subprocess

subprocess

OSError

False

'git'

'--version'

subprocess

True

<EOS>

checkif

gitis

installed.

<EOS>

def

<unk>

try

subprocess

check_call

['git'

'--version']

stdout

subprocess

DEVNULL

stderr

subprocess

DEVNULL

except

OSError

subprocess

CalledProcessError

return

<unk>

return

<EOS>

0.0

0.5

1.0 None

None

config

False

False

type

name

dtype

False

CudaNdarrayType

dtype

type

dtype

name

dtype

<EOS>

returna

symbolic3-d

variable.

<EOS>

def

<unk>

name

None

dtype

None

if dtype

is None

dtype

config

<unk>

CudaNdarrayType

dtype

dtype

broadcastable

False

False

False

return

type

name

0.0

0.5

1.0

that describe C# code snippets and SQL queries. In Haije’s thesis[13], the code summarization problem is modeled as a machinetranslation task, and some translation models such as Seq2Seq [43]and Seq2Seq with attention [4] are employed. Unlike previous stud-ies, we take the tree structure and sequential content of source codeinto consideration for a better representation of code.

8.3 Deep Reinforcement LearningReinforcement learning [19, 45, 50], concerned with how softwareagents ought to take actions in an environment so as to maxi-mize the cumulative reward, is well suited for the task of decision-making. Recently, professional-level computer Go program hasbeen designed by Silver et al. [41] using deep neural networksand Monte Carlo Tree Search. Human-level gaming control [28]has been achieved through deep Q-learning. A visual navigationsystem [53] has been proposed recently based on actor-critic re-inforcement learning model. Text generation can also be formu-lated as a decision-making problem and there have been severalreinforcement learning-based works on this specific tasks, includ-ing image captioning [38], dialogue generation [20] and sentencesimplification [52]. Ren et al. [38] propose an actor-critic deep re-inforcement learning model with an embedding reward for imagecaptioning. Li et al. [20] integrate a developer-defined reward withREINFORCE algorithm for dialogue generation. In this paper, wefollow an actor-critic reinforcement learning framework, while our

focus is on encoding the structural and sequential information ofcode snippets simultaneously with an attention mechanism.

9 CONCLUSIONIn this paper, we first point out two issues (i.e., code representa-tion and exposure bias) existing in traditional code summarizationworks. To handle these two issues, we first encode the structureand sequential content of code via AST-based LSTM and LSTMrespectively. Then we add a hybrid attention layer to integratethem together. We then feed the code representation vector intoa deep reinforcement learning framework, named actor-critic net-work. Comprehensive experiments on a real-world dataset showthat our proposed model outperforms other competitive baselinesand achieves state-of-the-art performance on several automaticmetrics, namely BLEU, METEOR, ROUGE-L and CIDER.

ACKNOWLEDGMENTSThis work is partially supported by the Ministry of Education ofChina under grant of No.2017PT18, the Natural Science Foundationof China under grant of No. 61672453, 61773361, 61473273, 61602405,the WE-DOCTOR company under grant of No. 124000-11110 andthe Zhejiang University Education Foundation under grant of No.K17-511120-017. This work is also supported by CCF-Tencent OpenResearch Fund, NSF through grants IIS-1526499, IIS-1763325, CNS-1626432, and NSFC 61672313.


REFERENCES[1] A. V Aho, R. Sethi, and J. D Ullman. 1986. Compilers, Principles, Techniques.

Addison Wesley 7, 8 (1986), 9.[2] M. Allamanis, H. Peng, and C. Sutton. 2016. A convolutional attention network

for extreme summarization of source code. In International Conference on MachineLearning. 2091–2100.

[3] M. Allamanis, D. Tarlow, A. Gordon, and Y. Wei. 2015. Bimodal modellingof source code and natural language. In International Conference on MachineLearning. 2123–2132.

[4] D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprint arXiv:1409.0473 (2014).

[5] S. Banerjee and A. Lavie. 2005. METEOR: An automatic metric for MT evaluationwith improved correlation with human judgments. In Proceedings of the aclworkshop on intrinsic and extrinsic evaluation measures for machine translationand/or summarization, Vol. 29. 65–72.

[6] A. V. M. Barone and R. Sennrich. 2017. A parallel corpus of Python functions anddocumentation strings for automated code documentation and code generation.arXiv preprint arXiv:1707.02275 (2017).

[7] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. 1998. Clone de-tection using abstract syntax trees. In Software Maintenance, 1998. Proceedings.,International Conference on. IEEE, 368–377.

[8] H. K. Dam, T. Tran, and T. Pham. 2016. A deep language model for software code.arXiv preprint arXiv:1608.02715 (2016).

[9] S. C. B. de Souza, N. Anquetil, and K. M. de Oliveira. 2005. A study of thedocumentation essential to software maintenance. In Proceedings of the 23rdannual international conference on Design of communication: documenting &designing for pervasive information. ACM, 68–75.

[10] J. Duchi, E. Hazan, and Y. Singer. 2011. Adaptive subgradient methods for onlinelearning and stochastic optimization. Journal of Machine Learning Research 12,Jul (2011), 2121–2159.

[11] A. Eriguchi, K. Hashimoto, and Y. Tsuruoka. 2016. Tree-to-sequence attentionalneural machine translation. arXiv preprint arXiv:1603.06075 (2016).

[12] X. Gu, H. Zhang, D. Zhang, and S. Kim. 2016. Deep API learning. In Proceedings ofthe 2016 24th ACM SIGSOFT International Symposium on Foundations of SoftwareEngineering. ACM, 631–642.

[13] T. Haije, B. O. K. Intelligentie, E. Gavves, and H. Heuer. 2016. Automatic CommentGeneration using a Neural Translation Model. (2016).

[14] S. Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neuralcomputation 9, 8 (1997), 1735–1780.

[15] S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer. 2016. Summarizing SourceCode using a Neural Attention Model.. In ACL (1).

[16] Mira Kajko-Mattsson. 2005. A survey of documentation practice within correctivemaintenance. Empirical Software Engineering 10, 1 (2005), 31–55.

[17] Y. Keneshloo, T. Shi, C. K Reddy, and N. Ramakrishnan. 2018. Deep ReinforcementLearning For Sequence to Sequence Models. arXiv preprint arXiv:1805.09461(2018).

[18] M. Kilickaya, A. Erdem, N. Ikizler-Cinbis, and E. Erdem. 2016. Re-evaluatingautomatic metrics for image captioning. arXiv preprint arXiv:1612.07600 (2016).

[19] V. R. Konda and J. N. Tsitsiklis. 2000. Actor-critic algorithms. In Advances inneural information processing systems. 1008–1014.

[20] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky. 2016. Deep re-inforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541(2016).

[21] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. 2015. Gated graph sequenceneural networks. arXiv preprint arXiv:1511.05493 (2015).

[22] B. P. Lientz and E. B. Swanson. 1980. Software maintenance management. (1980).[23] C. Y. Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text

Summarization Branches Out (2004).[24] W. Ling, E. Grefenstette, K. M. Hermann, T. Kočisky, A. Senior, F. Wang, and P.

Blunsom. 2016. Latent predictor networks for code generation. arXiv preprintarXiv:1603.06744 (2016).

[25] C. Maddison and D. Tarlow. 2014. Structured generative models of natural sourcecode. In International Conference on Machine Learning. 649–657.

[26] R. C Martin. 2009. Clean code: a handbook of agile software craftsmanship. PearsonEducation.

[27] A. Mnih and Y. W. Teh. 2012. A fast and simple algorithm for training neuralprobabilistic language models. arXiv preprint arXiv:1206.6426 (2012).

[28] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A.Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. 2015. Human-levelcontrol through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.

[29] L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin. 2016. Convolutional Neural Networksover Tree Structures for Programming Language Processing.. In AAAI, Vol. 2. 4.

[30] D. Movshovitz-Attias and W. W. Cohen. 2013. Natural language models forpredicting programming comments. (2013).

[31] A. T. Nguyen and T. N. Nguyen. 2017. Automatic categorization with deep neuralnetwork for open-source Java projects. In Proceedings of the 39th InternationalConference on Software Engineering Companion. IEEE Press, 164–166.

[32] L. Nie, H. Jiang, Z. Ren, Z. Sun, and X. Li. 2016. Query expansion based on crowdknowledge for code search. IEEE Transactions on Services Computing 9, 5 (2016),771–783.

[33] Y. Oda, H. Fudaba, G. Neubig, H. Hata, S. Sakti, T. Toda, and S. Nakamura. 2015.Learning to generate pseudo-code from source code using statistical machinetranslation (t). In Automated Software Engineering (ASE), 2015 30th IEEE/ACMInternational Conference on. IEEE, 574–584.

[34] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu. 2002. BLEU: a method for auto-matic evaluation of machine translation. In Proceedings of the 40th annual meetingon association for computational linguistics. Association for Computational Lin-guistics, 311–318.

[35] E. Parisotto, A. Mohamed, R. Singh, L. Li, D. Zhou, and P. Kohli. 2016. Neuro-symbolic program synthesis. arXiv preprint arXiv:1611.01855 (2016).

[36] C. Piech, J. Huang, A. Nguyen, M. Phulsuksombati, M. Sahami, and L. Guibas.2015. Learning program embeddings to propagate feedback on student code.arXiv preprint arXiv:1505.05969 (2015).

[37] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. 2015. Sequence level trainingwith recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015).

[38] Z. Ren, X. Wang, N. Zhang, X. Lv, and L. J. Li. 2017. Deep Reinforcement Learning-Based Image Captioning with Embedding Reward. In Computer Vision and PatternRecognition (CVPR), 2017 IEEE Conference on. IEEE, 1151–1159.

[39] R. Rosenfeld. 2000. Two decades of statistical language modeling: Where do wego from here? Proc. IEEE 88, 8 (2000), 1270–1278.

[40] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. 2015. High-dimensional continuous control using generalized advantage estimation. arXivpreprint arXiv:1506.02438 (2015).

[41] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J.Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, and S. Dieleman.2016. Mastering the game of Go with deep neural networks and tree search.Nature 529, 7587 (2016), 484–489.

[42] G. Sridhara, E. Hill, D. Muppaneni, L. Pollock, and K. Vijay-Shanker. 2010. To-wards automatically generating summary comments for java methods. In Proceed-ings of the IEEE/ACM international conference on Automated software engineering.ACM, 43–52.

[43] I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning withneural networks. In Advances in neural information processing systems. 3104–3112.

[44] R. S Sutton and A. G Barto. 1998. Introduction to reinforcement learning. Vol. 135.MIT press Cambridge.

[45] R. S Sutton, D. A McAllester, S. P Singh, and Y. Mansour. 2000. Policy gradientmethods for reinforcement learning with function approximation. In Advancesin neural information processing systems. 1057–1063.

[46] K. S. Tai, R. Socher, and C. D. Manning. 2015. Improved semantic representa-tions from tree-structured long short-term memory networks. arXiv preprintarXiv:1503.00075 (2015).

[47] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. 2015. Cider: Consensus-basedimage description evaluation. In Proceedings of the IEEE conference on computervision and pattern recognition. 4566–4575.

[48] C. J. Watkins and P. Dayan. 1992. Q-learning. Machine learning 8, 3-4 (1992),279–292.

[49] H. H. Wei and M. Li. 2017. Supervised Deep Features for Software FunctionalClone Detection by Exploiting Lexical and Syntactical Information in SourceCode. (2017).

[50] R. J Williams. 1992. Simple statistical gradient-following algorithms for connec-tionist reinforcement learning. In Reinforcement Learning. Springer, 5–32.

[51] D. Yang, A. Hussain, and C. V. Lopes. 2016. From query to usable code: Ananalysis of stack overflow code snippets. In Mining Software Repositories (MSR),2016 IEEE/ACM 13th Working Conference on. IEEE, 391–401.

[52] X. Zhang and M. Lapata. 2017. Sentence Simplification with Deep ReinforcementLearning. In Proceedings of the 2017 Conference on Empirical Methods in NaturalLanguage Processing. 584–594.

[53] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A Gupta, L. Fei-Fei, and A. Farhadi.2017. Target-driven visual navigation in indoor scenes using deep reinforcementlearning. In Robotics and Automation (ICRA), 2017 IEEE International Conferenceon. IEEE, 3357–3364.

improving automatic source code summarization via deep

Documents