drug-drug interaction mining and interaction terms ... › attach › jqpg7... · the amount of...

Drug-Drug Interaction Mining and Interaction Terms Extraction using Deep Learning: A Word-level Attention Bi-Directional LSTM

Zhengchao Yang1, Sudha Ram2, Faiz Currim3

Abstract Detection of Drug-Drug Interactions (DDIs) is critical for patient medication safety. Knowledge about drug interactions may not only prevent serious adverse drug events, but also control unnecessary health costs. The amount of biomedical literature related to drug interactions is increasing every year while the publicly available database for guiding medication treatment is consistently outdated due to the lack of automated DDIs extraction systems. In this paper, we propose a novel automated DDIs extraction method based on a word-level attention bi-direction LSTM neural network architecture (which is a special kind of Recurrent Neural Networks and used to solve long-term dependency problems). Our proposed framework requires minimal effort for manual feature engineering with good results for detecting DDIs from biomedical documents. The proposed method is able to extract interaction terms from sentences which contain DDIs via the word-level attention mechanism. The experimental results show that our method achieves better performance when compared with the existing state-of-the-art baselines for the multi-class DDIs detection problem. 1. Introduction Drug-Drug Interactions (DDIs) occur in daily life when people take different drugs at the same time and the desired effect of one drug is affected by another one (while a drug may interact with different substances, in this paper we study interactions with other drugs). 20% - 60% of patients admitted to hospitals are administered more than one drug simultaneously (Hayashi et al., 2017). Elderly patients who have multiple chronic diseases or conditions are more at risk. This means they are often administered multiple medicines by different medical specialists at the same time. It has been reported that adverse drug events (ADEs) affect more than 770,000 patients in U.S. hospitals every year and DDIs are a leading cause of Adverse Drug Events (Slight et al., 2018). To reduce the accompanying social and health costs, DDIs databases are under development by some organizations. However, even though some databases such as Micromedex4, SFINX5 and DrugBank6 are available, most databases are built and updated based on clinical records or research experiments, and are often outdated (Fathelrahman, 2018). Instead, the most up-to-date DDIs results are often from medical research and these are hidden in scientific literature in the form of unstructured text. As medical research continues to grow, manually extracting DDIs from unstructured text is laborious and impractical. Therefore, automated DDIs mining from scientific literatures is attracting attention from scientific researchers. Several approaches based on machine learning (ML) and natural language processing (NLP) techniques have been proposed to solve this problem (Björne et al., 2013; Chowdhury and Lavelli, 2013; Kim et al., 2015; Zheng et al., 2016). In general, these ML/NLP-based DDIs mining methods can be categorized into three groups: Feature-based, Kernel-based and Deep Neural Network/Deep

1 University of Arizona, Tucson, AZ. Email: [email protected] 2 University of Arizona, Tucson, AZ. Email: [email protected] 3 University of Arizona, Tucson, AZ. Email: [email protected] 4 https://www.micromedexsolutions.com/ 5 https://www.emergesphinx.org/ 6 https://www.drugbank.ca/

Learning-based methods (Björne et al., 2013; Chowdhury and Lavelli, 2013; Kim et al., 2015; Zheng et al., 2016). In the first group, i.e., Feature-based methods (Björne et al., 2013; Chowdhury and Lavelli, 2013), researchers identify different features that can be extracted from text and can represent the attributes of data instances. For example, Chowdhury and Lavelli (2013) proposed using heterogeneous features including lexical, semantic, syntactic and negation features in their ML system. Those features are derived from sentences as well as the corresponding parsed dependency trees. However, feature extraction is domain knowledge dependent and has limited lexical generalization abilities for unseen words. In the second group, i.e., Kernel-based methods (Kim et al., 2015; Zheng et al., 2016), different kernels derived from syntactic parse trees or graphs are proposed to investigate the similarities between data instances. Those kernels are based on how syntactic representations of data instances look like. Kernel combinations are often used in order to compensate for the weakness of individual kernels (Zheng et al., 2016). The first two groups of methods involve considerable manual feature engineering which in turn requires significant domain knowledge. To overcome this limitation, deep neural network-based methods (DNN), which require minimal feature engineering efforts, have gained attention and shown promising results for various research fields. DNN-based methods are well-known for their automatic multi-level representation for learning attributes, and the representation for each layer of the network becomes more abstract as layers go deeper (LeCun et al., 2015). There are two major architectures for Deep Neural Networks used to extract DDIs: Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). CNN based methods (Liu et al., 2016; Zhao et al., 2016) employ a fixed-size window to capture the contextual information of a word (represented as a word embedding feature). The combination of embedding-based features and possible traditional manually defined features go through convolutional and pooling layers and are transformed into high-level representations, which are finally fed into a Softmax classifier for DDIs detection. However, meaningful contextual information may not appear consecutively in a sentence, thus a fixed-size window may not cover the complete information. Thus, CNN is weak in learning long distance (i.e., spatially separated) patterns, where, for example, two semantically related words might be separated by many irrelevant words. On the other hand, RNN (Grossberg, 2013; Hochreiter and Schmidhuber, 1997), being a temporal sequence model, can accumulate contextual information for the whole sentence through its memory units and therefore is suitable for learning from long sentences or sequences without a fixed-size window (which is often used in CNN). A few RNN-based methods have been proved successful for relationship classification (Li et al., 2017; Zhang and Wang, 2015). Nevertheless, in RNN-based methods, words still need to be passed to the RNN one by one to get the final output from the last cell of an RNN. Some important syntactic or contextual information for important words may still be lost during the training process. Recently, an attention mechanism has been proposed to solve this kind of problem in many NLP areas such as question answering (Ilievski et al., 2016; Yang et al., 2016) and machine translation (Luong et al., 2015; Wu et al., 2016), among others. In this study, we proposed a novel word-level attention bi-directional LSTM based method to detect DDIs and extract important interaction terms/words using sentences that indicate DDIs. The extracted interaction terms/words can reveal more information on how biomedical researchers present drug interactions in their publications. Experimental results on the benchmark SemEval 2013 (DDI) datasets (Herrero-Zazo et al., 2013) show that our Word-Level Attention (WLA) Bi-LSTM model performs well in identifying drug interactions.

2. Research Methods In this section, our proposed WLA Bi-Directional LSTM neural network method for extraction of DDIs from biomedical texts in detail. An overview of our approach is shown in Figure 1.

Figure 1. Overview of the proposed framework

2.1 Data Preprocessing 2.1.1 Data instances extraction The SemEval 2013 datasets contain many XML files representing individual biomedical research records collected from DrugBank and Medline. Figure 2 shows an example sentence in an XML file about how the records are organized. Each of the sentences is manually labelled to indicate if the sentence contains interacting drugs, and the type of interaction between drugs if there is one. Since the sentence in Figure 2 contains 3 drugs, there will be 3 drug pairs labelled with their corresponding interaction types. In this example, “calcium” – “multivalent cations” pair is labelled as “False” which means there is no interaction between “calcium” and “multivalent cations”. Similarly, “calcium” – “alendronate” and “multivalent cations” – “alendronate” are both labelled as “Mechanism”. There are 5 types of interaction labels in our datasets:

• Mechanism: Labelled drug pairs interact via a Pharmacokinetic mechanism. • Effect: Labelled drug pairs interact via a pharmacodynamics mechanism. • Advice: indicates a recommendation or advice to deal with a drug interaction between two

drugs. • Int: indicates a drug interaction without specifying any interaction details. • False: There is no interaction between two drugs.

Figure 2. An example sentence/record in SemEval Datasets

An XML parsing program was written to extract the sentences and their corresponding drug entities as well as all the drug pairs. For example, three output data instances are extracted for the example in Figure 2. The extracted sentence, drug pair, the interaction type are separated by the “|” delimiter.

1. Products containing calcium and other multivalent cations likely will interfere with absorption of alendronate.|calcium|multivalent cations|false

2. Products containing calcium and other multivalent cations likely will interfere with absorption of alendronate.|calcium|alendronate|mechanism

3. Products containing calcium and other multivalent cations likely will interfere with absorption of alendronate.|multivalent cations|alendronate|mechanism

2.1.2 Negative Instance Filtering Machine Learning algorithms tend to be unsatisfactory in training classifiers when fed with imbalanced datasets (Chawla, 2009). SemEval datasets also have the imbalance problem where the ratio of the extracted interaction instances to the extracted negative (non-interactive) instances is 1:5.9. To alleviate the imbalance problem, we also follow the under-sampling method of Zhao (2016) using the following rules: Rule 1: If the extracted data instance has two drugs with same name, this instance will be removed because there is no point in having information in the training set suggesting that a drug can interact with itself. Rule 2: If the extracted data instance has a pair of two drugs in coordinate relations, this instance will be removed because in test sets, such pairs tend to cause a false positive problem. Note that the presence of coordinate relations is often signaled by the appearance of a coordinator (coordinating conjunction), e.g., “and”, “or”, “but”.Therefore, the first extracted instance in section 2.1.1 will be removed from training sets. 2.1.3 Drug Blinding, Normalization and Tokenization Since our DDIs extraction model only needs to learn the structure and contextual information of sentences that include drugs (and it does not need to know the drug names), we also follow the Drug Blinding method of Zhao (2016). We replace two drugs in a drug pair with DRUGA and DRUGB, and replace all other drugs in the same sentence with DRUGN. Therefore, the filtered instances from section 2.1.2 (for the example in Figure 2) will be transformed into the following two new instances:

1. Products containing DRUGA and other DRUGN likely will interfere with absorption of DRUGB.|calcium|alendronate|mechanism

2. Products containing DRUGN and other DRUGA likely will interfere with absorption of DRUGB.|multivalent cations|alendronate|mechanism

After the drug blinding process, we do a Normalization to replace all numbers (either integrals or decimals) with a token “NUM_TOKEN” using the Regular Expression “[0-9]+\.*[0-9]*” Finally, tokenization is conducted by using NLTK to convert a sentence into a sequence of tokens which then will be transformed into embeddings for the model described in the following sections.

2.2. Embedding Each token of the sequences derived from the previous section are transformed into an embedding which includes one word embedding and two distance embeddings (Note that an embedding is a vector of float numbers). 2.2.1 Word Embedding Word Embedding is a machine learning technique that maps words or phrases from a vocabulary to a corresponding vector of real numbers (Hinton and Roweis, 2003). Compared to the traditional Bag of words (Hotho et al., 2003) which usually results in very large and sparse vectors where the dimensionality of the vectors representing each word is equal to the size of the supported vocabulary, word embedding creates (for each word) a vector representation with the following two properties: (1) A much lower dimensional space representation (which solves the “sparsity” problem for traditional one-hot vector representation); (2) Contextual and semantic similarity for words. Therefore, word embedding is popular in machine learning especially in NLP (Mikolov, Yih, et al., 2013). In our study, we trained the word embedding vectors for words based on a corpus from PubMed7, which contains almost 11 million English abstracts of biomedical articles using Google’s word2vec tool8. Finally, we obtained our embedding vectors for each trained word with embedding size of 200 (the size of the vector representation for each word) following the suggestion of 200-300 from (Mikolov, Le, et al., 2013). Given a data instance 𝑆 = (𝑤%,𝑤', …𝑤), …𝑤*,… , 𝑤+, … ,𝑤,) from section 2.1, 𝑤) and 𝑤+ are DRUGA and DRUGB, respectively. The word embedding vector for each word is extracted from the pre-trained word embeddings table. The embedding of a word is represented as 𝑾_𝑬𝑴𝑩(𝑤*). Here 𝑾_𝑬𝑴𝑩(𝑤*) ∈ ℝ5where U = 200. 2.2.2 Distance Embedding The DDIs extraction process analyzes the relationship between two drugs in a sentence where the distance of other context words also contributes to determining whether two drugs have an interaction relationship (Zeng et al., 2014). Similarly, we introduce in our model, the concept of relative distances (𝑑*) and 𝑑*+ ) between focal word 𝑤* to the two candidate drugs 𝑤) and 𝑤+ . Figure 3 shows an example of such distances.

Figure 3. An example of distances between focal word and two candidate drugs

Finally, the distance embedding is obtained by mapping the distance to the vector in another lookup table shown in Table 1. Since the average sentence length in biomedical literature is 10-15, we use similar distance embedding look up table as proposed by Zeng (2014). The distance embedding size used is 11. The distance embeddings for each word are represented as 𝑫𝟏_𝑬𝑴𝑩(𝑤*) ∈ ℝ= and 𝑫𝟐_𝑬𝑴𝑩(𝑤*) ∈ ℝ= where S = 11.

7 https://www.ncbi.nlm.nih.gov/pubmed/ 8 https://code.google.com/archive/p/word2vec/

Table 1. Distance Embedding Lookup Table Distance ~

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 ~

Distance E

mbedding

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1

2.2.3 Embedding Concatenation After generating the embeddings for each word, the three embedding vectors are concatenated together to form an embedding layer as the input for our DDIs extraction model. The concatenated embedding for word 𝑤A is defined as: 𝒙* = C𝑾𝑬𝑴𝑩(DE), 𝑫𝟏𝑬𝑴𝑩(DE), 𝑫𝟐𝑬𝑴𝑩(DE)F (1) where 𝒙* ∈ ℝG, d=U+S+S=200+11+11=222, [] is a concatenation operation.

Figure 4. The architecture of the proposed Word Level Attention Bi-LSTM

2.3 Word Level Attention Bi-LSTM In this sub-section, we introduce our proposed word level attention Bi-LSTM and its architecture as shown in Figure 4. The architecture contains 5 layers: Input layer, embedding layer, Bi-LSTM layer, Attention Layer and DDIs classification layer. The Input layer is a layer for the pre-processed data instances described in section 2.1. The embedding layer transforms the data instances into embedding vectors which are then fed into Bi-LSTM layer. Bi-LSTM layer includes one forward and one backward LSTM network. Attention layer is used to calculate the attention value for each of the input word. The attention value here indicates how words are contributing to help the final classification layer decide whether a sentence includes a DDI or not. 2.3.1 Bi-LSTM layer Given an embedding sequence 𝑿 = (𝒙𝟏, 𝒙𝟐, … , 𝒙𝑻) for a data instance, the traditional LSTM (Gers et al., 1999) consumes the sequence from the past to the future or from the left to the right. Figure 5 shows the structure of a unidirectional LSTM which is actually a LSTM block/cell. The cell will be repeatedly used as it consumes the inputs one by one.

Figure 5. An example of distances between focal word and two candidate drugs

Typically, an LSTM recurrent neural network is composed of three major parts:

1. Forget gate 𝒇*, in formulation (2), which is responsible for forgetting or dropping part of the information that the network has consumed and remained as memory (namely, the cell state 𝒄*L%) when it takes new information. The output of this gate is 𝒇* ∗ 𝒄*L%;

2. A sigmoid function that decides which of the new information should be ignored or updated in formulation (3) while a tanh function (formulation (4)) is used to create a vector of all the possible values from the new input. Later, those two are multiplied to update the new cell state. This operation will update the new memory that the LSTM block/cell will keep as 𝒄* in formulation (5);

3. Finally, the LSTM block/cell decides what it will output. Another sigmoid function decides which parts of the updated cell state will participate in generating output in formulation (6) and (7).

𝒇* = 𝜎(𝑾OPPPPPP⃗ ∙ C𝒉*L%PPPPPPPP⃗ , 𝒙*F + 𝒃OPPPP⃗ ) (2) 𝒊* = 𝜎(𝑾WPPPPP⃗ ∙ C𝒉*L%PPPPPPPP⃗ , 𝒙*F + 𝒃WPPP⃗ ) (3) 𝒄*X = tanh]𝑾^PPPPPP⃗ ∙ C𝒉*L%PPPPPPPP⃗ , 𝒙*F + 𝒃^PPPP⃗ _ (4) 𝒄* = 𝒇* ∗ 𝒄*L% + 𝒊* ∗ 𝒄*X (5) 𝒐* = 𝜎(𝑾aPPPPPP⃗ ∙ C𝒉*L%PPPPPPPP⃗ , 𝒙*F + 𝒃bPPPP⃗ ) (6) 𝒉*PPPP⃗ = 𝒐* ∗ tanh(𝒄*) (7) where [] denotes the concatenation operation, ∗ denotes element wise multiplication, ∙ denotes dot production and 𝜎 denotes sigmoid function. 𝑾OPPPPPP⃗ ,𝑾WPPPPP⃗ ,𝑾^PPPPPP⃗ ,𝑾aPPPPPP⃗ ∈ ℝc×e, 𝒃OPPPP⃗ , 𝒃WPPP⃗ , 𝒃^PPPP⃗ , 𝒃aPPPP⃗ ∈ ℝc are

parameters that the model will learn from the data. Note that “→” on top of the parameters indicate that those parameters are related to the forward LSTM. However, standard LSTM only processes sequences in a forward temporal order which could still lose the beginning information for long sequence inputs as it reaches the end (Huang et al., 2015). Therefore, a bi-directional LSTM (shown in Figure 4) extends the forward LSTM by adding one more layer of backward LSTM in order to keep more information of the beginning of sequences. (8)-(13) are corresponding formulations for the backward LSTM. 𝒇* = 𝜎(𝑾OP⃖PPPPP ∙ C𝒉*h%P⃖PPPPPPP, 𝒙*F + 𝒃OP⃖PPP) (8) 𝒊* = 𝜎(𝑾WP⃖PPPP ∙ C𝒉*h%P⃖PPPPPPP, 𝒙*F + 𝒃WP⃖PP) (9) 𝒄*X = tanh]𝑾^P⃖PPPPP ∙ C𝒉*h%P⃖PPPPPPP, 𝒙*F + 𝒃^P⃖PPP_ (10) 𝒄* = 𝒇* ∗ 𝒄*h% + 𝒊* ∗ 𝒄*X (11) 𝒐* = 𝜎(𝑾aP⃖PPPPP ∙ C𝒉*h%P⃖PPPPPPP, 𝒙*F + 𝒃bP⃖PPP) (12) 𝒉*P⃖PPP = 𝒐* ∗ tanh(𝒄*) (13) 2.3.2 Word Level Attention Layer Traditionally, LSTM related tasks (Sutskever et al., 2014; Tang et al., 2015; Wöllmer et al., 2013) only use the final output 𝒉,PPPPP⃗ of the network to do the classification or regression prediction because they assume the final output has accumulated enough information for the whole sequence. However, due to the weakness of handling the long sequence modelling problem (Li et al., 2018), Bi-LSTM was proposed. However, both LSTM and Bi-LSTM could still lose information in the middle of the sequences as they are not able to recognize the contribution of each member in the sequence to the given task. An attention mechanism has been successfully applied to some NLP research areas such as question answering (Ilievski et al., 2016; Yang et al., 2016) and machine translation (Luong et al., 2015; Wu et al., 2016). Its purpose is to give more attention to certain parts of texts during NLP tasks. Inspired by Luong (2015), we propose an attention layer which is used to learn the importance of interaction terms before it decides the interaction types in the sentences. As shown in Figure 4, instead of getting rid of each individual output 𝒉*PPPP⃗ and 𝒉*P⃖PPP as the model consumes the inputs, we add an extra layer to learn the relationship of the individual outputs with the final outputs (Note: the final output is 𝒉%P⃖PPP for backward LSTM while the final output is 𝒉,P⃖PPPP). Two weight Matrices 𝑴𝒇𝑎𝑛𝑑𝑴+ ∈ ℝc×c are learned in the training procedure to decide the contribution of individual outputs to the final DDIs detection. We denote the output matrices of forward LSTM and backward LSTM as: 𝑯PPP⃗ = (𝒉%PPPP⃗ , 𝒉'PPPP⃗ , … , 𝒉,PPPPP⃗ ) ∈ ℝc×, (14) �⃖�PPP = (𝒉%P⃖PPP, 𝒉'P⃖PPP, … , 𝒉,P⃖PPPP) ∈ ℝc×, (15) Then each individual output is used to calculate the contribution or relationship for the final output based on the learned weight Matrices 𝑴𝒇𝑎𝑛𝑑𝑴+. To simplify the calculation, we use the following formulation instead: 𝐜 = 𝑯PPP⃗ 𝑴𝒇𝒉,PPPPP⃗ (16) �⃖� = �⃖�PPP𝑴𝒃𝒉𝟏P⃖PPP (17) Finally, we apply a Softmax function on c⃗ and c⃖ respectively in order to get a normalized attention weight (0-1) for each input word and generate the final attention by taking the average. 𝒂PP⃗ = softmax(𝐜) (18)

�⃖�PP = softmax(�⃖�) (19) 𝒂 = 𝒂PP⃗ h�⃖�PP

' (20)

where 𝐜, �⃖�, 𝒂PP⃗ , �⃖�PP, 𝒂 ∈ ℝ𝐓 and 𝒂 is the leant attention weight for each input word. As training process proceeds, the optimizer will try to minimize the cost function (error rates) in the last layer of our model to increase the weight of words that indicate interaction while decrease the weight of words that don’t indicate interaction in the sentence. Thus, the final error rates can be minimized through the training process. 2.3.3 DDIs Classification Layer Since the forward output and backward output are equally important, we concatenate both directional outputs for each inputs based on principles from past studies (Graves and Schmidhuber, 2005; Huang et al., 2015; Zhang and Wang, 2015) have done: 𝑯 = C𝑯PPP⃗ , �⃖�PPPF = ][𝒉%PPPPPP⃗ , 𝒉%P⃖PPPF, C𝒉%P⃖PPP, 𝒉'P⃖PPPF, … , [𝒉,PPPPP⃗ , 𝒉,P⃖PPPP]) (21) Then the feature vector 𝒉 captured by the attention weight vector and concatenated outputs are calculated as follows: 𝒉 = tanh(𝑯𝒂) ∈ ℝ'w (22) Finally, the feature vector 𝒉 as an attention weighted output is fed into a fully connected network (with one hidden layer and Softmax classifier). In the fully connected layer, we set the number of nodes as the number of classes, namely 5. The final output is as follows: 𝑝(𝑦|𝒙) = Softmax(𝒉𝑻𝑾a{* + 𝒃a{*) (23) where 𝑾a{* ∈ ℝ'w×|, 𝒃𝒐𝒖𝒕 ∈ ℝ| are network parameters that need to be trained in the training process. Cross-Entropy and L' regularization are used as the objective cost function to be optimized. The objective to be optimized is as follows: J(θ) = − %

�∑ 𝑦A𝑙𝑛𝑦�A�A�% + 𝜆||θ||'' (24)

In the objective function, the first item is the cross-entropy cost used to measure the difference between true labels and predicted labels. The second item is the L' regularization used to avoid overfitting problem. M is the number of training data instances and θ ={𝑾OPPPPPP⃗ ,𝑾WPPPPP⃗ ,𝑾^PPPPPP⃗ ,𝑾aPPPPPP⃗ , 𝒃OPPPP⃗ , 𝒃WPPP⃗ , 𝒃^PPPP⃗ , 𝒃aPPPP⃗ ,𝑾OP⃖PPPPP,𝑾WP⃖PPPP,𝑾^P⃖PPPPP,𝑾aP⃖PPPPP, 𝒃OP⃖PPP, 𝒃WP⃖PP, 𝒃^P⃖PPP, 𝒃aP⃖PPP,𝑴𝒇,𝑴𝒃,𝑾𝒐𝒖𝒕, 𝒃𝒐𝒖𝒕} includes all the parameters that the model needs to optimize. We use the method “Adam” (Kingma and Ba, 2014) to optimize the objective cost function. 3. Experiments 3.1 Datasets Description The SemEval 2013 dataset is a manually labelled dataset containing both training and test sets. The dataset contains documents of sentences selected from the DrugBank and Medline databases. Table 2 shows the statistics of the dataset before and after negative instances are filtered out.

Table 2. Statistics of SemEval 2013 training and test datasets

DDI Type Training Set Test Set Before Filtering After Filtering Before Filtering After Filtering

Mechanism 1,318 1,268 302 302 Advice 826 818 221 221 Effect 1,683 1,620 360 360

Int 187 176 96 96 False 23,757 13,923 4,737 3025 Total 27,771 17,805 5,716 4004

3.2 Evaluation Metrics Since the filtered datasets are still imbalanced, using prediction accuracy as one of the performance evaluation metrics is not recommended. Instead, we use precision (P), recall (R) and the F1-score (F1) to evaluate our model on the test set and use the F1-score to monitor the performance of the validation set during the training process. We randomly sample 5% of the training set (namely, 890 data instances) as the validation set. The precision and recall of each c ∈C = {Mechnism, Effect, Int, Advice} are calculated by: P� =

#�� ¡��¢£G¡�¢¤��¥¥��¤��¤¢��¡¦¡�G¢��#��¢�¡��¤¢��¡¦¡�G¢��

(25)

R� =#�� ¡��¢£G¡�¢¤��¥¥��¤��¤¢��¡¦¡�G¢��

#�� ¡�� (26)

The overall precision, recall and F-score are calculated by: P = %

|||∑ P��∈| (27)

R = %|||∑ R��∈| (28)

F1 = %|||

'©ª©hª

(29) Note that False is excluded in calculating those metrics because majority of the data comes from false interactions which is not the goal of our prediction. 3.3 Hyper-parameters We implemented our WLA-Bi-LSTM using the Tensorflow9 package. The parameters are summarized in Table 3. Note that the parameter “N” is set as 250 in the Table 3 because of its best performance in our framework. Other numbers including 100, 150, 200, 300, 350 were also used for experimentation, but they did not perform well. Our model ran on a desktop computer with 3.52 GHz AMD CPU, 16GB DDR3 memory, with a Nvidia GeForce GTX 950 Graphic card to speed up the training process.

Table 3. Hyper-parameters in our study Parameter Parameter name Value U Word Embedding Size 200 S Distance Embedding Size 11 Mini-batch Num of instances used to train at one time 150 N The number of hidden units for LSTM 250 learning_rate Learning rate for Adam optimizer 0.01

9 https://www.tensorflow.org/

3.4 Training In order to avoid overfitting problem (when performance of the training set is good but the generalization in the test set is poor), we sampled 5% of the training set as our validation set to monitor the training process. We picked the best trained model for the validation set. Figure 6 shows the F1 score of both the training and validation sets during the training process for 50 epochs.

Figure 6. F1-score for both training set and validation set during the training process

From Figure 6, we can see that both the training set and validation set results improve as the training process proceeds. Before epoch 16, the F1 value for the training set increases quickly. The F1 value for the validation set also increases rapidly before epoch 25 and reaches a high of 0.6472. After epoch 25, even though the performance for the training set still continues to improve slowly, the performance for the validation set starts to drop, showing a sign of overfitting as the model keeps fitting the training data well. Therefore, we keep the trained model at epoch 25 and use it to evaluate the performance for the test set. 3.5 Experimental Results Using the trained model at epoch 25 from section 3.4, we evaluate the test set and summarize the results in Table 4 and Table 5.

Table 4. Confusion Matrix for DDIs Classification True Type

True Number

Predicted Type Mechanism Effect Advice Int False

Mechanism 302 171 25 10 0 96 Effect 360 2 254 10 1 93 Advice 221 0 7 151 2 61

Int 96 0 18 0 69 9 False 3025 27 73 27 5 2893

Table 5. Evaluation Results using Different Metrics Type Precision Recall F1-Score

Mechanism 0.855 0.566 0.681 Effect 0.674 0.706 0.689 Advice 0.763 0.683 0.721

Int 0.896 0.719 0.798 All 0.757 0.659 0.705

Given the evaluation metrics in Table 4 and 5, we notice that relatively few instances of other types (“Effect”, “Advice”, “False”) are misclassified as “Mechanism” or “Int”. Thus, both types get relatively high precision scores. In contrast, many instances are misclassified as “Effect”, leading to a lower precision score for “Effect”. The reason is that most sentences in the training set labelled as “Effect” do not have terms strongly indicating “pharmacodynamics mechanism” while instances with other types also have similar syntactic structure without words to clearly distinguish them from type “Effect”. Another significant result is that many instances of type “Mechanism” are misclassified as “Effect” or “False”. After digging into the test set of “Mechanism” instances, we found that many “Mechanism” instances have very long sequences of words with a maximum length of 141. In addition, those instances have many clauses embedded in the sentences, thus making the sentences very complicated. Future work could incorporate syntactic information of sentences in order to address this problem. To demonstrate the advantages of our proposed model, we compare our performance results with existing methods as baselines, and summarize the results in Table 6. Our proposed model outperforms the baselines on both Recall and F1 scores. Our future work is focused on improving the precision (and recall).

Table 6. Performance comparison with past work Studies Methods Precision Recall F1-Score

(Chowdhury , 2013) Manual features + SVM 0.646 0.656 0.651 (Thomas et al., 2013) Two stages + SVM 0.642 0.579 0.609 (Kim et al., 2015) Linear kernel Method 0.732 0.499 0.594 (Liu et al., 2016) Dependency CNN 0.772 0.644 0.701 (Zhao et al., 2016) Syntax CNN 0.725 0.651 0.686 Our model Word Level Attention Bi-

LSTM 0.757 0.659 0.705

3.6 Interaction Terms To demonstrate that our proposed word attention method can learn about the contribution of words on the final classification, we extract four sentences with interactive drugs from the test set and visualize the attention weights for words in the sentences calculated by the trained model. The attention weights for the words of these sentences are shown in Figure 7. We can clearly see that the highlighted words strongly indicate the interaction attributes between the drugs.

1

Figure 7. Attention weights for four example sentences from the test set

4. Conclusion Automated DDIs mining from scientific literature is an important problem. In this study, we proposed a novel word-level attention bi-directional LSTM method to extract potential DDIs from biomedical literature. Our approach overcomes the disadvantages of feature and kernel-based methods in that we don’t require laborious feature engineering work, while maintaining a high level of performance. We also improve on the performance of existing deep neural network-based methods by adapting the attention mechanism for the domain of drug interactions to advance the development of drug adverse events alerting systems. Our performance results show that our model outperforms state-of-the art models in terms of average Recall and F1-scores. Our work also shows that the word attention mechanism can be used to identify interaction terms in scientific literature to extract previously unknown DDIs and reveal more interaction information hidden in the literature. Our work has some limitations. For example, one of the challenges in the DDIs dataset is that it suffers from the imbalanced class problem. Even though we have filtered out many negative instances, the dataset still has a relatively large number of negative instances in it. A future research direction is to combine some rebalancing techniques to get more balanced data. Generative Adversarial Network was recently proposed to mimic true data out of a sample set. It is a promising reinforcement learning technique that will be explored in our research to improve the performance. Overall, notwithstanding the limitations, we see that our work has the best F1 and recall performance to date. Given that drug interactions can have serious adverse effects to the point of fatalities, we believe improving recall (i.e., not missing potential drug-interactions) while maintaining a high precision is vital, and our paper makes an important contribution towards that goal. References: Björne, J., Kaewphan, S. and Salakoski, T. (2013), “UTurku: Drug Named Entity Recognition

and Drug-Drug Interaction Extraction Using SVM Classification and Domain Knowledge”, Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Association for Computational Linguistics, Atlanta, Georgia, USA, pp. 651–659.

Chawla, N.V. (2009), “Data Mining for Imbalanced Datasets: An Overview”, Data Mining and Knowledge Discovery Handbook, Springer, Boston, MA, pp. 875–886.

Chowdhury, M.F.M. and Lavelli, A. (2013), “FBK-irst : A Multi-Phase Kernel Based Approach for Drug-Drug Interaction Detection and Classification that Exploits Linguistic Information”, Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic

0

Evaluation (SemEval 2013), Association for Computational Linguistics, Atlanta, Georgia, USA, pp. 351–355.

Fathelrahman, A.I. (2018), “Chapter 22 - Issues on Source, Access, Extent, and Quality of Information Available Among Pharmacists and Pharmacy Personnel to Practice Effectively”, in Ibrahim, M.I.M., Wertheimer, A.I. and Babar, Z.-U.-D. (Eds.), Social and Administrative Aspects of Pharmacy in Low- and Middle-Income Countries, Academic Press, pp. 363–383.

Gers, F.A., Schmidhuber, J. and Cummins, F. (1999), “Learning to forget: continual prediction with LSTM”, pp. 850–855.

Graves, A. and Schmidhuber, J. (2005), “Framewise phoneme classification with bidirectional LSTM and other neural network architectures”, Neural Networks, Vol. 18 No. 5, pp. 602–610.

Grossberg, S. (2013), “Recurrent Neural Networks”, Scholarpedia, Vol. 8 No. 2, p. 1888. Hayashi, Y., Godai, A., Yamada, M., Yoshikura, N., Harada, N., Koumura, A., Kimura, A., et al.

(2017), “Reduction in the numbers of drugs administered to elderly in-patients with polypharmacy by a multidisciplinary review of medication using electronic medical records”, Geriatrics & Gerontology International, Vol. 17 No. 4, pp. 653–658.

Herrero-Zazo, M., Segura-Bedmar, I., Martínez, P. and Declerck, T. (2013), “The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions”, Journal of Biomedical Informatics, Vol. 46 No. 5, pp. 914–920.

Hinton, G.E. and Roweis, S.T. (n.d.). “Stochastic Neighbor Embedding”, Advances in neural information processing systems. 2003.

Hochreiter, S. and Schmidhuber, J. (1997), “Long Short-Term Memory”, Neural Computation, Vol. 9 No. 8, pp. 1735–1780.

Hotho, A., Staab, S. and Stumme, G. (2003), “Ontologies improve text document clustering”, Third IEEE International Conference on Data Mining, presented at the Third IEEE International Conference on Data Mining, pp. 541–544.

Huang, Z., Xu, W. and Yu, K. (2015), “Bidirectional LSTM-CRF Models for Sequence Tagging”, ArXiv:1508.01991 [Cs], available at: http://arxiv.org/abs/1508.01991 (accessed 13 August 2018).

Ilievski, I., Yan, S. and Feng, J. (2016), “A Focused Dynamic Attention Model for Visual Question Answering”, ArXiv:1604.01485 [Cs], available at: http://arxiv.org/abs/1604.01485 (accessed 11 August 2018).

Kim, S., Liu, H., Yeganova, L. and Wilbur, W.J. (2015), “Extracting drug-drug interactions from literature using a rich feature-based linear kernel approach”, Journal of Biomedical Informatics, Vol. 55, pp. 23–30.

Kingma, D.P. and Ba, J. (2014), “Adam: A Method for Stochastic Optimization”, ArXiv:1412.6980 [Cs], available at: http://arxiv.org/abs/1412.6980 (accessed 13 August 2018).

LeCun, Y., Bengio, Y. and Hinton, G. (2015), “Deep learning”, Nature, Vol. 521 No. 7553, pp. 436–444.

Li, F., Zhang, M., Fu, G. and Ji, D. (2017), “A neural joint model for entity and relation extraction from biomedical text”, BMC Bioinformatics, Vol. 18, p. 198.

Li, S., Li, W., Cook, C., Zhu, C. and Gao, Y. (2018), “Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN”, available at: https://arxiv.org/abs/1803.04831 (accessed 13 August 2018).

Liu, S., Chen, K., Chen, Q. and Tang, B. (2016), “Dependency-based convolutional neural network for drug-drug interaction extraction”, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), presented at the 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1074–1080.

Luong, M.-T., Pham, H. and Manning, C.D. (2015), “Effective Approaches to Attention-based Neural Machine Translation”, ArXiv:1508.04025 [Cs], available at: http://arxiv.org/abs/1508.04025 (accessed 11 August 2018).

Mikolov, T., Le, Q.V. and Sutskever, I. (2013), “Exploiting Similarities among Languages for Machine Translation”, ArXiv:1309.4168 [Cs], available at: http://arxiv.org/abs/1309.4168 (accessed 12 August 2018).

Mikolov, T., Yih, S.W. and Zweig, G. (2013), “Linguistic Regularities in Continuous Space Word Representations”, Microsoft Research, available at: https://www.microsoft.com/en-us/research/publication/linguistic-regularities-in-continuous-space-word-representations/ (accessed 12 August 2018).

Slight, S. P., Seger, D. L., Franz, C., Wong, A., & Bates, D. W. (2018). The national cost of adverse drug events resulting from inappropriate medication-related alert overrides in the United States. Journal of the American Medical Informatics Association.

Sutskever, I., Vinyals, O. and Le, Q.V. (2014), “Sequence to Sequence Learning with Neural Networks”, in Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D. and Weinberger, K.Q. (Eds.), Advances in Neural Information Processing Systems 27, Curran Associates, Inc., pp. 3104–3112.

Tang, D., Qin, B. and Liu, T. (2015), “Document Modeling with Gated Recurrent Neural Network for Sentiment Classification”, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Lisbon, Portugal, pp. 1422–1432.

Thomas, P., Neves, M., Rocktaschel, T. and Leser, U. (n.d.). “WBI-DDI: Drug-Drug Interaction Extraction using Majority Voting”, p. 8.

Wöllmer, M., Weninger, F., Knaup, T., Schuller, B., Sun, C., Sagae, K. and Morency, L. (2013), “YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context”, IEEE Intelligent Systems, Vol. 28 No. 3, pp. 46–53.

Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., et al. (2016), “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation”, ArXiv:1609.08144 [Cs], available at: http://arxiv.org/abs/1609.08144 (accessed 11 August 2018).

Yang, Z., He, X., Gao, J., Deng, L. and Smola, A. (2016), “Stacked Attention Networks for Image Question Answering”, presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29.

Zeng, D., Liu, K., Lai, S., Zhou, G. and Zhao, J. (2014), “Relation Classification via Convolutional Deep Neural Network”, Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin City University and Association for Computational Linguistics, Dublin, Ireland, pp. 2335–2344.

Zhang, D. and Wang, D. (2015), “Relation Classification via Recurrent Neural Network”, ArXiv:1508.01006 [Cs], available at: http://arxiv.org/abs/1508.01006 (accessed 11 August 2018).

Zhao, Z., Yang, Z., Luo, L., Lin, H. and Wang, J. (2016), “Drug drug interaction extraction from biomedical literature using syntax convolutional neural network”, Bioinformatics, Vol. 32 No. 22, pp. 3444–3453.

Zheng, W., Lin, H., Zhao, Z., Xu, B., Zhang, Y., Yang, Z. and Wang, J. (2016), “A graph kernel based on context vectors for extracting drug–drug interactions”, Journal of Biomedical Informatics, Vol. 61, pp. 34–43.

drug-drug interaction mining and interaction terms ... › attach › jqpg7... · the amount of...

Documents