semantic similarity detection - project...

Semantic Similarity Detection - Project Report Pankaj Kabra

Text Similarity Detection - Project Progress Report 1

Abstract 2

Problem Description and Dataset: 2

Exploratory Data Analysis 3

Methodology 5

Implementation & Code 7

Results 16

Conclusion 17

References 17

Abstract In this project, I intend to solve the problem of Semantic similarity detection in Text. Many times, we need to find out two textual sentences that are similar in meaning but have been constructed using different words. English being a versatile language, words mean different things in different contexts, sometimes a word is relevant in each context and other times its irrelevant. Sometimes people use a synonym of a word and it is hard for a machine to decipher if the meaning of the two sentences are the same. Quora Which is a Question Answering company has this problem in the context of duplicate questions. When people come to the website, instead of finding a similar question that has been asked before, people post a new question, this leads to a lot o duplicate question. Because of this they hosted a competition called “Quora Question Pairs”. Here they challenged the participants to find out duplicate questions with high accuracy. Using the Quora Question Pairs dataset, I’m going to develop an algorithm for Semantic Similarity of textual data. I was able to obtain an accuracy of 82% on validation dataset and the maximum obtained accuracy is 87% by a Stanford team.

Problem Description and Dataset: In the official description of the Quora described the problem of semantic similarity description to be very important for them to be able to maintain the service. Following is the official description of the problem statement.

Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers and offer more value to both groups in the long term.

Dataset - ~400 MB of Text

The data is 3 columns, pairs of text and a variable describing if the two pairs of sentences are duplicate of not. This is a binary variable. Please find below a sample of the dataset.

Exploratory Data Analysis The data used in the Quora Question Pair Dataset is as in the Figure 1, There are ~404K Question Pairs like above for Training.

Figure 1: Input Data

For these Question Pairs, I check of the length distribution of the Questions and as we see in Figure 2, both Question1 and Question2 have a similar distribution. This very interesting as both questions have peaks at same places. Its will be interesting in the future to find out why we have a small peak at around 180 in both Questions.

Figure 2: Length Distribution of Q1 and Q2

As I treat this problem as a classification problem, I examine if the classification target variable has imbalance in the classes. Here we have two classes 0 and 1 indicating Not-Similar and Similar Question. In Figure 3 we can see that the class 0 is occurring about 250K times and the class 1 is occurring about 150K times. Although the data is not distributed 50-50 but we still do not have extreme imbalance. I believe we do not need any imbalance treatment as we have enough data points for each of the classes

Figure 3: Distribution of the Classes 0 and 1

Next, I examine the most used words in the Questions. This can be best achieved using a word cloud. Figure 4 Shows a word cloud of the data in Question 1 and Question 2 combined. We can see that the topics have been derived from various fields such as Politics, Science, Health Finance etc. This provides a good insight to the type of data we are dealing with.

Figure 4: Word Cloud using both the Questions.

Methodology

I use the Methodology of Siamese Recurrent Neural Networks as described in [1]. Figure 5 shows a pictorial description of the neural network architecture used. Question 1 is inputted to 1st LSTM and Question 2 is inputted to the 2nd LSTM and then the resulting embedding is compared using the exponential Manhattan loss.

We can see that after learning the model will produce embeddings such that the positive examples come near by and the negative examples move farther. Such

embeddings are the core of any similarity detection mechanism. Here we will obtain these embeddings from the last layer of the LSTM.

Figure 5 Embedding Space Pre and Post Network Learning

Figure 6 Neural Network Architecture for Similarity detection. Figure taken form [1]

After Implementing the simple Siamese network above, I saw that I was not able to get accuracy more than 78%. So I changed the Approach and Stacked more layers on top of the network to get higher accuracy. I also added more arms to the Siamese Network above. This approach no longer is a Siamese Network approach as Siamese Networks only have two sides.

Implementation & Code I am using Keras to implement the architecture described above. I have also added a lot of layers to the structure above. As I tried by increasing the number of Layers I saw that the accuracy increased. I also added 1-d Convolutional Neural Networks as they have been proved to be faster than RNNs for NLP tasks. In the basic flow, The questions are converted to embeddings and then compared to the output.

The code files are divided into 4 sub-files serving different purposes.

Train.py – File used to train the model and dump the trained the model.

# from tqdm import tqdm import tensorflow as tf import numpy as np import utilstf as utl import pickle import logging import time import subprocess # logtime = time.strftime("%d%H%M%S") logging.basicConfig(filename='logfile.log', level=logging.DEBUG) ### loading the pre processed data by datatf.py train_x1 = pickle.load(open("interim_data_tf/train_x1.pkl","rb")) train_x2 = pickle.load(open("interim_data_tf/train_x2.pkl","rb")) train_y = pickle.load(open("interim_data_tf/train_y.pkl","rb")) embedding_matrix = pickle.load(open("interim_data_tf/embedding_matrix.pkl","rb")) word_index = pickle.load(open("interim_data_tf/word_index.pkl","rb")) tk = pickle.load(open("interim_data_tf/tk.pkl","rb")) logging.info("read the data") max_features = 200000 filter_length = 5 nb_filter = 64 pool_length = 4 learning_rate = 0.001 max_len = 40 graph = tf.Graph() graph.seed = 1

### building graph and layers with graph.as_default(): ph_q1 = tf.placeholder(tf.int32, shape=(None, max_len), name="ph_q1") ph_q2 = tf.placeholder(tf.int32, shape=(None, max_len), name="ph_q2") ph_y = tf.placeholder(tf.float32, shape=(None, 1), name="ph_y") ph_training = tf.placeholder(tf.bool, shape=(), name="ph_training") glove = tf.Variable(embedding_matrix, trainable=False) q1_glove_lookup = tf.nn.embedding_lookup(glove, ph_q1) q2_glove_lookup = tf.nn.embedding_lookup(glove, ph_q2) emb_size = len(word_index) + 1 emb_dim = 300 emb_std = np.sqrt(2 / emb_dim) emb = tf.Variable(tf.random_uniform([emb_size, emb_dim], -emb_std, emb_std)) q1_emb_lookup = tf.nn.embedding_lookup(emb, ph_q1) q2_emb_lookup = tf.nn.embedding_lookup(emb, ph_q2) layer1 = q1_glove_lookup layer1 = utl.time_distributed_dense(layer1, 300) layer1 = tf.reduce_sum(layer1, axis=1) layer2 = q2_glove_lookup layer2 = utl.time_distributed_dense(layer2, 300) layer2 = tf.reduce_sum(layer2, axis=1) layer3 = q1_glove_lookup layer3 = utl.conv1d(layer3, nb_filter, filter_length, padding='valid') layer3 = tf.layers.dropout(layer3, rate=0.2, training=ph_training) layer3 = utl.conv1d(layer3, nb_filter, filter_length, padding='valid') layer3 = utl.maxpool1d_global(layer3) layer3 = tf.layers.dropout(layer3, rate=0.2, training=ph_training) layer3 = utl.dense(layer3, 300) layer3 = tf.layers.dropout(layer3, rate=0.2, training=ph_training) layer3 = tf.layers.batch_normalization(layer3, training=ph_training) layer4 = q2_glove_lookup

layer4 = utl.conv1d(layer4, nb_filter, filter_length, padding='valid') layer4 = tf.layers.dropout(layer4, rate=0.2, training=ph_training) layer4 = utl.conv1d(layer4, nb_filter, filter_length, padding='valid') layer4 = utl.maxpool1d_global(layer4) layer4 = tf.layers.dropout(layer4, rate=0.2, training=ph_training) layer4 = utl.dense(layer4, 300) layer4 = tf.layers.dropout(layer4, rate=0.2, training=ph_training) layer4 = tf.layers.batch_normalization(layer4, training=ph_training) layer5 = q1_emb_lookup layer5 = tf.layers.dropout(layer5, rate=0.2, training=ph_training) layer5 = utl.lstm(layer5, size_hidden=300, size_out=300) layer6 = q2_emb_lookup layer6 = tf.layers.dropout(layer6, rate=0.2, training=ph_training) layer6 = utl.lstm(layer6, size_hidden=300, size_out=300) merged = tf.concat([model1, model2, model3, model4, model5, model6], axis=1) # merged = tf.concat([model1, model2], axis=1) merged = tf.layers.batch_normalization(merged, training=ph_training) for i in range(5): merged = utl.dense(merged, 300, activation=tf.nn.relu) merged = tf.layers.dropout(merged, rate=0.2, training=ph_training) merged = tf.layers.batch_normalization(merged, training=ph_training) pred_proba = tf.layers.dense(merged, units=1, activation=tf.nn.sigmoid , kernel_initializer=tf.random_normal_initializer(stddev=np.sqrt(2/int(merged.shape[1]))) , name="pred_proba") pred_round = tf.round(pred_proba) loss = tf.losses.log_loss(ph_y, pred_proba) accuracy = tf.reduce_mean(tf.cast(tf.equal(ph_y, pred_round), 'float32'), name="accuracy") opt = tf.train.AdamOptimizer(learning_rate=learning_rate) # for batchnorm

extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) with tf.control_dependencies(extra_update_ops): step = opt.minimize(loss) init = tf.global_variables_initializer() saver = tf.train.Saver() logging.info("graph created") np.random.seed(1) n_all, _ = train_y.shape idx = np.arange(n_all) np.random.shuffle(idx) ### splitting the data into training and testing n_split = n_all // 10 idx_val = idx[:n_split] idx_train = idx[n_split:] x1_train = train_x1[idx_train] x2_train = train_x2[idx_train] y_train = train_y[idx_train] x1_val = train_x1[idx_val] x2_val = train_x2[idx_val] y_val = train_y[idx_val] val_idx = np.arange(y_val.shape[0]) val_batches = utl.prepare_batches(val_idx, 5000) n_epochs = 20 logging.info("starting training") ### start training with tf.Session(config=None, graph=graph) as sess: sess.run(init) for epoch in range(n_epochs): np.random.seed(epoch) train_idx_shuffle = np.arange(y_train.shape[0]) np.random.shuffle(train_idx_shuffle) batches = utl.prepare_batches(train_idx_shuffle, 384) #progress = tqdm(total=len(batches)) for batch,idx in enumerate(batches): feed_dict = {

ph_q1: x1_train[idx], ph_q2: x2_train[idx], ph_y: y_train[idx], ph_training: True, } _, acc, lss = sess.run([step, accuracy, loss], feed_dict) #progress.update(1) #progress.set_description('%.3f / %.3f' % (acc, lss)) if batch %50 ==0: logging.info(' accu: %.1f%%, loss: %.3f, batch: %.0f/%.0f, epoch: %.0f/%.0f'% (acc*100, lss,batch, len(batches),epoch+1,n_epochs)) saver.save(sess, "tf_models/model.ckpt") y_pred = np.zeros_like(y_val) for idx in val_batches: feed_dict = { ph_q1: x1_val[idx], ph_q2: x2_val[idx], ph_y: y_val[idx], ph_training: False, } y_pred[idx, :] = sess.run(pred_round, feed_dict) logging.info(' **** val accu: %0.1f%%, epoch= %.0f/%.0f **** ' % (np.mean(y_val == y_pred)*100,epoch+1,n_epochs)) try: ## only want to run in euler logging.info("started: dumping tfmodels.tar.gz") subprocess.call("tar -zcvf tf_models.tar.gz tf_models/".split()) logging.info("completed: dumping tfmodels.tar.gz") except: pass

Utilstf.py – Save Utility functions for later usage

import tensorflow as tf import numpy as np ### prepare batches for feeding into the network def prepare_batches(seq, step): n = len(seq)

batches = [] for i in range(0, n, step): batches.append(seq[i:i+step]) return batches ### dense layer with he normalization as it gives a better def dense(X, size, activation=None): he_std = np.sqrt(2 / int(X.shape[1])) out = tf.layers.dense(X, units=size, activation=activation, kernel_initializer=tf.random_normal_initializer(stddev=he_std)) return out ### convolutional layer def conv1d(inputs, num_filters, filter_size, padding='same'): he_std = np.sqrt(2 / (filter_size * num_filters)) out = tf.layers.conv1d( inputs=inputs, filters=num_filters, padding=padding, kernel_size=filter_size, activation=tf.nn.relu, kernel_initializer=tf.random_normal_initializer(stddev=he_std)) return out ### global maxpooling def maxpool1d_global(X): out = tf.reduce_max(X, axis=1) return out ### TDD layer so that we get output from each state def time_distributed_dense(X, dense_size): shape = X.shape.as_list() assert len(shape) == 3 _, w, d = shape X_reshaped = tf.reshape(X, [-1, d]) H = dense(X_reshaped, dense_size, tf.nn.relu) return tf.reshape(H, [-1, w, dense_size]) ### LSTM cell def lstm(X, size_hidden, size_out): with tf.variable_scope('lstm_%d' % np.random.randint(0, 100)): he_std = np.sqrt(2 / (size_hidden * size_out)) W = tf.Variable(tf.random_normal([size_hidden, size_out], stddev=he_std))

b = tf.Variable(tf.zeros([size_out])) size_time = int(X.shape[1]) X = tf.unstack(X, size_time, axis=1) lstm_cell = tf.nn.rnn_cell.LSTMCell(size_hidden, name='basic_lstm_cell', forget_bias=1.0) outputs, states = tf.contrib.rnn.static_rnn(lstm_cell, X, dtype='float32') out = tf.matmul(outputs[-1], W) + b return out

datatf.py – To process data and put in format required

import zipfile import pandas as pd import tensorflow as tf import pickle import numpy as np from tqdm import tqdm import logging ### data reading TRAIN_CSV = 'kaggle_qqp/train.csv' TEST_CSV = 'kaggle_qqp/test.csv' EMBEDDING_FILE = 'embeddings/glove.840B.300d.zip' # Load training and test set train_df = pd.read_csv(TRAIN_CSV) # test_df = pd.read_csv(TEST_CSV) train_df = train_df.fillna('') train_y = train_df.is_duplicate.values train_y = train_y.astype('float32').reshape(-1, 1) logging.info("starting tokenizing") ### tokenizing and saving texts Tokenizer = tf.keras.preprocessing.text.Tokenizer pad_sequences = tf.keras.preprocessing.sequence.pad_sequences tk = Tokenizer(num_words=200000) max_len = 40 tk.fit_on_texts(list(train_df.question1) + list(train_df.question2)) train_x1 = tk.texts_to_sequences(train_df.question1)

train_x1 = pad_sequences(train_x1, maxlen=max_len) train_x2 = tk.texts_to_sequences(train_df.question2) train_x2 = pad_sequences(train_x2, maxlen=max_len) word_index = tk.word_index logging.info("starting embedding processing") ### embedding embedding_matrix = np.zeros((len(word_index) + 1, 300), dtype='float32') glove_zip = zipfile.ZipFile(EMBEDDING_FILE) glove_file = glove_zip.filelist[0] f_in = glove_zip.open(glove_file) for line in tqdm(f_in): values = line.split(b' ') word = values[0].decode() if word not in word_index: continue i = word_index[word] coefs = np.asarray(values[1:], dtype='float32') embedding_matrix[i, :] = coefs f_in.close() glove_zip.close() logging.info("starting pickle dumps") ### dumping processed data files for picking up in runtime pickle.dump(train_x1,open("interim_data_tf/train_x1.pkl","wb")) pickle.dump(train_x2,open("interim_data_tf/train_x2.pkl","wb")) pickle.dump(train_y,open("interim_data_tf/train_y.pkl","wb")) pickle.dump(embedding_matrix,open("interim_data_tf/embedding_matrix.pkl","wb")) pickle.dump(word_index,open("interim_data_tf/word_index.pkl","wb")) pickle.dump(tk,open("interim_data_tf/tk.pkl","wb"))

eval.py – to evaluate using the saved model

import tensorflow as tf import pandas as pd import numpy as np import pickle # from copy import deepcopy import logging logging.basicConfig(filename='logfile.log', level=logging.DEBUG)

### loading saved files from datatf.py word_index = pickle.load(open("interim_data_tf/word_index.pkl","rb")) tk = pickle.load(open("interim_data_tf/tk.pkl","rb")) ### setting test hyperparameters max_features = 200000 filter_length = 5 nb_filter = 64 pool_length = 4 learning_rate = 0.001 max_len = 40 padseq = tf.keras.preprocessing.sequence.pad_sequences def convert_text(txt, tokenizer, padder): x = tokenizer.texts_to_sequences(txt) x = padder(x, maxlen=max_len) return x TEST_CSV = 'kaggle_qqp/test.csv' test_df = pd.read_csv(TEST_CSV) test_df.set_index("test_id", inplace=True) # test_df = deepcopy(test_df[:1750]) ### initializing tensorflow session sess = tf.Session() saver = tf.train.import_meta_graph('tf_models/model.ckpt.meta') saver.restore(sess, "tf_models/model.ckpt") graph = tf.get_default_graph() pred_proba = graph.get_tensor_by_name("pred_proba/Sigmoid:0") bsize = 1500 ### setting batch size output = np.array([]) for i in list(range(0,len(test_df),bsize)): logging.info(i) q1l = np.vstack(test_df.iloc[i:i+bsize].question1.apply(lambda x: convert_text(str(x), tk, padseq)[0]).tolist()) q2l = np.vstack(test_df.iloc[i:i+bsize].question2.apply(lambda x: convert_text(str(x), tk, padseq)[0]).tolist()) feed_dict = { "ph_q1:0": q1l, "ph_q2:0": q2l, "ph_y:0": np.zeros((1, 1)), "ph_training:0": False, } out = sess.run(pred_proba, feed_dict).astype(int)

output = np.append(output,out) test_df['is_duplicate'] = output.reshape(-1,1) test_df[['is_duplicate']].to_csv("sumbission.csv")

Results I chose the model from Iteration 11 as after that the training accuracy kept on increasing and the validation accuracy either slightly decreased or remained the same. I supposed that the model was overfitting after this point.

Figure 7 The Graph shows the Training and Validation accuracy as a function of epoch’s

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Accu

racy

Epoch

Training and Validation Accuracy

Training Accuracy Validation Accuracy

I also did subjective model evaluation and following are the results –

Conclusion Accuracy of 82% was obtained in the exercise. The accuracy can be increased further using Attention Mechanisms. Apart from application in Text, Siamese Neural Network Similarity architecture can be applied to a wise variety of tasks such as Recommendation Systems, Face Identification among others.

References 1) Mueller, Jonas, and Aditya Thyagarajan. "Siamese Recurrent Architectures

for Learning Sentence Similarity." AAAI. Vol. 16. 2016. 2) Neculoiu, Paul, Maarten Versteegh, and Mihai Rotaru. "Learning text

similarity with siamese recurrent networks." Proceedings of the 1st Workshop on Representation Learning for NLP. 2016.

3) Yih, Wen-tau, et al. "Learning discriminative projections for text similarity measures." Proceedings of the Fifteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 2011.

4) Kenter, Tom, Alexey Borisov, and Maarten de Rijke. "Siamese cbow: Optimizing word embeddings for sentence representations." arXiv preprint arXiv:1606.04640 (2016).

semantic similarity detection - project...

Documents