context-aware response generation for mental health...

Context-Aware Response Generation for Mental Health Counseling

Reid PryzantStanford University

[email protected]

Abstract

We apply several document vectorizationand context-sensitive information retrievalalgorithms to the problem of responsegeneration in mental health counseling.We enter a data-usage agreement with atechnology-mediated therapy provider toobtain over 11,000 therapist-patient SMStranscripts. We use these transcripts tosuggest therapist responses. Our resultsindicate that GloVe vectors work best inthis setting, and that these vectors and acontext-aware nearest neighbors approachis able to fool human evaluators more than30% of the time.

1 Introduction

Natural language processing (NLP), speech, anddialogue systems have made great strides recentlyin extracting and understanding textual informa-tion. We believe it also has the potential to helpmajor health issues, most notably mental health.Nearly 1 out of every 5 Americans experiencessome form of mental illness each year (Abuse,2012). This means that more than 40 millionUS adults experience some form of the illnessev-ery year. Fortunately, mental health is readilytreatable with counseling and psychotherapy. Inrecent years, the availability of these treatmentshas greatly expanded, due in part to technology-mediated remote counseling. However, for many,real counseling with a trained professional is pro-hibitively difficult to access. Instead, many organi-zations employ untrained volunteers to meet theirpatient’s needs. The goal of our project was toaid these counselors by creating a recommenda-tion system that can suggest useful responses forthese counselors.

In general, most research on mental health

Table 1: Basic dataset statistics.

Dataset statisticsConversations 80,885Messages 3.2MCounselors 408Messages per conversation 42.6

counseling is small-scale and qualitative due to thedifficulty of obtaining data. Furthermore, in-depthanalysis of these data is often prohibited due totheir sensitive nature. To circumvent this prob-lem, we entered a data usage agreement with anonprofit organization that offers text-based cri-sis counseling and obtained transcripts for over11,000 text-based counseling conversations andapproximately one million messages.

We experimented with a variety of strategiesand evaluations, settling on an IR-based sys-tem which uses distributed representations ofeach message, and searches for nearest neigh-bors within this space to pick out likely candi-date responses. Our results indicate that this sys-tem frequently emulates a supportive and attentivecounselor, but falls short in some cases. We be-lieve there are promising future directions for thisproject, including the incorporation of a generativesequence to sequence model for response genera-tion.

2 Related Work

This work relates to two lines of research.First, theres a large body of research focused on

therapeutic discourse analysis and psycholinguis-tics. Conversation analysis has been applied tovarious psychotherapeutic settings (Gale, 1991).Importantly, work in this area has shown that thewords people use can reveal important aspects oftheir psychological state (Pennebaker et al., 2003).

Second, we are interested in prior work in large-scale computational linguistics applied to dialoguemodeling and conversations. Large studies haverevealed subtle dynamics in dialogues like coordi-nation between speakers and style matching (i.e.accent adjustment to grow closer to a conversa-tional partner) (?). In terms of understandingthese conversations, dissecting their meaning hasbeen explored by the computer science commu-nity. Unsupervised machine learning models havebeen used to model conversations and segmentthem into speech acts, topical clusters, or stages.For example, (?) used a hidden markov modelto annotate dialogues with conversational stages.These ideas have rarely been applied to counsel-ing data, but (?) applied a HMM to counsilingdata to model conversational stages, discoveringactionable conversation strategies that are associ-ated with better conversation outcomes

3 Theoretical Background

We continue by reviewing the theoretical back-grounded needed to understand our approach.Each subsection will discuss a different way of“vectorizing” sentences.

3.1 TF-IDF

TF-IDF weighting is a method of vectorizing doc-uments under a bag-of-words assumption (Man-ning et al., 1999). Each document is viewed sim-ply as a collection of words, with word orderand semantic information cast aside. The TF-IDFweighting scheme assigns a weight to each term tin a document d with

tfidft,d = tft,d × idft,· (1)

In words, this is the term-frequency of t in doc-ument d multiplied by the inverse document fre-quency of t across the entire corpus of documents.There are many ways to calculate term frequencyand document frequency. For this system, we usedthe log-normalized term frequency:

tft,d = 1 + log(ft,d) (2)

Where ft,d is the frequency of term t in docu-ment d. For the IDF terms, we used the inversedocument frequency smooth:

idft,· = log

(1 +

N

nt

)(3)

Where N is the total number of documents, andnt is the number of documents where term t oc-curs.

In order to get document vectors out of this al-gorithm, we select a corpus of keywords to useas features, then compute the TF-IDF weight ofeach feature. The resulting vector of TF-IDF val-ues serves as our document vector.

3.2 word2vec

Though TF-IDF gives us some nice vectors, theyare lacking in many respects. The induced vectorspace is extremly highly dimensional, and most ofthe vectors in that space are sparse. The word2vecmodel learns a dense vector space for individualwords, and we can average the word vectors of adocument to obtain dense representations therein.

The word2vec model learns vector embeddingsfor a collection of words. In general, these rep-resentations are trained such that one can predictco-occuring words from these represenations. Inthis framework, each word is mapped to a uniquevector with an embedding matrix W . The wordsin a context are passed through an output matrixU to produce a prediction as to the next word in asentence.

Formally, let w1, ..., wn be the words of a sen-tence. We train a model to maximize the “nextword” probability in a subsequence of length kp(wi|wi−1, ..., wi−k), and do this by maximiz-ing the average log proability over the whole se-quence:

1

n

∑i

log p(wi|wi−1, ..., wi−k) (4)

We can obtain p(wi|wi−1, ..., wi−k) by passingthe embeddings for wi−1, ..., wi−k to some com-bination function c(·). This function can be con-catenation, averaging, etc. We then pass this rep-resentation of context through a one-layer neuralnetwork (i.e. multiply by the U matrix). U willmap this representation to a bunch of logits (oneper possible output word), then we transform log-its into probabilities with a softmax function:

Figure 1: The word2vec model.

p(wi|wi−1, ..., wi−k) =eywt∑i e

ywi(5)

y = b+ Uc(wi−1, ..., wi−kA;W ) (6)

At the end of the day, after optimizing on equa-tion 4 and training via backprop, we get our wordembeddings in W .

3.3 GloVe

Though word2vec seems like a step in the rightdirection, term frequency co-occurence measure-ments are certainly meaningful, and neither ofthe previous two methods make use of them.The bag-of-words assumpton of TF-IDF forcesthe algorithm to discard the semantics embed-ded in co-occurence statistics, and by focusingon co-occurence prediction instead of frequencies,word2vec only indirectly works with this informa-tion. GloVe is a way to combine the best of bothworlds: it manipulates global co-occurence statis-tics to learn dense distributed representations foreach word (Pennington et al., 2014). This spacehas meaningful substructures that incorporate in-formation from both word co-occurrence statisticsand word order information. These vectors havebeen shown to accurately capture word semanticsand word analogies. Like word2vec, we can aver-age these vector representations to obtain vectorsat the document level.

The GloVe approach begins with the obser-vation that semantically related words shouldoccur in similar contexts. For example, letP (k,w) be the probability that word k appearsin the context of word w. Then we wouldexpect P (awesome|jiwei) to be high, butP (awesome|tube) and P (water|jiwei) to be

low. By nature of these co-occurence probabil-ities, “jiwei” is distant from “water” but closeto “awesome”. We can capture this notion bylooking at the ratios of these co-occurance proba-bilities. P (awesome|jiwei)/P (awesome|tube)would be high (because “jiwei” is moreclosely related to “awesome”), andP (water|jiwei)/P (water|tube) would below (because “water” is more closely related to“tube”).

These observations indicate that a good wordvector is something from which you can recoverthese co-occurrence ratio’s from. And that’s ex-actly what GloVe learns. GloVe does weightedleast squares regression whose training objectiveis such that the dot product between two word vec-tors is equal to the logarithm of the words’ proba-bility of co-occurence. The reason why the train-ing objective is based on logarithms is because thelog of a ratio is equal to the difference of logs,thereby baking word analagies into vector differ-ences.

Once word vectors are obtained for each word,vectorizing a document of words is done by aver-aging its constituent word vectors.

3.4 doc2vec

Though GloVe learns great distributed representa-tions for words, averaging these representations isa suboptimal, because it discards word order infor-mation.

As a third and final way of trying to vector-ize documents, we investigated the doc2vec (orparagraph-to-vector) algorithm (Le and Mikolov,2014). This is a simple modification of the origi-nal word2vec model (figure 2). In this setting, wehave a second embedding matrix D, with one col-umn per document. At prediction time, we use thevectors for a words’ context to predict that word,but we also use the document embedding vectorthat the utterance came from. This algorithm hasbeen shown to work better for some tasks like sen-timent prediction.

4 Approach

We continue by outlining our recommendationsystem. The system consists of several mod-ules: components for reading in data, manipulat-ing data, structuring and sorting data (there’s lotsof it – dumped as a raw text file it’s 3GB so wehad to do some IR-y speed stuff). However, that’s

Figure 2: The doc2vec model.

all outside the scope of this report. Instead we willbe discussing the algorithms that are at the core ofthe recommendation module.

At a high level, our recommender works as fol-lows:

1. Preprocessing: vectorize the corpus

2. Vectorize the query

3. Find nearest neighbors of the query incorpus-space

4. Reconstruct those neighbors

5. Optional reranking steps

For sentence vectorization, we implemented

• TF-IDF

• GloVe

• doc2vec

Each implementation was from-scratch, i.e. wedid not rely on any pre-made packages or librariesfor this other than Tensorflow/numpy. This wasbecause we wanted the project to be fundamen-tally a learning exercise.

For finding nearest neighbors, we used cosinedistance:

cos(a, b) =a · b

||a|| · ||b||(7)

We implemented three seperate rerankingstrategies:

• Best match (BM), where we search the cor-pus for the closest message to the “trigger”,then return the succeeding response to thatmessage.

• Contextualized best match (CBM), wherewe search the corpus for the closest kmessages to the “trigger”, then proceed torerank those k candidates by comparing theirand the query conversation’s contexts. Wererank these by simultaniously stepping backthrough (1) all of the candidate’s conversa-tions, and (2) the ongoing query conversa-tion. At each step, we compute cosine sim-ilarity between each candidate conversation’shistorical message and that of the query con-versation. For each candidate conversation,we remember it’s “inverse rank” at each stepin time (0 is the worst, k is the best), then atthe end of the day compute a weighted aver-age of these ranks, with an exponential decayof importance through time.

• Corpus-wide context matching (CCM).This is similar to contextualized best-match,except there is no initial candidate selec-tion step. Every conversation in the corpusis taken as a candidate, and we step backthrough time across all conversations at eachreranking step.

Note that if the program’s cursor ever went offpast the edge of a conversation while going backthrough time for reranking, we ignored the “rankcontribution” of that step, i.e. instead of giving theconversation an invalid rank of -1 for that contextlevel, we simply ignored it. While this has the sideeffect of skewing the weighting of short conversa-tions, it ensures that all conversations are treatedfairly, and that none are unfairly pulled down inthe rankings due to their length.

In addition to the three vectorization and rank-ing algorithms, we experimented with excludingstop words 1, common words like “the”, “if”, and“so” that offer little insight into the semantics ofa sentence. We experimented with removing pro-nouns from the list with little success.

5 Experiments

If we learned anything in this class, it’s that eval-uation is critical, so rather than investing time ina fancy neural response generation system, we putlots of effort into evaluating our engine.

We evaluated the response engine in two ways:first we used adversarial evaluation with an

1obtained from http://www.ranks.nl/stopwords

Table 2: AdverSuc scores achieved by differentvectorization, ranking, and evaluation strategies.

SVM

TF-IDF word2vec GloVe doc2vec

Random 0.5 0.51 0.49 0.5BM 0.62 0.66 0.59 0.59CBM 0.38 0.28 0.26 0.35CCM 0.41 0.30 0.35 0.40

NN

TF-IDF word2vec GloVe doc2vec

Random 0.5 0.5 0.5 0.5BM 0.65 0.66 0.61 0.62CBM 0.40 0.28 0.27 0.37CCM 0.43 0.33 0.38 0.45

LSTM trained to discriminate between machine-generated and human-generated responses. Sec-ond, we built a “game” for humans to perform thisdiscrimination (figure 3).

We first test two different adversarial evaluationmodels. The first is an SVM using unigram fea-tures; for each conversation, we select a history of7 messages, and transform it into a unigram rep-resentation. The second is a neural classificationmodel with a softmax funtion. This model takes asinput the concatenation of each message in the 7-message history and an encoding of the proposedmessage.

The Adversarial evaluation results are displayedin table 1. We find the the neural net evalua-tor performed better than the SVM, possibly dueto the (slight) temporal modeling of a concate-nated sequence, and the innapropriateness of theSVM’s bag-of-words assumption. When examin-ing different ranking strategies, we see that con-textualized best matching (CBM) generally out-performed its counterparts, likely due to the im-portance of the “trigger” message. When compar-ing different vectorization strategies, we see thatGloVe performed the best. This is suprising, giventhe supposed advantages of doc2vec, but may beexplained by a lack of hyperparameter tuning forthat algorithm.

For human evaluation, we solicited 6 humans tomanually evaluate a dozen conversations and pro-posed responses. We presented these evaluatorswith a conversation, followed by four candidateresponses. One of these was the “true” response,

Figure 3: The intro screen for our human evalua-tion game.

and the three others were the top reccomenda-tions from our CBM-GloVe system. Our evalua-tors were asked to pick out the true response, andwere capable of doing so 64% of the time.

To better understand the behavior of this sys-tem, we interviewed each evaluator and manu-ally examined some respnoses for ourselves. Ourevaluators unanimously reported that the systemwas either hit or very miss: the things that gaveresponses away weren’t minor, rather, they werewildly off topic.

We sincerely wish we could do an error analysisin this report, but as per the data use agreement wesigned with CTL, we are only allowed to presenttheir data in situations that they allow. When askedabout this setting (final project report for a class),they did not allow it.

6 Conclusion

We applied a context-sensitive recommendationengine to mental health counseling sessions. Weimplemented a variety of information retrieval andmessage vectorization algorithms, and entered adata-usage agreement with a technology-mediatedtherapy provider. Our results suggest that GloVevectors work best for this task, and we were ableto fool human evaluators more than 30% of thetime.

There is much future work to be done, includ-ing the incorporation of neural network rerankersand more extensive evaluation strategies.

ReferencesSubstance Abuse. 2012. Results from the 2010 Na-

tional Survey on Drug Use and Health: Mental-Health Findings.

Jerry Edward Gale. 1991. Conversation analysis oftherapeutic discourse: The pursuit of a therapeuticagenda.. Ablex Publishing.

Quoc Le and Tomas Mikolov. 2014. Distributed repre-sentations of sentences and documents. In Proceed-ings of the 31st International Conference on Ma-chine Learning (ICML-14). pages 1188–1196.

Christopher D Manning, Hinrich Schutze, et al. 1999.Foundations of statistical natural language process-ing, volume 999. MIT Press.

James W Pennebaker, Matthias R Mehl, and Kate GNiederhoffer. 2003. Psychological aspects of nat-ural language use: Our words, our selves. Annualreview of psychology 54(1):547–577.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for wordrepresentation. In EMNLP. volume 14, pages 1532–1543.

context-aware response generation for mental health...

Documents