news analytics for global infectious disease surveillance

News Analytics for Global Infectious Disease Surveillance

Saurav Ghosh

Dissertation submitted to the Faculty of theVirginia Polytechnic Institute and State University

in partial fulfillment of the requirements for the degree of

Doctor of Philosophyin

Computer Science

Naren Ramakrishnan, ChairMadhav MaratheChang-Tien Lu

Bryan LewisElaine O. Nsoesie

September 28, 2017Arlington, Virginia

Keywords: Infectious Disease Surveillance, News Analytics, HealthMap, WHO DONsCopyright 2017, Saurav Ghosh


Saurav Ghosh

(ABSTRACT)

Traditional disease surveillance can be augmented with a wide variety of open sources, suchas online news media, twitter, blogs, and web search records. Rapidly increasing volumes ofthese open sources are proving to be extremely valuable resources in helping analyze, detect,and forecast outbreaks of infectious diseases, especially new diseases or diseases spreadingto new regions. However, these sources are in general unstructured (noisy) and constructionof surveillance tools ranging from real-time disease outbreak monitoring to construction ofepidemiological line lists involves considerable human supervision. Intelligent modeling ofsuch sources using text mining methods such as, topic models, deep learning and dependencyparsing can lead to automated generation of the mentioned surveillance tools. Moreover, real-time global availability of these open sources from web-based bio-surveillance systems, suchas HealthMap and WHO Disease Outbreak News (DONs) can aid in development of generictools which will be applicable to a wide range of diseases (rare, endemic and emerging) acrossdifferent regions of the world.

In this dissertation, we explore various methods of using internet news reports to developgeneric surveillance tools which can supplement traditional surveillance systems and aid inearly detection of outbreaks. We primarily investigate three major problems related to in-fectious disease surveillance as follows. (i) Can trends in online news reporting monitor andpossibly estimate infectious disease outbreaks? We introduce approaches that use temporaltopic models over HealthMap corpus for detecting rare and endemic disease topics as well ascapturing temporal trends (seasonality, abrupt peaks) for each disease topic. The discoveryof temporal topic trends is followed by time-series regression techniques to estimate futuredisease incidence. (ii) In the second problem, we seek to automate the creation of epidemi-ological line lists for emerging diseases from WHO DONs in a near real-time setting. Forthis purpose, we formulate Guided Epidemiological Line List (GELL), an approach thatcombines neural word embeddings with information extracted from dependency parse-treesat the sentence level to extract line list features. (iii) Finally, for the third problem, weaim to characterize diseases automatically from HealthMap corpus using a disease-specificword embedding model which were subsequently evaluated against human curated ones foraccuracies.


Saurav Ghosh

(GENERAL AUDIENCE ABSTRACT)

Infectious Disease Outbreaks are a threat to public health and economic stability of manycountries. Traditional Disease Surveillance data released by organizations, such as CDC,ProMED is delayed and therefore, not reliable for real-time monitoring of infectious diseaseoutbreaks. Recently, open source indicators, such as online news sources and social mediasources (Twitter) have been shown to be effective in monitoring infectious disease outbreaksin real-time due to their volume, ease of availability and citizen participation. This disser-tation focuses on developing multiple data analytic tools which perform automated analysisof online disease-related news articles with an aim to characterize infectious diseases andmonitor their spatial and temporal progression in real-time. We show that temporal trendsextracted from online news articles can be used to capture dynamics of multiple diseaseoutbreaks, such as whooping cough outbreak in U.S. during summer of 2012, periodic out-breaks of H7N9 in China during 2013-2014 and emerging MERS outbreak in Saudi Arabia.However, online news reporting during infectious disease outbreaks is driven by interestand therefore, news coverage for certain diseases can be inconsistent over time leading toerroneous surveillance.

Grant Information

Supported by the Intelligence Advanced Research Projects Activity (IARPA) via Departmentof Interior National Business Center (DoI/NBC) contract number D12PC000337, the USGovernment is authorized to reproduce and distribute reprints of this work for Governmentalpurposes notwithstanding any copyright annotation thereon. Disclaimer: The views andconclusions contained herein are those of the authors and should not be interpreted asnecessarily representing the official policies or endorsements, either expressed or implied,of IARPA, DoI/NBC, or the US Government. This work has also been partially supportedby DTRA CNIMS and DTRA BSVE (contract number HDTRA1-11-D-0016-0005), NationalScience Foundation grant NRT-DESE-154362, NSF DIBBS Grant ACI-1443054, NIH MIDASGrant 5U01GM070694, NSF BIG DATA Grant IIS-1633028 and the National Institutes ofHealth grant 1R01GM109718.

iv

Dedication

To my wonderful Mom, Dad and Sisters

v

Acknowledgments

Firstly, I would like to thank my wonderful advisor, Dr. Naren for his patience, guidance,and support during this dissertation work. Indeed this research would not have been possiblewithout his efforts. I would also like to thank my committee members, Dr. Madhav, Dr.Bryan, Dr. Elaine, and Dr. Lu for their advice, comments, and time. Special thanks to Dr.Bryan, Dr. Elaine and Dr. Madhav for all their efforts, feedbacks, and support during thisresearch.

Secondly, I would like to thank my EMBERS colleagues and labmates, Prithwish, Malay,Sathappan, Subhodip, Rupinder, Patrick, Nikhil, Raihan, Parang, Samah, Tozammel, Hui-juan and others for their help and support. I would also like to mention my Blacksburgfriends, Deba Pratim, Arijit, Sayantan, Siddhartha and Abhishek for making my Blacksburgstay a memorable one. Next, I would like to mention my special friends from undergraduateand childhood days, Minhazul, Joydeep, Subhrajit, Abhishek, Soham and many others forsupporting me throughout my life.

Finally, I would like to thank all of my wonderful family members. In particular, I thankmy parents, Rina Ghosh and Uttam Ghosh, my aunts Krishna Sarkar, Soma Roy and RakhiGhosh and my uncle Pradip Ghosh for their endless support and motivation over the years.They always encouraged me to pursue higher education. Indeed, without their support,I would not be able to make it this far. I would also like to thank my wonderful sisters,specifically Trina Ghosh Acharya, Mayuri Sarkar and Aditi Sarkar for their love and support.

vi

Contents

1 Introduction 1

1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Problem 1: Assessing Associations between News Trends and Infec-tious Disease Outbreaks using Temporal Topic Models . . . . . . . . 2

1.1.2 Problem 2: Automating the construction of epidemiological line lists . 2

1.1.3 Problem 3: Automated Disease Taxonomy Generation . . . . . . . . 4

1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 Spatio-temporal topic models . . . . . . . . . . . . . . . . . . . . . . 5

1.2.2 Word embeddings (word2vec) . . . . . . . . . . . . . . . . . . . . . . 6

1.2.3 Dependency parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Organization of the dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Monitoring Rare Disease Outbreaks using Spatio-temporal Topic Model 9

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Spatio-temporal Topic Model . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Rare Topic Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Temporal and Spatial Patterns of Rare Topics . . . . . . . . . . . . . . . . . 12

2.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Temporal Topic Model with Prior Disease Knowledge for Assessing Asso-ciations between News Trends and Multiple Infectious Disease Outbreaks 16

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

vii

3.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.1 Data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.2 EpiNews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.1 Disease topic discovery . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.2 Detection of outbreak patterns . . . . . . . . . . . . . . . . . . . . . . 29

3.3.3 Estimating case counts . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 GELL: Automating the Extraction of Epidemiological Line Lists fromOpen Sources 43

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Problem Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3.1 Level O Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3.2 WHO Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3.3 Level 1 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3.4 Level 2 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4.1 WHO corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4.3 Human annotated line list . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4.4 Accuracy metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4.5 Parameter settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.5 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.6 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Characterizing Diseases from Unstructured Text: A Vocabulary DrivenWord2vec Approach 63

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

viii

5.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.2.1 Problem Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.2.2 Basic Word2vec Model . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.2.3 Disease Specific Word2vec Model (Dis2Vec) . . . . . . . . . . . . . . 68

5.2.4 Parameters in Dis2Vec . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6 Conclusions and Future Work 81

ix

List of Figures

1.1 Examples of five human curated line list cases along with the features corre-sponding to each case for MERS outbreak in Saudi Arabia. . . . . . . . . . . 4

1.2 Outline of this dissertation showing three text analytics methods for infectiousdisease surveillance using online news media. . . . . . . . . . . . . . . . . . . 8

2.1 Timeline of hantavirus outbreaks and keyword mentions from January 2013to February 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Graphical Representation of the unsupervised temporal topic model used fordetecting rare disease outbreaks. . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Three discovered topics and their most likely words related to Hantavirus. . 13

2.4 Three discovered topics that are related to Influenza (Avian Flu), Dengue andSwine Flu. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 Histogram showing the temporal patterns of all disease topics, including raredisease hantavirus discovered by the unsupervised temporal topic model . . . 14

2.6 The country specific topic prominence for different rare and endemic diseasetopics averaged over states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 Flow chart depicting the sequential modeling process in EpiNews . . . . . . 28

3.2 Correlation between disease case counts and temporal topic distributions ortrends (ξz) extracted by EpiNews for (a) whooping cough, (c) rabies, (e)salmonellosis, and (g) E. coli infection in U.S. Along with the temporal topictrends (ξz), we also showed the correlation between disease case counts andsampled case counts (generated by multinomial sampling from temporal topictrends) for (b) whooping cough, (d) rabies, (f) salmonellosis, and (h) E. coliinfection. Note, the sampled case counts and disease case counts share almostsimilar numerical range. However, the temporal topic trend values are atdifferent numerical range (ranging from 0 to 1) with respect to the diseasecase counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

x

3.3 Correlation between disease case counts and temporal topic distributions ortrends (ξz) extracted by EpiNews for (a) H7N9, (c) HFMD, and (e) denguein China. Along with the temporal topic trends (ξz), we also showed thecorrelation between disease case counts and sampled case counts (generated bymultinomial sampling from temporal topic trends) for (b) H7N9, (d) HFMD,and (f) dengue. Note, the sampled case counts and disease case counts sharealmost similar numerical range. However, the temporal topic trend values areat different numerical range (ranging from 0 to 1) with respect to the diseasecase counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.4 Correlation between disease case counts and temporal topic distributions ortrends (ξz) extracted by EpiNews for (a) ADD, (c) dengue, and (e) malariain India. Along with the temporal topic trends (ξz), we also showed thecorrelation between disease case counts and sampled case counts (generatedby multinomial sampling from temporal topic trends) for (b) ADD, (d) dengue,and (f) malaria. Note, the sampled case counts and disease case counts sharealmost similar numerical range. However, the temporal topic trend values areat different numerical range (ranging from 0 to 1) with respect to the diseasecase counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.5 Temporal correlation between actual case counts and case counts estimatedby the methods Casecount-ARMA, EpiNews-ARMAX and EpiNews-ARNet corresponding to (a) dengue and (b) HFMD in China. In (c), weshowed the temporal correlation between actual case counts and case countsestimated by EpiNews-ARNet-sample corresponding to ADD in India. . . 42

4.1 Tabular extraction of line list by GELL given a textual block of a WHOMERS bulletin. Each row in the extracted table depicts an infected case (or,patient) and columns represent the epidemiological features correspondingto each case. Information for each case in the table is then used to makeepidemiological inferences, such as inferring demographic distribution of cases 44

4.2 Block diagram depicting all components of the GELL framework. Givenmultiple WHO MERS bulletins as input, these components function in thedepicted order to extract line lists in tabular form) . . . . . . . . . . . . . . 46

4.3 Undirected dependency graph corresponding to S5. The red-colored edgesdepict those edges included in the shortest paths between the date phrases(4-June, 12-June) and the indicators (symptoms, admitted) . . . . . . . . . . 50

4.4 Directed dependency graph corresponding to S6 showing direct and indirect negation detection . . . . 51

4.5 Distribution of non-null features in the human annotated line list . . . . . . 54

xi

4.6 Distribution of QS values for each automated line listing model correspondingto MERS line list in Saudi Arabia. X-axis represents QS values and Y-axisrepresents the number of automated line list cases having a particular QS value 56

4.7 Accuracy of individual indicators (including the seed indicator) discoveredvia word2vec methods in GELL (SGNS) or GELL (SGHS) for each linelist feature. For clinical features, we show the average F1-score. This figuredepicts the informative indicators (indicators showing higher accuracies orF1-scores) which contribute to the improved performance of GELL (SGNS)or GELL (SGHS) for a particular feature. E.g. for animal contact, themost informative indicator contributing to the superior performance of GELL(SGHS) is camels followed by animals (seed), sheep and direct . . . . . . . 61

4.8 Analysis of disease onset features in the extracted line list . . . . . . . . . . 62

5.1 Comparative performance evaluation of disease specific word2vec model (Dis2Vec)across the disease characterization tasks for 3 different class of diseases - en-demic (blue), emerging (red) and rare (green). The axes along the four verticesrepresent the modeling accuracy for the disease characterization of interest viz.symptoms, transmission agents, transmission methods, and exposures. Thearea under the curve for each disease class represent the corresponding over-all accuracy over all the characterizations. Best characterization performancecan be seen for emerging diseases. . . . . . . . . . . . . . . . . . . . . . . . . 64

5.2 Automated taxonomy generation from unstructured news corpus (HealthMap)and a pre-specified vocabulary (V). Dis2Vec inputs these information togenerate disease specific word embeddings that are then passed through acosine comparator to generate the taxonomy for the disease of interest. . . . 65

5.3 Distribution of word counts corresponding to each taxonomical category inthe disease vocabulary (V). Words related to clinical symptoms constitutethe majority of V with relatively much smaller percentages of terms relatedto exposures, transmission agents and transmission methods . . . . . . . . . 73

5.4 Case study for emerging, endemic and rare diseases: Disease characterization accuracy plot for Dis2Vec

(first quadrant, red), SGNS (second quadrant, blue), SGHS (third quadrant, green), and CBOW (fourth

quadrant, orange) w.r.t. H7N9 (left, emerging), avian influenza (middle, endemic) and plague (right, rare).

The shaded area in a quadrant indicates the cosine similarity (scaled between 0 and 1) of the top words

found for the category of interest using corresponding model, as evaluated against the human annotated

words (see Table 5.1). The top words found for each model is shown in the corresponding quadrant with

radius equal to its average similarity with the human annotated words for the disease. Dis2Vec shows best

overall performance with noticeable improvements for symptoms w.r.t. all diseases. . . . . . . . . . 80

xii

List of Tables

1.1 Disease-specific semantic constructs captured by SGNS and Dis2Vec . . . 5

3.1 Disease names (along with routes of transmission), health agencies from whichcase counts were collected, time period over which case counts were obtainedand temporal granularity (daily, monthly, weekly or yearly) of the obtainedcase counts corresponding to each country. H7N9 stands for avian influenzaA, ADD stands for acute diarrheal disease and HFMD stands for hand, foot,and mouth disease. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Country-wise distribution of the total number of HealthMap news articlesalong with unique words and location names extracted from all the corre-sponding articles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Four disease topics (Whooping Cough, Rabies, Salmonella and E. coli infec-tion) discovered by the supervised topic model from the HealthMap corpusfor U.S. For each disease topic, we show the seed words and their correspond-ing probabilities in the seed topic distribution. Along with the seed words,we also show some of the regular words (having higher probabilities in theregular topic distribution) discovered by the supervised topic model relatedto these input seed words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4 Three disease topics (H7N9, HFMD and dengue) discovered by the supervisedtopic model from the HealthMap corpus for China. For each disease topic, weshow the seed words and their corresponding probabilities in the seed topicdistribution. Along with the seed words, we also show some of the regularwords (having higher probabilities in the regular topic distribution) discoveredby the supervised topic model related to these input seed words. . . . . . . . 36

3.5 Three disease topics (ADD, dengue and malaria) discovered by the supervisedtopic model from the HealthMap corpus for India. For each disease topic, weshow the seed words and their corresponding probabilities in the seed topicdistribution. Along with the seed words, we also show some of the regularwords (having higher probabilities in the regular topic distribution) discoveredby the supervised topic model related to these input seed words. . . . . . . . 37

xiii

3.6 Total time period of study, static training period and the evaluation periodfor estimating disease case counts in each country. . . . . . . . . . . . . . . . 38

3.7 Comparing the performance of EpiNews-ARNet against the baseline meth-ods EpiNews-ARMAX and Casecount-ARMA for 1-step ahead estima-tion of disease case counts. Metric used for comparing the case counts es-timated by the methods against the actual case counts is the normalizedroot-mean-square error (NRMSE). . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1 Seed indicator and the discovered indicators using word embeddings generatedby SGNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Average Quality Score (QS) achieved by each automated line listing modelfor MERS line list in Saudi Arabia. As can be seen, GELL (SGNS) showsbest performance achieving an average QS of 0.73 . . . . . . . . . . . . . . . 56

4.3 Comparing the automated line listing models based on the accuracy score forthe demographics and disease onset features. For the disease onset features,GELL (SGNS) emerges out to be the best performing model. However, forthe demographic features, all the models achieve almost similar performance 57

4.4 Comparing the performance of the automated line listing models for extract-ing clinical features corresponding to MERS line list in Saudi Arabia. Wereport the F1-score for class Y, class N and average F1-score across the twoclasses. For animal contact, GELL (SGHS) emerges out to be the bestperforming model. For comorbidities and specified HCW, GELL (SGNS)shows best performance. However, for secondary contact, Baseline achievesuperior performance in comparison to GELL . . . . . . . . . . . . . . . . . 57

4.5 Parameter settings in GELL (SGNS) and GELL (SGHS) for which boththe models achieve optimal performance in terms of average QS and individualfeature accuracies corresponding to MERS line list in Saudi Arabia. Non-applicable combinations are marked by NA . . . . . . . . . . . . . . . . . . . 58

4.6 Comparing the performance of GELL on extraction of clinical features withor without indirect negation for MERS line list in Saudi Arabia. It can beseen that indirect negation improves the performance of GELL for animalcontact and secondary contact. . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.1 Human curated disease taxonomy for three diseases from three different classof diseases (endemic, emerging, and rare). . . . . . . . . . . . . . . . . . . . 66

5.2 Symptom categories and corresponding words. . . . . . . . . . . . . . . . . . 72

xiv

5.3 Comparative performance evaluation of Dis2Vec-combined against Dis2Vec-objective and Dis2Vec-sample across the 4 characterization tasks underthe best parameter configuration for that model and task combination. Thevalue in each cell represents the overall accuracy across the 39 diseases forthat particular model and characterization task. We use equation 5.8 as theaccuracy metric in this table. . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.4 Comparative performance evaluation of Dis2Vec against SGNS, SGHS andCBOW across the 4 characterization tasks under the best parameter configu-ration for that model and task combination. The value in each cell representsthe overall accuracy across the 39 diseases for that particular model and char-acterization task. We use equation 5.8 as the accuracy metric in this table. . 75

5.5 Comparative performance evaluation of Dis2Vec with full vocabulary againsteach of the 6 conditions of Dis2Vec with a truncated vocabulary across the4 characterization tasks where the truncated vocabulary consists of diseasenames and all possible terms related to a particular taxonomical category.We use equation 5.8 as the accuracy metric in this table. . . . . . . . . . . . 76

5.6 Comparison of different parameter settings for each model, measured by thenumber of characterization tasks in which the best configuration had thatparameter setting. Non-applicable combinations are marked by ‘NA’ . . . . . 76

5.7 Comparative performance evaluation of Dis2Vec against SGNS, SGHS andCBOW across the 4 characterization tasks for each class of diseases (emerg-ing, endemic and rare) under the best parameter configuration for a particular{disease class, task, model} combination. We use equation 5.8 as the accuracymetric in this table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

xv

Chapter 1

Introduction

1.1 Background and Motivation

There has been a growing interest in tracking infectious disease outbreaks using an array ofonline sources including social networks [18, 53, 54], blogs [17], web search records [28], andonline news media [10, 43]. Due to their volume, ease of availability and citizen participation,such open source indicators can supplement traditional surveillance systems at monitoringdisease emergence and progression to enable effective control measures to be taken in asufficiently timely fashion. Traditional surveillance approaches rely on highly specified data,including medical records or environmental time series. In particular, there are approachesthat use rule based techniques [9, 83] and bayesian networks [84] for detecting outbreaksand temporal disease patterns by considering hospital data and weather related time series.However, traditional surveillance data are difficult to obtain in real-time at a global scale.Due to these limitations, more recent work has focused on using web-based open sources,such as online news [10, 43] or twitter [54] which are typically available in (near) real-time.Intelligent mining of signals from these open sources can decrease the time between anoutbreak and formal recognition of an outbreak [13], thus allowing for an expedited responseto disease outbreaks and related events around the world.

In this dissertation, we aim to build generic and automated surveillance tools in (near)real-time using online health-related news reports which are applicable to a wide range ofinfectious diseases across different geographic regions of the world. We briefly introduce eachproblem explored in this dissertation as follows.

1

2

1.1.1 Problem 1: Assessing Associations between News Trendsand Infectious Disease Outbreaks using Temporal Topic Mod-els

In retrospective assessments, internet news reports have been shown to capture early reportsof infectious disease transmission prior to official confirmation from health organizations. Ingeneral, media interest and reporting peaks and wanes during the course of an outbreak.Most prior work in disease surveillance using open source indicators has targeted specificendemic diseases, such as influenza [17, 87, 12, 17] and west nile virus [70]. Therefore,there is a need to develop generic frameworks that are applicable to infectious diseases withdiverse characteristics, such as rare (hantavirus), endemic (dengue, salmonella) and emerging(H7N9).

The first problem of this dissertation is aimed at developing temporal topic models for mon-itoring infectious disease outbreaks with diverse temporal dynamics in a specific country.Firstly, we introduce a spatio-temporal topic model [27, 61] for monitoring rare disease out-breaks in multiple countries of Latin America. However, this spatio-temporal topic modelhas certain limitations when it comes to monitoring a mixture of rare, endemic and emerg-ing diseases in a certain country as it does not incorporate any prior knowledge of diseases.Motivated by these limitations, we will propose EpiNews [26], a novel approach based ona combination of supervised temporal topic models (incorporating prior knowledge aboutdiseases) with time-series regression techniques that helps transform large corpora of newsarticles collected from HealthMap into temporal topic trends. We evaluated the method us-ing data from multiple infectious disease outbreaks reported in the United States of America(U.S.), China and India, such as whooping cough outbreak in U.S. (2012), H7N9 outbreaksin China (2013 and 2014) and dengue outbreaks in India (2013) and China (2014). The keyadvantages of this approach include, applicability to a wide range of diseases, and ability tocapture disease dynamics - including seasonality, abrupt peaks and troughs. Our observa-tions also suggest that, when news coverage is uniform, efficient modeling of temporal topictrends using time-series regression techniques can estimate disease outbreaks with increasedprecision before official reports by health organizations.

1.1.2 Problem 2: Automating the construction of epidemiologicalline lists

The second problem of this dissertation is aimed at developing tools to automatically createepidemiological line lists from open source reports (WHO DONs) of emerging diseases andmake such lists readily available to epidemiologists. Specifically, we will focus on derivingcharacteristics of an emerging infectious disease and the affected population from reports ofillness. Prior research [36, 44] has shown the utility in generating such lists through laborintensive human curation. In this problem, we seek to automate this effort in order to reduce

3

the latency incurred due to time consuming human curation.

An epidemiological line list is a listing of individuals suffering from an emerging disease thatdescribes both their demographic details as well as the timing of clinically and epidemiologi-cally significant events during the course of disease. These are typically used during outbreakinvestigations to identify key features of the disease,e.g. incubation period, symptoms, asso-ciated risk factors, and outcomes. Traditionally line lists have been curated manually andhave rarely been available to epidemiologists in (near) real-time. As the characteristics ofthe emerging diseases (Middle Eastern Respiratory Syndrome referred to as MERS, Ebola,H7N9, etc.) are relatively unknown with respect to endemic diseases (dengue, influenza,salmonella, etc.), ready availability of line lists can assist the epidemiologists in understand-ing the disease well enough to stop the outbreak. Real-time construction of line lists canalso be useful in contact tracing as well as risk identification of spread.

For this problem, we focus on the MERS outbreaks in Saudi Arabia and South Korea as ourcase studies. MERS was a relatively less understood disease during the onset of outbreaksin both the countries. Also, since MERS was poised as an emerging outbreak, there hasbeen good bulletin coverage about the infectious cases individually. This makes both theoutbreaks ideally suited to our goals. MERS is infectious as well and animal contact has beenposited as one of the transmission mechanism of the disease. In light of the aforementioneddiscussion, we curated the following features for our proposed line-lists.

• Demographic details: Age and Gender

• Disease onset details: Symptom onset date, hospitalization date and outcome date.

• Clinical parameters: Specified comorbidities, animal contact, secondary contact andspecified healthcare worker (HCW).

In Figure 1.1, we show examples of five human curated line list cases for MERS outbreakin Saudi Arabia where each row represents a line list case and each column represents thefeatures corresponding to each case. One of the primary challenges in extracting line lists isthat a single health bulletin can contain information about multiple cases. Thus a syntaxaware parsing mechanism, which can distinguish between cases mentioned within an article,is required. We propose Guided Epidemiological Line List (GELL) [25], a fully automated,novel and generic framework that extracts line lists though multiple levels of modeling. Givena health bulletin, GELL’s first task is to identify the number of mentioned cases and thesentences corresponding to each case. Following this task, GELL integrates informationextracted from neural word embeddings and dependency graphs at the sentence level toextract the features for each case.

4

Figure 1.1: Examples of five human curated line list cases along with the features corre-sponding to each case for MERS outbreak in Saudi Arabia.

1.1.3 Problem 3: Automated Disease Taxonomy Generation

Traditional disease surveillance has often relied on a multitude of reporting networks such asoutpatient networks, on-field healthcare workers, and lab-based networks. Some of the mosteffective tools while analyzing or mapping diseases, especially for new diseases or diseasespreading to new regions, are reliant on building disease taxonomies which can aid in earlydetection of outbreaks.

In recent years, the ready availability of social and news media has led to services such asHealthMap [22] which have been used to track several disease outbreaks from news mediaranging from the flu to Ebola. However, most of this data is unstructured and often noisy.Annotating such corpora thus requires considerable human oversight. While significant in-formation about both endemic [12, 80] and rare [61] diseases can be extracted from such newscorpora, traditional text analytics methods such as lemmatization and tokenization are oftenshallow and do not retain sufficient contextual information. More involved methods such astopic models are too computationally expensive for real-time worldwide surveillance and donot provide simple semantic contexts that could be used to comprehend the data.

In recent years, several deep learning based methods, such as word2vec [47, 48] and doc2vec [38],have been found to be promising in analyzing such text corpora. These methods once trainedover a representative corpus can be readily used to analyze new text and find semantic con-structs (e.g., king : man = queen : woman) which can be useful for automated taxonomycreation. Classical word2vec methods are generally unsupervised requiring no domain infor-mation and as such has broad applicability. However, for highly specified domains (such asdisease surveillance) with moderate sized corpus, classical methods such as SGNS [48] failto find meaningful semantic relationships (see Table 1.1). On the other hand, disease vocab-ulary driven word2vec method (Dis2Vec) generates more meaningful semantic constructs(see Table 1.1) which can be used towards such disease knowledge extractions.

Motivated by this, the third problem of this dissertation is aimed at motivating a disease vo-cabulary driven word2vec model (Dis2Vec) [24] to model diseases and constituent attributes

5

Table 1.1: Disease-specific semantic constructs captured by SGNS and Dis2Vec

Semantic Relations SGNS Dis2Vecvec(malaria) - vec(vectorborne) = vec(whooping cough) - vec(??) cerro direct contactvec(sars) - vec(zoonotic) = vec(salmonella) - vec(??) heidelberg foodbornevec(typhoid) :- vec(waterborne) = vec(mers) - vec(??) cov zoonoticvec(dengue) - vec(vectorborne) = vec(polio) - vec(??) eradication direct contactvec(chicken pox) - vec(droplet) = vec(campylobacter) - vec(??) microorganism foodborne

(symptoms, transmission agents, transmission methods and exposures) as word embeddingsfrom the HealthMap news corpus. We use these word embeddings to automatically cre-ate disease taxonomies and evaluate our model against corresponding human annotatedtaxonomies. We compare our model accuracies against several state-of-the-art word2vecmethods. Our results demonstrate that Dis2Vec outperforms traditional distributed vec-tor representations in its ability to faithfully capture taxonomical attributes across differentclass of diseases such as endemic, emerging and rare.

1.2 Related Work

The prior work related to the text mining methods proposed in this dissertation can beplaced in a few categories. We describe each of them in turn.

1.2.1 Spatio-temporal topic models

Related work for our first problem falls into the categories of spatio-temporal topic mod-els and using topic models to detect/forecast outbreaks. To the best of our knowledge,most existing topic models consider the spatial or temporal trends in isolation and do notexamine both types of trends jointly. A number of methods have been introduced for an-alyzing the time evolution of topics in document collections, such as the topics over time(TOT) model [79], dynamic topic models [6], and the TriMine model [45]. TOT handlestime-windows of fixed size and utilizes a Beta distribution to model the temporal evolutionof topics. Unlike TOT, DTM uses Kalman filters to model temporal trends of topics overa time-window of fixed size. Finally, TriMine is able to examine windows of variable sizein order to detect cyclic time patterns with disparate timescales enabling forecasting futureevents. A different line of work, Spatial Latent Dirichlet Allocation (SLDA) [78], discov-ers spatial patterns jointly with the word co-occurrences. Even though the model focuseson computer vision applications where the documents are represented by visual words theproposed techniques can also be extended for use in regular text documents. Ramage etal. [58] introduced a similar approach for annotated documents where the annotations cancorrespond to locations.

6

1.2.2 Word embeddings (word2vec)

The related works of interest for our second and third problem are primarily from the field ofneural-network based word embeddings and their applications in a variety of NLP tasks. Inrecent years, we have witnessed a tremendous surge of research concerned with representingwords from unstructured corpus to dense low-dimensional vectors drawing inspirations fromneural-network language modeling [5, 15, 50]. These representations, referred to as wordembeddings, have been shown to perform with considerable accuracy and ease across a varietyof linguistic tasks [3, 16, 72].

Mikolov et al. [47, 48] proposed skip-gram model, currently a state-of-the-art word embeddingmethod, which can be trained using either hierarchical softmax (SGHS) [48] or the negativesampling technique (SGNS) [48]. Skip-gram models have been found to be highly efficientin finding word embedding templates from huge amounts of unstructured text data anduncover various semantic and syntactic relationships. Mikolov et al. [48] also showed thatsuch word embeddings have the capability to capture linguistic regularities and patterns.These patterns can be represented as linear translations in the vector space. For example,vec(Madrid) - vec(Spain) + vec(France) is closer to vec(Paris) than any other word intheir corpus [49, 40]. Levy et al. [41] analyzed the theoretical founding of skip-gram modeland showed that the training method of SGNS can be converted into a weighted matrixfactorization and its objective induces an implicit factorization of a shifted PMI matrix- the well-known word-context PMI matrix [4, 73] shifted by a constant offset. Levy etal. [42] performed an exhaustive evaluation showing the impact of each parameter (windowsize, context distribution smoothing, sub-sampling of frequent words and others) on theperformance of SGNS and other recent word embedding methods, such as GLoVe [56]. Theyfound that SGNS consistently profits from larger negative samples (> 1) showing significantimprovement on various NLP tasks with higher values of negative samples.

1.2.3 Dependency parsing

Prior work related to our second problem also falls into the categories of dependency-basedsyntactic parsing and their applications in a variety of NLP tasks. This technique takesa sentence as input, and outputs a dependency graph structure. Since in this dissertationwe focus on utilizing dependency parsing for extracting information from natural languagetext, we briefly introduce the related work in dependency parsing and their applications in avariety of NLP tasks, such as relation extraction [85, 11], word embeddings [39] and negationdetection [51, 69, 2].

Wu et al. [85] introduced WOE, a new approach to Open IE that utilizes Wikipedia for self-supervised learning of unlexicalized relation extractors <subject, relation, object>. Theyshowed that dependency parser based features improved the precision and recall of WOE incomparison to POS tag features. Bunescu et al. [11] focused on extracting relations between

7

predefined entities by using the shortest path between them in the undirected dependencygraph.

Previous works on neural embeddings (including the skip-gram model) define the contexts of aword to be its linear context (words preceding and following the target word). Levy et al. [39]generalized the skip-gram model and used syntactic contexts derived from automaticallygenerated dependency parse-trees. These syntactic contexts were found to capture morefunctional similarities, while the bag-of-words nature of the contexts in the original skip-gram model generates broad topical similarities.

Finally, dependency parsing has also emerged as a powerful tool in the field of negationdetection. Ballesteros et al. [2] detects words affected by the negation cues, such as no, notor nothing by traversing dependency syntactic trees. In [51], Ou et al. used three differentapproaches for negation detection, out of which the syntax-based approach used rules andnegation patterns derived using the dependency output from the Stanford parser. Sohn etal. [69] used manually compiled negation rules derived from dependency paths for negationdetection in clinical narratives. Unlike previous approaches [66] which searched for negationwords within a fixed word distance, they argued that dependency based negation rules donot limit the negation scope to word distance; instead, they are based on syntactic context.

1.3 Organization of the dissertation

The structure of this dissertation document is described below.

Chapters 2 and 3 present temporal topic models for extracting temporal dynamics of infec-tious diseases from HealthMap news articles. Given uniform news coverage, the temporaltopic trends can be used for monitoring disease emergence and progression during outbreakperiods.

In Chapter 4, we introduce GELL, a method for automated construction of epidemiologicalline lists from WHO DONs with specific focus on emerging diseases, such as MERS inSaudi Arabia. We also propose a framework which uses the automatically extracted linelist features augmented with surveillance case counts for forecasting outbreaks of emergingdiseases.

Finally, in Chapter 5, we outline a word2vec approach that uses a pre-specified disease re-lated vocabulary to generate disease specific word embeddings from HealthMap corpus. Thedisease specific word embeddings are used to characterize infectious diseases subsequentlyevaluated against human annotated ones for accuracies.

8

Online News Media

HealthMap Temporal topic models with prior

knowledge

Capturing temporal trends of multiple

diseases with diverse characteristics

Dis2Vec Automated Disease Taxonomies

WHO DONs GELL Automated construction of line

lists

Temporal topic models

Capturing temporal trends of rare diseases

Figure 1.2: Outline of this dissertation showing three text analytics methods for infectiousdisease surveillance using online news media.

Chapter 2

Monitoring Rare Disease Outbreaksusing Spatio-temporal Topic Model

2.1 Introduction

In this chapter, we describe our efforts at using spatio-temporal topic model for monitoringrare disease (hantavirus) outbreaks in Latin American countries. Most prior work focuses ondetecting outbreaks of common endemic diseases, such as influenza by discovering temporalpatterns of pre-defined group of keywords. However, in our scenario, the major challenge isthat keyword based techniques have significant limitations at monitoring rare disease out-breaks. As incidences are rare, related keywords may be scarce over time in comparisonto endemic and emerging diseases. Therefore, it is difficult for keyword-based techniquesto identify temporal patterns for a diverse set of diseases and predict new outbreaks in atimely manner. In Figure 2.1, we compared the number of mentions over time for the setof hantavirus-specific keywords ‘hanta’, ‘hantavirus’, ‘roedores’, ‘ratones’, and ‘cardiopul-monar’, and the actual timeline of hantavirus incidences for each country. The actual han-tavirus incidences were extracted by a third-party gold standard. Figure 2.1(a) shows thetimeline of hantavirus incidences in the four countries, while Figure 2.1(b) and Figure 2.1(c)show the timeline of word mentions for the aforementioned keyword set. There are caseswhere, despite having a large number of hantavirus incidences, the number of keyword men-tions is low. Also, the two timelines are not aligned, with spikes in the keyword timelineappearing with a delay after spikes in the actual incidence timeline.

9

10

Figure 2.1: Timeline of hantavirus outbreaks and keyword mentions from January 2013 toFebruary 2014

2.2 Spatio-temporal Topic Model

Motivated by the limitations of keyword-based techniques, Rekatsinas and Ghosh et al.proposed a spatio-temporal topic model [27, 61] for monitoring rare disease (hantavirus)outbreaks in four Latin American countries (Chile, Brazil, Uruguay and Argentina) fromJanuary 2013 to March 2014. These four countries were specifically chosen due to the factthat no hantavirus incidences were observed in other Latin American Countries during thementioned timeline. The topic model explicitly models time and location, jointly with theword co-occurrence patterns over news articles from multiple data sources. This is done byincorporating both spatial and temporal component into the basic Latent Dirichlet Alloca-tion (LDA) framework [7]. The topic model uses location and topic specific distributionsto model the generation of words and time-stamps. Topic discovery is not only influenced

11

by word co-occurrences, but also spatial and temporal information. The graphical modelrepresentation of the temporal topic model is shown in Figure 2.2. In Figure 2.2, we see thateach source entry is associated with a location and each location is defined as a multinomialdistribution over disease topics that is randomly sampled from a Dirichlet prior distributionwith a pre-specified hyperparameter. Each topic is again associated with two distributions:(a) multinomial distribution over all the words in the vocabulary and (b) multinomial dis-tribution over time-stamps. Each of these two multinomial distributions is sampled from aDirichlet prior distribution with pre-specified hyper-parameters. For experimental purposes,the number of topics were fixed to be 12.

Figure 2.2: Graphical Representation of the unsupervised temporal topic model used fordetecting rare disease outbreaks.

12

2.3 Rare Topic Discovery

The HealthMap corpus used for experimental evaluations in [61] contains mentions to bothcommon (avian influenza, dengue, swine flu) and rare (hantavirus, cholera, yellow fever)diseases over multiple countries in Latin America. Upon evaluating the topics discovered bythe temporal topic model, the authors found that 6 out of the 12 topics are related to thediseases mentioned above, while the rest are background topics related to non-disease aspectsof the news articles. The authors focused only on the disease related topics. To evaluatethe disease topics, the authors considered a vocabulary of 184 health-related words andexamined the most likely words based on the health-related vocabulary and their per-topicprobabilities. Figure 2.3 shows three topics related to hantavirus and their most likely wordsbased on the health-related vocabulary. The first topic refers to the HPS syndrome withwords such as ‘pneumonia’, ‘sangre’ (blood), and ‘cardiopulmonar’ being ranked higher. Wesee that the temporal topic model is able to retrieve the correlation between words ‘hanta’and ‘ratones’ (mice) successfully. The second topic focuses on the HFRS syndrome withwords as ‘nariz’ (nose), ‘estornudar’ (sneeze), ‘renal’ being more prevalent. Finally, the thirdtopic focuses on the hantavirus transmission routes with words as ‘lixo’ (garbage), ‘criaderos’(breeding places), ‘manos’ (hands) and ‘roedores’ (rodents) being ranked higher than others.In Figure 2.4, we also observe three topics related to avian flu, dengue and swine flu. Forall three topics we see that the corresponding disease keywords, i.e., ‘influenza’, ‘dengue’and ‘gripe’ (flu) are ranked first. For the avian influenza topic, the temporal topic model isable to discover the correlation among words referring to both the causes, i.e., ‘mosquito’,‘larvas’, ‘zancudos’ (mosquitos), and the symptoms, i.e., ‘fiebre’ (fever), of the disease. Thisindicates that the temporal topic model proposed in [61] is able to detect both commondisease and rare disease topics from the HealthMap corpus.

2.4 Temporal and Spatial Patterns of Rare Topics

Figure 2.5 shows the temporal patterns discovered by the temporal topic model for bothcommon disease and rare disease topics. Focusing on the temporal patterns, we observethat the HFRS and the hanta transmission topics show small fluctuation across the differenttime points. However, we observe that the HPS topic follows a trend similar to that of thehantavirus incidence timeline (Figure 2.1(a)). More precisely, we see that the prominence ofthis topic peaks towards the end of May 2013 and from December 2013 to March 2014 exactlyduring the months when the number of hantavirus incidences increases. Similar results wereobserved for topics related to common diseases.

Finally, we examine the correlations between the discovered topics and the countries in LatinAmerica under consideration. Figure 2.6 shows the prominence of each topic for Brazil, Chile,Uruguay and Argentina. As expected, we observe that in Chile, HPS and HFRS are moreprominent, while in Brazil Dengue topic is prominent as Brazil is prone to dengue outbreaks

13

throughout the year.

Figure 2.3: Three discovered topics and their most likely words related to Hantavirus.

Figure 2.4: Three discovered topics that are related to Influenza (Avian Flu), Dengue andSwine Flu.

2.5 Limitations

The temporal topic model described above is completely unsupervised, i.e. it does not incor-porate any prior knowledge about diseases and therefore, exhibits a lack of robustness whenit comes to monitoring multiple diseases (mix of endemic, rare and emerging diseases) in aspecific country.

14

Figure 2.5: Histogram showing the temporal patterns of all disease topics, including raredisease hantavirus discovered by the unsupervised temporal topic model

• Topic Identification or Interpretation: One of the limitations is identification orinterpretation of each topic. Different runs of the unsupervised topic model will resultin disparate topic distributions and the user will have to perform post-analysis of worddistribution in each topic for its interpretation.

• Choice of number of topics: Secondly, the choice of number of topics is difficult inunsupervised models. In [61], the authors had to manually evaluate the unsupervised

15

Figure 2.6: The country specific topic prominence for different rare and endemic diseasetopics averaged over states.

topic model with K = {8, 12, 15} and found that K = 12 resulted in more meaningfultopics. Ideally, for monitoring progression of K diseases, the number of input topicsshould be K + 1, one topic for each disease and the remaining topic capturing thebackground noise in the data. In such scenarios, the natural tendency of unsupervisedmodels is to describe only the most frequent topics in the corpus.

• Underrepresentation of rare topics: Finally, if certain diseases are underrepre-sented in the corpus, they may get ignored in unsupervised models or they may getdistributed over multiple topics, e.g. in Figure 2.5, we observe 3 topics related to therare disease hantavirus and only one topic for each common disease (avian influenza,dengue, swine flu).

Chapter 3

Temporal Topic Model with PriorDisease Knowledge for AssessingAssociations between News Trendsand Multiple Infectious DiseaseOutbreaks

3.1 Introduction

Motivated by the limitations of unsupervised temporal topic model as mentioned in Chapter2, in this chapter we outline our efforts at using a supervised temporal topic model withprior disease knowledge for (near) real-time monitoring and estimating outbreaks of multiplediseases with diverse characteristics (mix of endemic, rare and emerging diseases) in a specificcountry. Our key contributions are as follows.

• We introduce EpiNews [26], a generic temporal framework for analyzing disease-related news reports using a supervised topic model. The supervised topic modeldiscovers multiple disease topics of interest and their associated temporal trends ofprominence in news media.

• EpiNews captures trends in disease progression, such as periodicity, peaks and troughsvia temporal trends of disease topics in news media.

• When news coverage is adequate, EpiNews also estimates disease incidence beforeofficial reports by health agencies using time-series regression models interposed overthe temporal trends of disease topics.

16

17

We validated our method against disease case count reports, as available from public healthagencies in U.S., China, and India. Disease-related news articles were provided by HealthMap [22],an internationally recognized, global disease alert system capturing outbreak reports fromover 200,000 electronic news sources. EpiNews was evaluated on multiple outbreaks inthe recent past, such as whooping cough in U.S. (2012) [14], periodic outbreaks of avianinfluenza A(H7N9) [86, 23] and hand, foot, and mouth disease (HFMD) in China (2013 and2014), periodic outbreaks of acute diarrheal disease (ADD) in India (2013 and 2014), majordengue outbreaks in China (2014) [67] and India (2013). Our experiments indicate thatEpiNews was successfully able to capture the dynamics of the mentioned outbreaks andestimate the case counts in many of these instances before official reports were published.However, inconsistent news coverage was found to adversely affect the performance of ourapproach.

3.2 Materials and Methods

3.2.1 Data sources

In this section, we discuss the data sources used to analyze the infectious disease outbreaks.We first describe the case count reports collected from public health agencies and completeour discussion about the HealthMap data used in this study.

Disease case counts. For each country, we collected case count data corresponding tomultiple diseases over a certain time period. In Table 3.1, we show the disease names (alongwith methods of transmission), health agencies from which case counts were collected, timeperiod over which case counts were obtained and temporal granularity (daily, monthly, weeklyor yearly) of the obtained case counts corresponding to each country.

HealthMap. Disease-related news articles were found to be indicative of infectious dis-ease outbreaks [61]. We collected such articles related to the mentioned diseases in Ta-ble 3.1, for each country under consideration, from HealthMap. The HealthMap corpusis a publicly available database from which we collected the disease-related articles, re-ported during the time period of interest. Each article contains the reported date andthe corresponding location information in the form of (lat, long) co-ordinate pairs. Weconverted the location co-ordinates to location names (country, state) via reverse geocod-ing. Reverse geocoding is defined as the process of finding a readable address or placename for a given (lat, long) pair. For example, (26.562851,−81.949532) was convertedto (United States, Florida) after reverse geocoding. Each HealthMap article was passedthrough a series of preprocessing steps. For China, majority (87.94%) of the articles werepublished in either Traditional Chinese or Simplified Chinese. We translated the textual

18

Table 3.1: Disease names (along with routes of transmission), health agencies from whichcase counts were collected, time period over which case counts were obtained and temporalgranularity (daily, monthly, weekly or yearly) of the obtained case counts corresponding toeach country. H7N9 stands for avian influenza A, ADD stands for acute diarrheal diseaseand HFMD stands for hand, foot, and mouth disease.

CountryDisease names

(Methods of transmission)Health

agenciesTimeperiod

Temporalgranularity

U.S.

Whooping cough (airborne, direct contact)Rabies (zoonotic)

Salmonellosis (food-borne)E. coli infection (waterborne, food-borne)

Project Tycho [74](https://www.tycho.pitt.edu/)

January 2010 -December 2013

Weekly

ChinaH7N9 (zoonotic)

HFMD (direct contact, airborne)Dengue (vector-borne)

National Health andFamily Planning Commission(http://en.nhfpc.gov.cn/)


Monthly

IndiaADD (food-borne)

Dengue (vector-borne)Malaria (vector-borne)

Integrated DiseaseSurveillance Programme

(http://www.idsp.nic.in/)


Weekly

content of these articles to English for ease of analysis using Google translate (https://translate.google.com/). Because of the unavailability of ground truth for these ar-ticles, we couldn’t validate the performance of Google translate in this context. However,Google translate is one of the state-of-the-art commercial machine translations used today.Recent advances in deep learning and neural machine translation have made it a reliable toolfor Chinese-to-English translation. For more details, see http://www.androidauthority.

com/google-translate-machine-learning-chinese-718813/. Prior research [77, 52, 76]on Chinese sentiment analysis has shown that using Google translate to translate Chinesereviews into English reviews improves the sentiment classification performance. In Pak etal. [52], Google translate also yielded better results for the sentiment classification taskin comparison to another commercial machine translation service named Yahoo Babelfish(http://babelfish.yahoo.com/).

3.2.2 EpiNews

In this section, we describe in details the components of our proposed framework EpiNews.In Figure 1, we show a flowchart depicting the sequential modeling process in EpiNews.The first component is the HealthMap preprocessing step which takes the HealthMap corpusas input and outputs the set X where each element represents a three-dimensional tuple ofthe form {word(w), location(l), timepoint(t)} : count. The second component, referred toas temporal topic modeling, is used to extract temporal topic trends from X . The finalcomponent, referred to as EpiNews-ARNet, is responsible for generating estimates ofdisease case counts using past available case counts and temporal topic trends extracted bythe supervised topic model.

19

HealthMap preprocessing

The first component of EpiNews deals with the preprocessing of HealthMap articles througha series of preprocessing steps as shown below.

1. Extracting main textual content. In the first step, we extracted the main textualcontent of each article using Dragnet [57] and Goose (https://github.com/grangier/python-goose), ignoring the non-textual elements such as images within the article.

2. Tokenization and lemmatization. Tokenization [81, 68] is the process of segmentinga textual content into words, phrases, symbols or other meaningful elements commonlyreferred to as tokens. Lemmatization [34] is performed after tokenization and can bedefined as the normalization process in which various inflected forms of a word areconverted to the same underlying lemma so that they can be analyzed as a single term.For example, terms such as travel, traveled, travels, TRAVEL, traveling, Travelling,travelling, travelled, Travel, Traveling were converted to the same underlying lemmatravel. In the second step, both tokenization and lemmatization were performed onthe extracted textual content using BASIS Technologies’ Rosette Language Processing(RLP) tools [59, 21] to generate a set of unique words or phrases corresponding to eacharticle.

3. Uppercase to lowercase. In the third step, we converted the uppercase letters ineach extracted word to lowercase letters. For example, both the terms Salmonellaand salmonella convey the same meaning, so they were converted to a single termsalmonella.

4. Removal of stop words. In the final step, we removed all the stop words such as in,by, of, at, all, etc. from the set of unique words or phrases extracted from each article.

The set of unique words in these processed articles were found to contain general- (e.g., cold,contagious, nausea, blood, food-borne, waterborne, sanitation) as well as specific- (e.g., rabies,whooping, h7n9, dengue, salmonella, malaria) disease related terms. In Table 3.2, we showcountry-wise distribution of the total number of HealthMap news articles along with uniquewords and location names extracted from all the corresponding articles.

Following Rekatsinas et al. [61] , the processed corpus for each country was transformedto a collection of tuples of the form {w, l, t} : count, where count is the number of newsarticles mentioning the word w associated with the location l and time point t in the tuple.For this transformation, we assumed that for each country, each processed article consistsof words from a vocabulary V , corresponds to a discretized time window t ∈ {1, 2, · · · , T}and is geotagged with a location l from a set of locations L in the country. For China,disease case counts were available on a monthly granularity and as such each time point trepresents a period of 1 month. However, for diseases in U.S. and India, case counts were

20

Table 3.2: Country-wise distribution of the total number of HealthMap news articles alongwith unique words and location names extracted from all the corresponding articles.

CountryTotal number of HealthMap

news articlesTotal number of unique words

Total number of unique location namesor (country, state) pairs

China 11,209 21,879 30India 1,204 17,160 30U.S. 9,872 59,687 51

obtained on a weekly basis and as such time point t represents a period of 1 week or morespecifically, epidemiological week (hereafter referred to as epi week). For example, the tuple(salmonella, (UnitedStates,Kansas), 2013− 10− 06) : 9 denotes that the word salmonellawas mentioned in 9 articles referring to the state of Kansas in U.S. over the epi weekextending from 6th October 2013 to 12th October 2013. For each country, let Nl representthe collection of tuples for each location l ∈ L and X denote the set of all tuple collectionsNl until time point T . This transformed set X was analyzed to extract the temporal trendsof disease topics as discussed in the following section. Both Nl and X were updated for eachcountry, as we proceed along the time window.

Temporal topic modeling

The second component of EpiNews deals with the topic and pattern discovery problem. Theset X of all tuple collections Nl can be treated as a three-dimensional matrix of size V ×L×Twhere the dimensions are represented by words (size V ), locations (size L) and time points(size T ). Each element xw,l,t in X represents the total number of articles mentioning the wordw (w ∈ V ) referring to location l (l ∈ L) over the time point t (t ∈ 1, 2, 3, . . . , T ). We assumethat each entry in a non-zero element xw,l,t of X is associated with a latent disease topic andtherefore, such hidden disease topics can be modeled in terms of three dimensions of X . Ourgoal is to extract the hidden disease topics and their corresponding associations with eachdimension of X . Following previous literature on topic models [7, 46, 61, 33], we implementeda supervised temporal topic model for this purpose. We supervise the discovery process ofeach disease topic by providing a set of prior words (also called seed words) [33]. Theseseed words are user-provided prior knowledge of each infectious disease and they encouragethe topic model to find evidence of these disease topics in the HealthMap corpus. Seedwords for each disease topic were extracted by examining the content of a subset of newsarticles mentioning the disease. Additionally, following similar techniques as in Chakrabortyet al. [12], we also examine a number of expert websites, such as CDC and WHO, to identifythe most important keywords for a particular disease. This supervised method helps inimproving the discovery of word co-occurrences within each topic as the model tends todiscover words that are related to the words in the seed set. Additionally, we model timeand location jointly [61] with the word co-occurrence patterns. This enables tracking of

21

temporal and spatial patterns of these disease topics in the news.

Generative process of the supervised topic model Before going into the details ofthe generative process, we will first define the notion of a topic in the supervised topicmodel. In unsupervised topic models [7, 61, 45, 64] , each topic k is defined as a discreteprobability distribution over all the words in the vocabulary V . In the supervised topicmodel, the notion of a topic is extended and defined as the convex combination of twodiscrete probability distributions: seed topic distribution and regular topic distribution [33].The seed topic distribution can only generate words from the seed set S, and thus it isdefined as a discrete probability distribution over only the words in the seed set S. On theother hand, the regular topic distribution has the freedom to generate any word includingthe seed words. So a regular topic is defined as a discrete probability distribution over all thewords in the vocabulary V . Here we assume that each regular topic is associated with onlyone seed topic, i.e., there is a one-to-one correspondence between seed and regular topics.

Algorithm 1: Generative process of the supervised topic model

1 for each topic k ← {1, 2, . . . , K} do2 Draw φrk ∼ Dirichlet(β(k))

3 Draw φsk ∼ Dirichlet(µ(k))

4 Draw ξk ∼ Dirichlet(γ(k))5 Draw πk ∼ Beta(1, 1)

6 for each location l ∈ L do7 Draw θl ∼ Dirichlet(α(l))8 for each entry i ∈ Nl do9 Draw topic zi ∼ Discrete (θl)

10 Draw indicator variable xi ∼ Bernoulli (πzi)

11 Draw wi ∼

{Discrete

(φrzi)

when xi = 0 // regular topic

Discrete(φszi)

when xi = 1 // seed topic

12 Draw timestamp ti ∼ Discrete (ξzi)

The generative process of the supervised topic model is described in Algorithm (1). GivenK disease topics, L locations and Nl for each l ∈ L, the supervised topic model uses locationand topic specific discrete probability distributions to model the generation of word and timepoint in each entry of Nl. To generate each entry i ∈ Nl for a location l ∈ L, we first samplea topic zi (zi ∈ {1, 2, · · · , K}) from the location-specific discrete probability distribution θlover K disease topics. To generate a word wi, we choose either the seed topic distribution (φs)or the regular topic distribution(φr) corresponding to the sampled topic zi. The indicatorvariable xi sampled from Bernoulli (πzi) decides whether the word should be drawn from theseed topic distribution or the regular topic distribution. πzi is called the sampling probability

22

for topic zi. Once the distribution is chosen, the word wi is generated from it. Finally, thetime point ti is drawn from the topic-specific discrete probability distribution ξzi over thetime points {1, 2, · · · , T}.

Choice of priors. ξk (k ∈ {1, 2, . . . , K}) is drawn from an asymmetric Dirichlet prior [65,75] parameterized by a T -dimensional vector γ(k) as defined below in equation (3.1).

γ(k) = [Nk,1 + γ′, Nk,2 + γ

′, · · · , Nk,t + γ

′, · · · , Nk,T + γ

′] (3.1)

where, Nk,t is the sum of the count variable across those tuples ({w, l, t′} : count) of Xwhere the word w in the tuple is a seed word related to disease topic k, t

′is equal to the

time point t in equation (3.1) and l refers to any location in the set L. In other words, Nk,t

accounts for the occurrence of seed words related to topic k in X at time point t. Higheroccurrence of seed words indicates higher prominence of topic k at time point t and viceversa. Therefore, asymmetric prior γ(k) is used to incorporate prior information into thesupervised topic model regarding prominence of disease topic k at different time points. Thehyperparameter γ

′in equation (3.1) is an additional smoothing parameter that contributes

a flat pseudocount to each component of γ(k). Additive smoothing is done to assign non-zeroprobabilities to those time points for which we have no prior information (zero occurrenceof seed words) related to topic k.

θl is also associated with an asymmetric Dirichlet prior parameterized by a K-dimensionalvector α(l) as defined below in equation (3.2).

α(l) = [Nl,1 + α′, Nl,2 + α

′, · · · , Nl,k + α

′, · · · , Nl,K + α

′] (3.2)

where, Nl,k is the sum of the count variable across those tuples ({w, l′ , t} : count) of X wherethe word w in the tuple is a seed word related to disease topic k, l

′is equal to the location

l in equation (3.2) and t can be any time point in the range {1, 2, · · · , T}. In other words,Nl,k accounts for the occurrence of seed words related to topic k in Nl. The hyperparameterα′

is the additional smoothing parameter that contributes a non-zero pseudocount to eachcomponent of α(l). Additive smoothing is done to assign non-zero probabilities to thoselocations for which we have no prior information (zero occurrence of seed words) related totopic k.

Finally, seed topic distribution (φs) and regular topic distribution (φr) are drawn fromsymmetric Dirichlet priors [75] where each component of the parameter vectors µ(k) (S-dimensional) and β(k) (V -dimensional) assumes the values of the hyperparameters µ

′and β

′

respectively, i.e., µ(k) = [µ′, µ′, · · · , µ′ ] and β(k) = [β

′, β′, · · · , β ′ ].

23

Choice of hyperparameters. A hyperparameter is defined as the parameter of a priordistribution. The hyperparameters α

′, γ′, β

′and µ

′are set to 2/K, 0.01, 0.01 and 1e − 07

respectively. These values are chosen heuristically, and an improved performance of thesupervised topic model could be achieved via efficient hyperparameter optimization [75]. Assuggested in Jagarlamudi et al. [33] , we set the sampling probability πk to a constant valueof 0.7 for each topic k ∈ {1, 2, · · · , K}.

Inference via collapsed gibbs sampling The key problem in the supervised topic modelis posterior inference. This amounts to reversing the defined generative process and inferringthe output (latent) parameters θ, φr , φs and ξ given the observed tuples in Nl. A standardapproach of posterior inference in topic models is collapsed gibbs sampling [29] , a MarkovChain Monte Carlo (MCMC) method.

To estimate the model parameters θ, φr , φs and ξ via collapsed gibbs sampling, we needto compute the conditional probability distribution Pr(zi = k|w, t, l, z−i, α(l), β(k), µ(k), γ(k))where zi represents the topic assignment for the ith tuple or entry in Nl. z−i represents thetopic assignments for all entries in Nl except the ith entry. We have three scenarios as shownbelow.

• If word wi in the ith entry of Nl is a regular word and k is a regular topic, then theconditional probability distribution is defined below in equation (3.3).

Pr(zi = k|w, t, l, z−i, α(l), β(k), µ(k), γ(k)) ∝nk,−iwi

+ β′∑V

v=1(nk,−iv + β ′)·

mk,−iti + γ

(k)ti∑T

t=1(mk,−it + γ

(k)t )

· ol,−ik + α(l)k∑K

k′=1(ol,−ik′

+ α(l)

k′)

· (∑V

v=1 nk,−iv + β

′) + πk

(∑V

v=1 nk,−iv + β ′) + (

∑Sv=1 s

k,−iv + µ′) + 2 · πk

(3.3)

• If word wi in the ith entry of Nl is a regular word and k is a seed topic, then theconditional probability distribution Pr(zi = k|w, t, l, z−i, α(l), β(k), µ(k), γ(k)) = 0 sincea regular word cannot be generated from any of the seed topic distributions.

• If word wi in the ith entry of Nl is a seed word, then word wi can be generated fromeither the seed topic k or the regular topic k. If word wi is generated from a seed topick, then the conditional probability distribution is defined below in equation (3.4). Onthe other hand, if word wi is generated from a regular topic k, then the conditionalprobability distribution is defined below in equation (3.5).

24

Pr(zi = k|w, t, l, z−i, α(l), β(k), µ(k), γ(k)) ∝sk,−iwi

+ µ′∑S

v=1(sk,−iv + µ′)·

mk,−iti + γ

(k)ti∑T

t=1(mk,−it + γ

(k)t )


k′=1(ol,−ik′

+ α(l)

k′)· πk (3.4)

Pr(zi = k|w, t, l, z−i, α(l), β(k), µ(k), γ(k)) ∝nk,−iwi

+ β′∑V

v=1(nk,−iv + β ′)·

mk,−iti + γ

(k)ti∑T

t=1(mk,−it + γ

(k)t )


k′=1(ol,−ik′

+ α(l)

k′)· (1− πk) (3.5)

In equations (3.3), (3.4) and (3.5), nk,−iwidenotes the number of times word wi is assigned

to regular topic k across all entries in Nl except the ith entry, sk,−iwidenotes the number of

times seed word wi is assigned to seed topic k across all entries in Nl except the ith entry,mk,−iti denotes the number of times time point ti is assigned to topic k across all entries in

Nl except the ith entry and ol,−ik denotes the number of times location l is associated with

topic k across all entries in Nl except the ith entry. α(l)k refers to the kth component of α(l)

and γ(k)ti denotes the component of γ(k) corresponding to time point ti.

Implementing the collapsed gibbs sampler. Collapsed gibbs sampler for the super-vised topic model is surprisingly easy to implement. It involves setting up the required countvariables, randomly initializing them, and then the gibbs sampler executes in an iterativefashion where on each iteration a topic is sampled for each entry in Nl according to equation(3.3) or equation (3.4) and equation (3.5) depending on whether the word in the entry isa regular word or a seed word respectively. The required count variables include nkwi , s

kwi

,mkti

and olk corresponding to the ith entry in Nl. For simplicity and efficiency, we also keep

a running count of nk (=∑V

v=1 nkv , the total number of times any word in vocabulary V

is assigned to topic k), sk (=∑S

v=1 skv , the total number of times any word in the set S of

seed words is assigned to the corresponding seed topic k), mk (=∑T

t=1mkt , the total number

of times any time point t ∈ {1, 2, · · · , K} is assigned to topic k) and ol (=∑K

k=1 olk, the

total number of times any topic k ∈ {1, 2, · · · , K} is associated with location l). Finally, inaddition to the mentioned count variables, we also require an array z which will contain thetopic assignment for each entry or tuple in Nl. Once we choose a topic for a particular entryin Nl, the chosen topic is set in the z array and the count variables are incremented in theappropriate position relevant to the entry.

Following the gibbs iterations, the count variables can be used to compute the output (latent)parameters θ, φr , φs and ξ as shown below in equation (3.6).

25

θl,k =olk + α

(l)k∑K

k′=1(olk′

+ α(l)

k′)

φrk,w =nkw + β

′∑Vv=1(nkv + β ′)

φsk,w =skw + µ

′∑Sv=1(skv + µ′)

(3.6)

ξk,t =mkt + γ

(k)t∑T

t=1(mkt + γ

(k)t )

where, θl,k represents the probability of topic k given location l, φrk,w represents the prob-ability of word w given topic k, φsk,w represents the probability of seed word w given seedtopic k and ξk,t denotes the temporal trend value of topic k at time point t. We ran thegibbs sampler for 300 iterations.

Estimation of disease case counts

The final component of EpiNews is concerned with estimation of disease case counts usingrelevant information such as past case counts and temporal topic trends (ξ). Let D be thedisease of interest. Without loss of generality, let the zth disease topic corresponds to D.Furthermore, let SD,T denotes case counts of D and ξz,T denotes temporal trend value forzth disease topic at a time point T . In general, reports of case counts published by healthorganizations are delayed (see Chakraborty et al. [12], Wang et al. [80]) and hence, at timepoint T case counts are available only till T ′ < T with a delay δ = T−T ′. However, temporaltopic trend values (ξz,1, ξz,2, · · · , ξz,T ) are available till T. Hence, we can formally define thecase count estimation problem as estimating SD,T using past case counts (SD) available tillT ′ and temporal topic trends (ξz) available till T . In general, disease case counts have apublication delay of 1 time point (T ′ = T − 1) and hence, estimating SD,T at T is equivalentto 1-step ahead estimation.

EpiNews-ARNet. For 1-step ahead case count estimation, we used a regularized versionof autoregressive model with external input variables (ARX) where external input vari-ables are represented by the temporal topic trends (ξz). We used Elastic Net [88] as theregularization model in ARX. This estimating component of EpiNews is designated asEpiNews-ARNet and defined below in equation (3.7).

26

SD,T =

p∑i=1

γiSD,T−i︸︷︷︸Internal component

+

q∑j=1

ηjgr (ξz,T−j+1+s)︸︷︷︸External component

+ εD,T (3.7)

where, SD,T is the estimated case count for disease D at time point T and γi, ηj are theregression coefficients fitted using Elastic Net constraints as given below in equation (3.8).

γopt, ηopt = arg minγ,η

T ′∑t′=0

(SD,t′ − SD,t′

)2

+ λ1

∑i,j |γi + ηj|+ λ2

∑i,j (γi + ηj)

2 (3.8)

where, λ1 and λ2 are the regularization coefficients for the L1 and L2 components of ElasticNet, respectively. The Elastic Net combines the properties of Least Absolute Shrinkage andSelection Operator (LASSO) [71, 31] and Ridge regression [31] models. This combinationallows for learning a sparse model like LASSO, while still maintaining the regularizationproperties of Ridge. If λ1 equals to 0, equation (3.8) equates to a Ridge estimator. On theother hand, if λ2 equals to 0, equation (3.8) corresponds to a LASSO estimator.

There are broadly two components to equation (3.7) which captures different signals aboutthe diseases as follows. (i) Internal component (p): This component is an autoregressivemodel that captures the signal embedded in past case counts and thus describes a delayedmodel. p indicates the order of autoregression. (ii) External component (q, r, s): Thiscomponent can also be thought of as an autoregressive component over the temporal topictrends (ξz) where q is the number of time points to look back. The temporal topic trendsare subjected to two additional transformations as follows. (a) Shift indicator (s): Often,the incidence of news reports is not concurrent with the incidence of diseases, as recorded inthe case counts. EpiNews-ARNet incorporates this information by shifting the temporaltopic trend value ξz,T by s steps. The shift can be positive (indicating a lagging trend),negative (indicating a leading trend) or zero (indicating a co-incident trend). (b) Rollingtransformation (r): Disease case counts (SD) do not follow a strictly linear relationshipwith temporal topic trends (ξz). One of the simplest methods is to detrend the signals usingdifference of trend values instead of absolute values. However, our experiments showed thatsuch transformations using a single time point often lead to unstable estimates. As such, wedefine a rolling transformation g over a window length r given below in equation (3.9).

gr(xT ) = x(T )− x(T − r) (3.9)

Essentially, such transformations aim to capture the changes in trend values over a periodand were found to be more indicative than absolute values. We ran a cross-validation stepto find the optimal (p, q, r, s) parameters.

27

Baselines. We compared EpiNews-ARNet with 2 baseline methods, namely Casecount-ARMA and EpiNews-ARMAX. In Casecount-ARMA, we fitted an autoregressive-moving-average model (ARMA(p, q) [8]) over past disease case counts to generate casecount estimates as shown below in equation (3.10).

SD,T = εD,T +

p∑i=1

γiSD,T−i +

q∑i=1

θiεD,T−i (3.10)

where, p and q are the orders of the autoregressive (AR) and moving average (MA) compo-nents, respectively. εD,T , εD,T−1, . . . , εD,T−q represent the white noise error terms. For furtherdetails including boundary conditions of ARMA, please refer to Box et al. [8]. Casecount-ARMA doesn’t use any information related to temporal topic trends (ξz). However, inEpiNews-ARMAX, we used an autoregressive–moving-average model with external inputvariables (ARMAX(p,q) [8]). As shown below in equation (3.11), ARMAX(p, q) incorporatesinformation from both past case counts and temporal topic trends (ξz) in order to estimatecase counts. Similar to EpiNews-ARNet, external input variables are represented by thetemporal topic trends (ξz).

SD,T = εD,T +

p∑i=1

γiSD,T−i +

p∑i=0

ηiξz,T−i +

q∑i=1

θiεD,T−i (3.11)

where, p and q are the orders of the autoregressive (AR) and moving average (MA) compo-nents, respectively. For further details, please refer to Box et al. [8].

Converting temporal topic trends to sampled case counts. We described EpiNews-ARNet using the temporal topic trends or distribution (ξz) as the external input variables.It is to be noted that the disease case counts (SD) and the temporal topic distribution(ξz) are typically at different numerical scales since values in a distribution range from0 to 1. To improve numerical stability we converted the temporal topic distributions toestimated case counts using multinomial sampling [35] over the time range. In multinomialsampling, samples are drawn from a multinomial distribution [35]. The case counts estimatedvia multinomial sampling from the temporal topic distributions are hereafter referred toas sampled case counts. To calculate the sampled case counts (ΞD) for disease D, thecorresponding temporal distribution ξz for zth topic was used as the multinomial distributionand the total number of case counts available till T ′ < T at T (due to delay in reportingof case counts) was used as the number of samples to be drawn from the distribution. SeeAlgorithm (2) for more details.

28

Algorithm 2: Multinomial sampling to convert temporal topic distribution to sampledcase counts.

Input : Temporal topic distribution: ξz,1, . . . , ξz,T

Total number of case counts till time point T ′: TSD,T ′ =T ′∑t′=0

(SD,t′)

Output: Sampled case counts from temporal topic distribution: ΞD,1, . . . ,ΞD,T

1 p← ξz,1, . . . , ξz,T2 n← TSD,T ′3 Draw n time points 0 ≤ ts ≤ T using multinomial sampling where p is the

multinomial distribution and n is the total number of samples to be drawn.4 For each time point 0 ≤ ts ≤ T , sampled case count ΞD,ts is calculated as the

frequency of occurrence of ts in the above n number of samples (time points) drawnfrom the multinomial distribution p.

HealthMap corpus

HealthMap preprocessing(First component)

Temporal topic modeling(Second component)

Past case counts Temporal topic trends

Set of {w, l, t} : count tuples

Casecount-ARMA EpiNews-ARMAX EpiNews-ARNet

case count estimates case count estimates case count estimates

Estimation of disease case counts

Figure 3.1: Flow chart depicting the sequential modeling process in EpiNews

29

3.3 Results

In this section, we present an empirical evaluation of our proposed framework EpiNews.We first evaluated the disease topics discovered by the supervised topic model. Next, weanalyzed whether the temporal topic trends (ξ) extracted by the supervised topic model areable to capture disease dynamics - including seasonality, abrupt peaks and troughs. Finally,we evaluated the quality of case counts estimated by EpiNews-ARNet against the actualdisease case counts.

3.3.1 Disease topic discovery

To evaluate the discovered disease topics, we looked at the words having higher probabilitiesin the seed topic distributions (φs) and regular topic distributions (φr). We present theanalysis of φs and φr in Tables 3.3, 3.4 and 3.5 corresponding to disease topics in U.S., Chinaand India respectively. For each country, both φs and φr were extracted from HealthMap dataspanning over the entire time period shown in Table 3.1. For each disease topic (z), we showthe seed words and their corresponding probabilities (sorted in descending order) in the seedtopic distribution φsz. Seed words having higher probabilities in φsz serve as informative priorwords in the topic discovery process as they are mentioned frequently in news articles relatedto the zth disease topic. For example, seed words such as food, salmonella, product, fda, drug,contamination serve as informative prior words for the discovery of salmonellosis topic inU.S. since they have higher probabilities in the seed topic distribution (see Table 3.3). Onthe other hand, seed words such as enteritidis, newport provide less prior information due totheir low probability values in the seed topic distribution. To understand how the supervisedtopic model discovers words from the HealthMap corpus related to these input seed words,we also show some of the regular words having higher probabilities in the regular topicdistribution φrz. For a particular disease topic, these regular words with higher probabilitiesare mentioned frequently in news articles related to that disease and also capture differentaspects (causes and clinical symptoms, methods of transmission, etc.) of the disease thatthe topic represents. For example, in Table 3.3 we show these regular words (having higherprobabilities in the regular topic distribution φrz) for the salmonellosis topic in U.S. Wordssuch as diarrhea, nausea, vomit are related to clinical symptoms of salmonellosis. On theother hand, words such as eat, contaminated, restaurant, meat, beef are related to causes ofsalmonellosis.

3.3.2 Detection of outbreak patterns

We also examined the temporal distribution or trends (ξz) for each disease topic (z) in aspecific country (Figures 2, 3 and 4) and their correlations with the disease case counts.For each country, temporal topic trends (ξz) were extracted from HealthMap data spanning

30

over the entire time period shown in Table 3.1. We made several important observations asfollows.

Disease seasonality. In U.S., case counts of salmonellosis and E. coli infection exhibitstrong periodic outbreaks, both peaking during the summer (see Figures 2 (e) and (g)).Temporal topic trends extracted by EpiNews were able to capture the periodicity of thesetwo diseases, particularly periodic outbreaks of salmonellosis and E. coli infection in 2010,2012 and 2013. However, during 2011, temporal topic trends failed to monitor the peakseason properly though they show a tendency to increase during summer. For salmonellosisin 2013, the temporal topic trends captured the major peak of the outbreak at the startof the season while failing to capture the seasonal activity towards the end. For rabies,although the topic trends captured the general characteristics it failed to detect some majoroutbreaks, such as the outbreak in the summer of 2010 (see Figure 2 (c)).

In China, H7N9 and HFMD case counts exhibit strong periodic outbreaks, with H7N9 peak-ing during the winter and HFMD peaking during the summer (see Figures 3 (a) and (c)). ForH7N9, temporal topic trends extracted by EpiNews were able to detect the seasonal out-breaks during March-April 2013 and January-February 2014. However, for HFMD, peaks intemporal topic trends precede the peaks in case counts during the summer of 2013 and 2014respectively. Therefore, temporal topic trends for HFMD exhibit a negative shift (leadingindicator) with respect to the case counts.

In India, case counts of ADD exhibit periodic outbreaks, peaking during the summer of 2013and 2014 (see Figure 4 (a)). Temporal topic trends detected the seasonal outbreak in thesummer of 2013 but failed to capture the outbreak in the summer of 2014.

Sudden peaks/troughs. In U.S., whooping cough outbreaks do not exhibit yearly pe-riodicity unlike salmonellosis and E. coli infection (see Figure 2 (a)). There was a majoroutbreak of whooping cough during the summer of 2012 and EpiNews detected this suddenincrease (peak) in case counts by displaying higher topic trends during the entire period ofthe outbreak. EpiNews also did not detect outbreaks during periods (summer of 2011 and2013) known to have low incidences (troughs) of whooping cough by displaying lower topictrends, suggesting low false alarm rate.

In China and India, dengue case counts exhibit seasonal outbreaks with peaks in case countsappearing during the months of September and October. However, China experienced asevere dengue outbreak in 2014 [67] in comparison to the outbreak in 2013 with the peakvalue of case counts exceeding 25,000 in the month of October (see Figure 3 (e)). Temporaltopic trends detected this sudden massive increase in case counts by displaying a sharp spikeduring the outbreak period. India also experienced a large dengue outbreak in 2013 withthe peak value of case counts exceeding 3,000 during a particular epi week in October (seeFigure 4 (c)). EpiNews was able to detect this outbreak by displaying higher topic trends

31

during the peak period. Malaria case counts in India exhibit irregular outbreaks or peaks(see Figure 4 (e)). EpiNews was successful in capturing majority of these outbreaks thoughit failed to detect some major peaks, such as the peak during the month of June 2014.

Sampled case counts. Along with the temporal topic trends (ξz), we also showed thecorresponding sampled case counts (ΞD) generated via multinomial sampling (see Algorithm(2)) from ξz for a disease D in Figures 2 ((b), (d), (f) and (h)), 3 ((b), (d) and (f)), 4 ((b),(d) and (f)). The figures show that the sampled case count values share similar numericalrange as the disease case counts while maintaining shapes of the temporal topic trends. Onthe other hand, the temporal topic trend values are at different numerical range (rangingfrom 0 to 1) with respect to the case counts.

3.3.3 Estimating case counts

As official reports of case counts by health agencies are usually lagged by a single time point(week or month), reliable early estimates of disease incidence can facilitate the allocation ofpublic health resources to enable effective control measures. Therefore, we aim to perform1-step ahead estimation of disease case counts starting from a particular time point. Forthe purpose of experimental validation, we used historical HealthMap data over a certaintime period as the static training set in a specific country (referred to as the static trainingperiod) and progressively utilized the remaining time points as the evaluation period overwhich we evaluated the case count estimates of EpiNews-ARNet. To estimate case countsat a particular time point T within the evaluation period, we utilized HealthMap data fromt = 0 up to t = T and extracted disease topics using the supervised topic model. Thedisease case counts at T were next estimated using past case counts available up to t = T ′

(T ′ = T − 1) and temporal topic trends (or, sampled case counts) available up to t = T . InTable 3.6, we show the total time period of study, static training period and the evaluationperiod for each country.

Models. For the task of 1-step ahead estimation, we compared the performance of EpiNews-ARNet against 2 baseline methods, namely EpiNews-ARMAX and Casecount-ARMA.We also compared temporal topic trends against sampled case counts (generated by multi-nomial sampling from the temporal topic trends) as the external input variables, for theapplicable methods EpiNews-ARNet and EpiNews-ARMAX.

Evaluation. We evaluated the case count estimates of each method over the evaluationperiod by comparing them against the actual case counts using normalized root-mean-squareerror (NRMSE). In Table 3.7, we present a comparative performance evaluation of the meth-ods for 1-step ahead estimation in terms of NRMSE values corresponding to diseases in U.S.,

32

China and India respectively. Table 3.7 provides multiple insights as follows. (i) EpiNews-ARNet with sampled case counts as external variables is the best performing method achiev-ing lowest NRMSE values for majority (8 out of 10) of the {country, disease} combinations.(ii) Two exceptions are {China, HFMD} and {U.S., E. coli infection} where EpiNews-ARNet and EpiNews-ARMAX with temporal topic trends as external variables achievelowest NRMSE values respectively. (iii) Both EpiNews-ARNet and EpiNews-ARMAXperform better overall with sampled case counts as external variables than temporal topictrends. (iv) For none of the {country, disease} combinations, Casecount-ARMA is ableto achieve lowest NRMSE values indicating the significance of incorporating temporal topictrends or sampled case counts as external variables for estimating case counts.

3.4 Discussions

In this paper, we studied the problem of monitoring and estimating outbreaks of multipleinfectious diseases using disease-related online news reports obtained from HealthMap. Weintroduced EpiNews, a novel and generic temporal framework that combines supervisedtemporal topic models with time-series regression techniques to monitor and estimate diseaseincidence. Experimental results demonstrate that EpiNews is able to capture the timevarying incidence of multiple diseases via temporal topic trends. Our experiments alsoillustrate that EpiNews can estimate disease incidence 1-step ahead with increased accuracyusing information from temporal topic trends.

EpiNews uses online news reports as the sole data source to capture disease dynamicsduring outbreaks. Therefore, it is generic in the sense that it is not tailored to a particulardisease or class of diseases. Moreover, the set of diseases selected for each country representa diversity of transmission pathways as shown in Table 3.1. Hence, the applicability ofEpiNews to these diverse sets of diseases as demonstrated in this study showcases thepotential generalizability of our approach to different class of diseases.

Temporal topic trends extracted by EpiNews from HealthMap news reports successfullycaptured dynamics of multiple outbreaks, such as whooping cough in U.S. during summer of2012, periodic outbreaks of salmonellosis and E. coli infection in U.S., periodic outbreaks ofH7N9 and HFMD in China, dengue outbreaks in India (2013) and China (2014). However,there are certain deviations where temporal topic trends could not monitor the trends indisease outbreaks properly. We posit that such deviations are a factor of multiple effectsas follows. (i) Firstly, news media coverage during disease outbreaks is driven by interest.News coverage for certain diseases can be inconsistent over time. For salmonellosis and E.coli infection outbreaks in 2010, 2013 and 2014 (see Figures 2 (e) and (g)), the temporaltopic trends capture the outbreak at the start of the season. However, as the outbreakseason progresses, the temporal topic trends are unable to capture the outbreak dynamicsaccurately. This indicates that news media coverage is generally high during the start ofa disease outbreak. However, we observe a decline in news media interest as the outbreak

33

season progresses. (ii) Secondly, for diseases with low public interest, the coverage can below even there is an ongoing disease outbreak. E.g., in case of the ADD outbreak in 2014(see Figure 4 (a)), we observe no coverage in news media (lack of activity in temporal topictrends) even though the outbreak occurred on a massive scale. (iii) Finally, our frameworkis heavily reliant on news corpora and does not account for possible reporting errors. Assuch, articles with missing or incomplete textual content can affect the performance of ourframework. E.g., in case of salmonellosis and E. coli infection outbreaks in 2011, the rise intemporal topic trends is comparatively lower during the outbreak period (see Figures 2 (e)and (g)) in comparison to the outbreaks in 2010, 2012, and 2014.

EpiNews supports monitoring and also 1-step ahead estimation of disease case counts withincreased precision. Table 3.7 shows that EpiNews-ARNet yields lowest NRMSE valuesfor all the diseases when compared to the baseline method Casecount-ARMA. This impliesthat incorporating information from temporal topic trends via EpiNews-ARNet results inimproved estimation of case counts. It is also to be noted that EpiNews-ARNet withsampled case counts as external variables achieves lower NRMSE for most of the diseasesthan the variant using temporal topic trends. This validates our claim that using sampledcase counts instead of actual topic trends as the external variables adds numerical stabilityto EpiNews-ARNet.

The performance of EpiNews-ARMAX is comparable to EpiNews-ARNet for diseasesin U.S. However, for diseases in China and India, EpiNews-ARNet significantly outper-forms EpiNews-ARMAX. In China and India, both disease case counts and temporal topictrends (or, sampled case counts) are characterized by sharp peaks during the outbreak period(see Figures 3 and 4). EpiNews-ARMAX performs poorly in such scenarios (see Table 3.7)in comparison to EpiNews-ARNet, mainly due to the unstable behavior of the ARMAXmodel when it comes to handling sharp gradients in input case counts or temporal topictrends. However, outbreak periods for diseases in U.S. are characterized by flat peaks withslow rise and fall (see Figure 2). Therefore, EpiNews-ARMAX achieves comparable per-formance to EpiNews-ARNet, even performing better for E. coli infection. Therefore, weconclude that both EpiNews-ARMAX and EpiNews-ARNet are preferred approachesfor estimating case counts of diseases characterized by flat outbreak peaks with slow riseand fall. However, when disease outbreaks exhibit sharp peaks, we recommend selectingEpiNews-ARNet for reliable estimation of case counts.

For dengue and HFMD in China, EpiNews-ARNet shows considerable improvement on1-step ahead estimation of disease incidence when compared to the baselines, specificallyCasecount-ARMA (see Table 3.7). In order to have a clearer understanding of the im-proved performance of EpiNews-ARNet with respect to the baselines, we plotted thetemporal correlation between actual case counts and case counts estimated by the meth-ods in Figure 5 corresponding to dengue and HFMD in China. It can be observed thatEpiNews-ARNet with sampled case counts as external variables is able to estimate thepeak in dengue case counts more accurately in comparison to the baselines (see Figure 5(a)). For HFMD, EpiNews-ARNet with both topic trends and sampled case counts as

34

external variables are able to estimate the peak in case counts, while the baselines fail to doso (see Figure 5 (b)). Casecount-ARMA’s inability to estimate the peaks in case countsfor both dengue and HFMD implies that past case counts are not reliable indicators for esti-mating sudden increases or peaks in disease incidence and therefore, need to be augmentedwith disease signals from online news media for accurate estimation of outbreaks. How-ever, inconsistent news coverage can adversely affect the timely estimation of outbreaks byEpiNews-ARNet as shown in Figure 5 (c). India experienced periodic outbreaks of ADDwith peaks in case counts during the summer of 2013 and 2014. However, we observe a lackof news coverage (no peak in temporal topic trends) during the peak in 2014 compared tothe peak in 2013 (see Figures 4 (a) and (b)). Therefore, the case count estimates generatedby EpiNews-ARNet have a delayed peak with respect to the actual peak in case countsduring the outbreak in 2014 (see Figure 5 (c)). This delayed peak is due to the internal com-ponent (p) in equation (3.7) which extracts information from past case counts. In overall,our results over a range of diseases and world regions suggest that monitoring progression ofinfectious diseases is feasible and disease incidence can be estimated with increased precisionvia efficient capturing of signals from online news media.

The effectiveness of online sources (news, tweets, search queries) to monitor and forecastthe emergence and/or spread of diseases is an ongoing topic of debate, as evidenced bythe community response to the study of Lazer et al. [37]. They demonstrated that GoogleFlu Trends (GFT) was overestimating influenza-like illness (ILI) case counts in CDC reports.However, when GFT was combined with lagged CDC data, the authors observed a substantialimprovement in estimating the CDC counts. In EpiNews based models (EpiNews-ARNetand EpiNews-ARMAX), unlike GFT, we have combined the lagged (past) disease casecounts with the temporal topic trends extracted from the HealthMap news corpus in order togenerate reliable case count estimates. However, during outbreak periods, inconsistent newsmedia coverage and possible reporting errors can hamper forecasting performance as laggedcase counts are not helpful in such scenarios and we must rely on external news trends forforecasting. Therefore, given consistent media coverage, EpiNews based models have thecapability to generate reliable case count estimates (see Figures 5 (a) and (b)). However, inscenarios where news media depict a lack of (or inconsistent) coverage (Figure 5 (c)), we cansupplement the model by leveraging information from physical data sources, such as climaticattributes (temperature [1] , precipitation [19] , and humidity [30]). The main take-awayconclusion from Lazer et al. [37], applicable to our work as well, is that models based onmachine learning, such as developed here, need to be constantly tuned/retrained to ensurethat model drift can be detected and corrected.

35

Table 3.3: Four disease topics (Whooping Cough, Rabies, Salmonella and E. coli infection)discovered by the supervised topic model from the HealthMap corpus for U.S. For eachdisease topic, we show the seed words and their corresponding probabilities in the seed topicdistribution. Along with the seed words, we also show some of the regular words (havinghigher probabilities in the regular topic distribution) discovered by the supervised topicmodel related to these input seed words.

Whooping cough topic Rabies topic Salmonellosis topic E. coli infection topic

Seed words Seed words Seed words Seed wordschild 0.1498school 0.1068cough 0.0828pertussis 0.0701whoop 0.0691whooping 0.0679infant 0.0596student 0.0557contagious 0.0454booster 0.0406cold 0.0395coughing 0.0309nose 0.0304respiratory 0.0284mild 0.0269tdap 0.0231immunize 0.0212runny 0.0198tetanus 0.0175breathe 0.0144

animal 0.1596rabies 0.1191rabid 0.0718bite 0.0695rabie 0.0674virus 0.0649wild 0.0585bat 0.0472raccoon 0.0471skunk 0.0424fox 0.0422wildlife 0.0379domestic 0.0323saliva 0.0247scratch 0.0237quarantine 0.0213horse 0.0192viral 0.0190livestock 0.0166mammal 0.0156

food 0.2056salmonella 0.1031product 0.1013recall 0.0878drug 0.0712consumer 0.0705contamination 0.0598fda 0.0579contaminate 0.0567abdominal 0.0351egg 0.0277chicken 0.0275poultry 0.025arthritis 0.0145peanut 0.0139cantaloupe 0.01shell 0.0086typhimurium 0.0083newport 0.0082enteritidis 0.0074

coli 0.2265boil 0.0887cell 0.0745toxin 0.0628escherichia 0.0617clinical 0.0573chemical 0.0557kidney 0.0414microbiology 0.0402reaction 0.0397hemolytic 0.0376lettuce 0.0366uremic 0.036physical 0.0342gene 0.0339shiga 0.0202expression 0.0162chemistry 0.0149stec 0.0125biochemistry 0.0094

Regular wordswith higher probabilities




contact 0.0037young 0.0023adult 0.0022vaccination 0.0019vaccine 0.0019california 0.0019vaccinate 0.0018parent 0.0017woman 0.0015baby 0.0014immunization 0.0011kid 0.0009air 0.0008weather 0.0007pregnant 0.0006mother 0.0006dose 0.0006antibiotic 0.0005pneumonia 0.0003

pet 0.0037contact 0.0037cat 0.0028vaccination 0.0024florida 0.0015vaccine 0.0014shot 0.0014street 0.0013clinic 0.0012texas 0.0010park 0.0010york 0.0010wound 0.0009virginia 0.0008ferret 0.0007brain 0.0007coyote 0.0005nervous 0.0005canine 0.0002

eat 0.0019diarrhea 0.0019nausea 0.0013foodborne 0.0013package 0.0012contaminated 0.0011meat 0.0011restaurant 0.0010vomit 0.0010products 0.0008cook 0.0008beef 0.0008raw 0.0007temperature 0.0006honey 0.0005pepper 0.0004weather 0.0003salad 0.0003mango 0.0002

transmit 0.0014massachusetts 0.0013surface 0.0012body 0.0012pennsylvania 0.0012blood 0.0012pathogen 0.0011resistant 0.0011drinking 0.0011agricultural 0.0011hygiene 0.0010raw 0.0009apple 0.0009sandwich 0.0009milk 0.0008stool 0.0008parasite 0.0005acs 0.0002receptor 0.0001

36

Table 3.4: Three disease topics (H7N9, HFMD and dengue) discovered by the supervisedtopic model from the HealthMap corpus for China. For each disease topic, we show theseed words and their corresponding probabilities in the seed topic distribution. Along withthe seed words, we also show some of the regular words (having higher probabilities in theregular topic distribution) discovered by the supervised topic model related to these inputseed words.

H7N9 topic HFMD topic Dengue topic

Seed words Seed words Seed wordsflu 0.1229bird 0.1225avian 0.1053influenza 0.1051human 0.1031virus 0.0832poultry 0.0786market 0.0610animal 0.0360chicken 0.0303respiratory 0.0230spring 0.0227farm 0.0224farmer 0.0213slaughter 0.0194winter 0.0179egg 0.0125pandemic 0.0117h7n9 0.0012h5n1 0.0000

hand 0.1573child 0.1384mouth 0.1127school 0.1016foot 0.0916class 0.0734hfmd 0.0557parent 0.0546nursery 0.0343kindergarten 0.0294oral 0.0192intestinal 0.0185infant 0.0178mumps 0.0174measles 0.0172herpes 0.0140enterovirus 0.0135encephalitis 0.0124dysentery 0.0117ulcer 0.0093

fever 0.2269dengue 0.1586mosquito 0.1052october 0.0826water 0.0682breeding 0.0559street 0.0481bite 0.0330aedes 0.0317pain 0.0294breed 0.0280park 0.0269sanitation 0.0179borne 0.0175albopictus 0.0168rain 0.0139hemorrhagic 0.0125vector 0.0115larva 0.0089aegypti 0.0066




zhejiang 0.0034beijing 0.0034shanghai 0.0030agriculture 0.0015pneumonia 0.0013temperature 0.0011food 0.0010eat 0.0009duck 0.0008pigeon 0.0008cook 0.0006vaccine 0.0006tamiflu 0.0005meat 0.0004strain 0.0004raw 0.0003pig 0.0003

shandong 0.0028hunan 0.0025care 0.0015rash 0.0008meningitis 0.0007viral 0.0007hepatitis 0.0007body 0.0006tuberculosis 0.0006childhood 0.0005palm 0.0004organ 0.0003skin 0.0003buttock 0.0003childcare 0.0003blister 0.0002kidney 0.0002

guangdong 0.0071guangzhou 0.0056site 0.0013temperature 0.0010weather 0.0009muscle 0.0008blood 0.0006urban 0.0005bleed 0.0004diarrhea 0.0004medicine 0.0004stagnant 0.0004spray 0.0003rainy 0.0003climate 0.0003cough 0.0002tank 0.0002

37

Table 3.5: Three disease topics (ADD, dengue and malaria) discovered by the supervisedtopic model from the HealthMap corpus for India. For each disease topic, we show theseed words and their corresponding probabilities in the seed topic distribution. Along withthe seed words, we also show some of the regular words (having higher probabilities in theregular topic distribution) discovered by the supervised topic model related to these inputseed words.

ADD topic Dengue topic Malaria topic

Seed words Seed words Seed wordsfall 0.1284child 0.1148school 0.0949student 0.0868food 0.0837consume 0.0611eat 0.0588vomit 0.0549meal 0.0525stomach 0.0412diarrhea 0.0315nausea 0.0304vomiting 0.0300poisoning 0.0249poison 0.0241midday 0.0237contaminated 0.0183cook 0.0179lunch 0.0117contaminate 0.0105

dengue 0.2090fever 0.0978municipal 0.0759breeding 0.0658borne 0.0586mosquito 0.0555september 0.0491august 0.0429water 0.0408rain 0.0385aedes 0.0382ward 0.0382platelet 0.0330breed 0.0300larva 0.0268blood 0.0264bite 0.0246chikungunya 0.0206vector 0.0199monsoon 0.0084

malaria 0.1504mosquito 0.1166site 0.0994water 0.0893awareness 0.0826lead 0.0735vector 0.0678breed 0.0567monsoon 0.0484blood 0.0414construction 0.0331camp 0.0316drug 0.0228rainfall 0.0175typhoid 0.0148tribal 0.0133falciparum 0.0114economic 0.0110anopheles 0.0099plasmodium 0.0084




village 0.0032bihar 0.0023inflammatory 0.0020sample 0.0018odisha 0.0017ache 0.0011sick 0.0010pain 0.0008iron 0.0008rice 0.0006pesticide 0.0005flood 0.0004drink 0.0004sanitation 0.0004stale 0.0003drinking 0.0001

civic 0.0038delhi 0.0026virus 0.0018temperature 0.0014fogging 0.0014haryana 0.0009spray 0.0009stagnant 0.0008infection 0.0008aegypti 0.0008drain 0.0007larval 0.0006stagnate 0.0005gutter 0.0003rainwater 0.0002urbanization 0.0001

mumbai 0.0025virus 0.0015maharashtra 0.0011stagnant 0.0011insect 0.0009garbage 0.0008flu 0.0008spraying 0.0007aegypti 0.0007parasite 0.0006tank 0.0006leptospirosis 0.0005urban 0.0004drainage 0.0003rainwater 0.0002waterlog 0.0002

38

Jan2010

Jan2011

Jan2012

Jan2013

0

100

200

300

400

500

600

Wh

oo

pin

g c

ou

gh

ca

se

co

un

ts

(a)

Whooping cough case counts

Jan2010

Jan2011

Jan2012

Jan2013

0

100

200

300

400

500

600

Wh

oo

pin

g c

ou

gh

ca

se

co

un

ts

(b)

Whooping cough case counts

Jan2010

Jan2011

Jan2012

Jan2013

0

20

40

60

80

100

Ra

bie

s c

ase

co

un

ts

(c)

Rabies case counts

Jan2010

Jan2011

Jan2012

Jan2013

0

20

40

60

80

100

Ra

bie

s c

ase

co

un

ts

(d)

Rabies case counts

Jan2010

Jan2011

Jan2012

Jan2013

0

200

400

600

800

1000

1200

Sa

lmo

ne

llosis

ca

se

co

un

ts

(e)

Salmonellosis case counts

Jan2010

Jan2011

Jan2012

Jan2013

0

200

400

600

800

1000

1200

Sa

lmo

ne

llosis

ca

se

co

un

ts

(f)

Salmonellosis case counts

Jan2010

Jan2011

Jan2012

Jan2013

Jul Jul Jul Jul

Epi week starting date

0

20

40

60

80

100

120

E. co

li in

fectio

n c

ase

co

un

ts

(g)

E. coli infection case counts

Jan2010

Jan2011

Jan2012

Jan2013

Jul Jul Jul Jul


0

20

40

60

80

100

120

E. co

li in

fectio

n c

ase

co

un

ts

(h)

E. coli infection case counts

0.000

0.005

0.010

0.015

0.020

Te

mp

ora

l to

pic

tre

nd

s

Temporal topic trends

0

100

200

300

400

500

600

700

800

Sa

mp

led

ca

se

co

un

ts

Sampled case counts

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

0.016

0.018

Te

mp

ora

l to

pic

tre

nd

s


0

20

40

60

80

100

120

Sa

mp

led

ca

se

co

un

ts

Sampled case counts

0.000

0.005

0.010

0.015

0.020

0.025

0.030

Te

mp

ora

l to

pic

tre

nd

sTemporal topic trends

0

500

1000

1500

2000

2500

3000

Sa

mp

led

ca

se

co

un

ts

Sampled case counts

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

0.016

Te

mp

ora

l to

pic

tre

nd

s


0

20

40

60

80

100

120

140

Sa

mp

led

ca

se

co

un

ts

Sampled case counts

Figure 3.2: Correlation between disease case counts and temporal topic distributions ortrends (ξz) extracted by EpiNews for (a) whooping cough, (c) rabies, (e) salmonellosis, and(g) E. coli infection in U.S. Along with the temporal topic trends (ξz), we also showed thecorrelation between disease case counts and sampled case counts (generated by multinomialsampling from temporal topic trends) for (b) whooping cough, (d) rabies, (f) salmonellosis,and (h) E. coli infection. Note, the sampled case counts and disease case counts share almostsimilar numerical range. However, the temporal topic trend values are at different numericalrange (ranging from 0 to 1) with respect to the disease case counts.

Table 3.6: Total time period of study, static training period and the evaluation period forestimating disease case counts in each country.

CountryTotal time period

of studyStatic training

periodEvaluation

period

U.S. January 2010 - December 2013 January 2010 - December 2011 January 2012 - December 2013

China January 2013 - December 2014 January 2013 - March 2013 April 2013 - December 2014

India January 2013 - December 2014 January 2013 - November 2013 December 2013 - December 2014

39

Jan2013

Jan2014

0

20

40

60

80

100

120

140

H7

N9

ca

se

co

un

ts

(a)

H7N9 case counts

Jan2013

Jan2014

0

20

40

60

80

100

120

140

H7

N9

ca

se

co

un

ts

(b)

H7N9 case counts

Jan2013

Jan2014

0

100000

200000

300000

400000

500000

600000

HF

MD

ca

se

co

un

ts

(c)

HFMD case counts

Jan2013

Jan2014

0

100000

200000

300000

400000

500000

600000H

FM

D c

ase

co

un

ts

(d)

HFMD case counts

Jan2013

Jan2014

Apr Jul Oct Apr Jul Oct


0

5000

10000

15000

20000

25000

30000

De

ng

ue

ca

se

co

un

ts

(e)

Dengue case counts

Jan2013

Jan2014

Apr Jul Oct Apr Jul Oct


0

5000

10000

15000

20000

25000

30000

De

ng

ue

ca

se

co

un

ts

(f)

Dengue case counts

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Te

mp

ora

l to

pic

tre

nd

s


0

20

40

60

80

100

120

140

160

Sa

mp

led

ca

se

co

un

ts

Sampled case counts

0.00

0.05

0.10

0.15

0.20

Te

mp

ora

l to

pic

tre

nd

s


0

100000

200000

300000

400000

500000

600000

700000

800000

900000

Sa

mp

led

ca

se

co

un

ts

Sampled case counts

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Te

mp

ora

l to

pic

tre

nd

s


0

2000

4000

6000

8000

10000

12000

14000

16000

Sa

mp

led

ca

se

co

un

ts

Sampled case counts

Figure 3.3: Correlation between disease case counts and temporal topic distributions ortrends (ξz) extracted by EpiNews for (a) H7N9, (c) HFMD, and (e) dengue in China.Along with the temporal topic trends (ξz), we also showed the correlation between diseasecase counts and sampled case counts (generated by multinomial sampling from temporaltopic trends) for (b) H7N9, (d) HFMD, and (f) dengue. Note, the sampled case counts anddisease case counts share almost similar numerical range. However, the temporal topic trendvalues are at different numerical range (ranging from 0 to 1) with respect to the disease casecounts.

40

Apr Jul Oct Jan2014

Apr Jul Oct0

200

400

600

800

1000

1200

1400

1600

1800

AD

D c

ase

co

un

ts

(a)

ADD case counts

Apr Jul Oct Jan2014

Apr Jul Oct0

200

400

600

800

1000

1200

1400

1600

1800

AD

D c

ase

co

un

ts

(b)

ADD case counts

Apr Jul Oct Jan2014

Apr Jul Oct0

500

1000

1500

2000

2500

3000

3500

4000

De

ng

ue

ca

se

co

un

ts

(c)

Dengue case counts

Apr Jul Oct Jan2014

Apr Jul Oct0

500

1000

1500

2000

2500

3000

3500

4000D

en

gu

e c

ase

co

un

ts(d)

Dengue case counts

Apr Jul Oct Jan2014

Apr Jul Oct


0

100

200

300

400

500

600

700

800

900

Ma

lari

a c

ase

co

un

ts

(e)

Malaria case counts

Apr Jul Oct Jan2014

Apr Jul Oct


0

100

200

300

400

500

600

700

800

900

Ma

lari

a c

ase

co

un

ts

(f)

Malaria case counts

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Te

mp

ora

l to

pic

tre

nd

s


0

1000

2000

3000

4000

5000

6000

Sa

mp

led

ca

se

co

un

ts

Sampled case counts

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Te

mp

ora

l to

pic

tre

nd

s


0

200

400

600

800

1000

1200

1400

Sa

mp

led

ca

se

co

un

ts

Sampled case counts

0.00

0.01

0.02

0.03

0.04

0.05

Te

mp

ora

l to

pic

tre

nd

s


0

50

100

150

200

250

300

Sa

mp

led

ca

se

co

un

ts

Sampled case counts

Figure 3.4: Correlation between disease case counts and temporal topic distributions ortrends (ξz) extracted by EpiNews for (a) ADD, (c) dengue, and (e) malaria in India. Alongwith the temporal topic trends (ξz), we also showed the correlation between disease casecounts and sampled case counts (generated by multinomial sampling from temporal topictrends) for (b) ADD, (d) dengue, and (f) malaria. Note, the sampled case counts and diseasecase counts share almost similar numerical range. However, the temporal topic trend valuesare at different numerical range (ranging from 0 to 1) with respect to the disease case counts.

41

Table 3.7: Comparing the performance of EpiNews-ARNet against the baseline methodsEpiNews-ARMAX and Casecount-ARMA for 1-step ahead estimation of disease casecounts. Metric used for comparing the case counts estimated by the methods against theactual case counts is the normalized root-mean-square error (NRMSE).

Country Disease Casecount-ARMAEpiNews-ARMAX EpiNews-ARNet

with temporaltopic trends

with sampledcase counts

with temporaltopic trends

with sampledcase counts

U.S.

Whoopingcough

0.584 0.577 0.582 0.583 0.558

Rabies 0.875 0.888 0.886 0.877 0.865Salmonellosis 0.445 0.978 0.450 0.441 0.430

E. coliinfection

0.685 0.657 0.663 0.686 0.671

China

H7N9 1.096 0.850 0.888 1.027 0.712HFMD 1.574 1.524 1.538 0.622 0.626Dengue 1.076 0.639 0.634 1.094 0.549

India

ADD 1.226 1.285 1.119 0.844 0.833Dengue 0.966 1.086 1.021 1.073 0.878Malaria 1.060 1.062 1.047 1.016 0.963

42

Jan2014

Dec Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Month starting date

0

5000

10000

15000

20000

25000

30000

35000

case counts

(a)

Casecount-ARMA

EpiNews-ARMAX-topic

EpiNews-ARMAX-sample

EpiNews-ARNet-topic

EpiNews-ARNet-sample

actual case count

Jan2014

Dec Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Month starting date

0

100000

200000

300000

400000

500000

600000

case counts

(b)

Casecount-ARMA

EpiNews-ARMAX-topic

EpiNews-ARMAX-sample

EpiNews-ARNet-topic


actual case count

Apr Jul Oct Jan2014

Apr Jul Oct


0

200

400

600

800

1000

1200

1400

1600

1800

case counts

(c)


actual case count

Figure 3.5: Temporal correlation between actual case counts and case counts estimated by themethods Casecount-ARMA, EpiNews-ARMAX and EpiNews-ARNet correspondingto (a) dengue and (b) HFMD in China. In (c), we showed the temporal correlation betweenactual case counts and case counts estimated by EpiNews-ARNet-sample correspondingto ADD in India.

Chapter 4

GELL: Automating the Extraction ofEpidemiological Line Lists from OpenSources

4.1 Introduction

In chapters 2 and 3, we outlined our efforts at monitoring and estimating infectious diseaseoutbreaks at different regions of the world (Chile, Argentina, Brazil, Uruguay, U.S., Chinaand India) using temporal topic models. However, topic models are too computationally ex-pensive and unable to provide simple semantic and similarity constructs that can be used tointerpret the data. Due to these limitations in topic models, deep learning based methods,such as word2vec [47, 48] and doc2vec [38], have been used in recent years for analyzingmassive text corpora. Motivated by this, in this chapter, we formulate Guided Epidemio-logical Line List (GELL), the first tool for building automated line lists (in near real-time)from open source reports of emerging disease outbreaks (such as, WHO health bulletins foremerging diseases) using distributed vector representations (ala word2vec). Specifically, wefocus on deriving epidemiological characteristics of an emerging disease (such as, MiddleEastern Respiratory Syndrome (MERS)) and the affected population from reports of illness.GELL uses distributed vector representations (ala word2vec) to discover a set of indicatorsfor each line list feature. This discovery of indicators is followed by the use of dependencyparsing based techniques for final extraction in tabular form Our main contributions are asfollows.

• Automated: GELL is fully automatic. Given a WHO health bulletin, it will auto-matically extract the number of line list cases and the associated features correspondingto each line list case. The user only needs to provide a seed keyword for each line listfeature to guide the extraction process.

43

44

• Novelty: To the best of our knowledge, there has been no prior systematic effortsat tabulating such information automatically from publicly available bulletins in anear-real time setting.

• Generality: GELL can be applied for extracting tabular information from reason-ably well structured bulletins released by health organizations such as WHO, CDC orProMED.

• Evaluation: We extensively evaluated the line list generated automatically by GELLagainst a human curated line list for MERS outbreaks in Saudi Arabia. In addition,we compared GELL against a baseline method.

• Epidemiological Inferences: Finally, we also show how these automatically ex-tracted line lists can be used for making epidemiological inferences, such as inferringage and gender distribution of affected individuals, symptoms-to-hospitalization periodof affected individuals and hospitalization-to-outcome period of affected individuals.

Figure 4.1: Tabular extraction of line list by GELL given a textual block of a WHO MERSbulletin. Each row in the extracted table depicts an infected case (or, patient) and columnsrepresent the epidemiological features corresponding to each case. Information for each casein the table is then used to make epidemiological inferences, such as inferring demographicdistribution of cases

45

4.2 Problem Overview

In this manuscript, we intend to focus on Middle Eastern Respiratory Syndrome (MERS)outbreaks in Saudi Arabia [44] (2012-ongoing) as our case study. MERS was a relatively lessunderstood disease when these outbreaks began. Therefore, MERS was poised as an emerg-ing outbreak leading to good bulletin coverage about the infectious cases individually. Thismakes these disease outbreaks ideally suited to our goals. MERS is infectious as well andanimal contact has been posited as one of the transmission mechanisms of the disease. Foreach line list case, we seek to extract automatically three types of epidemiological featuresas follows. (a) Demographics: Age and Gender, (b) Disease onset: onset date, hospitaliza-tion date and outcome date and (c) Clinical features: animal contact, secondary contact,comorbidities and specified healthcare worker (abbreviated as HCW).

In Figure 4.2, we show all the internal components comprising the framework of GELL.GELL takes multiple WHO MERS bulletins as input. The textual content of each bulletinis pre-processed by sentence splitting, tokenization, lemmatization, POS tagging, and datephrase detection using spaCy [32] and BASIS Technologies’ Rosette Language Processing(RLP) tools [59]. The pre-processing step is followed by three levels of modeling as follows.(a) Level 0 Modeling for extracting demographic information of cases, such as age andgender. In this level, we also identify the key sentences related to each line list case, (b) level1 Modeling for extracting disease onset information and (c) level 2 Modeling for extractingclinical features. This is the final level of modeling in GELL framework. Features extractedat this level are associated with two labels: Y or N. Therefore, modeling at this level combinesneural word embeddings with dependency parsing-based negation detection approaches toclassify the clinical features into Y or N. In the subsequent subsections, we will discuss eachinternal component of GELL in detail.

4.3 Materials and Methods

Given multiple WHO MERS bulletins as input, GELL proceeds through three levels ofmodeling for extracting line list features. We describe each level in turn.

4.3.1 Level O Modeling

In level 0 modeling, we extract the age and gender for each line list case. These two featuresare mentioned in a reasonably structured way and therefore, can be extracted using a com-bination of regular expressions as shown in Algorithm 3. One of the primary challenges inextracting line list cases is the fact that a single WHO MERS bulletin can contain informa-tion about multiple cases. Therefore, there is a need to distinguish between cases mentionedin the bulletin. In level 0 modeling, we make use of the age and gender extraction to also

46

Figure 4.2: Block diagram depicting all components of the GELL framework. Given multipleWHO MERS bulletins as input, these components function in the depicted order to extractline lists in tabular form)

identify sentences associated with each case. Since age and gender are the fundamental in-formation to be recorded for a line list case, we postulate that the sentence mentioning theage and gender will be the starting sentence describing a line list case (see the textual blockin Figure 4.1). Therefore, the number of cases mentioned in the bulletin will be equivalentto the number of sentences mentioning age and gender information. We further postulatethat information related to the other features (disease onset or critical) will be present eitherin the starting sentence or the sentences subsequent to the starting one not mentioning any

47

age and gender related information ((see the textual block in Figure 4.1)). For more detailson level 0 modeling, please see Algorithm 3. In Algorithm 3, N represents the number ofline list cases mentioned in the bulletin and SCn represents the set of sentences mentioningthe nth case.

Algorithm 3: Level 0 modelingInput : set of sentences in the input WHO MERS bulletinOutput: Age and Gender for each line list case, index of the starting sentence for each case

1 n = 0;2 SCn = Null;3 R1 =

\s+(?P<age>\d{1,2})(.{0,20})(\s+|-)(?P<gender>woman|man|male|female|boy|girl|housewife)

R2 =

\s+(?P<age>\d{1,2})\s*years?(\s|-)old

R3 =

\s*(?P<gender>woman|man|male|female|boy|girl|housewife|he|she)

for each sentence in the bulletin dois-starting → 0;if R1.match(sentence) then

Age = int(R1.groupdict()[’age’]);Gender = R1.groupdict()[’gender’];is-starting → 1;

elseif R2.match(sentence) then

Age = int(R3.groupdict()[’age’]);

elseAge = Null;

if R3.match(sentence) thenGender = int(R3.groupdict()[’gender’]);

elseGender = Null;

if Age 6= Null && Gender 6= Null thenis-starting → 1;

if is-starting thenn += 1;SCn = index of the sentence;

N = n;

4.3.2 WHO Word Embeddings

Before presenting the details of level 1 modeling and level 2 modeling, we will briefly discussthe process for providing WHO word embeddings as input to both these levels of modeling(see Figure 4.2). In this process, our main objective is to identify words which tend to

48

share similar contexts or appear in the contexts of each other specific to the WHO bulletins(contexts of a word refer to the words surrounding it in a specified window size). For instance,consider the sentences S1 =The patient had no contact with animals and S2 =The patientwas supposed to have no contact with camels. The terms animals and camels appear insimilar contexts in both S1 and S2. Both the terms animals and camels are indicative ofinformation pertaining to patient’s exposure to animals or animal products.

Similarly, consider the sentences S3 =The patient had an onset of symptoms on 23rd January2016 and S4 =The patient developed symptoms on 23rd January 2016. The terms onset andsymptoms are indicators for the onset date feature and both of them appear in similarcontexts or contexts of each other in S3 and S4.

For generating word-embeddings, neural network inspired word2vec models are ideally suitedto our goals because these models work on the hypothesis that words sharing similar contextsor tending to appear in the contexts of each other have similar embeddings. In recent years,word2vec models based on the skip-gram architectures [47, 48] have emerged as the mostpopular word embedding models for information extraction tasks [42, 39, 24]. We used twovariants of skip-gram models: (a) the skip-gram model trained using the negative samplingtechnique (SGNS [48]) and (b) the skip-gram model trained using hierarchical sampling(SGHS [48]) to generate embeddings for each term in the WHO vocabulary W . W refersto the list of all unique terms extracted from the entire corpus of WHO Disease OutbreakNews (DONs) corresponding to all diseases downloaded from http://www.who.int/csr/

don/archive/disease/en/. The embeddings for each term in W were provided as input tolevel 1 modeling and level 2 modeling as shown in Figure 4.2.

4.3.3 Level 1 Modeling

The level 1 modeling is responsible for extracting the disease onset features, such as symptomonset date, hospitalization date and outcome date for each linelist case, say the nth case. Forextracting a given disease onset feature, the level 1 modeling takes three inputs: (a) seedindicator for the feature, (b) the word embeddings generated using SGNS or SGHS for eachterm in the WHO vocabulary W and (c) SCn representing the set of sentences describingthe nth case for which we are extracting the feature.

Growth of seed indicator In the first phase of level 1 modeling, we discover the top-K similar (or, closest) indicators to the seed indicator for each feature using WHO wordembeddings. The similarity metric used is the standard cosine similarity metric. Therefore,we expand the seed indicator to create a set of K+1 indicators for each feature. In Table 4.1we show the indicators discovered by SGNS for each disease onset feature given the seedindicators as input.

49

Table 4.1: Seed indicator and the discovered indicators using word embeddings generatedby SGNS

Features Seed indicator Discovered indicators

Onset date onsetsymptoms, symptom, prior,

days, dates

Hospitalization date hospitalizedadmitted, screened, hospitalised,

passed, discharged

Outcome date diedrecovered, passed, became,

ill, hospitalized

Shortest Dependency Distance In the second phase, we use these K + 1 indicators toextract the disease onset features. For each indicator It∀t ∈ 1, 2, . . . , K + 1, we identifythe sentences mentioning It by iterating over each sentence in SCn. Then, for each sentencementioning It, we discover the shortest path along the undirected dependency graph betweenIt and the date phrases mentioned in the sentence. Subsequently, we calculate the lengthof the shortest path as the number of edges encountered while traversing along the shortestpath. The length of the shortest path is referred to as the dependency distance. E.g., considerthe sentence S5 = He developed symptoms on 4-June and was admitted to a hospital on 12-June. The sentence S5 contains the date phrases 4-June and 12-June. S5 also contains theindicator symptoms for onset date and admitted for hospitalization date (see Tables 4.1). InFigure 4.3, we show the undirected dependency graph for S5. We observe that the dependencydistance from symptoms to 4-June is 3 (symptoms → developed → on → 4-June) and 12-June is 4 (symptoms → developed → admitted → on → 12-June). Similarly, the dependencydistance from admitted to 4-June is 3 (admitted → developed → on → 4-June) and 12-Juneis 2 (admitted → on → 4-June). Therefore, for each indicator we extract a set of date phrasesand the dependency distance corresponding to each date phrase. The output value of theindicator is set to be the date phrase located at the shortest dependency distance. E.g., inS5, the output values of symptoms and admitted will be 4-June and 12-June respectively.The final output for each disease feature is obtained by performing majority voting on theoutputs of the indicators. For more algorithmic details, please see Algorithm 4.

4.3.4 Level 2 Modeling

The level 2 modeling is responsible for extracting the clinical features for each line list case.Extraction of clinical features is a binary classification problem where we have to classifyeach feature into two classes - Y or N. The first phase of level 2 modeling is similar to level 1modeling. Seed indicator for each clinical feature is provided as input to the level 2 modelingand we extract the K + 1 indicators for each such feature by discovering the top-K mostsimilar indicators to the seed indicator (in terms of cosine similarities) using WHO wordembeddings.

50

Figure 4.3: Undirected dependency graph corresponding to S5. The red-colored edges depictthose edges included in the shortest paths between the date phrases (4-June, 12-June) andthe indicators (symptoms, admitted)

Algorithm 4: Level 1 modelingInput : seed indicator, word embeddings for each term in W, SCnOutput: date phrase

1 Growth of seed indicator using word embeddings to generate K + 1 indicators represented as It∀t ∈ 1, 2, . . . ,K + 1;2 for each It do3 dependency-dist = dict(); empty dictionary4 for each sentence in SCn do5 check the mention of It;6 if It found then7 Identify the date phrases mentioned in the sentence;8 if at least one date phrase is found then9 construct the undirected dependency graph for the sentence (see Figure 4.3);

10 for each date phrase in the sentence do11 dependency-dist[date phrase] = dependency distance (see section 4.3.3);

12 else13 continue;

14 else15 continue;

16 Output of It = date phrase in dependency-dist having the shortest dependency distance;

17 final output = majority voting on the outputs of each It;

51

patient

had

comorbidities

noDirect

Negation

and

no

contact

with

animalsIndirect

Negation

had

The

Figure 4.4: Directed dependency graph corresponding to S6 showing direct and indirect negation detection

Dependency based negation detection In the second phase, we make use of the K+ 1indicators extracted in the first phase and a static lexicon of negation cues [20], such as no,not, without, unable, never, etc. to detect negation for a clinical feature. If no negation isdetected, we classify the feature as Y, otherwise N. For each indicator It∀t ∈ 1, 2, . . . , K+ 1,we identify the first sentence (referred to as SIt) mentioning It by iterating over the sentencesin SCn. Once SIt is identified, we perform two types of negation detection on the directeddependency graph DIt constructed for SIt .Direct Negation Detection: In this negation detection, we search for a negation cueamong the neighbors of It in DIt . If a negation cue is found, then the output of It isclassified as N.Indirect Negation Detection. Absence of a negation cue in the neighborhood of It drivesus to perform indirect negation detection. In this detection, we locate those terms in DIt forwhich DIt has a directed path from each of these terms as source to It as target. We referto these terms as the predecessors of It in DIt . Then, we search for negation cues in theneighborhood of each predecessor. If we find a negation cue around a predecessor, we assumethat the indicator It is also affected by this negation and we classify the output of It as N. Forexample, consider the sentence S6 =The patient had no comorbidities and had no contact withanimals. and the directed dependency graph corresponding to S6 is shown in Figure 4.4.Sentence S6 contains the seed indicators comorbidities for comorbidities and animals foranimal contact. In Figure 4.4, we observe direct negation detection for comorbidities as thenegation cue no is located in the neighborhood of the indicator comorbidities. However, foranimal contact, we observe indirect negation detection as the negation cue no is situatedin the neighborhood of the term contact which is one of the predecessors of the indicatoranimals.

Therefore, for a clinical feature we have K + 1 indicators and the classification output Y orN from each indicator. The final output for a feature is obtained via majority voting on the

52

outputs of the indicators.

Algorithm 5: Level 2 modelingInput : seed indicator, word embeddings for each term in W, negation cues, SCnOutput: Y or N

1 Growth of seed indicator using word embeddings to generate K + 1 indicators represented as It∀t ∈ 1, 2, . . . ,K + 1;2 for each It do3 Iterate over each sentence in SCn and identify the first sentence SIt mentioning It;4 Construct the directed dependency graph DIt (see Figure 4.4) for SIt ;5 NIt = set of terms connected to It in DIt , i.e. neighbors of It;6 PIt = predecessors of It in DIt ;7 Isnegation ← 0;8 if NIt has a negation cue then9 output of It= N ;

10 Isnegation ← 1;11 break;

12 else13 Iterate over each term in PIt and seach for a negation cue in the neighborhood;14 if negation cue found in neighborhood of a predecessor then15 output of It= N ;16 Isnegation ← 1;17 break;

18 if ¬Isnegation then19 output of It= Y ;

20 final output = majority voting on the outputs of each It;

4.4 Experimental Evaluation

In this section, we first provide a brief description of our experimental setup, including themodels for automatic extraction of line lists, human annotated line lists, accuracy metricand parameter settings.

4.4.1 WHO corpus

The WHO corpus used for generating the WHO word embeddings (see Figure 4.2) wasdownloaded from http://www.who.int/csr/don/archive/disease/en/. The corpus con-tains outbreak news articles related to a wide range of diseases reported during the timeperiod 1996 to 2016. The textual content of each article was pre-processed by sentencesplitting, tokenization and lemmatization using spaCy [32]. After pre-processing, the WHOcorpus was found to contain 35,485 sentences resulting in a vocabulary W of 4447 words.

4.4.2 Models

We evaluated the following automated line listing models.

53

• GELL (SGNS): Variant of GELL with SGNS used as the word2vec model for gener-ating WHO word embeddings.• GELL (SGHS): Variant of GELL with SGHS used as the word2vec model for generat-ing WHO word embeddings.• Baseline: Baseline model which does not use WHO word embeddings to expand the seedindicator in order to generate K + 1 indicators for each feature. Therefore, Baseline usesonly a single indicator (seed indicator) to extract line list features.

4.4.3 Human annotated line list

We evaluated the line list extracted by the automated line listing models against a humanannotated line list for MERS outbreaks in Saudi Arabia. To create the human annotatedlist, patient and outcome data for confirmed MERS cases were collected from the MERSDisease Outbreak News (DONs) reports of WHO [82] and curated into a machine-readabletabular line list. In the human annotated list, total number of confirmed cases were 241curated from 64 WHO bulletins reported during the period October 2012 to February 2015.Some of these 241 cases have missing (null) features (see Figure 4.1). In Figure 4.5, we showthe distribution of non-null features in the human annotated list. We observe that majorityof human annotated cases have at least 6 (out of 9) non-null features with the peak of thedistribution at 8.

4.4.4 Accuracy metric

Matching automated line list to human annotated list. For evaluation, the problemis: we are given a set of automated line list cases and a set of human annotated cases fora single WHO MERS bulletin. Our strategy is to costruct a bipartite graph [59] where (i)an edge exists if the automated case and the human annotated case is extracted from thesame WHO bulletin and (ii) the weight on the edge denotes the quality score (QS). Qualityscore (QS) is defined as the number of correctly extracted features in the automated casedivided by the number of non-null features in the human annotated case. We then constructa maximum weighted bipartite matching [59]. Such matchings are conducted for each WHObulletin to extract a set of matches where each match represents a pair (automated case,human annotated case) and is also associated with a QS. Once the matches are found forall the WHO bulletins, we computed the average QS by averaging the QS values across thematches.

Once the average QS and QS for each match are computed, we also computed the accuracyfor each line list feature. For the demographic and disease onset features, we computedthe accuracy classification score using scikit-learn [55] by comparing the automated featuresagainst the human annotated features across the matches. The clinical features are associatedwith two classes - Y and N (see Figure 4.1). For each class, we computed the F1-score using

54

0 1 2 3 4 5 6 7 8 9

Number of non-null features0

20

40

60

80

100

Num

ber o

f lin

e lis

t cas

es

Figure 4.5: Distribution of non-null features in the human annotated line list

scikit-learn [55] where F1-score can be interpreted as a harmonic mean of the precision andrecall. F1-score reaches its best value at 1 and worst score at 0. Along with the F1-score foreach class, we also report the average F1-score across the two classes.

4.4.5 Parameter settings

GELL (SGNS) and GELL (SGHS) uses WHO word embeddings to generate K + 1 in-dicators for the line list columns. Therefore, these two models inherit the parameters ofskip-gram based word2vec techniques, such as dimensionality, window size, negative sam-ples, etc. as shown in Table 4.5. Apart from the word2vec parameters, GELL also inheritsthe parameter K which refers to the K + 1 indicators for disease onset or clinical features(see Section ??). In Table 4.5, we provide the list of all parameters, the explored values foreach parameter and the applicable models corresponding to each parameter. We selected theoptimal parameter configuration for each model based on the maximum average QS valueas well as maximum average of the individual feature accuracies across the matches.

55

4.5 Results and Discussions

In this section we try to ascertain the efficacy and applicability of GELL by investigatingsome of the pertinent questions related to the problem of automated line listing.

Multiple indicators vs single indicator - which is the better method for automatedline listing?

As mentioned in section 5.3, GELL (SGNS) and GELL (SGHS) uses multiple indicatorsdiscovered by word2vec, whereas the baseline Baseline uses only the seed indicator to inferline list features. We executed our automated line listing models taking as input the same setof 64 WHO MERS bulletins from which 241 human annotated line list cases were extracted.In Table 4.2, we observe that the number of automated line list cases (198) and the matches(182) after maximum bipartite matching is same for all the models. This is due to the reasonthat level 0 modeling (age and gender extraction) is the common modeling component inall the models and the number of extracted line list cases depends on the age and genderextraction (see section ??). In Table 4.2, we also compared the average QS achieved byeach model. We observe that GELL (SGNS) is the best performing model achieving anaverage QS of 0.74 over GELL (SGHS) (0.71) and Baseline (0.67). To further validatethe results in Table 4.2, we also show the QS distribution for each model in Figure 4.6 wherex-axis represents the QS values and the y-axis represents the number of automated line listcases having a particular QS value. For Baseline, the peak of QS distribution is at 0.62.However, for GELL (SGNS) and GELL (SGHS), the peak of the distribution is at 0.75.We further observe that GELL (SGNS) extracts higher number of line list cases with aperfect QS of 1 in comparison to Baseline.

We also compared the models on the basis of individual accuracies of the line list featuresacross the matches in Tables 4.3 and 4.4. In Table 4.3, all the models achieve similar per-formance for the demographic features since level 0 modeling is similar for all the models(see section ??). However, for the disease onset features, both GELL (SGNS) and GELL(SGHS) outperform the baseline achieving an average accuracy of 0.45 and 0.43 in com-parison to Baseline (0.12) respectively. GELL (SGNS) is the best performing model foronset date. However, for hospitalization date and outcome date, GELL (SGHS) is thebetter performing model than GELL (SGNS). In Table 4.4, for the clinical features, weobserve that GELL (SGNS) performs better than GELL (SGHS) and Baseline for co-morbidities and specified HCW on the basis of average F1-score. Specifically, for specifiedHCW, GELL (SGNS) outperforms GELL (SGHS) and Baseline for the minority classY. For animal contact, GELL (SGHS) emerges out to be the best performing model interms of average F1-score, specifically outperforming the competing models for the minorityclass Y. Baseline only performs better for secondary contact, even though the performancefor the minority class Y is almost similar to GELL (SGHS) and GELL (SGNS). Overall,we can conclude from Table 4.4 that GELL employing multiple indicators discovered viaSGNS or SGHS shows superior performance than Baseline in majority of the scenarios,

56

0.25

0.33

0.38

0.43

0.44 0.5

0.56

0.57

0.62

0.67

0.71

0.75

0.78 0.8

0.83

0.86

0.88

0.89 1.0

QS values

0

5

10

15

20

25

30

35

Num

ber o

f lin

e lis

t cas

es

BaselineGELL (SGHS)GELL (SGNS)

Figure 4.6: Distribution of QS values for each automated line listing model correspondingto MERS line list in Saudi Arabia. X-axis represents QS values and Y-axis represents thenumber of automated line list cases having a particular QS value

specifically for the minority class of each clinical feature.

Table 4.2: Average Quality Score (QS) achieved by each automated line listing model forMERS line list in Saudi Arabia. As can be seen, GELL (SGNS) shows best performanceachieving an average QS of 0.73

ModelsHuman

listsAutolists

MatchesAverage

QSBaseline 241 198 182 0.67GELL

(SGHS)241 198 182 0.71

GELL(SGNS)

241 198 182 0.74

What are beneficial parameter settings for automated line listing?

To identify which parameter settings are beneficial for automated line listing, we looked atthe best parameter configuration (see Table 4.5) of GELL (SGNS) and GELL (SGHS)

57

Table 4.3: Comparing the automated line listing models based on the accuracy score for thedemographics and disease onset features. For the disease onset features, GELL (SGNS)emerges out to be the best performing model. However, for the demographic features, allthe models achieve almost similar performance

Featuretype

Features BaselineGELL

(SGHS)GELL

(SGNS)

DemographicsAge 0.87 0.91 0.87

Gender 0.99 0.98 0.97Average 0.93 0.95 0.92

Diseaseonset

Onsetdate

0.01 0.01 0.37

Hospitalizationdate

0.11 0.63 0.62

Outcomedate

0.48 0.66 0.36

Average 0.20 0.43 0.45

Table 4.4: Comparing the performance of the automated line listing models for extractingclinical features corresponding to MERS line list in Saudi Arabia. We report the F1-scorefor class Y, class N and average F1-score across the two classes. For animal contact, GELL(SGHS) emerges out to be the best performing model. For comorbidities and specifiedHCW, GELL (SGNS) shows best performance. However, for secondary contact, Baselineachieve superior performance in comparison to GELL

Clinical Feature(Y:N)

Class BaselineGELL

(SGHS)GELL

(SGNS)

Animal contact(1:3)

Y 0.33 0.68 0.37N 0.87 0.91 0.88

Average 0.60 0.79 0.63

Secondary contact(1:3)

Y 0.57 0.52 0.56N 0.86 0.70 0.72

Average 0.71 0.61 0.64

Comorbidities(2:1)

Y 0.52 0.52 0.81N 0.56 0.54 0.61

Average 0.54 0.53 0.71

Specified HCW(1:6)

Y 0.26 0.35 0.44N 0.95 0.93 0.90

Average 0.61 0.64 0.67

which achieved the accuracy values in Tables 4.2, 4.3 and 4.4. In Table 4.5, we explored thestandard settings of each word2vec parameter (dimensionality of word embeddings, window

58

size, negative samples and training iterations) in accordance with previous research [42].Regarding dimensionality of word embeddings, GELL (SGHS) prefers 600 dimensions,whereas GELL (SGNS) prefers 300 dimensions. For the window size, both the modelsseem to benefit from smaller-sized (5) context windows. Most sentences in WHO corpuscontain information about multiple columns, therefore relevant contexts of indicators are intheir immediate vicinities leading to smaller window sizes. The number of negative samplesis applicable only for GELL (SGNS) where it seems to prefer a single negative sample.Finally, for the training iterations, both the models benefit from more than 1 training it-eration. This is expected as the WHO corpus used for generating WHO word embeddings(see section 5.3) is a smaller-sized corpus with a vocabulary of only W = 4447 words. Insuch scenarios, word2vec models (SGNS or SGHS) generate improved embeddings withhigher number of training iterations. Finally, both the models are also associated with theparameter K which refers to the number of indicators K + 1 used for extracting the diseaseonset and clinical features. As expected, the models prefer at least 5 indicators, along withthe seed indicator to be used for automated line listing. Using higher number of indicatorsincreases the chance of discovering an informative indicator for a line list feature.

Table 4.5: Parameter settings in GELL (SGNS) and GELL (SGHS) for which both themodels achieve optimal performance in terms of average QS and individual feature accuraciescorresponding to MERS line list in Saudi Arabia. Non-applicable combinations are markedby NA

ModelsDimensionality

(300:600)

Windowsize

(5:10:15)

Negativesamples(1:5:15)

TrainingIterations

(1:2:5)

Indicators(K = 3:5:7)

GELL(SGHS)

600 5 NA 5 7

GELL(SGNS)

300 5 1 2 5

Which indicator keywords discovered using word2vec contribute to the improvedperformance of GELL?

Next, we investigate the informative indicators discovered using word2vec which contribute tothe improved performance of GELL (SGNS) or GELL (SGHS) in Tables 4.3 and 4.4. InFigure 4.7, we show the accuracies (or, average F1-score) of individual indicators (includingthe seed indicator) corresponding to the best performing model for a particular line listfeature. Regarding onset date (see Figure 4.7a), GELL (SGNS) is the best performingmodel and the seed indicator provided as input is onset. We observe that symptoms is themost informative indicator achieving an accuracy of 0.36 similar to the overall accuracy (seeTable 4.3). Rest of the indicators (including the seed indicator) achieve negligible accuraciesand therefore, do not contribute to the overall performance of GELL (SGNS). Similarly,for hospitalization date with the seed keyword hospitalization provided as input, admittedemerges out to be most informative indicator followed by the seed indicator, hospitalised andtreated (see Figure 4.7b). Finally, for the outcome date, died (seed indicator) and passed arethe two most informative indicators as observed in Figure 4.7c.

59

Regarding the clinical features, we show the average F1-score of individual indicators. Foranimal contact, the seed indicator provided as input is animals. We observe in Figure 4.7dthat the most informative indicator for animal contact is camels followed by indicators suchas animals (seed), sheep and direct. This shows that contact with camels is the major trans-mission mechanism for MERS disease. The informative indicators found for comorbiditiesare patient, comorbidities and history. Finally, regarding specified HCW, the informativeindicators discovered are healthcare (seed), tracing and intensive.

Does indirect negation detection play a useful role in extracting clinical features?

In level 2 modeling for extracting clinical features, both direct and indirect negation de-tection are used. For more details, please see section ??. To identify if indirect negationdetection contributes positively, we compared the performance of GELL with and withoutindirect negation detection for each clinical feature in Table 4.6 by reporting the F1-scorefor each class as well as average F1-score. We observe that indirect negation detection has apositive effect on the performance for animal contact and secondary contact. However, forcomorbidities and specified HCW, indirect negation detection plays an insignificant role.

Table 4.6: Comparing the performance of GELL on extraction of clinical features with orwithout indirect negation for MERS line list in Saudi Arabia. It can be seen that indirectnegation improves the performance of GELL for animal contact and secondary contact.

Clinical Feature Class Direct Negation Direct + Indirect Negation

Animal contactY 0.56 0.63N 0.80 0.90

Average 0.68 0.77

Secondary contactY 0.55 0.54N 0.65 0.72

Average 0.60 0.63

ComorbiditiesY 0.86 0.82N 0.64 0.62

Average 0.75 0.72

Specified HCWY 0.44 0.44N 0.90 0.90

Average 0.67 0.67

What insights can epidemiologists gain about the MERS disease from automat-ically extracted line lists?

Finally, we show some of the utilities of automated line lists by inferring different epidemio-logical insights from the line list extracted by GELL.Demographic distribution. In Figure 4.1, we show the age and gender distribution ofthe affected individuals in the extracted line list. We observe that males are more proneto getting infected by MERS rather than females. This is expected as males have a higherprobability of getting contacted with infected animals (animal contact) or with each other(secondary contact). Also individuals aged between 40 and 70 are more prone to gettinginfected as evident from the age distribution.Analysis of disease onset features. We analyzed the symptoms-to-hospitalization pe-riod by analyzing the difference (in days) between onset date and hospitalization date in the

60

extracted line list as shown in Figure 4.8a. We observe that most of the affected individualswith onset of symptoms got admitted to the hospital either on the same day or within 5 days.This depicts a prompt responsiveness of the concerned health authorities in Saudi Arabiain terms of admitting the individuals showing symptoms of MERS. In Figure 4.8b, we alsoshow a distribution of the hospitalization-to-outcome period (in days). Interestingly, we seethat the distribution has a peak at 0 which indicates that most of the infected individualsadmitted to the hospital died on the same day indicating high fatality rate of MERS case.

4.6 Future Research Directions

• Future research in automated line listing can focus on adapting GELL to extract-ing line lists from highly unstructured news sources (compared to WHO) such asHealthMap [10] using advanced NLP techniques, such as CRFs and LSTMs specifi-cally for negation detection in level 2 Modeling. Extending GELL to extracting linelists from generic news articles (HealthMap) can also be used for estimating case countsfor emerging diseases in a real-time scenario.

• We also aim to extract line lists using GELL for other emerging diseases, such as Ebola,H7N9 at different geographical regions of the world.

• Finally, insights, such as age distribution, gender distribution and incubation periodobtained from automatically extracted line lists by GELL can be used to build andparameterize epidemiological models [63] for forecasting MERS outbreaks in SaudiArabia.

61

onse

t

date

s

days

sym

ptom

s

prio

r

sym

ptom

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

GELL (SGNS)

(a) Onset date

hosp

italis

ed

hosp

italiz

ed

sym

ptom

atic

adm

itted

scre

ened

diag

nose

d

treat

ed

asym

ptom

atic

0.0

0.1

0.2

0.3

0.4

GELL (SGHS)

(b) Hospital date

occu

rred

stay

ed

radi

o

pass

ed

arriv

ed

work

ed

soap

died

0.0

0.1

0.2

0.3

0.4

GELL (SGHS)

(c) Outcome date

shee

p

anim

als

rode

nts

tissu

es

dire

ct

cam

els

anim

al

lives

tock

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

GELL (SGHS)

(d) Animal contact

patie

nt

com

orbi

ditie

s

patie

nt,

com

orbi

ditie

s,

expo

sure

hist

ory

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

GELL (SGNS)

(e) Comorbidities

cont

acts

tracin

g

hous

ehol

d

inte

nsiv

e

follo

w-up

heal

thca

re

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Guided Deep List (SGNS)

(f) Specified HCW

Figure 4.7: Accuracy of individual indicators (including the seed indicator) discovered viaword2vec methods in GELL (SGNS) or GELL (SGHS) for each line list feature. Forclinical features, we show the average F1-score. This figure depicts the informative indica-tors (indicators showing higher accuracies or F1-scores) which contribute to the improvedperformance of GELL (SGNS) or GELL (SGHS) for a particular feature. E.g. for animalcontact, the most informative indicator contributing to the superior performance of GELL(SGHS) is camels followed by animals (seed), sheep and direct

62

0 1 2 3 4 5 6 7 8 9 11 13 15 16 20 22 52 337

351

355

359

364

days

0

1

2

3

4

5

6

7

Line

list c

ases

(a) Symptoms-to-hospitalization perioddistribution

0 2 3 4 5 6 7 9 10 12 14 18 289

337

days

0

2

4

6

8

10

12

Line

list c

ases

(b) Hospitalization-to-outcome perioddistribution

Figure 4.8: Analysis of disease onset features in the extracted line list

Chapter 5

Characterizing Diseases fromUnstructured Text: A VocabularyDriven Word2vec Approach

5.1 Introduction

The massive volume and diversity of publicly available online news media makes manualanalysis daunting. Moreover, news reports are in general unstructured and construction ofsurveillance tools such as taxonomic correlations and trace mapping require considerable hu-man supervision. Automated tools are necessary to extract meaningful information. Our aimin this chapter is to develop automated deep learning based methods which once trained overa disease-related news corpus can be used towards disease knowledge extractions, e.g. givena disease, the method should be able to characterize its symptoms, exposures, transmissionmethods and transmission agents (see Table 5.1).

Our main contributions are as follows.

• We formulate Dis2Vec [24], a vocabulary driven word2vec method which is used togenerate disease specific word embeddings from unstructured health-related news cor-pus. Dis2Vec allows domain knowledge in the form of pre-specified disease-relatedvocabulary V to supervise the discovery process of word embeddings.

• We use these disease specific word embeddings to generate automated disease tax-onomies that are then evaluated against human curated ones for accuracies.

• Finally, we evaluate the applicability of such word embeddings over different class ofdiseases - emerging, endemic and rare for different taxonomical characterizations.

63

64

Preview of our results: In Figure 5.1, we provide a comparative performance evaluation ofDis2Vec across the disease characterization tasks for endemic, emerging and rare diseases.It can be seen that Dis2Vec is best able to characterize emerging diseases. Specifically, it isable to capture symptoms, transmission methods and transmission agents, with near-perfectaccuracies for emerging diseases. Such diseases (e.g. Ebola, H7N9) draw considerable mediainterest due to their unknown characteristics. News articles reporting emerging outbreakstend to focus on all characteristics of such diseases - symptoms, exposures, transmissionmethods and transmission agents. However, for endemic and rare diseases, transmissionagents and exposures are better understood, and news reports tend to focus mainly onsymptoms and transmission methods. Dis2Vec can still be applied for these class of diseasesbut with decreased accuracy for these under-represented characteristics.

Symptoms

Exposures Transmission Agents

Transmission Method

0.98

0.68

0.94

0.97

0.93

0.63

0.86

0.41

0.95

0.67

0.78

0.52

EmergingEndemicRare

Figure 5.1: Comparative performance evaluation of disease specific word2vec model(Dis2Vec) across the disease characterization tasks for 3 different class of diseases - en-demic (blue), emerging (red) and rare (green). The axes along the four vertices representthe modeling accuracy for the disease characterization of interest viz. symptoms, transmis-sion agents, transmission methods, and exposures. The area under the curve for each diseaseclass represent the corresponding overall accuracy over all the characterizations. Best char-acterization performance can be seen for emerging diseases.

65

VocabularyFlat list of words

(influenza, cough, etc.)

VocabularyFlat list of words

(influenza, cough, etc.)

Disease outbreak

News(HealthMap)

Disease outbreak

News(HealthMap)

Dis2VecDisease specific

embeddings

TaxonomyCategories

Cosine Comparator

Taxonomy Table

Tx. methods

Tx. agents

ExposuresSymptoms

Figure 5.2: Automated taxonomy generation from unstructured news corpus (HealthMap)and a pre-specified vocabulary (V). Dis2Vec inputs these information to generate diseasespecific word embeddings that are then passed through a cosine comparator to generate thetaxonomy for the disease of interest.

5.2 Model

5.2.1 Problem Overview

Disease taxonomy generation is the process of tabulating characteristics of diseases w.r.t.several pre-specified categories such as symptoms and transmission agents. Table 5.1 givesan example of taxonomy for three diseases viz. an emerging disease (H7N9), an endemicdisease (avian influenza) and a rare disease (plague). Traditionally, such taxonomies arehuman curated - either from prior expert knowledge or by combining a multitude of re-porting sources. News reports covering disease outbreaks can often contain disease specific

66

Table 5.1: Human curated disease taxonomy for three diseases from three different class ofdiseases (endemic, emerging, and rare).

Disease Transmissionmethods

Transmissionagents

Clinical symptoms Exposures

Avian influenza(endemic)

zoonotic domestic animal,wild animal

Fever, cough, sorethroat, diarrhea,vomiting

animal exposure,farmer, market,slaughter

H7N9 (emerging) zoonotic domestic animal Fever, cough,pneumonia

farmer, market,slaughter, animalexposure

Plague (rare) vectorborne,zoonotic

flea, wild animal Sore, fever,headache, mus-cle ache, vomiting,nausea

animal expo-sure, veterinarian,farmer

information, albeit in an unstructured way. Our aim is to generate automated taxonomy ofdiseases similar to Table 5.1 using such unstructured information from news reports. Suchautomated methods can greatly simplify the process of generating taxonomies, especially foremerging diseases, and lead to a timely dissemination of such information towards publichealth services. In general, such disease related news corpus is of moderate size for deep-learning methods and as explained in section 5.1, unsupervised methods often fail to extractmeaningful information. Thus we incorporate domain knowledge in the form of a flat-listof disease related terms such as disease names, possible symptoms and possible transmis-sion methods, hereafter referred to as the vocabulary V . Figure 5.2 shows the process ofautomated taxonomy generation where we employ a supervised word2vec method referredto as Dis2Vec which takes the following inputs - (a) the pre-specified disease vocabularyV and (b) unstructured news corpus D and generates embeddings for each word (includingwords in the vocabulary V) in the corpus. Once word embeddings are generated, we employa cosine comparator to create a tabular list of disease taxonomies similar to Table 5.1. Inthis cosine comparator, to classify each disease for a taxonomical category, we calculate thecosine similarities between the embedding for the disease name and embeddings for all pos-sible words related to that category. Then, we sort these cosine similarities (in descendingorder) and extract the words (higher up in the order) closer to the disease name hereafterreferred to as top words found for that category. For example, to extract the transmissionagents for plague, we calculate the cosine similarities between the embedding for the wordplague and the embeddings for all possible terms related to transmission agents and extractthe top words by sorting the terms w.r.t. these similarities. We can compare the top wordsfound for a category with the human annotated words to compute the accuracy of the taxon-omy generated from word embeddings. In the next 2 subsections, we will briefly discuss thebasic word2vec model (skip-gram model with negative sampling) followed by the detaileddescription of our vocabulary driven word2vec model Dis2Vec.

67

5.2.2 Basic Word2vec Model

In this section, we present a brief description of SGNS - the skip-gram model introducedin [47] trained using the negative sampling procedure in [48]. The objective of the skip-grammodel is to infer word embeddings that will be relevant for predicting the surrounding wordsin a sentence or a document. It is to be noted that the skip-gram model can also be trainedusing Hierarchical Softmax method as shown in [48].

The inputs to the skip-gram model are a corpus of words w ∈ W and their correspondingcontexts c ∈ C where W and C are the word and context vocabularies. In SGNS, thecontexts of word wi are defined by the words surrounding it in an L-sized context windowwi−L, . . . , wi−1, wi+1, . . . , wi+L. In order to convert the corpus D of unstructured news reportsinto a collection of observed (w, c) pairs, the textual content of each news report is processedto generate a set of unique terms or words and then the contexts of each such term areextracted by identifying the words surrounding it in an L-sized window. The notation #(w, c)represents the number of times the pair (w, c) occurs in D. Therefore, #(w) =

∑c∈C(w, c)

and #(c) =∑

w∈W(w, c) where #(w) and #(c) are the total number of times w and coccurred in D. Each word w ∈ W corresponds to a vector w ∈ RT and similarly, eachcontext c ∈ C is represented as a vector c ∈ RT , where T is the dimensionality of the wordor context embedding. The entries in the vectors are the latent parameters to be learned.

SGNS tries to maximize the probability whether a single word-context pair (w, c) wasgenerated from the observed corpus D. Let P (D = 1|w, c) refers to the probability that(w, c) was generated from the corpus, and P (D = 0|w, c) = 1−P (D = 1|w, c) the probabilitythat (w, c) was not. The objective function for a single (w, c) pair is modeled as:

P (D = 1|w, c) = σ(w · c) = 11+e−w·c (5.1)

where w and c are the T -dimensional latent parameters or vectors to be learned.

The objective of the negative sampling is to maximize P (D = 1|w, c) for observed (w, c)pairs while maximizing P (D = 0|w, c) for randomly sampled negative contexts (hence thename negative sampling), under the assumption that randomly selecting a context for a givenword will tend to generate an unobserved (w, c) pair. SGNS’s objective for a single (w, c)observation is then:

l(w,c) = log σ(w · c) + k · EcN∼PD [log σ(−w · cN)] (5.2)

where k is the number of negative samples and cN is the sampled context, drawn accordingto the smoothed unigram distribution PD(c) = #(c)α∑

c #(c)αwhere α = 0.75 is the smoothing

parameter. E represents the expectation term.

The objective of SGNS is trained in an online fashion using stochastic gradient updates

68

over the observed pairs in the corpus D. The global objective then sums over the observed(w, c) pairs in the corpus:

lSGNS =∑

(w,c)∈D

(log σ(w · c) + k · EcN∼PD [log σ(−w · cN)]

)(5.3)

Optimizing this objective will have a tendency to generate similar embeddings for observedword-context pairs, while scattering unobserved pairs in the vector space. Intuitively, wordsthat appear in similar contexts or tend to appear in the contexts of each other should havesimilar embeddings.

5.2.3 Disease Specific Word2vec Model (Dis2Vec)

In this section, we introduce Dis2Vec, a disease specific word2vec model whose objective isto generate word embeddings which will be useful for automatic disease taxonomy creationgiven an input unstructured corpus D. We used a pre-specified disease-related vocabulary V(domain information) to guide the discovery process of word embeddings in Dis2Vec. Theinput corpus D consists of a collection of (w, c) pairs. Based on V , we can categorize the(w, c) pairs into three types as shown below:

• D(d) = {(w, c) : w ∈ V ∧ c ∈ V}, i.e. both the word w and the context c are in V

• • D(¬d) = {(w, c) : w /∈ V ∧ c /∈ V}, i.e. neither the word w nor the context c are in V

• • D(d)(¬d) = {(w, c) : w ∈ V ⊕ c ∈ V}, i.e. either the word w is in V or the context c isin V but both cannot be in V

Therefore, the input corpus D can be represented as D = D(d) + D(¬d) + D(d)(¬d). Each ofthese categories of (w, c) pairs needs special consideration while generating disease specificembeddings.

Vocabulary Driven Negative Sampling

The first category (D(d)) of (w, c) pairs, where both w and c are in V (w ∈ V ∧ c ∈ V),is of prime importance in generating disease specific word embeddings. Our first step ingenerating such embeddings is to maximize log σ(w·c) in order to achieve similar embeddingsfor these disease word-context pairs. Apart from maximizing the dot products, following

69

classical approaches [48], negative sampling is also required to generate robust embeddings.In Dis2Vec, we adopt a vocabulary (V) driven negative sampling for these disease word-context pairs. In this vocabulary driven approach, instead of random sampling we samplenegative examples (cN) from the set of non-disease contexts, i.e. contexts which are not inV (c /∈ V). This targeted sampling of negative contexts will ensure dissimilar embeddingsof disease words (w ∈ V) and non-disease contexts (c /∈ V), thus scattering them in thevector space. However, sampling negative examples only from the set of non-disease contextsmay lead to overfitting and thus, we introduce a sampling parameter πs which controls theprobability of drawing a negative example from non-disease contexts (c ∈ V) versus diseasecontexts (c ∈ V). Dis2Vec’s objective for (w, c) ∈ D(d) is shown below in equation 5.4.

lD(d)=

∑(w,c)∈D(d)

(log σ(w · c) (5.4)

+ k · [P (xk < πs)EcN∼PDc/∈V [log σ(−w · cN)]

+ P (xk ≥ πs)EcN∼PDc∈V [log σ(−w · cN)]]

)where xk ∼ U(0, 1), U(0,1) being the uniform distribution on the interval [0,1]. If xk < πs,we sample a negative context cN from the unigram distribution PDc/∈V where Dc/∈V is the

collection of (w, c) pairs for which c /∈ V and PDc/∈V = #(c)α∑c/∈V #(c)α

where α is the smoothing

parameter. For values of xk ≥ πs, we sample cN from the unigram distribution PDc∈V and

PDc∈V = #(c)α∑c∈V #(c)α

. Therefore, optimizing the objective in equation 5.4 will have a tendency

to generate disease specific word embeddings for values of πs ≥ 0.5 due to the reason thathigher number of negative contexts (cN) will be sampled from the set of non-disease contexts(c /∈ V) with πs ≥ 0.5.

Out-of-vocabulary Objective Regularization

The second category (D(¬d)) of (w, c) pairs consists of those pairs for which both w and c arenot in V (w /∈ V ∧ c /∈ V). These pairs are uninformative to us in generating disease specificword embeddings since both w and c are not a part of V . However, minimizing the dotproducts, i.e. optimizing the objective log σ(−w ·c) for these non-disease word-context pairswill scatter them in the embedding space (dissimilar embeddings) and thus, a word w /∈ Vcan have similar embeddings (or, get closer) to a word w ∈ V which should be avoidable inour scenario. Therefore, we need to maximize log σ(w · c) for these (w, c) pairs in order toachieve similar (or, closer) embeddings. We adopt the basic objective function of SGNS for(w, c) ∈ D(¬d) as shown below in equation 5.5.

lD(¬d) =∑

(w,c)∈D(¬d)

(log σ(w · c) + k · EcN∼PD [log σ(−w · cN)]

)(5.5)

70

Vocabulary Driven Objective Minimization

Lastly, the third category (D(d)(¬d)) consists of (w, c) pairs where either w is in V or c is inV (w ∈ V ⊕ c ∈ V) but both cannot be in V . Consider an arbitrary (w, c) pair belonging toD(d)(¬d). As per the objective (equation 5.3) of SGNS, two words are similar to each otherif they share the same contexts or if they tend to appear in the contexts of each other (andpreferably both). If w ∈ V and c /∈ V , then maximizing log σ(w ·c) will have the tendency togenerate similar embeddings for the disease word w ∈ V and non-disease words /∈ V whichshare the same non-disease context c /∈ V . On the other word, if c ∈ V and w /∈ V , thenmaximizing log σ(w · c) will drive the embedding of the non-disease word w /∈ V closer tothe embeddings of disease words ∈ V sharing the same disease context c ∈ V . Therefore,we posit that the dot products for this category of (w, c) pairs should be minimized, i.e.the objective log σ(−w · c) should be optimized in order to ensure dissimilar embeddingsfor these (w, c) pairs. However, minimizing the dot products of all such word-context pairsmay lead to over-penalization and thus we introduce an objective selection parameter πowhich controls the probability of selecting log σ(−w · c) versus log σ(w · c). The objectivefor (w, c) ∈ D(d)(¬d) is shown below in equation 5.6.

lD(d)(¬d) =∑

(w,c)∈D(d)(¬d)

(P (z < πo) log σ(−w · c) (5.6)

+ P (z ≥ πo) log σ(w · c)

)where z ∼ U(0, 1), U(0,1) being the uniform distribution over the interval [0,1]. If z < πo,log σ(−w·c) gets optimized, otherwise Dis2Vec optimizes log σ(w·c). Therefore, optimizingthe objective in equation 5.6 will have a tendency to generate disease specific embeddingswith values of πo ≥ 0.5 due to the reason that the objective log σ(−w · c) will be selectedfor optimization with a higher probability over log σ(w · c).Finally, the overall objective of Dis2Vec comprising all three categories of (w, c) pairs canbe defined as below.

lDis2Vec = lD(d)+ lD(¬d) + lD(d)(¬d) (5.7)

Similar to SGNS, the objective in equation 5.7 is trained in an online fashion using stochasticgradient updates over the three categories of (w, c) pairs.

5.2.4 Parameters in Dis2Vec

Dis2Vec inherits all the parameters of SGNS, such as dimensionality (T ) of the wordembeddings, window size (L), number of negative samples (k) and context distribution

71

Algorithm 6: Dis2Vec modelInput : Unstructured corpus D = {(w, c)}, VOutput: word embeddings w∀w ∈ W, column embeddings c∀c ∈ C

1 Categorize D into 3 types: D(d) = {(w, c) : w ∈ V ∧ c ∈ V}, D(¬d) = {(w, c) : w /∈ V ∧ c /∈ V},D(d)(¬d) = {(w, c) : w ∈ V ⊕ c ∈ V}

2 for each (w, c) ∈ D do3 if (w, c) ∈ D(d) then4 train the (w, c) pair using the objective in equation 5.4

5 else if (w, c) ∈ D(¬d) then6 train the (w, c) pair using the objective in equation 5.5

7 else8 train the (w, c) pair using the objective in equation 5.6

smoothing (α). It also introduces two new parameters - the objective selection parameter(πo) and the sampling parameter (πs). The explored values for each of the aforementionedparameters are shown in Table 5.6.

5.3 Experimental Evaluation

We evaluated Dis2Vec against several state-of-the art methods. In this section, we firstprovide a brief description of our experimental setup, including the disease news corpus,human annotated taxonomy and the domain information used as the vocabulary V for theprocess. We present our experimental findings in Section 5.4 where we have compared ourmodel against several baselines and also explore its applicability to emerging diseases.

5.3.1 Experimental Setup

Corpus

We collected a dataset corresponding to a corpus of public health-related news articles inEnglish extracted from HealthMap [22], a prominent online aggregator of news articles fromall over the world for disease outbreak monitoring and real-time surveillance of emergingpublic health threats. Each article contains the following information - textual content, dis-ease tag, reported date and location information in the form of (lat, long) coordinates. Thearticles were reported during the time period 2010 to 2014 and correspond to locations fromall over the world. The textual content of each article was pre-processed by sentence split-ting, tokenization and lemmatization via BASIS Technologies’ Rosette Language Processing(RLP) tools [59]. After pre-processing, the corpus consisting of 124850 articles was foundto contain 1607921 sentences, spanning 52679298 words. Words that appeared less than 5times in the corpus were ignored, resulting in a vocabulary of 91178 words.

72

Human Annotated Taxonomy

Literature reviews were conducted for each of the 39 infectious diseases of interest in order tomake classifications for transmission methods, transmission agents, clinical symptoms andexposures or risk factors. These 39 diseases were selected such that no bias is included in theprocess, i.e. they represent a diversity of infectious diseases ranging from emerging (H7N9,Ebola) to endemic (dengue, avian influenza) to rare (plague, hantavirus, yellow fever).

Methods of transmission were first classified into 8 subcategories - direct contact, droplet, air-borne, zoonotic, vectorborne, waterborne, foodborne, and environmental. For many diseases,multiple subcategories of transmission methods could be assigned. Transmission agents wereclassified into 8 subcategories - wild animal, fomite, fly, mosquito, bushmeat, flea, tick anddomestic animal. The category of clinical symptoms was broken down into 8 subcategories:general, gastrointestinal, respiratory, nervous system, cutaneous, circulatory, musculoskele-tal, and urogenital. A full list of the symptoms within each subcategory can be found inTable 5.2. For disease exposures or risk factors, 7 subcategories were assigned based onthose listed/most commonly reported in the literature. The subcategories include: health-care facility , healthcare worker, schoolchild, mass gathering, travel, animal exposure, andweakened immune system. The animal exposure category was further broken down intofarmer, veterinarian, market and slaughter. For some diseases, there were no risk factorslisted, and for other diseases, multiple exposures were assigned.

Table 5.2: Symptom categories and corresponding words.

Symptom Category Words

General Fever, chill, weight loss, fatigue, lethargy, headacheGastrointestinal Abdominal pain, nausea, diarrhea, vomitingRespiratory Cough, runny nose, sneezing, chest pain, sore throat, pneumonia, dyspneaNervous system Mental status, paralysis, paresthesia, encephalitis, meningitisCutaneous Rash, sore, pink eyeCirculatory HemorrhagicMusculoskeletal Joint pain, muscle pain, muscle ache

Disease Vocabulary V

Disease vocabulary V is provided as prior knowledge to Dis2Vec in order to generate diseasespecific word embeddings as explained in section 5.2.3. V is represented by a flat list ofdisease-related terms consisting of disease names (influenza, h7n9, plague, ebola, etc.), allpossible words related to transmission methods(vectorborne, foodborne, waterborne, etc.),all possible words related to transmission agents (flea, domestic animal, mosquito, etc.),all possible words related to clinical symptoms (fever, nausea, paralysis, cough, headache,etc.) and all possible words related to exposures or risk factors (healthcare facility, slaughter,farmer, etc.). We treat the multi-word expressions (e.g. healthcare facility, sore throat) in Vas phrases, i.e. we learn a single embedding for these expressions, not a composite embedding

73

22.4%

50.0%

13.8%

13.8% TransmissionAgents

Exposures

Symptoms

TransmissionMethod

Figure 5.3: Distribution of word counts corresponding to each taxonomical category in thedisease vocabulary (V). Words related to clinical symptoms constitute the majority of Vwith relatively much smaller percentages of terms related to exposures, transmission agentsand transmission methods

of its individual terms. Total number of words in V is found to be 103. In Figure 5.3, weshow the distribution of word counts associated with different taxonomical categories in thedisease vocabulary (V). As depicted in Figure 5.3, half of the words in V are terms relatedto clinical symptoms followed by exposures or risk factors (22.4%), transmission methods(13.8%) and transmission agent(s) (13.8%).

Baselines

We compared the following baseline models with Dis2Vec on the four disease characteriza-tion tasks.

• SGNS: Unsupervised skip-gram model with negative sampling [48] described in sec-tion 5.2.2.

• SGHS: skip-gram model trained using the hierarchical softmax algorithm [48] instead

74

of negative sampling.

• CBOW: Continuous bag-of-words model described in [47]. Unlike skip-gram models,the training objective of the CBOW model is to correctly predict the target wordgiven its contexts (surrounding words). CBOW is denoted as a bag-of-words modelas the order of words in the contexts does not have any impact on the model.

All models (both baselines and Dis2Vec) were trained on the HealthMap corpus using a T -dimensional word embedding via gensim’s word2vec software [60]. We explored a large spaceof parameters for each model. In Table 5.6, we provide the list of parameters, the exploredvalues for each parameter and the applicable models corresponding to each parameter. Apartfrom the parameters listed in Table 5.6, we also applied the sub-sampling technique developedby Mikolov et al. [48] to each model in order to counter the imbalance between common words(such as, is, of, the, a, etc.) and rare words. In the context of NLP, these common wordsare referred to as stop words. For more details on the sub-sampling techniques, please seeMikolov et al. [48]. Our initial experiments (not reported) demonstrated that both thebaselines and Dis2Vec showed improved results on the disease characterization tasks withsub-sampling versus without sub-sampling.

Accuracy Metric

We evaluate the automatic taxonomy generation methods such that for a taxonomical char-acteristic of a disease, models that generate similar set of terms (top words) as the humanannotated ones are more preferable. As such, we use cosine similarity in a min-max settingbetween the aforementioned sets for a particular characterization category as our accuracymetric. The overall accuracy of a model for a category can be found by averaging the accu-racy values across all diseases of interest. This is a bounded metric (between 0 and 1) wherehigher values indicate better model performance. We can formalize the metric as follows.Let D be the disease and C be the taxonomical category under investigation. Furthermore,let C1, C2, · · · , CN be all possible terms or words related to C and H1, H2, · · · , HM be thehuman annotated words. Then the characterization accuracy corresponding to category Cand disease D is given below in equation 5.8.

Accuracy(C,D) =1

M

M∑j=1

cosine(D,Hj)−mini cosine(D,Ci)

maxi cosine(D,Ci)−mini cosine(D,Ci)(5.8)

where D, Hj and Ci represent the word embeddings forD, Hj and Ci. mini cosine(D,Ci) andmaxi cosine(D,Ci) represent the maximum and minimum cosine similarity values betweenD and the word embeddings of the terms related to C. Therefore, equation 5.8 indicatesthat if the human annotated word Hj is among the top words found by the word2vec modelfor the category C, then the ratio in the numerator is high leading to high accuracy and viceversa.

75

5.4 Results

In this section we try to ascertain the efficacy and the applicability of Dis2Vec by investi-gating some of the pertinent questions related to the problem of disease characterization.

Sample-vs-objective: which is the better method to incorporate disease vocabu-lary information into Dis2Vec? As described in Section 5.2, there are primarily two dif-ferent ways by which disease vocabulary information (V) guides the generation of embeddingsfor Dis2Vec (a) by modulating negative sampling parameter (πs) for disease word-contextpairs ((w, c) ∈ D(d)) referred to as Dis2Vec-sample and (b) by modulating the objectiveselection parameter (πo) for non-disease words or non-disease contexts ((w, c) ∈ D(d)(¬d))referred to as Dis2Vec-objective. We investigate the importance of these two strategies bycomparing the accuracies for each strategy individually (Dis2Vec-sample and Dis2Vec-objective) as well as combined together (Dis2Vec-combined) under the best parameterconfiguration for a particular task in Table 5.3. As can be seen, no single strategy is bestacross all tasks. Henceforth, we select the best performing strategy for a particular task asour Dis2Vec in the next Table 5.4.

Table 5.3: Comparative performance evaluation of Dis2Vec-combined against Dis2Vec-objective and Dis2Vec-sample across the 4 characterization tasks under the best param-eter configuration for that model and task combination. The value in each cell representsthe overall accuracy across the 39 diseases for that particular model and characterizationtask. We use equation 5.8 as the accuracy metric in this table.

Characterization tasks Dis2Vec-sample Dis2Vec-objective Dis2Vec-combined

Symptoms 0.635 0.945 0.940Exposures 0.590 0.540 0.597Transmission methods 0.794 0.754 0.734Transmission agents 0.505 0.506 0.516Overall average accuracy 0.631 0.686 0.697

Table 5.4: Comparative performance evaluation of Dis2Vec against SGNS, SGHS andCBOW across the 4 characterization tasks under the best parameter configuration for thatmodel and task combination. The value in each cell represents the overall accuracy acrossthe 39 diseases for that particular model and characterization task. We use equation 5.8 asthe accuracy metric in this table.

Characterization tasks CBOW SGHS SGNS Dis2Vec

Symptoms 0.498 0.560 0.620 0.945

Exposures 0.383 0.498 0.605 0.597Transmission methods 0.481 0.765 0.792 0.794Transmission agents 0.274 0.466 0.498 0.516

Overall average accuracy 0.409 0.572 0.629 0.713

Does disease vocabulary information improve disease characterization? Dis2Vecwas designed to incorporate disease vocabulary information in order to guide the generation

76

of disease specific word embeddings. To evaluate the importance of such vocabulary infor-mation in Dis2Vec, we compare the performance of Dis2Vec against the baseline word2vecmodels described in section 5.3.1 under the best parameter configuration for a particulartask. These baseline models do not permit incorporation of any vocabulary informationdue to their unsupervised nature. Table 5.4 presents the accuracy of the models for the 4disease characterization tasks - symptoms, exposures, transmission methods and transmis-sion agents. As can be seen, Dis2Vec performs the best for 3 tasks and in average. It isalso interesting to note that Dis2Vec achieves higher performance gain over the baselinemodels for the symptoms category than the other categories. The superior performance ofDis2Vec in the symptoms category can be attributed to two factors - (a) higher percentageof symptom words in the disease vocabulary V (see Figure 5.3) and (b) higher occurrences ofsymptom words in the HealthMap news corpus. News articles reporting a disease outbreakgenerally tend to focus more on the symptoms related to the disease rather than the othercategories. Given the functionality of Dis2Vec, higher occurrences of symptom terms in out-break news reports will lead to generation of efficient word embeddings for characterizingdisease symptoms.

Table 5.5: Comparative performance evaluation of Dis2Vec with full vocabulary againsteach of the 6 conditions of Dis2Vec with a truncated vocabulary across the 4 characterizationtasks where the truncated vocabulary consists of disease names and all possible terms relatedto a particular taxonomical category. We use equation 5.8 as the accuracy metric in thistable.

Characterization tasksDis2Vec(Exposures)

Dis2Vec(Transmission methods)

Dis2Vec(Transmission agents)

Dis2Vec(Symptoms)

Dis2Vec(full vocabulary)

Symptoms 0.597 0.581 0.165 0.883 0.945Exposures 0.554 0.557 0.315 0.416 0.597Transmission methods 0.748 0.768 0.517 0.455 0.794Transmission agents 0.446 0.459 0.467 0.457 0.516

Table 5.6: Comparison of different parameter settings for each model, measured by thenumber of characterization tasks in which the best configuration had that parameter setting.Non-applicable combinations are marked by ‘NA’

MethodT L k α πs πo

300 : 600 5 : 10 : 15 1 : 5 : 15 0.75 : 1 0.3 : 0.5 : 0.7 0.3 : 0.5 : 0.7

Dis2Vec-combined 2 : 2 3 : 1 : 0 2 : 1 : 1 1 : 3 4 : 0 : 0 0 : 2 : 2Dis2Vec-sample 2 : 2 2 : 1 : 1 1 : 1 : 2 4 : 0 1 : 2 : 1 NADis2Vec-objective 3 : 1 2 : 2 : 0 1 : 1 : 2 3 : 1 NA 2 : 0 : 2SGNS 2 : 2 2 : 2 : 0 0 : 2 : 2 2 : 2 NA NASGHS 3 : 1 1 : 0 : 3 NA NA NA NACBOW 0 : 4 0 : 4 : 0 NA NA NA NA

What are beneficial parameter configurations for characterizing diseases? Toidentify which parameter settings are beneficial for characterizing diseases, we looked at thebest parameter configuration of all the 6 models on each task. We then counted the num-ber of times each parameter setting was chosen in these configurations (see Table 5.6). We

77

Table 5.7: Comparative performance evaluation of Dis2Vec against SGNS, SGHS andCBOW across the 4 characterization tasks for each class of diseases (emerging, endemic andrare) under the best parameter configuration for a particular {disease class, task, model}combination. We use equation 5.8 as the accuracy metric in this table.

Class Tasks CBOW SGHS SGNS Dis2Vec

Emerging Symptoms 0.589 0.671 0.722 0.977Exposures 0.356 0.495 0.516 0.679Transmission methods 0.407 0.885 0.898 0.945Transmission agents 0.528 0.587 0.795 0.975

Endemic Symptoms 0.453 0.583 0.671 0.930Exposures 0.421 0.512 0.642 0.631Transmission methods 0.472 0.820 0.851 0.856Transmission agents 0.164 0.399 0.408 0.415

Rare Symptoms 0.506 0.536 0.599 0.949Exposures 0.377 0.525 0.616 0.670Transmission methods 0.503 0.760 0.755 0.775Transmission agents 0.320 0.522 0.512 0.515

compared standard settings of each parameter as explored in previous research [42]. Forthe new parameters πs and πo introduced by Dis2Vec, we chose the values 0.3, 0.5 and0.7 in order to analyze the impact of these parameters with values < 0.5 and ≥ 0.5. ForDis2Vec-objective and Dis2Vec-combined, some trends emerge regarding the parameterπo that these two models consistently benefit from values of πo ≥ 0.5 validating our claimsin section 5.2.3 that when πo ≥ 0.5, disease words and non-disease words get scatteredfrom each other in the vector space, thus tending to generate disease specific embeddings.However, for πs we observe mixed trends. As expected, Dis2Vec-sample benefits fromhigher values of sampling parameter πs ≥ 0.5. But Dis2Vec-combined seems to preferlower values of πs<0.5 and higher values of πo ≥ 0.5 for the disease characterization tasks.For the smoothing parameter(α), all the applicable models prefer smoothed unigram distri-bution (α = 0.75) for negative sampling except Dis2Vec-combined which is in favor ofunsmoothed distribution (α = 1.0) for characterizing diseases. For the number of negativesamples k, all the applicable models seem to benefit from k>1 except Dis2Vec-combinedwhich seems to prefer k = 1. For the window size (L), all the models prefer smaller-sizedcontext windows (either 5 or 10) except SGHS which prefers larger-sized windows (L>10)for characterizing diseases. Finally, regarding the dimensionality (T ) of the embeddings,Dis2Vec-combined, Dis2Vec-sample and SGNS are in equal favor of both 300 and 600dimensions. Dis2Vec-objective and SGHS prefer 300 dimensions and CBOW is in favorof 600 dimensions for characterizing diseases.

Importance of taxonomical categories - how should we construct the diseasevocabulary? We followup our previous analysis by investigating the importance of wordsrelated to each taxonomical category in constructing the disease vocabulary towards finalcharacterization accuracy. To evaluate a particular category, we used a truncated diseasevocabulary consisting of disease names and the words in the corresponding category to drivethe discovery of word embeddings in Dis2Vec under the best parameter configuration forthat category. We compared the accuracy of each of these conditions (Dis2Vec (exposures),

78

Dis2Vec (transmission methods), Dis2Vec (transmission agents), Dis2Vec (symptoms))against Dis2Vec (full vocabulary) across the 4 characterization tasks. Table 5.5 presentsour results for this analysis and provides multiple insights as follows.

• Constructing the vocabulary with words related to all the categories leads to bettercharacterization across all the tasks.

• As expected, Dis2Vec (symptoms) is the second best performing model for the symp-toms category but it’s performance is degraded for other tasks. The same goes forDis2Vec (transmission methods) and Dis2Vec (transmission agents).

• Therefore, it indicates that in order to achieve reasonable characterization accuracy fora category, we need to supply at least the words related to that category along withthe disease names in constructing the vocabulary.

Can Dis2Vec be applied to characterize emerging, endemic and rare diseases?We classified the 39 diseases of interest into 3 classes as follows. For classifying each disease,we plotted the time series of the counts of HealthMap articles with disease tag equal to thecorresponding disease.

• Endemic: We considered a disease as endemic if the counts of articles were consistentlyhigh for all years with repeating shapes. E.g.- rabies, avian influenza, west nile virus.

• Emerging: We considered a disease as emerging if the counts of articles were histori-cally low, but have peaked in recent years. E.g.- Ebola, H7N9, MERS.

• Rare: We considered a disease as rare if the counts were consistently low for all yearswith or without sudden spikes. E.g.- plague, chagas, japanese encephalitis. We alsoconsidered a disease as rare if the counts of articles were high in 2010/2011, but havesince fallen down and depicted consistently low counts. E.g.- tuberculosis.

Following classification, the distribution of emerging, endemic and rare diseases is 4 : 12 : 23respectively. In Table 5.7, we compared the accuracy of Dis2Vec against the baselineword2vec models for each class of diseases across the 4 characterization tasks under the bestparameter configuration for a particular {disease class, task, model} combination. It can beseen that Dis2Vec is the best performing model for majority of the {disease class, task}combinations except {endemic, exposures} and {rare, transmission agents}. It is interest-ing to note that for the symptoms category, Dis2Vec performs better than the baselinemodels across all the disease classes. Irrespective of disease class, news reports generallymention the symptoms of the disease while reporting an outbreak. As the characteristicsof the emerging diseases are relatively unknown w.r.t. endemic and rare, news media re-ports also tend to focus on other categories (exposures, transmission methods, transmissionagents) apart from the symptoms to create awareness among the general public. Therefore,

79

Dis2Vec and the baselines perform better overall for the emerging diseases in comparison toendemic and rare diseases. However, Dis2Vec outperforms the baselines for characterizingsymptoms and exposures of emerging diseases. For endemic and rare diseases, Dis2Vecachieves higher accuracy than the baseline models w.r.t. the symptoms category. For othercategories, Dis2Vec performs better overall, although the performance gain is not high incomparison to the symptoms. It is to be noted that Dis2Vec achieves reasonable accuracyfor characterizing rare diseases even though the number of articles related to these diseasesis very few in HealthMap corpus leading to under-represented categories. In Figure 5.4, weshow the top words selected for each category of an emerging disease (H7N9), an endemicdisease (avian influenza) and a rare disease (plague) across all the models. The human an-notated words corresponding to each category of these diseases can be found in Table 5.1.We selected these 3 diseases due to their public health significance and the fact that thesediseases have complete coverage across all the taxonomical categories (see Table 5.1). Itis interesting to note that for H7N9, the top words found by Dis2Vec for the symptomscategory contain all the human annotated words fever, cough and pneumonia, while the topwords found by SGNS only contain the word fever. For exposures (H7N9), Dis2Vec isable to capture three human annotated words animal exposure, farmer, slaughter. However,SGNS is only able to capture the word animal. For the symptoms category of the raredisease plague, Dis2Vec is able to detect three human annotated words sore, fever andheadache with SGNS only being able to detect the word fever. Moreover, Dis2Vec is ableto characterize the transmission method of plague as vectorborne with SGNS failing to doso.

5.5 Discussions

Classical word2vec methods such as SGNS and SGHS have been applied to solve a vari-ety of linguistic tasks with considerable accuracy. However, such methods fail to generatesatisfactory embeddings for highly specific domains such as healthcare where uncovering therelationships with respect to domain specific words is of greater importance than the non-domain ones. These algorithms are by design unsupervised and do not permit the inclusionof domain information to find interesting embeddings. In this chapter, we have proposedDis2Vec, a disease specific word2vec framework that given an unstructured news corpusand domain knowledge in terms of important words, can find interesting disease characteri-zations. We demonstrated the strength of our model by comparing it against three classicalword2vec methods on four disease characterization tasks. Dis2Vec exhibits the best overallaccuracy for 3 tasks across all the diseases and in general, its relative performance improve-ment is found to be empirically dependent on the amount of supplied domain knowledge.Consequently, Dis2Vec works especially well for characteristics with more domain knowl-edge (symptoms) and is found to be a promising tool to analyze different class of diseasesviz. emerging, endemic and rare.

80

H7N9 Avian influenza PlagueDis2Vec

SGNS

SGHS CBOW

diarrhea

feverhemorr

hagiccoug

h

pneu

mon

ia

encephalitis

sneezing

hemorrhagic

fever

mental_status

hemorrhagic

fever

encephalit

is

pneu

mon

iady

spne

a

diarrheaparalys is

nausea

sore_throat

runny_nose

H 7N 9

(a) Symptoms

Dis2VecSGNS

SGHS CBOW

cough

chillpneum

onia

ence

phal

itis

rash

hemorrhagic

mental_status

paresthes iasneezing

encephalitis

hemorrhagic

dyspnea

pneumonia

feve

r

leth

argy

sneezingpink_eye

chest_pain

runny_nose

musc le_ache

AVIAN _IN F L U E N ZA

(b) Symptoms

Dis2VecSGNS

SGHS CBOW

headache

sorehemorrhagic

encephalit

isdiar

rhea

snee

zing

feve

r

hemorrhagic

pink_eye

feverlethargymental_statusencephalitispneumonia

hemorrhagic

feverpneumonia

rash pa

raly

s is

pare

sthe

s ia

join

t_pa

in

musc le_painvom

iting

paresthes ia

hemorrhagic

runny_nose

sneezing

mental_status

P L AGU E

(c) Symptoms

Dis2VecSGNS

SGHS CBOW

veterinarian

animal

exposu

re

slau

ghte

r

farm

er

healthcare_worker

travel

veterinariananimal

schoolchild

healthcare_worker

exposure

animal

trav

elm

ass_

gath

erin

g

s laughter

weakened_im

mune_system

travel

market

healthcare_worker

H 7N 9

(d) Exposures

Dis2VecSGNS

SGHS CBOW

mass_gatheringslaughteranim

al

vete

rina

rian

heal

thca

re_f

acili

tys laughtermass_gathering

animal

veterinarianhealthcare_fac ility

s laughter

mass_gatherin

g

animal

heal

thca

re_w

orke

r

expo

sure

animal

mass_gathering

farmer

veterinarian

s laughter


(e) Exposures

Dis2VecSGNS

SGHS CBOW

travelhealthcare_fa

c ility

vete

rinaria

n

heal

thca

re_w

orke

r

childanim

almass_gathering

travelexposureveterinarian

animal

veterinaria

n

exposu

re

scho

olch

ildm

ass_

gath

erin

g

market

exposure

veterinarian

mass_gathering

healthcare_fac ility

P L AGU E

(f) Exposures

Dis2VecSGNS

SGHS CBOW

wild_animal

flea

domestic_anim

al

bushmeat

fomite

bush

mea

t

bushmeat

domestic_ani

H 7 N 9

(g) Transmission Agents

Dis2VecSGNS

SGHS CBOW

domestic_animal

wild

_ani

mal

domestic_anim

al

wild_animal

domestic_animal

fom

ite

mosquito

fly


(h) Transmission Agents

Dis2VecSGNS

SGHS CBOW

flea

bush

mea

t

bushmeat

flea

flea

bush

mea

t

wild_anim

al

tick

P L AGU E

(i) Transmission Agents

Dis2VecSGNS

SGHS CBOW

airborne

zoon

oticzoonotic

airborne

zoonotic

airb

orne

droplet

vectorborne

H 7N 9

(j) Transmission Method

Dis2VecSGNS

SGHS CBOW

zoonotic

airb

orne

zoonotic

vectorborne

zoonotic

dire

ct_c

onta

ctenvironm

ental

vectorborne


(k) Transmission Method

Dis2VecSGNS

SGHS CBOW

vectorborne

dire

ct_c

onta

ctdirect_contact

airborne

waterborne

zoon

otic

environmental

droplet

P L AGU E

(l) Transmission Method

Figure 5.4: Case study for emerging, endemic and rare diseases: Disease characterization accuracy plot for Dis2Vec (firstquadrant, red), SGNS (second quadrant, blue), SGHS (third quadrant, green), and CBOW (fourth quadrant, orange) w.r.t.H7N9 (left, emerging), avian influenza (middle, endemic) and plague (right, rare). The shaded area in a quadrant indicates thecosine similarity (scaled between 0 and 1) of the top words found for the category of interest using corresponding model, asevaluated against the human annotated words (see Table 5.1). The top words found for each model is shown in the correspondingquadrant with radius equal to its average similarity with the human annotated words for the disease. Dis2Vec shows bestoverall performance with noticeable improvements for symptoms w.r.t. all diseases.

Chapter 6

Conclusions and Future Work

We have presented multiple text analytics methods for global infectious disease surveillanceusing online news media. We identified three major thrusts for this problem viz. (i) as-sessing associations between news trends and infectious disease outbreaks using temporaltopic models, (ii) automated construction of line lists for emerging diseases and using themsubsequently for inferring epidemiological conclusions and (iii) automated characterizationof infectious diseases from unstructured online news reports. We presented our approachesfor each of these thrusts in this dissertation. Future research directions are discussed below.

• Problem 1: We have presented approaches based on temporal topic models aimed atassessing the associations between news trends and infectious disease outbreaks. Ourfindings related to this problem have been communicated in multiple venues (NIPSworkshop on Topic Models [27], SIAM Data Mining [61], Statistical Analysis and DataMining Journal [62], Nature Scientific Reports [26]). Future work will focus on devel-oping deep learning approaches (RNNs, LSTMs) for disease forecasting using multipledata sources (online news, weather attributes, twitter and Wikipedia). This can helpmitigating inconsistent news media coverage during disease outbreaks by integratingsignals from other open sources.

• Problem 2: Next, we have identified the problem of automated construction of linelists which can lead to early understanding of an unknown emerging outbreak. Wehave presented our approach GELL for this purpose. Findings related to this problemhave been communicated in ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD 2017) [25]. Future work will focus on extendingGELL to extracting line lists from more unstructured news sources, such as HealthMapusing advanced NLP techniques (CRFs, LSTMs).

• Problem 3: Finally, we have presented our disease-specific word2vec approach (Dis2Vec)aimed at automated characterization of diverse class of diseases (endemic, emerging

81

82

and rare) from HealthMap news sources. We communicated our findings relatedto this problem in ACM Conference on Information and Knowledge Management(CIKM) 2016 [24]. Future efforts will be aimed at incorporating location informa-tion in Dis2Vec so that we can study the variations in characterization of diseasesacross different geographical regions.

Bibliography

[1] L. Akil, H. A. Ahmad, and R. S. Reddy. Effects of climate change on salmonellainfections. Foodborne Pathogens and Disease, 11(12):974–980, 2014.

[2] M. Ballesteros, A. Dıaz, V. Francisco, P. Gervas, J. C. De Albornoz, and L. Plaza.Ucm-2: a rule-based approach to infer the scope of negation via dependency parsing.In Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2:Proceedings of the Sixth International Workshop on Semantic Evaluation, pages 288–293. Association for Computational Linguistics, 2012.

[3] M. Baroni, G. Dinu, and G. Kruszewski. Don’t count, predict! a systematic comparisonof context-counting vs. context-predicting semantic vectors. In Proceedings of the 52ndAnnual Meeting of the ACL, pages 238–247, 2014.

[4] M. Baroni and A. Lenci. Distributional memory: A general framework for corpus-basedsemantics. Computational Linguistics, 36(4):673–721, 2010.

[5] Y. Bengio, H. Schwenk, J.-S. Senecal, F. Morin, and J.-L. Gauvain. Neural probabilisticlanguage models. In Innovations in Machine Learning, pages 137–186. Springer, 2006.

[6] D. M. Blei and J. D. Lafferty. Dynamic topic models. In Proceedings of the 23rdinternational conference on Machine learning, pages 113–120. ACM, 2006.

[7] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. The Journal ofMachine Learning Research, 3:993–1022, 2003.

[8] G. E. Box, G. M. Jenkins, and G. C. Reinsel. Time series analysis: forecasting andcontrol, volume 734. John Wiley & Sons, 2011.

[9] S. E. Brossette, A. P. Sprague, J. M. Hardin, K. B. Waites, W. T. Jones, and S. A.Moser. Association rules and data mining in hospital infection control and public healthsurveillance. Journal of the American medical informatics association, 5(4):373–381,1998.

83

84

[10] J. S. Brownstein, C. C. Freifeld, B. Y. Reis, and K. D. Mandl. Surveillance SansFrontieres: Internet-based emerging infectious disease intelligence and the Healthmapproject. PLoS Medicine, 5(7):e151, 2008.

[11] R. C. Bunescu and R. J. Mooney. A shortest path dependency kernel for relationextraction. In Proceedings of the conference on human language technology and empiricalmethods in natural language processing, pages 724–731. Association for ComputationalLinguistics, 2005.

[12] P. Chakraborty, P. Khadivi, B. Lewis, A. Mahendiran, J. Chen, P. Butler, E. O. Nsoesie,S. R. Mekaru, J. S. Brownstein, M. Marathe, et al. Forecasting a moving target: Ensem-ble models for ILI case count predictions. In Proceedings of the 2014 SIAM InternationalConference on Data Mining, pages 262–270. SIAM, 2014.

[13] E. H. Chan, T. F. Brewer, L. C. Madoff, M. P. Pollack, A. L. Sonricker, M. Keller,C. C. Freifeld, M. Blench, A. Mawudeku, and J. S. Brownstein. Global capacity foremerging infectious disease detection. Proceedings of the National Academy of Sciences,107(50):21701–21706, 2010.

[14] J. D. Cherry. Epidemic pertussis in 2012 — the resurgence of a vaccine-preventabledisease. The New England Journal of Medicine, 367(9):785–787, 2012.

[15] R. Collobert and J. Weston. A unified architecture for natural language processing:Deep neural networks with multitask learning. In Proceedings of the 25th InternationalConference on Machine learning, pages 160–167. ACM, 2008.

[16] R. Collobert, J. Weston, et al. Natural language processing (almost) from scratch. TheJournal of Machine Learning Research, 12:2493–2537, 2011.

[17] C. D. Corley, D. J. Cook, A. R. Mikler, and K. P. Singh. Text and structural data miningof influenza mentions in web and social media. International Journal of EnvironmentalResearch and Public Health, 7(2):596–615, 2010.

[18] A. Culotta. Towards detecting influenza epidemics by analyzing twitter messages.SOMA ‘10, pages 115–122, 2010.

[19] F. C. Curriero, J. A. Patz, J. B. Rose, and S. Lele. The association between extreme pre-cipitation and waterborne disease outbreaks in the united states, 1948-1994. AmericanJournal of Public Health, 91(8):1194–1199, 2001.

[20] A. Diaz, M. Ballesteros, J. Carrillo-de Albornoz, and L. Plaza. Ucm at trec-2012: Doesnegation influence the retrieval of medical reports? Technical report, DTIC Document,2012.

[21] A. Doyle, G. Katz, K. Summers, C. Ackermann, I. Zavorin, Z. Lim, S. Muthiah, P. But-ler, N. Self, L. Zhao, et al. Forecasting significant societal events using the embersstreaming predictive analytics system. Big Data, 2(4):185–195, 2014.

85

[22] C. C. Freifeld, K. D. Mandl, B. Y. Reis, and J. S. Brownstein. Healthmap: global infec-tious disease monitoring through automated classification and visualization of internetmedia reports. Journal of the American Medical Informatics Association, 15(2):150–157,2008.

[23] H.-N. Gao, H.-Z. Lu, B. Cao, B. Du, H. Shang, J.-H. Gan, S.-H. Lu, Y.-D. Yang,Q. Fang, Y.-Z. Shen, et al. Clinical findings in 111 cases of influenza A (H7N9) virusinfection. The New England Journal of Medicine, 368(24):2277–2285, 2013.

[24] S. Ghosh, P. Chakraborty, E. Cohn, J. S. Brownstein, and N. Ramakrishnan. Charac-terizing diseases from unstructured text: A vocabulary driven word2vec approach. InProceedings of the 25th ACM International on Conference on Information and Knowl-edge Management, CIKM ’16, pages 1129–1138, New York, NY, USA, 2016. ACM.

[25] S. Ghosh, P. Chakraborty, B. L. Lewis, M. S. Majumder, E. Cohn, J. S. Brownstein,M. V. Marathe, and N. Ramakrishnan. Gell: Automatic extraction of epidemiologicalline lists from open sources. In Proceedings of the 23rd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, KDD ’17, pages 1477–1485, NewYork, NY, USA, 2017. ACM.

[26] S. Ghosh, P. Chakraborty, E. O. Nsoesie, E. Cohn, S. R. Mekaru, J. S. Brownstein, andN. Ramakrishnan. Temporal topic modeling to assess associations between news trendsand infectious disease outbreaks. Scientific Reports, 7(40841), 2017.

[27] S. Ghosh, T. Rekatsinas, S. R. Mekaru, E. O. Nsoesie, J. S. Brownstein, L. Getoor,and N. Ramakrishnan. Forecasting rare disease outbreaks with spatio-temporal topicmodels. In NIPS 2013 workshop on Topic Models. Citeseer, 2013.

[28] J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski, and L. Brilliant.Detecting influenza epidemics using search engine query data. Nature, 457(7232):1012–1014, 2009.

[29] T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the NationalAcademy of Sciences, 101(suppl 1):5228–5235, 2004.

[30] S. Hales, N. De Wet, J. Maindonald, and A. Woodward. Potential effect of populationand climate changes on global distribution of dengue fever: an empirical model. TheLancet, 360(9336):830–834, 2002.

[31] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. NewYork: Springer, 2009.

[32] M. Honnibal and M. Johnson. An improved non-monotonic transition system for depen-dency parsing. In Proceedings of the 2015 Conference on Empirical Methods in NaturalLanguage Processing, pages 1373–1378, Lisbon, Portugal, September 2015. Associationfor Computational Linguistics.

86

[33] J. Jagarlamudi, H. Daume III, and R. Udupa. Incorporating lexical priors into topicmodels. In Proceedings of the 13th Conference of the European Chapter of the Asso-ciation for Computational Linguistics, pages 204–213. Association for ComputationalLinguistics, 2012.

[34] J. Kanis and L. Skorkovska. Comparison of different lemmatization approaches throughthe means of information retrieval performance. In Text, Speech and Dialogue, pages93–100. Springer, 2010.

[35] G. J. Kerns. Introduction to probability and statistics using r. Lulu. com, 2010.

[36] E. H. Lau, J. Zheng, T. K. Tsang, Q. Liao, B. Lewis, J. S. Brownstein, S. Sanders, J. Y.Wong, S. R. Mekaru, C. Rivers, et al. Accuracy of epidemiological inferences based onpublicly available information: retrospective comparative analysis of line lists of humancases infected with influenza a (h7n9) in china. BMC medicine, 12(1):88, 2014.

[37] D. Lazer, R. Kennedy, G. King, and A. Vespignani. The parable of google flu: traps inbig data analysis. Science, 343(6176):1203–1205, 2014.

[38] Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. InICML, volume 14, pages 1188–1196, 2014.

[39] O. Levy and Y. Goldberg. Dependency-based word embeddings. In Proceedings of the52nd Annual Meeting of the ACL, pages 302–308, 2014.

[40] O. Levy and Y. Goldberg. Linguistic regularities in sparse and explicit word represen-tations. In Proceedings of the Eighteenth Conference on CoNLL, pages 171–180, 2014.

[41] O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In27th Annual Conference on Neural Information Processing Systems, pages 2177–2185,2014.

[42] O. Levy, Y. Goldberg, and I. Dagan. Improving distributional similarity with lessonslearned from word embeddings. TACL, 3:211–225, 2015.

[43] J. P. Linge, R. Steinberger, T. Weber, R. Yangarber, E. van der Goot, D. Al Khudhairy,and N. Stilianakis. Internet surveillance systems for early alerting of health threats.Eurosurveillance, 14(AVRJUIN):200–201, 2009.

[44] M. S. Majumder, C. Rivers, E. Lofgren, and D. Fisman. Estimation of mers-coronavirusreproductive number and case fatality rate for the spring 2014 saudi arabia outbreak:insights from publicly available data. PLOS Currents Outbreaks, 2014.

[45] Y. Matsubara, Y. Sakurai, C. Faloutsos, T. Iwata, and M. Yoshikawa. Fast mining andforecasting of complex time-stamped events. In Proceedings of the 18th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, pages 271–279.ACM, 2012.

87

[46] J. D. Mcauliffe and D. M. Blei. Supervised topic models. In Advances in NeuralInformation Processing Systems, pages 121–128, 2008.

[47] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word represen-tations in vector space. CoRR, abs/1301.3781, 2013.

[48] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed represen-tations of words and phrases and their compositionality. In 26th Annual Conference onNeural Information Processing Systems, pages 3111–3119, 2013.

[49] T. Mikolov, W. Yih, and G. Zweig. Linguistic regularities in continuous space wordrepresentations. In Human Language Technologies: Conference of the NAACL, pages746–751, 2013.

[50] A. Mnih and G. E. Hinton. A scalable hierarchical distributed language model. InAdvances in neural information processing systems, pages 1081–1088, 2009.

[51] Y. Ou and J. Patrick. Automatic negation detection in narrative pathology reports.Artificial intelligence in medicine, 64(1):41–50, 2015.

[52] A. Pak and P. Paroubek. Twitter based system: Using twitter for disambiguatingsentiment ambiguous adjectives. In Proceedings of the 5th International Workshop onSemantic Evaluation, pages 436–439. Association for Computational Linguistics, 2010.

[53] J. Parker, Y. Wei, A. Yates, O. Frieder, and N. Goharian. A framework for detectingpublic health trends with twitter. In Proceedings of the 2013 IEEE/ACM InternationalConference on Advances in Social Networks Analysis and Mining, pages 556–563. ACM,2013.

[54] M. J. Paul and M. Dredze. You are what you tweet: Analyzing twitter for public health.In Proceedings of the 5th International AAAI Conference on Weblogs and Social Media,pages 265–272, 2011.

[55] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011.

[56] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word rep-resentation. In Proceedings of the 2014 Conference on Empirical Methods in NaturalLanguage Processing, pages 1532–1543, 2014.

[57] M. E. Peters and D. Lecocq. Content extraction using diverse feature sets. In Proceedingsof the 22nd International Conference on World Wide Web Companion, pages 89–90.International World Wide Web Conferences Steering Committee, 2013.

88

[58] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled lda: A supervisedtopic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume1, pages 248–256. Association for Computational Linguistics, 2009.

[59] N. Ramakrishnan, P. Butler, S. Muthiah, N. Self, R. Khandpur, P. Saraf, W. Wang,J. Cadena, A. Vullikanti, G. Korkmaz, et al. ‘Beating the news’ with EMBERS: Fore-casting civil unrest using open source indicators. In Proceedings of the 20th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, pages1799–1808. ACM, 2014.

[60] R. Rehurek and P. Sojka. Software Framework for Topic Modelling with Large Corpora.In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks,pages 45–50, Valletta, Malta, May 2010. ELRA. http://is.muni.cz/publication/

884893/en.

[61] T. Rekatsinas, S. Ghosh, S. R. Mekaru, E. O. Nsoesie, J. S. Brownstein, L. Getoor, andN. Ramakrishnan. SourceSeer: Forecasting rare disease outbreaks using multiple datasources. In Proceedings of the 2015 SIAM International Conference on Data Mining,pages 379–387. SIAM, 2015.

[62] T. Rekatsinas, S. Ghosh, S. R. Mekaru, E. O. Nsoesie, J. S. Brownstein, L. Getoor,and N. Ramakrishnan. Forecasting rare disease outbreaks from open source indicators.Statistical Analysis and Data Mining: The ASA Data Science Journal, 10(2):136–150,2017.

[63] C. Rivers. Modeling emerging infectious diseases for public health decision support.2015.

[64] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for au-thors and documents. In Proceedings of the 20th Conference on Uncertainty in ArtificialIntelligence, pages 487–494. AUAI Press, 2004.

[65] T. N. Rubin, A. Chambers, P. Smyth, and M. Steyvers. Statistical topic models formulti-label document classification. Machine learning, 88(1-2):157–208, 2012.

[66] G. K. Savova, J. J. Masanz, P. V. Ogren, J. Zheng, S. Sohn, K. C. Kipper-Schuler, andC. G. Chute. Mayo clinical text analysis and knowledge extraction system (ctakes):architecture, component evaluation and applications. Journal of the American MedicalInformatics Association, 17(5):507–513, 2010.

[67] S.-Q. Shen, H.-X. Wei, Y.-H. Fu, H. Zhang, Q.-Y. Mo, X.-J. Wang, S.-Q. Deng, W. Zhao,Y. Liu, X.-S. Feng, et al. Multiple sources of infection and potential endemic charac-teristics of the large outbreak of dengue in guangdong in 2014. Scientific Reports, 5,2015.

89

[68] V. Singh and B. Saini. An effective pre-processing algorithm for information retrievalsystems. International Journal of Database Management Systems, 6(6):13, 2014.

[69] S. Sohn, S. Wu, and C. G. Chute. Dependency parser-based negation detection inclinical narratives. AMIA Summits on Translational Science proceedings AMIA Summiton Translational Science, 2012:1–8, 2012.

[70] R. Sugumaran and J. Voss. Real-time spatio-temporal analysis of west nile virus usingtwitter data. In Proceedings of the 3rd International Conference on Computing forGeospatial Research and Applications, page 39. ACM, 2012.

[71] R. Tibshirani. Regression shrinkage and selection via the LASSO. Journal of the RoyalStatistical Society. Series B (Methodological), pages 267–288, 1996.

[72] J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and generalmethod for semi-supervised learning. In Proceedings of the 48th annual meeting of theACL, pages 384–394, 2010.

[73] P. D. Turney and P. Pantel. From frequency to meaning: Vector space models ofsemantics. Journal of Artificial Intelligence Research, 37:141–188, 2010.

[74] W. G. Van Panhuis, J. Grefenstette, S. Y. Jung, N. S. Chok, A. Cross, H. Eng, B. Y.Lee, V. Zadorozhny, S. Brown, D. Cummings, et al. Contagious diseases in the unitedstates from 1888 to the present. The New England journal of medicine, 369(22):2152,2013.

[75] H. M. Wallach, D. M. Mimno, and A. McCallum. Rethinking LDA: Why priors matter.In Advances in Neural Information Processing Systems, pages 1973–1981, 2009.

[76] X. Wan. Using bilingual knowledge and ensemble techniques for unsupervised chinesesentiment analysis. In Proceedings of the conference on empirical methods in naturallanguage processing, pages 553–561. Association for Computational Linguistics, 2008.

[77] X. Wan. Co-training for cross-lingual sentiment classification. In Proceedings of theJoint Conference of the 47th Annual Meeting of the ACL and the 4th InternationalJoint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1,pages 235–243. Association for Computational Linguistics, 2009.

[78] X. Wang and E. Grimson. Spatial latent dirichlet allocation. In Advances in neuralinformation processing systems, pages 1577–1584, 2008.

[79] X. Wang and A. McCallum. Topics over time: a non-markov continuous-time model oftopical trends. In Proceedings of the 12th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 424–433. ACM, 2006.

90

[80] Z. Wang, P. Chakraborty, S. R. Mekaru, J. S. Brownstein, J. Ye, and N. Ramakrishnan.Dynamic poisson autoregression for influenza-like-illness case count prediction. In Pro-ceedings of the 21th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining, pages 1285–1294. ACM, 2015.

[81] J. J. Webster and C. Kit. Tokenization as the initial phase in NLP. In Proceedings of the14th Conference on Computational Linguistics, volume 4, pages 1106–1110. Associationfor Computational Linguistics, 1992.

[82] WHO. Disease outbreak news (dons). http://www.who.int/csr/don/en/.

[83] W.-K. Wong, A. Moore, G. Cooper, and M. Wagner. Rule-based anomaly patterndetection for detecting disease outbreaks. In AAAI/IAAI, pages 217–223, 2002.

[84] W.-K. Wong, A. Moore, G. Cooper, and M. Wagner. Bayesian network anomaly patterndetection for disease outbreaks. In ICML, pages 808–815, 2003.

[85] F. Wu and D. S. Weld. Open information extraction using wikipedia. In Proceedingsof the 48th Annual Meeting of the Association for Computational Linguistics, pages118–127. Association for Computational Linguistics, 2010.

[86] S. Yang, Y. Chen, D. Cui, H. Yao, J. Lou, Z. Huo, G. Xie, F. Yu, S. Zheng, Y. Yang,et al. Avian-origin influenza A (H7N9) infection in influenza A (H7N9)–affected areasof china: a serological study. Journal of Infectious Diseases, 209(2):265–269, 2014.

[87] Q. Yuan, E. O. Nsoesie, B. Lv, G. Peng, R. Chunara, and J. S. Brownstein. Monitoringinfluenza epidemics in china with search query from baidu. PLoS One, 8(5):e64323,2013.

[88] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journalof the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.

news analytics for global infectious disease surveillance

Documents