covid-19 literature knowledge graph construction and drug ... · a novel and comprehensive...

11
COVID-19 Literature Knowledge Graph Construction and Drug Repurposing Report Generation Qingyun Wang 1 , Manling Li 1 , Xuan Wang 1 , Nikolaus Parulian 1 , Guangxing Han 2 , Jiawei Ma 2 , Jingxuan Tu 3 , Ying Lin 1 , Haoran Zhang 1 , Weili Liu 1 , Aabhas Chauhan 1 , Yingjun Guan 1 , Bangzheng Li 1 , Ruisong Li 1 , Xiangchen Song 1 , Heng Ji 1 , Jiawei Han 1 , Shih-Fu Chang 2 , James Pustejovsky 3 , David Liem 4 , Ahmed Elsayed 5 , Martha Palmer 5 , Jasmine Rah 6 , Cynthia Schneider 7 , Boyan Onyshkevych 7 1 UIUC 2 Columbia University 3 Brandeis University 4 UCLA 5 CU 6 UW 7 DARPA [email protected], [email protected], [email protected] Abstract To combat COVID-19, clinicians and scien- tists all need to digest the vast amount of rel- evant biomedical knowledge in literature to understand the disease mechanism and the re- lated biological functions. We have developed a novel and comprehensive knowledge discov- ery framework, COVID-KG, which leverages novel semantic representation and external on- tologies to represent text and images in the in- put literature data, and then performs various extraction components to extract fine-grained multimedia knowledge elements (entities, re- lations and events). We then exploit the con- structed multimedia KGs for question answer- ing and report generation, using drug repurpos- ing as a case study. Our framework also pro- vides detailed contextual sentences, subfigures and knowledge subgraphs as evidence. All of the data, KGs, resources and shared services are publicly available 1 . 1 Introduction The COVID-19 pandemic has rapidly changed our work styles and behaviors in previously unimagin- able ways, including scientific research. Scientists across the world have dropped everything to fight COVID-19. They are racing while collaborating at unprecedented levels. Practical progress at combat- ing COVID-19 highly depends on effective search, analysis, discovery, assessment and extension of these research results. However, clinicians and sci- entists are facing two unique barriers on digesting these research papers. The first challenge is quantity. Such a bottle- neck in knowledge access is exacerbated during a pandemic when the increased investment on rel- evant research would lead to even faster growth of literature than usual. For example, till April 28, 2020, at PubMed 2 there are 19,443 papers re- 1 http://blender.cs.illinois.edu/covid19/ 2 https://www.ncbi.nlm.nih.gov/pubmed/ Figure 1: The Growing Number of COVID-19 Papers at PubMed lated to coronavirus; as of June 13, 2020, there are 140K+ related papers, nearly 2.7K new papers per day (see Figure 1). This knowledge bottleneck causes significant delay in the development of vac- cines and drugs for COVID-19. More intelligent knowledge discovery technologies need to be de- veloped to enable researchers to more quickly and accurately access and digest relevant knowledge from literature. The second challenge is quality due to the rise and rapid, extensive publications of preprint manuscripts without pre-publication peer review. Many research results about coronavirus from dif- ferent research labs and sources are redundant, complementary or event conflicting with each other, while some false information has been promoted at both formal publication venues and social media platforms such as Twitter. As a result, some of the policy responses to the virus, and public perception of it, have been based on misleading, and at times erroneous, claims. The isolation of these knowl- edge resources makes it hard, if not impossible, for researchers to connect dots that exist in separate resources to obtain insights. Thus, it is challeng-

Upload: others

Post on 15-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COVID-19 Literature Knowledge Graph Construction and Drug ... · a novel and comprehensive knowledge discov-ery framework, COVID-KG, which leverages novel semantic representation

COVID-19 Literature Knowledge Graph Construction and DrugRepurposing Report Generation

Qingyun Wang1, Manling Li1, Xuan Wang1, Nikolaus Parulian1, Guangxing Han2,Jiawei Ma2, Jingxuan Tu3, Ying Lin1, Haoran Zhang1, Weili Liu1, Aabhas Chauhan1,

Yingjun Guan1, Bangzheng Li1, Ruisong Li1, Xiangchen Song1, Heng Ji1,Jiawei Han1, Shih-Fu Chang2, James Pustejovsky3, David Liem4, Ahmed Elsayed5,

Martha Palmer5, Jasmine Rah6, Cynthia Schneider7, Boyan Onyshkevych7

1UIUC 2Columbia University 3Brandeis University 4UCLA 5CU 6UW [email protected], [email protected], [email protected]

AbstractTo combat COVID-19, clinicians and scien-tists all need to digest the vast amount of rel-evant biomedical knowledge in literature tounderstand the disease mechanism and the re-lated biological functions. We have developeda novel and comprehensive knowledge discov-ery framework, COVID-KG, which leveragesnovel semantic representation and external on-tologies to represent text and images in the in-put literature data, and then performs variousextraction components to extract fine-grainedmultimedia knowledge elements (entities, re-lations and events). We then exploit the con-structed multimedia KGs for question answer-ing and report generation, using drug repurpos-ing as a case study. Our framework also pro-vides detailed contextual sentences, subfiguresand knowledge subgraphs as evidence. All ofthe data, KGs, resources and shared servicesare publicly available1.

1 Introduction

The COVID-19 pandemic has rapidly changed ourwork styles and behaviors in previously unimagin-able ways, including scientific research. Scientistsacross the world have dropped everything to fightCOVID-19. They are racing while collaborating atunprecedented levels. Practical progress at combat-ing COVID-19 highly depends on effective search,analysis, discovery, assessment and extension ofthese research results. However, clinicians and sci-entists are facing two unique barriers on digestingthese research papers.

The first challenge is quantity. Such a bottle-neck in knowledge access is exacerbated duringa pandemic when the increased investment on rel-evant research would lead to even faster growthof literature than usual. For example, till April28, 2020, at PubMed2 there are 19,443 papers re-

1http://blender.cs.illinois.edu/covid19/2https://www.ncbi.nlm.nih.gov/pubmed/

04-3

0

05-0

7

05-1

4

05-2

1

05-2

8

06-0

4

06-1

1

06-1

8

06-2

5

20000

22000

24000

26000

28000

30000

Figure 1: The Growing Number of COVID-19 Papersat PubMed

lated to coronavirus; as of June 13, 2020, thereare 140K+ related papers, nearly 2.7K new papersper day (see Figure 1). This knowledge bottleneckcauses significant delay in the development of vac-cines and drugs for COVID-19. More intelligentknowledge discovery technologies need to be de-veloped to enable researchers to more quickly andaccurately access and digest relevant knowledgefrom literature.

The second challenge is quality due to therise and rapid, extensive publications of preprintmanuscripts without pre-publication peer review.Many research results about coronavirus from dif-ferent research labs and sources are redundant,complementary or event conflicting with each other,while some false information has been promoted atboth formal publication venues and social mediaplatforms such as Twitter. As a result, some of thepolicy responses to the virus, and public perceptionof it, have been based on misleading, and at timeserroneous, claims. The isolation of these knowl-edge resources makes it hard, if not impossible, forresearchers to connect dots that exist in separateresources to obtain insights. Thus, it is challeng-

Page 2: COVID-19 Literature Knowledge Graph Construction and Drug ... · a novel and comprehensive knowledge discov-ery framework, COVID-KG, which leverages novel semantic representation

Figure 2: COVID-KG Overview: From Data to Semantics to Knowledge

ing to draw useful conclusions based on previousresearch effectively.

Let’s consider drug repurposing as a case study.Besides the long process of clinical trial andbiomedical experiments, another major cause forthe long process is the complexity of the probleminvolved and the difficulty in drug discovery ingeneral. The current clinical trials for drug re-purposing mainly rely on symptoms by consideringdrugs that can treat diseases with similar symptoms.However, there are too many drug candidates andtoo much misinformation published from multiplesources. The clinicians and scientists thus needurgent help to obtain a reliable ranked list of drugswith detailed evidence. In addition to a ranked listof drugs, clinicians and scientists also aim to gainnew insights into the underlying molecular cellularmechanisms on Covid-19, and which pre-existingconditions may affect the mortality and severity ofthis disease.

To tackle these two challenges we propose a newframework COVID-KG to accelerate scientific dis-covery and build a bridge between clinicians and bi-ology scientists, as illustrated in Figure 2. COVID-KG starts by reading existing papers to build multi-media knowledge graphs (KGs), in which nodes areentities/concepts and edges represent relations in-volving these entities, extracted from both text andimages. Given the KGs enriched with path rankingand evidence mining, COVID-KG answers naturallanguage questions effectively. Using drug repur-posing as a case study, for 11 typical questions thathuman experts aim to explore, we integrate ourtechniques to generate a comprehensive report for

each candidate drug. Preliminary assessment byexpert clinicians and medical school students showour generated reports are informative and sound.

2 Multimedia Knowledge GraphConstruction

2.1 Coarse-grained Text KnowledgeExtraction

We apply our state-of-the-art biomedical Informa-tion Extraction (IE) system (Wang et al., 2019a; Liet al., 2019; Li and Ji, 2019; Zheng et al., 2014;Huang et al., 2017) to build knowledge graphs(KGs), in which nodes are entities/concepts andedges are the relations and events involving theseentities. This system consists of three compo-nents: (1) coarse-grained entity extraction and en-tity linking for four entity types: Gene nodes, Dis-ease nodes, Chemical nodes, and Organism. Wefollow the entity ontology defined in the Com-parative Toxicogenomics Database (CTD) (Daviset al., 2016), and obtain a Medical Subject Head-ings (MeSH) Unique ID for each mention. (2)Based on the MeSH Unique IDs, we further linkall entities to the CTD and extract 133 subtypesof relations such as Marker/Mechanism, Therapeu-tic, and Increase Expression, and validate thembased on document-level co-occurrence throughdistant supervision. These relation types includeGene–Chemical–Interaction Relationships, Chemi-cal–Disease Associations, Gene–Disease Associa-tions, Chemical–GO Enrichment Associations andChemical–Pathway Enrichment Associations. (3)Event extraction: we extract 13 Event types andthe roles of entities involved in these events, in-

Page 3: COVID-19 Literature Knowledge Graph Construction and Drug ... · a novel and comprehensive knowledge discov-ery framework, COVID-KG, which leverages novel semantic representation

Figure 3: Constructed KG Connecting Losartan (candidate drug in COVID-19) and cathepsin L pseudogene 2(gene related to coronavirus).

cluding Gene expression, Transcription, Localiza-tion, Protein catabolism, Binding, Protein modifi-cation, Phosphorylation, Ubiquitination, Acetyla-tion, Deacetylation, Regulation, Positive regulation,Negative regulation. Figure 3 shows an example ofthe constructed knowledge graph.

2.2 Fine-grained Text Entity ExtractionNER Result Visualization

Angiotensin-converting enzyme 2 GENE_OR_GENOME ( ACE2 GENE_OR_GENOME ) as aSARS-CoV-2 CORONAVIRUS receptor: molecular mechanisms and potential therapeutic target.SARS-CoV-2 CORONAVIRUS has been sequenced [3]. A phylogenetic EVOLUTION analysis[3, 4] found a bat WILDLIFE origin for the SARS-CoV-2 CORONAVIRUS. There is a diversity ofpossible intermediate hosts for SARS-CoV-2 CORONAVIRUS, including pangolins WILDLIFE,but not mice EUKARYOTE and rats EUKARYOTE [5]. There are many similarities of SARS-CoV-2 CORONAVIRUS with the original SARS-CoV CORONAVIRUS. Using computermodeling, Xu et al. [6] found that the spike proteins GENE_OR_GENOME of SARS-CoV-2CORONAVIRUS and SARS-CoV CORONAVIRUS have almost identical 3-D structures in thereceptor binding domain that maintains Van der Waals forces PHYSICAL_SCIENCE. SARS-CoV spike proteins GENE_OR_GENOME has a strong binding affinity to human ACE2GENE_OR_GENOME, based on biochemical interaction studies and crystal structure analysis[7]. SARS-CoV-2 CORONAVIRUS and SARS-CoV spike proteins GENE_OR_GENOME shareidentity in amino acid sequences and ……

Figure 4: Example of Fine-grained Entity Extraction

However, questions from experts often involvefine-grained knowledge elements, such as “Which

animo acids in glycoprotein (a spike protein ofCOVID-19) are most related to Glycan (CHEM-ICAL)?”. In order to be able to answer thesequestions, We have incorporated 75 fine-grainedentity types automatically annotated by CORD-NER (Wang et al., 2020d) into the constructedKG. CORD-NER covers many new entity typesspecifically related to the COVID-19 studies (e.g.,coronaviruses, viral proteins, evolution, materials,substrates and immune responses), which may ben-efit research on COVID-19 related virus, spread-ing mechanisms, and potential vaccines. The en-tity types come from four sources: (1) 18 gen-eral entity types from Spacy3 (e.g., person, loca-tion and organization), (2) 18 biomedical entitytypes from SciSpacy4 (e.g., organism, gene, chem-ical and disease), (3) 127 biomedical entity types

3https://spacy.io/api/annotation#named-entities

4https://allenai.github.io/scispacy/

Page 4: COVID-19 Literature Knowledge Graph Construction and Drug ... · a novel and comprehensive knowledge discov-ery framework, COVID-KG, which leverages novel semantic representation

from the UMLS knowledge base5 (e.g., organism,gene, chemical, disease and biological process),and (4) 9 new entity types defined by human forthe COVID-19 studies (i.e., coronaviruses, viralproteins, evolution, materials, substrates, immuneresponses, livestock, wildlife and physical science).CORD-NER reorganizes all the entity types fromthe four sources into one entity type hierarchy of75 fine-grained entity types. CORD-NER relieson distantly- and weakly-supervised NER meth-ods (Wang et al., 2019b; Shang et al., 2018), withno need of expensive human annotation on any ar-ticles or subcorpus. Its entity annotation qualitysurpasses SciSpacy (over 10% higher on the F1score based on a sample set of documents), a fullysupervised BioNER tool.

Figure 4 shows some examples of the annotationresults on a CORD-19 paper (Zhang et al., 2020).CORD-NER achieves high quality recognizing thenew entity types. For instance, “SARS-CoV-2” isrecognized as the “CORONAVIRUS” type, “bat”and “pangolins” are recognized as the “WILDLIFE”type and “Van der Waals forces” is recognized asthe “PHYSICAL SCIENCE” type.

2.3 Image Processing and Cross-mediaEntity Grounding

Figures in biomedical papers contain rich informa-tion uniquely manifested in the visual modality,such as molecular structures, microscopic images,dosage response curves, relational diagrams, andother visual types. We have developed a visual IEsubsystem to extract the visual information fromfigure images to enrich the knowledge graph. Westart by designing a pipeline and automatic toolsshown in Figure 5 to extract figures from papersin the CORD-19 dataset and segment figures intoclose to half million subfigures. Then, we performcross-modal entity grounding to ground entitiesmentioned in captions or referring text to visualobjects in the subfigures.

One main challenge for figure analysis lies inthe lack of figures stored in separate image files.Most figures are embedded as part of PDF files ofthe papers. We employ Deepfigures (Siegel et al.,2018) to automatically detect and extract figuresfrom each PDF document. Each figure is associatedwith text in its caption or referring context (mainbody text referring to the figure). In this way, a

5ttps://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html

Figure 5: System pipeline for automatic figure extrac-tion and subfigure segmentation.

Fig. The baseplate .p. (A) Isosurfacerepresenta0on of the C3 symmetricbaseplate reconstruc0on at 4.5 σ densitycutoff. The density corresponding to theTal trimer is shown in pink, with the restof the reconstruc0on in transparent gray.(B) Central sec0on through C3reconstruc0on (gray scale). Densityinside the lumen presumablycorresponding to TMP is indicated withthe yellow arrow. Scale bar = 100Å. (C)The Tal density alone, viewed from theside and boNom. The disordered densityat the boNom in the side view isassumed to be the Tal CTD and wasremoved from the boNom view forclarity. (D) Ribbon representa0on of theatomic model of Tal viewed from theside and boNom. One subunit is coloreddark gray. The TMP model is shown ingreen. (E) Tal subunit, colored in rainbowcolors from blue (N-terminus) to red (C-terminus). Relevant structural elementsare labeled. (F) Density corresponding tothe central part of Tal with the atomicmodels of Tal and TMP shown in pinkand green, respec0vely. (G) View downthe axis of the baseplate showing theinside of Tal as a van der Waals surface,colored from hydrophilic (tan) tohydrophobic (purple). TMP is shown ingreen. (H) Detail of the interac0onbetween the TMP C-terminus and Tal.The three C-terminal residues of TMP(Y1152, Y1153 and L1154) are shown ins0ck representa0on.

Figure 6: Examples of segmenting a figure into subfig-ures and aligning them with subcaption text

figure can be coarsely attached to an KG entity ifthe entity is mentioned in the associated text.

To further delineate semantic and visual infor-mation contained in each subfigure, we have de-veloped a pipeline to segment individual subfig-ures and then align each subfigure with its corre-sponding sub-caption. We employ Figure-separator(Tsutsui and Crandall, 2017) to detect and separateall non-overlapping image regions. Meanwhile,subfigures in a figure are typically marked withalphabetical letters (e.g., A, B, C, etc). We usedeep neural networks (Zhou et al., 2017) to detecttext in the figures and use OCR tools (Smith, 2007)to automatically recognize text information withineach figure. To distinguish subfigure marker textfrom text labels in figures to annotate figure con-tent, we use location proximity between text labelsand subfigures to locate subfigure text markers. Lo-cation information of such text markers can also beused to merge multiple image regions into a singlesubfigure. At the end, each subfigure is segmented,and associated with its corresponding subcaptionand referring context.

The segmented subfigures and associated text

Page 5: COVID-19 Literature Knowledge Graph Construction and Drug ... · a novel and comprehensive knowledge discov-ery framework, COVID-KG, which leverages novel semantic representation

Figure 7: Expanding knowledge graph through subfig-ure segmentation and cross-modal entity grounding.

labels provide rich information that can be usedto expand KG constructed from text captions. Forexample, as shown in Figure 7, we apply a classifierto detect subfigure images those contain molecularstructures. Then by linking specific drug namesextracted from within-figure text to the drug entityin the coarse KG constructed from the caption text,a cross-modal expanded KG can be constructedthat links specific molecular structure images tocorresponding drug entities in the KG.

2.4 Knowledge Graph SemanticVisualization

In order to enhance the exploration and discoveryof the information mined from the COVID-19 liter-ature through the algorithms discussed in previoussections, we have been developing techniques tocreate semantic visualizations over large datasets ofcomplex networks of biomedical relations. Seman-tic visualization allows for visualization of user-defined subsets of these relations through interac-tive semantically typed tag clouds and heat maps.This allows researchers to get a global view of se-lected relation subtypes drawn from hundreds orthousands of papers at a single glance. This in turnallows for the ready identification of novel relation-ships that would typically be missed by directedkeyword searches or simple unigram word cloudor heatmap displays.6

We first build a data index from the datasets, andthen create a Kibana dashboard out of the gener-ated data indices. Each Kibana dashboard has acollection of visualizations that are designed to in-teract with each other. Dashboards are fulfilled asweb applications. The navigation of a dashboard ismainly through clicking and searching. By click-ing the protein keyword EIF2AK2 in the tag cloudnamed “Enzyme proteins participating Modifica-

6https://www.semviz.org/

tion relations”, a constraint on the type of proteinsin modifications is added. Correspondingly, all theother visualizations will be changed.

One unique feature of the SemViz semantic vi-sualization is the creation of dense tag clouds anddense heatmaps, through a process of parameterreduction over relations, allowing for the visual-ization of a relation sets as tag clouds and multi-ple chained relations as heatmaps. Figure 8 illus-trates such a dense heatmap, where a functionallytyped protein is implicated in a disease relation(e.g., “those proteins that are down regulators ofINF which are implicated in obesity”).

3 Knowledge-driven QuestionAnswering

3.1 KG Matching and Path Ranking

With the constructed knowledge graphs from mas-sive updated scientific literature and domain knowl-edge, we can support many biomedical hypothe-sis related queries and inference tools for work-ing scientists and clinicians. Current questionanswering (QA) methods usually rely on word-level or sentence-level semantic meaning match-ing. The questions from existing shared tasks arelimited to non-experts (e.g., “Corona Virus Up-date?”) or too high-level (e.g., “What is knownabout transmission, incubation, and environmen-tal stability?”). Most of current QA systems aretrained from Wikipedia articles, and thus they willnot be effective for the COVID-19 domain wheremost answers are not explicitly written in a singlesentence or document. In sharp contrast, we de-velop a QA component based on a combination ofknowledge graph matching and distributional se-mantic matching. It provides fast and effective an-swering of questions about drugs, diseases, chem-ical entities and genes from any angle. We buildknowledge graph indexing and searching functionsto facilitate users to pose queries to search effec-tively and efficiently. We also support semanticmatching from the constructed KGs and relatedtexts by accepting multi-hop queries.

A common category of queries is about the con-nections between two entities. Given two entitiesas query, we generate a subgraph covering salientpaths between them to show how they are con-nected through other entities. Figure 3 is an exam-ple subgraph summarizing the connections betweenLosartan and cathepsin L pseudogene 2. The pathsare generated by traversing the constructed KG,

Page 6: COVID-19 Literature Knowledge Graph Construction and Drug ... · a novel and comprehensive knowledge discov-ery framework, COVID-KG, which leverages novel semantic representation

Figure 8: Regulatory Processes-Disease Interactions Heatmap

and are ranked the frequency of paths in the KG.We construct a subgraph for each query by merg-ing the paths of top ranked paths. Each edge isassigned a salience score by aggregating the scoresof paths passing through it,

3.2 Knowledge-driven Sentence MatchingIn addition to knowledge elements, we also presentrelated sentences as evidence. We use BioBert (Leeet al., 2020) pre-trained language model to repre-sent each sentence along with its left and rightneighboring sentences as local contexts. Using thesame architecture computed on all respective sen-tences and user query, we aggregate the sequenceembedding layer, the last hidden layer in the BERTarchitecture with average pooling (Reimers andGurevych, 2019). We use the similarity betweenthe embedding representations of each sentenceand query to extract the most relevant sentences asevidence. Table 1 shows some answer examplesfrom our QA component.

Question # of Answers Example AnswersWhich genes are relatedto COVID-19?

687 AP2 associated kinase 1,myeloperoxidase, thiore-doxin

Which chemicals are re-lated to COVID-19?

3,142 acetoacetic acid, Chlo-rine, Zymosan

Which genes are relatedto COVID-19 that can betransferred from its simi-lar diseases?

2,168 DEK proto-oncogene,neclear receptor corepres-sor 1

Which chemicals are re-lated to COVID-19 thatcan be transferred fromits similar diseases?

327 Ampicillin, Quercetin,Zoledronic Acid

Table 1: Knowledge-driven QA Output Examples

3.3 Evidence MiningQueries also often include entity types instead ofentity instances, which requires us to extract evi-

dence sentences based on type or pattern match-ing. We have developed EVIDENCEMINER (Wanget al., 2020b,c), a web-based system that allowsa user’s query as a natural language statementor an inquired relationship at the meta-symbollevel (e.g., CHEMICAL, PROTEIN) and automati-cally retrieves textual evidence from a backgroundcorpora of COVID-19. It is supported by data-driven methods for distantly supervised named en-tity recognition (Wang et al., 2020d) and pattern-based open information extraction (Wang et al.,2018a; Li et al., 2018). The entities and patternsare pre-computed and indexed offline to supportfast online evidence retrieval. For example, in Fig-ure 9, the top results of EVIDENCEMINER for thequery (“SARS-COV-2”, “Losartan”) indicates that“Losartan” can be a potential drug treatment for“COVID-19”. EVIDENCEMINER achieves betterperformance compared with baseline methods suchas BM25 (Robertson et al., 2009) and LitSense(Allot et al., 2019).

4 A case study on Drug RepurposingReport Generation

4.1 Task and Data

A human written report about drug repurposingusually includes answers for the following typicalquestions.

1. Current indication: what is the drug class?What is it currently approved to treat?

2. Molecular structure (symbols desired, but apointer to a reference is also useful)

3. Mechanism of action i.e. inhibits viral entry,replication, etc (w/ a pointer to data)

4. Was the drug identified by manual or compu-tation screen?

Page 7: COVID-19 Literature Knowledge Graph Construction and Drug ... · a novel and comprehensive knowledge discov-ery framework, COVID-KG, which leverages novel semantic representation

Figure 9: Top results of EvidenceMiner for the query(“SARS-COV-2”, “Losartan”).

5. Who is studying the drug? (Source/lab name)6. In vitro Data available (cell line used, assays

run, viral strain used, cytopathic effects, toxi-city, LD50, dosage response curve, etc)

7. Animal Data Available (what animal model,LD50, dosage response curve, etc)

8. Clinical trials on going (what phase, facility,target population, dosing, intervention etc)

9. Funding source10. Has the drug shown evidence of systemic tox-

icity?11. List of relevant sources to pull data from.

We use three drugs suggested by DARPA biol-ogists as case studies: Benazepril, Losartan, andAmodiaquine. Our KG results for many other drugsare visualized at our website7. We use the follow-ing list of chemicals/genes related to COVID-19,suggested by DARPA biologists:

• BM1 00870 BM1 06175 BM1 16375 BM1 17125BM1 22385 BM1 30360 BM1 33735 BM1 56245BM1 56735 BM1 00870 BM1 06175 BM1 16375BM1 17125 BM1 22385 BM1 30360 BM1 33735BM1 56245 BM1 56735 CATB-10270 CATB-1418CATB-1674 CATB-16A CATB-16D2 CATB-1852CATB-1874 CATB-2744 CATB-3098 CATB-348CATB-3483 CATB-5880 CATB-84 CATB-912 CATDCATHY CATK CATL CATL-LIKE CTS12 CTS3CTS6 CTS7 CTS7-PS CTS8 CTS8L1 CTS8-PSCTSA CTSA.L CTSB CTSBA CTSBB CTSB.LCTSB-PS CTSB.S CTSC CTSC.L CTSC.S CTSDCTSD2 CTSD.S CTSE CTSEAL CTSE.L CTSE.SCTSF CTSF.L CTSG CTSH CTSH.L CTSH-PSCTSJ CTSK CTSK1 CTSK.L CTSL CTSL.1 CTSL3CTSL3P CTSLA CTSLB CTSLL CTSL.L CTSLL3CTSLP1 CTSLP2 CTSLP3 CTSLP4 CTSLP6CTSLP8 CTSM CTSM-PS CTSM-PS2 CTSOCTSO.L CTSQ CTSQL2 CTSR CTSS CTSS1CTSS.2 CTSS2.1 CTSS2.2 CTSSL CTSS.L CTSS.S

7http://blender.cs.illinois.edu/covid19/visualization.html

CTSV CTSV.L CTSW CTSW.L CTSZ CTSZ.LCTSZ.S LOAG 18685 SMP 013040.1 SMP 034410.1SMP 067050 SMP 067060 SMP 085010 SMP 085180SMP 103610 SMP 105370 SMP 158410 SMP 158420SMP 179950 TSP 01409 TSP 02382 TSP 02383TSP 03306 TSP 07747 TSP 10129 TSP 10493TSP 11596 LMAN1 LMAN1L LMAN1.L LMAN1.SLMAN2 LMAN2L MBL1P MBL2 ACE2 FURINTMPRSS2

With the knowledge discovered and more pa-pers about this topic published every day, we needto keep up to date with the latest development.For this purpose, we download new COVID-19papers on a daily basis from three ApplicationProgramming Interfaces (APIs): NCBI PMC API,NCBI Pubtator API and CORD-19 archive. Weprovide incremental updates including new papers,removed papers and updated papers, and their meta-data information at our website8.

4.2 ResultsUp to June 14, 2020 we have collected 140Kpapers. We choose 25,534 peer-reviewed pa-pers and construct the KG. The current KG in-cludes 7,230 Diseases, 9,123 Chemicals and 50,864genes, 1,725,518 chemical-gene links, 5,556,670chemical-disease links, and 7,7844,574 gene-disease links. The KG has got more than 800+downloads.

Several clinicians and medical school studentsin our team have manually reviewed the drug repur-posing reports for three drugs, and also the knowl-edge graphs connecting 41 drugs and COVID-19related chemicals/genes. Preliminary results showthat most of our output are informative, valid andsound. For instance, after the coronavirus entersthe cell in the lungs, it can cause a severe dis-ease called Acute Respiratory Distress Syndrome(ARDS). This condition causes the release of in-flammatory molecules in the body named cytokinessuch as Interleukin-2, Interleukin-6, Tumor Necro-sis Factor, and Interleukin-10. We see all of theseconnections in our results, such as the examplesshown in Figure 3 and Figure 10. Some resultsare a little surprising to scientists and they thinkit’s worth further investigation. For example, inFigure 3 we can see that Lusartan is connected totumor protein p53 which is related to lung cancer.

Our final generated reports9 are shared publicly.For each question, our framework provides an-

8http://blender.cs.illinois.edu/covid19/

9http://blender.cs.illinois.edu/covid19/DrugRe-purposingReport_V2.0.docx

Page 8: COVID-19 Literature Knowledge Graph Construction and Drug ... · a novel and comprehensive knowledge discov-ery framework, COVID-KG, which leverages novel semantic representation

lopinavir-ritonavirdrug combination

cathepsin D

COVID-19

CoronavirusInfections

SevereAcute

RespiratorySyndrome

Figure 10: Connections Involving Coronavirus RelatedDiseases

swers along with detailed evidences, knowledgesubgraphs and image segmentation and analysisresults. Table 2 shows some example answers.

Question Example Answers

Q1

Drug Class angiotensin-converting enzyme (ACE) inhibitorsDisease hypertension

Evidence

[PMID:32314699 (PMC7253125)] Past medical his-tory was significant for hypertension, treated withamlodipine and benazepril, and chronic back pain.

Sentences [PMID:32081428 (PMC7092824)] On the otherhand, many ACE inhibitors are currently used totreat hypertension and other cardiovascular diseases.Among them are captopril, perindopril, ramipril,lisinopril, benazepril, and moexipril.

Q4

Disease COVID-19

Evidence

[PMID:32081428 (PMC7092824)] By using amolecular docking approach, an earlier study iden-tified N-(2-aminoethyl)-1 aziridine-ethanamine as anovel ACE2 inhibitor that effectively blocks theSARS-CoV RBD-mediated cell fusion.

Sentences This has provided a potential candidate and leadcompound for further therapeutic drug development.Meanwhile, biochemical and cell-based assays canbe established to screen chemical compound librariesto identify novel inhibitors.

Q6

Disease cardiovascular disease

Evidence

[PMID:22800722 (PMC7102827)] The in vitro half-maximal inhibitory concentration (IC50) values offood-derived ACE inhibitory peptides are about1000-fold higher than that of synthetic captopril butthey have higher in vivo activities than would be ex-pected from their in vitro activities.....

Q8

Disease COVID-19

Evidence

[PMID:32336612 (PMC7167588)] Two trials oflosartan as additional treatment for SARS-CoV-2 in-fection in hospitalized (NCT04312009) or not hos-pitalized (NCT04311177) patients have been an-nounced, supported by the background of the hugeadverse impact of the ACE Angiotensin II AT1 re-ceptor axis over-activity in these patients.

Sentences [PMID:32350632 (PMC7189178)] To address therole of angiotensin in lung injury, there is an ongoingclinical trial to examine whether losartan treatmentaffects outcomes in COVID-19 associated ARDS(NCT04312009).[PMID:32439915 (PMC7242178)] Losartan wasalso the molecule chosen in two trials recently startedin the United States by the University of Minnesotato treat patients with COVID-19 (clinical trials.govNCT04311177 and NCT 104312009).

Table 2: Example Answers for Questions in Drug Re-purposing Reports

5 Related Work

There has been a lot of previous work on extract-ing biomedical entities (Krallinger et al., 2013; Luet al., 2015; Leaman and Lu, 2016; Habibi et al.,2017; Crichton et al., 2017; Wang et al., 2018b;Beltagy et al., 2019; Alsentzer et al., 2019; Wei

et al., 2019; Wang et al., 2020d), relations (Uzuneret al., 2011; Krallinger et al., 2011; Segura-Bedmaret al., 2013; Bui et al., 2014; Peng et al., 2016; Weiet al., 2015; Peng et al., 2017; Quirk and Poon,2016; Luo et al., 2017; Wei et al., 2019; Peng et al.,2019, 2020), and events (Ananiadou et al., 2010;Van Landeghem et al., 2013; Nedellec et al., 2013;Deleger et al., 2016; Wei et al., 2019; Li et al.,2019; ShafieiBavani et al., 2020) from biomed-ical corpora to construct KGs. Recently, Hopeet al. (2020); Ilievski et al. (2020); Wolinski (2020);Ahamed and Samad (2020) build KGs based onCORD-19 (Wang et al., 2020a).

Most of the recent biomedical QA work isdriven by the BioASQ initiative (Tsatsaronis et al.,2015) with many algorithms developed (Yang et al.,2015, 2016; Chandu et al., 2017; Kraus et al.,2017). There are COVID-19 question answeringlive systems coming from BioASQ (COVIDASK10,AUEB11), and search engines (Kricka et al., 2020;Esteva et al., 2020; Hope et al., 2020; Taub Tabibet al., 2020) have also been built.

Our work advances state-of-the-art by extend-ing the knowledge elements to more fine-grainedtypes, incorporating image analysis and cross-media knowledge grounding, and knowledge graphmatching into question answering.

6 Conclusions and Future Work

We have developed a novel framework COVID-KG that automatically transforms massive scien-tific literature corpus into organized, structured andactionable knowledge graphs. With COVID-KG,researchers and clinicians are able to obtain trust-worthy and non-trivial answers from scientific liter-ature, and thus focus on more important hypothesistesting, and prioritize the analysis efforts for can-didate exploration directions. In our ongoing workwe have created a new ontology that includes 77entity subtypes and 58 event subtypes, and we arere-building an end-to-end joint neural IE systemfollowing this new ontology. In the future we planto extend COVID-KG to automate the creation ofnew hypotheses by predicting new links. Inspiredfrom our recent success at multimedia event extrac-tion (Li et al., 2020), we will create a multimediacommon semantic space for literature and applyit to improve cross-media knowledge grounding,inference and transfer.

10https://covidask.korea.ac.kr/11http://cslab241.cs.aueb.gr:5000/

Page 9: COVID-19 Literature Knowledge Graph Construction and Drug ... · a novel and comprehensive knowledge discov-ery framework, COVID-KG, which leverages novel semantic representation

ReferencesSabber Ahamed and Manar Samad. 2020. Informa-

tion mining for covid-19 research from a largevolume of scientific literature. arXiv preprintarXiv:2004.02085.

Alexis Allot, Qingyu Chen, Sun Kim, Roberto Vera Al-varez, Donald C Comeau, W John Wilbur, and Zhiy-ong Lu. 2019. Litsense: making sense of biomedicalliterature at sentence level. Nucleic acids research.

Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Naumann, andMatthew McDermott. 2019. Publicly available clini-cal BERT embeddings. In Proceedings of the 2ndClinical Natural Language Processing Workshop,pages 72–78, Minneapolis, Minnesota, USA. Asso-ciation for Computational Linguistics.

Sophia Ananiadou, Sampo Pyysalo, Jun’ichi Tsujii,and Douglas B Kell. 2010. Event extraction for sys-tems biology by text mining the literature. Trends inbiotechnology, 28(7):381–390.

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scib-ert: Pretrained language model for scientific text. InEMNLP.

Quoc-Chinh Bui, Peter MA Sloot, Erik M Van Mul-ligen, and Jan A Kors. 2014. A novel feature-based approach to extract drug–drug interactionsfrom biomedical text. Bioinformatics, 30(23):3365–3371.

Khyathi Chandu, Aakanksha Naik, Aditya Chan-drasekar, Zi Yang, Niloy Gupta, and Eric Nyberg.2017. Tackling biomedical text summarization:Oaqa at bioasq 5b. In BioNLP 2017, pages 58–66.

Gamal Crichton, Sampo Pyysalo, Billy Chiu, and AnnaKorhonen. 2017. A neural network multi-task learn-ing approach to biomedical named entity recogni-tion. BMC Bioinf., 18(1):368.

Allan Peter Davis, Cynthia J Grondin, Robin J Johnson,Daniela Sciaky, Benjamin L King, Roy McMorran,Jolene Wiegers, Thomas C Wiegers, and Carolyn JMattingly. 2016. The comparative toxicogenomicsdatabase: update 2017. Nucleic acids research.

Louise Deleger, Robert Bossy, Estelle Chaix,Mouhamadou Ba, Arnaud Ferre, Philippe Bessieres,and Claire Nedellec. 2016. Overview of the bacteriabiotope task at bionlp shared task 2016. In Proceed-ings of the 4th BioNLP shared task workshop, pages12–22.

Andre Esteva, Anuprit Kale, Romain Paulus, KazumaHashimoto, Wenpeng Yin, Dragomir Radev, andRichard Socher. 2020. Co-search: Covid-19 in-formation retrieval with semantic search, questionanswering, and abstractive summarization. arXivpreprint arXiv:2006.09595.

Maryam Habibi, Leon Weber, Mariana Neves,David Luis Wiegandt, and Ulf Leser. 2017. Deeplearning with word embeddings improves biomed-ical named entity recognition. Bioinformatics,33(14):i37–i48.

Tom Hope, Jason Portenoy, Kishore Vasan, JonathanBorchardt, Eric Horvitz, Daniel S Weld, Marti AHearst, and Jevin West. 2020. Scisight: Combin-ing faceted navigation and research group detectionfor covid-19 exploratory scientific search. arXivpreprint arXiv:2005.12668.

Lifu Huang, Jonathan May, Xiaoman Pan, Heng Ji, Xi-ang Ren, Jiawei Han, Lin Zhao, and James Hendler.2017. Liberal entity extraction: Rapid constructionof fine-grained entity typing systems. In Big Data,Mar 2017, 5(1): 19-31.

Filip Ilievski, Daniel Garijo, Hans Chalupsky,Naren Teja Divvala, Yixiang Yao, Craig Rogers,Ronpeng Li, Jun Liu, Amandeep Singh, DanielSchwabe, et al. 2020. Kgtk: A toolkit for largeknowledge graph manipulation and analysis. arXivpreprint arXiv:2006.00088.

Martin Krallinger, Florian Leitner, Obdulia Rabal,Miguel Vazquez, Julen Oyarzabal, and Alfonso Va-lencia. 2013. Overview of the chemical compoundand drug name recognition (chemdner) task. InBioCreative challenge evaluation workshop, vol-ume 2, page 2. Citeseer.

Martin Krallinger, Miguel Vazquez, Florian Leitner,David Salgado, Andrew Chatr-Aryamontri, AndrewWinter, Livia Perfetto, Leonardo Briganti, Luana Li-cata, Marta Iannuccelli, et al. 2011. The protein-protein interaction tasks of biocreative iii: classifica-tion/ranking of articles and linking bio-ontology con-cepts to full text. BMC bioinformatics, 12(S8):S3.

Milena Kraus, Julian Niedermeier, Marcel Jankrift,Soren Tietbohl, Toni Stachewicz, Hendrik Folk-erts, Matthias Uflacker, and Mariana Neves. 2017.Olelo: a web application for intuitive explorationof biomedical literature. Nucleic acids research,45(W1):W478–W483.

Larry J Kricka, Sergei Polevikov, Jason Y Park,Paolo Fortina, Sergio Bernardini, Daniel Satchkov,Valentin Kolesov, and Maxim Grishkov. 2020. Ar-tificial intelligence-powered search tools and re-sources in the fight against covid-19. EJIFCC,31(2):106.

Robert Leaman and Zhiyong Lu. 2016. Tag-gerone: joint named entity recognition and normal-ization with semi-markov models. Bioinformatics,32(18):2839–2846.

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,Donghyeon Kim, Sunkyu Kim, Chan Ho So, andJaewoo Kang. 2020. Biobert: a pre-trained biomed-ical language representation model for biomedicaltext mining. Bioinformatics, 36(4):1234–1240.

Page 10: COVID-19 Literature Knowledge Graph Construction and Drug ... · a novel and comprehensive knowledge discov-ery framework, COVID-KG, which leverages novel semantic representation

Diya Li, Lifu Huang, Heng Ji, and Jiawei Han. 2019.Biomedical event extraction based on knowledge-driven tree-lstm. In Proc. 2019 Annual Conferenceof the North American Chapter of the Associationfor Computational Linguistics (NAACL-HLT2019).

Diya Li and Heng Ji. 2019. Syntax-aware multi-taskgraph convolutional networks for biomedical rela-tion extraction. In Proc. EMNLP2019 Workshop onHealth Text Mining and Information Analysis.

Manling Li, Alireza Zareian, Qi Zeng, Spencer White-head, Di Lu, Heng Ji, and Shih-Fu Chang. 2020.Cross-media structured common space for multime-dia event extraction. In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics, pages 2557–2568, Online. Associationfor Computational Linguistics.

Qi Li, Xuan Wang, Yu Zhang, Qi Li, Fei Ling, CathyWu H, and Jiawei Han. 2018. Pattern discoveryfor wide-window open information extraction inbiomedical literature. In BIBM. IEEE.

Yanan Lu, Donghong Ji, Xiaoyuan Yao, Xiaomei Wei,and Xiaohui Liang. 2015. CHEMDNER systemwith mixed conditional random fields and multi-scale word clustering. J. Cheminf., 7(S1):S4.

Yuan Luo, Ozlem Uzuner, and Peter Szolovits.2017. Bridging semantics and syntax with graphalgorithms—state-of-the-art of extracting biomedi-cal relations. Briefings in bioinformatics, 18(1):160–178.

Claire Nedellec, Robert Bossy, Jin-Dong Kim, Jung-Jae Kim, Tomoko Ohta, Sampo Pyysalo, and PierreZweigenbaum. 2013. Overview of bionlp sharedtask 2013. In Proceedings of the BioNLP sharedtask 2013 workshop, pages 1–7.

Nanyun Peng, Hoifung Poon, Chris Quirk, KristinaToutanova, and Wen-tau Yih. 2017. Cross-sentencen-ary relation extraction with graph lstms. Transac-tions of the Association for Computational Linguis-tics, 5:101–115.

Yifan Peng, Qingyu Chen, and Zhiyong Lu. 2020. Anempirical study of multi-task learning on BERT forbiomedical text mining. In Proceedings of the 19thSIGBioMed Workshop on Biomedical Language Pro-cessing, pages 205–214, Online. Association forComputational Linguistics.

Yifan Peng, Chih-Hsuan Wei, and Zhiyong Lu. 2016.Improving chemical disease relation extraction withrich features and weakly labeled data. Journal ofcheminformatics, 8(1):53.

Yifan Peng, Shankai Yan, and Zhiyong Lu. 2019.Transfer learning in biomedical natural languageprocessing: An evaluation of BERT and ELMo onten benchmarking datasets. In Proceedings of the18th BioNLP Workshop and Shared Task, pages 58–65, Florence, Italy. Association for ComputationalLinguistics.

Chris Quirk and Hoifung Poon. 2016. Distant super-vision for relation extraction beyond the sentenceboundary. arXiv preprint arXiv:1609.04873.

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processing.Association for Computational Linguistics.

Stephen Robertson, Hugo Zaragoza, et al. 2009. Theprobabilistic relevance framework: Bm25 and be-yond. FnT Inf. Ret., 3(4):333–389.

Isabel Segura-Bedmar, Paloma Martınez, and MarıaHerrero-Zazo. 2013. SemEval-2013 task 9 : Extrac-tion of drug-drug interactions from biomedical texts(DDIExtraction 2013). In Second Joint Conferenceon Lexical and Computational Semantics (*SEM),Volume 2: Proceedings of the Seventh InternationalWorkshop on Semantic Evaluation (SemEval 2013),pages 341–350, Atlanta, Georgia, USA. Associationfor Computational Linguistics.

Elaheh ShafieiBavani, Antonio Jimeno Yepes,Xu Zhong, and David Martinez Iraola. 2020. Globallocality in biomedical relation and event extraction.In Proceedings of the 19th SIGBioMed Workshop onBiomedical Language Processing, pages 195–204,Online. Association for Computational Linguistics.

Jingbo Shang, Liyuan Liu, Xiang Ren, Xiaotao Gu,Teng Ren, and Jiawei Han. 2018. Learning namedentity tagger using domain-specific dictionary. InEMNLP. ACL.

Noah Siegel, Nicholas Lourie, Russell Power, andWaleed Ammar. 2018. Extracting scientific figureswith distantly supervised neural networks. In Pro-ceedings of the 18th ACM/IEEE on joint conferenceon digital libraries, pages 223–232.

Ray Smith. 2007. An overview of the tesseract ocr en-gine. In Ninth international conference on documentanalysis and recognition (ICDAR 2007), volume 2,pages 629–633. IEEE.

Hillel Taub Tabib, Micah Shlain, Shoval Sadde, DanLahav, Matan Eyal, Yaara Cohen, and Yoav Gold-berg. 2020. Interactive extractive search overbiomedical corpora. In Proceedings of the 19thSIGBioMed Workshop on Biomedical Language Pro-cessing, pages 28–37, Online. Association for Com-putational Linguistics.

George Tsatsaronis, Georgios Balikas, ProdromosMalakasiotis, Ioannis Partalas, Matthias Zschunke,Michael R Alvers, Dirk Weissenborn, AnastasiaKrithara, Sergios Petridis, Dimitris Polychronopou-los, et al. 2015. An overview of the bioasqlarge-scale biomedical semantic indexing and ques-tion answering competition. BMC bioinformatics,16(1):138.

Page 11: COVID-19 Literature Knowledge Graph Construction and Drug ... · a novel and comprehensive knowledge discov-ery framework, COVID-KG, which leverages novel semantic representation

Satoshi Tsutsui and David J Crandall. 2017. A datadriven approach for compound figure separation us-ing convolutional neural networks. In 2017 14thIAPR International Conference on Document Analy-sis and Recognition (ICDAR), volume 1, pages 533–540. IEEE.

Ozlem Uzuner, Brett R South, Shuying Shen, andScott L DuVall. 2011. 2010 i2b2/va challenge onconcepts, assertions, and relations in clinical text.Journal of the American Medical Informatics Asso-ciation, 18(5):552–556.

Sofie Van Landeghem, Jari Bjorne, Chih-Hsuan Wei,Kai Hakala, Sampo Pyysalo, Sophia Ananiadou,Hung-Yu Kao, Zhiyong Lu, Tapio Salakoski, YvesVan de Peer, et al. 2013. Large-scale event extrac-tion from literature with multi-level gene normaliza-tion. PloS one, 8(4):e55814.

Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar,Russell Reas, Jiangjiang Yang, Darrin Eide, KathrynFunk, Rodney Kinney, Ziyang Liu, William Merrill,et al. 2020a. Cord-19: The covid-19 open researchdataset. ArXiv.

Qingyun Wang, Lifu Huang, Zhiying Jiang, KevinKnight, Heng Ji, Mohit Bansal, and Yi Luan. 2019a.Paperrobot: Incremental draft generation of scien-tific ideas. In Proc. The 57th Annual Meetingof the Association for Computational Linguistics(ACL2019).

Xuan Wang, Yingjun Guan, Weili Liu, AabhasChauhan, Enyi Jiang, Qi Li, David Liem, DibakarSigdel, John Caufield, Peipei Ping, et al. 2020b. Ev-idenceminer: Textual evidence discovery for life sci-ences. In Proceedings of the 58th Annual Meeting ofthe Association for Computational Linguistics: Sys-tem Demonstrations, pages 56–62.

Xuan Wang, Weili Liu, Aabhas Chauhan, YingjunGuan, and Jiawei Han. 2020c. Automatic textual ev-idence mining in covid-19 literature. arXiv preprintarXiv:2004.12563.

Xuan Wang, Xiangchen Song, Yingjun Guan,Bangzheng Li, and Jiawei Han. 2020d. Com-prehensive named entity recognition on cord-19with distant or weak supervision. arXiv preprintarXiv:2003.12218.

Xuan Wang, Yu Zhang, Qi Li, Yinyin Chen, and JiaweiHan. 2018a. Open information extraction with meta-pattern discovery in biomedical literature. In BCB,pages 291–300. ACM.

Xuan Wang, Yu Zhang, Qi Li, Xiang Ren, JingboShang, and Jiawei Han. 2019b. Distantly supervisedbiomedical named entity recognition with dictionaryexpansion. In BIBM.

Xuan Wang, Yu Zhang, Xiang Ren, Yuhao Zhang,Marinka Zitnik, Jingbo Shang, Curtis Langlotz, andJiawei Han. 2018b. Cross-type biomedical namedentity recognition with deep multi-task learning.Bioinformatics, page bty869.

Chih-Hsuan Wei, Alexis Allot, Robert Leaman, andZhiyong Lu. 2019. PubTator central: automatedconcept annotation for biomedical full text articles.Nucleic Acids Research, 47(W1):W587–W593.

Chih-Hsuan Wei, Yifan Peng, Robert Leaman, Al-lan Peter Davis, Carolyn J Mattingly, Jiao Li,Thomas C Wiegers, and Zhiyong Lu. 2015.Overview of the biocreative v chemical disease re-lation (cdr) task. In Proceedings of the fifth BioCre-ative challenge evaluation workshop, volume 14.

Francis Wolinski. 2020. Visualization of diseasesat risk in the covid-19 literature. arXiv preprintarXiv:2005.00848.

Zi Yang, Niloy Gupta, Xiangyu Sun, Di Xu, Chi Zhang,and Eric Nyberg. 2015. Learning to answer biomed-ical factoid & list questions: Oaqa at bioasq 3b.CLEF (Working Notes), 1391.

Zi Yang, Yue Zhou, and Eric Nyberg. 2016. Learn-ing to answer biomedical questions: Oaqa at bioasq4b. In Proceedings of the Fourth BioASQ workshop,pages 23–37.

Haibo Zhang, Josef M Penninger, Yimin Li, NanshanZhong, and Arthur S Slutsky. 2020. Angiotensin-converting enzyme 2 (ace2) as a sars-cov-2 receptor:molecular mechanisms and potential therapeutic tar-get. Intensive care medicine, 46(4):586–590.

Jin Guang Zheng, Daniel Howsmon, Boliang Zhang,Juergen Hahn, Deborah McGuinness, JamesHendler, and Heng Ji. 2014. Entity linking forbiomedical literature. In BMC Medical Informaticsand Decision Making.

Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang,Shuchang Zhou, Weiran He, and Jiajun Liang. 2017.East: an efficient and accurate scene text detector. InProceedings of the IEEE conference on Computer Vi-sion and Pattern Recognition, pages 5551–5560.