bioqa - a question answering system for the biomedical domain

34
BioQA - A question answering system for the biomedical domain Luis Tari

Upload: nitesh

Post on 14-Feb-2016

26 views

Category:

Documents


2 download

DESCRIPTION

BioQA - A question answering system for the biomedical domain. Luis Tari. Question Answering (QA). What is QA? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: BioQA -  A question answering system for the biomedical domain

BioQA - A question answering system for the

biomedical domain

Luis Tari

Page 2: BioQA -  A question answering system for the biomedical domain

Question Answering (QA) What is QA?

“QA is an interactive human computer process that encompasses understanding a user information need, typically expressed in a natural language query; retrieving relevant documents, data, or knowledge from selected sources; extracting, qualifying and prioritizing available answers from these sources; and presenting and explaining responses in an effective manner.”

Cited from “New Directions in Question Answering” Why QA?

One of the ultimate goals in AI (human-level AI, Turing’s test, …)

A move beyond keyword query, finding what we really want to know

Page 3: BioQA -  A question answering system for the biomedical domain

QA How is QA different from a search engine?

Check out www.brainboost.com

QA Search Engine

Queries in Natural Language (Questions)

Queries based on keywords

Present answers to users Users find the answers from retrieved results

Some natural language process is used to determine answers

Mostly keywords and ranking to retrieve results

Page 4: BioQA -  A question answering system for the biomedical domain

Text Retrieval Conference (TREC)

An annual activity of information retrieval (IR) research sponsored by the National Institute for Standards and Technology (NIST).

TREC is organized into “tracks” of common interest.

Research groups work on a common source of data and a common set of queries or tasks.

The goal is to allow comparisons across systems and approaches in a research-oriented, collegial manner.

Page 5: BioQA -  A question answering system for the biomedical domain

TREC Genomics Track TREC Genomics Track focuses on the

retrieval of information from biomedical literature.

Ad-hoc retrieval on a set of 4.5 millions of articles, in which 25% of them have no abstracts.

50 topics (queries) organized in 5 templates

Page 6: BioQA -  A question answering system for the biomedical domain

TREC Genomics Templates1. Find articles describing standard methods or

protocols for doing some sort of experiment or procedure.

2. Find articles describing the role of a gene involved in a given disease.

3. Find articles describing the role of a gene in a specific biological process.

4. Find articles describing interactions (e.g., promote, suppress, inhibit, etc.) between two or more genes in the function of an organ or in a disease.

5. Find articles describing one or more mutations of a given gene and its biological impact.

Page 7: BioQA -  A question answering system for the biomedical domain

BioQA A QA system for the biomedical domain A great deal of genomics information

resources are available Entrez Gene, PubMed, UniProt, Gene Ontology,

UMLS, many many more… BioQA utilizes some of the genomics

resources, whereas a generic QA does not Keyword search is not enough

Consider the following examples

Page 8: BioQA -  A question answering system for the biomedical domain

Example 1 Suppose as a biologist, I want to know the role of the

gene interferon beta in the disease multiple sclerosis. Query to PubMed:

“interferon beta” AND “multiple sclerosis”

Oops… interferon beta IS also the name of a treatment. I’m not a medical doctor so I don’t really care….

Page 9: BioQA -  A question answering system for the biomedical domain

Example 2 Query: “interferon beta” AND “multiple

sclerosis”

Hmm… this is more like what I am looking for….

Page 10: BioQA -  A question answering system for the biomedical domain

Objectives of BioQA

Phase 1 Retrieve relevant articles with respect to the

specific needs of user’s questions Phase 2

Extract and present answers to the users Phase 3

Answer questions that require simple reasoning

Page 11: BioQA -  A question answering system for the biomedical domain

BioQA Prototype Offline Subsystem

Page 12: BioQA -  A question answering system for the biomedical domain

BioQA PrototypeOnline Subsystem

Accept User’s Query

Database

Process User’s Query(Tag, Lucene Syntax,

Stem...)

Search Database (using Lucene indexes)

Filter result

Rank/Categorize result

Index

Present results to User

Allow user to modify/choose query patterns

Page 13: BioQA -  A question answering system for the biomedical domain

Main Components of BioQA Phase 1

Question Processing and Query Formation Entity Recognition Indexing Pronoun Resolution Extraction Ranking

Page 14: BioQA -  A question answering system for the biomedical domain

Question Processing and Query Formation Process questions so that keywords are extracted to

form queries for retrieval Incorporate synonyms for the keywords Consider the question:

“What is the role of PRNP in mad cow disease?” First idea

Get all the nouns from the question But we do not want a query that includes “role”

Second idea Identify all the entities from the question and treat them as

keywords But what if we are unable to identify some of the entities?

Page 15: BioQA -  A question answering system for the biomedical domain

Question Processing and Query Formation

Third idea – making use of dependency grammar (Link Grammar)

+-----------------Xp-----------------+ | +--------MVp--------+ | | +---Ost---+ | | +---Ws--+Ss*w+ +--Ds-+-Mp-+J+ +J+ | | | | | | | | | | |LEFT-WALL what is.v the role.n of X in Y ?

In the following example, N1= “role” and N2= “X” in the question

keyword(N2) :- noun(N1), noun(N2), Mp(N1,X), J(X,N2).

Page 16: BioQA -  A question answering system for the biomedical domain

Entity Recognition To recognize gene symbols, disease names Lots of resources on

gene symbols: Entrez Gene, HUGO, … disease names: MeSH, UMLS, …

Why is Entity Recognition still an issue? “CDC28” can be written as “Cdc28”, “Cdc28p”, “cdc-28” “hairy” is a gene name “GSS” is a synonym of “PRNP”, but “GSS” itself is also a

gene which is unrelated to “PRNP”! Two tasks

Recognize gene names given a biomedical article Generate gene symbol synonyms and variants given a

gene symbol in a query

Page 17: BioQA -  A question answering system for the biomedical domain

Entity Recognition Various approaches:

Machine learning techniques to recognize names on the basis of their characteristic features

Dictionary-based methods with generation of variants

Dictionary-based + Part-of-Speech methods Rule-based methods

Some of the best Entity Taggers: ABNER GAPSCORE

Page 18: BioQA -  A question answering system for the biomedical domain

Anaphora Resolution Pronominal Anaphors

Resolving third-person pronouns and reflexive pronouns Example: “BRCA1 interacts with Smad2. It also interacts

with Smad3.” Sortal Anaphors

“In this report, we show that virus infection of cells results in a dramatic hyperacetylation of histones H3 and H4 that is localized to the IFN-beta promoter. … Thus, coactivator-mediated localized hyperacetylation of histones may play a crucial role in inducible gene expression. [PMID: 10024886]

Which histones?

Page 19: BioQA -  A question answering system for the biomedical domain

Anaphora Resolution

“Ethanol was found to inhibit the function of this chimeric receptor in a manner similar to that of nACh alpha 7 receptors. Because the inhibition transfers with the amino-terminal domain of the receptor, the observations suggest that the amino-terminal domain of the receptor is involved in the inhibition.” [PMID: 8863848]

Page 20: BioQA -  A question answering system for the biomedical domain

Extraction To extract knowledge from text

Knowledge such as protein-protein interactions, gene-disease relations, …

Can be used in presenting answers Extracting protein-protein interactions

“Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation.” [PMID: 15920482]

Should extract the following interactions from the above text: Cdc28 binds Clb2 Swe1 is phosphorylated by Clb2-Cdc28 complex Cdc5 is involved in Swe1 phosphorylation.

Page 21: BioQA -  A question answering system for the biomedical domain

Extraction Extraction of other relations

“… Furthermore, PACT colocalized with viral replication complex in the infected cells. Thus the observed effect of PACT is novel and PACT is involved in the regulation of viral replication …” [PMID: 11401490]

Should extract the following relations from the above text: PACT colocalized with viral replication complex in the

infected cells PACT is involved in the regulation of viral replication

Page 22: BioQA -  A question answering system for the biomedical domain

Extraction Two main directions towards extraction:

Cooccurrence Identify entities that co-occur within abstracts Frequency-based scoring scheme to rank the extracted

relationships NLP

Combine the analysis of syntax and semantics Using extraction rules that are implemented manually or

learned automatically from annotated corpus However,

Cooccurrences sometimes do not actually mean correct relations

Cannot infer directional relationships from cooccurrences

Page 23: BioQA -  A question answering system for the biomedical domain

Hard Lessons learned from TREC Synonyms from gene dictionary is NOT

enough Generating gene symbol variants is essential

One query is not enough to do the job Generating query variants, which are slight

variations of the original query. For instance, the query “inhibitory synoptic

transmission” can have the variants “synoptic transmission” and “inhibitory transmission”.

Page 24: BioQA -  A question answering system for the biomedical domain

more…. Abstracts related to a gene family can be

relevant as well Suppose we want to know about the gene COPII,

we may want to know COP, COPI as well Abstracts can merely mention an entity as an

example e.g. [PMID 10232877]: GSTM1 is mentioned to be

related to breast cancer as an example, but article is about GSTM1 and alcoholism.

Page 25: BioQA -  A question answering system for the biomedical domain

Future Components

Structural Feedback Answer Presentation Semantics of Words Simple Reasoning using Domain Knowledge

Page 26: BioQA -  A question answering system for the biomedical domain

Structural Feedback Problem:

Can we use the underlying “structures” among the relevant articles to improve the retrieval process? [IBM]

Goal: To learn the “structures” of abstracts that are identified as relevant.

Idea: Learn the structure of articles (such as common words, MeSH terms) identified to be relevant by domain experts identified to be relevant by users

Page 27: BioQA -  A question answering system for the biomedical domain

Answer Presentation To present answers to users in a precise and

concise manner Current Status: relevant “answers” are presented to

the users in the form of abstracts Problem: Not concise enough for users Ideas:

Retrieve small passage of text, based on proximity of keywords [LCC02] and simple cosine similarity between sentences [Singapore05].

Extraction using NLP Use text summarization techniques to present answers

[PSB06].

Page 28: BioQA -  A question answering system for the biomedical domain

Semantics of Words WordNet – a resource that provides synonyms of

words in different senses; relations between words Question:

“What is the role of IDE in Alzheimer’s Disease?” Abstract (PMID:12161276):

“… IDE plays in the degradation and clearance of human amyloid beta from migroglial cells and neurons …”

Semantic relation between “role” and “play” [from WordNet]: role: function, purpose, role, use play: is_a(play_use) So we can say “role”, “play”, “use” are related.

Answer: The role of IDE is in the degradation and clearance of human amyloid beta from migroglial cells and neurons.

Page 29: BioQA -  A question answering system for the biomedical domain

Simple Reasoning using Domain Knowledge (Example 1) Question:

“Does IDE play a role in Alzheimer’s Disease (AD)?” Retrieved Abstract (PMID:12161276):

“… The insulin degrading enzyme (IDE) is an attractive candidate gene since previous studies have identified a possible role that IDE plays in the degradation and clearance of human amyloid beta from migroglial cells and neurons …”

Domain knowledge: AD is a nervous system disease. Neurons are related to the nervous system.

Answer: Yes, IDE plays a role in AD because AD is a nervous system disease and IDE plays in the degradation and clearance of human amyloid beta from migroglial cells and neurons.

Page 30: BioQA -  A question answering system for the biomedical domain

Simple Reasoning using Domain Knowledge (Example 2)

Question: Does MMS2 involve in cancer? Domain Knowledge about MMS2

MMS2 is known to be involved in biological processes such as cell proliferation and the ubiquitin cycle, based on the Gene Ontology.

Cell Proliferation – cell growth Ubiquitin cycle – regulating proteins' half-lives

Page 31: BioQA -  A question answering system for the biomedical domain

Simple Reasoning using Domain Knowledge (Example 2 cont.) Domain Knowledge about cancer

Abnormal growth of tissues Sometimes in cancer, we find that the ubiquitin cycle is

deregulated, leading to certain proteins having extra long or extra short half-lives.

Answer: Yes. Since MMS2 is involved in regulating cell proliferation and ubiquitin cycle, MMS2 is possibly involved in cancer.

Challenges: How to represent such knowledge Where to get such domain knowledge

Page 32: BioQA -  A question answering system for the biomedical domain

Potential Projects Learning

Structural Feedback Rules for describing keywords in questions

Answer Presentation Passage retrieval, extraction

Extraction gene-disease, gene-biological process relations

Sortal Resolution Semantics of Words

Page 33: BioQA -  A question answering system for the biomedical domain

References Literature mining for the biologist: from information retrieval to biological

discovery. Lars Juhl Jensen, Jasmin Saric and Peer Bork. Nature Reviews Genetics 7, 119-129 (February 2006).

Anaphora Resolution Anaphora Resolution in Biomedical Literature. Jose Castano, Jason

Zhang, James Pustejovsky. Extraction of Gene-Disease Relations

Association of genes to genetically inherited diseases using data mining. Perez-Iratxeta C, Bork P, Andrade MA. Nature Genetics 31, 316-319 (2002).

G2D: A Tool for Mining Genes Associated to Disease. Perez-Iratxeta C, Wjst M, Bork P, Andrade MA. BMC Genetics 6, 45 (2005).

Extraction of Gene-Disease Relations from Medline Using Domain Dictionaries and Machine Learning. Hong-Woo Chun, Yoshimasa Tsuruoka, Jin-Dong Kim, Rie Shiba, Naoki Nagata, Teruyoshi Hishiki, and Jun'ichi Tsujii. PSB 2006.

Structural Feedback [IBM] Rie Kubota Ando, Mark Dredze, Tong Zhang. TREC 2005

Genomics Track Experiments at IBM Watson.

Page 34: BioQA -  A question answering system for the biomedical domain

References Answer Presentation

[LCC02] Dan I. Moldovan, Mihai Surdeanu: On the Role of Information Retrieval and Information Extraction in Question Answering Systems. SCIE 2002: 129-147.

[Singapore05] Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan and Tat-Seng Chua, Question Answering Passage Retrieval Using Dependency Relations, In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development of Information Retrieval (SIGIR 2005), Salvador, Brazil, August 15 -19, 2005.

[PSB06] Zhiyong Lu, K. Bretonnel Cohen, and Lawrence Hunter. Finding GeneRIFs via Gene Ontology Annotations. To appear in PSB 2006.

WordNet Resources [WordNetSim] Pedersen, Patwardhan, and Michelizzi.

WordNet::Similarity - Measuring the Relatedness of Concepts. Appears in the Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-04), July 25-29, 2004, San Jose, CA (Intelligent Systems Demonstration).

[SenseRelate] Michelizzi. Semantic Relatedness Applied to All Words Sense Disambiguation. Master of Science Thesis, Department of Computer Science, University of Minnesota, Duluth, July, 2005.