1 the biotext project myers seminar sept 22, 2003 marti hearst associate professor sims, uc berkeley...

36
1 The BioText Project Myers Seminar Sept 22, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA AQUAINT, and a gift from Genentech

Post on 22-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

1

The BioText Project

Myers SeminarSept 22, 2003

Marti HearstAssociate Professor

SIMS, UC Berkeley

Projected sponsored by NSF DBI-0317510, ARDA AQUAINT, and a gift from Genentech

2

BioText Project Goals

• Provide fast, flexible, intelligent access to information for use in biosciences applications.

• Focus on– Textual Information– Tightly integrated with other

resources• Ontologies• Record-based databases

3

People

• Project Leaders: – PI: Marti Hearst Co-PI: Adam Arkin

• Computational Linguistics– Barbara Rosario– Presley Nakov

• Database Research– Ariel Schwartz– Gaurav Bhalotia (graduated)

• User Interface / Information Retrieval– Kevin Li– Emilia Stoica

• Bioscience– Dr. TingTing Zhang

4

Outline

• Main Goals– System Architecture– Apoptosis problem statement

• Recent results in – Abbreviation definition recognition– Semantic relation recognition (from

text)– Search User Interfaces– Hierarchical grouping of journals

5

BioText: Main GoalsBioText: Main Goals

Sophisticated Text Analysis

Annotations inDatabase

ImprovedSearch Interface

6

Recent Result (Schwartz & Hearst 03)

• Fast, simple algorithm for recognizing abbreviation definitions.– Simpler and faster than the rest– Higher precision and recall– Idea: Work backwards from the end

• Examples:– In eukaryotes, the key to transcriptional regulation of the

Heat Shock Response is the Heat Shock Transcription Factor (HSF).

– Gcn5-related N-acetyltransferase (GNAT)

• Idea: use redundancy across abstracts to figure out abbreviation meaning even when definition is not present.

7

BioText: A Two-Sided ApproachBioText: A Two-Sided Approach

SwissProt

Blast

Mesh

GOWordNet

Medline

JournalFull Text

Sophisticated DatabaseDesign & Algorithms

EmpiricalComputational Linguistics

Algorithms

8

Death ReceptorsSignaling

Survival Factors Signaling

Ca++ Signaling

P53 pathway

Caspase 12

Effecter Caspases (3,6,7)

Caspase 9

Apaf 1IAPs

NFkB

Mitochondria Cytochrome c

Bax, Bak

Apoptosis

Bcl-2 like

BH3 only

Apoptosis Network

Smac

ER Stress

Genotoxic Stress

Initiator Caspases (8, 10)

AIF

Lost of Attachment Cell Cycle stress, etc

Slide courtesy TingTing Zhang

9

The issues (courtesy TingTing Zhang):

• The network nodes are deduced from reading and processing of experimental knowledge by experts. Every month >1000 apoptosis papers are published.

• The supporting experimental data are gathered in different organs, tissues, cells using various techniques.

• There are various levels of uncertainty associated with different techniques used to answer certain questions.

• Depending on the expression patterns for the players in the network, the observation may or may not be extended to other contexts.

• We need to keep track of ALL the information in order to understand the system better.

10

Simple cases:

• Mouse Bim proteins (isoforms EL, L, S) binds to human Bcl-2 (bacteriophoage screening using cDNA expression library from T-Lymphoma cell line KO52DA20).• Human BimEL protein is 89% identical to mouse BimEL, Human BimL is 85% identical to mouse BimL (Hybridization of mouse bim cDNA to human fetal spleen and peripheral blood cDNA library).• Bim mRNA is detected in B and T lyphoid cells (Northern blot analysis of mouse KO52DA20, WEHI 703, WEHI 707, WEHI7.1, CH1, WEHI231 WEHI415, B6.23.16BW2 cell extracts).• BimL protein interact with Bcl-2 OR Bcl-XL, or Bcl-w proteins (Immuno-precipitation (anti-Bcl-2 OR Bcl-XL OR Bcl-w)) followed by Western blot (anti-EEtag) using extracts human 293T cells co-transfected with EE-tagged BimL AND (bcl-2 OR bcl-XL OR bcl-w) plasmids)• BimL deleted of the BH3 domain does not bind to Bcl-2 OR Bcl-XL, or Bcl-w proteins (under experimental conditions mentioned above)

11

Computational Language Goals

• Recognizing and annotating entities within textual documents

• Identifying semantic relations among entities

• To (eventually) be used in tandem with semi-automated reasoning systems.

12

Main Ideas for NLP Approach

• Assign Semantics using – Statistics– Hierarchical Lexical Ontologies to

generalize– Redundancy in the data

• Build up Layers of Representation– Syntactic and Semantic– Use these in a feedback loop

13

Computational Linguistics Goals

• Mark up text with semantic relations

14

Recent Result:Descent of Hierarchy

• Idea: – Use the top levels of a lexical

hierarchy to identify semantic relations

• Hypothesis:– A particular semantic relation holds

between all 2-word Noun Compounds that can be categorized by a MeSH pair.

15

Definition

• NC: Any sequence of nouns that itself functions as a noun– asthma hospitalizations – health care personnel hand wash

• Technical text is rich with NCs Open-labeled long-term study of the subcutaneous sumatriptan efficacy and tolerability in acute migraine treatment.

16

• Identification• Syntactic analysis (attachments)

• [Baseline [headache frequency]]• [[Tension headache] patient]

• Our Goal: Semantic analysis• Headache treatment treatment for headache• Corticosteroid treatment treatment that uses

corticosteroid

NCs: Three tasks

17

Main Idea:

• Top-level MESH categories can be used to indicate which relations hold between noun compounds

• headache recurrence– C23.888.592.612.441 C23.550.291.937

• headache pain– C23.888.592.612.441 G11.561.796.444

• breast cancer cells– A01.236 C04 A11

18

Linguistic MotivationCan cast NC into head-modifier relation, and assume head noun has an argument and qualia structure.

– (used-in): kitchen knife– (made-of): steel knife– (instrument-for): carving knife– (used-on): putty knife– (used-by): butcher’s knife

19

Distribution of Frequent Category Pairs

20

How Far to Descend?• Anatomy: 250 CPs

– 187 (75%) remain first level– 56 (22%) descend one level – 7 (3%) descend two levels

• Natural Science (H01): 21 CPs– 1 (4%) remain first level– 8 (39%) descend one level – 12 (57%) descend two levels

• Neoplasm (C04) 3 CPs:– 3 (100%) descend one level

21

Evaluation• Apply the rules to a test set• Accuracy:

– Anatomy: 91% accurate– Natural Science: 79%– Diseases: 100%

• Total:– 89.6% via intra-category averaging– 90.8% via extra-category averaging

22

Summary of NC Work

• Lexical hierarchy useful for inferring semantic relations

• Works because semantics are constrained and word sense ambiguity is not too much of a problem

• Can it be extended to other types of relations?– Preliminary results on one set of relations

are promising.

23

Database Research Issues

• Efficiently and effectively combining – Relational databases & Text– Hierarchical Ontologies– Layers of Annotations

24

Interface Issues

• Create intuitive, appealing interfaces that are better than what’s currently out there.

• Start with existing assigned metadata

• As text analysis improves, incorporate the results into the interface.

25

26

27

28

29

Some Recent Work

• Organizing BioScience Journal Names– Currently there are > 3500

30

31

32

Some Recent Work

• Organizing BioScience Journal Names– Currently there are > 3500

• Idea:– Group them into faceted hierarchies

semi-automatically– Using clustering of title terms,

synonym similarity via WordNet, and other techniques

33

34

35

Summary

• BioText aims to improve access to bioscience information via– Sophisticated language analysis– Integration of results into

• Annotated database• Flexible user interface

• Eventual goal– Semi-automated mining and

discovery

36

There’s lots to do!

biotext.berkeley.edu

For more information: