supporting annotation layers for natural language processing marti hearst, preslav nakov, ariel...

60
Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar March 17, 2006 Supported by NSF DBI-0317510 And a gift from Genentech

Upload: elaine-pearson

Post on 29-Jan-2016

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

Supporting Annotation Layers for Natural Language Processing

Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk

UC Berkeley

Stanford InfoSeminarMarch 17, 2006 Supported by NSF DBI-0317510

And a gift from Genentech

Page 2: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Outline

• Motivation: NLP tasks

• System Description Annotation architecture Sample queries

• Database Design and Evaluation

• Related Work

• Future Work

Page 3: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Double Exponential Growth in Bioscience Journal ArticlesFrom Hunter & Cohen, Molecular Cell 21, 2006

Page 4: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

BioText Project Goals

• Provide flexible, intelligent access to information for use in biosciences applications.

• Focus on Textual Information from Journal Articles Tightly integrated with other resources

Ontologies Record-based databases

Page 5: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Project Team

• Project Leaders: PI: Marti Hearst Co-PI: Adam Arkin

• Computational Linguistics and Databases Presley Nakov Ariel Schwartz Brian Wolf Barbara Rosario (alum) Gaurav Bhalotia (alum)

• User Interface / IR Rowena Luk Dr. Emilia Stoica

• Bioscience Janice Hamerja Dr. TingTing Zhang (alum)

Page 6: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

BioText Architecture

Sophisticated Text Analysis

Annotations inDatabase

ImprovedSearch Interface

Page 7: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Sample Sentence

“Recent research, in proliferating cells, has demonstrated that interaction of E2F1 with the p53 pathway could involve transcriptional up-regulation of E2F1 target genes such as p14/p19ARF, which affect p53 accumulation [67,68], E2F1-induced phosphorylation of p53 [69], or direct E2F1-p53 complex formation [70].”

Page 8: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Motivation

• Most natural language processing (NLP) algorithms make use of the results of previous processing steps:

Tokenizer Part-of-speech tagger Phrase boundary recognizer Syntactic parser Semantic tagger

• No standard way to represent, store and retrieve text annotations efficiently.

• MEDLINE has close to 13 million abstracts. Full text has started to become available as well.

Page 9: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

System overview

• A system for flexible querying of text that has been annotated with the results of NLP processing.

• Supports self-overlapping and parallel layers, integration of syntactic and ontological hierarchies, and tight integration with SQL.

• Designed to scale to very large corpora. Most NLP annotation systems assume in-memory

usage We’ve evaluated indexing architectures

Page 10: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Text Annotation Framework

• Annotations are stored independently of text in an RDBMS.

• Declarative query language for annotation retrieval.

• Indexing structure designed for efficient query processing.

Page 11: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Key Contributions

•Support for hierarchical and overlapping layers of annotation.

•Querying multiple levels of annotations simultaneously.

•First to evaluate different physical database designs for NLP annotation architecture.

Page 12: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Layers of Annotations

• Each annotation represents an interval spanning a sequence of characters absolute start and end positions

• Each layer corresponds to a conceptually different kind of annotation Protein, MESH label, Noun Phrase

• Layers can be Sequential Overlapping

two multiple-word concepts sharing a word Hierarchical (two different ways)

spanning, when the intervals are nested as in a parse tree, or ontologically, when the token itself is derived from a

hierarchical ontology

Page 13: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Layers of Annotations

NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN

NP PP NP VP PP NP NP PP NP

D019254 D044465 D001769 D002477 D003643

D001773

D016923

D007962

24224596 28102012043

POS

Shallow

parse

Ontology

Gene/protein

185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Word

Ontology

Gene/protein

Word

Part of Speech

Shallow Parse

Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.

D016158

39727642722

NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN

NP PP NP VP PP NP NP PP NP

D019254 D044465 D001769 D002477 D003643

D001773

D016923

D007962

24224596 28102012043

POS

Shallow

parse

Ontology

Gene/protein

185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Word

Ontology

Gene/protein

Word

Part of Speech

Shallow Parse

Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.

D016158

39727642722

Page 14: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Layers of Annotations

NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN

NP PP NP VP PP NP NP PP NP

D019254 D044465 D001769 D002477 D003643

D001773

D016923

D007962

24224596 28102012043

POS

Shallow

parse

Ontology

Gene/protein

185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Word

Ontology

Gene/protein

Word

Part of Speech

Shallow Parse

Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.

D016158

39727642722

NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN

NP PP NP VP PP NP NP PP NP

D019254 D044465 D001769 D002477 D003643

D001773

D016923

D007962

24224596 28102012043

POS

Shallow

parse

Ontology

Gene/protein

185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Word

Ontology

Gene/protein

Word

Part of Speech

Shallow Parse

Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.

D016158

39727642722

Page 15: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Layers of Annotations

NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN

NP PP NP VP PP NP NP PP NP

D019254 D044465 D001769 D002477 D003643

D001773

D016923

D007962

24224596 28102012043

POS

Shallow

parse

Ontology

Gene/protein

185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Word

Ontology

Gene/protein

Word

Part of Speech

Shallow Parse

Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.

D016158

39727642722

NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN

NP PP NP VP PP NP NP PP NP

D019254 D044465 D001769 D002477 D003643

D001773

D016923

D007962

24224596 28102012043

POS

Shallow

parse

Ontology

Gene/protein

185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Word

Ontology

Gene/protein

Word

Part of Speech

Shallow Parse

Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.

D016158

39727642722

Page 16: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Layers of Annotations

NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN

NP PP NP VP PP NP NP PP NP

D019254 D044465 D001769 D002477 D003643

D001773

D016923

D007962

24224596 28102012043

POS

Shallow

parse

Ontology

Gene/protein

185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Word

Ontology

Gene/protein

Word

Part of Speech

Shallow Parse

Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.

D016158

39727642722

NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN

NP PP NP VP PP NP NP PP NP

D019254 D044465 D001769 D002477 D003643

D001773

D016923

D007962

24224596 28102012043

POS

Shallow

parse

Ontology

Gene/protein

185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Word

Ontology

Gene/protein

Word

Part of Speech

Shallow Parse

Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.

D016158

39727642722

Full parse, sentence and section layers are not shown.

Page 17: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Example: Query for Noun Compound ExtractionGoal: find noun phrases consisting ONLY of 3 nouns

plastic water bottle

blue water bottle

big plastic water bottle

FROM

[layer=’shallow_parse’ && tag_name=’NP’

ˆ [layer=’pos’ && tag_name="noun"]

[layer=’pos’ && tag_name="noun"]

[layer=’pos’ && tag_name="noun"] $

] AS compound

SELECT compound.content

Page 18: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Query for Noun Compound Extraction (SQL wrapping)

SELECT LOWER(compound.content), COUNT(*)

FROM (

BEGIN_LQL

[layer=’shallow_parse’ && tag_name=’NP’

ˆ [layer=’pos’ && tag_name="noun"]

[layer=’pos’ && tag_name="noun"]

[layer=’pos’ && tag_name="noun"] $

] AS compound

SELECT compound.content

END_LQL

) AS lql

ORDER BY freq DESC

Page 19: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Query for Noun Compound Extraction (using artificial layers)

Goal: find noun phrases which have EXACTLY two nouns at the end, but no nouns before those two.

“big blue water bottle”

“plastic water bottle”

FROM

[layer=’shallow_parse’ && tag_name=’NP’

ˆ ( { ALLOW GAPS }

![layer=’pos’ && tag_name="noun"]

( [layer=’pos’ && tag_name="noun"]

[layer=’pos’ && tag_name="noun"] ) $

) $

] AS compound

SELECT compound.content

Page 20: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Example: Paraphrases

• Want to find phrases with certain variations: Immunodeficiency virus(?es) in ?the

human(?s)

immunodeficiency virus in humans immonodeficiency viruses in humans immunodeficiency virus in the human immunodeficiency virus in a human

Page 21: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Query for Paraphrases(optional layers and disjunction) [layer=’sentence’

[layer=’pos’ && tag_name="noun" &&

content = "immunodeficiency"]

[layer=’pos’ && tag_name="noun" &&

content IN ("virus","viruses")]

[layer=’pos’ && tag_name=’IN’] AS prep

?[layer=’pos’ && tag_name=’DT’ &&

content IN ("the","a","an")]

[layer=’pos’ && tag_name="noun" &&

content IN ("human", "humans")]

] SELECT prep.content

Page 22: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Example: Protein-Protein Interactions• Find all sentences that consist of a

An NP containing a gene, followed by a morphological variant of the verb “activate”,

“inhibit”, or “bind”, followed by another NP containing a gene.

protein

Activate(d,ing)Inhibit(ed,ing)

Bind(s,ing)protein

Sentence

Page 23: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Query for Protein-Protein InteractionsSELECT p1_text, verb_content, p2_text, COUNT(*) AS cnt FROM (BEGIN_LQL [layer='sentence' { ALLOW GAPS } [layer='shallow_parse' && tag_name='NP' [layer='gene'] $ ] AS p1 [layer='pos' && tag_name="verb" && (content ~ "activate%" || content ~ "inhibit%" ||

content ~ "bind%") ] AS verb [layer='shallow_parse' && tag_name='NP' [layer='gene'] $ ] AS p2 ] SELECT p1.text AS p1_text, verb.content AS verb_content,

p2.text AS p2_text END_LQL) lql GROUP BY p1_text, verb_content, p2_textORDER BY count(*) DESC

Page 24: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Protein-Protein InteractionsSample Output

PROTEIN 1 INTERACTION VERB PROTEIN 2 FREQUENCY

Ca2 activates protein kinase 312

Cln3 activate protein kinase 234

TAP binds transcription factor 192

TNF activatesprotein tyrosine kinase

133

serine/threonine kinase

binding RhoA GTPase 132

Phospholamban inhibits ATPase 114

PRL activated transcription factor 108

Interleukin 2 activates transcription factor 84

Prolactin activates transcription factor 84

AMPA activated protein kinase 78

Nerve growth factor activates protein kinase 78

LPS inhibited MHC class II 75

Heat shock protein Binding p59 72

EPO activated STAT5 63

EGF activated PP2A 60

cis binds Sp1 50

Page 25: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Example: Chemical-Disease Interactions• “A new approach to the respiratory problems of

cystic fibrosis is dornase alpha, a mucolytic enzyme given by inhalation.”

• Goal: extract the relation that dornase alpha (potentially) prevents cystic fibrosis.

• MeSH C06.689 subtree contains pancrediseases

• MeSH supplementary concepts represent chemicals.

Page 26: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Query onDisease-Chemical Interactions

Page 27: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Query onDisease-Chemical Interactions[layer='sentence' { NO ORDER, ALLOW GAPS } [layer='shallow_parse' && tag_name='NP‘ [layer='chemicals'] AS chemical $ ] [layer='shallow_parse' && tag_name='NP' [layer='mesh' &&

tree_number BELOW 'C06.689%'] AS disease $

] ]] AS sent SELECT chemical.text, disease.text, sent.text

Page 28: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Results: Chemical-Disease

Page 29: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Query Translation

Page 30: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

Database Design & Evaluation

Page 31: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Database Design• Evaluated 5 different logical and physical database designs.

• The basic model is similar to the one of TIPSTER (Grishman, 1996). Each annotation is stored as a record in a relation.

• Architecture 1 contains the following columns:1. docid: document ID;2. section: title, abstract or body text;3. layer_id: a unique identifier of the annotation layer;4. start_char_pos: starting character position, relative to

particular section and docid;5. end_char_pos: end character position, relative to

particular section and docid;6. tag_type: a layer-specific token unique identifier.

There is a separate table mapping token IDs to entities (the string in case of a word, the MeSH label(s) in case of a MeSH term etc.)

Page 32: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Database Design (cont.)

• Architecture 2 introduces one additional column, sequence_pos, thus defining an ordering for each layer.

Simplifies some SQL queries as there is no need for “NOT EXISTS” self joins, which are required under Architecture 1 in cases where tokens from the same layer must follow each other immediately.

• Architecture 3 adds sentence_id, which is the number of the current sentence and redefines sequence_pos as relative to both layer_id and sentence_id.

Simplifies most queries since they are often limited to the same sentence.

Page 33: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Database Design (cont.)

• Architecture 4 merges the word and POS layers, and adds word_id assuming a one-to-one correspondence between them. Reduces the number of stored annotations and the number

of joins in queries with both word and POS constraints.

• Architecture 5 replaces sequence_pos with first_word_pos and last_word_pos, which correspond to the sequence_pos of the first/last word covered by the annotation. Requires all annotation boundaries to coincide with word

boundaries. Copes naturally with adjacency constraints between

different layers. Allows for a simpler indexing structure.

Page 34: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Data Layout for all 5 Architectures

Example: “Kinase inhibits RAG-1.”

231(NP)40343(s.parse)b3345

259(VP)49413b3345

23155503b3345

21665455506b3345

21077040346(mesh)b3345

23955505b3345

239(prt)40345 (gene)b3345

8998522755501b3345

55608253 (VB)49411 b3345

59571227 (NN)40341 (POS)b3345

8998528998555500b3345

5560825560849410b3345

595712595714034b (body)3345

WORDID

SENTENCE

SEQUENCEPOS

TAGTYPE

ENDCHARPOS

STARTCHARPOS

LAYERID

SECTIONPMID

131(NP)343(s.parse)b3345

259(VP)413b3345

331503b3345

216654506b3345

110770346(mesh)b3345

239505b3345

139(prt)345 (gene)b3345

89985327501b3345

55608253 (VB)411 b3345

59571127 (NN)341 (POS)b3345

89985389985500b3345

55608255608410b3345

59571159571340 (word)b (body)3345

WORDID

SENTENCE

SEQUENCEPOS

TAGTYPE

ENDCHARPOS

STARTCHARPOS

LAYERID

SECTIONPMID

Basic architecture Added, architecture 3

Added, architecture 2 Added, architecture 4

3

2

1

3

2

1

FIRSTWORDPOS

1

2

3

1

3

1

3

4

3

2

4

3

2

LASTWORDPOS

2

3

4

2

4

2

4

Added, architecture 5

Page 35: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Indexing Structure

• Two types of composite indexes: forward and inverted. An index lookup can be performed on any column combination

that corresponds to an index prefix. The forward indexes support lookup based on position in a

given document. The inverted indexes support lookup based on annotation

values (i.e., tag type and word id).

• Most query plans involve both forward and inverted indexes Joins statistics would have been useful

• Detailed statistics are essential. Standard statistics in DB2 are insufficient.

• Records are clustered on their primary key

Page 36: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Indexing Structure (cont.)Architecture Type Columns

Arch 1-4 F *DOCID +SECTION +LAYER_ID +START_CHAR_POS +END_CHAR_POS +TAG_TYPE

Arch 1-4 I LAYER_ID +TAG_TYPE +DOCID +SECTION +START_CHAR_POS +END_CHAR_POS

Arch 2 F DOCID +SECTION +LAYER_ID +SEQUENCE POS +TAG_TYPE +START_CHAR_POS +END_CHAR_POS

Arch 2 I LAYER_ID +TAG_TYPE +DOCID +SECTION +SEQUENCE POS +START_CHAR_POS +END_CHAR_POS

Arch 3-4 F DOCID +SECTION +LAYER_ID +SENTENCE +SEQUENCE POS +TAG_TYPE +START_CHAR_POS +END_CHAR_POS

Arch 3-4 I LAYER_ID +TAG_TYPE +DOCID +SECTION +SENTENCE +SEQUENCE POS +START_CHAR_POS +END_CHAR_POS

Arch 4 I WORD ID +LAYER_ID +TAG_TYPE +DOCID +SECTION +START_CHAR_POS +END_CHAR_POS +SENTENCE +SEQUENCE POS

Arch 5 F *DOCID +SECTION +LAYER_ID +SENTENCE +FIRST_WORD_POS +LAST_WORD_POS +TAG_TYPE

Arch 5 I LAYER_ID +TAG_TYPE +DOCID +SECTION +SENTENCE +FIRST_WORD_POS +LAST_WORD_POS

Arch 5 I WORD ID +LAYER_ID +TAG_TYPE +DOCID +SECTION +SENTENCE +FIRST_WORD_POS

Page 37: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Experimental Setup

• Annotated 13,504 MEDLINE abstracts Stanford Lexicalized Parser (Klein and Manning,

2003) for sentence splitting, word tokenization, POS tagging and parsing.

We wrote a shallow parser and tools for gene and MeSH term recognition.

• This resulted in 10,910,243 records stored in an IBM DB2 Universal Database Server.

• Defined 4 workloads based on variants of queries.

Page 38: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Experimental Setup:4 Workloads

[layer='shallow_parse' && tag_name="NP"] AS np1[layer='pos' && content='('][layer='shallow_parse' && tag_name="NP"] AS np2[layer='pos' && content=')']

(Pustejovsky et al., 2001)

(d) Acronym-Meaning Extraction

[layer='shallow_parse' && tag_name="NP" [layer='pos' && tag_name="noun" ^ [layer='mesh' && tree_number BELOW "G07.553"] AS m1 $ ] [layer='pos' && tag_name="noun" ^ [layer='mesh' && tree_number BELOW "D"] AS m2 $ ]] SELECT m1.content, m2.content

(c) Descent of Hierarchy:

(Rosario et al., 2002)

[layer='sentence' {ALLOW GAPS} [layer='gene'] AS gene1 [layer='pos' && tag_name="verb" && content="binds"] AS verb [layer='gene'] AS gene2] SELECT gene1.content, verb.content, gene2.content

(Blaschke et al., 1999)

(a) Protein-Protein Interaction

[layer='sentence' [layer='shallow_parse' && tag_name="NP"] AS np1 [layer='pos' && tag_name="verb" && content='binds'] AS verb [layer='pos' && tag_name="prep" && content='to'] [layer='shallow_parse' && tag_name="NP"] AS np2] SELECT np1.content, verb.content, np2.content

(Thomas et al., 2000)

(b) Protein-Protein Interaction

A01 A07

limb:vein

shoulder: artery

Page 39: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Results

Workload (a) (b)

Architecture 1 2 3 4 5 1 2 3 4 5

SQL lines 37 37 34 29 29 91 77 75 65 50

# Joins 6 6 6 5 5 12 11 11 9 7

Time (sec) 3.98 4.35 3.59 1.69 1.94 3.88 5.68 5.41 3.85 3.55

Workload (c) (d)

Architecture 1 2 3 4 5 1 2 3 4 5

SQL lines 45 38 38 39 41 59 50 53 53 35

# Joins 7 6 6 6 6 7 7 7 7 4

Time (sec) 17.9 23.42 21.49 30.07 4.06 1,879 1,700 2,182 1,682 1,582

Workload (a) (b) (c) (d)

#Queries 54 11 50 1

#Results/query 303.4 77.5 1.6 16,701

LQL lines 8 6 5 4

Page 40: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Results

Architecture

Space (MB) 1 2 3 4 5

Data Storage 168.5 168.5 168.5 132.5 136.5

Index Storage 617.0 1,397.0 1,441.0 1,182.0 673.5

Total Storage 785.5 1,565.5 1,609.5 1,314.5 810.0

•Architecture 5 performs well (if not best) on all query types, while the other architectures perform poorly on at least one query type.

•Storage requirement of Architecture 5 is comparable to that of Architecture 1

•Architecture 5 results in much simpler queries

•Conclusion: We recommend Architecture 5 in most cases, or Architecture 1, if atomic annotation layer cannot be defined.

Page 41: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Scalability Analysis

• Combined workload of 3 query types

• Varying buffer pool sizes

Page 42: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Scalability Analysis

Buffer Pool Size (MB) Elapsed Time (ms) Buffer Read Time (ms)

1000 2300 1050

100 2900 1670

10 4600 3340

1 8300 6250

• Suggests that the query execution time grows as a sub-linear function of memory size.

• We believe a similar ratio will be observed when increasing the database size and keeping the memory size fixed

• Parallel query execution can be enabled after partitioning the annotation on document_id

Page 43: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Study on a larger dataset

• Annotated 1.4 Million MEDLINE abstracts 10 million sentences 320 million annotations 70 GB total database size

Workload (a) (b) (c) (d) Random (a, b, c)

#Queries 54 11 50 1 115

#Results/query 32,295 5,420 48 113,483 15,686

Time/query 0:50 55:44 1:35 3:33:57 6:26

Page 44: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Related Work• Annotation graphs (AG): directed

acyclic graph; nodes can have time stamps or are constrained via paths to labeled parents and children. (Bird and Liberman, 2001)

• Emu system: sequential levels of annotations. Hierarchical relations may exist between different levels, but must be explicitly defined for each pair.(Cassidy&Harrington,2001)

• The Q4M query language for MATE: directed graph; constraints and ordering of the annotated components. Stored in XML (McKelvie&al., 2001)

• TIQL: queries consist of manipulating intervals of text, indicated by XML tags; supports set operations. (Nenadic et al., 2002)

SELECT IWHERE X.[id:I].Y <- db/wrd X.[:hv].[]*.Y <- db/phn;

Annotation GraphsFind arcs labeled as words, whose phonetic transcription starts with a “hv“:

[[Phonetic=A -> Phonetic=p] ^ Syllable=S]

EmuFind sentences of phonetic “A” followed by “p“ both dominated by an “S” syllable:

($a word) ($b word); ($a pos ~ "NN") && ($a <> $b) && ($b # ~ "lesser")

Q4M (MATE system)Find nouns followed by the word “lesser”:

TIQL (TIMS system)Find sentences containing the noun phrase “COUP-TF II” and the verb “inhibit”:

(<SENTENCE> <TERM nf=‘COUP TF II’>) <V lemma=‘inhibit’>

Page 45: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

What about XQuery/XPath?

Page 46: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Main Advantages of LQL System• Stand-off annotation

Flexible and modular Multi-layered, including overlaps

• LQL – simple yet powerful Support for hierarchies Optimized for cross-layer queries Much more expressive than standard text search engines

• Seamless integration with SQL and RDBMS Easy integration with additional data sources Simple parallelism

• Full text support Caption search Formatting-aware queries Flexible support for document structure

Page 47: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

On the Horizon

• Full text documents support Really complex in bioscience text

Caption search Formatting-aware annotation layers Flexible support for document structure

• Query simplification Shorthand syntax GUI helper

Page 48: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Syntax-HelperInterface

Page 49: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

Thank you!

biotext.berkeley.edu/lql

Page 50: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Overlap Example

Page 51: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Meta-data tables

BIOTEXT_ANNOTATION_LAYER

LAYER_ID LAYER_NAME OWNER LAST_UPDATED

1 pos hearst 6/12/2005

2 full_parse hearst 6/12/2005

3 shallow_parse hearst 6/12/2005

4 sentence hearst 6/12/2005

5 gene hearst 6/12/2005

6 mesh hearst 6/12/2005

7 chemicals hearst 6/12/2005

Page 52: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Meta-data tables

BIOTEXT_ANNOTATION_ATTRIBUTESLAYER_ID

ATTRIBUTEATTRIBUTE_FIELD

TABLE_NAME ATTRIBUTE_IDATTRIBUTE_TEXT

DBL_QUOTE_ALIAS

TREE_TABLETREE_DESC

TREE_NUM

-1 layer layer_idbiotext_annotation_layers

layer_idlayer_name

layer None None None

-1 tag_name tag_typebiotext_annotation_tag_types

tag_type_id tag_name tag_group None None None

-1 tag_group tag_typebiotext_annotation_tag_types

tag_type_id tag_group tag_group None None None

1 content word_idbiotext_annotation_word

word_id word content_lower None None None

1content_lower

word_idbiotext_annotation_word

word_id word_lower content_lower None None None

5 name tag_typelocuslink_aliases

locus_id name name None None None

6tree_number

tag_typebiotext_annotation_mesh_tree

descriptor_uitree_number

tree_numberbiotext_annotation_mesh_tree

descriptor_ui

tree_number

6 mesh_term tag_typebiotext_annotation_mesh_terms

descriptor_ui mesh_termmesh_term_lower

biotext_annotation_mesh_tree

descriptor_ui

tree_number

6mesh_term_lower

tag_typebiotext_annotation_mesh_terms

descriptor_uimesh_term_lower

mesh_term_lower

biotext_annotation_mesh_tree

descriptor_ui

tree_number

Page 53: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Meta-data tables

BIOTEXT_ANNOTATION_TAG_TYPESLAYER_ID TAG_TYPE_ID TAG_NAME TAG_GROUP

21 2 1019 IN IN

22 2 1020 INTJ INTJ

23 2 1021 JJ adjective

24 2 1022 JJR adjective

25 2 1023 JJS adjective

26 2 1025 LS LS

27 2 1069 LST LST

28 2 1026 MD MD

29 2 1070 NAC NAC

30 2 1027 NN noun

31 2 1028 NNP noun

32 2 1029 NNPS noun

33 2 1030 NNS noun

34 2 1031 NP NP

35 2 1032 NX NX

Page 54: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Meta-data tables

BIOTEXT_ANNOTATION_WORDWORD_ID

WORD WORD_LOWER

1 1212952 BCl bcl

2 1212953 2,2'-disulfonic 2,2'-disulfonic

3 1212954 1762-1860 1762-1860

4 1212955 Premkumar premkumar

5 1212956 329:265-285 329:265-285

6 1212957 EVPROC evproc

7 1212958 fascinae fascinae

8 1212959 fascines fascines

9 1212960 Cox-Stuart cox-stuart

10 1212961 epidydimo-orchitis epidydimo-orchitis

11 1212962 10-20-min 10-20-min

12 1212963 0.05-10-ng/ml 0.05-10-ng/ml

13 1212964 1.016x 1.016x

14 1212965 Goldberg-Lindblom goldberg-lindblom

15 1212966 Lundborg lundborg

16 1212967 graft-loss graft-loss

Page 55: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

References

• Steven Bird and Mark Liberman. 2001. A formal framework for linguistic annotation. Speech Communication, 33(1–2):23–60.

• Steve Cassidy and Jonathan Harrington. 2001. Speech annotation and corpus tools. Speech Communication, 33(1–2):61–77.

• David McKelvie, Amy Isard, Andreas Mengel, Morten B. Moller, Michael Grosse and Marion Klein. 2001. Speech annotation and corpus tools. Speech Communication, 33(1–2):97–112.

• Goran Nenadic, Hideki Mima, Irena Spasic, Sophia Ananiadou and Jun-ichi Tsujii. 2002. Terminology-Driven Literature Mining and Knowledge Acquisition in Biomedicine. International Journal of Medical Informatics, 67:33–48.

• Ralph Grishman. 1996. Building an Architecture: a CAWG Saga. Advances in Text Processing: Tipster Program Phase II, Morgan Kaufmann, 1996.

• Steve Cassidy. 1999. Compiling Multi-tiered Speech Databases into the Relational Model: Experiments with the Emu System. 6th European Conference on Speech Communication and Technology Eurospeech 99, 2127–2130, Budapest, Hungary.

• Xiaoyi Ma, Haejoong Lee, Steven Bird and Kazuaki Maeda. 2002. Models and Tools for Collaborative Annotation. Third International Conference on Language Resources and Evaluation, 2066–2073.

Page 56: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Acquiring Labeled Data using Citances

Page 57: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

A discovery is made …

A paper is written …

Page 58: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

That paper is cited …

and cited …

and cited …

… as the evidence for some fact(s) F.

Page 59: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Each of these in turn are cited for some fact(s) …

… until it is the case that all important facts in the field can be found in citationsentences alone!

Page 60: Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar

UC Berkeley Biotext Project

Citances

• Nearly every statement in a bioscience journal article is backed up with a cite.

• It is quite common for papers to be cited 30-100 times.

• The text around the citation tends to state biological facts. (Call these citances.)

• Different citances will state the same facts in different ways …

• … so can we use these for creating models of language expressing semantic relations?