© 2003 the mitre corporation. all rights reserved. mitre critical assessment of information...

26
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette Hirschman Alexander Morgan Alexander Yeh http://www.mitre.org/public/ biocreative

Post on 18-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Critical Assessment of Information Extraction Systems in

Biology(BioCreAtIvE)

Marc ColosimoLynette HirschmanAlexander Morgan

Alexander Yeh

http://www.mitre.org/public/biocreative

Page 2: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Outline

Past evaluation- KDD Cup 2002

Current evaluation- BioCreAtIvE

Summary

Page 3: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Past Evaluation: KDD 2002 Challenge Cup Evaluation

We were invited to run a task for KDD Cup 2002*We ran one of two tasks for 2002

- Alexander Yeh was the chair for Task 1 (fly genes)- Mark Craven (U. Wisc.) was the chair for Task 2

(yeast genes)Data-mining conf: NOT biology nor text processing

*http://www.biostat.wisc.edu/~craven/kddcup/tasks.html

Page 4: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Task 1: For a Set of Papers on Genetics or Molecular Biology

We provided for each paper- The full text of the paper- A list of the genes mentioned in that paper

The task was to- Rank the curatable papers before the

non-curatable papers- Does each paper contain any curatable gene

product information (Yes/No)?- For each curatable gene mentioned in the

paper, does that paper have experimental results for

Transcript(s) of that gene (Yes/No)? Protein(s) of that gene (Yes/No)?

Page 5: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

ResultsThe winner and honorable mentions were all

combined teams from 2 or 3 organizationsWinner: a team from ClearForest and Celera

- Used manually generated rules and patterns to perform information extraction

- Also had the best score in each of the 3 sub-tasks Best MedianRanked-list: 84% 69% Yes/No curate paper: 78% 58%Yes/No gene products: 67% 35%

18 teams submitted test results

Page 6: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Outline

Past evaluation- KDD Cup 2002

Current evaluation- BioCreAtIvE

Summary

Page 7: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Current Evaluation: BioCreAtIvE Organized by MITRE, CNB (Madrid) and others

- Under the umbrella of the ISCB BioLINK Special Interest Group for Text Data Mining*

Two tasks- Entity extraction (MITRE)

Gene name mentions (NCBI) Gene list (MITRE)

- Functional curation (CNB-Madrid)Automatically map text to GO (Gene Ontology) terms for proteins described in text

*http://www.pdg.cnb.uam.es/BioLINK

Page 8: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

ScheduleJuly 2003: initial training data & guidelines

Nov-Dec. 2003: test data released, results due

Participants may chose which tasks and which sub-tasks they want to

participate in. You are not limited to one or all of the tasks.

Page 9: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Why Evaluate Entity Extraction for Molecular Biology?

Entity extraction is a basic text mining operation - It indicates the items discussed in a document- Variations in nomenclature constitute a major

stumbling block to accessing the biomedical literature

Many groups working on entity extraction- But there is no way to compare the systems

Different data setsDifferent tasks

Challenge Evaluations have been successful making comparisons

- This work should also lead to resources and standards for handling nomenclature

Page 10: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Progress in Speech Recognition

Source: Pallett, D. Garofolo, J. and Fiscus, J. (NIST) Measurements in Support of Research Accomplishments. Feb 2000. Communications of the ACM: Special Section on Broadcast News Understanding.

Results show decrease in error rate over time, measured by results from best system each year

Note that the research community selected new, harder problems over time

Can we expect the same progress for accessing biological literature?

Page 11: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Some Challenges of Extracting Entities in Molecular Biology Texts

Entity mentions are often common nouns (as opposed to proper nouns)

In fact, many entities are named with ordinary words

- E.g., some Drosophila gene names: by, for, if, blue, saw, period, white, midget

Also, new entities are constantly being discovered and/or renamed

Page 12: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

“Complete” Entity Extraction is More Than Finding Mentions in the Text

For each mention, it is important to determine which entity is being discussed

This is non-trivial in molecular biology- An entity can have synonyms- The same word(s) can refer to different

entitiesE.g., Sek1 refers to two different proteins in mice (Map2k4 and Epha4)

- Mentions can share text: e.g., “MEK1/2” is about both MEK1 and MEK2

Page 13: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Entity Extraction Task 1A: Gene Name MentionData provided by Lorrie Tanabe & John Wilbur,

NCBI- 15,000 sentences manually annotated for

genes 7,500 sentences for training 2,500 sentences for development test 5,000 sentences for testing

Example (transformed for display purposes)- Data are marked for occurrences of gene-

related mentions (underlined), including binding sites, motifs, domains, proteins, promoters, etc.

Structure and expression of a gene from Arabidopsis thaliana encoding a protein related to SNF1 protein kinase.

Page 14: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Entity Extraction Task 1B: Gene List Annotation

Given a set of abstracts

We have screened the Drosophila X chromosome for genes whose dosage affects the function of the homeotic gene Deformed. One of these genes, extradenticle, encodes a homeodomain transcription factor that heterodimerizes with Deformed and other homeotic Hox proteins. Mutations in the nejire gene, which encodes a transcriptional adaptor protein belonging to the CBP/p300 family, also interact with Deformed. The other previously characterized gene identified as a Deformed interactor is Notch, which encodes a transmembrane receptor. These three genes underscore the importance of transcriptional regulation and cell-cell signaling in Hox function. Four novel genes were also identified in the screen. One of these, rancor, is required for appropriate embryonic expression of Deformed and another homeotic gene, labial. Both Notch and nejire affect the function of another Hox gene, Ultrabithorax, indicating they may be required for homeotic activity in general.

Page 15: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Entity Extraction Task 1B: What a Contestant’s System Should Return

Return a list of the standardized names of the genes mentioned in each abstract:

Also return 1 text mention for each gene in list

We have screened the Drosophila X chromosome for genes whose dosage affects the function of the homeotic gene Deformed. One of these genes, extradenticle, encodes a homeodomain transcription factor that heterodimerizes with Deformed and other homeotic Hox proteins. Mutations in the nejire gene, which encodes a transcriptional adaptor protein belonging to the CBP/p300 family, also interact with Deformed. The other previously characterized gene identified as a Deformed interactor is Notch, which encodes a transmembrane receptor. These three genes underscore the importance of transcriptional regulation and cell-cell signaling in Hox function. Four novel genes were also identified in the screen. One of these, rancor, is required for appropriate embryonic expression of Deformed and another homeotic gene, labial. Both Notch and nejire affect the function of another Hox gene, Ultrabithorax, indicating they may be required for homeotic activity in general.

0004656, 0002522, 0015624, 0000439, 0012384, 0004647, 0000611

Page 16: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Task 1B: Data Availability Abstracts from PubMed/Medline

- Training- Development test- Test

Gene lists for papers from model organism databases (Drosophila, mouse, yeast)

- A list of genes (standardized names) for each paper is available

- Note that gene list is for full paper, but the text we can get is just the abstract

Synonym lists provided by each database to map alternate gene names, as mentioned in text, to their unique database identifier

Page 17: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Task 1B:References Associated w. Lists of Genes

Page 18: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Task 1B: Data Set Size (in Abstracts*)

Fly Mouse Yeast

500050005000Training

DevelopmentTest

Test

150 250 (150)

(250) (250) (250)

*Each abstract is around 250 words

Page 19: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Task 2: Functional Annotation

Data provided by Swiss-Prot (Rolf Apweiler) and being run by Christian Blaschke (CNB-Madrid)

Task: - Automatically generate evidence for Gene

Ontology annotations for a set of proteins from the text of an article

Gold standard:- SWISS PROT Human Genome Annotations- SWISS PROT curators will also check the

correctness and utility of the pointers to the evidence

Page 20: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Functional Annotation: Sub-tasks

1. Return text evidence for GO annotations found in a paper- Given a full text paper, protein(s) and

associated GO term(s)2. Generate GO term(s) and evidence for a

protein- Given a paper and protein(s) in the paper - Note that more than one GO term might

be associated with a protein3. Exploratory. Given a set of proteins find

relevant GO annotations and evidence from full text articles

Page 21: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Task 2: Find Text Evidence Supporting SWISS PROT GO Annotation

SWISS PROT entry for: Small inducible cytokine A8 precursor;

Synonyms: CCL8; Monocyte chemotactic protein 2 ; MCP-2 Monocyte chemoattractant protein 2; HC14

GO Annotation: 0006816Calcium ion transport

Page 22: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Task 2: Find Text Evidence (cont.)

Full text article…

…cpt-cAMP (1 mM) pretreatment of the cells completely inhibited RANTES-, MIP-1-, and MCP-2-induced Ca2+ mobilization …

Protein: Small inducible cytokine A8 precursor

Synonym: MCP-2

GO Annotation: 0006816Calcium ion transport

Evidence:

Page 23: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Outline

Past evaluation- KDD Cup 2002

Current evaluation- BioCreAtIvE

Summary

Page 24: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Summary We are trying to help the curators by

providing common challenge evaluations based on relevant problems faced by curators

Providing common evaluations provide a means to directly compare different methods and helps to advance research in the area

There is still time to compete in the current challenge

http://www.mitre.org/public/biocreative

Page 25: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

The End

Page 26: © 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED.

MITRE

Linking Literature, Databases, Ontologies, Data

MEDLINE

Literature Collections

Genbank

Databases

SwissProt

Ontologies

Data integration via metaschemas

Improved searchand indexing

PathwayDiscovery

DB update

DataInterpretation

ExperimentalData

DB curation