usami bionlp2011

Post on 25-May-2015

456 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Automatic Acquisitionof Huge Training Datafor Bio-Medical Named Entity Recognition

Yu Usami, Han-Cheol Cho, Naoaki Okazaki, and Jun’ichi Tsujii

Graduate School of Information Science and Technology University of Tokyo

Introduction

Named Entity RecognitionAM , cystain C and cathepsin B are present as ...

Recent approach:

Machine learning on manually annotated corpus

• BioCreAtIvE task 1A (Yeh et al, 2005)

• Semi-supervised (Vlachos and Gasperin, 2010)

Introduction

Named Entity RecognitionAM , cystain C and cathepsin B are present as ...

Recent approach:

Machine learning on manually annotated corpus

• BioCreAtIvE task 1A (Yeh et al, 2005)

• Semi-supervised (Vlachos and Gasperin, 2010)

Introduction

Named Entity RecognitionAM , cystain C and cathepsin B are present as ...

Recent approach:

Machine learning on manually annotated corpus

• BioCreAtIvE task 1A (Yeh et al, 2005)

• Semi-supervised (Vlachos and Gasperin, 2010)

B B BO O O O OI I

Labels B : Beginning of NE I : Inside of NE O: Out of NE

Introduction

Named Entity RecognitionAM , cystain C and cathepsin B are present as ...

Recent approach:

Machine learning on manually annotated corpus

• BioCreAtIvE task 1A (Yeh et al, 2005)

• Semi-supervised (Vlachos and Gasperin, 2010)

B B BO O O O OI I

Introduction

Named Entity RecognitionAM , cystain C and cathepsin B are present as ...

Recent approach:

Machine learning on manually annotated corpus

• BioCreAtIvE task 1A (Yeh et al, 2005)

• Semi-supervised (Vlachos and Gasperin, 2010)

B B BO O O O OI I Expensive• Cost• Time

Our Idea

Our Idea

Utilize inexpensive and large resources:

Lexical database Unlabeled text

Our Idea

Utilize inexpensive and large resources:

Lexical database Unlabeled text

Build dictionary

Our Idea

Utilize inexpensive and large resources:

Lexical database Unlabeled text

Build dictionary

String match

Our Idea

Utilize inexpensive and large resources:

Lexical database Unlabeled text

Build dictionary

String match

Acquire annotated corpus for Training

Dictionary Building

Dictionary Building

Symbol: CD177

Dictionary Building

Official Name: CD177 molecule

Dictionary Building

Synonyms: NB1, PRV1, HNA2A, CD177

Dictionary Building

CD177 CD177 molecule NB1 PRV1 HNA2A

Dictionary Building

Task Settings

Task: Single class NER

Target Class: Gene-or-gene-product (GGP)

Resources:

• Lexical database: Entrez Gene

include 6,816,109 gene (protein) records

• Unlabeled text: 2009 MEDLINE

include 17,764,827 articles

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Test data

String match

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training dataString match

Unlabeled text

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Training data

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Model

Learn

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Test data

Apply

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

14.27

40.78

23.83

42.69

10.18

39.03

PRF1

Dic-based

ML-based

Problem of Simple Approach

Stats: Acquired 1,715,344,107 labeled tokens including 10.0% NEs

Examples(A)PMID 1984484: It is clear that in culture media of

AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.

(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.

Goal of This Study

Our ContributionAcquire huge high-quality training datawith lexical database and unlabeled text

Methodology

1. Utilize references (links) for disambiguation

2. Expand NEs based on coordination analysis

3. Gain new NEs by using self-training

Disambiguation

Utilize lexical database references

record AM

reference PMID 1984484

(A)PMID 1984484: It is clear that in culture media of

AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.

(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.

Disambiguation

Utilize lexical database references

record AM

reference PMID 1984484

(A)PMID 1984484: It is clear that in culture media of

AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.

(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.

Disambiguation

Utilize lexical database references

record AM

reference PMID 1984484

(A)PMID 1984484: It is clear that in culture media of

AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.

(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.

Side Effect of Using References

Lacks of the reference in the lexical database

record entA entB entC

ref PMID 19025 1021 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

String matchif referred

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Start from Here

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Coordinate token

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Is this mention included in the dictionary?

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Yes

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Yes

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Coordinate token

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Is this mention included in the dictionary?

Coordination Analysis

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Yes

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Yes

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

Not a coordinate tokenNot included

Expand NEs based on coordination structure

record entA entB entC

ref PMID 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

Coordination Analysis

End

Self-training

Training Data

Classifier Model Remaining Data

Learning

Apply

Add new NEs

Evaluation Settings

Test corpus:BioNLP 2011 Shared Task EPI corpus(Training set + Development set)

Learning and Decoding:Linear kernel SVM(Predict each token label sequentially)

NER Results

Method Prec. Recall F1

String match 39.03 42.69 40.78 + References 90.62 13.52 23.53 + Coord Analysis 89.66 13.77 23.87

String match 10.18 23.83 14.27 + References 69.25 39.12 50.00 + Coord Analysis 66.79 47.44 55.47 + Self-training 63.72 51.18 56.77

Dic-based

ML-based

Automatic vs Manual

Type Total tokens NE tokens Manual 161,577 12,603 Automatic 48,677,426 3,055,362 NER Performance

Trained oneach corpus Manual Automatic

62.6667.8957.9258.56

68.2680.76

P R F1

Automatic vs Manual

Type Total tokens NE tokens Manual 161,577 12,603 Automatic 48,677,426 3,055,362 NER Performance

Trained oneach corpus Manual Automatic

62.6667.8957.9258.56

68.2680.76

P R F1F1: 67.89 F1: 62.66

Conclusion

Acquired high-quality training data automatically• Use of references for high-precision • Improve recall with‣ Coordination analysis‣ Self-training

Acquired large size training data• Used 10% (Memory limitation)

Future Work

Utilize all of acquired training data for learning‣ Online learning

Improve self-training performance

Semi-supervised approach with acquired data

Apply to another domain or semantic class

top related