usami bionlp2011

Automatic Acquisitionof Huge Training Datafor Bio-Medical Named Entity Recognition

Yu Usami, Han-Cheol Cho, Naoaki Okazaki, and Jun’ichi Tsujii

Graduate School of Information Science and Technology University of Tokyo

Introduction

Named Entity RecognitionAM , cystain C and cathepsin B are present as ...

Recent approach:

Machine learning on manually annotated corpus

• BioCreAtIvE task 1A (Yeh et al, 2005)

• Semi-supervised (Vlachos and Gasperin, 2010)

Introduction

Recent approach:

Introduction

Recent approach:

B B BO O O O OI I

Labels B : Beginning of NE I : Inside of NE O: Out of NE

Introduction

Recent approach:

B B BO O O O OI I

Introduction

Recent approach:

B B BO O O O OI I Expensive• Cost• Time

Our Idea

Utilize inexpensive and large resources:

Lexical database Unlabeled text

Our Idea

Build dictionary

Our Idea

Build dictionary

String match

Our Idea

Build dictionary

String match

Acquire annotated corpus for Training

Dictionary Building

Symbol: CD177

Dictionary Building

Official Name: CD177 molecule

Dictionary Building

Synonyms: NB1, PRV1, HNA2A, CD177

Dictionary Building

CD177 CD177 molecule NB1 PRV1 HNA2A

Dictionary Building

Task Settings

Task: Single class NER

Target Class: Gene-or-gene-product (GGP)

Resources:

• Lexical database: Entrez Gene

include 6,816,109 gene (protein) records

• Unlabeled text: 2009 MEDLINE

include 17,764,827 articles

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Test data

String match

ML-based NER trained on acquired training dataString match

Unlabeled text

Training data

Test data

Dic-based

ML-based

Problem of Simple Approach

Stats: Acquired 1,715,344,107 labeled tokens including 10.0% NEs

Examples(A)PMID 1984484: It is clear that in culture media of

AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.

(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.

Goal of This Study

Our ContributionAcquire huge high-quality training datawith lexical database and unlabeled text

Methodology

1. Utilize references (links) for disambiguation

2. Expand NEs based on coordination analysis

3. Gain new NEs by using self-training

Disambiguation

Utilize lexical database references

record AM

reference PMID 1984484

(A)PMID 1984484: It is clear that in culture media of

Disambiguation

record AM

Disambiguation

record AM

Side Effect of Using References

Lacks of the reference in the lexical database

record entA entB entC

ref PMID 19025 1021 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

String matchif referred

Expand NEs based on coordination structure

ref PMID 4928016

Coordination Analysis

ref PMID 4928016

Start from Here

ref PMID 4928016

Coordinate token

ref PMID 4928016

Is this mention included in the dictionary?

ref PMID 4928016

Coordinate token

ref PMID 4928016

Is this mention included in the dictionary?

ref PMID 4928016

Not a coordinate tokenNot included

ref PMID 4928016

Self-training

Training Data

Classifier Model Remaining Data

Learning

Add new NEs

Evaluation Settings

Test corpus:BioNLP 2011 Shared Task EPI corpus(Training set + Development set)

Learning and Decoding:Linear kernel SVM(Predict each token label sequentially)

NER Results

Method Prec. Recall F1

String match 39.03 42.69 40.78 + References 90.62 13.52 23.53 + Coord Analysis 89.66 13.77 23.87

String match 10.18 23.83 14.27 + References 69.25 39.12 50.00 + Coord Analysis 66.79 47.44 55.47 + Self-training 63.72 51.18 56.77

Dic-based

ML-based

Automatic vs Manual

Type Total tokens NE tokens Manual 161,577 12,603 Automatic 48,677,426 3,055,362 NER Performance

Trained oneach corpus Manual Automatic

62.6667.8957.9258.56

68.2680.76

P R F1

Automatic vs Manual

Type Total tokens NE tokens Manual 161,577 12,603 Automatic 48,677,426 3,055,362 NER Performance

Trained oneach corpus Manual Automatic

62.6667.8957.9258.56

68.2680.76

P R F1F1: 67.89 F1: 62.66

Conclusion

Acquired high-quality training data automatically• Use of references for high-precision • Improve recall with‣ Coordination analysis‣ Self-training

Acquired large size training data• Used 10% (Memory limitation)

Future Work

Utilize all of acquired training data for learning‣ Online learning

Improve self-training performance

Semi-supervised approach with acquired data

Apply to another domain or semantic class

usami bionlp2011

cathepsin b

training data training

f1mlbased ner

cystain c

annotated corpusfor

cystatin c

lexical databaseunlabeled

task semisupervised

Technology

economy specific research and introduction of successful...

nano-opidweb.stanford.edu/class/cs379c/archive/2013/class...may...

usami-net.comusami-net.com/assets/img/content/front/event_dairaitensai.pdf ·...

miami swimming club hy-tek's meet manager 5.0 ......21...

toru usami

usami per andare avanti usami per tornare agli indovinelli...

apostila um erick athayde usami

a case study of the training using periodization in …a...

17th august/agosto 2019 - amazon s3 · 34 matsuzaka...

apostila dois erick athayde usami

thermoluminescence and esr study of shocked minerals k....

usami 2014

a multi-vdd dynamic variable-pipeline on-chip router for...

happiness in switzerland chika usami, natsuki kashiwase and...

herick usami - metaconsciencia.com · constituição da...

as dimensões e os extraterrestres herick usami

ultra fine-grained run-time power gating of on-chip routers...

a simple method for extracting the natural beauty of hair...

フル北海道／東北東北／関東 -...

ご挨拶 t kyo 2o2onishikawa yui sugimoto marina kouno...