usami bionlp2011
TRANSCRIPT
![Page 1: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/1.jpg)
Automatic Acquisitionof Huge Training Datafor Bio-Medical Named Entity Recognition
Yu Usami, Han-Cheol Cho, Naoaki Okazaki, and Jun’ichi Tsujii
Graduate School of Information Science and Technology University of Tokyo
![Page 2: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/2.jpg)
Introduction
Named Entity RecognitionAM , cystain C and cathepsin B are present as ...
Recent approach:
Machine learning on manually annotated corpus
• BioCreAtIvE task 1A (Yeh et al, 2005)
• Semi-supervised (Vlachos and Gasperin, 2010)
![Page 3: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/3.jpg)
Introduction
Named Entity RecognitionAM , cystain C and cathepsin B are present as ...
Recent approach:
Machine learning on manually annotated corpus
• BioCreAtIvE task 1A (Yeh et al, 2005)
• Semi-supervised (Vlachos and Gasperin, 2010)
![Page 4: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/4.jpg)
Introduction
Named Entity RecognitionAM , cystain C and cathepsin B are present as ...
Recent approach:
Machine learning on manually annotated corpus
• BioCreAtIvE task 1A (Yeh et al, 2005)
• Semi-supervised (Vlachos and Gasperin, 2010)
B B BO O O O OI I
Labels B : Beginning of NE I : Inside of NE O: Out of NE
![Page 5: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/5.jpg)
Introduction
Named Entity RecognitionAM , cystain C and cathepsin B are present as ...
Recent approach:
Machine learning on manually annotated corpus
• BioCreAtIvE task 1A (Yeh et al, 2005)
• Semi-supervised (Vlachos and Gasperin, 2010)
B B BO O O O OI I
![Page 6: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/6.jpg)
Introduction
Named Entity RecognitionAM , cystain C and cathepsin B are present as ...
Recent approach:
Machine learning on manually annotated corpus
• BioCreAtIvE task 1A (Yeh et al, 2005)
• Semi-supervised (Vlachos and Gasperin, 2010)
B B BO O O O OI I Expensive• Cost• Time
![Page 7: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/7.jpg)
Our Idea
![Page 8: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/8.jpg)
Our Idea
Utilize inexpensive and large resources:
Lexical database Unlabeled text
![Page 9: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/9.jpg)
Our Idea
Utilize inexpensive and large resources:
Lexical database Unlabeled text
Build dictionary
![Page 10: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/10.jpg)
Our Idea
Utilize inexpensive and large resources:
Lexical database Unlabeled text
Build dictionary
String match
![Page 11: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/11.jpg)
Our Idea
Utilize inexpensive and large resources:
Lexical database Unlabeled text
Build dictionary
String match
Acquire annotated corpus for Training
![Page 12: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/12.jpg)
Dictionary Building
![Page 13: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/13.jpg)
Dictionary Building
Symbol: CD177
![Page 14: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/14.jpg)
Dictionary Building
Official Name: CD177 molecule
![Page 15: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/15.jpg)
Dictionary Building
Synonyms: NB1, PRV1, HNA2A, CD177
![Page 16: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/16.jpg)
Dictionary Building
![Page 17: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/17.jpg)
CD177 CD177 molecule NB1 PRV1 HNA2A
Dictionary Building
![Page 18: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/18.jpg)
Task Settings
Task: Single class NER
Target Class: Gene-or-gene-product (GGP)
Resources:
• Lexical database: Entrez Gene
include 6,816,109 gene (protein) records
• Unlabeled text: 2009 MEDLINE
include 17,764,827 articles
![Page 19: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/19.jpg)
Ineffectiveness of Simple Approach
Dictionary-based NER
ML-based NER trained on acquired training data
![Page 20: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/20.jpg)
Ineffectiveness of Simple Approach
Dictionary-based NER
ML-based NER trained on acquired training data
![Page 21: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/21.jpg)
Ineffectiveness of Simple Approach
Dictionary-based NER
ML-based NER trained on acquired training data
Test data
String match
![Page 22: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/22.jpg)
Ineffectiveness of Simple Approach
Dictionary-based NER
ML-based NER trained on acquired training data
![Page 23: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/23.jpg)
Ineffectiveness of Simple Approach
Dictionary-based NER
ML-based NER trained on acquired training dataString match
Unlabeled text
![Page 24: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/24.jpg)
Ineffectiveness of Simple Approach
Dictionary-based NER
ML-based NER trained on acquired training data
Training data
![Page 25: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/25.jpg)
Ineffectiveness of Simple Approach
Dictionary-based NER
ML-based NER trained on acquired training data
Model
Learn
![Page 26: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/26.jpg)
Ineffectiveness of Simple Approach
Dictionary-based NER
ML-based NER trained on acquired training data
Test data
Apply
![Page 27: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/27.jpg)
Ineffectiveness of Simple Approach
Dictionary-based NER
ML-based NER trained on acquired training data
14.27
40.78
23.83
42.69
10.18
39.03
PRF1
Dic-based
ML-based
![Page 28: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/28.jpg)
Problem of Simple Approach
Stats: Acquired 1,715,344,107 labeled tokens including 10.0% NEs
Examples(A)PMID 1984484: It is clear that in culture media of
AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.
(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.
![Page 29: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/29.jpg)
Goal of This Study
Our ContributionAcquire huge high-quality training datawith lexical database and unlabeled text
Methodology
1. Utilize references (links) for disambiguation
2. Expand NEs based on coordination analysis
3. Gain new NEs by using self-training
![Page 30: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/30.jpg)
Disambiguation
Utilize lexical database references
record AM
reference PMID 1984484
(A)PMID 1984484: It is clear that in culture media of
AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.
(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.
![Page 31: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/31.jpg)
Disambiguation
Utilize lexical database references
record AM
reference PMID 1984484
(A)PMID 1984484: It is clear that in culture media of
AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.
(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.
![Page 32: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/32.jpg)
Disambiguation
Utilize lexical database references
record AM
reference PMID 1984484
(A)PMID 1984484: It is clear that in culture media of
AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.
(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.
![Page 33: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/33.jpg)
Side Effect of Using References
Lacks of the reference in the lexical database
record entA entB entC
ref PMID 19025 1021 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
String matchif referred
![Page 34: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/34.jpg)
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
![Page 35: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/35.jpg)
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
Start from Here
![Page 36: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/36.jpg)
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
Coordinate token
![Page 37: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/37.jpg)
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
![Page 38: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/38.jpg)
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
Is this mention included in the dictionary?
![Page 39: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/39.jpg)
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
Yes
![Page 40: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/40.jpg)
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
Yes
![Page 41: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/41.jpg)
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
Coordinate token
![Page 42: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/42.jpg)
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
![Page 43: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/43.jpg)
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Is this mention included in the dictionary?
Coordination Analysis
![Page 44: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/44.jpg)
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
Yes
![Page 45: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/45.jpg)
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
Yes
![Page 46: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/46.jpg)
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
Not a coordinate tokenNot included
![Page 47: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/47.jpg)
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
End
![Page 48: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/48.jpg)
Self-training
Training Data
Classifier Model Remaining Data
Learning
Apply
Add new NEs
![Page 49: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/49.jpg)
Evaluation Settings
Test corpus:BioNLP 2011 Shared Task EPI corpus(Training set + Development set)
Learning and Decoding:Linear kernel SVM(Predict each token label sequentially)
![Page 50: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/50.jpg)
NER Results
Method Prec. Recall F1
String match 39.03 42.69 40.78 + References 90.62 13.52 23.53 + Coord Analysis 89.66 13.77 23.87
String match 10.18 23.83 14.27 + References 69.25 39.12 50.00 + Coord Analysis 66.79 47.44 55.47 + Self-training 63.72 51.18 56.77
Dic-based
ML-based
![Page 51: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/51.jpg)
Automatic vs Manual
Type Total tokens NE tokens Manual 161,577 12,603 Automatic 48,677,426 3,055,362 NER Performance
Trained oneach corpus Manual Automatic
62.6667.8957.9258.56
68.2680.76
P R F1
![Page 52: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/52.jpg)
Automatic vs Manual
Type Total tokens NE tokens Manual 161,577 12,603 Automatic 48,677,426 3,055,362 NER Performance
Trained oneach corpus Manual Automatic
62.6667.8957.9258.56
68.2680.76
P R F1F1: 67.89 F1: 62.66
![Page 53: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/53.jpg)
Conclusion
Acquired high-quality training data automatically• Use of references for high-precision • Improve recall with‣ Coordination analysis‣ Self-training
Acquired large size training data• Used 10% (Memory limitation)
![Page 54: Usami bionlp2011](https://reader033.vdocuments.net/reader033/viewer/2022052619/5562ebd3d8b42a213b8b4bd6/html5/thumbnails/54.jpg)
Future Work
Utilize all of acquired training data for learning‣ Online learning
Improve self-training performance
Semi-supervised approach with acquired data
Apply to another domain or semantic class