information extraction and named entity recognition-2
Post on 10-Feb-2018
247 Views
Preview:
TRANSCRIPT
-
7/22/2019 Information Extraction and Named Entity Recognition-2
1/73
Informatio
Extraction and N
Entity Recogni
Introducing the taGetting simple stru
information out o
-
7/22/2019 Information Extraction and Named Entity Recognition-2
2/73
Christopher Manning
Information Extraction
Information extraction (IE) systems Find and understand limited relevant parts of texts
Gather information from many pieces of text
Produce a structured representation of relevant information
relations(in the database sense), a.k.a.,
a knowledge base Goals:
1. Organize information so that it is useful to people
2. Put information in a semantically precise form that alloinferences to be made by computer algorithms
-
7/22/2019 Information Extraction and Named Entity Recognition-2
3/73
Christopher Manning
Information Extraction (IE)
IE systems extract clear, factual information Roughly: Who did what to whom when?
E.g., Gathering earnings, profits, board members, headquarters, e
company reports
The headquarters of BHP Billiton Limited, and the global of the combined BHP Billiton Group, are located in MelboAustralia.
headquarters(BHP Biliton Limited, Melbourne, Austral
Learn drug-gene product interactions from medical research
-
7/22/2019 Information Extraction and Named Entity Recognition-2
4/73
Christopher Manning
Low-level information extraction
Is now availableand I think popularin applicationor Google mail, and web indexing
Often seems to be based on regular expressions and
-
7/22/2019 Information Extraction and Named Entity Recognition-2
5/73
Christopher Manning
Low-level information extraction
-
7/22/2019 Information Extraction and Named Entity Recognition-2
6/73
Christopher Manning
Why is IE hard on the web?
Need this
price
Title
A book,
Not a toy
-
7/22/2019 Information Extraction and Named Entity Recognition-2
7/73
Christopher Manning
How is IE useful?Classified Advertisements (Real Estate)
Background: Plain text
advertisements
Lowest common
denominator: only thing
that 70+ newspapersusing many different
publishing systems can
all handle
2067206v1
March 02, 1998
-
7/22/2019 Information Extraction and Named Entity Recognition-2
8/73
Christopher Manning
h h
-
7/22/2019 Information Extraction and Named Entity Recognition-2
9/73
Christopher Manning
Why doesnt text search (IR) work?
What you search for in real estate advertisements: Town/suburb. You might think easy, but:
Real estate agents:Coldwell Banker, Mosman
Phrases:Only 45 minutes from Parramatta
Multiple property ads have different suburbs in one ad Money: want a range not a textual match
Multiple amounts: was $155K, now $145K
Variations: offers in the high 700s [but notrents for $270]
Bedrooms: similar issues: br, bdr, beds, B/R
Ch i h M i
-
7/22/2019 Information Extraction and Named Entity Recognition-2
10/73
Christopher Manning
Named Entity Recognition (NER)
A very important sub-task: findand classifynames intext, for example:
The decision by the independent MP Andrew Wilkie to
withdraw his support for the minority Labor government
sounded dramatic but it should not further threaten itsstability. When, after the 2010 election, Wilkie, Rob
Oakeshott, Tony Windsor and the Greens agreed to support
Labor, they gave just two guarantees: confidence and
supply.
Ch i t h M i
-
7/22/2019 Information Extraction and Named Entity Recognition-2
11/73
Christopher Manning
A very important sub-task: findand classifynames intext, for example:
The decision by the independent MP Andrew Wilkie to
withdraw his support for the minority Laborgovernment
sounded dramatic but it should not further threaten itsstability. When, after the 2010election, Wilkie, Rob
Oakeshott, Tony Windsor and the Greensagreed to support
Labor, they gave just two guarantees: confidence and
supply.
Named Entity Recognition (NER)
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
12/73
Christopher Manning
A very important sub-task: findand classifynames intext, for example:
The decision by the independent MP Andrew Wilkie to
withdraw his support for the minority Laborgovernment
sounded dramatic but it should not further threaten itsstability. When, after the 2010election, Wilkie, Rob
Oakeshott, Tony Windsor and the Greensagreed to support
Labor, they gave just two guarantees: confidence and
supply.
Named Entity Recognition (NER)
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
13/73
Christopher Manning
Named Entity Recognition (NER)
The uses: Named entities can be indexed, linked off, etc.
Sentiment can be attributed to companies or products
A lot of IE relations are associations between named entities
For question answering, answers are often named entities.
Concretely:
Many web pages tag various entities, with links to bio or top
Reuters OpenCalais, Evri, AlchemyAPI, Yahoos Term Ext
Apple/Google/Microsoft/ smart recognizers for document
-
7/22/2019 Information Extraction and Named Entity Recognition-2
14/73
Informatio
Extraction and N
Entity Recogni
Introducing the taGetting simple stru
information out o
-
7/22/2019 Information Extraction and Named Entity Recognition-2
15/73
Evaluation of N
Entity Recogn
Precision, Recall, an
F measure;
their extension to seq
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
16/73
Christopher Manning
The 2-by-2 contingency table
correct not correct
selected tp fp
not selected fn tn
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
17/73
p g
Precision and recall
Precision: % of selected items that are correctRecall: % of correct items that are selected
correct not correct
selected tp fp
not selected fn tn
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
18/73
p g
A combined measure: F
A combined measure that assesses the P/R tradeoff is(weighted harmonic mean):
The harmonic mean is a very conservative average; se
8.3
People usually use balanced F1 measure
i.e., with = 1 (that is, = ): F= 2PR/(P+
RP
PR
RP
F+
+=
+
=2
2 )1(
1)1(
1
1
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
19/73
Quiz question
What is the F1?
P= 40% R= 40% F1 =
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
20/73
Quiz question
What is the F1?
P= 75% R= 25% F1 =
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
21/73
The Named Entity Recognition Task
Task: Predict entities in a text
Foreign ORG
Ministry ORG
spokesman O
Shen PERGuofang PER
told O
Reuters ORG
: :
}Standard
evaluationis per entity,
notper token
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
22/73
Precision/Recall/F1 for IE/NER
Recall and precision are straightforward for tasks likecategorization, where there is only one grain size (do
The measure behaves a bit funnily for IE/NER when th
boundary errors (which are common):
First Bank of Chicago announced earnings
This counts as both a fp and a fn
Selecting nothingwould have been better
Some other metrics (e.g., MUC scorer) give partial cre
(according to complex rules)
-
7/22/2019 Information Extraction and Named Entity Recognition-2
23/73
Evaluation of N
Entity Recogn
Precision, Recall, an
F measure;
their extension to seq
-
7/22/2019 Information Extraction and Named Entity Recognition-2
24/73
Methods for d
NER and IE
The space;
hand-written patt
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
25/73
Three standard approaches to NER
1. Hand-written regular expressions Perhaps stacked
2. Using classifiers
Generative: Nave Bayes
Discriminative: Maxent models
3. Sequence models
HMMs
CMMs/MEMMs
CRFs
Christopher Manning
Hand written Patterns for
-
7/22/2019 Information Extraction and Named Entity Recognition-2
26/73
Hand-written Patterns for
Information Extraction
If extracting from automatically generated web pagesregex patterns usually work.
Amazon page
-
7/22/2019 Information Extraction and Named Entity Recognition-2
27/73
MUC: the NLP genesis of IE
DARPA funded significant efforts in IE in the early to m
Message Understanding Conference (MUC) was an anevent/competition where results were presented.
Focused on extracting information from news articles
Terrorist events
Industrial joint ventures
Company management changes
Starting off, all rule-based, gradually moved to ML
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
28/73
MUC Information Extraction Examp
Bridgestone Sports Co. said Friday
it had set up ajoint venture in
Taiwan with a local concern and a
Japanese trading house to produce
golf clubs to be supplied to Japan.
The joint venture, Bridgestone
Sports Taiwan Co., capitalized at20 million new Taiwan dollars, will
start production in January 1990
with production of 20,000 iron
and metal wood clubs a month.
JOINT-VENTURE-1
Relationship: TIE-UP Entities: Bridgestone Sport
concern, a Japanese trad
Joint Ent: Bridgestone Spo
Activity: ACTIVITY-1
Amount: NT$20 000 000
ACTIVITY-1
Activity:PRODUCTION
Company: Bridgestone Spo
Product:iron and metal w
Start date: DURING: Januar
Christopher Manning
Natural Language Processing based
-
7/22/2019 Information Extraction and Named Entity Recognition-2
29/73
Natural Language Processing-based
Hand-written Information Extractio
For unstructured human-written text, some NLP may Part-of-speech (POS) tagging
Mark each word as a noun, verb, preposition, etc.
Syntactic parsing
Identify phrases: NP, VP, PP
Semantic word categories (e.g. from WordNet)
KILL: kill, murder, assassinate, strangle, suffocate
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
30/73
0
1
2
3
4
PN s
Ar
N
PN
P
s
Art
Finite Automaton for
Noun groups:Johns interesting
book with a nice cover
Grep++ = Cascaded grepping
Christopher Manning
Natural Language Processing-based
-
7/22/2019 Information Extraction and Named Entity Recognition-2
31/73
Natural Language Processing-based
Hand-written Information Extractio
We use a cascaded regular expressions to match relat Higher-level regular expressions can use categories matched
level expressions
E.g. the CRIME-VICTIM pattern can use things matching NOU
This was the basis of the SRI FASTUS system in later M
Example extraction pattern
Crime victim:
Prefiller: [POS: V, Hypernym: KILL]
Filler: [Phrase: NOUN-GROUP]
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
32/73
Rule-based Extraction Examples
Determining which person holds what office in what organization
[person] , [office] of [org]
Vuk Draskovic, leader of the Serbian Renewal Movement
[org] (named, appointed, etc.) [person] Prep [office]
NATO appointed Wesley Clark as Commander in Chief
Determining where an organization is located
[org] in[loc]
NATO headquarters in Brussels
[org] [loc] (division, branch, headquarters, etc.)
KFOR Kosovo headquarters
-
7/22/2019 Information Extraction and Named Entity Recognition-2
33/73
Methods for d
NER and IE
The space;
hand-written patt
f i
-
7/22/2019 Information Extraction and Named Entity Recognition-2
34/73
Informatio
extraction as
classificatio
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
35/73
Nave use of text classification for I
Use conventional classification algorithms to clsubstrings of document as to be extractedor
In some simple but compelling domains, this natechnique is remarkably effective. But do think about when it would and wouldnt work!
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
36/73
Change of Address email
Christopher Manning
Change-of-Address detection
-
7/22/2019 Information Extraction and Named Entity Recognition-2
37/73
Change-of-Address detection[Kushmerick et al., ATEM 2001]
addressnave-Bayes
model
P[robert@lousycorp.com]
P[robert@cubemedia.com
everyone so.... My new email address is: robert@cubemedia.com Hope a
From: Robert Kubinsky Subject: Email update
messagenave Bayesmodel
Yes
2. Extraction
1. Classification
Christopher Manning
Change-of-Address detection resul
-
7/22/2019 Information Extraction and Named Entity Recognition-2
38/73
Change of Address detection resul[Kushmerick et al., ATEM 2001]
Corpus of 36 CoA emails and 5720 non-CoA emails
Results from 2-fold cross validations (train on half, test on ot
Very skewed distribution intended to be realistic
Note very limited training data: only 18 training CoA messag
36 CoA messages have 86 email addresses; old, new, and mi
P R F1
Message classification 98% 97% 98%
Address classification 98% 68% 80%
I f ti
-
7/22/2019 Information Extraction and Named Entity Recognition-2
39/73
Informatio
extraction as
classificatio
Sequence Mode
-
7/22/2019 Information Extraction and Named Entity Recognition-2
40/73
Sequence Mode
Named Enti
Recognitio
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
41/73
The ML sequence model approach
Training
1. Collect a set of representative training documents
2. Label each token for its entity class or other (O)
3. Design feature extractors appropriate to the text and classes
4. Train a sequence classifier to predict the labels from the data
Testing
1. Receive a set of testing documents
2. Run sequence model inference to label each token
3. Appropriately output the recognized entities
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
42/73
Encoding classes for sequence labe
IO encoding IOB encoding
Fred PER B-PER
showed O O
Sue PER B-PER
Mengqiu PER B-PERHuang PER I-PER
s O O
new O O
painting O O
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
43/73
Features for sequence labeling
Words
Current word (essentially like a learned dictionary)
Previous/next word (context)
Other kinds of inferred linguistic classification
Part-of-speech tags
Label context
Previous (and perhaps next) label
43
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
44/73
Features: Word substrings
drug
company
movie
place
person
Cotrimoxazole Wethersfie
Alien Fury: Countdown to Invasio
0
0
0
18
0
oxa
708
0006
:
14
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
45/73
Features: Word shapes
Word Shapes
Map words to simplified representation that encodes
such as length, capitalization, numerals, Greek lette
punctuation, etc.
Varicella-zoster Xx-xxxmRNA xXXX
CPA1 XXXd
Sequence Mode
-
7/22/2019 Information Extraction and Named Entity Recognition-2
46/73
Sequence Mode
Named Enti
Recognitio
-
7/22/2019 Information Extraction and Named Entity Recognition-2
47/73
Maximum ent
sequence mo
Maximum entropy M
models (MEMMsConditional Markov
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
48/73
Sequence problems
Many problems in NLP have data which is a sequence
characters, words, phrases, lines, or sentences
We can think of our task as one of labeling each item
VBG NN IN DT NN IN NN
Chasing opportunity in an age of upheaval
POS tagging
B B I I B I
Word segmentatio
PERS O O O ORG ORG
Murdoch discusses future of News Corp.
Named entity recognition
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
49/73
MEMM inference in systems
For a Conditional Markov Model (CMM) a.k.a. a Maximum Entro
Markov Model (MEMM), the classifier makes a single decision aconditioned on evidence from observations and previous decisio
A larger space of sequences is usually explored via search
-3 -2 -1 0 +1
DT NNP VBD ??? ???
The Dow fell 22.6 %
Local Context
FeaturW0
W+1
W-1
T-1
T-1-T-2
hasDigit?
(Ratnaparkhi 1996; Toutanova et al. 2003, etc.)
Decision Point
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
50/73
Example: POS Tagging
Scoring individual labeling decisions is no more complex than st
classification decisions We have some assumed labels to use for prior positions
We use features of those and the observed data (which can include
previous, and next words) to predict the current label
-3 -2 -1 0 +1
DT NNP VBD ??? ???
The Dow fell 22.6 %
Local Context
FeaturW0
W+1
W-1
T-1
T-1-T-2
hasDigit?
Decision Point
(Ratnaparkhi 1996; Toutanova et al. 2003, etc.)
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
51/73
Example: POS Tagging
POS tagging Features can include:
Current, previous, next words in isolation or together.
Previous one, two, three tags.
Word-internal features: word types, suffixes, dashes, etc.
-3 -2 -1 0 +1
DT NNP VBD ??? ???
The Dow fell 22.6 %
Local Context
FeaturW0
W+1
W-1
T-1
T-1-T-2
hasDigit?
(Ratnaparkhi 1996; Toutanova et al. 2003, etc.)
Decision Point
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
52/73
Inference in Systems
Sequence Level
Local Level
Local
Data
Feature
Extraction
Features
LabelOptimization
Smoothing
Classifier Type
SequenceData
Maximum Entropy
ModelsQuadratic
Penalties
Conjugate
Gradient
Sequence Model
Local
DataLocal
Data
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
53/73
Greedy Inference
Greedy inference:
We just start at the left, and use our classifier at each position to assign a
The classifier can depend on previous labeling decisions as well as obse
Advantages: Fast, no extra memory requirements
Very easy to implement
With rich features including observations to the right, it may perform quite
Disadvantage:
Greedy. We make commit errors we cannot recover from
Sequence Model
Inference
Best Sequence
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
54/73
Beam Inference
Beam inference:
At each position keep the top kcomplete sequences.
Extend each sequence in each local way.
The extensions compete for the kslots at the next position.
Advantages:
Fast; beam sizes of 35 are almost as good as exact inference in many c
Easy to implement (no dynamic programming required).
Disadvantage:
Inexact: the globally best sequence can fall off the beam.
Sequence Model
Inference
Best Sequence
Christopher Manning
-
7/22/2019 Information Extraction and Named Entity Recognition-2
55/73
Viterbi Inference
Viterbi inference:
Dynamic programming or memoization.
Requires small window of state influence (e.g., past two states are releva
Advantage: Exact: the global best sequence is returned.
Disadvantage:
Harder to implement long-distance state-state interactions (but beam infeto allow long-distance resurrection of sequences anyway).
Sequence Model
Inference
Best Sequence
Christopher Manning
i bi f & h
-
7/22/2019 Information Extraction and Named Entity Recognition-2
56/73
Viterbi Inference: J&M Ch. 6
Im punting on this read J&M Ch. 5/6.
Ill do dynamic programming for parsing
Basically, providing you only look at neighboring state
dynamic program a search for the optimal state sequ
Christopher Manning
Vi bi I f J&M Ch 5/6
-
7/22/2019 Information Extraction and Named Entity Recognition-2
57/73
Viterbi Inference: J&M Ch. 5/6
Christopher Manning
CRF
-
7/22/2019 Information Extraction and Named Entity Recognition-2
58/73
CRFs[Lafferty, Pereira, and McCallum 2001]
Another sequence model: Conditional Random Fields (CRFs)
A whole-sequence conditional model rather than a chaining of local m
The space of cs is now the space of sequences
But if the features firemain local, the conditional sequence likelihood can be calc
using dynamic programming
Training is slower, but CRFs avoid causal-competition biases
These (or a variant using a max margin criterion) are seen as the state
these days but in practice usually work much the same as MEMMs
'
),'(expc i
ii dcf=),|( dcP i ii dcf ),(exp
-
7/22/2019 Information Extraction and Named Entity Recognition-2
59/73
Maximum ent
sequence mo
Maximum entropy M
models (MEMMsConditional Markov
The Full Task
-
7/22/2019 Information Extraction and Named Entity Recognition-2
60/73
Informatio
Extraction
Christopher Manning
Th F ll T k f I f ti E t
-
7/22/2019 Information Extraction and Named Entity Recognition-2
61/73
The Full Task of Information ExtracInformation Extraction =
segmentation + classification+ associaAs a family of techniques:
For years, Microsoft CorporationCEOBill
Gatesrailed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Now Gateshimself says Microsoftwill gladly
disclose its crown jewels--the coveted code
behind the Windows operating system--toselect customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
MicrosoftVP. "That's a super-important shift
for us in terms of code access.
Richard Stallman, founderof the Free
Software Foundation, countered saying
Microsoft Corporation
CEO
Bill Gates
Gates
Microsoft
Bill VeghteMicrosoft
VP
Richard Stallman
founder
Free Software Foundation
Slide by Andrew McCallum Use Christopher Manning
An Even Broader View
-
7/22/2019 Information Extraction and Named Entity Recognition-2
62/73
An Even Broader View
Create ontology
Segment
Classify
Associate
ClusterLoad DB
Spider
Query,Search
IE
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
Slide by Andrew McCallum Use Christopher Manning
Landscape of IE Tasks:D t F tti
-
7/22/2019 Information Extraction and Named Entity Recognition-2
63/73
Document FormattingText paragraphs
without formatting
Grammatical sentence
and some formatting & l
Non-grammatical snippets,
rich formatting & links
Tables
Astro Teller is the CEO and co-founder ofBodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz
fellow. His M.S. in symbolic and heuristic
computation and B.S. in computer science are
from Stanford University.
Slide by Andrew McCallum Use Christopher Manning
Landscape of IE TasksI t d d B dth f C
-
7/22/2019 Information Extraction and Named Entity Recognition-2
64/73
Intended Breadth of Coverage
Web site specific Genre specific Wide, no
Amazon.com Book Pages Resumes Univer
Formatting Layout Lan
Slide by Andrew McCallum Use Christopher Manning
Landscape of IE Tasks :Complexity of entities/relations
-
7/22/2019 Information Extraction and Named Entity Recognition-2
65/73
Complexity of entities/relations
Closed set
He was born in Alabama
Regular set
Phone: (413) 545-13
Complex pattern
University of Arkansas
P.O. Box 140
Hope, AR 71802
was among the six
sold by Hope Feldman
Ambiguous patterns, nee
and many sources of evi
The CALD main office is 412
1299
The big Wyoming sky
U.S. states U.S. phone numbers
U.S. postal addresses Person names
Headquarters:
1128 Main Street, 4th Floor
Cincinnati, Ohio 45210
Pawel Opalinski, Softw
Engineer at WhizBang
Slide by Andrew McCallum Use Christopher Manning
Landscape of IE Tasks:Arity of relation
-
7/22/2019 Information Extraction and Named Entity Recognition-2
66/73
Arity of relation
Single entity
Person: Jack Welch
Binary relationship
Relation: Person-Title
Person: Jack Welch
Title: CEO
N-ary
Named entityextraction
Jack Welch will retire as CEO of General Electric tomorrow. T
at the Connecticut company will be filled by Jeffrey Immelt.
Relat ion: Company-Location
Company: General Electric
Locat ion : Connecticut
Relation:
Company:
Title:
Out:In :
Person: Jeffrey Immelt
Locat ion: Connecticut
Slide b Andre McCall m Use Christopher Manning
Association task = Relation Extract
-
7/22/2019 Information Extraction and Named Entity Recognition-2
67/73
Association task = Relation Extract
Checking if groupings of entities are instances of a rel
1. Manually engineered rules
Rules defined over words/entites: located in
-
7/22/2019 Information Extraction and Named Entity Recognition-2
68/73
Relation Extraction: Disease Outbre
May 19 1995, Atlanta -- The Centers for Disease Controland Prevention, which is in the front line of the world'sresponse to the deadly Ebola epidemic in Zaire ,is finding itself hard pressed to cope with the crisis
Date Disease Name
Jan. 1995 Malaria
July 1995 Mad Cow Disease
Feb. 1995 Pneumonia
May 1995 Ebola
InformationExtraction System
Christopher Manning
Relation Extraction: Protein Interac
-
7/22/2019 Information Extraction and Named Entity Recognition-2
69/73
We show that CBF-A and CBF-C interactwith each other to form a CBF-A-CBF-C complex
and that CBF-B does not interact with CBF-A or
CBF-C individually but that it associates with the
CBF-A-CBF-C complex.
Relation Extraction: Protein Interac
CBF-A CBF-C
CBF-B CBF-A-CBF-C complex
interactcomplex
associates
Christopher Manning
Binary Relation Association as BinaClassification
-
7/22/2019 Information Extraction and Named Entity Recognition-2
70/73
Classification
Christos Faloutsosconferred with Ted Senator, the KDD 2003 G
Person-Role (Christos Faloutsos, KDD 2003 General Cha
Person-Role ( Ted Senator, KDD 2003 General Cha
Person Person Role
Christopher Manning
Resolving coreference(both within and across documents
-
7/22/2019 Information Extraction and Named Entity Recognition-2
71/73
(both within and across documents
John Fitzgerald Kennedy was born at 83 Beals Street in Brookline, Massachusetts on29, 1917, at 3:00 pm,[7] the second son of Joseph P. Kennedy, Sr., and Rose Fitzgera
turn, was the eldest child of John "Honey Fitz" Fitzgerald, a prominent Boston politi
was the city's mayor and a three-term member of Congress. Kennedy lived in Brook
years and attended Edward Devotion School, Noble and Greenough Lower School, a
School, through 4th grade. In 1927, the family moved to 5040 Independence Avenu
Bronx, New York City; two years later, they moved to 294 Pondfield Road in Bronxvi
where Kennedy was a member of Scout Troop 2 (and was the first Boy Scout to bec
President).[8] Kennedy spent summers with his family at their home in Hyannisport
Massachusetts, and Christmas and Easter holidays with his family at their winter ho
Beach, Florida. For the 5th through 7th grade, Kennedy attended Riverdale Country
private school for boys. For 8th grade in September 1930, the 13-year old Kennedy
Canterbury School in New Milford, Connecticut.
Christopher Manning
Rough Accuracy of Information Ext
-
7/22/2019 Information Extraction and Named Entity Recognition-2
72/73
Rough Accuracy of Information Ext
Errors cascade (error in entity tagerror in relation
These are very rough, actually optimistic, numbers
Hold for well-established tasks, but lower for many specific/
Information type Accuracy
Entities 90-98%
Attributes 80%
Relations 60-70%
Events 50-60%
The Full Task
I f i
-
7/22/2019 Information Extraction and Named Entity Recognition-2
73/73
Informatio
Extraction
top related