in4080 natural language processing - forsiden€¦ · classification 1. find the ners 2. ......

65
IN4080 – 2018 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning 1

Upload: others

Post on 04-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

IN4080 – 2018 FALLNATURAL LANGUAGE PROCESSING

Jan Tore Lønning

1

Page 2: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Lecture 14, 31 Oct

Information extraction, pipelines

2

Page 3: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Today

Dependency parsing:

Wrap-up

Evaluation

Treebanks and pipelines

Information extraction, IE

Chunking

Named entity recognition

Relation extraction, 5 different ways

3

Page 4: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Figures from Nivre: Dependency

Grammar and Dependency

Structure

Dependency structure

• Words are connected

to each other by

directed links, called

dependencies

• Asymmetric relationship

between a head word

and its dependents:

• A B

• A governs B

• B depends on A

Page 5: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Data-driven text parsing

Induce syntactic analysis directly from a treebank

Only interested in one/the correct/the best analysis (or analyses)

Components:

1. A formal model M defining possible analysis of sentences in L

2. A sample of annotated text S = (y1, …, ym) from L

3. An inductive inference scheme J defining analyses for a text T =(x1,…xn)

given M and S

5

Page 6: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Transition-based dependency parsing6

Page 7: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Arc eager

1. LEFTARC: Assert a head-dependent relation between the word at the

front of the input buffer and the word at the top of the stack; pop

the stack.

2. RIGHTARC: Assert a head-dependent relation between the word on

the top of the stack and the word at front of the input buffer; shift

the word at the front of the input buffer to the stack.

3. SHIFT: Remove the word from the front of the input buffer and push it

onto the stack.

4. REDUCE: Pop the stack

7

Page 8: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Non-projective parsers

Some languages are best described by non-projective structures

How to parse?

Alt 1: Use a form of projective structures during parse. Transform the

structures afterwards

Alt 2: Some sort of graph-based parsing

8

Page 9: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Graph-based parsing

An alternative to transition-based parsing

The model is based on the idea of constructing all possible structures

and rank them

Can handle non-projective structures

Less efficient

9

Page 10: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Deep-learning approaches to dependency parsing

Neural architectures have taken over for more traditional machine

learners

Irrespective of transition or graph-based

Less need for manual feature engineering

Better results

IN5550

10

Page 11: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Today

Dependency parsing:

Wrap-up

Evaluation

Treebanks and pipelines

Information extraction, IE

Chunking

Named entity recognition

Relation extraction, 5 different ways

11

Page 12: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Evaluation

Two standard measures of evaluation

Labeled attachment score (LAS)

Proportion of words that are ascribed the correct head and dependency

label.

Unlabeled attachment score (UAS)

Proportion of words that are ascribed the correct head.

12

Page 13: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Evaluation example13

Page 14: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Results from the days of Malt parser (2007)

Page 15: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Today

Dependency parsing:

Wrap-up

Evaluation

Treebanks and pipelines

Information extraction, IE

Chunking

Named entity recognition

Relation extraction, 5 different ways

15

Page 16: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Treebanks

There are available free dependency treebanks for many languages

The place to start in these days: http://universaldependencies.org/

CONLL-formats:

One word per line, a number of columns for various information

CONLL-X, CONLL-U – different POSTAGs

16

fromAndrei's INF5830 slides

Page 17: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Dependency parsers

There are many freely available and trainable dependency parsers

Malt parser: http://www.maltparser.org/

UDPipe: http://ufal.mff.cuni.cz/udpipe

Online demo: http://lindat.mff.cuni.cz/services/udpipe/

Stanford CoreNLP:

https://stanfordnlp.github.io/CoreNLP/depparse.html

Online demo: http://corenlp.run/

Spacy: https://spacy.io/

And more

17

Page 18: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Today

Dependency parsing:

Wrap-up

Evaluation

Treebanks and pipelines

Information extraction, IE

Chunking

Named entity recognition

Relation extraction, 5 different ways

18

Page 19: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

IE basics

Bottom-Up approach

Start with unrestricted texts, and do the best you can

The approach was in particular developed by the Message Understanding Conferences (MUC) in the 1990s

Select a particular domain and task

19

Information extraction (IE) is the task of

automatically extracting structured information

from unstructured and/or semi-structured

machine-readable documents. (Wikipedia)

Page 20: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Steps20

(Some appro-

aches do these

steps in a

different order

– or

simultaneously)From NLTK

Page 21: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Today

Dependency parsing:

Wrap-up

Evaluation

Treebanks and pipelines

Information extraction, IE

Chunking

Named entity recognition

Relation extraction, 5 different ways

21

Page 22: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Next steps

Chunk together words to phrases

22

Page 23: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

NP-chunks

Exactly what is an NP-chunk?

It is an NP

But not all NPs are chunks

Flat structure: no NP-chunk is part of another NP chunk

Maximally large

Opposing restrictions

23

[ The/DT market/NN ] for/IN

[ system-management/NN software/NN ] for/IN

[ Digital/NNP ]

[ 's/POS hardware/NN ] is/VBZ fragmented/JJ enough/RB that/IN

[ a/DT giant/NN ] such/JJ as/IN

[ Computer/NNP Associates/NNPS ] should/MD do/VB well/RB there/RB ./.

Page 24: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Regular Expression Chunker

Input POS-tagged sentences

Use a regular expression over POS to identify NP-chunks

NLTK example:

It inserts parentheses

24

grammar = r"""NP: {<DT|PP\$>?<JJ>*<NN>}

{<NNP>+} """

Page 25: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

IOB-tags

B-NP: First word in NP

I-NP: Part of NP, not first word

O: Not part of NP (phrase)

Properties

One tag per token

Unambiguous

Does not insert anything in the

text itself

25

Page 26: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Assigning IOB-tags

The process can be considered a form for tagging

POS-tagging: Word to POS-tag

IOB-tagging: POS-tag to IOB-tag

But one may in addition use additional features, e.g. words

Can use various types of classifiers

NLTK uses a MaxEnt Classifier (=LogReg, but the implementation is slow)

We can modify along the lines of mandatory assignment 2, using scikit-learn

26

Page 27: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

27

J&M, 3. ed.

Page 28: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Evaluating (IOB-)chunkers

cp = nltk.RegexpParser("")

test_sents = conll ('test', chunks=['NP'])

IOB Accuracy: 43.4%

Precision: 0.0%

Recall: 0.0%

F-Measure: 0.0%

What do we evaluate?

IOB-tags? or

Whole chunks?

Yields different results

For IOB-tags:

Baseline: majority class O,

yields > 33%

Whole chunks:

Which chunks did we find?

Harder

Lower numbers

28

Page 29: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Evaluating (IOB-)chunkers

cp = nltk.RegexpParser("")

test_sents = conll ('test',

chunks=['NP'])

IOB Accuracy: 43.4%

Precision: 0.0%

Recall: 0.0%

F-Measure: 0.0%

>> cp = nltk.RegexpParser(

r"NP: {<[CDJNP].*>+}")

IOB Accuracy: 87.7%

Precision: 70.6%

Recall: 67.8%

F-Measure: 69.2%

29

Page 30: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Today

Dependency parsing:

Wrap-up

Evaluation

Treebanks and pipelines

Information extraction, IE

Chunking

Named entity recognition

Relation extraction, 5 different ways

30

Page 31: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Named entities31

Named entity:

Anything you can refer to by a proper name

i.e. not all NP (chunks):

high fuel prices

Maybe longer NP than just chunk:

Bank of America

Find the phrases

Classify them

Citing high fuel prices, [ORG United Airlines]

said [TIME Friday] it has increased fares by

[MONEY $6] per round trip on flights to

some cities also served by lower-cost

carriers. [ORG American Airlines], a unit of

[ORG AMR Corp.], immediately matched the

move, spokesman [PER Tim Wagner] said.

[ORG United], a unit of [ORG UAL Corp.],

said the increase took effect [TIME Thursday]

and applies to most routes where it

competes against discount carriers, such as

[LOC Chicago] to [LOC Dallas] and [LOC

Denver] to [LOC San Francisco].

Page 32: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Types of NE

The set of types vary between different systems

Which classes are useful depend on application

32

Page 33: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Ambiguities33

Page 34: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Gazetteer

Useful: List of names,

e.g.

Gazetteer: list of

geographical names

But does not remove all

ambiguities

cf. example

34

Page 35: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Representation (IOB)35

Page 36: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Feature-based NER

Similar to tagging and chunking

You will need features from several layers

Features may include

Words, POS-tags, Chunk-tags, Graphical prop.

and more (See J&M, 3.ed)

36

Page 37: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Feature-based NER algorithms37

Greedy decoding

"Word-by word", decide for the first word, then for the second word, etc.

Can use various learners, e.g. Logistic regression (MaxEnt)

We can use our set-up for mandatory 2 with smaller adjustments

For shortcomings and better alternatives, c.f. lecture 9/J&M, 3. ed,

ch.8:

Maximum Entropy Markov Models (MEMM)

Conditional random fields (Preferred approach until recently

Page 38: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Neural NER

The last years: neural architectures show the best results

J&M, 3. ed., ch 17, sec. 17.1.3, not curriculum in IN4080

IN5550

38

Page 39: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Evaluation

Have we found the correct NERs?

Evaluate precision and recall as for chunking

For the correctly identified NERs, have we labelled them correctly?

39

Page 40: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Today

Dependency parsing:

Wrap-up

Evaluation

Treebanks and pipelines

Information extraction, IE

Chunking

Named entity recognition

Relation extraction, 5 different ways

40

Page 41: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Goal

Extract the relations that exist

between the (named) entities in the

text

A fixed set of relations (normally)

Determined by application:

Jeopardy

Preventing terrorist attacks

Detecting illness from medical record

41

• Born_in

• Date_of_birth

• Parent_of

• Author_of

• Winner_of

• Part_of

• Located_in

• Acquire

• Threaten

• Has_symptom

• Has_illness

Page 42: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Examples42

Page 43: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Methods for relation extraction43

1. Hand-written patterns

2. Machine Learning (Supervised classifiers)

3. Semi-supervised classifiers via bootstrapping

4. Semi-supervised classifiers via distant supervision

5. Unsupervised

Page 44: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

1. Hand-written patterns

Example: acquisitions

[ORG]…( buy(s)|

bought|

aquire(s|d) )…[ORG]

Hand-write patterns like this

Properties:

High precision

Will only cover a small set of

patterns

Low recall

Time consuming

(Also in NLTK, sec 7.6)

44

Page 45: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Example45

Page 46: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Methods for relation extraction46

1. Hand-written patterns

2. Machine Learning (Supervised classifiers)

3. Semi-supervised classifiers via bootstrapping

4. Semi-supervised classifiers via distant supervision

5. Unsupervised

Page 47: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

2. Supervised classifiers47

A corpus

A fixed set of entities and relations

The sentences in the corpus are hand-annotated:

Entities

Relations between them

Split the corpus into parts for training and testing

Train a classifier:

Choose learner: Naive Bayes, Logistic regression (Max Ent), SVM, …

Select features

Page 48: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

2. Supervised classifiers, contd.48

Training:

Use pairs of entities within the same sentence with no relation between them

as negative data

Classification

1. Find the NERs

2. For each pair of NERs determine whether there is a relation between them

3. If there is, label the relation

Page 49: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Examples of features49

American

Airlines, a unit

of AMR,

immediately

matched the

move,

spokesman Tim

Wagner said

Page 50: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Properties50

The bottleneck is the availability of training data

To hand label data is time consuming

Mostly applied to restricted domains

Does not generalize well to other domains

Page 51: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Methods for relation extraction51

1. Hand-written patterns

2. Machine Learning (Supervised classifiers)

3. Semi-supervised classifiers via bootstrapping

4. Semi-supervised classifiers via distant supervision

5. Unsupervised

Page 52: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

3. Semisupervised, bootstrapping

If we know a pattern for a relation we can determine whether a pair stands in the relation

Conversely: If we know that a pair stands in a relationship, we can find patterns that describe the relation

52

Pairs:

IBM – AlchemyAPI

Google – YouTube

Facebook - WhatsApp

Patterns:

[ORG]…bought…[ORG]

Relation

ACQUIRE

Page 53: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Example53

(IBM, AlchemyAPI): ACQUIRE

Search for sentences containing IBM and AlchemyAPI

Results (Web-search, Google, btw. first 10 results):

IBM's Watson makes intelligent acquisition of Denver-based AlchemyAPI(Denver Post)

IBM is buying machine-learning systems maker AlchemyAPI Inc. to bolster its Watson technology as competition heats up in the data analytics and artificial intelligence fields. (Bloomberg)

IBM has acquired computing services provider AlchemyAPI to broaden its portfolio of Watson-branded cognitive computing services. (ComputerWorld)

Page 54: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Example contd.54

Extract patterns

IBM's Watson makes intelligent acquisition of Denver-based AlchemyAPI

(Denver Post)

IBM is buying machine-learning systems maker AlchemyAPI Inc. to bolster its

Watson technology as competition heats up in the data analytics and artificial

intelligence fields. (Bloomberg)

IBM has acquired computing services provider AlchemyAPI to broaden its

portfolio of Watson-branded cognitive computing services. (ComputerWorld)

Page 55: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Procedure

From the extracted sentences,

we extract patterns

Use these patterns to extract

more pairs of entities that stand

in these patterns

These pairs may again be used

for extracting more patterns,

etc.

…makes intelligent acquisition …

… is buying …

… has acquired …

55

Page 56: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Bootstrapping56

Page 57: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

A little more57

We could

either extract pattern templates and searching for these

or features for classification and build a classifier

If we use patterns we should generalize

makes intelligent acquisition (make(s)|made) JJ* acquisition

During the process we should evaluate before we extend:

Does the new pattern recognize other pairs we know stand in the relation?

Does the new pattern return pairs that are not in the relation? (Precision)

Page 58: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Methods for relation extraction58

1. Hand-written patterns

2. Machine Learning (Supervised classifiers)

3. Semi-supervised classifiers via bootstrapping

4. Semi-supervised classifiers via distant supervision

5. Unsupervised

Page 59: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

4. Distant supervision for RE

Combine:

A large external knowledge base, e.g. Wikipedia, Word-net

Large amounts of unlabeled text

Extract tuples that stand in known relation from knowledge base:

Many tuples

Follow the bootstrapping technique on the text

59

Page 60: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

4. Distant supervision for RE

Properties:

Large data sets allow for

fine-grained features

combinations of features

Evaluation

Requirement

Large knowledge-base

60

Page 61: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Methods for relation extraction61

1. Hand-written patterns

2. Machine Learning (Supervised classifiers)

3. Semi-supervised classifiers via bootstrapping

4. Semi-supervised classifiers via distant supervision

5. Unsupervised

Page 62: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

5. Unsupervised relation extraction

Open IE

Example:

1. Tag and chunk

2. Find all word sequences

satisfying cetain syntactic constraints,

in particular containing a verb

These are taken to be the relations

3. For each such, find the immediate non-vacuous NP to the left and to the right

4. Assign a confidence score

United has a hub in Chicago, which is the headquarters of United Continental Holdings.

r1: <United, has a hub in, Chicago>

r2: <Chicago, is the headquarters of, United Continental Holdings>

62

Page 63: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Evaluating relation extraction

Supervised methods can be

evaluated on each of the

examples in a test set.

For the semi-supervised

method:

we don’t have a test set.

we can evaluate the precision of

the returned examples manually

Beware the difference between

Determine for a sentence

whether an entity pair in the sen-

tence is in a particular relation

Recall and precision

Determine from a text:

We may use several occurrences

of the pair in the text to draw a

conclusion

Precision

63

We skip the confidence scoring

Page 64: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

More fine grained IE

Tokenization+tagging

Identifying the "actors"

Chunking

Named-entity recognition

Co-refrence resolution

Relation detection

Event detection

Co-reference resolution of events

Temporal extraction

Template filling

64

So far Possible refinements

Page 65: IN4080 Natural Language Processing - Forsiden€¦ · Classification 1. Find the NERs 2. ... Semi-supervised classifiers via distant supervision 5. Unsupervised. 3. Semisupervised,

Some example systems65

Stanford core nlp: http://corenlp.run/

SpaCy (Python): https://spacy.io/docs/api/

OpenNLP (Java): https://opennlp.apache.org/docs/

GATE (Java): https://gate.ac.uk/