special topics in computer science advanced topics in information retrieval lecture 10: natural...

41
Special Topics in Computer Science Special Topics in Computer Science Advanced Topics in Information Advanced Topics in Information Retrieval Retrieval Lecture 10: Lecture 10: Natural Language Processing Natural Language Processing and IR. and IR. Syntax and structural Syntax and structural disambiguation disambiguation Alexander Gelbukh www.Gelbukh.com

Upload: danielle-cantrell

Post on 27-Mar-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

Special Topics in Computer ScienceSpecial Topics in Computer Science

Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval

Lecture 10: Lecture 10: Natural Language Processing and IR. Natural Language Processing and IR.

Syntax and structural disambiguation Syntax and structural disambiguation Alexander Gelbukh

www.Gelbukh.com

Page 2: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

2

Previous Chapter: Previous Chapter: ConclusionsConclusions

Tagging, word sense disambiguation, andanaphora resolution are cases of disambiguation ofmeaning

Useful in translation, information retrieval, and textundertanding

Dictionary-based methods good but expensive

Statistical methods cheap and sometimes imperfect... but not always (if very

large corpora are available)

Page 3: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

3

Previous Chapter: Research topicsPrevious Chapter: Research topics

Too many to list New methods Lexical resources (dictionaries) = Computational linguistics

Page 4: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

4

ContentsContents

Language levels Syntax

Dependency approach Constituency-based approach Head-driven approach

Grammars and parsing Ambiguity and disambiguation

Page 5: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

5

Language levelsLanguage levels

Letters are built up into words Words into sentences Sentences into <...> text

Each level has its own representation This allows for modular processing

A module describes one levelor transforms from one level to another

Page 6: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

6

Source of language complexity: 1-DSource of language complexity: 1-D

This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the

Language

Text (speech)

Meaning Meaning

........Text Text.......

Bra

in 1

Brain 2

Page 7: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

7

Knowledge Knowledge

Lan-guage

Lan-guage

This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture. This is a text that represents the meaning shown in the right part of the picture.

Text

Source of language complexity: 1-DSource of language complexity: 1-D

Page 8: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

8

Linguistic processorLinguistic processortranslates between representationstranslates between representations

Linguisticmodule

Meanings

This is an example of the output text ofthe system. This is an example of theoutput text of the system. This is anexample of the output text of thesystem. This is an example of the outputtext of the system. This is an example ofthe output text of the system. This is anexample of the output text of thesystem. This is an example of the outputtext of the system. This is an example ofthe output text of the system. This is anexample of the output text of thesystem. This is an example of the outputtext of the system. This is an example ofthe output text of the system. This is anexample of the output text of thesystem. This is an example of the outputtext of the system. This is an example ofthe output text of the system. This is anexample of the output text of thesystem. This is an example of the outputtext of the system. This is an example ofthe output text of the system. This is an

Texts

Linguisticmodule

Appliedsystem

Page 9: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

9

General scheme of text General scheme of text processingprocessing

L inguistic processor

Applied system

(e.g., Expert system)

Out-put

In-put

(Semantic) representation

Linguistic processor uses linguistic knowledge Applied system uses other types of knowledge

(e.g., Artificial Intelligence)

Page 10: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

10

Language levelsLanguage levels

Morphological: words Syntactic: sentences Semantic: meaning Pragmatic: intention ...?

Page 11: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

11

This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture. This is a text that represents themeaning shown in the right part of thepicture.

LanguageText Meaning

Morphologicalrepresentation

Syntacticrepresentation

Morpho-logicaltrans-former

Syntac-tic

trans-former

Seman-tic

trans-former

Semanitcrepresentation

Surfacerepresentation

Fine structure of linguistic processor

Page 12: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

12

Example of textExample of text

““Science is important for Science is important for our country.our country.

The Government pays it The Government pays it much attention.”much attention.”

Page 13: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

13

Textual representationTextual representation

Text is a sequence of letter.

S c i e n c e i s S c i e n c e i s i m p o r t a n t i m p o r t a n t f o r o u r c f o r o u r c o u n t r y . T h e o u n t r y . T h e G o v e r n m e n G o v e r n m e n t p a y s i t t p a y s i t m u c h a t t e n m u c h a t t e n t i o n t i o n ..

Page 14: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

14

Linguistic processor

Morpho-logical

analyzer

Semantic analyzer

Syntactic parser

Morphologicalanalysis

Morfological analysisMorfological analysis

Page 15: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

15

Morphological Morphological representationrepresentation

A sequence of words.The THE article definite, plural/singular

science SCIENCE noun singular

is BE verb present, 3rd person, sing.

important IMPORTANT adjective

for FOR preposition

our WE pronoun possessive

country COUNTRY noun singular

Page 16: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

16

Linguistic processor

Morpho-logical

analyzer

Semantic analyzer

Syntactic parser

Syntacticparsing

Syntactic parsingSyntactic parsing

Page 17: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

17

Syntactic representation Syntactic representation

A sequence of syntactic trees.

BE

SCIENCE IMPORTANT

COUNTRY

WE

of

PAY

GOVERNMENT ATTENTION IT

MUCH

Page 18: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

18

Syntactic representationSyntactic representation

What happened?

With whom happened?

... their details

PAY

GOVERNMENT ATTENTION IT

MUCH

Page 19: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

19

Linguistic processor

Morpho-logical

analyzer

Semantic analyzer

Syntactic parser

Semanticanalysis

Semantic analysisSemantic analysis

Next lecture...Next lecture...

Page 20: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

20

SyntaxSyntax

The structure describing the relationships between words in a sentence

Describes the relationships implied by grammatical characteristics not by meaning

Often allows for simple paraphrasing John reads the book The book is read by John

Page 21: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

21

Early approach: Dependency syntaxEarly approach: Dependency syntax

Tree Nodes: words Arcs: modified by

Modifies means adds details,clarifies, chooses of many...makes more specific

Arcs are typed Types are: subject, object, attribute, ...

PAY

GOVERNMENT ATTENTION IT

MUCH

Subject

Obje

ct

Recipient

Att

ribute

Page 22: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

22

... Dependency syntax... Dependency syntax

General situation: pay More specifically: the one

where: who pays is government what is paid is attention to whom it is paid is it

More specifically: attention that is much

PAY

GOVERNMENT ATTENTION IT

MUCH

Subject

Obje

ct

Recipient

Att

ribute

Page 23: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

23

Advantages/disadvantages of Advantages/disadvantages of Dependency SyntaxDependency Syntax

Advantages Solid linguistic base Rather direct translation into semantics Easily applicable to languages with free word order

Korean? Russian, Latin This is why solid linguistic base: good for classical

languages!

Disadvantages No nice mathematical base No simple algorithms

Page 24: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

24

Most popular approach: Constituency Most popular approach: Constituency (Phrase Structure grammars)(Phrase Structure grammars)

Tree Nodes: nested segments of the phrase

Cannot intersect, only nested Usually are labeled with part-of-speech names

Arcs: nesting In classical approach, arcs are not labeled

[[Our Government ] [pays [ much attention] [to it ] ] ]

Page 25: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

25

ConstituencyConstituency

[[Our Government ] [pays [ much attention] [to it ] ] ]Our Government

pays

much attention

to it

Page 26: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

26

ConstituencyConstituency

[[OurR GovernmentN ]NP

[paysV [ muchA attentionN]NP [toP itR ]PP ] VP]S

R: pronoun NP: noun phraseN: noun VP: verb phraseV: verb PP: prepositional phraseA: adjective S: sentence

Page 27: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

27

Constituency: graphical representationConstituency: graphical representation

[[Our Government ]NP [pays [ much attention]NP [to it ]PP ] VP]S

S VP

NP NP PP

NP VP NP NP

R N V A N P R

Our Government pays much attention to it

Page 28: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

28

Phrase structure grammarPhrase structure grammar

Enumerates possible configurations at nodes Usually recursive

S NP VP

NP A NP

NP R NP

NP P NP

NP N

VP VP NP PP

VP V

S VP

NP NP PP

NP VP NP NP

R N V A N P R

Our Government pays much attention to it

Page 29: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

29

Context-independency hypothesisContext-independency hypothesis

A configuration is possible or not,regardless of where it is used Wherever you find VP NP PP, it can be VP Wherever you find NP VP, it can be S If you can put together S that covers all the sentence,

it is a grammatically correct description With this, given a suitable grammar, you can

List all sentences of a language List only correct sentences of that language

List all and only correct structures Correctness means a native speaker’s intuition

Page 30: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

30

Generative ideaGenerative idea

Find a grammar to list all and only correct sentences (with their structures) of a language

This is a complete description of that language!

How can be useful in analysis? Reverse the grammar

Page 31: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

31

ParsingParsing

Given a grammar and a sentence Find all possible structures That describe this sentence with this grammar

Many methods. Not discussed today.A lot of research. Very fast algorithms

Complexity: cubic in the number of words in the sentence (there are better methods, up to 2.8)

Problem: combinatorics of variants

Page 32: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

32

Advantages and disadvantages of cAdvantages and disadvantages of consitituency approachonsitituency approach

Advantages Nice mathematics, very well understood Efficient analysis algorithms, very well-elaborated Good for languages with fixed word order

English. Chinese?

Disadvantages Difficult translation into semantics Bad when it comes to freer word order

Even in English! Worse in other languages

Page 33: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

33

Head-driven approachesHead-driven approaches

Combine some advantages of dependency-based and constituency-based approaches

Syntax is still fixed-order. But word dependency information is added Easier translation into semantics More linguistically-based

How? In each constituent, the main word (head) is marked It modifies the head of the larger constituent

[[Our Government ] [pays [ much attention] [to it ] ] ]

Page 34: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

34

Syntactic ambiguitySyntactic ambiguity

I see a cat with a telescope I see [a cat] [with a telescope]

I use a telescope to see a cat

I see [a cat [with a telescope]] I see a cat that has a telescope

Nearly any preposition causes ambiguity Dozens, thousands, millions of variants for a sentence!

Because their numbers multiply I see a cat with a telescope in a garden at the shore of a river

Page 35: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

35

Ambiguity resolutionAmbiguity resolution

Syntactic means are not enough Is telescope more related to see or to cat?

Statistical methods: is it used with see or cat? Dictionary-based methods: does it share more meaning

with see or cat?• Path length in a dictionary of semantic relationships

Ideally, context should be analyzed, and reasoning applied: I see a cat with a telescope. It keeps the telescope in its

left paw. Now no good methods for this.

Page 36: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

36

Shallow parsingShallow parsing

Due to the HUGE problems in resolving ambiguity Do not resolve it! Do what you can de wellI see [a cat] [with a telescope] [in a garden] [at the shore] [of a river]

Better than nothing Can be done well

Page 37: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

37

EvaluationEvaluation

PARSEVAL international contents A practical parser usually gives only one variant

Implies disambiguation!

Manually built corpora (treebanks) Compare what the program did with what humans di

d

Page 38: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

38

One of the uses in IR:One of the uses in IR:Lexical ambiguity resolutionLexical ambiguity resolution

Syntactic analysis helps in POS disambiguation: Oil is used well in Mexico. Oil well is used in Mexico. Well = ?

But does not help in WSD: I deposited my money in an international bank. I live on a beautiful bank of Han river.

Page 39: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

39

Research topicsResearch topics

Faster algorithms E.g. parallel

Handling linguistic phenomena not handled bycurrent approaches

Ambiguity resolution! Statistical methods A lot can be done

Page 40: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

40

ConclusionsConclusions

Syntax structure is one of intermediate representationsof a text for its processing

Helps text understanding Thus reasoning, question answering, ...

Directly helps POS tagging Resolves lexical ambiguity of part of speech But not WSD-type ambiguities

A big science in itself, with 50 (2000?) years of history

Page 41: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation

41

Thank you!Till June 8? 6 pm

Semantics