overview of statistical nlp ir group meeting march 7, 2006

23
Overview of Statistical NLP IR Group Meeting March 7, 2006

Upload: hortense-jacobs

Post on 31-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Overview of Statistical NLP IR Group Meeting March 7, 2006

Overview of Statistical NLP

IR Group Meeting

March 7, 2006

Page 2: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 2

Outline

Some basic/important NLP problems Topics that recently attracted many interests NLP research groups Discussion on the relation between NLP and

IR

Page 3: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 3

Levels of Analysis in NLP(from Dan Roth’s CS598) Morphology

How words are constructed Syntax

Structural relation between words Semantics

The meaning of words and of combinations of words Pragmatics.

How is a sentence used? What’s its purpose? Discourse (sometimes distinguished as a subfield of

Pragmatics) Relationships between sentences; global context.

Page 4: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 4

Some NLP Problems

N-gram Models Word Sense Disambiguation Lexical Acquisition (POS) Tagging (Syntactic) Parsing Semantic Role Labeling (Semantic Parsing) Named Entity Recognition Textual Entailment …

Page 5: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 5

N-gram Models

The task: to estimate P(wn|w1,…,wn-1) Approaches:

Maximum likelihood estimation Various smoothing methods

Applications: Automatic speech recognition Spelling correction Handwriting recognition Statistical machine translation

Page 6: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 6

Word Sense Disambiguation (WSD) The task: to determine which of the senses of an ambiguous

word is involved in a particular use of the word Approaches:

Supervised: Log-linear models Information-theoretic Memory-based learning (kNN)

Dictionary-based: Sense definitions Thesauri Translations in a second language

Unsupervised: Clustering using EM algorithm

Page 7: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 7

Word Sense Disambiguation (WSD) Accuracy:

Word-specific Easy words: > 90% Hard words: 50~70%

Applications: Statistical machine translation Information retrieval

Page 8: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 8

Lexical Acquisition

The task: to develop algorithms and statistical techniques for filling the holes in existing machine-learnable dictionaries by looking at the occurrence patterns of words in large text corpora

Examples: Verb subcategorization Propositional phrase attachment disambiguation Selectional preferences Semantic similarity

Page 9: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 9

Semantic Similarity

The task: to acquire a relative measure of similarity between two words

Approaches: Vector space measures (document space, word

space, modifier space, etc.) Probabilistic measures (KL-divergence, etc.)

Applications: Information retrieval (query expansion)

Page 10: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 10

POS Tagging

The task: labeling each word in a sentence with its appropriate part of speech

Major approaches HMM Transformation-based

Advantages: speed and storage

Other approaches Neural networks, decision trees, memory-based

learning, maximum entropy models

Page 11: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 11

POS Tagging Accuracy:

95~97% Achieved only when the application text and the training

text are from the similar source Applications

For higher-level NLP tasks: partial parsing, parsing, NER, etc.

“…the best lexicalized probabilistic parsers are now good enough that they perform better starting with untagged text and doing the tagging themselves, rather than using a tagger as preprocessor.” (Charniak 1997)

Page 12: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 12

(Syntactic) Parsing

The task: to find the most likely syntactic parse tree of a sentence

Approaches: Probabilistic context free grammar (PCFG)

Supervised Unsupervised

Lexicalized models Dependency-based models

Page 13: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 13

(Syntactic) Parsing

Accuracy: Charniak 1997: Rec 0.875 Prec 0.874 Collins 1997: Rec 0.881 Prec 0.886

Applications: For other NLP tasks such as semantic role

labeling and relation extraction

Page 14: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 14

Semantic Role Labeling

The task: to identify the predicate-argument structures in sentences

Approaches: Supervised learning

Accuracy: Best ~70% (CoNLL 04 shared task)

Applications: Information extraction Question answering

Page 15: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 15

Textual Entailment

The task: given two text fragments, to recognize whether the meaning of one text is entailed (can be inferred) from the other text

Approaches: Word overlap Statistical lexical relations Syntactic matching Logic inference

Accuracy: ~0.56, best ~0.60 (PASCAL Challenge 05)

Applications: Question answering Multi-document summarization

Page 16: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 16

Tools

Brill Tagger Charniak Parser Collins Parser MiniPar Semantic Parser

ASSERT Parser CCG’s demo

Page 17: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 17

Corpora

WordNet Penn Treebank (Sample) PropBank FrameNet

Page 18: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 18

Other Tasks

Automatic Speech Recognition Natural Language Generation Automatic Summarization …

Page 19: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 19

Outline

Some basic/important NLP problems Topics that recently attracted many interests NLP research groups Discussion on the relation between NLP and

IR

Page 20: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 20

Recent topics

Unsupervised and semi-supervised approaches Knowledge acquisition bottleneck

Semantic role labeling Improve the performance of SRL Use the results for other tasks

Relation extraction WSD Parsing Statistical machine translation

Word alignment

Page 21: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 21

Outline

Some basic/important NLP problems Topics that recently attracted many interests NLP research groups Discussion on the relation between NLP and

IR

Page 22: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 22

NLP Research Groups

USC/ISI Stanford UPenn Johns-Hopkins UIUC …

Page 23: Overview of Statistical NLP IR Group Meeting March 7, 2006

03/07/2006 IR Group Meeting -- NLP 23

Outline

Some basic/important NLP problems Topics that recently attracted many interests NLP research groups Discussion on the relation between NLP and

IR