natural language processing references: 1. foundations of statistical natural language processing 2....

Natural Language Processing

References:1. Foundations of Statistical Natural Language Processing

2. Speech and Language Processing

Berlin ChenDepartment of Computer Science & Information Engineering

National Taiwan Normal University

2

Motivation for NLP (1/2)

• Academic: Explore the nature of linguistic communication– Obtain a better understanding of how languages work

• Practical: Enable effective human-machine communication

– Conversational agents are becoming an important form of human-computer communication

– Revolutionize the way computers are used• More flexible and intelligent

3

Motivation for NLP (2/2)

• Different Academic Disciplines: Problems and Methods– Electrical Engineering,

Statistics– Computer Science – Linguistics– Psychology

• Many of the techniques presented were first develpoed for speech and then spread over into NLP– E.g. Language models in speech recognition

LinguisticsPsychology

Computer

Science

Electrical

Engineering,

Statistics

NLP

4

Turing Test

• Alan Turing,1950

– Alan predicted at the end of 20 century a machine with 10 gigabytes of memory would have 30% chance of fooling a human interrogator after 5 minutes of questions

• Does it come true?

interrogator

5

Hollywood Cinema

• Computers/robots can listen, speak, and answer our questions– E.g.: HAL 9000 computer in “2001: A Space Odyssey”

(2001 太空漫遊 )

6

State of the Art

• Canadian computer program accepted daily weather data and generated weather reports (1976)

• Read student essays and grade them• Automated reading tutor• Spoken Dialogues

– AT&T, How May I Help You?

7

Major Topics for NLP

• Semantics/Meaning– Representation of Meaning– Semantic Analysis– Word Sense Disambiguation

• Pragmatics – Natural Language Generation– Discourse, Dialogue and Conversational Agents– Machine Translation

8

Dissidences

• Rationalists (e.g. Chomsky)– Humans are innate language faculties– (Almost fully) encoded rules plus reasoning mechanisms– Dominating between 1960’s~mid 1980’s

• Empiricists (e.g. Shannon) – The mind does not begin with detailed sets of principles and

procedures for language components and cognitive domains– Rather, only general operations for association, pattern

recognition, generalization, etc., are endowed with• General language models plus machine learning approaches

– Dominating between 1920’s~mid 1960’s and resurging 1990’s~

9

Dissidences: Statistical and Non-Statistical NLP

• The dividing line between the two has become much more fuzzy recently– An increasing number of non-statistical researches use corpus e

vidence and incorporate quantitative methods• Corpus: “a body of texts” ( 大量的文稿 )

– Statistical NLP needs to start with all the scientific knowledge available about a phenomenon when building a probabilistic model, rather than closing one’s eye and taking a clean-slate approach

• Probabilistic and data-driven

• Statistical NLP → “Language Technology” or “Language Engineering”

(I) Part-of-Speech Tagging

11

Review

• Tagging (part-of-speech tagging)– The process of assigning (labeling) a part-of-speech or other

lexical class marker to each word in a sentence (or a corpus)• Decide whether each word is a noun, verb, adjective, or

whatever

The/AT representative/NN put/VBD chairs/NNS on/IN the/AT table/NN

Or

The/AT representative/JJ put/NN chairs/VBZ on/IN the/AT table/NN

– An intermediate layer of representation of syntactic structure• When compared with syntactic parsing

– Above 96% accuracy for most successful approaches

Tagging can be viewed as a kind of syntactic disambiguation

12

Introduction

• Parts-of-speech– Known as POS, word classes, lexical tags, morphology classes

• Tag sets– Penn Treebank : 45 word classes used (Francis, 1979)

• Penn Treebank is a parsed corpus

– Brown corpus: 87 word classes used (Marcus et al., 1993)

– ….

The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.

13

The Penn Treebank POS Tag Set

14

Disambiguation

• Resolve the ambiguities and choose the proper tag for the context

• Most English words are unambiguous (have only one tag) but many of the most common words are ambiguous– E.g.: “can” can be a (an auxiliary) verb or a noun – E.g.: statistics of Brown corpus

- 11.5% word types are

ambiguous

- But 40% tokens are ambiguous

(However, the probabilities of

tags associated a word are

not equal → many ambiguous

tokens are easy to disambiguate)

wtPwtP 21

15

Process of POS Tagging

Tagging Algorithm

A String of WordsA Specified

Tagset

A Single Best Tag of Each WordVB DT NN .

Book that flight .

VBZ DT NN VB NN ?

Does that flight serve dinner ?

Two information sources used:

- Syntagmatic information (looking at information about tag sequences)

- Lexical information (predicting a tag based on the word concerned)

16

POS Tagging Algorithms (1/2)

Fall into One of Two Classes

• Rule-based Tagger– Involve a large database of handcrafted disambiguation rules

• E.g. a rule specifies that an ambiguous word is a noun rather than a verb if it follows a determiner

• ENGTWOL: a simple rule-based tagger based on the constraint grammar architecture

• Stochastic/Probabilistic Tagger– Also called model-based tagger– Use a training corpus to compute the probability of a given word

having a given context – E.g.: the HMM tagger chooses the best tag for a given word

(maximize the product of word likelihood and tag sequence probability)

“a new play”

P(NN|JJ) ≈ 0.45

P(VBP|JJ) ≈ 0.0005

17

POS Tagging Algorithms (1/2)

• Transformation-based/Brill Tagger– A hybrid approach

– Like rule-based approach, determine the tag of an ambiguous word based on rules

– Like stochastic approach, the rules are automatically induced from previous tagged training corpus with the machine learning technique

• Supervised learning

18

Rule-based POS Tagging (1/3)

• Two-stage architecture– First stage: Use a dictionary to assign each word a list of

potential parts-of-speech– Second stage: Use large lists of hand-written disambiguation

rules to winnow down this list to a single part-of-speech for each word

Pavlov had shown that salivation …

Pavlov PAVLOV N NOM SG PROPER

had HAVE V PAST VFIN SVO

HAVE PCP2 SVO

shown SHOW PCP2 SVOO SVO SV

that ADV

PRON DEM SG

DET CENTRAL DEM SG

CS

salivation N NOM SG

An example forThe ENGTOWL tagger

A set of 1,100 constraintscan be applied to the input sentence

(complementizer)

(preterit)

(past participle)

19


• Simple lexical entries in the ENGTWOL lexicon

past participle

20


Example: It isn’t that odd! ( 它沒有那麼奇特的 )

I consider that odd. ( 我思考那奇數 ?)

ADV

Complement

A

NUM

21

HMM-based Tagging (1/8)

• Also called Maximum Likelihood Tagging– Pick the most-likely tag for a word

• For a given sentence or words sequence , an HMM tagger chooses the tag sequence that maximizes the following probability

tags1 previoustagtagwordmaxargtag

:position at wordaFor

nPP

i

jjij

i

N-gram HMM tagger

tag sequence probabilityword/lexical likelihood

22


• Assumptions made here– Words are independent of each other

• A word’s identity only depends on its tag

– “Limited Horizon” and “Time Invariant” (“Stationary”) • Limited Horizon: a word’s tag only depends on the previous

few tags (limited horizon) and the dependency does not change over time (time invariance)

• Time Invariant: the tag dependency won’t change as tag sequence appears different positions of a sentence

Do not model long-distance relationships well !

- e.g., Wh-extraction,…

23


• Apply a bigram-HMM tagger to choose the best tag for a given word – Choose the tag ti for word wi that is most probable given the prev

ious tag ti-1 and current word wi

– Through some simplifying Markov assumptions

iijj

i wttPt ,maxarg 1

jiijj

i twPttPt 1maxarg

tag sequence probability word/lexical likelihood

24


• Example: Choose the best tag for a given word

Secretariat/NNP is /VBZ expected/VBN to/TO race/VB tomorrow/NN

to/TO race/??? P(VB|TO) P(race|VB)=0.00001

P(NN|TO) P(race|NN)=0.000007

0.34 0.00003

0.021 0.00041

Pretend that the previous

word has already tagged

25


• The Viterbi algorithm for the bigram-HMM tagger

end

do 1- step 1 to1:ifor

argmaxion 3.Terminat

argmax

1 2maxInduction 2.

, 1tion Initializa 1.

1

1

11

1

11

iii

nJj

n

kjiJk

i

jikjik

i

jjjj

XX

n-

jX

ttPkj

Jjn,i, twPttPkj

tPπJj, twPπj

26


• Apply trigram-HMM tagger to choose the best sequence of tags for a given sentence– When trigram model is used

• Maximum likelihood estimation based on the relative frequencies observed in the pre-tagged training corpus (labeled data)

n

i

ii

n

i

iiinttt

twPtttPttPtPT13

12121,..,2,1

,maxargˆ

jij

iiiiML

jjii

iiiiiiML

twc

twctwP

tttc

tttctttP

,

,

,12

1212

Smoothing or linear interpolationare needed !

iML

iiMLiiiMLiiismoothed

tP

ttPtttPtttP

)1(

,, 11212

27


• Probability smoothing of and is necessary

j

ij

iiii twc

twctwP

,

,

jji

iiii ttc

ttcttP

1

11

1ii ttP ii twP

28


• Probability re-estimation based on unlabeled data • EM (Expectation-Maximization) algorithm is applied

– Start with a dictionary that lists which tags can be assigned to which words

» word likelihood function cab be estimated» tag transition probabilities set to be equal

– EM algorithm learns (re-estimates) the word likelihood function for each tag and the tag transition probabilities

• However, a tagger trained on hand-tagged data worked better than one trained via EM

– Treat the model as a Markov Model in training but treat them as a Hidden Markov Model in tagging

ii twP

1ii ttP

Secretariat/NNP is /VBZ expected/VBN to/TO race/VB tomorrow/NN

29

Transformation-based Tagging (1/8)

• Also called Brill tagging– An instance of Transformation-Based Learning (TBL)

• Motive– Like the rule-based approach, TBL is based on rules that

specify what tags should be assigned to what word– Like the stochastic approach, rules are automatically induced

from the data by the machine learning technique

• Note that TBL is a supervised learning technique– It assumes a pre-tagged training corpus

30


• How the TBL rules are learned– Three major stages

1. Label every word with its most-likely tag using a set of tagging rules (use the broadest rules at first)

2. Examine every possible transformation (rewrite rule), and select the one that results in the most improved tagging (supervised! should compare to the pre-tagged corpus )

3. Re-tag the data according this rule

– The above three stages are repeated until some stopping criterion is reached

• Such as insufficient improvement over the previous pass

– An ordered list of transformations (rules) can be finally obtained

31


• Example

So, race will be initially coded as NN

(label every word with its most-likely tag)

P(NN|race)=0.98

P(VB|race)=0.02

(a). is/VBZ expected/VBN to/To race/NN tomorrow/NN

(b). the/DT race/NN for/IN outer/JJ space/NN

Refer to the correct tag

Information of each word,

and find the tag of racein (a) is wrong

Learn/pick a most suitable transformation rule: (by examining every possible transformation)

Change NN to VB while the previous tag is TO

expected/VBN to/To race/NN → expected/VBN to/To race/VB Rewrite rule:

1

2

3

32


• Templates (abstracted transformations)– The set of possible transformations may be infinite

– Should limit the set of transformations

– The design of a small set of templates (abstracted transformations) is needed

E.g., a strange rule like:

transform NN to VB if the previous word was “IBM” and the word “the” occurs between 17 and 158 words before

that

33


• Possible templates (abstracted transformations)

Brill’s templates.Each begins with“Change tag a to

tagb when ….”

34


• Learned transformations

more valuable player

Constraints for tags

Constraints for words

Rules learned by Brill’s original tagger

Modal verbs (should, can,…)

Verb, past participle

Verb, 3sg, past tense

Verb, 3sg, Present

35


• Reference for tags used in the previous slide

36


• Algorithm

The GET_BEST_INSTANCE procedure in the example algorithm is

“Change tag from X to Y if the previous tag is Z”.

for all combinations

of tags

Get best instance for each transformation

Z

XYtraverse

corpus

Check if it is better than the best instance achieved in previous iterations

append to the rule list

score

(II) Extractive Spoken Document Summarization - Models and Features

38

Introduction (1/3)

• World Wide Web has led to a renaissance of the research of automatic document summarization, and has extended it to cover a wider range of new tasks

• Speech is one of the most important sources of information about multimedia content

• However, spoken documents associated with multimedia are unstructured without titles and paragraphs and thus are difficult to retrieve and browse– Spoken documents are merely audio/video signals or a very long

sequence of transcribed words including errors

– It is inconvenient and inefficient for users to browse through each of them from the beginning to the end

39

Introduction (2/3)

• Spoken document summarization, which aims to generate a summary automatically for the spoken documents, is the key for better speech understanding and organization

• Extractive vs. Abstractive Summarization– Extractive summarization is to select a number of indicative

sentences or paragraphs from original document and sequence them to form a summary

– Abstractive summarization is to rewrite a concise abstract that can reflect the key concepts of the document

– Extractive summarization has gained much more attention in the recent past

40

Introduction (3/3)

Large-textCorpus

Speech Signal

Acousticmodel

Linguistic model

Prosodic FeaturesExtraction

SpeechRecognition

Speech Transcription

Important Unit(Sentence)Extraction

SentenceCompaction

Statistical Features

Prosody Features

Confidence Score

Language Model

Lexical Features

Word DependencyProbability

RecordsCaptionsIndexes

SummarySummary

FeaturesExtraction

41

History of Summarization Research

1950 1960 1970 1980 1990 2000

Early system using a surface-level approach (1958)

The first entity-level approaches based on syntactic analysis (1961)

The use of location features (1969)

The extended surface-level approach to include the used of cue phrases

The emergence of more extensive entity-level approaches (1972)

The first discourse-based approaches based on story grammars (1980)

A variety of different work (entity-level approaches based on AI 、 logic and production rules semantic networks 、 hybrid approaches)

Recent work has almost exclusively focused on extract rather than abstracts.A renewed interest in earlier surface-level approaches.

The emergence of new areas such as multi-document summarization (1997), multiligual summarization, and multimedia summarization (1997)

Spoken-document Summarization

Text-document Summarization

The first training approach (1995)

The first SVD-based approach (1995)

More natural language generation work begins to focus on text summarization

42

Extraction Based on Sentence Locations/Structures

• Sentence extraction using sentence location information– Lead (Hajime and Manabu 2000) – Focusing on the introductory and concluding segments (Hirohata

et al. 2005)– Specific structure on some domain (Maskey et al. 2003)

• E.g., broadcast news programs － sentence position, speaker type, previous-speaker type, next-speaker type, speaker change

43

Statistical Summarization Approaches (1/7)

• Spoken sentences are ranked and selected based on some similarity measures or significant scores

(a) Similarity Measures– Vector Space Model (VSM) (Ho 2003)– The document and sentence of it are represented in vector forms– The sentences that have the highest relevance scores to the

whole document are selected– To summarize more important and different concepts in a

document• Relevance measure (Gong et al. 2001)

• Maximum Marginal Relevance (MMR) (Murray et al. 2005)

x

y

iS

D

44


(a) Similarity Measures– Relevance measure (Gong et al. 2001)

– Maximum Marginal Relevance, MMR (Murray et al. 2005)

)),()(1()),(()( SummSSimaDSSimaiS ii

MMR D

Summary

iS

The candidate sentence set

mi SSSS ,...,,..,, 21

Compute the relevance score between sentence and document D

iS

Select sentence that hasthe highest relevance score

If the number of sentence in the summary reaches

the predefined valueNo

Delete sentence and recompute the weighted term-frequency vector for the document

maxS

D

YesStop

maxS

45


(b) SVD-based Method– The sentence can also be represented as a semantic vector– While the sentence with more topic or semantic information are s

elected– LSA (Gong et al. 2001)

– DIM (Hirohata et al. 2005)

Mi

i

i

i

i

a

a

a

a

A

3

2

1

SVD

iNN

i

i

i

v

v

v

A

22

11

ˆ

Dimension reduction

iKK

i

i

v

v

11

Weighted word-frequency vector

Weighted singular-value

vector

Reduced dimension vector

Mi

i

i

i

i

a

a

a

a

A

3

2

1

SVD

iNN

i

i

i

v

v

v

A

22

11

ˆ

Dimension reduction

iKK

i

i

v

v

11

Weighted word-frequency vector

Weighted singular-value

vector

Reduced dimension vector

K

kikki vS

1

2)(

Jw

w

w

2

1

1S 2S MS

J content words

M sentences Information of word j

j

A U

1

2

K

i

Information of sentence i

tVTerm-sentence

matrixLeft singularvector matrix

Right singularvector matrix

singular value matrix

12

k

Jw

w

w

2

1

1S 2S MS

J content words

M sentences Information of word j

j

A U

1

2

K

i

Information of sentence i

tVTerm-sentence

matrixLeft singularvector matrix

Right singularvector matrix

singular value matrix

12

k

46


(c) Sentence Significance Score (SIG)– Each sentence in the document is represented as a sequence of

terms, which can be simply given by a significance score– Features such as the confidence score, linguistic score or prosod

ic information also can be further integrated – Sentence selection can be performed based on this score– E.g., Given a sentence

• Linguistic score:

• Significance score:

• Or Sentence Significance Score

(Hirohata et al. 2005)

J

jjCjIji wCwIwL

JS

1

)()()(1

Jji wwwwS ,...,,...,, 21

J

jji wI

JS

1

)(1

).....|(log)( 1 jjj wwPwL

j

Ajj F

FfwI log)(

47


(c) Sentence Significance Score– Sentence:

• :statistical measure, such as TF/IDF• :linguistic measure, e.g., named entities and POSs• :confidence score• :N-gram score• is calculated from the grammatical structure of the sentence

• Statistical measure also can be evaluated using PLSA (Probabilistic Latent Semantic Analysis)

– Topic Significance– Term Entropy

)()()()()(1

51

4321 i

J

jjjjji Sbwgwcwlws

JS

Jji wwwwS ,...,,...,, 21

)( jws)( jwl)( jwc)( jwg

)( iSb

48


(d) Classification-based Methods– Sentence selection is formulated as a binary classification proble

m. A sentence can either be included in a summary or not

– These methods need a set of training documents (or labeled data) for training the classifiers

– For example,• Naïve Bayes’ Classifier/Bayesian Network Classifier (Kupiec 1995, Koumpis et al. 2005, Maskey et al. 2005)

• Support Vector Machine (SVM) (Zhu and Penn 2005)

• Logistic Regression (Zhu and Penn 2005)

• Gaussian Mixture Models (GMM) (Murray et al. 2005)

)|()(1

Cxpxp v

V

v

SummaryNon-summary

iS

49


(e) Combined Methods (Hirohata et al. 2005)

– Sentence Significance Score (SIG) combined with Location Information

– Latent semantic analysis (LSA) combined with Location Information

– DIM combined with Location Information

50

Probabilistic Generative Approaches (1/7)

• MAP criterion for sentence selection

• Sentence prior– Sentence prior is simply set to uniform here– Or may have to do with

• Sentence duration/position, correctness of sentence boundary, confidence score, prosodic information, etc.

• Each sentence of the document can be ranked by this likelihood value

SPSDP

DP

SPSDPDSP

i

i

ii

Sentence model Sentence prior

51


• Hidden Markov Model (HMM)– Each sentence of the spoken document is treated as a

probabilistic generative model of N-grams, while the spoken document is the observation

– : the sentence model, estimated from the sentence

– : the collection model, estimated from a large corpus

(In order to have some probability of every term in the vocabulary)

– : a weighting parameter

SiD

ij

ij

Dw

Dtc

jjiHMM CwPSwPSDP,

1

SwP j S

CwP j C

SwP j

CwP j

1

Sentence model

Document (Observation)

iLji wwwwD ....21

ij

ij

Dw

Dtc

jjiHMM CwPSwPSDP,

1

S

SwP j

CwP j

1

Sentence model

Document (Observation)

iLji wwwwD ....21

ij

ij

Dw

Dtc

jjiHMM CwPSwPSDP,

1

S

52


• Relevance Model, RM– In HMM, the true sentence model might not be

accurately estimated (by MLE)• Since the sentence consists only of few terms

– In order to improve estimation of the sentence model• Each sentence has its own associated relevant

model , constructed by the subset of documents in the collection that are relevant to the sentence

• The relevance model is then linearly combined with the original sentence model to form a more accurate sentence model

SwP j

SRS

Sjjj RwPSwPSwP 1ˆ

ij

ij

Dw

Dwc

jjiHMM CwPSwPSDP,

1ˆ ˆ

i

ii S

wSSwP

)()|(

53


• A schematic diagram of extractive spoken document summarization jointly using the HMM and RM models

1D2D

iD

IR System

General Text NewsCollection

Retrieved Relevant

Documentsof S S’s HMM

Model

S’s RMModel

Sentence

Document Likelihood

)|( SDP iHMM

ContemporaryText NewsCollection

iD

S S

Spoken Documents to be Summarized Local Feedback

54


• Topical Mixture Model (TMM)– Build a probabilistic latent topical space – Measure the likelihood of a sentence generating a given

document in such space

Dwn

Dw

K

kikki STPTwPSDP

,

1

1TwnP

DocumentD=w1w2…wn…wN

A sentence model

2TwnP

KTwnP

iSTP 2

iSTP 1

iK STP

The TMM model for s specific sentence Si

55


• Word Topical Mixture Model (wTMM)– To explore the co-occurrence relationship between words of the l

anguage– Each word of the language as a topical mixture model for

predicting the occurrence of the other word

– Each sentence of the spoken document to be summarized was treated as a composite word TMM model for generating the document

– The likelihood of the document being generated by can be expressed as:

jww

jwM

K

kwkkw jj

MTPTwPMwP1

)|()|()|(

D iS

Dwn

Dw Sw

K

kwkkiji

ij

jMTPTwPSDP

,

1,

57

Comparison of Extractive Summarization Methods

• Literal Term Matching Vs. Concept Matching – Literal Term Matching ：

• Extraction using degree of similarity (VSM, MMR)• Extraction using features score (Sentence score)• HMM, HMMRM

– Concept Matching ：• Extraction using latent semantic analysis (LSA, DIM)• TMM, wTMM

58

Evaluation Metrics (1/3)

• Subjective Evaluation Metrics (Direct evaluation)– Conducted by human subjects– Different levels

• Objective Evaluation Metrics– Automatic summaries were evaluated by objective metrics

• Automatic Evaluation– Summaries are evaluated by IR

59


• Objective Evaluation Metrics– Sentence recall/precision (Hirohata et al. 2004)

• Sentence recall/precision is commonly used in evaluating sentence-extraction-based text summarization

• Sentence boundaries are not explicitly indicated in input speech, estimated boundaries based on recognition results do not always agree with those in manual summaries

(Kitade et al., 2004)• F-measure, F-measure/max, F-measure/ave.

man

summan

S

SSR

sum

summan

S

SSP

PR

PRF

2

60


• Objective Evaluation Metrics– ROUGE-N (Lin et al. 2003)

• ROUGE-N is an N-gram recall between an automatic summary and a set of manual summaries

– Cosine Measure (Saggion et al. 2002, Ho 2003)

H n

H n

SS Sgn

SS Sgnm

gC

gC

)(

)(

NROUGE

DD

H

hDhDDhD

H

RmASIMmEmASIM

DmACC

2

%),(%)(%),(1

%)( 1,,

%)(%)(

%)(%)(

,

,

,%)(%),(mEmA

mEmA

DhD

DhD

DhD

VV

VVmEmASIM

昨天　馬英九　訪問　中國大陸

昨天　馬英九　結束　訪問　回國

x

y

DA

DhE ,

natural language processing references: 1. foundations of statistical natural language processing 2....

Documents

language components

language engineering

speech taggingreviewtagging

body of texts statistical

nonstatistical nlpthe

methodselectrical engineering

artcanadian computer

human interrogator