tell us the first step you would do to comprehend the below passage? slumdog millionaire, the latest...

34
Tell us the first step you would do to comprehend the below passage? Slumdog Millionaire, the latest megahit flip, talks about rags to riches story a slum dweller. The movie, an adaptation of novel, is based on popular Indian version of American soap contest – who wants to be millionaire –which was well accepted by the Masses. Freida Pinto is heroin of the movie. She hails from Mumbai. Even though it was her debut movie, because of her exemplary performance she has received offers for many Hollywood movie. Slumdog received numerous accolades from all over the world. Apart from the Oscar, some notables where – Toronto International festival, Cannes etc. CS 626 - Group 1 Dept of CSE -IIT Bombay 1

Upload: wendy-beryl-burns

Post on 03-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Tell us the first step you would do to comprehend the below passage?

Slumdog Millionaire, the latest megahit flip, talks about rags to riches story a slum dweller. The movie, an adaptation of novel, is based on popular Indian version of American soap contest – who wants to be millionaire –which was well accepted by the Masses.

Freida Pinto is heroin of the movie. She hails from Mumbai. Even though it was her debut movie, because of her exemplary performance she has received offers for many Hollywood movie.

Slumdog received numerous accolades from all over the world. Apart from the Oscar, some notables where – Toronto International festival, Cannes etc.

CS 626 - Group 1 Dept of CSE -IIT Bombay 1

Discourse Segmentation

CS 626 Course Seminar Dept of CSE,IIT B

Group-1:Sriraj (08305034)Dipak(08305901)Balamurali(08405401)

The way we go….

•Introduction

•Motivation

•TextTiling

•Context Vector and Segmentation

•Lexical Chains and Segmentation

•Segmentation with LSA

•Conclusion

•Reference

INTRODUCTION

Discourse comes from Latin word 'discursus' Discourse: A continuous stretch of (especially spoken) language larger than a sentence, often constituting a coherent unit such as a sermon, argument, joke, or narrative -(Crystal 1992) Discourse: Novels, as well as short conversations or groans(cries) 

CS 626 - Group 1 4Dept of CSE -IIT Bombay

Beaugrande definition of discourse

• Cohesion - grammatical relationship between parts of a sentence essential for its interpretation;

• Coherence - the order of statements relates one another by sense.

• Intentionality - the message has to be conveyed deliberately and consciously;

• Acceptability - indicates that the communicative product needs to be satisfactory in that the audience approves it;

• Informativeness - some new information has to be included in the discourse;

• Situationality - circumstances in which the remark is made are important;

• Intertextuality - reference to the world outside the text or the interpreters' schemata;

CS 626 - Group 1 Dept of CSE -IIT Bombay 5

DISCOURSE STRUCTURE - SALIENT FEATURES

•Existence of a Hierarchy.

•Segmentation at semantic level.

 •Domain-specific knowledge

CS 626 - Group 1 6Dept of CSE -IIT Bombay

DISCOURSE SEGMENTATION

"Partition of full length text into coherent  multi-paragraph units " - Marti Hearst

CS 626 - Group 1 7Dept of CSE -IIT Bombay

MOTIVATION

• Text Summarization

• Question and Answering • Sentiment Analysis • Topic Detection

CS 626 - Group 1 8Dept of CSE -IIT Bombay

TEXTILING

Use of  TF-IDF concept within a document

Analogy     IR     : Document->Entire Corpus    NLP  : Block-> Entire Document

A term used more inside a block weighs more.

Adjuscent Blocks contain more related terms -             - An evidence of strong cohesion

CS 626 - Group 1 9Dept of CSE -IIT Bombay

CONTD...

Algorithm -

• Divide Text into blocks(say k sentence long).

• Compute cosine similarity with adjacent blocks.

cos(b1,b2) = •Smoothed Interpolated similarity v/s sentence gap number is plotted.

•Lowermost portion of valleys - Boundaries 

CS 626 - Group 1 10Dept of CSE -IIT Bombay

n

tbt

n

tbt

n

tbtbt

ww

ww

1

2

2,1

2

1,

12,1,

CONTD...

CS 626 - Group 1

Source :[1]

Are we satisfied?

11Dept of CSE -IIT Bombay

TEXTTILING - WHAT WENT WRONG??

Same word need not be repeated - But similar word could be

WSD was not performed - Polysemy  issues. 

The contextual information not considered.

CS 626 - Group 1 12Dept of CSE -IIT Bombay

CONTEXT VECTORS & SEGMENTATION

Capture contextual information in different blocks.

Steps:• Encoding of contextual information. - context vector creation • Creation of Block Vectors • Measurement of similarity. -instead of TF-IDF, use context vector

• cos(v,w) =

CS 626 - Group 1 13Dept of CSE -IIT Bombay

n

tt

n

tt

n

ttt

wv

wv

1

2

1

2

1

DID IT DO THE TRICK?

Yes!

•Precision increased 32 to 52%

•Recall increased 40 to 51%

Lets try to improvise a bit more!

CS 626 - Group 1 14Dept of CSE -IIT Bombay

LEXICAL CHAINS

•A lexical cohesion computing Technique.

•A sequence of related words in the text. •Independent of the grammatical structure.

•Provides a context for disambiguation.

•Enable identification of the concept.

  

CS 626 - Group 1 15Dept of CSE -IIT Bombay

Different forms of Lexical Cohesion

• Repetition• Repetition through synonymy

– Police, officers

• Word Association through– Specialization/Generalization

• murder weapon, knife

– part-whole/whole-part relationship

• Committee, members

• Statistical association between words– Osama Bin laden and Word Trade center

How

• Uses an auxiliary resource to cluster words into sets of related concepts (wordnet)

• Areas of low cohesive strength are good indicators of topic boundaries

• Process– Tokeniser

– Lexical chainer

– Boundary detector

Process

• Tokenizer– POS tagging is done

– Morphological analysis is done

• Lexical Chainer– To find relation between tokens

– Single pass clustering

– First token is start of first chain

– Tokens added to most recently updated chain that it

shares the strongest relationship

Process Contd...

• Boundary Detection– A high concentration of chains begin and end between two adjacent textual

units

– Boundary Strength w(n,n+1) = E * S

• E = number of lexical chains whose span ends at sentence n

• S = number of chains that begin their span at sentence n+1

– Take the mean of all non zero scores

– This mean acts as minimum allowable boundary strength

And the Improvement is …

• Evaluation Metrics– Precision

– Recall

Precision Recall

SeLeCT 36.6 62.7

JTextTile 13.3 19.7

Random Segmentation 7.1 7.1

Latent Semantic Analysis ( LSA )

Problems with Frequency Vector Based Similarity

Short Passages• Similarity estimate is inaccurate for short passages• An additional occurrence of a common word (reflected

in numerator) causes a disproportionate increase in sim(x,y) unless the denominator is large

j

2j,y

j

2j,x

jj,yj,i

ff

ff

)y,x(sim

Problems with Frequency Vector Based similarity..cont’d(2)

Term Matching Problem

• Car; Automobile• Car; Petrol• Similar/related but distinct words are considered negative

evidence• Solutions

– Stemming– Thesaurus, Wordnet based similarity measures– Latent Semantic Analysis

Introduction to LSA

• LSA stems from work in IR

• Represents word and passage meaning as high-dimensional vectors in the semantic space

• Does not use humanly constructed dictionaries, knowledge bases, semantic networks, etc.

• Meaning of word : Average of the meaning of all passages in which it appears

• Meaning of passage: Average of the meaning of all the words it contains

Training LSA

• Input: set of texts

• Vocabulary

},...,{ m1

}w,...,w{ n1

n

2

1

m21

w

...

w

w...

A

The values are scaled according to general form of inverse document frequency

Dimensionality reduction using SVD

Training LSA …cont’d(2)

TVUB

mr mn rr rn

k

is k – dimensional LSA space for LSA feature vector for word wi

Benefits of applying SVD is concise representation. Storage and complexity of the

similarity matrix is reduced Captures major structural associations between

words and documents Noise is removed simply by omitting the less salient

dimensions in U

Training LSA …cont’d(3)

k )i(k

k

Applying LSA

• A sentence si is represented by its term frequency vector fi where fij

is the frequency of term j in si

• Meaning of si j

kiji )j(f

k

2jk

k

2ik

kjkik

jiij ),cos(M

Significance of k

• Finding optimal dimensionality: Important step in LSA

• Hypothetically, the optimal space for the reconstruction has the same dimensionality as the source that generates discourse.

• Source generate passages by choosing words from a k-dimensional space in such a way that words in

the same paragraph tend to be selected from

nearby locations.

LSA results

• LSA is twice as accurate as the word similarity based co-occurrence vector.

(Error reduced from 22% to 11 %)

• LSA values become less accurate as more dimensions are incorporated into the feature vectors

Conclusion

• Text tiling, context vector based similarity, lexical chaining and LSA all are bag of word approaches.

• Bag-of-word approaches are sufficient .. to some extent. “LSA makes no use of word order, thus of syntactic relations or logic, or of morphology. Remarkably, it

manages to extract reflections of passage and word meanings quite well without these aids, but it must still be suspected of resulting incompleteness or likely error on some occasions” Excerpt from [5].

1

Contd..

• LSA is purely statistical whereas other approaches use some form of external knowledge bases in addition to statistical techniques.

• Role of external Knowledge.

• To move to next level we need some linguistics.

• We need right mix of statistical  and linguistics approaches to move forward.

CS 626 - Group 1 Dept of CSE -IIT Bombay 32

Reference

[1]. Hearst, M. A. 1993 Texttiling: a Quantitative Approach to Discourse segmentation. Technical Report. UMI Order Number: S2K-93-24., University of California at Berkeley.

[2]. Kaufmann, S. 1999. Cohesion and collocation: using context vectors in text segmentation. In Proceedings of the 37th Annual Meeting of the Association For Computational Linguistics on Computational Linguistics , Pages 99-107

[3]. Landauer, T. K., Foltz, P. W., & Laham, D. 1998. Introduction to Latent Semantic Analysis. Discourse Processes, 25, pages 259-284.

[4]. Barzilay, Regina and Michael Elhadad. 1997.Using lexical chains for text summarization. In Proceedings of the Intelligent Scalable Text Summarization Workshop (ISTS-97), Madrid, Spain

[4]. Freddy Y. Y. Choi. 2000. Advances in domain independent linear text segmentation. In Proceedings of NAACL , pages 26-33

[5]. Freddy Y. Y. Choi, Peter Wiemer-hastings, Johanna Moore. 2001. Latent semantic analysis for text segmentation. In Proceedings of EMNLP, pages 109-117

[6]. Stokes, N., Carthy, J., Smeaton, A.F. 2002. Segmenting Broadcast News Streams Using Lexical Chains. in Proceedings of 1st Starting AI Researchers Symposium (STAIRS 2002), volume 1, pp.145-154.

CS 626 - Group 1 Dept of CSE -IIT Bombay 33

Contd..

[7]. http://www.freewebs.com/hsalhi/Discourse%20Analysis%20Handout.doc

[8]. http://ilze.org/semio/005.htm

[9]. http://www.dfki.de/etai/SpecialIssues/Dia99/denecke/ETAI-00/node11.html

[10]. http://www.csi.ucd.ie/staff/jcarthy/home/Lex.html

CS 626 - Group 1 Dept of CSE -IIT Bombay 34