experimenting the texttiling algorithm

Experimenting the TextTiling Algorithm

Adam C., Andreani V., Bengsston J., Bouchara N., Choucavy L., Delpech E., El Maarouf I., Fontan L., Gotlik W.

Summary of the work done by master students at Université Toulouse Le Mirail

Experimenting the Text Tiling algorithm

Part I : What is the Text Tiling Algorithm ?

Part III : Demo

Part II : Experimentations with the Text Tiling algorithm

Part I :

What is the TextTiling algorithm?

« an algorithm for partitionning expository texts into coherent multi-paragraph discourse units which reflects the subtopic structure of the texts »

developed by Marti Hearst (1997):

«TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages », In Computational Linguistics, March 1997.

http://www.ischool.berkeley.edu/~hearst/tiling-about.html

http://www.ischool.berkeley.edu/~hearst/tiling-about.html

Why segment a text into multi-paragraphs

unit ?

Computational tasks that use arbitrary windows might benefit from using windows with motivated boundaries

Ease of readability for online long texts (Reading Assistant Tools)

IR : retrieving relevant passages instead of whole document

Summarization : extract sentences according to their position in the subtopic structure

What is the hypothesis behind TextTiling ?

« TextTiling assumes that a set of lexical items is in use during the course of a given subtopic discussion, and when that subtopic changes, a significant proportion of the vocabulary changes as well »

Text Tiling doesn’t detect subtopics per se but shifts in topic by means of change in vocabulary

Operates a linear segmentation (no hierarchy)

« TextTiling assumes that a set of lexical items is in use during the course of a given subtopic discussion, and when that subtopic changes, a significant proportion of the vocabulary changes as well »

Detection of topic shift

Raw text

Tokenisation

Segmentation into pseudo-sentences

(20 tokens)

similarity score Sbloc A vs bloc B

a similarity score is computed every pseudo-sentence between 2 blocks of 6 pseudo-sequences

the more vocabulary in common, the highest the score

S

SS

SS

SS

SS

SS

SS

SS

SS

S

I. Detection of topic shift

0,6

0,65

0,7

0,75

0,8

0,85

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 390

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

SCORE

Pseudo-sentence number

topic shifts occur at the deepest gaps (after smoothing)

tiles boundaries will be adjusted to the nearest paragraph break

a gap means there is a drop in vocabulary similarity

Evaluation by Hearst (1997)

kappa = 0.647

In case of disagreement among judges, a boudary is kept if at least 3 judges agree on it

Evaluation on 12 magazine articles annotated by 7 judges

Judges are asked « to mark the paragraph boudary at which the topic changed »

Agreement among judges (kappa measure) :

Evaluation by Hearst (1997)

Precision Recall

Baseline(random)

0.43 0.42

TextTiler 0.66 0.61

Judges 0.81 0.71

Works well on long (+1800 words) expository texts with little structural demarcation

Part II : Experimentations with theTextTiling algorithm

Work done by masters students, Université Toulouse Le Mirail

variation of :

Experimentations :

Implementation in Perl

cross annotation of 3 texts

linguistic parameters computation parameters

Annotation of topic boundary

No clear-cut topic shift, rather ‘regions’ of shift

A difficult (unnatural ?) task for humans

Annotators felt a smaller unity (sentence) would have been more convenient

Our kappa : 0.56 Heart’s judges : 0.65

kappa should be at least > 0.67, the best is > 0.8

Variation of linguistic parameters

PRECISION

F-MESURE

RECALL

0,58

0,35

0,25

0,61

0,26

0,17

0,53

0,34

0,23

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

basic trigrams lemmatization (TreeTagger*)

* http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

Variation of computation parameters

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 1970

0,1

0,2

0,3

0,4

0,5

0,6

0,7

1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177 185 193 201

Computation window :

Smoothing :

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

1 14 27 40 53 66 79 92 105 118 131 144 157 170 183 196

pseudo-sentence length

block length

Size of computation window

Block length

2 4 6 8 10 12 14 16 18 20

5 ++ +++ ++ ++ ++ ++ ++ ++ ++ ++

10 ++ ++ ++ + + ++ + + + +

15 ++ + + + + + + - - -

20 + + + - - - - - - --

25 + + - - - - - -- -- --

30 + - - - - -- -- -- -- --

35 + - - - - -- -- -- -- --

40 -- -- -- -- -- -- -- -- -- --Pse

u do-

s ent

e nce

l eng

t h

Correlation window size / smoothing

Correlation between window size and smoothing :

The smallest your window, the more smoothing you need to smoothe

window size (number of tokens)

10 20 30 40 50

Smoothing iteration 3 3 1 1 1

width 2 1 2 2 1

Optimal parameters set

Nb parag.

Nb words

Words/parag.

sentences/block

tokens/sentence

smooth.iteration

smooth.width

Text 1 12 2000 167 6 5 3 2

Text 2 22 2400 109 6 10 1 1

Text 3 37 1750 20 8 10 1 1

One optimal parameters set per text

Optimal set varies according to text/paragraph length ?

Final thoughts

Computation parameters :

lemmatization doesn’t significantly improve TextTiling what about stemming ?

Linguistic processing :

parameters are highly dependent optimal parameters set vary from text to text

Proposal : an adaptative Text Tiler ? window size could be adapted to text intrinsic qualities smoothing could then be adapted to window size

Demo Demo Part III :Part III :

Similarity score – Hearst (1997)

Sim (b1 ,b2) =∑t wt,b1 . wt,b2

∑t w²tb1 . ∑t w²tb2 √

b1 : block 1

b2 : block 2

t : token

w : weight (frequency) of the token in the block

back

Kappa measure

Annot 1

yes no TOTAL

Annot2 yes 40 35 Y2=75

no 5 20 N2=25

TOTAL Y1=45 N1=55 T=100

AgreementP(A) = 0.6

Expected agreementP(E) = (Y1.Y2 + N1.N2) / T²= 0.475

= 0.24Kappa =

P(A) – P(E)

1 – P(E)

back

http://www.musc.edu/dc/icrebm/kappa.html

experimenting the texttiling algorithm

Technology