Download - Experimenting the TextTiling Algorithm
![Page 1: Experimenting the TextTiling Algorithm](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f130241a28ab8a078b45b5/html5/thumbnails/1.jpg)
Experimenting the TextTiling Algorithm
Adam C., Andreani V., Bengsston J., Bouchara N., Choucavy L., Delpech E., El Maarouf I., Fontan L., Gotlik W.
Summary of the work done by master students at Université Toulouse Le Mirail
![Page 2: Experimenting the TextTiling Algorithm](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f130241a28ab8a078b45b5/html5/thumbnails/2.jpg)
Experimenting the Text Tiling algorithm
Part I : What is the Text Tiling Algorithm ?
Part III : Demo
Part II : Experimentations with the Text Tiling algorithm
![Page 3: Experimenting the TextTiling Algorithm](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f130241a28ab8a078b45b5/html5/thumbnails/3.jpg)
Part I :
What is the TextTiling algorithm?
« an algorithm for partitionning expository texts into coherent multi-paragraph discourse units which reflects the subtopic structure of the texts »
developed by Marti Hearst (1997):
«TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages », In Computational Linguistics, March 1997.
http://www.ischool.berkeley.edu/~hearst/tiling-about.html
![Page 4: Experimenting the TextTiling Algorithm](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f130241a28ab8a078b45b5/html5/thumbnails/4.jpg)
Why segment a text into multi-paragraphs
unit ?
Computational tasks that use arbitrary windows might benefit from using windows with motivated boundaries
Ease of readability for online long texts (Reading Assistant Tools)
IR : retrieving relevant passages instead of whole document
Summarization : extract sentences according to their position in the subtopic structure
![Page 5: Experimenting the TextTiling Algorithm](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f130241a28ab8a078b45b5/html5/thumbnails/5.jpg)
What is the hypothesis behind TextTiling ?
« TextTiling assumes that a set of lexical items is in use during the course of a given subtopic discussion, and when that subtopic changes, a significant proportion of the vocabulary changes as well »
Text Tiling doesn’t detect subtopics per se but shifts in topic by means of change in vocabulary
Operates a linear segmentation (no hierarchy)
« TextTiling assumes that a set of lexical items is in use during the course of a given subtopic discussion, and when that subtopic changes, a significant proportion of the vocabulary changes as well »
![Page 6: Experimenting the TextTiling Algorithm](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f130241a28ab8a078b45b5/html5/thumbnails/6.jpg)
Detection of topic shift
Raw text
Tokenisation
Segmentation into pseudo-sentences
(20 tokens)
similarity score Sbloc A vs bloc B
a similarity score is computed every pseudo-sentence between 2 blocks of 6 pseudo-sequences
the more vocabulary in common, the highest the score
S
SS
SS
SS
SS
SS
SS
SS
SS
S
![Page 7: Experimenting the TextTiling Algorithm](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f130241a28ab8a078b45b5/html5/thumbnails/7.jpg)
I. Detection of topic shift
0,6
0,65
0,7
0,75
0,8
0,85
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 390
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
SCORE
Pseudo-sentence number
topic shifts occur at the deepest gaps (after smoothing)
tiles boundaries will be adjusted to the nearest paragraph break
a gap means there is a drop in vocabulary similarity
![Page 8: Experimenting the TextTiling Algorithm](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f130241a28ab8a078b45b5/html5/thumbnails/8.jpg)
Evaluation by Hearst (1997)
kappa = 0.647
In case of disagreement among judges, a boudary is kept if at least 3 judges agree on it
Evaluation on 12 magazine articles annotated by 7 judges
Judges are asked « to mark the paragraph boudary at which the topic changed »
Agreement among judges (kappa measure) :
![Page 9: Experimenting the TextTiling Algorithm](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f130241a28ab8a078b45b5/html5/thumbnails/9.jpg)
Evaluation by Hearst (1997)
Precision Recall
Baseline(random)
0.43 0.42
TextTiler 0.66 0.61
Judges 0.81 0.71
Works well on long (+1800 words) expository texts with little structural demarcation
![Page 10: Experimenting the TextTiling Algorithm](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f130241a28ab8a078b45b5/html5/thumbnails/10.jpg)
Part II : Experimentations with theTextTiling algorithm
Work done by masters students, Université Toulouse Le Mirail
variation of :
Experimentations :
Implementation in Perl
cross annotation of 3 texts
linguistic parameters computation parameters
![Page 11: Experimenting the TextTiling Algorithm](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f130241a28ab8a078b45b5/html5/thumbnails/11.jpg)
Annotation of topic boundary
No clear-cut topic shift, rather ‘regions’ of shift
A difficult (unnatural ?) task for humans
Annotators felt a smaller unity (sentence) would have been more convenient
Our kappa : 0.56 Heart’s judges : 0.65
kappa should be at least > 0.67, the best is > 0.8
![Page 12: Experimenting the TextTiling Algorithm](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f130241a28ab8a078b45b5/html5/thumbnails/12.jpg)
Variation of linguistic parameters
PRECISION
F-MESURE
RECALL
0,58
0,35
0,25
0,61
0,26
0,17
0,53
0,34
0,23
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
basic trigrams lemmatization (TreeTagger*)
* http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
![Page 13: Experimenting the TextTiling Algorithm](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f130241a28ab8a078b45b5/html5/thumbnails/13.jpg)
Variation of computation parameters
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 1970
0,1
0,2
0,3
0,4
0,5
0,6
0,7
1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177 185 193 201
Computation window :
Smoothing :
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
1 14 27 40 53 66 79 92 105 118 131 144 157 170 183 196
pseudo-sentence length
block length
![Page 14: Experimenting the TextTiling Algorithm](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f130241a28ab8a078b45b5/html5/thumbnails/14.jpg)
Size of computation window
Block length
2 4 6 8 10 12 14 16 18 20
5 ++ +++ ++ ++ ++ ++ ++ ++ ++ ++
10 ++ ++ ++ + + ++ + + + +
15 ++ + + + + + + - - -
20 + + + - - - - - - --
25 + + - - - - - -- -- --
30 + - - - - -- -- -- -- --
35 + - - - - -- -- -- -- --
40 -- -- -- -- -- -- -- -- -- --Pse
u do-
s ent
e nce
l eng
t h
![Page 15: Experimenting the TextTiling Algorithm](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f130241a28ab8a078b45b5/html5/thumbnails/15.jpg)
Correlation window size / smoothing
Correlation between window size and smoothing :
The smallest your window, the more smoothing you need to smoothe
window size (number of tokens)
10 20 30 40 50
Smoothing iteration 3 3 1 1 1
width 2 1 2 2 1
![Page 16: Experimenting the TextTiling Algorithm](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f130241a28ab8a078b45b5/html5/thumbnails/16.jpg)
Optimal parameters set
Nb parag.
Nb words
Words/parag.
sentences/block
tokens/sentence
smooth.iteration
smooth.width
Text 1 12 2000 167 6 5 3 2
Text 2 22 2400 109 6 10 1 1
Text 3 37 1750 20 8 10 1 1
One optimal parameters set per text
Optimal set varies according to text/paragraph length ?
![Page 17: Experimenting the TextTiling Algorithm](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f130241a28ab8a078b45b5/html5/thumbnails/17.jpg)
Final thoughts
Computation parameters :
lemmatization doesn’t significantly improve TextTiling what about stemming ?
Linguistic processing :
parameters are highly dependent optimal parameters set vary from text to text
Proposal : an adaptative Text Tiler ? window size could be adapted to text intrinsic qualities smoothing could then be adapted to window size
![Page 18: Experimenting the TextTiling Algorithm](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f130241a28ab8a078b45b5/html5/thumbnails/18.jpg)
Demo Demo Part III :Part III :
![Page 19: Experimenting the TextTiling Algorithm](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f130241a28ab8a078b45b5/html5/thumbnails/19.jpg)
Similarity score – Hearst (1997)
Sim (b1 ,b2) =∑t wt,b1 . wt,b2
∑t w²tb1 . ∑t w²tb2 √
b1 : block 1
b2 : block 2
t : token
w : weight (frequency) of the token in the block
back
![Page 20: Experimenting the TextTiling Algorithm](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f130241a28ab8a078b45b5/html5/thumbnails/20.jpg)
Kappa measure
Annot 1
yes no TOTAL
Annot2 yes 40 35 Y2=75
no 5 20 N2=25
TOTAL Y1=45 N1=55 T=100
AgreementP(A) = 0.6
Expected agreementP(E) = (Y1.Y2 + N1.N2) / T²= 0.475
= 0.24Kappa =
P(A) – P(E)
1 – P(E)
back
http://www.musc.edu/dc/icrebm/kappa.html