experimenting the texttiling algorithm
TRANSCRIPT
Experimenting the TextTiling Algorithm
Adam C., Andreani V., Bengsston J., Bouchara N., Choucavy L., Delpech E., El Maarouf I., Fontan L., Gotlik W.
Summary of the work done by master students at Université Toulouse Le Mirail
Experimenting the Text Tiling algorithm
Part I : What is the Text Tiling Algorithm ?
Part III : Demo
Part II : Experimentations with the Text Tiling algorithm
Part I :
What is the TextTiling algorithm?
« an algorithm for partitionning expository texts into coherent multi-paragraph discourse units which reflects the subtopic structure of the texts »
developed by Marti Hearst (1997):
«TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages », In Computational Linguistics, March 1997.
http://www.ischool.berkeley.edu/~hearst/tiling-about.html
Why segment a text into multi-paragraphs
unit ?
Computational tasks that use arbitrary windows might benefit from using windows with motivated boundaries
Ease of readability for online long texts (Reading Assistant Tools)
IR : retrieving relevant passages instead of whole document
Summarization : extract sentences according to their position in the subtopic structure
What is the hypothesis behind TextTiling ?
« TextTiling assumes that a set of lexical items is in use during the course of a given subtopic discussion, and when that subtopic changes, a significant proportion of the vocabulary changes as well »
Text Tiling doesn’t detect subtopics per se but shifts in topic by means of change in vocabulary
Operates a linear segmentation (no hierarchy)
« TextTiling assumes that a set of lexical items is in use during the course of a given subtopic discussion, and when that subtopic changes, a significant proportion of the vocabulary changes as well »
Detection of topic shift
Raw text
Tokenisation
Segmentation into pseudo-sentences
(20 tokens)
similarity score Sbloc A vs bloc B
a similarity score is computed every pseudo-sentence between 2 blocks of 6 pseudo-sequences
the more vocabulary in common, the highest the score
S
SS
SS
SS
SS
SS
SS
SS
SS
S
I. Detection of topic shift
0,6
0,65
0,7
0,75
0,8
0,85
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 390
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
SCORE
Pseudo-sentence number
topic shifts occur at the deepest gaps (after smoothing)
tiles boundaries will be adjusted to the nearest paragraph break
a gap means there is a drop in vocabulary similarity
Evaluation by Hearst (1997)
kappa = 0.647
In case of disagreement among judges, a boudary is kept if at least 3 judges agree on it
Evaluation on 12 magazine articles annotated by 7 judges
Judges are asked « to mark the paragraph boudary at which the topic changed »
Agreement among judges (kappa measure) :
Evaluation by Hearst (1997)
Precision Recall
Baseline(random)
0.43 0.42
TextTiler 0.66 0.61
Judges 0.81 0.71
Works well on long (+1800 words) expository texts with little structural demarcation
Part II : Experimentations with theTextTiling algorithm
Work done by masters students, Université Toulouse Le Mirail
variation of :
Experimentations :
Implementation in Perl
cross annotation of 3 texts
linguistic parameters computation parameters
Annotation of topic boundary
No clear-cut topic shift, rather ‘regions’ of shift
A difficult (unnatural ?) task for humans
Annotators felt a smaller unity (sentence) would have been more convenient
Our kappa : 0.56 Heart’s judges : 0.65
kappa should be at least > 0.67, the best is > 0.8
Variation of linguistic parameters
PRECISION
F-MESURE
RECALL
0,58
0,35
0,25
0,61
0,26
0,17
0,53
0,34
0,23
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
basic trigrams lemmatization (TreeTagger*)
* http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
Variation of computation parameters
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 1970
0,1
0,2
0,3
0,4
0,5
0,6
0,7
1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177 185 193 201
Computation window :
Smoothing :
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
1 14 27 40 53 66 79 92 105 118 131 144 157 170 183 196
pseudo-sentence length
block length
Size of computation window
Block length
2 4 6 8 10 12 14 16 18 20
5 ++ +++ ++ ++ ++ ++ ++ ++ ++ ++
10 ++ ++ ++ + + ++ + + + +
15 ++ + + + + + + - - -
20 + + + - - - - - - --
25 + + - - - - - -- -- --
30 + - - - - -- -- -- -- --
35 + - - - - -- -- -- -- --
40 -- -- -- -- -- -- -- -- -- --Pse
u do-
s ent
e nce
l eng
t h
Correlation window size / smoothing
Correlation between window size and smoothing :
The smallest your window, the more smoothing you need to smoothe
window size (number of tokens)
10 20 30 40 50
Smoothing iteration 3 3 1 1 1
width 2 1 2 2 1
Optimal parameters set
Nb parag.
Nb words
Words/parag.
sentences/block
tokens/sentence
smooth.iteration
smooth.width
Text 1 12 2000 167 6 5 3 2
Text 2 22 2400 109 6 10 1 1
Text 3 37 1750 20 8 10 1 1
One optimal parameters set per text
Optimal set varies according to text/paragraph length ?
Final thoughts
Computation parameters :
lemmatization doesn’t significantly improve TextTiling what about stemming ?
Linguistic processing :
parameters are highly dependent optimal parameters set vary from text to text
Proposal : an adaptative Text Tiler ? window size could be adapted to text intrinsic qualities smoothing could then be adapted to window size
Demo Demo Part III :Part III :
Similarity score – Hearst (1997)
Sim (b1 ,b2) =∑t wt,b1 . wt,b2
∑t w²tb1 . ∑t w²tb2 √
b1 : block 1
b2 : block 2
t : token
w : weight (frequency) of the token in the block
back
Kappa measure
Annot 1
yes no TOTAL
Annot2 yes 40 35 Y2=75
no 5 20 N2=25
TOTAL Y1=45 N1=55 T=100
AgreementP(A) = 0.6
Expected agreementP(E) = (Y1.Y2 + N1.N2) / T²= 0.475
= 0.24Kappa =
P(A) – P(E)
1 – P(E)
back
http://www.musc.edu/dc/icrebm/kappa.html