a boosting algorithm for classification of semi-structured text
DESCRIPTION
A Boosting Algorithm for Classification of Semi-Structured Text. Taku Kudo * # Yuji Matsumoto * * Nara Institute Science and Technology # Currently, NTT Communication Science Labs. Backgrounds. Text Classification using Machine Learning categories: topics (sports, finance, politics…) - PowerPoint PPT PresentationTRANSCRIPT
1
A Boosting Algorithm for Classification of Semi-Structured Text
Taku Kudo * #
Yuji Matsumoto * * Nara Institute Science and Technology
# Currently, NTT Communication Science Labs.
2
Backgrounds Text Classification using Machine Learning
categories: topics (sports, finance, politics…) features: bag-of-words (BOW) methods: SVM, Boosting, Naïve Bayes
Changes in categories modalities, subjectivities, or sentiments
Changes in text size document (large) → passage, sentence (small) Our Claim: BOW is not sufficient
3
Backgrounds, cont. Straightforward extensions
Add some structural features, e.g., fixed-length N-gram or fixed-length syntactic relations
But… Ad-hoc and task dependent require careful feature selections How to determine the optimal size (length) ?
Use of larger substructures yields an inefficiency Use of smaller substructures is the same as BOW
4
Our approach Semi-structured text
assume that text is represented in a tree word sequence, dependency tree, base-phrases, XML
Propose a new ML algorithm that can automatically capture relevant substructures in a semi-structured text
Characteristics: Instance is not a numerical vector but a tree Use all subtrees as features without any constraints A compact and relevant feature set is automatically
selected
5
Classifier for Trees
6
Tree classification problem Goal:
Induce a mapping from
given training data Training data
A set of pairs of tree x and class label y (+1 or -1)
d
ac
a
cd
+1 -1 d
ac
cb
+1 -1 d
ba
a
},,,,,,{ 2211 LL yyyT xxx
,, ,T=
}1,1{ : xf
TT
7
Labeled ordered tree, subtree Labeled ordered tree (or
simply tree) labeled: each node is associated
with a label ordered: siblings are ordered
Subtree preserves parent-daughter
relation preserves sibling relation preserves the label
B is a subtree of A
A is a supertree of B
b
a
c
a bc
d b
a
c
b
AB A
8
Decision stumps for trees
otherwisey
tifyh yt
)(,
xx
<t, y> is a parameter (rule) of decision stumps
A simple rule-based classifier
d
ac
b x = <t1, y>=< , +1>a
c
h <t1, y>(x) = 1 h <t2, y>(x) = 1
<t2, y>=< , -1>bd
9
Decision stumps for trees, cont.
)(),(
),(maxargˆ,ˆ
,1
}1,1{,
iyt
L
ii
yFt
hyyt gain
ytgainyt
x
Training: select the optimal rule that maximizes the gain (or accuracy)
} '|'{1
i
L
i
ttF x
F: feature set (a set of all subtrees)
},,,,,,{ 2211 LL yyyT xxx
10
Decision stumps for trees, cont.
d
ac
a
cd
+1 -1 d
ac
cb
+1 -1 d
ba
a
a, +1 +1 +1 +1 +1 0
<t,y>
a, -1 -1 -1 -1 -1 0 b, +1 -1 -1 +1 +1 -1
cb
da
-1
+1 +1 +1 -1 2
……
ac
d
+1+1 -1 +1 -1 4
Select the optimal rulethat yields the maximum
gain
gain
11
Boosting Decision stumps are too weak Boosting [Schapire97]
1. build an weak leaner (decision stumps) Hj
2. re-weight instances with respect to error rates
3. repeat 1 to 2 in K times
4. output a liner combination of H1 ~ HK
Redefine the gain to use Boosting
L
iiiiyt
L
iii ddhydyt
1,
1
1,0 ),(),(gain x
12
Efficient Computation
13
How to find the optimal rule?
)(maxargˆ,ˆ ,1}1,1{,
iyt
L
iii
yFthydyt x
} '|'{1
i
L
i
ttF x
F is too huge to be enumerated explicitly Need to find the optimal rule efficiently
A variant of Branch-and-Bound Define a search space in which whole set of subtrees is given Find the optimal rule by traversing this search space Prune the search space by proposing a criterion
14
Right most extension [Asai02, Zaki02]
extend a given tree of size (n-1) by adding a new node to obtain trees of size n a node is added to the right-most-path a node is added as the rightmost sibling
b
a
c1
2 4
a b5 6c3
b
a
c1
2 4
a b5 6c3
b
a
c1
2 4
a b5 6c3
b
a
c1
2 4
a b5 6c3
rightmost- path
t 7
7 7},,{ cbaL
},,{ cba
},,{ cba},,{ cba
15
Right most extension, cont. Recursive applications of right most
extensions create a search space
16
Pruning For all , propose an
upper bound such that Can prune the node t if ,
where is a suboptimal gain
}1,1{ ,' ytt)(),'( tytgain
)(t
4.0
4.0
gain
7.0
)( 1.0
gain
6.0
)(3.0
gain
5.0
)( 4.0
gain
)( 5.0 gain
)( 4.0 gain
Pruning strategyμ(t )=0.4 implies the gain of any supertree of t is no grater than 0.4
)(),'(
,'
tytgain
tt
17
Upper bound of the gain(an extension of [Morishita 02])
},1|{ 1
},1|{ 1
2
,2
max)(
x
x
tyi
L
iiii
tyi
L
iiii
i
i
dyd
dyd
t
where
)(),'(
}1,1{,' allfor
tytgain
ytt
18
Relation to SVMs with Tree Kernel
19
Classification algorithm
Ft
K
kkkk
K
kytk
btIw
tIy
hf
t
kk
)(sgn
)1)(2(sgn
)(sgn)(
1
1,
x
x
xx
kttk
k
kttk
kt
k
k
yb
yw
}|{
}|{
2
Modeled as a linear classifier wt : weight of tree t -b : bias (default class label)
)1)(2()(, xx tIyh yt
20
SVMs and Tree Kernel [Collins 02]
)}(,),({)( 1 xxxx JtItI b
ac {0,…,1,…1,…,1,…,1,…,1,…,1,…,0,…}
a b c a
b
a
c b
a
c
Tree Kernel: all subtrees are expanded implicitly
btIw
bf
t )(sgn
})(sgn{)(
x
xwx
SVM:
btIwft
)(sgn)( xxBoosting:
Feature spaces are essentially the same Learning strategies are different
21
SVM v.s Boosting [Rätsch 01] Both are known as Large Margin Classifiers Metric of margin is different
Boosting: L1-norm margin- w is expressed in a small number of features- sparse solution in the feature space
)( ii xw
SVM: L2-norm margin- w is expressed in a small number of examples - support vectors- sparse solution in the example space
btIwft
)(sgn)( xx
22
SVM v.s Boosting, cont. Accuracy is task-dependent Practical advantages of Boosting:
Good interpretability Can analyze how the model performs or what kinds of
features are useful Compact features (rules) are easy to deal with
Fast classification Complexity depends on the small number of rules Kernel methods are too heavy
23
Experiments
24
Sentence classifications PHS: cell phone review classification (5,741 sent.)
domain: Web-based BBS on PHS, a sort of cell phone categories: positive review or negative review
MOD: modality identification (1,710 sent.) domain : editorial news articles categories: assertion, opinion, or description
positive: It is useful that we can know the date and time of E-Mails.negative: I feel that the response is not so good.
assertion: We should not hold an optimistic view of the success of POKEMON.
opinion: I think that now is the best time for developing the blue print.description: Social function of education has been changing.
25
Sentence representations N-gram tree
each word simply modifies the next word subtree is an N-gram (N is unrestricted)
dependency tree word-based dependency tree A Japanese dependency parser, CaboCha, is used
bag-of-words (baseline)
response is very good
response is very good
26
Results
SVM + Tree Kernel
dep 77.0 24.2 81.7 87.6
n-gram 78.9 57.5 84.1 90.1
outperforms the baseline (bow) dep v.s n-gram: comparable (no significant difference)
PHS MOD
opinion assertion description
Boosting bow 76.0 59.6 70.0 82.2
dep 78.7 78.7 86.7 91.7
n-gram 79.3 76.7 87.2 91.6
SVMs show worse performance depending on tasks overfitting
27
B: subtrees that include “use”
0.0027 want to use 0.0002 use0.0002 be in use0.0001 be easy to use
C: subtrees that include “recharge”
0.0028 recharging time is short -0.0041 recharging time is long
A: subtrees that include “hard, difficult ”
Interpretability
0.0004 be hard to hung up -0.0006 be hard to read-0.0007 be hard to use-0.0017 be hard to …
-0.0001 was easy to use-0.0007 be hard to use-0.0019 is easier to use than..
Ft
btIwft
)(sgn)( xx
PHS dataset withdependency
28
Interpretability, cont. PHS dataset withdependency
Input: The LCD is large, beautiful and easy to see
Ft
btIwft
)(sgn)( xx
0.00368 be easy to0.00353 beautiful0.00237 be easy to see0.00174 is large0.00107 The LCD is large0.00074 The LCD is …0.00057 The LCD0.00036 see-0.00001 large
subtree tweight w
29
Advantages Compact feature set
Boosting extracts only 1,783 unique features The set sizes of distinct 1-gram, 2-gram, and 3-gram
are 4,211, 24,206, and 43,658 respectively
SVMs implicitly use a huge number of features Fast classification
Boosting: 0.531 sec. / 5,741 instances SVM: 255.42 sec. / 5,741 instances Boosting is about 480 times faster than SVMs
30
Conclusions Assume that text is represented in a tree Extension of decision stumps
all subtrees are potentially used as features Boosting Branch and bound
enables to find the optimal rule efficiently Advantages:
good interpretability fast classification comparable accuracy to SVMs with kernels
31
Future work Other applications
Information extraction semantic-role labeling parse tree re-ranking
Confidence rated predictions for decision stumps
32
Thank you!
An implementation of our method is available as an open source software at:
http://chasen.naist.jp/~taku/software/bact/