a boosting algorithm for classification of semi-structured text

1

A Boosting Algorithm for Classification of Semi-Structured Text

Taku Kudo * #

Yuji Matsumoto * * Nara Institute Science and Technology

# Currently, NTT Communication Science Labs.

2

Backgrounds Text Classification using Machine Learning

categories: topics (sports, finance, politics…) features: bag-of-words (BOW) methods: SVM, Boosting, Naïve Bayes

Changes in categories modalities, subjectivities, or sentiments

Changes in text size document (large) → passage, sentence (small) Our Claim: BOW is not sufficient

3

Backgrounds, cont. Straightforward extensions

Add some structural features, e.g., fixed-length N-gram or fixed-length syntactic relations

But… Ad-hoc and task dependent require careful feature selections How to determine the optimal size (length) ?

Use of larger substructures yields an inefficiency Use of smaller substructures is the same as BOW

4

Our approach Semi-structured text

assume that text is represented in a tree word sequence, dependency tree, base-phrases, XML

Propose a new ML algorithm that can automatically capture relevant substructures in a semi-structured text

Characteristics: Instance is not a numerical vector but a tree Use all subtrees as features without any constraints A compact and relevant feature set is automatically

selected

5

Classifier for Trees

6

Tree classification problem Goal:

Induce a mapping from

given training data Training data

A set of pairs of tree x and class label y (+1 or -1)

d

ac

a

cd

+1 -1 d

ac

cb

+1 -1 d

ba

a

},,,,,,{ 2211 LL yyyT xxx

,, ,T=

}1,1{ : xf

TT

7

Labeled ordered tree, subtree Labeled ordered tree (or

simply tree) labeled: each node is associated

with a label ordered: siblings are ordered

Subtree preserves parent-daughter

relation preserves sibling relation preserves the label

B is a subtree of A

A is a supertree of B

b

a

c

a bc

d b

a

c

b

AB A

8

Decision stumps for trees

otherwisey

tifyh yt

)(,

xx

　

<t, y> is a parameter (rule) of decision stumps

A simple rule-based classifier

d

ac

b x = <t1, y>=< , +1>a

c

h <t1, y>(x) = 1 h <t2, y>(x) = 1

<t2, y>=< , -1>bd

9

Decision stumps for trees, cont.

)(),(

),(maxargˆ,ˆ

,1

}1,1{,

iyt

L

ii

yFt

hyyt gain

ytgainyt

x

Training: select the optimal rule that maximizes the gain (or accuracy)

} '|'{1

i

L

i

ttF x

F: feature set (a set of all subtrees)

},,,,,,{ 2211 LL yyyT xxx

10

Decision stumps for trees, cont.

d

ac

a

cd

+1 -1 d

ac

cb

+1 -1 d

ba

a

a, +1 +1 +1 +1 +1 0

<t,y>

a, -1 -1 -1 -1 -1 0 b, +1 -1 -1 +1 +1 -1

cb

da

-1

+1 +1 +1 -1 2

……

ac

d

+1+1 -1 +1 -1 4

Select the optimal rulethat yields the maximum

gain

gain

11

Boosting Decision stumps are too weak Boosting [Schapire97]

1. build an weak leaner (decision stumps) Hj

2. re-weight instances with respect to error rates

3. repeat 1 to 2 in K times

4. output a liner combination of H1 ~ HK

Redefine the gain to use Boosting

L

iiiiyt

L

iii ddhydyt

1,

1

1,0 ),(),(gain x

12

Efficient Computation

13

How to find the optimal rule?

)(maxargˆ,ˆ ,1}1,1{,

iyt

L

iii

yFthydyt x

} '|'{1

i

L

i

ttF x

F is too huge to be enumerated explicitly Need to find the optimal rule efficiently

　 A variant of Branch-and-Bound Define a search space in which whole set of subtrees is given Find the optimal rule by traversing this search space Prune the search space by proposing a criterion

14

Right most extension [Asai02, Zaki02]

extend a given tree of size (n-1) by adding a new node to obtain trees of size n a node is added to the right-most-path a node is added as the rightmost sibling

b

a

c1

2 4

a b5 6c3

b

a

c1

2 4

a b5 6c3

b

a

c1

2 4

a b5 6c3

b

a

c1

2 4

a b5 6c3

rightmost- path

t 7

7 7},,{ cbaL

},,{ cba

},,{ cba},,{ cba

15

Right most extension, cont. Recursive applications of right most

extensions create a search space

16

Pruning For all , propose an

upper bound such that Can prune the node t if ,

where is a suboptimal gain

}1,1{ ,' ytt)(),'( tytgain

)(t

4.0

4.0

gain

7.0

)( 1.0

gain

6.0

)(3.0

gain

5.0

)( 4.0

gain

)( 5.0 gain

)( 4.0 gain

Pruning strategyμ(t )=0.4 implies the gain of any supertree of t is no grater than 0.4

)(),'(

,'

tytgain

tt

17

Upper bound of the gain(an extension of [Morishita 02])

},1|{ 1

},1|{ 1

2

,2

max)(

x

x

tyi

L

iiii

tyi

L

iiii

i

i

dyd

dyd

t

　

where

)(),'(

}1,1{,' allfor

tytgain

ytt

　

18

Relation to SVMs with Tree Kernel

19

Classification algorithm

Ft

K

kkkk

K

kytk

btIw

tIy

hf

t

kk

)(sgn

)1)(2(sgn

)(sgn)(

1

1,

x

x

xx

kttk

k

kttk

kt

k

k

yb

yw

}|{

}|{

2

Modeled as a linear classifier wt : weight of tree t -b : bias (default class label)

)1)(2()(, xx tIyh yt 　

20

SVMs and Tree Kernel [Collins 02]

)}(,),({)( 1 xxxx JtItI b

ac {0,…,1,…1,…,1,…,1,…,1,…,1,…,0,…}

a b c a

b

a

c b

a

c

Tree Kernel: all subtrees are expanded implicitly

btIw

bf

t )(sgn

})(sgn{)(

x

xwx

　　　

SVM:

btIwft

)(sgn)( xxBoosting:

Feature spaces are essentially the same Learning strategies are different

21

SVM v.s Boosting [Rätsch 01] Both are known as Large Margin Classifiers Metric of margin is different

Boosting: L1-norm margin- w is expressed in a small number of features- sparse solution in the feature space

)( ii xw

SVM: L2-norm margin- w is expressed in a small number of examples - support vectors- sparse solution in the example space

btIwft

)(sgn)( xx

22

SVM v.s Boosting, cont. Accuracy is task-dependent Practical advantages of Boosting:

Good interpretability Can analyze how the model performs or what kinds of

features are useful Compact features (rules) are easy to deal with

Fast classification Complexity depends on the small number of rules Kernel methods are too heavy

23

Experiments

24

Sentence classifications PHS: cell phone review classification (5,741 sent.)

domain: Web-based BBS on PHS, a sort of cell phone categories: positive review or negative review

MOD: modality identification (1,710 sent.) domain : editorial news articles categories: assertion, opinion, or description

positive: It is useful that we can know the date and time of E-Mails.negative: I feel that the response is not so good.

assertion: We should not hold an optimistic view of the success of POKEMON.

opinion: I think that now is the best time for developing the blue print.description: Social function of education has been changing.

25

Sentence representations N-gram tree

each word simply modifies the next word subtree is an N-gram (N is unrestricted)

dependency tree word-based dependency tree A Japanese dependency parser, CaboCha, is used

bag-of-words (baseline)

response is very good

response is very good

26

Results

SVM + Tree Kernel

dep 77.0 24.2 81.7 87.6

n-gram 78.9 57.5 84.1 90.1

outperforms the baseline (bow) dep v.s n-gram: comparable (no significant difference)

PHS MOD

opinion assertion description

Boosting bow 76.0 59.6 70.0 82.2

dep 78.7 78.7 86.7 91.7

n-gram 79.3 76.7 87.2 91.6

SVMs show worse performance depending on tasks overfitting

27

B: subtrees that include “use”

0.0027 want to use 0.0002 use0.0002 be in use0.0001 be easy to use

C: subtrees that include “recharge”

0.0028 recharging time is short -0.0041 recharging time is long

A: subtrees that include “hard, difficult ”

Interpretability

0.0004 be hard to hung up -0.0006 be hard to read-0.0007 be hard to use-0.0017 be hard to …

-0.0001 was easy to use-0.0007 be hard to use-0.0019 is easier to use than..

Ft

btIwft

)(sgn)( xx

PHS dataset withdependency

28

Interpretability, cont. PHS dataset withdependency

Input: The LCD is large, beautiful and easy to see

Ft

btIwft

)(sgn)( xx

0.00368 be easy to0.00353 beautiful0.00237 be easy to see0.00174 is large0.00107 The LCD is large0.00074 The LCD is …0.00057 The LCD0.00036 see-0.00001 large

subtree tweight w

29

Advantages Compact feature set

Boosting extracts only 1,783 unique features The set sizes of distinct 1-gram, 2-gram, and 3-gram

are 4,211, 24,206, and 43,658 respectively

SVMs implicitly use a huge number of features Fast classification

Boosting: 0.531 sec. / 5,741 instances SVM: 255.42 sec. / 5,741 instances Boosting is about 480 times faster than SVMs

30

Conclusions Assume that text is represented in a tree Extension of decision stumps

all subtrees are potentially used as features Boosting Branch and bound

enables to find the optimal rule efficiently Advantages:

good interpretability fast classification comparable accuracy to SVMs with kernels

31

Future work Other applications

Information extraction semantic-role labeling parse tree re-ranking

Confidence rated predictions for decision stumps

32

Thank you!

An implementation of our method is available as an open source software at:

http://chasen.naist.jp/~taku/software/bact/

a boosting algorithm for classification of semi-structured text

Documents