a boosting algorithm for classification of semi-structured text

32
1 A Boosting Algorithm for Classification of Semi-Structured Text Taku Kudo * # Yuji Matsumoto * * Nara Institute Science and Technology # Currently, NTT Communication Science

Upload: roch

Post on 08-Feb-2016

37 views

Category:

Documents


3 download

DESCRIPTION

A Boosting Algorithm for Classification of Semi-Structured Text. Taku Kudo * # Yuji Matsumoto * * Nara Institute Science and Technology # Currently, NTT Communication Science Labs. Backgrounds. Text Classification using Machine Learning categories: topics (sports, finance, politics…) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Boosting Algorithm for Classification of  Semi-Structured Text

1

A Boosting Algorithm for Classification of Semi-Structured Text

Taku Kudo * #

Yuji Matsumoto * * Nara Institute Science and Technology

# Currently, NTT Communication Science Labs.

Page 2: A Boosting Algorithm for Classification of  Semi-Structured Text

2

Backgrounds Text Classification using Machine Learning

categories: topics (sports, finance, politics…) features: bag-of-words (BOW) methods: SVM, Boosting, Naïve Bayes

Changes in categories modalities, subjectivities, or sentiments

Changes in text size document (large) → passage, sentence (small) Our Claim: BOW is not sufficient

Page 3: A Boosting Algorithm for Classification of  Semi-Structured Text

3

Backgrounds, cont. Straightforward extensions

Add some structural features, e.g., fixed-length N-gram or fixed-length syntactic relations

But… Ad-hoc and task dependent require careful feature selections How to determine the optimal size (length) ?

Use of larger substructures yields an inefficiency Use of smaller substructures is the same as BOW

Page 4: A Boosting Algorithm for Classification of  Semi-Structured Text

4

Our approach Semi-structured text

assume that text is represented in a tree word sequence, dependency tree, base-phrases, XML

Propose a new ML algorithm that can automatically capture relevant substructures in a semi-structured text

Characteristics: Instance is not a numerical vector but a tree Use all subtrees as features without any constraints A compact and relevant feature set is automatically

selected

Page 5: A Boosting Algorithm for Classification of  Semi-Structured Text

5

Classifier for Trees

Page 6: A Boosting Algorithm for Classification of  Semi-Structured Text

6

Tree classification problem Goal:

Induce a mapping from

given training data Training data

A set of pairs of tree x and class label y (+1 or -1)

d

ac

a

cd

+1 -1 d

ac

cb

+1 -1 d

ba

a

},,,,,,{ 2211 LL yyyT xxx

,, ,T=

}1,1{ : xf

TT

Page 7: A Boosting Algorithm for Classification of  Semi-Structured Text

7

Labeled ordered tree, subtree Labeled ordered tree (or

simply tree) labeled: each node is associated

with a label ordered: siblings are ordered

Subtree preserves parent-daughter

relation preserves sibling relation preserves the label

B is a subtree of A

A is a supertree of B

b

a

c

a bc

d b

a

c

b

AB A

Page 8: A Boosting Algorithm for Classification of  Semi-Structured Text

8

Decision stumps for trees

otherwisey

tifyh yt

)(,

xx

 

<t, y> is a parameter (rule) of decision stumps

A simple rule-based classifier

d

ac

b x = <t1, y>=< , +1>a

c

h <t1, y>(x) = 1 h <t2, y>(x) = 1

<t2, y>=< , -1>bd

Page 9: A Boosting Algorithm for Classification of  Semi-Structured Text

9

Decision stumps for trees, cont.

)(),(

),(maxargˆ,ˆ

,1

}1,1{,

iyt

L

ii

yFt

hyyt gain

ytgainyt

x

Training: select the optimal rule that maximizes the gain (or accuracy)

} '|'{1

i

L

i

ttF x

F: feature set (a set of all subtrees)

},,,,,,{ 2211 LL yyyT xxx

Page 10: A Boosting Algorithm for Classification of  Semi-Structured Text

10

Decision stumps for trees, cont.

d

ac

a

cd

+1 -1 d

ac

cb

+1 -1 d

ba

a

a, +1 +1 +1 +1 +1 0

<t,y>

a, -1 -1 -1 -1 -1 0 b, +1 -1 -1 +1 +1 -1

cb

da

-1

+1 +1 +1 -1 2

……

ac

d

+1+1 -1 +1 -1 4

Select the optimal rulethat yields the maximum

gain

gain

Page 11: A Boosting Algorithm for Classification of  Semi-Structured Text

11

Boosting Decision stumps are too weak Boosting [Schapire97]

1. build an weak leaner (decision stumps) Hj

2. re-weight instances with respect to error rates

3. repeat 1 to 2 in K times

4. output a liner combination of H1 ~ HK

Redefine the gain to use Boosting

L

iiiiyt

L

iii ddhydyt

1,

1

1,0 ),(),(gain x

Page 12: A Boosting Algorithm for Classification of  Semi-Structured Text

12

Efficient Computation

Page 13: A Boosting Algorithm for Classification of  Semi-Structured Text

13

How to find the optimal rule?

)(maxargˆ,ˆ ,1}1,1{,

iyt

L

iii

yFthydyt x

} '|'{1

i

L

i

ttF x

F is too huge to be enumerated explicitly Need to find the optimal rule efficiently

  A variant of Branch-and-Bound Define a search space in which whole set of subtrees is given Find the optimal rule by traversing this search space Prune the search space by proposing a criterion

Page 14: A Boosting Algorithm for Classification of  Semi-Structured Text

14

Right most extension [Asai02, Zaki02]

extend a given tree of size (n-1) by adding a new node to obtain trees of size n a node is added to the right-most-path a node is added as the rightmost sibling

b

a

c1

2 4

a b5 6c3

b

a

c1

2 4

a b5 6c3

b

a

c1

2 4

a b5 6c3

b

a

c1

2 4

a b5 6c3

rightmost- path

t 7

7 7},,{ cbaL

},,{ cba

},,{ cba},,{ cba

Page 15: A Boosting Algorithm for Classification of  Semi-Structured Text

15

Right most extension, cont. Recursive applications of right most

extensions create a search space

Page 16: A Boosting Algorithm for Classification of  Semi-Structured Text

16

Pruning For all , propose an

upper bound such that Can prune the node t if ,

where is a suboptimal gain

}1,1{ ,' ytt)(),'( tytgain

)(t

4.0

4.0

gain

7.0

)( 1.0

gain

6.0

)(3.0

gain

5.0

)( 4.0

gain

)( 5.0 gain

)( 4.0 gain

Pruning strategyμ(t )=0.4 implies the gain of any supertree of t is no grater than 0.4

)(),'(

,'

tytgain

tt

Page 17: A Boosting Algorithm for Classification of  Semi-Structured Text

17

Upper bound of the gain(an extension of [Morishita 02])

},1|{ 1

},1|{ 1

2

,2

max)(

x

x

tyi

L

iiii

tyi

L

iiii

i

i

dyd

dyd

t

 

where

)(),'(

}1,1{,' allfor

tytgain

ytt

 

Page 18: A Boosting Algorithm for Classification of  Semi-Structured Text

18

Relation to SVMs with Tree Kernel

Page 19: A Boosting Algorithm for Classification of  Semi-Structured Text

19

Classification algorithm

Ft

K

kkkk

K

kytk

btIw

tIy

hf

t

kk

)(sgn

)1)(2(sgn

)(sgn)(

1

1,

x

x

xx

kttk

k

kttk

kt

k

k

yb

yw

}|{

}|{

2

Modeled as a linear classifier wt : weight of tree t -b : bias (default class label)

)1)(2()(, xx tIyh yt  

Page 20: A Boosting Algorithm for Classification of  Semi-Structured Text

20

SVMs and Tree Kernel [Collins 02]

)}(,),({)( 1 xxxx JtItI b

ac {0,…,1,…1,…,1,…,1,…,1,…,1,…,0,…}

a b c a

b

a

c b

a

c

Tree Kernel: all subtrees are expanded implicitly

btIw

bf

t )(sgn

})(sgn{)(

x

xwx

     

SVM:

btIwft

)(sgn)( xxBoosting:

Feature spaces are essentially the same Learning strategies are different

Page 21: A Boosting Algorithm for Classification of  Semi-Structured Text

21

SVM v.s Boosting [Rätsch 01] Both are known as Large Margin Classifiers Metric of margin is different

Boosting: L1-norm margin- w is expressed in a small number of features- sparse solution in the feature space

)( ii xw

SVM: L2-norm margin- w is expressed in a small number of examples - support vectors- sparse solution in the example space

btIwft

)(sgn)( xx

Page 22: A Boosting Algorithm for Classification of  Semi-Structured Text

22

SVM v.s Boosting, cont. Accuracy is task-dependent Practical advantages of Boosting:

Good interpretability Can analyze how the model performs or what kinds of

features are useful Compact features (rules) are easy to deal with

Fast classification Complexity depends on the small number of rules Kernel methods are too heavy

Page 23: A Boosting Algorithm for Classification of  Semi-Structured Text

23

Experiments

Page 24: A Boosting Algorithm for Classification of  Semi-Structured Text

24

Sentence classifications PHS: cell phone review classification (5,741 sent.)

domain: Web-based BBS on PHS, a sort of cell phone categories: positive review or negative review

MOD: modality identification (1,710 sent.) domain : editorial news articles categories: assertion, opinion, or description

positive: It is useful that we can know the date and time of E-Mails.negative: I feel that the response is not so good.

assertion: We should not hold an optimistic view of the success of POKEMON.

opinion: I think that now is the best time for developing the blue print.description: Social function of education has been changing.

Page 25: A Boosting Algorithm for Classification of  Semi-Structured Text

25

Sentence representations N-gram tree

each word simply modifies the next word subtree is an N-gram (N is unrestricted)

dependency tree word-based dependency tree A Japanese dependency parser, CaboCha, is used

bag-of-words (baseline)

response is very good

response is very good

Page 26: A Boosting Algorithm for Classification of  Semi-Structured Text

26

Results

SVM + Tree Kernel

dep 77.0 24.2 81.7 87.6

n-gram 78.9 57.5 84.1 90.1

outperforms the baseline (bow) dep v.s n-gram: comparable (no significant difference)

PHS MOD

opinion assertion description

Boosting bow 76.0 59.6 70.0 82.2

dep 78.7 78.7 86.7 91.7

n-gram 79.3 76.7 87.2 91.6

SVMs show worse performance depending on tasks overfitting

Page 27: A Boosting Algorithm for Classification of  Semi-Structured Text

27

B: subtrees that include “use”

0.0027 want to use 0.0002 use0.0002 be in use0.0001 be easy to use

C: subtrees that include “recharge”

0.0028 recharging time is short -0.0041 recharging time is long

A: subtrees that include “hard, difficult ”

Interpretability

0.0004 be hard to hung up -0.0006 be hard to read-0.0007 be hard to use-0.0017 be hard to …

-0.0001 was easy to use-0.0007 be hard to use-0.0019 is easier to use than..

Ft

btIwft

)(sgn)( xx

PHS dataset withdependency

Page 28: A Boosting Algorithm for Classification of  Semi-Structured Text

28

Interpretability, cont. PHS dataset withdependency

Input: The LCD is large, beautiful and easy to see

Ft

btIwft

)(sgn)( xx

0.00368 be easy to0.00353 beautiful0.00237 be easy to see0.00174 is large0.00107 The LCD is large0.00074 The LCD is …0.00057 The LCD0.00036 see-0.00001 large

subtree tweight w

Page 29: A Boosting Algorithm for Classification of  Semi-Structured Text

29

Advantages Compact feature set

Boosting extracts only 1,783 unique features The set sizes of distinct 1-gram, 2-gram, and 3-gram

are 4,211, 24,206, and 43,658 respectively

SVMs implicitly use a huge number of features Fast classification

Boosting: 0.531 sec. / 5,741 instances SVM: 255.42 sec. / 5,741 instances Boosting is about 480 times faster than SVMs

Page 30: A Boosting Algorithm for Classification of  Semi-Structured Text

30

Conclusions Assume that text is represented in a tree Extension of decision stumps

all subtrees are potentially used as features Boosting Branch and bound

enables to find the optimal rule efficiently Advantages:

good interpretability fast classification comparable accuracy to SVMs with kernels

Page 31: A Boosting Algorithm for Classification of  Semi-Structured Text

31

Future work Other applications

Information extraction semantic-role labeling parse tree re-ranking

Confidence rated predictions for decision stumps

Page 32: A Boosting Algorithm for Classification of  Semi-Structured Text

32

Thank you!

An implementation of our method is available as an open source software at:

http://chasen.naist.jp/~taku/software/bact/