machine learning reading: chapter 18. 2 text classification is text i a finance new article?...

Machine Learning

Reading: Chapter 18

2

Text Classification

Is texti a finance new article?

Positive Negative

3

20 attributes

Investors 2 Dow 2 Jones 2 Industrial 1 Average 3 Percent 5 Gain 6 Trading 8 Broader 5 stock 5 Indicators 6 Standard 2 Rolling 1 Nasdaq 3 Early 10 Rest 12 More 13 first 11 Same 12 The 30

4

20 attributes

Men’s Basketball Championship UConn Huskies Georgia Tech Women Playing Crown Titles Games Rebounds All-America early rolling Celebrates Rest More First The same

Examplestock rolling the class

1 0 3 40 other

2 6 8 35 finance

3 7 7 25 other

4 5 7 14 other

5 8 2 20 finance

6 9 4 25 finance

7 5 6 20 finance

8 0 2 35 other

9 0 11 25 finance

10 0 15 28 other

6

Constructing the Decision Tree

Goal: Find the smallest decision tree consistent with the examples

Find the attribute that best splits examples Form tree with root = best attribute For each value vi (or range) of best attribute

Selects those examples with best=vi

Construct subtreei by recursively calling decision tree with subset of examples, all attributes except best

Add a branch to tree with label=vi and subtree=subtreei

7

Choosing the Best Attribute:Binary Classification

Want a formal measure that returns a maximum value when attribute makes a perfect split and minimum when it makes no distinction

Information theory (Shannon and Weaver 49)

Entropy: a measure that characterizes the impurity of a collection of examples

Information gain: the expected reduction in entropy caused by partitioning the xamples according to this attribute

8

Formula for Entropy

H(P(v1),…P(vn))=∑-P(vi)log2P(vi)

where P(v) = probability of v

Examples:Suppose we have a collection of 10 examples, 5

positive, 5 negative:H(1/2,1/2)=-1/2log21/2-1/2log21/2=1 bit

Suppose we have a collection of 100 examples, 1 positive and 99 negative:

H(1/100,99/100)=-.01log2.01-.99log2.99=.08 bits

n

i=1

9

Choosing the Best Attribute: Information Gain

Information gain (from attribute test) = difference between the original information requirement and new requirement

Gain(A)=H(p/p+n,n/p+n)-Remainder(A)

H=entropy Highest when the set is equally divided between

positive (p) and negative (n) examples (.5,.5) (value of 1)

Lower as the set becomes more unbalanced (e.g., (.9, .1) )

Information based on attributes

= Remainder (A)

P=n=10, so H(1/2,1/2)= 1 bit

11

Text Classification

Is texti a finance new article?

Positive Negative

Examplestock rolling the class

1 0 3 40 other

2 6 8 35 finance

3 7 7 25 other

4 5 7 14 other

5 8 2 20 finance

6 9 4 25 finance

7 5 6 20 finance

8 0 2 35 other

9 0 11 25 finance

10 0 15 28 other

stock rolling

<5 5-10 10 10

5-10<5

1,8,9,10 2,3,4,5,6,7

1,5,6,82,3,4,7

9,10

14

Algorithm as specified so far is designed for binary classification, attributes with discrete values

Attributes: Outlook: sunny, overcast, rain Temperature: hot, mild, cool Humidity: normal, high Wind: weak, strongClassification PlayTennis?: Yes, No

Day Outlook Temperature

Humidity Wind PlayTennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

Humidity

E=.940 (9/14 yes)

Wind

E=.94

Outlook

E=.940

Temperature

E=.940

High Normal Strong Weak

Overcast Sunny

Rain

Cool Mild

Hot

E=.985 E=.592 E=.811 E=.1.0

Gain(S,Outlook)=.246, Gain(S,Humidity)=.151, Gain(S,Wind)=.048, Gain(S,Temperature)=.029

Outlook is selected because it has highest gain

Gain(humidity)=.940-(7/14).985-(7/14).592 Gain(wind)=.940-(6/14).811-(8/14)1.0

Gain(outlook)=.940-(4/14)0-(5/14).79-(5/14).79

Day Outlook Temperature

Humidity Wind PlayTennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

19

Extending the algorithm for continuous valued attributes

Dynamically define new discrete-valued attributes that partition the continuous attribute into a discrete set of intervals

For continuous A, create Ac that is true if A<c, false otherwise

How to select the best value for threshold c?

Sort examples by continuous attribute Identify adjacent examples that differ in target

classification Generate a set of candidate thresholds

midway between corresponding values of A Choose threshold c that maximizes

information gain

20

Example: temperature as continuous value

Temp 40 48 60 72 80 90

Playtennis?

No No Yes Yes Yes No

Two candidate thresholds: (48+60)/2 (80+90)/2

Information gain greater for Temperature>54 than for Temperature>85

21

Other cases What if class is discrete valued, not

binary?

What if an attribute has many values (e.g., 1 per instance)?

22

Training vs. Testing

A learning algorithm is good if it uses its learned hypothesis to make accurate predictions on unseen data

Collect a large set of examples (with classifications)

Divide into two disjoint sets: the training set and the test set

Apply the learning algorithm to the training set, generating hypothesis h

Measure the percentage of examples in the test set that are correctly classified by h

Repeat for different sizes of training sets and different randomly selected training sets of each size.

24

Overfitting

Learning algorithms may use irrelevant attributes to make decisions For news, day published and newspaper

When else can overfitting occur? Solution #1: Decision tree pruning

Prune away attributes with low information gain

Use statistical significance to test whether gain is meaningful

25

K-fold Cross Validation

Solution #2: To reduce overfitting

Run k experiments Use a different 1/k of data for testing

each time Average the results

5-fold, 10-fold, leave-one-out

26

Cross-Validation

Model

Lather, rinse, repeat (10 times)

9 folds (approx. 1409) 1 fold (approx. 157)

Train

Evaluate

Report average

Split into 10 folds

Labeled data (1566)

27

Example

28

Ensemble Learning

Learn from a collection of hypotheses

Majority voting

Enlarges the hypothesis space

29

Boosting

Uses a weighted training set Each example has an associated weight wj0 Higher weighted examples have higher importance

Initially, wj=1 for all examples Next round: increase weights of misclassified

examples, decrease other weights From the new weighted set, generate hypothesis

h2 Continue until M hypotheses generated Final ensemble hypothesis = weighted-majority

combination of all M hypotheses Weight each hypothesis according to how well it did

on training data

30

AdaBoost

If input learning algorithm is a weak learning algorithm L always returns a hypothesis with weighted

error on training slightly better than random Returns hypothesis that classifies training

data perfectly for large enough M Boosts the accuracy of the original

learning algorithm on training data

machine learning reading: chapter 18. 2 text classification is text i a finance new article?...

Documents