machine learning

30
Machine Learning Reading: Chapter 18

Upload: tanek-pena

Post on 30-Dec-2015

24 views

Category:

Documents


0 download

DESCRIPTION

Machine Learning. Reading: Chapter 18. Text Classification. Is text i a finance new article?. Positive. Negative. Investors 2 Dow 2 Jones 2 Industrial 1 Average 3 Percent 5 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Machine Learning

Machine Learning

Reading: Chapter 18

Page 2: Machine Learning

2

Text Classification

Is texti a finance new article?

Positive Negative

Page 3: Machine Learning

3

20 attributes

Investors 2 Dow 2 Jones 2 Industrial 1 Average 3 Percent 5 Gain 6 Trading 8 Broader 5 stock 5 Indicators 6 Standard 2 Rolling 1 Nasdaq 3 Early 10 Rest 12 More 13 first 11 Same 12 The 30

Page 4: Machine Learning

4

20 attributes

Men’s Basketball Championship UConn Huskies Georgia Tech Women Playing Crown Titles Games Rebounds All-America early rolling Celebrates Rest More First The same

Page 5: Machine Learning

Examplestock rolling the class

1 0 3 40 other

2 6 8 35 finance

3 7 7 25 other

4 5 7 14 other

5 8 2 20 finance

6 9 4 25 finance

7 5 6 20 finance

8 0 2 35 other

9 0 11 25 finance

10 0 15 28 other

Page 6: Machine Learning

6

Constructing the Decision Tree

Goal: Find the smallest decision tree consistent with the examples

Find the attribute that best splits examples Form tree with root = best attribute For each value vi (or range) of best attribute

Selects those examples with best=vi

Construct subtreei by recursively calling decision tree with subset of examples, all attributes except best

Add a branch to tree with label=vi and subtree=subtreei

Page 7: Machine Learning

7

Choosing the Best Attribute:Binary Classification

Want a formal measure that returns a maximum value when attribute makes a perfect split and minimum when it makes no distinction

Information theory (Shannon and Weaver 49)

Entropy: a measure that characterizes the impurity of a collection of examples

Information gain: the expected reduction in entropy caused by partitioning the xamples according to this attribute

Page 8: Machine Learning

8

Formula for Entropy

H(P(v1),…P(vn))=∑-P(vi)log2P(vi)

where P(v) = probability of v

Examples:Suppose we have a collection of 10 examples, 5

positive, 5 negative:H(1/2,1/2)=-1/2log21/2-1/2log21/2=1 bit

Suppose we have a collection of 100 examples, 1 positive and 99 negative:

H(1/100,99/100)=-.01log2.01-.99log2.99=.08 bits

n

i=1

Page 9: Machine Learning

9

Choosing the Best Attribute: Information Gain

Information gain (from attribute test) = difference between the original information requirement and new requirement

Gain(A)=H(p/p+n,n/p+n)-Remainder(A)

H=entropy Highest when the set is equally divided between

positive (p) and negative (n) examples (.5,.5) (value of 1)

Lower as the set becomes more unbalanced (e.g., (.9, .1) )

Page 10: Machine Learning

Information based on attributes

= Remainder (A)

P=n=10, so H(1/2,1/2)= 1 bit

Page 11: Machine Learning

11

Text Classification

Is texti a finance new article?

Positive Negative

Page 12: Machine Learning

Examplestock rolling the class

1 0 3 40 other

2 6 8 35 finance

3 7 7 25 other

4 5 7 14 other

5 8 2 20 finance

6 9 4 25 finance

7 5 6 20 finance

8 0 2 35 other

9 0 11 25 finance

10 0 15 28 other

Page 13: Machine Learning

stock rolling

<5 5-10 10 10

5-10<5

1,8,9,10 2,3,4,5,6,7

1,5,6,82,3,4,7

9,10

Page 14: Machine Learning

14

Algorithm as specified so far is designed for binary classification, attributes with discrete values

Attributes: Outlook: sunny, overcast, rain Temperature: hot, mild, cool Humidity: normal, high Wind: weak, strongClassification PlayTennis?: Yes, No

Page 15: Machine Learning

Day Outlook Temperature

Humidity Wind PlayTennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

Page 16: Machine Learning

Humidity

E=.940 (9/14 yes)

Wind

E=.94

Outlook

E=.940

Temperature

E=.940

High Normal Strong Weak

Overcast Sunny

Rain

Cool Mild

Hot

E=.985 E=.592 E=.811 E=.1.0

Gain(S,Outlook)=.246, Gain(S,Humidity)=.151, Gain(S,Wind)=.048, Gain(S,Temperature)=.029

Outlook is selected because it has highest gain

Gain(humidity)=.940-(7/14).985-(7/14).592 Gain(wind)=.940-(6/14).811-(8/14)1.0

Gain(outlook)=.940-(4/14)0-(5/14).79-(5/14).79

Page 17: Machine Learning

Day Outlook Temperature

Humidity Wind PlayTennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

Page 18: Machine Learning

Day Outlook Temperature

Humidity Wind PlayTennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

Page 19: Machine Learning

19

Extending the algorithm for continuous valued attributes

Dynamically define new discrete-valued attributes that partition the continuous attribute into a discrete set of intervals

For continuous A, create Ac that is true if A<c, false otherwise

How to select the best value for threshold c?

Sort examples by continuous attribute Identify adjacent examples that differ in target

classification Generate a set of candidate thresholds

midway between corresponding values of A Choose threshold c that maximizes

information gain

Page 20: Machine Learning

20

Example: temperature as continuous value

Temp 40 48 60 72 80 90

Playtennis?

No No Yes Yes Yes No

Two candidate thresholds: (48+60)/2 (80+90)/2

Information gain greater for Temperature>54 than for Temperature>85

Page 21: Machine Learning

21

Other cases What if class is discrete valued, not

binary?

What if an attribute has many values (e.g., 1 per instance)?

Page 22: Machine Learning

22

Training vs. Testing

A learning algorithm is good if it uses its learned hypothesis to make accurate predictions on unseen data

Collect a large set of examples (with classifications)

Divide into two disjoint sets: the training set and the test set

Apply the learning algorithm to the training set, generating hypothesis h

Measure the percentage of examples in the test set that are correctly classified by h

Repeat for different sizes of training sets and different randomly selected training sets of each size.

Page 23: Machine Learning

23

Page 24: Machine Learning

24

Overfitting

Learning algorithms may use irrelevant attributes to make decisions For news, day published and newspaper

When else can overfitting occur? Solution #1: Decision tree pruning

Prune away attributes with low information gain

Use statistical significance to test whether gain is meaningful

Page 25: Machine Learning

25

K-fold Cross Validation

Solution #2: To reduce overfitting

Run k experiments Use a different 1/k of data for testing

each time Average the results

5-fold, 10-fold, leave-one-out

Page 26: Machine Learning

26

Cross-Validation

Model

Lather, rinse, repeat (10 times)

9 folds (approx. 1409) 1 fold (approx. 157)

Train

Evaluate

Report average

Split into 10 folds

Labeled data (1566)

Page 27: Machine Learning

27

Example

Page 28: Machine Learning

28

Ensemble Learning

Learn from a collection of hypotheses

Majority voting

Enlarges the hypothesis space

Page 29: Machine Learning

29

Boosting

Uses a weighted training set Each example has an associated weight wj0 Higher weighted examples have higher importance

Initially, wj=1 for all examples Next round: increase weights of misclassified

examples, decrease other weights From the new weighted set, generate hypothesis

h2 Continue until M hypotheses generated Final ensemble hypothesis = weighted-majority

combination of all M hypotheses Weight each hypothesis according to how well it did

on training data

Page 30: Machine Learning

30

AdaBoost

If input learning algorithm is a weak learning algorithm L always returns a hypothesis with weighted

error on training slightly better than random Returns hypothesis that classifies training

data perfectly for large enough M Boosts the accuracy of the original

learning algorithm on training data