introduction to machine learning algorithms

1

Introduction Introduction to to

Machine Learning AlgorithmsMachine Learning Algorithms

บุญเสริมบุญเสริม กิจศิริกลุกิจศิริกลุภาควิชาวิศวกรรมคอมพิวเตอรภาควิชาวิศวกรรมคอมพิวเตอร จุฬาลงกรณมหาวทิยาลัยจุฬาลงกรณมหาวทิยาลัย

2

The Needs for Machine LearningThe Needs for Machine Learning

Recently, full-genome sequencing has blossomed.

Several other technologies, such as DNA microarrays and mass spectrometry have considerably progressed.

The technologies rapidly produce terabytes of data.

3

The Problem of DiabetesThe Problem of Diabetes

To predict whether the patient shows signs of diabetesData contains 2 classes, i.e.+ : tested positive for diabetes− : tested negative for diabetes

4

The Problem of Diabetes (contThe Problem of Diabetes (cont.).)

Data contains 8 attributes1. Number of times pregnant2. Plasma glucose concentration after?? 2 hours in an oral

glucose tolerance test3. Diastolic blood pressure (mm Hg)4. Triceps skin fold thickness (mm)5. 2-Hour serum insulin (mu U/ml)6. Body mass index (weight in kg/(height in m)^2)7. Diabetes pedigree function8. Age (years)

5

The Data of Diabetes (Table 1)The Data of Diabetes (Table 1)

No. Time Pregnant

Plasma Glucose

Blood Pressure

Fold Thickness

Serum Insulin

BMI Pedi-gree

189 846

175

83

94

168

235

110

140

543

166

0.39830.1

25.8

43.3

28.1

43.1

39.3

22.2

103

23.2

0.587

0.183

0.167

2.288

0.704

0.245

0.487

89

137

126

145

97

30.5 0.158197

60

72

30

66

40

88

82

66

70

23

19

38

23

35

41

19

15

45

Age Class

1 59 +

5 51 +

1 33 −

1 21 −

0 33 +

3 27 −

13 57 −

1 22 −

2 53 +

6

Inductive Learning form Data Inductive Learning form Data (Classification Tasks)(Classification Tasks)

Input Data

Learning Algorithm

Learned Model

7

Inductive Learning form Data Inductive Learning form Data –– Cont.Cont.(Classification Tasks)(Classification Tasks)

Learned Model

Test Data

Learning Algorithm

Classification

8

Machine Learning AlgorithmsMachine Learning Algorithms

Decision Tree LearningNeural NetworksSupport Vector MachinesEtc.

9

Decision Tree LearningDecision Tree Learning

Decision tree representation

Each internal node tests an attribute

Each branch corresponds to attribute value

Each leaf node assigns a classification

10

An Example of Decision TreesAn Example of Decision Treesplasma glucose

body mass indexnegative

≤127 >127

blood pressure

negative

blood pressure

≤29.9 >29.9

blood pressure

positivenegative

plasma glucosepositive

≤72 >72 ≤61 >61

positivenegative

>66 >155≤66 ≤155

11

A Decision Tree Learning AlgorithmA Decision Tree Learning Algorithm

Main Loop:1. A ← the “best” decision attribute for next node

2. Assign A as decision attribute for node

3. For each value of A, create new descendant of node

4. Sort training examples to leaf nodes

5. If training examples are perfectly classified, then STOP, else iterate over new leaf nodes

12

Selection of the Selection of the ““BestBest”” AttributeAttributeUsing Data in Table 1 (4+,5−), some candidate attributes are shown below.

no. time pregnantplasma glucose

4+,1−

≤126 >126

0+,4− 2+,2−≤1 >1

2+,3−

agebody mass index

3+,2−

≤28.1 >28.1

1+,3− 4+,2−≤27 >27

0+,3−

13

EntropyEntropy

S is a sample of training examplesp+ is the proportion of positive examples in Sp− is the proportion of negative examples in SEntropy measures the impurity of SEntropy(S) = − p+log2p+− p−log2p−

14

Information GainInformation GainGain(S,A) = expected reduction in entropy due to

sorting on A

Example:

∑∈

−≡)(

)(||||

)(),(AValuev

SEntropySvS

SEntropyASGain v

plasma glucose4+,5−

4+,1−

≤126 >126

0+,4−

59.0

51

2log51

54

2log54

44

2log44

40

2log40

95

2log95

94

2log94),( =

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

⎟⎠⎞

⎜⎝⎛ −−

+⎟⎠⎞

⎜⎝⎛ −−

−⎟⎠⎞

⎜⎝⎛ −−≡glucoseplasmaSGain

15

Selection of the Selection of the ““BestBest”” Attribute by GainAttribute by Gain

Gain=0.59 Gain=0.01

Gain=0.09 Gain=0.38

no. time pregnant

2+,2−≤1 >1

2+,3−

body mass index

3+,2−

≤28.1 >28.1

1+,3−

age

4+,2−≤27 >27

0+,3−

plasma glucose

4+,1−

≤126 >126

0+,4−

“Plasma glucose” is selected as the root node, and the process of adding nodes is repeated.

16

Artificial Neural NetworksArtificial Neural Networks

Artificial neural networks (ANN) simulate units in human brain (neurons)Many weighted interconnections among unitsLearning by adaptation of the connection weights

17

PerceptronPerceptronOne of the earliest neural network modelsInput is a real valued vector (x1,…,xn)Use an activation function to compute output value (o)Output is linear combination of the input with weights (w1,…,wn) x1

x2

xn

...

1

w2

wn

w x0= 1w0

Σo

x1

x2

xn

...

1

w2

wn

w x0= 1w0

Σo

⎩⎨⎧

<+++−>+++

=θθ

nn

nnn xwxwxwif

xwxwxwifxxxo

L

L

2211

221121 1

1),...,,(

⎩⎨⎧

<++++−>++++

=0101

),...,,(22110

2211021

nn

nnn xwxwxwwif

xwxwxwwifxxxo

L

L

18


No. Time Pregnant

Plasma Glucose

Blood Pressure

Fold Thickness

Serum Insulin

BMI Pedi-gree

189 846

175

83

94

168

235

110

140

543

166

0.39830.1

25.8

43.3

28.1

43.1

39.3

22.2

103

23.2

0.587

0.183

0.167

2.288

0.704

0.245

0.487

89

137

126

145

97

30.5 0.158197

60

72

30

66

40

88

82

66

70

23

19

38

23

35

41

19

15

45

Age Class

1 59 +

5 51 +

1 33 −

1 21 −

0 33 +

3 27 −

13 57 −

1 22 −

2 53 +

19


No. Time Pregnant

Plasma Glucose

Blood Pressure

Fold Thickness

Serum Insulin

BMI Pedi-gree

189 846

175

83

94

168

235

110

140

543

166

0.39830.1

25.8

43.3

28.1

43.1

39.3

22.2

103

23.2

0.587

0.183

0.167

2.288

0.704

0.245

0.487

89

137

126

145

97

30.5 0.158197

60

72

30

66

40

88

82

66

70

23

19

38

23

35

41

19

15

45

Age Class

1 59 +

5 51 +

1 33 −

1 21 −

0 33 +

3 27 −

13 57 −

1 22 −

2 53 +

20

PerceptronPerceptron’’ss Decision SurfaceDecision Surface

+

90 100 110 120 130 140 150 160 170 180 190

2224

26

28

30

32

34

36

38

x 1

x2

200

38

40

42

44

+

−

−

+−

−−

+

BMI

plasma glucose

x ww x w

w= − −21

21

0

2

w0+w x1 1 xw 22+ =0

21

PerceptronPerceptron TrainingTraining

t=1t=-1

w=[0.25 –0.1 0.5]x2 = 0.2 x1 – 0.5

o=1

o=-1

(x,t)=([-1,-1],1)o=sgn(0.25+0.1-0.5)=-1

∆w=[0.2 –0.2 –0.2]

(x,t)=([2,1],-1)o=sgn(0.45-0.6+0.3)=1

∆w=[-0.2 –0.4 –0.2]

(x,t)=([1,1],1)o=sgn(0.25-0.7+0.1)=-1

∆w=[0.2 0.2 0.2]

22

Linearly NonLinearly Non--Separable ExamplesSeparable Examples

+

+

+

23

Multilayer NetworksMultilayer Networks

Multilayer Feedforward Neural Networks can represent non-linear decision surfaceAn Example of multilayer networks

w0=-0.5

x1

x2

w0=-0.5Σ

Σ

1.0 1.0

obinary function

Σ

ΣΣ

w0=-1.5

1.0 1.0

1.0 -2.0

+

x10 1

1

+

x2

0

0

-0.5

-1.5

0

0

-0.5 0

24

Sigmoid FunctionSigmoid FunctionSigmoid function is used as activation function in multilayer networks

1

0

σ( )ye y=

+ −1

1

y

x1

x2

xn

...

w1

w2

wn

x0= 1w0

Σo

x1

x2

xn

...

w1

w2

wn

x0= 1w0

Σo

25

+

+

−−

Using SigmoidUsing Sigmoid

26

BackpropagationBackpropagation AlgorithmAlgorithm

Backward step: propagate errors from output to hidden layer

yj

xk

δj

wki

wjk

δk

Forward step: Propagate activation from input to output layer

xi

27

Support Vector MachinesSupport Vector MachinesAn SVM constructs an optimal hyperplane that separates the data points of two classes as far as possible

No. Time Pregnant

Plasma Glucose

Blood Pressure

Fold Thickness

Serum Insulin

BMI Pedi-gree

189 8461758394

168235110140543

1660.39830.1

25.843.328.143.139.322.2

103

23.2

0.5870.1830.1672.2880.7040.2450.487

8913712614597

30.5 0.158197

607230664088826670

231938233541191545

Age Class

1 59 +5 51 +1 33 −1 21 −0 33 +3 27 −

13 57 −1 22 −2 53 +

28

Optimal Separating Optimal Separating HyperplaneHyperplane

+

90 100 110 120 130 140 150 160 170 180 190

2224

26

28

30

32

34

36

38

x 1

x2

200

38

40

42

44

+

−

−

+−

−−

+

BMI

plasma glucose

29

An Example of Linearly Separable FunctionsAn Example of Linearly Separable Functions

• No. of support vectors = 4

30

An Example of Linearly NonAn Example of Linearly Non--Separable FunctionsSeparable Functions

31

An Example of Linearly Separable FunctionsAn Example of Linearly Separable Functions

• In case of using the input space

32

Feature SpacesFeature Spaces

For linearly non-separable function, it is very likely that a linear separator (hyperplane) can be constructed in higher dimensional space.Suppose we map the data points in the input space Rn

into some feature space of higher dimension, Rm using function F

F : Rn → Rm

Example:F : R2 → R3

x = (x1, x2) , F(x) = (x1, x2, √2 x1x2)

33

An Example of Linearly NonAn Example of Linearly Non--Separable FunctionsSeparable Functions• In case of using the feature space: F(x) = (x1, x2, √2 x1x2)

34

An Example of Linearly NonAn Example of Linearly Non--Separable FunctionsSeparable Functions• The corresponding non-linear function in the input space

35

MulticlassMulticlass SVMsSVMsThe Max Win algorithm is an approach for constructing multiclass SVMsTraining:− Construct all possible binary SVMs− For N classes, there will be N(N-1)/2 binary

SVMs− Each classifier is trained on 2 out of N classesTesting:− A test example is classified by all classifiers− Each classifier provides one vote for its

preferred class− The majority vote is the final output

36

Comparison of the Algorithms on DiabetesComparison of the Algorithms on Diabetes

Algorithm AccuracyDecision Tree 78.65%

Neural Network 79.17%Support Vector Machine 78.65%

introduction to machine learning algorithms

Documents