introduction to machine learning algorithms
TRANSCRIPT
1
Introduction Introduction to to
Machine Learning AlgorithmsMachine Learning Algorithms
บุญเสริมบุญเสริม กิจศิริกลุกิจศิริกลุภาควิชาวิศวกรรมคอมพิวเตอรภาควิชาวิศวกรรมคอมพิวเตอร จุฬาลงกรณมหาวทิยาลัยจุฬาลงกรณมหาวทิยาลัย
2
The Needs for Machine LearningThe Needs for Machine Learning
Recently, full-genome sequencing has blossomed.
Several other technologies, such as DNA microarrays and mass spectrometry have considerably progressed.
The technologies rapidly produce terabytes of data.
3
The Problem of DiabetesThe Problem of Diabetes
To predict whether the patient shows signs of diabetesData contains 2 classes, i.e.+ : tested positive for diabetes− : tested negative for diabetes
4
The Problem of Diabetes (contThe Problem of Diabetes (cont.).)
Data contains 8 attributes1. Number of times pregnant2. Plasma glucose concentration after?? 2 hours in an oral
glucose tolerance test3. Diastolic blood pressure (mm Hg)4. Triceps skin fold thickness (mm)5. 2-Hour serum insulin (mu U/ml)6. Body mass index (weight in kg/(height in m)^2)7. Diabetes pedigree function8. Age (years)
5
The Data of Diabetes (Table 1)The Data of Diabetes (Table 1)
No. Time Pregnant
Plasma Glucose
Blood Pressure
Fold Thickness
Serum Insulin
BMI Pedi-gree
189 846
175
83
94
168
235
110
140
543
166
0.39830.1
25.8
43.3
28.1
43.1
39.3
22.2
103
23.2
0.587
0.183
0.167
2.288
0.704
0.245
0.487
89
137
126
145
97
30.5 0.158197
60
72
30
66
40
88
82
66
70
23
19
38
23
35
41
19
15
45
Age Class
1 59 +
5 51 +
1 33 −
1 21 −
0 33 +
3 27 −
13 57 −
1 22 −
2 53 +
6
Inductive Learning form Data Inductive Learning form Data (Classification Tasks)(Classification Tasks)
Input Data
Learning Algorithm
Learned Model
7
Inductive Learning form Data Inductive Learning form Data –– Cont.Cont.(Classification Tasks)(Classification Tasks)
Learned Model
Test Data
Learning Algorithm
Classification
8
Machine Learning AlgorithmsMachine Learning Algorithms
Decision Tree LearningNeural NetworksSupport Vector MachinesEtc.
9
Decision Tree LearningDecision Tree Learning
Decision tree representation
Each internal node tests an attribute
Each branch corresponds to attribute value
Each leaf node assigns a classification
10
An Example of Decision TreesAn Example of Decision Treesplasma glucose
body mass indexnegative
≤127 >127
blood pressure
negative
blood pressure
≤29.9 >29.9
blood pressure
positivenegative
plasma glucosepositive
≤72 >72 ≤61 >61
positivenegative
>66 >155≤66 ≤155
11
A Decision Tree Learning AlgorithmA Decision Tree Learning Algorithm
Main Loop:1. A ← the “best” decision attribute for next node
2. Assign A as decision attribute for node
3. For each value of A, create new descendant of node
4. Sort training examples to leaf nodes
5. If training examples are perfectly classified, then STOP, else iterate over new leaf nodes
12
Selection of the Selection of the ““BestBest”” AttributeAttributeUsing Data in Table 1 (4+,5−), some candidate attributes are shown below.
no. time pregnantplasma glucose
4+,1−
≤126 >126
0+,4− 2+,2−≤1 >1
2+,3−
agebody mass index
3+,2−
≤28.1 >28.1
1+,3− 4+,2−≤27 >27
0+,3−
13
EntropyEntropy
S is a sample of training examplesp+ is the proportion of positive examples in Sp− is the proportion of negative examples in SEntropy measures the impurity of SEntropy(S) = − p+log2p+− p−log2p−
14
Information GainInformation GainGain(S,A) = expected reduction in entropy due to
sorting on A
Example:
∑∈
−≡)(
)(||||
)(),(AValuev
SEntropySvS
SEntropyASGain v
plasma glucose4+,5−
4+,1−
≤126 >126
0+,4−
59.0
51
2log51
54
2log54
44
2log44
40
2log40
95
2log95
94
2log94),( =
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
⎟⎠⎞
⎜⎝⎛ −−
+⎟⎠⎞
⎜⎝⎛ −−
−⎟⎠⎞
⎜⎝⎛ −−≡glucoseplasmaSGain
15
Selection of the Selection of the ““BestBest”” Attribute by GainAttribute by Gain
Gain=0.59 Gain=0.01
Gain=0.09 Gain=0.38
no. time pregnant
2+,2−≤1 >1
2+,3−
body mass index
3+,2−
≤28.1 >28.1
1+,3−
age
4+,2−≤27 >27
0+,3−
plasma glucose
4+,1−
≤126 >126
0+,4−
“Plasma glucose” is selected as the root node, and the process of adding nodes is repeated.
16
Artificial Neural NetworksArtificial Neural Networks
Artificial neural networks (ANN) simulate units in human brain (neurons)Many weighted interconnections among unitsLearning by adaptation of the connection weights
17
PerceptronPerceptronOne of the earliest neural network modelsInput is a real valued vector (x1,…,xn)Use an activation function to compute output value (o)Output is linear combination of the input with weights (w1,…,wn) x1
x2
xn
...
1
w2
wn
w x0= 1w0
Σo
x1
x2
xn
...
1
w2
wn
w x0= 1w0
Σo
⎩⎨⎧
<+++−>+++
=θθ
nn
nnn xwxwxwif
xwxwxwifxxxo
L
L
2211
221121 1
1),...,,(
⎩⎨⎧
<++++−>++++
=0101
),...,,(22110
2211021
nn
nnn xwxwxwwif
xwxwxwwifxxxo
L
L
18
The Data of Diabetes (Table 1)The Data of Diabetes (Table 1)
No. Time Pregnant
Plasma Glucose
Blood Pressure
Fold Thickness
Serum Insulin
BMI Pedi-gree
189 846
175
83
94
168
235
110
140
543
166
0.39830.1
25.8
43.3
28.1
43.1
39.3
22.2
103
23.2
0.587
0.183
0.167
2.288
0.704
0.245
0.487
89
137
126
145
97
30.5 0.158197
60
72
30
66
40
88
82
66
70
23
19
38
23
35
41
19
15
45
Age Class
1 59 +
5 51 +
1 33 −
1 21 −
0 33 +
3 27 −
13 57 −
1 22 −
2 53 +
19
The Data of Diabetes (Table 1)The Data of Diabetes (Table 1)
No. Time Pregnant
Plasma Glucose
Blood Pressure
Fold Thickness
Serum Insulin
BMI Pedi-gree
189 846
175
83
94
168
235
110
140
543
166
0.39830.1
25.8
43.3
28.1
43.1
39.3
22.2
103
23.2
0.587
0.183
0.167
2.288
0.704
0.245
0.487
89
137
126
145
97
30.5 0.158197
60
72
30
66
40
88
82
66
70
23
19
38
23
35
41
19
15
45
Age Class
1 59 +
5 51 +
1 33 −
1 21 −
0 33 +
3 27 −
13 57 −
1 22 −
2 53 +
20
PerceptronPerceptron’’ss Decision SurfaceDecision Surface
+
90 100 110 120 130 140 150 160 170 180 190
2224
26
28
30
32
34
36
38
x 1
x2
200
38
40
42
44
+
−
−
+−
−−
+
BMI
plasma glucose
x ww x w
w= − −21
21
0
2
w0+w x1 1 xw 22+ =0
21
PerceptronPerceptron TrainingTraining
t=1t=-1
w=[0.25 –0.1 0.5]x2 = 0.2 x1 – 0.5
o=1
o=-1
(x,t)=([-1,-1],1)o=sgn(0.25+0.1-0.5)=-1
∆w=[0.2 –0.2 –0.2]
(x,t)=([2,1],-1)o=sgn(0.45-0.6+0.3)=1
∆w=[-0.2 –0.4 –0.2]
(x,t)=([1,1],1)o=sgn(0.25-0.7+0.1)=-1
∆w=[0.2 0.2 0.2]
22
Linearly NonLinearly Non--Separable ExamplesSeparable Examples
+
+
+
23
Multilayer NetworksMultilayer Networks
Multilayer Feedforward Neural Networks can represent non-linear decision surfaceAn Example of multilayer networks
w0=-0.5
x1
x2
w0=-0.5Σ
Σ
1.0 1.0
obinary function
Σ
ΣΣ
w0=-1.5
1.0 1.0
1.0 -2.0
+
x10 1
1
+
x2
0
0
-0.5
-1.5
0
0
-0.5 0
24
Sigmoid FunctionSigmoid FunctionSigmoid function is used as activation function in multilayer networks
1
0
σ( )ye y=
+ −1
1
y
x1
x2
xn
...
w1
w2
wn
x0= 1w0
Σo
x1
x2
xn
...
w1
w2
wn
x0= 1w0
Σo
25
+
+
−−
Using SigmoidUsing Sigmoid
26
BackpropagationBackpropagation AlgorithmAlgorithm
Backward step: propagate errors from output to hidden layer
yj
xk
δj
wki
wjk
δk
Forward step: Propagate activation from input to output layer
xi
27
Support Vector MachinesSupport Vector MachinesAn SVM constructs an optimal hyperplane that separates the data points of two classes as far as possible
No. Time Pregnant
Plasma Glucose
Blood Pressure
Fold Thickness
Serum Insulin
BMI Pedi-gree
189 8461758394
168235110140543
1660.39830.1
25.843.328.143.139.322.2
103
23.2
0.5870.1830.1672.2880.7040.2450.487
8913712614597
30.5 0.158197
607230664088826670
231938233541191545
Age Class
1 59 +5 51 +1 33 −1 21 −0 33 +3 27 −
13 57 −1 22 −2 53 +
28
Optimal Separating Optimal Separating HyperplaneHyperplane
+
90 100 110 120 130 140 150 160 170 180 190
2224
26
28
30
32
34
36
38
x 1
x2
200
38
40
42
44
+
−
−
+−
−−
+
BMI
plasma glucose
29
An Example of Linearly Separable FunctionsAn Example of Linearly Separable Functions
• No. of support vectors = 4
30
An Example of Linearly NonAn Example of Linearly Non--Separable FunctionsSeparable Functions
31
An Example of Linearly Separable FunctionsAn Example of Linearly Separable Functions
• In case of using the input space
32
Feature SpacesFeature Spaces
For linearly non-separable function, it is very likely that a linear separator (hyperplane) can be constructed in higher dimensional space.Suppose we map the data points in the input space Rn
into some feature space of higher dimension, Rm using function F
F : Rn → Rm
Example:F : R2 → R3
x = (x1, x2) , F(x) = (x1, x2, √2 x1x2)
33
An Example of Linearly NonAn Example of Linearly Non--Separable FunctionsSeparable Functions• In case of using the feature space: F(x) = (x1, x2, √2 x1x2)
34
An Example of Linearly NonAn Example of Linearly Non--Separable FunctionsSeparable Functions• The corresponding non-linear function in the input space
35
MulticlassMulticlass SVMsSVMsThe Max Win algorithm is an approach for constructing multiclass SVMsTraining:− Construct all possible binary SVMs− For N classes, there will be N(N-1)/2 binary
SVMs− Each classifier is trained on 2 out of N classesTesting:− A test example is classified by all classifiers− Each classifier provides one vote for its
preferred class− The majority vote is the final output
36
Comparison of the Algorithms on DiabetesComparison of the Algorithms on Diabetes
Algorithm AccuracyDecision Tree 78.65%
Neural Network 79.17%Support Vector Machine 78.65%