machine learning i 80-629a apprentissage automatique i 80-629
Post on 02-May-2022
6 Views
Preview:
TRANSCRIPT
Neural Networks— Week #5
Machine Learning I 80-629A
Apprentissage Automatique I 80-629
Laurent Charlin — 80-629
This lecture
• Neural Networks
A. Modeling
B. Fitting
C. Deep neural networks
D. In practice
Some of today’s material is (adapted) from Joelle Pineau’s slides
2
From Linear Classification to Neural Networks
Recall Linear Classification
]�
]�
Recall Linear Classification
^(]) = \!]+\�
(\�]+\�) > � =�(\�]+\�) < � =�
Decision
]�
]�
Recall Linear Classification
^(]) = \!]+\�
(\�]+\�) > � =�(\�]+\�) < � =�
Decision
]�
]�
What if data is not linearly separable?
]�
]�
Exclusive OR (XOR)
What if data is not linearly separable?
]�
]�Use the joint decision of several linear classifier?
Exclusive OR (XOR)
What if data is not linearly separable?
]�
]�Use the joint decision of several linear classifier?
]�
]�
\′"]+\′�
\!]+\�
Exclusive OR (XOR)
Combining models
]�
]�
\′"]+\′�
\!]+\�
(\�]+\�) > � =�(\�]+\�) < � =�
(\��]+\��) > � =�
(\��]+\��) < � =�
K′(]) :
K(]) :
Combining models
]�
]�
\′"]+\′�
\!]+\�
(\�]+\�) > � =�(\�]+\�) < � =�
(\��]+\��) > � =�
(\��]+\��) < � =�
K(]) = FSI K�(]) = =�K(]) = FSI K�(]) = =�K(]) = FSI K�(]) = =�
K′(]) :
K(]) :
Combining models
]�
]�
\′"]+\′�
\!]+\�
(\�]+\�) > � =�(\�]+\�) < � =�
(\��]+\��) > � =�
(\��]+\��) < � =�
1. Evaluate each model 2. Combine the output of models
K(]) = FSI K�(]) = =�K(]) = FSI K�(]) = =�K(]) = FSI K�(]) = =�
K′(]) :
K(]) :
Combining models
]�
]�
\′"]+\′�
\!]+\�
(\�]+\�) > � =�(\�]+\�) < � =�
(\��]+\��) > � =�
(\��]+\��) < � =�
1. Evaluate each model 2. Combine the output of models
K(]) = FSI K�(]) = =�K(]) = FSI K�(]) = =�K(]) = FSI K�(]) = =�
K��(]) = YMWJXMTQI(\����K(])K�(])
�+\��
�)
K′(]) :
K(]) :
Laurent Charlin — 80-629
Combining model (graphical view)
7
Laurent Charlin — 80-629
Combining model (graphical view)
7
]�
]�
Laurent Charlin — 80-629
Combining model (graphical view)
7
K′(])
K(])]�
]�
Laurent Charlin — 80-629
Combining model (graphical view)
7
K′′(])
K′(])
K(])]�
]�
Laurent Charlin — 80-629
Combining model (graphical view)
7
K′′(])
K′(])
K(])]�
]�
{ , }
Laurent Charlin — 80-629
Combining model (graphical view)
7
K′′(])
K′(])
K(])]�
]�
{ , }
Neural Network
Laurent Charlin — 80-629
Combining model (graphical view)
7
K′′(])
K′(])
K(])]�
]�
{ , }
Perceptron/ Neuron
Neural Network
Laurent Charlin — 80-629
Feed-forward neural network
• Each arrow denotes a connection
• A signal associated with a weight
• Each node is the weighted sum of its input followed by a non-linear activation
• Connections go left to right
• No connections within a layer
• No backward connections (recurrent)
8
Input Layer Hidden Layer(s) Output Layer
σ(∑
N
\N�]N)
σ(∑
N
\N�T�N)
]�
]�
1
Laurent Charlin — 80-629
Feed-forward neural network
1. An input layer
• Its size is the number of inputs + 1
2. One or more hidden layer(s)
• Their size is a hyper-parameter
3. An output layer
• Its size is the number of outputs
9
Input Layer Hidden Layer(s) Output Layer
]�
]�
1
Compute a prediction (forward pass)
Input Layer Hidden Layer(s) Output Layer
σ(∑
N
\N�]N)
σ(∑
N
\N�T�N)
]�
]�
1.
2.
1
Laurent Charlin — 80-629
Neural Networks
• Flexible model class
• Highly-non linear models
• Good for regression/classification/density estimation
• Models behind “Deep Learning”
• Historical aspects
11
Learning the Parameters of a Neural Network
Laurent Charlin — 80-629
Fitting a neural network
13
Laurent Charlin — 80-629
Fitting a neural networkHow do we estimate the model’s parameters?
13
Laurent Charlin — 80-629
Fitting a neural networkHow do we estimate the model’s parameters?
• No-closed form solution
13
Laurent Charlin — 80-629
Fitting a neural networkHow do we estimate the model’s parameters?
• No-closed form solution
• Gradient-based optimization
13
Laurent Charlin — 80-629
Fitting a neural networkHow do we estimate the model’s parameters?
• No-closed form solution
• Gradient-based optimization
• Threshold functions are not differentiable
13
Laurent Charlin — 80-629
Fitting a neural networkHow do we estimate the model’s parameters?
• No-closed form solution
• Gradient-based optimization
• Threshold functions are not differentiable
• Replace by sigmoid (inverse logit). A soft threshold.
13
ɀǣǕȅȒǣƳ(Ə) :=(
+ exp(−Ə)
)
<latexit sha1_base64="dLhR9EENMeitoFPRf4huIjKVqJk=">AAAD4nicjVLLbtNAFJ02PIp5pbBkM6KqlIi0itmAkJCqtqFdNFAeaSvFUTQe3zijjD3WzHVIZHnHih1iyyfwNewQLPgHfoBxUqo8KsRItu6ce87ce4+un0hhsF7/vrJaunL12vW1G87NW7fv3C2v3zsxKtUcWlxJpc98ZkCKGFooUMJZooFFvoRTf7BX5E+HoI1Q8TscJ9CJWBiLnuAMLdQtv/AQRpgZEUZKBHllemV5lT57Tj0JPaxQr6cZz9w8cx95MEoqWxeknHpahH2sdssb9e365NDlwD0PNnb2fu/+IoQcd9dXP3iB4mkEMXLJjGm79QQ7GdMouITc8VIDCeMDFkLbhjGLwHSyycA53bRIQHtK2y9GOkFnFRmLjBlHvmVGDPtmMVeAl+XaKfaedjIRJylCzKeFeqmkqGjhHg2EBo5ybAPGtbC9Ut5n1h60Hs9V8ZUaIPNNjWnNxrUolSi0ej83WDZpIAE+j47SWHAVwAIqcYSa5c7mbMPBUCRmxpxUFwQDGDERF+ZkhyCHYBtl9CWkcJG1hYt0ZV+EAk3tyO5DXDvQAIPqkmTuvab9GbBD4dZbiMSuksFfxr/evFT2H73M6JoQiDRaMODo4E1YMDsTM5kpVjN3HMfbB7tXGpoWfZWA1SudeY088wqe72eN3NLs3rqLW7ocnDzedm382i5wg0zPGnlAHpIKcckTskMOyTFpEU6+km/kB/lZCkofS59Kn6fU1ZVzzX0yd0pf/gB3nFEj</latexit><latexit sha1_base64="DpBwbaWiy/+4AHb7lZsaM+BTqpo=">AAAD4nicjVLLbtNAFJ0mPEp4pbBkM6KqlIi0itmAkCpVbUO7aKA80laKo2g8vk5GGXusmeuQyPKOFTvElk/ga9gh+hfwAYyTUuVRIUaydefcc+bee3S9WAqD9fqPlULx2vUbN1dvlW7fuXvvfnntwYlRiebQ4koqfeYxA1JE0EKBEs5iDSz0JJx6g708fzoEbYSK3uM4hk7IepEIBGdooW75pYswwtSIXqiEn1WmV5ZV6Ytt6koIsELdQDOeOlnqPHFhFFc2L0kZdbXo9bHaLa/Xt+qTQ5cD5yJY39n7tXv+29s+7q4VPrq+4kkIEXLJjGk79Rg7KdMouISs5CYGYsYHrAdtG0YsBNNJJwNndMMiPg2Utl+EdILOKlIWGjMOPcsMGfbNYi4Hr8q1Ewyed1IRxQlCxKeFgkRSVDR3j/pCA0c5tgHjWtheKe8zaw9aj+eqeEoNkHmmxrRm41qYSBRafZgbLJ00EAOfR0dJJLjyYQGVOELNstLGbMP+UMRmxpxE5wQDGDIR5eakhyCHYBtl9BUkcJm1hfN0ZV/0BJrakd2HqHagAQbVJcnce037M2CHws13EIpdJf2/jH+9eaXsP3qZ0TXBF0m4YMDRwdtezuxMzGQmX82sVCq5+2D3SkPToq9jsHqlU7eRpW7O87y0kVma3VtncUuXg5OnW46N39gFbpDpWSWPyGNSIQ55RnbIITkmLcLJN/Kd/CTnRb/4qfi5+GVKLaxcaB6SuVP8+gcznVKy</latexit><latexit sha1_base64="DpBwbaWiy/+4AHb7lZsaM+BTqpo=">AAAD4nicjVLLbtNAFJ0mPEp4pbBkM6KqlIi0itmAkCpVbUO7aKA80laKo2g8vk5GGXusmeuQyPKOFTvElk/ga9gh+hfwAYyTUuVRIUaydefcc+bee3S9WAqD9fqPlULx2vUbN1dvlW7fuXvvfnntwYlRiebQ4koqfeYxA1JE0EKBEs5iDSz0JJx6g708fzoEbYSK3uM4hk7IepEIBGdooW75pYswwtSIXqiEn1WmV5ZV6Ytt6koIsELdQDOeOlnqPHFhFFc2L0kZdbXo9bHaLa/Xt+qTQ5cD5yJY39n7tXv+29s+7q4VPrq+4kkIEXLJjGk79Rg7KdMouISs5CYGYsYHrAdtG0YsBNNJJwNndMMiPg2Utl+EdILOKlIWGjMOPcsMGfbNYi4Hr8q1Ewyed1IRxQlCxKeFgkRSVDR3j/pCA0c5tgHjWtheKe8zaw9aj+eqeEoNkHmmxrRm41qYSBRafZgbLJ00EAOfR0dJJLjyYQGVOELNstLGbMP+UMRmxpxE5wQDGDIR5eakhyCHYBtl9BUkcJm1hfN0ZV/0BJrakd2HqHagAQbVJcnce037M2CHws13EIpdJf2/jH+9eaXsP3qZ0TXBF0m4YMDRwdtezuxMzGQmX82sVCq5+2D3SkPToq9jsHqlU7eRpW7O87y0kVma3VtncUuXg5OnW46N39gFbpDpWSWPyGNSIQ55RnbIITkmLcLJN/Kd/CTnRb/4qfi5+GVKLaxcaB6SuVP8+gcznVKy</latexit><latexit sha1_base64="s6DDuxI+w1jNPnbWU6NWyHo2G8Q=">AAAD4nicjVJLb9NAEN7WPEp4pXDksiKqlIi0irmAkJAqaGgPDZRH2kpxFK3XE2eVtdfaHYdElm+cuCGu/AR+DTcEP4Z1Eqo8KsRKtma/+b6dmU/jJ1IYbDR+bmw6V65eu751o3Tz1u07d8vb906NSjWHNldS6XOfGZAihjYKlHCeaGCRL+HMH74s8mcj0Eao+ANOEuhGLIxFX3CGFuqVX3kIY8yMCCMlgrw6u7K8Rp89p56EPlap19eMZ26euY88GCfV3QtSTj0twgHWeuVKY68xPXQ9cOdBhczPSW9785MXKJ5GECOXzJiO20iwmzGNgkvIS15qIGF8yELo2DBmEZhuNh04pzsWCWhfafvFSKfooiJjkTGTyLfMiOHArOYK8LJcJ8X+024m4iRFiPmsUD+VFBUt3KOB0MBRTmzAuBa2V8oHzNqD1uOlKr5SQ2S+qTOt2aQepRKFVh+XBsumDSTAl9FxGguuAlhBJY5Rs7y0s9hwMBKJWTAn1QXBAEZMxIU52RHIEdhGGX0NKVxkbeEiXT0QoUBTP7b7ENcPNcCwtiZZeq9lfwbsULj7HiLxQsngL+Nfb14q+49eFnQtCEQarRhwfPguLJjdqZnMFKuZl0ol7wDsXmloWfRNAlavdOY188wreL6fNXNLs3vrrm7penD6eM+18dtGZb853+At8oA8JFXikidknxyRE9ImnHwnP8gv8tsJnM/OF+frjLq5MdfcJ0vH+fYHjqtOiw==</latexit>
Laurent Charlin — 80-629
Fit the parameters (w) (backward pass)
• Derive a gradient wrt the parameters (w)
• The back-propagation starts from the output node(s) and heads toward the input(s)
• In practice, the order of the computation is important
14
Input Layer Hidden Layer(s) Output Layer
]�
]�
1
(^− ˆ)�
�(^ � ˆ)�
�\O=
�(^ � K(�
N \NTN))�
�\O
=�(^ � K(
�N� \N�K(
�O \O]O)N�)�
�\O
Laurent Charlin — 80-629
Gradient descent• No closed-form formula
• Repeat the following steps (for t=0,1,2,… until convergence):
1. Calculate a gradient
2. Apply the update
• Stochastic gradient descent
• One example at a time
• Batch gradient descent
• All examples at a time
15
∇\YNO
\Y+�NO = \Y
NO − α∇\YNO
Laurent Charlin — 80-629
What can an MLP learn?1. A single unit (neuron)
• Linear classifier + non-linear output
2. A network with a single hidden layer
• Any continuous function (but may require exponentially many hidden units as a function of the inputs)
3. A network with two (or more) hidden layers
• Any function can be approximated with arbitrary accuracy.
16
The Importance of Representations
Laurent Charlin — 80-629
From Neural Networks to Deep Neural Networks
18
A neural Network
Laurent Charlin — 80-629
From Neural Networks to Deep Neural Networks
18
A neural Network A deep neural Network
Laurent Charlin — 80-629
From Neural Networks to Deep Neural Networks
18
A neural Network A deep neural Network
Modern deep learning provides a powerful framework for supervised learning. By adding more layers and more units within a layer, a deep network can represent functions of increasing complexity.
Deep Learning — Part II, p.163 http://www.deeplearningbook.org/contents/part_practical.html
Another View of deep learning• Representations are important
[From: Honglak Lee (and Graham Taylor)]19
Representations
[From: Honglak Lee (and Graham Taylor]20
Representations
[From: Honglak Lee (and Graham Taylor]20
Representations
[From: Honglak Lee (and Graham Taylor]20
Representations
[From: Honglak Lee (and Graham Taylor]20
Data Layer 1 Layer 2 Classifier… Output
21
Machine Translation
22
Universal Sentence
Representation
French Encoder
English Encoder
Spanish Encoder
French Decoder
English Decoder
Spanish Decoder
<latexit sha1_base64="728H+MTBszQF6C/oMDOXf+EkQ1Q=">AAADZ3icfVJdaxNBFJ1m/ajrR1sFEXxZDYUKMezqgz5WbbGg1YomLWRDmZ29ScfMxzJzNyYsC/4EX/Wf+RP8F84msXST4oWFyzlz75w5e5JMcIth+Hut4V25eu36+g3/5q3bdzY2t+52rc4Ngw7TQpuThFoQXEEHOQo4yQxQmQg4TkZvKv54DMZyrb7gNIO+pEPFB5xRdFA3Hqca7elmM2yHswpWm2jRNMmijk63Gu/iVLNcgkImqLW9KMywX1CDnAko/Ti3kFE2okPouVZRCbZfzOSWwbZD0mCgjfsUBjPU374wUkiHWzCGYtmzIHmiRdq/uLSg0tqpTNwySfHMLnMVeBnXy3Hwsl9wleUIis21DHIRoA4qe4KUG2Aopq6hzHD3nICdUUMZOhNrtyRaj5AmtkWdzmlL5gK50d9qby9mAjJgdXSSK850CkuowAkaWjcvHfPMLuybzP1zVllASbmq7CsOQIzB6aTBB8jhnHX3VvTOHh9ytK337n+r1lsDMHqyMlLbd3ju/dPPzvrXzvp/J/6389IxP94Dlw8Dh272YwaO1aaI98sirrYlSbFflr6LX7QcttWm+6wdPW+Hn8Lm7qvv8yCuk4fkMdkhEXlBdskBOSIdwshX8oP8JL8af7wN7773YH60sbYI7z1SK+/RX/X8IL0=</latexit>
<latexit sha1_base64="728H+MTBszQF6C/oMDOXf+EkQ1Q=">AAADZ3icfVJdaxNBFJ1m/ajrR1sFEXxZDYUKMezqgz5WbbGg1YomLWRDmZ29ScfMxzJzNyYsC/4EX/Wf+RP8F84msXST4oWFyzlz75w5e5JMcIth+Hut4V25eu36+g3/5q3bdzY2t+52rc4Ngw7TQpuThFoQXEEHOQo4yQxQmQg4TkZvKv54DMZyrb7gNIO+pEPFB5xRdFA3Hqca7elmM2yHswpWm2jRNMmijk63Gu/iVLNcgkImqLW9KMywX1CDnAko/Ti3kFE2okPouVZRCbZfzOSWwbZD0mCgjfsUBjPU374wUkiHWzCGYtmzIHmiRdq/uLSg0tqpTNwySfHMLnMVeBnXy3Hwsl9wleUIis21DHIRoA4qe4KUG2Aopq6hzHD3nICdUUMZOhNrtyRaj5AmtkWdzmlL5gK50d9qby9mAjJgdXSSK850CkuowAkaWjcvHfPMLuybzP1zVllASbmq7CsOQIzB6aTBB8jhnHX3VvTOHh9ytK337n+r1lsDMHqyMlLbd3ju/dPPzvrXzvp/J/6389IxP94Dlw8Dh272YwaO1aaI98sirrYlSbFflr6LX7QcttWm+6wdPW+Hn8Lm7qvv8yCuk4fkMdkhEXlBdskBOSIdwshX8oP8JL8af7wN7773YH60sbYI7z1SK+/RX/X8IL0=</latexit>
Machine learning Make machines that can learn
Deep learning A set of machine learning techniques
based on neural networks
Idea: Hugo Larochelle
Machine learning Make machines that can learn
Deep learning A set of machine learning techniques
based on neural networks
Idea: Hugo Larochelle
Representation learning Machine learning paradigm to discover data representations
Laurent Charlin — 80-629
Deep neural networks
• Several layers of hidden nodes
• Parameters at different levels of representation
24Figure 1: We would like the raw input image to be transformed into gradually higher levels of representation,representing more and more abstract functions of the raw input, e.g., edges, local shapes, object parts,etc. In practice, we do not know in advance what the “right” representation should be for all these levelsof abstractions, although linguistic concepts might help guessing what the higher levels should implicitlyrepresent.
3
[Figure from Yoshua Bengio]
Laurent Charlin — 80-629 [Figures from Yoshua Bengio]25
Neural Network Hyper-parameters
Laurent Charlin — 80-629
Hyperparameters
1. Model specific
• Activation functions (output & hidden), Network size
2. Optimisation Objective
• Regularization, Early-stopping, Dropout
3. Optimization procedure
• Momentum, Adaptive learning rates
27
Laurent Charlin — 80-629
Activation Functions• Non-linear functions that transform the weighted sum of the inputs, e.g.:
28
<latexit sha1_base64="BlcsYKbhJPBm6aghT8qYTX/5YD8=">AAAD0XicjVJbaxNBFJ52vdR4S/XRl8VSSCGGbBH0RSjaYJFWazWmkF2W2dmTdMjMzjKXXFgWxNf+BcEnxd/jm/ji7/DJ2SSUbhLQAwMf35nznTPfnChlVOlm8+faunPl6rXrGzcqN2/dvnO3unnvgxJGEmgTwYQ8jbACRhNoa6oZnKYSMI8YdKLBiyLfGYJUVCTv9SSFgON+QnuUYG2psFrt1XxleEjdUUjHId0Jq1vNRnMa7jLw5mBr79Xv8y+/fvw5DjfXv/uxIIZDognDSnW9ZqqDDEtNCYO84hsFKSYD3IeuhQnmoIJsOnrublsmdntC2pNod8pWti+VZNzyCqTEOu8q4DQSLA4ui2aYKzXhkRXjWJ+pxVxBrsp1je49DTKapEZDQmaz9AxztXALq9yYSiCaTSzARFL7HJecYYmJtoaWukRCDDSOVB3bOSd1bpimUoxKb8+mA6RAyuzYJJSIGBZYpsda4rJ58ZCmam7feOaftUqB5pgmhX3ZAbAh2Dmx+xoMXGRt3yJd26d9qlX90P59Un8pAQY7SyUlvaML7x+9s9Y/t9b/l+jKun+WddtpCpLYbX52An3DsKwfilGZCVZrV/x9sMsn4cg2eGNVsBYy81t55hctoyhr5XnF7ra3uMnLoLPb8B43PO+t3fIWmsUGeoAeohry0BO0hw7QMWojgoboM/qKvjknztj56HyaXV1fm9fcR6Vwzv8CIYdOCg==</latexit>
ǔ(∑
ǣ
ɯǣɴǣ)
Laurent Charlin — 80-629
Activation Functions• Non-linear functions that transform the weighted sum of the inputs, e.g.:
28
<latexit sha1_base64="BlcsYKbhJPBm6aghT8qYTX/5YD8=">AAAD0XicjVJbaxNBFJ52vdR4S/XRl8VSSCGGbBH0RSjaYJFWazWmkF2W2dmTdMjMzjKXXFgWxNf+BcEnxd/jm/ji7/DJ2SSUbhLQAwMf35nznTPfnChlVOlm8+faunPl6rXrGzcqN2/dvnO3unnvgxJGEmgTwYQ8jbACRhNoa6oZnKYSMI8YdKLBiyLfGYJUVCTv9SSFgON+QnuUYG2psFrt1XxleEjdUUjHId0Jq1vNRnMa7jLw5mBr79Xv8y+/fvw5DjfXv/uxIIZDognDSnW9ZqqDDEtNCYO84hsFKSYD3IeuhQnmoIJsOnrublsmdntC2pNod8pWti+VZNzyCqTEOu8q4DQSLA4ui2aYKzXhkRXjWJ+pxVxBrsp1je49DTKapEZDQmaz9AxztXALq9yYSiCaTSzARFL7HJecYYmJtoaWukRCDDSOVB3bOSd1bpimUoxKb8+mA6RAyuzYJJSIGBZYpsda4rJ58ZCmam7feOaftUqB5pgmhX3ZAbAh2Dmx+xoMXGRt3yJd26d9qlX90P59Un8pAQY7SyUlvaML7x+9s9Y/t9b/l+jKun+WddtpCpLYbX52An3DsKwfilGZCVZrV/x9sMsn4cg2eGNVsBYy81t55hctoyhr5XnF7ra3uMnLoLPb8B43PO+t3fIWmsUGeoAeohry0BO0hw7QMWojgoboM/qKvjknztj56HyaXV1fm9fcR6Vwzv8CIYdOCg==</latexit>
ǔ(∑
ǣ
ɯǣɴǣ)
Laurent Charlin — 80-629
Activation Functions• Non-linear functions that transform the weighted sum of the inputs, e.g.:
•Non-linearities increase model representation power
28
<latexit sha1_base64="BlcsYKbhJPBm6aghT8qYTX/5YD8=">AAAD0XicjVJbaxNBFJ52vdR4S/XRl8VSSCGGbBH0RSjaYJFWazWmkF2W2dmTdMjMzjKXXFgWxNf+BcEnxd/jm/ji7/DJ2SSUbhLQAwMf35nznTPfnChlVOlm8+faunPl6rXrGzcqN2/dvnO3unnvgxJGEmgTwYQ8jbACRhNoa6oZnKYSMI8YdKLBiyLfGYJUVCTv9SSFgON+QnuUYG2psFrt1XxleEjdUUjHId0Jq1vNRnMa7jLw5mBr79Xv8y+/fvw5DjfXv/uxIIZDognDSnW9ZqqDDEtNCYO84hsFKSYD3IeuhQnmoIJsOnrublsmdntC2pNod8pWti+VZNzyCqTEOu8q4DQSLA4ui2aYKzXhkRXjWJ+pxVxBrsp1je49DTKapEZDQmaz9AxztXALq9yYSiCaTSzARFL7HJecYYmJtoaWukRCDDSOVB3bOSd1bpimUoxKb8+mA6RAyuzYJJSIGBZYpsda4rJ58ZCmam7feOaftUqB5pgmhX3ZAbAh2Dmx+xoMXGRt3yJd26d9qlX90P59Un8pAQY7SyUlvaML7x+9s9Y/t9b/l+jKun+WddtpCpLYbX52An3DsKwfilGZCVZrV/x9sMsn4cg2eGNVsBYy81t55hctoyhr5XnF7ra3uMnLoLPb8B43PO+t3fIWmsUGeoAeohry0BO0hw7QMWojgoboM/qKvjknztj56HyaXV1fm9fcR6Vwzv8CIYdOCg==</latexit>
ǔ(∑
ǣ
ɯǣɴǣ)
Laurent Charlin — 80-629
Activation Functions• Non-linear functions that transform the weighted sum of the inputs, e.g.:
•Non-linearities increase model representation power
•Non-linearities increase the difficult of optimization
28
<latexit sha1_base64="BlcsYKbhJPBm6aghT8qYTX/5YD8=">AAAD0XicjVJbaxNBFJ52vdR4S/XRl8VSSCGGbBH0RSjaYJFWazWmkF2W2dmTdMjMzjKXXFgWxNf+BcEnxd/jm/ji7/DJ2SSUbhLQAwMf35nznTPfnChlVOlm8+faunPl6rXrGzcqN2/dvnO3unnvgxJGEmgTwYQ8jbACRhNoa6oZnKYSMI8YdKLBiyLfGYJUVCTv9SSFgON+QnuUYG2psFrt1XxleEjdUUjHId0Jq1vNRnMa7jLw5mBr79Xv8y+/fvw5DjfXv/uxIIZDognDSnW9ZqqDDEtNCYO84hsFKSYD3IeuhQnmoIJsOnrublsmdntC2pNod8pWti+VZNzyCqTEOu8q4DQSLA4ui2aYKzXhkRXjWJ+pxVxBrsp1je49DTKapEZDQmaz9AxztXALq9yYSiCaTSzARFL7HJecYYmJtoaWukRCDDSOVB3bOSd1bpimUoxKb8+mA6RAyuzYJJSIGBZYpsda4rJ58ZCmam7feOaftUqB5pgmhX3ZAbAh2Dmx+xoMXGRt3yJd26d9qlX90P59Un8pAQY7SyUlvaML7x+9s9Y/t9b/l+jKun+WddtpCpLYbX52An3DsKwfilGZCVZrV/x9sMsn4cg2eGNVsBYy81t55hctoyhr5XnF7ra3uMnLoLPb8B43PO+t3fIWmsUGeoAeohry0BO0hw7QMWojgoboM/qKvjknztj56HyaXV1fm9fcR6Vwzv8CIYdOCg==</latexit>
ǔ(∑
ǣ
ɯǣɴǣ)
Laurent Charlin — 80-629
Activation Functions• Non-linear functions that transform the weighted sum of the inputs, e.g.:
•Non-linearities increase model representation power
•Non-linearities increase the difficult of optimization
• Different functions for hidden units and output units
28
<latexit sha1_base64="BlcsYKbhJPBm6aghT8qYTX/5YD8=">AAAD0XicjVJbaxNBFJ52vdR4S/XRl8VSSCGGbBH0RSjaYJFWazWmkF2W2dmTdMjMzjKXXFgWxNf+BcEnxd/jm/ji7/DJ2SSUbhLQAwMf35nznTPfnChlVOlm8+faunPl6rXrGzcqN2/dvnO3unnvgxJGEmgTwYQ8jbACRhNoa6oZnKYSMI8YdKLBiyLfGYJUVCTv9SSFgON+QnuUYG2psFrt1XxleEjdUUjHId0Jq1vNRnMa7jLw5mBr79Xv8y+/fvw5DjfXv/uxIIZDognDSnW9ZqqDDEtNCYO84hsFKSYD3IeuhQnmoIJsOnrublsmdntC2pNod8pWti+VZNzyCqTEOu8q4DQSLA4ui2aYKzXhkRXjWJ+pxVxBrsp1je49DTKapEZDQmaz9AxztXALq9yYSiCaTSzARFL7HJecYYmJtoaWukRCDDSOVB3bOSd1bpimUoxKb8+mA6RAyuzYJJSIGBZYpsda4rJ58ZCmam7feOaftUqB5pgmhX3ZAbAh2Dmx+xoMXGRt3yJd26d9qlX90P59Un8pAQY7SyUlvaML7x+9s9Y/t9b/l+jKun+WddtpCpLYbX52An3DsKwfilGZCVZrV/x9sMsn4cg2eGNVsBYy81t55hctoyhr5XnF7ra3uMnLoLPb8B43PO+t3fIWmsUGeoAeohry0BO0hw7QMWojgoboM/qKvjknztj56HyaXV1fm9fcR6Vwzv8CIYdOCg==</latexit>
ǔ(∑
ǣ
ɯǣɴǣ)
Laurent Charlin — 80-629
Activation functions — hidden units
• Traditional
• Logistic-like units
•
•
• Saturate on both sides
• Derivable everywhere
29
ǔ(ɿ) = ǼȒǕǣɎ−(ɿ) =
+ exp(−ɿ)<latexit sha1_base64="qpaDyuzKtpV050lgPaVO/5mx6Xs=">AAADmHicfVLLbhMxFHUzPEp4pbCDjUVVKRFJOsNDsEEKtFWLoFAeaStlQuTx3EmteMYj29MmtWbNR/A17BB8AQv+BU8SqiapuJKlq3N8X0cnSDlT2nV/L5WcS5evXF2+Vr5+4+at25WVO/tKZJJCmwou5GFAFHCWQFszzeEwlUDigMNBMNgo+INjkIqJ5LMepdCNST9hEaNEW6hXeRpVT2v4BfY1DLXhos90/sU0vHwKR5JQ4+XGww+xD8O02jit5bhXWXWb7jjwYuJNk9XWyx8nf76uP9nrrZTe+KGgWQyJppwo1fHcVHcNkZpRDnnZzxSkhA5IHzo2TUgMqmvG9+V4zSIhjoS0L9F4jJbXzpWY2OIKpCQ67yiIWSB42D3f1JBYqVEc2GYx0UdqnivAi7hOpqPnXcOSNNOQ0MkuUcaxFrjQE4dMAtV8ZBNCJbPnYHpErGraqj4zJRBioEmg6sTuOarHGddMipOZ2814gRToLDrMEkZFCHMo10Mtyax44TFL1VS+4UQ/K5UCHROWFPKZHeDHYPck+B1kcMbauQVd3WTWA6r+1hokqW9LgEFtoWSm3+6Z9o1PVvpXVvp/P/7X88Kysr8J1h8Sdm3t+xQsK6Txt3LjF92CwGzlednaz5s322Ky/6jpPW66H6wPN9AkltF99ABVkYeeoRbaQXuojSj6hr6jn+iXc89pOdvO68nX0tK05i6aCefjX3EGM5o=</latexit>
ǔ(ɿ) = tanh(ɿ)<latexit sha1_base64="kYbldNofxKT1cgIzqzUH5bfYpVo=">AAADcnicfVJdaxNBFJ0mftT40VTf9GU0FFKIYaMP+iIU22JBqxVNW8iGMjt7kwyZj2Xmbkxc9pf4qj/K/+EPcDaJpZsULwwcztl7587ZEyVSOAyC3xuV6o2bt25v3qndvXf/wVZ9++GpM6nl0OVGGnseMQdSaOiiQAnniQWmIgln0Xi/0M8mYJ0w+ivOEugrNtRiIDhDT13UtwbN77v0DQ2R6ZGHF/VG0A7mRddBZwkaew0yr5OL7cr7MDY8VaCRS+ZcrxMk2M+YRcEl5LUwdZAwPmZD6HmomQLXz+ab53THMzEdGOuPRjpnaztXWjLleQfWMsx7DpSIjIz7V4dmTDk3U5EfphiO3KpWkNdpvRQHr/uZ0EmKoPlil0EqKRpaOEVjYYGjnHnAuBX+OZSPmGUcvZ+lWyJjxsgi12J+z1lLpRKFNd9Kb8/mCyTAy+w01YKbGFZYiVO0rGxePBGJW9o3XfjnrXKAigld2JcdgZyA35PRj5DCpervLeTmgRgKdK0P/tfr1jsLMN5daynNO770/vkXb/1bb/2/L/4389q2WngAPh8Wjn3vpwS8amwWHuZZWEyLouwwz2s+fp3VsK2D0xftzst28NnncH+RQ7JJnpBnpEk65BXZI0fkhHQJJyn5QX6SX5U/1cfVp9VlZisby55HpFTV1l/ZdCMh</latexit>
[https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40811.pdf]
Laurent Charlin — 80-629
Activation functions — hidden units
• Traditional
• Logistic-like units
•
•
• Saturate on both sides
• Derivable everywhere
29
ǔ(ɿ) = ǼȒǕǣɎ−(ɿ) =
+ exp(−ɿ)<latexit sha1_base64="qpaDyuzKtpV050lgPaVO/5mx6Xs=">AAADmHicfVLLbhMxFHUzPEp4pbCDjUVVKRFJOsNDsEEKtFWLoFAeaStlQuTx3EmteMYj29MmtWbNR/A17BB8AQv+BU8SqiapuJKlq3N8X0cnSDlT2nV/L5WcS5evXF2+Vr5+4+at25WVO/tKZJJCmwou5GFAFHCWQFszzeEwlUDigMNBMNgo+INjkIqJ5LMepdCNST9hEaNEW6hXeRpVT2v4BfY1DLXhos90/sU0vHwKR5JQ4+XGww+xD8O02jit5bhXWXWb7jjwYuJNk9XWyx8nf76uP9nrrZTe+KGgWQyJppwo1fHcVHcNkZpRDnnZzxSkhA5IHzo2TUgMqmvG9+V4zSIhjoS0L9F4jJbXzpWY2OIKpCQ67yiIWSB42D3f1JBYqVEc2GYx0UdqnivAi7hOpqPnXcOSNNOQ0MkuUcaxFrjQE4dMAtV8ZBNCJbPnYHpErGraqj4zJRBioEmg6sTuOarHGddMipOZ2814gRToLDrMEkZFCHMo10Mtyax44TFL1VS+4UQ/K5UCHROWFPKZHeDHYPck+B1kcMbauQVd3WTWA6r+1hokqW9LgEFtoWSm3+6Z9o1PVvpXVvp/P/7X88Kysr8J1h8Sdm3t+xQsK6Txt3LjF92CwGzlednaz5s322Ky/6jpPW66H6wPN9AkltF99ABVkYeeoRbaQXuojSj6hr6jn+iXc89pOdvO68nX0tK05i6aCefjX3EGM5o=</latexit>
ǔ(ɿ) = tanh(ɿ)<latexit sha1_base64="kYbldNofxKT1cgIzqzUH5bfYpVo=">AAADcnicfVJdaxNBFJ0mftT40VTf9GU0FFKIYaMP+iIU22JBqxVNW8iGMjt7kwyZj2Xmbkxc9pf4qj/K/+EPcDaJpZsULwwcztl7587ZEyVSOAyC3xuV6o2bt25v3qndvXf/wVZ9++GpM6nl0OVGGnseMQdSaOiiQAnniQWmIgln0Xi/0M8mYJ0w+ivOEugrNtRiIDhDT13UtwbN77v0DQ2R6ZGHF/VG0A7mRddBZwkaew0yr5OL7cr7MDY8VaCRS+ZcrxMk2M+YRcEl5LUwdZAwPmZD6HmomQLXz+ab53THMzEdGOuPRjpnaztXWjLleQfWMsx7DpSIjIz7V4dmTDk3U5EfphiO3KpWkNdpvRQHr/uZ0EmKoPlil0EqKRpaOEVjYYGjnHnAuBX+OZSPmGUcvZ+lWyJjxsgi12J+z1lLpRKFNd9Kb8/mCyTAy+w01YKbGFZYiVO0rGxePBGJW9o3XfjnrXKAigld2JcdgZyA35PRj5DCpervLeTmgRgKdK0P/tfr1jsLMN5daynNO770/vkXb/1bb/2/L/4389q2WngAPh8Wjn3vpwS8amwWHuZZWEyLouwwz2s+fp3VsK2D0xftzst28NnncH+RQ7JJnpBnpEk65BXZI0fkhHQJJyn5QX6SX5U/1cfVp9VlZisby55HpFTV1l/ZdCMh</latexit>
[https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40811.pdf]
Laurent Charlin — 80-629
Activation functions — hidden units
• Traditional
• Logistic-like units
•
•
• Saturate on both sides
• Derivable everywhere
29
• Rectified linear units (Relu)
• Non-derivable at a single point
• Now Standard
• Better results / faster training
• Shuts off units
• Leaky Relu
ǔ(ɿ) = max{, ɿ}<latexit sha1_base64="GbUEfDiBNvnjfmEEpvbxAIDwbWQ=">AAADeXicfVJdaxNBFJ0mftT4lSr44svUUkglDRv7oC9C1RYLWq1o2kI2hNnZm3TIfCwzd2PSdcHf4qv+IX+LL84mtXTT4oVhLufMvXPv4USJFA6D4PdSpXrt+o2by7dqt+/cvXe/vvLg0JnUcuhwI409jpgDKTR0UKCE48QCU5GEo2j0puCPxmCdMPoLThPoKTbUYiA4Qw/1648GjdMN+pKGik1omNGgSU/DnPbra0ErmAW9nLTPkrXt1e+kiIP+SuVdGBueKtDIJXOu2w4S7GXMouAS8lqYOkgYH7EhdH2qmQLXy2YL5HTdIzEdGOuPRjpDa+sXSjLlcQfWMsy7DpSIjIx7F5tmTDk3VZFvphieuEWuAK/iuikOXvQyoZMUQfP5LINUUjS0EIzGwgJHOfUJ41b4dSg/YZZx9LKWfomMGSGLXJP5OadNlUoU1nwt7Z7NBkiAl9FJqgU3MSygEidoWVm8eCwSdybfZK6fl8oBKiZ0IV+2B3IMfk5GP0AK56z/t6AbO2Io0DXfewfo5lsLMNq4VFLqt3+u/eZnL/1rL/2/F//reWVZLdwB7w8L+772YwKeNTYLd/MsLLpFUbab5zVvv/ai2S4nh89a7a1W8Mn78NXch2SZPCZPSIO0yXOyTfbIAekQTr6RH+Qn+VX5U12tNqpP508rS/ObPCSlqG79BRPjJjk=</latexit>
ǔ(ɿ) = ǼȒǕǣɎ−(ɿ) =
+ exp(−ɿ)<latexit sha1_base64="qpaDyuzKtpV050lgPaVO/5mx6Xs=">AAADmHicfVLLbhMxFHUzPEp4pbCDjUVVKRFJOsNDsEEKtFWLoFAeaStlQuTx3EmteMYj29MmtWbNR/A17BB8AQv+BU8SqiapuJKlq3N8X0cnSDlT2nV/L5WcS5evXF2+Vr5+4+at25WVO/tKZJJCmwou5GFAFHCWQFszzeEwlUDigMNBMNgo+INjkIqJ5LMepdCNST9hEaNEW6hXeRpVT2v4BfY1DLXhos90/sU0vHwKR5JQ4+XGww+xD8O02jit5bhXWXWb7jjwYuJNk9XWyx8nf76uP9nrrZTe+KGgWQyJppwo1fHcVHcNkZpRDnnZzxSkhA5IHzo2TUgMqmvG9+V4zSIhjoS0L9F4jJbXzpWY2OIKpCQ67yiIWSB42D3f1JBYqVEc2GYx0UdqnivAi7hOpqPnXcOSNNOQ0MkuUcaxFrjQE4dMAtV8ZBNCJbPnYHpErGraqj4zJRBioEmg6sTuOarHGddMipOZ2814gRToLDrMEkZFCHMo10Mtyax44TFL1VS+4UQ/K5UCHROWFPKZHeDHYPck+B1kcMbauQVd3WTWA6r+1hokqW9LgEFtoWSm3+6Z9o1PVvpXVvp/P/7X88Kysr8J1h8Sdm3t+xQsK6Txt3LjF92CwGzlednaz5s322Ky/6jpPW66H6wPN9AkltF99ABVkYeeoRbaQXuojSj6hr6jn+iXc89pOdvO68nX0tK05i6aCefjX3EGM5o=</latexit>
ǔ(ɿ) = tanh(ɿ)<latexit sha1_base64="kYbldNofxKT1cgIzqzUH5bfYpVo=">AAADcnicfVJdaxNBFJ0mftT40VTf9GU0FFKIYaMP+iIU22JBqxVNW8iGMjt7kwyZj2Xmbkxc9pf4qj/K/+EPcDaJpZsULwwcztl7587ZEyVSOAyC3xuV6o2bt25v3qndvXf/wVZ9++GpM6nl0OVGGnseMQdSaOiiQAnniQWmIgln0Xi/0M8mYJ0w+ivOEugrNtRiIDhDT13UtwbN77v0DQ2R6ZGHF/VG0A7mRddBZwkaew0yr5OL7cr7MDY8VaCRS+ZcrxMk2M+YRcEl5LUwdZAwPmZD6HmomQLXz+ab53THMzEdGOuPRjpnaztXWjLleQfWMsx7DpSIjIz7V4dmTDk3U5EfphiO3KpWkNdpvRQHr/uZ0EmKoPlil0EqKRpaOEVjYYGjnHnAuBX+OZSPmGUcvZ+lWyJjxsgi12J+z1lLpRKFNd9Kb8/mCyTAy+w01YKbGFZYiVO0rGxePBGJW9o3XfjnrXKAigld2JcdgZyA35PRj5DCpervLeTmgRgKdK0P/tfr1jsLMN5daynNO770/vkXb/1bb/2/L/4389q2WngAPh8Wjn3vpwS8amwWHuZZWEyLouwwz2s+fp3VsK2D0xftzst28NnncH+RQ7JJnpBnpEk65BXZI0fkhHQJJyn5QX6SX5U/1cfVp9VlZisby55HpFTV1l/ZdCMh</latexit>
Ǖ(ɿ) = max{, ɿ}+ αmin{, ɿǣ}<latexit sha1_base64="tNDTbrt6UmRHIkHIIwjEGouO1lQ=">AAADk3icfVJdaxNBFJ1m/ajxK1V88mW0FFJMw0YfFESoNcWCRiuatpBdwuzsTTJkPpaZ2Zh0WfBX+Gt81f/gv3E2aUs3KV4Y5nLO3Dv3Hk6UcGas7/9dq3jXrt+4uX6revvO3Xv3axsPjoxKNYUuVVzpk4gY4ExC1zLL4STRQETE4Tgavyv44wlow5T8ZmcJhIIMJRswSqyD+jV/WD/dxm9wIMgUBxn2G/g0yPEzHBCejIjDmTzH+wwHeb+26Tf9eeDVpHWWbO4++YGKOOxvVD4EsaKpAGkpJ8b0Wn5iw4xoyyiHvBqkBhJCx2QIPZdKIsCE2Xy1HG85JMYDpd2RFs/R6talkkw43IDWxOY9A4JFisfh5aYZEcbMROSaCWJHZpkrwKu4XmoHr8KMySS1IOlilkHKsVW4kBLHTAO1fOYSQjVz62A6IppQ6wQv/RIpNbYkMg3i5pw1RMot0+p7afdsPkACtIxOU8moimEJ5XZqNSmLF09YYs7kmy70c1IZsIIwWciXHQCfgJuT4E+QwgXr/i3oepsNmTWNj84bsvFeA4y3V0pK/ToX2u98ddLvOenPX/yv55Vl1aANzh8aOq72cwKOVToL9vMsKLpFUbaf51Vnv9ay2VaTo+fN1oum/8X58O3Ch2gdPUZPUR210Eu0iw7QIeoiin6iX+g3+uM98l57e1578bSytrjRQ1QKr/MPjMYueg==</latexit>
[https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40811.pdf]
Laurent Charlin — 80-629
Activation functions — Output units
30
σ(F) =�
�+ exp(−F)
XTKYRF](F)N =J]U(FN)∑O exp(FO)
Output type Output Unit Equivalent Statistical Distribution
Binary (0,1) Bernoulli
Categorical (0,1,2,3,k) Multinoulli
Continuous Identity(z) = z Gaussian
Multi-modal mean, (co-)variance, components Mixture of Gaussians
ɀǣǕȅȒǣƳ(ɿ) =
+ exp(−ɿ)<latexit sha1_base64="dMsVesdqf5yE82ASLgFUnJGN2Rk=">AAADjXicfZLbbhMxEIbdLIcSDk2Bu95YVJUSSKNdEIKLgiraqpWgUA49SNko8nonqRV7vbK9Iam178DTcAuvwdvgTULVTSpGsjT6f894/GmilDNtfP/PUsW7cfPW7eU71bv37j9Yqa0+PNEyUxSOqeRSnUVEA2cJHBtmOJylCoiIOJxGg53CPx2C0kwm38w4hY4g/YT1GCXGSd3a09DAyFjN+kKyOK9fNPAbHPYUoTbIbYCf4RBGaX3zopF3a+t+y58EXkyCWbK+vfV4becL1I+6q5X3YSxpJiAxlBOt24Gfmo4lyjDKIa+GmYaU0AHpQ9ulCRGgO3byqRxvOCXGPancSQyeqNWNKyVWOF2DUsTkbQ2CRZLHnatNLRFaj0XkmglizvW8V4jXee3M9F53LEvSzEBCp7P0Mo6NxAVEHDMF1PCxSwhVzH0H03PioBmHuvRKJOXAkEg3iZtz3BQZN0zJ76W/28kAKdCyOsoSRmUMcyo3I6NIGV48ZKme4RtN+TlUGowgLCnw2QPgQ3BzEvwRMrh03buFXd9lfWZ084PbiqS5rwAGjYWSUr/DS/abXx36dw79vxv/63ltWTXcBbcfCg5d7acUnCuVDfdyGxbdosju5XnVrV8wv2yLycnzVvCi5X92e/gWTWMZraEnqI4C9AptowN0hI4RRT/QT/QL/fZWvJfelje7W1ma1TxCpfD2/wI6US64</latexit>
ɀȒǔɎȅƏɴ(ɿǣ) =exp(ɿǣ)∑ǣ′ exp(ɿǣ′)
<latexit sha1_base64="C8rgjhM2JzvG2gHJ44PWY0F11SY=">AAADonicfVJdb9MwFPUWPkb56uCRF4tuopVKlcIDvAwm2MQEDDaNbpOaqnLcm86qHUe2U1Ks/Av+Av+EByRe4YF/g9OWaWknLEU6Osf3+t6TEyacaeP7f1ZWvStXr11fu1G5eev2nbvV9XvHWqaKQodKLtVpSDRwFkPHMMPhNFFARMjhJBy9LvSTMSjNZPzJTBLoCTKMWcQoMY7qV18GBjJjtYyMIFle/9JnDbyFg0gRagPIkimT20Cnom/ZoxzPyQI3ctyv1vyWPz14GbTnoLa99bX2I9j4ftBfX30XDCRNBcSGcqJ1t+0npmeJMoxyyCtBqiEhdESG0HUwJgJ0z043zfGmYwY4ksp9scFTtrJ5ocQKx2tQipi8q0GwUPJB72JTS4TWExG6ZoKYM72oFeRlWjc10fOeZXGSGojpbJYo5dhIXDiLB0wBNXziAKGKuXUwPSPOR+P8L70SSjkyJNRN4uacNEXKDVPyc2l3Ox0gAVpmszRmVA5ggeUmM4qUzRuMWaLn9mUz/5xVGtyPZnFhn90DPgY3J8EfIIVz1b1byPUdNmRGN9+7qMTNNwpg1FgqKfXbP/f+8ZGz/pWz/t+N//W8tKwS7IDLh4J9V/sxAadKZYNdF8WiWxja3TyvuPi1F8O2DI6ftNpPW/6hy+ELNDtr6AF6iOqojZ6hbbSHDlAHUfQN/US/0G9vw3vrHXpHs6urK/Oa+6h0vOAvvCQ53A==</latexit>
Laurent Charlin — 80-629
Regularization• Weight decay on the parameters
• L2 penalty on the parameters
• Early stopping of the optimization procedure
• Monitor the validation loss and terminate when it stops improving
• Number of hidden layers and hidden units per layer
31
Laurent Charlin — 80-629
Momentum
• Pro: Can allow you to jump over small local optima
• Pro: Goes faster through flat areas by using acquired speed
• Con: Can also jump over global optima
• Con: One more hyper-parameter to tune
• More advanced adaptive steps: adagrad, adam
32
<latexit sha1_base64="sKY+EgrIOYJedFEs1rb/QhcIVTk=">AAAEd3icnVPtatRAFM22q9b1q9WfigyWli1ul40UFKRQaksLtlo/1hZ21nIzubsddpIJMzf7Qcj7+AS+i2/iTyfbpTRtQXFCyOHcuefeOXMTJEpaarV+Vebmq7du31m4W7t3/8HDR4tLj79ZnRqBbaGVNicBWFQyxjZJUniSGIQoUHgcDN4V8eMhGit1/JUmCXYj6MeyJwWQo04Xf46+Z/TSz9nqJitgztYZB5WcAeMxBAocS4y/LR7GCceU1fcMhBJjYjtohfuu5YxzVlud7uK1YaHFAyRgw/9QY6Mmi3TkUBqt5U6v1CGxl2zI+enicqvZmi52HfgzsOzN1tHp0twPHmqRFrpCgbUdv5VQNwNDUijMazy1mIAYQB87DsYQoe1mU4NztuKYkPW0ca/rcMrWVi6lZJHjLRoDlHcsRjLQKuxeFs0gsnYSBU4sAjqzV2MFeVOsk1LvTTeTcZISxuK8l16qGGlWXCgLpUFBauIACCPdcZg4AwOC3LWXqgRaDwgC2wDX56QRpYqk0aPS2bNpAwmKMjtOYyl0iFdYRWMyUDYvHMrEzuwbn/vnrLJIEci4sC/bRzVE1yewD5jiRdTVLcL1HdmXZBsHbkLjxp5BHKxdSynpHV54v/7FWb/trP8n0Rvz/prWaScJGuH+uc3P2E8VmMaBHpWZ7s3aNb6DbvgMHroCH50KkDYZ380zXpQMgmw3z2tutv2rk3wdHL9q+htN3/+0sby1PRvzBe+p98Kre7732tvy9r0jr+2JyrPKduV95WD+d/V5dbVaP986V5nlPPFKq+r/AcuMfWo=</latexit>
ɯɎ+ = ɯɎ � ↵rɯɎ ٢JȸƏƳǣƺȇɎ (ƺɀƬƺȇɎ٣
ɮ = �ɮ� ↵rɯɎ ٢JȸƏƳǣƺȇɎ (ƺɀƬƺȇɎ ɯ ȅȒȅƺȇɎɖȅ٣
ɯɎ+ = ɯɎ + ɮ
Laurent Charlin — 80-629
Wide or Deep?
33
Laurent Charlin — 80-629
Wide or Deep?
33
[Figure 6.6, Deep Learning, book]
Laurent Charlin — 80-629
Wide or Deep?
34
[Figure 6.7, Deep Learning, book]
Laurent Charlin — 80-629
Dropout
35 [Figure 7.6, Deep Learning]
• Standard regularization technique
• At training drop a percentage of the units
• Used for non-output layer
• Prevents co-adaptation / Bagging
• At test: use the full network
• Normalize the weights
Laurent Charlin — 80-629
Neural Network Takeaways
Laurent Charlin — 80-629
Neural Networks takeaways
• Very flexible models
• Composed of simple units (neurons)
• Adapt to different types of data
• (Highly) non-linear models
• E.g., Can learn to order/rank inputs easily
• Scale to very large datasets
• May require “fiddling” with model architecture + optimization hyper-parameters
• Standardizing data can be very important
38
Laurent Charlin — 80-629
Where do NNs shine
• Input is high-dimensional discrete or real-valued
• Output is discrete or real valued, or a vector of values
• Possibly noisy data
• Form of target function is unknown
• Human interpretability is not important
• The computation of the output based on the input has to be fast
39
Laurent Charlin — 80-629 40
Most tasks that consist of mapping an input vector to an output vector, and that are easy for a person to do rapidly, can be accomplished via deep learning, given sufficiently large models and sufficiently large datasets of labeled training examples.
Other tasks, that cannot be described as associating one vector to another, or that are difficult enough that a person would require time to think and reflect in order to accomplish the task, remain beyond the scope of deep learning for now.
Deep Learning — Part II, p.163 http://www.deeplearningbook.org/contents/part_practical.html
Neural Networks in Practice
Laurent Charlin — 80-629
In practice• Software now derives gradients automatically
• You specify the architecture of the network
• Connection pattern
• Number of hidden layers
• Number of layers
• Activation functions
• Learning rate (learning rate updates)
• Dropout
• For intuitions: https://playground.tensorflow.org
42
Laurent Charlin — 80-629
A selection of standard tools (in python)
• Scikit-learn
• Machine learning toolbox
• Feed-forward neural networks
• Neural network specific tools
• PyTorch, Tensorflow
• Keras
• More specific tools for specific tasks:
• caffe for computer vision, pySpark for distributed computations, NLTK for natural language processing
43
top related