math recapmath recap jingwei tang @ cgl. 19.02.2020. ... probability space (ฮฉ, ๐๐, ๐ด๐ด)...
TRANSCRIPT
1
โข Bottom-up: Building up concepts from foundational to more advanced.This strategy has the advantage that the reader at all times is able to relyon their previously learned concepts.
โข Top-down: Drilling down from practical needs to more basicrequirements. This goal-driven approach has the advantage that thereaders know at all times why they need to work on a particular concept.
-- <Mathematics for Machine Learning>
2
โข Bottom-up: Building up concepts from foundational to more advanced.This strategy has the advantage that the reader at all times is able to relyon their previously learned concepts.
โข Top-down: Drilling down from practical needs to more basicrequirements. This goal-driven approach has the advantage that thereaders know at all times why they need to work on a particular concept.
-- <Mathematics for Machine Learning>
1. Linear Algebra
2. (Brief) Vector Calculus
3. Probability Theory
3
Matrix Multiplications
โข Let matrix ๐ด๐ด โ โ๐๐ร๐๐ and ๐ต๐ต โ โ๐๐ร๐๐, matrix-matrix multiplication can be defined as
๐ถ๐ถ๐๐๐๐ = ๏ฟฝ๐๐=1
๐๐
๐ด๐ด๐๐๐๐๐ต๐ต๐๐๐๐
the result matrix ๐ถ๐ถ โ โ๐๐ร๐๐
โข Vectors can be viewed as matrices: ๐ฅ๐ฅ โ โ๐๐ร1
๐ฆ๐ฆ๐๐ = ๏ฟฝ๐๐=1
๐๐
๐ด๐ด๐๐๐๐๐ฅ๐ฅ๐๐
4
Matrix Multiplications
โข Example: matrix-matrix multiplication:
1 2 33 2 1
0 21 โ10 1
=
โข Example: matrix-vector multiplication:
0 21 โ10 1
23 =
5
Matrix Multiplications
โ๐ด๐ด โ โ๐๐ร๐๐, ๐ต๐ต โ โ๐๐ร๐๐, ๐ถ๐ถ,๐ท๐ท โ โ๐๐ร๐๐
โข Associativity: ๐ด๐ด๐ต๐ต ๐ถ๐ถ = ๐ด๐ด(๐ต๐ต๐ถ๐ถ)
โข Distributivity: ๐ด๐ด + ๐ต๐ต ๐ถ๐ถ = ๐ด๐ด๐ถ๐ถ + ๐ต๐ต๐ถ๐ถ๐ด๐ด ๐ถ๐ถ + ๐ท๐ท = ๐ด๐ด๐ถ๐ถ + ๐ด๐ด๐ท๐ท
โข NO Commutativity: ๐ด๐ด๐ต๐ต โ ๐ต๐ต๐ด๐ด
6
Linear Systemsโข System of linear equations
๐๐11๐ฅ๐ฅ1 + โฏ+ ๐๐1๐๐๐ฅ๐ฅ๐๐ = ๐๐1โฆ โฆ
๐๐๐๐1๐ฅ๐ฅ1 + โฏ+ ๐๐๐๐๐๐๐ฅ๐ฅ๐๐ = ๐๐๐๐
can be represented through matrix-vector multiplication:
๐๐11 โฆ ๐๐1๐๐โฎ โฎ
๐๐๐๐1 โฆ ๐๐๐๐๐๐
๐ฅ๐ฅ1โฎ๐ฅ๐ฅ๐๐
=๐๐1โฎ๐๐๐๐
7
Linear Systems ๐ด๐ด๐ฅ๐ฅ = ๐๐๐ด๐ด โ โ๐๐ร๐๐, ๐ฅ๐ฅ โ โ๐๐, ๐๐ โ โ๐๐: ๐๐ equations, ๐๐ variables
โข ๐๐ = ๐๐: Square Systems Can have 0, 1,โ solution(s).
โข ๐๐ < ๐๐: Underdetermined Systems Typically have โ solutions.
โข ๐๐ > ๐๐: Overdetermined Systems Linear Regression: least-squares solution (min ๐ด๐ด๐ฅ๐ฅ โ ๐๐ 2
)
8
Linear Systems ๐ด๐ด๐ฅ๐ฅ = ๐๐Square system examples:
โข 2๐ฅ๐ฅ + 3๐ฆ๐ฆ = 5๐ฅ๐ฅ + ๐ฆ๐ฆ = 3
โข 2๐ฅ๐ฅ + 3๐ฆ๐ฆ = 54๐ฅ๐ฅ + 6๐ฆ๐ฆ = 5
โข 2๐ฅ๐ฅ + 3๐ฆ๐ฆ = 54๐ฅ๐ฅ + 6๐ฆ๐ฆ = 10
9
Square system ๐ด๐ด๐ฅ๐ฅ = ๐๐ has an unique solution.
โ
๐ด๐ด is invertible.
โ
๐ด๐ด๐ฅ๐ฅ = 0 only has trivial solution ๐ฅ๐ฅ = 0.
โ ๐ฅ๐ฅ = 4๐ฆ๐ฆ = โ1
โ NO solutions
โ ๐ฅ๐ฅ = 5โ3๐ฆ๐ฆ2
,โ๐ฆ๐ฆ โ โ
Invertibility and Determinantโข Matrix ๐ด๐ด โ โ๐๐ร๐๐ is called invertible if there exists ๐ต๐ต โ โ๐๐ร๐๐ s.t. ๐ด๐ด๐ต๐ต =๐ผ๐ผ = ๐ต๐ต๐ด๐ด, ๐ต๐ต is then called the inverse of ๐ด๐ด, ๐ต๐ต = ๐ด๐ดโ1.
โข ๐ด๐ด โ โ๐๐ร๐๐ is invertible or nonsingular if and only if it is square and full rank. Equivalently, having det ๐ด๐ด โ 0.
โข det ๐ด๐ด :โ๐๐ร๐๐ โ โ
e.g. det ๐๐ ๐๐๐๐ ๐๐ = ๐๐ ๐๐
๐๐ ๐๐ = ๐๐๐๐ โ ๐๐๐๐
๐๐11 ๐๐12 ๐๐13๐๐21 ๐๐22 ๐๐23๐๐31 ๐๐32 ๐๐33
= ๐๐11๐๐22 ๐๐23๐๐32 ๐๐33 โ ๐๐12
๐๐21 ๐๐23๐๐31 ๐๐33 + ๐๐13
๐๐21 ๐๐22๐๐31 ๐๐32
10
Eigenvalues and Eigenvectorsโข Let ๐ด๐ด โ โ๐๐ร๐๐ be a square matrix. ๐๐ โ โ is an
eigenvalue of ๐ด๐ด and ๐ฅ๐ฅ โ โ๐๐\ {0} is the corresponding eigenvector if
๐ด๐ด๐ฅ๐ฅ = ๐๐๐ฅ๐ฅ
โ ๐ด๐ด โ ๐๐๐ผ๐ผ๐๐ ๐ฅ๐ฅ = 0 has solutions other than ๐ฅ๐ฅ = 0.
โ ๐๐๐๐๐๐ ๐ด๐ด โ ๐๐๐ผ๐ผ๐๐ = 0. (Polynomial of degree ๐๐)
11
๐ด๐ด is invertible. Equivalently, det ๐ด๐ด โ 0
โ
๐ด๐ด๐ฅ๐ฅ = 0 only has trivial solution ๐ฅ๐ฅ = 0.
Eigenvalues and EigenvectorsExample: find the eigenvalues and eigenvectors of
๐ด๐ด =1
12
12
1
12
13
Solution:
๐๐1 =12
, ๐ธ๐ธ๐๐1 = span 1โ1 ,
๐๐2 =32
, ๐ธ๐ธ๐๐2 = span 11 .
Eigenvalues and Eigenvectors
Eigendecomposition
๐ด๐ด| |๐ฅ๐ฅ1 โฏ ๐ฅ๐ฅ๐๐| |
=| |๐ฅ๐ฅ1 โฏ ๐ฅ๐ฅ๐๐| |
๐๐1 0โฑ
0 ๐๐๐๐
๐ด๐ด๐ด๐ด = ๐ด๐ด๐ท๐ทโ
๐ด๐ด = ๐ด๐ด๐ท๐ท๐ด๐ดโ1 (if and only if eigenvectors of ๐ด๐ด form a basis of โ๐๐)
14
EigendecompositionExample: find the Eigendecomposition of
๐ด๐ด =1
12
12
1
๐๐1 =12
, ๐ธ๐ธ๐๐1 = span 1โ1 ,
๐๐2 =32
, ๐ธ๐ธ๐๐2 = span 11 .
15
Eigendecomposition
๐ด๐ด = ๐ด๐ด๐ท๐ท๐ด๐ดโ1
๐ท๐ท =12
0
0 32
, ๐ด๐ด = 12
1 1โ1 1 , ๐ด๐ดโ1 = 1
21 โ11 1
16
Optional, to form a set of (nice) normalized basis
Matrix Decompositionsโข Eigendecomposition: ๐ด๐ด โ โ๐๐ร๐๐, eigenvectors of ๐ด๐ด form a basis of โ๐๐
๐ด๐ด = ๐ด๐ด๐ท๐ท๐ด๐ดโ1
โข QR/QU Decomposition (from Gram-Schmidt process): ๐ด๐ด โ โ๐๐ร๐๐
๐ด๐ด = ๐๐๐๐โข LU Decomposition (from Gaussian Elimination): ๐ด๐ด โ โ๐๐ร๐๐
๐ด๐ด = ๐ฟ๐ฟ๐๐โข Singular Value Decomposition (SVD): ๐ด๐ด โ โ๐๐ร๐๐,๐๐ โ โ๐๐ร๐๐,๐๐ โ โ๐๐ร๐๐. ฮฃ โ โ๐๐ร๐๐ is a diagonal matrix.
๐ด๐ด = ๐๐ฮฃ๐๐๐๐
โข Cholesky Decomposition: ๐ด๐ด โ โ๐๐ร๐๐, symmetric and positive definite๐ด๐ด = ๐ฟ๐ฟ๐ฟ๐ฟ๐๐
17
Positive Definiteness of Matrices
โข Symmetric: ๐ด๐ด โ โ๐๐ร๐๐ is symmetric if ๐ด๐ด๐๐๐๐ = ๐ด๐ด๐๐๐๐ ,โ๐๐, ๐๐ โ [1,๐๐].
e.g. 1 22 0 ,
1 2 32 โ1 53 5 0
โข Symmetric Positive Definite: A Symmetric matrix ๐ด๐ด โ โ๐๐ร๐๐ is called symmetric, positive definite if
โ๐ฅ๐ฅ โ โ๐๐\{0}: ๐ฅ๐ฅ๐๐๐ด๐ด๐ฅ๐ฅ > 0If only โฅ holds, ๐ด๐ด is called symmetric, positive semidefinite.
18
Positive Definiteness of Matrices
Example: find out whether the following matrix is symmetric positive definite
๐ด๐ด = 9 66 5
19
1. Linear Algebra
2. (Brief) Vector Calculus
3. Probability Theory
20
Notion of Derivatives
Derivative: Let ๐๐:โ โ โ, ๐ฅ๐ฅ โ ๐๐(๐ฅ๐ฅ), the derivative is defined as:
๐๐๐๐๐๐๐ฅ๐ฅ
โ lim๐ฟ๐ฟ๐ฟ๐ฟโ0
๐๐ ๐ฅ๐ฅ + ๐ฟ๐ฟ๐ฅ๐ฅ โ ๐๐ ๐ฅ๐ฅ๐ฟ๐ฟ๐ฅ๐ฅ
21
Notion of DerivativesPartial Derivative: Let ๐๐:โ๐๐ โ โ,๐๐ โ ๐๐ ๐๐ ,๐๐ โ โ๐๐ of ๐๐ variables, the partial derivative is defined as:
๐๐๐๐๐๐๐ฅ๐ฅ๐๐
โ lim๐ฟ๐ฟ๐ฟ๐ฟโ0
๐๐ ๐ฅ๐ฅ1, โฆ , ๐ฅ๐ฅ๐๐ + ๐ฟ๐ฟ๐ฅ๐ฅ, โฆ ๐ฅ๐ฅ๐๐ โ ๐๐ ๐ฅ๐ฅ1, โฆ , ๐ฅ๐ฅ๐๐ , โฆ ๐ฅ๐ฅ๐๐๐ฟ๐ฟ๐ฅ๐ฅ
Gradient: Collect partial derivatives of all variables and form a row vector
๐ป๐ป๐๐ = grad๐๐ =๐๐๐๐๐๐๐๐
=๐๐๐๐๐๐๐ฅ๐ฅ1
๐๐๐๐๐๐๐ฅ๐ฅ2
โฆ๐๐๐๐๐๐๐ฅ๐ฅ๐๐
โ โ1ร๐๐
22
Notion of Derivatives
23
Jacobian: Let ๐๐:โ๐๐ โ โ๐๐,๐๐ โ ๐๐ ๐๐ ,๐๐ โ โ๐๐,๐๐ ๐๐ โ โ๐๐, stacking all the gradient of components of ๐๐ ๐๐ into a matrix:
๐ฝ๐ฝ =๐๐๐๐ ๐๐๐๐๐๐
=
๐๐๐๐1 ๐๐๐๐๐๐โฎ
๐๐๐๐๐๐ ๐๐๐๐๐๐
=
๐๐๐๐1 ๐๐๐๐๐ฅ๐ฅ1
โฏ๐๐๐๐1 ๐๐๐๐๐ฅ๐ฅ๐๐
โฎ โฎ๐๐๐๐๐๐ ๐๐๐๐๐ฅ๐ฅ1
โฏ๐๐๐๐๐๐ ๐๐๐๐๐ฅ๐ฅ๐๐
โ โ๐๐ร๐๐
Notion of Derivatives
โข Derivative: ๐๐:โ โ โ, ๐๐๐๐๐๐๐ฟ๐ฟโ โ
โข Gradient: ๐๐:โ๐๐ โ โ, ๐ป๐ป๐๐ โ โ1ร๐๐
โข Jacobian: ๐๐:โ๐๐ โ โ๐๐, ๐ฝ๐ฝ โ โ๐๐ร๐๐
24
Chain Ruleโข Real-valued functions: ๐๐,๐๐:โ โ โ, ๐ฅ๐ฅ โ ๐๐ ๐ฅ๐ฅ , ๐ฅ๐ฅ โ ๐๐ ๐ฅ๐ฅ
๐๐๐๐ ๐๐ ๐ฅ๐ฅ๐๐๐ฅ๐ฅ
=๐๐๐๐(๐๐ ๐ฅ๐ฅ )๐๐๐๐(๐ฅ๐ฅ)
๐๐๐๐(๐ฅ๐ฅ)๐๐๐ฅ๐ฅ
โข Multi-variable functions: ๐๐:โ๐๐ โ โ๐๐,๐๐ โ ๐๐ ๐๐ ,๐๐:โ๐๐ โ โ,๐๐ โ ๐๐ ๐๐
๐๐๐๐ ๐๐ ๐๐๐๐๐๐
=๐๐๐๐(๐๐ ๐๐ )๐๐๐๐(๐๐)
๐๐๐๐ ๐๐๐๐๐๐
25
1 ร ๐๐ 1 ร ๐๐ ๐๐ ร ๐๐
More Complicated Derivatives
โข Derivative of Matricesโข Higher-order Derivatives
[See references]
26
1. Linear Algebra
2. (Brief) Vector Calculus
3. Probability Theory
28
Probability Space (ฮฉ,๐๐,๐ด๐ด)
โข Sample Space ฮฉ: set of all possible outcomes of an experiment
โข Event Space ๐๐: space of potential results of the experiment (a collection of all subsets of ฮฉ in discrete setting)
โข Probability ๐ด๐ด: with each event ๐ด๐ด โ ๐๐, we associate a number ๐ด๐ด(๐ด๐ด)that measures the โdegree of beliefโ that the event will occur.
29
Probability Space (ฮฉ,๐๐,๐ด๐ด)
โข Sample Space ฮฉ: set of all possible outcomes of an experiment
โข Event Space ๐๐: space of potential results of the experiment (a collection of all subsets of ฮฉ in discrete setting)
โข Probability ๐ด๐ด: with each event ๐ด๐ด โ ๐๐, we associate a number ๐ด๐ด(๐ด๐ด)that measures the โdegree of beliefโ that the event will occur.
โข Random Variable ๐๐: A function/mapping ๐๐:ฮฉ โ ๐ฏ๐ฏ. We are interested in the probabilities on elements of ๐ฏ๐ฏ.
30
Probability Space (ฮฉ,๐๐,๐ด๐ด)Example: tossing coins
โข Experiment: tossing coins for two consecutive times.โข Sample Space: ฮฉ = {โโ, ๐๐๐๐,โ๐๐, ๐๐โ} (โ for head and ๐๐ for tails)โข Random Variable: ๐๐ maps the event to number of heads. ๐ฏ๐ฏ = 0,1,2 .
๐๐ โโ = 2,๐๐ ๐๐๐๐ = 0,๐๐ โ๐๐ = ๐๐ ๐๐โ = 1โข Probabilities (on ๐ฏ๐ฏ):
๐ด๐ด ๐๐ = 0 = 0.25,๐ด๐ด ๐๐ = 1 = 0.5,๐ด๐ด ๐ฅ๐ฅ = 2 = 0.25
31
PDF and CDFโข Probability Density Function (PDF):
๐๐:โ โ โ s.t. โ๐๐ โ โ, ๐๐ ๐ฅ๐ฅ โฅ 0 and โซโ ๐๐ ๐ฅ๐ฅ ๐๐๐ฅ๐ฅ = 1
We can associate a random variable ๐๐ with PDF:
๐ด๐ด ๐๐ โค ๐๐ โค ๐๐ = ๏ฟฝ๐๐
๐๐๐๐ ๐ฅ๐ฅ ๐๐๐ฅ๐ฅ
โข Cumulative Distribution Function(CDF):๐น๐น๐๐ ๐ฅ๐ฅ = ๐ด๐ด(๐๐ โค ๐ฅ๐ฅ)
๐น๐น๐๐ ๐ฅ๐ฅ = ๏ฟฝโโ
๐ฟ๐ฟ๐๐ ๐ง๐ง ๐๐๐ง๐ง , ๐๐ ๐ฅ๐ฅ =
๐๐๐น๐น๐๐ ๐ฅ๐ฅ๐๐๐ฅ๐ฅ
32
๐ฅ๐ฅ
Joint Distribution
โข Let ๐๐,๐๐ be two random variables over the same probability space. Joint distribution is defined as
๐น๐น๐๐,๐๐ ๐ฅ๐ฅ,๐ฆ๐ฆ = ๐ด๐ด ๐๐ โค ๐ฅ๐ฅ,๐๐ โค ๐ฆ๐ฆjoint density:
๐๐ ๐ฅ๐ฅ,๐ฆ๐ฆ =๐๐2๐น๐น๐๐,๐๐(๐ฅ๐ฅ,๐ฆ๐ฆ)
๐๐๐ฅ๐ฅ๐๐๐ฆ๐ฆโข Marginalization:
๐๐๐๐ ๐ฅ๐ฅ = ๏ฟฝโโ
โ๐๐ ๐ฅ๐ฅ,๐ฆ๐ฆ ๐๐๐ฆ๐ฆ , ๐น๐น๐๐ ๐ฅ๐ฅ = ๐น๐น๐๐,๐๐(๐ฅ๐ฅ,โ)
33
Independence
โข Two events ๐ด๐ด and ๐ต๐ต are independent if
๐ด๐ด ๐ด๐ด โฉ ๐ต๐ต = ๐ด๐ด ๐ด๐ด ๐ด๐ด(๐ต๐ต)
โข Two Random Variables ๐๐ and ๐๐ are independent if their joint distribution function factorizes, i.e.
๐น๐น๐๐,๐๐ ๐ฅ๐ฅ,๐ฆ๐ฆ = ๐น๐น๐๐ ๐ฅ๐ฅ ๐น๐น๐๐(๐ฆ๐ฆ)
34
Conditional Probability
Probability of ๐ด๐ด given ๐ต๐ต has occurred:
๐ด๐ด ๐ด๐ด ๐ต๐ต =๐ด๐ด ๐ด๐ด โฉ ๐ต๐ต๐ด๐ด ๐ต๐ต
โข Laws regarding conditional probability:โข Law of total probability: ๐ด๐ด ๐ต๐ต = โ๐๐=1๐๐ ๐ด๐ด ๐ต๐ต ๐ด๐ด๐๐ ๐ด๐ด(๐ด๐ด๐๐)โข Bayes Rule: ๐ด๐ด ๐ด๐ด ๐ต๐ต = ๐ด๐ด ๐ต๐ต ๐ด๐ด ๐๐ ๐ด๐ด
๐๐ ๐ต๐ตโข Chain Rule: ๐ด๐ด ๐ด๐ด1, โฆ๐ด๐ด๐๐ = ๐ด๐ด ๐ด๐ด1 ๐ด๐ด ๐ด๐ด2 ๐ด๐ด1 ๐ด๐ด(๐ด๐ด3|๐ด๐ด1,๐ด๐ด2)โฏ๐ด๐ด(๐ด๐ด๐๐|๐ด๐ด1, โฆ ,๐ด๐ด๐๐โ1)
[See references]
35
Expectation
โข The expected value of a function ๐๐:โ โ โ of a univariate continuous random variable ๐๐~๐๐(๐ฅ๐ฅ) is given by
๐ผ๐ผ๐๐ ๐๐ ๐ฅ๐ฅ = ๏ฟฝ๐ณ๐ณ๐๐ ๐ฅ๐ฅ ๐๐ ๐ฅ๐ฅ ๐๐๐ฅ๐ฅ
โข The mean of a random variable ๐๐ is defined as
๐ผ๐ผ๐๐ ๐ฅ๐ฅ = ๏ฟฝ๐ณ๐ณ๐ฅ๐ฅ๐๐ ๐ฅ๐ฅ ๐๐๐ฅ๐ฅ
โข Linearity: ๐ผ๐ผ๐๐ ๐๐๐๐ ๐ฅ๐ฅ + ๐๐๐๐ ๐ฅ๐ฅ = ๐๐๐ผ๐ผ๐๐ ๐๐ ๐ฅ๐ฅ + ๐๐๐ผ๐ผ๐๐[๐๐ ๐ฅ๐ฅ ]
36
(Co)variance
โข Covariance between two univariate random variables ๐๐,๐๐ โ โ:๐ถ๐ถ๐ถ๐ถ๐ฃ๐ฃ๐๐,๐๐[๐ฅ๐ฅ,๐ฆ๐ฆ] = ๐ผ๐ผ๐๐,๐๐ (๐ฅ๐ฅ โ ๐ผ๐ผ๐๐ ๐ฅ๐ฅ )(๐ฆ๐ฆ โ ๐ผ๐ผ๐๐ ๐ฆ๐ฆ ) = ๐ผ๐ผ ๐ฅ๐ฅ๐ฆ๐ฆ โ ๐ผ๐ผ ๐ฅ๐ฅ ๐ผ๐ผ[๐ฆ๐ฆ]
โข Variance is the covariance with itself:๐๐ ๐ฅ๐ฅ = ๐ถ๐ถ๐ถ๐ถ๐ฃ๐ฃ๐๐,๐๐[๐ฅ๐ฅ, ๐ฅ๐ฅ] = ๐ผ๐ผ๐๐ ๐ฅ๐ฅ โ ๐ผ๐ผ๐๐ ๐ฅ๐ฅ 2 = ๐ผ๐ผ๐๐ ๐ฅ๐ฅ2 โ ๐ผ๐ผ๐๐ ๐ฅ๐ฅ 2
โข NOT Linear:๐๐ ๐ฅ๐ฅ + ๐ฆ๐ฆ = ๐๐ ๐ฅ๐ฅ + ๐๐ ๐ฆ๐ฆ + ๐ถ๐ถ๐ถ๐ถ๐ฃ๐ฃ ๐ฅ๐ฅ,๐ฆ๐ฆ + ๐ถ๐ถ๐ถ๐ถ๐ฃ๐ฃ[๐ฆ๐ฆ, ๐ฅ๐ฅ]
37
Guassian (Normal) Distribution
โข The Gaussian distribution has a density
๐๐ ๐ฅ๐ฅ ๐๐,๐๐2 =12๐๐๐๐2
exp โ๐ฅ๐ฅ โ ๐๐ 2
2๐๐2
Denoted as ๐๐~๐๐(๐ฅ๐ฅ|๐๐,๐๐2)
38
PDF, CDF, Joint distribution, Expectation, Covariance, Gaussian Distribution can all be extended with some efforts to higher dimensions.[see references]
References
โข Mathematics for Machine Learningโข Matrix Cookbook
39
End of Presentation
Start of Q&A Session
40