matthew dixon [email protected]/~mdixon7/presentations/dl_fe.pdfoverview deep...
TRANSCRIPT
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Deep Learning for Financial Econometrics2
Matthew [email protected]
Illinois Institute of Technology
Machine Learning in Finance BootcampFields Institute
University of TorontoSeptember 26th, 2019
Code: https://github.com/mfrdixon/ML Finance Codes
2Slides with an * in the title are advanced material for mathematicians. **denotes advanced programming details.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Overview
• Deep Learning for cross-sectional data• Background: review of supervised machine learning• Introduction to feedforward neural networks• Related approximation and learning theory• Training and back-propagation• Interpretability with factor modeling example
• Deep Learning for time-series data prediction• Understand the conceptual formulation and statistical
explanation of RNNs• Understand the different types of RNNs, including plain RNNs,
GRUs and LSTMs• Understand the conceptual formulation and statistical
explanation of CNNs.• Understand 1D CNNs for time series prediction.• Understand how dilation CNNs capture multi-scales.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Overview
• Deep Learning for dimensionality reduction• Understand principal component analysis for dimension
reduction.• Understand how to formulate a linear autoencoder.• Understand how a linear autoencoder performs PCA.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Quick Quiz
Which of the following statements are true?
A Machine learning is different from statistics: Machine learningmethods assume the data generation process is unknown
B The output from neural network classifiers are theprobabilities of an input belonging to a class
C We need multiple hidden layers in the neural network tocapture non-linearity
D Neural networks provide no statistical interpretability, they are’black-boxes’ and the importance of the features is unknown
E Neural networks can only be fitted to stationary data
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Quick Quiz
Which of the following statements are true?
A Machine learning is different from statistics: Machine learningmethods assume the data generation process is unknown[True]
B The output from neural network classifiers are theprobabilities of an input belonging to a class [True]
C We need multiple hidden layers in the neural network tocapture non-linearity [False3]
D Neural networks provide no statistical interpretability, they are’black-boxes’ and the importance of the features is unknown[False]
E Neural networks can only be fitted to stationary data [False]
3The Universal representation theorem suggests that we only need one hiddenlayer: Andrei Nikolaevich Kolmogorov, On the representation of continuous functionsof many variables by superposition of continuous functions of one variable andaddition. AMS Translation, 28(2):55-59, 1963.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Quick Overview
Property Statistical Inference Machine Learning
Goal Causal models with explanatory power Prediction performance, often with limited explanatory powerData The data is generated by a model The data generation process is unknownFramework Probabilistic Algorithmic & ProbabilisticExpressability Typically linear Non-linearModel selection Based on information criteria Numerical optimizationScalability Limited to lower dimensional data Scales to higher dimensional input dataRobustness Prone to over-fitting Designed for out-of-sample performance
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Supervised Machine Learning
• Machine learning addresses a fundamental prediction problem:Construct a nonlinear predictor, Y (X ), of an output, Y , givena high dimensional input matrix X = (X1, . . . ,XP) of Pvariables.
• Machine learning can be simply viewed as the study andconstruction of an input-output map of the form
Y = F (X ) where X = (X1, . . . ,XP).
• The output variable, Y , can be continuous, discrete or mixed.
• For example, in a classification problem, F : X → Y whereY ∈ 1, . . . ,K and K is the number of categories.
• We will denote the i th observation of the data D := (X ,Y ) as(xi , yi ) - this is a “feature-vector” and response (a.k.a.“label”). Note that lower caps refer to a specific observationin D.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Taxonomy of Most Popular Neural Network Architectures
feed forward auto-encoder convolution
recurrent Long / short term memory neural Turing machines
Figure: Most commonly used deep learning architectures for modeling. Source:http://www.asimovinstitute.org/neural-network-zoo
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
FFWD Neural NetworksNeural network model:
Y = FW ,b(X ) + ε
where FW ,b : Rp → Rd is a deep neural network with L layers
Y (X ) := FW ,b(X ) = f(L)
W (L),b(L) · · · f(1)
W (1),b(1)(X )
• W = (W (1), . . . ,W (L)) and b = (b(1), . . . , b(L)) are weightmatrices and bias vectors.
• For any W (i) ∈ Rm×n, we can write the matrix as n column
m-vectors W (i) = [w(i)1 , . . . ,w
(i)n ].
• Denote each weight as w(`)i ,j :=
(W (`)
)i ,j
.
• X is a N × p matrix of observations in Rp.
• No assumptions are made on the distribution of the error, ε,other than it is independently distributed.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Deep Neural Networks*
• Let σ : R→ B ⊂ R denote a continuous, monotonicallyincreasingly, function.
• A function f(`)
W (`),b(`) : Rn → Rm, given by
f (v) = W (`)σ(`−1)(v) + b(`), W (`) ∈ Rm×n and b(`) ∈ Rm, isa semi-affine function in v , e.g. f (v) = wtanh(v) + b.
• σ(·) are known activation functions of the output from theprevious layer., e.g. σ(x) := max(x , 0).
• FW ,b(X ) is a composition of semi-affine functions.
• If all the activation functions are linear, FW ,b is just linearregression, regardless of the number of layers L.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Geometric Interpretation of FFWD Neural Networks
x1 x2 x1 x2 x1 x2
No hidden layers One hidden layer Two hidden layers
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Geometric Interpretation of FFWD Neural Networks
Half-Moon Dataset
No hidden layers One hidden layer Two hidden layers
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Why do we need more Neurons?
x1 x2
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Geometric Interpretation of Neural Networks
25 hidden units 50 hidden units 75 hidden units
Figure: The number of hidden units is adjusted according to therequirements of the classification problem and can he very high for datasets which are difficult to separate.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Geometric Interpretation of Neural Networks
Figure: Hyperplanes defined by three activated neurons in the hidden layer.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Universal Representation Theorem [1989]*
• Let Cp := f : Rp → R | f (x) ∈ C (R) be the set ofcontinuous functions from Rp to R.
• Denote Σp(σ) as the class of functions
FW ,b : Rp → R : FW ,b(x) = W (2)σ(W (1)x + b(1)) + b(2).
• Consider Ω := (0, 1] and let C0 be the collection of all openintervals in (0, 1].
• Then σ(C0), the σ-algebra generated by C0, is the Borelσ-algebra, B((0, 1]).
• Let Mp := f : Rp → R | f (x) ∈ B(R) denote the set of allBorel measurable functions from Rp to R.
• Denote the Borel σ-algebra of Rp as Bp.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Universal Representation Theorem*
Theorem (Hornik, Stinchcombe & White, 1989)
For every monotonically increasing activation function σ, everyinput dimension size p, and every probability measure µ on(Rp,Bp), Σp(g) is uniformly dense on compacta in Cp andρµ-dense in Mp.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Shattering• What is the maximum number of points that can be arranged
so that FW ,b(X ) shatters them?• i.e. for all possible assignments of binary labels to those
points, does there exist a W , b such that FW ,b makes noerrors when classifying that set of data points?
• Every distinct pair of points is separable with the linearthreshold perceptron. So every data set of size 2 is shatteredby the perceptron.
• However, this linear threshold perceptron is incapable ofshattering triplets, for example X ∈ −0.5, 0, 0.5 andY ∈ 0, 1, 0.
• In general, the VC dimension of the class of halfspaces in Rk
is k + 1. For example, a 2-d plane shatters any three points,but can not shatter four points.
• This maximum no. of points is the Vapnik-Chervonenkis (VC)dimension and is one characterization of learnability of aclassifier.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Example of Shattering over the Interval [−1, 1]
Right point activated Left point activated None activated Both activated
Figure: For the points −0.5, 0.5, there are weights and biases that activates only one of them(W = 1, b = 0 or W = −1, b = 0), none of them (W = 1, b = −0.75) and both of them (W = 1, b = 0.75).
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
VC Dimension Example
• Determine the VC dimension of the indicator function whereΩ = [0, 1]
F (x) = f : Ω→ 0, 1, f (x) = 1x∈[t1,t2), or
f (x) = 1− 1x∈[t1,t2) , t1 < t2 ∈ Ω.
• Suppose there are three points x1, x2 and x3 and assumex1 < x2 < x3 without loss of generality. All possible binarylabeling of the points is reachable, therefore we assert thatVC (F ) ≥ 3.
• With four points x1, x2, x3 and x4 (assumed increasing asalways), you cannot label x1 and x3 with the value 1 and x2
and x4 with the value 0 for example. So VC (F ) = 3.
• A related, but alternative, approach to the VC dimension is anumerical analysis perspective.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
When is a neural network a spline?*• Under certain choices of the activation function, we can
construct NNs which are piecewise polynomial interpolantsreferred to as ’splines’.
• Let f : Ω→ R be any Lipschitz continuous function over abounded domain Ω ⊂ Rp:
∀x, x′ ∈ Ω, |f (x′)− f (x)| ≤ L|x′ − x|,for some constant L ≥ 0.
• Suppose that the values fk := f (xk) are known only atdistinct points xknk=1 in Ω.
• Form an orthogonal basis over Ω to give the interpolant
f (x) =n∑
k=1
φk(x)fk , ∀x ∈ Ω,
where the φk(x)nk=1 are piecewise constant (PC) orthogonalbasis functions, φi (xj) = δij (Kronecker delta).
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
When is a neural network a spline?*
• If the data points xknk=1 are (almost) uniformly distributedin the bounded domain Ω, then the error of our PCinterpolant scales as∥∥∥f − f
∥∥∥L2p,µ(Ω)
∼ O(n−1p ). (1)
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Piecewise basis functions as differences of Heavisidefunctions*
• Heaviside functions (a.k.a. step functions) over R are definedas
H(x) =
1 x ≥ 0,
0 x < 0,
• Choose the domain to be the unit interval in R and restrictthe data to a uniform grid of width 2ε so thatxk = (2k − 1)ε, k = 1 . . . n.
• Construct a piecewise constant basis from the difference ofHeaviside functions:
φk(x) = H(x − (xk − ε))− H(x − (xk + ε)).
• Basis functions are indicator functions:φk = 1[xk−ε,xk+ε), ∀k = 1, . . . , n, with compact support over[xk − ε, xk + ε) and centered about xk .
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Piecewise constant basis functions*
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
x
φ 1(x
)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
x
φ 2(x
)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
x
φ 3(x
)
Figure: The first three piecewise constant basis functions produced by the difference of neighboring stepfunctions, φk (x) = H(xk − ε)− H(xk + ε).
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Single Layer NNs Activated with Heaviside functions*
• Key Idea: Choose the bias b(1)k = −2(k − 1)ε and weights
W (1) = 1 then
H(W (1)x + b(1)) =
H(x)H(x − 2ε)
. . .H(x − 2(k − 1)ε)
. . .H(x − (2n − 1)ε).
• The two layer NN approximation,f (x) = W (2)H(W (1)x + b(1)), is a PC interpolant when
W (2) = [f (x1), f (x2)− f (x1), . . . , f (xK )− f (xn−1)].
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Single Layer NNs Activated with Heaviside functions*
• This single layer neural network gives the following output:
f (x) =
f (x1) x ≤ 2ε,
f (x2) 2ε < x ≤ 4ε,
. . . . . .
f (xk) 2(k)ε < x ≤ 2(k + 1)ε,
. . . . . .
f (xn−1) 2(n − 1)ε < x ≤ 2nε.
• f (x) is a linear interpolant over [0, 1].
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Single Layer NNs Activated with Heaviside functions*
0.0 0.2 0.4 0.6 0.8 1.0
−1.0
−0.5
0.00.5
1.0
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.02
0.04
0.06
0.08
0.10
x
|f(x)−
f(x)|
Function values Absolute error
Figure: The approximation of sin(2πx) using gridded input data andHeaviside activation functions.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Single Layer NNs Activated with Heaviside functions *
Theorem|f (x)− f (x)| ≤ δ, ∀x ∈ [0, 1] if there are at least n = b 1
2ε + 1chidden units and δ := Lε.
Proof.Since xk = (2k − 1)ε,we have that f (x) = f (xk) over the interval[xk − ε, xk + ε], which is the support of φk(x). By the Lipschitzcontinuity of f (x), it follows that the worst case error appears atthe mid-point of any interval [xk , xk+1]
|f (xk +ε)− f (x+ε)| = |f (x+ε)−f (xk)| ≤ |f (xk)|+Lε−|f (xk)| = δ.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Training with Regularization under IID noise
• With a training set D = xi , yiNi=1, solve the constrainedoptimization
(W , b) = arg min(W ,b)1
N
N∑i=1
L(yi , Y (xi ))
• Often, the L2-norm for a traditional least squares problem ischosen as an error measure, and we then minimize the lossfunction combined with a penalty
L(yi , yi ) = ‖yi − yi‖22 + λ1‖W ‖1
• ∇L is given in closed form by a chain rule and, throughback-propagation, each layer’s weights W (`) and biases b(`)
are fitted with stochastic gradient descent.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Equivalence with OLS estimation
• Without activation and no hidden layers, the optimalparameters β := [b, W ] are an OLS estimator.
• If we assume that the errors ε1, ε2, . . . , εN are white noise andlet y := Y ∈ R so that the linear model is
y = Xβ + ε,
thenβ = (X′X)−1X′y
where the augmented matrix is
X =
1 x1,1 . . . x1,p
. . . . . . . . . . . .1 xN,1 . . . xN,p.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Training
• Solve the constrained optimization
(W , b) = arg minW ,b1
N
N∑i=1
L(yi , YW ,b(xi ))
• If Y is continuous then the L2-norm for a traditional leastsquares problem is chosen as an error measure, and we thenminimize the loss function
L(Y , Y (X )) = ||Y − Y ||22
• A 1-of-K encoding is used for a categorical response, so thatY is a K-binary vector Y ∈ [0, 1]K and the value k ispresented as Yk = 1 and Yj = 0,∀j 6= k . The loss function isthe cross-entropy term
L(Y , Y (X )) = −Y T lnY .
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Derivative of the softmax function
• Using the quotient rule f ′(x) = g ′(x)h(x)−h′(x)g(x)[h(x)]2 , the
derivative σ := σ(x) can be written as:
∂σi∂xi
=exp(xi )|| exp(x)||1 − exp(xi ) exp(xi )
|| exp(x)||21(2)
=exp(xi )
|| exp(x)||1· || exp(x)||1 − exp(xi )
|| exp(x)||1(3)
= σi (1− σi ) (4)
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Derivative of the softmax function
• For the case i 6= j , the derivative of the sum is:
∂σi∂xj
=0− exp(xi ) exp(xj)
|| exp(x)||21(5)
= −exp(xj)
|| exp(x)||1· exp(xi )
|| exp(x)||1(6)
= −σjσi (7)
• This can be written compactly as ∂σi∂xj
= σi (δij − σj).
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Updating the Weight Matrices
• In machine learning, the goal is to find the weights and biasesgiven an input X . We can express Y as a function of theweight matrix W ∈ RK×M and bias W ∈ RK so that
Y (W , b) = σ I (W , b),
where the input function I : RK×M ×RK → RK is of the formI (W , b) := WX + b
• Applying the multivariate chain rule to give the Jacobian ofY (W , b) gives
∇Y (W , b) = ∇(σ I )(W , b) (8)
= ∇σ(I (W , b)) · ∇I (W , b) (9)
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Updating the Weight Matrices
• Recall that the cross entropy function is given by
L(Y , Y (X )) = −Y Tk lnYk
• Since Y is a constant we can express the cross-entropy as afunction of (W , b)
L(W , b) = L σ(I (W , b)).
• Applying the multivariate chain rule gives
∇L(W , b) = ∇(L σ)(I (W , b))
= ∇L(σ(I (W , b))) · ∇σ(I (W , b)) · ∇I (W , b)
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Back-propagation
• Use stochastic gradient descent to find the minimum
(W , b) = arg minW ,b
1
N
N∑i=1
L(yi , yi )
• Because of the compositional form of the model, the gradientcan be easily derived using the chain rule for differentiation.
• This can be computed by a forward and then a backwardsweep (’back-propagation’) over the network, keeping trackonly of quantities local to each neuron.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Back-propagation
Forward Pass:
• Set Z (0) = X and for ` = 1, . . . , L set
Z (`) = f(`)
W (`),b(`)(Z(`−1)) = σ(`)(W (`)Z (`−1) + b(`)).
Back-propagation:
• Define the back-propagation error δ(`) := ∇b(`)L, where givenδ(L) = Y − Y and for ` = L− 1, . . . , 1 set
δ(`) = (∇I (`)σ(`))(W (`+1))T δ(`+1), (10)
∇W (`)L = δ(`) ⊗ Z (`−1), (11)
and ⊗ is the outer product of two vectors.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Updating the weights
• The weights and biases are updated according to theexpression:
∆W (`) = −γ∇W (`)L = −γδ(`) ⊗ Z (`−1), (12)
∆b(`) = −γδ(`), (13)
where γ is a learning rate parameter. See back-propagationnotebook example and Appendix of Chpt 2 lecture notes formore details.
• Mini-Batch or offline updates involve using many observationsof X at the same time. The batch size refers to the numberof observations of X used in each pass.
• An epoch refers to a round-trip (forward+backward sweep) ofall training examples.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Back-propagation example with three layers
• Suppose that a feed-forward network classifier has two sigmoidactivated hidden layers and a softmax activated output layer.
• After a forward pass, the values of Z (`)3`=1 are stored and
the error Y − Y , where Y = Z (3), is calculated for oneobservation of X .
• The back-propagation and weight updates in the final layerare evaluated:
δ(3) = Y − Y
∇W (3)L = δ(3) ⊗ Z (2).
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
From Equations 10 and 11, we update the back-propagation errorand weight updates for Hidden Layer 2
δ(2) = Z (2)(1− Z (2))(W (3))T δ(3),
∇W (2)L = δ(2) ⊗ Z (1).
Repeating for Hidden Layer 1
δ(1) = Z (1)(1− Z (1))(W (2))T δ(2),
∇W (1)L = δ(1) ⊗ X .
We update the weights and biases using Equations 12 and 13, sothat b(3) → b(3) − γδ(3), W (3) →W (3) − γδ(3) ⊗ Z (2) and repeatfor the other weight-bias pairs, (W (`), b(`))2
`=1
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Stopping Criteria**
We can use callbacks in Keras to terminate the training based onstopping criteria.
from keras.callbacks import EarlyStopping
...
callbacks = [EarlyStopping(monitor=’loss’, min_delta=0, patience=2)]
...
model.fit(X,Y, callbacks=callbacks)
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Explanatory Power of Neural Networks
• In a linear regression model
Y = Fβ(X ) := β0 + β1X1 + · · ·+ βKXK
the sensitivities are∂Xi
Y = βi
• In a FFWD neural network, we can use the chain rule
∂XiY = ∂Xi
FW ,b(X ) = ∂Xif
(L)
W (L),b(L) · · · f(1)
W (1),b(1)(X )
• For example, with one hidden layer, σ(x) := tanh(x) and
f(`)
W (`),b(`)(X ) := σ(Z (`)) := σ(W (`)X + b(`)):
∂XjY =
∑i
w(2)i (1− σ2(I
(1)i ))w
(1)i ,j where ∂xσ(x) = (1− σ2(x))
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Explanatory Power of Neural Networks
• Or in matrix form, using the Jacobian4 J = D(I (1))W (1) of somecontinuous σ:
∂X Y = W (2)J(I (1)) = W (2)D(I (1))W (1),
where Di ,i (I ) = σ′(Ii ), Di ,j 6=i = 0.
• Bounds on the sensitivities are given by the product of the weightmatrices
min(W (2)W (1), 0) ≤ ∂X Y ≤ max(W (2)W (1), 0).
4When σ is an identity function, the Jacobian J(I (1)) = W (1).
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Example: Step test
• The model is trained to the following data generation processwhere the coefficients of the features are stepped:
Y =10∑i=1
iXi , Xi ∼ U(0, 1).
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Example: Step test
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
0.0 2.5 5.0 7.5 10.0
Importance (Sensitivity)
Varia
ble
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
0.00 0.05 0.10 0.15 0.20
Importance (Garson's algorithm)
X9
X8
X7
X5
X6
X2
X4
X1
X3
X10
0 10 20 30 40
Importance (Olden's algorithm)
Figure: Step test: This figure shows the ranked importance of the input variables in a fitted neural networkwith one hidden layer.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Example: Friedman test
Y = 10 sin (πX1X2) + 20 (X3 − 0.5)2 + 10X4 + 5X5 + ε,
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
X9
X10
X6
X8
X7
X3
X5
X2
X1
X4
0.0 2.5 5.0 7.5 10.0
Importance (Sensitivity)
Var
iabl
e
X6
X5
X8
X7
X9
X4
X3
X2
X1
X10
0.0 0.1 0.2 0.3
Importance (Garson's algorithm)
X3
X4
X2
X5
X8
X9
X6
X7
X10
X1
−10 0 10 20 30
Importance (Olden's algorithm)
Figure: Friedman test: Estimated sensitivities of the fitted neural network to the input.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Interaction Terms Introduced by Activation
• Even with one hidden layer, the activation function introducesinteraction terms XiXj .
• To see this, consider the partial derivative
∂XjY =
∑i
w(2)i σ′(I
(1)i )w
(1)i ,j
• and differentiate again with respect to Xk , k 6= i to give theHessian:
∂2Xj ,Xk
Y = −2∑i
w(2)i σ′′(I
(1)i )w
(1)i ,j w
(1)i ,k
• which is not zero everywhere unless σ is (piecewise) linear orconstant.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Interaction Terms Introduced by Activation
x_23
x_34
x_28
x_310
x_15
x_45
x_14
x_25
x_24
x_12
0 1 2 3 4
Importance (Sensitivity)
Inte
ract
ion
Term
Figure: Friedman test: Estimated Interaction terms in the fitted neural network to the input.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
The Jacobian using ReLU activation
• In matrix form, with σ(x) = max(x , 0), the Jacobian, J, can bewritten as linear combination of Heaviside functions
J := J(X ) = ∂X Y (X ) = W (2)J(I (1)) = W (2)H(W (1)X+b(1))W (1),
where Hii (I(1)) = H(I
(1)i ) = 1
I(1)i >0
, Hij = 0, j ≥ i .
• This can be written in matrix element form as
Jij = [∂X Y ]ij =n∑
k=1
W(2)ik W
(1)kj H(I
(1)k ) =
n∑k=1
ckHk(I (1))
where ak := W(2)ik W
(1)kj .
• As a linear combination of indicator functions, we have
Jij =n−1∑k=1
ak1I (1)k >0,I
(1)k+1≤0 + an1I (1)
n >0, ak :=k∑
i=1
ci .
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
The Jacobian using ReLU activation
• Alternatively, the Jacobian can be expressed in terms of X
Jij =n−1∑k=1
ak1W (1)k X>−b(1)
k ,W(1)k+1 X≤−b
(1)k+1
+ an1W (1)n X>−b(1)
n .
• wlog: In the case when p = 1, this simplifies to a weighted sum ofindependent Bernoulli trials
Jij =n−1∑k=1
ak1xk<X≤xk+1+ an1xn<X , j = 1.
where xk :=b
(1)k
W(1)k
. The expectation of the Jacobian is given by
µn−1 := E[Jij ] =n∑
k=1
akpk , pk := Pr[xk < X ≤ xk+1]∀k = 1, . . . , n−1, pn = Pr[xn < X ].
• For finite weights, the expectation is bounded above by∑n
k=1 ak .
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Bounds on the Variance
• The variance is
V[Jij ] =n−1∑k=1
akV[1I (1)k >0,I
(1)k+1≤0]+anV[1I (1)
n >0] =n∑
k=1
akpk(1−pk).
• If the weights are constrained so that the mean is constant,µnij = µij , then the weights are ak = µ
npk. Then the variance is
monotonically increasing with the number of hidden units:
V[Jij ] = µn − 1
n< µ.
• Otherwise, in general, we have
V[Jij ] ≤ µij ≤n∑
i=1
ak .
• Increasing variance with more hidden units is undesirable forinterpretability - we want the opposite effect!
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Robustness of Interpretability
• If the data is generated from a linear model, can the neuralnetwork recover the correct weights?
Y = β1X1 + β2X2 + ε, X1,X2, ε ∼ N(0, 1), β1 = 1, β2 = 1
• Compare OLS estimators with zero hidden layer NNs and singlehidden layers NNs.
• Use tanh activation functions for smoothness of the Jacobianbecause max(x , 0) gives a piecewise continuous Jacobian.
• Increase the number of hidden layers and show that the variance ofthe sensitivities convergence with.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Robustness of Interpretability
Method β0 β1 β2
OLS 0 1.0154 1.018
NN0 b(1) = 0.03 W(1)1 = 1.0184141 W
(1)2 = 1.02141815
NN1 W (2)σ(b(1)) + b(2) = 0.02 E[W (2)D(1)(I (1))W(1)1 ] = 1.013887 E[W (2)D(1)(I (1))W
(1)2 ] = 1.02224
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Robustness of Interpretability
(a) density of β1 (b) density of β2
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Robustness of Interpretability (β1)
Hidden Units Mean Median Std.dev 1% C.I. 99% C.I.
2 0.980875 1.0232913 0.10898393 0.58121675 1.072990810 0.9866159 1.0083131 0.056483902 0.76814914 1.032252250 0.99183553 1.0029879 0.03123002 0.8698967 1.0182846
100 1.0071343 1.0175397 0.028034585 0.89689034 1.0296803200 1.0152218 1.0249312 0.026156902 0.9119074 1.0363332
The confidence interval narrows with increasing number of hiddenunits.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Robustness of Interpretability (β2)
Hidden Units Mean Median Std.dev 1% C.I. 99% C.I.
2 0.98129386 1.0233982 0.10931312 0.5787732 1.07372810 0.9876832 1.0091512 0.057096474 0.76264584 1.033971450 0.9903236 1.0020974 0.031827927 0.86471796 1.0152498
100 0.9842479 0.9946766 0.028286876 0.87199813 1.0065105200 0.9976638 1.0074166 0.026751818 0.8920307 1.0189484
The confidence interval narrows with increasing number of hiddenunits.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Linear Cross-sectional Factor Models
• Consider the linear cross-sectional factor model:
rt = Bt−1ft + εt , t = 1, . . . ,T
• Bt = [1 | b1,t | · · · | bK ,t ] is the N × K + 1 matrix of factorexposures
• (bi ,t)k is the exposure (a.k.a. loading) of asset i to factor k attime t
• ft = [α, f1,t , . . . , fK ,t ] is the K + 1 vector of factor returns
• rt is the N vector of asset returns
• ρ(ft , εt) = 0 and cross-sectional heteroskedasticityE[ε2
t,i ] = σ2i , i ∈ 1, . . . ,N.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Non-linear Non-parametric Cross-sectional Factor Models
• Consider the non-linear cross-sectional factor model:
rt = F (Bt) + εt
• F : RK → R is a differentiable non-linear function
• Do not assume that εt is Gaussian or require any other restrictionson εt , i.e. error distribution is non-parametric
• The model shall just be used to predict the next period returnsonly and stationarity of the factor returns is not required.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Neural Network Cross-sectional Factor Models5
• Neural network cross-sectional factor model:
rt = FW ,b(Bt) + εt
where FW ,b(X ) is a neural network with L layers
r := FW ,b(X ) = f(L)
W (L),b(L) · · · f(1)
W (1),b(1)(X )
5Dixon & Polson, Deep Fundamental Factor Models, 2019,arxiv.org/abs/1903.07677.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Experimental Setup
• Here, we define the universe as the top 250 stocks from the S&P500 index, ranked by market cap.
• Factors are given by Bloomberg and reported monthly
• Remove stocks with too many missing factor values, leaving 218stocks
• Use a 100 period experiment, fitting a cross-sectional regressionmodel at each period and using the model to forecast the nextperiod monthly returns.
• At each time step, the canonical experimental design overt = 2, . . . ,T is
• (X traint ,Y train
t ) = (Bt−1, rt−1)
• (X testt ,Y test
t ) = (Bt , rt)
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
MSEs
(a) Linear Regression (b) FFWD Neural Network
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Overfitting
Figure: In sample MSE as a function of the number of neurons in the hidden layer.The models are trained here without L1 regularization.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Managing the Bias-Variance Tradeoff with L1
Regularization
(a) In-sample (b) Out-of-Sample
Figure: The effect of L1 regularization on the MSE errors for a network with 50neurons in the hidden layer.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Information Ratios (Random Selection of Stocks)
Figure: Random selection of x stocks from the universe in each period.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Information Ratios (In-Sample)
(a) Linear Regression (b) FFWD Neural Network
Figure: Selection of the x top performing stocks from the universe ineach in-sample period.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Information Ratios (Out-of-Sample)
(a) Linear Regression (b) FFWD Neural Network
Figure: Selection of the x top performing stocks from the universe ineach out-of-sample period.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Sensitivities
Figure: The sensitivities and their uncertainty can be estimated from data(black) or derived analytically (red).
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Recurrent Neural Networks
• Understand the conceptual formulation and statistical explanationof RNNs
• Understand the different types of RNNs
• Two notebook examples with limit order book data and Bitcoinhistory
• Navigate further learning resources
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Recurrent Neural Network Predictors
• Input-output pairs D = xt , ytNt=1 are auto-correlatedobservations of X and Y at times t = 1, . . . ,N
• Construct a nonlinear times series predictor, yt+h(Xt), of anoutput, yt+h, using a high dimensional input matrix of p lengthsub-sequences Xt :
yt+h = F (Xt) where Xt = seqt−p+1,t(X ) = (xt−p+1, . . . , xt)
• xt−j is a j th lagged observation of xt , xt−j = Lj [xt ], forj = 0, . . . , p − 1.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Recurrent Neural Networks (p=6)
z1t−5
z jt−5
zHt−5
xt−5
z1t−4
z jt−4
zHt−4
xt−4
z1t−3
z jt−3
zHt−3
xt−3
z1t−2
z jt−2
zHt−2
xt−2
z1t−1
z jt−1
zHt−1
xt−1
z1t
z jt
zHt
xt
yt+h
. . . . . .
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Recurrent Neural Networks
• For each time step s = t − p + 1, . . . , t, a function Fh generates ahidden state zs :
zs = Fh(xs , zs−1) := σ(Whxs + Uhzs−1 + bh), Wh ∈ RH×d ,Uh ∈ RH×H
• When the output is continuous, the model output from the finalhidden state is given by:
yt+h = Fy (zt) = Wyzt + by , Wy ∈ RK×H
• When the output is categorical, the output is given by
yt+h = Fy (zt) = softmax(Wyzt + by )
• Goal: find the weight matrices W = (Wh,Uh,Wy ) and biasesb = (bh, by ).
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Univariate Example
• Consider the univariate, step-ahead, time series predictionyt = F (Xt−1), using p previous observations yt−ipi=1.
• Because this is a special case when no input is available at time t(since we are predicting it), we form the hidden states to time zt−1.
• The simplest case of a RNN with one hidden unit, H = 1, noactivation function and the dimensionality of the input vector isd = 1.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Univariate Example
• If Wh = Uh = φ, |φ| < 1, Wy = 1 and bh = by = 0.
• Then we can show that yt = F (Xt−1) is a zero-driftauto-regressive, AR(p), model with geometrically decaying weights:
zt−p = φyt−p,
zt−p+1 = φ(zt−p + yt−p+1),
. . . = . . .
zt−1 = φ(zt−2 + yt−1),
yt = zt−1,
where
yt = (φL + φ2L2 + · · ·+ φpLp)[yt ].
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Stability
• Given a lag 1 unit innovation, 1, the output from the hidden layeris
zt = σ(Wh0 + Uh1 + bh).
• For strict stationarity, we require that the absolute value of eachcomponent of the hidden variable under a unit impulse at lag 1 isless than 1
|zt |j = |σ(Uh1 + bh)|j < 1,
which is satisfied if |σ(x)| ≤ 1 and each element of Uh1 + bh isfinite.
• Additionally if σ is strictly monotone increasing then |zt |j under alag two unit innovation is less than |zt |j under a lag one unitinnovation
|σ(Uh1) + bh)|j > |σ(Uhσ(Uh1 + bh) + bh)|j . (14)
In particular, the choice tanh satisfies the requirement on σ.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Multiple Choice Question 1
Identify the following correct statements:
• A linear recurrent neural network with a memory of p lags is anautoregressive model AR(p) with non-parametric error.
• Recurrent neural networks, as time series models, are guaranteedto be stable, for any choice of weights.
• The amount of memory in a shallow recurrent network correspondsto the number of times a single perceptron layer is unfolded.
• The amount of memory in a deep recurrent network corresponds tothe number of perceptron layers.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Overview
• RNNs assume that the data is covariance stationary. Theautocovariance structure of the RNN is fixed and therefore so mustit be in the data.
• RNNs using a version of back-propagation known as“back-propagation through time” (BPTT) which is prone to avanishing gradient problem when the number of lags in the modelis large.
• We will see extensions to these plain-RNNs which maintain along-term memory through an exponentially smoothed hiddenstate.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Exponential Smoothing
• Exponential smoothing is a type of forecasting or filtering methodthat exponentially decreases the weight of past and currentobservations to give smoothed predictions yt+1.
• It requires a single parameter, α, also called the smoothing factoror smoothing coefficient.
• This parameter controls the rate at which the influence of theobservations at prior time steps decay exponentially.
• α is often set to a value between 0 and 1. Large values mean thatthe model pays attention mainly to the most recent pastobservations, whereas smaller values mean more of the history istaken into account when making a prediction.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Exponential Smoothing
• Exponential smoothing takes the forecast for the previous period ytand adjusts with the forecast error, yt − yt . The forecast for thenext period becomes:
yt+1 = yt + α(yt − yt)
or equivalentlyyt+1 = αyt + (1− α)yt .
• Writing this as a geometric decaying autoregressive series back tothe first observation:
yt+1 = αyt + α(1− α)yt−1 + α(1− α)2yt−2 + α(1− α)3yt−3
+ · · ·+ α(1− α)t−1y1 + α(1− α)t y1,
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Exponential Smoothing
• We observe that smoothing introduces long-term model, the entireobserved data, not just a sub-sequence used for prediction in aplain-RNN, for example.
• For geometrically decaying models, it is useful to characterize it bythe “half-life” - the lag k at which its coefficient is equal to a half:
α(1− α)k =1
2,
or
k = − ln(2α)
ln(1− α).
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Dynamic Smoothing
• Dynamic exponential smoothing is a time dependent, convex,combination of the smoothed output, yt , and the observation yt :
yt+1 = αtyt + (1− αt)yt ,
where αt ∈ [0, 1] denotes the dynamic smoothing factor which canbe equivalently written in the one-step-ahead forecast of the form
yt+1 = yt + αt(yt − yt).
• Hence the smoothing can be viewed as a form of dynamic forecasterror correction; When αt = 1, the one-step ahead forecast merelyrepeats the current observation yt , and when αt = 0, the forecasterror is ignored and the forecast is the current model output.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Dynamic Smoothing
• The smoothing can also be viewed a weighted sum of the laggedobservations, with lower or equal weights, αt−s
∏sr=1(1− αt−r+1)
at the lag s ≥ 1 past observation, yt−s :
yt+1 = αtyt +∑
s=1t−1
αt−s
s∏r=1
(1− αt−r+1)yt−s +t−1∏r=0
(1− αt−r )y1,
where the last term is a time dependent constant and typically weinitialize the exponential smoother with y1 = y1.
• Note that for any αt−r+1 = 1, the prediction yt+1 will have nodependency on all lags yt−ss≥r . The model simply forgets theobservations at or beyond the r th lag.
• In the special case when the smoothing is constant and equal to1− α, then the above expression simplifies to
yt+1 = α−1yt ,
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Dynamic Smoothing
• Putting this together gives the following dynamic RNN:
smoothing : ht = αt ht + (1− αt)ht−1 (15)
smoother update : αt = σ(1)(Uαht−1 + Wαxt + bα) (16)
hidden state update : ht = σ(Uhht−1 + Whxt + bh), (17)
where σ(1) is a sigmoid or heaviside function and σ is anyactivation function.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Figure: An illustrative example of the response of a simplified GRU and comparison with a plain-RNN and aRNN with an exponentially smoothed hidden state, under a constant α. The RNN loses memory of the unit impulseafter three lags, whereas the RNNs with smooth hidden states maintain memory of the first unit impulse even whenthe second unit impulse arrives. The difference between the dynamically smoothed RNN (the toy GRU) andsmoothed RNN with a fixed smoothing parameter appears insignificant. Keep in mind however that the dynamicalsmoothing model has much more flexibility in how controlling the sensitivity of the smoothing to the unit impulses.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Gated Recurrent Units (GRUs)
• We introduce a reset, or switch, rt , in order to forget thedependence of ht on previous data.
• Effectively, we turn ht from a RNN to a FNN.
ht = Zt ht + (1− Zt)ht−1 (18)
Zt = σ(UZht−1 + WZxt + bZ ) (19)
ht = tanh(Uh rt · ht−1 + Whxt + bh) (20)
rt = σ(Urht−1 + Wrxt + br ) (21)
yt = WY ht + bY (22)
• rt is the switching (reset) parameter which determines how muchrecurrence is used for the candidate hidden state ht .
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
References
• S. Haykin, Kalman filtering and neural networks, volume 47. JohnWiley & Sons, 2004
• M.F. Dixon, N. Polson and V. Sokolov, Deep Learning forSpatio-Temporal Modeling: Dynamic Traffic Flows and HighFrequency Trading, Applied Stochastic Methods in Business andIndustry, arXiv:1705.09851, 2018.
• M.F. Dixon, Sequence Classification of the Limit Order Book usingRecurrent Neural Networks, J. Computational Science, SpecialIssue on Topics in Computational and Algorithmic Finance,arXiv:1707.05642, 2018.
• M.F. Dixon, A High Frequency Trade Execution Model forSupervised Learning, High Frequency, arXiv:1710.03870, 2018.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Multiple Choice Question 1: Answers
• A linear recurrent neural network with a memory of p lags is anautoregressive model AR(p) with non-parametric error.
• Recurrent neural networks, as time series models, are guaranteedto be stable, for any choice of weights.
• The amount of memory in a shallow recurrent network correspondsto the number of times a single perceptron layer is unfolded.
• The amount of memory in a deep recurrent network corresponds tothe number of perceptron layers.
Answer: 1, 2, 3
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Learning Objectives
• Understand the conceptual formulation and statistical explanationof CNNs.
• Understand 1D CNNs for time series prediction.
• A notebook example on univariate time series prediction with 1DCNNs.
• Understand how dilation CNNs capture multi-scales.
• Understand 2D CNNs for image processing. Supplementarynotebook.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Multiple Choice Question 2
Identify the following correct statements:
• A feedforward network can, in principle, always be used for anysupervised learning problem.
• A linear recurrent neural network with a memory of p lags is anautoregressive model AR(p).
• Recurrent neural networks, as time series models, are guaranteedto be stable, for any choice of weights and activation.
• Feedforward and recurrent neural network predictors require thatthe data is stationary.
• GRUs allow the hidden state to persist beyond a sequence by usingsmoothing.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Multiple Choice Question 2
Identify the following correct statements:
• A feedforward network can, in principle, always be used for anysupervised learning problem.
• A linear recurrent neural network with a memory of p lags is anautoregressive model AR(p).
• Recurrent neural networks, as time series models, are guaranteedto be stable, for any choice of weights and activation.
• Feedforward and recurrent neural predictors require that the data isstationary.
• GRUs allow the hidden state to persist beyond a sequence by usingsmoothing.
Answer: 1,2,4,5.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Convolutional neural networks (CNN)
• Convolutional neural networks (CNN) are feedforward neuralnetworks that can exploit local spatial structures in the input data.
• CNNs attempt to reduce the network size by exploiting spatial ortemporal locality.
• CNNs use sparse connections between the different layers.
• Convolution is frequently used for image processing, such as forsmoothing, sharpening, and edge detection of images.
Convolutional Neural Networks
Figure: Convolutional Neural Networks. Source:http: // www. asimovinstitute. org/ neural-network-zoo
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Weighted Moving Average Smoothers
• A common technique in time series analysis and signal processingis to filter the time series.
• We have already seen exponential smoothing as a special case of aclass of smoothers known as “weighted moving average (WMA)”smoothers.
• WMA smoothers take the form
xt =1∑
i∈I wi
∑i∈I
wixt−i
where xt is the local mean of the time series. The weights arespecified to emphasize or deemphasize particular observations ofxt−i in the span |I|.
• Examples of well known smoothers include the Hanning smootherh(3):
xt = (xt−1 + 2xt + xt+1)/4.
• Such smoothers have the effect of reducing noise in the time series6.
• It takes |I| samples of input at a time and then computes theweighted average to produce a single output. As the length of thefilter increases, the smoothness of the output increases, whereasthe sharp modulations in the data are smoothed out.
6The moving average filter is a simple Low Pass Finite Impulse Response(FIR) filter.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Moving Average Smoothers
Figure: Example of a moving average filter applied to time series.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Convolution for Time Series
• The moving average filter is in fact a (discrete) convolution using avery simple filter kernel.
• The simplest discrete convolution gives the relation between xi andxj :
xt−i =t−1∑j=0
δijxt−j , i ∈ 0, . . . , t − 1
where we have used the Kronecker delta δ.
• The kernel filtered time series is a convolution7
xt−i =∑j∈J
Kj+k+1xt−i−j , i ∈ k + 1, . . . , p − k,
7For simplicity, the ends of the sequence are assumed to be unsmoothed butfor notational reasons we set xt−i = xt−i for i ∈ 1, . . . , k, p − k + 1, . . . , p.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Convolution for Time Series
• The filtered AR(p) model is
xt = µ+
p∑i=1
φi xt−i (23)
= µ+ (φ1L + φ2L2 + · · ·+ φpL
p)[xt ] (24)
= µ+ [L, L2, . . . , Lp]φ[xt ], (25)
with coefficients φ := [φi , . . . , φp]T 8.
8Note that there is no look-ahead bias because we do not smooth the last kvalues of the observed data xsts=1
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
1D CNN with one hidden unit
• A simple 1D CNN consisting of a feedforward output layer and anon-activated hidden layer with one unit (i.e. kernel):
xt = Wyzt + by , zt = [xt−1, . . . , xt−p]T ,Wy = φT , by = µ,
where xt−i is the i th output from a convolution of the p lengthinput sequence with a kernel consisting of 2k + 1 weights.
• These weights are fixed over time and hence the CNN is onlysuited to prediction from stationary time series.
• Note also, in contrast to a RNN, that the size of the weight matrixWy increases with the number of lags in the model.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
1D CNN with H hidden units
• The univariate CNN predictor with p lags and H activated hiddenunits (kernels) is
xt = Wyvec(zt) + by (26)
[zt ]i ,m = σ(∑j∈J
Km,j+k+1xt−i−j + [bh]m) (27)
= σ(K ∗ xt + bh) (28)
where m ∈ 1, . . . ,H denotes the index of the kernel and thekernel matrix K ∈ RH×2k+1, hidden bias vector bh ∈ RH andoutput matrix Wy ∈ R1×pH .
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Remarks on CNNs
• Since the size of Wy increases with both the number of lags andthe number of kernels, it may be preferable to reduce thedimensionality of the weights with an additional layer and henceavoid over-fitting.
• Convolutional neural networks are not limited to sequential models.One might, for example, sample the past lags non-uniformly sothat I = 2ipi=1 then the maximum lag in the model is 2p. Such anon-sequential model allows a large maximum lag withoutcapturing all the intermediate lags.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Stationarity
• A univariate linear CNN predictor, with one kernel and noactivation, can be written in the canonical form
xt = µ+ (1− Φ(L))[K ∗ xt ] = µ+ K ∗ (1− Φ(L))[xt ] (29)
= µ+ (φ1L + . . . φpLp)[xt ] := µ+ (1− Φ(L))[xt ], (30)
where, by the linearity of Φ(L) in xt , the convolution commutesand thus we can write φ := K ∗ φ.
• Provided that Φ(L)−1 forms a divergent sequence in the noiseprocess εsts=1 then the model is stable.
• Finding the roots of the characteristic equation
Φ(z) = 0,
it follows that the linear CNN is strictly stationary and ergodic ifall the roots lie outside the unit circle in the complex plane,|λi | > 1, i ∈ 1, . . . , p.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Equivalent Eigenvalue Problem*
• Finding roots of polynomials is equivalent to finding eigenvalues ofthe “companion matrix”.
• A famous theorem states that the roots of any polynomial can befound by turning it into a matrix and finding the eigenvalues
• Given the p degree polynomial9:
q(z) = c0 + c1z + . . .+ cp−1zp−1 + zp,
we define the p × p companion matrix
C :=
0 1 0 . . . 0
0 0 1 0...
.... . .
. . .. . .
...0 0 0 0 1
-c0 -c1 . . . -cp−2 −cp−1
then the characteristic polynomial det(C − λI ) = q(λ), and so theeigenvalues of C are the roots of q 10.
9Notice that the zp coefficient is 1.10Note that if the polynomial does not have a unit leading coefficient then
one can just divide the polynomial by that coefficient to get it in the form ofEquation 98, without changing its roots.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Stridded Convolution
• In the above slides, we performed the convolution by sliding theimage area by a unit increment. This is known as using a stride, s,of one11:
• More generally, given an integer s ≥ 1, a convolution with stride sfor input (feature) f ∈ Rm is defined as:
[K ∗s f ]i =k∑
p=−kKpfs(i−1)+1+p, i ∈ 1, . . . , d
m
se. (31)
Here dms e denotes the smallest integer greater than ms .
11A common choice in CNNs is to take s = 2
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Dilated Convolution
• In addition to image processing, CNNs have also been successfullyapplied to time series. WaveNet, for example, is a CNN developedfor audio processing.
• Time series often display long-term correlations. Moreover, thedependent variable(s) may exhibit non-linear dependence on thelagged predictors.
• Recall that a CNN for times series is a non-linear p-auto-regressionof the form
yt =
p∑i=1
φi (xt−i ) + εt (32)
where the coefficient functions φi , i ∈ 1, . . . , p aredata-dependent and optimized through the convolutional network.
• A dilated convolution effectively allows the network to operate ona coarser scale than with a normal convolution. This is similar topooling or stridded convolutions, but here the output has the samesize as the input
• The network can learn multi-scale, non-linear, dependencies byusing stacked layers of dilated convolutions.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Dilated convolution
• In a dilated convolution the filter is applied to every d th element inthe input vector, allowing the model to efficiently learn connectionsbetween far-apart data points.
• For an architecture with L layers of dilated convolutions` ∈ 1, . . . , L, a dilated convolution outputs a stack of “featuremaps” given by
[K (`) ∗d (`) f (`−1)]i =k∑
p=−kK
(`)p f
(`−1)
d (`)(i−1)+1+p, i ∈ 1, . . . , d m
d (`)e.
(33)where d is the dilation factor and we can choose the dilations toincrease by a factor of two: d (`) = 2`−1.
• The filters for each layer, K (`), are chosen to be of size1× k ′ = 1× 2, where k ′ := 2k + 1.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Dilated Convolution
• The output is the forecasted time series Y = (yt)N−1t=0 .
Figure: A dilated convolutional neural network with three layers. Thereceptive field is given by r = 8, i.e. one output value is influenced byeight input neurons.
• The receptive field depends on the number of layers L and thefilter size k ′, and is given by
r = 2L−1k ′. (34)
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Learning Objectives
• Understand principal component analysis for dimension reduction.
• Understand how to formulate a linear autoencoder.
• Understand how a linear autoencoder performs PCA.
• Notebook example using dimension reduction on the yield curve.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Preliminaries
• Let yiNi=1 be a set of N observation vectors, each of dimension n.
We assume that n ≤ N.
• Let Y ∈ Rn×N be a matrix whose columns are yiNi=1,
Y =
| |y1 · · · yN| |
.• The element-wise average of the N observations is an n
dimensional signal which may be written as:
y =1
N
N∑i=1
yi =1
NY1N ,
where 1N ∈ RN×1 is a column vector of all-ones.
• Let Y0 be a matrix whose columns are the demeaned observations(we center each observation yi by subtracting y from it):
Y0 = Y− y1TN .
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Projection
• A linear projection from Rm to Rn is a linear transformation of afinite dimensional vector given by a matrix multiplication:
xi = WTyi ,
where yi ∈ Rn, xi ∈ Rm, and W ∈ Rn×m.
• Each element j in the vector xi is an inner product between yi andthe j-th column of W, which we denote by wj .
• Let X ∈ Rm×N be a matrix whose columns are the set of N vectors
of transformed observations, let x = 1N
N∑i=1
xi = 1N X1N be the
element-wise average, and X0 = X− x1TN the de-meaned matrix.
• Clearly, X = WTY and X0 = WTY0.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Principal Component Projection*
• When the matrix WT represents the transformation that appliesprincipal component analysis, we denote W = P, and the columnsof the orthonormal matrix12, P, denoted
pjnj=1
, are referred to
as loading vectors.
• The transformed vectors xiNi=1 are referred to as principalcomponents or scores.
• The first loading vector is defined as the unit vector with which theinner products of the observations have the greatest variance:
p1 = maxw1
wT1 Y0YT
0 w1 s.t. wT1 w1 = 1. (35)
• The solution to (35) is known to be the eigenvector of the samplecovariance matrix Y0YT
0 corresponding to its largest eigenvalue13
12i.e. P−1 = PT .13We normalize the eigenvector and disregard its sign.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Principal Component Projection*
• Next, p2 is the unit vector which has the largest variance of innerproducts between it and the observations after removing theorthogonal projections of the observations onto p1. It may befound by solving:
p2 = maxw2
wT2
(Y0 − p1p
T1 Y0
)(Y0 − p1p
T1 Y0
)Tw2 s.t.w
T2 w2 = 1.
(36)
• The solution to (36) is known to be the eigenvector correspondingto the largest eigenvalue under the constraint that it is notcollinear with p1.
• Similarly, the remaining loading vectors are equal to the remainingeigenvectors of Y0YT
0 corresponding to descending eigenvalues.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Principal Components
• The eigenvalues of Y0YT0 , which is a positive semi-definite matrix,
are non-negative.
• They are not necessarily distinct, but since it is a symmetric matrixit has n eigenvectors that are all orthogonal, and it is alwaysdiagonalizable.
• Thus, the matrix P may be computed by diagonalizing thecovariance matrix:
Y0YT0 = PΛP−1 = PΛPT ,
where Λ = X0XT0 is a diagonal matrix whose diagonal elements
λini=1 are sorted in descending order.
• The transformation back to the observations is Y = PX.
• The fact that the covariance matrix of X is diagonal means thatPCA is a decorrelation transformation and is often used to denoisedata.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Dimensionality reduction
• PCA is often used as a method for dimensionality reduction, theprocess of reducing the number of variables in a model in order toavoid the curse of dimensionality.
• PCA gives the first m principal components (m < n) by applyingthe truncated transformation
Xm = PTmY,
where each column of Xm ∈ Rm×N is a vector whose elements arethe first m principal components, and Pm is a matrix whosecolumns are the first m loading vectors,
Pm =
| |p1 · · · pm| |
∈ Rn×m.
• Intuitively, by keeping only m principal components, we are losinginformation, and we minimize this loss of information bymaximizing their variances.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Minimum total squared reconstruction error
• Pm is also a solution to:
minW∈Rn×m
∥∥∥Y0 −WWTY0
∥∥∥2
Fs.t. WTW = Im×m, (37)
where F denotes the Frobenius matrix norm.
• The m leading loading vectors are an orthonormal basis whichspans the m dimensional subspace onto which the projections ofthe demeaned observations have the minimum squared differencefrom the original demeaned observations.
• In other words, Pm compresses each demeaned vector of length ninto a vector of length m (where m ≤ n) in such a way thatminimizes the sum of total squared reconstruction errors.
• The minimizer of (37) is not unique: W = PmQ is also a solution,where Q ∈ Rm×m is any orthogonal matrix, QT = Q−1.Multiplying Pm from the right by Q transforms the first m loadingvectors into a different orthonormal basis for the same subspace.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Minimum total squared reconstruction error
• A neural network that is trained to learn the identity functionf (Y ) = Y is called an autoencoder. Its output layer has the samenumber of nodes as the input layer, and the cost function is somemeasure of the reconstruction error.
• Autoencoders are unsupervised learning models, and they are oftenused for the purpose of dimensionality reduction.
• A simple autoencoder that implements dimensionality reduction isa feed-forward autoencoder with at least one layer that has asmaller number of nodes, which functions as a bottleneck.
• After training the neural network using backpropagation, it isseparated into two parts: the layers up to the bottleneck are usedas an encoder, and the remaining layers are used as a decoder.
• In the simplest case, there is only one hidden layer (thebottleneck), and the layers in the network are fully connected.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Linear autoencoders*
• In the case that no non-linear activation function is used,xi = W (1)yi + b(1) and yi = W (2)xi + b(2). If the cost function isthe total squared difference between output and input, thentraining the autoencoder on the input data matrix Y solves:
minW (1),b(1),W (2),b(2)
∥∥∥Y−(
W(2)(W (1)Y + b(1)1TN
)+ b(2)1TN
)∥∥∥2
F.
(38)
• If we set the partial derivative with respect to b2 to zero and insertthe solution into (38), then the problem becomes:
minW (1),W (2)
∥∥∥Y0 −W (2)W (1)Y0
∥∥∥2
F
• Thus, for any b1, the optimal b2 is such that the problem becomesindependent of b1 and of y. Therefore, we may focus only on theweights W (1), W (2).
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Linear autoencoders as orthogonal projections
• Setting the gradients to zero, W (1) is the left Moore-Penrosepseudoinverse of W (2) (and W (2) is the right pseudoinverse ofW (1)):
W (1) = (W (2))† =(W (2)TW (2)
)−1(W (2))T
• The minimization with respect to a single matrix is:
minW (2)∈Rn×m
∥∥∥Y0 −W (2)(W (2))†Y0
∥∥∥2
F(39)
The matrix W (2)(W (2))† = W (2)((W (2))TW (2)
)−1(W (2))T is
the orthogonal projection operator onto the column space of W (2)
when its columns are not necessarily orthonormal.
• This problem is very similar to (37), but without theorthonormality constraint.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Linear autoencoders: mathematical properties*
• It can be shown that W (2) is a minimizer of (39) if and only if itscolumn space is spanned by the first m loading vectors of Y.
• The linear autoencoder is said to apply PCA to the input data inthe sense that its output is a projection of the data onto the lowdimensional principal subspace.
• However, unlike actual PCA, the coordinates of the output of thebottleneck are correlated and are not sorted in descending order ofvariance.
• The solutions for reduction to different dimensions are not nested:when reducing the data from dimension n to dimension m1, thefirst m2 vectors (m2 < m1) are not an optimal solution toreduction from dimension n to m2, which therefore requirestraining an entirely new autoencoder.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Equivalence of linear autoencoders and PCA*Hypothesis: The first m loading vectors of Y are the first m left
singular vectors of the matrix W (2) which minimizes (39).
• Train the linear autoencoder on the original dataset Y and thencompute the first m left singular vectors of W (2) ∈ Rn×m, wheretypically m << N.
• The loading vectors may also be recovered from the weights of thehidden layer by a singular value decomposition, W (1).
• If W (2) = UΣVT , which we assume is full-rank, then
W (1) = (W (2))† = VΣ†UT
and
W2W†2 = UΣVTVΣ†UT = UΣΣ†UT = UmUTm, (40)
where we used the fact that(VT)†
= V and that Σ† ∈ Rm×n is amatrix whose diagonal elements are 1
σj(assuming σj 6= 0, and 0
otherwise).
• The matrix ΣΣ† is a diagonal matrix whose first m diagonalelements are equal to one and the other n −m elements are equalto zero.
• The matrix Um ∈ Rn×m is a matrix whose columns are the first mleft singular vectors of W2
• Thus, the first m left singular vectors of (W (1))T ∈ Rn×m are alsoequal to the first m loading vectors of Y .
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Figure: This figure shows the yield curve over time, each line corresponds to a different maturity in theterm-structure of interest rates.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
(a)PTmY0YT
0 Pm (b) (W (2))TY0YT0 W (2) (c) UT
mY0YT0 Um
Figure: The covariance matrix of the data in the transformed coordinates, according to (a) the loading vectorscomputed by applying SVD to the entire dataset, (b) the weights of the linear autoencoder, and (c) the leftsingular vectors of the autoencoder weights.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
(a) PTm∆Y0 (b) UT
m∆Y0
Figure: The first two principal components of ∆Y0, projected using Pm are shown in (a). The first twoapproximated principal components (up to a sign change) using Um . The first principal component is representedby the x-axis and the second by the y-axis.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Summary
• Deep Learning for cross-sectional data
• Background: review of supervised machine learning
• Introduction to feedforward neural networks
• Related approximation and learning theory
• Training and back-propagation
• Interpretability with factor modeling example
• Deep Learning for time-series data prediction
• Understand the conceptual formulation and statistical explanationof RNNs
• Understand the different types of RNNs, including plain RNNs,GRUs and LSTMs
• Two notebook examples with limit order book data and Bitcoinhistory
• Understand the conceptual formulation and statistical explanationof CNNs.
• Understand 1D CNNs for time series prediction.
• A notebook example on univariate time series prediction with 1DCNNs.
• Understand how dilation CNNs capture multi-scales.
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments
Summary
• Deep Learning for dimensionality reduction
• Understand principal component analysis for dimension reduction.
• Understand how to formulate a linear autoencoder.
• Understand how a linear autoencoder performs PCA.
• Notebook example using dimension reduction on the yield curve.
Extra Slides
Review of time series analysis
• Characterize the properties of time series
• Problem formulation in time series prediction
• Practical challenges that frequently arise in time series prediction
Extra Slides
Time Series Data
Time series data:
• Consists of observations on a variable over time, e.g. stock prices
• Observations are not assumed to be independent over time, ratherobservations are often strongly related to their recent histories.
• For this reason, the ordering of the data matters (unlikecross-sectional data).
• Pay special attention to the data frequency.
Extra Slides
Classical Time Series Analysis
• Box-Jenkins linear time series analysis, e.g. ARIMA
• Non-linear time series analysis, e.g. GARCH
• Dynamic state based models, e.g. Kalman filters.
Extra Slides
Autoregressive Processes AR(p)
• The pth order autoregressive process of a variable y depends onlyon the previous values of the variable plus a white noisedisturbance term
yt = µ+
p∑i=1
φiyt−i + εt
• Defining the polynomial functionφ(L) := (1− φ1L− φ2L
2 − · · · − φpLp), the AR(p) process can beexpressed in the more compact form
φ(L)yt = µ+ εt
Key point: an AR(p) process has a geometrically decaying acf.
Extra Slides
Stationarity
Stationarity
• An important property of AR(p) processes is whether they arestationary or not. A process which is stationary will have errordisturbance terms εt will have a declining impact on the currentvalue of y as the lag increases.
• Conversely, non-stationary AR(p) processes exhibit thecounter-intuitive behavior that the error disturbance terms becomeincreasingly influential as the lag increases.If the roots of the characteristic equation
φ(z) = (1− φ1z − φ2z2 − · · · − φpzp) = 0
*all* lie outside the unit circle, then the AR(p) process isstationary.
Extra Slides
Exercise 1
Is the following random walk (zero mean AR(1) process)stationary?
yt = yt−1 + εt
Extra Slides
Exercise 2
Compute the mean, variance and autocorrelation function of thefollowing zero-mean AR(1) process:
yt = φ1yt−1 + εt .
Extra Slides
Moving Average Processes MA(q)
MA(q) process The qth order moving average process is the linearcombination of the white noise process εtq+1
t=1
yt = µ+
q∑i=1
θiεt−i + εt
It’s convenient to define the Lag or Backshift operator Liyt := yt−iand write the MA(q) process as
yt = µ+ θ(L)εt
where the polynomial function θ(L) := 1 + θ1L+ θ2L2 + · · ·+ θqL
q.
Extra Slides
Moving Average Processes
The distinguishing properties of the MA(q) process are thefollowing:
• Constant mean: E[yt ] = µ
• Constant variance: V[yt ] = (1 + θ21 + θ2
2 + · · ·+ θ2q)σ2
• Autocovariances which may be non-zero to lag q and zerothereafter:
γs =
(θs + θs+1θ1 + θs+2θ2 + · · ·+ θqθq−s)σ2 s = 1, 2, . . . , q,
0 s > q.
Key point: a MA(q) process has non-zero points of acf= q. Aboveq there is a sharp cut-off.
Extra Slides
Exercise 3
Consider the following zero-mean MA(2) process:
yt = εt + θ1εt−1 + θ2εt−2
Show that:
• E[yt ] = 0, ∀t• V[yt ] = γ0 = (1 + θ2
1 + θ22)σ2, ∀t
• ε1 = θ1+θ1θ2
1+θ21+θ2
2, ε2 = θ2
1+θ21+θ2
2and εs = 0, s > q = 2.
If θ1 = −0.5 and θ2 = 0.25, sketch the autocorrelation function(acf) of the sample MA(2) process.
Extra Slides
ARMA
The pth, qth order ARMA process of a variable y combines theAR(p) and MA(q) processes
φ(L)yt = µ+ θ(L)εt
where
φ(L) := (1− φ1L− φ2L2 − · · · − φpLp)
θ(L) := 1 + θ1L + θ2L2 + · · ·+ θqL
q
The mean of a ARMA series depends only on the coefficients ofthe autoregressive process and not the coefficients of the moving
average process:
E[yt ] =µ
1− φ1 − φ2 − · · · − φp
Extra Slides
Forecasting with a MA(q) model
• Predict the value yt+1 given information Ωt at time t using theequation:
yt+1 = µ+ θ1εt + θ2εt−1 + θ3εt−2 + εt+1
• The forecast is the conditional expectation of the prediction. Theone-step ahead forecast at time t is denoted ft,1:
ft,1 = E[yt+1|Ωt ] = µ+ θ1εt + θ2εt−1 + θ3εt−2
since only the conditional expectation E[εt+1|Ωt ] = 0.
• The two-step ahead forecast ft,2 is given by:
ft,2 = E[yt+2|Ωt ] = µ+ θ2εt + θ3εt−1
• In general, a MA(q) process has a memory of q periods, soforecasts q + 1 steps ahead collapse to an intercept µ.
Extra Slides
Forecasting with a AR(p) model
• Recall that an AR(p) process has infinite memory.
• In general, the s-step ahead forecast for an AR(2) process is givenby
ft,s = E[yt+s |Ωt ] = µ+ φ1ft,s−1 + φ2ft,s−2
thus we need to compute the one-period lag and two-period lagfirst.
• ARMA(p, q) : Putting the MA(q) and AR(p) s-step aheadforecasting equations together, we get
ft,s =
p∑i=1
ai ft,s−i +
q∑j=1
bjεt+s−j
where ft,s = yt+s , s ≤ 0 and εt+s = 0, s > 0.
Extra Slides
Exercise 4
Consider the following ARMA(2,2) 1-step ahead forecast
ft,1 = a1ft,0 + a2ft,−1 + b1εt + b2εt−1 (41)
= a1yt + a2yt−1 + b1εt + b2εt−1 (42)
(43)
where ai and bj are the autoregressive and moving averagecoefficients respectively.
Extra Slides
Parametric tests: time series analysis
• Augmented Dickey Fuller test for time series stationarity, with theSchwert (heuristic) estimate for the maximum lag
• Durbin-Watson tests for co-integration of two variables
• Granger causality tests
Extra Slides
Identification
• it may be difficult to observe the order of the MA and ARprocesses from the acf and pacf plots.
• It is often preferable to use the Akaike Information Criteria (AIC)to measure the quality of fit:
AIC = ln(σ2) +2k
Twhere σ2 is the residual variance (the residual sums of squaresdivided by the number of observations T ) and k = p + q + 1 is thetotal number of parameters estimated.
• This criterion expresses a trade-off between the first term, thequality of fit, and the second term, a penalty function proportionalto the number of parameters.
• Adding more parameters to the model reduces the residuals butincreases the right hand-term, thus the AIC favors the best fit withthe fewest number of parameters.
• AIC is limited to asymptotic normality assumptions on the error.
Extra Slides
In Sample: Diagnostics
• A fitted model must be examined for underfitting with a whitenoise test. Box and Pierce propose the Portmanteau statistic
Q∗(m) = Tm∑l=1
ρ2l
• As a test statistic for the null hypothesis
H0 : ρ1 = · · · = ρm = 0
against the alternative hypothesis
Ha : ρi 6= 0
for somei ∈ 1, . . . ,m
. ρi are the sample autocorrelations of the residual.
• The Box-Pierce statistic follows an asymptotically chi-squareddistribution with m − p − q degrees of freedom.
• The decision rule is to reject H0 if Q(m) > χ2α where χ2
α denotesthe 100(1− α)th percentile of a chi-squared distribution with mdegrees of freedom and is the significance level for rejecting H0.
Extra Slides
In Sample: Diagnostics
• The Ljung-Box test statistic increases the power of the test infinite samples:
Q(m) = T (T + 2)m∑l=1
ρ2l
T − l
.
• This statistic also follows an asymptotically chi-squareddistribution with m degrees of freedom.
• The Ljung-Box statistic follows asymptotically a chi-squareddistribution with m − p − q degrees of freedom.
Extra Slides
Time Series Cross Validation
Verifica(on: tuning hyper-‐parameters
Tes(ng: assess performance
split
Figure: Times series cross validation, also referred to as walk forward optimization, is used to avoid introducinglook-ahead bias into the model.
Extra Slides
Forecasting Accuracy
• The forecasting accuracy may be measured using the mean squareerror (MSE) or the mean absolute error (MAE) of theout-of-sample forecasts with the actual observations.
• Let’s suppose that we have an trained the model with Tobservations, then the MSE evaluated over the next mobservations :
MSE =1
m
m−1∑t=0
(yT+t+s − fT+t,s)2
• The MAE is given by
MAE =1
m
m−1∑t=0
|yT+t+s − fT+t,s |
Question: Why use the MSE in preference to the MAE?
Extra Slides
Predicting Events
• Suppose we have conditionally i.i.d. Bernoulli r.v.s Xt withpt := P[Xt = 1|Ωt ] representing a binary event
• E[Xt | Ω] = 0 · (1− pt) + 1 · pt = pt
• V[Xt | Ω] = pt(1− pt)
• Under an ARMA model:
ln
(pt
1− pt
)= φ−1(L)(µ+ θ(L)εt)
Extra Slides
Entropy
C. Shannon, A Mathematical Theory of Communication, The Bell System Technical Journal, Vol. 27, pp. 379-423,623-656, July, October, 1948.
Extra Slides
Directional Performance: Confusion Matrix
Actual Predicted
up down
up m11 m12 m1
down m21 m22 m1
m01 m02 m
χ2 =2∑
i=1
2∑j=1
(mij −mi0m0j/m)2
mi0m0j/m
with 1 degree of freedom.
Extra Slides
Example: Confusion Matrix
Actual Predicted
up down
up 12 2 14down 8 2 10
20 4 24
χ2 = 0.137
with p-value of 0.71.
Extra Slides
ROC chart
Extra Slides
Performance Terminology
• Precision is TP/(TP + FP): fraction of records that were positivefrom the group that the classifier predicted to be positive
• True Positive Rate is TP/(TP + FN): fraction of positiveexamples the classifier correctly identified. This is also known asRecall or Sensitivity.
• False Positive Rate is FP/(FP + TN): fraction of positiveexamples which the classifier mis-identified.
• Specificity is TN/(FP + TN) = 1− FPR: fraction of negativeexamples which the classifier correctly identified.
Extra Slides
Practical Prediction Issues
• How to predict inherently rare events, e.g. black swans? Ans:over-sampling?
• How much training history?
• How frequently to retrain the model?
Extra Slides
Confusion Matrices with Oversampling
−1 0 1
−1
01
True
Predicted
0.1188
0.003
0.0466
0.85
0.9961
0.8653
0.0312
9e−04
0.0881
−1 0 1
−1
01
True
Predicted
0.6938
0.0743
0.1295
0.2188
0.8267
0.1244
0.0875
0.099
0.7461
Figure: Normalized confusion matrices (left) without and (right) with oversampling.
Extra Slides
ROC Curves with Oversampling
False Positive Rate
True
Pos
itive
Rat
e
0 0.2 0.4 0.6 0.8 1
00.
20.
40.
60.
81
AUC: 0.543
AUC: 0.784
Before Oversampling
After Oversampling
Figure: ROC curves with and without oversampling.