matthew dixon [email protected]/~mdixon7/presentations/dl_fe.pdfoverview deep...

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Deep Learning for Financial Econometrics2

Matthew [email protected]

Illinois Institute of Technology

Machine Learning in Finance BootcampFields Institute

University of TorontoSeptember 26th, 2019

Code: https://github.com/mfrdixon/ML Finance Codes

2Slides with an * in the title are advanced material for mathematicians. **denotes advanced programming details.


Overview

• Deep Learning for cross-sectional data• Background: review of supervised machine learning• Introduction to feedforward neural networks• Related approximation and learning theory• Training and back-propagation• Interpretability with factor modeling example

• Deep Learning for time-series data prediction• Understand the conceptual formulation and statistical

explanation of RNNs• Understand the different types of RNNs, including plain RNNs,

GRUs and LSTMs• Understand the conceptual formulation and statistical

explanation of CNNs.• Understand 1D CNNs for time series prediction.• Understand how dilation CNNs capture multi-scales.


Overview

• Deep Learning for dimensionality reduction• Understand principal component analysis for dimension

reduction.• Understand how to formulate a linear autoencoder.• Understand how a linear autoencoder performs PCA.


Quick Quiz

Which of the following statements are true?

A Machine learning is different from statistics: Machine learningmethods assume the data generation process is unknown

B The output from neural network classifiers are theprobabilities of an input belonging to a class

C We need multiple hidden layers in the neural network tocapture non-linearity

D Neural networks provide no statistical interpretability, they are’black-boxes’ and the importance of the features is unknown

E Neural networks can only be fitted to stationary data


Quick Quiz

Which of the following statements are true?

A Machine learning is different from statistics: Machine learningmethods assume the data generation process is unknown[True]

B The output from neural network classifiers are theprobabilities of an input belonging to a class [True]

C We need multiple hidden layers in the neural network tocapture non-linearity [False3]

D Neural networks provide no statistical interpretability, they are’black-boxes’ and the importance of the features is unknown[False]

E Neural networks can only be fitted to stationary data [False]

3The Universal representation theorem suggests that we only need one hiddenlayer: Andrei Nikolaevich Kolmogorov, On the representation of continuous functionsof many variables by superposition of continuous functions of one variable andaddition. AMS Translation, 28(2):55-59, 1963.


Quick Overview

Property Statistical Inference Machine Learning

Goal Causal models with explanatory power Prediction performance, often with limited explanatory powerData The data is generated by a model The data generation process is unknownFramework Probabilistic Algorithmic & ProbabilisticExpressability Typically linear Non-linearModel selection Based on information criteria Numerical optimizationScalability Limited to lower dimensional data Scales to higher dimensional input dataRobustness Prone to over-fitting Designed for out-of-sample performance


Supervised Machine Learning

• Machine learning addresses a fundamental prediction problem:Construct a nonlinear predictor, Y (X ), of an output, Y , givena high dimensional input matrix X = (X1, . . . ,XP) of Pvariables.

• Machine learning can be simply viewed as the study andconstruction of an input-output map of the form

Y = F (X ) where X = (X1, . . . ,XP).

• The output variable, Y , can be continuous, discrete or mixed.

• For example, in a classification problem, F : X → Y whereY ∈ 1, . . . ,K and K is the number of categories.

• We will denote the i th observation of the data D := (X ,Y ) as(xi , yi ) - this is a “feature-vector” and response (a.k.a.“label”). Note that lower caps refer to a specific observationin D.


Taxonomy of Most Popular Neural Network Architectures

feed forward auto-encoder convolution

recurrent Long / short term memory neural Turing machines

Figure: Most commonly used deep learning architectures for modeling. Source:http://www.asimovinstitute.org/neural-network-zoo

http://www.asimovinstitute.org/neural-network-zoo


FFWD Neural NetworksNeural network model:

Y = FW ,b(X ) + ε

where FW ,b : Rp → Rd is a deep neural network with L layers

Y (X ) := FW ,b(X ) = f(L)

W (L),b(L) · · · f(1)

W (1),b(1)(X )

• W = (W (1), . . . ,W (L)) and b = (b(1), . . . , b(L)) are weightmatrices and bias vectors.

• For any W (i) ∈ Rm×n, we can write the matrix as n column

m-vectors W (i) = [w(i)1 , . . . ,w

(i)n ].

• Denote each weight as w(`)i ,j :=

(W (`)

)i ,j

.

• X is a N × p matrix of observations in Rp.

• No assumptions are made on the distribution of the error, ε,other than it is independently distributed.


Deep Neural Networks*

• Let σ : R→ B ⊂ R denote a continuous, monotonicallyincreasingly, function.

• A function f(`)

W (`),b(`) : Rn → Rm, given by

f (v) = W (`)σ(`−1)(v) + b(`), W (`) ∈ Rm×n and b(`) ∈ Rm, isa semi-affine function in v , e.g. f (v) = wtanh(v) + b.

• σ(·) are known activation functions of the output from theprevious layer., e.g. σ(x) := max(x , 0).

• FW ,b(X ) is a composition of semi-affine functions.

• If all the activation functions are linear, FW ,b is just linearregression, regardless of the number of layers L.


Geometric Interpretation of FFWD Neural Networks

x1 x2 x1 x2 x1 x2

No hidden layers One hidden layer Two hidden layers


Geometric Interpretation of FFWD Neural Networks

Half-Moon Dataset

No hidden layers One hidden layer Two hidden layers


Why do we need more Neurons?

x1 x2


Geometric Interpretation of Neural Networks

25 hidden units 50 hidden units 75 hidden units

Figure: The number of hidden units is adjusted according to therequirements of the classification problem and can he very high for datasets which are difficult to separate.


Geometric Interpretation of Neural Networks

Figure: Hyperplanes defined by three activated neurons in the hidden layer.


Universal Representation Theorem [1989]*

• Let Cp := f : Rp → R | f (x) ∈ C (R) be the set ofcontinuous functions from Rp to R.

• Denote Σp(σ) as the class of functions

FW ,b : Rp → R : FW ,b(x) = W (2)σ(W (1)x + b(1)) + b(2).

• Consider Ω := (0, 1] and let C0 be the collection of all openintervals in (0, 1].

• Then σ(C0), the σ-algebra generated by C0, is the Borelσ-algebra, B((0, 1]).

• Let Mp := f : Rp → R | f (x) ∈ B(R) denote the set of allBorel measurable functions from Rp to R.

• Denote the Borel σ-algebra of Rp as Bp.


Universal Representation Theorem*

Theorem (Hornik, Stinchcombe & White, 1989)

For every monotonically increasing activation function σ, everyinput dimension size p, and every probability measure µ on(Rp,Bp), Σp(g) is uniformly dense on compacta in Cp andρµ-dense in Mp.


Shattering• What is the maximum number of points that can be arranged

so that FW ,b(X ) shatters them?• i.e. for all possible assignments of binary labels to those

points, does there exist a W , b such that FW ,b makes noerrors when classifying that set of data points?

• Every distinct pair of points is separable with the linearthreshold perceptron. So every data set of size 2 is shatteredby the perceptron.

• However, this linear threshold perceptron is incapable ofshattering triplets, for example X ∈ −0.5, 0, 0.5 andY ∈ 0, 1, 0.

• In general, the VC dimension of the class of halfspaces in Rk

is k + 1. For example, a 2-d plane shatters any three points,but can not shatter four points.

• This maximum no. of points is the Vapnik-Chervonenkis (VC)dimension and is one characterization of learnability of aclassifier.


Example of Shattering over the Interval [−1, 1]

Right point activated Left point activated None activated Both activated

Figure: For the points −0.5, 0.5, there are weights and biases that activates only one of them(W = 1, b = 0 or W = −1, b = 0), none of them (W = 1, b = −0.75) and both of them (W = 1, b = 0.75).


VC Dimension Example

• Determine the VC dimension of the indicator function whereΩ = [0, 1]

F (x) = f : Ω→ 0, 1, f (x) = 1x∈[t1,t2), or

f (x) = 1− 1x∈[t1,t2) , t1 < t2 ∈ Ω.

• Suppose there are three points x1, x2 and x3 and assumex1 < x2 < x3 without loss of generality. All possible binarylabeling of the points is reachable, therefore we assert thatVC (F ) ≥ 3.

• With four points x1, x2, x3 and x4 (assumed increasing asalways), you cannot label x1 and x3 with the value 1 and x2

and x4 with the value 0 for example. So VC (F ) = 3.

• A related, but alternative, approach to the VC dimension is anumerical analysis perspective.


When is a neural network a spline?*• Under certain choices of the activation function, we can

construct NNs which are piecewise polynomial interpolantsreferred to as ’splines’.

• Let f : Ω→ R be any Lipschitz continuous function over abounded domain Ω ⊂ Rp:

∀x, x′ ∈ Ω, |f (x′)− f (x)| ≤ L|x′ − x|,for some constant L ≥ 0.

• Suppose that the values fk := f (xk) are known only atdistinct points xknk=1 in Ω.

• Form an orthogonal basis over Ω to give the interpolant

f (x) =n∑

k=1

φk(x)fk , ∀x ∈ Ω,

where the φk(x)nk=1 are piecewise constant (PC) orthogonalbasis functions, φi (xj) = δij (Kronecker delta).


When is a neural network a spline?*

• If the data points xknk=1 are (almost) uniformly distributedin the bounded domain Ω, then the error of our PCinterpolant scales as∥∥∥f − f

∥∥∥L2p,µ(Ω)

∼ O(n−1p ). (1)


Piecewise basis functions as differences of Heavisidefunctions*

• Heaviside functions (a.k.a. step functions) over R are definedas

H(x) =

1 x ≥ 0,

0 x < 0,

• Choose the domain to be the unit interval in R and restrictthe data to a uniform grid of width 2ε so thatxk = (2k − 1)ε, k = 1 . . . n.

• Construct a piecewise constant basis from the difference ofHeaviside functions:

φk(x) = H(x − (xk − ε))− H(x − (xk + ε)).

• Basis functions are indicator functions:φk = 1[xk−ε,xk+ε), ∀k = 1, . . . , n, with compact support over[xk − ε, xk + ε) and centered about xk .


Piecewise constant basis functions*

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

x

φ 1(x

)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

x

φ 2(x

)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

x

φ 3(x

)

Figure: The first three piecewise constant basis functions produced by the difference of neighboring stepfunctions, φk (x) = H(xk − ε)− H(xk + ε).


Single Layer NNs Activated with Heaviside functions*

• Key Idea: Choose the bias b(1)k = −2(k − 1)ε and weights

W (1) = 1 then

H(W (1)x + b(1)) =

H(x)H(x − 2ε)

. . .H(x − 2(k − 1)ε)

. . .H(x − (2n − 1)ε).

• The two layer NN approximation,f (x) = W (2)H(W (1)x + b(1)), is a PC interpolant when

W (2) = [f (x1), f (x2)− f (x1), . . . , f (xK )− f (xn−1)].



• This single layer neural network gives the following output:

f (x) =

f (x1) x ≤ 2ε,

f (x2) 2ε < x ≤ 4ε,

. . . . . .

f (xk) 2(k)ε < x ≤ 2(k + 1)ε,

. . . . . .

f (xn−1) 2(n − 1)ε < x ≤ 2nε.

• f (x) is a linear interpolant over [0, 1].



0.0 0.2 0.4 0.6 0.8 1.0

−1.0

−0.5

0.00.5

1.0

x

y

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.02

0.04

0.06

0.08

0.10

x

|f(x)−

f(x)|

Function values Absolute error

Figure: The approximation of sin(2πx) using gridded input data andHeaviside activation functions.


Single Layer NNs Activated with Heaviside functions *

Theorem|f (x)− f (x)| ≤ δ, ∀x ∈ [0, 1] if there are at least n = b 1

2ε + 1chidden units and δ := Lε.

Proof.Since xk = (2k − 1)ε,we have that f (x) = f (xk) over the interval[xk − ε, xk + ε], which is the support of φk(x). By the Lipschitzcontinuity of f (x), it follows that the worst case error appears atthe mid-point of any interval [xk , xk+1]

|f (xk +ε)− f (x+ε)| = |f (x+ε)−f (xk)| ≤ |f (xk)|+Lε−|f (xk)| = δ.


Training with Regularization under IID noise

• With a training set D = xi , yiNi=1, solve the constrainedoptimization

(W , b) = arg min(W ,b)1

N

N∑i=1

L(yi , Y (xi ))

• Often, the L2-norm for a traditional least squares problem ischosen as an error measure, and we then minimize the lossfunction combined with a penalty

L(yi , yi ) = ‖yi − yi‖22 + λ1‖W ‖1

• ∇L is given in closed form by a chain rule and, throughback-propagation, each layer’s weights W (`) and biases b(`)

are fitted with stochastic gradient descent.


Equivalence with OLS estimation

• Without activation and no hidden layers, the optimalparameters β := [b, W ] are an OLS estimator.

• If we assume that the errors ε1, ε2, . . . , εN are white noise andlet y := Y ∈ R so that the linear model is

y = Xβ + ε,

thenβ = (X′X)−1X′y

where the augmented matrix is

X =

1 x1,1 . . . x1,p

. . . . . . . . . . . .1 xN,1 . . . xN,p.


Training

• Solve the constrained optimization

(W , b) = arg minW ,b1

N

N∑i=1

L(yi , YW ,b(xi ))

• If Y is continuous then the L2-norm for a traditional leastsquares problem is chosen as an error measure, and we thenminimize the loss function

L(Y , Y (X )) = ||Y − Y ||22

• A 1-of-K encoding is used for a categorical response, so thatY is a K-binary vector Y ∈ [0, 1]K and the value k ispresented as Yk = 1 and Yj = 0,∀j 6= k . The loss function isthe cross-entropy term

L(Y , Y (X )) = −Y T lnY .


Derivative of the softmax function

• Using the quotient rule f ′(x) = g ′(x)h(x)−h′(x)g(x)[h(x)]2 , the

derivative σ := σ(x) can be written as:

∂σi∂xi

=exp(xi )|| exp(x)||1 − exp(xi ) exp(xi )

|| exp(x)||21(2)

=exp(xi )

|| exp(x)||1· || exp(x)||1 − exp(xi )

|| exp(x)||1(3)

= σi (1− σi ) (4)


Derivative of the softmax function

• For the case i 6= j , the derivative of the sum is:

∂σi∂xj

=0− exp(xi ) exp(xj)

|| exp(x)||21(5)

= −exp(xj)

|| exp(x)||1· exp(xi )

|| exp(x)||1(6)

= −σjσi (7)

• This can be written compactly as ∂σi∂xj

= σi (δij − σj).


Updating the Weight Matrices

• In machine learning, the goal is to find the weights and biasesgiven an input X . We can express Y as a function of theweight matrix W ∈ RK×M and bias W ∈ RK so that

Y (W , b) = σ I (W , b),

where the input function I : RK×M ×RK → RK is of the formI (W , b) := WX + b

• Applying the multivariate chain rule to give the Jacobian ofY (W , b) gives

∇Y (W , b) = ∇(σ I )(W , b) (8)

= ∇σ(I (W , b)) · ∇I (W , b) (9)


Updating the Weight Matrices

• Recall that the cross entropy function is given by

L(Y , Y (X )) = −Y Tk lnYk

• Since Y is a constant we can express the cross-entropy as afunction of (W , b)

L(W , b) = L σ(I (W , b)).

• Applying the multivariate chain rule gives

∇L(W , b) = ∇(L σ)(I (W , b))

= ∇L(σ(I (W , b))) · ∇σ(I (W , b)) · ∇I (W , b)


Back-propagation

• Use stochastic gradient descent to find the minimum

(W , b) = arg minW ,b

1

N

N∑i=1

L(yi , yi )

• Because of the compositional form of the model, the gradientcan be easily derived using the chain rule for differentiation.

• This can be computed by a forward and then a backwardsweep (’back-propagation’) over the network, keeping trackonly of quantities local to each neuron.


Back-propagation

Forward Pass:

• Set Z (0) = X and for ` = 1, . . . , L set

Z (`) = f(`)

W (`),b(`)(Z(`−1)) = σ(`)(W (`)Z (`−1) + b(`)).

Back-propagation:

• Define the back-propagation error δ(`) := ∇b(`)L, where givenδ(L) = Y − Y and for ` = L− 1, . . . , 1 set

δ(`) = (∇I (`)σ(`))(W (`+1))T δ(`+1), (10)

∇W (`)L = δ(`) ⊗ Z (`−1), (11)

and ⊗ is the outer product of two vectors.


Updating the weights

• The weights and biases are updated according to theexpression:

∆W (`) = −γ∇W (`)L = −γδ(`) ⊗ Z (`−1), (12)

∆b(`) = −γδ(`), (13)

where γ is a learning rate parameter. See back-propagationnotebook example and Appendix of Chpt 2 lecture notes formore details.

• Mini-Batch or offline updates involve using many observationsof X at the same time. The batch size refers to the numberof observations of X used in each pass.

• An epoch refers to a round-trip (forward+backward sweep) ofall training examples.


Back-propagation example with three layers

• Suppose that a feed-forward network classifier has two sigmoidactivated hidden layers and a softmax activated output layer.

• After a forward pass, the values of Z (`)3`=1 are stored and

the error Y − Y , where Y = Z (3), is calculated for oneobservation of X .

• The back-propagation and weight updates in the final layerare evaluated:

δ(3) = Y − Y

∇W (3)L = δ(3) ⊗ Z (2).


From Equations 10 and 11, we update the back-propagation errorand weight updates for Hidden Layer 2

δ(2) = Z (2)(1− Z (2))(W (3))T δ(3),

∇W (2)L = δ(2) ⊗ Z (1).

Repeating for Hidden Layer 1

δ(1) = Z (1)(1− Z (1))(W (2))T δ(2),

∇W (1)L = δ(1) ⊗ X .

We update the weights and biases using Equations 12 and 13, sothat b(3) → b(3) − γδ(3), W (3) →W (3) − γδ(3) ⊗ Z (2) and repeatfor the other weight-bias pairs, (W (`), b(`))2

`=1


Stopping Criteria**

We can use callbacks in Keras to terminate the training based onstopping criteria.

from keras.callbacks import EarlyStopping

...

callbacks = [EarlyStopping(monitor=’loss’, min_delta=0, patience=2)]

...

model.fit(X,Y, callbacks=callbacks)


Explanatory Power of Neural Networks

• In a linear regression model

Y = Fβ(X ) := β0 + β1X1 + · · ·+ βKXK

the sensitivities are∂Xi

Y = βi

• In a FFWD neural network, we can use the chain rule

∂XiY = ∂Xi

FW ,b(X ) = ∂Xif

(L)

W (L),b(L) · · · f(1)

W (1),b(1)(X )

• For example, with one hidden layer, σ(x) := tanh(x) and

f(`)

W (`),b(`)(X ) := σ(Z (`)) := σ(W (`)X + b(`)):

∂XjY =

∑i

w(2)i (1− σ2(I

(1)i ))w

(1)i ,j where ∂xσ(x) = (1− σ2(x))


Explanatory Power of Neural Networks

• Or in matrix form, using the Jacobian4 J = D(I (1))W (1) of somecontinuous σ:

∂X Y = W (2)J(I (1)) = W (2)D(I (1))W (1),

where Di ,i (I ) = σ′(Ii ), Di ,j 6=i = 0.

• Bounds on the sensitivities are given by the product of the weightmatrices

min(W (2)W (1), 0) ≤ ∂X Y ≤ max(W (2)W (1), 0).

4When σ is an identity function, the Jacobian J(I (1)) = W (1).


Example: Step test

• The model is trained to the following data generation processwhere the coefficients of the features are stepped:

Y =10∑i=1

iXi , Xi ∼ U(0, 1).


Example: Step test

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

0.0 2.5 5.0 7.5 10.0

Importance (Sensitivity)

Varia

ble

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

0.00 0.05 0.10 0.15 0.20

Importance (Garson's algorithm)

X9

X8

X7

X5

X6

X2

X4

X1

X3

X10

0 10 20 30 40

Importance (Olden's algorithm)

Figure: Step test: This figure shows the ranked importance of the input variables in a fitted neural networkwith one hidden layer.


Example: Friedman test

Y = 10 sin (πX1X2) + 20 (X3 − 0.5)2 + 10X4 + 5X5 + ε,


X9

X10

X6

X8

X7

X3

X5

X2

X1

X4

0.0 2.5 5.0 7.5 10.0


Var

iabl

e

X6

X5

X8

X7

X9

X4

X3

X2

X1

X10

0.0 0.1 0.2 0.3

Importance (Garson's algorithm)

X3

X4

X2

X5

X8

X9

X6

X7

X10

X1

−10 0 10 20 30

Importance (Olden's algorithm)

Figure: Friedman test: Estimated sensitivities of the fitted neural network to the input.


Interaction Terms Introduced by Activation

• Even with one hidden layer, the activation function introducesinteraction terms XiXj .

• To see this, consider the partial derivative

∂XjY =

∑i

w(2)i σ′(I

(1)i )w

(1)i ,j

• and differentiate again with respect to Xk , k 6= i to give theHessian:

∂2Xj ,Xk

Y = −2∑i

w(2)i σ′′(I

(1)i )w

(1)i ,j w

(1)i ,k

• which is not zero everywhere unless σ is (piecewise) linear orconstant.


Interaction Terms Introduced by Activation

x_23

x_34

x_28

x_310

x_15

x_45

x_14

x_25

x_24

x_12

0 1 2 3 4


Inte

ract

ion

Term

Figure: Friedman test: Estimated Interaction terms in the fitted neural network to the input.


The Jacobian using ReLU activation

• In matrix form, with σ(x) = max(x , 0), the Jacobian, J, can bewritten as linear combination of Heaviside functions

J := J(X ) = ∂X Y (X ) = W (2)J(I (1)) = W (2)H(W (1)X+b(1))W (1),

where Hii (I(1)) = H(I

(1)i ) = 1

I(1)i >0

, Hij = 0, j ≥ i .

• This can be written in matrix element form as

Jij = [∂X Y ]ij =n∑

k=1

W(2)ik W

(1)kj H(I

(1)k ) =

n∑k=1

ckHk(I (1))

where ak := W(2)ik W

(1)kj .

• As a linear combination of indicator functions, we have

Jij =n−1∑k=1

ak1I (1)k >0,I

(1)k+1≤0 + an1I (1)

n >0, ak :=k∑

i=1

ci .


The Jacobian using ReLU activation

• Alternatively, the Jacobian can be expressed in terms of X

Jij =n−1∑k=1

ak1W (1)k X>−b(1)

k ,W(1)k+1 X≤−b

(1)k+1

+ an1W (1)n X>−b(1)

n .

• wlog: In the case when p = 1, this simplifies to a weighted sum ofindependent Bernoulli trials

Jij =n−1∑k=1

ak1xk<X≤xk+1+ an1xn<X , j = 1.

where xk :=b

(1)k

W(1)k

. The expectation of the Jacobian is given by

µn−1 := E[Jij ] =n∑

k=1

akpk , pk := Pr[xk < X ≤ xk+1]∀k = 1, . . . , n−1, pn = Pr[xn < X ].

• For finite weights, the expectation is bounded above by∑n

k=1 ak .


Bounds on the Variance

• The variance is

V[Jij ] =n−1∑k=1

akV[1I (1)k >0,I

(1)k+1≤0]+anV[1I (1)

n >0] =n∑

k=1

akpk(1−pk).

• If the weights are constrained so that the mean is constant,µnij = µij , then the weights are ak = µ

npk. Then the variance is

monotonically increasing with the number of hidden units:

V[Jij ] = µn − 1

n< µ.

• Otherwise, in general, we have

V[Jij ] ≤ µij ≤n∑

i=1

ak .

• Increasing variance with more hidden units is undesirable forinterpretability - we want the opposite effect!


Robustness of Interpretability

• If the data is generated from a linear model, can the neuralnetwork recover the correct weights?

Y = β1X1 + β2X2 + ε, X1,X2, ε ∼ N(0, 1), β1 = 1, β2 = 1

• Compare OLS estimators with zero hidden layer NNs and singlehidden layers NNs.

• Use tanh activation functions for smoothness of the Jacobianbecause max(x , 0) gives a piecewise continuous Jacobian.

• Increase the number of hidden layers and show that the variance ofthe sensitivities convergence with.



Method β0 β1 β2

OLS 0 1.0154 1.018

NN0 b(1) = 0.03 W(1)1 = 1.0184141 W

(1)2 = 1.02141815

NN1 W (2)σ(b(1)) + b(2) = 0.02 E[W (2)D(1)(I (1))W(1)1 ] = 1.013887 E[W (2)D(1)(I (1))W

(1)2 ] = 1.02224



(a) density of β1 (b) density of β2


Robustness of Interpretability (β1)

Hidden Units Mean Median Std.dev 1% C.I. 99% C.I.

2 0.980875 1.0232913 0.10898393 0.58121675 1.072990810 0.9866159 1.0083131 0.056483902 0.76814914 1.032252250 0.99183553 1.0029879 0.03123002 0.8698967 1.0182846

100 1.0071343 1.0175397 0.028034585 0.89689034 1.0296803200 1.0152218 1.0249312 0.026156902 0.9119074 1.0363332

The confidence interval narrows with increasing number of hiddenunits.


Robustness of Interpretability (β2)

Hidden Units Mean Median Std.dev 1% C.I. 99% C.I.

2 0.98129386 1.0233982 0.10931312 0.5787732 1.07372810 0.9876832 1.0091512 0.057096474 0.76264584 1.033971450 0.9903236 1.0020974 0.031827927 0.86471796 1.0152498

100 0.9842479 0.9946766 0.028286876 0.87199813 1.0065105200 0.9976638 1.0074166 0.026751818 0.8920307 1.0189484

The confidence interval narrows with increasing number of hiddenunits.


Linear Cross-sectional Factor Models

• Consider the linear cross-sectional factor model:

rt = Bt−1ft + εt , t = 1, . . . ,T

• Bt = [1 | b1,t | · · · | bK ,t ] is the N × K + 1 matrix of factorexposures

• (bi ,t)k is the exposure (a.k.a. loading) of asset i to factor k attime t

• ft = [α, f1,t , . . . , fK ,t ] is the K + 1 vector of factor returns

• rt is the N vector of asset returns

• ρ(ft , εt) = 0 and cross-sectional heteroskedasticityE[ε2

t,i ] = σ2i , i ∈ 1, . . . ,N.


Non-linear Non-parametric Cross-sectional Factor Models

• Consider the non-linear cross-sectional factor model:

rt = F (Bt) + εt

• F : RK → R is a differentiable non-linear function

• Do not assume that εt is Gaussian or require any other restrictionson εt , i.e. error distribution is non-parametric

• The model shall just be used to predict the next period returnsonly and stationarity of the factor returns is not required.


Neural Network Cross-sectional Factor Models5

• Neural network cross-sectional factor model:

rt = FW ,b(Bt) + εt

where FW ,b(X ) is a neural network with L layers

r := FW ,b(X ) = f(L)

W (L),b(L) · · · f(1)

W (1),b(1)(X )

5Dixon & Polson, Deep Fundamental Factor Models, 2019,arxiv.org/abs/1903.07677.


Experimental Setup

• Here, we define the universe as the top 250 stocks from the S&P500 index, ranked by market cap.

• Factors are given by Bloomberg and reported monthly

• Remove stocks with too many missing factor values, leaving 218stocks

• Use a 100 period experiment, fitting a cross-sectional regressionmodel at each period and using the model to forecast the nextperiod monthly returns.

• At each time step, the canonical experimental design overt = 2, . . . ,T is

• (X traint ,Y train

t ) = (Bt−1, rt−1)

• (X testt ,Y test

t ) = (Bt , rt)


MSEs

(a) Linear Regression (b) FFWD Neural Network


Overfitting

Figure: In sample MSE as a function of the number of neurons in the hidden layer.The models are trained here without L1 regularization.


Managing the Bias-Variance Tradeoff with L1

Regularization

(a) In-sample (b) Out-of-Sample

Figure: The effect of L1 regularization on the MSE errors for a network with 50neurons in the hidden layer.


Information Ratios (Random Selection of Stocks)

Figure: Random selection of x stocks from the universe in each period.


Information Ratios (In-Sample)


Figure: Selection of the x top performing stocks from the universe ineach in-sample period.


Information Ratios (Out-of-Sample)


Figure: Selection of the x top performing stocks from the universe ineach out-of-sample period.


Sensitivities

Figure: The sensitivities and their uncertainty can be estimated from data(black) or derived analytically (red).


Recurrent Neural Networks

• Understand the conceptual formulation and statistical explanationof RNNs

• Understand the different types of RNNs

• Two notebook examples with limit order book data and Bitcoinhistory

• Navigate further learning resources


Recurrent Neural Network Predictors

• Input-output pairs D = xt , ytNt=1 are auto-correlatedobservations of X and Y at times t = 1, . . . ,N

• Construct a nonlinear times series predictor, yt+h(Xt), of anoutput, yt+h, using a high dimensional input matrix of p lengthsub-sequences Xt :

yt+h = F (Xt) where Xt = seqt−p+1,t(X ) = (xt−p+1, . . . , xt)

• xt−j is a j th lagged observation of xt , xt−j = Lj [xt ], forj = 0, . . . , p − 1.


Recurrent Neural Networks (p=6)

z1t−5

z jt−5

zHt−5

xt−5

z1t−4

z jt−4

zHt−4

xt−4

z1t−3

z jt−3

zHt−3

xt−3

z1t−2

z jt−2

zHt−2

xt−2

z1t−1

z jt−1

zHt−1

xt−1

z1t

z jt

zHt

xt

yt+h

. . . . . .


Recurrent Neural Networks

• For each time step s = t − p + 1, . . . , t, a function Fh generates ahidden state zs :

zs = Fh(xs , zs−1) := σ(Whxs + Uhzs−1 + bh), Wh ∈ RH×d ,Uh ∈ RH×H

• When the output is continuous, the model output from the finalhidden state is given by:

yt+h = Fy (zt) = Wyzt + by , Wy ∈ RK×H

• When the output is categorical, the output is given by

yt+h = Fy (zt) = softmax(Wyzt + by )

• Goal: find the weight matrices W = (Wh,Uh,Wy ) and biasesb = (bh, by ).


Univariate Example

• Consider the univariate, step-ahead, time series predictionyt = F (Xt−1), using p previous observations yt−ipi=1.

• Because this is a special case when no input is available at time t(since we are predicting it), we form the hidden states to time zt−1.

• The simplest case of a RNN with one hidden unit, H = 1, noactivation function and the dimensionality of the input vector isd = 1.


Univariate Example

• If Wh = Uh = φ, |φ| < 1, Wy = 1 and bh = by = 0.

• Then we can show that yt = F (Xt−1) is a zero-driftauto-regressive, AR(p), model with geometrically decaying weights:

zt−p = φyt−p,

zt−p+1 = φ(zt−p + yt−p+1),

. . . = . . .

zt−1 = φ(zt−2 + yt−1),

yt = zt−1,

where

yt = (φL + φ2L2 + · · ·+ φpLp)[yt ].


Multiple Choice Question 1

Identify the following correct statements:

• A linear recurrent neural network with a memory of p lags is anautoregressive model AR(p) with non-parametric error.

• Recurrent neural networks, as time series models, are guaranteedto be stable, for any choice of weights.

• The amount of memory in a shallow recurrent network correspondsto the number of times a single perceptron layer is unfolded.

• The amount of memory in a deep recurrent network corresponds tothe number of perceptron layers.


Overview

• RNNs assume that the data is covariance stationary. Theautocovariance structure of the RNN is fixed and therefore so mustit be in the data.

• RNNs using a version of back-propagation known as“back-propagation through time” (BPTT) which is prone to avanishing gradient problem when the number of lags in the modelis large.

• We will see extensions to these plain-RNNs which maintain along-term memory through an exponentially smoothed hiddenstate.


Exponential Smoothing

• Exponential smoothing is a type of forecasting or filtering methodthat exponentially decreases the weight of past and currentobservations to give smoothed predictions yt+1.

• It requires a single parameter, α, also called the smoothing factoror smoothing coefficient.

• This parameter controls the rate at which the influence of theobservations at prior time steps decay exponentially.

• α is often set to a value between 0 and 1. Large values mean thatthe model pays attention mainly to the most recent pastobservations, whereas smaller values mean more of the history istaken into account when making a prediction.



• Exponential smoothing takes the forecast for the previous period ytand adjusts with the forecast error, yt − yt . The forecast for thenext period becomes:

yt+1 = yt + α(yt − yt)

or equivalentlyyt+1 = αyt + (1− α)yt .

• Writing this as a geometric decaying autoregressive series back tothe first observation:

yt+1 = αyt + α(1− α)yt−1 + α(1− α)2yt−2 + α(1− α)3yt−3

+ · · ·+ α(1− α)t−1y1 + α(1− α)t y1,



• We observe that smoothing introduces long-term model, the entireobserved data, not just a sub-sequence used for prediction in aplain-RNN, for example.

• For geometrically decaying models, it is useful to characterize it bythe “half-life” - the lag k at which its coefficient is equal to a half:

α(1− α)k =1

2,

or

k = − ln(2α)

ln(1− α).


Dynamic Smoothing

• Dynamic exponential smoothing is a time dependent, convex,combination of the smoothed output, yt , and the observation yt :

yt+1 = αtyt + (1− αt)yt ,

where αt ∈ [0, 1] denotes the dynamic smoothing factor which canbe equivalently written in the one-step-ahead forecast of the form

yt+1 = yt + αt(yt − yt).

• Hence the smoothing can be viewed as a form of dynamic forecasterror correction; When αt = 1, the one-step ahead forecast merelyrepeats the current observation yt , and when αt = 0, the forecasterror is ignored and the forecast is the current model output.


Dynamic Smoothing

• The smoothing can also be viewed a weighted sum of the laggedobservations, with lower or equal weights, αt−s

∏sr=1(1− αt−r+1)

at the lag s ≥ 1 past observation, yt−s :

yt+1 = αtyt +∑

s=1t−1

αt−s

s∏r=1

(1− αt−r+1)yt−s +t−1∏r=0

(1− αt−r )y1,

where the last term is a time dependent constant and typically weinitialize the exponential smoother with y1 = y1.

• Note that for any αt−r+1 = 1, the prediction yt+1 will have nodependency on all lags yt−ss≥r . The model simply forgets theobservations at or beyond the r th lag.

• In the special case when the smoothing is constant and equal to1− α, then the above expression simplifies to

yt+1 = α−1yt ,


Dynamic Smoothing

• Putting this together gives the following dynamic RNN:

smoothing : ht = αt ht + (1− αt)ht−1 (15)

smoother update : αt = σ(1)(Uαht−1 + Wαxt + bα) (16)

hidden state update : ht = σ(Uhht−1 + Whxt + bh), (17)

where σ(1) is a sigmoid or heaviside function and σ is anyactivation function.


Figure: An illustrative example of the response of a simplified GRU and comparison with a plain-RNN and aRNN with an exponentially smoothed hidden state, under a constant α. The RNN loses memory of the unit impulseafter three lags, whereas the RNNs with smooth hidden states maintain memory of the first unit impulse even whenthe second unit impulse arrives. The difference between the dynamically smoothed RNN (the toy GRU) andsmoothed RNN with a fixed smoothing parameter appears insignificant. Keep in mind however that the dynamicalsmoothing model has much more flexibility in how controlling the sensitivity of the smoothing to the unit impulses.


Gated Recurrent Units (GRUs)

• We introduce a reset, or switch, rt , in order to forget thedependence of ht on previous data.

• Effectively, we turn ht from a RNN to a FNN.

ht = Zt ht + (1− Zt)ht−1 (18)

Zt = σ(UZht−1 + WZxt + bZ ) (19)

ht = tanh(Uh rt · ht−1 + Whxt + bh) (20)

rt = σ(Urht−1 + Wrxt + br ) (21)

yt = WY ht + bY (22)

• rt is the switching (reset) parameter which determines how muchrecurrence is used for the candidate hidden state ht .


References

• S. Haykin, Kalman filtering and neural networks, volume 47. JohnWiley & Sons, 2004

• M.F. Dixon, N. Polson and V. Sokolov, Deep Learning forSpatio-Temporal Modeling: Dynamic Traffic Flows and HighFrequency Trading, Applied Stochastic Methods in Business andIndustry, arXiv:1705.09851, 2018.

• M.F. Dixon, Sequence Classification of the Limit Order Book usingRecurrent Neural Networks, J. Computational Science, SpecialIssue on Topics in Computational and Algorithmic Finance,arXiv:1707.05642, 2018.

• M.F. Dixon, A High Frequency Trade Execution Model forSupervised Learning, High Frequency, arXiv:1710.03870, 2018.


Multiple Choice Question 1: Answers

• A linear recurrent neural network with a memory of p lags is anautoregressive model AR(p) with non-parametric error.

• Recurrent neural networks, as time series models, are guaranteedto be stable, for any choice of weights.

• The amount of memory in a shallow recurrent network correspondsto the number of times a single perceptron layer is unfolded.

• The amount of memory in a deep recurrent network corresponds tothe number of perceptron layers.

Answer: 1, 2, 3


Learning Objectives

• Understand the conceptual formulation and statistical explanationof CNNs.

• Understand 1D CNNs for time series prediction.

• A notebook example on univariate time series prediction with 1DCNNs.

• Understand how dilation CNNs capture multi-scales.

• Understand 2D CNNs for image processing. Supplementarynotebook.




• A feedforward network can, in principle, always be used for anysupervised learning problem.

• A linear recurrent neural network with a memory of p lags is anautoregressive model AR(p).

• Recurrent neural networks, as time series models, are guaranteedto be stable, for any choice of weights and activation.

• Feedforward and recurrent neural network predictors require thatthe data is stationary.

• GRUs allow the hidden state to persist beyond a sequence by usingsmoothing.




• A feedforward network can, in principle, always be used for anysupervised learning problem.

• A linear recurrent neural network with a memory of p lags is anautoregressive model AR(p).

• Recurrent neural networks, as time series models, are guaranteedto be stable, for any choice of weights and activation.

• Feedforward and recurrent neural predictors require that the data isstationary.

• GRUs allow the hidden state to persist beyond a sequence by usingsmoothing.

Answer: 1,2,4,5.


Convolutional neural networks (CNN)

• Convolutional neural networks (CNN) are feedforward neuralnetworks that can exploit local spatial structures in the input data.

• CNNs attempt to reduce the network size by exploiting spatial ortemporal locality.

• CNNs use sparse connections between the different layers.

• Convolution is frequently used for image processing, such as forsmoothing, sharpening, and edge detection of images.

Convolutional Neural Networks

Figure: Convolutional Neural Networks. Source:http: // www. asimovinstitute. org/ neural-network-zoo

http://www.asimovinstitute.org/neural-network-zoo


Weighted Moving Average Smoothers

• A common technique in time series analysis and signal processingis to filter the time series.

• We have already seen exponential smoothing as a special case of aclass of smoothers known as “weighted moving average (WMA)”smoothers.

• WMA smoothers take the form

xt =1∑

i∈I wi

∑i∈I

wixt−i

where xt is the local mean of the time series. The weights arespecified to emphasize or deemphasize particular observations ofxt−i in the span |I|.

• Examples of well known smoothers include the Hanning smootherh(3):

xt = (xt−1 + 2xt + xt+1)/4.

• Such smoothers have the effect of reducing noise in the time series6.

• It takes |I| samples of input at a time and then computes theweighted average to produce a single output. As the length of thefilter increases, the smoothness of the output increases, whereasthe sharp modulations in the data are smoothed out.

6The moving average filter is a simple Low Pass Finite Impulse Response(FIR) filter.


Moving Average Smoothers

Figure: Example of a moving average filter applied to time series.


Convolution for Time Series

• The moving average filter is in fact a (discrete) convolution using avery simple filter kernel.

• The simplest discrete convolution gives the relation between xi andxj :

xt−i =t−1∑j=0

δijxt−j , i ∈ 0, . . . , t − 1

where we have used the Kronecker delta δ.

• The kernel filtered time series is a convolution7

xt−i =∑j∈J

Kj+k+1xt−i−j , i ∈ k + 1, . . . , p − k,

7For simplicity, the ends of the sequence are assumed to be unsmoothed butfor notational reasons we set xt−i = xt−i for i ∈ 1, . . . , k, p − k + 1, . . . , p.


Convolution for Time Series

• The filtered AR(p) model is

xt = µ+

p∑i=1

φi xt−i (23)

= µ+ (φ1L + φ2L2 + · · ·+ φpL

p)[xt ] (24)

= µ+ [L, L2, . . . , Lp]φ[xt ], (25)

with coefficients φ := [φi , . . . , φp]T 8.

8Note that there is no look-ahead bias because we do not smooth the last kvalues of the observed data xsts=1


1D CNN with one hidden unit

• A simple 1D CNN consisting of a feedforward output layer and anon-activated hidden layer with one unit (i.e. kernel):

xt = Wyzt + by , zt = [xt−1, . . . , xt−p]T ,Wy = φT , by = µ,

where xt−i is the i th output from a convolution of the p lengthinput sequence with a kernel consisting of 2k + 1 weights.

• These weights are fixed over time and hence the CNN is onlysuited to prediction from stationary time series.

• Note also, in contrast to a RNN, that the size of the weight matrixWy increases with the number of lags in the model.


1D CNN with H hidden units

• The univariate CNN predictor with p lags and H activated hiddenunits (kernels) is

xt = Wyvec(zt) + by (26)

[zt ]i ,m = σ(∑j∈J

Km,j+k+1xt−i−j + [bh]m) (27)

= σ(K ∗ xt + bh) (28)

where m ∈ 1, . . . ,H denotes the index of the kernel and thekernel matrix K ∈ RH×2k+1, hidden bias vector bh ∈ RH andoutput matrix Wy ∈ R1×pH .


Remarks on CNNs

• Since the size of Wy increases with both the number of lags andthe number of kernels, it may be preferable to reduce thedimensionality of the weights with an additional layer and henceavoid over-fitting.

• Convolutional neural networks are not limited to sequential models.One might, for example, sample the past lags non-uniformly sothat I = 2ipi=1 then the maximum lag in the model is 2p. Such anon-sequential model allows a large maximum lag withoutcapturing all the intermediate lags.


Stationarity

• A univariate linear CNN predictor, with one kernel and noactivation, can be written in the canonical form

xt = µ+ (1− Φ(L))[K ∗ xt ] = µ+ K ∗ (1− Φ(L))[xt ] (29)

= µ+ (φ1L + . . . φpLp)[xt ] := µ+ (1− Φ(L))[xt ], (30)

where, by the linearity of Φ(L) in xt , the convolution commutesand thus we can write φ := K ∗ φ.

• Provided that Φ(L)−1 forms a divergent sequence in the noiseprocess εsts=1 then the model is stable.

• Finding the roots of the characteristic equation

Φ(z) = 0,

it follows that the linear CNN is strictly stationary and ergodic ifall the roots lie outside the unit circle in the complex plane,|λi | > 1, i ∈ 1, . . . , p.


Equivalent Eigenvalue Problem*

• Finding roots of polynomials is equivalent to finding eigenvalues ofthe “companion matrix”.

• A famous theorem states that the roots of any polynomial can befound by turning it into a matrix and finding the eigenvalues

• Given the p degree polynomial9:

q(z) = c0 + c1z + . . .+ cp−1zp−1 + zp,

we define the p × p companion matrix

C :=

0 1 0 . . . 0

0 0 1 0...

.... . .

. . .. . .

...0 0 0 0 1

-c0 -c1 . . . -cp−2 −cp−1

then the characteristic polynomial det(C − λI ) = q(λ), and so theeigenvalues of C are the roots of q 10.

9Notice that the zp coefficient is 1.10Note that if the polynomial does not have a unit leading coefficient then

one can just divide the polynomial by that coefficient to get it in the form ofEquation 98, without changing its roots.


Stridded Convolution

• In the above slides, we performed the convolution by sliding theimage area by a unit increment. This is known as using a stride, s,of one11:

• More generally, given an integer s ≥ 1, a convolution with stride sfor input (feature) f ∈ Rm is defined as:

[K ∗s f ]i =k∑

p=−kKpfs(i−1)+1+p, i ∈ 1, . . . , d

m

se. (31)

Here dms e denotes the smallest integer greater than ms .

11A common choice in CNNs is to take s = 2


Dilated Convolution

• In addition to image processing, CNNs have also been successfullyapplied to time series. WaveNet, for example, is a CNN developedfor audio processing.

• Time series often display long-term correlations. Moreover, thedependent variable(s) may exhibit non-linear dependence on thelagged predictors.

• Recall that a CNN for times series is a non-linear p-auto-regressionof the form

yt =

p∑i=1

φi (xt−i ) + εt (32)

where the coefficient functions φi , i ∈ 1, . . . , p aredata-dependent and optimized through the convolutional network.

• A dilated convolution effectively allows the network to operate ona coarser scale than with a normal convolution. This is similar topooling or stridded convolutions, but here the output has the samesize as the input

• The network can learn multi-scale, non-linear, dependencies byusing stacked layers of dilated convolutions.


Dilated convolution

• In a dilated convolution the filter is applied to every d th element inthe input vector, allowing the model to efficiently learn connectionsbetween far-apart data points.

• For an architecture with L layers of dilated convolutions` ∈ 1, . . . , L, a dilated convolution outputs a stack of “featuremaps” given by

[K (`) ∗d (`) f (`−1)]i =k∑

p=−kK

(`)p f

(`−1)

d (`)(i−1)+1+p, i ∈ 1, . . . , d m

d (`)e.

(33)where d is the dilation factor and we can choose the dilations toincrease by a factor of two: d (`) = 2`−1.

• The filters for each layer, K (`), are chosen to be of size1× k ′ = 1× 2, where k ′ := 2k + 1.


Dilated Convolution

• The output is the forecasted time series Y = (yt)N−1t=0 .

Figure: A dilated convolutional neural network with three layers. Thereceptive field is given by r = 8, i.e. one output value is influenced byeight input neurons.

• The receptive field depends on the number of layers L and thefilter size k ′, and is given by

r = 2L−1k ′. (34)


Learning Objectives

• Understand principal component analysis for dimension reduction.

• Understand how to formulate a linear autoencoder.

• Understand how a linear autoencoder performs PCA.

• Notebook example using dimension reduction on the yield curve.


Preliminaries

• Let yiNi=1 be a set of N observation vectors, each of dimension n.

We assume that n ≤ N.

• Let Y ∈ Rn×N be a matrix whose columns are yiNi=1,

Y =

| |y1 · · · yN| |

.• The element-wise average of the N observations is an n

dimensional signal which may be written as:

y =1

N

N∑i=1

yi =1

NY1N ,

where 1N ∈ RN×1 is a column vector of all-ones.

• Let Y0 be a matrix whose columns are the demeaned observations(we center each observation yi by subtracting y from it):

Y0 = Y− y1TN .


Projection

• A linear projection from Rm to Rn is a linear transformation of afinite dimensional vector given by a matrix multiplication:

xi = WTyi ,

where yi ∈ Rn, xi ∈ Rm, and W ∈ Rn×m.

• Each element j in the vector xi is an inner product between yi andthe j-th column of W, which we denote by wj .

• Let X ∈ Rm×N be a matrix whose columns are the set of N vectors

of transformed observations, let x = 1N

N∑i=1

xi = 1N X1N be the

element-wise average, and X0 = X− x1TN the de-meaned matrix.

• Clearly, X = WTY and X0 = WTY0.


Principal Component Projection*

• When the matrix WT represents the transformation that appliesprincipal component analysis, we denote W = P, and the columnsof the orthonormal matrix12, P, denoted

pjnj=1

, are referred to

as loading vectors.

• The transformed vectors xiNi=1 are referred to as principalcomponents or scores.

• The first loading vector is defined as the unit vector with which theinner products of the observations have the greatest variance:

p1 = maxw1

wT1 Y0YT

0 w1 s.t. wT1 w1 = 1. (35)

• The solution to (35) is known to be the eigenvector of the samplecovariance matrix Y0YT

0 corresponding to its largest eigenvalue13

12i.e. P−1 = PT .13We normalize the eigenvector and disregard its sign.


Principal Component Projection*

• Next, p2 is the unit vector which has the largest variance of innerproducts between it and the observations after removing theorthogonal projections of the observations onto p1. It may befound by solving:

p2 = maxw2

wT2

(Y0 − p1p

T1 Y0

)(Y0 − p1p

T1 Y0

)Tw2 s.t.w

T2 w2 = 1.

(36)

• The solution to (36) is known to be the eigenvector correspondingto the largest eigenvalue under the constraint that it is notcollinear with p1.

• Similarly, the remaining loading vectors are equal to the remainingeigenvectors of Y0YT

0 corresponding to descending eigenvalues.


Principal Components

• The eigenvalues of Y0YT0 , which is a positive semi-definite matrix,

are non-negative.

• They are not necessarily distinct, but since it is a symmetric matrixit has n eigenvectors that are all orthogonal, and it is alwaysdiagonalizable.

• Thus, the matrix P may be computed by diagonalizing thecovariance matrix:

Y0YT0 = PΛP−1 = PΛPT ,

where Λ = X0XT0 is a diagonal matrix whose diagonal elements

λini=1 are sorted in descending order.

• The transformation back to the observations is Y = PX.

• The fact that the covariance matrix of X is diagonal means thatPCA is a decorrelation transformation and is often used to denoisedata.


Dimensionality reduction

• PCA is often used as a method for dimensionality reduction, theprocess of reducing the number of variables in a model in order toavoid the curse of dimensionality.

• PCA gives the first m principal components (m < n) by applyingthe truncated transformation

Xm = PTmY,

where each column of Xm ∈ Rm×N is a vector whose elements arethe first m principal components, and Pm is a matrix whosecolumns are the first m loading vectors,

Pm =

| |p1 · · · pm| |

∈ Rn×m.

• Intuitively, by keeping only m principal components, we are losinginformation, and we minimize this loss of information bymaximizing their variances.


Minimum total squared reconstruction error

• Pm is also a solution to:

minW∈Rn×m

∥∥∥Y0 −WWTY0

∥∥∥2

Fs.t. WTW = Im×m, (37)

where F denotes the Frobenius matrix norm.

• The m leading loading vectors are an orthonormal basis whichspans the m dimensional subspace onto which the projections ofthe demeaned observations have the minimum squared differencefrom the original demeaned observations.

• In other words, Pm compresses each demeaned vector of length ninto a vector of length m (where m ≤ n) in such a way thatminimizes the sum of total squared reconstruction errors.

• The minimizer of (37) is not unique: W = PmQ is also a solution,where Q ∈ Rm×m is any orthogonal matrix, QT = Q−1.Multiplying Pm from the right by Q transforms the first m loadingvectors into a different orthonormal basis for the same subspace.


Minimum total squared reconstruction error

• A neural network that is trained to learn the identity functionf (Y ) = Y is called an autoencoder. Its output layer has the samenumber of nodes as the input layer, and the cost function is somemeasure of the reconstruction error.

• Autoencoders are unsupervised learning models, and they are oftenused for the purpose of dimensionality reduction.

• A simple autoencoder that implements dimensionality reduction isa feed-forward autoencoder with at least one layer that has asmaller number of nodes, which functions as a bottleneck.

• After training the neural network using backpropagation, it isseparated into two parts: the layers up to the bottleneck are usedas an encoder, and the remaining layers are used as a decoder.

• In the simplest case, there is only one hidden layer (thebottleneck), and the layers in the network are fully connected.


Linear autoencoders*

• In the case that no non-linear activation function is used,xi = W (1)yi + b(1) and yi = W (2)xi + b(2). If the cost function isthe total squared difference between output and input, thentraining the autoencoder on the input data matrix Y solves:

minW (1),b(1),W (2),b(2)

∥∥∥Y−(

W(2)(W (1)Y + b(1)1TN

)+ b(2)1TN

)∥∥∥2

F.

(38)

• If we set the partial derivative with respect to b2 to zero and insertthe solution into (38), then the problem becomes:

minW (1),W (2)

∥∥∥Y0 −W (2)W (1)Y0

∥∥∥2

F

• Thus, for any b1, the optimal b2 is such that the problem becomesindependent of b1 and of y. Therefore, we may focus only on theweights W (1), W (2).


Linear autoencoders as orthogonal projections

• Setting the gradients to zero, W (1) is the left Moore-Penrosepseudoinverse of W (2) (and W (2) is the right pseudoinverse ofW (1)):

W (1) = (W (2))† =(W (2)TW (2)

)−1(W (2))T

• The minimization with respect to a single matrix is:

minW (2)∈Rn×m

∥∥∥Y0 −W (2)(W (2))†Y0

∥∥∥2

F(39)

The matrix W (2)(W (2))† = W (2)((W (2))TW (2)

)−1(W (2))T is

the orthogonal projection operator onto the column space of W (2)

when its columns are not necessarily orthonormal.

• This problem is very similar to (37), but without theorthonormality constraint.


Linear autoencoders: mathematical properties*

• It can be shown that W (2) is a minimizer of (39) if and only if itscolumn space is spanned by the first m loading vectors of Y.

• The linear autoencoder is said to apply PCA to the input data inthe sense that its output is a projection of the data onto the lowdimensional principal subspace.

• However, unlike actual PCA, the coordinates of the output of thebottleneck are correlated and are not sorted in descending order ofvariance.

• The solutions for reduction to different dimensions are not nested:when reducing the data from dimension n to dimension m1, thefirst m2 vectors (m2 < m1) are not an optimal solution toreduction from dimension n to m2, which therefore requirestraining an entirely new autoencoder.


Equivalence of linear autoencoders and PCA*Hypothesis: The first m loading vectors of Y are the first m left

singular vectors of the matrix W (2) which minimizes (39).

• Train the linear autoencoder on the original dataset Y and thencompute the first m left singular vectors of W (2) ∈ Rn×m, wheretypically m << N.

• The loading vectors may also be recovered from the weights of thehidden layer by a singular value decomposition, W (1).

• If W (2) = UΣVT , which we assume is full-rank, then

W (1) = (W (2))† = VΣ†UT

and

W2W†2 = UΣVTVΣ†UT = UΣΣ†UT = UmUTm, (40)

where we used the fact that(VT)†

= V and that Σ† ∈ Rm×n is amatrix whose diagonal elements are 1

σj(assuming σj 6= 0, and 0

otherwise).

• The matrix ΣΣ† is a diagonal matrix whose first m diagonalelements are equal to one and the other n −m elements are equalto zero.

• The matrix Um ∈ Rn×m is a matrix whose columns are the first mleft singular vectors of W2

• Thus, the first m left singular vectors of (W (1))T ∈ Rn×m are alsoequal to the first m loading vectors of Y .


Figure: This figure shows the yield curve over time, each line corresponds to a different maturity in theterm-structure of interest rates.


(a)PTmY0YT

0 Pm (b) (W (2))TY0YT0 W (2) (c) UT

mY0YT0 Um

Figure: The covariance matrix of the data in the transformed coordinates, according to (a) the loading vectorscomputed by applying SVD to the entire dataset, (b) the weights of the linear autoencoder, and (c) the leftsingular vectors of the autoencoder weights.


(a) PTm∆Y0 (b) UT

m∆Y0

Figure: The first two principal components of ∆Y0, projected using Pm are shown in (a). The first twoapproximated principal components (up to a sign change) using Um . The first principal component is representedby the x-axis and the second by the y-axis.


Summary

• Deep Learning for cross-sectional data

• Background: review of supervised machine learning

• Introduction to feedforward neural networks

• Related approximation and learning theory

• Training and back-propagation

• Interpretability with factor modeling example

• Deep Learning for time-series data prediction

• Understand the conceptual formulation and statistical explanationof RNNs

• Understand the different types of RNNs, including plain RNNs,GRUs and LSTMs

• Two notebook examples with limit order book data and Bitcoinhistory

• Understand the conceptual formulation and statistical explanationof CNNs.

• Understand 1D CNNs for time series prediction.

• A notebook example on univariate time series prediction with 1DCNNs.

• Understand how dilation CNNs capture multi-scales.


Summary

• Deep Learning for dimensionality reduction

• Understand principal component analysis for dimension reduction.

• Understand how to formulate a linear autoencoder.

• Understand how a linear autoencoder performs PCA.

• Notebook example using dimension reduction on the yield curve.

Extra Slides

Review of time series analysis

• Characterize the properties of time series

• Problem formulation in time series prediction

• Practical challenges that frequently arise in time series prediction

Extra Slides

Time Series Data

Time series data:

• Consists of observations on a variable over time, e.g. stock prices

• Observations are not assumed to be independent over time, ratherobservations are often strongly related to their recent histories.

• For this reason, the ordering of the data matters (unlikecross-sectional data).

• Pay special attention to the data frequency.

Extra Slides

Classical Time Series Analysis

• Box-Jenkins linear time series analysis, e.g. ARIMA

• Non-linear time series analysis, e.g. GARCH

• Dynamic state based models, e.g. Kalman filters.

Extra Slides

Autoregressive Processes AR(p)

• The pth order autoregressive process of a variable y depends onlyon the previous values of the variable plus a white noisedisturbance term

yt = µ+

p∑i=1

φiyt−i + εt

• Defining the polynomial functionφ(L) := (1− φ1L− φ2L

2 − · · · − φpLp), the AR(p) process can beexpressed in the more compact form

φ(L)yt = µ+ εt

Key point: an AR(p) process has a geometrically decaying acf.

Extra Slides

Stationarity

Stationarity

• An important property of AR(p) processes is whether they arestationary or not. A process which is stationary will have errordisturbance terms εt will have a declining impact on the currentvalue of y as the lag increases.

• Conversely, non-stationary AR(p) processes exhibit thecounter-intuitive behavior that the error disturbance terms becomeincreasingly influential as the lag increases.If the roots of the characteristic equation

φ(z) = (1− φ1z − φ2z2 − · · · − φpzp) = 0

*all* lie outside the unit circle, then the AR(p) process isstationary.

Extra Slides

Exercise 1

Is the following random walk (zero mean AR(1) process)stationary?

yt = yt−1 + εt

Extra Slides

Exercise 2

Compute the mean, variance and autocorrelation function of thefollowing zero-mean AR(1) process:

yt = φ1yt−1 + εt .

Extra Slides

Moving Average Processes MA(q)

MA(q) process The qth order moving average process is the linearcombination of the white noise process εtq+1

t=1

yt = µ+

q∑i=1

θiεt−i + εt

It’s convenient to define the Lag or Backshift operator Liyt := yt−iand write the MA(q) process as

yt = µ+ θ(L)εt

where the polynomial function θ(L) := 1 + θ1L+ θ2L2 + · · ·+ θqL

q.

Extra Slides

Moving Average Processes

The distinguishing properties of the MA(q) process are thefollowing:

• Constant mean: E[yt ] = µ

• Constant variance: V[yt ] = (1 + θ21 + θ2

2 + · · ·+ θ2q)σ2

• Autocovariances which may be non-zero to lag q and zerothereafter:

γs =

(θs + θs+1θ1 + θs+2θ2 + · · ·+ θqθq−s)σ2 s = 1, 2, . . . , q,

0 s > q.

Key point: a MA(q) process has non-zero points of acf= q. Aboveq there is a sharp cut-off.

Extra Slides

Exercise 3

Consider the following zero-mean MA(2) process:

yt = εt + θ1εt−1 + θ2εt−2

Show that:

• E[yt ] = 0, ∀t• V[yt ] = γ0 = (1 + θ2

1 + θ22)σ2, ∀t

• ε1 = θ1+θ1θ2

1+θ21+θ2

2, ε2 = θ2

1+θ21+θ2

2and εs = 0, s > q = 2.

If θ1 = −0.5 and θ2 = 0.25, sketch the autocorrelation function(acf) of the sample MA(2) process.

Extra Slides

ARMA

The pth, qth order ARMA process of a variable y combines theAR(p) and MA(q) processes

φ(L)yt = µ+ θ(L)εt

where

φ(L) := (1− φ1L− φ2L2 − · · · − φpLp)

θ(L) := 1 + θ1L + θ2L2 + · · ·+ θqL

q

The mean of a ARMA series depends only on the coefficients ofthe autoregressive process and not the coefficients of the moving

average process:

E[yt ] =µ

1− φ1 − φ2 − · · · − φp

Extra Slides

Forecasting with a MA(q) model

• Predict the value yt+1 given information Ωt at time t using theequation:

yt+1 = µ+ θ1εt + θ2εt−1 + θ3εt−2 + εt+1

• The forecast is the conditional expectation of the prediction. Theone-step ahead forecast at time t is denoted ft,1:

ft,1 = E[yt+1|Ωt ] = µ+ θ1εt + θ2εt−1 + θ3εt−2

since only the conditional expectation E[εt+1|Ωt ] = 0.

• The two-step ahead forecast ft,2 is given by:

ft,2 = E[yt+2|Ωt ] = µ+ θ2εt + θ3εt−1

• In general, a MA(q) process has a memory of q periods, soforecasts q + 1 steps ahead collapse to an intercept µ.

Extra Slides

Forecasting with a AR(p) model

• Recall that an AR(p) process has infinite memory.

• In general, the s-step ahead forecast for an AR(2) process is givenby

ft,s = E[yt+s |Ωt ] = µ+ φ1ft,s−1 + φ2ft,s−2

thus we need to compute the one-period lag and two-period lagfirst.

• ARMA(p, q) : Putting the MA(q) and AR(p) s-step aheadforecasting equations together, we get

ft,s =

p∑i=1

ai ft,s−i +

q∑j=1

bjεt+s−j

where ft,s = yt+s , s ≤ 0 and εt+s = 0, s > 0.

Extra Slides

Exercise 4

Consider the following ARMA(2,2) 1-step ahead forecast

ft,1 = a1ft,0 + a2ft,−1 + b1εt + b2εt−1 (41)

= a1yt + a2yt−1 + b1εt + b2εt−1 (42)

(43)

where ai and bj are the autoregressive and moving averagecoefficients respectively.

Extra Slides

Parametric tests: time series analysis

• Augmented Dickey Fuller test for time series stationarity, with theSchwert (heuristic) estimate for the maximum lag

• Durbin-Watson tests for co-integration of two variables

• Granger causality tests

Extra Slides

Identification

• it may be difficult to observe the order of the MA and ARprocesses from the acf and pacf plots.

• It is often preferable to use the Akaike Information Criteria (AIC)to measure the quality of fit:

AIC = ln(σ2) +2k

Twhere σ2 is the residual variance (the residual sums of squaresdivided by the number of observations T ) and k = p + q + 1 is thetotal number of parameters estimated.

• This criterion expresses a trade-off between the first term, thequality of fit, and the second term, a penalty function proportionalto the number of parameters.

• Adding more parameters to the model reduces the residuals butincreases the right hand-term, thus the AIC favors the best fit withthe fewest number of parameters.

• AIC is limited to asymptotic normality assumptions on the error.

Extra Slides

In Sample: Diagnostics

• A fitted model must be examined for underfitting with a whitenoise test. Box and Pierce propose the Portmanteau statistic

Q∗(m) = Tm∑l=1

ρ2l

• As a test statistic for the null hypothesis

H0 : ρ1 = · · · = ρm = 0

against the alternative hypothesis

Ha : ρi 6= 0

for somei ∈ 1, . . . ,m

. ρi are the sample autocorrelations of the residual.

• The Box-Pierce statistic follows an asymptotically chi-squareddistribution with m − p − q degrees of freedom.

• The decision rule is to reject H0 if Q(m) > χ2α where χ2

α denotesthe 100(1− α)th percentile of a chi-squared distribution with mdegrees of freedom and is the significance level for rejecting H0.

Extra Slides

In Sample: Diagnostics

• The Ljung-Box test statistic increases the power of the test infinite samples:

Q(m) = T (T + 2)m∑l=1

ρ2l

T − l

.

• This statistic also follows an asymptotically chi-squareddistribution with m degrees of freedom.

• The Ljung-Box statistic follows asymptotically a chi-squareddistribution with m − p − q degrees of freedom.

Extra Slides

Time Series Cross Validation

Verifica(on: tuning hyper-‐parameters

Tes(ng: assess performance

split

Figure: Times series cross validation, also referred to as walk forward optimization, is used to avoid introducinglook-ahead bias into the model.

Extra Slides

Forecasting Accuracy

• The forecasting accuracy may be measured using the mean squareerror (MSE) or the mean absolute error (MAE) of theout-of-sample forecasts with the actual observations.

• Let’s suppose that we have an trained the model with Tobservations, then the MSE evaluated over the next mobservations :

MSE =1

m

m−1∑t=0

(yT+t+s − fT+t,s)2

• The MAE is given by

MAE =1

m

m−1∑t=0

|yT+t+s − fT+t,s |

Question: Why use the MSE in preference to the MAE?

Extra Slides

Predicting Events

• Suppose we have conditionally i.i.d. Bernoulli r.v.s Xt withpt := P[Xt = 1|Ωt ] representing a binary event

• E[Xt | Ω] = 0 · (1− pt) + 1 · pt = pt

• V[Xt | Ω] = pt(1− pt)

• Under an ARMA model:

ln

(pt

1− pt

)= φ−1(L)(µ+ θ(L)εt)

Extra Slides

Entropy

C. Shannon, A Mathematical Theory of Communication, The Bell System Technical Journal, Vol. 27, pp. 379-423,623-656, July, October, 1948.

Extra Slides

Directional Performance: Confusion Matrix

Actual Predicted

up down

up m11 m12 m1

down m21 m22 m1

m01 m02 m

χ2 =2∑

i=1

2∑j=1

(mij −mi0m0j/m)2

mi0m0j/m

with 1 degree of freedom.

Extra Slides

Example: Confusion Matrix

Actual Predicted

up down

up 12 2 14down 8 2 10

20 4 24

χ2 = 0.137

with p-value of 0.71.

Extra Slides

ROC chart

Extra Slides

Performance Terminology

• Precision is TP/(TP + FP): fraction of records that were positivefrom the group that the classifier predicted to be positive

• True Positive Rate is TP/(TP + FN): fraction of positiveexamples the classifier correctly identified. This is also known asRecall or Sensitivity.

• False Positive Rate is FP/(FP + TN): fraction of positiveexamples which the classifier mis-identified.

• Specificity is TN/(FP + TN) = 1− FPR: fraction of negativeexamples which the classifier correctly identified.

Extra Slides

Practical Prediction Issues

• How to predict inherently rare events, e.g. black swans? Ans:over-sampling?

• How much training history?

• How frequently to retrain the model?

Extra Slides

Confusion Matrices with Oversampling

−1 0 1

−1

01

True

Predicted

0.1188

0.003

0.0466

0.85

0.9961

0.8653

0.0312

9e−04

0.0881

−1 0 1

−1

01

True

Predicted

0.6938

0.0743

0.1295

0.2188

0.8267

0.1244

0.0875

0.099

0.7461

Figure: Normalized confusion matrices (left) without and (right) with oversampling.

Extra Slides

ROC Curves with Oversampling

False Positive Rate

True

Pos

itive

Rat

e

0 0.2 0.4 0.6 0.8 1

00.

20.

40.

60.

81

AUC: 0.543

AUC: 0.784

Before Oversampling

After Oversampling

Figure: ROC curves with and without oversampling.

matthew dixon [email protected]/~mdixon7/presentations/dl_fe.pdfoverview deep...

Documents