matthew dixon [email protected]/~mdixon7/presentations/dl_fe.pdfoverview deep...

151
Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Deep Learning for Financial Econometrics 2 Matthew Dixon [email protected] Illinois Institute of Technology Machine Learning in Finance Bootcamp Fields Institute University of Toronto September 26th, 2019 Code: https://github.com/mfrdixon/ML Finance Codes 2 Slides with an * in the title are advanced material for mathematicians. ** denotes advanced programming details.

Upload: others

Post on 29-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Deep Learning for Financial Econometrics2

Matthew [email protected]

Illinois Institute of Technology

Machine Learning in Finance BootcampFields Institute

University of TorontoSeptember 26th, 2019

Code: https://github.com/mfrdixon/ML Finance Codes

2Slides with an * in the title are advanced material for mathematicians. **denotes advanced programming details.

Page 2: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Overview

• Deep Learning for cross-sectional data• Background: review of supervised machine learning• Introduction to feedforward neural networks• Related approximation and learning theory• Training and back-propagation• Interpretability with factor modeling example

• Deep Learning for time-series data prediction• Understand the conceptual formulation and statistical

explanation of RNNs• Understand the different types of RNNs, including plain RNNs,

GRUs and LSTMs• Understand the conceptual formulation and statistical

explanation of CNNs.• Understand 1D CNNs for time series prediction.• Understand how dilation CNNs capture multi-scales.

Page 3: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Overview

• Deep Learning for dimensionality reduction• Understand principal component analysis for dimension

reduction.• Understand how to formulate a linear autoencoder.• Understand how a linear autoencoder performs PCA.

Page 4: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Quick Quiz

Which of the following statements are true?

A Machine learning is different from statistics: Machine learningmethods assume the data generation process is unknown

B The output from neural network classifiers are theprobabilities of an input belonging to a class

C We need multiple hidden layers in the neural network tocapture non-linearity

D Neural networks provide no statistical interpretability, they are’black-boxes’ and the importance of the features is unknown

E Neural networks can only be fitted to stationary data

Page 5: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Quick Quiz

Which of the following statements are true?

A Machine learning is different from statistics: Machine learningmethods assume the data generation process is unknown[True]

B The output from neural network classifiers are theprobabilities of an input belonging to a class [True]

C We need multiple hidden layers in the neural network tocapture non-linearity [False3]

D Neural networks provide no statistical interpretability, they are’black-boxes’ and the importance of the features is unknown[False]

E Neural networks can only be fitted to stationary data [False]

3The Universal representation theorem suggests that we only need one hiddenlayer: Andrei Nikolaevich Kolmogorov, On the representation of continuous functionsof many variables by superposition of continuous functions of one variable andaddition. AMS Translation, 28(2):55-59, 1963.

Page 6: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Quick Overview

Property Statistical Inference Machine Learning

Goal Causal models with explanatory power Prediction performance, often with limited explanatory powerData The data is generated by a model The data generation process is unknownFramework Probabilistic Algorithmic & ProbabilisticExpressability Typically linear Non-linearModel selection Based on information criteria Numerical optimizationScalability Limited to lower dimensional data Scales to higher dimensional input dataRobustness Prone to over-fitting Designed for out-of-sample performance

Page 7: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Supervised Machine Learning

• Machine learning addresses a fundamental prediction problem:Construct a nonlinear predictor, Y (X ), of an output, Y , givena high dimensional input matrix X = (X1, . . . ,XP) of Pvariables.

• Machine learning can be simply viewed as the study andconstruction of an input-output map of the form

Y = F (X ) where X = (X1, . . . ,XP).

• The output variable, Y , can be continuous, discrete or mixed.

• For example, in a classification problem, F : X → Y whereY ∈ 1, . . . ,K and K is the number of categories.

• We will denote the i th observation of the data D := (X ,Y ) as(xi , yi ) - this is a “feature-vector” and response (a.k.a.“label”). Note that lower caps refer to a specific observationin D.

Page 8: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Taxonomy of Most Popular Neural Network Architectures

feed forward auto-encoder convolution

recurrent Long / short term memory neural Turing machines

Figure: Most commonly used deep learning architectures for modeling. Source:http://www.asimovinstitute.org/neural-network-zoo

Page 9: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

FFWD Neural NetworksNeural network model:

Y = FW ,b(X ) + ε

where FW ,b : Rp → Rd is a deep neural network with L layers

Y (X ) := FW ,b(X ) = f(L)

W (L),b(L) · · · f(1)

W (1),b(1)(X )

• W = (W (1), . . . ,W (L)) and b = (b(1), . . . , b(L)) are weightmatrices and bias vectors.

• For any W (i) ∈ Rm×n, we can write the matrix as n column

m-vectors W (i) = [w(i)1 , . . . ,w

(i)n ].

• Denote each weight as w(`)i ,j :=

(W (`)

)i ,j

.

• X is a N × p matrix of observations in Rp.

• No assumptions are made on the distribution of the error, ε,other than it is independently distributed.

Page 10: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Deep Neural Networks*

• Let σ : R→ B ⊂ R denote a continuous, monotonicallyincreasingly, function.

• A function f(`)

W (`),b(`) : Rn → Rm, given by

f (v) = W (`)σ(`−1)(v) + b(`), W (`) ∈ Rm×n and b(`) ∈ Rm, isa semi-affine function in v , e.g. f (v) = wtanh(v) + b.

• σ(·) are known activation functions of the output from theprevious layer., e.g. σ(x) := max(x , 0).

• FW ,b(X ) is a composition of semi-affine functions.

• If all the activation functions are linear, FW ,b is just linearregression, regardless of the number of layers L.

Page 11: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Geometric Interpretation of FFWD Neural Networks

x1 x2 x1 x2 x1 x2

No hidden layers One hidden layer Two hidden layers

Page 12: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Geometric Interpretation of FFWD Neural Networks

Half-Moon Dataset

No hidden layers One hidden layer Two hidden layers

Page 13: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Why do we need more Neurons?

x1 x2

Page 14: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Geometric Interpretation of Neural Networks

25 hidden units 50 hidden units 75 hidden units

Figure: The number of hidden units is adjusted according to therequirements of the classification problem and can he very high for datasets which are difficult to separate.

Page 15: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Geometric Interpretation of Neural Networks

Figure: Hyperplanes defined by three activated neurons in the hidden layer.

Page 16: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Universal Representation Theorem [1989]*

• Let Cp := f : Rp → R | f (x) ∈ C (R) be the set ofcontinuous functions from Rp to R.

• Denote Σp(σ) as the class of functions

FW ,b : Rp → R : FW ,b(x) = W (2)σ(W (1)x + b(1)) + b(2).

• Consider Ω := (0, 1] and let C0 be the collection of all openintervals in (0, 1].

• Then σ(C0), the σ-algebra generated by C0, is the Borelσ-algebra, B((0, 1]).

• Let Mp := f : Rp → R | f (x) ∈ B(R) denote the set of allBorel measurable functions from Rp to R.

• Denote the Borel σ-algebra of Rp as Bp.

Page 17: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Universal Representation Theorem*

Theorem (Hornik, Stinchcombe & White, 1989)

For every monotonically increasing activation function σ, everyinput dimension size p, and every probability measure µ on(Rp,Bp), Σp(g) is uniformly dense on compacta in Cp andρµ-dense in Mp.

Page 18: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Shattering• What is the maximum number of points that can be arranged

so that FW ,b(X ) shatters them?• i.e. for all possible assignments of binary labels to those

points, does there exist a W , b such that FW ,b makes noerrors when classifying that set of data points?

• Every distinct pair of points is separable with the linearthreshold perceptron. So every data set of size 2 is shatteredby the perceptron.

• However, this linear threshold perceptron is incapable ofshattering triplets, for example X ∈ −0.5, 0, 0.5 andY ∈ 0, 1, 0.

• In general, the VC dimension of the class of halfspaces in Rk

is k + 1. For example, a 2-d plane shatters any three points,but can not shatter four points.

• This maximum no. of points is the Vapnik-Chervonenkis (VC)dimension and is one characterization of learnability of aclassifier.

Page 19: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Example of Shattering over the Interval [−1, 1]

Right point activated Left point activated None activated Both activated

Figure: For the points −0.5, 0.5, there are weights and biases that activates only one of them(W = 1, b = 0 or W = −1, b = 0), none of them (W = 1, b = −0.75) and both of them (W = 1, b = 0.75).

Page 20: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

VC Dimension Example

• Determine the VC dimension of the indicator function whereΩ = [0, 1]

F (x) = f : Ω→ 0, 1, f (x) = 1x∈[t1,t2), or

f (x) = 1− 1x∈[t1,t2) , t1 < t2 ∈ Ω.

• Suppose there are three points x1, x2 and x3 and assumex1 < x2 < x3 without loss of generality. All possible binarylabeling of the points is reachable, therefore we assert thatVC (F ) ≥ 3.

• With four points x1, x2, x3 and x4 (assumed increasing asalways), you cannot label x1 and x3 with the value 1 and x2

and x4 with the value 0 for example. So VC (F ) = 3.

• A related, but alternative, approach to the VC dimension is anumerical analysis perspective.

Page 21: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

When is a neural network a spline?*• Under certain choices of the activation function, we can

construct NNs which are piecewise polynomial interpolantsreferred to as ’splines’.

• Let f : Ω→ R be any Lipschitz continuous function over abounded domain Ω ⊂ Rp:

∀x, x′ ∈ Ω, |f (x′)− f (x)| ≤ L|x′ − x|,for some constant L ≥ 0.

• Suppose that the values fk := f (xk) are known only atdistinct points xknk=1 in Ω.

• Form an orthogonal basis over Ω to give the interpolant

f (x) =n∑

k=1

φk(x)fk , ∀x ∈ Ω,

where the φk(x)nk=1 are piecewise constant (PC) orthogonalbasis functions, φi (xj) = δij (Kronecker delta).

Page 22: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

When is a neural network a spline?*

• If the data points xknk=1 are (almost) uniformly distributedin the bounded domain Ω, then the error of our PCinterpolant scales as∥∥∥f − f

∥∥∥L2p,µ(Ω)

∼ O(n−1p ). (1)

Page 23: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Piecewise basis functions as differences of Heavisidefunctions*

• Heaviside functions (a.k.a. step functions) over R are definedas

H(x) =

1 x ≥ 0,

0 x < 0,

• Choose the domain to be the unit interval in R and restrictthe data to a uniform grid of width 2ε so thatxk = (2k − 1)ε, k = 1 . . . n.

• Construct a piecewise constant basis from the difference ofHeaviside functions:

φk(x) = H(x − (xk − ε))− H(x − (xk + ε)).

• Basis functions are indicator functions:φk = 1[xk−ε,xk+ε), ∀k = 1, . . . , n, with compact support over[xk − ε, xk + ε) and centered about xk .

Page 24: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Piecewise constant basis functions*

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

x

φ 1(x

)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

x

φ 2(x

)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

x

φ 3(x

)

Figure: The first three piecewise constant basis functions produced by the difference of neighboring stepfunctions, φk (x) = H(xk − ε)− H(xk + ε).

Page 25: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Single Layer NNs Activated with Heaviside functions*

• Key Idea: Choose the bias b(1)k = −2(k − 1)ε and weights

W (1) = 1 then

H(W (1)x + b(1)) =

H(x)H(x − 2ε)

. . .H(x − 2(k − 1)ε)

. . .H(x − (2n − 1)ε).

• The two layer NN approximation,f (x) = W (2)H(W (1)x + b(1)), is a PC interpolant when

W (2) = [f (x1), f (x2)− f (x1), . . . , f (xK )− f (xn−1)].

Page 26: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Single Layer NNs Activated with Heaviside functions*

• This single layer neural network gives the following output:

f (x) =

f (x1) x ≤ 2ε,

f (x2) 2ε < x ≤ 4ε,

. . . . . .

f (xk) 2(k)ε < x ≤ 2(k + 1)ε,

. . . . . .

f (xn−1) 2(n − 1)ε < x ≤ 2nε.

• f (x) is a linear interpolant over [0, 1].

Page 27: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Single Layer NNs Activated with Heaviside functions*

0.0 0.2 0.4 0.6 0.8 1.0

−1.0

−0.5

0.00.5

1.0

x

y

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.02

0.04

0.06

0.08

0.10

x

|f(x)−

f(x)|

Function values Absolute error

Figure: The approximation of sin(2πx) using gridded input data andHeaviside activation functions.

Page 28: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Single Layer NNs Activated with Heaviside functions *

Theorem|f (x)− f (x)| ≤ δ, ∀x ∈ [0, 1] if there are at least n = b 1

2ε + 1chidden units and δ := Lε.

Proof.Since xk = (2k − 1)ε,we have that f (x) = f (xk) over the interval[xk − ε, xk + ε], which is the support of φk(x). By the Lipschitzcontinuity of f (x), it follows that the worst case error appears atthe mid-point of any interval [xk , xk+1]

|f (xk +ε)− f (x+ε)| = |f (x+ε)−f (xk)| ≤ |f (xk)|+Lε−|f (xk)| = δ.

Page 29: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Training with Regularization under IID noise

• With a training set D = xi , yiNi=1, solve the constrainedoptimization

(W , b) = arg min(W ,b)1

N

N∑i=1

L(yi , Y (xi ))

• Often, the L2-norm for a traditional least squares problem ischosen as an error measure, and we then minimize the lossfunction combined with a penalty

L(yi , yi ) = ‖yi − yi‖22 + λ1‖W ‖1

• ∇L is given in closed form by a chain rule and, throughback-propagation, each layer’s weights W (`) and biases b(`)

are fitted with stochastic gradient descent.

Page 30: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Equivalence with OLS estimation

• Without activation and no hidden layers, the optimalparameters β := [b, W ] are an OLS estimator.

• If we assume that the errors ε1, ε2, . . . , εN are white noise andlet y := Y ∈ R so that the linear model is

y = Xβ + ε,

thenβ = (X′X)−1X′y

where the augmented matrix is

X =

1 x1,1 . . . x1,p

. . . . . . . . . . . .1 xN,1 . . . xN,p.

Page 31: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Training

• Solve the constrained optimization

(W , b) = arg minW ,b1

N

N∑i=1

L(yi , YW ,b(xi ))

• If Y is continuous then the L2-norm for a traditional leastsquares problem is chosen as an error measure, and we thenminimize the loss function

L(Y , Y (X )) = ||Y − Y ||22

• A 1-of-K encoding is used for a categorical response, so thatY is a K-binary vector Y ∈ [0, 1]K and the value k ispresented as Yk = 1 and Yj = 0,∀j 6= k . The loss function isthe cross-entropy term

L(Y , Y (X )) = −Y T lnY .

Page 32: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Derivative of the softmax function

• Using the quotient rule f ′(x) = g ′(x)h(x)−h′(x)g(x)[h(x)]2 , the

derivative σ := σ(x) can be written as:

∂σi∂xi

=exp(xi )|| exp(x)||1 − exp(xi ) exp(xi )

|| exp(x)||21(2)

=exp(xi )

|| exp(x)||1· || exp(x)||1 − exp(xi )

|| exp(x)||1(3)

= σi (1− σi ) (4)

Page 33: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Derivative of the softmax function

• For the case i 6= j , the derivative of the sum is:

∂σi∂xj

=0− exp(xi ) exp(xj)

|| exp(x)||21(5)

= −exp(xj)

|| exp(x)||1· exp(xi )

|| exp(x)||1(6)

= −σjσi (7)

• This can be written compactly as ∂σi∂xj

= σi (δij − σj).

Page 34: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Updating the Weight Matrices

• In machine learning, the goal is to find the weights and biasesgiven an input X . We can express Y as a function of theweight matrix W ∈ RK×M and bias W ∈ RK so that

Y (W , b) = σ I (W , b),

where the input function I : RK×M ×RK → RK is of the formI (W , b) := WX + b

• Applying the multivariate chain rule to give the Jacobian ofY (W , b) gives

∇Y (W , b) = ∇(σ I )(W , b) (8)

= ∇σ(I (W , b)) · ∇I (W , b) (9)

Page 35: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Updating the Weight Matrices

• Recall that the cross entropy function is given by

L(Y , Y (X )) = −Y Tk lnYk

• Since Y is a constant we can express the cross-entropy as afunction of (W , b)

L(W , b) = L σ(I (W , b)).

• Applying the multivariate chain rule gives

∇L(W , b) = ∇(L σ)(I (W , b))

= ∇L(σ(I (W , b))) · ∇σ(I (W , b)) · ∇I (W , b)

Page 36: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Back-propagation

• Use stochastic gradient descent to find the minimum

(W , b) = arg minW ,b

1

N

N∑i=1

L(yi , yi )

• Because of the compositional form of the model, the gradientcan be easily derived using the chain rule for differentiation.

• This can be computed by a forward and then a backwardsweep (’back-propagation’) over the network, keeping trackonly of quantities local to each neuron.

Page 37: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Back-propagation

Forward Pass:

• Set Z (0) = X and for ` = 1, . . . , L set

Z (`) = f(`)

W (`),b(`)(Z(`−1)) = σ(`)(W (`)Z (`−1) + b(`)).

Back-propagation:

• Define the back-propagation error δ(`) := ∇b(`)L, where givenδ(L) = Y − Y and for ` = L− 1, . . . , 1 set

δ(`) = (∇I (`)σ(`))(W (`+1))T δ(`+1), (10)

∇W (`)L = δ(`) ⊗ Z (`−1), (11)

and ⊗ is the outer product of two vectors.

Page 38: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Updating the weights

• The weights and biases are updated according to theexpression:

∆W (`) = −γ∇W (`)L = −γδ(`) ⊗ Z (`−1), (12)

∆b(`) = −γδ(`), (13)

where γ is a learning rate parameter. See back-propagationnotebook example and Appendix of Chpt 2 lecture notes formore details.

• Mini-Batch or offline updates involve using many observationsof X at the same time. The batch size refers to the numberof observations of X used in each pass.

• An epoch refers to a round-trip (forward+backward sweep) ofall training examples.

Page 39: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Back-propagation example with three layers

• Suppose that a feed-forward network classifier has two sigmoidactivated hidden layers and a softmax activated output layer.

• After a forward pass, the values of Z (`)3`=1 are stored and

the error Y − Y , where Y = Z (3), is calculated for oneobservation of X .

• The back-propagation and weight updates in the final layerare evaluated:

δ(3) = Y − Y

∇W (3)L = δ(3) ⊗ Z (2).

Page 40: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

From Equations 10 and 11, we update the back-propagation errorand weight updates for Hidden Layer 2

δ(2) = Z (2)(1− Z (2))(W (3))T δ(3),

∇W (2)L = δ(2) ⊗ Z (1).

Repeating for Hidden Layer 1

δ(1) = Z (1)(1− Z (1))(W (2))T δ(2),

∇W (1)L = δ(1) ⊗ X .

We update the weights and biases using Equations 12 and 13, sothat b(3) → b(3) − γδ(3), W (3) →W (3) − γδ(3) ⊗ Z (2) and repeatfor the other weight-bias pairs, (W (`), b(`))2

`=1

Page 41: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Stopping Criteria**

We can use callbacks in Keras to terminate the training based onstopping criteria.

from keras.callbacks import EarlyStopping

...

callbacks = [EarlyStopping(monitor=’loss’, min_delta=0, patience=2)]

...

model.fit(X,Y, callbacks=callbacks)

Page 42: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Explanatory Power of Neural Networks

• In a linear regression model

Y = Fβ(X ) := β0 + β1X1 + · · ·+ βKXK

the sensitivities are∂Xi

Y = βi

• In a FFWD neural network, we can use the chain rule

∂XiY = ∂Xi

FW ,b(X ) = ∂Xif

(L)

W (L),b(L) · · · f(1)

W (1),b(1)(X )

• For example, with one hidden layer, σ(x) := tanh(x) and

f(`)

W (`),b(`)(X ) := σ(Z (`)) := σ(W (`)X + b(`)):

∂XjY =

∑i

w(2)i (1− σ2(I

(1)i ))w

(1)i ,j where ∂xσ(x) = (1− σ2(x))

Page 43: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Explanatory Power of Neural Networks

• Or in matrix form, using the Jacobian4 J = D(I (1))W (1) of somecontinuous σ:

∂X Y = W (2)J(I (1)) = W (2)D(I (1))W (1),

where Di ,i (I ) = σ′(Ii ), Di ,j 6=i = 0.

• Bounds on the sensitivities are given by the product of the weightmatrices

min(W (2)W (1), 0) ≤ ∂X Y ≤ max(W (2)W (1), 0).

4When σ is an identity function, the Jacobian J(I (1)) = W (1).

Page 44: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Example: Step test

• The model is trained to the following data generation processwhere the coefficients of the features are stepped:

Y =10∑i=1

iXi , Xi ∼ U(0, 1).

Page 45: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Example: Step test

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

0.0 2.5 5.0 7.5 10.0

Importance (Sensitivity)

Varia

ble

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

0.00 0.05 0.10 0.15 0.20

Importance (Garson's algorithm)

X9

X8

X7

X5

X6

X2

X4

X1

X3

X10

0 10 20 30 40

Importance (Olden's algorithm)

Figure: Step test: This figure shows the ranked importance of the input variables in a fitted neural networkwith one hidden layer.

Page 46: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Example: Friedman test

Y = 10 sin (πX1X2) + 20 (X3 − 0.5)2 + 10X4 + 5X5 + ε,

Page 47: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

X9

X10

X6

X8

X7

X3

X5

X2

X1

X4

0.0 2.5 5.0 7.5 10.0

Importance (Sensitivity)

Var

iabl

e

X6

X5

X8

X7

X9

X4

X3

X2

X1

X10

0.0 0.1 0.2 0.3

Importance (Garson's algorithm)

X3

X4

X2

X5

X8

X9

X6

X7

X10

X1

−10 0 10 20 30

Importance (Olden's algorithm)

Figure: Friedman test: Estimated sensitivities of the fitted neural network to the input.

Page 48: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Interaction Terms Introduced by Activation

• Even with one hidden layer, the activation function introducesinteraction terms XiXj .

• To see this, consider the partial derivative

∂XjY =

∑i

w(2)i σ′(I

(1)i )w

(1)i ,j

• and differentiate again with respect to Xk , k 6= i to give theHessian:

∂2Xj ,Xk

Y = −2∑i

w(2)i σ′′(I

(1)i )w

(1)i ,j w

(1)i ,k

• which is not zero everywhere unless σ is (piecewise) linear orconstant.

Page 49: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Interaction Terms Introduced by Activation

x_23

x_34

x_28

x_310

x_15

x_45

x_14

x_25

x_24

x_12

0 1 2 3 4

Importance (Sensitivity)

Inte

ract

ion

Term

Figure: Friedman test: Estimated Interaction terms in the fitted neural network to the input.

Page 50: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

The Jacobian using ReLU activation

• In matrix form, with σ(x) = max(x , 0), the Jacobian, J, can bewritten as linear combination of Heaviside functions

J := J(X ) = ∂X Y (X ) = W (2)J(I (1)) = W (2)H(W (1)X+b(1))W (1),

where Hii (I(1)) = H(I

(1)i ) = 1

I(1)i >0

, Hij = 0, j ≥ i .

• This can be written in matrix element form as

Jij = [∂X Y ]ij =n∑

k=1

W(2)ik W

(1)kj H(I

(1)k ) =

n∑k=1

ckHk(I (1))

where ak := W(2)ik W

(1)kj .

• As a linear combination of indicator functions, we have

Jij =n−1∑k=1

ak1I (1)k >0,I

(1)k+1≤0 + an1I (1)

n >0, ak :=k∑

i=1

ci .

Page 51: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

The Jacobian using ReLU activation

• Alternatively, the Jacobian can be expressed in terms of X

Jij =n−1∑k=1

ak1W (1)k X>−b(1)

k ,W(1)k+1 X≤−b

(1)k+1

+ an1W (1)n X>−b(1)

n .

• wlog: In the case when p = 1, this simplifies to a weighted sum ofindependent Bernoulli trials

Jij =n−1∑k=1

ak1xk<X≤xk+1+ an1xn<X , j = 1.

where xk :=b

(1)k

W(1)k

. The expectation of the Jacobian is given by

µn−1 := E[Jij ] =n∑

k=1

akpk , pk := Pr[xk < X ≤ xk+1]∀k = 1, . . . , n−1, pn = Pr[xn < X ].

• For finite weights, the expectation is bounded above by∑n

k=1 ak .

Page 52: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Bounds on the Variance

• The variance is

V[Jij ] =n−1∑k=1

akV[1I (1)k >0,I

(1)k+1≤0]+anV[1I (1)

n >0] =n∑

k=1

akpk(1−pk).

• If the weights are constrained so that the mean is constant,µnij = µij , then the weights are ak = µ

npk. Then the variance is

monotonically increasing with the number of hidden units:

V[Jij ] = µn − 1

n< µ.

• Otherwise, in general, we have

V[Jij ] ≤ µij ≤n∑

i=1

ak .

• Increasing variance with more hidden units is undesirable forinterpretability - we want the opposite effect!

Page 53: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Robustness of Interpretability

• If the data is generated from a linear model, can the neuralnetwork recover the correct weights?

Y = β1X1 + β2X2 + ε, X1,X2, ε ∼ N(0, 1), β1 = 1, β2 = 1

• Compare OLS estimators with zero hidden layer NNs and singlehidden layers NNs.

• Use tanh activation functions for smoothness of the Jacobianbecause max(x , 0) gives a piecewise continuous Jacobian.

• Increase the number of hidden layers and show that the variance ofthe sensitivities convergence with.

Page 54: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Robustness of Interpretability

Method β0 β1 β2

OLS 0 1.0154 1.018

NN0 b(1) = 0.03 W(1)1 = 1.0184141 W

(1)2 = 1.02141815

NN1 W (2)σ(b(1)) + b(2) = 0.02 E[W (2)D(1)(I (1))W(1)1 ] = 1.013887 E[W (2)D(1)(I (1))W

(1)2 ] = 1.02224

Page 55: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Robustness of Interpretability

(a) density of β1 (b) density of β2

Page 56: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Robustness of Interpretability (β1)

Hidden Units Mean Median Std.dev 1% C.I. 99% C.I.

2 0.980875 1.0232913 0.10898393 0.58121675 1.072990810 0.9866159 1.0083131 0.056483902 0.76814914 1.032252250 0.99183553 1.0029879 0.03123002 0.8698967 1.0182846

100 1.0071343 1.0175397 0.028034585 0.89689034 1.0296803200 1.0152218 1.0249312 0.026156902 0.9119074 1.0363332

The confidence interval narrows with increasing number of hiddenunits.

Page 57: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Robustness of Interpretability (β2)

Hidden Units Mean Median Std.dev 1% C.I. 99% C.I.

2 0.98129386 1.0233982 0.10931312 0.5787732 1.07372810 0.9876832 1.0091512 0.057096474 0.76264584 1.033971450 0.9903236 1.0020974 0.031827927 0.86471796 1.0152498

100 0.9842479 0.9946766 0.028286876 0.87199813 1.0065105200 0.9976638 1.0074166 0.026751818 0.8920307 1.0189484

The confidence interval narrows with increasing number of hiddenunits.

Page 58: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Linear Cross-sectional Factor Models

• Consider the linear cross-sectional factor model:

rt = Bt−1ft + εt , t = 1, . . . ,T

• Bt = [1 | b1,t | · · · | bK ,t ] is the N × K + 1 matrix of factorexposures

• (bi ,t)k is the exposure (a.k.a. loading) of asset i to factor k attime t

• ft = [α, f1,t , . . . , fK ,t ] is the K + 1 vector of factor returns

• rt is the N vector of asset returns

• ρ(ft , εt) = 0 and cross-sectional heteroskedasticityE[ε2

t,i ] = σ2i , i ∈ 1, . . . ,N.

Page 59: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Non-linear Non-parametric Cross-sectional Factor Models

• Consider the non-linear cross-sectional factor model:

rt = F (Bt) + εt

• F : RK → R is a differentiable non-linear function

• Do not assume that εt is Gaussian or require any other restrictionson εt , i.e. error distribution is non-parametric

• The model shall just be used to predict the next period returnsonly and stationarity of the factor returns is not required.

Page 60: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Neural Network Cross-sectional Factor Models5

• Neural network cross-sectional factor model:

rt = FW ,b(Bt) + εt

where FW ,b(X ) is a neural network with L layers

r := FW ,b(X ) = f(L)

W (L),b(L) · · · f(1)

W (1),b(1)(X )

5Dixon & Polson, Deep Fundamental Factor Models, 2019,arxiv.org/abs/1903.07677.

Page 61: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Experimental Setup

• Here, we define the universe as the top 250 stocks from the S&P500 index, ranked by market cap.

• Factors are given by Bloomberg and reported monthly

• Remove stocks with too many missing factor values, leaving 218stocks

• Use a 100 period experiment, fitting a cross-sectional regressionmodel at each period and using the model to forecast the nextperiod monthly returns.

• At each time step, the canonical experimental design overt = 2, . . . ,T is

• (X traint ,Y train

t ) = (Bt−1, rt−1)

• (X testt ,Y test

t ) = (Bt , rt)

Page 62: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

MSEs

(a) Linear Regression (b) FFWD Neural Network

Page 63: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Overfitting

Figure: In sample MSE as a function of the number of neurons in the hidden layer.The models are trained here without L1 regularization.

Page 64: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Managing the Bias-Variance Tradeoff with L1

Regularization

(a) In-sample (b) Out-of-Sample

Figure: The effect of L1 regularization on the MSE errors for a network with 50neurons in the hidden layer.

Page 65: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Information Ratios (Random Selection of Stocks)

Figure: Random selection of x stocks from the universe in each period.

Page 66: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Information Ratios (In-Sample)

(a) Linear Regression (b) FFWD Neural Network

Figure: Selection of the x top performing stocks from the universe ineach in-sample period.

Page 67: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Information Ratios (Out-of-Sample)

(a) Linear Regression (b) FFWD Neural Network

Figure: Selection of the x top performing stocks from the universe ineach out-of-sample period.

Page 68: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Sensitivities

Figure: The sensitivities and their uncertainty can be estimated from data(black) or derived analytically (red).

Page 69: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Recurrent Neural Networks

• Understand the conceptual formulation and statistical explanationof RNNs

• Understand the different types of RNNs

• Two notebook examples with limit order book data and Bitcoinhistory

• Navigate further learning resources

Page 70: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Recurrent Neural Network Predictors

• Input-output pairs D = xt , ytNt=1 are auto-correlatedobservations of X and Y at times t = 1, . . . ,N

• Construct a nonlinear times series predictor, yt+h(Xt), of anoutput, yt+h, using a high dimensional input matrix of p lengthsub-sequences Xt :

yt+h = F (Xt) where Xt = seqt−p+1,t(X ) = (xt−p+1, . . . , xt)

• xt−j is a j th lagged observation of xt , xt−j = Lj [xt ], forj = 0, . . . , p − 1.

Page 71: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Recurrent Neural Networks (p=6)

z1t−5

z jt−5

zHt−5

xt−5

z1t−4

z jt−4

zHt−4

xt−4

z1t−3

z jt−3

zHt−3

xt−3

z1t−2

z jt−2

zHt−2

xt−2

z1t−1

z jt−1

zHt−1

xt−1

z1t

z jt

zHt

xt

yt+h

. . . . . .

Page 72: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Recurrent Neural Networks

• For each time step s = t − p + 1, . . . , t, a function Fh generates ahidden state zs :

zs = Fh(xs , zs−1) := σ(Whxs + Uhzs−1 + bh), Wh ∈ RH×d ,Uh ∈ RH×H

• When the output is continuous, the model output from the finalhidden state is given by:

yt+h = Fy (zt) = Wyzt + by , Wy ∈ RK×H

• When the output is categorical, the output is given by

yt+h = Fy (zt) = softmax(Wyzt + by )

• Goal: find the weight matrices W = (Wh,Uh,Wy ) and biasesb = (bh, by ).

Page 73: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Univariate Example

• Consider the univariate, step-ahead, time series predictionyt = F (Xt−1), using p previous observations yt−ipi=1.

• Because this is a special case when no input is available at time t(since we are predicting it), we form the hidden states to time zt−1.

• The simplest case of a RNN with one hidden unit, H = 1, noactivation function and the dimensionality of the input vector isd = 1.

Page 74: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Univariate Example

• If Wh = Uh = φ, |φ| < 1, Wy = 1 and bh = by = 0.

• Then we can show that yt = F (Xt−1) is a zero-driftauto-regressive, AR(p), model with geometrically decaying weights:

zt−p = φyt−p,

zt−p+1 = φ(zt−p + yt−p+1),

. . . = . . .

zt−1 = φ(zt−2 + yt−1),

yt = zt−1,

where

yt = (φL + φ2L2 + · · ·+ φpLp)[yt ].

Page 75: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Stability

• Given a lag 1 unit innovation, 1, the output from the hidden layeris

zt = σ(Wh0 + Uh1 + bh).

• For strict stationarity, we require that the absolute value of eachcomponent of the hidden variable under a unit impulse at lag 1 isless than 1

|zt |j = |σ(Uh1 + bh)|j < 1,

which is satisfied if |σ(x)| ≤ 1 and each element of Uh1 + bh isfinite.

• Additionally if σ is strictly monotone increasing then |zt |j under alag two unit innovation is less than |zt |j under a lag one unitinnovation

|σ(Uh1) + bh)|j > |σ(Uhσ(Uh1 + bh) + bh)|j . (14)

In particular, the choice tanh satisfies the requirement on σ.

Page 76: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Multiple Choice Question 1

Identify the following correct statements:

• A linear recurrent neural network with a memory of p lags is anautoregressive model AR(p) with non-parametric error.

• Recurrent neural networks, as time series models, are guaranteedto be stable, for any choice of weights.

• The amount of memory in a shallow recurrent network correspondsto the number of times a single perceptron layer is unfolded.

• The amount of memory in a deep recurrent network corresponds tothe number of perceptron layers.

Page 77: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Overview

• RNNs assume that the data is covariance stationary. Theautocovariance structure of the RNN is fixed and therefore so mustit be in the data.

• RNNs using a version of back-propagation known as“back-propagation through time” (BPTT) which is prone to avanishing gradient problem when the number of lags in the modelis large.

• We will see extensions to these plain-RNNs which maintain along-term memory through an exponentially smoothed hiddenstate.

Page 78: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Exponential Smoothing

• Exponential smoothing is a type of forecasting or filtering methodthat exponentially decreases the weight of past and currentobservations to give smoothed predictions yt+1.

• It requires a single parameter, α, also called the smoothing factoror smoothing coefficient.

• This parameter controls the rate at which the influence of theobservations at prior time steps decay exponentially.

• α is often set to a value between 0 and 1. Large values mean thatthe model pays attention mainly to the most recent pastobservations, whereas smaller values mean more of the history istaken into account when making a prediction.

Page 79: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Exponential Smoothing

• Exponential smoothing takes the forecast for the previous period ytand adjusts with the forecast error, yt − yt . The forecast for thenext period becomes:

yt+1 = yt + α(yt − yt)

or equivalentlyyt+1 = αyt + (1− α)yt .

• Writing this as a geometric decaying autoregressive series back tothe first observation:

yt+1 = αyt + α(1− α)yt−1 + α(1− α)2yt−2 + α(1− α)3yt−3

+ · · ·+ α(1− α)t−1y1 + α(1− α)t y1,

Page 80: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Exponential Smoothing

• We observe that smoothing introduces long-term model, the entireobserved data, not just a sub-sequence used for prediction in aplain-RNN, for example.

• For geometrically decaying models, it is useful to characterize it bythe “half-life” - the lag k at which its coefficient is equal to a half:

α(1− α)k =1

2,

or

k = − ln(2α)

ln(1− α).

Page 81: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Dynamic Smoothing

• Dynamic exponential smoothing is a time dependent, convex,combination of the smoothed output, yt , and the observation yt :

yt+1 = αtyt + (1− αt)yt ,

where αt ∈ [0, 1] denotes the dynamic smoothing factor which canbe equivalently written in the one-step-ahead forecast of the form

yt+1 = yt + αt(yt − yt).

• Hence the smoothing can be viewed as a form of dynamic forecasterror correction; When αt = 1, the one-step ahead forecast merelyrepeats the current observation yt , and when αt = 0, the forecasterror is ignored and the forecast is the current model output.

Page 82: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Dynamic Smoothing

• The smoothing can also be viewed a weighted sum of the laggedobservations, with lower or equal weights, αt−s

∏sr=1(1− αt−r+1)

at the lag s ≥ 1 past observation, yt−s :

yt+1 = αtyt +∑

s=1t−1

αt−s

s∏r=1

(1− αt−r+1)yt−s +t−1∏r=0

(1− αt−r )y1,

where the last term is a time dependent constant and typically weinitialize the exponential smoother with y1 = y1.

• Note that for any αt−r+1 = 1, the prediction yt+1 will have nodependency on all lags yt−ss≥r . The model simply forgets theobservations at or beyond the r th lag.

• In the special case when the smoothing is constant and equal to1− α, then the above expression simplifies to

yt+1 = α−1yt ,

Page 83: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Dynamic Smoothing

• Putting this together gives the following dynamic RNN:

smoothing : ht = αt ht + (1− αt)ht−1 (15)

smoother update : αt = σ(1)(Uαht−1 + Wαxt + bα) (16)

hidden state update : ht = σ(Uhht−1 + Whxt + bh), (17)

where σ(1) is a sigmoid or heaviside function and σ is anyactivation function.

Page 84: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Figure: An illustrative example of the response of a simplified GRU and comparison with a plain-RNN and aRNN with an exponentially smoothed hidden state, under a constant α. The RNN loses memory of the unit impulseafter three lags, whereas the RNNs with smooth hidden states maintain memory of the first unit impulse even whenthe second unit impulse arrives. The difference between the dynamically smoothed RNN (the toy GRU) andsmoothed RNN with a fixed smoothing parameter appears insignificant. Keep in mind however that the dynamicalsmoothing model has much more flexibility in how controlling the sensitivity of the smoothing to the unit impulses.

Page 85: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Gated Recurrent Units (GRUs)

• We introduce a reset, or switch, rt , in order to forget thedependence of ht on previous data.

• Effectively, we turn ht from a RNN to a FNN.

ht = Zt ht + (1− Zt)ht−1 (18)

Zt = σ(UZht−1 + WZxt + bZ ) (19)

ht = tanh(Uh rt · ht−1 + Whxt + bh) (20)

rt = σ(Urht−1 + Wrxt + br ) (21)

yt = WY ht + bY (22)

• rt is the switching (reset) parameter which determines how muchrecurrence is used for the candidate hidden state ht .

Page 86: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

References

• S. Haykin, Kalman filtering and neural networks, volume 47. JohnWiley & Sons, 2004

• M.F. Dixon, N. Polson and V. Sokolov, Deep Learning forSpatio-Temporal Modeling: Dynamic Traffic Flows and HighFrequency Trading, Applied Stochastic Methods in Business andIndustry, arXiv:1705.09851, 2018.

• M.F. Dixon, Sequence Classification of the Limit Order Book usingRecurrent Neural Networks, J. Computational Science, SpecialIssue on Topics in Computational and Algorithmic Finance,arXiv:1707.05642, 2018.

• M.F. Dixon, A High Frequency Trade Execution Model forSupervised Learning, High Frequency, arXiv:1710.03870, 2018.

Page 87: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Multiple Choice Question 1: Answers

• A linear recurrent neural network with a memory of p lags is anautoregressive model AR(p) with non-parametric error.

• Recurrent neural networks, as time series models, are guaranteedto be stable, for any choice of weights.

• The amount of memory in a shallow recurrent network correspondsto the number of times a single perceptron layer is unfolded.

• The amount of memory in a deep recurrent network corresponds tothe number of perceptron layers.

Answer: 1, 2, 3

Page 88: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Learning Objectives

• Understand the conceptual formulation and statistical explanationof CNNs.

• Understand 1D CNNs for time series prediction.

• A notebook example on univariate time series prediction with 1DCNNs.

• Understand how dilation CNNs capture multi-scales.

• Understand 2D CNNs for image processing. Supplementarynotebook.

Page 89: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Multiple Choice Question 2

Identify the following correct statements:

• A feedforward network can, in principle, always be used for anysupervised learning problem.

• A linear recurrent neural network with a memory of p lags is anautoregressive model AR(p).

• Recurrent neural networks, as time series models, are guaranteedto be stable, for any choice of weights and activation.

• Feedforward and recurrent neural network predictors require thatthe data is stationary.

• GRUs allow the hidden state to persist beyond a sequence by usingsmoothing.

Page 90: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Multiple Choice Question 2

Identify the following correct statements:

• A feedforward network can, in principle, always be used for anysupervised learning problem.

• A linear recurrent neural network with a memory of p lags is anautoregressive model AR(p).

• Recurrent neural networks, as time series models, are guaranteedto be stable, for any choice of weights and activation.

• Feedforward and recurrent neural predictors require that the data isstationary.

• GRUs allow the hidden state to persist beyond a sequence by usingsmoothing.

Answer: 1,2,4,5.

Page 91: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Convolutional neural networks (CNN)

• Convolutional neural networks (CNN) are feedforward neuralnetworks that can exploit local spatial structures in the input data.

• CNNs attempt to reduce the network size by exploiting spatial ortemporal locality.

• CNNs use sparse connections between the different layers.

• Convolution is frequently used for image processing, such as forsmoothing, sharpening, and edge detection of images.

Convolutional Neural Networks

Figure: Convolutional Neural Networks. Source:http: // www. asimovinstitute. org/ neural-network-zoo

Page 92: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Weighted Moving Average Smoothers

• A common technique in time series analysis and signal processingis to filter the time series.

• We have already seen exponential smoothing as a special case of aclass of smoothers known as “weighted moving average (WMA)”smoothers.

• WMA smoothers take the form

xt =1∑

i∈I wi

∑i∈I

wixt−i

where xt is the local mean of the time series. The weights arespecified to emphasize or deemphasize particular observations ofxt−i in the span |I|.

• Examples of well known smoothers include the Hanning smootherh(3):

xt = (xt−1 + 2xt + xt+1)/4.

• Such smoothers have the effect of reducing noise in the time series6.

• It takes |I| samples of input at a time and then computes theweighted average to produce a single output. As the length of thefilter increases, the smoothness of the output increases, whereasthe sharp modulations in the data are smoothed out.

6The moving average filter is a simple Low Pass Finite Impulse Response(FIR) filter.

Page 93: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Moving Average Smoothers

Figure: Example of a moving average filter applied to time series.

Page 94: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Convolution for Time Series

• The moving average filter is in fact a (discrete) convolution using avery simple filter kernel.

• The simplest discrete convolution gives the relation between xi andxj :

xt−i =t−1∑j=0

δijxt−j , i ∈ 0, . . . , t − 1

where we have used the Kronecker delta δ.

• The kernel filtered time series is a convolution7

xt−i =∑j∈J

Kj+k+1xt−i−j , i ∈ k + 1, . . . , p − k,

7For simplicity, the ends of the sequence are assumed to be unsmoothed butfor notational reasons we set xt−i = xt−i for i ∈ 1, . . . , k, p − k + 1, . . . , p.

Page 95: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Convolution for Time Series

• The filtered AR(p) model is

xt = µ+

p∑i=1

φi xt−i (23)

= µ+ (φ1L + φ2L2 + · · ·+ φpL

p)[xt ] (24)

= µ+ [L, L2, . . . , Lp]φ[xt ], (25)

with coefficients φ := [φi , . . . , φp]T 8.

8Note that there is no look-ahead bias because we do not smooth the last kvalues of the observed data xsts=1

Page 96: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

1D CNN with one hidden unit

• A simple 1D CNN consisting of a feedforward output layer and anon-activated hidden layer with one unit (i.e. kernel):

xt = Wyzt + by , zt = [xt−1, . . . , xt−p]T ,Wy = φT , by = µ,

where xt−i is the i th output from a convolution of the p lengthinput sequence with a kernel consisting of 2k + 1 weights.

• These weights are fixed over time and hence the CNN is onlysuited to prediction from stationary time series.

• Note also, in contrast to a RNN, that the size of the weight matrixWy increases with the number of lags in the model.

Page 97: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

1D CNN with H hidden units

• The univariate CNN predictor with p lags and H activated hiddenunits (kernels) is

xt = Wyvec(zt) + by (26)

[zt ]i ,m = σ(∑j∈J

Km,j+k+1xt−i−j + [bh]m) (27)

= σ(K ∗ xt + bh) (28)

where m ∈ 1, . . . ,H denotes the index of the kernel and thekernel matrix K ∈ RH×2k+1, hidden bias vector bh ∈ RH andoutput matrix Wy ∈ R1×pH .

Page 98: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Remarks on CNNs

• Since the size of Wy increases with both the number of lags andthe number of kernels, it may be preferable to reduce thedimensionality of the weights with an additional layer and henceavoid over-fitting.

• Convolutional neural networks are not limited to sequential models.One might, for example, sample the past lags non-uniformly sothat I = 2ipi=1 then the maximum lag in the model is 2p. Such anon-sequential model allows a large maximum lag withoutcapturing all the intermediate lags.

Page 99: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Stationarity

• A univariate linear CNN predictor, with one kernel and noactivation, can be written in the canonical form

xt = µ+ (1− Φ(L))[K ∗ xt ] = µ+ K ∗ (1− Φ(L))[xt ] (29)

= µ+ (φ1L + . . . φpLp)[xt ] := µ+ (1− Φ(L))[xt ], (30)

where, by the linearity of Φ(L) in xt , the convolution commutesand thus we can write φ := K ∗ φ.

• Provided that Φ(L)−1 forms a divergent sequence in the noiseprocess εsts=1 then the model is stable.

• Finding the roots of the characteristic equation

Φ(z) = 0,

it follows that the linear CNN is strictly stationary and ergodic ifall the roots lie outside the unit circle in the complex plane,|λi | > 1, i ∈ 1, . . . , p.

Page 100: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Equivalent Eigenvalue Problem*

• Finding roots of polynomials is equivalent to finding eigenvalues ofthe “companion matrix”.

• A famous theorem states that the roots of any polynomial can befound by turning it into a matrix and finding the eigenvalues

• Given the p degree polynomial9:

q(z) = c0 + c1z + . . .+ cp−1zp−1 + zp,

we define the p × p companion matrix

C :=

0 1 0 . . . 0

0 0 1 0...

.... . .

. . .. . .

...0 0 0 0 1

-c0 -c1 . . . -cp−2 −cp−1

then the characteristic polynomial det(C − λI ) = q(λ), and so theeigenvalues of C are the roots of q 10.

9Notice that the zp coefficient is 1.10Note that if the polynomial does not have a unit leading coefficient then

one can just divide the polynomial by that coefficient to get it in the form ofEquation 98, without changing its roots.

Page 101: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Stridded Convolution

• In the above slides, we performed the convolution by sliding theimage area by a unit increment. This is known as using a stride, s,of one11:

• More generally, given an integer s ≥ 1, a convolution with stride sfor input (feature) f ∈ Rm is defined as:

[K ∗s f ]i =k∑

p=−kKpfs(i−1)+1+p, i ∈ 1, . . . , d

m

se. (31)

Here dms e denotes the smallest integer greater than ms .

11A common choice in CNNs is to take s = 2

Page 102: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Dilated Convolution

• In addition to image processing, CNNs have also been successfullyapplied to time series. WaveNet, for example, is a CNN developedfor audio processing.

• Time series often display long-term correlations. Moreover, thedependent variable(s) may exhibit non-linear dependence on thelagged predictors.

• Recall that a CNN for times series is a non-linear p-auto-regressionof the form

yt =

p∑i=1

φi (xt−i ) + εt (32)

where the coefficient functions φi , i ∈ 1, . . . , p aredata-dependent and optimized through the convolutional network.

• A dilated convolution effectively allows the network to operate ona coarser scale than with a normal convolution. This is similar topooling or stridded convolutions, but here the output has the samesize as the input

• The network can learn multi-scale, non-linear, dependencies byusing stacked layers of dilated convolutions.

Page 103: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Dilated convolution

• In a dilated convolution the filter is applied to every d th element inthe input vector, allowing the model to efficiently learn connectionsbetween far-apart data points.

• For an architecture with L layers of dilated convolutions` ∈ 1, . . . , L, a dilated convolution outputs a stack of “featuremaps” given by

[K (`) ∗d (`) f (`−1)]i =k∑

p=−kK

(`)p f

(`−1)

d (`)(i−1)+1+p, i ∈ 1, . . . , d m

d (`)e.

(33)where d is the dilation factor and we can choose the dilations toincrease by a factor of two: d (`) = 2`−1.

• The filters for each layer, K (`), are chosen to be of size1× k ′ = 1× 2, where k ′ := 2k + 1.

Page 104: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Dilated Convolution

• The output is the forecasted time series Y = (yt)N−1t=0 .

Figure: A dilated convolutional neural network with three layers. Thereceptive field is given by r = 8, i.e. one output value is influenced byeight input neurons.

• The receptive field depends on the number of layers L and thefilter size k ′, and is given by

r = 2L−1k ′. (34)

Page 105: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Learning Objectives

• Understand principal component analysis for dimension reduction.

• Understand how to formulate a linear autoencoder.

• Understand how a linear autoencoder performs PCA.

• Notebook example using dimension reduction on the yield curve.

Page 106: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Preliminaries

• Let yiNi=1 be a set of N observation vectors, each of dimension n.

We assume that n ≤ N.

• Let Y ∈ Rn×N be a matrix whose columns are yiNi=1,

Y =

| |y1 · · · yN| |

.• The element-wise average of the N observations is an n

dimensional signal which may be written as:

y =1

N

N∑i=1

yi =1

NY1N ,

where 1N ∈ RN×1 is a column vector of all-ones.

• Let Y0 be a matrix whose columns are the demeaned observations(we center each observation yi by subtracting y from it):

Y0 = Y− y1TN .

Page 107: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Projection

• A linear projection from Rm to Rn is a linear transformation of afinite dimensional vector given by a matrix multiplication:

xi = WTyi ,

where yi ∈ Rn, xi ∈ Rm, and W ∈ Rn×m.

• Each element j in the vector xi is an inner product between yi andthe j-th column of W, which we denote by wj .

• Let X ∈ Rm×N be a matrix whose columns are the set of N vectors

of transformed observations, let x = 1N

N∑i=1

xi = 1N X1N be the

element-wise average, and X0 = X− x1TN the de-meaned matrix.

• Clearly, X = WTY and X0 = WTY0.

Page 108: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Principal Component Projection*

• When the matrix WT represents the transformation that appliesprincipal component analysis, we denote W = P, and the columnsof the orthonormal matrix12, P, denoted

pjnj=1

, are referred to

as loading vectors.

• The transformed vectors xiNi=1 are referred to as principalcomponents or scores.

• The first loading vector is defined as the unit vector with which theinner products of the observations have the greatest variance:

p1 = maxw1

wT1 Y0YT

0 w1 s.t. wT1 w1 = 1. (35)

• The solution to (35) is known to be the eigenvector of the samplecovariance matrix Y0YT

0 corresponding to its largest eigenvalue13

12i.e. P−1 = PT .13We normalize the eigenvector and disregard its sign.

Page 109: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Principal Component Projection*

• Next, p2 is the unit vector which has the largest variance of innerproducts between it and the observations after removing theorthogonal projections of the observations onto p1. It may befound by solving:

p2 = maxw2

wT2

(Y0 − p1p

T1 Y0

)(Y0 − p1p

T1 Y0

)Tw2 s.t.w

T2 w2 = 1.

(36)

• The solution to (36) is known to be the eigenvector correspondingto the largest eigenvalue under the constraint that it is notcollinear with p1.

• Similarly, the remaining loading vectors are equal to the remainingeigenvectors of Y0YT

0 corresponding to descending eigenvalues.

Page 110: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Principal Components

• The eigenvalues of Y0YT0 , which is a positive semi-definite matrix,

are non-negative.

• They are not necessarily distinct, but since it is a symmetric matrixit has n eigenvectors that are all orthogonal, and it is alwaysdiagonalizable.

• Thus, the matrix P may be computed by diagonalizing thecovariance matrix:

Y0YT0 = PΛP−1 = PΛPT ,

where Λ = X0XT0 is a diagonal matrix whose diagonal elements

λini=1 are sorted in descending order.

• The transformation back to the observations is Y = PX.

• The fact that the covariance matrix of X is diagonal means thatPCA is a decorrelation transformation and is often used to denoisedata.

Page 111: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Dimensionality reduction

• PCA is often used as a method for dimensionality reduction, theprocess of reducing the number of variables in a model in order toavoid the curse of dimensionality.

• PCA gives the first m principal components (m < n) by applyingthe truncated transformation

Xm = PTmY,

where each column of Xm ∈ Rm×N is a vector whose elements arethe first m principal components, and Pm is a matrix whosecolumns are the first m loading vectors,

Pm =

| |p1 · · · pm| |

∈ Rn×m.

• Intuitively, by keeping only m principal components, we are losinginformation, and we minimize this loss of information bymaximizing their variances.

Page 112: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Minimum total squared reconstruction error

• Pm is also a solution to:

minW∈Rn×m

∥∥∥Y0 −WWTY0

∥∥∥2

Fs.t. WTW = Im×m, (37)

where F denotes the Frobenius matrix norm.

• The m leading loading vectors are an orthonormal basis whichspans the m dimensional subspace onto which the projections ofthe demeaned observations have the minimum squared differencefrom the original demeaned observations.

• In other words, Pm compresses each demeaned vector of length ninto a vector of length m (where m ≤ n) in such a way thatminimizes the sum of total squared reconstruction errors.

• The minimizer of (37) is not unique: W = PmQ is also a solution,where Q ∈ Rm×m is any orthogonal matrix, QT = Q−1.Multiplying Pm from the right by Q transforms the first m loadingvectors into a different orthonormal basis for the same subspace.

Page 113: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Minimum total squared reconstruction error

• A neural network that is trained to learn the identity functionf (Y ) = Y is called an autoencoder. Its output layer has the samenumber of nodes as the input layer, and the cost function is somemeasure of the reconstruction error.

• Autoencoders are unsupervised learning models, and they are oftenused for the purpose of dimensionality reduction.

• A simple autoencoder that implements dimensionality reduction isa feed-forward autoencoder with at least one layer that has asmaller number of nodes, which functions as a bottleneck.

• After training the neural network using backpropagation, it isseparated into two parts: the layers up to the bottleneck are usedas an encoder, and the remaining layers are used as a decoder.

• In the simplest case, there is only one hidden layer (thebottleneck), and the layers in the network are fully connected.

Page 114: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Linear autoencoders*

• In the case that no non-linear activation function is used,xi = W (1)yi + b(1) and yi = W (2)xi + b(2). If the cost function isthe total squared difference between output and input, thentraining the autoencoder on the input data matrix Y solves:

minW (1),b(1),W (2),b(2)

∥∥∥Y−(

W(2)(W (1)Y + b(1)1TN

)+ b(2)1TN

)∥∥∥2

F.

(38)

• If we set the partial derivative with respect to b2 to zero and insertthe solution into (38), then the problem becomes:

minW (1),W (2)

∥∥∥Y0 −W (2)W (1)Y0

∥∥∥2

F

• Thus, for any b1, the optimal b2 is such that the problem becomesindependent of b1 and of y. Therefore, we may focus only on theweights W (1), W (2).

Page 115: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Linear autoencoders as orthogonal projections

• Setting the gradients to zero, W (1) is the left Moore-Penrosepseudoinverse of W (2) (and W (2) is the right pseudoinverse ofW (1)):

W (1) = (W (2))† =(W (2)TW (2)

)−1(W (2))T

• The minimization with respect to a single matrix is:

minW (2)∈Rn×m

∥∥∥Y0 −W (2)(W (2))†Y0

∥∥∥2

F(39)

The matrix W (2)(W (2))† = W (2)((W (2))TW (2)

)−1(W (2))T is

the orthogonal projection operator onto the column space of W (2)

when its columns are not necessarily orthonormal.

• This problem is very similar to (37), but without theorthonormality constraint.

Page 116: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Linear autoencoders: mathematical properties*

• It can be shown that W (2) is a minimizer of (39) if and only if itscolumn space is spanned by the first m loading vectors of Y.

• The linear autoencoder is said to apply PCA to the input data inthe sense that its output is a projection of the data onto the lowdimensional principal subspace.

• However, unlike actual PCA, the coordinates of the output of thebottleneck are correlated and are not sorted in descending order ofvariance.

• The solutions for reduction to different dimensions are not nested:when reducing the data from dimension n to dimension m1, thefirst m2 vectors (m2 < m1) are not an optimal solution toreduction from dimension n to m2, which therefore requirestraining an entirely new autoencoder.

Page 117: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Equivalence of linear autoencoders and PCA*Hypothesis: The first m loading vectors of Y are the first m left

singular vectors of the matrix W (2) which minimizes (39).

• Train the linear autoencoder on the original dataset Y and thencompute the first m left singular vectors of W (2) ∈ Rn×m, wheretypically m << N.

• The loading vectors may also be recovered from the weights of thehidden layer by a singular value decomposition, W (1).

• If W (2) = UΣVT , which we assume is full-rank, then

W (1) = (W (2))† = VΣ†UT

and

W2W†2 = UΣVTVΣ†UT = UΣΣ†UT = UmUTm, (40)

where we used the fact that(VT)†

= V and that Σ† ∈ Rm×n is amatrix whose diagonal elements are 1

σj(assuming σj 6= 0, and 0

otherwise).

• The matrix ΣΣ† is a diagonal matrix whose first m diagonalelements are equal to one and the other n −m elements are equalto zero.

• The matrix Um ∈ Rn×m is a matrix whose columns are the first mleft singular vectors of W2

• Thus, the first m left singular vectors of (W (1))T ∈ Rn×m are alsoequal to the first m loading vectors of Y .

Page 118: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Figure: This figure shows the yield curve over time, each line corresponds to a different maturity in theterm-structure of interest rates.

Page 119: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

(a)PTmY0YT

0 Pm (b) (W (2))TY0YT0 W (2) (c) UT

mY0YT0 Um

Figure: The covariance matrix of the data in the transformed coordinates, according to (a) the loading vectorscomputed by applying SVD to the entire dataset, (b) the weights of the linear autoencoder, and (c) the leftsingular vectors of the autoencoder weights.

Page 120: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

(a) PTm∆Y0 (b) UT

m∆Y0

Figure: The first two principal components of ∆Y0, projected using Pm are shown in (a). The first twoapproximated principal components (up to a sign change) using Um . The first principal component is representedby the x-axis and the second by the y-axis.

Page 121: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Summary

• Deep Learning for cross-sectional data

• Background: review of supervised machine learning

• Introduction to feedforward neural networks

• Related approximation and learning theory

• Training and back-propagation

• Interpretability with factor modeling example

• Deep Learning for time-series data prediction

• Understand the conceptual formulation and statistical explanationof RNNs

• Understand the different types of RNNs, including plain RNNs,GRUs and LSTMs

• Two notebook examples with limit order book data and Bitcoinhistory

• Understand the conceptual formulation and statistical explanationof CNNs.

• Understand 1D CNNs for time series prediction.

• A notebook example on univariate time series prediction with 1DCNNs.

• Understand how dilation CNNs capture multi-scales.

Page 122: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Introduction Neural Networks Splines and NNs* Training Interpretability* Equity Factor Modeling Recurrent Neural Networks Advanced RNNs Convolutional Neural Networks Auto-encoders Autoencoders Experiments

Summary

• Deep Learning for dimensionality reduction

• Understand principal component analysis for dimension reduction.

• Understand how to formulate a linear autoencoder.

• Understand how a linear autoencoder performs PCA.

• Notebook example using dimension reduction on the yield curve.

Page 123: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Review of time series analysis

• Characterize the properties of time series

• Problem formulation in time series prediction

• Practical challenges that frequently arise in time series prediction

Page 124: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Time Series Data

Time series data:

• Consists of observations on a variable over time, e.g. stock prices

• Observations are not assumed to be independent over time, ratherobservations are often strongly related to their recent histories.

• For this reason, the ordering of the data matters (unlikecross-sectional data).

• Pay special attention to the data frequency.

Page 125: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Classical Time Series Analysis

• Box-Jenkins linear time series analysis, e.g. ARIMA

• Non-linear time series analysis, e.g. GARCH

• Dynamic state based models, e.g. Kalman filters.

Page 126: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Autoregressive Processes AR(p)

• The pth order autoregressive process of a variable y depends onlyon the previous values of the variable plus a white noisedisturbance term

yt = µ+

p∑i=1

φiyt−i + εt

• Defining the polynomial functionφ(L) := (1− φ1L− φ2L

2 − · · · − φpLp), the AR(p) process can beexpressed in the more compact form

φ(L)yt = µ+ εt

Key point: an AR(p) process has a geometrically decaying acf.

Page 127: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Stationarity

Stationarity

• An important property of AR(p) processes is whether they arestationary or not. A process which is stationary will have errordisturbance terms εt will have a declining impact on the currentvalue of y as the lag increases.

• Conversely, non-stationary AR(p) processes exhibit thecounter-intuitive behavior that the error disturbance terms becomeincreasingly influential as the lag increases.If the roots of the characteristic equation

φ(z) = (1− φ1z − φ2z2 − · · · − φpzp) = 0

*all* lie outside the unit circle, then the AR(p) process isstationary.

Page 128: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Exercise 1

Is the following random walk (zero mean AR(1) process)stationary?

yt = yt−1 + εt

Page 129: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Exercise 2

Compute the mean, variance and autocorrelation function of thefollowing zero-mean AR(1) process:

yt = φ1yt−1 + εt .

Page 130: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Moving Average Processes MA(q)

MA(q) process The qth order moving average process is the linearcombination of the white noise process εtq+1

t=1

yt = µ+

q∑i=1

θiεt−i + εt

It’s convenient to define the Lag or Backshift operator Liyt := yt−iand write the MA(q) process as

yt = µ+ θ(L)εt

where the polynomial function θ(L) := 1 + θ1L+ θ2L2 + · · ·+ θqL

q.

Page 131: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Moving Average Processes

The distinguishing properties of the MA(q) process are thefollowing:

• Constant mean: E[yt ] = µ

• Constant variance: V[yt ] = (1 + θ21 + θ2

2 + · · ·+ θ2q)σ2

• Autocovariances which may be non-zero to lag q and zerothereafter:

γs =

(θs + θs+1θ1 + θs+2θ2 + · · ·+ θqθq−s)σ2 s = 1, 2, . . . , q,

0 s > q.

Key point: a MA(q) process has non-zero points of acf= q. Aboveq there is a sharp cut-off.

Page 132: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Exercise 3

Consider the following zero-mean MA(2) process:

yt = εt + θ1εt−1 + θ2εt−2

Show that:

• E[yt ] = 0, ∀t• V[yt ] = γ0 = (1 + θ2

1 + θ22)σ2, ∀t

• ε1 = θ1+θ1θ2

1+θ21+θ2

2, ε2 = θ2

1+θ21+θ2

2and εs = 0, s > q = 2.

If θ1 = −0.5 and θ2 = 0.25, sketch the autocorrelation function(acf) of the sample MA(2) process.

Page 133: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

ARMA

The pth, qth order ARMA process of a variable y combines theAR(p) and MA(q) processes

φ(L)yt = µ+ θ(L)εt

where

φ(L) := (1− φ1L− φ2L2 − · · · − φpLp)

θ(L) := 1 + θ1L + θ2L2 + · · ·+ θqL

q

The mean of a ARMA series depends only on the coefficients ofthe autoregressive process and not the coefficients of the moving

average process:

E[yt ] =µ

1− φ1 − φ2 − · · · − φp

Page 134: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Forecasting with a MA(q) model

• Predict the value yt+1 given information Ωt at time t using theequation:

yt+1 = µ+ θ1εt + θ2εt−1 + θ3εt−2 + εt+1

• The forecast is the conditional expectation of the prediction. Theone-step ahead forecast at time t is denoted ft,1:

ft,1 = E[yt+1|Ωt ] = µ+ θ1εt + θ2εt−1 + θ3εt−2

since only the conditional expectation E[εt+1|Ωt ] = 0.

• The two-step ahead forecast ft,2 is given by:

ft,2 = E[yt+2|Ωt ] = µ+ θ2εt + θ3εt−1

• In general, a MA(q) process has a memory of q periods, soforecasts q + 1 steps ahead collapse to an intercept µ.

Page 135: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Forecasting with a AR(p) model

• Recall that an AR(p) process has infinite memory.

• In general, the s-step ahead forecast for an AR(2) process is givenby

ft,s = E[yt+s |Ωt ] = µ+ φ1ft,s−1 + φ2ft,s−2

thus we need to compute the one-period lag and two-period lagfirst.

• ARMA(p, q) : Putting the MA(q) and AR(p) s-step aheadforecasting equations together, we get

ft,s =

p∑i=1

ai ft,s−i +

q∑j=1

bjεt+s−j

where ft,s = yt+s , s ≤ 0 and εt+s = 0, s > 0.

Page 136: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Exercise 4

Consider the following ARMA(2,2) 1-step ahead forecast

ft,1 = a1ft,0 + a2ft,−1 + b1εt + b2εt−1 (41)

= a1yt + a2yt−1 + b1εt + b2εt−1 (42)

(43)

where ai and bj are the autoregressive and moving averagecoefficients respectively.

Page 137: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Parametric tests: time series analysis

• Augmented Dickey Fuller test for time series stationarity, with theSchwert (heuristic) estimate for the maximum lag

• Durbin-Watson tests for co-integration of two variables

• Granger causality tests

Page 138: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Identification

• it may be difficult to observe the order of the MA and ARprocesses from the acf and pacf plots.

• It is often preferable to use the Akaike Information Criteria (AIC)to measure the quality of fit:

AIC = ln(σ2) +2k

Twhere σ2 is the residual variance (the residual sums of squaresdivided by the number of observations T ) and k = p + q + 1 is thetotal number of parameters estimated.

• This criterion expresses a trade-off between the first term, thequality of fit, and the second term, a penalty function proportionalto the number of parameters.

• Adding more parameters to the model reduces the residuals butincreases the right hand-term, thus the AIC favors the best fit withthe fewest number of parameters.

• AIC is limited to asymptotic normality assumptions on the error.

Page 139: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

In Sample: Diagnostics

• A fitted model must be examined for underfitting with a whitenoise test. Box and Pierce propose the Portmanteau statistic

Q∗(m) = Tm∑l=1

ρ2l

• As a test statistic for the null hypothesis

H0 : ρ1 = · · · = ρm = 0

against the alternative hypothesis

Ha : ρi 6= 0

for somei ∈ 1, . . . ,m

. ρi are the sample autocorrelations of the residual.

• The Box-Pierce statistic follows an asymptotically chi-squareddistribution with m − p − q degrees of freedom.

• The decision rule is to reject H0 if Q(m) > χ2α where χ2

α denotesthe 100(1− α)th percentile of a chi-squared distribution with mdegrees of freedom and is the significance level for rejecting H0.

Page 140: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

In Sample: Diagnostics

• The Ljung-Box test statistic increases the power of the test infinite samples:

Q(m) = T (T + 2)m∑l=1

ρ2l

T − l

.

• This statistic also follows an asymptotically chi-squareddistribution with m degrees of freedom.

• The Ljung-Box statistic follows asymptotically a chi-squareddistribution with m − p − q degrees of freedom.

Page 141: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Time Series Cross Validation

Verifica(on: tuning hyper-­‐parameters

Tes(ng: assess performance

split

Figure: Times series cross validation, also referred to as walk forward optimization, is used to avoid introducinglook-ahead bias into the model.

Page 142: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Forecasting Accuracy

• The forecasting accuracy may be measured using the mean squareerror (MSE) or the mean absolute error (MAE) of theout-of-sample forecasts with the actual observations.

• Let’s suppose that we have an trained the model with Tobservations, then the MSE evaluated over the next mobservations :

MSE =1

m

m−1∑t=0

(yT+t+s − fT+t,s)2

• The MAE is given by

MAE =1

m

m−1∑t=0

|yT+t+s − fT+t,s |

Question: Why use the MSE in preference to the MAE?

Page 143: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Predicting Events

• Suppose we have conditionally i.i.d. Bernoulli r.v.s Xt withpt := P[Xt = 1|Ωt ] representing a binary event

• E[Xt | Ω] = 0 · (1− pt) + 1 · pt = pt

• V[Xt | Ω] = pt(1− pt)

• Under an ARMA model:

ln

(pt

1− pt

)= φ−1(L)(µ+ θ(L)εt)

Page 144: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Entropy

C. Shannon, A Mathematical Theory of Communication, The Bell System Technical Journal, Vol. 27, pp. 379-423,623-656, July, October, 1948.

Page 145: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Directional Performance: Confusion Matrix

Actual Predicted

up down

up m11 m12 m1

down m21 m22 m1

m01 m02 m

χ2 =2∑

i=1

2∑j=1

(mij −mi0m0j/m)2

mi0m0j/m

with 1 degree of freedom.

Page 146: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Example: Confusion Matrix

Actual Predicted

up down

up 12 2 14down 8 2 10

20 4 24

χ2 = 0.137

with p-value of 0.71.

Page 147: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

ROC chart

Page 148: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Performance Terminology

• Precision is TP/(TP + FP): fraction of records that were positivefrom the group that the classifier predicted to be positive

• True Positive Rate is TP/(TP + FN): fraction of positiveexamples the classifier correctly identified. This is also known asRecall or Sensitivity.

• False Positive Rate is FP/(FP + TN): fraction of positiveexamples which the classifier mis-identified.

• Specificity is TN/(FP + TN) = 1− FPR: fraction of negativeexamples which the classifier correctly identified.

Page 149: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Practical Prediction Issues

• How to predict inherently rare events, e.g. black swans? Ans:over-sampling?

• How much training history?

• How frequently to retrain the model?

Page 150: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

Confusion Matrices with Oversampling

−1 0 1

−1

01

True

Predicted

0.1188

0.003

0.0466

0.85

0.9961

0.8653

0.0312

9e−04

0.0881

−1 0 1

−1

01

True

Predicted

0.6938

0.0743

0.1295

0.2188

0.8267

0.1244

0.0875

0.099

0.7461

Figure: Normalized confusion matrices (left) without and (right) with oversampling.

Page 151: Matthew Dixon matthew.dixon@iitmypages.iit.edu/~mdixon7/presentations/DL_FE.pdfOverview Deep Learning for cross-sectional data Background: review of supervised machine learning

Extra Slides

ROC Curves with Oversampling

False Positive Rate

True

Pos

itive

Rat

e

0 0.2 0.4 0.6 0.8 1

00.

20.

40.

60.

81

AUC: 0.543

AUC: 0.784

Before Oversampling

After Oversampling

Figure: ROC curves with and without oversampling.